diff options
author | Santiago Ruano Rincón <santiago@debian.org> | 2013-12-13 07:53:37 -0800 |
---|---|---|
committer | Jim Meyering <meyering@fb.com> | 2013-12-21 10:51:22 -0800 |
commit | 178ed7cc324bc2000c19a3f7a4be649dfa99b44a (patch) | |
tree | 7506d9b55886051555be2305608953c2bb3f3549 /tests | |
parent | 1a8b1b370eace41be892e9fef041f36b72baeefb (diff) | |
download | grep-178ed7cc324bc2000c19a3f7a4be649dfa99b44a.tar.gz |
pcre: tell grep -P to relax its stance on invalid multibyte chars
Do not exit-2 for invalid UTF-8 characters. Just prior to this
change, this command would match no lines and fail like this:
$ printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $?
grep: invalid UTF-8 byte sequence in input
2
After this change, the same command matches both lines, and succeeds:
jM-^B$
j$
0
* src/pcresearch.c (Pcompile): Use PCRE_NO_UTF8_CHECK, too, and
add a comment.
* tests/pcre-utf8: Add a test and a comment.
This change did not work with Debian unstable pcre-8.31-2
or with some 8.33 and 8.34-based versions, but does work with
Fedora 20's 8.33 and with a built-from-latest source library.
Based on a patch by Santiago Ruano Rincón.
See http://bugs.gnu.org/15758/
Diffstat (limited to 'tests')
-rwxr-xr-x | tests/pcre-utf8 | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/tests/pcre-utf8 b/tests/pcre-utf8 index b8228d51..a3b9390b 100755 --- a/tests/pcre-utf8 +++ b/tests/pcre-utf8 @@ -19,9 +19,15 @@ echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' \ euro='\342\202\254 euro' printf "$euro\\n" > in || framework_failure_ +# The euro sign has the unicode "Symbol" property, so this must match: LC_ALL=en_US.UTF-8 grep -P '^\p{S}' in > out || fail=1 compare in out || fail=1 +# This RE must *not* match in the C locale, because the first +# byte is not a "Symbol". +LC_ALL=C grep -P '^\p{S}' in > out && fail=1 +compare /dev/null out || fail=1 + LC_ALL=en_US.UTF-8 grep -P '^. euro$' in > out2 || fail=1 compare in out2 || fail=1 |