pcre: tell grep -P to relax its stance on invalid multibyte chars

Do not exit-2 for invalid UTF-8 characters. Just prior to this change, this command would match no lines and fail like this: $ printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $? grep: invalid UTF-8 byte sequence in input 2 After this change, the same command matches both lines, and succeeds: jM-^B$ j$ 0 * src/pcresearch.c (Pcompile): Use PCRE_NO_UTF8_CHECK, too, and add a comment. * tests/pcre-utf8: Add a test and a comment. This change did not work with Debian unstable pcre-8.31-2 or with some 8.33 and 8.34-based versions, but does work with Fedora 20's 8.33 and with a built-from-latest source library. Based on a patch by Santiago Ruano Rincón. See http://bugs.gnu.org/15758/
author: Santiago Ruano Rincón <santiago@debian.org> 2013-12-13 07:53:37 -0800
committer: Jim Meyering <meyering@fb.com> 2013-12-21 10:51:22 -0800
commit: 178ed7cc324bc2000c19a3f7a4be649dfa99b44a (patch)
tree: 7506d9b55886051555be2305608953c2bb3f3549 /tests/pcre-utf8
parent: 1a8b1b370eace41be892e9fef041f36b72baeefb (diff)
download: grep-178ed7cc324bc2000c19a3f7a4be649dfa99b44a.tar.gz
1 files changed, 6 insertions, 0 deletions
diff --git a/tests/pcre-utf8 b/tests/pcre-utf8
index b8228d51..a3b9390b 100755
--- a/tests/pcre-utf8
+++ b/tests/pcre-utf8
@@ -19,9 +19,15 @@ echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' \
 euro='\342\202\254 euro'
 printf "$euro\\n" > in || framework_failure_
 
+# The euro sign has the unicode "Symbol" property, so this must match:
 LC_ALL=en_US.UTF-8 grep -P '^\p{S}' in > out || fail=1
 compare in out || fail=1
 
+# This RE must *not* match in the C locale, because the first
+# byte is not a "Symbol".
+LC_ALL=C grep -P '^\p{S}' in > out && fail=1
+compare /dev/null out || fail=1
+
 LC_ALL=en_US.UTF-8 grep -P '^. euro$' in > out2 || fail=1
 compare in out2 || fail=1
author	Santiago Ruano Rincón <santiago@debian.org>	2013-12-13 07:53:37 -0800
committer	Jim Meyering <meyering@fb.com>	2013-12-21 10:51:22 -0800
commit	178ed7cc324bc2000c19a3f7a4be649dfa99b44a (patch)
tree	7506d9b55886051555be2305608953c2bb3f3549 /tests/pcre-utf8
parent	1a8b1b370eace41be892e9fef041f36b72baeefb (diff)
download	grep-178ed7cc324bc2000c19a3f7a4be649dfa99b44a.tar.gz