summaryrefslogtreecommitdiff
path: root/tests/pcre-utf8
diff options
context:
space:
mode:
authorSantiago Ruano Rincón <santiago@debian.org>2013-12-13 07:53:37 -0800
committerJim Meyering <meyering@fb.com>2013-12-21 10:51:22 -0800
commit178ed7cc324bc2000c19a3f7a4be649dfa99b44a (patch)
tree7506d9b55886051555be2305608953c2bb3f3549 /tests/pcre-utf8
parent1a8b1b370eace41be892e9fef041f36b72baeefb (diff)
downloadgrep-178ed7cc324bc2000c19a3f7a4be649dfa99b44a.tar.gz
pcre: tell grep -P to relax its stance on invalid multibyte chars
Do not exit-2 for invalid UTF-8 characters. Just prior to this change, this command would match no lines and fail like this: $ printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $? grep: invalid UTF-8 byte sequence in input 2 After this change, the same command matches both lines, and succeeds: jM-^B$ j$ 0 * src/pcresearch.c (Pcompile): Use PCRE_NO_UTF8_CHECK, too, and add a comment. * tests/pcre-utf8: Add a test and a comment. This change did not work with Debian unstable pcre-8.31-2 or with some 8.33 and 8.34-based versions, but does work with Fedora 20's 8.33 and with a built-from-latest source library. Based on a patch by Santiago Ruano Rincón. See http://bugs.gnu.org/15758/
Diffstat (limited to 'tests/pcre-utf8')
-rwxr-xr-xtests/pcre-utf86
1 files changed, 6 insertions, 0 deletions
diff --git a/tests/pcre-utf8 b/tests/pcre-utf8
index b8228d51..a3b9390b 100755
--- a/tests/pcre-utf8
+++ b/tests/pcre-utf8
@@ -19,9 +19,15 @@ echo '$' | LC_ALL=en_US.UTF-8 grep -qP '\p{S}' \
euro='\342\202\254 euro'
printf "$euro\\n" > in || framework_failure_
+# The euro sign has the unicode "Symbol" property, so this must match:
LC_ALL=en_US.UTF-8 grep -P '^\p{S}' in > out || fail=1
compare in out || fail=1
+# This RE must *not* match in the C locale, because the first
+# byte is not a "Symbol".
+LC_ALL=C grep -P '^\p{S}' in > out && fail=1
+compare /dev/null out || fail=1
+
LC_ALL=en_US.UTF-8 grep -P '^. euro$' in > out2 || fail=1
compare in out2 || fail=1