grep: fix encoding-error incompatibilities among regex, DFA, KWset

This follows up to http://bugs.gnu.org/17376 and fixes a different set of incompatibilities, namely between the regex matcher and the other matchers, when the pattern contains encoding errors. The GNU regex matcher is not consistent in this area: sometimes an encoding error matches only itself, and sometimes it matches part of a multibyte character. There is no documentation for grep's behavior in this area and users don't seem to care, and it's simpler to defer to the regex matcher for problematic cases like these. * NEWS: Document this. * src/dfa.c (ctok): Remove. All uses removed. (parse_bracket_exp, atom): Use BACKREF if a pattern contains an encoding error, so that the matcher will revert to regex. * src/dfasearch.c, src/grep.c, src/pcresearch.c, src/searchutils.c: Don't include dfa.h, since search.h now does that for us. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): In a UTF-8 locale, there's no need to worry about matching part of a multibyte character. * src/grep.c (contains_encoding_error): New static function. (main): Use it, so that grep -F is consistent with plain fgrep when the pattern contains an encoding error. * src/search.h: Include dfa.h, so that kwsearch.c can call using_utf8. * src/searchutils.c (is_mb_middle): Remove UTF-8-specific code. Callers now ensure that we are in a non-UTF-8 locale. The code was clearly wrong, anyway. * tests/fgrep-infloop, tests/invalid-multibyte-infloop: * tests/prefix-of-multibyte: Do not require that grep have a particular behavor for this test. It's OK to match (exit status 0), not match (exit status 1), or report an error (exit status 2), since the pattern contains an encoding error and grep's behavior is not specified for such patterns. Test only that KWset, DFA, and regex agree. * tests/prefix-of-multibyte: Add tests for ABCABC and __..._ABCABC___.
author: Paul Eggert <eggert@cs.ucla.edu> 2014-05-05 20:19:19 -0700
committer: Paul Eggert <eggert@cs.ucla.edu> 2014-05-05 20:19:59 -0700
commit: eb3292b3b205e50d0373f26ff0950ec82f49c14a (patch)
tree: bfcb18201f277f03886e83efedfc070693652d45 /tests
parent: 17683df11fbea7aa01c9d60f1b45874c9ea5e26a (diff)
download: grep-eb3292b3b205e50d0373f26ff0950ec82f49c14a.tar.gz
3 files changed, 44 insertions, 17 deletions
diff --git a/tests/fgrep-infloop b/tests/fgrep-infloop
index 07a0ce04..015ec74d 100755
--- a/tests/fgrep-infloop
+++ b/tests/fgrep-infloop
@@ -8,14 +8,20 @@ require_compiled_in_MB_support
 
 encode() { echo "$1" | tr ABC '\357\274\241'; }
 
+encode ABC > in || framework_failure_
 fail=0
 
 for LOC in en_US.UTF-8 $LOCALE_FR_UTF8; do
   out=out1-$LOC
-  encode ABC \
-    | LC_ALL=$LOC timeout 10s grep -F "$(encode BC)" > $out 2>&1
-  test $? = 1 || fail=1
-  compare /dev/null $out || fail=1
+  LC_ALL=$LOC timeout 10s grep -F "$(encode BC)" in > $out
+  status=$?
+  if test $status -eq 0; then
+    compare in $out
+  elif test $status -eq 1; then
+    compare_dev_null_ /dev/null $out
+  else
+    test $status -eq 2
+  fi || fail=1
 done
 
 Exit $fail
diff --git a/tests/invalid-multibyte-infloop b/tests/invalid-multibyte-infloop
index e98c1707..b28bc532 100755
--- a/tests/invalid-multibyte-infloop
+++ b/tests/invalid-multibyte-infloop
@@ -14,7 +14,14 @@ encode AA > input
 fail=0
 
 # Before 2.15, this would infloop.
-LC_ALL=en_US.UTF-8 timeout 3 grep -F $(encode A) input > out || fail=1
-compare input out || fail=1
+LC_ALL=en_US.UTF-8 timeout 3 grep -F $(encode A) input > out
+status=$?
+if test $status -eq 0; then
+  compare input out
+elif test $status -eq 1; then
+  compare_dev_null_ /dev/null out
+else
+  test $status -eq 2
+fi || fail=1
 
 Exit $fail
diff --git a/tests/prefix-of-multibyte b/tests/prefix-of-multibyte
index 2ab9a99a..2228a22b 100755
--- a/tests/prefix-of-multibyte
+++ b/tests/prefix-of-multibyte
@@ -9,21 +9,35 @@ encode() { echo "$1" | tr ABC '\357\274\241'; }
 
 encode ABC >exp1
 encode aABC >exp2
+encode ABCABC >exp3
+encode _____________________ABCABC___ >exp4
 
 fail=0
 
 for LOC in en_US.UTF-8 $LOCALE_FR_UTF8; do
-  for type in dfa fgrep regex; do
-    case $type in
-      dfa) opt= prefix= ;;
-      fgrep) opt=-F prefix= ;;
-      regex) opt= prefix='\(\)\1' ;;
-    esac
-    out=out-$type-$LOC
-    LC_ALL=$LOC grep $opt "$prefix$(encode A)" exp1 >$out || fail=1
-    compare exp1 $out || fail=1
-    LC_ALL=$LOC grep $opt "$prefix$(encode aA)" exp2 >$out || fail=1
-    compare exp2 $out || fail=1
+  for pat in A aA BCA; do
+    for file in exp1 exp2 exp3 exp4; do
+      for type in regex dfa fgrep; do
+        case $type in
+          dfa) opt= prefix= ;;
+          fgrep) opt=-F prefix= ;;
+          regex) opt= prefix='\(\)\1' ;;
+        esac
+        pattern=$prefix$(encode $pat)
+        out=out-$type-$LOC
+        LC_ALL=$LOC grep $opt "$pattern" $file >$out
+        status=$?
+        echo $status >$out.status
+        if test $status -eq 0; then
+          compare $file $out
+        elif test $status -eq 1; then
+          compare_dev_null_ /dev/null $out
+        else
+          test $status -eq 2
+        fi || fail=1
+        compare out-regex-$LOC.status $out.status || fail=1
+      done
+    done
   done
 done
author	Paul Eggert <eggert@cs.ucla.edu>	2014-05-05 20:19:19 -0700
committer	Paul Eggert <eggert@cs.ucla.edu>	2014-05-05 20:19:59 -0700
commit	eb3292b3b205e50d0373f26ff0950ec82f49c14a (patch)
tree	bfcb18201f277f03886e83efedfc070693652d45 /tests
parent	17683df11fbea7aa01c9d60f1b45874c9ea5e26a (diff)
download	grep-eb3292b3b205e50d0373f26ff0950ec82f49c14a.tar.gz