grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

These days, nearly everyone uses a multibyte locale, and grep is often used with the --ignore-case (-i) option, but that option imposes a very high cost in order to handle some unusual cases in just a few multibyte locales. This change gets most of the performance of using LC_ALL=C without eliminating the ability to search for multibyte strings. With the following example, I see an 11x speed-up with a 2.3GHz i7: Generate a 10M-line file, with each line consisting of 40 'j's: yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k Time searching it for the simple/noexistent string "foobar", first with this patch (best-of-5 trials): LC_ALL=en_US.UTF-8 env time src/grep -i foobar k 1.10 real 1.03 user 0.07 sys Back out that commit (temporarily), recompile, and rerun the experiment: git log -1 -p|patch -R -p1; make LC_ALL=en_US.UTF-8 env time src/grep -i foobar k 12.50 real 12.41 user 0.08 sys The trick is to realize that for some search strings, it is easy to convert to an equivalent one that is handled much more efficiently. E.g., convert this command: grep -i foobar k to this: grep '[fF][oO][oO][bB][aA][rR]' k That allows the matcher to search in buffer mode, rather than having to extract/case-convert/search each line separately. Currently, we perform this conversion only when search strings contain neither '\' nor '['. See the comments for more detail. * src/main.c (trivial_case_ignore): New function. (main): When possible, transform the regexp so we can drop the -i. * tests/turkish-eyes: New file. * tests/Makefile.am (TESTS): Use it. * NEWS (Improvements): Mention it.
author: Jim Meyering <meyering@fb.com> 2013-11-24 18:49:31 -0800
committer: Jim Meyering <meyering@fb.com> 2014-01-09 21:08:41 -0800
commit: 97318f5e59a1ef6feb8a378434a00932a3fc1e0b (patch)
tree: 4941a8fa9a48bdbd142216abc134bc8197b75e2b /tests/turkish-eyes
parent: c53ed7be03da564fc45836048324ee184f4541f1 (diff)
download: grep-97318f5e59a1ef6feb8a378434a00932a3fc1e0b.tar.gz
1 files changed, 44 insertions, 0 deletions
diff --git a/tests/turkish-eyes b/tests/turkish-eyes
new file mode 100755
index 00000000..323eb354
--- /dev/null
+++ b/tests/turkish-eyes
@@ -0,0 +1,44 @@
+#!/bin/sh
+# Ensure that case-insensitive matching works with all Turkish i's
+
+# Copyright (C) 2014 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_compiled_in_MB_support
+
+fail=0
+
+L=tr_TR.UTF-8
+
+# Check for a broken tr_TR.UTF-8 locale definition.
+# In this locale, 'i' is not a lower-case 'I'.
+echo I | LC_ALL=$L grep -i i > /dev/null \
+    && skip_ "your $L locale appears to be broken"
+
+# Ensure that this matches:
+# printf 'I:İ ı:i\n'|LC_ALL=tr_TR.utf8 grep -i 'ı:i I:İ'
+I=$(printf '\304\260') # capital I with dot
+i=$(printf '\304\261') # lowercase dotless i
+
+data=$(      printf "I:$I $i:i")
+search_str=$(printf "$i:i I:$I")
+printf "$data\n" > in || framework_failure_
+
+LC_ALL=$L grep -i "^$search_str\$" in > out || fail=1
+compare out in || fail=1
+
+Exit $fail
author	Jim Meyering <meyering@fb.com>	2013-11-24 18:49:31 -0800
committer	Jim Meyering <meyering@fb.com>	2014-01-09 21:08:41 -0800
commit	97318f5e59a1ef6feb8a378434a00932a3fc1e0b (patch)
tree	4941a8fa9a48bdbd142216abc134bc8197b75e2b /tests/turkish-eyes
parent	c53ed7be03da564fc45836048324ee184f4541f1 (diff)
download	grep-97318f5e59a1ef6feb8a378434a00932a3fc1e0b.tar.gz