summaryrefslogtreecommitdiff
path: root/NEWS
diff options
context:
space:
mode:
authorJim Meyering <meyering@redhat.com>2012-06-01 21:18:00 +0200
committerJim Meyering <meyering@redhat.com>2012-06-02 11:06:08 +0200
commit7aa698d36b5b2eeb8e90e7a327eb7ebe46d59e87 (patch)
treecadef59a85a9ffe401994e8addeb1a995394c279 /NEWS
parent2665746b756bd372ba856e165388dc98032362fd (diff)
downloadgrep-7aa698d36b5b2eeb8e90e7a327eb7ebe46d59e87.tar.gz
grep: fix how -i works with a match containing the Turkish I-with-dot
Fix a long-standing problem in the way grep's -i interacts with data whose byte count changes when we convert it to lower case. For example, the UTF-8 Turkish I-with-dot (İ) occupies two bytes, but its lower case analog, i, occupies just one byte. The code converts both search string and the haystack data to lower case, and then searches for the modified string in the modified buffer. The trouble arose when using a lowercase buffer <offset,length> pair to manipulate the original (longer) buffer. The solution is to change mbtolower to return additional information: a malloc'd mapping vector. With that, the caller maps the lowercase- relative <offset,length> to numbers that refer to the original buffer. This mapping is used only when lengths actually differ, so the cost in general should be small. * src/searchutils.c (mbtolower): Add the new map parameter. * src/search.h (mb_case_map_apply): New function. * src/kwsearch.c (Fexecute): Update mbtolower caller, and upon success, apply the new map. * src/dfasearch.c (EGexecute): Likewise. * tests/Makefile.am (XFAIL_TESTS): Remove turkish-I from this list; that test is no longer expected to fail. * NEWS (Bug fixes): Mention it. Reported by Ilya Basin in http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413 and later by Strahinja Kustudic in http://savannah.gnu.org/bugs/?36567
Diffstat (limited to 'NEWS')
-rw-r--r--NEWS5
1 files changed, 5 insertions, 0 deletions
diff --git a/NEWS b/NEWS
index 69262765..d0ea60ab 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,11 @@ GNU grep NEWS -*- outline -*-
** Bug fixes
+ grep -i, in a multi-byte locale, when matching a line containing a character
+ like the UTF-8 Turkish I-with-dot (U+0130) (whose lower-case representation
+ occupies fewer bytes), would print an incomplete output line.
+ [bug introduced in grep-2.6]
+
--include and --exclude can again be combined, and again apply to
the command line, e.g., "grep --include='*.[ch]' --exclude='system.h'
PATTERN *" again reads all *.c and *.h files except for system.h.