diff options
author | Jim Meyering <meyering@redhat.com> | 2012-06-16 12:32:35 +0200 |
---|---|---|
committer | Jim Meyering <meyering@redhat.com> | 2012-06-16 18:18:00 +0200 |
commit | 2be0c6591ea5d56ee7fbc364228f24b85f569a5c (patch) | |
tree | 094cd910c8362e8fb429cfced49d96459f47b784 /src/searchutils.c | |
parent | 074842d3e3054714a495252e582886f0e4ace4e4 (diff) | |
download | grep-2be0c6591ea5d56ee7fbc364228f24b85f569a5c.tar.gz |
grep -i: work also when converting to lower-case inflates byte count
Commit v2.12-16-g7aa698d addressed the case in which the lower-case
representation of an input byte occupies fewer bytes than the original.
However, even with commit v2.12-20-g074842d, grep -i would still
misbehave when converting a character to lower-case increased its
byte count. The map-manipulation code assumed that the case conversion
could only shrink the byte count. With the consideration that it may
also inflate it, the deltas recorded in the map array must be signed,
and we must account for the one-to-two-or-more mapping when the
original-to-lower-case conversion causes the byte count to increase.
* src/searchutils.c (mbtolower): When a lower-case character occupies
more than one byte, set its remaining map slots to zero. Change the
type of the map to be signed, and compute the change in character
byte count as new_length - old_length.
* src/search.h: Include <stdint.h>, for decl of intmax_t.
(mb_case_map_apply): Adjust for signed increments:
each map entry is now signed.
(mb_len_map_t): Define type. Thanks to Paul Eggert for noticing
in review that using a bare "char" as the base type would be wrong on
systems for which it is a signed type (as with gcc's -funsigned-char).
* src/kwsearch.c (Fcompile, Fexecute): Likewise.
* src/dfasearch.c (kwsincr_case, EGexecute): Likewise.
* tests/turkish-I-without-dot: New test. Thanks to Paolo Bonzini
for the tip that in the tr_TR.utf8 locale, mapping "I" to lower case
increases the character's byte count.
* tests/Makefile.am (TESTS): Add it.
* tests/init.cfg (require_tr_utf8_locale_): New function.
* NEWS (Bug fixes): Expand the existing entry.
Diffstat (limited to 'src/searchutils.c')
-rw-r--r-- | src/searchutils.c | 41 |
1 files changed, 26 insertions, 15 deletions
diff --git a/src/searchutils.c b/src/searchutils.c index c1fb656d..ca30134a 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -43,10 +43,10 @@ kwsinit (kwset_t *kwset) } #if MBS_SUPPORT -/* Convert the *N-byte string, BEG, to lowercase, and write the +/* Convert the *N-byte string, BEG, to lower-case, and write the NUL-terminated result into malloc'd storage. Upon success, set *N to the length (in bytes) of the resulting string (not including the - trailing NUL byte), and return a pointer to the lowercase string. + trailing NUL byte), and return a pointer to the lower-case string. Upon memory allocation failure, this function exits. Note that on input, *N must be larger than zero. @@ -55,26 +55,35 @@ kwsinit (kwset_t *kwset) to the buffer and reuses it on any subsequent call. As a consequence, this function is not thread-safe. - When all the characters in the lowercase result string have the - same length as corresponding characters in the input string, - set *LEN_MAP_P to NULL. Otherwise, set it to a malloc'd buffer (like the - returned buffer, this must not be freed by caller) of the same length as - the result string. (*LEN_MAP_P)[J] is one less than the length-in-bytes - of the character in BEG that formed byte J of the result. This map is - used by the caller to convert offset,length pairs that reference the - lowercase result to numbers that refer to the corresponding parts of - the original buffer. */ + When each character in the lower-case result string has the same length + as the corresponding character in the input string, set *LEN_MAP_P + to NULL. Otherwise, set it to a malloc'd buffer (like the returned + buffer, this must not be freed by caller) of the same length as the + result string. (*LEN_MAP_P)[J] is the change in byte-length of the + character in BEG that formed byte J of the result as it was converted to + lower-case. It is usually zero. For the upper-case Turkish I-with-dot + it is -1, since the upper-case character occupies two bytes, while the + lower-case one occupies only one byte. For the Turkish-I-without-dot + in the tr_TR.utf8 locale, it is 1 because the lower-case representation + is one byte longer than the original. When that happens, we have two + or more slots in *LEN_MAP_P for each such character. We store the + difference in the first one and 0's in any remaining slots. + + This map is used by the caller to convert offset,length pairs that + reference the lower-case result to numbers that refer to the matched + part of the original buffer. */ + char * -mbtolower (const char *beg, size_t *n, unsigned char **len_map_p) +mbtolower (const char *beg, size_t *n, mb_len_map_t **len_map_p) { static char *out; - static unsigned char *len_map; + static mb_len_map_t *len_map; static size_t outalloc; size_t outlen, mb_cur_max; mbstate_t is, os; const char *end; char *p; - unsigned char *m; + mb_len_map_t *m; bool lengths_differ = false; if (*n > outalloc || outalloc == 0) @@ -123,9 +132,11 @@ mbtolower (const char *beg, size_t *n, unsigned char **len_map_p) } else { - *m++ = mbclen - 1; beg += mbclen; size_t ombclen = wcrtomb (p, towlower ((wint_t) wc), &os); + *m = mbclen - ombclen; + memset (m + 1, 0, ombclen - 1); + m += ombclen; p += ombclen; outlen += ombclen; lengths_differ |= (mbclen != ombclen); |