summaryrefslogtreecommitdiff
path: root/src/search.h
Commit message (Collapse)AuthorAgeFilesLines
* grep: prefer signed to unsigned integersPaul Eggert2021-08-251-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This improves runtime checking for integer overflow when compiling with gcc -fsanitize=undefined and the like. It also avoids the need for some integer casts, which can be error-prone. * bootstrap.conf (gnulib_modules): Add idx. * src/dfasearch.c (struct dfa_comp, kwsmusts): (possible_backrefs_in_pattern, regex_compile, GEAcompile) (EGexecute): * src/grep.c (struct patloc, patlocs_allocated, patlocs_used) (n_patterns, update_patterns, pattern_file_name, poison_len) (asan_poison, fwrite_errno, compile_fp_t, execute_fp_t) (buf_has_encoding_errors, buf_has_nulls, file_must_have_nulls) (bufalloc, pagesize, all_zeros, fillbuf, nlscan) (print_line_head, print_line_middle, print_line_tail, grepbuf) (grep, contains_encoding_error, fgrep_icase_available) (fgrep_icase_charlen, fgrep_to_grep_pattern, try_fgrep_pattern) (main): * src/kwsearch.c (struct kwsearch, Fcompile, Fexecute): * src/kwset.c (struct trie, struct kwset, kwsalloc, kwsincr) (kwswords, treefails, memchr_kwset, acexec_trans, kwsexec) (treedelta, kwsprep, bm_delta2_search, bmexec_trans, bmexec) (acexec): * src/kwset.h (struct kwsmatch): * src/pcresearch.c (Pcompile, Pexecute): * src/search.h (mb_clen): * src/searchutils.c (kwsinit, mb_goback, wordchars_count) (wordchars_size, wordchar_next, wordchar_prev): Prefer idx_t to size_t or ptrdiff_t for nonnegative sizes, and prefer ptrdiff_t to size_t for sizes plus error values. * src/grep.c (uword_size): New constant, used for signed size calculations. (totalnl, add_count, totalcc, print_offset, print_line_head, grep): Prefer intmax_t to uintmax_t for wide integer calculations. (fgrep_icase_charlen): Prefer ptrdiff_t to int for size offsets. * src/grep.h: Include idx.h. * src/search.h (imbrlen): New function, like mbrlen except with idx_t and ptrdiff_t.
* grep: avoid some size_t castsPaul Eggert2021-08-241-2/+2
| | | | | | | | | This helps move the code away from unsigned types. * src/grep.c (buf_has_encoding_errors, contains_encoding_error): * src/searchutils.c (mb_goback): Compare to MB_LEN_MAX, not to (size_t) -2. This is a bit safer anyway, as grep relies on MB_LEN_MAX limits elsewhere. * src/search.h (mb_clen): Compare to -2 before converting to size_t.
* maint: run "make update-copyright"Paul Eggert2021-01-011-1/+1
|
* grep: -P: report input filename upon PCRE execution failureJim Meyering2020-10-111-0/+2
| | | | | | | | | | | | Without this, it could be tedious to determine which input file evokes a PCRE-execution-time failure. * src/pcresearch.c (Pexecute): When failing, include the error-provoking file name in the diagnostic. * src/grep.c (input_filename): Make extern, since used above. * src/search.h (input_filename): Declare. * tests/filename-lineno.pl: Test for this. ($no_pcre): Factor out. * NEWS (Bug fixes): Mention this.
* grep: avoid unnecessary regex compilationNorihiro Tanaka2020-09-221-3/+3
| | | | | | | | | | | | | | | | | | | Grep resorts to using the regex engine when the precision of either -o or --color is required, or when the pattern is not supported by our DFA engine (e.g., backref). Otherwise, grep would perform regex compilation solely to check the syntax. This change makes grep skip that compilation in the common case for which it is unnecessary. The compilation we are avoiding is quite costly, consuming O(N^2) RSS for N regular expressions. * src/dfasearch.c (GEAcompile): Add new argument, and avoid unneeded compilation of regex. * src/grep.c (compile_fp_t): Update prototype. (main): Update caller. * src/kwsearch.c (Fcompile): Update caller and add new argument. * src/pcresearch.c (Pcompile): Add new argument. * src/search.h (GEAcompile, Fcompile, Pcompile): Update prototype.
* maint: update all copyright year number rangesJim Meyering2020-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest with copyright year adjusted. * tests/init.sh: Sync with gnulib to pick up copyright year. * bootstrap: Likewise. * doc/grep.in.1: Use "-" in copyright year ranges, not \en.
* grep: improve grep -Fw performance in non-UTF8 multibyte localesNorihiro Tanaka2019-11-171-1/+2
| | | | | | | * src/searchutils.c (mb_goback): New parameter. All callers changed. * src/search.h (mb_goback): Update prototype. * src/kwsearch.c (Fexecute): Use mb_goback's MBCLEN to detect a word-boundary even more efficiently.
* maint: update all copyright dates via "make update-copyright"Jim Meyering2019-01-011-1/+1
| | | | * gnulib: Also update submodule for its copyright updates.
* maint: update gnulib and copyright dates for 2018Jim Meyering2018-01-061-1/+1
| | | | | | * gnulib: Update to latest. * all files: Run "make update-copyright". * bootstrap: Update from gnulib.
* Improve -i performance in typical UTF-8 searchesPaul Eggert2017-01-171-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ‘grep -i i’ is slow in a UTF-8 locale, because ‘i’ in the pattern matches the two-byte character 'ı' (U+0131, LATIN SMALL LETTER DOTLESS I) in data, and kwset handles only single-byte character translations, so grep falls back on a slower DFA-based search for all searches. Improve -i performance in the typical case by using kwset when data are free of troublesome characters like 'ı', falling back on the DFA only when data contain troublesome characters. * src/dfasearch.c (GEAcompile): * src/grep.c (compile_fp_t): * src/kwsearch.c (Fcompile): * src/pcresearch.c (Pcompile): Pattern arg is now char *, not char const *, since Fcompile now reallocates it sometimes. * src/grep.c (all_single_byte_after_folding): Remove. All callers removed. (fgrep_icase_charlen): New function. (fgrep_icase_available, try_fgrep_pattern): Use it, for more-generous semantics. (fgrep_to_grep_pattern): Now extern. (main): Do not free keys, since Fexecute may use them. * src/kwsearch.c (struct kwsearch): New struct. (Fcompile): Return it. If -i, be more generous about patterns. (Fexecute): Use it. Fall back on DFA when the data contain troublesome characters; this should be rare in practice. * src/kwset.c, src/kwset.h (kwswords): New function.
* maint: update gnulib and copyright dates for 2017Jim Meyering2017-01-011-1/+1
| | | | | * gnulib: Update to latest. * all files: Run "make update-copyright".
* grep: minor performance tweak for pure functionsPaul Eggert2016-12-261-3/+4
| | | | | * src/search.h (wordchars_size, wordchar_next, wordchar_prev): Declare to be pure.
* grep: move localeinfo to grep.cZev Weiss2016-12-251-1/+3
| | | | | | | | | It's not really dfasearch-specific, and grep.c initializes it, so it seems like the most appropriate "owner". * src/dfasearch.c (localeinfo): Remove. * src/grep.c (localeinfo): Add. * src/search.h (localeinfo): Move to new commented section.
* grep: prepare search backends for thread-safetyZev Weiss2016-12-251-6/+6
| | | | | | | | | | | | | | | | | | | | | | To facilitate removing mutable global state from search backends, compile() functions will return an opaque pointer to backend-specific data, which must then be passed back into the corresponding execute() function. This is merely a preparatory step changing function signatures and call sites, so the pointers passed & returned are dummies for now and not (yet) actually used. * src/grep.c (compile_fp_t): Now returns an opaque pointer (the compiled pattern). (execute_fp_t): Now passed the pointer returned by a compile_fp_t. All call sites updated accordingly. (compiled_pattern): New static variable. * src/dfasearch.c (GEAcompile): Return a void pointer (dummy NULL). (EGexecute): Receive a void pointer argument (unused). * src/kwsearch.c (Fcompile): Return a void pointer (dummy NULL). (Fexecute): Receive a void pointer argument (unused). * src/pcresearch.c (Pcompile): Return a void pointer (dummy NULL). (Pexecute): Receive a void pointer argument (unused). * src/search.h: Update compile/execute function prototypes.
* grep: improve word checking with UTF-8Paul Eggert2016-12-231-1/+1
| | | | | | | | * src/searchutils.c: Do not include <verify.h>. (word_start): Remove, replacing with ... (sbwordchar): New static var. All uses changed. (wordchar_prev): Return size_t, not bool, as this generates slightly better code. Go back faster if UTF-8.
* grep: speed up -wf in C localePaul Eggert2016-12-231-0/+1
| | | | | | | | | | Problem reported by Norihiro Tanaka (Bug#22357#100). This patch improves the performance on that benchmark on my platform so that grep is now only about 2x slower than grep 2.26, which means it is considerably faster than grep 2.25 and earlier. * src/kwsearch.c (Fexecute): Use wordchars_size to boost performance for this case. * src/search.h, src/searchutils.c (wordchars_size): New function.
* grep: specialize word-finding functionsPaul Eggert2016-12-231-2/+3
| | | | | | | | | | | | | | | This improves performance a bit. * src/dfasearch.c, src/kwsearch.c (wordchar): Remove; now in searchutils.c. * src/grep.c (main): Call wordinit if -w. * src/search.h: Adjust. * src/searchutils.c: Include verify.h. (word_start): New static var. (wordchar): Move here from dfasearch.c and kwsearch.c. (wordinit, wordchars_count, wordchar_next, wordchar_prev): New functions. (mb_prev_wc, mb_next_wc): Remove. All callers changed to use the new functions instead.
* grep: simplify matcher configurationPaul Eggert2016-12-201-2/+2
| | | | | | | | | | | | | | | * src/grep.c (matcher, compile): Remove static vars. (compile_fp_t): Now takes a 3rd syntax argument. (Gcomppile, Ecompile, Acompile, GAcompile, PAcompile): Remove. (struct matcher): Now nameless, since it is used only once. Make 'name' a bit shorter. New member 'syntax'. (matchers): Initialize it, and change removed functions to GEAcompile. (F_MATCHER_INDEX, G_MATCHER_INDEX): New constants. (setmatcher): New arg MATCHER, and return new matcher index. Avoid unnecessary call to strcmp. (main): Keep matcher as a local int, not a global pointer. * src/kwsearch.c (Fcompile): * src/pcresearch.c (Pcompile): Ignore the 3rd syntax argument.
* grep: -P no longer uses PCRE_MULTILINEPaul Eggert2016-11-191-3/+3
| | | | | | | | | | | | | | | | | | This reverts commit f6603c4e1e04dbb87a7232c4b44acc6afdf65fef, as the extra performance is not worth the trouble for PCRE users. Problem reported by Stephane Chazelas in: http://bugs.gnu.org/22655#103 * NEWS: Document this and the next patch. * src/dfasearch.c (EGexecute): * src/grep.c (execute_fp_t): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pexecute): First arg is now a const pointer again. * src/grep.c (buf_has_encoding_errors): Now static. * src/grep.h (buf_has_encoding_errors): Remove decl. * src/search.h: Adjust decls. * src/pcresearch.c (reflags): Remove. All uses removed. (Pcompile, Pexecute): Do not use PCRE_MULTILINE.
* grep: die more systematicallyPaul Eggert2016-10-041-1/+0
| | | | | | | | | | | | | | * src/die.h: New file. * src/dfasearch.c, src/grep.c, src/pcresearch.c: Include die.h. * src/dfasearch.c (dfaerror): * src/grep.c (context_length_arg, add_count, prline, setmatcher, main): * src/pcresearch.c (jit_exec, Pcompile, Pexecute): Use 'die' instead of 'error' when exiting. * src/pcresearch.c: Do not include verify.h. (die): Remove; now in die.h. * src/search.h: Do not include error.h here, since this file does not use anything defined in error.h. Instead, dfasearch.c, which uses error.h's symbols, now includes error.h directly.
* grep: avoid code duplication with -iFPaul Eggert2016-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | This follows up on the -iF performance improvement (Bug#23752). * NEWS: Simplify description of -iF improvement. * src/dfa.c: Do not include wctype.h. (lonesome_lower, case_folded_counterparts): Move to localeinfo.c. (CASE_FOLDED_BUFSIZE): Move to localeinfo.h. * src/grep.c: Do not include wctype.h. (lonesome_lower): Remove. (fgrep_icase_available): Use case_folded_counterparts instead. Do not call it for the same character twice. Return false on wcrtomb failures (which should never happen). (fgrep_to_grep_pattern, main): Simplify. Let fgrep_to_grep’s caller fiddle with the global variables. * src/localeinfo.c: Include <wctype.h> (lonesome_lower, case_folded_counterparts): Move here from src/dfa.c. Return int, not unsigned int. Verify that CASE_FOLDED_BUFSIZE is big enough. * src/localeinfo.h (CASE_FOLDED_BUFSIZE): Now 32, so that we don’t expose lonesome_lower’s size. * src/searchutils.c (kwsinit): Return new kwset instead of storing it via a pointer. All callers changed. Simplify a bit.
* grep: speed up -iF in multibyte localesNorihiro Tanaka2016-09-011-1/+1
| | | | | | | | | | | | | | | In a multibyte locale, if a pattern is composed of only single byte characters and their all counterparts are also single byte characters and the pattern does not have invalid sequences, grep -iF uses the fgrep matcher, the same as in a single byte locale (Bug#23752). * NEWS: Mention it. * src/grep.c (lonesome_lower): New constant. (fgrep_icase_available): New function. (fgrep_to_grep_pattern): Simplify it. (main): Use them. * src/searchutils.c (kwsinit): New arg MB_TRANS; all uses changed. Try fgrep matcher for case insensitive matching by grep -F in multibyte locale.
* dfa: make dfa.c fully thread-safePaul Eggert2016-08-311-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This follows up on Zev Weiss’s recent patches to make the DFA code thread-safe (Bug#24249). It removes the remaining static variables used by dfa.c. These variables are locale-dependent, so they would cause problems in multithreaded code where different threads are in different locales (e.g., via uselocale). I abstracted most of the variables into a new localeinfo module. * src/Makefile.am (grep_SOURCES): Add localeinfo.c. (noinst_HEADERS): Add localeinfo.h. * src/dfa.c: Include localeinfo.h. (struct dfa): Remove multibyte member, as it is now part of localeinfo. New members simple_locale and localeinfo. Put locale-related members at the end. (mbrtowc_cache): Remove; now part of dfa->localeinfo. (charclass_index): Rename back from dfa_charclass_index, since it's private. (unibyte_word_constituent): New arg DFA; use its sbctowc member. (using_utf8, dfa_using_utf8, init_mbrtowc_cache, check_utf8): Remove; now done by localeinfo members. All uses changed. (dfasyntax): New localeinfo arg. Move to end to avoid forward decls. Initialize the entire DFA. (unibyte_c, check_unibyte_c): Remove; now in simple_locale member. (using_simple_locale): Now takes bool instead of DFA. Do the locale check here, rather than in the caller, as the result is now cached in dfa->simple_locale. (dfaalloc): Just allocate the DFA. dfasyntax now initializes it. * src/dfa.h: Add forward decl of struct localeinfo. Adjust to new dfa.c API. * src/dfasearch.c (localeinfo): New var, replacing former static vars like mbrtowc_cache. * src/localeinfo.c, src/localeinfo.h: New files. * src/search.h: Include localeinfo.h. (localeinfo): New decl. * src/searchutils.c (mbclen_cache, build_mbclen_cache): Remove. All uses changed to localeinfo. * tests/Makefile.am (dfa_match_aux_LDADD): Add localeinfo.o. * tests/dfa-match-aux.c: Include localeinfo.h. (main): Adjust to changes in DFA API.
* maint: remove unused mbtoupper functionJim Meyering2016-04-101-1/+0
| | | | | | | * src/searchutils.c (mbtoupper): Remove now-unused function. Also remove inclusion of <assert.h>, since this change removed the final use of assert. * src/search.h (mbtoupper): Remove declaration.
* grep: restore -P optimization (followup fix)Paul Eggert2016-01-061-3/+3
| | | | | | * src/search.h (EGexecute, Fexecute, Pexecute): Change decls to match new implementations. I forgot to add this file to the previous commit.
* maint: update copyright year, bootstrap, init.shJim Meyering2016-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest. * tests/init.sh: Update from gnulib. * bootstrap: Likewise.
* maint: update copyright year ranges to include 2015Jim Meyering2015-01-011-1/+1
| | | | | Run "make update-copyright". Also, ... * grep.texi: Update manually, converting each "--" to "-".
* grep: use mbclen cache more effectivelyPaul Eggert2014-09-161-0/+19
| | | | | | | | | | | | | | | | | | * src/grep.c (buffer_textbin, contains_encoding_error): Use mb_clen for speed. (buffer_textbin): Bypass mb_clen in unibyte locales. (main): Always initialize the cache, since it's sometimes used in unibyte locales now. Initialize it before contains_encoding_error might be called. * src/search.h (SEARCH_INLINE): New macro. (mbclen_cache): Now extern decl. (mb_clen): New inline function. * src/searchutils.c (SEARCH_INLINE, SYSTEM_INLINE): Define. (mbclen_cache): Now extern. (build_mbclen_cache): Put 1 into the cache when mbrlen returns 0. (mb_goback): Use mb_len for speed, and rely on it returning nonzero. * src/system.h (SYSTEM_INLINE): New macro. (to_uchar): Use it.
* grep: fix -w match next to a multibyte letterPaul Eggert2014-05-051-0/+2
| | | | | | | | | | | | | | | * NEWS: Document this. * src/dfasearch.c, src/kwsearch.c (WCHAR): Remove. (wordchar): New static function. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): Use the new functions, so that the code works correctly if a multibyte character adjacent to the match has two or more bytes. * src/search.h, src/searchutils.c (mb_prev_wc, mb_next_wc): New functions. * tests/word-delim-multibyte: Add a test for grep -w (which now passes), and a test for \> (which still fails). The \< test also still fails.
* grep: improve internal API for multibyte boundaryPaul Eggert2014-05-051-2/+2
| | | | | | | | | * src/search.h, src/searchutils.c (mb_goback): Rename from is_mb_middle. Omit last arg. Return number of bytes to go back, not just a boolean. All uses changed. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): Adjust to API change. * src/kwsearch.c (Fexecute): Eliminate common subexpression.
* grep: fix encoding-error incompatibilities among regex, DFA, KWsetPaul Eggert2014-05-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This follows up to http://bugs.gnu.org/17376 and fixes a different set of incompatibilities, namely between the regex matcher and the other matchers, when the pattern contains encoding errors. The GNU regex matcher is not consistent in this area: sometimes an encoding error matches only itself, and sometimes it matches part of a multibyte character. There is no documentation for grep's behavior in this area and users don't seem to care, and it's simpler to defer to the regex matcher for problematic cases like these. * NEWS: Document this. * src/dfa.c (ctok): Remove. All uses removed. (parse_bracket_exp, atom): Use BACKREF if a pattern contains an encoding error, so that the matcher will revert to regex. * src/dfasearch.c, src/grep.c, src/pcresearch.c, src/searchutils.c: Don't include dfa.h, since search.h now does that for us. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): In a UTF-8 locale, there's no need to worry about matching part of a multibyte character. * src/grep.c (contains_encoding_error): New static function. (main): Use it, so that grep -F is consistent with plain fgrep when the pattern contains an encoding error. * src/search.h: Include dfa.h, so that kwsearch.c can call using_utf8. * src/searchutils.c (is_mb_middle): Remove UTF-8-specific code. Callers now ensure that we are in a non-UTF-8 locale. The code was clearly wrong, anyway. * tests/fgrep-infloop, tests/invalid-multibyte-infloop: * tests/prefix-of-multibyte: Do not require that grep have a particular behavor for this test. It's OK to match (exit status 0), not match (exit status 1), or report an error (exit status 2), since the pattern contains an encoding error and grep's behavior is not specified for such patterns. Test only that KWset, DFA, and regex agree. * tests/prefix-of-multibyte: Add tests for ABCABC and __..._ABCABC___.
* grep: simplify dfa.c by having it not include mbsupport.h directlyPaul Eggert2014-04-051-3/+0
| | | | | | | | | | | | | | | | | | | * src/mbsupport.h: Remove. * src/Makefile.am (noinst_HEADERS): Remove mbsupport.h. * src/dfa.c, src/grep.c, src/search.h: Don't include mbsupport.h. * src/dfa.c: Include wchar.h and wctype.h unconditionally, as this simplifies the use of dfa.c in grep, and it does no harm in gawk. (setlocale, static_assert): Remove gawk-specific hacks, as gawk now does these itself. (struct dfa, dfambcache, mbs_to_wchar) (is_valid_unibyte_character, setbit_wc, using_utf8, FETCH_WC) (addtok_wc, add_utf8_anychar, atom, state_index, epsclosure) (dfaanalyze, dfastate, prepare_wc_buf, dfaoptimize, dfafree, dfamust): * src/dfasearch.c (EGexecute): * src/grep.c (main): * src/searchutils.c (mbtoupper): Assume MBS_SUPPORT.
* fgrep: fix case-fold incompatibility with plain 'grep'Paul Eggert2014-03-071-1/+1
| | | | | | | | | | | | | | | fgrep converted to lowercase, whereas the regex code converted to uppercase. The resulting behaviors don't agree in offbeat cases like Greek sigmas and Turkish Is. Fix this by changing fgrep to agree with the regex code. * src/kwsearch.c (Fcompile, Fexecute): * src/searchutils.c (kwsinit, mbtoupper): Convert to uppercase, not to lowercase, for compatibility with plain 'grep'. * src/search.h, src/searchutils.c (mbtoupper): Rename from mbtolower, since it now converts to uppercase. All uses changed. * tests/case-fold-titlecase: Add tests for this.
* grep: avoid 'inline' when it doesn't matterPaul Eggert2014-02-281-19/+0
| | | | | | | | | | | | | These days, compilers generally do just fine without advice from users about 'inline', and there's little need for 'static inline', just as there's little need for 'register'. * src/dfa.c (to_uchar): * src/dosbuf.c (guess_type, undossify_input, dossified_pos): * src/main.c (undossify_input): No longer inline. * src/search.h (mb_case_map_apply): Move from here ... * src/kwsearch.c (mb_case_map_apply): ... to here, and make it no longer 'inline'.
* speed up mb-boundary-detection after each preliminary matchNorihiro Tanaka2014-02-091-0/+1
| | | | | | | | | | | | | | | | | | | | | After each kwsexec or dfaexec match, we must determine whether the tentative match falls in the middle of a multi-byte character. That is what our is_mb_middle function does, but it was expensive, even when most input consisted of single-byte characters. The main cost was for each call to mbrlen. This change constructs and uses a cache of the lengths returned by mbrlen for unibyte values. The largest speed-up (3x to 7x, CPU-dependent) is when most lines contain a match, yet few are printed, e.g., when using grep -v common-pattern ... to filter out all but a few lines. * src/search.h (build_mbclen_cache): Declare it. * src/main.c: Include "search.h". [MBS_SUPPORT] (main): Call build_mbclen_cache in a multibyte locale. * src/searchutils.c [HAVE_LANGINFO_CODESET]: Include <langinfo.h>. (mbclen_cache): New global. (build_mbclen_cache): New function. (is_mb_middle) [HAVE_LANGINFO_CODESET]: Use it. * NEWS (Improvements): Mention it.
* maint: update copyright dates for 2014Jim Meyering2014-01-011-1/+1
| | | | Do that by running "make update-copyright".
* maint: update all copyright year number rangesJim Meyering2013-01-041-1/+1
| | | | Run "make update-copyright".
* grep -i: work also when converting to lower-case inflates byte countJim Meyering2012-06-161-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit v2.12-16-g7aa698d addressed the case in which the lower-case representation of an input byte occupies fewer bytes than the original. However, even with commit v2.12-20-g074842d, grep -i would still misbehave when converting a character to lower-case increased its byte count. The map-manipulation code assumed that the case conversion could only shrink the byte count. With the consideration that it may also inflate it, the deltas recorded in the map array must be signed, and we must account for the one-to-two-or-more mapping when the original-to-lower-case conversion causes the byte count to increase. * src/searchutils.c (mbtolower): When a lower-case character occupies more than one byte, set its remaining map slots to zero. Change the type of the map to be signed, and compute the change in character byte count as new_length - old_length. * src/search.h: Include <stdint.h>, for decl of intmax_t. (mb_case_map_apply): Adjust for signed increments: each map entry is now signed. (mb_len_map_t): Define type. Thanks to Paul Eggert for noticing in review that using a bare "char" as the base type would be wrong on systems for which it is a signed type (as with gcc's -funsigned-char). * src/kwsearch.c (Fcompile, Fexecute): Likewise. * src/dfasearch.c (kwsincr_case, EGexecute): Likewise. * tests/turkish-I-without-dot: New test. Thanks to Paolo Bonzini for the tip that in the tr_TR.utf8 locale, mapping "I" to lower case increases the character's byte count. * tests/Makefile.am (TESTS): Add it. * tests/init.cfg (require_tr_utf8_locale_): New function. * NEWS (Bug fixes): Expand the existing entry.
* grep: fix how -i works with a match containing the Turkish I-with-dotJim Meyering2012-06-021-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a long-standing problem in the way grep's -i interacts with data whose byte count changes when we convert it to lower case. For example, the UTF-8 Turkish I-with-dot (İ) occupies two bytes, but its lower case analog, i, occupies just one byte. The code converts both search string and the haystack data to lower case, and then searches for the modified string in the modified buffer. The trouble arose when using a lowercase buffer <offset,length> pair to manipulate the original (longer) buffer. The solution is to change mbtolower to return additional information: a malloc'd mapping vector. With that, the caller maps the lowercase- relative <offset,length> to numbers that refer to the original buffer. This mapping is used only when lengths actually differ, so the cost in general should be small. * src/searchutils.c (mbtolower): Add the new map parameter. * src/search.h (mb_case_map_apply): New function. * src/kwsearch.c (Fexecute): Update mbtolower caller, and upon success, apply the new map. * src/dfasearch.c (EGexecute): Likewise. * tests/Makefile.am (XFAIL_TESTS): Remove turkish-I from this list; that test is no longer expected to fail. * NEWS (Bug fixes): Mention it. Reported by Ilya Basin in http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413 and later by Strahinja Kustudic in http://savannah.gnu.org/bugs/?36567
* maint: update all copyright year number rangesJim Meyering2012-01-011-1/+1
| | | | Run "make update-copyright".
* maint: remove #if-MBS_SUPPORT declaration guardsJim Meyering2011-09-161-2/+0
| | | | * src/search.h: Don't bother to #if-out declarations.
* maint: mark some function declarations as externJim Meyering2011-04-281-10/+9
| | | | * src/search.h: Add "extern" keyword to each function declaration.
* maint: update copyright year ranges to include 2011Jim Meyering2011-01-031-1/+1
| | | | Run "make update-copyright", so "make syntax-check" works in 2011.
* maint: include <wchar.h> and <wctype.h> unconditionallyJim Meyering2010-04-021-6/+4
| | | | | | | * src/main.c: Include <wchar.h> and <wctype.h> unconditionally. Their presence/usefulness are assured by gnulib. * src/dfa.c: Likewise. * src/search.h: Likewise.
* maint: MBS_SUPPORT: define to 0/1, not undef/1Jim Meyering2010-04-021-2/+2
| | | | | | | | | | | | | | Prepare to remove many of these #ifdefs. * src/mbsupport.h (MBS_SUPPORT): Define to 0/1, not undef/1. Change each "#ifdef MBS_SUPPORT" to "#if MBS_SUPPORT". Use this: perl -pi -e 's/ifdef (MBS_SUPPORT)/if $1/' $(g grep -l ifdef.MBS_SUPPO) * src/dfa.c: s/#ifdef MBS_SUPPORT/#if MBS_SUPPORT/ * src/dfa.h: Likewise. * src/dfasearch.c: Likewise. * src/kwsearch.c: Likewise. * src/main.c: Likewise. * src/search.h: Likewise. * src/searchutils.c: Likewise.
* grep -F: fix a multi-byte erroneous-match-in-middle bugJim Meyering2010-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | Just as Perl prints nothing in this case, printf '\357\274\241\n' | perl -CIO -lne '/\357/ and print' grep should also print nothing when used as follows. However, these would mistakenly match with grep prior to 2.6.2: printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\357' printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\357\274' * src/searchutils.c (is_mb_middle): New parameter: the length of the match, in bytes, as determined by kwsexec. Use this to detect when the nominal match found by kwsexec must be skipped because it is for an incomplete multi-byte character that is a prefix of a character in the input. * src/dfasearch.c (EGexecute): Update caller. * src/kwsearch.c (Fexecute): Likewise. * src/search.h: Update prototype. * NEWS (Bug fixes): Mention it. Report and analysis by Norihiro Tanaka.
* grep: libify *search.cPaolo Bonzini2010-03-221-0/+14
| | | | | | | | | | * src/Makefile.am (libsearch_a_SOURCES): Add dfasearch.c, kwsearch.c, pcresearch.c. * src/esearch.c, src/fsearch.c, * src/gsearch.c: Only include search.h. * src/dfasearch.c (GEAcompile, EGexecute): Export. * src/kwsearch.c (Fcompile, Fexecute): Export. * src/pcresearch.c (Pcompile, Pexecute): Export. * src/search.h: Add new exported functions.
* grep: split search.cPaolo Bonzini2010-03-221-0/+47
* po/POTFILES.in: Update. * src/Makefile.am (grep_SOURCES, egrep_SOURCES, fgrep_SOURCES): Move kwset.c and dfa.c to libsearch.a. Add searchutils.c there too. * src/search.h, src/dfasearch.c, src/pcresearch.c, src/kwsearch.c, src/searchutils.c: New files, split out of src/search.c. * src/esearch.c, src/fsearch.c: Include the new files instead of search.c. * src/gsearch.c: Likewise, plus move Gcompile/Acompile here.