summaryrefslogtreecommitdiff
path: root/src/dfasearch.c
Commit message (Collapse)AuthorAgeFilesLines
* maint: update all copyright year number rangesJim Meyering2020-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest with copyright year adjusted. * tests/init.sh: Sync with gnulib to pick up copyright year. * bootstrap: Likewise. * doc/grep.in.1: Use "-" in copyright year ranges, not \en.
* doc: spell "back-reference" more consistentlyPaul Eggert2019-12-301-6/+6
|
* maint: adjust new commentsJim Meyering2019-12-221-7/+7
| | | | | * src/dfasearch.c (possible_backrefs_in_pattern): Remove a duplicate "a", insert a "be" and a comma, and reformat.
* grep: fix some bugs in pattern-grouping speedupPaul Eggert2019-12-221-47/+78
| | | | | | | | | | | | | | | | | | | This fixes some bugs in the previous commit, and should finish the fix for Bug#33249. * NEWS: Mention fix for Bug#33249. * src/dfasearch.c (possible_backrefs_in_pattern, regex_compile) (GEAcompile): In new code, prefer ptrdiff_t to size_t when either will do, since ptrdiff_t has better error checking. At some point we should adjust the old code too. (possible_backrefs_in_pattern): Rename from find_backref_in_pattern. New arg BS_SAFE. All uses changed. Fix false negative if a multibyte character ends in a single '\\' byte, followed by the two bytes '\\', '1'. (regex_compile): Simplify. (GEAcompile): Avoid quadratic behavior when reallocating growing buffers. Fix a couple of bugs in copying pattern data involving backreferences. Fix another bug in copying pattern metadata involving backreferences, by removing the need to copy it.
* grep: grouping of a pattern with multiple linesNorihiro Tanaka2019-12-221-20/+107
| | | | | | | | | | | | | | | | | | | | When grep uses regex, it splits a pattern with multiple lines by newline character into fragments. Compilation and execution run for each fragment. That causes slowdown. By this change, each fragment is divided into groups by whether the fragment includes back references. A fragment with back references constitutes group, and all fragments that lack back references also constitute a group. This change extremely speeds-up following case. $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat $ yes 00000000000000000000000000000000000000000x | head -10000 >in $ time -p env LC_ALL=C src/grep -f pat in * src/dfasearch.c (find_backref_in_pattern, regex_compile): New functions. (GEAcompile): Use the new functions to group fragments as mentioned above.
* dfa: separate parse and compile phaseNorihiro Tanaka2019-12-191-1/+2
| | | | | | | | | | DFAMUST() must be called after parse and before tokens re-order which is introduced in commit 5c7a0371823876cca7a1347fa09ca26bbbff0c98, but both are executed in compilation phase. * lib/dfa.c (dfaparse): Change it to global function. (dfacomp): If first argument is NULL, skip parse. * lib/dfa.h: (dfaparse): Add a prototype.
* grep: do not match invalid UTF-8Paul Eggert2019-12-171-1/+1
| | | | | | | | Update Gnulib to latest. Also: * src/dfasearch.c (EGexecute): Use ptrdiff_t, not size_t, to match new Gnulib API. * tests/Makefile.am (TESTS): Add dfa-invalid-utf8. * tests/dfa-invalid-utf8: New file.
* grep: improve grep -Fw performance in non-UTF8 multibyte localesNorihiro Tanaka2019-11-171-1/+1
| | | | | | | * src/searchutils.c (mb_goback): New parameter. All callers changed. * src/search.h (mb_goback): Update prototype. * src/kwsearch.c (Fexecute): Use mb_goback's MBCLEN to detect a word-boundary even more efficiently.
* maint: update all copyright dates via "make update-copyright"Jim Meyering2019-01-011-1/+1
| | | | * gnulib: Also update submodule for its copyright updates.
* maint: update gnulib and copyright dates for 2018Jim Meyering2018-01-061-1/+1
| | | | | | * gnulib: Update to latest. * all files: Run "make update-copyright". * bootstrap: Update from gnulib.
* Improve -i performance in typical UTF-8 searchesPaul Eggert2017-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ‘grep -i i’ is slow in a UTF-8 locale, because ‘i’ in the pattern matches the two-byte character 'ı' (U+0131, LATIN SMALL LETTER DOTLESS I) in data, and kwset handles only single-byte character translations, so grep falls back on a slower DFA-based search for all searches. Improve -i performance in the typical case by using kwset when data are free of troublesome characters like 'ı', falling back on the DFA only when data contain troublesome characters. * src/dfasearch.c (GEAcompile): * src/grep.c (compile_fp_t): * src/kwsearch.c (Fcompile): * src/pcresearch.c (Pcompile): Pattern arg is now char *, not char const *, since Fcompile now reallocates it sometimes. * src/grep.c (all_single_byte_after_folding): Remove. All callers removed. (fgrep_icase_charlen): New function. (fgrep_icase_available, try_fgrep_pattern): Use it, for more-generous semantics. (fgrep_to_grep_pattern): Now extern. (main): Do not free keys, since Fexecute may use them. * src/kwsearch.c (struct kwsearch): New struct. (Fcompile): Return it. If -i, be more generous about patterns. (Fexecute): Use it. Fall back on DFA when the data contain troublesome characters; this should be rare in practice. * src/kwset.c, src/kwset.h (kwswords): New function.
* dfa: prefer ptrdiff_t to size_tPaul Eggert2017-01-151-6/+7
| | | | | | | | | | | | | | | | | The code already cannot handle objects with size greater than SIZE_MAX / 2, so be more honest about it and use ptrdiff_t instead of size_t. ptrdiff_t arithmetic is signed, which allows for more checking via -fsanitize=undefined. It also makes the code a tad smaller on x86-64, since it can test for < 0 rather than for == SIZE_MAX. * src/dfasearch.c (struct dfa_comp.kwset_exact_matches): (kwsmusts, EGexecute): * src/kwsearch.c (Fcompile, Fexecute): * src/kwset.c (struct kwset.kwsexec, kwsincr, memchr_kwset) (memoff2_kwset, bmexec_trans, bmexec, cwexec, acexec_trans) (acexec, kwsexec): * src/kwset.h (struct kwsmatch.index, .offset, .size): Prefer ptrdiff_t to size_t where either will do.
* maint: update gnulib and copyright dates for 2017Jim Meyering2017-01-011-1/+1
| | | | | * gnulib: Update to latest. * all files: Run "make update-copyright".
* grep: move localeinfo to grep.cZev Weiss2016-12-251-2/+0
| | | | | | | | | It's not really dfasearch-specific, and grep.c initializes it, so it seems like the most appropriate "owner". * src/dfasearch.c (localeinfo): Remove. * src/grep.c (localeinfo): Add. * src/search.h (localeinfo): Move to new commented section.
* dfasearch: thread safetyZev Weiss2016-12-251-65/+61
| | | | | | | | | | | | | | * src/dfasearch.c (struct dfa_comp): New struct to hold previously-global variables. (dfawarn): Remove static variable. (kwsmusts): Operate on a dfa_comp parameter instead of global variables. (GEAcompile): Allocate and return a dfa_comp struct instead of setting global variables. (EGexecute): Operate on a dfa_comp parameter instead of global variables. * src/searchutils.c (kwsinit): Replace a static array with a dynamically-allocated one.
* grep: prepare search backends for thread-safetyZev Weiss2016-12-251-2/+4
| | | | | | | | | | | | | | | | | | | | | | To facilitate removing mutable global state from search backends, compile() functions will return an opaque pointer to backend-specific data, which must then be passed back into the corresponding execute() function. This is merely a preparatory step changing function signatures and call sites, so the pointers passed & returned are dummies for now and not (yet) actually used. * src/grep.c (compile_fp_t): Now returns an opaque pointer (the compiled pattern). (execute_fp_t): Now passed the pointer returned by a compile_fp_t. All call sites updated accordingly. (compiled_pattern): New static variable. * src/dfasearch.c (GEAcompile): Return a void pointer (dummy NULL). (EGexecute): Receive a void pointer argument (unused). * src/kwsearch.c (Fcompile): Return a void pointer (dummy NULL). (Fexecute): Receive a void pointer argument (unused). * src/pcresearch.c (Pcompile): Return a void pointer (dummy NULL). (Pexecute): Receive a void pointer argument (unused). * src/search.h: Update compile/execute function prototypes.
* grep: standardize on localeinfo.multibytePaul Eggert2016-12-231-1/+1
| | | | | | | | * src/dfasearch.c (EGexecute): * src/grep.c (main): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pcompile): Prefer localeinfo.multibyte to (MB_CUR_MAX > 1).
* grep: specialize word-finding functionsPaul Eggert2016-12-231-9/+2
| | | | | | | | | | | | | | | This improves performance a bit. * src/dfasearch.c, src/kwsearch.c (wordchar): Remove; now in searchutils.c. * src/grep.c (main): Call wordinit if -w. * src/search.h: Adjust. * src/searchutils.c: Include verify.h. (word_start): New static var. (wordchar): Move here from dfasearch.c and kwsearch.c. (wordinit, wordchars_count, wordchar_next, wordchar_prev): New functions. (mb_prev_wc, mb_next_wc): Remove. All callers changed to use the new functions instead.
* build: update gnulib submodule to latestArnold D. Robbins2016-12-131-2/+1
| | | | * src/dfasearch.c (GEAcompile): Remove use of flag, RE_ICASE covers it.
* grep: -P no longer uses PCRE_MULTILINEPaul Eggert2016-11-191-1/+1
| | | | | | | | | | | | | | | | | | This reverts commit f6603c4e1e04dbb87a7232c4b44acc6afdf65fef, as the extra performance is not worth the trouble for PCRE users. Problem reported by Stephane Chazelas in: http://bugs.gnu.org/22655#103 * NEWS: Document this and the next patch. * src/dfasearch.c (EGexecute): * src/grep.c (execute_fp_t): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pexecute): First arg is now a const pointer again. * src/grep.c (buf_has_encoding_errors): Now static. * src/grep.h (buf_has_encoding_errors): Remove decl. * src/search.h: Adjust decls. * src/pcresearch.c (reflags): Remove. All uses removed. (Pcompile, Pexecute): Do not use PCRE_MULTILINE.
* grep: die more systematicallyPaul Eggert2016-10-041-5/+3
| | | | | | | | | | | | | | * src/die.h: New file. * src/dfasearch.c, src/grep.c, src/pcresearch.c: Include die.h. * src/dfasearch.c (dfaerror): * src/grep.c (context_length_arg, add_count, prline, setmatcher, main): * src/pcresearch.c (jit_exec, Pcompile, Pexecute): Use 'die' instead of 'error' when exiting. * src/pcresearch.c: Do not include verify.h. (die): Remove; now in die.h. * src/search.h: Do not include error.h here, since this file does not use anything defined in error.h. Instead, dfasearch.c, which uses error.h's symbols, now includes error.h directly.
* dfa: new option for anchored searchesPaul Eggert2016-09-021-1/+3
| | | | | | | | | | This follows up on a suggestion by Norihiro Tanaka (Bug#24262). * src/dfa.c (struct regex_syntax): New member 'anchor'. (char_context): Use it. (dfasyntax): Change signature to specify it, along with the old FOLD and EOL args, as a single DFAOPTS arg. All uses changed. * src/dfa.h (DFA_ANCHOR, DFA_CASE_FOLD, DFA_EOL_NUL): New constants for dfasyntax new last arg.
* grep: use regex fastmap unless -iPaul Eggert2016-09-011-1/+4
| | | | | | | This builds on a suggestion by Norihiro Tanaka (Bug#24009). * src/dfasearch.c (GEAcompile): Use a fastmap unless -i. This improves performance 20x for me using the first benchmark given in Bug#24009.
* grep: improve dfasearch storage managementPaul Eggert2016-09-011-37/+38
| | | | | | | | | | | | | | This patch is mostly refactoring, with a bit of performance tweaking. It is done in preparation for a fix for Bug#24009. * src/dfasearch.c (patterns): Now of type struct re_pattern_buffer * instead of an anonymous struct pointer, since there is no longer any need to keep regs here. All uses changed. (GEAcompile): Use patlim instead of a hard-to-follow "total". Use x2nrealloc to avoid potential O(N**2) reallocation algorithm. Initialize just the pattern members that need clearing. (EGexecute): Put regs into a static variable, as this code did before 2001-02-18, as there is no need to have a separate set of regs for each pattern. Explain the "Q@#%!#" comment better.
* grep: avoid code duplication with -iFPaul Eggert2016-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | This follows up on the -iF performance improvement (Bug#23752). * NEWS: Simplify description of -iF improvement. * src/dfa.c: Do not include wctype.h. (lonesome_lower, case_folded_counterparts): Move to localeinfo.c. (CASE_FOLDED_BUFSIZE): Move to localeinfo.h. * src/grep.c: Do not include wctype.h. (lonesome_lower): Remove. (fgrep_icase_available): Use case_folded_counterparts instead. Do not call it for the same character twice. Return false on wcrtomb failures (which should never happen). (fgrep_to_grep_pattern, main): Simplify. Let fgrep_to_grep’s caller fiddle with the global variables. * src/localeinfo.c: Include <wctype.h> (lonesome_lower, case_folded_counterparts): Move here from src/dfa.c. Return int, not unsigned int. Verify that CASE_FOLDED_BUFSIZE is big enough. * src/localeinfo.h (CASE_FOLDED_BUFSIZE): Now 32, so that we don’t expose lonesome_lower’s size. * src/searchutils.c (kwsinit): Return new kwset instead of storing it via a pointer. All callers changed. Simplify a bit.
* grep: speed up -iF in multibyte localesNorihiro Tanaka2016-09-011-1/+1
| | | | | | | | | | | | | | | In a multibyte locale, if a pattern is composed of only single byte characters and their all counterparts are also single byte characters and the pattern does not have invalid sequences, grep -iF uses the fgrep matcher, the same as in a single byte locale (Bug#23752). * NEWS: Mention it. * src/grep.c (lonesome_lower): New constant. (fgrep_icase_available): New function. (fgrep_to_grep_pattern): Simplify it. (main): Use them. * src/searchutils.c (kwsinit): New arg MB_TRANS; all uses changed. Try fgrep matcher for case insensitive matching by grep -F in multibyte locale.
* dfa: make dfa.c fully thread-safePaul Eggert2016-08-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This follows up on Zev Weiss’s recent patches to make the DFA code thread-safe (Bug#24249). It removes the remaining static variables used by dfa.c. These variables are locale-dependent, so they would cause problems in multithreaded code where different threads are in different locales (e.g., via uselocale). I abstracted most of the variables into a new localeinfo module. * src/Makefile.am (grep_SOURCES): Add localeinfo.c. (noinst_HEADERS): Add localeinfo.h. * src/dfa.c: Include localeinfo.h. (struct dfa): Remove multibyte member, as it is now part of localeinfo. New members simple_locale and localeinfo. Put locale-related members at the end. (mbrtowc_cache): Remove; now part of dfa->localeinfo. (charclass_index): Rename back from dfa_charclass_index, since it's private. (unibyte_word_constituent): New arg DFA; use its sbctowc member. (using_utf8, dfa_using_utf8, init_mbrtowc_cache, check_utf8): Remove; now done by localeinfo members. All uses changed. (dfasyntax): New localeinfo arg. Move to end to avoid forward decls. Initialize the entire DFA. (unibyte_c, check_unibyte_c): Remove; now in simple_locale member. (using_simple_locale): Now takes bool instead of DFA. Do the locale check here, rather than in the caller, as the result is now cached in dfa->simple_locale. (dfaalloc): Just allocate the DFA. dfasyntax now initializes it. * src/dfa.h: Add forward decl of struct localeinfo. Adjust to new dfa.c API. * src/dfasearch.c (localeinfo): New var, replacing former static vars like mbrtowc_cache. * src/localeinfo.c, src/localeinfo.h: New files. * src/search.h: Include localeinfo.h. (localeinfo): New decl. * src/searchutils.c (mbclen_cache, build_mbclen_cache): Remove. All uses changed to localeinfo. * tests/Makefile.am (dfa_match_aux_LDADD): Add localeinfo.o. * tests/dfa-match-aux.c: Include localeinfo.h. (main): Adjust to changes in DFA API.
* dfa: thread-safety: eliminate static local variablesZev Weiss2016-08-201-1/+1
| | | | | | | | | | | | | | | | * src/dfa.c: Replace utf8 and unibyte_c static local variables with static globals initialized by a new function dfa_init() which must be called before any other dfa*() functions. (dfa_using_utf8): Rename using_utf8() to dfa_using_utf8() for consistency with other exported functions. * src/dfa.h (dfa_using_utf8): Rename using_utf8() to dfa_using_utf8(); also add _GL_ATTRIBUTE_PURE. (dfa_init): New function. * src/grep.c (main), tests/dfa-match-aux.c (main): Call dfa_init(). * src/dfasearch.c (EGexecute): Replace using_utf8 with dfa_using_utf8. * src/kwsearch.c (Fexecute): Likewise. * src/pcresearch.c (Pcompile): Likewise. http://bugs.gnu.org/24259
* dfa: thread-safety: move regex syntax configuration into struct dfaZev Weiss2016-08-201-2/+3
| | | | | | | | | | | | | | * src/dfa.c: move global variables holding regex syntax configuration into a new struct (`struct regex_syntax') and add an instance of it to struct dfa. All references to the globals are replaced with references to the dfa struct's new member. As a side effect, a `struct dfa' must be allocated with dfaalloc() and passed to dfasyntax(). * src/dfa.h (dfasyntax): Add new struct dfa* parameter. * src/dfasearch.c (GEAcompile): Allocate `dfa' earlier and pass it to dfasyntax(). * tests/dfa-match-aux.c (main): Pass `dfa' to dfasyntax(). http://bugs.gnu.org/24259
* dfa: avoid uninitialized constantsPaul Eggert2016-08-171-5/+3
| | | | | | | | | | | Some compilers warn about 'static int const x;' on the grounds that X should have an initializer. Instead of worrying about this, rewrite to avoid this sort of thing. * src/dfa.c (emptyset): New function. (parse_bracket_exp): Use it instead of 'equal' and a zero constant. * src/dfasearch.c (struct patterns): Remove tag 'patterns'. (patterns0): Remove zero constant. (GEAcompile): Use memset instead of the zero constant.
* grep: print "filename:lineno:" in invalid-regex diagnosticJim Meyering2016-07-251-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Determining the file name and line number is a little tricky because of the way the regular expressions are all concatenated onto a newline- separated list. By the time grep would compile regular expressions, the <filename,lineno> origin of each regexp was no longer available. This patch adds a list of filename,first_lineno pairs, one per input source, by which we can then map the ordinal regexp number to a filename,lineno pair for the diagnostic. * src/dfasearch.c (GEAcompile): When diagnosing an invalid regexp specified via -f FILE, include the "FILENAME:LINENO: " prefix. Also, when there are two or more lines with compilation failures, diagnose all of them, rather than stopping after the first. * src/grep.h (pattern_file_name): Declare it. * src/grep.c: (struct FL_pair): Define type. (fl_pair, n_fl_pair_slots, n_pattern_files, patfile_lineno): Define globals. (fl_add, pattern_file_name): Define functions. (main): Call fl_add for each type of the following: -e argument, -f argument, command-line-specified (without -e) regexp. * tests/filename-lineno.pl: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Improvements): Mention this. Initially reported by Gunnar Wolf in https://bugs.debian.org/525214 Forwarded to grep's bug list by Santiago Ruano Rincón as http://debbugs.gnu.org/23965
* grep: always match single line only with DFA supersetNorihiro Tanaka2016-07-211-10/+9
| | | | | | | \n cannot occur inside a multibyte character. So an input always matches single line only with DFA superset. * src/dfasearch.c (EGexecute): Simplify it with above.
* maint: mark a couple of static variables constZev Weiss2016-06-091-1/+1
| | | | | | * src/dfa.c (parse_bracket_exp): mark zeroclass const. * src/dfasearch.c: mark patterns0 const. http://bugs.gnu.org/23712
* grep: -F multiword longest match not always neededNorihiro Tanaka2016-06-021-1/+1
| | | | | | | | | Searching multiple fixed words, grep immediately returns without longest match if not needed. Without this change, grep tries longest match for multiple words even if not needed. * src/kwset.c (kwsexec, acexec, cwexec, bmexec): Add a bool argument for whether longest match is needed. All callers changed. * src/kwset.h (kwsexec): Update prototype.
* dfa: prefer bool at DFA interfacesNorihiro Tanaka2016-05-011-1/+1
| | | | | | | | * src/dfa.c (struct dfa, dfasyntax, dfaanalyze, dfaexec_main) (dfaexec_mb, dfaexec_sb, dfaexec_noop, dfaexec, dfacomp): * src/dfa.h (dfasyntax, dfacomp, dfaexec, dfaanalyze): * src/dfasearch.c (EGexecute): Use bool for boolean.
* grep -z: avoid erroneous match with regexp anchor and \n in textJim Meyering2016-02-201-0/+1
| | | | | | | | | | | * src/dfasearch.c (EGexecute): Clear the newline_anchor bit when eolbyte is not '\n'. * tests/z-anchor-newline: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Describe it. Originally reported by Ulrich Mueller in https://bugs.gentoo.org/show_bug.cgi?id=574662 Reported to us by Sergei Trofimovich as http://debbugs.gnu.org/22655
* grep: -x now supersedes -w more consistentlyPaul Eggert2016-01-151-3/+3
| | | | | | | | | | | | * NEWS, doc/grep.texi (Matching Control): Mention this. * src/dfasearch.c (EGexecute): * src/pcresearch.c (Pcompile): Don't get confused by -w if -x is also present. * src/pcresearch.c (Pcompile): Remove misleading comment about non-UTF-8 multibyte locales, as PCRE doesn't support them. Calculate buffer sizes more carefully; the old method allocated a buffer slightly too big, seemingly due to luck. * tests/backref-word, tests/pcre: Add tests for this bug.
* grep: restore -P PCRE_NO_UTF8_CHECK optimizationPaul Eggert2016-01-061-1/+1
| | | | | | | | | | | | | | | | | | On my platform in the en_US.utf8 locale, this makes 'grep -P "z.*a" k' 220x faster, where k is created by the shell command: yes 'abcdefg hijklmn opqrstu vwxyz' | head -n 10000000 >k * src/dfasearch.c (EGexecute): * src/grep.c (execute_fp_t): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pexecute): First arg is now char *, not char const *, since Pexecute now temporarily modifies this argument. * src/grep.c, src/grep.h (buf_has_encoding_errors): Now extern. * src/pcresearch.c (Pexecute): Use it. If the input is free of encoding errors, use a multiline search and the PCRE_NO_UTF8_CHECK option, as this is typically way faster. This restores an optimization that was removed with the recent changes for binary file detection.
* maint: update copyright year, bootstrap, init.shJim Meyering2016-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest. * tests/init.sh: Update from gnulib. * bootstrap: Likewise.
* grep: avoid use of uninitialized variableNorihiro Tanaka2015-08-191-1/+1
| | | | | | | | | EGexecute would use "backref" uninitialized. While that could have no bearing on correctness, it could impact performance, via an unnecessary use of regexp. * src/dfasearch.c (EGexecute): Initialize backref. Reported as http://debbugs.gnu.org/21273 Introduced by commit v2.21-55-gea0ebaa.
* dfa: build struct dfamust on demandNorihiro Tanaka2015-07-041-33/+29
| | | | | | | | | | | | | | | | | | | | | | If we won't use KWset, do not build a "struct dfamust". Now it is built only when needed. * src/dfa.c (struct dfa) [musts]: Remove member. (dfacomp): Don't build dfamust here. (dfamustfree): New function to free a struct dfamust. (dfamust): Make it a global function, and make it return a pointer to a malloc'd struct dfamust. (dfamusts): Remove it. * src/dfa.h (struct dfamust) [next]: Remove member. In the implementation preceding this patch, there was never more than one of these in a given "struct dfa". (dfamustfree, dfamust): Add prototypes. (dfamusts): Remove prototype. (dfaalloc): Declare with _GL_ATTRIBUTE_MALLOC. To make that symbol usable there, move the inclusion of "xalloc.h" from dfa.c to this file, dfa.h. * src/dfasearch.c (kwsmusts): Adapt to use the new interface. Update the comments to reflect reality. This addresses http://bugs.gnu.org/17715
* maint: update copyright year ranges to include 2015Jim Meyering2015-01-011-1/+1
| | | | | Run "make update-copyright". Also, ... * grep.texi: Update manually, converting each "--" to "-".
* grep: minor improvements to retry-DFA-superset patchPaul Eggert2014-05-091-14/+10
| | | | | * src/dfasearch.c (EGexecute): Avoid unnecessary test in a context where memrchr cannot return a null pointer.
* grep: retry DFA superset after matching multiple linesNorihiro Tanaka2014-05-091-13/+19
| | | | | | | | | | | | * src/dfasearch.c (EGexecute): Without this patch, the code reverts to KWset when the DFA superset matches multiple lines. However, if the DFA superset matches multiple lines, it most likely also matches a single line, and reverting to KWset means dfafast won't work effectively. Change the code so that it retries the DFA superset immediately after it matches multipline lines. On my platform this improves the performance of "LC_ALL=C grep '\(ab\)cd\1d' k" from 3.48 to 2.14 seconds realtime, where k contains the output of "yes abcdabc | head -50000000".
* grep: fix -w match next to a multibyte letterPaul Eggert2014-05-051-5/+8
| | | | | | | | | | | | | | | * NEWS: Document this. * src/dfasearch.c, src/kwsearch.c (WCHAR): Remove. (wordchar): New static function. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): Use the new functions, so that the code works correctly if a multibyte character adjacent to the match has two or more bytes. * src/search.h, src/searchutils.c (mb_prev_wc, mb_next_wc): New functions. * tests/word-delim-multibyte: Add a test for grep -w (which now passes), and a test for \> (which still fails). The \< test also still fails.
* grep: improve internal API for multibyte boundaryPaul Eggert2014-05-051-2/+1
| | | | | | | | | * src/search.h, src/searchutils.c (mb_goback): Rename from is_mb_middle. Omit last arg. Return number of bytes to go back, not just a boolean. All uses changed. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): Adjust to API change. * src/kwsearch.c (Fexecute): Eliminate common subexpression.
* grep: fix encoding-error incompatibilities among regex, DFA, KWsetPaul Eggert2014-05-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This follows up to http://bugs.gnu.org/17376 and fixes a different set of incompatibilities, namely between the regex matcher and the other matchers, when the pattern contains encoding errors. The GNU regex matcher is not consistent in this area: sometimes an encoding error matches only itself, and sometimes it matches part of a multibyte character. There is no documentation for grep's behavior in this area and users don't seem to care, and it's simpler to defer to the regex matcher for problematic cases like these. * NEWS: Document this. * src/dfa.c (ctok): Remove. All uses removed. (parse_bracket_exp, atom): Use BACKREF if a pattern contains an encoding error, so that the matcher will revert to regex. * src/dfasearch.c, src/grep.c, src/pcresearch.c, src/searchutils.c: Don't include dfa.h, since search.h now does that for us. * src/dfasearch.c (EGexecute): * src/kwsearch.c (Fexecute): In a UTF-8 locale, there's no need to worry about matching part of a multibyte character. * src/grep.c (contains_encoding_error): New static function. (main): Use it, so that grep -F is consistent with plain fgrep when the pattern contains an encoding error. * src/search.h: Include dfa.h, so that kwsearch.c can call using_utf8. * src/searchutils.c (is_mb_middle): Remove UTF-8-specific code. Callers now ensure that we are in a non-UTF-8 locale. The code was clearly wrong, anyway. * tests/fgrep-infloop, tests/invalid-multibyte-infloop: * tests/prefix-of-multibyte: Do not require that grep have a particular behavor for this test. It's OK to match (exit status 0), not match (exit status 1), or report an error (exit status 2), since the pattern contains an encoding error and grep's behavior is not specified for such patterns. Test only that KWset, DFA, and regex agree. * tests/prefix-of-multibyte: Add tests for ABCABC and __..._ABCABC___.
* grep: clarify EGexecute slightlyPaul Eggert2014-05-031-3/+3
| | | | * src/dfasearch.c (EGexecute): Change if-then-else to !if-else-then.
* grep: fix the bug in previous patch.Norihiro Tanaka2014-05-031-2/+2
| | | | * src/dfasearch.c (EGexecute): Do it.
* grep: simplify EGexecute furtherPaul Eggert2014-04-301-70/+41
| | | | | | | * src/dfa.c, src/dfa.h (dfasuperset): Arg is now const pointer. Now pure. * src/dfasearch.c (EGexecute): Coalesce some duplicate code. Don't worry about memrchr returning NULL when that's impossible.