summaryrefslogtreecommitdiff
path: root/src/kwset.h
Commit message (Collapse)AuthorAgeFilesLines
* maint: make update-copyrightJim Meyering2022-01-011-1/+1
|
* grep: prefer signed to unsigned integersPaul Eggert2021-08-251-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This improves runtime checking for integer overflow when compiling with gcc -fsanitize=undefined and the like. It also avoids the need for some integer casts, which can be error-prone. * bootstrap.conf (gnulib_modules): Add idx. * src/dfasearch.c (struct dfa_comp, kwsmusts): (possible_backrefs_in_pattern, regex_compile, GEAcompile) (EGexecute): * src/grep.c (struct patloc, patlocs_allocated, patlocs_used) (n_patterns, update_patterns, pattern_file_name, poison_len) (asan_poison, fwrite_errno, compile_fp_t, execute_fp_t) (buf_has_encoding_errors, buf_has_nulls, file_must_have_nulls) (bufalloc, pagesize, all_zeros, fillbuf, nlscan) (print_line_head, print_line_middle, print_line_tail, grepbuf) (grep, contains_encoding_error, fgrep_icase_available) (fgrep_icase_charlen, fgrep_to_grep_pattern, try_fgrep_pattern) (main): * src/kwsearch.c (struct kwsearch, Fcompile, Fexecute): * src/kwset.c (struct trie, struct kwset, kwsalloc, kwsincr) (kwswords, treefails, memchr_kwset, acexec_trans, kwsexec) (treedelta, kwsprep, bm_delta2_search, bmexec_trans, bmexec) (acexec): * src/kwset.h (struct kwsmatch): * src/pcresearch.c (Pcompile, Pexecute): * src/search.h (mb_clen): * src/searchutils.c (kwsinit, mb_goback, wordchars_count) (wordchars_size, wordchar_next, wordchar_prev): Prefer idx_t to size_t or ptrdiff_t for nonnegative sizes, and prefer ptrdiff_t to size_t for sizes plus error values. * src/grep.c (uword_size): New constant, used for signed size calculations. (totalnl, add_count, totalcc, print_offset, print_line_head, grep): Prefer intmax_t to uintmax_t for wide integer calculations. (fgrep_icase_charlen): Prefer ptrdiff_t to int for size offsets. * src/grep.h: Include idx.h. * src/search.h (imbrlen): New function, like mbrlen except with idx_t and ptrdiff_t.
* maint: run "make update-copyright"Paul Eggert2021-01-011-1/+1
|
* grep: minor kwset cleanupsPaul Eggert2020-10-111-3/+3
| | | | | | | | | * src/kwsearch.c (Fexecute): Assume C99 to put declarations nearer uses. * src/kwset.c (bmexec): Omit unnecessary test. * src/kwset.h (struct kwsmatch): Make OFFSET and SIZE individual elements, not arrays of size 1 (a revenant of an earlier API). All uses changed.
* maint: update all copyright year number rangesJim Meyering2020-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest with copyright year adjusted. * tests/init.sh: Sync with gnulib to pick up copyright year. * bootstrap: Likewise. * doc/grep.in.1: Use "-" in copyright year ranges, not \en.
* maint: update all copyright dates via "make update-copyright"Jim Meyering2019-01-011-1/+1
| | | | * gnulib: Also update submodule for its copyright updates.
* maint: update gnulib and copyright dates for 2018Jim Meyering2018-01-061-1/+1
| | | | | | * gnulib: Update to latest. * all files: Run "make update-copyright". * bootstrap: Update from gnulib.
* grep: remove Commentz-Walter codePaul Eggert2017-01-181-1/+1
| | | | | | | | | | | | | This code was not being used, and complicated maintenance. We can bring it back from the repository if it turns out to be useful later. * src/kwset.c (struct kwset.reverse): Remove. All uses of FOO->reverse replaced by (FOO->kwsexec == bmexec). (kwsalloc): Remove 'reverse' arg, as callers outside this module do not care about algorithm choice. All callers changed. (kwsprep): When deciding whether to use Boyer-Moore, do not worry about being called twice on the same kwset, as that is not allowed. (cwexec): Remove; it was never called. All uses removed.
* Improve -i performance in typical UTF-8 searchesPaul Eggert2017-01-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ‘grep -i i’ is slow in a UTF-8 locale, because ‘i’ in the pattern matches the two-byte character 'ı' (U+0131, LATIN SMALL LETTER DOTLESS I) in data, and kwset handles only single-byte character translations, so grep falls back on a slower DFA-based search for all searches. Improve -i performance in the typical case by using kwset when data are free of troublesome characters like 'ı', falling back on the DFA only when data contain troublesome characters. * src/dfasearch.c (GEAcompile): * src/grep.c (compile_fp_t): * src/kwsearch.c (Fcompile): * src/pcresearch.c (Pcompile): Pattern arg is now char *, not char const *, since Fcompile now reallocates it sometimes. * src/grep.c (all_single_byte_after_folding): Remove. All callers removed. (fgrep_icase_charlen): New function. (fgrep_icase_available, try_fgrep_pattern): Use it, for more-generous semantics. (fgrep_to_grep_pattern): Now extern. (main): Do not free keys, since Fexecute may use them. * src/kwsearch.c (struct kwsearch): New struct. (Fcompile): Return it. If -i, be more generous about patterns. (Fexecute): Use it. Fall back on DFA when the data contain troublesome characters; this should be rare in practice. * src/kwset.c, src/kwset.h (kwswords): New function.
* dfa: prefer ptrdiff_t to size_tPaul Eggert2017-01-151-5/+6
| | | | | | | | | | | | | | | | | The code already cannot handle objects with size greater than SIZE_MAX / 2, so be more honest about it and use ptrdiff_t instead of size_t. ptrdiff_t arithmetic is signed, which allows for more checking via -fsanitize=undefined. It also makes the code a tad smaller on x86-64, since it can test for < 0 rather than for == SIZE_MAX. * src/dfasearch.c (struct dfa_comp.kwset_exact_matches): (kwsmusts, EGexecute): * src/kwsearch.c (Fcompile, Fexecute): * src/kwset.c (struct kwset.kwsexec, kwsincr, memchr_kwset) (memoff2_kwset, bmexec_trans, bmexec, cwexec, acexec_trans) (acexec, kwsexec): * src/kwset.h (struct kwsmatch.index, .offset, .size): Prefer ptrdiff_t to size_t where either will do.
* grep: improve comments, mostly in kwsetPaul Eggert2017-01-111-23/+4
| | | | | | | Remove kwset.h comments that are obsolete and seemingly not maintained anyway; people can look in kwset.c instead. Update comments to reflect current behavior better. Cite Faro & Lecroq 2013. Use GNU style for end-of-sentence.
* maint: update gnulib and copyright dates for 2017Jim Meyering2017-01-011-1/+1
| | | | | * gnulib: Update to latest. * all files: Run "make update-copyright".
* grep: minor cleanups for -F Aho-CorasickPaul Eggert2016-06-021-3/+3
| | | | | | | | | | | | | | | | | | | | | * NEWS: Don't claim 7x, as the value seems to be system-dependent. * src/kwset.c (struct kwset.kwsexec, bmexec, acexec_trans, acexec): * src/kwset.c, src/kwset.h (kwsalloc, kwsexec): Don't put 'const' into the declaration when that is irrelevant to the API. More generally, don't bother with 'const' when it's only a local so it is reasonably obvious to a reader that it is 'const' anyway. It would be overkill to add 'const' to all locals that never change. * src/kwset.c (U): Avoid unnecessary parens. (treefails, memoff2_kwset, bmexec_trans, bmexec, cwexec, acexec_trans): Prefer SIZE_MAX to (size_t) -1. (bmexec_trans, cwexec, acexec_trans): Remove attributes for static functions that no longer seem needed. (memoff2_kwset): Rename from memchr2_kwset, since it returns an offset, not a pointer. All uses changed. (cwexec, acexec_trans) [lint]: Remove initialization that is no longer needed; at least, GCC 6.1 x86-64 does not need it. (acexec_trans): Clarify code by using nesting rather than 'continue'.
* grep: -F multiword longest match not always neededNorihiro Tanaka2016-06-021-2/+2
| | | | | | | | | Searching multiple fixed words, grep immediately returns without longest match if not needed. Without this change, grep tries longest match for multiple words even if not needed. * src/kwset.c (kwsexec, acexec, cwexec, bmexec): Add a bool argument for whether longest match is needed. All callers changed. * src/kwset.h (kwsexec): Update prototype.
* grep: use Aho-Corasick algorithm to search multiple fixed wordsPaul Eggert2016-06-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Searching multiple fixed words, grep used the Commentz-Walter algorithm, but this was O(m*n) and was very slow in the worst case. For example: - input: yes `printf %040d` | head -10000000 - word1: x0000000000000000000 - word2: x This change instead uses the Aho-Corasick algorithm to search multiple fixed words. It uses a high-quality trie-building function that is already defined for Commentz-Walter in kwset.c. I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5 by this change. Using best-of-5 trials for the benchmark: find /usr/share/doc/ -type f | LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null The results were: real 11.37 user 11.03 sys 0.24 [without the change] real 1.49 user 1.31 sys 0.15 [with the change] * src/kwset.c (struct kwset): Add a new member 'mode'. (kwsalloc): Use it. All callers are changed. (kwsincr): Using Aho-Corasick algorithm, build tries in normal order. (acexec_trans, acexec): Add a new function. (kwsexec): Use it. * src/kwset.h (kwsalloc): Update a prototype. * NEWS (Improvements): Mention it.
* maint: update copyright year, bootstrap, init.shJim Meyering2016-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest. * tests/init.sh: Update from gnulib. * bootstrap: Likewise.
* maint: update copyright year ranges to include 2015Jim Meyering2015-01-011-1/+1
| | | | | Run "make update-copyright". Also, ... * grep.texi: Update manually, converting each "--" to "-".
* grep: simplify memory allocation in kwsetPaul Eggert2014-04-071-9/+8
| | | | | | | | | | | | | | | * src/kwset.c: Include kwset.h first, to check its prereqs. Include xalloc.h, for xmalloc. (kwsalloc): Use xmalloc, not malloc, so that the caller need not worry about memory allocation failure. (kwsalloc, kwsincr, kwsprep): Do not worry about obstack_alloc returning NULL, as that's not possible. (kwsalloc, kwsincr, kwsprep, bmexec, cwexec, kwsexec, kwsfree): Omit unnecessary conversion between struct kwset * and kwset_t. (kwsincr, kwsprep): Return void since memory-allocation failure is not possible now. All uses changed. * src/kwset.h: Include <stddef.h>, for size_t, so that this include file doesn't require other files to be included first.
* maint: update copyright dates for 2014Jim Meyering2014-01-011-1/+1
| | | | Do that by running "make update-copyright".
* maint: update all copyright year number rangesJim Meyering2013-01-041-1/+1
| | | | Run "make update-copyright".
* grep: fix some core dumps with long lines etc.Paul Eggert2012-03-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These problems mostly occur because the code attempts to stuff sizes into int or into unsigned int; this doesn't work on most 64-bit hosts and the errors can lead to core dumps. * NEWS: Document this. * src/dfa.c (token): Typedef to ptrdiff_t, since the enum's range could be as small as -128 .. 127 on practical hosts. (position.index): Now size_t, not unsigned int. (leaf_set.elems): Now size_t *, not unsigned int *. (dfa_state.hash, struct mb_char_classes.nchars, .nch_classes) (.nranges, .nequivs, .ncoll_elems, struct dfa.cindex, .calloc, .tindex) (.talloc, .depth, .nleaves, .nregexps, .nmultibyte_prop, .nmbcsets): (.mbcsets_alloc): Now size_t, not int. (dfa_state.first_end): Now token, not int. (state_num): New type. (struct mb_char_classes.cset): Now ptrdiff_t, not int. (struct dfa.utf8_anychar_classes): Now token[5], not int[5]. (struct dfa.sindex, .salloc, .tralloc): Now state_num, not int. (struct dfa.trans, .realtrans, .fails): Now state_num **, not int **. (struct dfa.newlines): Now state_num *, not int *. (prtok): Don't assume 'token' is no wider than int. (lexleft, parens, depth): Now size_t, not int. (charclass_index, nsubtoks) (parse_bracket_exp, addtok, copytoks, closure, insert, merge, delete) (state_index, epsclosure, state_separate_contexts) (dfaanalyze, dfastate, build_state, realloc_trans_if_necessary) (transit_state_singlebyte, match_anychar, match_mb_charset) (check_matching_with_multibyte_ops, transit_state_consume_1char) (transit_state, dfaexec, free_mbdata, dfaoptimize, dfafree) (freelist, enlist, addlists, inboth, dfamust): Don't assume indexes fit in 'int'. (lex): Avoid overflow in string-to-{hi,lo} conversions. (dfaanalyze): Redo indexing so that it works with size_t values, which cannot go negative. * src/dfa.h (dfaexec): Count argument is now size_t *, not int *. (dfastate): State numbers are now ptrdiff_t, not int. * src/dfasearch.c: Include "intprops.h", for TYPE_MAXIMUM. (kwset_exact_matches): Now size_t, not int. (EGexecute): Don't assume indexes fit in 'int'. Check for overflow before converting a ptrdiff_t to a regoff_t, as regoff_t is narrower than ptrdiff_t in 64-bit glibc (contra POSIX). Check for memory exhaustion in re_search rather than treating it merely as failure to match; use xalloc_die () to report any error. * src/kwset.c (struct trie.accepting): Now size_t, not unsigned int. (struct kwset.words): Now ptrdiff_t, not int. * src/kwset.h (struct kwsmatch.index): Now size_t, not int.
* maint: update all copyright year number rangesJim Meyering2012-01-011-1/+1
| | | | Run "make update-copyright".
* maint: update copyright year ranges to include 2011Jim Meyering2011-01-031-1/+1
| | | | Run "make update-copyright", so "make syntax-check" works in 2011.
* kwset: correct comments; require non-NULL kwsmatch argumentJim Meyering2010-03-271-2/+4
| | | | | | | | | * src/kwset.c (kwsexec): Correct comments. This function has been returning an offset, not a pointer, for 9 years. Do not test for kwsmatch == NULL. All callers pass non-NULL. (cwexec): Likewise. * src/kwset.h (kwsexec): Mark the 4th parameter, kwsmatch, as non-NULL. Include "arg-nonnull.h".
* kwset/system: remove ptr_tPaolo Bonzini2010-03-121-1/+2
| | | | | * src/kwset.h: Declare kwset using an incomplete struct type. * src/system.h (ptr_t): Remove.
* maint: remove all uses of PARAMSJim Meyering2010-03-111-5/+5
| | | | | | | | | | | * lib/savedir.h (PARAMS): Remove definitions manually. Remove the remaining ones via this command: git grep -l define.PARAMS |xargs perl -ni -e '/define PARAMS/ or print' * src/dfa.h (PARAMS): Remove definitions. * src/system.h (PARAMS): Likewise. Remove most uses with this: git grep -lw PARAMS |xargs perl -pi -e 's/\bPARAMS *\((.*)\);/$1;/' Remove the remainder manually.
* maint: update all FSF copyright year lists to include 2010Jim Meyering2010-01-011-1/+2
| | | | | | Use this command: git ls-files |grep -vE '^(\..*|COPYING|gnulib)$' |xargs \ env UPDATE_COPYRIGHT_USE_INTERVALS=1 build-aux/update-copyright
* update/add copyright noticesKarl Berry2009-01-301-1/+1
|
* GPLv2 -> GPLv3Bernhard Rosenkraenzer2007-06-281-1/+1
|
* Update FSF's civic address, zip code, and citizen relocation code.Charles Levert2005-05-021-2/+2
| | | | | * 78 files: Update FSF's civic address, zip code, and citizen relocation code.
* fix warnings about constStepan Kasal2005-02-091-3/+3
| | | | | * src/kwset.h (kwsincr, kwsprep): Change return type to 'const char *'. * src/kwset.c (kwsincr, kwsprep): Likewise.
* Remove sentinel code.Alain Magloire2001-02-081-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we know that the input is always terminated by a newline before the matching algorithms see it, clean up the matching algorithms so that they no longer need to modify the input by inserting a sentinel newline, and no longer worry about running off the end of the buffer due to a missing sentinel. * src/grep.c (nlscan, prpending, prtext, grepbuf): Do not worry about running off the end of the input buffer, since it's now guaranteed to end in the sentinel newline. * src/search.c (EGexecute, Pexecute): Likewise. * src/dfa.c (prtok, dfasyntax, dfaparse, copy, merge, state_index, epsclosure, dfaexec, dfacomp): Change many instances of "T *" to "T const *", to catch any inadvertent programming errors made during this conversion. * src/dfa.h (dfacomp, dfaexec, dfaparse): Likewise. * src/grep.c (struct stats.parent, long_options, grepdir, compile, execute, fillbuf, lastnl, lastout, nlscan, prline, prpending, prtext, grepbuf, grep, grepfile, grepdir): Likewise. * src/grep.h (struct matcher.compile, struct matcher.execute): Likewise. * src/kwset.c (struct kwset.trans, kwsalloc, kwsincr, treefails, treedelta, hasevery, treenext, bmexec, cwexec, kwsexec): Likewise. * src/kwset.h (kwsalloc, kwsincr, kwsexec): Likewise. * src/search.c (kwsmusts, Gcompile, Ecompile, EGexecute, Pcompile, Pexecute): Likewise. * src/dfa.c (dfaexec): Use size_t, not char *, to avoid worrisome casts to convert char const * to char *. * src/dfa.h (dfaexec): Likewise. * src/grep.c (execute): Likewise. * src/grep.h (execute): Likewise. * src/kwset.c (bmexec, cwexec, kwsexec): Likewise. * src/kwset.h (struct kwsmatch.offset, kwsalloc, kwsincr, kwsexec): Likewise. * src/search.c (EGexecute, Fexecute, Pexecute): Likewise. * src/dfa.h (_PTR_T): Depend on defined __STDC__, not __STDC__. (PARAMS): Depend on PROTOTYPES, not __STDC__. * src/dfa.c (dfasyntax): Last arg is unsigned char, not int. * src/dfa.h (dfasyntax): Likewise. * src/dfa.h (struct dfa): Remove member newlines; no longer needed. * src/dfa.c (build_state, dfaexec, dfafree): Do not worry about special newline state. * src/search.c (matchers): Move definition to end of file, so that we don't need forward decls. (lastexact): Remove. (kwset_exact_matches): New var; subsumes old lastexact var. All uses changed. * src/dfa.c (index): Remove macro. (REALLOC_IF_NECESSARY): Skip unnecessary test. (tstbit, setbit, clrbit): Declare arg to be unsigned, to help compiler. (copyset, zeroset, equal): Use C builtin primitives, to help compiler. (dfaexec): Do not modify input string. Remove newline parameter; no longer needed. (comsubs): Use strchr, not index. * src/grep.h (matchers): Use fixed name size, not pointer (as there's no need for the extra flexibility). All uses changed. * src/kwset.h (struct kwsmatch.offset): Renamed from beg, with change of type to size_t. All uses changed. * src/grep.c (reset): No longer need kludge for dfaexec. Simplify. (reset, grepbuf): Adjust to new interface for 'execute'. (install_matcher): List is now terminated by null compile, not null name. Do not invoke setrlimit if that wouldn't change the limit. * src/dfa.c (xcalloc, xmalloc, xrealloc, prtok, tstbit, setbit, clrbit, copyset, zeroset, notset, equal, charclass_index, looking_at, lex, addtok, atom, nsubtoks, copytoks, closure, branch, regexp, copy, insert, merge, delete, state_index, build_state, build_state_zero, icatalloc, icpyalloc, istrstr, ifree, freelist, enlist, comsubs, addlists, inboth): Remove forward decls; no longer needed. * src/grep.c (ck_atoi, usage, error, setmatcher, install_matcher, prepend_args, prepend_default_options, page_alloc, reset, fillbuf, grepbuf, prtext, prpending, prline, print_offset_sep, nlscan, grep, grepfile): Likewise. * src/kwset.c (enqueue, treefails, treedelta, hasevery, treenext, bmexec, cwexec): Likewise. * src/search.c (Gcompile, Ecompile, EGexecute, Fcompile, Fexecute, Pcompile, Pexecute, kwsinit): Likewise. * src/search.c (Pcompile): Do not assume newly allocated storage is zeroed.
* Initial revisionAlain Magloire1998-11-031-0/+57