summaryrefslogtreecommitdiff
path: root/src/grep.c
Commit message (Collapse)AuthorAgeFilesLines
* grep: prefer signed to unsigned integersPaul Eggert2021-08-251-146/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This improves runtime checking for integer overflow when compiling with gcc -fsanitize=undefined and the like. It also avoids the need for some integer casts, which can be error-prone. * bootstrap.conf (gnulib_modules): Add idx. * src/dfasearch.c (struct dfa_comp, kwsmusts): (possible_backrefs_in_pattern, regex_compile, GEAcompile) (EGexecute): * src/grep.c (struct patloc, patlocs_allocated, patlocs_used) (n_patterns, update_patterns, pattern_file_name, poison_len) (asan_poison, fwrite_errno, compile_fp_t, execute_fp_t) (buf_has_encoding_errors, buf_has_nulls, file_must_have_nulls) (bufalloc, pagesize, all_zeros, fillbuf, nlscan) (print_line_head, print_line_middle, print_line_tail, grepbuf) (grep, contains_encoding_error, fgrep_icase_available) (fgrep_icase_charlen, fgrep_to_grep_pattern, try_fgrep_pattern) (main): * src/kwsearch.c (struct kwsearch, Fcompile, Fexecute): * src/kwset.c (struct trie, struct kwset, kwsalloc, kwsincr) (kwswords, treefails, memchr_kwset, acexec_trans, kwsexec) (treedelta, kwsprep, bm_delta2_search, bmexec_trans, bmexec) (acexec): * src/kwset.h (struct kwsmatch): * src/pcresearch.c (Pcompile, Pexecute): * src/search.h (mb_clen): * src/searchutils.c (kwsinit, mb_goback, wordchars_count) (wordchars_size, wordchar_next, wordchar_prev): Prefer idx_t to size_t or ptrdiff_t for nonnegative sizes, and prefer ptrdiff_t to size_t for sizes plus error values. * src/grep.c (uword_size): New constant, used for signed size calculations. (totalnl, add_count, totalcc, print_offset, print_line_head, grep): Prefer intmax_t to uintmax_t for wide integer calculations. (fgrep_icase_charlen): Prefer ptrdiff_t to int for size offsets. * src/grep.h: Include idx.h. * src/search.h (imbrlen): New function, like mbrlen except with idx_t and ptrdiff_t.
* grep: avoid some size_t castsPaul Eggert2021-08-241-4/+4
| | | | | | | | | This helps move the code away from unsigned types. * src/grep.c (buf_has_encoding_errors, contains_encoding_error): * src/searchutils.c (mb_goback): Compare to MB_LEN_MAX, not to (size_t) -2. This is a bit safer anyway, as grep relies on MB_LEN_MAX limits elsewhere. * src/search.h (mb_clen): Compare to -2 before converting to size_t.
* grep: avoid sticky problem with ‘-f - -f -’Paul Eggert2021-08-211-6/+11
| | | | | | | | Inspired by bug#50129 even though this is a different bug. * src/grep.c (main): For ‘-f -’, use clearerr (stdin) after reading, so that ‘grep -f - -f -’ reads stdin twice even when stdin is a tty. Also, for ‘-f FILE’, report any I/O error when closing FILE.
* grep: djb2 correctionPaul Eggert2021-08-181-1/+9
| | | | | Problem reported by Alex Murray (bug#50093). * src/grep.c (hash_pattern): Use a nonzero initial value.
* grep: simplify data movement slightlyPaul Eggert2021-08-091-11/+5
| | | | * src/grep.c (fillbuf): Simplify movement of saved data.
* grep: pointer-integer cast nitPaul Eggert2021-08-091-2/+2
| | | | | | * src/grep.c (ALIGN_TO): When converting pointers to unsigned integers, convert to uintptr_t not size_t, as size_t in theory might be too narrow.
* doc: usage: --group-separator/--no-group-separatorKevin Locke2021-08-061-0/+2
| | | | | * src/grep.c (usage): Document --group-separator and --no-group-separator.
* maint: run "make update-copyright"Paul Eggert2021-01-011-1/+1
|
* grep: use of --unix-byte-offsets (-u) now elicits a warningJim Meyering2020-12-251-2/+2
| | | | | | | * NEWS (Change in behavior): Mention this. * src/grep.c (main): Warn about each use of obsolete --unix-byte-offsets (-u). * doc/grep.in.1 (-u): Remove its documentation.
* grep: avoid performance regression with many patternsJim Meyering2020-11-261-2/+3
| | | | | | | | | | * src/grep.c (hash_pattern): Switch from PJW to DJB2, to avoid an O(N) to O(N^2) performance regression due to hash collisions with patterns from e.g., seq 500000|tr 0-9 A-J Reported by Frank Heckenbach in https://bugs.gnu.org/44754 * NEWS (Bug fixes): Mention it. * tests/hash-collision-perf: New file. * tests/Makefile.am (TESTS): Add it.
* build: update gnulib to latest for warning fixesJim Meyering2020-11-261-1/+1
| | | | | | | * gnulib: Update submodule to latest. * src/grep.c (printf_errno): Reflect gnulib's renaming: change _GL_ATTRIBUTE_FORMAT_PRINTF to _GL_ATTRIBUTE_FORMAT_PRINTF_STANDARD
* grep: remove GREP_OPTIONSPaul Eggert2020-11-031-67/+2
| | | | | | | | | | | | * NEWS: Mention this. * doc/grep.in.1: Remove GREP_OPTIONS documentation. * doc/grep.texi (Environment Variables): Move GREP_OPTIONS stuff into a “no longer implemented” paragraph. * src/grep.c (prepend_args, prepend_default_options): Remove. (main): Do not look at GREP_OPTIONS. * tests/Makefile.am (TESTS_ENVIRONMENTS): * tests/init.cfg (vars_): Remove GREP_OPTIONS.
* grep: -P: report input filename upon PCRE execution failureJim Meyering2020-10-111-1/+1
| | | | | | | | | | | | Without this, it could be tedious to determine which input file evokes a PCRE-execution-time failure. * src/pcresearch.c (Pexecute): When failing, include the error-provoking file name in the diagnostic. * src/grep.c (input_filename): Make extern, since used above. * src/search.h (input_filename): Declare. * tests/filename-lineno.pl: Test for this. ($no_pcre): Factor out. * NEWS (Bug fixes): Mention this.
* grep: pacify Sun C 5.15Paul Eggert2020-09-231-1/+1
| | | | | | | This suppresses a false alarm '"grep.c", line 720: warning: initializer will be sign-extended: -1'. * src/grep.c (uword_max): New static constant. (initialize_unibyte_mask): Use it.
* grep: fix more Turkish-eyes bugsPaul Eggert2020-09-231-35/+81
| | | | | | | | | | | | | | | | | Fix more bugs recently uncovered by Norihiro Tanaka (Bug#43577). * NEWS: Mention new bug report. * src/grep.c (ok_fold): New static var. (setup_ok_fold): New function. (fgrep_icase_charlen): Reject single-byte characters if they match some multibyte characters when ignoring case. This part of the patch is partly derived from <https://bugs.gnu.org/43577#14>, which means it is: Co-authored-by: Norihiro Tanaka <noritnk@kcn.ne.jp> (main): Call setup_ok_fold if ok_fold might be needed. * src/searchutils.c (kwsinit): With the grep.c changes, this code can now revert to classic 7th Edition Unix style; aborting would be wrong. * tests/turkish-eyes: Add tests for these bugs.
* grep: fix recently-introduced performance glitchPaul Eggert2020-09-231-1/+0
| | | | | | * src/grep.c (main): Do not double-increment update_patterns. update_patterns increments n_patterns now; do not increment it again, as the incorrect count would hurt performance heuristics later.
* grep: avoid unnecessary regex compilationNorihiro Tanaka2020-09-221-3/+4
| | | | | | | | | | | | | | | | | | | Grep resorts to using the regex engine when the precision of either -o or --color is required, or when the pattern is not supported by our DFA engine (e.g., backref). Otherwise, grep would perform regex compilation solely to check the syntax. This change makes grep skip that compilation in the common case for which it is unnecessary. The compilation we are avoiding is quite costly, consuming O(N^2) RSS for N regular expressions. * src/dfasearch.c (GEAcompile): Add new argument, and avoid unneeded compilation of regex. * src/grep.c (compile_fp_t): Update prototype. (main): Update caller. * src/kwsearch.c (Fcompile): Update caller and add new argument. * src/pcresearch.c (Pcompile): Add new argument. * src/search.h (GEAcompile, Fcompile, Pcompile): Update prototype.
* grep: "grep '\)'" reports an error againPaul Eggert2020-09-181-0/+6
| | | | | | | * src/grep.c (try_fgrep_pattern): With -G, pass \) through to GEAcompile so that it can complain. This fixes an unexpected change in behavior from grep 3.4 and earlier. * tests/filename-lineno.pl: Add tests for this sort of thing.
* grep: tweak by using mempcpyPaul Eggert2020-09-181-4/+2
| | | | | * src/grep.c (try_fgrep_pattern): Tweak previous change by using mempcpy.
* grep: make echo .|grep '\.' match once againJim Meyering2020-09-181-0/+3
| | | | | | | | | | | | | | The same applied for many other backslash-escaped bytes, not just metacharacters. The switch to rawmemchr in v3.4-almost-10-g9393b97 made some parts of the code require the usually-guaranteed newline sentinel at the end of each pattern. Before, some consumers used a (correct) pattern length and did not care that try_fgrep_pattern could transform a pattern (with sentinel) like "\\.\n" to "..\n", thus violating that assumption. * src/grep.c (try_fgrep_pattern): Preserve the invariant that each regexp is newline-terminated. * tests/backslash-dot: New file. Test for this. * tests/Makefile.am (TESTS): Add it.
* grep: be more consistent about diagnostic formatPaul Eggert2020-09-181-6/+3
| | | | | | | | | | * NEWS: Mention this. * bootstrap.conf (gnulib_modules): Remove 'quote'. * src/grep.c: Do not include quote.h. (grep, grepdirent, grepdesc): Put the three unusual diagnostics into the same "grep: FOO: message" form that grep uses elsewhere. * tests/binary-file-matches, tests/in-eq-out-infloop: Adjust tests to match new diagnostic format.
* maint: avoid syntax-check failureJim Meyering2020-09-171-1/+1
| | | | | | | | * src/grep.c (grep): Lower-case the "B" in "Binary file... matches" diagnostic that we now emit to stderr. This avoids the following when running "make syntax-check": maint.mk: found capitalized error message make: *** [maint.mk:469: sc_error_message_uppercase] Error 1
* Send "Binary file FOO matches" to stderrPaul Eggert2020-09-171-6/+2
| | | | | | | | | | | * NEWS, doc/grep.texi: Mention this change (Bug#29668). * src/grep.c (grep): Send "Binary file FOO matches" to stderr instead of stdout. * tests/encoding-error, tests/invalid-multibyte-infloop: * tests/null-byte, tests/pcre-count, tests/surrogate-pair: * tests/symlink, tests/unibyte-binary: Adjust tests to match new behavior. In all cases this simplifies the tests, which is a good sign.
* Suppress "Binary file FOO matches" if -IPaul Eggert2020-09-171-2/+3
| | | | | | | Problem reported by Jason Franklin (Bug#33552). * NEWS: Mention this. * src/grep.c (grep): Do not output "Binary file FOO matches" if -I. * tests/encoding-error: Add test for this bug.
* Prefer rawmemchr to memchr when it’s easyPaul Eggert2020-09-071-6/+5
| | | | | | | | | | * bootstrap.conf (gnulib_modules): Add rawmemchr. * src/dfasearch.c (GEAcompile, EGexecute): * src/grep.c (update_patterns, prpending, prtext): * src/kwsearch.c (Fcompile, Fexecute): * src/pcresearch.c (Pcompile, Pexecute): Simplify (and presumably speed up a little) by using rawmemchr with a sentinel, instead of using memchr.
* Simplify pattern_file_namePaul Eggert2020-09-071-2/+1
| | | | | | * src/grep.c (pattern_file_name): Make first argument origin-0, not origin-1, as this simplifies both caller and callee. All uses changed.
* Omit duplicate regexpsPaul Eggert2020-09-071-100/+177
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not pass two copies of the same regexp to the regular-expression engine. Although the engines should perform nearly as well even with the copies, in practice they do not. Problem reported by Luca Borzacchiello (Bug#43040). * bootstrap.conf (gnulib_modules): Add hash. * src/grep.c: Include stdint.h, for SIZE_WIDTH. Include hash.h. (struct patloc, patloc, patlocs_allocated, patlocs_used): Rename from struct FL_pair, fl_pair, n_fl_pair_slots, n_pattern_files, respectively, since the data type is no longer a pair. All uses changed. (struct patloc): New member FILELINE. The lineno member is now ptrdiff_t since nowadays we prefer signed types. (pattern_array, patterns_table): New static vars. (count_nl_bytes, fl_add): Remove; no longer used. (hash_pattern, compare_patterns, update_patterns): New functions. update_patterns does what fl_add used to do, plus remove dups. (pattern_file_name): Adjust to change from fl_pair to patloc. (main): Move some variables to inner blocks for clarity. Maintain the pattern_table hash of all patterns. Update pattern_array to match keys, and use update_patterns instead of fl_add to remove duplicate keys. * tests/filename-lineno.pl (invalid-re-2-files) (invalid-re-2-files2, invalid-re-2e): Ensure regexps are unique in tests so that dups aren’t removed in diagnostics. (invalid-re-line-numbers): New test.
* Revert -L exit status change introduced in grep 3.2Paul Eggert2020-08-221-5/+5
| | | | | | | | | Problems reported by Antonio Diaz Diaz in: https://bugs.gnu.org/28105#29 * NEWS, doc/grep.texi (Exit Status), src/grep.c (usage): Adjust documentation accordingly. * src/grep.c (grepdesc, main): Go back to old behavior. * tests/skip-read: Adjust tests accordingly.
* doc: fix --exclude description in man pagePaul Eggert2020-01-021-2/+2
| | | | | | | Problem reported by Duncan Moore (Bug#37212). * src/grep.c (usage): Fix incorrect statement about --exclude and directories. Standardize on “that match GLOB” instead of “matching GLOB”.
* maint: update all copyright year number rangesJim Meyering2020-01-011-1/+1
| | | | | | | | Run "make update-copyright" and then... * gnulib: Update to latest with copyright year adjusted. * tests/init.sh: Sync with gnulib to pick up copyright year. * bootstrap: Likewise. * doc/grep.in.1: Use "-" in copyright year ranges, not \en.
* grep: new --no-ignore-case optionPaul Eggert2019-11-051-1/+8
| | | | | | | | | | | | Suggested by Karl Berry and mostly implemented by Arnold Robbins (Bug#37907). * NEWS: * doc/grep.in.1: * doc/grep.texi (Matching Control): * src/grep.c (usage): Document the new option. * src/grep.c (NO_IGNORE_CASE_OPTION): New constant. (long_options, main): Support new option.
* grep: simplify previous patchPaul Eggert2019-11-051-13/+7
| | | | | * src/grep.c (main): Use an int rather than an enum for a local var, which is overkill here.
* grep: further simplify out_file handlingPaul Eggert2019-11-051-23/+24
| | | | | | | | | | | | | * src/grep.c (print_filenames): Make this a local variable instead of static. Rename it to filename_option, to avoid confusion with the print_filename function, and rename the enum values for the same reason. All uses changed. (out_file): Now -1, 0, 1 to represent unknown, false, true. All uses changed. (single_command_line_arg): Remove. This static variable’s function is now accomplished by a local variable ‘num_operands’. (grepdesc): Simplify adjustment of out_file accordingly. (main): Initialize out_file to -1 if not known yet.
* grep: simplify out_file handlingZev Weiss2019-11-051-16/+24
| | | | | | | | | | | | * src/grep.c (print_filenames): New tristate enum (-H, -h, or neither); supplants with_filenames and no_filenames. (single_command_line_arg): New variable indicating if grep was run with a single command-line argument. (no_filenames): Remove variable. (grepdirent): Don't twiddle out_file back and forth during recursion. (grepdesc): Turn off out_file on 'grep -r foo nondirectory'. (main): Replace with_filenames and no_filenames with print_filenames. Enable out_file when both -r/-R and multiple arguments are given.
* grep: fix ‘grep -L ... >/dev/null’ bugPaul Eggert2019-10-121-2/+2
| | | | | | | | | Problem reported by Adam Sampson (Bug#37716). * NEWS: Mention this. * src/grep.c (grepdesc): Don’t assume that stdout being /dev/null means list_files == LISTFILES_NONE. (main): Do not change list_files merely because stdout is /dev/null. * tests/skip-read: Test for this bug.
* grep: tighten -i docPaul Eggert2019-10-031-1/+1
| | | | | | | | * doc/grep.in.1: * doc/grep.texi (Matching Control): * src/grep.c (usage): Make it clearer that -i affects patterns and data, but not file names (Bug#37604).
* grep: parse --color arg independent of localePaul Eggert2019-02-031-7/+10
| | | | | | | This is a better fix for Bug#34285. * bootstrap.conf (gnulib_modules): Add c-strcase. * src/grep.c: Include c-strcase.h, not strings.h. (main): Use c_strcasecmp, not strcasecmp.
* grep: fix grep.c includesPaul Eggert2019-02-021-1/+1
| | | | | | * src/grep.c: Include strings.h; problem reported by David Monniaux (Bug#34285). Do not include fcntl.h, as system.h does that for us.h
* grep: simplify pcresearch.c ifdefsPaul Eggert2019-01-201-0/+7
| | | | | | | | | | This fixes a warning if PCRE is not used (Bug#34054). * configure.ac (USE_PCRE): New conditional. * src/Makefile.am (grep_SOURCES) [!USE_PCRE]: Omit pcresearch.c. * src/grep.c (matchers) [!HAVE_LIBPCRE]: Omit perl matcher. (setmatcher) [!HAVE_LIBPCRE]: If helpful, mention --disable-perl-regexp in diagnostic. * src/pcresearch.c: Simplify by assuming HAVE_LIBPCRE.
* maint: update all copyright dates via "make update-copyright"Jim Meyering2019-01-011-1/+1
| | | | * gnulib: Also update submodule for its copyright updates.
* grep: fit --version authorship into 80Paul Eggert2018-12-201-5/+3
| | | | | | * src/grep.c (AUTHORS): Remove. (main): Output the authorship info ourselves instead of having version_etc do it. This is better for i18n anyway.
* grep: triple initial buffer size: 32k->96kJim Meyering2018-10-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changing 32k to 96k gives a 3-23% performance improvement. All timings ran with this diff on top of commit v3.1-39-g7179b21: for n in 32 64 96 128; do echo n=$n perl -pi -e 's/(INITIAL_BUFSIZE =) \d+/$1 '$n/ src/grep.c && make AM_CFLAGS=-O3 WERROR_CFLAGS= >& makerr-$n && for needle in 1f2 1f298lkjskjhahjklkj34; do echo " needle=$needle" for i in $(seq 10); do env MALLOC_PERTURB_= time -qf%e src/grep $needle w2000 done 2>&1 |sort -g | tee >(head -1|sed 's/^/ /') > .time-${n}KB-$needle done done Tested searchs: search for a short literal pattern that is not present in 9.3GB file containing 2000 copies of /usr/dict/words created via this: ln -s /usr/share/dict/words k && cat $(yes k|head -2000) > w2000 I ran this command: env MALLOC_PERTURB_= time src/grep 1f2 w2000 old(32k) vs new elapsed time, best of 10 trials (gcc-9.0.0 20180831, -O3): 32k 64k 96k(%incr) 128k CPU 1.25 1.18 1.16( 7.2) 1.20 i7-4770S@3.10GHz cache=8MB 1.21 1.16 1.17( 3.3) 1.19 Xeon(R) E3-1505M v5 @ 2.80GHz cache=8MB 2.36 2.29 2.29( 3.0) 2.36 Xeon(R) E5-2680 v4 @ 2.40GHz cache=32MB 1.40 1.32 1.31( 6.4) 1.33 i5-6260U @ 1.80GHz cache=4MB 1.31 1.26 1.24( 5.3) 1.23 AMD FX(tm)-4100 cache=2MB (with only 1000 copies) Searching for a longer string: 1f298lkjskjhahjklkj34 2.03 1.76 1.61(20.7) 1.53 i7-4770S@3.10GHz cache=8MB 1.95 1.70 1.56(20.0) 1.51 Xeon(R) E3-1505M v5 @ 2.80GHz 3.27 2.98 2.84(13.1) 3.02 Xeon(R) E5-2680 v4 @ 2.40GHz 2.48 2.12 1.91(23.0) 1.80 i5-6260U @ 1.80GHz cache=4MB 1.72 1.54 1.46(15.1) 1.41 AMD FX(tm)-4100 cache=2MB * src/grep.c (INITIAL_BUFSIZE): Triple it: 32kB -> 96kB
* grep: fix usage 80-column glitchPaul Eggert2018-09-281-1/+2
| | | | | * src/grep.c (usage): Do not go over 80 columns in the source code, to pacify "make dist".
* doc: “pattern” vs “patterns”Paul Eggert2018-05-111-16/+16
| | | | | | | * doc/grep.in.1, doc/grep.texi, src/grep.c (usage): Be more careful about saying that an argument or option specifies one or more patterns, not just a single pattern. Problem reported by Kaz Kylheku (Bug#31400).
* maint: update URLsPaul Eggert2018-04-211-1/+1
| | | | | Mostly this is just changing http: to https:. In one or two places it removes no-longer-useful URLs.
* maint: update gnulib and copyright dates for 2018Jim Meyering2018-01-061-1/+1
| | | | | | * gnulib: Update to latest. * all files: Run "make update-copyright". * bootstrap: Update from gnulib.
* grep: diagnose stack overflow rather than segfaultingJim Meyering2017-12-161-0/+2
| | | | | | | | | | | | | | | | | | | * bootstrap.conf (gnulib_modules): Add c-stack. * src/grep.c: Include "c-stack.h". (main): Call c_stack_action (NULL); * tests/stack-overflow: New file. * tests/Makefile.am (TESTS): Add name of new file. * NEWS (Improvements): Mention it. Interestingly, this bug does not afflict grep-2.5.4 or prior, so it appeared to have been introduced with grep-2.6. However, the origin is in glibc's regexp compiler, and I tracked it to stack-aware parsing that was removed from glibc's regexp in 2002. However, grep-2.5.4 was released in 2009. That version worked (and still works, now) because it included and (by default) used an old copy of glibc's regexp code. Jeremy Feusi reported the grep segfault in https://bugs.gnu.org/29666. I reported the glibc regexp bug in https://sourceware.org/bugzilla/show_bug.cgi?id=22620
* grep: omit a dup 'const'Paul Eggert2017-11-031-1/+1
| | | | * src/grep.c (matchers): Omit duplicate 'const'.
* Pacify GCC 5.4Paul Eggert2017-08-211-1/+1
| | | | | * src/grep.c (grepdesc): Rework to pacify GCC 5.4 warning about logical not.
* grep: -L exits with status 0 if a file is selectedPaul Eggert2017-08-171-3/+3
| | | | | | | Problem reported by Anthony Sottile (Bug#28105). * NEWS, doc/grep.texi (Exit Status), src/grep.c (usage): Document this. * src/grep.c (grepdesc): Implement it. * tests/skip-read: Test it.