| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
Problem reported by Nathan Weeks in: http://bugs.gnu.org/17856
* src/grep.c (Ecompile): Also specify RE_UNMATCHED_RIGHT_PAREN_ORD.
* doc/grep.texi (Fundamental Structure), NEWS: Document this.
* tests/ere.tests: Add a couple of tests for this.
* tests/spencer1.tests: Fix exit status.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With --max-count=N (-m N), grep is supposed to stop reading input
after it has found the Nth match. However, a recent context-
related change made it so grep would always read to end of file.
* src/grep.c (prtext): Don't let a negative "out_after" value
make "pending" line count negative.
* tests/max-count-overread: New test, for this.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention it.
* THANKS: Add names of two recent bug reporters.
This bug was introduced by commit v2.18-139-g5122195.
Reported by Marc Aldorasi in http://bugs.gnu.org/17640.
|
|
|
|
|
|
|
|
|
| |
grep -E 'a(b$|c$)' would mistakenly match "aa".
* src/dfa.c (dfamust): When resetting 'is' in OR, also reset
'begline' and 'endline' of 'must'.
* NEWS (Bug fixes): Mention it.
This bug was introduced via commit v2.18-85-g2c94326.
Reported by Péter Radics in <http://bugs.gnu.org/17617>.
|
|
|
|
|
|
|
| |
Problem reported by Khaled Ziyaeen; see: http://bugs.gnu.org/17481
* NEWS, doc/grep.texi (File and Directory Selection): Document this.
* src/grep.c (main): Implement this.
* tests/include-exclude: Test this.
|
|
|
|
|
| |
* tests/count-newline: New test.
* tests/Makefile.am (TESTS): Add it.
|
|
|
|
|
| |
* tests/mb-non-UTF8-performance (timeout): Use an integer,
as 'timeout 1.234' doesn't work in EUC locales.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem reported by Stephane Chazelas in: http://bugs.gnu.org/16867
* NEWS: Document the fix.
* src/dfa.c (dfaoptimize): Remove any superset if changing from
UTF-8 to unibyte, and if the pattern has no backreferences.
(dfassbuild): In multibyte locales, treat \< \> \b \B as
backreferences in the DFA, since the DFA relies on unibyte
tests to check them.
(dfacomp): Optimize after building the superset, so that
dfassbuild can depend on d->multibyte. A downside is that
dfaoptimize must remove supersets that are likely slower than the
DFA after optimization, but that's been done in the
above-described change.
* tests/Makefile.am (XFAIL_TESTS): Remove word-delim-multibyte,
since the test works now.
|
|
|
|
|
| |
* tests/context-0: New test.
* tests/Makefile.am (TESTS): Add it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* NEWS: Document this.
* src/dfasearch.c, src/kwsearch.c (WCHAR): Remove.
(wordchar): New static function.
* src/dfasearch.c (EGexecute):
* src/kwsearch.c (Fexecute): Use the new functions, so that the
code works correctly if a multibyte character adjacent to the
match has two or more bytes.
* src/search.h, src/searchutils.c (mb_prev_wc, mb_next_wc):
New functions.
* tests/word-delim-multibyte: Add a test for grep -w (which now
passes), and a test for \> (which still fails). The \< test also
still fails.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This follows up to http://bugs.gnu.org/17376 and fixes a different
set of incompatibilities, namely between the regex matcher and the
other matchers, when the pattern contains encoding errors.
The GNU regex matcher is not consistent in this area: sometimes
an encoding error matches only itself, and sometimes it
matches part of a multibyte character. There is no documentation
for grep's behavior in this area and users don't seem to care,
and it's simpler to defer to the regex matcher for problematic
cases like these.
* NEWS: Document this.
* src/dfa.c (ctok): Remove. All uses removed.
(parse_bracket_exp, atom): Use BACKREF if a pattern contains
an encoding error, so that the matcher will revert to regex.
* src/dfasearch.c, src/grep.c, src/pcresearch.c, src/searchutils.c:
Don't include dfa.h, since search.h now does that for us.
* src/dfasearch.c (EGexecute):
* src/kwsearch.c (Fexecute): In a UTF-8 locale, there's no need to
worry about matching part of a multibyte character.
* src/grep.c (contains_encoding_error): New static function.
(main): Use it, so that grep -F is consistent with plain fgrep
when the pattern contains an encoding error.
* src/search.h: Include dfa.h, so that kwsearch.c can call using_utf8.
* src/searchutils.c (is_mb_middle): Remove UTF-8-specific code.
Callers now ensure that we are in a non-UTF-8 locale.
The code was clearly wrong, anyway.
* tests/fgrep-infloop, tests/invalid-multibyte-infloop:
* tests/prefix-of-multibyte:
Do not require that grep have a particular behavor for this test.
It's OK to match (exit status 0), not match (exit status 1), or
report an error (exit status 2), since the pattern contains an
encoding error and grep's behavior is not specified for such
patterns. Test only that KWset, DFA, and regex agree.
* tests/prefix-of-multibyte: Add tests for ABCABC and __..._ABCABC___.
|
|
|
|
| |
* tests/prefix-of-multibyte: Also test the regex version.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See: http://bugs.gnu.org/17376
* src/dfa.c (dfambcache): Don't cache invalid sequences, because they can't be
represented by wide characters.
(dfambcache, mbs_to_wchar): Return WEOF for invalid sequences.
(ctok): New global variable.
(parse_bracket_exp, atom, match_anychar, match_mb_charset): Don't allow WEOF.
(lex): Set 'ctok'.
* src/kwsearch.c (Fexecute):
* src/searchutils.c (is_mb_middle): Don't check here.
* tests/invalid-multibyte-infloop: Adjust to fixed behavior.
* tests/prefix-of-multibyte: Add test cases for this bug.
|
|
|
|
|
|
|
| |
Problem reported by Stephane Chazelas in: http://bugs.gnu.org/16871
* doc/grep.texi (Usage): Remove incorrect example with -P.
* tests/pcre: Improve test so that it actually tests whether \s
matches a newline.
|
|
|
|
|
|
| |
* tests/pcre-infloop: Spell locale name, en_US.UTF-8, consistently,
converting this one use from "en_US.utf8", which would provoke a
test failure on OS/X.
|
|
|
|
|
|
|
|
|
|
| |
See <http://bugs.gnu.org/17245> and <http://bugs.exim.org/1468>.
* NEWS: Document this.
* src/pcresearch.c (Pexecute): Do not use PCRE_NO_UTF8_CHECK,
as this leads to undefined behavior when the input is not UTF-8.
* tests/pcre-infloop, tests/pcre-invalid-utf8-input:
Exit status is now 2, not 1, when grep -P is given invalid UTF-8
data in a UTF-8 locale.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug was introduced in the early-2012 patches that fixed some
context-handling bugs. Bisecting found commit
d8951d3f4e1bbd564809aa8e713d8333bda2f802 (2012-02-05 18:00:43 +0100),
but it apears the underlying problem was introduced in commit
8b47c4cf6556933f59226c234b0fe984f6c77dc7 (2012-01-03 11:22:09 +0100).
* NEWS: Mention bug fix.
* src/dfa.c (char_context): Consider NUL to be a newline only if -z.
* tests/Makefile.am (TESTS): Add null-byte.
* tests/null-byte: New file.
|
|
|
|
|
| |
* tests/pcre-infloop: New test.
* tests/Makefile.am (TESTS): Add it.
|
|
|
|
|
|
|
|
| |
* NEWS: Document it.
* src/dfasearch.c (GEAcompile):
* src/kwsearch.c (Fcompile):
Use C99-style decls to simplify. Avoid duplicate code.
* tests/empty-line: Add some more tests like this.
|
|
|
|
|
| |
* src/dfasearch.c (EGAcompile): Fix it.
* src/kwsearch.c (Fcompile): Fix it.
|
|
|
|
|
|
| |
* tests/euc-mb: Reverse order of arguments to compare.
Be consistent in ordering compare arguments: expected followed
by actual.
|
|
|
|
|
|
|
|
|
|
|
| |
When kwsexec gives us the offset of a potential match, we compute
line begin/end and then run the DFA matcher to see if there really
is a match on that line. When the beginning of the line, BEG, is
not on a multibyte character boundary, advance BEG until it on such
a boundary, before running the DFA search.
* src/dfasearch.c (EGexecute): As above. Add a comment.
* tests/euc-mb: Add a test case that exercises this code.
This addresses http://debbugs.gnu.org/17095.
|
|
|
|
|
| |
* tests/mb-non-UTF8-performance: Avoid false-positive failure
when run on certain AMD processors.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Andreas Schwab reported in http://debbugs.gnu.org/16941
that this test would timeout and fail on m68k-suse-linux.
Rather than testing absolute duration with a limit tuned
to today's hardware, compare performance of grep with LC_ALL=C
against that same command using LC_ALL=ja_JP.eucJP.
* tests/init.cfg (require_hi_res_time_): New function.
* tests/mb-non-UTF8-performance: Rewrite to use it:
record absolute duration D of the first (normally much faster)
command, and set a timeout of 8*D for the command running in
an affected locale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fgrep converted to lowercase, whereas the regex code converted
to uppercase. The resulting behaviors don't agree in offbeat
cases like Greek sigmas and Turkish Is. Fix this by changing
fgrep to agree with the regex code.
* src/kwsearch.c (Fcompile, Fexecute):
* src/searchutils.c (kwsinit, mbtoupper):
Convert to uppercase, not to lowercase, for compatibility with
plain 'grep'.
* src/search.h, src/searchutils.c (mbtoupper):
Rename from mbtolower, since it now converts to uppercase.
All uses changed.
* tests/case-fold-titlecase: Add tests for this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The DFA code and the regex code didn't use the same semantics for
case-folding. The regex code says that the data char d matches
the pattern char p if uc (d) == uc (p). POSIX is unclear in this
area; the simplest fix for now is to change the DFA code to agree
with the regex code. See <http://bugs.gnu.org/16919>.
* src/dfa.c (static_assert): New macro, if not already defined.
(setbit_case_fold_c): Assume MB_CUR_MAX is 1 and that case_fold
is nonzero; all callers changed.
(setbit_case_fold_c, parse_bracket_exp, lex, atom):
Case-fold like the regex code does.
(lonesome_lower): New constant.
(case_folded_counterparts): New function.
(parse_bracket_exp): Prefer plain setbit when case-folding is
not needed.
* src/dfa.h (CASE_FOLDED_BUFSIZE): New constant.
(case_folded_counterparts): New function decl.
* src/main.c (trivial_case_ignore): Case-fold like the regex code does.
(main): Try to improve comment re trivial_case_ignore.
* tests/case-fold-titlecase: Add lots more test cases.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* NEWS: Document this.
* src/dfa.c (setbit_wc): Simplify.
(setbit_c): Remove; no longer used.
(setbit_case_fold_c, parse_bracket_exp, atom):
Don't mishandle titlecase. For 'atom', this removes the need for
the refactoring of Bug#16729.
(lex): Use the slower approach only for letters that have a
differing case.
* tests/case-fold-titlecase: New file.
* tests/Makefile.am (TESTS): Add it.
|
|
|
|
|
|
|
|
|
|
|
| |
* NEWS: Document this.
* src/dfa.c (using_simple_locale): New function.
(parse_bracket_exp): Handle bracket expressions like [a-[.z.]]
correctly. Don't assume that dfaexec handles expressions like
[^a-z] correctly, as they can match multiple characters in some
locales.
* tests/posix-bracket: New file.
* tests/Makefile.am (TESTS): Add it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For the -w option, with -P, we used to look for the pattern surrounded by
word boundaries. That's different from what grep -w does and what the
documentation describes. Now align with grep -w and the documentation by
using PCRE look-behind and look-ahead operators to match the pattern if
it is not surrounded by word constituents.
* src/pcresearch.c (Pcompile): Use (?<!\w)(?:...)(?!\w) rather than
\b(?:...)\b.
* NEWS (Bug fixes): Mention it.
* tests/pcre-w: New file.
* tests/Makefile.am (TESTS): Add it.
This complements the fix for http://debbugs.gnu.org/16865
|
|
|
|
|
|
|
|
|
|
|
|
| |
To implement -w and -x, we bracket the search term with parentheses.
However, that set of parentheses had the default semantics of
"capturing", i.e., creating a backreferenceable matched quantity.
Instead, use (?:...), to create a non-capturing group.
* src/pcresearch.c (Pcompile): Use (?:...) rather than (...).
* NEWS (Bug fixes): Mention it.
* tests/pcre-wx-backref: New file.
* tests/Makefile.am (TESTS): Add it.
This addresses http://debbugs.gnu.org/16865
|
|
|
|
|
|
|
|
|
|
|
| |
Test for the just-fixed performance regression.
With a 100-200x differential, it is reasonable to expect that
a very slow system will be able to complete the designated
task in a few seconds, while with the bug, even a very fast
system would exceed the timeout.
* tests/mb-non-UTF8-performance: New file.
* tests/Makefile.am (TESTS): Add it.
* tests/init.cfg (require_JP_EUC_locale_): New function.
|
|
|
|
|
|
|
|
| |
This is a bug in the current dfa.c, which was reintroduced by the
recent reversion from RRI.
* tests/unibyte-negated-circumflex: New file.
* tests/Makefile.am (TESTS): Add it.
* tests/init.cfg (require_unibyte_locale): New function.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This option was disabled in March of 2010, and began to elicit a
warning in January of 2012. Its time has come.
* doc/grep.in.1: Remove mention.
* doc/grep.texi: Likewise.
* src/main.c (GROUP_SEPARATOR_OPTION, usage, MMAP_OPTION)
(long_options, main): Remove all traces.
* tests/Makefile.am (check_PROGRAMS): Remove mention of ignore-mmap.
* tests/ignore-mmap: Remove file.
* NEWS (Maintenance): Mention it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that DFA searching works with multi-byte locales, the only remaining
reason to case-convert the searched input is the kwset optimization.
But multi-byte case-conversion is so expensive that it's not
worthwhile even to attempt that optimization.
* src/dfasearch.c (kwsmusts): Skip this function in ignore-case mode
when the locale is multi-byte.
(EGexecute): Now that this code need not handle multi-byte case-ignoring
matches, remove the expensive copy/case-conversion code.
With no case-converted buffer, there is no longer any need to call
mb_case_map_apply, so remove it and associated code.
(kwsincr_case): Remove function. Now, every use of this function
is equivalent to a use of kwsincr. Replace all uses.
* tests/turkish-eyes: Test all of -E, -F and -G.
|
|
|
|
| |
* tests/turkish-eyes: Remove unnecessary uses of printf.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These days, nearly everyone uses a multibyte locale, and grep is often
used with the --ignore-case (-i) option, but that option imposes a very
high cost in order to handle some unusual cases in just a few multibyte
locales. This change gets most of the performance of using LC_ALL=C
without eliminating the ability to search for multibyte strings.
With the following example, I see an 11x speed-up with a 2.3GHz i7:
Generate a 10M-line file, with each line consisting of 40 'j's:
yes jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj | head -10000000 > k
Time searching it for the simple/noexistent string "foobar",
first with this patch (best-of-5 trials):
LC_ALL=en_US.UTF-8 env time src/grep -i foobar k
1.10 real 1.03 user 0.07 sys
Back out that commit (temporarily), recompile, and rerun the experiment:
git log -1 -p|patch -R -p1; make
LC_ALL=en_US.UTF-8 env time src/grep -i foobar k
12.50 real 12.41 user 0.08 sys
The trick is to realize that for some search strings, it is easy
to convert to an equivalent one that is handled much more efficiently.
E.g., convert this command:
grep -i foobar k
to this:
grep '[fF][oO][oO][bB][aA][rR]' k
That allows the matcher to search in buffer mode, rather than having to
extract/case-convert/search each line separately. Currently, we perform
this conversion only when search strings contain neither '\' nor '['.
See the comments for more detail.
* src/main.c (trivial_case_ignore): New function.
(main): When possible, transform the regexp so we can drop the -i.
* tests/turkish-eyes: New file.
* tests/Makefile.am (TESTS): Use it.
* NEWS (Improvements): Mention it.
|
|
|
|
|
|
|
|
| |
Problem reported by Jim Meyering.
* tests/bre, tests/ere, tests/spencer1-locale:
Prefer re_shell, not re_shell_.
* tests/init.sh (re_shell): New var, which is exported instead of
re_shell_.
|
|
|
|
|
|
|
| |
Problem reported by Dagobert Michelsen in <http://bugs.gnu.org/16380>.
* tests/bre, tests/ere, tests/spencer1-locale:
Prefer re_shell_ to SHELL, if re_shell_ is set.
* tests/init.sh (re_shell_): Export if it's used.
|
|
|
|
| |
Do that by running "make update-copyright".
|
|
|
|
|
|
|
|
|
| |
In order to obtain the behavior we want, i.e., to disable
error-on-invalid-UTF-in-input, apply this PCRE option in
pcre_exec, not when compiling.
* src/pcresearch.c (Pexecute): Use PCRE_NO_UTF8_CHECK here, ...
(Pcompile): ...rather than here.
* tests/pcre-invalid-utf8-input: Adjust test case to test for this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do not exit-2 for invalid UTF-8 characters. Just prior to this
change, this command would match no lines and fail like this:
$ printf 'j\x82\nj\n'|LC_ALL=en_US.UTF-8 grep -P j|cat -A; echo $?
grep: invalid UTF-8 byte sequence in input
2
After this change, the same command matches both lines, and succeeds:
jM-^B$
j$
0
* src/pcresearch.c (Pcompile): Use PCRE_NO_UTF8_CHECK, too, and
add a comment.
* tests/pcre-utf8: Add a test and a comment.
This change did not work with Debian unstable pcre-8.31-2
or with some 8.33 and 8.34-based versions, but does work with
Fedora 20's 8.33 and with a built-from-latest source library.
Based on a patch by Santiago Ruano Rincón.
See http://bugs.gnu.org/15758/
|
|
|
|
|
| |
* tests/long-line-vs-2GiB-read: Don't declare the test "failed"
when running out of memory. In that case, skip it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When trying to exercize some long-line-handling code, I ran these
commands:
$ dd bs=1 seek=2G of=big < /dev/null; grep -l x big; echo $?
grep: big: Invalid argument
2
grep should not have issued that diagnostic, and it should
have exited with status 1, not 2. What happened?
grep read the 2GiB of NULs, doubled its buffer size,
copied the 2GiB into the new 4GiB buffer, and proceeded
to call "read" with a byte-count argument of 2^32.
On at least Darwin 12.5.0, that makes read fail with EINVAL.
The solution is to use gnulib's safe_read wrapper.
* src/main.c: Include "safe-read.h"
(fillbuf): Use safe_read, rather than bare read. The latter
cannot handle a read size of 2^32 on some systems.
* bootstrap.conf (gnulib_modules): Add safe-read.
* tests/long-line-vs-2GiB-read: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention it.
|
|
|
|
|
|
|
|
| |
* tests/multibyte-white-space (utf8_space_characters): The generation
of test inputs relied on GNU sed's interpretation of \<, but that is
not portable, and caused spurious test failures. Adjust the sed regexp
to work on all versions.
Reported by Karl Dubost in http://bugs.gnu.org/15953.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* src/pcresearch.c (Pexecute): Don't abort upon unexpected
PCRE-specific error code. Explicitly handle PCRE_ERROR_BADUTF8,
and change the default to print a diagnostic including the unhandled
integer PCRE error code and exit with status 2.
* tests/pcre-invalid-utf8-input: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention it.
* THANKS: Update.
Reported by Dave Reisner in http://bugs.gnu.org/15758.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit v2.14-40-g01ec90b made \s and \S work with multi-byte
characters, but it made it so any use like \s*, \s+, \s?, \s{3}
would malfunction in a multi-byte locale.
* src/dfa.c (lex): Also reset laststart.
* tests/backslash-s-and-repetition-operators: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention it.
* THANKS: Update.
Reported by Mirraz Mirraz in http://bugs.gnu.org/15773.
|
|
|
|
|
|
|
| |
* tests/pcre-utf8: Convert the hex \xHH literals for the euro symbol
to octal \OOO.
* tests/turkish-I: Likewise for "I with dot".
* tests/turkish-I-without-dot: Likewise for another Turkish I: U+0131.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use octal escapes, not hex, in printf(1) format strings,
and in one case, use $AWK's printf so we can continue
to use the table of hex values.
* tests/char-class-multibyte: Use printf octal escapes, not hex,
for portability to shells like dash and Solaris 10's /bin/sh.
* tests/backslash-s-vs-invalid-multitype: Likewise.
* tests/surrogate-pair: Likewise.
* tests/unibyte-bracket-expr: Count in decimal and convert to octal.
* tests/multibyte-white-space (hex_printf): New function.
Use it in place of printf so we can retain the table of hex digits
without hitting the limitation of some bourne shells.
Reported by Paul Eggert in http://bugs.gnu.org/15690#11
|
|
|
|
|
|
|
| |
* tests/multibyte-white-space (utf8_space_characters): Add more
single-byte whitespace characters. Align RHS hex values and
make the sed substitution less rigid, to accommodate.
Also, ensure that grep '\S' exits with status 1.
|
|
|
|
|
|
| |
* tests/spencer1.tests: Add a non-range bracket expression representing the
same regexp, to cover the alternate code path, the one that does not require
a regcomp/exec call to interpret the regexp.
|
|
|
|
|
|
|
| |
* tests/backslash-S-vs-invalid-multitype: New file.
Prompted by the bug report from Roman at
http://savannah.gnu.org/bugs/?40009
* tests/Makefile.am (TESTS): Add it.
|