| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This improves runtime checking for integer overflow when compiling
with gcc -fsanitize=undefined and the like. It also avoids
the need for some integer casts, which can be error-prone.
* bootstrap.conf (gnulib_modules): Add idx.
* src/dfasearch.c (struct dfa_comp, kwsmusts):
(possible_backrefs_in_pattern, regex_compile, GEAcompile)
(EGexecute):
* src/grep.c (struct patloc, patlocs_allocated, patlocs_used)
(n_patterns, update_patterns, pattern_file_name, poison_len)
(asan_poison, fwrite_errno, compile_fp_t, execute_fp_t)
(buf_has_encoding_errors, buf_has_nulls, file_must_have_nulls)
(bufalloc, pagesize, all_zeros, fillbuf, nlscan)
(print_line_head, print_line_middle, print_line_tail, grepbuf)
(grep, contains_encoding_error, fgrep_icase_available)
(fgrep_icase_charlen, fgrep_to_grep_pattern, try_fgrep_pattern)
(main):
* src/kwsearch.c (struct kwsearch, Fcompile, Fexecute):
* src/kwset.c (struct trie, struct kwset, kwsalloc, kwsincr)
(kwswords, treefails, memchr_kwset, acexec_trans, kwsexec)
(treedelta, kwsprep, bm_delta2_search, bmexec_trans, bmexec)
(acexec):
* src/kwset.h (struct kwsmatch):
* src/pcresearch.c (Pcompile, Pexecute):
* src/search.h (mb_clen):
* src/searchutils.c (kwsinit, mb_goback, wordchars_count)
(wordchars_size, wordchar_next, wordchar_prev):
Prefer idx_t to size_t or ptrdiff_t for nonnegative sizes,
and prefer ptrdiff_t to size_t for sizes plus error values.
* src/grep.c (uword_size): New constant, used for signed
size calculations.
(totalnl, add_count, totalcc, print_offset, print_line_head, grep):
Prefer intmax_t to uintmax_t for wide integer calculations.
(fgrep_icase_charlen): Prefer ptrdiff_t to int for size offsets.
* src/grep.h: Include idx.h.
* src/search.h (imbrlen): New function, like mbrlen except
with idx_t and ptrdiff_t.
|
|
|
|
|
|
|
|
| |
* src/searchutils.c (mb_goback): When scanning backward through
UTF-8, check the length implied by the putative byte 1 before
bothering to invoke mb_clen. This length check also lets us use
mbrlen directly rather than calling mb_clen, which would
eventually defer to mbrlen anyway.
|
|
|
|
|
|
|
| |
* src/searchutils.c (mb_goback): Set *MBCLEN only in
non-UTF-8 encodings, since that’s the only time it’s needed,
and this lets us see more clearly that the UTF-8 clen value
is not useful to the caller.
|
|
|
|
|
| |
* src/searchutils.c (wordchar_prev): Tweak performance by using a
value already in a local variable rather than consulting a table.
|
|
|
|
|
|
| |
* src/searchutils.c (mb_goback): Improve the comment to better
describe this confusing function. And remove an unnecessary
test of cur vs end.
|
|
|
|
|
|
|
|
|
| |
This helps move the code away from unsigned types.
* src/grep.c (buf_has_encoding_errors, contains_encoding_error):
* src/searchutils.c (mb_goback):
Compare to MB_LEN_MAX, not to (size_t) -2. This is a bit safer
anyway, as grep relies on MB_LEN_MAX limits elsewhere.
* src/search.h (mb_clen): Compare to -2 before converting to size_t.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix more bugs recently uncovered by Norihiro Tanaka (Bug#43577).
* NEWS: Mention new bug report.
* src/grep.c (ok_fold): New static var.
(setup_ok_fold): New function.
(fgrep_icase_charlen): Reject single-byte characters
if they match some multibyte characters when ignoring case.
This part of the patch is partly derived from
<https://bugs.gnu.org/43577#14>, which means it is:
Co-authored-by: Norihiro Tanaka <noritnk@kcn.ne.jp>
(main): Call setup_ok_fold if ok_fold might be needed.
* src/searchutils.c (kwsinit): With the grep.c changes,
this code can now revert to classic 7th Edition Unix style;
aborting would be wrong.
* tests/turkish-eyes: Add tests for these bugs.
|
|
|
|
|
|
|
|
|
| |
Problem reported by Mayo Fark (Bug#43225).
* src/searchutils.c (wordchar_prev): In a UTF-8 locale, do not
assume that an encoding-error byte cannot be part of a word
constituent, as this assumption is incorrect for the last byte
of a multibyte word constituent.
* tests/word-delim-multibyte: Add a test for the bug.
|
|
|
|
|
|
|
|
| |
Run "make update-copyright" and then...
* gnulib: Update to latest with copyright year adjusted.
* tests/init.sh: Sync with gnulib to pick up copyright year.
* bootstrap: Likewise.
* doc/grep.in.1: Use "-" in copyright year ranges, not \en.
|
|
|
|
|
|
|
| |
* src/searchutils.c (mb_goback): New parameter. All callers changed.
* src/search.h (mb_goback): Update prototype.
* src/kwsearch.c (Fexecute): Use mb_goback's MBCLEN to detect a
word-boundary even more efficiently.
|
|
|
|
| |
* gnulib: Also update submodule for its copyright updates.
|
|
|
|
|
|
| |
* gnulib: Update to latest.
* all files: Run "make update-copyright".
* bootstrap: Update from gnulib.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This code was not being used, and complicated maintenance.
We can bring it back from the repository if it turns out
to be useful later.
* src/kwset.c (struct kwset.reverse): Remove. All uses of
FOO->reverse replaced by (FOO->kwsexec == bmexec).
(kwsalloc): Remove 'reverse' arg, as callers outside this
module do not care about algorithm choice. All callers changed.
(kwsprep): When deciding whether to use Boyer-Moore, do not worry
about being called twice on the same kwset, as that is not allowed.
(cwexec): Remove; it was never called. All uses removed.
|
|
|
|
|
|
|
| |
Remove kwset.h comments that are obsolete and seemingly not
maintained anyway; people can look in kwset.c instead.
Update comments to reflect current behavior better.
Cite Faro & Lecroq 2013. Use GNU style for end-of-sentence.
|
|
|
|
|
| |
* gnulib: Update to latest.
* all files: Run "make update-copyright".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* src/dfasearch.c (struct dfa_comp): New struct to hold
previously-global variables.
(dfawarn): Remove static variable.
(kwsmusts): Operate on a dfa_comp parameter instead of global
variables.
(GEAcompile): Allocate and return a dfa_comp struct instead of setting
global variables.
(EGexecute): Operate on a dfa_comp parameter instead of global
variables.
* src/searchutils.c (kwsinit): Replace a static array with a
dynamically-allocated one.
|
| |
|
|
|
|
|
|
|
|
| |
* src/searchutils.c: Do not include <verify.h>.
(word_start): Remove, replacing with ...
(sbwordchar): New static var. All uses changed.
(wordchar_prev): Return size_t, not bool, as this generates
slightly better code. Go back faster if UTF-8.
|
|
|
|
|
|
|
|
|
|
| |
Problem reported by Norihiro Tanaka (Bug#22357#100).
This patch improves the performance on that benchmark on my
platform so that grep is now only about 2x slower than grep 2.26,
which means it is considerably faster than grep 2.25 and earlier.
* src/kwsearch.c (Fexecute):
Use wordchars_size to boost performance for this case.
* src/search.h, src/searchutils.c (wordchars_size): New function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This improves performance a bit.
* src/dfasearch.c, src/kwsearch.c (wordchar):
Remove; now in searchutils.c.
* src/grep.c (main): Call wordinit if -w.
* src/search.h: Adjust.
* src/searchutils.c: Include verify.h.
(word_start): New static var.
(wordchar): Move here from dfasearch.c and kwsearch.c.
(wordinit, wordchars_count, wordchar_next, wordchar_prev):
New functions.
(mb_prev_wc, mb_next_wc): Remove.
All callers changed to use the new functions instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These days, the dangerous powers of C macros are not needed if
constants or functions will do just as well.
* src/grep.c (SEP_CHAR_SELECTED, SEP_CHAR_REJECTED, SEP_STR_GROUP)
(INITIAL_BUFSIZE):
* src/kwset.c (DEPTH_SIZE):
Now constants, not macros.
* src/kwset.c (link): Remove macro. Instead, rename local vars
from 'link' to 'cur'.
(malloc) [GREP]: Remove macro. All uses of malloc changed to xmalloc.
Omit double-inclusion of xalloc.h. Do not depend on 'GREP'.
(U): Now a function, not a macro.
* src/kwset.c, src/searchutils.c (NCHAR): Move this macro to ...
* src/system.h: ... here, and make it a constant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This follows up on the -iF performance improvement (Bug#23752).
* NEWS: Simplify description of -iF improvement.
* src/dfa.c: Do not include wctype.h.
(lonesome_lower, case_folded_counterparts): Move to localeinfo.c.
(CASE_FOLDED_BUFSIZE): Move to localeinfo.h.
* src/grep.c: Do not include wctype.h.
(lonesome_lower): Remove.
(fgrep_icase_available): Use case_folded_counterparts instead.
Do not call it for the same character twice.
Return false on wcrtomb failures (which should never happen).
(fgrep_to_grep_pattern, main): Simplify. Let fgrep_to_grep’s
caller fiddle with the global variables.
* src/localeinfo.c: Include <wctype.h>
(lonesome_lower, case_folded_counterparts):
Move here from src/dfa.c. Return int, not unsigned int.
Verify that CASE_FOLDED_BUFSIZE is big enough.
* src/localeinfo.h (CASE_FOLDED_BUFSIZE): Now 32, so that
we don’t expose lonesome_lower’s size.
* src/searchutils.c (kwsinit): Return new kwset instead of
storing it via a pointer. All callers changed. Simplify a bit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a multibyte locale, if a pattern is composed of only single byte
characters and their all counterparts are also single byte characters
and the pattern does not have invalid sequences, grep -iF uses the
fgrep matcher, the same as in a single byte locale (Bug#23752).
* NEWS: Mention it.
* src/grep.c (lonesome_lower): New constant.
(fgrep_icase_available): New function.
(fgrep_to_grep_pattern): Simplify it.
(main): Use them.
* src/searchutils.c (kwsinit): New arg MB_TRANS; all uses changed.
Try fgrep matcher for case insensitive matching by grep -F in multibyte
locale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This follows up on Zev Weiss’s recent patches to make the DFA code
thread-safe (Bug#24249). It removes the remaining static
variables used by dfa.c. These variables are locale-dependent, so
they would cause problems in multithreaded code where different
threads are in different locales (e.g., via uselocale). I
abstracted most of the variables into a new localeinfo module.
* src/Makefile.am (grep_SOURCES): Add localeinfo.c.
(noinst_HEADERS): Add localeinfo.h.
* src/dfa.c: Include localeinfo.h.
(struct dfa): Remove multibyte member, as it is now part of
localeinfo. New members simple_locale and localeinfo.
Put locale-related members at the end.
(mbrtowc_cache): Remove; now part of dfa->localeinfo.
(charclass_index): Rename back from dfa_charclass_index,
since it's private.
(unibyte_word_constituent): New arg DFA; use its sbctowc member.
(using_utf8, dfa_using_utf8, init_mbrtowc_cache, check_utf8):
Remove; now done by localeinfo members. All uses changed.
(dfasyntax): New localeinfo arg. Move to end to avoid forward decls.
Initialize the entire DFA.
(unibyte_c, check_unibyte_c): Remove; now in simple_locale member.
(using_simple_locale): Now takes bool instead of DFA.
Do the locale check here, rather than in the caller,
as the result is now cached in dfa->simple_locale.
(dfaalloc): Just allocate the DFA. dfasyntax now initializes it.
* src/dfa.h: Add forward decl of struct localeinfo.
Adjust to new dfa.c API.
* src/dfasearch.c (localeinfo): New var, replacing former static
vars like mbrtowc_cache.
* src/localeinfo.c, src/localeinfo.h: New files.
* src/search.h: Include localeinfo.h.
(localeinfo): New decl.
* src/searchutils.c (mbclen_cache, build_mbclen_cache):
Remove. All uses changed to localeinfo.
* tests/Makefile.am (dfa_match_aux_LDADD): Add localeinfo.o.
* tests/dfa-match-aux.c: Include localeinfo.h.
(main): Adjust to changes in DFA API.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Searching multiple fixed words, grep used the Commentz-Walter
algorithm, but this was O(m*n) and was very slow in the worst case.
For example:
- input: yes `printf %040d` | head -10000000
- word1: x0000000000000000000
- word2: x
This change instead uses the Aho-Corasick algorithm to search multiple
fixed words. It uses a high-quality trie-building function that is
already defined for Commentz-Walter in kwset.c.
I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change. Using best-of-5 trials for the benchmark:
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
The results were:
real 11.37 user 11.03 sys 0.24 [without the change]
real 1.49 user 1.31 sys 0.15 [with the change]
* src/kwset.c (struct kwset): Add a new member 'mode'.
(kwsalloc): Use it.
All callers are changed.
(kwsincr): Using Aho-Corasick algorithm, build tries in normal order.
(acexec_trans, acexec): Add a new function.
(kwsexec): Use it.
* src/kwset.h (kwsalloc): Update a prototype.
* NEWS (Improvements): Mention it.
|
|
|
|
|
|
|
| |
* src/searchutils.c (mbtoupper): Remove now-unused function.
Also remove inclusion of <assert.h>, since this change removed
the final use of assert.
* src/search.h (mbtoupper): Remove declaration.
|
|
|
|
|
|
|
|
| |
Run "make update-copyright" and then...
* gnulib: Update to latest.
* tests/init.sh: Update from gnulib.
* bootstrap: Likewise.
|
|
|
|
|
| |
Run "make update-copyright". Also, ...
* grep.texi: Update manually, converting each "--" to "-".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* src/grep.c (buffer_textbin, contains_encoding_error):
Use mb_clen for speed.
(buffer_textbin): Bypass mb_clen in unibyte locales.
(main): Always initialize the cache, since it's sometimes used in
unibyte locales now. Initialize it before contains_encoding_error
might be called.
* src/search.h (SEARCH_INLINE): New macro.
(mbclen_cache): Now extern decl.
(mb_clen): New inline function.
* src/searchutils.c (SEARCH_INLINE, SYSTEM_INLINE): Define.
(mbclen_cache): Now extern.
(build_mbclen_cache): Put 1 into the cache when mbrlen returns 0.
(mb_goback): Use mb_len for speed, and rely on it returning nonzero.
* src/system.h (SYSTEM_INLINE): New macro.
(to_uchar): Use it.
|
|
|
|
|
|
|
|
|
| |
glibc has a bug where mbrlen and mbrtowc mishandle length-0 inputs.
Working around it in gnulib slows grep down, so disable the tests for it
and make sure grep works even if the bug is present.
* bootstrap.conf (avoided_gnulib_modules): Add mbrtowc-tests.
* configure.ac (gl_cv_func_mbrtowc_empty_input): Assume yes.
* src/searchutils.c (mb_next_wc): Don't invoke mbrtowc on empty input.
|
|
|
|
|
|
| |
This reverts commit v2.18-148-ga6ae68d.
Now that we have gnulib change v0.1-131-g2a045bc, "mbrlen, mbrtowc:
fix bug with empty input", this work-around is no longer needed.
|
|
|
|
|
|
|
|
|
|
| |
* src/searchutils.c (mb_next_wc): Work around glibc bug 16950; see:
https://sourceware.org/bugzilla/show_bug.cgi?id=16950
This bug was masked in the other GNU/Linux tests I made. It was
exposed on RHEL 6.5 x86-64, where the compiler (GCC Red Hat 4.4.7-4)
happened to use temporaries in a different way.
Also see recent changes to the Gnulib documentation in this area:
http://lists.gnu.org/archive/html/bug-gnulib/2014-05/msg00013.html
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* NEWS: Document this.
* src/dfasearch.c, src/kwsearch.c (WCHAR): Remove.
(wordchar): New static function.
* src/dfasearch.c (EGexecute):
* src/kwsearch.c (Fexecute): Use the new functions, so that the
code works correctly if a multibyte character adjacent to the
match has two or more bytes.
* src/search.h, src/searchutils.c (mb_prev_wc, mb_next_wc):
New functions.
* tests/word-delim-multibyte: Add a test for grep -w (which now
passes), and a test for \> (which still fails). The \< test also
still fails.
|
|
|
|
|
|
|
|
|
| |
* src/search.h, src/searchutils.c (mb_goback): Rename from
is_mb_middle. Omit last arg. Return number of bytes to go back,
not just a boolean. All uses changed.
* src/dfasearch.c (EGexecute):
* src/kwsearch.c (Fexecute): Adjust to API change.
* src/kwsearch.c (Fexecute): Eliminate common subexpression.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This follows up to http://bugs.gnu.org/17376 and fixes a different
set of incompatibilities, namely between the regex matcher and the
other matchers, when the pattern contains encoding errors.
The GNU regex matcher is not consistent in this area: sometimes
an encoding error matches only itself, and sometimes it
matches part of a multibyte character. There is no documentation
for grep's behavior in this area and users don't seem to care,
and it's simpler to defer to the regex matcher for problematic
cases like these.
* NEWS: Document this.
* src/dfa.c (ctok): Remove. All uses removed.
(parse_bracket_exp, atom): Use BACKREF if a pattern contains
an encoding error, so that the matcher will revert to regex.
* src/dfasearch.c, src/grep.c, src/pcresearch.c, src/searchutils.c:
Don't include dfa.h, since search.h now does that for us.
* src/dfasearch.c (EGexecute):
* src/kwsearch.c (Fexecute): In a UTF-8 locale, there's no need to
worry about matching part of a multibyte character.
* src/grep.c (contains_encoding_error): New static function.
(main): Use it, so that grep -F is consistent with plain fgrep
when the pattern contains an encoding error.
* src/search.h: Include dfa.h, so that kwsearch.c can call using_utf8.
* src/searchutils.c (is_mb_middle): Remove UTF-8-specific code.
Callers now ensure that we are in a non-UTF-8 locale.
The code was clearly wrong, anyway.
* tests/fgrep-infloop, tests/invalid-multibyte-infloop:
* tests/prefix-of-multibyte:
Do not require that grep have a particular behavor for this test.
It's OK to match (exit status 0), not match (exit status 1), or
report an error (exit status 2), since the pattern contains an
encoding error and grep's behavior is not specified for such
patterns. Test only that KWset, DFA, and regex agree.
* tests/prefix-of-multibyte: Add tests for ABCABC and __..._ABCABC___.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* src/dfa.c (dfambcache, parse_bracket_exp): Simplify.
(mbs_to_wchar, wctok, FETCH_WC, match_anychar, match_mb_charset)
(check_matching_with_multibyte_ops, transit_state_consume_1char)
(transit_state, dfaexec): Use wint_t, not wchar_t, so that
WEOF is treated correctly on platforms where WEOF is not a valid
wchar_t value.
(ctok, lex): Use int, not unsigned int, for characters,
so that EOF is treated more naturally.
(parse_bracket_exp): Use NOTCHAR to mark uninitialized char, since
FETCH_WC can now set the char to EOF.
(lex): Remove unnecessary test for EOF.
(parse_bracket_exp, atom): Swap then and else parts, to put
the small one first; this is more readable here.
* src/searchutils.c (is_mb_middle): Simplify.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See: http://bugs.gnu.org/17376
* src/dfa.c (dfambcache): Don't cache invalid sequences, because they can't be
represented by wide characters.
(dfambcache, mbs_to_wchar): Return WEOF for invalid sequences.
(ctok): New global variable.
(parse_bracket_exp, atom, match_anychar, match_mb_charset): Don't allow WEOF.
(lex): Set 'ctok'.
* src/kwsearch.c (Fexecute):
* src/searchutils.c (is_mb_middle): Don't check here.
* tests/invalid-multibyte-infloop: Adjust to fixed behavior.
* tests/prefix-of-multibyte: Add test cases for this bug.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some hosts, nl_langinfo returns strings other than "UTF-8" when
UTF-8 is used, and (worse) return "UTF-8" even if the encoding is
single-byte. Work around these problems by trying a sample
character instead.
* src/dfa.c, src/pcresearch.c, src/searchutils.c:
Don't include <langinfo.h>.
* src/dfa.c (using_utf8): Test for UTF-8 by trying a character
rather than by invoking nl_langinfo (CODESET); this is more
portable in practice, and removes a dependency on
HAVE_LANGINFO_CODESET.
* src/pcresearch.c: Include dfa.h, for using_utf8.
(Pcompile): Use using_utf8 rather than nl_langinfo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* src/mbsupport.h: Remove.
* src/Makefile.am (noinst_HEADERS): Remove mbsupport.h.
* src/dfa.c, src/grep.c, src/search.h: Don't include mbsupport.h.
* src/dfa.c: Include wchar.h and wctype.h unconditionally, as
this simplifies the use of dfa.c in grep, and it does no harm
in gawk.
(setlocale, static_assert): Remove gawk-specific hacks, as
gawk now does these itself.
(struct dfa, dfambcache, mbs_to_wchar)
(is_valid_unibyte_character, setbit_wc, using_utf8, FETCH_WC)
(addtok_wc, add_utf8_anychar, atom, state_index, epsclosure)
(dfaanalyze, dfastate, prepare_wc_buf, dfaoptimize, dfafree, dfamust):
* src/dfasearch.c (EGexecute):
* src/grep.c (main):
* src/searchutils.c (mbtoupper):
Assume MBS_SUPPORT.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
fgrep converted to lowercase, whereas the regex code converted
to uppercase. The resulting behaviors don't agree in offbeat
cases like Greek sigmas and Turkish Is. Fix this by changing
fgrep to agree with the regex code.
* src/kwsearch.c (Fcompile, Fexecute):
* src/searchutils.c (kwsinit, mbtoupper):
Convert to uppercase, not to lowercase, for compatibility with
plain 'grep'.
* src/search.h, src/searchutils.c (mbtoupper):
Rename from mbtolower, since it now converts to uppercase.
All uses changed.
* tests/case-fold-titlecase: Add tests for this.
|
|
|
|
|
|
|
| |
* src/dfa.c (using_utf8): Remove "static inline".
* src/dfa.h (using_utf8): Declare it.
* src/searchutils.c (is_mb_middle): Use using_utf8 rather than
rolling our own.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After each kwsexec or dfaexec match, we must determine whether
the tentative match falls in the middle of a multi-byte character.
That is what our is_mb_middle function does, but it was expensive,
even when most input consisted of single-byte characters. The main
cost was for each call to mbrlen. This change constructs and uses
a cache of the lengths returned by mbrlen for unibyte values.
The largest speed-up (3x to 7x, CPU-dependent) is when most
lines contain a match, yet few are printed, e.g., when using
grep -v common-pattern ... to filter out all but a few lines.
* src/search.h (build_mbclen_cache): Declare it.
* src/main.c: Include "search.h".
[MBS_SUPPORT] (main): Call build_mbclen_cache in a multibyte locale.
* src/searchutils.c [HAVE_LANGINFO_CODESET]: Include <langinfo.h>.
(mbclen_cache): New global.
(build_mbclen_cache): New function.
(is_mb_middle) [HAVE_LANGINFO_CODESET]: Use it.
* NEWS (Improvements): Mention it.
|
|
|
|
| |
Do that by running "make update-copyright".
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
grep -i would segfault on systems using UTF-16-based wchar_t (Cygwin)
when converting an input string containing certain 4-byte UTF-8
sequences to lower case. The conversions to wchar_t and back to
a UTF-8 multibyte string did not take surrogate pairs into account.
* src/searchutils.c (mbtolower) [__CYGWIN__]: Detect and handle
surrogate pairs when converting.
* NEWS (Bug fixes): Mention it.
* tests/surrogate-pair: New test.
* tests/Makefile.am (TESTS): Add it.
Reported by: Jim Burwell
|
|
|
|
| |
Run "make update-copyright".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit v2.12-16-g7aa698d addressed the case in which the lower-case
representation of an input byte occupies fewer bytes than the original.
However, even with commit v2.12-20-g074842d, grep -i would still
misbehave when converting a character to lower-case increased its
byte count. The map-manipulation code assumed that the case conversion
could only shrink the byte count. With the consideration that it may
also inflate it, the deltas recorded in the map array must be signed,
and we must account for the one-to-two-or-more mapping when the
original-to-lower-case conversion causes the byte count to increase.
* src/searchutils.c (mbtolower): When a lower-case character occupies
more than one byte, set its remaining map slots to zero. Change the
type of the map to be signed, and compute the change in character
byte count as new_length - old_length.
* src/search.h: Include <stdint.h>, for decl of intmax_t.
(mb_case_map_apply): Adjust for signed increments:
each map entry is now signed.
(mb_len_map_t): Define type. Thanks to Paul Eggert for noticing
in review that using a bare "char" as the base type would be wrong on
systems for which it is a signed type (as with gcc's -funsigned-char).
* src/kwsearch.c (Fcompile, Fexecute): Likewise.
* src/dfasearch.c (kwsincr_case, EGexecute): Likewise.
* tests/turkish-I-without-dot: New test. Thanks to Paolo Bonzini
for the tip that in the tr_TR.utf8 locale, mapping "I" to lower case
increases the character's byte count.
* tests/Makefile.am (TESTS): Add it.
* tests/init.cfg (require_tr_utf8_locale_): New function.
* NEWS (Bug fixes): Expand the existing entry.
|
|
|
|
|
|
|
|
| |
* src/searchutils.c (mbtolower): Return the map back to the caller
if any input character's length differs from the corresponding output
character's, not merely if the total string length differs.
Problem reported by Johannes Mercer in
<http://lists.gnu.org/archive/html/bug-grep/2012-06/msg00029.html>.
|