diff options
author | Paul Eggert <eggert@cs.ucla.edu> | 2016-06-02 10:00:48 -0700 |
---|---|---|
committer | Paul Eggert <eggert@cs.ucla.edu> | 2016-06-02 15:26:21 -0700 |
commit | a3d42de483a29c1da381f5c3dad87cbbc86a5c70 (patch) | |
tree | 211301764c05bac498d6153d7975e85d4959c2be /thanks-gen | |
parent | c7526623f458084cbe947f0e1b43f328481b2ca3 (diff) | |
download | grep-a3d42de483a29c1da381f5c3dad87cbbc86a5c70.tar.gz |
grep: use Aho-Corasick algorithm to search multiple fixed words
Searching multiple fixed words, grep used the Commentz-Walter
algorithm, but this was O(m*n) and was very slow in the worst case.
For example:
- input: yes `printf %040d` | head -10000000
- word1: x0000000000000000000
- word2: x
This change instead uses the Aho-Corasick algorithm to search multiple
fixed words. It uses a high-quality trie-building function that is
already defined for Commentz-Walter in kwset.c.
I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change. Using best-of-5 trials for the benchmark:
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words >/dev/null
The results were:
real 11.37 user 11.03 sys 0.24 [without the change]
real 1.49 user 1.31 sys 0.15 [with the change]
* src/kwset.c (struct kwset): Add a new member 'mode'.
(kwsalloc): Use it.
All callers are changed.
(kwsincr): Using Aho-Corasick algorithm, build tries in normal order.
(acexec_trans, acexec): Add a new function.
(kwsexec): Use it.
* src/kwset.h (kwsalloc): Update a prototype.
* NEWS (Improvements): Mention it.
Diffstat (limited to 'thanks-gen')
0 files changed, 0 insertions, 0 deletions