diff options
author | Reuben Thomas <rrt@sc3d.org> | 2010-03-08 15:41:28 +0000 |
---|---|---|
committer | Paolo Bonzini <bonzini@gnu.org> | 2010-03-08 17:02:34 +0100 |
commit | 564ba765e84b27e2d8aa415def3841c879f668e4 (patch) | |
tree | 4adb0bd30a8c2e7765c4b170f4a18fa1436dfaf7 /TODO | |
parent | 0d38a8bba16bdc4aa8d657becd5526575a3f8ca8 (diff) | |
download | grep-564ba765e84b27e2d8aa415def3841c879f668e4.tar.gz |
bring TODO up-to-date
* TODO: merge with TODO section of http://www.gnu.org/software/grep/devel.html
and remove done items. Some small bits of tidying also.
Diffstat (limited to 'TODO')
-rw-r--r-- | TODO | 372 |
1 files changed, 306 insertions, 66 deletions
@@ -6,16 +6,12 @@ Get sane performance with UTF-8 locales. -1) rewrite the configure.in script, perhaps also Makefile.am -2) set up for gnulib-tool --import -3) improve the test infrastructure -4) check in the patches for the sync of dfa.c with GNU awk -5) other small patches which wait for a test case -6) process the Fedora/Red Hat patches -7) some _minimal_ cleanup of the grep(), grepdir(), recursion - (the "main loop") and fix --directories=read +Improve the test infrastructure. -## +Other small patches which wait for a test case. + +Some _minimal_ cleanup of the grep(), grepdir(), recursion (the "main +loop") and fix --directories=read Write better Texinfo documentation for grep. The manual page would be a good place to start, but Info documents are also supposed to contain a @@ -28,77 +24,321 @@ Improve the performance of the regex backtracking matcher. This matcher is agonizingly slow, and is responsible for grep sometimes being slower than Unix grep when backreferences are used. -Provide support for the Posix [= =] and [. .] constructs. This is -difficult because it requires locale-dependent details of the character -set and collating sequence, but Posix does not standardize any method -for accessing this information! +Some test in tests/spencer2.tests should have failed! +Need to filter out some bugs in dfa.[ch]/regex.[ch]. -## +Threads for grep? -Some test in tests/spencer2.tests should have failed !!! -Need to filter out some bugs in dfa.[ch]/regex.[ch]. +GNU grep does 32-bit arithmetic, it needs to move to 64-bit. -Threads for grep ? +Clean up, too many #ifdefs! -GNU grep does 32 bits arithmetic, it needs to move to 64. +Check some new algorithms for matching; talk to Karl Berry and Nelson. +Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142) +claim that his algorithm is faster than Boyer-More. Worth checking. -Clean up, too many #ifdef's !! +Lazy dynamic linking of libpcre, libz, and libbz2? -Check some new Algorithms for matching, talk to Karl Berry and Nelson. -Sunday's "Quick Search" Algorithm (CACM 33, 8 August 1990 pp. 132-142) -claim that his algo. is faster then Boyer-More ???? -Worth Checking. +Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one +binary. Is there a possibility of doing even better by automatically +checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F +0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? + +## Check <http://tony.abou-assaleh.net/greps.html>. -Take a look at: - -- cgrep (Context grep) seems like nice work; - -- sgrep (Struct grep); - -- agrep (Approximate grep), from glimpse; - -- nr-grep (Nondeterministic reverse grep); - -- ggrep (Grouse grep); - -- grep.py (Python grep); - -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep); - -- freegrep <http://www.vocito.com/downloads/software/grep/>; - -- ja-grep's mlb2 patch (Japanese grep); - -- pcregrep (from Perl-Compatible Regular Expressions library). -Can we merge ? - -Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one binary. - -POSIX Compliance see p10003.x +Take a look at these and consider opportunities +for merging or cloning: + + -- ja-grep's mlb2 patch (Japanese grep) + <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz> + -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep) + <http://www.ff.iij4u.or.jp/~nrt/lv/>; + -- pcregrep (from Perl-Compatible Regular Expressions library) + <http://www.pcre.org/>; + -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/> + seems like nice work; + -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>; + -- agrep (Approximate grep) <http://www.tgries.de/agrep/>, + from glimpse; + -- nr-grep (Nondeterministic reverse grep) + <http://www.dcc.uchile.cl/~gnavarro/software/>; + -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>; + -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>; + -- freegrep (a BSD-licensed grep for those who can't stand the GNU GPL) + <http://www.vocito.com/downloads/software/grep/>; + +## + +POSIX Compliance: see p10003.x + +In general, interesting things to check in POSIX/OpenGroup include: + +Provide support for the POSIX [= =] and [. .] constructs. This is +difficult because it requires locale-dependent details of the +character set and collating sequence, but POSIX does not standardize +any method for accessing this information! Moving away from GNU regex API for POSIX regex API. -Better and faster !! +## + +POSIX and --ignore-case + +For this issue, interesting things to check in POSIX include the +Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in +particular Section “Regular Expression General Requirements” and its +paragraph about caseless matching (note that this may not have been +fully thought through and that this text may be self-contradicting +[specifically: “of either data or patterns” versus all the rest]). + +In particular, consider the following with POSIX's approach to case +folding in mind. Assume a non-Turkic locale with a character +repertoire reduced to the following various forms of “LATIN LETTER I”: + +0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069; +0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049 +0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069; +0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 + +First note the differing UTF-8 octet lengths of U+0049 (0x49) and +U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This +implies that whole UTF-8 strings cannot be case-converted in place, +using the same memory buffer, and that the needed octet-size of the +new buffer cannot merely be guessed (although there's a simple upper +bound of six times the size of the input, as the longest UTF-8 +encoding of any character is six bytes). + +We have + +lc(I) = i, uc(I) = I +lc(i) = i, uc(i) = I +lc(İ) = i, uc(İ) = İ +lc(ı) = ı, uc(ı) = I + +where lc() and uc() denote lower-case and upper-case conversions. + +There are several candidate --ignore-case logics (including the one +mandated by POSIX): + +Using the + +if (lc(input_wchar) == lc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y n +"i" | Y Y Y n +"İ" | Y Y Y n +"ı" | n n n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if (uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y n Y +"İ" | n n Y n +"ı" | Y Y n Y + +There is a lack of symmetry between CAPITAL and SMALL LETTERs with +this. + +Using the + +if ( lc(input_wchar) == lc(pattern_wchar) +|| uc(input_wchar) == uc(pattern_wchar)) + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y n +"ı" | Y Y n Y + +There is some elegance and symmetry with this. But there are +potentially two conversions to be made per input character. If the +pattern is pre-converted, two copies of it need to be kept and used in +a mutually coherent fashion. + +Using the + +if ( input_wchar == pattern_wchar +|| lc(input_wchar) == pattern_wchar +|| uc(input_wchar) == pattern_wchar) + +logic (as mandated by POSIX) leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n Y +"i" | Y Y Y n +"İ" | n n Y n +"ı" | n n n Y + +There is a different CAPITAL/SMALL symmetry with this. But there's +also a loss of pattern/input symmetry that's unique to it. Also there +are potentially two conversions to be made per input character. + +Using the + +if (lc(uc(input_wchar)) == lc(uc(pattern_wchar))) + + +logic leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y Y Y +"i" | Y Y Y Y +"İ" | Y Y Y Y +"ı" | Y Y Y Y + +This shows total symmetry and transitivity +(at least in this example analysis). +There are two conversions to be made per input character, +but support could be added for having +a single straight mapping performing +a composition of the two conversions. + +Any optimization in the implementation of each logic +must not change its basic semantic. ## -Check POSIX: - -- Volume "Base Definitions (XBD)", - Chapter "Regular Expressions" - and in particular - Section "Regular Expression General Requirements" - and its paragraph about caseless matching. - -Check the Unicode Standard: - -- Chapter 3 ("Conformance"), - Section 3.13 ("Default Case Operations") - and the toCasefold() case conversion operation; - -- Chapter 4 ("Character Properties"), - Section 4.2 ("Case -- Normative") - and the SpecialCasing.txt and CaseFolding.txt - files from the Unicode Database; - -- Chapter 5 ("Implementation Guidelines"), - Section 5.18 ("Case Mappings"), - Subsection "Caseless Matching". - -Check Unicode Technical Standard #18 ("Unicode Regular Expressions"). -Check Unicode Standard Annex #15 ("Unicode Normalization Forms"). +In general, interesting things to check in Unicode include: + +The <http://www.unicode.org/standard/standard.html> Unicode Standard. + +Unicode Technical Standard #18 (<http://www.unicode.org/reports/tr18/> +“Unicode Regular Expressions”). + +Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/> +“Unicode Normalization Forms”). ## -Before every release: - -- drop dfa.[ch] into a copy of gawk and run "make check"; - -- send pot file to the Translation Project to get fresh po files; - -- get up-to-date version of ABOUT-NLS; - -- update NEWS. +Unicode and --ignore-case + +For this issue, interesting things to check in Unicode include: + + -- The Unicode Standard, Chapter 3 + (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf> + “Conformance”), Section 3.13 (“Default Case Operations”) and the + toCasefold() case conversion operation. + + -- The Unicode Standard, Chapter 4 + (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf> + “Character Properties”), Section 4.2 (“Case—Normative”) and + the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> + SpecialCasing.txt and + <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt> + CaseFolding.txt files from the + <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode + Character Database . + +The <http://www.unicode.org/standard/standard.html> Unicode Standard, +Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf> +Implementation Guidelines ”), Section 5.18 (“Case Mappings”), +Subsection “Caseless Matching”. + +The Unicode <http://www.unicode.org/charts/case/> case charts. + +Unicode uses the + +if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string)) + +logic for caseless matching. Let's consider the “LATIN LETTER I” +example mentioned above. In a non-Turkic locale, simple case folding +yields + +toCasefold_simple(U+0049) = U+0069 +toCasefold_simple(U+0069) = U+0069 +toCasefold_simple(U+0130) = U+0130 +toCasefold_simple(U+0131) = U+0131 + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y n n +"i" | Y Y n n +"İ" | n n Y n +"ı" | n n n Y + +This is different from anything so far! + +In a non-Turkic locale, full case folding yields + +toCasefold_full(U+0049) = U+0069 +toCasefold_full(U+0069) = U+0069 +toCasefold_full(U+0130) = <U+0069, U+0307> +toCasefold_full(U+0131) = U+0131 + +with + +0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;; + +which leads to the following matches: + + \in I i İ ı +pat\ ---------- +"I" | Y Y * n +"i" | Y Y * n +"İ" | n n Y n +"ı" | n n n Y + +This is just sad! + +Note that having toCasefold(U+0131), simple or full, map to itself +instead of U+0069 is in contradiction with the rules of Section 5.18 +of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same +thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is +U+0069. The justification for the weird toCasefold_full(U+0130) +mapping is unknown; it doesn't even make sense to add a dot (U+0307) +to a letter that already has one (U+0069). It would have been so +simple to put them all in the same equivalence class! + +Otherwise, also consider the following problem with Unicode's approach +on case folding in mind. Assume that we want to perform + +echo 'AßBC | grep -i 'Sb' + +which corresponds to + +input: U+0041 U+00DF U+0042 U+0043 U+000A +pattern: U+0053 U+0062 + +Following “CaseFolding-4.1.0.txt”, applying the toCasefold() +transformation to these yields + +input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A +pattern: U+0073 U+0062 + +so, according to this approach, the input should match the pattern. As +long as the original input line is to be reported to the user as a +whole, there is no problem (from the user's point-of-view; +implementation is complicated by this). + +However, consider both these GNU extensions: + +echo 'AßBC' | grep -i --only-matching 'Sb' +echo 'AßBC' | grep -i --color=always 'Sb' + +What is to be reported in these cases, since the match begins in the +*middle* of the original input character 'ß'? + +Note that Unicode's toCasefold() cannot be implemented in terms of +POSIX' towctrans() since that can only return a single wint_t value +per input wint_t value. |