diff options
author | Jim Meyering <meyering@fb.com> | 2023-03-18 08:28:36 -0700 |
---|---|---|
committer | Jim Meyering <meyering@meta.com> | 2023-03-18 17:08:09 -0700 |
commit | c83ffc197ec483c6f44f907346f34127ec044ef0 (patch) | |
tree | d3b01f6a00fe5a9573f596e45c4f5ad8b8a856b5 /doc | |
parent | 7979ea7ddbf83f3203d53b6351c3717ce0af91c4 (diff) | |
download | grep-c83ffc197ec483c6f44f907346f34127ec044ef0.tar.gz |
grep: -P (--perl-regexp) \d: match only ASCII digits
Prior to grep-3.9, the PCRE matcher had always treated \d just
like [0-9]. grep-3.9's fix for \w and \b mistakenly relaxed \d
to also match multibyte digits.
* src/grep.c (P_MATCHER_INDEX): Define enum.
(pcre_pattern_expand_backslash_d): New function.
(main): Call it for -P.
* NEWS (Bug fixes): Mention it.
* doc/grep.texi: Document it: with -P, \d matches only ASCII digits.
Provide a PCRE documentation URL and an example of how
to use (?s) with -z.
* tests/pcre-ascii-digits: New test.
* tests/Makefile.am (TESTS): Add that file name.
Reported as https://bugs.gnu.org/62267
Diffstat (limited to 'doc')
-rw-r--r-- | doc/grep.texi | 31 |
1 files changed, 31 insertions, 0 deletions
diff --git a/doc/grep.texi b/doc/grep.texi index 621beaf5..eaad6e17 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1141,6 +1141,37 @@ combined with the @option{-z} (@option{--null-data}) option, and note that @samp{grep@ -P} may warn of unimplemented features. @xref{Other Options}. +For documentation, refer to @url{https://www.pcre.org/}, with these caveats: +@itemize +@item +@samp{\d} always matches only the ten ASCII digits, regardless of locale or +in-regexp directives like @samp{(?aD)}. +Use @samp{\p@{Nd@}} if you require to match non-ASCII digits. +Once pcre2 support for @samp{(?aD)} is widespread enough, +we expect to make that the default, so it will be overridable. +@c Using pcre2 git commit pcre2-10.40-112-g6277357, this demonstrates how +@c we'll prefix with (?aD) to make \d's ASCII-only behavior the default: +@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '(?aD)^\d+' <<< '٠١٢٣٤٥٦٧٨٩' +@c [Exit 1] +@c $ LC_ALL=en_US.UTF-8 ./pcre2grep -u '^\d+' <<< '٠١٢٣٤٥٦٧٨٩' +@c ٠١٢٣٤٥٦٧٨٩ + +@item +By default, @command{grep} applies each regexp to a line at a time, +so the @samp{(?s)} directive (making @samp{.} match line breaks) +is generally ineffective. +However, with @option{-z} (@option{--null-data}) it can work: +@example +$ printf 'a\nb\n' |grep -zP '(?s)a.b' +a +b +@end example +But beware: with the @option{-z} (@option{--null-data}) and a file +containing no NUL byte, grep must read the entire file into memory +before processing any of it. +Thus, it will exhaust memory and fail for some large files. +@end itemize + @end table |