diff options
author | Paul Eggert <eggert@cs.ucla.edu> | 2022-05-22 14:59:53 -0700 |
---|---|---|
committer | Paul Eggert <eggert@cs.ucla.edu> | 2022-05-22 15:01:32 -0700 |
commit | a860bd39e384ed6111bc63fe6aabeb7f7120e6d5 (patch) | |
tree | 97b23ce449a172f1f09c6b7a7ef9159b95623e0f /doc | |
parent | 80bcb074aeed9b508a02940c8036c4ea5a1b9c63 (diff) | |
download | grep-a860bd39e384ed6111bc63fe6aabeb7f7120e6d5.tar.gz |
doc: document regex corner cases better
* doc/grep.texi (Environment Variables)
(Fundamental Structure, Character Classes and Bracket Expressions)
(Special Backslash Expressions, Back-references and Subexpressions)
(Basic vs Extended): Say more precisely what happens with
problematic regular expressions.
(Problematic Expressions): New section.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/grep.texi | 224 |
1 files changed, 168 insertions, 56 deletions
diff --git a/doc/grep.texi b/doc/grep.texi index a717e32d..69b52dc2 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -265,8 +265,7 @@ begin and end with word constituents, it differs from surrounding a regular expression with @samp{\<} and @samp{\>}. For example, although @samp{grep -w @@} matches a line containing only @samp{@@}, @samp{grep '\<@@\>'} cannot match any line because @samp{@@} is not a -word constituent. @xref{The Backslash Character and Special -Expressions}. +word constituent. @xref{Special Backslash Expressions}. @item -x @itemx --line-regexp @@ -830,8 +829,8 @@ is specified by examining the three environment variables in that order. The first of these variables that is set specifies the locale. For example, if @env{LC_ALL} is not set, -but @env{LC_COLLATE} is set to @samp{pt_BR}, -then the Brazilian Portuguese locale is used +but @env{LC_COLLATE} is set to @samp{pt_BR.UTF-8}, +then a Brazilian Portuguese locale is used for the @env{LC_COLLATE} category. As a special case for @env{LC_MESSAGES} only, the environment variable @env{LANGUAGE} can contain a colon-separated list of languages that @@ -1176,10 +1175,11 @@ pages, but work only if PCRE is available in the system. @menu * Fundamental Structure:: * Character Classes and Bracket Expressions:: -* The Backslash Character and Special Expressions:: +* Special Backslash Expressions:: * Anchoring:: * Back-references and Subexpressions:: * Basic vs Extended:: +* Problematic Expressions:: * Character Encoding:: * Matching Non-ASCII:: @end menu @@ -1259,9 +1259,10 @@ the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated expressions. -Two regular expressions may be joined by the infix operator @samp{|}; -the resulting regular expression -matches any string matching either alternate expression. +@cindex alternatives in regular expressions +Two regular expressions may be joined by the infix operator @samp{|}. +The resulting regular expression matches any string matching either of +the two expressions, which are called @dfn{alternatives}. Repetition takes precedence over concatenation, which in turn takes precedence over alternation. @@ -1269,14 +1270,8 @@ A whole expression may be enclosed in parentheses to override these precedence rules and form a subexpression. An unmatched @samp{)} matches just itself. -Some strings are not valid regular expressions and cause -@command{grep} to issue a diagnostic and fail. For example, @samp{xy\1} -is invalid because there is no parenthesized subexpression for the -back-reference @samp{\1} to refer to. Also, some regular expressions -have unspecified behavior and should be avoided in portable scripts -even if @command{grep} does not currently diagnose them. For example, -@samp{xy\0} has unspecified behavior because @samp{0} is not a special -character and there is no documentation for the behavior of @samp{\0}. +Not every character string is a valid regular expression. +@xref{Problematic Expressions}. @node Character Classes and Bracket Expressions @section Character Classes and Bracket Expressions @@ -1442,7 +1437,7 @@ represents the close character class symbol. @item - represents the range if it's not first or last in a list or the ending point -of a range. +of a range. To make the @samp{-} a list item, it is best to put it last. @item ^ represents the characters not in the list. @@ -1451,8 +1446,8 @@ character a list item, place it anywhere but first. @end table -@node The Backslash Character and Special Expressions -@section The Backslash Character and Special Expressions +@node Special Backslash Expressions +@section Special Backslash Expressions @cindex backslash The @samp{\} character followed by a special character is a regular @@ -1524,8 +1519,6 @@ for example, @samp{(a)*\1} fails to match @samp{a}. If the parenthesized subexpression matches more than one substring, the back-reference refers to the last matched substring; for example, @samp{^(ab*)*\1$} matches @samp{ababbabb} but not @samp{ababbab}. -The back-reference @samp{\@var{n}} is invalid -if preceded by fewer than @var{n} subexpressions. When multiple regular expressions are given with @option{-e} or from a file (@samp{-f @var{file}}), back-references are local to each expression. @@ -1536,65 +1529,181 @@ back-references are local to each expression. @section Basic vs Extended Regular Expressions @cindex basic regular expressions -In basic regular expressions the characters @samp{?}, @samp{+}, +Basic regular expressions differ from extended regular expressions +in the following ways: + +@itemize +@item +The characters @samp{?}, @samp{+}, @samp{@{}, @samp{|}, @samp{(}, and @samp{)} lose their special meaning; instead use the backslashed versions @samp{\?}, @samp{\+}, @samp{\@{}, @samp{\|}, @samp{\(}, and @samp{\)}. Also, a backslash is needed -before an interval expression's closing @samp{@}}, and an unmatched -@code{\)} is invalid. +before an interval expression's closing @samp{@}}. -Portable scripts should avoid the following constructs, as -POSIX says they produce unspecified results: +@item +An unmatched @samp{\)} is invalid. -@itemize @bullet @item -An extended regular expression that uses back-references. +If an unescaped @samp{^} appears neither first, nor directly after +@samp{\(} or @samp{\|}, it is treated like an ordinary character and +is not an anchor. + @item -A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}. +If an unescaped @samp{$} appears neither last, nor directly before +@samp{\|} or @samp{\)}, it is treated like an ordinary character and +is not an anchor. + @item -An empty parenthesized regular expression like @samp{()}. +If an unescaped @samp{*} appears first, or appears directly after +@samp{\(} or @samp{\|} or anchoring @samp{^}, it is treated like an +ordinary character and is not a repetition operator. +@end itemize + +@node Problematic Expressions +@section Problematic Regular Expressions + +@cindex invalid regular expressions +@cindex unspecified behavior in regular expressions +Some strings are @dfn{invalid regular expressions} and cause +@command{grep} to issue a diagnostic and fail. For example, @samp{xy\1} +is invalid because there is no parenthesized subexpression for the +back-reference @samp{\1} to refer to. + +Also, some regular expressions have @dfn{unspecified behavior} and +should be avoided even if @command{grep} does not currently diagnose +them. For example, @samp{xy\0} has unspecified behavior because +@samp{0} is not a special character and @samp{\0} is not a special +backslash expression (@pxref{Special Backslash Expressions}). +Unspecified behavior can be particularly problematic because the set +of matched strings might be only partially specified, or not be +specified at all, or the expression might even be invalid. + +The following regular expression constructs are invalid on all +platforms conforming to POSIX, so portable scripts can assume that +@command{grep} rejects these constructs: + +@itemize @bullet @item -An empty alternative (as in, e.g, @samp{a|}). +A basic regular expression containing a back-reference @samp{\@var{n}} +preceded by fewer than @var{n} closing parentheses. For example, +@samp{\(a\)\2} is invalid. + @item -A repetition operator that immediately follows an empty expression, -unescaped @samp{$}, or another repetition operator. +A bracket expression containing @samp{[:} that does not start a +character class; and similarly for @samp{[=} and @samp{[.}. For +example, @samp{[a[:b]} and @samp{[a[:ouch:]b]} are invalid. +@end itemize + +GNU @command{grep} treats the following constructs as invalid. +However, other @command{grep} implementations might allow them, so +portable scripts should not rely on their being invalid: + +@itemize @bullet +@item +Unescaped @samp{\} at the end of a regular expression. + @item -An interval expression with a repetition count greater than 255. +Unescaped @samp{[} that does not start a bracket expression. + +@item +A @samp{\@{} in a basic regular expression that does not start an +interval expression. + @item A basic regular expression with unbalanced @samp{\(} or @samp{\)}, or an extended regular expression with unbalanced @samp{(}. + +@item +In the POSIX locale, a range expression like @samp{z-a} that +represents zero elements. A non-GNU @command{grep} might treat it as +a valid range that never matches. + +@item +An interval expression with a repetition count greater than 32767. +(The portable POSIX limit is 255, and even interval expressions with +smaller counts can be impractically slow on all known implementations.) + @item A bracket expression that contains at least three elements, the first and last of which are both @samp{:}, or both @samp{.}, or both -@samp{=}. For example, it is unspecified whether the bracket expression -@samp{[:alpha:]} is equivalent to @samp{[[:alpha:]]}, equivalent to -@samp{[:ahlp]}, or invalid. +@samp{=}. For example, a non-GNU @command{grep} might treat +@samp{[:alpha:]} like @samp{[[:alpha:]]}, or like @samp{[:ahlp]}. +@end itemize + +The following constructs have well-defined behavior in GNU +@command{grep}. However, they have unspecified behavior elsewhere, so +portable scripts should avoid them: + +@itemize @bullet @item -A range expression like @samp{z-a} that represents zero elements; -it might never match, or it might be invalid. +Special backslash expressions like @samp{\<} and @samp{\b}. +@xref{Special Backslash Expressions}. + @item -A range expression outside the POSIX locale. +A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}. + @item -A backslash escaping an ordinary character (e.g., @samp{\S}), -unless it is a back-reference. +An extended regular expression that uses back-references. + @item -An unescaped backslash at the end of a regular expression. +An empty regular expression, subexpression, or alternative. For +example, @samp{(a|bc|)} is not portable; a portable equivalent is +@samp{(a|bc)?}. + @item -An unescaped @samp{[} that is not part of a bracket expression. +In a basic regular expression, an anchoring @samp{^} that appears +directly after @samp{\(}, or an anchoring @samp{$} that appears +directly before @samp{\)}. + @item -A @samp{\@{} in a basic regular expression (or an unescaped @samp{@{} -in an extended regular expression) that does not start an interval -expression. +In a basic regular expression, a repetition operator that +directly follows another repetition operator. + +@item +In an extended regular expression, unescaped @samp{@{} +that does not begin a valid interval expression. +GNU @command{grep} treats the @samp{@{} as an ordinary character. + +@item +A null character or an encoding error in either pattern or input data. +@xref{Character Encoding}. + +@item +An input file that ends in a non-newline character, +where GNU @command{grep} silently supplies a newline. @end itemize -@cindex interval expressions -GNU @samp{grep@ -E} treats @samp{@{} as special -only if it begins a valid interval expression. -For example, the command -@samp{grep@ -E@ '@{1'} searches for the two-character string @samp{@{1} -instead of reporting a syntax error in the regular expression. -POSIX allows this behavior as an extension, but portable scripts -should avoid it. +The following constructs have unspecified behavior, in both GNU +and other @command{grep} implementations. Scripts should avoid +them whenever possible. + +@itemize +@item +A backslash escaping an ordinary character, unless it is a +back-reference like @samp{\1} or a special backslash expression like +@samp{\<} or @samp{\b}. @xref{Special Backslash Expressions}. For +example, @samp{\x} has unspecified behavior now, and a future version +of @command{grep} might specify @samp{\x} to have a new behavior. + +@item +A repetition operator that appears directly after an anchor, or at the +start of a complete regular expression, parenthesized subexpression, +or alternative. For example, @samp{+|^*(+a|?-b)} has unspecified +behavior, whereas @samp{\+|^\*(\+a|\?-b)} is portable. + +@item +A range expression outside the POSIX locale. For example, in some +locales @samp{[a-z]} might match some characters that are not +lowercase letters, or might not match some lowercase letters, or might +be invalid. With GNU @command{grep} it is not documented whether +these range expressions use native code points, or use the collating +sequence specified by the @env{LC_COLLATE} category, or have some +other interpretation. Outside the POSIX locale, it is portable to use +@samp{[[:lower:]]} to match a lower-case letter, or +@samp{[abcdefghijklmnopqrstuvwxyz]} to match an ASCII lower-case +letter. + +@end itemize @node Character Encoding @section Character Encoding @@ -1900,7 +2009,10 @@ other patterns cause @command{grep} to match every line. To match empty lines, use the pattern @samp{^$}. To match blank lines, use the pattern @samp{^[[:blank:]]*$}. To match no lines at -all, use the command @samp{grep -f /dev/null}. +all, use an extended regular expression like @samp{a^} or @samp{$a}. +To match every line, a portable script should use a pattern like +@samp{^} instead of the empty pattern, as POSIX does not specify the +behavior of the empty pattern. @item How can I search in both standard input and in files? |