summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorPaul Eggert <eggert@cs.ucla.edu>2022-05-22 14:59:53 -0700
committerPaul Eggert <eggert@cs.ucla.edu>2022-05-22 15:01:32 -0700
commita860bd39e384ed6111bc63fe6aabeb7f7120e6d5 (patch)
tree97b23ce449a172f1f09c6b7a7ef9159b95623e0f /doc
parent80bcb074aeed9b508a02940c8036c4ea5a1b9c63 (diff)
downloadgrep-a860bd39e384ed6111bc63fe6aabeb7f7120e6d5.tar.gz
doc: document regex corner cases better
* doc/grep.texi (Environment Variables) (Fundamental Structure, Character Classes and Bracket Expressions) (Special Backslash Expressions, Back-references and Subexpressions) (Basic vs Extended): Say more precisely what happens with problematic regular expressions. (Problematic Expressions): New section.
Diffstat (limited to 'doc')
-rw-r--r--doc/grep.texi224
1 files changed, 168 insertions, 56 deletions
diff --git a/doc/grep.texi b/doc/grep.texi
index a717e32d..69b52dc2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -265,8 +265,7 @@ begin and end with word constituents, it differs from surrounding a
regular expression with @samp{\<} and @samp{\>}. For example, although
@samp{grep -w @@} matches a line containing only @samp{@@}, @samp{grep
'\<@@\>'} cannot match any line because @samp{@@} is not a
-word constituent. @xref{The Backslash Character and Special
-Expressions}.
+word constituent. @xref{Special Backslash Expressions}.
@item -x
@itemx --line-regexp
@@ -830,8 +829,8 @@ is specified by examining the three environment variables
in that order.
The first of these variables that is set specifies the locale.
For example, if @env{LC_ALL} is not set,
-but @env{LC_COLLATE} is set to @samp{pt_BR},
-then the Brazilian Portuguese locale is used
+but @env{LC_COLLATE} is set to @samp{pt_BR.UTF-8},
+then a Brazilian Portuguese locale is used
for the @env{LC_COLLATE} category.
As a special case for @env{LC_MESSAGES} only, the environment variable
@env{LANGUAGE} can contain a colon-separated list of languages that
@@ -1176,10 +1175,11 @@ pages, but work only if PCRE is available in the system.
@menu
* Fundamental Structure::
* Character Classes and Bracket Expressions::
-* The Backslash Character and Special Expressions::
+* Special Backslash Expressions::
* Anchoring::
* Back-references and Subexpressions::
* Basic vs Extended::
+* Problematic Expressions::
* Character Encoding::
* Matching Non-ASCII::
@end menu
@@ -1259,9 +1259,10 @@ the resulting regular expression
matches any string formed by concatenating two substrings
that respectively match the concatenated expressions.
-Two regular expressions may be joined by the infix operator @samp{|};
-the resulting regular expression
-matches any string matching either alternate expression.
+@cindex alternatives in regular expressions
+Two regular expressions may be joined by the infix operator @samp{|}.
+The resulting regular expression matches any string matching either of
+the two expressions, which are called @dfn{alternatives}.
Repetition takes precedence over concatenation,
which in turn takes precedence over alternation.
@@ -1269,14 +1270,8 @@ A whole expression may be enclosed in parentheses
to override these precedence rules and form a subexpression.
An unmatched @samp{)} matches just itself.
-Some strings are not valid regular expressions and cause
-@command{grep} to issue a diagnostic and fail. For example, @samp{xy\1}
-is invalid because there is no parenthesized subexpression for the
-back-reference @samp{\1} to refer to. Also, some regular expressions
-have unspecified behavior and should be avoided in portable scripts
-even if @command{grep} does not currently diagnose them. For example,
-@samp{xy\0} has unspecified behavior because @samp{0} is not a special
-character and there is no documentation for the behavior of @samp{\0}.
+Not every character string is a valid regular expression.
+@xref{Problematic Expressions}.
@node Character Classes and Bracket Expressions
@section Character Classes and Bracket Expressions
@@ -1442,7 +1437,7 @@ represents the close character class symbol.
@item -
represents the range if it's not first or last in a list or the ending point
-of a range.
+of a range. To make the @samp{-} a list item, it is best to put it last.
@item ^
represents the characters not in the list.
@@ -1451,8 +1446,8 @@ character a list item, place it anywhere but first.
@end table
-@node The Backslash Character and Special Expressions
-@section The Backslash Character and Special Expressions
+@node Special Backslash Expressions
+@section Special Backslash Expressions
@cindex backslash
The @samp{\} character followed by a special character is a regular
@@ -1524,8 +1519,6 @@ for example, @samp{(a)*\1} fails to match @samp{a}.
If the parenthesized subexpression matches more than one substring,
the back-reference refers to the last matched substring;
for example, @samp{^(ab*)*\1$} matches @samp{ababbabb} but not @samp{ababbab}.
-The back-reference @samp{\@var{n}} is invalid
-if preceded by fewer than @var{n} subexpressions.
When multiple regular expressions are given with
@option{-e} or from a file (@samp{-f @var{file}}),
back-references are local to each expression.
@@ -1536,65 +1529,181 @@ back-references are local to each expression.
@section Basic vs Extended Regular Expressions
@cindex basic regular expressions
-In basic regular expressions the characters @samp{?}, @samp{+},
+Basic regular expressions differ from extended regular expressions
+in the following ways:
+
+@itemize
+@item
+The characters @samp{?}, @samp{+},
@samp{@{}, @samp{|}, @samp{(}, and @samp{)} lose their special meaning;
instead use the backslashed versions @samp{\?}, @samp{\+}, @samp{\@{},
@samp{\|}, @samp{\(}, and @samp{\)}. Also, a backslash is needed
-before an interval expression's closing @samp{@}}, and an unmatched
-@code{\)} is invalid.
+before an interval expression's closing @samp{@}}.
-Portable scripts should avoid the following constructs, as
-POSIX says they produce unspecified results:
+@item
+An unmatched @samp{\)} is invalid.
-@itemize @bullet
@item
-An extended regular expression that uses back-references.
+If an unescaped @samp{^} appears neither first, nor directly after
+@samp{\(} or @samp{\|}, it is treated like an ordinary character and
+is not an anchor.
+
@item
-A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}.
+If an unescaped @samp{$} appears neither last, nor directly before
+@samp{\|} or @samp{\)}, it is treated like an ordinary character and
+is not an anchor.
+
@item
-An empty parenthesized regular expression like @samp{()}.
+If an unescaped @samp{*} appears first, or appears directly after
+@samp{\(} or @samp{\|} or anchoring @samp{^}, it is treated like an
+ordinary character and is not a repetition operator.
+@end itemize
+
+@node Problematic Expressions
+@section Problematic Regular Expressions
+
+@cindex invalid regular expressions
+@cindex unspecified behavior in regular expressions
+Some strings are @dfn{invalid regular expressions} and cause
+@command{grep} to issue a diagnostic and fail. For example, @samp{xy\1}
+is invalid because there is no parenthesized subexpression for the
+back-reference @samp{\1} to refer to.
+
+Also, some regular expressions have @dfn{unspecified behavior} and
+should be avoided even if @command{grep} does not currently diagnose
+them. For example, @samp{xy\0} has unspecified behavior because
+@samp{0} is not a special character and @samp{\0} is not a special
+backslash expression (@pxref{Special Backslash Expressions}).
+Unspecified behavior can be particularly problematic because the set
+of matched strings might be only partially specified, or not be
+specified at all, or the expression might even be invalid.
+
+The following regular expression constructs are invalid on all
+platforms conforming to POSIX, so portable scripts can assume that
+@command{grep} rejects these constructs:
+
+@itemize @bullet
@item
-An empty alternative (as in, e.g, @samp{a|}).
+A basic regular expression containing a back-reference @samp{\@var{n}}
+preceded by fewer than @var{n} closing parentheses. For example,
+@samp{\(a\)\2} is invalid.
+
@item
-A repetition operator that immediately follows an empty expression,
-unescaped @samp{$}, or another repetition operator.
+A bracket expression containing @samp{[:} that does not start a
+character class; and similarly for @samp{[=} and @samp{[.}. For
+example, @samp{[a[:b]} and @samp{[a[:ouch:]b]} are invalid.
+@end itemize
+
+GNU @command{grep} treats the following constructs as invalid.
+However, other @command{grep} implementations might allow them, so
+portable scripts should not rely on their being invalid:
+
+@itemize @bullet
+@item
+Unescaped @samp{\} at the end of a regular expression.
+
@item
-An interval expression with a repetition count greater than 255.
+Unescaped @samp{[} that does not start a bracket expression.
+
+@item
+A @samp{\@{} in a basic regular expression that does not start an
+interval expression.
+
@item
A basic regular expression with unbalanced @samp{\(} or @samp{\)},
or an extended regular expression with unbalanced @samp{(}.
+
+@item
+In the POSIX locale, a range expression like @samp{z-a} that
+represents zero elements. A non-GNU @command{grep} might treat it as
+a valid range that never matches.
+
+@item
+An interval expression with a repetition count greater than 32767.
+(The portable POSIX limit is 255, and even interval expressions with
+smaller counts can be impractically slow on all known implementations.)
+
@item
A bracket expression that contains at least three elements, the first
and last of which are both @samp{:}, or both @samp{.}, or both
-@samp{=}. For example, it is unspecified whether the bracket expression
-@samp{[:alpha:]} is equivalent to @samp{[[:alpha:]]}, equivalent to
-@samp{[:ahlp]}, or invalid.
+@samp{=}. For example, a non-GNU @command{grep} might treat
+@samp{[:alpha:]} like @samp{[[:alpha:]]}, or like @samp{[:ahlp]}.
+@end itemize
+
+The following constructs have well-defined behavior in GNU
+@command{grep}. However, they have unspecified behavior elsewhere, so
+portable scripts should avoid them:
+
+@itemize @bullet
@item
-A range expression like @samp{z-a} that represents zero elements;
-it might never match, or it might be invalid.
+Special backslash expressions like @samp{\<} and @samp{\b}.
+@xref{Special Backslash Expressions}.
+
@item
-A range expression outside the POSIX locale.
+A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}.
+
@item
-A backslash escaping an ordinary character (e.g., @samp{\S}),
-unless it is a back-reference.
+An extended regular expression that uses back-references.
+
@item
-An unescaped backslash at the end of a regular expression.
+An empty regular expression, subexpression, or alternative. For
+example, @samp{(a|bc|)} is not portable; a portable equivalent is
+@samp{(a|bc)?}.
+
@item
-An unescaped @samp{[} that is not part of a bracket expression.
+In a basic regular expression, an anchoring @samp{^} that appears
+directly after @samp{\(}, or an anchoring @samp{$} that appears
+directly before @samp{\)}.
+
@item
-A @samp{\@{} in a basic regular expression (or an unescaped @samp{@{}
-in an extended regular expression) that does not start an interval
-expression.
+In a basic regular expression, a repetition operator that
+directly follows another repetition operator.
+
+@item
+In an extended regular expression, unescaped @samp{@{}
+that does not begin a valid interval expression.
+GNU @command{grep} treats the @samp{@{} as an ordinary character.
+
+@item
+A null character or an encoding error in either pattern or input data.
+@xref{Character Encoding}.
+
+@item
+An input file that ends in a non-newline character,
+where GNU @command{grep} silently supplies a newline.
@end itemize
-@cindex interval expressions
-GNU @samp{grep@ -E} treats @samp{@{} as special
-only if it begins a valid interval expression.
-For example, the command
-@samp{grep@ -E@ '@{1'} searches for the two-character string @samp{@{1}
-instead of reporting a syntax error in the regular expression.
-POSIX allows this behavior as an extension, but portable scripts
-should avoid it.
+The following constructs have unspecified behavior, in both GNU
+and other @command{grep} implementations. Scripts should avoid
+them whenever possible.
+
+@itemize
+@item
+A backslash escaping an ordinary character, unless it is a
+back-reference like @samp{\1} or a special backslash expression like
+@samp{\<} or @samp{\b}. @xref{Special Backslash Expressions}. For
+example, @samp{\x} has unspecified behavior now, and a future version
+of @command{grep} might specify @samp{\x} to have a new behavior.
+
+@item
+A repetition operator that appears directly after an anchor, or at the
+start of a complete regular expression, parenthesized subexpression,
+or alternative. For example, @samp{+|^*(+a|?-b)} has unspecified
+behavior, whereas @samp{\+|^\*(\+a|\?-b)} is portable.
+
+@item
+A range expression outside the POSIX locale. For example, in some
+locales @samp{[a-z]} might match some characters that are not
+lowercase letters, or might not match some lowercase letters, or might
+be invalid. With GNU @command{grep} it is not documented whether
+these range expressions use native code points, or use the collating
+sequence specified by the @env{LC_COLLATE} category, or have some
+other interpretation. Outside the POSIX locale, it is portable to use
+@samp{[[:lower:]]} to match a lower-case letter, or
+@samp{[abcdefghijklmnopqrstuvwxyz]} to match an ASCII lower-case
+letter.
+
+@end itemize
@node Character Encoding
@section Character Encoding
@@ -1900,7 +2009,10 @@ other patterns cause @command{grep} to match every line.
To match empty lines, use the pattern @samp{^$}. To match blank
lines, use the pattern @samp{^[[:blank:]]*$}. To match no lines at
-all, use the command @samp{grep -f /dev/null}.
+all, use an extended regular expression like @samp{a^} or @samp{$a}.
+To match every line, a portable script should use a pattern like
+@samp{^} instead of the empty pattern, as POSIX does not specify the
+behavior of the empty pattern.
@item
How can I search in both standard input and in files?