New option back-references are local, beefup manual.

* doc/grep.texi : Document the new options and the new behaviour back-references are local. Use excerpt from Karl Berry regex texinfo. * bootstrap/Makefile.try : Added xstrtoumax.o xstrtoul.o hard-local.o From Guglielmo 'bond' Bondioni : The bug was that using a multi line file that contained REs (one per line), backreferences in the REs were considered global (to the file) and not local (to the line). That is, \1 in line n refers to the first \(.\) in the whole file, rather than in the line itself. From Tapani Tarvainen : # Re: grep -e '\(a\)\1' -e '\(b\)\1' That's not the way it should work: multiple -e arguments should be treated as independent patterns and back references should not refer to previous ones. From Paul Eggert : GNU grep currently does not issue diagnostics for the following two cases, both of which are erroneous: grep -e '[' -e ']' grep '[ ]' POSIX requires a diagnostic in both cases because '[' is not a valid regular expression. To overcome those problems, grep no longer pass the concatenate patterns to GNU regex but rather compile each patterns separately and keep the result in an array. * src/search.c (patterns) : New global variable; a structure array holding the compiled patterns. Declare function prototypes to minimize error. (dfa, kswset, regexbuf, regs) : Removed, no longer static globals, but rather fields in patterns[] structure per motif. (Fcompile) : Alloc an entry in patterns[] to hold the regex. (Ecompile) : Alloc an entry per motif in the patterns[] array. (Gcompile) : Likewise. (EGexecute) : Loop through of array of patterns[] for a match. From Bernd Strieder : # tail -f logfile | grep important | do_something_urgent # tail -f logfile | grep important | do_something_taking_very_long If grep does full buffering in these cases then the urgent operation does not happen as it should in the first case, and in the second case time is lost due to waiting for the buffer to be filled. This is clearly spoken not grep's fault in the first place, but libc's. There is a heuristic in libc that make a stream line-buffered only if a terminal is on the other end. This doesn't take care of the cases where this connection is somehow indirect. * src/grep.c (line_buffered) : new option variable. (prline) : if line_buffered is set fflush() is call. (usage) : line_buffered new option. Input from Paul Eggert, doing setvbuf() may not be portable and breaks grep -z. This patch prevent kwset_matcher from following problems. For example, in SJIS encoding, one character has the codepoint 0x895c. So the second byte of the character can match with '\' incorrectly. And in eucJP encoding, there are the characters whose codepoints are 0xa5b9, 0xa5c8. On the other hand, there is one character whose codepoint is 0xb9a5. So 0xb9a5 can match with 2nd byte of 0xa5b9 and 1st byte of 0xa5c8. (EGexecute) : call check_multibyte_string when kwset is set. (Fexecute) : call to check_multibyte_string. (MB_CUR_MAX) : new macro.
author: Alain Magloire <alainm@rcsm.ee.mcgill.ca> 2001-02-18 04:13:21 +0000
committer: Alain Magloire <alainm@rcsm.ee.mcgill.ca> 2001-02-18 04:13:21 +0000
commit: 1236f00774c60964d9c1661e7a8f6833d45596f5 (patch)
tree: 2392756d9ab216677151659ec06073f76ee16f6e /doc
parent: 67c5b94f135ef975f657bac7e896719761646db9 (diff)
download: grep-1236f00774c60964d9c1661e7a8f6833d45596f5.tar.gz
3 files changed, 215 insertions, 84 deletions
diff --git a/doc/.cvsignore b/doc/.cvsignore
new file mode 100644
index 00000000..14bb26d9
--- /dev/null
+++ b/doc/.cvsignore
@@ -0,0 +1,3 @@
+Makefile
+Makefile.in
+version.texi
diff --git a/doc/grep.1 b/doc/grep.1
index 387e0024..752eb334 100644
--- a/doc/grep.1
+++ b/doc/grep.1
@@ -12,7 +12,7 @@
 .de Id
 .ds Dt \\$4
 ..
-.Id $Id: grep.1,v 1.17 2001/02/16 05:50:23 alainm Exp $
+.Id $Id: grep.1,v 1.18 2001/02/18 04:13:21 alainm Exp $
 .TH GREP 1 \*(Dt "GNU Project"
 .SH NAME
 grep, egrep, fgrep \- print lines matching a pattern
@@ -269,6 +269,9 @@ is operating, or if an I/O error occurs.
 Prefix each line of output with the line number
 within its input file.
 .TP
+.BR \-\^\-line-buffering
+Use line buffering, it can be a performance penality.
+.TP
 .BR \-q ", " \-\^\-quiet ", " \-\^\-silent
 Quiet; do not write anything to standard output.
 Exit immediately with zero status if any match is found,
diff --git a/doc/grep.texi b/doc/grep.texi
index 42ad1c2a..66ea5394 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -147,7 +147,7 @@ extensions.
 @item -c
 @itemx --count
 @opindex -c
-@opindex -count
+@opindex --count
 @cindex counting lines
 Suppress normal output; instead print a count of matching
 lines for each input file.  With the @samp{-v}, @samp{--invert-match} option,
@@ -261,12 +261,8 @@ Print @var{num} lines of leading context before matching lines.
 @cindex context
 Print @var{num} lines of output context.
 
+@item --colour
 @itemx --color
-@opindex --color
-@cindex highlight, color, colour
-Equivalent to @samp{--colour}.
-
-@itemx --colour
 @opindex --colour
 @cindex highlight, color, colour
 The matching string is surrounded by the marker specify in @var{GREP_COLOR}.
@@ -346,6 +342,11 @@ Print the filename for each match.
 @cindex no filename prefix
 Suppress the prefixing of filenames on output when multiple files are searched.
 
+@item --line-buffered
+@opindex --line-buffered
+@cindex line buffering
+Set the line buffering policy, this can be a performance penality.
+
 @item -L
 @itemx --files-without-match
 @opindex -L
@@ -381,12 +382,8 @@ it must be either at the end of the line or followed by
 a non-word constituent character.  Word-constituent
 characters are letters, digits, and the underscore.
 
-@item -R
-@cindex recursive search
-@cindex searching directory trees
-Equivalent to @sam{--directories=recurse}.
-
 @item -r
+@itemx -R
 @itemx --recursive
 @opindex -r
 @opindex --recursive
@@ -396,18 +393,18 @@ For each directory mentioned in the command line, read and process all
 files in that directory, recursively.  This is the same as the
 @samp{--directories=recurse} option.
 
-@item --include=@var{pattern}
+@item --include=@var{file_pattern}
 @opindex --include
-@cindex recursive search
+@cindex include files
 @cindex searching directory trees
-When processing directories recursively, only files matching @var{pattern}
-@var{pattern} will be search.
+When processing directories recursively, only files matching @var{file_pattern}
+will be search.
 
-@item --exclude=@var{pattern}
+@item --exclude=@var{file_pattern}
 @opindex --exclude
-@cindex recursive search
+@cindex exclude files
 @cindex searching directory trees
-When processing directories recursively, skip files matching @var{pattern}.
+When processing directories recursively, skip files matching @var{file_pattern}.
 
 @item -m @var{num}
 @itemx --max-count=@var{num}
@@ -558,7 +555,7 @@ specify an option containing whitespace or a backslash.
 
 @item GREP_COLOR
 @vindex GREP_COLOR
-@cindex default options environment variable, highlight, color, coulor
+@cindex highlight markers
 This variable specifies the surrounding markers use to highlight the matching
 text.  The default is control ascii red.
 
@@ -690,8 +687,8 @@ A @dfn{regular expression} is a pattern that describes a set of strings.
 Regular expressions are constructed analogously to arithmetic expressions,
 by using various operators to combine smaller expressions.
 @command{grep} understands two different versions of regular expression
-syntax: ``basic'' and ``extended''.  In @sc{gnu} @command{grep}, there is no
-difference in available functionality using either syntax.
+syntax: ``basic''(BRE) and ``extended''(ERE).  In @sc{gnu} @command{grep},
+there is no difference in available functionality using either syntax.
 In other implementations, basic regular expressions are less powerful.
 The following description applies to extended regular expressions;
 differences for basic regular expressions are summarized afterwards.
@@ -701,13 +698,74 @@ a single character.  Most characters, including all letters and digits,
 are regular expressions that match themselves.  Any metacharacter
 with special meaning may be quoted by preceding it with a backslash.
 
+A regular expression may be followed by one of several
+repetition operators:
+
+@table @samp
+
+@item .
+@opindex .
+@cindex dot
+@cindex period
+The period @samp{.} matches any single character.
+
+@item ?
+@opindex ?
+@cindex question mark
+@cindex match sub-expression at most once
+The preceding item is optional and will be matched at most once.
+
+@item *
+@opindex *
+@cindex asterisk
+@cindex match sub-expression zero or more times
+The preceding item will be matched zero or more times.
+
+@item +
+@opindex +
+@cindex plus sign
+The preceding item will be matched one or more times.
+
+@item @{@var{n}@}
+@opindex @{n@}
+@cindex braces, one argument
+@cindex match sub-expression n times
+The preceding item is matched exactly @var{n} times.
+
+@item @{@var{n},@}
+@opindex @{n,@}
+@cindex braces, second argument omitted
+@cindex match sub-expression n or more times
+The preceding item is matched n or more times.
+
+@item @{@var{n},@var{m}@}
+@opindex @{n,m@}
+@cindex braces, two arguments
+The preceding item is matched at least @var{n} times, but not more than
+@var{m} times.
+
+@end table
+
+Two regular expressions may be concatenated; the resulting regular
+expression matches any string formed by concatenating two substrings
+that respectively match the concatenated subexpressions.
+
+Two regular expressions may be joined by the infix operator @samp{|}; the
+resulting regular expression matches any string matching either subexpression.
+
+Repetition takes precedence over concatenation, which in turn
+takes precedence over alternation.  A whole subexpression may be
+enclosed in parentheses to override these precedence rules.
+
+@section Character Class
+
 @cindex bracket expression
+@cindex character class
 A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and
-@samp{]}.  It matches any
-single character in that list; if the first character of the list is the
-caret @samp{^}, then it
-matches any character @strong{not} in the list.  For example, the regular
-expression @samp{[0123456789]} matches any single digit.
+@samp{]}.  It matches any single character in that list; if the first
+character of the list is the caret @samp{^}, then it matches any character
+@strong{not} in the list.  For example, the regular expression
+@samp{[0123456789]} matches any single digit.
 
 @cindex range expression
 Within a bracket expression, a @dfn{range expression} consists of two
@@ -812,82 +870,96 @@ depends upon the C locale and the @sc{ascii} character
 encoding, whereas the former is independent of locale and character set.
 (Note that the brackets in these class names are
 part of the symbolic names, and must be included in addition to
-the brackets delimiting the bracket list.)  Most metacharacters lose
-their special meaning inside lists.  To include a literal @samp{]}, place it
-first in the list.  Similarly, to include a literal @samp{^}, place it anywhere
-but first.  Finally, to include a literal @samp{-}, place it last.
+the brackets delimiting the bracket list.)
 
-The period @samp{.} matches any single character.  The symbol @samp{\w}
-is a synonym for @samp{[[:alnum:]]} and @samp{\W} is a synonym for
-@samp{[^[:alnum]]}.
+Most metacharacters lose their special meaning inside lists.
 
-The caret @samp{^} and the dollar sign @samp{$} are metacharacters that
-respectively match the empty string at the beginning and end
-of a line.  The symbols @samp{\<} and @samp{\>} respectively match the
-empty string at the beginning and end of a word.  The symbol
-@samp{\b} matches the empty string at the edge of a word, and @samp{\B}
-matches the empty string provided it's not at the edge of a word.
+@table @samp
+@item ]
+ends the list if it's not the first list item.  So, if you want to make
+the @samp{]} character a list item, you must put it first.
 
-A regular expression may be followed by one of several
-repetition operators:
+@item [.
+represents the open collating symbol.
+
+@item .]
+represents the close collating symbol.
+
+@item [=
+represents the open equivalence class.
+
+@item =]
+represents the close equivalence class.
+
+@item [:
+represents the open character class followed by a valid character class name.
+
+@item :]
+represents the close character class followed by a valid character class name.
 
+@item -
+represents the range if it's not first or last in a list or the ending point
+of a range.
+
+@item ^
+represents the characters not in the list.  If you want to make the @samp{^}
+character a list item, place it anywhere but first.
+
+@end table
+
+@section Backslash Character
+@cindex backslash
+
+The @samp{\} when followed by certain ordinary characters take a special
+meaning :
 
 @table @samp
 
-@item ?
-@opindex ?
-@cindex question mark
-@cindex match sub-expression at most once
-The preceding item is optional and will be matched at most once.
+@item @samp{\b}
+Match the empty string at the edge of a word.
 
-@item *
-@opindex *
-@cindex asterisk
-@cindex match sub-expression zero or more times
-The preceding item will be matched zero or more times.
+@item @samp{\B}
+Match the empty string provided it's not at the edge of a word.
 
-@item +
-@opindex +
-@cindex plus sign
-The preceding item will be matched one or more times.
+@item @samp{\<}
+Match the empty string at the beginning of word.
 
-@item @{@var{n}@}
-@opindex @{n@}
-@cindex braces, one argument
-@cindex match sub-expression n times
-The preceding item is matched exactly @var{n} times.
+@item @samp{\>}
+Match the empty string at the end of word.
 
-@item @{@var{n},@}
-@opindex @{n,@}
-@cindex braces, second argument omitted
-@cindex match sub-expression n or more times
-The preceding item is matched n or more times.
+@item @samp{\w}
+Match word constituent, it is a synonym for @samp{[[:alnum:]]}.
 
-@item @{@var{n},@var{m}@}
-@opindex @{n,m@}
-@cindex braces, two arguments
-The preceding item is matched at least @var{n} times, but not more than
-@var{m} times.
+@item @samp{\W}
+Match non word constituent, it is a synonym for @samp{[^[:alnum:]]}.
 
 @end table
 
-Two regular expressions may be concatenated; the resulting regular
-expression matches any string formed by concatenating two substrings
-that respectively match the concatenated subexpressions.
+For example , @samp{\brat\b} matches the separate word @samp{rat},
+@samp{c\Brat\Be} matches @samp{crate}, but @samp{dirty \Brat} doesn't
+match @samp{dirty rat}.
 
-Two regular expressions may be joined by the infix operator @samp{|}; the
-resulting regular expression matches any string matching either
-subexpression.
+@section Anchoring
+@cindex anchoring
 
-Repetition takes precedence over concatenation, which in turn
-takes precedence over alternation.  A whole subexpression may be
-enclosed in parentheses to override these precedence rules.
+The caret @samp{^} and the dollar sign @samp{$} are metacharacters that
+respectively match the empty string at the beginning and end of a line.
+
+@section Back-reference
+@cindex back-reference
 
-The backreference @samp{\@var{n}}, where @var{n} is a single digit, matches the
-substring previously matched by the @var{n}th parenthesized subexpression
-of the regular expression.
+The back-reference @samp{\@var{n}}, where @var{n} is a single digit, matches
+the substring previously matched by the @var{n}th parenthesized subexpression
+of the regular expression. For example, @samp{(a)\1} matches @samp{aa}.
+When use with alternation if the group does not participate in the match, then
+the back-reference makes the whole match fail.  For example, @samp{a(.)|b\1}
+will not match @samp{ba}.  When multiple regular expressions are given with
+@samp{-e} or from a file @samp{-f file}, the back-referecences are local to
+each expression.
 
+@section Basic vs Extended
 @cindex basic regular expressions
+
 In basic regular expressions the metacharacters @samp{?}, @samp{+},
 @samp{@{}, @samp{|}, @samp{(}, and @samp{)} lose their special meaning;
 instead use the backslashed versions @samp{\?}, @samp{\+}, @samp{\@{},
@@ -1038,6 +1110,9 @@ ps -ef | grep '[c]ron'
 If the pattern had been written without the square brackets, it would
 have matched not only the @command{ps} output line for @command{cron},
 but also the @command{ps} output line for @command{grep}.
+Note that some platforms @command{ps} limit the ouput to the width
+of the screen, grep does not have any limit on the length of a line
+except the available memory.
 
 @item
 Why does @command{grep} report ``Binary file matches''?
@@ -1077,6 +1152,56 @@ Use the special file name @samp{-}:
 @example
 cat /etc/passwd | grep 'alain' - /etc/motd
 @end example
+
+@item
+@cindex palindromes
+How to express palindromes in a regular expression?
+
+It can be done by using the back referecences, for example a palindrome
+of 4 chararcters can be written in BRE.
+
+@example
+grep -w -e '\(.\)\(.\).\2\1' file
+@end example
+
+It matches the word "radar" or "civic".
+
+Guglielmo Bondioni proposed a single RE that finds all the palindromes up to 19
+characters long.
+
+@example
+egrep -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1$' file
+@end example
+
+Note this is done by using GNU ERE extensions, it might not be portable on
+other greps.
+
+@item
+Why are my expressions whith the vertical bar fail?
+
+@example
+/bin/echo "ba" | egrep '(a)\1|(b)\1'
+@end example
+
+The first alternate branch fails then the first group was not in the match
+this will make the second alternate branch fails.  For example, "aaba" will
+match, the first group participate in the match and can be reuse in the
+second branch.
+
+@item
+What do @command{grep, fgrep, egrep} stand for ?
+
+grep comes from the way line editing was done on Unix.  For example,
+@command{ed} uses this syntax to print a list of matching lines on the screen.
+
+@example
+global/regular expression/print
+g/re/p
+@end example
+
+@command{fgrep} stands for Fixed @command{grep}, @command{egrep} Extended
+@command{grep}.
+
 @end enumerate
 
 @node Reporting Bugs
@@ -1090,7 +1215,7 @@ Large repetition counts in the @samp{@{m,n@}} construct may cause
 @command{grep} to use lots of memory.  In addition, certain other
 obscure regular expressions require exponential time and
 space, and may cause grep to run out of memory.
-Backreferences are very slow, and may require exponential time.
+Back-references are very slow, and may require exponential time.
 
 @page
 @node Concept Index
author	Alain Magloire <alainm@rcsm.ee.mcgill.ca>	2001-02-18 04:13:21 +0000
committer	Alain Magloire <alainm@rcsm.ee.mcgill.ca>	2001-02-18 04:13:21 +0000
commit	1236f00774c60964d9c1661e7a8f6833d45596f5 (patch)
tree	2392756d9ab216677151659ec06073f76ee16f6e /doc
parent	67c5b94f135ef975f657bac7e896719761646db9 (diff)
download	grep-1236f00774c60964d9c1661e7a8f6833d45596f5.tar.gz