Doc improvoments fro P.E.

author: Alain Magloire <alainm@rcsm.ee.mcgill.ca> 2001-02-08 15:03:50 +0000
committer: Alain Magloire <alainm@rcsm.ee.mcgill.ca> 2001-02-08 15:03:50 +0000
commit: 67d86c7220648435fdfe63dbfcb5494180d1085c (patch)
tree: 88422478f83ec2ce9c22b36d345523242979bfab
parent: 5d6c5528ceaf61a916d667af1a54562c9caa7789 (diff)
download: grep-67d86c7220648435fdfe63dbfcb5494180d1085c.tar.gz
4 files changed, 138 insertions, 49 deletions
diff --git a/ChangeLog b/ChangeLog
index 9351a60a..b2576710 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,4 +1,10 @@
-2000-03-26  Paul Eggert  <eggert@twinsun.com>
+2000-04-06  Paul Eggert
+
+	* doc/grep.1, doc/grep.texi, NEWS: Improve the explanation of
+	locale-dependent behavior of range expressions.  Mention
+	LC_COLLATE, since this affects range expressions.
+
+2000-03-26  Paul Eggert
 
 	* Makefile.am (ACINCLUDE_INPUTS): Add decl.m4, inttypes_h.m4,
 	uintmax_t.m4, ulonglong.m4, xstrtoumax.m4.
@@ -27,7 +33,7 @@
 	New files, taken unchanged from textutils, fileutils, sh-utils
 	and/or tar.
 
-2000-03-23  Paul Eggert  <eggert@twinsun.com>
+2000-03-23  Paul Eggert
 
 	* src/search.c (Pcompile): Add support for NUL bytes in
 	Perl regular expressions.
diff --git a/NEWS b/NEWS
index 91d225d6..b96222ca 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,10 @@
+  - Bracket regular expressions like [a-z] are now locale-dependent,
+    as POSIX.2 requires.  For example, many locales sort characters in
+    dictionary order, and in these locales the regular expression
+    [a-d] is not equivalent to [abcd]; it might be equivalent to
+    [aBbCcDd], for example.  To obtain the traditional interpretation
+    of bracket expressions, you can use the C locale by setting the
+    LC_ALL environment variable to the value "C".
 
   - The new -P or --perl-regexp option tells grep to interpert the pattern as
     a Perl regular expression.
diff --git a/doc/grep.1 b/doc/grep.1
index b87f5412..a71c4f3c 100644
--- a/doc/grep.1
+++ b/doc/grep.1
@@ -12,7 +12,7 @@
 .de Id
 .ds Dt \\$4
 ..
-.Id $Id: grep.1,v 1.13 2001/02/08 05:33:57 alainm Exp $
+.Id $Id: grep.1,v 1.14 2001/02/08 15:03:50 alainm Exp $
 .TH GREP 1 \*(Dt "GNU Project"
 .SH NAME
 grep, egrep, fgrep \- print lines matching a pattern
@@ -395,11 +395,13 @@ a single character.  Most characters, including all letters and digits,
 are regular expressions that match themselves.  Any metacharacter with
 special meaning may be quoted by preceding it with a backslash.
 .PP
-A list of characters enclosed by
+A
+.I "bracket expression"
+is a list of characters enclosed by
 .B [
 and
-.B ]
-matches any single
+.BR ] .
+It matches any single
 character in that list; if the first character of the list
 is the caret
 .B ^
@@ -408,10 +410,32 @@ then it matches any character
 in the list.
 For example, the regular expression
 .B [0123456789]
-matches any single digit.  A range of characters
-may be specified by giving the first and last characters, separated
-by a hyphen.
-Finally, certain named classes of characters are predefined.
+matches any single digit.
+.PP
+Within a bracket expression, a
+.I "range expression"
+consists of two characters separated by a hyphen.
+It matches any single character that sorts between the two characters,
+inclusive, using the locale's collating sequence and character set.
+For example, in the default C locale,
+.B [a\-d]
+is equivalent to
+.BR [abcd] .
+Many locales sort characters in dictionary order, and in these locales
+.B [a\-d]
+is typically not equivalent to
+.BR [abcd] ;
+it might be equivalent to
+.BR [aBbCcDd] ,
+for example.
+To obtain the traditional interpretation of bracket expressions,
+you can use the C locale by setting the
+.B LC_ALL
+environment variable to the value
+.BR C .
+.PP
+Finally, certain named classes of characters are predefined within
+bracket expressions, as follows.
 Their names are self explanatory, and they are
 .BR [:alnum:] ,
 .BR [:alpha:] ,
@@ -428,8 +452,8 @@ and
 For example,
 .B [[:alnum:]]
 means
-.BR [0-9A-Za-z] ,
-except the latter form depends upon the \s-1POSIX\s0 locale and the
+.BR [0\-9A\-Za\-z] ,
+except the latter form depends upon the C locale and the
 \s-1ASCII\s0 character encoding, whereas the former is independent
 of locale and character set.
 (Note that the brackets in these class names are part of the symbolic
@@ -576,6 +600,29 @@ instead of reporting a syntax error in the regular expression.
 \s-1POSIX.2\s0 allows this behavior as an extension, but portable scripts
 should avoid it.
 .SH "ENVIRONMENT VARIABLES"
+Grep's behavior is affected by the following environment variables.
+.PP
+A locale
+.BI LC_ foo
+is specified by examining the three environment variables
+.BR LC_ALL ,
+.BR LC_\fIfoo\fP ,
+.BR LANG ,
+in that order.
+The first of these variables that is set specifies the locale.
+For example, if
+.B LC_ALL
+is not set, but
+.B LC_MESSAGES
+is set to
+.BR pt_BR ,
+then Brazilian Portuguese is used for the
+.B LC_MESSAGES
+locale.
+The C locale is used if none of these environment variables are set,
+or if the locale catalog is not installed, or if
+.B grep
+was not compiled with national language support (\s-1NLS\s0).
 .TP
 .B GREP_OPTIONS
 This variable specifies default options to be placed in front of any
@@ -593,28 +640,26 @@ Option specifications are separated by whitespace.
 A backslash escapes the next character,
 so it can be used to specify an option containing whitespace or a backslash.
 .TP
-\fBLC_ALL\fP, \fBLC_MESSAGES\fP, \fBLANG\fP
+\fBLC_ALL\fP, \fBLC_COLLATE\fP, \fBLANG\fP
 These variables specify the
-.B LC_MESSAGES
-locale, which determines the language that
-.B grep
-uses for messages.
-The locale is determined by the first of these variables that is set.
-American English is used if none of these environment variables are set,
-or if the message catalog is not installed, or if
-.B grep
-was not compiled with national language support (\s-1NLS\s0).
+.B LC_COLLATE
+locale, which determines the collating sequence used to interpret
+range expressions like
+.BR [a\-z] .
 .TP
 \fBLC_ALL\fP, \fBLC_CTYPE\fP, \fBLANG\fP
 These variables specify the
 .B LC_CTYPE
 locale, which determines the type of characters, e.g., which
 characters are whitespace.
-The locale is determined by the first of these variables that is set.
-The \s-1POSIX\s0 locale is used if none of these environment variables
-are set, or if the locale catalog is not installed, or if
+.TP
+\fBLC_ALL\fP, \fBLC_MESSAGES\fP, \fBLANG\fP
+These variables specify the
+.B LC_MESSAGES
+locale, which determines the language that
 .B grep
-was not compiled with national language support (\s-1NLS\s0).
+uses for messages.
+The default C locale uses American English messages.
 .TP
 .B POSIXLY_CORRECT
 If set,
diff --git a/doc/grep.texi b/doc/grep.texi
index caeba681..9595bdca 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -501,9 +501,20 @@ matching engine is used.  @xref{Grep Programs}.
 @section Environment Variables
 
 Grep's behavior is affected by the following environment variables.
+
+A locale @code{LC_@var{foo}} is specified by examining the three
+environment variables @env{LC_ALL}, @env{LC_@var{foo}}, and @env{LANG},
+in that order.  The first of these variables that is set specifies the
+locale.  For example, if @env{LC_ALL} is not set, but @env{LC_MESSAGES}
+is set to @samp{pt_BR}, then Brazilian Portuguese is used for the
+@code{LC_MESSAGES} locale.  The C locale is used if none of these
+environment variables are set, or if the locale catalog is not
+installed, or if @command{grep} was not compiled with national language
+support (@sc{nls}).
+
 @cindex environment variables
 
-@table @code
+@table @env
 
 @item GREP_OPTIONS
 @vindex GREP_OPTIONS
@@ -518,22 +529,17 @@ whitespace.  A backslash escapes the next character, so it can be used to
 specify an option containing whitespace or a backslash.
 
 @item LC_ALL
-@itemx LC_MESSAGES
+@itemx LC_COLLATE
 @itemx LANG
 @vindex LC_ALL
-@vindex LC_MESSAGES
+@vindex LC_COLLATE
 @vindex LANG
-@cindex language of messages
-@cindex message language
+@cindex character type
 @cindex national language support
 @cindex NLS
-@cindex translation of message language
-These variables specify the @code{LC_MESSAGES} locale, which determines
-the language that @command{grep} uses for messages.  The locale is determined
-by the first of these variables that is set.  American English is used
-if none of these environment variables are set, or if the message
-catalog is not installed, or if @command{grep} was not compiled with national
-language support (@sc{nls}).
+These variables specify the @code{LC_COLLATE} locale, which determines
+the collating sequence used to interpret range expressions like
+@samp{[a-z]}.
 
 @item LC_ALL
 @itemx LC_CTYPE
@@ -545,11 +551,22 @@ language support (@sc{nls}).
 @cindex national language support
 @cindex NLS
 These variables specify the @code{LC_CTYPE} locale, which determines the
-type of characters, e.g., which characters are whitespace.  The locale is
-determined by the first of these variables that is set.  The @sc{posix}
-locale is used if none of these environment variables are set, or if the
-locale catalog is not installed, or if @command{grep} was not compiled with
-national language support (@sc{nls}).
+type of characters, e.g., which characters are whitespace.
+
+@item LC_ALL
+@itemx LC_MESSAGES
+@itemx LANG
+@vindex LC_ALL
+@vindex LC_MESSAGES
+@vindex LANG
+@cindex language of messages
+@cindex message language
+@cindex national language support
+@cindex NLS
+@cindex translation of message language
+These variables specify the @code{LC_MESSAGES} locale, which determines
+the language that @command{grep} uses for messages.  The default C
+locale uses American English messages.
 
 @item POSIXLY_CORRECT
 @vindex POSIXLY_CORRECT
@@ -649,17 +666,31 @@ The fundamental building blocks are the regular expressions that match
 a single character.  Most characters, including all letters and digits,
 are regular expressions that match themselves.  Any metacharacter
 with special meaning may be quoted by preceding it with a backslash.
-A list of characters enclosed by @samp{[} and @samp{]} matches any
+
+@cindex bracket expression
+A @dfn{bracket expression} is a list of characters enclosed by @samp{[} and
+@samp{]}.  It matches any
 single character in that list; if the first character of the list is the
 caret @samp{^}, then it
 matches any character @strong{not} in the list.  For example, the regular
 expression @samp{[0123456789]} matches any single digit.
-A range of characters may be specified by giving the first
-and last characters, separated by a hyphen.
 
-Finally, certain named classes of characters are predefined, as follows.
+@cindex range expression
+Within a bracket expression, a @dfn{range expression} consists of two
+characters separated by a hyphen.  It matches any single character that
+sorts between the two characters, inclusive, using the locale's
+collating sequence and character set.  For example, in the default C
+locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.  Many locales sort
+characters in dictionary order, and in these locales @samp{[a-d]} is
+typically not equivalent to @samp{[abcd]}; it might be equivalent to
+@samp{[aBbCcDd]}, for example.  To obtain the traditional interpretation
+of bracket expressions, you can use the C locale by setting the
+@env{LC_ALL} environment variable to the value @samp{C}.
+
+Finally, certain named classes of characters are predefined within
+bracket expressions, as follows.
 Their interpretation depends on the @code{LC_CTYPE} locale; the
-interpretation below is that of the @sc{posix} locale, which is the default
+interpretation below is that of the C locale, which is the default
 if no @code{LC_CTYPE} locale is specified.
 
 @cindex classes of characters
@@ -743,7 +774,7 @@ Hexadecimal digits:
 
 @end table
 For example, @samp{[[:alnum:]]} means @samp{[0-9A-Za-z]}, except the latter
-depends upon the @sc{posix} locale and the @sc{ascii} character
+depends upon the C locale and the @sc{ascii} character
 encoding, whereas the former is independent of locale and character set.
 (Note that the brackets in these class names are
 part of the symbolic names, and must be included in addition to
author	Alain Magloire <alainm@rcsm.ee.mcgill.ca>	2001-02-08 15:03:50 +0000
committer	Alain Magloire <alainm@rcsm.ee.mcgill.ca>	2001-02-08 15:03:50 +0000
commit	67d86c7220648435fdfe63dbfcb5494180d1085c (patch)
tree	88422478f83ec2ce9c22b36d345523242979bfab
parent	5d6c5528ceaf61a916d667af1a54562c9caa7789 (diff)
download	grep-67d86c7220648435fdfe63dbfcb5494180d1085c.tar.gz