Doc changes for [perl #89750]

author: Karl Williamson <public@khwilliamson.com> 2011-05-03 14:08:43 -0600
committer: Jesse Vincent <jesse@bestpractical.com> 2011-05-03 17:14:06 -0400
commit: 1f59b28370e2e2b18e56e01ba9cf10440343bcd1 (patch)
tree: 1e4b74e48d5bc2a0edc6f4d0d3db6502251c5c28
parent: 7b4a7e586ed8557b4b47ff04c789aa6a65b1c944 (diff)
download: perl-1f59b28370e2e2b18e56e01ba9cf10440343bcd1.tar.gz
3 files changed, 59 insertions, 4 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index ea22a00c22..4319436f54 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -56,6 +56,8 @@ This release provides full functionality for C<use feature
 'unicode_strings'>.  Under its scope, all string operations executed and
 regular expressions compiled (even if executed outside its scope) have
 Unicode semantics.  See L<feature/"the 'unicode_strings' feature">.
+However, see L</Inverted bracketed character classes and multi-character folds>,
+below.
 
 This feature avoids most forms of the "Unicode Bug" (see
 L<perlunicode/The "Unicode Bug"> for details).  If there is any
@@ -529,6 +531,29 @@ In addition to the sections that follow, see L</C API Changes>.
 
 =head2 Regular Expressions and String Escapes
 
+=head3 Inverted bracketed character classes and multi-character folds
+
+Some characters match a sequence of two or three characters in C</i>
+regular expression matching under Unicode rules.  One example is
+C<LATIN SMALL LETTER SHARP S> which matches the sequence C<ss>.
+
+ 'ss' =~ /\A[\N{LATIN SMALL LETTER SHARP S}]\z/i  # Matches
+
+This, however, can lead to very counter-intuitive results, especially
+when inverted.  Because of this, Perl 5.14 does not use multi-character C</i>
+matching in inverted character classes.
+
+ 'ss' =~ /\A[^\N{LATIN SMALL LETTER SHARP S}]+\z/i  # ???
+
+This should match any sequences of characters that aren't the C<SHARP S>
+nor what C<SHARP S> matches under C</i>.  C<"s"> isn't C<SHARP S>, but
+Unicode says that C<"ss"> is what C<SHARP S> matches under C</i>.  So
+which one "wins"? Do you fail the match because the string has C<ss> or
+accept it because it has an C<s> followed by another C<s>?
+
+Earlier releases of Perl did allow this multi-character matching,
+but due to bugs, it mostly did not work.
+
 =head3 \400-\777
 
 In certain circumstances, C<\400>-C<\777> in regexes have behaved
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 12617e251a..c4ec417a1d 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -72,7 +72,11 @@ are split between groupings, or when one or more are quantified.  Thus
  # be even if it did!!
  "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i;      # Doesn't match!
 
-Also, this matching doesn't fully conform to the current Unicode
+Perl doesn't match multiple characters in an inverted bracketed
+character class, which otherwise could be highly confusing.  See
+L<perlrecharclass/Negation>.
+
+Also, Perl matching doesn't fully conform to the current Unicode C</i>
 recommendations, which ask that the matching be made upon the NFD
 (Normalization Form Decomposed) of the text.  However, Unicode is
 in the process of reconsidering and revising their recommendations.
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 4c91931cc1..2b76dfbe46 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -401,7 +401,7 @@ The third form of character class you can use in Perl regular expressions
 is the bracketed character class.  In its simplest form, it lists the characters
 that may be matched, surrounded by square brackets, like this: C<[aeiou]>.
 This matches one of C<a>, C<e>, C<i>, C<o> or C<u>.  Like the other
-character classes, exactly one character is matched. To match
+character classes, exactly one character is matched.* To match
 a longer string consisting of characters mentioned in the character
 class, follow the character class with a L<quantifier|perlre/Quantifiers>.  For
 instance, C<[aeiou]+> matches one or more lowercase English vowels.
@@ -417,6 +417,19 @@ Examples:
                            # a single character.
  "ae" =~  /^[aeiou]+$/     # Match, due to the quantifier.
 
+ -------
+
+* There is an exception to a bracketed character class matching a only a
+single character.  When the class is to match caselessely under C</i>
+matching rules, and a character inside the class matches a
+multiple-character sequence caselessly under Unicode rules, the class
+(when not L<inverted|/Negation>) will also match that sequence.  For
+example, Unicode says that the letter C<LATIN SMALL LETTER SHARP S>
+should match the sequence C<ss> under C</i> rules.  Thus,
+
+ 'ss' =~ /\A\N{LATIN SMALL LETTER SHARP S}\z/i             # Matches
+ 'ss' =~ /\A[aeioust\N{LATIN SMALL LETTER SHARP S}]\z/i    # Matches
+
 =head3 Special Characters Inside a Bracketed Character Class
 
 Most characters that are meta characters in regular expressions (that
@@ -525,13 +538,26 @@ It is also possible to instead list the characters you do not want to
 match. You can do so by using a caret (C<^>) as the first character in the
 character class. For instance, C<[^a-z]> matches any character that is not a
 lowercase ASCII letter, which therefore includes almost a hundred thousand
-Unicode letters.
+Unicode letters.  The class is said to be "negated" or "inverted".
 
 This syntax make the caret a special character inside a bracketed character
 class, but only if it is the first character of the class. So if you want
 the caret as one of the characters to match, either escape the caret or
 else not list it first.
 
+In inverted bracketed character classes, Perl ignores the Unicode rules
+that normally say that a given character matches a sequence of multiple
+characters under caseless C</i> matching, which otherwise could be
+highly confusing:
+
+ "ss" =~ /^[^\xDF]+$/ui;
+
+This should match any sequences of characters that aren't C<\xDF> nor
+what C<\xDF> matches under C</i>.  C<"s"> isn't C<\xDF>, but Unicode
+says that C<"ss"> is what C<\xDF> matches under C</i>.  So which one
+"wins"? Do you fail the match because the string has C<ss> or accept it
+because it has an C<s> followed by another C<s>?
+
 Examples:
 
  "e"  =~  /[^aeiou]/   #  No match, the 'e' is listed.
@@ -765,7 +791,7 @@ C<\p{HorizSpace}> and \C<\p{XPosixBlank}>.  For example,
 C<\p{PosixAlpha}> can be written as C<\p{Alpha}>.  All are listed
 in L<perluniprops/Properties accessible through \p{} and \P{}>.
 
-=head4 Negation
+=head4 Negation of POSIX character classes
 X<character class, negation>
 
 A Perl extension to the POSIX character class is the ability to
author	Karl Williamson <public@khwilliamson.com>	2011-05-03 14:08:43 -0600
committer	Jesse Vincent <jesse@bestpractical.com>	2011-05-03 17:14:06 -0400
commit	1f59b28370e2e2b18e56e01ba9cf10440343bcd1 (patch)
tree	1e4b74e48d5bc2a0edc6f4d0d3db6502251c5c28
parent	7b4a7e586ed8557b4b47ff04c789aa6a65b1c944 (diff)
download	perl-1f59b28370e2e2b18e56e01ba9cf10440343bcd1.tar.gz