Work properly under UTF-8 LC_CTYPE locales

This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
author: Karl Williamson <public@khwilliamson.com> 2014-01-27 15:35:00 -0700
committer: Karl Williamson <public@khwilliamson.com> 2014-01-27 23:03:48 -0700
commit: 31f05a37c4e9c37a7263491f2fc0237d836e1a80 (patch)
tree: 7537c7e179350243b3de0f3a99d6747c9c7812e6 /handy.h
parent: cea315b64e0c4b1890867df0c925cafc8823ba38 (diff)
download: perl-31f05a37c4e9c37a7263491f2fc0237d836e1a80.tar.gz
1 files changed, 46 insertions, 13 deletions
diff --git a/handy.h b/handy.h
index 0714d4ea6d..c65170a31a 100644
--- a/handy.h
+++ b/handy.h
@@ -523,12 +523,13 @@ Variant C<isFOO_utf8> is like C<isFOO_uni>, but the input is a pointer to a
 classification of just the first (possibly multi-byte) character in the string
 is tested.
 
-Variant C<isFOO_LC> is like the C<isFOO_A> and C<isFOO_L1> variants, but uses
-the C library function that gives the named classification instead of
-hard-coded rules.  For example, C<isDIGIT_LC()> returns the result of calling
-C<isdigit()>.  This means that the result is based on the current locale, which
-is what C<LC> in the name stands for.  FALSE is always returned if the input
-won't fit into an octet.
+Variant C<isFOO_LC> is like the C<isFOO_A> and C<isFOO_L1> variants, but the
+result is based on the current locale, which is what C<LC> in the name stands
+for.  If Perl can determine that the current locale is a UTF-8 locale, it uses
+the published Unicode rules; otherwise, it uses the C library function that
+gives the named classification.  For example, C<isDIGIT_LC()> when not in a
+UTF-8 locale returns the result of calling C<isdigit()>.  FALSE is always
+returned if the input won't fit into an octet.
 
 Variant C<isFOO_LC_uvchr> is like C<isFOO_LC>, but is defined on any UV.  It
 returns the same as C<isFOO_LC> for input code points less than 256, and
@@ -1241,18 +1242,24 @@ EXTCONST U32 PL_charclass[];
 #define toUPPER_LATIN1_MOD(c) ((! FITS_IN_8_BITS(c))                       \
                                ? (c)                                       \
                                : PL_mod_latin1_uc[ (U8) (c) ])
+#define IN_UTF8_CTYPE_LOCALE PL_in_utf8_CTYPE_locale
 
 /* Use foo_LC_uvchr() instead  of these for beyond the Latin1 range */
 
 /* For internal core Perl use only: the base macro for defining macros like
  * isALPHA_LC, which uses the current LC_CTYPE locale.  'c' is the code point
- * (0-255) to check.  'utf8_locale_classnum' is currently unused.  The code to
- * actually do the test this is passed in 'non_utf8'.  If 'c' is above 255, 0
- * is returned.  For accessing the full range of possible code points under
- * locale rules, use the macros based on _generic_LC_uvchr instead of this. */
+ * (0-255) to check.  In a UTF-8 locale, the result is the same as calling
+ * isFOO_L1(); the 'utf8_locale_classnum' parameter is something like
+ * _CC_UPPER, which gives the class number for doing this.  For non-UTF-8
+ * locales, the code to actually do the test this is passed in 'non_utf8'.  If
+ * 'c' is above 255, 0 is returned.  For accessing the full range of possible
+ * code points under locale rules, use the macros based on _generic_LC_uvchr
+ * instead of this. */
 #define _generic_LC_base(c, utf8_locale_classnum, non_utf8)                    \
            (! FITS_IN_8_BITS(c)                                                \
            ? 0                                                                 \
+           : IN_UTF8_CTYPE_LOCALE                                              \
+             ? cBOOL(PL_charclass[(U8) (c)] & _CC_mask(utf8_locale_classnum))  \
              : cBOOL(non_utf8))
 
 /* For internal core Perl use only: a helper macro for defining macros like
@@ -1275,15 +1282,41 @@ EXTCONST U32 PL_charclass[];
  * helper macros */
 #define _generic_toLOWER_LC(c, function, cast)  (! FITS_IN_8_BITS(c)           \
                                                 ? (c)                          \
+                                                : (IN_UTF8_CTYPE_LOCALE)       \
+                                                  ? PL_latin1_lc[ (U8) (c) ]   \
                                                 : function((cast)(c)))
 
+/* Note that the result can be larger than a byte in a UTF-8 locale.  It
+ * returns a single value, so can't adequately return the upper case of LATIN
+ * SMALL LETTER SHARP S in a UTF-8 locale (which should be a string of two
+ * values "SS");  instead it asserts against that under DEBUGGING, and
+ * otherwise returns its input */
 #define _generic_toUPPER_LC(c, function, cast)                                 \
                     (! FITS_IN_8_BITS(c)                                       \
                     ? (c)                                                      \
-                      : function((cast)(c)))
-
+                    : ((! IN_UTF8_CTYPE_LOCALE)                                \
+                      ? function((cast)(c))                                    \
+                      : ((((U8)(c)) == MICRO_SIGN)                             \
+                        ? GREEK_CAPITAL_LETTER_MU                              \
+                        : ((((U8)(c)) == LATIN_SMALL_LETTER_Y_WITH_DIAERESIS)  \
+                          ? LATIN_CAPITAL_LETTER_Y_WITH_DIAERESIS              \
+                          : ((((U8)(c)) == LATIN_SMALL_LETTER_SHARP_S)         \
+                            ? (__ASSERT_(0) (c))                               \
+                            : PL_mod_latin1_uc[ (U8) (c) ])))))
+
+/* Note that the result can be larger than a byte in a UTF-8 locale.  It
+ * returns a single value, so can't adequately return the fold case of LATIN
+ * SMALL LETTER SHARP S in a UTF-8 locale (which should be a string of two
+ * values "ss"); instead it asserts against that under DEBUGGING, and
+ * otherwise returns its input */
 #define _generic_toFOLD_LC(c, function, cast)                                  \
-                      _generic_toLOWER_LC(c, function, cast)
+                    (LIKELY((c) != MICRO_SIGN)                                 \
+                    ? (__ASSERT_(! IN_UTF8_CTYPE_LOCALE                        \
+                                 || (c) != LATIN_SMALL_LETTER_SHARP_S)         \
+                       _generic_toLOWER_LC(c, function, cast))                 \
+                    : (IN_UTF8_CTYPE_LOCALE)                                   \
+                      ? GREEK_SMALL_LETTER_MU                                  \
+                      : (c))
 
 /* Use the libc versions for these if available. */
 #if defined(HAS_ISASCII) && ! defined(USE_NEXT_CTYPE)
author	Karl Williamson <public@khwilliamson.com>	2014-01-27 15:35:00 -0700
committer	Karl Williamson <public@khwilliamson.com>	2014-01-27 23:03:48 -0700
commit	31f05a37c4e9c37a7263491f2fc0237d836e1a80 (patch)
tree	7537c7e179350243b3de0f3a99d6747c9c7812e6 /handy.h
parent	cea315b64e0c4b1890867df0c925cafc8823ba38 (diff)
download	perl-31f05a37c4e9c37a7263491f2fc0237d836e1a80.tar.gz