summaryrefslogtreecommitdiff
path: root/lib
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2012-01-31 09:55:44 -0700
committerKarl Williamson <public@khwilliamson.com>2012-02-04 16:29:32 -0700
commitb0b13adaa5af8ce6d9754d028537cbf8b370e3ab (patch)
tree3bbed1f1f22fab8c65e3f6931deb021bbcb2b604 /lib
parent4066e594fe94825b10f07a4bb94dfb8072e3405f (diff)
downloadperl-b0b13adaa5af8ce6d9754d028537cbf8b370e3ab.tar.gz
Unicode::UCD::prop_invmap() compress digit results
This changes the output of prop_invmap() for the Perl_Decimal_Digit property to use code point deltas, similar to other properties. This causes the output to be 1/10 what it used to be.
Diffstat (limited to 'lib')
-rw-r--r--lib/Unicode/UCD.pm57
-rw-r--r--lib/Unicode/UCD.t9
-rw-r--r--lib/unicore/mktables11
3 files changed, 49 insertions, 28 deletions
diff --git a/lib/Unicode/UCD.pm b/lib/Unicode/UCD.pm
index 3473ecbe1a..f15b4180a6 100644
--- a/lib/Unicode/UCD.pm
+++ b/lib/Unicode/UCD.pm
@@ -2396,9 +2396,40 @@ for forcing the addition is to make the returned map array significantly more
compact. There is no such advantage to doing the same thing to the elements
that are lists, and the addition is extra work.
+=item B<C<ce>>
+
+This is like C<c>, but some elements are the empty string, so not all are
+integers.
+The one internal Perl property accessible by C<prop_invmap> is of this type:
+"Perl_Decimal_Digit" returns an inversion map which gives the numeric values
+that are represented by the Unicode decimal digit characters. Characters that
+don't represent decimal digits map to the empty string, like so:
+
+ @digits @values
+ 0x0000 ""
+ 0x0030 -48
+ 0x003A: ""
+ 0x0660: -1632
+ 0x066A: ""
+ 0x06F0: -1776
+ 0x06FA: ""
+ 0x07C0: -1984
+ 0x07CA: ""
+ 0x0966: -2406
+ ...
+
+This means that the code points from 0 to 0x2F do not represent decimal digits;
+the code point 0x30 (DIGIT ZERO, =48 decimal) represents 48-48 = 0; code
+point 0x31, (DIGIT ONE), represents 49-48 = 1; ... code point 0x39, (DIGIT
+NINE), represents 57-48 = 9; ... code points 0x3A through 0x65F do not
+represent decimal digits; 0x660 (ARABIC-INDIC DIGIT ZERO, =1632 decimal),
+represents 1632-1632 = 0; ... 0x07C1 (NKO DIGIT ONE, = 1985), represents
+1985-1984 = 1 ...
+
=item B<C<cle>>
-means that some of the map array elements have the forms given by C<cl>, and
+is a combination of the C<cl> type and the C<e> type. Some of
+the map array elements have the forms given by C<cl>, and
the rest are the empty string. The property C<NFKC_Casefold> has this form.
An example slice is:
@@ -2490,27 +2521,6 @@ With this, C<charinrange()> will return C<undef> if its input code point maps
to C<$missing>. You can avoid this by omitting the C<next> statement, and adding
a line after the loop to handle the final element of the inversion map.
-One internal Perl property is accessible by this function.
-"Perl_Decimal_Digit" returns an inversion map in which all the Unicode decimal
-digits map to their numeric values, and everything else to the empty string,
-like so:
-
- @digits @values
- 0x0000 ""
- 0x0030 0
- 0x0031 1
- 0x0032 2
- 0x0033 3
- 0x0034 4
- 0x0035 5
- 0x0036 6
- 0x0037 7
- 0x0038 8
- 0x0039 9
- 0x003A ""
- 0x0660 0
- 0x0661 1
- ...
Note that the inversion maps returned for the C<Case_Folding> and
C<Simple_Case_Folding> properties do not include the Turkic-locale mappings.
@@ -3145,6 +3155,9 @@ RETRY:
# could
$format = 'sl';
}
+ elsif ($returned_prop eq 'ToPerlDecimalDigit') {
+ $format = 'ce';
+ }
elsif ($format ne 'n' && $format ne 'r') {
# All others are simple scalars
diff --git a/lib/Unicode/UCD.t b/lib/Unicode/UCD.t
index 89d2c596fe..6018638c42 100644
--- a/lib/Unicode/UCD.t
+++ b/lib/Unicode/UCD.t
@@ -1260,7 +1260,14 @@ foreach my $prop (keys %props) {
}
}
elsif ($format =~ /^ c /x) {
- if ($missing ne "0") {
+ if ($full_name eq 'Perl_Decimal_Digit') {
+ if ($missing ne "") {
+ fail("prop_invmap('$mod_prop')");
+ diag("The missings should be \"\"; got '$missing'");
+ next PROPERTY;
+ }
+ }
+ elsif ($missing ne "0") {
fail("prop_invmap('$mod_prop')");
diag("The missings should be '0'; got '$missing'");
next PROPERTY;
diff --git a/lib/unicore/mktables b/lib/unicore/mktables
index 98898e8250..9b9dd7f194 100644
--- a/lib/unicore/mktables
+++ b/lib/unicore/mktables
@@ -10129,14 +10129,15 @@ END
Perl_Extension => 1,
Directory => $map_directory,
Type => $STRING,
- Range_Size_1 => 1,
+ To_Output_Map => $OUTPUT_DELTAS,
);
$Decimal_Digit->add_comment(join_lines(<<END
This file gives the mapping of all code points which represent a single
-decimal digit [0-9] to their respective digits. For example, the code point
-U+0031 (an ASCII '1') is mapped to a numeric 1. These code points are those
-that have Numeric_Type=Decimal; not special things, like subscripts nor Roman
-numerals.
+decimal digit [0-9] to their respective digits, but it uses a delta to
+make the table significantly smaller. For example, the code point U+0031 (an
+ASCII '1') is mapped to a numeric "-48", because 0x31 = 49, and 49 + -48 = 1.
+These code points are those that have Numeric_Type=Decimal; not special
+things, like subscripts nor Roman numerals.
END
));