summaryrefslogtreecommitdiff
path: root/embed.fnc
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2018-03-23 13:43:56 -0600
committerKarl Williamson <khw@cpan.org>2018-03-26 16:26:54 -0600
commit8946fcd98c63bdc848cec00a1c72aaf232d932a1 (patch)
treed121082565c788d8cb4876b5d867ba701e3c4662 /embed.fnc
parent608bdd1e9ade8e9ca6d2312c64b2b1c0a653eadc (diff)
downloadperl-8946fcd98c63bdc848cec00a1c72aaf232d932a1.tar.gz
Move UTF-8 case changing data into core
Prior to this commit, if a program wanted to compute the case-change of a character above 0xFF, the C code would switch to perl, loading lib/utf8heavy.pl and then read another file from disk, and then create a hash. Future references would use the hash, but the start up cost is quite large. There are five case change types, uc, lc, tc, fc, and simple fc. Only the first encountered requires loading of utf8_heavy, but each required switching to utf8_heavy, and reading the appropriate file from disk. This commit changes these functions to use compiled-in C data structures (inversion maps) to represent the data. To look something up requires a binary search instead of a hash lookup. An individual hash lookup tends to be faster than a binary search, but the differences are small for small sizes. I did some benchmarking some years ago, (commit message 87367d5f9dc9bbf7db1a6cf87820cea76571bf1a) and the results were that for fewer than 512 entries, the binary search was just as fast as a hash, if not actually faster. Now, I've done some more benchmarks on blead, using the tool benchmark.pl, which wasn't available back then. The results below indicate that the differences are minimal up through 2047 entries, which all Unicode properties are well within. A hash, PL_foldclosures, is still constructed at runtime for the case of regular expression /i matching, and this could be generated at Perl compile time, as a further enhancement for later. But reading a file from disk is no longer required to do this. ======================= benchmarking results ======================= Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches _m branch predict miss _m1 level 1 cache miss _mm last cache (e.g. L3) miss - indeterminate percentage (e.g. 1/0) The numbers represent raw counts per loop iteration. "\x{10000}" =~ qr/\p{CWKCF}/" swash invlist Ratio % fetch search ------ ------- ------- Ir 2259.0 2264.0 99.8 Dr 665.0 664.0 100.2 Dw 406.0 404.0 100.5 COND 406.0 405.0 100.2 IND 17.0 15.0 113.3 COND_m 8.0 8.0 100.0 IND_m 4.0 4.0 100.0 Ir_m1 8.9 17.0 52.4 Dr_m1 4.5 3.4 132.4 Dw_m1 1.9 1.2 158.3 Ir_mm 0.0 0.0 100.0 Dr_mm 0.0 0.0 100.0 Dw_mm 0.0 0.0 100.0 These were constructed by using the file whose contents are below, which uses the property in Unicode that currently has the largest number of entries in its inversion list, > 1600. The test was run on blead -O2, no debugging, no threads. Then the cut-off boundary was changed from 512 to 2047 for when we use a hash vs an inversion list, and the test run again. This yields the difference between a hash fetch and an inversion list binary search ===================== The benchmark file is below =============== no warnings 'once'; my @benchmarks; push @benchmarks, 'swash' => { desc => '"\x{10000}" =~ qr/\p{CWKCF}/"', setup => 'no warnings "once"; my $re = qr/\p{CWKCF}/; my $a = "\x{10000}";', code => '$a =~ $re;', }; \@benchmarks;
Diffstat (limited to 'embed.fnc')
-rw-r--r--embed.fnc10
1 files changed, 6 insertions, 4 deletions
diff --git a/embed.fnc b/embed.fnc
index 5adc705e4a..43fc31aa29 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1745,7 +1745,7 @@ EiMRn |UV |_invlist_len |NN SV* const invlist
EMiRn |bool |_invlist_contains_cp|NN SV* const invlist|const UV cp
EXpMRn |SSize_t|_invlist_search |NN SV* const invlist|const UV cp
EXMpR |SV* |_get_swash_invlist|NN SV* const swash
-EXMpR |HV* |_swash_inversion_hash |NN SV* const swash
+EXMpR |HV* |_swash_inversion_hash
#endif
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C)
EXpM |SV* |_get_regclass_nonbitmap_data \
@@ -1797,9 +1797,11 @@ s |UV |_to_utf8_case |const UV uv1 \
|NN const U8 *p \
|NN U8* ustrp \
|NULLOK STRLEN *lenp \
- |NN SV **swashp \
- |NN const char *normal \
- |NULLOK const char *special
+ |NN SV *invlist \
+ |NN const IV * const invmap \
+ |NULLOK const int * const * const aux_tables \
+ |NULLOK const U8 * const aux_table_lengths \
+ |NN const char * const normal
#endif
ApbmdD |UV |to_utf8_lower |NN const U8 *p|NN U8* ustrp|NULLOK STRLEN *lenp
AMp |UV |_to_utf8_lower_flags|NN const U8 *p|NULLOK const U8* e \