summaryrefslogtreecommitdiff
path: root/charclass_invlists.h
Commit message (Collapse)AuthorAgeFilesLines
* charclass_invlists.h: Add regen entryKarl Williamson2015-02-211-0/+1
| | | | | | | This missing entry is one used by t/porting/regen.t to see if the contents are up-to-date. I don't know why it didn't get added earlier, and why there aren't failures except apparently on my machine due to it's not being there. I thought I took great care in getting it right.
* regen/mk_invlists.pl: Add tables for \b{sb}Karl Williamson2015-02-191-1/+22982
| | | | | This single line addition generates a very confused diff listing for the generated file.
* regen/mk_invlists.pl: Add tables for \b{wb}Karl Williamson2015-02-191-1/+12147
| | | | | This single line change generates very a confused diff listing for the generated file, so is kept separate form the other \b{wb} commits.
* regen/mk_invlists.pl: Add GCB tablesKarl Williamson2015-02-191-1/+12005
| | | | | | | | | This will enable the next commit to add \b{gcb}. I separated this out from that commit because the diff output here is is very confused, not accurately showing the underlying changes. Actually two data structures are being added for every character set, and nothing else changed.
* regen/mk_invlists.pl: Revamp #if generationKarl Williamson2015-02-191-23/+71
| | | | | | | | | | | | This changes where the symbols are defined to a single file each. This may save text space, depending on the compiler. The next commit will cause this hdr to be included in more places, so it becomes more important to do this. At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE. The code now is executed regardless of that. This is simpler, and previously there might have been the possibility of uninitialized memory being read, should re_comp.o be executed before recomp.o.
* regen/mk_invlists.pl: Add capability for line break propsKarl Williamson2015-02-191-1/+1
| | | | | | | | | | | This is a partial implementation of a full inversion map generation capability, which is why some code is indented more than necessary -- in the future there will be things that use that. But this is sufficient for 5.22. This allows for the generation of tables to handle the Unicode line breaking properties, like GCB and WB. Future commits will actually use this capability.
* Unicode::UCD: Add charprops_all() public functionKarl Williamson2015-02-181-1/+1
|
* Unicode::UCD: Add charprop public functionKarl Williamson2015-02-181-1/+1
|
* Unicode::UCD::prop_value_aliases() Don't return invalid valueKarl Williamson2015-02-181-1/+1
| | | | | | | | | Prior to this commit, if you said prop_value_aliases("Any", "foo") it would return "foo". But there really aren't any synonyms for the "Any" property values, so it should return undef instead.
* Unicode::UCD: Pod corrections, clarificationsKarl Williamson2015-02-181-1/+1
|
* Unicode::UCD: Generalize for EBCDIC platformsKarl Williamson2015-02-131-1/+1
|
* Unicode::UCD: Fix synopsisKarl Williamson2015-02-101-1/+1
| | | | | | | Instead of using a constant code point in some of the lines, use the $variable that is used in other lines Spotted by Dagfinn Ilmari Mannsåker
* Unicode::UCD: Add prop_values() functionKarl Williamson2015-02-101-1/+1
| | | | This new function returns the input property's possible values.
* regen/mk_invlists.pl: Rename functionKarl Williamson2015-01-211-1/+1
| | | | The new name more clearly reflects its input restrictions
* regen/mk_invlists.pl: Do less workKarl Williamson2015-01-211-1/+1
| | | | | We only need to reorder the native code points (0..255) for EBCDIC, so can quit when we get there, by appropriately refactoring the code
* regen/mk_invlists.pl: White-space onlyKarl Williamson2015-01-211-1/+1
| | | | | Indent as a result of new block in the previous commit; reformat a comment
* regen/mk_invlists.pl: Skip unnecessary workKarl Williamson2015-01-211-1/+1
| | | | | | | This reorders the code points below 256 depending on the platform. However all platforms have the same values for those above 255, so can skip this code if the first code point (and hence all code points) being output isn't one of those affected.
* regen/mk_invlists.pl: output sortedKarl Williamson2015-01-211-5868/+5868
| | | | This will make it easier to see differences in future commits
* regen/mk_invlists.pl: Output code points as hexKarl Williamson2015-01-211-49375/+49375
| | | | | | Unicode represents all code points as hex, so follow suit. I, for one, am used to seeing hex code points, and so eyeballing these makes more sense when they are in hex.
* Unicode::UCD: Allow internal properties in invmap()Karl Williamson2015-01-211-1/+1
| | | | | | This adds an undocumented way to get invmap() to return internal properties, like invlist(). This is intended only for Perl-core use.
* Unicode::UCD: pod nitsKarl Williamson2015-01-211-1/+1
|
* Correct dependencies for charclass_invlists.hFather Chrysostomos2014-12-041-1/+42
| | | | | | | | regen.t should fail if Unicode tables are updated and this header is not regenerated. See commit 713f4b7fa and the thread beginning at <20141204124705.472.qmail@lists-nntp.develooper.com>.
* Add checksum to charclass_invlists.hFather Chrysostomos2014-12-031-1/+3
| | | | | and check that checksum in t/porting/regen.t. This makes the tests run faster.
* Use Unicode 7.0Karl Williamson2014-06-161-1320/+5224
|
* regen/mk_invlists.pl: Remove unnecessary #if'sKarl Williamson2014-05-311-333/+9
| | | | | | | | Even though this file is not intended to be human consumable, it is annoying to see #if ... #endif #if ... where the #endif and #if could be consolidated. It turns out not to be hard to do that.
* regen/mk_invlists.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-23/+35130
| | | | | This causes the generated charclass_invlists.h to be valid on all supported platforms
* Move an inversion list generation to mktablesKarl Williamson2014-01-271-1/+1
| | | | | | | Prior to this patch, this was in regen/mk_invlists.pl, but future commits will want it to also be used by the header generated by regen/regcharclass.pl, so use a common source so the logic doesn't have to be duplicated.
* IDStart and IDCont no longer go out to diskKarl Williamson2014-01-091-0/+2160
| | | | | | | These are the base names for various macros used in parsing identifiers. Prior to this patch, parsing a code point above Latin1 caused loading disk files. This patch causes all the information to be compiled into the Perl binary.
* Rmv PL_Posix_ptrsKarl Williamson2014-01-091-225/+0
| | | | | | | | | | | | Previous commits in this series have removed all uses of this global array. This completely removes it. Since it is a global, consideration need be given to possible uses of it outside the core. It has never been externally documented, and is an opaque structure whose internals have changed with every release. The functions used to access it are almost all static to regcomp.c; those few that aren't have been hidden from all but the few .c files that need to have access to them, via #if's.
* Remove PL_L1Posix_ptrsKarl Williamson2014-01-091-224/+0
| | | | | | | | | | | | This global array is no longer used, having been removed in previous commits in this series. Since it is a global, consideration need be given to possible uses of it outside the core. It has never been externally documented, and is an opaque structure whose internals have changed with every release. The functions used to access it are almost all static to regcomp.c; those few that aren't have been hidden from all but the few .c files that need to have access to them, via #if's.
* Compile in list of foldable code pointsKarl Williamson2014-01-091-0/+240
| | | | | | | | | | | | | | When constructing what matches code points under /i, Perl uses an inversion list of all the possible code points that participate in folds. This number is relatively few compared to the possible universe of code points, as most of the world's scripts aren't cased, and many characters in the scripts that do fold aren't foldable (such as punctuation). Prior to this commit, the list for the above-Latin1 code points was read-in from disk if and only if needed. This commit causes the list to be added to read-only data in a C header, trading a little space in Perl's text segment for speed at execution. This will enable ripping out some code in this and future commits (offsetting the space used by this one).
* Compile in all POSIX class inversion listsKarl Williamson2014-01-091-0/+8682
| | | | | | | | This changes charclass_invlists.h to have the complete definitions for all the POSIX classes, like \w and [:alpha:]. Thus these won't have to be loaded off disk at run-time. Taking advantage of this will be done in stages in future commits
* Upgrade to Unicode 6.3Karl Williamson2013-10-031-9/+3
|
* Add inversion list for U+80 - U+FFKarl Williamson2013-09-241-0/+14
| | | | | This is the upper half of the Latin1 range. This simplifies some code very slightly, but will be of use in future commits.
* Fix off-by-one error in inversion lists.Karl Williamson2013-07-161-72/+68
| | | | | | | The first commit of this topic branch added a dummy 0 element to the end of certain inversion lists to work around an off-by-one error. This commit makes the necessary changes to stop that error, and to remove the dummy element. SvCUR() and invlist_len() now are kept in sync.
* Reinstate "regcomp.c: Make C-array inversion lists const"Karl Williamson2013-07-161-68/+68
| | | | | | | | | | | This reverts commit 18505f093a44607b687ae5fe644872f835f66313, which reverted 241136e0ed70738cccd6c4b20ce12b26231f30e5, thus reinstating the latter commit. It turns out that the error being chased down was not due to this commit. Its original message was: The inversion lists that are compiled into a C header are now const.
* Reinstate "regcomp.c: Move 2 hdr inversion fields to SV hdr"Karl Williamson2013-07-161-102/+34
| | | | | | | | | | | | | | This reverts commit 67434bafe4f2406e7c92e69013aecd446c896a9a, which reverted 4fdeca7844470c929f35857f49078db1fd124dbc, thus reinstating the latter commit. It turns out that the error being chased down was not due to this commit. Its original message was: This commit continues the process of separating the header area of inversion lists from the body. 2 more fields are moved out of the header portion of the inversion list, and into the header portion of the SV that contains it.
* Reinstate + fix "Revert "regcomp.c: Add a constant 0 element before ↵Karl Williamson2013-07-161-102/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inversion lists" " This reverts commit de353015643cf10b437d714d3483c1209e079916 which reverted 533c4e2f08b42d977e5004e823d4849f7473d2d0, thus reinstating it, plus this commit adds a fix to get it to pass under Address Sanitizer. The root cause of the problem is that there are two measures of the length of an inversion list. One is SvCUR(), and the other is invlist_len(). The original commit caused these to get off-by-one in some cases. The ultimate solution is to only store one value, and return the other one based off that. Rather than redo the whole branch, I've taken an easier way out, which is to add a dummy element at the end of some inversion lists, so that they aren't off-by-one. Then the other patches from the original branch will be applied. Each will be tested with Address Sanitizer. Then the work to fix the underlying problem will be done. The original commit's message was: This commit is the first step to separating the header from the body of inversion lists. Doing so will allow the compiled-in inversion lists to be fully read-only. To invert an inversion list, one simply unshifts a 0 to the front of it if one is not there, and shifts off the 0 if it does have one. The current data structure reserves an element at the beginning of each inversion list that is either 0 or 1. If 0, it means the inversion list begins there; if 1, it means the inversion list starts at the next element. Inverting involves flipping this bit. This commit changes the structure so that there is an additional element just after the element that flips. This new element is always 0, and the flipping element now says whether the inversion list begins at the constant 0 element, or the one after that. Doing this allows the flipping element to be separated in later commits from the body of the inversion list, which will always begin with the constant 0 element. That means that the body of the inversion list can be const.
* Revert "regcomp.c: Add a constant 0 element before inversion lists"Karl Williamson2013-07-041-140/+110
| | | | | | | This reverts commit 533c4e2f08b42d977e5004e823d4849f7473d2d0. This continues the backing out of this topic branch. A bisect shows that the first commit exhibiting an error is the first one in the branch.
* Revert "regcomp.c: Move 2 hdr inversion fields to SV hdr"Karl Williamson2013-07-041-34/+102
| | | | | | | This reverts commit 4fdeca7844470c929f35857f49078db1fd124dbc. This continues the backing out of this topic branch. A bisect shows that the first commit exhibiting an error is the first one in the branch.
* Revert "regcomp.c: Make C-array inversion lists const"Karl Williamson2013-07-041-68/+68
| | | | | | | This reverts commit 241136e0ed70738cccd6c4b20ce12b26231f30e5. This continues the backing out of this topic branch. A bisect shows that the first commit exhibiting an error is the first one in the branch.
* regcomp.c: Make C-array inversion lists constKarl Williamson2013-07-031-68/+68
| | | | The inversion lists that are compiled into a C header are now const.
* regcomp.c: Move 2 hdr inversion fields to SV hdrKarl Williamson2013-07-031-102/+34
| | | | | | | This commit continues the process of separating the header area of inversion lists from the body. 2 more fields are moved out of the header portion of the inversion list, and into the header portion of the SV that contains it.
* regcomp.c: Add a constant 0 element before inversion listsKarl Williamson2013-07-031-110/+140
| | | | | | | | | | | | | | | | | | | | | | | | This commit is the first step to separating the header from the body of inversion lists. Doing so will allow the compiled-in inversion lists to be fully read-only. To invert an inversion list, one simply unshifts a 0 to the front of it if one is not there, and shifts off the 0 if it does have one. The current data structure reserves an element at the beginning of each inversion list that is either 0 or 1. If 0, it means the inversion list begins there; if 1, it means the inversion list starts at the next element. Inverting involves flipping this bit. This commit changes the structure so that there is an additional element just after the element that flips. This new element is always 0, and the flipping element now says whether the inversion list begins at the constant 0 element, or the one after that. Doing this allows the flipping element to be separated in later commits from the body of the inversion list, which will always begin with the constant 0 element. That means that the body of the inversion list can be const.
* De-globalize regcomp inversion lists.Craig A. Berry2012-10-261-34/+166
| | | | | | | | | | | | | These lists are declared at file scope so will be global unless made static. Actual use of these lists is via the various PL_xxx global variables that point to them and that (except for NonL1_Perl_Non_Final_Folds_invlist) are initialized in Perl_re_op_compile in regcomp.c (but not in its incarnation as ext/re/re_comp.c). So change the lists to be static, and also skip declaring and initializing them in ext/re/re_comp.c except for the one case that is actually used in the extension version.
* regen/mk_invlists.pl: Make list for multi-fold charsKarl Williamson2012-10-141-0/+67
| | | | | This causes charclass_invlists.h to have a new list of all the characters whose fold is a sequence of more than one character.
* Add caching to inversion list searchesKarl Williamson2012-08-251-33/+66
| | | | | | | Benchmarking showed some speed-up when the result of the previous search in an inversion list is cached, thus potentially avoiding a search in the next call. This adds a field to each inversion list which caches its previous search result.
* mktables: Generate tables for chars that aren't in final fold posKarl Williamson2012-08-021-0/+52
| | | | | | | | | | This starts with the existing table that mktables generates that lists all the characters in Unicode that occur in multi-character folds, and aren't in the final positions of any such fold. It generates data structures with this information to make it quickly available to code that wants to use it. Future commits will use these tables.
* Experimentally add VT to \s definitionKarl Williamson2012-05-221-6/+2
| | | | | | | | | | | | | | This commit is the minimal necessary to get \s to match the vertical tab. It is being done early in the 5.17 series in order to see what repercussions there might be from doing this. It may well be that we decide that this change will require a 'use feature' to activate. In any event there is significant documentation of the behavior without the VT that this patch does not address at all. Tom Christiansen asked Larry Wall why \s did not include VT, and reported that Larry replied that he did not remember, but had no objections to adding it.
* Patch [perl #111400] [:upper:] broken for above Latin1Karl Williamson2012-02-281-6/+4
| | | | | | | | | | | | This was an off-by-one error caused by my failing to realize that things had to be done differently at the 255/256 boundary depending on whether U+00FF matched or did not match the property. Two properties were affected, [:upper:] and [:punct:]. The bug was that all code points above the first one > 255 that legitimately matches the property will match whether or not they should. In the case of [:upper:], this meant that effectively anything from 256..infinity matched. For [:punct:], it was anything above U+037D.