summaryrefslogtreecommitdiff
path: root/regcharclass.h
Commit message (Collapse)AuthorAgeFilesLines
* Update information about using older UnicodeKarl Williamson2015-03-191-1/+1
|
* mktables: Better work with earlier UnicodesKarl Williamson2015-03-191-1/+1
| | | | | | | | Unicode adds new files to its character database from time to time in new versions of the Standard. mktables is supposed to be able to handle this when it knows about a file, but it is compiling a version of the Standard that predates that file's existence. It was not dealing properly with this situation.
* regen/regcharclass.pl: Need to rebuild when source files changeKarl Williamson2015-03-191-1/+45
| | | | | Like regen/mk_invlists.pl, if any of various Unicode-related files change, we can't rely on the generated file remaining unchanged.
* Remove obsolete macros/tables for \XKarl Williamson2015-02-191-437/+1
| | | | | A previous commit changed how \X is implemented, and now we don't need these anymore.
* Add checksum to regcharclass.hFather Chrysostomos2014-12-031-1/+3
| | | | | and check that checksum in t/porting/regen.t. This makes the tests run faster.
* Use Unicode 7.0Karl Williamson2014-06-161-20/+28
|
* Add some (UN)?LIKELY() to UTF8 handlingKarl Williamson2014-05-311-8/+8
| | | | | It's very rare actually for code to be presented with malformed UTF-8, so give the compiler a hint about the likely branches.
* utf8.h: Use new macro type from previous commitKarl Williamson2014-05-311-12/+9
| | | | | | | | This allows for an efficient isUTF8_CHAR macro, which does its own length checking, and uses the UTF8_INVARIANT macro for the first byte. On EBCDIC systems this macro which does a table lookup is quite a bit more efficient than all the branches that would normally have to be done.
* regen/regcharclass.pl: Add new macro type with intermed checkingKarl Williamson2014-05-311-30/+18
| | | | | This adds a new macro generation option for inputs that are checked elsewhere for buffer overflow, but otherwise needs validity checks.
* regen/regcharclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-0/+2184
| | | | | This causes the generated regcharclass.h to be valid on all supported platforms
* regen/regcharclass.pl: make a 'do' into a 'require'Karl Williamson2014-05-311-1/+0
| | | | | This is because a future commit will execute this code multiple times, and the library file should only be read once.
* Revert "regen/regcharclass.pl: Make more EBCDIC-friendly"Karl Williamson2014-05-311-467/+451
| | | | | | | | | | | This reverts commit c4c8e61502fd5289a080f20332c6e3f9f23ce6e2. It turns out that this scheme to bootstrap regcharclass.h onto a machine not running ASCII created too much manual labor getting things to work. A better solution is to cross compile on an ASCII machine for the target. Commit 6ff677df5d6fe0f52ca0b6736f8b5a46ac402943 created the infrastructure to do that, and this commit starts the process of changing regen/regcharclass.pl to use that.
* regen/regcharclass.pl: Improve the generated codeKarl Williamson2014-05-301-2/+2
| | | | | | | | This is a small improvement when a consecutive group of U8 code points begins at 0 or ends at 255. These end points are physically impossible of being exceeded, so there is no need to test for that end of the range. In several places this causes a mask operation to not be generated.
* /x in patterns now includes all \p{PatWS}Karl Williamson2014-05-301-12/+0
| | | | | | | | | | | | | | | | | | | | This brings Perl regular expressions more into conformance with Unicode. /x now accepts 5 additional characters as white space. Use of these characters as literals under /x has been deprecated since 5.18, so now we are free to change what they mean. This commit eliminates the static function that processes the old whitespace definition (and a generated macro that was used only for this), using the already existing one for the new definition. It refactors slightly the static function that skips comments to mesh better with the needs of its callers, and calls it in one place where before the code was essentially duplicated. p5p discussion starting in http://nntp.perl.org/group/perl.perl5.porters/214726 convinced me that the (?[ ]) comments should be terminated the same way as regular /x comments, and this was also done in this commit. No prior notice is necessary as this is an experimental feature.
* regcomp.c: Don't read past string-endKarl Williamson2014-03-121-26/+0
| | | | | | | | In doing an audit of regcomp.c, and experimenting using Encode::_utf8_on(), I found this one instance of a regen/regcharclass.pl macro that could read beyond the end of the string if given malformed UTF-8. Hence we convert to use the 'safe' form. There are no other uses of the non-safe version, so don't need to generate them.
* regen/regcharclass.pl: Don't generate unused macrosKarl Williamson2014-03-121-177/+9
| | | | Having these unused macros around just clutters up the header file
* regen/regcharclass.pl: Don't generate unused macrosKarl Williamson2014-03-011-43/+0
| | | | | | The macros generated by these options are not needed in the core; generating them just clutters up the header file, and some will actually be forbidden by the next commit.
* Revert most of 3a8bbffbce: Avoid unnecessary malformed checkingKarl Williamson2014-03-011-83/+235
| | | | | | | | | | | | | | My thinking was muddled when I made that commit, and this reverts the essence of it. The theory was that since we have already processed the regex pattern, we don't need to check it for malformedness, hence we don't need the "safe" form of certain macros that check for and avoid running off the end of the buffer. It is true that we don't have to worry about malformedness indicating that the buffer is bigger than it really is, but these macros can match up to three well-formed characters, so we do have to make sure that all three are in the buffer, and that the input isn't just the first two at the buffer's end. This was caught by running valgrind.
* regen/regcharclass.pl: Simplify generated safe macrosKarl Williamson2014-03-011-91/+59
| | | | | | | | | | | | | | | | | | | | | This simplifies the macros generated which make sure there are no read errors. Prior to this commit, the code generated looked like (e - s) > 3 ? see if things of at most length 4 match : (e - s) > 2 ? see if things of at most length 3 match : (e - s) > 1 ? see if things of at most length 2 match : (e - s) > 0 ? see if things of at most length 1 match For things that are a single character, the ones greater than length 2 must be in UTF8, and their needed length can be determined by UTF8SKIP, so we can get rid of most of the (e-s) tests. This doesn't change the macros which can match multiple characters; that is a harder to do.
* regen/regcharclass.pl: Warn that macros are internal onlyKarl Williamson2014-03-011-0/+2
| | | | | This adds a comment to the generated file that the macros are not to be generally used.
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-0/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* Avoid unnecessary malformed checkingKarl Williamson2014-01-271-235/+83
| | | | | | | | | | regen/regcharclass.pl can create macros for use where we need to worry about the possibility of malformed UTF-8, and for where we don't. In the case of looking at regex patterns, the Perl core has complete control over generating them, and hence isn't generally going to create too short a buffer; if it does, it's a bug that will show up and get fixed. This commit changes to generate and use the faster macros that don't do bounds checking.
* Move an inversion list generation to mktablesKarl Williamson2014-01-271-0/+27
| | | | | | | Prior to this patch, this was in regen/mk_invlists.pl, but future commits will want it to also be used by the header generated by regen/regcharclass.pl, so use a common source so the logic doesn't have to be duplicated.
* Upgrade to Unicode 6.3Karl Williamson2013-10-031-33/+14
|
* regen/regcharclass.pl: Make more EBCDIC-friendlyKarl Williamson2013-08-291-559/+575
| | | | | | | | This commit changes the code generated by the macros so that they work right out-of-the-box on non-ASCII platforms for non-UTF-8 inputs. THEY ARE WRONG for UTF-8, but this is good enough to get perl bootstrapped onto the target platform, and regcharclass.pl can be run there, generating macros with correct UTF-8.
* Fix multi-char fold edge caseKarl Williamson2013-05-201-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | use locale; fc("\N{LATIN CAPITAL LETTER SHARP S}") eq 2 x fc("\N{LATIN SMALL LETTER LONG S}") should return true, as the SHARP S folds to two 's's in a row, and the LONG S is an antique variant of 's', and folds to s. Until this commit, the expression was false. Similarly, the following should match, but didn't until this commit: "\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa The reason these didn't work properly is that in both cases the actual fold to 's' is disallowed. In the first case because of locale; and in the second because of /aa. And the code wasn't smart enough to realize that these were legal. The fix is to special case these so that the fold of sharp s (both capital and small) is two LONG S's under /aa; as is the fold of the capital sharp s under locale. The latter is user-visible, and the documentation of fc() now points that out. I believe this is such an edge case that no mention of it need be done in perldelta.
* regcharclass.h: Add macro for non-ASCII PATWSKarl Williamson2013-01-231-0/+22
| | | | This will be used to deprecate uses of non-ASCII Pattern White Space
* regcharclass.h: Add macro for finding pattern white spaceKarl Williamson2013-01-111-0/+44
| | | | This Unicode property will be used in future commits
* Rename property involved in \X matching, for clarityKarl Williamson2012-12-161-3/+3
| | | | | I was re-reading some code and got confused. This table matches just the first character of a sequence that may or may not contain others.
* make regcharclass generate submacros if necessary to keep them shortYves Orton2012-12-061-63/+71
| | | | | | Some compilers can't handle unexpanded macros longer than something like 8000 characters. So we split up long ones into sub macros to work around the problem
* regexec.c: Use SPACE macros instead of swashKarl Williamson2012-11-191-0/+59
| | | | | | | | | | | This will avoid loading a swash when an above Latin1 code point is tested to see if it matches \s. The SPACE macro is quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The CCC_TRY_U macro in regexec.c could not be used for this, and I just expanded out what it would generate, modified to use the macro instead of a swash.
* Refactor is_XDIGIT_uni(), is_XDIGIT_utf8() and macrosKarl Williamson2012-11-191-0/+27
| | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_XDIGIT_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* Refactor is_BLANK_uni() and is_BLANK_utf8() macrosKarl Williamson2012-11-191-0/+34
| | | | | | | | | | | This adds macros to regen/regcharclass.pl that are usable as part of the is_BLANK_foo() macros in handy.h, so that no function call need be done to handle above Latin1 input. These macros are quite small, and unlikely to grow over time, as Unicode has mostly finished adding white space equivalents to the Standard. The functions that implement these in utf8.c are also changed to use the macros instead of generating a swash. This should speed things up slightly, with less memory used over time as the swash fills.
* handy.h: Add isVERTWS_uni(), isVERTWS_utf8()Karl Williamson2012-11-191-0/+12
| | | | | These two macros match the same things as \v does in patterns. I'm leaving them undocumented for now.
* make regcharclass hash order determinisiticYves Orton2012-11-171-2/+2
|
* Eliminate test from generated cp macrosYves Orton2012-11-171-4/+2
| | | | | | | | | | | | | | | | Sayeth Karl: In the _cp macros, the final test can be simplified: /*** GENERATED CODE ***/ #define is_VERTWS_cp(cp) \ ( ( 0x0A <= cp && cp <= 0x0D ) || ( 0x0D < cp && \ ( 0x85 == cp || ( 0x85 < cp && \ ( 0x2028 == cp || ( 0x2028 < cp && \ 0x2029 == cp ) ) ) ) ) ) That 0x2028 < cp can be omitted and it will still mean the same thing. And So Be It.
* regen/regcharclass.pl: Generate better code for some macrosKarl Williamson2012-10-201-13/+13
| | | | | | | | | | | | This commit revamps the recently added function calculate_mask() to not just work to give a single mask/compare value for its input and fail if there are none, but to return a list of masks/compares when the set can be split up into subsets that each can be represented by a mask/compare. If this list taken as a whole yields fewer branches than what we get otherwise, it is better code, and is used. Said another way, what we had there before was all or nothing; this works to improve things even if we can't do it all.
* regen/regcharclass.pl: Change name of generated macroKarl Williamson2012-10-161-2/+2
| | | | | | | | This changes the macro isMULTI_CHAR_FOLD() (non-utf8 version) from just generating ascii-range code points to generating the full Latin1 range. However there are no such non-ASCII values, so the macro expansion is unchanged. By changing the name, it becomes clearer in future commits that we aren't excluding things that we should be considering.
* regen/regcharclass.pl: Generate macros for multi-char fold sequencesKarl Williamson2012-10-091-0/+225
| | | | These will be used in future commits
* regen/regcharclass.pl: improved optree generationYves Orton2012-10-031-12/+6
| | | | | | Karl Williamson noticed that we dont always deal with common suffixes in the most efficient way. This change reworks how we convert a trie to an optree so that common suffixes are always grouped together.
* remove test define from regen/regcharclass.plYves Orton2012-09-291-16/+0
|
* improve conditional folding logic in regen/regcharclass.plYves Orton2012-09-291-122/+40
|
* fix perl #115078, ternary folding logic failureYves Orton2012-09-291-4/+5
|
* add a new define for testing perl #115078Yves Orton2012-09-291-0/+19
| | | | | | We dont have any easy way to test regen/regcharclass.pl currently. Perl #115078 is related to a bug in the _cleanup() routine which is fixed with next patch.
* utf8.h: Remove some EBCDIC dependenciesKarl Williamson2012-09-131-0/+39
| | | | | | | | | | | regen/regcharclass.pl has been enhanced in previous commits so that it generates as good code as these hand-defined macro definitions for various UTF-8 constructs. And, it should be able to generate EBCDIC ones as well. By using its definitions, we can remove the EBCDIC dependencies for them. It is quite possible that the EBCDIC versions were wrong, since they have never been tested. Even if regcharclass.pl has bugs under EBCDIC, it is easier to find and fix those in one place, than all the sundry definitions.
* regen/regcharclass.pl: Add optimizationKarl Williamson2012-09-131-39/+46
| | | | | | On UTF-8 input known to be valid, continuation bytes must be in the range 0x80 .. 0x9F. Therefore, any tests for being within those bounds will always be true, and may be omitted.
* regen/regcharclass.pl: Extend previously added optimizationKarl Williamson2012-09-131-4/+4
| | | | | | | | | A previous commit added an optimization to save a branch in the generated code at the expense of an extra mask when the input class has certain characteristics. This extends that to the case where sub-portions of the class have similar characteristics. The first optimization for the entire class is moved to right before the new loop that checks each range in it.
* regen/regcharclass.pl: Add an optimizationKarl Williamson2012-09-131-30/+30
| | | | | | Branches can be eliminated from the macros that are generated here by using a mask in cases where applicable. This adds checking to see if this optimization is possible, and applies it if so.
* Use macro not swash for utf8 quotemetaKarl Williamson2012-09-131-0/+65
| | | | | | | | | | | | | | The rules for matching whether an above-Latin1 code point are now saved in a macro generated from a trie by regen/regcharclass.pl, and these are now used by pp.c to test these cases. This allows removal of a wrapper subroutine, and also there is no need for dynamic loading at run-time into a swash. This macro is about as big as I'm comfortable compiling in, but it saves the building of a hash that can grow over time, and removes a subroutine and interpreter variables. Indeed, performance benchmarks show that it is about the same speed as a hash, but it does not require having to load the rules in from disk the first time it is used.
* regen/regcharclass.pl: Generate macros for \X processingKarl Williamson2012-09-131-0/+128
| | | | | | | \X is implemented in regexec.c as a complicated series of property look-ups. It turns out that many of those are for just a few code points, and so can be more efficiently implemented with a macro than a swash. This generates those.