summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
...
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-2/+0
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* Add \b{sb}Karl Williamson2015-02-191-0/+1
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+2
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-0/+5
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* Corrections to spelling and grammatical errors.Lajos Veres2015-01-281-1/+1
| | | | Extracted from patch submitted by Lajos Veres in RT #123693.
* regcomp.h: Clarify commentKarl Williamson2015-01-211-1/+1
|
* Add regex nodes for localeKarl Williamson2014-12-291-1/+9
| | | | | These will be used in a future commit to distinguish between /l patterns vs non-/l.
* Eliminate unused BACK regnodeAaron Crane2014-09-291-3/+1
|
* regcomp.c: Add a function and use itKarl Williamson2014-09-291-0/+7
| | | | | | | This adds a function to allocate a regnode with 2 32-bit arguments, and uses it, rather than the ad-hoc code that did the same thing previously. This is in preparation for this code being used in a 2nd place in a future commit.
* regcomp.h: Add commentKarl Williamson2014-09-291-1/+1
|
* regcomp.h: Remove obsolete #definesKarl Williamson2014-09-291-5/+0
| | | | These internal definitions are no longer used.
* regcomp.h: Use existing macro instead of reinventingKarl Williamson2014-09-291-2/+2
|
* Add tests for a51d618a fix of RT #122283Yves Orton2014-09-281-0/+3
| | | | | | | | | | | | | | | | | | | | | Add a new re debug mode for outputing stuff useful for testing. In this case we count the number of times that we go through study_chunk. With a51d618a we should do 5 times (or less) when we traverse the test pattern. Without a51d618a we recurse 11 times. In the case of RT #122283 we would do gazilions of recursions, so many I never let it run to finish. / (?(DEFINE)(?<foo>foo)) (?(DEFINE)(?<bar>(?&foo)bar)) (?(DEFINE)(?<baz>(?&bar)baz)) (?(DEFINE)(?<bop>(?&baz)bop)) /x I say "or less" because you could argue that since these defines are never called, we should not actually recurse at all, and should maybe just compile this as a simple empty pattern.
* change NODE_ALIGN_FILL to set flags to 0Yves Orton2014-09-171-1/+10
| | | | | | | | | | | | In 075abff3 Andy Lester set the flags field of regops to default to 0xde. I find this really weird, and possibly dangerous, as it seems to me reasonable to assume a new regop would have this field set to 0, so that later on code can set it to something else if necessary. (Which is what I wanted to do.) Since nothing breaks if I set it to 0x0 and I find that to be a much more natural default than 0xde (the prefix of 0xdeadbeef), I am changing this to set it to 0.
* Eliminate the duplicative regops BOL and EOLYves Orton2014-09-171-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | See also perl5porters thread titled: "Perl MBOLism in regex engine" In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda) the BOL regop was split into two behaviours MBOL and SBOL, with SBOL and BOL behaving identically. Similarly the EOL regop was split into two behaviors SEOL and MEOL, with EOL and SEOL behaving identically. This then resulted in various duplicative code related to flags and case statements in various parts of the regex engine. It appears that perhaps BOL and EOL were kept because they are the type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl to handle aliases for the type data so that SBOL/MBOL are of type BOL, even though BOL == SBOL seems to cover that case without adding to the confusion. This means two regops, a regstate, and an internal regex flag can be removed (and used for other things), and various logic relating to them can be removed. For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and MBOL is /^/m. (I consider it a fail we have no way to say MBOL without the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is also a /\z/ which is EOS "end of string" with or without the /m).
* regcomp.h: Comment nitsKarl Williamson2014-09-031-2/+2
|
* Allow for changing size of bracketed regex char classKarl Williamson2014-09-031-1/+14
| | | | | | | | This commit allows Perl to be compiled with a bitmap size that is larger than 256. This bitmap is used to directly look up whether a character matches or not, without having to do a binary search or hash lookup. It might improve the performance for some installations that have a lot of use of scripts that are above the Latin1 range.
* Rename some internal regex #definesKarl Williamson2014-09-031-23/+24
| | | | | | | | | These are renamed to be more clear as to their actual meanings. I know other people have been confused by their former names. Some of the name changes will become more important as future commits will allow the bitmap in a bracketed character class to be a different size.
* regcomp.h: Remove some no-longer used #definesKarl Williamson2014-09-031-10/+0
| | | | This is an internal header, so can change names within it.
* regcomp.h: Use unsigned 1 in left shiftKarl Williamson2014-09-031-2/+2
| | | | | | This prevents a signed result if this macro ever gets used in a U8. The ANYOF_BITMAP_TEST macro must now be cast or it would generate warnings when compiled with -DPERL_BOOL_AS_CHAR
* regcomp.h: Fix comment that said the opposite of the truthKarl Williamson2014-09-031-1/+1
| | | | Too many negations led to this.
* regex: Use #define for number of bits in ANYOFKarl Williamson2014-08-211-3/+8
| | | | | | | ANYOF nodes (for bracketed character classes) currently are for code points 0-255. This is the first step in the eventual making that size configurable. This also renames a static function, as the domain may not necessarily be 'latin1'
* regcomp.c: Make SSC node clone safeKarl Williamson2014-03-121-9/+13
| | | | | | This just sets the ptr field in the Synthetic Start Class that will be passed to regexec.c NULL, and clarifies the comments in regcomp.h. See the thread starting at http://markmail.org/message/2txwaqnjco6zodeo
* regcomp.c: Fix more alignment problemsKarl Williamson2014-02-191-20/+16
| | | | | | | | | | | | | | | | | | | | | | | | | I believe this will fix the remaining alignment problems recently being shown on gcc on HP-UX, It works on the procura machine. regnodes should not have stricter alignment than required by U32, for reasons given in the comments this commit adds to the beginning of regcomp.h. Commit 31f05a37 added a new ANYOF regnode struct with a pointer field. This requires stricter alignment on some 64-bit platforms, and hence doesn't work on those platforms. This commit removes that regnode struct type, and instead stores the pointer it used via a more indirect, but already existing mechanism that stores other data.. The function that returns that other data is enlarged to return this new field as well. It now needs to be called from regcomp.c, so the previous commit had renamed and made it accessible from there. The "public" function that wraps this one is unchanged. (I put "public" in quotes here, because I don't think anyone outside core is or should be using it, but since it has been publicly available for a long time, I'm treating the API as unchangeable. regcomp.c called this public function before this commit, but needs the additional data returned by the inner one).
* regcomp.h: Allow compiler to perform calculationKarl Williamson2014-02-191-1/+1
| | | | | | | | Instead of doing the calculation of how many bytes a 256 bitmap occupies, let the compiler do it. I believe we are not too far away from having the ability to allow applications to recompile Perl to increase the bitmap size trading speed for memory. ICU has an 8192 bitmap last time I checked.
* Change method of passing some info from regcomp to regexecKarl Williamson2014-02-191-14/+6
| | | | | | | | | | | | | | For the last several releases, the fact that an ANYOF node could match something outside its bitmap has been passed to regexec.c by having its ARG field not be -1 (appropriately cast). A bit was set if the match could occur even if the target string was not UTF-8 encoded. This design was used to save a bit, as previously there was a bit also for it matching UTF-8 strings. That design is no longer tenable, as a future commit will have a third (independent) reason for something to match outside the bitmap, This commits uses the current spare bit flag to indicate if the match can only occur if the target string is UTF-8.
* regcomp.h: Remove extraneous commentKarl Williamson2014-02-191-7/+0
| | | | | This is obsolete and is a partial copy of the up-to-date comment below it.
* regcomp.h: Free up flag bit in ANYOF nodesKarl Williamson2014-02-191-10/+8
| | | | The ANYOF_LOC bit was removed from final use in the previous commit.
* regexes: Remove uses of ANYOF_LOCALE flagKarl Williamson2014-02-191-4/+2
| | | | | | | | | | | | | This flag no longer adds any useful information and can be removed. An ANYOF node that depends on locale either matches a POSIX class like /d, or matches case insensitively, or both. There are flags for both these cases, and to see if something matches locale, one merely needs to see if either flag is set. Not having to keep track of this extra flag simplifies things, and will allow it to be removed. There was a time when this flag was shared with one of the remaining locale ones, and there was relict code that allowed that sharing to be reinstated, and which this commit also removes.
* regcomp.c: Simplify /l Synthetic Start Class constructionKarl Williamson2014-02-191-3/+12
| | | | | | | | | | | | | | | The ANYOF_POSIXL flag is needed in general for ANYOF nodes to indicate if the struct contains an extra U32 element used to hold the list of POSIX classes (like \w and [:punct:]) whose matches depend on the locale in effect at the time of runtime pattern matching. But the SSC always contains this U32, and so doesn't need to use the flag. Instead, if there aren't any such classes, the U32 will be zero. Removing keeping track of this flag during the assembly of the SSC simplifies things. At the completion of this process, this flag is set if the U32 is non-zero to pass that information on to regexec.c so that it doesn't have to special case things.
* Revert "Free up bit for regex ANYOF nodes"Karl Williamson2014-02-151-5/+21
| | | | | This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and adds comments referring to it, in case it is ever needed.
* Free up bit for regex ANYOF nodesKarl Williamson2014-02-151-16/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit frees up a bit by using an extra regnode to pass the information to the regex engine instead of the flag. I originally thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL bit, as that might speed some things up. But if we need to do this again by adding another node to get another bit, we want one that is mutually exclusive of the first one we did, For otherwise we start having to make 3 nodes instead of two to get the combinations: 1 0 0 1 1 1 This combinatorial problem is avoided by using bits that are mutually exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and there are other bits that are set only under /l, so if we need to do this again, we should use one of those. I wrote this code when I thought I really needed a bit. But since, I have figured out a better way to get the bit needed now. But I don't want to lose this code to posterity, so this commit is being made long enough to get the commit number, then it will be reverted, adding comments referring to the commit number, so that it can easily be reconstructed when necessary.
* regcomp.h: Rmv false commentsKarl Williamson2014-02-121-4/+4
| | | | I misread the code when I added these comments
* eliminate RXf_ANCH_SINGLEDavid Mitchell2014-02-071-2/+2
| | | | | | | | | This macro defines two flag bits: #define PREGf_ANCH_SINGLE (PREGf_ANCH_SBOL|PREGf_ANCH_GPOS) but is only used twice in core (and not on CPAN), don't really add any value, but increases cognitive complexity.
* Add RXf_UNBOUNDED_QUANTIFIER and regexp->maxlenYves Orton2014-02-031-0/+2
| | | | | | | | | The flag tells us that a pattern may match an infinitely long string. The new member in the regexp struct tells us how long the string might be. With these two items we can implement regexp based $/
* rename REG_SEEN_WHATEVER to REG_WHATEVER_SEEN to match RXf_ and PREGf_ ↵Yves Orton2014-01-311-12/+11
| | | | convention
* Move the RXf_ANCH flags to intflags as PREGf_ANCH_xxx and add ↵Yves Orton2014-01-311-2/+9
| | | | | | | | | | RXf_IS_ANCHORED as a replacement The only requirement outside of the regex engine is to identify that there is an anchor involved at all. So we move the 4 anchor flags to intflags and replace it with a single aggregate flag RXf_IS_ANCHORED in extflags. This frees up another 3 bits in extflags.
* move RXf_GPOS_SEEN and RXf_GPOS_FLOAT to intflagsYves Orton2014-01-311-1/+3
| | | | | | | | This required removing the RXf_GPOS_CHECK mask as it uses one flag that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont know why we test both. Further investigation required.
* Rename RXf_CANY_SEEN to PREGf_CANY_SEEN and move from extflags to intflagsYves Orton2014-01-311-0/+1
|
* move RXf_NOSCAN from extflags to intflags as PREGf_NOSCANYves Orton2014-01-311-0/+5
| | | | | Includes some improvements to how we dump regexps so that when a regexp is for the standard perl engine we also show the intflags for the engine
* regcomp.c: Change a variable and flag bit namesKarl Williamson2014-01-271-1/+1
| | | | | The meaning of these was expanded two commits ago, so update the name to reflect this, to prevent future confusion
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* Rename regex internal flag bitKarl Williamson2014-01-221-1/+1
| | | | | This is a clearer name; is used internally only in regcomp.c and regexec.c
* Use bit instead of node for regex SSCKarl Williamson2014-01-221-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag bits in regular expression ANYOF nodes are perennially in short supply. However there are still plenty of regex nodes possible. So one solution to needing to pass more information is to create a node that encapsulates what is needed. That is what commit 9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a particular ANYOF node is for the synthetic start class (SSC). However this solution introduces other issues. If you have to express two things, then you need a regnode for A, a regnode for B, a regnode for both A and B, and another regnode for both not A nor B; With three things, you need 8 regnodes to express all possible combinations. This becomes unwieldy to write code for. The number of combinations goes way down if some of them are mutually exclusive. At the time of that commit, I thought that a SSC need not ever warn if matching against an above-Unicode code point. I was wrong, and that has been corrected earlier in the 5.19 series. But it finally came to me how to tell regexec that an ANYOF node is for the SSC without taking up a flag bit and without requiring a regnode type. The 'next_off' field in a regnode tells the engine the offeset in the regex program to the node it's supposed to go to after processing this one. Since the SSC stands alone, its 'next_off' field is unused, and we can put anything we want in it. That, however, is not true of other ANYOF regnodes. But it turns out that there are certain values that will never be legitimate in the 'next_off' field in these, and so this commit uses one of those to signal that this ANYOF field is an SSC. regnodes come in various sizes, and the offset is in terms of how many of the smallest ones are there to the next node to look at. Since ANYOF nodes are large, the offset is always > 1, and so this commit uses 1 to indicate an SSC.
* regcomp.h: Reorder some #definesKarl Williamson2013-12-311-8/+8
| | | | | | There are no logic changes. The previous commit changed the numbers for some of the bits. This commit re-arranges things so that the #defines are again in numerical order.
* Re-order some flag bits to avoid potential branchesKarl Williamson2013-12-311-3/+4
| | | | | | | | | | | The ANYOF_INVERT flag is used in every single pattern match of [bracketed character classes]. With backtracking, this can be a huge number. All the other flags' uses pale by comparison. I noticed that by making it the lowest bit, we don't have to use CBOOL, as the only possibilities are 0 and 1. cBOOL hopefully will be optimized away, but not always. This commit reorders some of the flag bits to make this one the lowest, and adds a compile check to make sure it isn't inadvertently changed.
* Output regex above-Unicode matching in syn strt classKarl Williamson2013-12-311-1/+1
| | | | | | | A warning is supposed to be raised under some conditions when matching an above-Unicode code point against a Unicode property. Prior to this patch, if the synthetic start class excluded the code point, the warning would be skipped, even though it was attempted to be matched.
* Convert regnode to a flag for [...]Karl Williamson2013-12-311-4/+6
| | | | | | | | | | | | | | | | | | Prior to this commit, there were 3 types of ANYOF nodes; now there are two: regular, and one for the synthetic start class (ssc). This commit converted the third type dealing with warning about matching \p{} against non-Unicode code points, into using the spare flag bit for ANYOF nodes. This allows this bit to apply to ssc ANYOF nodes, whereas previously it couldn't. There is a bug in which the warning isn't raised if the match is rejected by the optimizer, because of this inability. This bug will be fixed in a later commit. Another option would have been to create a new node-type which was an ANYOF_SSC_WARN_SUPER node. But this adds extra complications to things; and we have a spare bit that we might as well use. The comments give better possibilities for freeing up 2 bits should they be needed.
* regcomp.c: Split #define into twoKarl Williamson2013-12-311-0/+5
| | | | | | | | | | | | The syntethic start class regnode (SSC) and a bracketed character class node share much of the same data structure, including a flags field, and some of the same flag bits within it. Currently, only locale-related flags (under /l rules) are the same between the two during construction of the SSC. But a future commit will introduce another common flag. This commit creates an extra #define for use where we want the common flags, while retaining the existing one for use where we want the locale flags. The new #define is just a copy of the existing one, to be changed in the future commit.
* Avoid pointer churn in study_chunk recursion bitmap allocationYves Orton2013-11-241-0/+1
| | | | | | | | | | | | | | | | Since we can only recurse into a given paren (or the entire pattern) once, we know that the maximum recursion depth is the number of parens in the pattern (plus one for "whole pattern"). This means we can preallocate one large bitmap, and then use different chunks of it for each level. That avoids SAVEFREEPV costs for each bitmap, which are likely short anyway. (One could imagine an optimization where a flag somewhere lets us use the RExC_study_chunk_recursed pointer as a bitmap, so we dont have to allocate all when we have less than 32 parens.) This removes the "recursed" argument from study_chunk() and replaces it with a "recursive_depth" argument which counts how deep we are in the bitmap "stack".