summaryrefslogtreecommitdiff
path: root/regnodes.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.sym: Note specialized use of 'flags' in 2 OPsKarl Williamson2018-12-301-2/+2
|
* Add new regnode: ANYOFH, without a bitmapKarl Williamson2018-12-261-164/+170
| | | | | | | | | | | This commit adds a regnode for the case where nothing in the bit map has matches. This allows the bitmap to be omitted, saving 32 bytes of otherwise wasted space per node. Many non-Latin Unicode properties have this characteristic. Further, since this node applies only to code points above 255, which are representable only in UTF-8, we can trivially fail a match where the target string isn't in UTF-8. Time savings also accrue from skipping the bitmap look-up. When swashes are removed, even more time will be saved.
* Remove ASCII/NASCII regnodesKarl Williamson2018-12-261-153/+142
| | | | | | | The ANYOFM/NANYOFM regnodes are generalizations of these. They have more masks and shifts than the removed nodes, but not more branches, so are effectively the same speed. Remove the ASCII/NASCII nodes in favor of having less code to maintain.
* regcomp.c: Simplify handling of EXACTFish nodes with 's' at edgeKarl Williamson2018-12-261-129/+119
| | | | | | | | | | | | | | | | | | Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for handling an 's' at the leading edge, at the trailing edge, and at both edges for nodes under /di that there is nothing else in that would prevent them from being EXACTFU nodes. If two of these get joined, it could create an 'ss' sequence which can't be an EXACTFU node, for U+DF would match them unconditionally. Instead, under /di it should match if and only if the target string is UTF-8 encoded. I realized later that having three types becomes harder to deal with when adding yet more node types, so this commit turns the three into just one node type, indicating that at least one edge of the node is an 's'. It also simplifies the parsing of the pattern and determining which node to use.
* Collapse regnode EXACTFU_SS into EXACTFUPKarl Williamson2018-12-261-134/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXACTFUP was created by the previous commit to handle a problematic case in which not all the code points in an EXACTFU node are /i foldable at compile time. Doing so will allow a future commit to use the pre-folded EXACTFU nodes (done in a prior commit), saving execution time for the common case. The only problematic code point is the MICRO SIGN. Most patterns don't use this character. EXACTFU_SS is problematic in a different way. It contains the sequence 'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in it can be pre-folded (unless it also contains a MICRO SIGN). The reason this is problematic is that it is the only non-UTF-8 node where the length in folding can change. To process it at runtime, the more general fold equivalence function is used that is capable of handling length disparities, but is slower than the functions otherwise used for non-UTF-8. What I've chosen to do for now is to make a single node type for all the problematic cases (which at this time means just the two aforementioned ones). If we didn't do this, we'd have to add a third node type for patterns that contain both 'ss' and MICRO. Or artificially split the pattern so the two never were in the same node, but we can't do that because it can cause bugs in handling multi-character folds. If more special handling is found to be needed, there'd be a combinatorial explosion of additional node types to handle all possible combinations. What this effectively means is that the slower, more general foldEQ function is used for portions of patterns containing the MICRO sign when the pattern isn't in UTF-8, even though there is no inherent reason to do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
* Add regnode EXACTFUP, for problematicKarl Williamson2018-12-261-127/+132
| | | | | | | | | | If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now created. This character is the only one not needing UTF-8 to represent, but its fold does need UTF-8, which causes some issues, so it has to be specially handled. When matching against a non-UTF-8 target string, the pattern is effectively folded, but not if the target is UTF-8. By creating this node, we can remove the special handling required for the nodes that don't have a MICRO SIGN, in a future commit.
* regcomp.c: Generate EXACTFU_SS only for non-UTF8Karl Williamson2018-12-261-1/+1
| | | | | | | | | | | | | It turns out that now, the regular methods for handling multi-character folds work for the ones involving LATIN SMALL LETTER SHARP S when the pattern is in UTF-8. So the special code for handling this case can be removed, and a regular EXACTFU node is generated. This has the advantage of being trie-able, and requiring fewer operations at run time, as the pattern is pre-folded at compile time, and doesn't have to be re-folded during each backtracking at run-time. This means that the EXACTFU_SS node type will only be generated for non-UTF-8 patterns, and the handling of it is unchanged in these cases.
* regcomp.c: Allow more EXACTFish nodes to be trieableKarl Williamson2018-12-071-114/+129
| | | | | | | | | | | | | | | | | | The previous two commits fixed bugs where it would be possible during optimization to join two EXACTFish nodes together, and the result would not work properly with LATIN SMALL LETTER SHARP S. But by doing so, the commits caused all non-UTF-8 EXACTFU nodes that begin or end with [Ss] from being trieable. This commit changes things so that the only the ones that are non-trieable are the ones that, when joined, have the sequence [Ss][Ss] in them. To do so, I created three new node types that indicate if the node begins with [Ss] or ends with them, or both. These preclude having to examine the node contents at joining to determine this. And since there are plenty of node types available, it seemed the best choice. But other options would be available should we run out of nodes. Examining the first and final characters of a node is not expensive, for example.
* regcomp.sym: Clarify descriptions of EXACTish regnodesKarl Williamson2018-12-061-9/+9
|
* Add regnode EXACTFU_ONLY8Karl Williamson2018-11-271-113/+118
| | | | | | | | | | | | This is a regnode that otherwise would be an EXACTFU except that it contains a code point that requires UTF-8 to match, including all the possible folds involving it. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try. For completeness, there could also be an EXACTFAA_ONLY8 and an EXACTFL_ONLY8 created, but I think these are unlikely to actually appear in the wild, since using /aa is mainly about ASCII, and /l mostly will involve characters that don't require UTF-8.
* Add regnode EXACT_ONLY8Karl Williamson2018-11-271-113/+118
| | | | | | | This is a regnode that otherwise would be an EXACT except that it contains a code point that requires UTF-8 to represent. Hence if the target string isn't UTF-8, we know it can't possibly match, without needing to try.
* Add regnode NANYOFMKarl Williamson2018-11-171-159/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This matches when the existing node ANYOFM would not match; i.e., they are complements. I almost didn't create this node, but it turns out to significantly speed up various classes of matches. For example qr/[^g]/, both /i and not, turn into this node; and something like (("a" x $large_number) . "b") =~ /[^a]/ goes through the string a word at a time, instead of previously byte-by-byte. Benchmarks are at the end of this mesage. This node gets generated when complementing any single ASCII character and when complementing any ASCII case pair, like /[^Gg]/. It never gets generated if the class includes a character that isn't ASCII (actually UTF-8 invariant, which matters only on EBCDIC platforms). The details of when this node gets constructed are complicated. It happens when the bit patterns of the characters in the class happen to have certain very particular characteristics, depending on the vagaries of the character set. [BbCc] will do so, but [AaBb] does not. [^01] does, but not [^12]. Precisely, look at all the bit patterns of the characters in the set, and count the total number of differing bits, calling it 'n'. If and only if the number of characters is 2**n, this node gets generated. As an example, on both ASCII and EBCDIC, the last 4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are 0011. The other 4 bits are the same for each of these 4 digits. That means that only 2 bits differ among the 4 characters, and 2**2==4, so the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000 differ only in one bit so 2**1==2, and so [^08] will generate this node. We could consider in the future, an extension where, if the input doesn't work to generate this node, that we construct the closure of that input to generate this node, which would have false positives that would have to be tested for. The speedup of this node is so significant that that could still be faster than what we have today. The benchmarks are for a 64-bit word. 32-bits would not be as good. Key: Ir Instruction read Dr Data read Dw Data write COND conditional branches IND indirect branches The numbers (except for the final column) represent raw counts per loop iteration. The higher the number in the final column, the faster. (('a' x 1) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 2782.0 2648.0 105.1 Dr 845.0 799.0 105.8 Dw 531.0 500.0 106.2 COND 431.0 419.0 102.9 IND 22.0 22.0 100.0 (('a' x 10) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 3358.0 2671.0 125.7 Dr 998.0 801.0 124.6 Dw 630.0 500.0 126.0 COND 503.0 424.0 118.6 IND 22.0 22.0 100.0 (('a' x 100) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 9118.0 2773.0 328.8 Dr 2528.0 814.0 310.6 Dw 1620.0 500.0 324.0 COND 1223.0 450.0 271.8 IND 22.0 22.0 100.0 (('a' x 1000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 66718.0 3650.0 1827.9 Dr 17828.0 923.0 1931.5 Dw 11520.0 500.0 2304.0 COND 8423.0 668.0 1260.9 IND 22.0 22.0 100.0 (('a' x 10000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir 642718.0 12650.0 5080.8 Dr 170828.0 2048.0 8341.2 Dw 110520.0 500.0 22104.0 COND 80423.0 2918.0 2756.1 IND 22.0 22.0 100.0 (('a' x 100000) . 'b') =~ /[^a]/ blead nanyof Ratio % -------- -------- -------- Ir Inf 102654.8 6237.1 Dr Inf 13299.3 12788.9 Dw Inf 500.9 219708.7 COND 800424.1 25419.1 3148.9 IND 22.0 22.0 100.0
* regcomp.sym: longj field is a booleanKarl Williamson2018-11-161-2/+2
| | | | | | | | | | | | | | | The comments could lead one to thinking one could specify any of the argument fields that nodes can have. But in fact, the value is a boolean, 0 meaning to use the normal offset field of all regnodes; and 1 meaning to use the ARG field that some regnodes have. If a regnode had more than just the one argument field, the one that corresponds to that would be used. This commit enforces that, and changes regcomp.sym to not use '2', which is misleading. It clarifies the comments about this and what '.' means in the flags field
* regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-201-4/+4
| | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
* regcomp.sym: Add node type ANYOF_POSIXLKarl Williamson2018-10-201-160/+166
| | | | | | This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it, which requires extra space. Adding this will allow a future commit to simplify handling for ANYOF nodes.
* S_regmatch(): combine CURLY_B_min/_known statesDavid Mitchell2018-08-261-35/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently two similar backtracking states for simple non-greedy pattern repeats: CURLY_B_min CURLY_B_min_known the latter is a variant of the former for when the character which must follow the repeat is known, e.g. /(...)*?X.../, which allows quick skipping to the next viable position. The code for the two cases: case CURLY_B_min_fail: case CURLY_B_min_known_fail: share a lot of similarities. This commit merges the two states into a single CURLY_B_min state, with an associated single CURLY_B_min_fail fail state. That one code block can handle both types, with a single if (ST.c1 == CHRTEST_VOID) ... test to choose between the two variant parts of the code. This makes the code smaller and more maintainable, at the cost of one extra test per backtrack.
* Spelling correction for consistency with pod/perldebguts.pod.James E Keenan2018-04-081-1/+1
|
* Change name of regnode for clarityKarl Williamson2018-02-161-10/+10
| | | | | | | The EXACTFA nodes are in fact not generated by /a, but by /aa. Change the name to EXACTFAA to correspond. I found myself getting confused by this.
* recomp.sym: Add ANYOFM regnodeKarl Williamson2018-01-301-158/+163
| | | | | This uses a mask instead of a bitmap, and is restricted to representing invariant characters under UTF-8 that meet particular bit patterns.
* regcomp.sym: Add regnodes for [[:ascii:]]Karl Williamson2017-12-291-137/+147
| | | | These will be used in a future commit
* regcomp.sym: Add nodes for script runsKarl Williamson2017-12-241-87/+97
| | | | To be used in the implementation thereof.
* regcomp.sym: Clarify regnode commentKarl Williamson2017-12-161-1/+1
|
* clear savestack on (?{...}) failure and backtrackDavid Mitchell2017-02-141-79/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #126697 In a regex, after executing a (?{...}) code block, if we fail and backtrack over the codeblock, we're supposed to unwind the savestack, so that for any example any local()s within the code block are undone. It turns out that a backtracking state isn't pushed for (?{...}), only for postponed evals ( i.e. (??{...})). This means that it relies on one of the earlier backtracking states to clear the savestack on its behalf. This can't always be relied upon, and the ticket above contains code where this falls down; in particular: 'ABC' =~ m{ \A (?: (?: AB | A | BC ) (?{ local $count = $count + 1; print "! count=$count; ; pos=${\pos}\n"; }) )* \z }x Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next doesn't, since there's nothing it would be responsible for that needs cleaning up. The solution to this is to push a backtrack state for every (?{...}) as well as every (??{...}). The sole job of that state is to do a LEAVE_SCOPE(ST.lastcp). The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB to make it clear it's only used on postponed /(??{A})B/ regexes, and a new state has been added, EVAL_B, which is only called when backtracking after failing something in the B in /(?{...})B/.
* Unify GOSTART and GOSUBYves Orton2016-03-061-40/+36
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.
* Cleanup, document, and restructure regen/regcomp.plYves Orton2015-10-051-1/+3
| | | | | | | | | | | | | | | | We cleanup the parsing code, replacing our set of arrays of properties with an array of hashes of properties, with utility subs registering new items, etc. We also split up the output code into a set of subs, one sub per output "blob" (generaly a var definition), so that we have some visibility of the higher level strucuture of our output code. With this patch visibility of the structure of what we generate emerges from the nest of here docs. :-) Note this change does not (greatly) alter regcomp.sym or perldebguts.pod, it merely cleans up and generally speaking modernizes and most importantly documents the code.
* fix perl #126186 make all verbs allow an optional argYves Orton2015-10-051-4/+4
| | | | | | | | | | | | In perl #126186 it was pointed out we had started allowing name arguments for verbs where we did not document them to be supported, albeit in an inconsistent way. The previous patch cleaned up some of the cause of this, but it seems better to just generally allow the existing verbs to all support a mark name argument. So this patch reverses the effect of the previous patch, and makes all verbs, FAIL, ACCEPT, etc, allow an optional argument, and set REGERROR/REGMARK appropriately as well.
* Add ANYOFD regex nodeKarl Williamson2015-08-241-154/+159
| | | | | This is like an ANYOF node, but just for when /d is in effect. It will be used in future commits
* perldebguts: Add clarificationKarl Williamson2015-08-241-1/+1
|
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-163/+157
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* regcomp.sym: Update \b descriptionsKarl Williamson2015-03-181-7/+7
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-8/+8
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* Reserve a bit for 'the re strict subpragma.Karl Williamson2015-01-131-3/+3
| | | | This is another step in the process
* Add regex nodes for localeKarl Williamson2014-12-291-148/+163
| | | | | These will be used in a future commit to distinguish between /l patterns vs non-/l.
* Create bit for /n.Karl Williamson2014-12-281-7/+7
|
* Eliminate unused BACK regnodeAaron Crane2014-09-291-132/+127
|
* Make space for /xx flagKarl Williamson2014-09-291-7/+7
| | | | | | This doesn't actually use the flag yet. We no longer have to make version-dependent changes to ext/Devel-Peek/t/Peek.t, (it being in /ext) so this doesn't
* Up regex flags limit for (??{})Karl Williamson2014-09-291-1/+1
| | | | | | | | | Previously the regex pattern compilation flags needed for this construct would fit into an 8-bit byte. This conveniently fits into the flags structure element of a regnode. There are changes coming that require more than 8 bits, so in preparation, this commit adds an argument to the node that implements (??{}) (31-bits usable for flags), and moves the storage to that.
* regcomp.sym: ANYOF nodes have an argumentKarl Williamson2014-09-291-1/+1
| | | | | | Plus a bitmap, but they always have an argument besides, contrary to what was specified here. Future commits rely on this, whereas heretofore this error was harmless.
* regexp.h Remove unused bit placeholdersKarl Williamson2014-09-291-7/+7
| | | | | | We do not need a placeholder for unused flag bits. And removing them makes the generated regnodes.h more accurate as to what bits are available.
* regexp.h: Move regex flag bit positions.Karl Williamson2014-09-291-6/+6
| | | | | | | | | | | | | | This moves three bits to create a block of unused bits at the beginning. The first bit had to be moved to make space for other uses that are coming in future commits. This breaks binary compatibility, so might as well move the other two bits so that all the unused bits are consolidated at the beginning. This pool of unused bits is the boundary between the bits that are common to op.h and regexp.h (and in op_reg_common.h) and those that are separate. It's best to have all the unused bits there, so when we need to use one, it can be taken from either side, as needed, without us being trapped into having an available bit, but of the wrong kind.
* Eliminate the duplicative regops BOL and EOLYves Orton2014-09-171-207/+198
| | | | | | | | | | | | | | | | | | | | | | | | | | | See also perl5porters thread titled: "Perl MBOLism in regex engine" In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda) the BOL regop was split into two behaviours MBOL and SBOL, with SBOL and BOL behaving identically. Similarly the EOL regop was split into two behaviors SEOL and MEOL, with EOL and SEOL behaving identically. This then resulted in various duplicative code related to flags and case statements in various parts of the regex engine. It appears that perhaps BOL and EOL were kept because they are the type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl to handle aliases for the type data so that SBOL/MBOL are of type BOL, even though BOL == SBOL seems to cover that case without adding to the confusion. This means two regops, a regstate, and an internal regex flag can be removed (and used for other things), and various logic relating to them can be removed. For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and MBOL is /^/m. (I consider it a fail we have no way to say MBOL without the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is also a /\z/ which is EOS "end of string" with or without the /m).
* Fix for Coverity perl5 CID 29034: Out-of-bounds read (OVERRUN) ↵Jarkko Hietaniemi2014-04-301-0/+8
| | | | | | | | | | | | overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31). Needed compile-time limits for the PL_reg_intflags_name so that the bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH because the size of name array is not visible during compile time (only const char*[] is), so modified regcomp.pl to generate the size, made it visible only under DEBUGGING. Did extflags analogously even though its size currently exactly 32 already. The sizeof(flags)*8 is extra paranoia for ILP64.
* Change 'semantics' to 'rules'Karl Williamson2014-02-201-12/+12
| | | | | | The term 'semantics' in documentation when applied to character sets is changed to 'rules' as being a shorter less-jargony synonym in this case. This was discussed several releases ago, but I didn't get around to it.
* Revert "Free up bit for regex ANYOF nodes"Karl Williamson2014-02-151-155/+150
| | | | | This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and adds comments referring to it, in case it is ever needed.
* Free up bit for regex ANYOF nodesKarl Williamson2014-02-151-150/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit frees up a bit by using an extra regnode to pass the information to the regex engine instead of the flag. I originally thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL bit, as that might speed some things up. But if we need to do this again by adding another node to get another bit, we want one that is mutually exclusive of the first one we did, For otherwise we start having to make 3 nodes instead of two to get the combinations: 1 0 0 1 1 1 This combinatorial problem is avoided by using bits that are mutually exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and there are other bits that are set only under /l, so if we need to do this again, we should use one of those. I wrote this code when I thought I really needed a bit. But since, I have figured out a better way to get the bit needed now. But I don't want to lose this code to posterity, so this commit is being made long enough to get the commit number, then it will be reverted, adding comments referring to the commit number, so that it can easily be reconstructed when necessary.
* Add RXf_UNBOUNDED_QUANTIFIER and regexp->maxlenYves Orton2014-02-031-1/+1
| | | | | | | | | The flag tells us that a pattern may match an infinitely long string. The new member in the regexp struct tells us how long the string might be. With these two items we can implement regexp based $/
* Move the RXf_ANCH flags to intflags as PREGf_ANCH_xxx and add ↵Yves Orton2014-01-311-8/+12
| | | | | | | | | | RXf_IS_ANCHORED as a replacement The only requirement outside of the regex engine is to identify that there is an anchor involved at all. So we move the 4 anchor flags to intflags and replace it with a single aggregate flag RXf_IS_ANCHORED in extflags. This frees up another 3 bits in extflags.
* move RXf_GPOS_SEEN and RXf_GPOS_FLOAT to intflagsYves Orton2014-01-311-6/+8
| | | | | | | | This required removing the RXf_GPOS_CHECK mask as it uses one flag that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont know why we test both. Further investigation required.
* Rename RXf_CANY_SEEN to PREGf_CANY_SEEN and move from extflags to intflagsYves Orton2014-01-311-2/+4
|
* Use bit instead of node for regex SSCKarl Williamson2014-01-221-155/+150
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flag bits in regular expression ANYOF nodes are perennially in short supply. However there are still plenty of regex nodes possible. So one solution to needing to pass more information is to create a node that encapsulates what is needed. That is what commit 9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a particular ANYOF node is for the synthetic start class (SSC). However this solution introduces other issues. If you have to express two things, then you need a regnode for A, a regnode for B, a regnode for both A and B, and another regnode for both not A nor B; With three things, you need 8 regnodes to express all possible combinations. This becomes unwieldy to write code for. The number of combinations goes way down if some of them are mutually exclusive. At the time of that commit, I thought that a SSC need not ever warn if matching against an above-Unicode code point. I was wrong, and that has been corrected earlier in the 5.19 series. But it finally came to me how to tell regexec that an ANYOF node is for the SSC without taking up a flag bit and without requiring a regnode type. The 'next_off' field in a regnode tells the engine the offeset in the regex program to the node it's supposed to go to after processing this one. Since the SSC stands alone, its 'next_off' field is unused, and we can put anything we want in it. That, however, is not true of other ANYOF regnodes. But it turns out that there are certain values that will never be legitimate in the 'next_off' field in these, and so this commit uses one of those to signal that this ANYOF field is an SSC. regnodes come in various sizes, and the offset is in terms of how many of the smallest ones are there to the next node to look at. Since ANYOF nodes are large, the offset is always > 1, and so this commit uses 1 to indicate an SSC.