| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a regnode for the case where nothing in the bit map has
matches. This allows the bitmap to be omitted, saving 32 bytes of
otherwise wasted space per node. Many non-Latin Unicode properties have
this characteristic. Further, since this node applies only to code
points above 255, which are representable only in UTF-8, we can
trivially fail a match where the target string isn't in UTF-8. Time
savings also accrue from skipping the bitmap look-up. When swashes are
removed, even more time will be saved.
|
|
|
|
|
|
|
| |
The ANYOFM/NANYOFM regnodes are generalizations of these. They have
more masks and shifts than the removed nodes, but not more branches, so
are effectively the same speed. Remove the ASCII/NASCII nodes in favor
of having less code to maintain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for
handling an 's' at the leading edge, at the trailing edge, and at both
edges for nodes under /di that there is nothing else in that would
prevent them from being EXACTFU nodes. If two of these get joined, it
could create an 'ss' sequence which can't be an EXACTFU node, for U+DF
would match them unconditionally. Instead, under /di it should match
if and only if the target string is UTF-8 encoded.
I realized later that having three types becomes harder to deal with
when adding yet more node types, so this commit turns the three into
just one node type, indicating that at least one edge of the node is an
's'.
It also simplifies the parsing of the pattern and determining which node
to use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EXACTFUP was created by the previous commit to handle a problematic case
in which not all the code points in an EXACTFU node are /i foldable at
compile time. Doing so will allow a future commit to use the pre-folded
EXACTFU nodes (done in a prior commit), saving execution time for the
common case. The only problematic code point is the MICRO SIGN. Most
patterns don't use this character.
EXACTFU_SS is problematic in a different way. It contains the sequence
'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in
it can be pre-folded (unless it also contains a MICRO SIGN). The reason
this is problematic is that it is the only non-UTF-8 node where the
length in folding can change. To process it at runtime, the more
general fold equivalence function is used that is capable of handling
length disparities, but is slower than the functions otherwise used for
non-UTF-8.
What I've chosen to do for now is to make a single node type for all the
problematic cases (which at this time means just the two aforementioned
ones). If we didn't do this, we'd have to add a third node type for
patterns that contain both 'ss' and MICRO. Or artificially split the
pattern so the two never were in the same node, but we can't do that
because it can cause bugs in handling multi-character folds. If more
special handling is found to be needed, there'd be a combinatorial
explosion of additional node types to handle all possible combinations.
What this effectively means is that the slower, more general foldEQ
function is used for portions of patterns containing the MICRO sign when
the pattern isn't in UTF-8, even though there is no inherent reason to
do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
|
|
|
|
|
|
|
|
|
|
| |
If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now
created. This character is the only one not needing UTF-8 to represent,
but its fold does need UTF-8, which causes some issues, so it has to be
specially handled. When matching against a non-UTF-8 target string, the
pattern is effectively folded, but not if the target is UTF-8. By
creating this node, we can remove the special handling required for the
nodes that don't have a MICRO SIGN, in a future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that now, the regular methods for handling multi-character
folds work for the ones involving LATIN SMALL LETTER SHARP S when the
pattern is in UTF-8. So the special code for handling this case can be
removed, and a regular EXACTFU node is generated. This has the
advantage of being trie-able, and requiring fewer operations at run
time, as the pattern is pre-folded at compile time, and doesn't have to
be re-folded during each backtracking at run-time.
This means that the EXACTFU_SS node type will only be generated for
non-UTF-8 patterns, and the handling of it is unchanged in these cases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous two commits fixed bugs where it would be possible during
optimization to join two EXACTFish nodes together, and the result would
not work properly with LATIN SMALL LETTER SHARP S. But by doing so,
the commits caused all non-UTF-8 EXACTFU nodes that begin or end with
[Ss] from being trieable.
This commit changes things so that the only the ones that are
non-trieable are the ones that, when joined, have the sequence [Ss][Ss]
in them. To do so, I created three new node types that indicate if the
node begins with [Ss] or ends with them, or both. These preclude having
to examine the node contents at joining to determine this. And since
there are plenty of node types available, it seemed the best choice.
But other options would be available should we run out of nodes.
Examining the first and final characters of a node is not expensive, for
example.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACTFU except that it
contains a code point that requires UTF-8 to match, including all the
possible folds involving it. Hence if the target string isn't UTF-8, we
know it can't possibly match, without needing to try.
For completeness, there could also be an EXACTFAA_ONLY8 and an
EXACTFL_ONLY8 created, but I think these are unlikely to actually appear
in the wild, since using /aa is mainly about ASCII, and /l mostly will
involve characters that don't require UTF-8.
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACT except that it
contains a code point that requires UTF-8 to represent. Hence if the
target string isn't UTF-8, we know it can't possibly match, without
needing to try.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches when the existing node ANYOFM would not match; i.e., they
are complements.
I almost didn't create this node, but it turns out to significantly
speed up various classes of matches. For example qr/[^g]/, both /i and
not, turn into this node; and something like
(("a" x $large_number) . "b") =~ /[^a]/
goes through the string a word at a time, instead of previously
byte-by-byte. Benchmarks are at the end of this mesage.
This node gets generated when complementing any single ASCII character
and when complementing any ASCII case pair, like /[^Gg]/. It never gets
generated if the class includes a character that isn't ASCII (actually
UTF-8 invariant, which matters only on EBCDIC platforms).
The details of when this node gets constructed are complicated. It
happens when the bit patterns of the characters in the class happen to
have certain very particular characteristics, depending on the vagaries
of the character set. [BbCc] will do so, but [AaBb] does not. [^01]
does, but not [^12]. Precisely, look at all the bit patterns of the
characters in the set, and count the total number of differing bits,
calling it 'n'. If and only if the number of characters is 2**n, this
node gets generated. As an example, on both ASCII and EBCDIC, the last
4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are
0011. The other 4 bits are the same for each of these 4 digits. That
means that only 2 bits differ among the 4 characters, and 2**2==4, so
the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000
differ only in one bit so 2**1==2, and so [^08] will generate this node.
We could consider in the future, an extension where, if the input
doesn't work to generate this node, that we construct the closure of
that input to generate this node, which would have false positives that
would have to be tested for. The speedup of this node is so significant
that that could still be faster than what we have today.
The benchmarks are for a 64-bit word. 32-bits would not be as good.
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers (except for the final column) represent raw counts per loop
iteration. The higher the number in the final column, the faster.
(('a' x 1) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 2782.0 2648.0 105.1
Dr 845.0 799.0 105.8
Dw 531.0 500.0 106.2
COND 431.0 419.0 102.9
IND 22.0 22.0 100.0
(('a' x 10) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 3358.0 2671.0 125.7
Dr 998.0 801.0 124.6
Dw 630.0 500.0 126.0
COND 503.0 424.0 118.6
IND 22.0 22.0 100.0
(('a' x 100) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 9118.0 2773.0 328.8
Dr 2528.0 814.0 310.6
Dw 1620.0 500.0 324.0
COND 1223.0 450.0 271.8
IND 22.0 22.0 100.0
(('a' x 1000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 66718.0 3650.0 1827.9
Dr 17828.0 923.0 1931.5
Dw 11520.0 500.0 2304.0
COND 8423.0 668.0 1260.9
IND 22.0 22.0 100.0
(('a' x 10000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 642718.0 12650.0 5080.8
Dr 170828.0 2048.0 8341.2
Dw 110520.0 500.0 22104.0
COND 80423.0 2918.0 2756.1
IND 22.0 22.0 100.0
(('a' x 100000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir Inf 102654.8 6237.1
Dr Inf 13299.3 12788.9
Dw Inf 500.9 219708.7
COND 800424.1 25419.1 3148.9
IND 22.0 22.0 100.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The comments could lead one to thinking one could specify any of the
argument fields that nodes can have. But in fact, the value is a
boolean, 0 meaning to use the normal offset field of all regnodes; and 1
meaning to use the ARG field that some regnodes have. If a regnode had
more than just the one argument field, the one that corresponds to that
would be used.
This commit enforces that, and changes regcomp.sym to not use '2',
which is misleading.
It clarifies the comments about this and what '.' means in the flags
field
|
|
|
|
|
|
| |
This changes regcomp.sym to generate the correct lengths for ANYOF
nodes, which means they don't have to be special cased in regcomp.c,
leading to simplification
|
|
|
|
|
|
| |
This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it,
which requires extra space. Adding this will allow a future commit to
simplify handling for ANYOF nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are currently two similar backtracking states for simple
non-greedy pattern repeats:
CURLY_B_min
CURLY_B_min_known
the latter is a variant of the former for when the character which must
follow the repeat is known, e.g. /(...)*?X.../, which allows quick
skipping to the next viable position.
The code for the two cases:
case CURLY_B_min_fail:
case CURLY_B_min_known_fail:
share a lot of similarities. This commit merges the two states into a
single CURLY_B_min state, with an associated single CURLY_B_min_fail
fail state.
That one code block can handle both types, with a single
if (ST.c1 == CHRTEST_VOID) ...
test to choose between the two variant parts of the code.
This makes the code smaller and more maintainable, at the cost of one
extra test per backtrack.
|
| |
|
|
|
|
|
|
|
| |
The EXACTFA nodes are in fact not generated by /a, but by /aa. Change
the name to EXACTFAA to correspond.
I found myself getting confused by this.
|
|
|
|
|
| |
This uses a mask instead of a bitmap, and is restricted to representing
invariant characters under UTF-8 that meet particular bit patterns.
|
|
|
|
| |
These will be used in a future commit
|
|
|
|
| |
To be used in the implementation thereof.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #126697
In a regex, after executing a (?{...}) code block, if we fail and
backtrack over the codeblock, we're supposed to unwind the savestack, so
that for any example any local()s within the code block are undone.
It turns out that a backtracking state isn't pushed for (?{...}), only
for postponed evals ( i.e. (??{...})). This means that it relies on one
of the earlier backtracking states to clear the savestack on its behalf.
This can't always be relied upon, and the ticket above contains code where
this falls down; in particular:
'ABC' =~ m{
\A
(?:
(?: AB | A | BC )
(?{
local $count = $count + 1;
print "! count=$count; ; pos=${\pos}\n";
})
)*
\z
}x
Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next
doesn't, since there's nothing it would be responsible for that needs
cleaning up.
The solution to this is to push a backtrack state for every (?{...}) as
well as every (??{...}). The sole job of that state is to do a
LEAVE_SCOPE(ST.lastcp).
The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB
to make it clear it's only used on postponed /(??{A})B/ regexes, and a new
state has been added, EVAL_B, which is only called when backtracking after
failing something in the B in /(?{...})B/.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling,
and other special casing by unifying them, at pretty much no cost.
GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as
a U32 and an I32 respectively. ARG() holds the "parno" we will recurse
into. ARG2L() holds a signed offset to the relevant start node for the
recursion.
Prior to this patch the argument to GOSUB would always be >=, and unlike
other parts of our logic we would not use 0 to represent "start/end" of
pattern, as GOSTART would be used for "recurse to beginning of pattern",
after this patch we use 0 to represent "start/end", and a lot of
complexity "goes away" along with GOSTART regops.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We cleanup the parsing code, replacing our set of arrays of
properties with an array of hashes of properties, with
utility subs registering new items, etc.
We also split up the output code into a set of subs,
one sub per output "blob" (generaly a var definition),
so that we have some visibility of the higher level strucuture
of our output code. With this patch visibility of the structure
of what we generate emerges from the nest of here docs. :-)
Note this change does not (greatly) alter regcomp.sym or
perldebguts.pod, it merely cleans up and generally speaking
modernizes and most importantly documents the code.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In perl #126186 it was pointed out we had started allowing name
arguments for verbs where we did not document them to be supported,
albeit in an inconsistent way. The previous patch cleaned up some
of the cause of this, but it seems better to just generally allow
the existing verbs to all support a mark name argument.
So this patch reverses the effect of the previous patch, and makes
all verbs, FAIL, ACCEPT, etc, allow an optional argument, and
set REGERROR/REGMARK appropriately as well.
|
|
|
|
|
| |
This is like an ANYOF node, but just for when /d is in effect. It will
be used in future commits
|
| |
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
| |
This is another step in the process
|
|
|
|
|
| |
These will be used in a future commit to distinguish between /l patterns
vs non-/l.
|
| |
|
| |
|
|
|
|
|
|
| |
This doesn't actually use the flag yet.
We no longer have to make version-dependent changes to
ext/Devel-Peek/t/Peek.t, (it being in /ext) so this doesn't
|
|
|
|
|
|
|
|
|
| |
Previously the regex pattern compilation flags needed for this construct
would fit into an 8-bit byte. This conveniently fits into the flags
structure element of a regnode. There are changes coming that require
more than 8 bits, so in preparation, this commit adds an argument to the
node that implements (??{}) (31-bits usable for flags), and moves the
storage to that.
|
|
|
|
|
|
| |
Plus a bitmap, but they always have an argument besides, contrary to
what was specified here. Future commits rely on this, whereas
heretofore this error was harmless.
|
|
|
|
|
|
| |
We do not need a placeholder for unused flag bits. And removing them
makes the generated regnodes.h more accurate as to what bits are
available.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves three bits to create a block of unused bits at the beginning.
The first bit had to be moved to make space for other uses that are
coming in future commits. This breaks binary compatibility, so might as
well move the other two bits so that all the unused bits are
consolidated at the beginning.
This pool of unused bits is the boundary between the bits that are
common to op.h and regexp.h (and in op_reg_common.h) and those that are
separate. It's best to have all the unused bits there, so when we need
to use one, it can be taken from either side, as needed, without us
being trapped into having an available bit, but of the wrong kind.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See also perl5porters thread titled: "Perl MBOLism in regex engine"
In the perl 5.000 release (a0d0e21ea6ea90a22318550944fe6cb09ae10cda)
the BOL regop was split into two behaviours MBOL and SBOL, with SBOL
and BOL behaving identically. Similarly the EOL regop was split into
two behaviors SEOL and MEOL, with EOL and SEOL behaving identically.
This then resulted in various duplicative code related to flags and
case statements in various parts of the regex engine.
It appears that perhaps BOL and EOL were kept because they are the
type ("regkind") for SBOL/MBOL and SEOL/MEOL/EOS. Reworking regcomp.pl
to handle aliases for the type data so that SBOL/MBOL are of type
BOL, even though BOL == SBOL seems to cover that case without adding
to the confusion.
This means two regops, a regstate, and an internal regex flag can
be removed (and used for other things), and various logic relating
to them can be removed.
For the uninitiated, SBOL is /^/ and /\A/ (with or without /m) and
MBOL is /^/m. (I consider it a fail we have no way to say MBOL without
the /m modifier). Similarly SEOL is /$/ and MEOL is /$/m (there is
also a /\z/ which is EOS "end of string" with or without the /m).
|
|
|
|
|
|
|
|
|
|
|
|
| |
overrun-local: Overrunning array PL_reg_intflags name of 14 8-byte elements at element index 31 (byte offset 248) using index bit (which evaluates to 31).
Needed compile-time limits for the PL_reg_intflags_name so that the
bit loop doesn't waltz off past the array. Could not use C_ARRAY_LENGTH
because the size of name array is not visible during compile time
(only const char*[] is), so modified regcomp.pl to generate the size,
made it visible only under DEBUGGING. Did extflags analogously
even though its size currently exactly 32 already. The sizeof(flags)*8
is extra paranoia for ILP64.
|
|
|
|
|
|
| |
The term 'semantics' in documentation when applied to character sets is
changed to 'rules' as being a shorter less-jargony synonym in this case.
This was discussed several releases ago, but I didn't get around to it.
|
|
|
|
|
| |
This reverts commit 34fdef848b1687b91892ba55e9e0c3430e0770f6, and
adds comments referring to it, in case it is ever needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit frees up a bit by using an extra regnode to pass the
information to the regex engine instead of the flag. I originally
thought that if this was needed, it should be the ANYOF_ABOVE_LATIN1_ALL
bit, as that might speed some things up. But if we need to do this
again by adding another node to get another bit, we want one that is
mutually exclusive of the first one we did, For otherwise we start
having to make 3 nodes instead of two to get the combinations:
1 0
0 1
1 1
This combinatorial problem is avoided by using bits that are mutually
exclusive, which the ABOVE_LATIN1_ALL isn't, but the one freed by this
commit ANYOF_NON_UTF8_NON_ASCII_ALL is only set under /d matching, and
there are other bits that are set only under /l, so if we need to do
this again, we should use one of those.
I wrote this code when I thought I really needed a bit. But since, I
have figured out a better way to get the bit needed now. But I don't
want to lose this code to posterity, so this commit is being made long
enough to get the commit number, then it will be reverted, adding
comments referring to the commit number, so that it can easily be
reconstructed when necessary.
|
|
|
|
|
|
|
|
|
| |
The flag tells us that a pattern may match an infinitely long string.
The new member in the regexp struct tells us how long the string might
be.
With these two items we can implement regexp based $/
|
|
|
|
|
|
|
|
|
|
| |
RXf_IS_ANCHORED as a replacement
The only requirement outside of the regex engine is to identify that there is
an anchor involved at all. So we move the 4 anchor flags to intflags and replace
it with a single aggregate flag RXf_IS_ANCHORED in extflags.
This frees up another 3 bits in extflags.
|
|
|
|
|
|
|
|
| |
This required removing the RXf_GPOS_CHECK mask as it uses one flag
that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that
moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as
you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont
know why we test both. Further investigation required.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The flag bits in regular expression ANYOF nodes are perennially in short
supply. However there are still plenty of regex nodes possible. So one
solution to needing to pass more information is to create a node that
encapsulates what is needed. That is what commit
9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a
particular ANYOF node is for the synthetic start class (SSC). However
this solution introduces other issues. If you have to express two
things, then you need a regnode for A, a regnode for B, a regnode for
both A and B, and another regnode for both not A nor B; With three
things, you need 8 regnodes to express all possible combinations. This
becomes unwieldy to write code for. The number of combinations goes way
down if some of them are mutually exclusive. At the time of that
commit, I thought that a SSC need not ever warn if matching against an
above-Unicode code point. I was wrong, and that has been corrected
earlier in the 5.19 series.
But it finally came to me how to tell regexec that an ANYOF node is
for the SSC without taking up a flag bit and without requiring a regnode
type. The 'next_off' field in a regnode tells the engine the offeset in
the regex program to the node it's supposed to go to after processing
this one. Since the SSC stands alone, its 'next_off' field is unused,
and we can put anything we want in it. That, however, is not true of
other ANYOF regnodes. But it turns out that there are certain values
that will never be legitimate in the 'next_off' field in these, and so
this commit uses one of those to signal that this ANYOF field is an SSC.
regnodes come in various sizes, and the offset is in terms of how many
of the smallest ones are there to the next node to look at. Since ANYOF
nodes are large, the offset is always > 1, and so this commit uses 1 to
indicate an SSC.
|