| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
|
| |
This new test script has a test that's supposed to exercise an up-to 10s
wait-and-retry loop when loading properties. It has a 500s timeout
built-in for if that fails. On my system its been intermittently
failing (not sure if due to something I'm doing or a problem with the
test or with regcomp.c) which effectively hangs the test run.
So decrease the timeout to 25 secs.
|
|
|
|
|
|
|
|
|
|
|
| |
The problem here is that a syntax error occurs and hence certain things
don't get done, but processing continues, as the error isn't checked for
until after the return of the function that found it. The failing
assertion is checking that those certain things actually did get done.
There appear to be good reasons to defer the raising of the error until
then, so the simplest way to fix this is to generalize the code so that
the failing assertion doesn't happen.
|
| |
|
|
|
|
|
|
|
| |
User-defined properties are supposed to be called just once for /i and
once for non-/i. This adds tests for that.
It turns out that this was broken in blead.
|
|
|
|
|
| |
Add some tests. These test various error conditions that haven't been
tested before.
|
|
|
|
| |
That is, in \p{user-defined}
|
|
|
|
|
| |
This adds some trailing spaces and comments in expansion of
\p{user-defined}/ to verify things work.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large commit moves the handling of user-defined properties to C
code. This should speed it up, but the main reason to do this is to
stop using swashes in this case, leaving only tr/// using them. Once
that too is converted, all swash handling can be ripped out of perl.
Doing this in perl has caused some nasty interactions that will now be
fixed automatically.
The change is not entirely transparent, however (besides speed and the
possibility of removing these interactions). perldelta in this commit
details these.
|
|
|
|
|
| |
Improve t/porting/manifest.t output on errors
to show the line number.
|
|
|
|
|
| |
This is similar to the changes made in 7bfdd8260c
we do not want to use 'sudo' during the tests.
|
|
|
|
|
|
|
|
|
|
| |
This retains blead customizations:
* 1a58b39af8 remove of 'use vars'
* 7bfdd8260c 500_ping_icmp.t: remove sudo code
These changes are not required anymore, they
are merged upstream
* 0fc44d0a18 avoid stderr noise in tests
|
|
|
|
| |
The bug in this case was fixed in db9848c8d.
|
|
|
|
|
| |
When looking for locales on a system, try this one which seems to be
getting to be available widely.
|
|
|
|
|
|
| |
Previous commits in this series have changed uc(), lc(), fc(), etc. to
know how to handle Turkish UTF-8 locales. This commit extends this to
/i regular expression pattern matching.
|
|
|
|
| |
But since these aren't recognized yet, they will be skipped
|
|
|
|
|
|
|
| |
This just calls fold_grind.pl with a particular option.
But, as of this commit, Turkish locales aren't recognized specially, so
this test just always skips.
|
|
|
|
|
|
|
|
|
|
| |
The CaseFolding.txt file has special locale-dependent rules. This
commit changed fold_grind to notice them, and to generate tests for
the situation we aren't in, which are expected to fail.
Since, as of this commit, the Turkic locale is not recognized, this
commit has the effect of generating tests for the Turkic locale, running
them, and making sure they fail when appropriate.
|
|
|
|
|
|
| |
These will be used by later commits. But right now Perl doesn't know
how to determine if a locale is Turkic, so these functions return no
locale, until later in this commit series
|
|
|
|
|
|
|
|
|
|
|
|
| |
The code knew this, but it was adding the ASCII alphabetics to the list
of things that matched in UTF-8 locales. This is unnecessary, as we've
long had the infrastructure elsewhere to handle all potential mappings
from a Latin1 code point to other Latin1, so we can just rely on it.
And it created complexities for future commits in this series.
The MICRO SIGN is the exception, as it folds to non-Latin1 in UTF-8
locales, and this is the place where the structure exists to handle
that.
|
|
|
|
| |
This will be needed in future commits
|
| |
|
|
|
|
| |
Just align some logical or clauses for readability.
|
|
|
|
|
|
| |
This bug was introduced in b2296192536090829ba6d2cb367456f4e346dcc6
n 5.29.7. Using /il should not result in looking for a [:posix:] class
that matches the code points given.
|
|
|
|
|
|
|
|
|
| |
The regexp engine sets and restores $^R in a few places, but didn't
mg_set() (SvSETMAGIC()) it at all.
Calls to length() on $^R, both within regexp code blocks and on
a successful match could add utf8 length magic to $^R, and modifying
$^R without mg_set() could leave now invalid length magic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This has been a goal for a long time, but I thought it would be a lot of
work, but now have realized that there was a fairly easy simplistic
approach.
The core file is renamed fold_grind.pl. It formerly had an outer loop
which iterated over the possible character set regex pattern modifiers,
/a, /l, etc that were tested. Now that loop is just a block and new
wrapper files have been created, one per modifier. They just pass a
global to the core file that gives which modifier this test file is to
use. Hence each file corresponds to one iteration of the old outer
loop, splitting the tests up into 6 smaller tests that can run in
parallel.
|
|
|
|
|
|
|
|
| |
the test I added allocated more temp files, but didn't arrange for
backup files to be cleaned up.
Modified the cleanup to clean up every generated temp and backup file
even if more are allocated in the future with mkfiles()
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was caused by a counting error.
An EXACTFish regnode has a finite length it can hold for the string
being matched. If that length is exceeded, a 2nd node is used for the
next segment of the string, for as many regnodes as are needed.
A problem occurs if a regnode ends with one of the 22 characters in
Unicode 11 that occur in non-final positions of a multi-character fold.
The design of the pattern matching engine doesn't allow matches across
regnodes. Consider, for example if a node ended in the letter 'f' and
the next node begins with the letter 'i'. That sequence should match,
under /i, the ligature "fi" (U+FB01). But it wouldn't because the
pattern splits them across nodes. The solution I adopted was to forbid
a node to end with one of those 22 characters if there is another string
node that follows it. This is not fool proof, for example, if the
entire node consisted of only these characters, one would have to split
it at some position (In that case, we just take as much of the string as
will fit.) But for real life applications, it is good enough.
What happens if a node ends with one of the 22, is that the node is
shortened so that those are instead placed at the beginning of the
following node. When the code encounters this situation, it backs off
until it finds a character that isn't a non-final fold one, and closes
the node with that one.
A /i node is filled with the fold of the input, for several reasons.
The most obvious is that it saves time, you can skip folding the pattern
at runtime. But there are reasons based on the design of the optimzer
as well, which I won't go into here, but are documented in regcomp.c.
When we back out the final characters in a node, we also have to back
out the corresponding unfolded characters in the input, so that those
can be (folded) into the following node. Since the number of characters
in the fold may not be the same as unfolded, there is not an easily
discernable correspondence between the input and the folded output.
That means that generally, what has to be done is that the input is
reparsed from the beginning of the node, but the permitted length has
been shortened (we know precisely how much to shorten it to) so that it
will end with something other than the 22. But, the code saves the
previous input character's position (for other reasons), so if we only
have to backup one character, we can just use that and not have to
reparse.
This bug was that the code thought a two character backup was really a
one character one, and did not reparse the node, creating an off-by-one
error, and a character was simply omitted in the pattern (that should
have started the following node). And the input had two of the 22
characters adjacent to each other in just the right positions that the
node was split. The bisect showed that when the node size was changed
the bug went away, at least for this particular input string. But a
different, longer, string would have triggered the bug, and this commit
fixes that.
This bug is actually very unlikely to occur in most real world
applications. That is because other changes in the regex compiler have
caused nodes to be split so that things that don't particpate in folds
at all are separated out into EXACT nodes. (The reason for that is it
allows the optimizer things to grab on to under /i that it wouldn't
otherwise have known about.) That means that anything like this string
would never cause the bug to happen because blanks and commas, etc.
would be in separate nodes, and so no node would ever get large enough
to fill the 238 available byte slots in a node (235 on EBCDIC). Only a
long string without punctuation would trigger it. I have artificially
constructed such a string in the tests added by this commit.
One of the 22 characters is 't', so long strings of DNA "ACTG" could
trigger this bug. I find it somewhat amusing that this is something
like a DNA transcription error, which occurs in nature at very low
rates, but selection, it is believed, will make sure the error rate is
above zero.
|
| |
|
|
|
|
|
| |
When no file has previously been opened, "eof" should return true. This
behavior was broken by 32e653230c7ccc (see also [#60978]).
|
| |
|
|
|
|
|
| |
Previously COPLINE was updated (to the end of the file) before
reporting the error, which wasn't useful.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
A node that matches only 'A' and 'a', for example, can be turned into an
ANYOFM node, which is faster to execute. This is done after joining of
adjacent EXACTFish nodes, as longer nodes are better than shorter ones,
including because they lessen the number of bugs with multi-char folds
not matching because of node boundaries.
But if a length 1 node remains, ANYOFM is better.
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a regnode for the case where nothing in the bit map has
matches. This allows the bitmap to be omitted, saving 32 bytes of
otherwise wasted space per node. Many non-Latin Unicode properties have
this characteristic. Further, since this node applies only to code
points above 255, which are representable only in UTF-8, we can
trivially fail a match where the target string isn't in UTF-8. Time
savings also accrue from skipping the bitmap look-up. When swashes are
removed, even more time will be saved.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit extensively changes the optimizations for ANYOF regnodes
that represent bracketed character classes.
The removal of the regex compilation pass now makes these feasible and
desirable. Compilation now tries hard to optimize an ANYOF node into
something smaller and/or faster when feasible.
Now, qr/[X]/ for any single character or POSIX class X, and any
modifiers like /d, /i, etc, should be the same as qr/X/ for the same
modifiers, unless it would require the pattern to be upgraded from
non-UTF-8 to UTF-8, unless not doing so could introduce bugs.
These changes fix some issues with multi-character /i folding.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for
handling an 's' at the leading edge, at the trailing edge, and at both
edges for nodes under /di that there is nothing else in that would
prevent them from being EXACTFU nodes. If two of these get joined, it
could create an 'ss' sequence which can't be an EXACTFU node, for U+DF
would match them unconditionally. Instead, under /di it should match
if and only if the target string is UTF-8 encoded.
I realized later that having three types becomes harder to deal with
when adding yet more node types, so this commit turns the three into
just one node type, indicating that at least one edge of the node is an
's'.
It also simplifies the parsing of the pattern and determining which node
to use.
|
|
|
|
|
| |
This makes it easier to add new tests without duplicating, as witnessed
by the duplicate ones this commit removes
|
|
|
|
|
| |
This is in preparation for a future commit where it will be used in more
than one place.
|
|
|
|
|
|
| |
ANYOF nodes can generate different things depending on the UTF-8ness of
the pattern. This adds the capability of conveniently specifying in a
test that the pattern should be upgraded
|
| |
|
|
|
|
| |
For: RT # 133722
|
|
|
|
|
|
|
| |
In the (?[ ... ]) regex sets features, one can embed another compiled
regex set pattern. Such compiled patterns always have a flag of '^',
which we weren't looking for prior to this commit. That meant that
uncompiled patterns would be mistaken for compiled ones.
|
|
|
|
|
|
|
|
|
| |
The text of perl5294delta was wrong about a change. This commit changes
that text, and adds an entry to the latest perldelta with the
correction. A test has been added to verify the way things work.
The wrong language led to this blog post, and my comment in it:
https://www.effectiveperlprogramming.com/2018/12/perl-v5-30-lets-you-match-more-with-the-general-quantifier/
|
|
|
|
|
|
|
|
|
|
|
|
| |
The tests where we write a string larger than the pipe size to
a pipe hang on 15.6.0, while they seem to work on Darwin 17.7.0.
So we will skip these tests on Darwin, if the major version is
less than 16. (We may adjust this is we have more reports on
which versions between 15.6.0 and 17.7.0 success/fail).
Note that the tests hang even if we send a string of 512 characters,
which is much, much smaller than the actual size of the string in
the test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous two commits fixed bugs where it would be possible during
optimization to join two EXACTFish nodes together, and the result would
not work properly with LATIN SMALL LETTER SHARP S. But by doing so,
the commits caused all non-UTF-8 EXACTFU nodes that begin or end with
[Ss] from being trieable.
This commit changes things so that the only the ones that are
non-trieable are the ones that, when joined, have the sequence [Ss][Ss]
in them. To do so, I created three new node types that indicate if the
node begins with [Ss] or ends with them, or both. These preclude having
to examine the node contents at joining to determine this. And since
there are plenty of node types available, it seemed the best choice.
But other options would be available should we run out of nodes.
Examining the first and final characters of a node is not expensive, for
example.
|
|
|
|
|
| |
The previous commit fixed a bug. This commit detects if someone creates
a new instance of that bug.
|