| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A synthetic start class (SSC) is generated by the regular expression
pattern compiler to give a consolidation of all the possible things that
can match at the beginning of where a pattern can possibly match.
For example
qr/a?bfoo/;
requires the match to begin with either an 'a' or a 'b'. There are no
other possibilities. We can set things up to quickly scan for either of
these in the target string, and only when one of these is found do we
need to look for 'foo'.
There is an overhead associated with using SSCs. If the number of
possibilities that the SSC excludes is relatively small, it can be
counter-productive to use them.
This patch creates a crude sieve to decide whether to use an SSC or not.
If the SSC doesn't exclude at least half the "likely" possiblities, it
is discarded. This patch is a starting point, and can be refined if
necessary as we gain experience.
See thread beginning with
http://nntp.perl.org/group/perl.perl5.porters/212644
In many patterns, no SSC is generated; and with the advent of tries,
SSC's have become less important, so whatever we do is not terribly
critical.
|
|
|
|
|
|
|
| |
This creates a #define that gives the highest code point that is an
ASCII printable. On ASCII-ish platforms, this is 0x7E, but on EBCDIC
platforms it varies, and can be as high as 0xFF. This is in preparation
for needing this value in a future commit in regcomp.c
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
| |
This causes the generated unicode_constants.h to be valid on all
supported platforms
|
|
|
|
|
|
|
| |
This is currently allowed, but is non-graphic, and is indistinguishable
from a regular space. I was the one who initially allowed it, and did
so out of ignorance of the negative consequences of doing so. There is
no other precedent for including it.
|
|
|
|
|
|
| |
These character constants were used only for a special edge case in trie
construction that has been removed -- except for one instance in
regexec.c which could just as well be some other character.
|
|
|
|
| |
These will be used in a future commit
|
| |
|
|
|
|
| |
These will be used in future commits
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
|
| |
I think it's clearer to use Copy. When I wrote this custom macro, we
didn't have the infrastructure to generate a UTF-8 encoded string at
compile time.
|
|
|
|
|
| |
This was added in the 5.17 series so there's no code relying on its
current name. I think that the abbreviation is clearer.
|
|
|
|
|
|
| |
This now uses the U+ notation to indicate code points, which is
unambiguous not matter what the platform's character set is. (charnames
accepts the U+ notation)
|
|
|
|
|
| |
This was added in the 5.17 series, so can't be yet in the field; and
isn't needed.
|
|
|
|
|
|
|
|
|
|
| |
join_exact() prior to this commit returned a delta for 3 problematic
sequences showing that the minimum length they match is less than their
nominal length. It turns out that this is needed for all
multi-character fold sequences; our test suite just did not have the
tests in it to show that. Tests that do show this will be added in a
future commit, but code elsewhere must be fixed before they pass.
regcomp.c
|
|
|
|
|
|
|
| |
A future commit will want to use the first surrogate code point's UTF-8
value. Add this to the generated macros, and give it a name, since
there is no official one. The program has to be modified to cope with
this.
|
|
|
|
|
|
|
|
|
|
| |
A previous commit has caused macros to be generated that will match
Unicode code points of interest to the \X algorithm. This patch uses
them. This speeds up modern Korean processing by 15%.
Together with recent previous commits, the throughput of modern Korean
under \X has more than doubled, and is now comparable to other
languages (which have increased themselved by 35%)
|
|
The recently added utf8_strings.h has been expanded to include more than
just strings. I'm renaming it to avoid confusion.
|