| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
This commit revamps the recently added function calculate_mask() to not
just work to give a single mask/compare value for its input and fail if
there are none, but to return a list of masks/compares when the set can
be split up into subsets that each can be represented by a mask/compare.
If this list taken as a whole yields fewer branches than what we get
otherwise, it is better code, and is used.
Said another way, what we had there before was all or nothing; this
works to improve things even if we can't do it all.
|
|
|
|
|
|
|
|
| |
This changes the macro isMULTI_CHAR_FOLD() (non-utf8 version) from just
generating ascii-range code points to generating the full Latin1 range.
However there are no such non-ASCII values, so the macro expansion is
unchanged. By changing the name, it becomes clearer in future commits
that we aren't excluding things that we should be considering.
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
|
| |
Karl Williamson noticed that we dont always deal with common suffixes in
the most efficient way. This change reworks how we convert a trie to an
optree so that common suffixes are always grouped together.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
We dont have any easy way to test regen/regcharclass.pl currently.
Perl #115078 is related to a bug in the _cleanup() routine which is
fixed with next patch.
|
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl has been enhanced in previous commits so that it
generates as good code as these hand-defined macro definitions for
various UTF-8 constructs. And, it should be able to generate EBCDIC
ones as well. By using its definitions, we can remove the EBCDIC
dependencies for them. It is quite possible that the EBCDIC versions
were wrong, since they have never been tested. Even if
regcharclass.pl has bugs under EBCDIC, it is easier to find and fix
those in one place, than all the sundry definitions.
|
|
|
|
|
|
| |
On UTF-8 input known to be valid, continuation bytes must be in the
range 0x80 .. 0x9F. Therefore, any tests for being within those bounds
will always be true, and may be omitted.
|
|
|
|
|
|
|
|
|
| |
A previous commit added an optimization to save a branch in the
generated code at the expense of an extra mask when the input class has
certain characteristics. This extends that to the case where
sub-portions of the class have similar characteristics. The first
optimization for the entire class is moved to right before the new loop
that checks each range in it.
|
|
|
|
|
|
| |
Branches can be eliminated from the macros that are generated here
by using a mask in cases where applicable. This adds checking to see if
this optimization is possible, and applies it if so.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The rules for matching whether an above-Latin1 code point are now saved
in a macro generated from a trie by regen/regcharclass.pl, and these are
now used by pp.c to test these cases. This allows removal of a wrapper
subroutine, and also there is no need for dynamic loading at run-time
into a swash.
This macro is about as big as I'm comfortable compiling in, but it
saves the building of a hash that can grow over time, and removes a
subroutine and interpreter variables. Indeed, performance benchmarks
show that it is about the same speed as a hash, but it does not require
having to load the rules in from disk the first time it is used.
|
|
|
|
|
|
|
| |
\X is implemented in regexec.c as a complicated series of property
look-ups. It turns out that many of those are for just a few code
points, and so can be more efficiently implemented with a macro than a
swash. This generates those.
|
|
|
|
|
|
|
|
| |
Instead of having to list all code points in a class, you can now use
\p{} or a range.
This changes some classes to use the \p{}, so that any changes Unicode
makes to the definitions don't have to manually be done here as well.
|
|
|
|
|
|
| |
Future commits will have other headers #include the headers generated by
these programs. It is best to guard against the preprocessor from
trying to process these twice
|
|
|
|
|
|
| |
Tricky folds have been removed from the code, so the removed #defines
are obsolete. I'm leaving this in, in so it can conveniently be
referred to in case we ever need it again.
|
|
|
|
|
|
|
| |
Sync copyright dates with actual changes according to git history.
[Plus run regen_perly.h to update the SHA-256 checksums, and
regen/regcharclass.pl to update regcharclass.h]
|
| |
|
|
|
|
|
|
|
| |
The tricky fold characters need to be expanded to include the ones
that map to the same ones as the original set. This isn't because the
new ones have a length issue, it's that they get left out of comparisons
because of the special regnodes generated for the tricky ones.
|
| |
|
|
|
|
|
| |
This results in small changes to the formatting of the generated comments
in regcharclass.h
|
|
|
|
|
|
| |
Includes an updated regcharclass.h without datestamp in it so when it
is trivially rebuilt it doesnt change in terms of contents.
p4raw-id: //depot/perl@31636
|
|
|
|
|
|
|
| |
regex engine.
Message-ID: <9b18b3110704270709y50ef652ci436b3bb29abca275@mail.gmail.com>
p4raw-id: //depot/perl@31102
|
|
|
|
|
|
|
| |
regex engine.
Message-ID: <9b18b3110704240746u461e4bdcl208ef7d7f9c5ef64@mail.gmail.com>
p4raw-id: //depot/perl@31081
|
|
|
|
|
| |
(Yves Orton). Also, avoid trailing spaces.
p4raw-id: //depot/perl@31037
|
|
|
| |
p4raw-id: //depot/perl@31031
|
|
|
| |
p4raw-id: //depot/perl@31030
|
|
PCRE and unicode tr18
Message-ID: <9b18b3110704221434g43457742p28cab00289f83639@mail.gmail.com>
p4raw-id: //depot/perl@31026
|