| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
This missing entry is one used by t/porting/regen.t to see if the
contents are up-to-date. I don't know why it didn't get added earlier,
and why there aren't failures except apparently on my machine due to
it's not being there. I thought I took great care in getting it right.
|
|
|
|
|
| |
This single line addition generates a very confused diff listing for the
generated file.
|
|
|
|
|
| |
This single line change generates very a confused diff listing for the
generated file, so is kept separate form the other \b{wb} commits.
|
|
|
|
|
|
|
|
|
| |
This will enable the next commit to add \b{gcb}.
I separated this out from that commit because the diff output here is is
very confused, not accurately showing the underlying changes. Actually
two data structures are being added for every character set, and nothing
else changed.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes where the symbols are defined to a single file each. This
may save text space, depending on the compiler. The next commit will
cause this hdr to be included in more places, so it becomes more
important to do this.
At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE.
The code now is executed regardless of that. This is simpler, and
previously there might have been the possibility of uninitialized memory
being read, should re_comp.o be executed before recomp.o.
|
|
|
|
|
|
|
|
|
|
|
| |
This is a partial implementation of a full inversion map generation
capability, which is why some code is indented more than necessary --
in the future there will be things that use that. But this is
sufficient for 5.22.
This allows for the generation of tables to handle the Unicode line
breaking properties, like GCB and WB. Future commits will actually use
this capability.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, if you said
prop_value_aliases("Any", "foo")
it would return "foo". But there really aren't any synonyms for the
"Any" property values, so it should return undef instead.
|
| |
|
| |
|
|
|
|
|
|
|
| |
Instead of using a constant code point in some of the lines, use the
$variable that is used in other lines
Spotted by Dagfinn Ilmari Mannsåker
|
|
|
|
| |
This new function returns the input property's possible values.
|
|
|
|
| |
The new name more clearly reflects its input restrictions
|
|
|
|
|
| |
We only need to reorder the native code points (0..255) for EBCDIC, so
can quit when we get there, by appropriately refactoring the code
|
|
|
|
|
| |
Indent as a result of new block in the previous commit; reformat a
comment
|
|
|
|
|
|
|
| |
This reorders the code points below 256 depending on the platform.
However all platforms have the same values for those above 255, so can
skip this code if the first code point (and hence all code points) being
output isn't one of those affected.
|
|
|
|
| |
This will make it easier to see differences in future commits
|
|
|
|
|
|
| |
Unicode represents all code points as hex, so follow suit.
I, for one, am used to seeing hex code points, and so eyeballing these
makes more sense when they are in hex.
|
|
|
|
|
|
| |
This adds an undocumented way to get invmap() to return internal
properties, like invlist(). This is intended only for Perl-core
use.
|
| |
|
|
|
|
|
|
|
|
| |
regen.t should fail if Unicode tables are updated and this header is
not regenerated.
See commit 713f4b7fa and the thread beginning at
<20141204124705.472.qmail@lists-nntp.develooper.com>.
|
|
|
|
|
| |
and check that checksum in t/porting/regen.t. This makes the tests
run faster.
|
| |
|
|
|
|
|
|
|
|
| |
Even though this file is not intended to be human consumable, it is
annoying to see #if ... #endif #if ...
where the #endif and #if could be consolidated.
It turns out not to be hard to do that.
|
|
|
|
|
| |
This causes the generated charclass_invlists.h to be valid on all
supported platforms
|
|
|
|
|
|
|
| |
Prior to this patch, this was in regen/mk_invlists.pl, but future
commits will want it to also be used by the header generated by
regen/regcharclass.pl, so use a common source so the logic doesn't have
to be duplicated.
|
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commits in this series have removed all uses of this global
array. This completely removes it.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This global array is no longer used, having been removed in previous
commits in this series.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When constructing what matches code points under /i, Perl uses an
inversion list of all the possible code points that participate in
folds. This number is relatively few compared to the possible universe
of code points, as most of the world's scripts aren't cased, and many
characters in the scripts that do fold aren't foldable (such as
punctuation). Prior to this commit, the list for the above-Latin1 code
points was read-in from disk if and only if needed. This commit causes
the list to be added to read-only data in a C header, trading a little
space in Perl's text segment for speed at execution. This will enable
ripping out some code in this and future commits (offsetting the space
used by this one).
|
|
|
|
|
|
|
|
| |
This changes charclass_invlists.h to have the complete definitions for
all the POSIX classes, like \w and [:alpha:]. Thus these won't have to
be loaded off disk at run-time.
Taking advantage of this will be done in stages in future commits
|
| |
|
|
|
|
|
| |
This is the upper half of the Latin1 range. This simplifies some code
very slightly, but will be of use in future commits.
|
|
|
|
|
|
|
| |
The first commit of this topic branch added a dummy 0 element to the end
of certain inversion lists to work around an off-by-one error. This
commit makes the necessary changes to stop that error, and to remove
the dummy element. SvCUR() and invlist_len() now are kept in sync.
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 18505f093a44607b687ae5fe644872f835f66313, which
reverted 241136e0ed70738cccd6c4b20ce12b26231f30e5, thus reinstating the
latter commit. It turns out that the error being chased down was not
due to this commit.
Its original message was:
The inversion lists that are compiled into a C header are now const.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 67434bafe4f2406e7c92e69013aecd446c896a9a, which
reverted 4fdeca7844470c929f35857f49078db1fd124dbc, thus reinstating the
latter commit. It turns out that the error being chased down was not
due to this commit.
Its original message was:
This commit continues the process of separating the header area of
inversion lists from the body. 2 more fields are moved out of the
header portion of the inversion list, and into the header portion of the
SV that contains it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
inversion lists" "
This reverts commit de353015643cf10b437d714d3483c1209e079916 which
reverted 533c4e2f08b42d977e5004e823d4849f7473d2d0, thus reinstating it,
plus this commit adds a fix to get it to pass under Address Sanitizer.
The root cause of the problem is that there are two measures of the
length of an inversion list. One is SvCUR(), and the other is
invlist_len(). The original commit caused these to get off-by-one in
some cases. The ultimate solution is to only store one value, and
return the other one based off that. Rather than redo the whole branch,
I've taken an easier way out, which is to add a dummy element at the end
of some inversion lists, so that they aren't off-by-one. Then the other
patches from the original branch will be applied. Each will be
tested with Address Sanitizer. Then the work to fix the underlying
problem will be done.
The original commit's message was:
This commit is the first step to separating the header from the body of
inversion lists. Doing so will allow the compiled-in inversion lists to
be fully read-only.
To invert an inversion list, one simply unshifts a 0 to the front of it
if one is not there, and shifts off the 0 if it does have one.
The current data structure reserves an element at the beginning of each
inversion list that is either 0 or 1. If 0, it means the inversion list
begins there; if 1, it means the inversion list starts at the next
element. Inverting involves flipping this bit.
This commit changes the structure so that there is an additional element
just after the element that flips. This new element is always 0, and
the flipping element now says whether the inversion list begins at the
constant 0 element, or the one after that.
Doing this allows the flipping element to be separated in later commits
from the body of the inversion list, which will always begin with the
constant 0 element. That means that the body of the inversion list can
be const.
|
|
|
|
|
|
|
| |
This reverts commit 533c4e2f08b42d977e5004e823d4849f7473d2d0.
This continues the backing out of this topic branch. A bisect shows
that the first commit exhibiting an error is the first one in the
branch.
|
|
|
|
|
|
|
| |
This reverts commit 4fdeca7844470c929f35857f49078db1fd124dbc.
This continues the backing out of this topic branch. A bisect shows
that the first commit exhibiting an error is the first one in the
branch.
|
|
|
|
|
|
|
| |
This reverts commit 241136e0ed70738cccd6c4b20ce12b26231f30e5.
This continues the backing out of this topic branch. A bisect shows
that the first commit exhibiting an error is the first one in the
branch.
|
|
|
|
| |
The inversion lists that are compiled into a C header are now const.
|
|
|
|
|
|
|
| |
This commit continues the process of separating the header area of
inversion lists from the body. 2 more fields are moved out of the
header portion of the inversion list, and into the header portion of the
SV that contains it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the first step to separating the header from the body of
inversion lists. Doing so will allow the compiled-in inversion lists to
be fully read-only.
To invert an inversion list, one simply unshifts a 0 to the front of it
if one is not there, and shifts off the 0 if it does have one.
The current data structure reserves an element at the beginning of each
inversion list that is either 0 or 1. If 0, it means the inversion list
begins there; if 1, it means the inversion list starts at the next
element. Inverting involves flipping this bit.
This commit changes the structure so that there is an additional element
just after the element that flips. This new element is always 0, and
the flipping element now says whether the inversion list begins at the
constant 0 element, or the one after that.
Doing this allows the flipping element to be separated in later commits
from the body of the inversion list, which will always begin with the
constant 0 element. That means that the body of the inversion list can
be const.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These lists are declared at file scope so will be global unless
made static. Actual use of these lists is via the various PL_xxx
global variables that point to them and that (except for
NonL1_Perl_Non_Final_Folds_invlist) are initialized in
Perl_re_op_compile in regcomp.c (but not in its incarnation as
ext/re/re_comp.c).
So change the lists to be static, and also skip declaring and
initializing them in ext/re/re_comp.c except for the one case that
is actually used in the extension version.
|
|
|
|
|
| |
This causes charclass_invlists.h to have a new list of all the
characters whose fold is a sequence of more than one character.
|
|
|
|
|
|
|
| |
Benchmarking showed some speed-up when the result of the previous
search in an inversion list is cached, thus potentially avoiding a
search in the next call. This adds a field to each inversion list which
caches its previous search result.
|
|
|
|
|
|
|
|
|
|
| |
This starts with the existing table that mktables generates that lists
all the characters in Unicode that occur in multi-character folds, and
aren't in the final positions of any such fold.
It generates data structures with this information to make it quickly
available to code that wants to use it. Future commits will use these
tables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit is the minimal necessary to get \s to match the vertical
tab. It is being done early in the 5.17 series in order to see what
repercussions there might be from doing this.
It may well be that we decide that this change will require a 'use
feature' to activate. In any event there is significant documentation
of the behavior without the VT that this patch does not address at all.
Tom Christiansen asked Larry Wall why \s did not include VT, and
reported that Larry replied that he did not remember, but had no
objections to adding it.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was an off-by-one error caused by my failing to realize that things
had to be done differently at the 255/256 boundary depending on whether
U+00FF matched or did not match the property.
Two properties were affected, [:upper:] and [:punct:]. The bug was that
all code points above the first one > 255 that legitimately matches the
property will match whether or not they should. In the case of
[:upper:], this meant that effectively anything from 256..infinity
matched. For [:punct:], it was anything above U+037D.
|