| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
This meant changing LABEL's definition in perly.y, so most of this
commit is actually from the regened files.
|
| |
|
|
|
|
|
| |
This was added in the previous commit, but was unnecessary, as it
is not used anywhere and is not part of the public API.
|
| |
|
|
|
|
|
|
| |
These functions can read beyond the end of their input strings if
presented with malformed UTF-8 input. Perl core code has been converted
to use other functions instead of these.
|
|
|
|
|
|
|
|
| |
These functions are like utf8_to_uvuni() and utf8_to_uvchr(), but their
name implies that the input UTF-8 has been validated.
They are not currently documented, as it's best for XS writers to call
the functions that do validation.
|
|
|
|
|
|
|
|
| |
The existing functions (utf8_to_uvchr and utf8_to_uvuni) have a
deficiency in that they could read beyond the end of the input string if
given malformed input. This commit creates two new functions which
behave as the old ones did, but have an extra parameter each, which
gives the upper limit to the string, so no read beyond it is done.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This cleans up and simplifies and extends how the trie
logic interacts with the new node types. This change ultimately
makes the EXACTFU, EXACTFU_SS, EXACTFU_NO_TRIE (renamed to
EXACTFU_TRICKYFOLD) work properly with the trie engine regardless
of whether the string is utf8 or latin1.
This patch depends on the following:
EXACT => utf8 or "binary" text
EXACTFU => either pre-folded utf8, or latin1 that has to be folded as though it was utf8
EXACTFU_SS => special case of EXACTFU to handle \xDF/ss (affects latin1 treatment)
EXACTFU_TRICKYFOLD => special case of EXACTFU to handle tricky non-latin1 fold rules
EXACTF => "old style fold logic" untriable nodetype
EXACTFA => (currently) untriable nodetype
EXACTFL => (currently) untriable nodetype
See the comments in regcomp.sym for these fold types.
This patch involves a number of distinct, but related parts. Starting
from compilation:
* Simplify how we detect a triable sequence given the new nodetypes,
this also probably fixed some "bugs" in how we detected certain
sequences, like /||foo|bar/.
* Simplify how we read EXACTFU nodes under utf8 by removing the now
redundant folding logic (EXACTFU nodes under utf8 are prefolded).
Also extend this logic to handle latin1 patterns properly (in
conjunction with other changes)
* Part of the problems associated with EXACTFU_SS and EXACTFU_TRICKYFOLD
have to do with how the trie logic interacts with the minlen logic.
This change handles both by pessimising the minlen when encounting
these nodetypes. One observation is that the minlen logic is basically
broken, and works only because it conflates bytes and codepoints in
such a way that we more or less always get a value small enough that things work out
anyway. Fixing that is properly is the job of another patch.
* Part of the problem of doing folding under unicode rules is that
there are a lot of foldings possible, some with strange rules. This
means that the bitmap logic does not work correctly in all cases,
as we currently do not have any way to populate it properly.
So this patch disables the bitmap entirely when folding is involved
until that is fixed.
The end result of this is: we can TRIE/AHOCORASICK any sequence of
EXACT, or EXACTFU (ish) nodes, regardless of utf8 or not, but we disable
the bitmap when folding.
A note for follow up relating to this patch is that the way EXACTFU_XXX
nodes are currently dealt with we wont build the "maximal" trie because
of their presence, instead creating a "jumptrie" consisting of either a
leading EXACTFU node followed by a EXACTFU_XXX node, or vice versa. We
should eventually address that.
|
|
|
|
|
|
| |
Previously it was being passed &rsfp as a parameter, because it was
returning another value, fdscript. However, the return value has been
ignored since commit cc69b689ee7c2745 removed suidperl in January 2009.
|
|
|
|
|
|
|
|
|
|
| |
As described in the pod changes in this commit, this changes quotemeta()
to consistenly quote non-ASCII characters when used under
unicode_strings. The behavior is changed for these and UTF-8 encoded
strings to more closely align with Unicode's recommendations.
The end result is that we *could* at some future point start using other
characters as metacharacters than the 12 we do now.
|
|
|
|
|
|
|
| |
This function assumes that there is enough space in the buffer to read
however many bytes are indicated by the first byte in the alleged UTF-8
encoded string. This may not be true, and so it can read beyond the
buffer end. is_utf8_char_buf() should be used instead.
|
|
|
|
|
|
|
|
|
|
| |
This function is to replace is_utf8_char(), and requires an extra
parameter to ensure that it doesn't read beyond the end of the buffer.
Convert is_utf8_char() and the only place in the Perl core to use the
new one, assuming in each that there is enough space.
Thanks to Jarkko Hietaniemi for suggesting this function name
|
|
|
|
|
| |
This function provides a convenient and thread-safe way for modules to
hook op checking.
|
|
|
|
|
|
|
|
|
| |
This adds a routine that will take a C array and quickly create an
inversion list that points to that array. Thus the array had better be
exactly the internal form that is required for an inversion list. To
make sure that this doesn't get out of sync, a new field in the list's
header is created that is a combination of
version-number/inversion-list-type.
|
|
|
|
|
|
|
|
|
| |
Previous commits have added the ability to the inversion list
intersection routine to take the complement of one of its inputs.
Likewise, for unions, this will be a frequent paradigm, and it is
cheaper to do the complement of an input in the routine than to
construct a new temporary that is the desired complement, and throw it
away.
|
|
|
|
|
|
| |
This function is no longer necessary, as it is just a call to the newly
created _invlist_intersection_maybe_complement_2nd() with the correct
parameters.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that it is a common paradigm to want to take the
intersection of an inversion list with the complement of another
inversion list. In fact, this is the how to subtract the second
inversion list from the first, as what remains in the first after the
subtraction is everything in it that is not in the second.
It also turns out that it adds very few cycles to an intersection to
complement one (or both, should we choose to) of the operands. By
adding this capability, we don't have to create a copy of the inverted
operand beforehand, just to throw it away.
|
|
|
|
|
|
|
| |
It is common in a loop to keep adding inversion lists to a current
running total. But the first time through, the current union list needs
to be initialized from NULL. This puts that code in the function
instead of the callers each having to do it.
|
|
|
|
|
| |
402642c6301a1dbc64ea3acc8beee35078afee26 only changed pad_findmy_pvn.
pad_findmy_pv and pad_findmy_sv need the same treatment.
|
| |
|
|
|
|
|
| |
Instead of just doing SvPV on something that is not a PV, SvPVbyte
should actually do what it is advertised as doing.
|
|
|
|
|
|
|
|
|
|
| |
In shouldn’t destroy globs or references passed to it, or try to
coerce them if they are read-only or incoercible.
I added tests for SvPVbyte at the same time, even though it was not
exhibiting the same problems, as sv_utf8_downgrade doesn’t try to
coerce anything. (SvPVbyte has its own set of bugs, which I hope to
fix in fifthcoming commits.)
|
|
|
|
| |
so that stringification will be able to use it, too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
newATTRSUB requires the sub name to be passed to it wrapped up in
a const op.
Commit 8756617677dbd allowed it to accept a GV that way, since
S_maybe_add_coresub (in gv.c) needed to pass it an existing GV not in
the symbol table yet (to simplify code elsewhere).
This had the inadvertent side-effect of making the GV read-only, since
that’s what the check function for const ops does.
Even if we were to call this a feature, it wouldn’t make sense as
implemented, as GVs for non-ampable (&-able) subs like *CORE::chdir
were not being made read-only.
This commit adds a new flag to newATTRSUB, to allow a GV to be passed
as the o parameter, instead of an op. While this may look as though
it’s undoing the simplification in commit 8756617677dbd by adding
more code, the new code is still conceptually simpler and more
straightforward.
Since newATTRSUB is in the API, I had to add a new _flags variant.
(How did newATTRSUB get into the API to begin with?)
In adding a test, I also discovered that ‘used once’ warnings
were applying to these subs, which is obviously wrong. Commit
8756617677dbd caused that, too, as it was relying on the side-effect
of newATTRSUB doing a GV lookup.
This fixes that, too, by turning on the multi flag in
S_maybe_add_coresub.
|
|
|
|
|
|
|
| |
I think it is clearer to note that what happens here is that the node
can match fewer characters than what it would normally be thought to,
and hence the returned value should be subtracted; it also means that
the absolute value need not be taken
|
|
|
|
|
|
|
| |
The strings in every EXACTFish node are examined for certain problematic
sequences and code points. Prior to this patch, this was done in
several passes, but this refactors the routine to do it in a single
pass.
|
|
|
|
|
|
| |
This changes a parameter to this function to instead of changing a running
total, return the actual value computed by the function; and it changes
the calling areas of code to compensate.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the function that returns the swash associated with a
bracketed character class so that it returns the original swash and not
a copy. The function is renamed and made accessible only from within
regexec.c, and a new wrapper function with the original name is created
that just calls the other one and returns a copy of the swash.
Thus, all access from outside regexec.c will use a copy which if
overwritten will not harm others; while the option exists from within
regexec.c to use a shared version.
|
|
|
|
| |
This will be used in future commits for debug traces
|
|
|
|
|
|
|
| |
Add a new parameter to _core_swash_init() that is an inversion list to
add to the swash, along with a boolean to indicate if this inversion
list is derived from a user-defined property. This capability will prove
useful in future commits
|
|
|
|
|
|
| |
This adds the capability, to be used in future commits, for swash_ini()
to return NULL instead of croaking if it can't find a property, so that
the caller can choose how to handle the situation.
|
|
|
|
| |
This function will be used in future commits
|
|
|
|
|
| |
This function does a binary search on an inversion list. It will be
used in future commits
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, swash_init returns a copy of the swash it finds. The core
portions of the swash are read-only, and the non-read-only portions are
derived from them. When the value for a code point is looked up, the
results for it and adjacent code points are stored in a new element,
so that the lookup never has to be performed again. But since a copy is
returned, those results are stored only in the copy, and any other uses
of the same logical stash don't have access to them, so the lookups have
to be performed for each logical use.
Here's an example. If you have 2 occurrences of /\p{Upper}/ in your
program, there are 2 different swashes created, both initialized
identically. As you start matching against code points, say "A" =~
/\p{Upper}/, the swashes diverge, as the results for each match are
saved in the one applicable to that match. If you match "A" in each
swash, it has to be looked up in each swash, and an (identical) element
will be saved for it in each swash. This is wasteful of both time and
memory.
This patch renames the function and returns the original and not a copy,
thus eliminating the overhead for stashes accessed through the new
interface. The old function name is serviced by a new function which
merely wraps the new name result with a copy, thus preserving the
interface for existing calls.
Thus, in the example above, there is only one swash, and matching "A"
against it results in only one new element, and so the second use will
find that, and not have to go out looking again. In a program with lots
of regular expressions, the savings in time and memory can be quite
large.
The new name is restricted to use only in regcomp.c and utf8.c (unless
XS code cheats the preprocessor), where we will code so as to not
destroy the original's data. Otherwise, a change to that would change
the definition of a Unicode property everywhere in the program.
Note that there are no current callers of the new interface; these will
be added in future commits.
|
|
|
|
| |
Otherwise can have memory leaks
|
|
|
|
|
| |
This function has always confused me, as it doesn't return a swash, but
a swatch.
|
|
|
|
|
|
|
|
|
|
| |
These 4 functions have been replaced by variants to_utf8_foo_flags(),
but for XS code that called the old ones in the Perl_to_utf8_foo()
forms, backwards compatibility versions need to be created.
For calls of just the to_utf8_foo() forms, macros have been used to
automatically call the new forms without the performance penalty of
going through the compatibility functions.
|
|
|
|
|
|
| |
Now that we have hints in $^H to indicate the default feature bun-
dle, there is no need for entries in %^H that turn features off by
their presence.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some operators, like pp_complement, assign their argument to TARG
(which copies vstring magic), modify it in place, and then call set-
magic. That’s supposed to work, but vstring magic was remaining as it
was, such that ~v7 would still be treated as "v7" by vstring-aware
code, even though the resulting string is not "\7".
This commit adds vstring set-magic that checks to see whether the pv
still matches the vstring. It cannot simply free the vstring magic,
as that would prevent $x=v0 from working.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bug is a side effect of rv2gv’s starting to return an incoercible
mortal copy of a coercible glob in 5.14:
$ perl5.12.4 -le 'open FH, "t/test.pl"; $fh=*FH; tell $fh; print tell'
0
$ perl5.14.0 -le 'open FH, "t/test.pl"; $fh=*FH; tell $fh; print tell'
-1
In the first case, tell without arguments is returning the position of
the filehandle.
In the second case, tell with an explicit argument that happens to
be a coercible glob (tell has an implicit rv2gv, so tell $fh is actu-
ally tell *$fh) sets PL_last_in_gv to a mortal copy thereof, which is
freed at the end of the statement, setting PL_last_in_gv to null. So
there is no ‘last used’ handle by the time we get to the tell without
arguments.
This commit adds a new rv2gv flag that tells it not to copy the glob.
By doing it unconditionally on the kidop, this allows tell(*$fh) to
work the same way.
Let’s hope nobody does tell(*{*$fh}), which will unset PL_last_in_gv
because the inner * returns a mortal copy.
This whole area is really icky. PL_last_in_gv should be refcounted,
but that would cause handles to leak out of scope, breaking programs
that rely on the auto-closing ‘feature’.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the array_base feature to feature.pm
Perl_feature_is_enabled has been modified to use PL_curcop, rather
than PL_hintgv, so it can work with run-time hints as well.
(PL_curcop holds the current state op at run time, and &PL_compiling
at compile time, so it works for both.) The hints in $^H are not
stored in the same place at compile time and run time, so the FEATURE_IS_ENABLED macro has been modified to check first whether
PL_curop == &PL_compiling.
Since array_base is on by default with no hint for it in %^H, it is
a ‘negative’ feature, whose entry in %^H turns it off. feature.pm
has been modified to support such negative features. The new FEATURE_IS_ENABLED_d can check whether such default features
are enabled.
This does make things less efficient, as every version declaration
now loads feature.pm to disable all features (including turning off
array_base, which entails adding an entry to %^H) before loading the
new bundle. I have plans to make this more efficient.
|
|
|
|
|
|
|
|
|
|
| |
_to_uni_fold_flags() and _to_fold_latin1() now have their flags
parameter be a boolean. The name 'flags' is retained in case the usage
ever expands instead of calling it by the name of the only use this
currently has.
This is as a result of confusion between this and _to_ut8_fold_flags()
which does have more than one flag possibility.
|
|
|
|
|
|
|
|
|
|
| |
This changes the 4 case changing functions to take extra parameters to
specify if the utf8 string is to be processed under locale rules when
the code points are < 256. The current functions are changed to macros
that call the new versions so that current behavior is unchanged.
An additional, static, function is created that makes sure that the
255/256 boundary is not crossed during the case change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When substr() occurs in potential lvalue context, the offsets are
adjusted to the current string (negative being converted to positive,
lengths reaching beyond the end of the string being shortened, etc.)
as soon as the special lvalue to be returned is created.
When that lvalue is assigned to, the original scalar is stringified
once more.
That implementation results in two bugs:
1) Fetch is called twice in a simple substr() assignment (except in
void context, due to the special optimisation of commit 24fcb59fc).
2) These two calls are not equivalent:
$SIG{__WARN__} = sub { warn "w ",shift};
sub myprint { print @_; $_[0] = 1 }
print substr("", 2);
myprint substr("", 2);
The second one dies. The first one only warns. That’s mean. The
error is also wrong, sometimes, if the original string is going to get
longer before the substr lvalue is actually used.
The behaviour of \substr($str, -1) if $str changes length is com-
pletely undocumented. Before 5.10, it was documented as being unreli-
able and subject to change.
What this commit does is make the lvalue returned by substr remember
the original arguments and only adjust the offsets when the assign-
ment happens.
This means that the following now prints z, instead of xyz (which is
actually what I would expect):
$str = "a";
$substr = \substr($str,-1);
$str = "xyz";
print $substr;
|
| |
|
|
|
|
|
| |
This simplifies the code, as it's only called from one spot, in
Perl_moreswitches().
|
|
|
|
|
|
|
|
|
|
| |
When -Dusesitecustomize is used with -Duserelocatableinc,
SITELIB_EXP/sitecustomize.pl is not found due to SITELIB_EXP having a
'.../..' relocation path.
This patch refactors the path relocation code from S_incpush() into
S_mayberelocate() so that it can be used in both S_incpush() and in
usesitecustomize's use of SITELIB_EXP.
|
|
|
|
|
| |
S_sv_unglob is only called in one place, so inline it (but cheat, to
preserve blame history).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sv_force_normal is passed the SV_COW_DROP_PV flag if the scalar is
about to be written over. That flag is not currently used. We can
speed up assignment over fake GVs a lot by taking advantage of the flag.
Before and after:
$ time ./perl -e '$x = *foo, undef $x for 1..2000000'
real 0m4.264s
user 0m4.248s
sys 0m0.007s
$ time ./perl -e '$x = *foo, undef $x for 1..2000000'
real 0m1.820s
user 0m1.812s
sys 0m0.005s
|
|
|
|
|
|
|
|
|
|
| |
The logic surrounding subroutine redefinition warnings (to warn or not
to warn?) was in three places. Over time, they drifted apart, to the
point that newXS was following completely different rules. It was
only warning for redefinition of functions in the autouse namespace.
Recent commits have brought it into conformity with the other redefi-
nition warnings.
Obviously it’s about time we put it in one function.
|