| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some code in this function examines the first two nodes in the regex to
set suitable flags etc. Part of the code accesses the second node
by using regnext(first), other parts by NEXTOPER(first). The second method
only works when the node is the same size as a basic node. I *think*
that the code only makes use of this second value in situations where
the node *is* basic, but nevertheless, it makes valgrind unhappy when
the first node is an EXACT node, and reading the second node's
supposed type field is actually reading the padding bytes at the end of
the EXACT string, which are uninitialised.
So just use regnext() only.
Something as simple as /x/ on non-debugging builds was enough to make
valgrind complain. (On debugging builds, the program buffer is initially
zero-filled.)
|
|
|
|
|
|
|
|
|
|
| |
/[A-Z]/ai should match KELVIN SIGN, as it folds to a 'k'. It should not
match under /aai, as that restricts fold matching. But I tested for the
wrong symbol which ended up forbidding both /ai and /aai.
This commit changes to the correct symbol. I also reordered the 'if'
while I was at it as a nano optimisation, to test for the /aa last, as
that is the less common part of the '&&' test.
|
|
|
|
| |
whitespace-only change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #124109.
2c1f00b9036 localised PL_curpm to NULL when calling swash init code
(i.e. perl-level code that is loaded and executed when something
like "lc $large_codepoint" is executed).
b4fa55d3f1 followed this up by gutting Perl_save_re_context(), since
that function did, basically,
if (PL_curpm) {
for (i = 1; i <= RX_NPARENS(PM_GETRE(PL_curpm))) {
do the C equivalent of the perl code "local ${i}";
}
}
and now that PL_curpm was null, the code wasn't called any more. However,
it turns out that the localisation *was* still needed, it's just that
nothing in the test suite actually tested for it.
In something like the following:
$x = "\x{41c}";
$x =~ /(.*)/;
$s = lc $1;
pp_lc() calls get magic on $1, which sets $1's PV value to a copy of the
substring captured by the current pattern match.
Then pp_lc() calls a function to convert the string to upper case, which
triggers a swash load, which calls perl code that does a pattern match
and, most importantly, uses the value of $1. This triggers get magic on
$1, which overwrites $1's PV value with a new value. When control returns
to pp_lc(), $1 now holds the wrong string value.
Hence $1, $2 etc need localising as well as PL_curpm.
The old way that Perl_save_re_context() used to work (localising
$1..${RX_NPARENS}) won't work directly when PL_curpm is NULL (as in the
swash case), since we don't know how many vars to localise.
In this case, hard-code it as localising $1,$2,$3 and add a porting
test file that checks that the utf8.pm code and dependences don't
use anything outside those 3 vars.
|
|
|
|
|
|
| |
This reverts commit b4fa55d3f12c6d98b13a8b3db4f8d921c8e56edc.
Turns out we need Perl_save_re_context() after all
|
|
|
|
|
|
| |
This reverts commit d28a9254e445aee7212523d9a7ff62ae0a743fec.
Turns out we need save_re_context() after all
|
|
|
|
|
|
| |
This reverts commit 0ddd4a5b1910c8bfa9b7e55eb0db60a115fe368c.
Turns out we need the save_re_context() function after all.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
| |
Unicode 5.2 had an anomalous situation, fixed in the next release, which
runs afoul of an assert() in regcomp.c. This just modifies the assert
for it to not fail for this situation.
|
|
|
|
|
| |
This experimental feature now has the intersection operator ("&") higher
precedence than the other binary operators.
|
|
|
|
| |
Outdent code that the previous commit removed the surrounding block from
|
|
|
|
|
|
|
|
|
|
|
| |
Prior to this commit, the regex compiler was relying on the lexer to do
the translation from Unicode to native for \N{...} constructs, where it
was simpler to do. However, when the pattern is a single-quoted string,
it is passed unchanged to the regex compiler, and did not work. Fixing
it required some refactoring, though it led to a clean API in a static
function.
This was spotted by Father Chrysostomos.
|
|
|
|
|
| |
It actually does do the right thing: /(?(R0))/ and /(?(R00))/ both fall
through to give an appropriate error 'Switch condition not recognized'
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some questions and loose ends:
XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren?
XXX mg.c:Perl_magic_set - need appopriate error handling for $)
XXX regcomp.c:S_reg - need to check if we do the right thing if parno
was not grokked
Perl_get_debug_opts should probably return something unsigned; not sure
if that's something we can change.
|
| |
|
|
|
|
|
|
| |
Both needed: the macro is for compilers, the comment for static checkers.
(This doesn't address whether each spot is correct and necessary.)
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes where the symbols are defined to a single file each. This
may save text space, depending on the compiler. The next commit will
cause this hdr to be included in more places, so it becomes more
important to do this.
At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE.
The code now is executed regardless of that. This is simpler, and
previously there might have been the possibility of uninitialized memory
being read, should re_comp.o be executed before recomp.o.
|
|
|
|
|
|
|
|
|
|
| |
//n was implemented by avoiding the primary side-effects of compiling
a capture when the flag was turned on; however some secondary effects
still occurred later in the same function, by using the value of the
'paren' variable - even as far as causing coredumps.
Setting paren to ':' when NOCAPTURE is enabled makes the rest of the
function act just as if it had parsed (?:...) instead of (...).
|
|
|
|
| |
This could be triggered by trying to compile eg 'qr{x+(y(?0))*}'.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
AFL (<http://lcamtuf.coredump.cx/afl>) found that the UV to I32 conversion
can evade the necessary range checks on wraparound, leading to bad reads.
Check for it, and force to I32_MAX, expecting that this will usually
yield a "Reference to nonexistent group" error.
|
|
|
|
|
| |
New test in 8a6d8ec6fe revealed additional code problem reading past
end of string under clang with sanitize=address.
|
|
|
|
|
|
|
| |
AFL (<http://lcamtuf.coredump.cx/afl>) found that when producing the
error message for /(??/ we hit an assert because we've stepped past
the end of the pattern string. Code inspection found that we also do
that in other branches, and we also need to check UTF more carefully.
|
|
|
|
|
|
| |
A POSIX character class is has to be in a bracketed character class. A
warning is issued when something appearing to be one is found outside.
Until this commit the warning wasn't raised for negated classes.
|
|
|
|
| |
This caused EBCDIC builds to fail
|
|
|
|
| |
Extracted from patch submitted by Lajos Veres in RT #123693.
|
| |
|
|
|
|
|
|
| |
The [:cased:] internal class now handles [:upper:] and/or [:lower:]
under /i matching. This code skipped possible optimizations because it
didn't think to use this.
|
|
|
|
|
|
|
|
|
|
| |
\d, [:digit:], and [:xdigit:] don't match anything in the upper Latin1
range. Therefore whether or not the target string is UTF-8 or not
doesn't change what they match, hence the /d modifier acts exactly like
the /u modifier for them. At run-time /u executes fewer branches
because it doesn't have to test if the target string is in UTF-8 or not,
so treating these as if /u had instead been specified saves some
runtime.
|
|
|
|
| |
This changes some labels to be outdented 2 spaces from surrounding code
|
|
|
|
| |
The code for these two case: statements was almost identical.
|
|
|
|
|
| |
Future commits will want this function to be able to be used in more
than one core file.
|
|
|
|
|
| |
This is in preparation for combining them into common code. No other
changes are made, except for an additional blank line.
|
|
|
|
|
| |
The previous commit has caused this information to no longer be looked
at; no need to store it therefore.
|
|
|
|
|
|
| |
This variable is a boolean with values of 0 and 1, even though it's
stored as 32-bits in the struct, to get the simplest store/retrieval
code generated, so it's safe to cast it to a bool.
|
| |
|
|
|
|
|
|
|
|
|
| |
Generally the guideline is to outdent C labels (e.g. 'foo:') 2 columns
from the surrounding code.
If the label starts at column zero, then it means that diffs, such as
those generated by git, display the label rather than the function
name at the head of a diff block: which makes diffs harder to peruse.
|
| |
|
|
|
|
|
| |
This is in preparation for the next commit. The function previously was
used only in DEBUGGING builds
|
| |
|
|
|
|
|
|
|
|
| |
When a range in a bracketed character class has one end be specified as
Unicode, the whole range is viewed as Unicode. Currently this is not
warned about, though it is somewhat like mixing apples and oranges.
This commit adds a warning, but only under "use re 'strict'", and
it now documents the only one-end behavior.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the way we calculate if the endpoints in a range in a
[bracketed character class] are "literal" (like 'A', 'b') vs non (like
\x{41}) is to have a count of the literal endpoints.
Future commits will expand the definition of literal to include things
that are portably-specified, including things like \t, \N{U+xx}, etc.
It will be easier to specify that we have encountered a non-portable
name instead of the other way around. So that is what this commit does.
The only non-portables are \digit, \o{}, \x{}, and \cX for all X.
|
|
|
|
| |
Indent inside a newly formed block
|