summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* avoid uninit read in re_op_compile()David Mitchell2015-04-281-3/+3
| | | | | | | | | | | | | | | | | | Some code in this function examines the first two nodes in the regex to set suitable flags etc. Part of the code accesses the second node by using regnext(first), other parts by NEXTOPER(first). The second method only works when the node is the same size as a basic node. I *think* that the code only makes use of this second value in situations where the node *is* basic, but nevertheless, it makes valgrind unhappy when the first node is an EXACT node, and reading the second node's supposed type field is actually reading the padding bytes at the end of the EXACT string, which are uninitialised. So just use regnext() only. Something as simple as /x/ on non-debugging builds was enough to make valgrind complain. (On debugging builds, the program buffer is initially zero-filled.)
* Fix regression in 5.21: /[A-Z]/aiKarl Williamson2015-04-091-3/+2
| | | | | | | | | | /[A-Z]/ai should match KELVIN SIGN, as it folds to a 'k'. It should not match under /aai, as that restricts fold matching. But I tested for the wrong symbol which ended up forbidding both /ai and /aai. This commit changes to the correct symbol. I also reordered the 'if' while I was at it as a nano optimisation, to test for the /aa last, as that is the less common part of the '&&' test.
* Perl_save_re_context(): re-indent after last commitDavid Mitchell2015-03-301-16/+12
| | | | whitespace-only change.
* save_re_context(): do "local $n" with no PL_curpmDavid Mitchell2015-03-301-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RT #124109. 2c1f00b9036 localised PL_curpm to NULL when calling swash init code (i.e. perl-level code that is loaded and executed when something like "lc $large_codepoint" is executed). b4fa55d3f1 followed this up by gutting Perl_save_re_context(), since that function did, basically, if (PL_curpm) { for (i = 1; i <= RX_NPARENS(PM_GETRE(PL_curpm))) { do the C equivalent of the perl code "local ${i}"; } } and now that PL_curpm was null, the code wasn't called any more. However, it turns out that the localisation *was* still needed, it's just that nothing in the test suite actually tested for it. In something like the following: $x = "\x{41c}"; $x =~ /(.*)/; $s = lc $1; pp_lc() calls get magic on $1, which sets $1's PV value to a copy of the substring captured by the current pattern match. Then pp_lc() calls a function to convert the string to upper case, which triggers a swash load, which calls perl code that does a pattern match and, most importantly, uses the value of $1. This triggers get magic on $1, which overwrites $1's PV value with a new value. When control returns to pp_lc(), $1 now holds the wrong string value. Hence $1, $2 etc need localising as well as PL_curpm. The old way that Perl_save_re_context() used to work (localising $1..${RX_NPARENS}) won't work directly when PL_curpm is NULL (as in the swash case), since we don't know how many vars to localise. In this case, hard-code it as localising $1,$2,$3 and add a porting test file that checks that the utf8.pm code and dependences don't use anything outside those 3 vars.
* Revert "Gut Perl_save_re_context"David Mitchell2015-03-301-3/+21
| | | | | | This reverts commit b4fa55d3f12c6d98b13a8b3db4f8d921c8e56edc. Turns out we need Perl_save_re_context() after all
* Revert "Don’t call save_re_context"David Mitchell2015-03-301-0/+1
| | | | | | This reverts commit d28a9254e445aee7212523d9a7ff62ae0a743fec. Turns out we need save_re_context() after all
* Revert "Mathomise save_re_context"David Mitchell2015-03-301-0/+11
| | | | | | This reverts commit 0ddd4a5b1910c8bfa9b7e55eb0db60a115fe368c. Turns out we need the save_re_context() function after all.
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* regcomp.c: Fix so works on Unicode 5.2Karl Williamson2015-03-191-3/+12
| | | | | | Unicode 5.2 had an anomalous situation, fixed in the next release, which runs afoul of an assert() in regcomp.c. This just modifies the assert for it to not fail for this situation.
* Change /(?[...]) to have normal operator precedenceKarl Williamson2015-03-191-195/+407
| | | | | This experimental feature now has the intersection operator ("&") higher precedence than the other binary operators.
* regcomp.c: White-space onlyKarl Williamson2015-03-181-14/+14
| | | | Outdent code that the previous commit removed the surrounding block from
* Fix qr'\N{U+41}' on EBCDIC platformsKarl Williamson2015-03-181-196/+263
| | | | | | | | | | | Prior to this commit, the regex compiler was relying on the lexer to do the translation from Unicode to native for \N{...} constructs, where it was simpler to do. However, when the pattern is a single-quoted string, it is passed unchanged to the regex compiler, and did not work. Fixing it required some refactoring, though it led to a clean API in a static function. This was spotted by Father Chrysostomos.
* fix XXX comment for regcomp.c:S_regHugo van der Sanden2015-03-101-1/+1
| | | | | It actually does do the right thing: /(?(R0))/ and /(?(R00))/ both fall through to give an appropriate error 'Switch condition not recognized'
* [perl #123814] replace grok_atou with grok_atoUVHugo van der Sanden2015-03-091-17/+28
| | | | | | | | | | | | Some questions and loose ends: XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren? XXX mg.c:Perl_magic_set - need appopriate error handling for $) XXX regcomp.c:S_reg - need to check if we do the right thing if parno was not grokked Perl_get_debug_opts should probably return something unsigned; not sure if that's something we can change.
* [perl #123814] stricter handling of numbers in regexp quantifiersHugo van der Sanden2015-03-091-5/+20
|
* Consistently use NOT_REACHED; /* NOTREACHED */Jarkko Hietaniemi2015-03-041-6/+6
| | | | | | Both needed: the macro is for compilers, the comment for static checkers. (This doesn't address whether each spot is correct and necessary.)
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-5/+2
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* regcomp.c: Add assertionKarl Williamson2015-02-191-0/+2
|
* Add \b{sb}Karl Williamson2015-02-191-0/+7
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+8
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-13/+83
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* regen/mk_invlists.pl: Revamp #if generationKarl Williamson2015-02-191-2/+0
| | | | | | | | | | | | This changes where the symbols are defined to a single file each. This may save text space, depending on the compiler. The next commit will cause this hdr to be included in more places, so it becomes more important to do this. At the same time this removes the guard for #ifndef PERL_IN_XSUB_RE. The code now is executed regardless of that. This is simpler, and previously there might have been the possibility of uninitialized memory being read, should re_comp.o be executed before recomp.o.
* [perl #123852] avoid capture side-effects under noncapture flagHugo van der Sanden2015-02-181-0/+2
| | | | | | | | | | //n was implemented by avoiding the primary side-effects of compiling a capture when the flag was turned on; however some secondary effects still occurred later in the same function, by using the value of the 'paren' variable - even as far as causing coredumps. Setting paren to ':' when NOCAPTURE is enabled makes the rest of the function act just as if it had parsed (?:...) instead of (...).
* [perl #123843] fix SEGV reading data->flagsHugo van der Sanden2015-02-151-1/+1
| | | | This could be triggered by trying to compile eg 'qr{x+(y(?0))*}'.
* Add comments about how backrefs are parsedYves Orton2015-02-151-8/+27
|
* fix infinite loop in parsing backrefs in regex patternsYves Orton2015-02-151-2/+4
|
* [perl #123782] regcomp: check for overflow on /(?123)/Hugo van der Sanden2015-02-101-1/+3
| | | | | | | | AFL (<http://lcamtuf.coredump.cx/afl>) found that the UV to I32 conversion can evade the necessary range checks on wraparound, leading to bad reads. Check for it, and force to I32_MAX, expecting that this will usually yield a "Reference to nonexistent group" error.
* regcomp can read past end of string after parsing flagsHugo van der Sanden2015-02-101-1/+2
| | | | | New test in 8a6d8ec6fe revealed additional code problem reading past end of string under clang with sanitize=address.
* [perl #123755] including unknown char in error requires careHugo van der Sanden2015-02-091-3/+8
| | | | | | | AFL (<http://lcamtuf.coredump.cx/afl>) found that when producing the error message for /(??/ we hit an assert because we've stepped past the end of the pattern string. Code inspection found that we also do that in other branches, and we also need to check UTF more carefully.
* regcomp.c: Warn on [:^posix:] not being in []Karl Williamson2015-02-051-0/+3
| | | | | | A POSIX character class is has to be in a bracketed character class. A warning is issued when something appearing to be one is found outside. Until this commit the warning wasn't raised for negated classes.
* regcomp.c: Fix typos in variable nameKarl Williamson2015-02-011-2/+2
| | | | This caused EBCDIC builds to fail
* Corrections to spelling and grammatical errors.Lajos Veres2015-01-281-1/+1
| | | | Extracted from patch submitted by Lajos Veres in RT #123693.
* regcomp.c: Clarify commentKarl Williamson2015-01-231-1/+1
|
* regcomp.c: Another minor optimizationKarl Williamson2015-01-231-8/+2
| | | | | | The [:cased:] internal class now handles [:upper:] and/or [:lower:] under /i matching. This code skipped possible optimizations because it didn't think to use this.
* regcomp.c: Minor optimizationsKarl Williamson2015-01-231-1/+24
| | | | | | | | | | \d, [:digit:], and [:xdigit:] don't match anything in the upper Latin1 range. Therefore whether or not the target string is UTF-8 or not doesn't change what they match, hence the /d modifier acts exactly like the /u modifier for them. At run-time /u executes fewer branches because it doesn't have to test if the target string is in UTF-8 or not, so treating these as if /u had instead been specified saves some runtime.
* regexec.c, regcomp.c: White-space onlyKarl Williamson2015-01-231-19/+19
| | | | This changes some labels to be outdented 2 spaces from surrounding code
* regcomp.c: Collapse \b, \B casesKarl Williamson2015-01-211-18/+8
| | | | The code for these two case: statements was almost identical.
* Move inline fcn to #included fileKarl Williamson2015-01-211-21/+0
| | | | | Future commits will want this function to be able to be used in more than one core file.
* regcomp.c: Reorder two switch casesKarl Williamson2015-01-211-12/+13
| | | | | This is in preparation for combining them into common code. No other changes are made, except for an additional blank line.
* regcomp.c: Don't store unnecessary data in \b opsKarl Williamson2015-01-211-2/+0
| | | | | The previous commit has caused this information to no longer be looked at; no need to store it therefore.
* regcomp.c: Silence Win32 compiler warningsKarl Williamson2015-01-211-4/+4
| | | | | | This variable is a boolean with values of 0 and 1, even though it's stored as 32-bits in the struct, to get the simplest store/retrieval code generated, so it's safe to cast it to a bool.
* reg: avoid pointing past end of string on short DEFINEHugo van der Sanden2015-01-211-2/+2
|
* avoid C labels in column 0David Mitchell2015-01-211-2/+2
| | | | | | | | | Generally the guideline is to outdent C labels (e.g. 'foo:') 2 columns from the surrounding code. If the label starts at column zero, then it means that diffs, such as those generated by git, display the label rather than the function name at the head of a diff block: which makes diffs harder to peruse.
* regcomp.c: Add warnings under re 'strict'Karl Williamson2015-01-201-0/+27
|
* regcomp.c: Move #define, make a function always compiledKarl Williamson2015-01-201-8/+6
| | | | | This is in preparation for the next commit. The function previously was used only in DEBUGGING builds
* regcomp.c: Add warnings under re 'strict'Karl Williamson2015-01-201-0/+53
|
* Add portablity warning for re 'strict'Karl Williamson2015-01-201-14/+19
| | | | | | | | When a range in a bracketed character class has one end be specified as Unicode, the whole range is viewed as Unicode. Currently this is not warned about, though it is somewhat like mixing apples and oranges. This commit adds a warning, but only under "use re 'strict'", and it now documents the only one-end behavior.
* regcomp.c: Fix typo in commentKarl Williamson2015-01-201-2/+2
|
* regcomp.c: Refactor a calculationKarl Williamson2015-01-201-17/+24
| | | | | | | | | | | | Currently the way we calculate if the endpoints in a range in a [bracketed character class] are "literal" (like 'A', 'b') vs non (like \x{41}) is to have a count of the literal endpoints. Future commits will expand the definition of literal to include things that are portably-specified, including things like \t, \N{U+xx}, etc. It will be easier to specify that we have encountered a non-portable name instead of the other way around. So that is what this commit does. The only non-portables are \digit, \o{}, \x{}, and \cX for all X.
* regcomp.c: White-space onlyKarl Williamson2015-01-161-14/+14
| | | | Indent inside a newly formed block