summaryrefslogtreecommitdiff
path: root/regen
Commit message (Collapse)AuthorAgeFilesLines
* Work properly under UTF-8 LC_CTYPE localesKarl Williamson2014-01-271-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This large (sorry, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing, character classification) in UTF-8 locales. As a side effect it resolves [perl #56820]. The basics are easy, but there were a lot of details, and one troublesome edge case discussed below. What essentially happens is that when the locale is changed to a UTF-8 one, a global variable is set TRUE (FALSE when changed to a non-UTF-8 locale). Within the scope of 'use locale', this variable is checked, and if TRUE, the code that Perl uses for non-locale behavior is used instead of the code for locale behavior. Since Perl's internal representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale. More work had to be done for regular expressions. There are three cases. 1) The character classes \w, [[:punct:]] needed no extra work, as the changes fall out from the base work. 2) Strings that are to be matched case-insensitively. These form EXACTFL regops (nodes). Notice that if such a string contains only characters above-Latin1 that match only themselves, that the node can be downgraded to an EXACT-only node, which presents better optimization possibilities, as we now have a fixed string known at compile time to be required to be in the target string to match. Similarly if all characters in the string match only other above-Latin1 characters case-insensitively, the node can be downgraded to a regular EXACTFU node (match, folding, using Unicode, not locale, rules). The code changes for this could be done without accepting UTF-8 locales fully, but there were edge cases which needed to be handled differently if I stopped there, so I continued on. In an EXACTFL node, all such characters are now folded at compile time (just as before this commit), while the other characters whose folds are locale-dependent are left unfolded. This means that they have to be folded at execution time based on the locale in effect at the moment. Again, this isn't a change from before. The difference is that now some of the folds that need to be done at execution time (in regexec) are potentially multi-char. Some of the code in regexec was trivial to extend to account for this because of existing infrastructure, but the part dealing with regex quantifiers, had to have more work. Also the code that joins EXACTish nodes together had to be expanded to account for the possibility of multi-character folds within locale handling. This was fairly easy, because it already has infrastructure to handle these under somewhat different circumstances. 3) In bracketed character classes, represented by ANYOF nodes, a new inversion list was created giving the characters that should be matched by this node when the runtime locale is UTF-8. The list is ignored except under that circumstance. To do this, I created a new ANYOF type which has an extra SV for the inversion list. The edge case that caused the most difficulty is folding involving the MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range character that folds to outside that range. The issue is that it doesn't naturally fall out that it will match the CAP MU. If we let the CAP MU fold to the samll mu at compile time (which it can because both are above-Latin1 and so the fold is the same no matter what locale is in effect), it could appear that the regnode can be downgraded away from EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case insensitvely match the CAP MU. This could be special cased in regcomp and regexec, but I wanted to avoid that. Instead the mktables tables are set up to include the CAP MU as a character whose presence forbids the downgrading, so the special casing is in mktables, and not in the C code.
* Avoid unnecessary malformed checkingKarl Williamson2014-01-271-2/+2
| | | | | | | | | | regen/regcharclass.pl can create macros for use where we need to worry about the possibility of malformed UTF-8, and for where we don't. In the case of looking at regex patterns, the Perl core has complete control over generating them, and hence isn't generally going to create too short a buffer; if it does, it's a bug that will show up and get fixed. This commit changes to generate and use the faster macros that don't do bounds checking.
* regen/regcharclass.pl: Don't test UV >= 0Karl Williamson2014-01-271-3/+11
| | | | | | | An unsigned must always be >= 0, and generating a test for that can lead to a compiler warning, even if it gets optimized out. The input to the macros generated by this are supposed to be UV. This commit suppresses any >= 0 test.
* regen/regcharclass.pl: Fix warningKarl Williamson2014-01-271-1/+0
| | | | | wrap() is already defined by the regen infrastructure; no need to do so again, and get warning if we persist in doing so.
* Move an inversion list generation to mktablesKarl Williamson2014-01-272-6/+5
| | | | | | | Prior to this patch, this was in regen/mk_invlists.pl, but future commits will want it to also be used by the header generated by regen/regcharclass.pl, so use a common source so the logic doesn't have to be duplicated.
* reentr.c: Handle systems without getpwentBrian Fraser2014-01-261-0/+2
| | | | Namely, Android.
* [perl #120977] bump $warnings::VERSIONTony Cook2014-01-221-1/+1
|
* assume "all" in "use warnings 'FATAL';" and relatedHauke D2014-01-221-1/+5
| | | | | | | | | | | | | Until now, the behavior of the statements use warnings "FATAL"; use warnings "NONFATAL"; no warnings "FATAL"; was unspecified and inconsistent. This change causes them to be handled with an implied "all" at the end of the import list. Tony Cook: fix AUTHORS formatting
* rename aggref warnings to autoderefRicardo Signes2014-01-142-2/+2
|
* Increase $warnings::VERSION to 1.21Father Chrysostomos2014-01-141-1/+1
|
* Make key/push $scalar experimentalFather Chrysostomos2014-01-142-0/+3
| | | | | We need a better name for the experimental category, but I have not thought of one, even after sleeping on it.
* IDStart and IDCont no longer go out to diskKarl Williamson2014-01-091-0/+2
| | | | | | | These are the base names for various macros used in parsing identifiers. Prior to this patch, parsing a code point above Latin1 caused loading disk files. This patch causes all the information to be compiled into the Perl binary.
* regen/mk_invlists.pl: White-space onlyKarl Williamson2014-01-091-14/+14
| | | | This outdents a block to be in line with adjacent lines.
* Rmv PL_Posix_ptrsKarl Williamson2014-01-091-14/+0
| | | | | | | | | | | | Previous commits in this series have removed all uses of this global array. This completely removes it. Since it is a global, consideration need be given to possible uses of it outside the core. It has never been externally documented, and is an opaque structure whose internals have changed with every release. The functions used to access it are almost all static to regcomp.c; those few that aren't have been hidden from all but the few .c files that need to have access to them, via #if's.
* Remove PL_L1Posix_ptrsKarl Williamson2014-01-091-9/+0
| | | | | | | | | | | | This global array is no longer used, having been removed in previous commits in this series. Since it is a global, consideration need be given to possible uses of it outside the core. It has never been externally documented, and is an opaque structure whose internals have changed with every release. The functions used to access it are almost all static to regcomp.c; those few that aren't have been hidden from all but the few .c files that need to have access to them, via #if's.
* Compile in list of foldable code pointsKarl Williamson2014-01-091-0/+1
| | | | | | | | | | | | | | When constructing what matches code points under /i, Perl uses an inversion list of all the possible code points that participate in folds. This number is relatively few compared to the possible universe of code points, as most of the world's scripts aren't cased, and many characters in the scripts that do fold aren't foldable (such as punctuation). Prior to this commit, the list for the above-Latin1 code points was read-in from disk if and only if needed. This commit causes the list to be added to read-only data in a C header, trading a little space in Perl's text segment for speed at execution. This will enable ripping out some code in this and future commits (offsetting the space used by this one).
* Compile in all POSIX class inversion listsKarl Williamson2014-01-091-0/+10
| | | | | | | | This changes charclass_invlists.h to have the complete definitions for all the POSIX classes, like \w and [:alpha:]. Thus these won't have to be loaded off disk at run-time. Taking advantage of this will be done in stages in future commits
* regen/warnings.pl: Add commentsKarl Williamson2014-01-011-0/+8
| | | | | These note that warnings categories should be independent in the calls to ckWARN() and packWARN() type macros.
* silence -Wformat-nonliteral compiler warningsDavid Mitchell2013-11-281-5/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to the security risks associated with user-supplied formats being passed to C-level printf() style functions (eg %n), gcc has a -Wformat-nonliteral warning that complains whenever such a function is passed a non-literal format string. This commit silences all such warnings in core and ext/. The main changes are 1) the 'f' (format) flag in embed.fnc is now handled slightly more cleverly. Rather than just applying to functions whose last arg is '...' (and where the format arg is assumed to be the previous arg), it can now handle non-'...' functions: arg checking is disabled, but format checking is sill done: it works by assuming that an arg called 'fmt', 'pat' or 'f' is the format string (and dies if fails to find exactly one such arg). 2) with the new embed.fnc functionally, more functions have been marked with the 'f' flag. When such a function passes its fmt arg onto an inner printf-like function, we simply disable the warning for that call using GCC_DIAG_IGNORE(-Wformat-nonliteral), since we know that the caller must have already checked it. 3) In quite a few places the format string isn't literal, but it *is* constant (e.g. PL_warn_uninit_sv). For those cases, again disable the warning. 4) In pp_formline(), a particular format was was one of several different literal strings depending on circumstances. Rather than assigning this string to a temporary variable, incorporate the ?: branches directly in the function call arg. gcc is clever enough to decide the arg is then always literal.
* mark Perl_my_strftime with format attributeDavid Mitchell2013-11-281-2/+7
| | | | | | | | | | | | | | mark this function with __attribute__format__null_ok__(__strftime__,pTHX_1,0) so that compiler checks and warnings about strftime-style format args can be checked. Rather than adding new flag(s) to embed.fnc, I just enhanced the f flag to treat it as strftime-style rather than printf if the function name matches /strftime/. This was quicker, and we're unlikely to have many such functions.
* Reënable qr caching for (??{}) retval where possibleFather Chrysostomos2013-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | When a scalar is returned from (??{...}) inside a regexp, it gets com- piled into a regexp if it is not one already. Then the regexp is sup- posed to be cached on that scalar (in magic), so that the same scalar returned again will not require another compilation. Commit e4bfbed39b disabled caching except on references to overloaded objects. But in that one case the caching caused erroneous behaviour, which was just fixed by 636209429f and this commit’s parent, effect- ively disabling the cache altogether. The cache is disabled because it does not apply to TEMP variables (those about to be freed anyway, for which caching would be a waste of CPU), and all non-overloaded non-qr thingies get copied into new mortal (TEMP) scalars (as of e4bfbed39b) before reaching the caching code. This commit skips the copy if the return value is already a non-magi- cal string or number. It also allows the caching to happen on con- stants, which has never been permitted before. (There is actually no reason for disallowing qr magic on read-only variables.)
* Make &CORE::exit respect vmsish exit hintFather Chrysostomos2013-11-081-1/+1
| | | | | | | | | by removing the hint from the exit op itself and just having pp_exit look in the cop hint hash, where it is already stored (as a result of having been in %^H at compile time). &CORE:: subs intentionally lack a nextstate op (cop) so they can see the hints in the caller’s nextstate op.
* Fix &CORE::exit/die under vmsish "hushed"Father Chrysostomos2013-11-081-1/+1
| | | | | | | This commit makes them behave like exit and die without the ampersand by moving the OPpHUSH_VMSISH hint from exit/die op to the current statement (nextstate/cop) instead. &CORE:: subs intentionally lack a nextstate op, so they can see the hints in the caller’s nextstate op.
* Stop lexical CORE sub from interfering with CORE::Father Chrysostomos2013-11-081-1/+0
| | | | | | | | | | | | | | | | | | | | The way CORE:: was handled in the lexer was convoluted. CORE was treated initially as a keyword, with exceptions in the lexer to make it behave correctly. If it turned out not to be followed by ::, then the lexer would fall back to treating it as a bareword or sub name. Before even checking for a keyword, the lexer looks for :: and goes to the bareword/sub code. But it made a special exception there for CORE::. In the end, treating CORE as a keyword recognized by the keyword() function requires more special cases than simply special-casing CORE:: in toke.c. This fixes the lexical CORE sub bug, while reducing the total num- ber of lines.
* Split ck_open into two functionsFather Chrysostomos2013-11-061-1/+1
| | | | | | It is used for two op types, but only a small portion of it applies to both, so we can put that in a static function. This makes the next commit easier.
* rv2hv does not use its TARGFather Chrysostomos2013-10-241-1/+1
| | | | | | | rv2hv has had a TARG since perl 5.000, but it has not used it since hv_scalar was added in perl-5.8.0-3008-ga3bcc51. This commit removes it, saving a tiny bit of space in the pad.
* new warnings category, so bump warnings.pmRicardo Signes2013-10-051-1/+1
|
* Make postderef experimentalFather Chrysostomos2013-10-051-0/+2
|
* Add postderef_qq feature featureFather Chrysostomos2013-10-051-0/+1
|
* Increase $feature::VERSION to 1.34Father Chrysostomos2013-10-051-1/+1
|
* Add postderef feature featureFather Chrysostomos2013-10-051-0/+1
|
* Add inversion list for U+80 - U+FFKarl Williamson2013-09-241-0/+9
| | | | | This is the upper half of the Latin1 range. This simplifies some code very slightly, but will be of use in future commits.
* Use IVSIZE not HAS_QUAD to enable 'q' and 'Q' formats in pack.Nicholas Clark2013-09-171-4/+3
| | | | | | | | | | | Whilst the code for 'q' and 'Q' in pp_pack is itself well behaved if enabled on a perl with 32 bit IVs (using SvNV instead of SvIV and SvUV), the regression tests are not. Several tests use an eval of "pack 'q'" to determine if 64 bit integer support is available (instead of $Config{ivsize}), and t/op/pack.t fails many tests. While these could be fixed (or skipped), unfortunately the approach of evaling "pack 'q'" is fairly popular on CPAN, so the breakage isn't just in the perl core, and might also be present in code we can't see or submit patches for.
* index/value array slice operationRuslan Zakirov2013-09-131-0/+1
| | | | | | kvaslice operator that imlements %a[0,2,4] syntax which result in list of index/value pairs. Implemented in consistency with "key/value hash slice" operator.
* key/value hash slice operationRuslan Zakirov2013-09-131-0/+1
| | | | | | kvhslice operator that implements %h{1,2,3,4} syntax which returns list of key value pairs rather than just values (regular slices).
* [perl #115928] we don't use drand48_r or random_r any longerTony Cook2013-09-131-44/+1
| | | | | Removing this should mean that metaconfig will remove the units from the built Configure
* Remove no longer necessary constantsKarl Williamson2013-08-291-6/+0
| | | | | | These character constants were used only for a special edge case in trie construction that has been removed -- except for one instance in regexec.c which could just as well be some other character.
* utf8.h, unicode_constants.h: Add some #defines.Karl Williamson2013-08-291-0/+3
| | | | These will be used in a future commit
* unicode_constants.h: Add #defines for CR, LFKarl Williamson2013-08-291-0/+2
|
* regen/regcharclass.pl: Make more EBCDIC-friendlyKarl Williamson2013-08-291-3/+19
| | | | | | | | This commit changes the code generated by the macros so that they work right out-of-the-box on non-ASCII platforms for non-UTF-8 inputs. THEY ARE WRONG for UTF-8, but this is good enough to get perl bootstrapped onto the target platform, and regcharclass.pl can be run there, generating macros with correct UTF-8.
* unicode_constants.h: Add #defines for Byte Order MarkKarl Williamson2013-08-291-0/+2
| | | | These will be used in future commits
* Don't refer to U+XXXX when mean nativeKarl Williamson2013-08-291-1/+1
| | | | | These messages say the output number is Unicode, but it is really native, so change to saying is 0xXXXX.
* [perl #117265] safesyscalls: check embedded nul in syscall argsTony Cook2013-08-261-3/+4
| | | | | | | | | | | | | | | | Check for the nul char in pathnames and string arguments to syscalls, return undef and set errno to ENOENT. Added to the io warnings category syscalls. Strings with embedded \0 chars were prev. ignored in the syscall but kept in perl. The hidden payloads in these invalid string args may cause unnoticed security problems, as they are hard to detect, ignored by the syscalls but kept around in perl PVs. Allow an ending \0 though, as several modules add a \0 to such strings without adjusting the length. This is based on a change originally by Reini Urban, but pretty much all of the code has been replaced.
* Generate the lib/ cleanup rules in the Win32 Makefiles from MANIFEST.Nicholas Clark2013-07-241-4/+32
|
* Generate the lib/ cleanup rules in Makefile.SH automatically from MANIFEST.Nicholas Clark2013-07-241-3/+33
|
* Generate lib/.gitignore from MANIFEST.Nicholas Clark2013-07-241-0/+122
| | | | | | | | | | It's possible to programmatically determine almost all the files and directories which will be created in lib/ by building the extensions. Hence add a new script regen/lib_cleanup.pl to do this. This saves having to manually update lib/.gitignore to reflect changes in the build products of extensions, which has become a small but reoccurring instance of scut-work.
* On failure, regen_lib.pl now generates diagnostics, not just "not ok".Nicholas Clark2013-07-241-2/+33
| | | | | We have to stop using File::Compare's compare(), as it doesn't return diagnostics about what went wrong.
* Fix off-by-one error in inversion lists.Karl Williamson2013-07-161-7/+2
| | | | | | | The first commit of this topic branch added a dummy 0 element to the end of certain inversion lists to work around an off-by-one error. This commit makes the necessary changes to stop that error, and to remove the dummy element. SvCUR() and invlist_len() now are kept in sync.
* Reinstate "regcomp.c: Make C-array inversion lists const"Karl Williamson2013-07-161-2/+2
| | | | | | | | | | | This reverts commit 18505f093a44607b687ae5fe644872f835f66313, which reverted 241136e0ed70738cccd6c4b20ce12b26231f30e5, thus reinstating the latter commit. It turns out that the error being chased down was not due to this commit. Its original message was: The inversion lists that are compiled into a C header are now const.
* Reinstate "regcomp.c: Move 2 hdr inversion fields to SV hdr"Karl Williamson2013-07-161-7/+1
| | | | | | | | | | | | | | This reverts commit 67434bafe4f2406e7c92e69013aecd446c896a9a, which reverted 4fdeca7844470c929f35857f49078db1fd124dbc, thus reinstating the latter commit. It turns out that the error being chased down was not due to this commit. Its original message was: This commit continues the process of separating the header area of inversion lists from the body. 2 more fields are moved out of the header portion of the inversion list, and into the header portion of the SV that contains it.