| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This large (sorry, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
The basics are easy, but there were a lot of details, and one
troublesome edge case discussed below.
What essentially happens is that when the locale is changed to a UTF-8
one, a global variable is set TRUE (FALSE when changed to a non-UTF-8
locale). Within the scope of 'use locale', this variable is checked,
and if TRUE, the code that Perl uses for non-locale behavior is used
instead of the code for locale behavior. Since Perl's internal
representation is UTF-8, we get UTF-8 behavior for a UTF-8 locale.
More work had to be done for regular expressions. There are three
cases.
1) The character classes \w, [[:punct:]] needed no extra work, as
the changes fall out from the base work.
2) Strings that are to be matched case-insensitively. These form
EXACTFL regops (nodes). Notice that if such a string contains only
characters above-Latin1 that match only themselves, that the node can be
downgraded to an EXACT-only node, which presents better optimization
possibilities, as we now have a fixed string known at compile time to be
required to be in the target string to match. Similarly if all
characters in the string match only other above-Latin1 characters
case-insensitively, the node can be downgraded to a regular EXACTFU node
(match, folding, using Unicode, not locale, rules). The code changes
for this could be done without accepting UTF-8 locales fully, but there
were edge cases which needed to be handled differently if I stopped
there, so I continued on.
In an EXACTFL node, all such characters are now folded at compile time
(just as before this commit), while the other characters whose folds are
locale-dependent are left unfolded. This means that they have to be
folded at execution time based on the locale in effect at the moment.
Again, this isn't a change from before. The difference is that now some
of the folds that need to be done at execution time (in regexec) are
potentially multi-char. Some of the code in regexec was trivial to
extend to account for this because of existing infrastructure, but the
part dealing with regex quantifiers, had to have more work.
Also the code that joins EXACTish nodes together had to be expanded to
account for the possibility of multi-character folds within locale
handling. This was fairly easy, because it already has infrastructure
to handle these under somewhat different circumstances.
3) In bracketed character classes, represented by ANYOF nodes, a new
inversion list was created giving the characters that should be matched
by this node when the runtime locale is UTF-8. The list is ignored
except under that circumstance. To do this, I created a new ANYOF type
which has an extra SV for the inversion list.
The edge case that caused the most difficulty is folding involving the
MICRO SIGN, U+00B5. It folds to the GREEK SMALL LETTER MU, as does the
GREEK CAPITAL LETTER MU. The MICRO SIGN is the only 0-255 range
character that folds to outside that range. The issue is that it
doesn't naturally fall out that it will match the CAP MU. If we let the
CAP MU fold to the samll mu at compile time (which it can because both
are above-Latin1 and so the fold is the same no matter what locale is in
effect), it could appear that the regnode can be downgraded away from
EXACTFL to EXACTFU, but doing so would cause the MICRO SIGN to not case
insensitvely match the CAP MU. This could be special cased in regcomp
and regexec, but I wanted to avoid that. Instead the mktables tables
are set up to include the CAP MU as a character whose presence forbids
the downgrading, so the special casing is in mktables, and not in the C
code.
|
|
|
|
|
|
|
|
|
|
| |
regen/regcharclass.pl can create macros for use where we need to worry
about the possibility of malformed UTF-8, and for where we don't. In
the case of looking at regex patterns, the Perl core has complete
control over generating them, and hence isn't generally going to create
too short a buffer; if it does, it's a bug that will show up and get
fixed. This commit changes to generate and use the faster macros that
don't do bounds checking.
|
|
|
|
|
|
|
| |
An unsigned must always be >= 0, and generating a test for that can lead
to a compiler warning, even if it gets optimized out. The input to the
macros generated by this are supposed to be UV. This commit suppresses
any >= 0 test.
|
|
|
|
|
| |
wrap() is already defined by the regen infrastructure; no need to do so
again, and get warning if we persist in doing so.
|
|
|
|
|
|
|
| |
Prior to this patch, this was in regen/mk_invlists.pl, but future
commits will want it to also be used by the header generated by
regen/regcharclass.pl, so use a common source so the logic doesn't have
to be duplicated.
|
|
|
|
| |
Namely, Android.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Until now, the behavior of the statements
use warnings "FATAL";
use warnings "NONFATAL";
no warnings "FATAL";
was unspecified and inconsistent. This change causes them to be handled
with an implied "all" at the end of the import list.
Tony Cook: fix AUTHORS formatting
|
| |
|
| |
|
|
|
|
|
| |
We need a better name for the experimental category, but I have not
thought of one, even after sleeping on it.
|
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
| |
This outdents a block to be in line with adjacent lines.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commits in this series have removed all uses of this global
array. This completely removes it.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This global array is no longer used, having been removed in previous
commits in this series.
Since it is a global, consideration need be given to possible uses of it
outside the core. It has never been externally documented, and is an
opaque structure whose internals have changed with every release. The
functions used to access it are almost all static to regcomp.c; those
few that aren't have been hidden from all but the few .c files that need
to have access to them, via #if's.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When constructing what matches code points under /i, Perl uses an
inversion list of all the possible code points that participate in
folds. This number is relatively few compared to the possible universe
of code points, as most of the world's scripts aren't cased, and many
characters in the scripts that do fold aren't foldable (such as
punctuation). Prior to this commit, the list for the above-Latin1 code
points was read-in from disk if and only if needed. This commit causes
the list to be added to read-only data in a C header, trading a little
space in Perl's text segment for speed at execution. This will enable
ripping out some code in this and future commits (offsetting the space
used by this one).
|
|
|
|
|
|
|
|
| |
This changes charclass_invlists.h to have the complete definitions for
all the POSIX classes, like \w and [:alpha:]. Thus these won't have to
be loaded off disk at run-time.
Taking advantage of this will be done in stages in future commits
|
|
|
|
|
| |
These note that warnings categories should be independent in the calls
to ckWARN() and packWARN() type macros.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to the security risks associated with user-supplied formats
being passed to C-level printf() style functions (eg %n),
gcc has a -Wformat-nonliteral warning that complains whenever such a
function is passed a non-literal format string.
This commit silences all such warnings in core and ext/.
The main changes are
1) the 'f' (format) flag in embed.fnc is now handled slightly more
cleverly. Rather than just applying to functions whose last arg is '...'
(and where the format arg is assumed to be the previous arg), it
can now handle non-'...' functions: arg checking is disabled, but format
checking is sill done: it works by assuming that an arg called 'fmt',
'pat' or 'f' is the format string (and dies if fails to find exactly one
such arg).
2) with the new embed.fnc functionally, more functions have been marked
with the 'f' flag. When such a function passes its fmt arg onto an inner
printf-like function, we simply disable the warning for that call using
GCC_DIAG_IGNORE(-Wformat-nonliteral), since we know that the caller must
have already checked it.
3) In quite a few places the format string isn't literal, but it *is*
constant (e.g. PL_warn_uninit_sv). For those cases, again disable the
warning.
4) In pp_formline(), a particular format was was one of several different
literal strings depending on circumstances. Rather than assigning this
string to a temporary variable, incorporate the ?: branches directly in
the function call arg. gcc is clever enough to decide the arg is then
always literal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mark this function with
__attribute__format__null_ok__(__strftime__,pTHX_1,0)
so that compiler checks and warnings about strftime-style format args
can be checked.
Rather than adding new flag(s) to embed.fnc, I just enhanced the f flag
to treat it as strftime-style rather than printf if the function name
matches /strftime/. This was quicker, and we're unlikely to have many
such functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a scalar is returned from (??{...}) inside a regexp, it gets com-
piled into a regexp if it is not one already. Then the regexp is sup-
posed to be cached on that scalar (in magic), so that the same scalar
returned again will not require another compilation.
Commit e4bfbed39b disabled caching except on references to overloaded
objects. But in that one case the caching caused erroneous behaviour,
which was just fixed by 636209429f and this commit’s parent, effect-
ively disabling the cache altogether.
The cache is disabled because it does not apply to TEMP variables
(those about to be freed anyway, for which caching would be a waste
of CPU), and all non-overloaded non-qr thingies get copied into
new mortal (TEMP) scalars (as of e4bfbed39b) before reaching the
caching code.
This commit skips the copy if the return value is already a non-magi-
cal string or number. It also allows the caching to happen on con-
stants, which has never been permitted before. (There is actually no
reason for disallowing qr magic on read-only variables.)
|
|
|
|
|
|
|
|
|
| |
by removing the hint from the exit op itself and just having pp_exit
look in the cop hint hash, where it is already stored (as a result of
having been in %^H at compile time).
&CORE:: subs intentionally lack a nextstate op (cop) so they can see
the hints in the caller’s nextstate op.
|
|
|
|
|
|
|
| |
This commit makes them behave like exit and die without the ampersand
by moving the OPpHUSH_VMSISH hint from exit/die op to the current
statement (nextstate/cop) instead. &CORE:: subs intentionally lack a
nextstate op, so they can see the hints in the caller’s nextstate op.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way CORE:: was handled in the lexer was convoluted.
CORE was treated initially as a keyword, with exceptions in the lexer
to make it behave correctly. If it turned out not to be followed
by ::, then the lexer would fall back to treating it as a bareword
or sub name.
Before even checking for a keyword, the lexer looks for :: and goes
to the bareword/sub code. But it made a special exception there
for CORE::.
In the end, treating CORE as a keyword recognized by the keyword()
function requires more special cases than simply special-casing CORE::
in toke.c.
This fixes the lexical CORE sub bug, while reducing the total num-
ber of lines.
|
|
|
|
|
|
| |
It is used for two op types, but only a small portion of it applies
to both, so we can put that in a static function. This makes the
next commit easier.
|
|
|
|
|
|
|
| |
rv2hv has had a TARG since perl 5.000, but it has not used it since
hv_scalar was added in perl-5.8.0-3008-ga3bcc51.
This commit removes it, saving a tiny bit of space in the pad.
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
This is the upper half of the Latin1 range. This simplifies some code
very slightly, but will be of use in future commits.
|
|
|
|
|
|
|
|
|
|
|
| |
Whilst the code for 'q' and 'Q' in pp_pack is itself well behaved if enabled
on a perl with 32 bit IVs (using SvNV instead of SvIV and SvUV), the
regression tests are not. Several tests use an eval of "pack 'q'" to
determine if 64 bit integer support is available (instead of
$Config{ivsize}), and t/op/pack.t fails many tests. While these could be
fixed (or skipped), unfortunately the approach of evaling "pack 'q'" is
fairly popular on CPAN, so the breakage isn't just in the perl core, and
might also be present in code we can't see or submit patches for.
|
|
|
|
|
|
| |
kvaslice operator that imlements %a[0,2,4] syntax which
result in list of index/value pairs. Implemented in
consistency with "key/value hash slice" operator.
|
|
|
|
|
|
| |
kvhslice operator that implements %h{1,2,3,4} syntax which
returns list of key value pairs rather than just values
(regular slices).
|
|
|
|
|
| |
Removing this should mean that metaconfig will remove the units from
the built Configure
|
|
|
|
|
|
| |
These character constants were used only for a special edge case in trie
construction that has been removed -- except for one instance in
regexec.c which could just as well be some other character.
|
|
|
|
| |
These will be used in a future commit
|
| |
|
|
|
|
|
|
|
|
| |
This commit changes the code generated by the macros so that they work
right out-of-the-box on non-ASCII platforms for non-UTF-8 inputs. THEY
ARE WRONG for UTF-8, but this is good enough to get perl bootstrapped
onto the target platform, and regcharclass.pl can be run there,
generating macros with correct UTF-8.
|
|
|
|
| |
These will be used in future commits
|
|
|
|
|
| |
These messages say the output number is Unicode, but it is really
native, so change to saying is 0xXXXX.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Check for the nul char in pathnames and string arguments to
syscalls, return undef and set errno to ENOENT.
Added to the io warnings category syscalls.
Strings with embedded \0 chars were prev. ignored in the syscall but
kept in perl. The hidden payloads in these invalid string args may cause
unnoticed security problems, as they are hard to detect, ignored by
the syscalls but kept around in perl PVs.
Allow an ending \0 though, as several modules add a \0 to
such strings without adjusting the length.
This is based on a change originally by Reini Urban, but pretty much
all of the code has been replaced.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
It's possible to programmatically determine almost all the files and
directories which will be created in lib/ by building the extensions.
Hence add a new script regen/lib_cleanup.pl to do this.
This saves having to manually update lib/.gitignore to reflect changes in
the build products of extensions, which has become a small but reoccurring
instance of scut-work.
|
|
|
|
|
| |
We have to stop using File::Compare's compare(), as it doesn't return
diagnostics about what went wrong.
|
|
|
|
|
|
|
| |
The first commit of this topic branch added a dummy 0 element to the end
of certain inversion lists to work around an off-by-one error. This
commit makes the necessary changes to stop that error, and to remove
the dummy element. SvCUR() and invlist_len() now are kept in sync.
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 18505f093a44607b687ae5fe644872f835f66313, which
reverted 241136e0ed70738cccd6c4b20ce12b26231f30e5, thus reinstating the
latter commit. It turns out that the error being chased down was not
due to this commit.
Its original message was:
The inversion lists that are compiled into a C header are now const.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 67434bafe4f2406e7c92e69013aecd446c896a9a, which
reverted 4fdeca7844470c929f35857f49078db1fd124dbc, thus reinstating the
latter commit. It turns out that the error being chased down was not
due to this commit.
Its original message was:
This commit continues the process of separating the header area of
inversion lists from the body. 2 more fields are moved out of the
header portion of the inversion list, and into the header portion of the
SV that contains it.
|