| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
These are the base names for various macros used in parsing identifiers.
Prior to this patch, parsing a code point above Latin1 caused loading
disk files. This patch causes all the information to be compiled into
the Perl binary.
|
|
|
|
|
|
|
| |
Previous commits in this series have caused all the POSIX classes to be
completely specified at C compile time. This allows us to revise the
base function used by all these macros to use these definitions,
avoiding reading them in from disk.
|
| |
|
|
|
|
|
| |
These functions will be out of the way in mathoms. There were a few
that could not be moved, as-is, so I left them.
|
|
|
|
|
|
| |
In all these cases, there is an already existing macro that does exactly
the same thing as the code that this commit replaces. No sense
duplicating logic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This bottom level function decodes the first character of a UTF-8 string
into a code point. It is discouraged from using it directly. This
commit cleans up some of the warnings it can raise. Now, tests for
malformations are done before any tests for other potential issues. One
of those issues involves code points so large that they have never
appeared in any official standard (the current standard has scaled back
the highest acceptable code point from earlier versions). It is
possible (though not done in CPAN) to warn and/or forbid these code
points, while accepting smaller code points that are still above the
legal Unicode maximum. The warning message for this now includes the
code point if representable on the machine. Previously it always
displayed raw bytes, which is what it still does for non-representable
code points.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The warnings categories non_unicode, nonchar, and surrogate are all
subcategories of 'utf8'. One should never call a packWARN() with both a
category and a subcategory of it, as it will mean that one can't
completely make the subcategory independent. For example,
use warnings 'utf8';
no warnings 'surrogate';
surrogate warnings will be output if they are tested with a
ckWARN2(WARN_UTF8, WARN_SURROGATE);
utf8.c was guilty of this.
|
|
|
|
|
| |
The test here for WARN_UTF8 is redundant, as only if one of the other
three warning categories is enabled will anything actually be output.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The function _invlist_invert_prop() is hereby removed. The recent
changes to allow \p{} to match above-Unicode means that no special
handling of properties need be done when inverting.
This function was accessible to XS code that cheated by using #defines
to pretend it was something it wasn't, but it also has been marked
as subject to change since its inception, and never appeared in any
documentation.
|
|
|
|
|
| |
This indents various newly-formed blocks (by the previous commit) in
these three files, and reflows lines to fit into 79 columns
|
|
|
|
|
|
|
|
|
| |
mktables now outputs the tables for binary properties as inversion
lists, with a size as the first element. This means simpler handling of
these tables in the core, including removal of an entire pass over them
(it was done just to get the size). These tables are marked as for
internal use by the Perl core only, so their format is changeable at
will.
|
|
|
|
| |
plus some typo fixes. I probably changed some things in perlintern, too.
|
|
|
|
| |
Rearrange this multi-line conditional to be easier to read.
|
|
|
|
|
| |
"The" referring to a parameter here does not look right to me, a native
English speaker.
|
|
|
|
|
|
|
| |
The names of these hashes stored in some disk files is retrievable by a
standardized lookup. There is no need to have them hard-coded in C
code. This is one less opportunity for the file and the code to get out
of sync.
|
| |
|
|
|
|
|
|
| |
These temporaries are all known to fit into 8 bits; by using a U8 it
should be more obvious to an optimizing compiler, and so the bounds
checking need not be done.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were a few places that were doing
unsigned_var = cond ? signed_val : unsigned_val;
or similar. Fixed by suitable casts etc.
The four in utf8.c were fixed by assigning to an intermediate
unsigned var; this has the happy side-effect of collapsing
a large macro expansion, where toUPPER_LC() etc evaluate their arg
multiple times.
|
|
|
|
|
| |
This outdents code to the proper level given that the surrounding block
has been removed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes all the tables in the lib/unicore/To directory that map from
code point to code point be formatted so that the mapped-to code point
is expressed as hexadecimal.
This allows for uniform treatment of these tables in utf8.c, and removes
the final use of strtol() in the (non-CPAN) core. strtol() should be
avoided because it is subject to locale rules, and some older libc
implementations have been buggy. It was used because Perl doesn't have
an efficient way of parsing a decimal number and advancing the parse
pointer to beyond it; we do have such a method for hex numbers.
The input to mktables published by Unicode is also in hex, so this now
conforms to that convention.
This also will facilitate the new work currently being done to read in
the tables that find the closing bracket given an opening one.
|
|
|
|
|
| |
The Win32 compiler doesn't realize that the values in these places can
be a max of 255. Other compilers don't warn.
|
|
|
|
|
|
| |
IS_UTF8_CHAR is defined by utf8.h, so this is always defined.
In fact, later in utf8.c we use it again, this time without the
ifdef.
|
|
|
|
| |
Previously it was based on HAS_QUAD, which is not (as) correct.
|
| |
|
|
|
|
|
|
|
| |
This removes a macro not yet even in a development release, and splits
its calls into two classes: those where the input is a byte; and those
where it can be any unsigned integer. The byte implementation avoids a
function call on EBCDIC platforms.
|
|
|
|
|
|
|
| |
These functions are still called by some CPAN-upstream modules, so can't
go into mathoms until those are fixed. There are other changes needed
in these modules, so I'm deferring sending patching to their maintainers
until I know all the necessary changes.
|
|
|
|
|
| |
Since commit 010ab96b9b802bbf77168b5af384569e053cdb63, this function is
now longer a wrapper, so shouldn't be described as such.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The LATIN SMALL LETTER SHARP S can't fold to 'ss' under /iaa because the
definition of /aa prohibits it, but it can fold to two consecutive
instances of LATIN SMALL LETTER LONG S. A capital sharp s can do the
same, and that was fixed in 1ca267a5, but this one was overlooked then.
It turns out that another possibility was also overlooked in 1ca267a5.
Both U+FB05 (LATIN SMALL LIGATURE LONG S T) and U+FB06 (LATIN SMALL
LIGATURE ST) fold to the string 'st', except under /iaa these folds are
prohibited. But U+FB05 and U+FB06 are equivalent to each other under
/iaa. This wasn't working until now. This commit changes things so
both fold to FB06.
This bug would only surface during /iaa matching, and I don't believe
there are any current code paths which lead to it, hence no tests are
added by this commit. However, a future commit will lead to this bug,
and existing tests find it then.
|
|
|
|
|
| |
This is a micro optimization. We now check for a common case and return
if found, before checking for a relatively uncommon case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that the Unicode data is stored in native character set order, it is
rare to need to work with the Unicode order. Traditionally, the real
work was done in functions that worked with the Unicode order, and
wrapper functions (or macros) were used to translate to/from native.
There are two groups of functions: one that translates from code point
to UTF-8, and the other group goes the opposite direction.
This commit changes the base function that translates from UTF-8 to code
point to output native instead of Unicode. Those extremely rare
instances where Unicode output is needed instead will have to hand-wrap
calls to this function with a translation macro, as now described in the
API pod. Prior to this, it was the other way, the native was wrapped,
and the rare, strict Unicode wasn't. This eliminates a layer of
function call overhead for a common case.
The base function that translates from code point to UTF-8 retains its
Unicode input, as that is more natural to process. However, it is
de-emphasized in the pod, with the functionality description moved to
the pod for a native input wrapper function. And, those wrappers are
now macros in all cases; previously there was function call overhead
sometimes. (Equivalent exported functions are retained, however, for XS
code that uses the Perl_foo() form.)
I had hoped to rebase this commit, squashing it with an earlier commit
in this series, eliminating the use of a temporary function name change,
but the work involved turns out to be large, with no real payoff.
|
| |
|
|
|
|
|
| |
This moves these two functions to be adjacent to the function they each
call, thus keeping like things together.
|
|
|
|
|
|
| |
There is a macro that accomplishes the same task for a two byte UTF-8
encoded character, and avoids the overhead of the general purpose
function call.
|
|
|
|
|
| |
The formal parameter gets evaluated multiple times on an EBCDIC
platform, thus incrementing more than the intended once.
|
|
|
|
| |
There is a macro that accomplishes this task, and is easier to read.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the code so that converting to UTF-8 is avoided unless
necessary. For such inputs, the conversion back from UTF-8 is also
avoided. The cost of doing this is that the first swatches are combined
into one that contains the values for all characters 0-255, instead of
having multiple swatches. That means when first calculating the swatch
it calculates all 256, instead of 128 (160 on EBCDIC).
This also fixes an EBCDIC bug in which characters in this range were
being translated twice.
|
|
|
|
|
|
|
|
| |
This function assumes that the input is well-formed UTF-8, even though
until this commit, the prefatory comments didn't say so. The API does
not pass the buffer length, so there is no way it could check for
reading off the end of the buffer. One code path already calls
valid_utf8_to_uvchr(); this changes the remaining code path to correspond.
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
In the case of invariants these two macros should do the same thing,
but it seems to me that the latter name more clearly indicates what is
going on.
|
|
|
|
|
|
|
|
|
| |
This means use official Unicode code point numbering, not native. Doing
this converts the existing UNISKIP calls in the code to refer to native
code points, which is what they meant anyway. The terminology is
somewhat ambiguous, but I don't think it will cause real confusion.
NATIVE_SKIP is also introduced for situations where it is important to
be precise.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is in preparation for deprecating these functions, to force any
code that has been using these functions to change.
Since the Unicode tables are now stored in native order, these
functions should only rarely be needed.
However, the functionality of these is needed, and in actuality, on
ASCII platforms, the native functions are #defined to these. So what
this commit does is rename the functions to something else, and create
wrappers with the old names, so that anyone using them will get the
deprecation when it actually goes into effect: we are waiting for CPAN
files distributed with the core to change before doing the deprecation.
According to cpan.grep.me, this should affect fewer than 10 additional
CPAN distributions.
|
|
|
|
|
|
|
| |
Code should almost never be dealing with non-native code points
This is in preparation for later deprecation when our CPAN modules have
been converted away from using it.
|
|
|
|
|
|
|
| |
Now that the tables are stored in native order, there is almost no need
for code to be dealing in Unicode order.
According to grep.cpan.me, there are no uses of this function in CPAN.
|
|
|
|
|
|
|
|
|
| |
Now that all the tables are stored in native format, there is very
little reason to use this function; and those who do need this kind of
functionality should be using the bottom level routine, so as to make it
clear they are doing nonstandard stuff.
According to grep.cpan.me, there are no uses of this function in CPAN.
|
|
|
|
| |
This is in preparation for the current wrapee becoming deprecated
|
|
|
|
|
| |
Since the value is invariant under both UTF-8 and not, we already have
it in 'uv'; no need to do anything else to get it
|