| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
This takes the few latest changes in the draft Unicode 12.1, ahead of
our freeze. None are substantive. No further non-substantive changes
will be added, except in the unlikely event that a substantive change is
made, we will take it and potentially delay Perl 5.30.
|
|
|
|
| |
A variable needed to be updated for Unicode 12.1
|
|
|
|
|
|
| |
I realized that commit f9c1e7e9ed13a16099c8471c2030b93deb482571
works now, but future Unicode versions may add fractions that fool it.
This commit should handle any such event
|
| |
|
|
|
|
| |
Indent block newly formed in previous commit
|
|
|
|
|
|
|
| |
This turns out to be because Windows doesn't necessarily round to even
on floating point %e conversions. The solution is to add an extra entry
rounding up to odd when a fraction is precisely representable in binary.
So far, the only case where this occurs is 1/32.
|
|
|
|
| |
This inadvertently was left on, slowing down the process a little
|
|
|
|
|
|
|
|
| |
Somehow I missed updating some files with the result that a few official
12.0 final corrections did not make it into
906f46d96ca4ba2d1039d576954bc5a47868348c.
These are mostly tests and break property changes for a few characters
|
| |
|
|
|
|
| |
This supports this new feature.
|
| |
|
|
|
|
| |
These debugging lines were left in by 21c34e9717d
|
| |
|
|
|
|
|
| |
This renames a variable to more accurately reflect its content, and adds
a new one which has the old name but with an accurate content.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
I am starting to write a Unicode::Private_Use module which will allow
one to specify the Unicode properties of private use code points, thus
making them actually useful. This commit adds a hook to regcomp.c to
accommodate this module. The changes are pretty minimal. This way we
don't have to wait another release cycle to get it out there.
I don't want to document this interface, until it's proven.
|
|
|
|
|
|
|
| |
IBM says that there are 13 characters whose code point varies depending
on the EBCDIC code page. They fail to mention that the \n character may
also vary. This commit adds checks for \n, in addition to the checks
for the 13 graphic variant ones.
|
|
|
|
| |
Unicode 12.0 is finalized. Change to use it.
|
|
|
|
|
|
|
|
|
|
|
| |
change a couple of
const char * foo[] = { ... }
to
const char * const foo[] = { ... }
Making the string ptrs const means the whole thing is RO and doesn't
appear in data section, making porting/libperl.t happier when building
under -DPERL_GLOBAL_STRUCT_PRIVATE.
|
|
|
|
| |
These are in a generated structure.
|
|
|
|
| |
This will be used in a future commit.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a Turkic locale, these are problematic because their mappings
cross the 255/256 boundary.
This change has the side effect of causing U+307 to be added to the
problematic list, and it normally really isn't problematic, because in
those locales where U+130 and U+131 are problematic, U+307 isn't used.
But applications could switch in and out of Turkic locales, so it's best
to leave it be considered problematic. The consequences of making this
mark problematic are simply slightly less optimized regex pattern code.
|
| |
|
|
|
|
|
|
|
|
|
| |
This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with
additional changes to get things to compile
It turns out I was wrong about the underlying cause that commit
addressed, and it is easier to just use the existing constants that get
generated.
|
|
|
|
|
|
|
|
| |
Before the GCB property handling got more complicated, it was possible
to represent its vagaries with a boolean table on early Unicode
releases. Now there are more complicated rules, and even though early
releases only use 0 or 1, the rules exist and lead to compilation
errors. Just remove the special handling, and let the table be U8.
|
|
|
|
|
|
|
|
|
| |
These 2 Unicode-like property definitions used internally by the regular
expression compiler are moved by this commit from regen/mk_invlists.pl
to lib/unicore/mktables.
By placing all these in the same place, maintainers only have to learn
one bit of code, instead of two.
|
|
|
|
|
|
|
|
| |
If two tables are identical, the code created a #define of one index of
a pointer array to be the other index. But in some cases, that's not sufficient,
and the actual pointer must be defined in terms of the other. This
showed up in compiling perl with an early Unicode version, but the
circumstances could arise again in a future version.
|
|
|
|
|
|
|
| |
This table contains all the code points that are in any multi-character
fold (not the folded-from character, but what that character folds to).
It will be used in a future commit.
|
| |
|
|
|
|
|
| |
The new value is the maximum number of code points that fold to any
single code point. It will be used in a future commit.
|
|
|
|
|
|
|
|
| |
Committer: For porting tests: Update $VERSION in 4 files.
Run:
./perl -Ilib regen/mk_invlists.pl
./perl -Ilib regen/regcharclass.pl
|
|
|
|
|
|
|
| |
C99 says there shouldn't be more than 2 digits in an exponent unless
needed. But Windows uses three. This messes some stuff up that is
expecting two. Change to remove leading zeros so that only two digits
are used. This allows mktables to properly operate on Windows.
|
|
|
|
|
| |
These tests have been wrongly passing. A future commit will change
that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These are only used in compiling patterns. They previously were placed
in utf8.c because they are large, and there is a copy of regcomp.c in
ext/re, so they would have use twice the space.
This commit changes things so that they only are used and defined in
regcomp.c, (not re_comp.c) so that duplication does not occur. They are
accessed only from one function, and that is also moved from utf8.c to
regcomp.c, only compiled in regcomp.c, and referred to as an external by
re_comp.c
I had to change the names of the table. Previously they started with
'PL_' in case any got exposed, but globvar.t mindlessly assumes that any
such variables in the file regcomp.c are globals, and wrongly complains.
It was easier to just change the prefix to 'UNI_' instead.
A few tables are used in regexec.c, and are duplicated in re_exec.c.
Things could be adjusted so that only one copy is used. I tried this,
but the tables are far more intertwined in regexec.c functions than
the ones changed in this commit, as only a single function accesses
these. Thus doing this would be a lot harder, and the payback isn't all
that much. I started work to make them EXTCONSTs, and then discovered
the intertwining, but left in that work, unused.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Each Unicode property that specifies a boundary conditions, like
Word_Break, partitions all the Unicode code points into equivalence
classes. So, for example, combining marks are placed into the Extend
class, because they are usually used to extend the previous character
and don't stand on their own. mk_invlists.pl creates a boolean table of
all pairwise combinations of these classes, so that it knows by simple
lookup if the first character is class X and the next character is class
Y, if a break is permitted between these.
However, in some cases the answer isn't as simple as this, and other
means such as the characters in the vicinity of X and Y must be used to
disambiguate. In these cases the table value in the cell (X,Y) isn't a
boolean, but is some other number indicating some specially crafted code
section to execute to resolve the issue.
Over the years, Unicode has tended to subdivide partitions into smaller
ones, as they've refined their algorithms. But with Unicode 11, they
used another method and actually removed partitions. Rather, they
retain the partitions, but no code point actually takes on the value of
an obsolete partition.
In order to not have to change the algorithm unnecessarily between
Unicode releases (who knows, they might change their minds, and
unobsolete these next time), mk_invlists has just kept the tables
around, but those cells won't ever get accessed because no code point in
the current release evaluates to them.
But that makes the tables unnecessarily large. We can achieve the same
thing by mapping each unused equivalence class to the same value, which
we call 'unused'. The algorithms that refer to the obsolete partitions
go through the data assigning values to the cells, but now the cells
overlap, since all obsolete classes map to the same row or column. Thus
the data is total garbage. But that doesn't matter, since that row or
column is never read by the data in the Unicode release the table is
constructed for.
mk_invlists also can compile older Unicode releases, and this makes
those tables smaller than before, with all unused classes in a
given release collapsed into a single row and single column of (unused)
garbage.
|
| |
|
|
|
|
| |
This completes the process of upgrading to Unicode 11.0.
|
|
|
|
|
|
|
|
|
| |
Unicode 11 has some new data files needed for it, and some changes in
the boundary rules that need to be accounted for. This does all that
can be done without causing tests to fail. The LB algorithm has
changed, and tests would fail if we included the code changes needed for
that change in this commit. Instead those few lines will come as part
of the Unicode 11.0 commit.
|
| |
|
|
|
|
|
| |
By simply removing a special case, we can avoid having to work around it
later.
|
| |
|
|
|
|
|
|
|
| |
I forgot that mktables (until told that things have been updated) makes
all failing boundary condition tests pass and hence I got confused.
It's a simple matter to remind the user that this is happening, to
prevent the confusion
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
This makes sure that all patterns in this file are compiled under /aa.
Doing this can catch bugs. The bug the previous commit fixes would have
been caught if we did this.
|
|
|
|
|
|
|
|
|
|
|
| |
The problem here is that it was using a regular expression pattern to
determine if a code point is the integer 0. When a new Unicode release
comes along and adds a new block of decimals, this routine should be run
before the interpreter is compiled for real. And the pattern won't know
about the new block, so this would fail.
Solve the problem by using only Unicode::UCD to discover this info, and
not a pattern.
|
| |
|
|
|
|
|
|
|
| |
This property is not normally compiled by perl, but an installation may
choose to use it. It was failing some tests because this is a special
property that is like a perl dual-var. It is both binary, and
non-binary, and commit 346f9bfbe12 forgot that.
|
|
|
|
|
| |
In some of these, certain properties aren't defined yet, so have no
entries. Just add a check for that, and compensate.
|