summaryrefslogtreecommitdiff
path: root/charclass_invlists.h
Commit message (Collapse)AuthorAgeFilesLines
* Update Unicode 12.1Karl Williamson2019-04-191-1/+1
| | | | | | | This takes the few latest changes in the draft Unicode 12.1, ahead of our freeze. None are substantive. No further non-substantive changes will be added, except in the unlikely event that a substantive change is made, we will take it and potentially delay Perl 5.30.
* mktables: Silence warningKarl Williamson2019-04-161-1/+1
| | | | A variable needed to be updated for Unicode 12.1
* mktables: Generalize handling of [perl #133979]Karl Williamson2019-04-101-1/+1
| | | | | | I realized that commit f9c1e7e9ed13a16099c8471c2030b93deb482571 works now, but future Unicode versions may add fractions that fool it. This commit should handle any such event
* Preliminary Unicode 12.1Unicode Consortium2019-04-081-211/+1494
|
* mktables: White-space onlyKarl Williamson2019-04-061-1/+1
| | | | Indent block newly formed in previous commit
* PATCH: [perl #133979] uniprops02 failing on WindowsKarl Williamson2019-04-061-1/+1
| | | | | | | This turns out to be because Windows doesn't necessarily round to even on floating point %e conversions. The solution is to add an extra entry rounding up to odd when a fraction is precisely representable in binary. So far, the only case where this occurs is 1/32.
* mktables: Turn off DEBUGKarl Williamson2019-04-041-1/+1
| | | | This inadvertently was left on, slowing down the process a little
* Corrections to Unicode 12.0Unicode Consortium2019-04-021-129/+499
| | | | | | | | Somehow I missed updating some files with the result that a few official 12.0 final corrections did not make it into 906f46d96ca4ba2d1039d576954bc5a47868348c. These are mostly tests and break property changes for a few characters
* Add tests for wildcards in Unicode property valuesKarl Williamson2019-03-121-1/+1
|
* regen/mk_invlists.pl: Add tables for Unicode wildcardsKarl Williamson2019-03-121-1/+3472
| | | | This supports this new feature.
* Add warnings category experimental::uniprop_wildcardsKarl Williamson2019-03-121-1/+1
|
* regen/mk_invlists.pl: Remove stray debugging stmtsKarl Williamson2019-03-121-1/+1
| | | | These debugging lines were left in by 21c34e9717d
* regen/mk_invlists.pl: Comment/white-space onlyKarl Williamson2019-03-121-1/+1
|
* regen/mk_invlists.pl, lib/utf8_heavy.pl: Rename variableKarl Williamson2019-03-121-1/+1
| | | | | This renames a variable to more accurately reflect its content, and adds a new one which has the old name but with an accurate content.
* charclass_invlists.h: Add commentKarl Williamson2019-03-121-2/+2
|
* Add hook for Unicode private use overrideKarl Williamson2019-03-071-1/+1
| | | | | | | | | | I am starting to write a Unicode::Private_Use module which will allow one to specify the Unicode properties of private use code points, thus making them actually useful. This commit adds a hook to regcomp.c to accommodate this module. The changes are pretty minimal. This way we don't have to wait another release cycle to get it out there. I don't want to document this interface, until it's proven.
* Check for \n in EBCDIC code pagesKarl Williamson2019-03-061-427/+427
| | | | | | | IBM says that there are 13 characters whose code point varies depending on the EBCDIC code page. They fail to mention that the \n character may also vary. This commit adds checks for \n, in addition to the checks for the 13 graphic variant ones.
* Use Unicode 12.0Unicode Consortium2019-03-041-4758/+10891
| | | | Unicode 12.0 is finalized. Change to use it.
* PERL_GLOBAL_STRUCT_PRIVATE: fix some const stringsDavid Mitchell2019-02-191-7/+5
| | | | | | | | | | | change a couple of const char * foo[] = { ... } to const char * const foo[] = { ... } Making the string ptrs const means the whole thing is RO and doesn't appear in data section, making porting/libperl.t happier when building under -DPERL_GLOBAL_STRUCT_PRIVATE.
* mktables: Omit unnecessary duplicatesKarl Williamson2019-02-161-1/+1
| | | | These are in a generated structure.
* regen/mk_invlists.pl: Create new inversion listKarl Williamson2019-02-051-1/+355
| | | | This will be used in a future commit.
* mktables: Make Turkic 'I' chars problematicKarl Williamson2019-02-051-4/+6
| | | | | | | | | | | | In a Turkic locale, these are problematic because their mappings cross the 255/256 boundary. This change has the side effect of causing U+307 to be added to the problematic list, and it normally really isn't problematic, because in those locales where U+130 and U+131 are problematic, U+307 isn't used. But applications could switch in and out of Turkic locales, so it's best to leave it be considered problematic. The consequences of making this mark problematic are simply slightly less optimized regex pattern code.
* regen/mk_invlists.pl: Rmv extraneous tab in outputKarl Williamson2019-01-041-2/+2
|
* Revert "regen/mk_invlists.pl: Fix bug when 2 ident tables"Karl Williamson2018-12-311-41/+2
| | | | | | | | | This reverts commit 7e9b4fe4d85e9b669993bf96a7e33ffff3197e20, with additional changes to get things to compile It turns out I was wrong about the underlying cause that commit addressed, and it is easier to just use the existing constants that get generated.
* regen/mk_invlists.pl: Rmv outdated codeKarl Williamson2018-12-261-1/+1
| | | | | | | | Before the GCB property handling got more complicated, it was possible to represent its vagaries with a boolean table on early Unicode releases. Now there are more complicated rules, and even though early releases only use 0 or 1, the rules exist and lead to compilation errors. Just remove the special handling, and let the table be U8.
* Move 2 property defns to mktablesKarl Williamson2018-12-251-456/+442
| | | | | | | | | These 2 Unicode-like property definitions used internally by the regular expression compiler are moved by this commit from regen/mk_invlists.pl to lib/unicore/mktables. By placing all these in the same place, maintainers only have to learn one bit of code, instead of two.
* regen/mk_invlists.pl: Fix bug when 2 ident tablesKarl Williamson2018-12-251-2/+41
| | | | | | | | If two tables are identical, the code created a #define of one index of a pointer array to be the other index. But in some cases, that's not sufficient, and the actual pointer must be defined in terms of the other. This showed up in compiling perl with an early Unicode version, but the circumstances could arise again in a future version.
* regen/mk_invlists.pl: Add new tableKarl Williamson2018-12-071-1/+281
| | | | | | | This table contains all the code points that are in any multi-character fold (not the folded-from character, but what that character folds to). It will be used in a future commit.
* regen/mk_invlists.pl: Rmv no longer used arrayKarl Williamson2018-12-071-1/+1
|
* regen/mk_invlists.pl: Generate a new valueKarl Williamson2018-11-261-1/+8
| | | | | The new value is the maximum number of code points that fold to any single code point. It will be used in a future commit.
* fix typosAlexandr Savca2018-10-091-1/+1
| | | | | | | | Committer: For porting tests: Update $VERSION in 4 files. Run: ./perl -Ilib regen/mk_invlists.pl ./perl -Ilib regen/regcharclass.pl
* mktables: Handle platforms with 3 digit exponentsKarl Williamson2018-08-201-1/+1
| | | | | | | C99 says there shouldn't be more than 2 digits in an exponent unless needed. But Windows uses three. This messes some stuff up that is expecting two. Change to remove leading zeros so that only two digits are used. This allows mktables to properly operate on Windows.
* mktables: Some tests are invalidKarl Williamson2018-08-031-1/+1
| | | | | These tests have been wrongly passing. A future commit will change that.
* Move Unicode \p{} definitions to regcomp.cKarl Williamson2018-08-021-3947/+4221
| | | | | | | | | | | | | | | | | | | | | | | | | These are only used in compiling patterns. They previously were placed in utf8.c because they are large, and there is a copy of regcomp.c in ext/re, so they would have use twice the space. This commit changes things so that they only are used and defined in regcomp.c, (not re_comp.c) so that duplication does not occur. They are accessed only from one function, and that is also moved from utf8.c to regcomp.c, only compiled in regcomp.c, and referred to as an external by re_comp.c I had to change the names of the table. Previously they started with 'PL_' in case any got exposed, but globvar.t mindlessly assumes that any such variables in the file regcomp.c are globals, and wrongly complains. It was easier to just change the prefix to 'UNI_' instead. A few tables are used in regexec.c, and are duplicated in re_exec.c. Things could be adjusted so that only one copy is used. I tried this, but the tables are far more intertwined in regexec.c functions than the ones changed in this commit, as only a single function accesses these. Thus doing this would be a lot harder, and the payback isn't all that much. I started work to make them EXTCONSTs, and then discovered the intertwining, but left in that work, unused.
* regen/mk_invlists.pl: Collapse unused boundary valuesKarl Williamson2018-07-211-81/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each Unicode property that specifies a boundary conditions, like Word_Break, partitions all the Unicode code points into equivalence classes. So, for example, combining marks are placed into the Extend class, because they are usually used to extend the previous character and don't stand on their own. mk_invlists.pl creates a boolean table of all pairwise combinations of these classes, so that it knows by simple lookup if the first character is class X and the next character is class Y, if a break is permitted between these. However, in some cases the answer isn't as simple as this, and other means such as the characters in the vicinity of X and Y must be used to disambiguate. In these cases the table value in the cell (X,Y) isn't a boolean, but is some other number indicating some specially crafted code section to execute to resolve the issue. Over the years, Unicode has tended to subdivide partitions into smaller ones, as they've refined their algorithms. But with Unicode 11, they used another method and actually removed partitions. Rather, they retain the partitions, but no code point actually takes on the value of an obsolete partition. In order to not have to change the algorithm unnecessarily between Unicode releases (who knows, they might change their minds, and unobsolete these next time), mk_invlists has just kept the tables around, but those cells won't ever get accessed because no code point in the current release evaluates to them. But that makes the tables unnecessarily large. We can achieve the same thing by mapping each unused equivalence class to the same value, which we call 'unused'. The algorithms that refer to the obsolete partitions go through the data assigning values to the cells, but now the cells overlap, since all obsolete classes map to the same row or column. Thus the data is total garbage. But that doesn't matter, since that row or column is never read by the data in the Unicode release the table is constructed for. mk_invlists also can compile older Unicode releases, and this makes those tables smaller than before, with all unused classes in a given release collapsed into a single row and single column of (unused) garbage.
* regen/mk_invlists.pl: Make adjacent comment and its codeKarl Williamson2018-07-211-1/+1
|
* Use Unicode 11.0Unicode Consortium2018-07-201-6170/+14975
| | | | This completes the process of upgrading to Unicode 11.0.
* Prepare for Unicode 11.0Karl Williamson2018-07-201-65/+82
| | | | | | | | | Unicode 11 has some new data files needed for it, and some changes in the boundary rules that need to be accounted for. This does all that can be done without causing tests to fail. The LB algorithm has changed, and tests would fail if we included the code changes needed for that change in this commit. Instead those few lines will come as part of the Unicode 11.0 commit.
* mktables: Comment, white-spaceKarl Williamson2018-07-201-1/+1
|
* mktables: Avoid some unnecessary workKarl Williamson2018-07-201-1/+1
| | | | | By simply removing a special case, we can avoid having to work around it later.
* regen/mk_invlists.pl: Fix a couple typos, nitsKarl Williamson2018-07-201-1/+1
|
* mktables: Improve warning messageKarl Williamson2018-07-201-1/+1
| | | | | | | I forgot that mktables (until told that things have been updated) makes all failing boundary condition tests pass and hence I got confused. It's a simple matter to remind the user that this is happening, to prevent the confusion
* uni_keywords.h: Fix misspelling typoKarl Williamson2018-07-071-1/+1
|
* mktables: Correct L<> for perluniprops; rmv trail spaceKarl Williamson2018-06-251-1/+1
|
* regen/mk_invlists.pl: Fix outdated commentsKarl Williamson2018-06-251-1/+1
|
* regen/mk_invlists.pl: use re 'qr/aa'Karl Williamson2018-06-251-1/+1
| | | | | | This makes sure that all patterns in this file are compiled under /aa. Doing this can catch bugs. The bug the previous commit fixes would have been caught if we did this.
* regen/mk_invlists.pl: Fix chicken and egg problemKarl Williamson2018-06-251-1/+1
| | | | | | | | | | | The problem here is that it was using a regular expression pattern to determine if a code point is the integer 0. When a new Unicode release comes along and adds a new block of decimals, this routine should be run before the interpreter is compiled for real. And the pattern won't know about the new block, so this would fail. Solve the problem by using only Unicode::UCD to discover this info, and not a pattern.
* mktables: Add, change some commentsKarl Williamson2018-06-251-1/+1
|
* mktables: Handle cjkiicore properlyKarl Williamson2018-06-251-1/+1
| | | | | | | This property is not normally compiled by perl, but an installation may choose to use it. It was failing some tests because this is a special property that is like a perl dual-var. It is both binary, and non-binary, and commit 346f9bfbe12 forgot that.
* regen/mk_invlists.pl: Fix-ups for early Unicode versionsKarl Williamson2018-06-251-1/+1
| | | | | In some of these, certain properties aren't defined yet, so have no entries. Just add a check for that, and compensate.