summaryrefslogtreecommitdiff
path: root/utfebcdic.h
Commit message (Collapse)AuthorAgeFilesLines
* utf8n_to_uvchr() Properly test for extended UTF-8Karl Williamson2017-07-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It somehow dawned on me that the code is incorrect for warning/disallowing very high code points. What is really wanted in the API is to catch UTF-8 that is not necessarily portable. There are several classes of this, but I'm referring here to just the code points that are above the Unicode-defined maximum of 0x10FFFF. These can be considered non-portable, and there is a mechanism in the API to warn/disallow these. However an earlier standard defined UTF-8 to handle code points up to 2**31-1. Anything above that is using an extension to UTF-8 that has never been officially recognized. Perl does use such an extension, and the API is supposed to have a different mechanism to warn/disallow on this. Thus there are two classes of warning/disallowing for above-Unicode code points. One for things that have some non-Unicode official recognition, and the other for things that have never had official recognition. UTF-EBCDIC differs somewhat in this, and since Perl 5.24, we have had a Perl extension that allows it to handle any code point that fits in a 64-bit word. This kicks in at code points above 2**30-1, a number different than UTF-8 extended kicks in on ASCII platforms. Things are also complicated by the fact that the API has provisions for accepting the overlong UTF-8 malformation. It is possible to use extended UTF-8 to represent code points smaller than 31-bit ones. Until this commit, the extended warning/disallowing was based on the resultant code point, and only when that code point did not fit into 31 bits. But what is really wanted is if extended UTF-8 was used to represent a code point, no matter how large the resultant code point is. This differs from the previous definition, but only for EBCDIC platforms, or when the overlong malformation was also present. So it does not affect very many real-world cases. This commit fixes that. It turns out that it is easier to tell if something is using extended-UTF8. One just looks at the first byte of a sequence. The trailing part of the warning message that gets raised is slightly changed to be clearer. It's not significant enough to affect perldiag.
* utf8.c: Move some #defines here, the only file that uses themKarl Williamson2017-07-011-7/+0
| | | | | | These are very specialized #defines to determine if UTF-8 overflows the word size of the platform. I think its unwise to make them kinda generally available.
* utfebcdic.h: Fix typo in commentKarl Williamson2016-12-191-1/+1
| | | | Spotted by Christian Hansen
* utfebcdic.h: Follow up to adding const qualifiersKarl Williamson2016-12-111-57/+74
| | | | | | | | | | | | | | | | | | | | | | | | This is a follow-up to commit 9f2eed981068e7abbcc52267863529bc59e9c8c0, which manually added const qualifiers to some generated code in order to avoid some compiler warnings. The code changed by the other commit had been hand-edited after being generated to add branch prediction, which would be too hard to program in at this time, so the const additions also had to be hand-edited in. The commit just before this current one changed the generator to add the const, and I then did comparisons by hand to make sure the only differences were the branch predictions. In doing so, I found one missing const, plus a bunch of differences in the generated code for EBCDIC 037. We do not currently have a smoker for that system, so the differences could be as a result of a previous error, or they could be the result of the added 'const' causing the macro generator to split things differently. It splits in order to avoid size limits in some preprocessors, and the extra 'const' tokens could have caused it to make its splits differently. Since we don't have any smokers for this, and no known actual systems running it, I decided not to bother to hand-edit the output to add branch prediction.
* Fix const correctness in utf8.hPetr Písař2016-12-011-138/+138
| | | | | | | | The original code was generated and then hand-tunes. Therefore I edited the code in place instead of fixing the regen/regcharclass.pl generator. Signed-off-by: Petr Písař <ppisar@redhat.com>
* Add macro for Unicode Corregindum #9 strictKarl Williamson2016-09-171-0/+42
| | | | | | | | | | | | | This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out.
* Add macro for determining if UTF-8 is Unicode-strictKarl Williamson2016-09-171-8/+140
|
* isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCIIKarl Williamson2016-09-171-0/+51
| | | | | | | | | | | | | | | | This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it are changed somewhat.
* utfebcdic.h: Fix typo in commentKarl Williamson2016-09-171-1/+1
|
* Add #defines for UTF-8 of highest representable code pointKarl Williamson2016-08-311-0/+8
| | | | | This will allow the next commit to not have to actually try to decode the UTF-8 string in order to see if it overflows the platform.
* utf8.h, utfebcdic.h: Add comments, align white spaceKarl Williamson2016-08-311-1/+29
|
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2015-12-091-0/+9
| | | | for future use
* utf8.h, utfebcdic.h: Comments, white-space onlyKarl Williamson2015-12-061-1/+6
|
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-8/+0
| | | | | | | | Previous commits have set things up so the macros are the same on both platforms. By moving them to the common part of utf8.h, they can share the same definition. The difference listing shows instead other things being moved due to the size of this move in comparison with those things that really stayed the same.
* utf8.h: Combine EBCDIC and ASCII macrosKarl Williamson2015-12-051-5/+0
| | | | | | | | The previous commits have made these macros be the exact same text, so can be combined, and defined just once. This requires moving them to the portion of the file that is common with both EBCDIC and ASCII. The commit diff shows instead other code being moved.
* utf8.h: Combine EBCDIC and ASCII #definesKarl Williamson2015-12-051-2/+0
| | | | | | Change to use the same definition for two macros on both types of platforms, simplifying the code, by using the underlying structure of the encoding.
* utf8.h, et.al.: Clean up some castsKarl Williamson2015-12-051-2/+2
| | | | | By making sure the no-op macros cast the output appropriately, we can eliminate the casts that have been added in things that call them
* utf8.h: Combine ASCII and EBCDIC defines into oneKarl Williamson2015-12-051-1/+0
| | | | | By using a more fundamental value, these two definitions of the macro can be made the same, so only need one, common to both platforms
* utfebcdic.h: Use an internal macro to avoid repeatingKarl Williamson2015-12-051-15/+12
| | | | | This creates a macro that is used in portions of 2 other macros, thus removing repetition.
* utf8.h, utfebcdic.h: Fix-up UTF8_MAXBYTES_CASE defnKarl Williamson2015-12-051-6/+0
| | | | | | | | The definition had gotten moved away from its comments in utf8.h, and the wrong thing was being guarded by a #error, (UTF8_MAXBYTES instead). And it is possible to generalize to get the compiler to do the calculation, and to consolidate the definitions from the two files into a single one.
* Extend UTF-EBCDIC to handle up to 2**64-1Karl Williamson2015-11-251-17/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This uses for UTF-EBCDIC essentially the same mechanism that Perl already uses for UTF-8 on ASCII platforms to extend it beyond what might be its natural maximum. That is, when the UTF-8 start byte is 0xFF, it adds a bunch more bytes to the character than it otherwise would, bringing it to a total of 14 for UTF-EBCDIC. This is enough to handle any code point that fits in a 64 bit word. The downside of this is that this extension is not compatible with previous perls for the range 2**30 up through the previous max, 2**30 - 1. A simple program could be written to convert files that were written out using an older perl so that they can be read with newer perls, and the perldelta says we will do this should anyone ask. However, I strongly suspect that the number of such files in existence is zero, as people in EBCDIC land don't seem to use Unicode much, and these are very large code points, which are associated with a portability warning every time they are output in some way. This extension brings UTF-EBCDIC to parity with UTF-8, so that both can cover a 64-bit word. It allows some removal of special cases for EBCDIC in core code and core tests. And it is a necessary step to handle Perl 6's NFG, which I'd like eventually to bring to Perl 5. This commit causes two implementations of a macro in utf8.h and utfebcdic.h to become the same, and both are moved to a single one in the portion of utf8.h common to both. To illustrate, the I8 for U+3FFFFFFF (2**30-1) is "\xFE\xBF\xBF\xBF\xBF\xBF\xBF" before and after this commit, but the I8 for the next code point, U+40000000 is now "\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xA0\xA0\xA0\xA0\xA0\xA0", and before this commit it was "\xFF\xA0\xA0\xA0\xA0\xA0\xA0". The I8 for 2**64-1 (U+FFFFFFFFFFFFFFFF) is "\xFF\xAF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF\xBF", whereas before this commit it was unrepresentable. Commit 7c560c3beefbb9946463c9f7b946a13f02f319d8 said in its message that it was moving something that hadn't been needed on EBCDIC until the "next commit". That statement turned out to be wrong, overtaken by events. This now is the commit it was referring to. commit I prematurely pushed that
* utf8.h, utfebcdic.h: Use mnemonic constantKarl Williamson2015-11-091-14/+15
| | | | | | | The magic number 13 is used in various places on ASCII platforms, and 7 correspondingly on EBCDIC. This moves the #defines for what these represent to early in their files, and uses the symbolic name thereafter.
* Change meaning of UNI_IS_INVARIANT on EBCDIC platformsKarl Williamson2015-09-181-2/+1
| | | | | | | | | | | | | This should make more CPAN and other code work without change. Usually, unwittingly, code that says UNI_IS_INVARIANT means to use the native platform code values for code points below 256, so acquiesce to the expected meaning and make the macro correspond. Since the native values on ASCII machines are the same as Unicode, this change doesn't affect code running on them. A new macro, OFFUNI_IS_INVARIANT, is created for those few places that really do want a Unicode value. There are just a few places in the Perl core like that, which this commit changes.
* Fix potential flaw in 2 EBCDIC macros.Karl Williamson2015-09-041-2/+2
| | | | | | | | It occurred to me in code reading that it was possible for these macros to not give the correct result if passed a signed argument. An earlier version of this commit was buggy. Thanks to Yaroslav Kuzmin for spotting that.
* utf8.h, utfebcdic.h: Add some assertionsKarl Williamson2015-09-041-4/+6
| | | | | | These will detect a array bounds error that occurs on EBCDIC machines, and by including the assert on non-EBCDIC, we verify that the code wouldn't fail when built on EBCDIC.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+3
| | | | | | This changes the definition of isUTF8_POSSIBLY_PROBLEMATIC() on EBCDIC platforms to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+6
| | | | | | | | Prior to this commit UVCHR_SKIP() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately -- to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change EBCDIC macro definitionKarl Williamson2015-09-041-0/+8
| | | | | | | | Prior to this commit UVCHR_IS_INVARIANT() was defined the same in both ASCII and EBCDIC, but they expanded to different things. Now, they are defined separately to what they expand to, and the EBCDIC version is changed when all expanded out to use PL_charclass[] instead of PL_e2a[]. The new array is more likely to be in the memory cache.
* Change some UTF-EBCDIC macro handling defnsKarl Williamson2015-09-041-14/+19
| | | | | | | | | This commit changes the definitions of some macros for UTF-8 handling on EBCDIC platforms. The previous definitions transformed the bytes into I8 and did tests on the transformed values. The change is to use previously unused bits in l1_char_class_tab.h so the transform isn't needed, and generally only one branch is. These macros are called from the inner loops of, for example, regex backtracking.
* utfebcdic.h: Clarify commentKarl Williamson2015-09-021-4/+6
|
* utf8.h, utfebcdic.h: Add comments; white-space onlyKarl Williamson2015-08-011-6/+7
|
* utfebcdic.h: Comments onlyKarl Williamson2015-08-011-2/+3
|
* utfebcdic.h: Remove commentsKarl Williamson2015-04-061-4/+1
| | | | One is false, and one is addressed now in the perlebcdic.pod
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* utfebcdic.h: Add commentKarl Williamson2015-03-051-0/+14
|
* utfebcdic.h: Add commentsKarl Williamson2014-05-311-0/+2
|
* Fix definition of toCTRL() for EBCDICKarl Williamson2014-05-311-0/+4
| | | | | | The definition was incorrect. When going from control to printable name, we need to go from Latin1 -> Native, so that e.g., a 65 gets turned into the native 'A'
* Make many EBCDIC tables generated instead of hand-codedKarl Williamson2014-05-311-581/+6
| | | | | | | | | | | | | | This causes the generated file ebcdic_tables.h to be #included by utfebcdic.h instead of the hand-coded tables that were formerly there. This makes it much easier to add or remove support for EBCDIC code pages. The UTF-EBCDIC-related tables for 037 and POSIX-BC are somewhat modified from what they were before. They were changed by hand minimally a long time ago to prevent segfaults, but in so doing, they lost an important sorting characteristic of UTF-EBCDIC. The machine-generated versions retain the sorting, while also not doing the segfaults. utfebcdic.h has more detail about this, regarding tr16.
* utfebcdic.h: Comment changes onlyKarl Williamson2014-05-301-26/+45
| | | | Clarifications and typo fix.
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2013-08-291-0/+2
|
* utfebcdic.h: Change 'unsigned char' to U8Karl Williamson2013-08-291-35/+35
| | | | This is for consistency with the rest of Perl
* utfebcdic.h: Add (UV) castKarl Williamson2013-08-291-1/+1
| | | | The operand of this macro is implicitly a UV. Make sure that it is.
* utfebcdic.h: Add commentKarl Williamson2013-08-291-0/+6
|
* utf8.h: Clean up and use START_MARK definitionKarl Williamson2013-08-291-1/+3
| | | | | | | | | The previous definition broke good encapsulation rules. UTF_START_MARK should return something that fits in a byte; it shouldn't be the caller that does this. So the mask is moved into the definition. This means it can apply only to the portion that creates something larger than a byte. Further, the EBCDIC version can be simplified, since 7 is the largest possible number of bytes in an EBCDIC UTF8 character.
* utfebcdic.h: Remove extra parameter expansionsJohn Goodyear2013-08-291-2/+2
| | | | | These two macros were improperly expanding the parameters as well as defining the operation, leading to compile errors.
* Add macro OFFUNISKIPKarl Williamson2013-08-291-1/+2
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.
* Make casing tables nativeKarl Williamson2013-08-291-10/+162
| | | | | These are final tables that haven't been converted to native character set casing.
* utfebcdic.h: Remove trailing spacesKarl Williamson2013-08-291-4/+4
|
* Deprecate NATIVE_TO_NEED and ASCII_TO_NEEDKarl Williamson2013-08-291-8/+0
| | | | | | | | | | | | | | | | | | These macros are no longer called in the Perl core. This commit turns them into functions so that they can use gcc's deprecation facility. I believe these were defective right from the beginning, and I have struggled to understand what's going on. From the name, it appears NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the appropriate parameter indicates that. But that is impossible to do correctly from that API, as for variant characters, it needs to return two bytes. It could only work correctly if ch is an I8 byte, which isn't native, and hence the name would be wrong. Similar arguments for ASCII_TO_NEED. The function S_append_utf8_from_native_byte(const U8 byte, U8** dest) does what I think NATIVE_TO_NEED intended.
* Use new clearer named #definesKarl Williamson2013-08-291-10/+17
| | | | | This converts several areas of code to use the more clearly named macros introduced in the previous commit