summaryrefslogtreecommitdiff
path: root/utf8.h
Commit message (Collapse)AuthorAgeFilesLines
* perlapi: Add intro text to Unicode sectionKarl Williamson2015-05-071-0/+6
|
* perlapi: Document some functionsKarl Williamson2015-05-071-2/+30
| | | | | These are mentioned in some other pods. It's best to bring them into perlapi, and refer to them from the other pods.
* utf8.h: Add a #defineKarl Williamson2015-05-071-2/+3
| | | | | | The name UVCHR... parallels the usage of various functions uvchr... It's less confusing to keep the same name form for the same type of functionality
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* fix assertions for UTF8_TWO_BYTE_HI/LOHugo van der Sanden2015-02-121-3/+3
| | | | | | Replace the stricter MAX_PORTABLE_UTF8_TWO_BYTE check with a looser MAX_UTF8_TWO_BYTE check, else we can't correctly convert codepoints in the range 0x400-0x7ff from utf16 to utf8 on non-ebcdic platforms.
* foldEQ_utf8(): Add some internal flagsKarl Williamson2014-12-291-0/+2
| | | | The comments explain their purpose
* Make is_invariant_string()Karl Williamson2014-11-261-1/+14
| | | | | | This is a more accurately named synonym for is_ascii_string(), which is retained. The old name is misleading to someone programming for non-ASCII platforms.
* utf8.h: EBCDIC fixKarl Williamson2014-10-211-2/+2
| | | | | | | These macros are supposed to accommodate larger than a byte inputs. Therefore, under EBCDIC, we have to use a different macro which handles the larger values. On ASCII platforms, these called macros are no-ops so it doesn't matter there.
* Add and use macros for case-insensitive comparisonKarl Williamson2014-08-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | This adds to handy.h isALPHA_FOLD_EQ(c1,c2) which efficiently tests if c1 and c2 are the same character, case-insensitively. For example isALPHA_FOLD_EQ(c, 's') returns true if and only if <c> is 's' or 'S'. isALPHA_FOLD_NE() is also added by this commit. At least one of c1 and c2 must be known to be in [A-Za-z] or this macro doesn't work properly. (There is an assert for this in the macro in DEBUGGING builds). That is why the name includes "ALPHA", so you won't forget when using it. This functionality has been in regcomp.c for a while, under a different name. I had thought that the only reason to make it more generally available was potential speed gain, but recent gcc versions optimize to the same code, so I thought there wasn't any point to doing so. But I now think that using this makes things easier to read (and certainly shorter to type in). Once you grok what this macro does, it simplifies what you have to keep in your mind when reading logical expressions with multiple operands. That something can be either upper or lower case can be a distraction to understanding the larger point of the expression.
* utf8.h: Add commentKarl Williamson2014-07-091-1/+3
|
* perlapi: Refactor placements, headings of some functionsKarl Williamson2014-06-051-7/+0
| | | | | | | | | | | | | | It is not very user friendly to list functions as "Functions found in file FOO". Better is to group them by purpose, as many were already. I went through and placed the ones that weren't already so grouped into groups. Patches welcome if you have a better classification. I changed the headings of some so that the important disctinction was the first word so that they are placed in the file more appropriately. And a couple of ones that I had created myself, I came up with a name that I think is better than the original
* Add parameters to "use locale"Karl Williamson2014-06-051-2/+5
| | | | | | | This commit allows one to specify to enable locale-awareness for only a specified subset of the locale categories. Thus you could make a section of code LC_MESSAGES aware, with no locale-awareness for the other categories.
* Fix definition of toCTRL() for EBCDICKarl Williamson2014-05-311-0/+4
| | | | | | The definition was incorrect. When going from control to printable name, we need to go from Latin1 -> Native, so that e.g., a 65 gets turned into the native 'A'
* Add some (UN)?LIKELY() to UTF8 handlingKarl Williamson2014-05-311-3/+3
| | | | | It's very rare actually for code to be presented with malformed UTF-8, so give the compiler a hint about the likely branches.
* Make is_utf8_char_buf() a macroKarl Williamson2014-05-311-0/+2
| | | | | | This function is now more efficiently implemented as a synonym for isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code that calls it that way.
* utf8.h: Use new macro type from previous commitKarl Williamson2014-05-311-35/+25
| | | | | | | | This allows for an efficient isUTF8_CHAR macro, which does its own length checking, and uses the UTF8_INVARIANT macro for the first byte. On EBCDIC systems this macro which does a table lookup is quite a bit more efficient than all the branches that would normally have to be done.
* Create isUTF8_CHAR() macro and use itKarl Williamson2014-05-311-13/+39
| | | | | | | | | | | | | | | | | | This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
* utf8.h: Move macro within fileKarl Williamson2014-05-311-7/+8
| | | | This places it in a better situated spot for later commits
* regen/regcharclass.pl: Update to use EBCDIC utilitiesKarl Williamson2014-05-311-1/+1
| | | | | This causes the generated regcharclass.h to be valid on all supported platforms
* White-space, comments onlyKarl Williamson2014-01-271-1/+1
| | | | | | | This mostly indents and outdents base on blocks added or removed by the previous commit. But there are a few comment changes and vertical alignment of macro backslash continuation characters, and other white-space changes
* Rename an internal flagKarl Williamson2014-01-271-1/+1
| | | | | The UTF8 in the name is kind of misleading, and would be more misleading after future commits make UTF8 locales special.
* Taint more operands with case changesKarl Williamson2014-01-271-5/+4
| | | | | | | | | | The documentation says that Perl taints certain operations when subject to locale rules, such as lc() and ucfirst(). Prior to this commit there were exceptions when the operand to these functions contained no characters whose case change actually varied depending on the locale, for example the empty string or above-Latin1 code points. Changing to conform to the documentation simplifies the core code, and yields more consistent results.
* Change some warnings in utf8n_to_uvchr()Karl Williamson2014-01-011-1/+3
| | | | | | | | | | | | | | | | This bottom level function decodes the first character of a UTF-8 string into a code point. It is discouraged from using it directly. This commit cleans up some of the warnings it can raise. Now, tests for malformations are done before any tests for other potential issues. One of those issues involves code points so large that they have never appeared in any official standard (the current standard has scaled back the highest acceptable code point from earlier versions). It is possible (though not done in CPAN) to warn and/or forbid these code points, while accepting smaller code points that are still above the legal Unicode maximum. The warning message for this now includes the code point if representable on the machine. Previously it always displayed raw bytes, which is what it still does for non-representable code points.
* Move a macro from utf8.h to handy.h for wider use.Karl Williamson2014-01-011-10/+0
| | | | Future commits will want this available outside utf8.h
* utf8.h: Add parameter checking to some macros in DEBUGGING buildsKarl Williamson2013-12-061-23/+51
| | | | | | This change should catch some wrong calls to these macros. The meat of the macros is extracted out into two internal-only macros, and the other macros are rearranged to call these.
* utf8.h: Fix grammar in commentKarl Williamson2013-12-041-2/+2
|
* utf8.h: White-space onlyKarl Williamson2013-09-301-1/+2
| | | | I believe this makes the macro easier to read
* The choice of 7 or 13 byte extended UTF-8 should be based on UVSIZE.Nicholas Clark2013-09-171-5/+3
| | | | Previously it was based on HAS_QUAD, which is not (as) correct.
* Use separate macros for byte vs uv UnicodeKarl Williamson2013-09-101-1/+6
| | | | | | | This removes a macro not yet even in a development release, and splits its calls into two classes: those where the input is a byte; and those where it can be any unsigned integer. The byte implementation avoids a function call on EBCDIC platforms.
* PATCH: [perl #119601] Bleadperl breaks ETHER/Devel-DeclareKarl Williamson2013-09-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | I will not otherwise mention that stealing .c code from the core is a dangerous practice. This is actually a bug in the module, which had been masked until now. The first two parameters to utf8_to_uvchr_buf() are both U8*. But both 's' and PL_bufend are char*. The 's' has a cast to U8* in the failing line, but not PL_bufend. Interestingly, the line in the official toke.c (introduced in 4b88fb76) has always been right, so the stealer didn't copy it correctly. What de69f3af3 did was turn this former function call into a macro that manipulates the parameters and calls another function, thereby removing a layer of function call overhead. The manipulation involves subtracting 's' from PL_bufend, and this fails to compile due to the missing cast on the latter parameter. The problem goes away if the macro casts both parameters to U8*, and that is what this commit does.
* utf8.h: White space onlyKarl Williamson2013-08-291-6/+7
| | | | Vertically align the definitions of a few #defines
* utf8.h, unicode_constants.h: Add some #defines.Karl Williamson2013-08-291-0/+2
| | | | These will be used in a future commit
* utf8.h: Fix UTF8_IS_SUPER defn for EBCDICKarl Williamson2013-08-291-1/+1
| | | | | The parentheses were misplaced, so it wasn't looking at the second byte of the input string properly.
* utf8.c: Remove wrapper functions.Karl Williamson2013-08-291-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the Unicode data is stored in native character set order, it is rare to need to work with the Unicode order. Traditionally, the real work was done in functions that worked with the Unicode order, and wrapper functions (or macros) were used to translate to/from native. There are two groups of functions: one that translates from code point to UTF-8, and the other group goes the opposite direction. This commit changes the base function that translates from UTF-8 to code point to output native instead of Unicode. Those extremely rare instances where Unicode output is needed instead will have to hand-wrap calls to this function with a translation macro, as now described in the API pod. Prior to this, it was the other way, the native was wrapped, and the rare, strict Unicode wasn't. This eliminates a layer of function call overhead for a common case. The base function that translates from code point to UTF-8 retains its Unicode input, as that is more natural to process. However, it is de-emphasized in the pod, with the functionality description moved to the pod for a native input wrapper function. And, those wrappers are now macros in all cases; previously there was function call overhead sometimes. (Equivalent exported functions are retained, however, for XS code that uses the Perl_foo() form.) I had hoped to rebase this commit, squashing it with an earlier commit in this series, eliminating the use of a temporary function name change, but the work involved turns out to be large, with no real payoff.
* utf8.h: Clarify commentsKarl Williamson2013-08-291-3/+3
|
* utf8.h, utfebcdic.h: Add #defineKarl Williamson2013-08-291-0/+2
|
* Fix EBCDIC bugs in UTF8_ACUMULATE and utf8.cKarl Williamson2013-08-291-4/+8
|
* utf8.h: Clean up and use START_MARK definitionKarl Williamson2013-08-291-3/+3
| | | | | | | | | The previous definition broke good encapsulation rules. UTF_START_MARK should return something that fits in a byte; it shouldn't be the caller that does this. So the mask is moved into the definition. This means it can apply only to the portion that creates something larger than a byte. Further, the EBCDIC version can be simplified, since 7 is the largest possible number of bytes in an EBCDIC UTF8 character.
* utf8.h: Move #includesKarl Williamson2013-08-291-3/+3
| | | | | These two files were only being #included for non-ebcdic compiles; they should be included always.
* utf8.h: Simplify UTF8_EIGHT_BIT_foo on EBCDICKarl Williamson2013-08-291-5/+8
| | | | | | | | | These macros were previously defined in terms of UTF8_TWO_BYTE_HI and UTF8_TWO_BYTE_LO. But the EIGHT_BIT versions can use the less general and simpler NATIVE_TO_LATIN1 instead of NATIVE_TO_UNI because the input domain is restricted in the EIGHT_BIT. Note that on ASCII platforms, these both expand to the same thing, so the difference matters only on EBCDIC.
* Add macro OFFUNISKIPKarl Williamson2013-08-291-2/+12
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.
* utf8.c: Stop using two functionsKarl Williamson2013-08-291-2/+2
| | | | | | | | | | | | | | | | | This is in preparation for deprecating these functions, to force any code that has been using these functions to change. Since the Unicode tables are now stored in native order, these functions should only rarely be needed. However, the functionality of these is needed, and in actuality, on ASCII platforms, the native functions are #defined to these. So what this commit does is rename the functions to something else, and create wrappers with the old names, so that anyone using them will get the deprecation when it actually goes into effect: we are waiting for CPAN files distributed with the core to change before doing the deprecation. According to cpan.grep.me, this should affect fewer than 10 additional CPAN distributions.
* Convert uvuni_to_utf8() to functionKarl Williamson2013-08-291-2/+1
| | | | | | | Code should almost never be dealing with non-native code points This is in preparation for later deprecation when our CPAN modules have been converted away from using it.
* utf8.c: Swap which fcn wraps the otherKarl Williamson2013-08-291-1/+0
| | | | This is in preparation for the current wrapee becoming deprecated
* Deprecate NATIVE_TO_NEED and ASCII_TO_NEEDKarl Williamson2013-08-291-3/+0
| | | | | | | | | | | | | | | | | | These macros are no longer called in the Perl core. This commit turns them into functions so that they can use gcc's deprecation facility. I believe these were defective right from the beginning, and I have struggled to understand what's going on. From the name, it appears NATIVE_TO_NEED taks a native byte and turns it into UTF-8 if the appropriate parameter indicates that. But that is impossible to do correctly from that API, as for variant characters, it needs to return two bytes. It could only work correctly if ch is an I8 byte, which isn't native, and hence the name would be wrong. Similar arguments for ASCII_TO_NEED. The function S_append_utf8_from_native_byte(const U8 byte, U8** dest) does what I think NATIVE_TO_NEED intended.
* Use real illegal UTF-8 byteKarl Williamson2013-08-291-0/+4
| | | | | | | | | | | | | | | | The code here was wrong in assuming that \xFF is not legal in UTF-8 encoded strings. It currently doesn't work due to a bug, but that may eventually be fixed: [perl #116867]. The comments are also wrong that all bytes are legal in UTF-EBCDIC. It turns out that in well-formed UTF-8, the bytes C0 and C1 never appear (C2, C3, and C4 as well in UTF-EBCDIC), as they would be the start byte of an illegal overlong sequence. This creates a #define for an illegal byte using one of the real illegal ones, and changes the code to use that. No test is included due to #116867.
* Add and use macro to return EBCDICKarl Williamson2013-08-291-4/+7
| | | | | | | | The conversion from UTF-8 to code point should generally be to the native code point. This adds a macro to do that, and converts the core calls to the existing macro to use the new one instead. The old macro is retained for possible backwards compatibility, though it probably should be deprecated.
* utf8.h: Correct macros for EBCDICKarl Williamson2013-08-291-5/+10
| | | | | These macros were incorrect for EBCDIC. The 3 step process given in utfebcdic.h wasn't being followed.
* Use new clearer named #definesKarl Williamson2013-08-291-5/+5
| | | | | This converts several areas of code to use the more clearly named macros introduced in the previous commit
* utf8.h, utfebcdic.h: Create less confusing #definesKarl Williamson2013-08-291-9/+25
| | | | | | | | | | | This commit creates macros whose names mean something to me, and which I don't find confusing. The older names are retained for backwards compatibility. Future commits will fix bugs I introduced from misunderstanding the meaning of the older names. The older names are now #defined in terms of the newer ones, and moved so that they are only defined once, valid for both ASCII and EBCDIC platforms.