summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* IDStart and IDCont no longer go out to diskKarl Williamson2014-01-091-2/+11
| | | | | | | These are the base names for various macros used in parsing identifiers. Prior to this patch, parsing a code point above Latin1 caused loading disk files. This patch causes all the information to be compiled into the Perl binary.
* isWORDCHAR_uni(), isDIGIT_utf8() etc no longer go out to diskKarl Williamson2014-01-091-11/+22
| | | | | | | Previous commits in this series have caused all the POSIX classes to be completely specified at C compile time. This allows us to revise the base function used by all these macros to use these definitions, avoiding reading them in from disk.
* utf8.c: Add commentKarl Williamson2014-01-091-1/+3
|
* utf8.c: Move a bunch of deprecated fcns to mathoms.cKarl Williamson2014-01-051-400/+0
| | | | | These functions will be out of the way in mathoms. There were a few that could not be moved, as-is, so I left them.
* utf8.c: Use existing macros instead of duplicate codeKarl Williamson2014-01-051-101/+38
| | | | | | In all these cases, there is an already existing macro that does exactly the same thing as the code that this commit replaces. No sense duplicating logic.
* Change some warnings in utf8n_to_uvchr()Karl Williamson2014-01-011-26/+26
| | | | | | | | | | | | | | | | This bottom level function decodes the first character of a UTF-8 string into a code point. It is discouraged from using it directly. This commit cleans up some of the warnings it can raise. Now, tests for malformations are done before any tests for other potential issues. One of those issues involves code points so large that they have never appeared in any official standard (the current standard has scaled back the highest acceptable code point from earlier versions). It is possible (though not done in CPAN) to warn and/or forbid these code points, while accepting smaller code points that are still above the legal Unicode maximum. The warning message for this now includes the code point if representable on the machine. Previously it always displayed raw bytes, which is what it still does for non-representable code points.
* utf8.c: Fix warning category and subcategory conflictsKarl Williamson2014-01-011-6/+6
| | | | | | | | | | | | | | | | The warnings categories non_unicode, nonchar, and surrogate are all subcategories of 'utf8'. One should never call a packWARN() with both a category and a subcategory of it, as it will mean that one can't completely make the subcategory independent. For example, use warnings 'utf8'; no warnings 'surrogate'; surrogate warnings will be output if they are tested with a ckWARN2(WARN_UTF8, WARN_SURROGATE); utf8.c was guilty of this.
* utf8.c: Don't do redundant testKarl Williamson2014-01-011-1/+1
| | | | | The test here for WARN_UTF8 is redundant, as only if one of the other three warning categories is enabled will anything actually be output.
* utf8.c: Typo in comment, and clarificationKarl Williamson2014-01-011-1/+1
|
* Remove no-longer used inversion list functionKarl Williamson2013-12-311-1/+1
| | | | | | | | | | | The function _invlist_invert_prop() is hereby removed. The recent changes to allow \p{} to match above-Unicode means that no special handling of properties need be done when inverting. This function was accessible to XS code that cheated by using #defines to pretend it was something it wasn't, but it also has been marked as subject to change since its inception, and never appeared in any documentation.
* White-space onlyKarl Williamson2013-12-311-28/+28
| | | | | This indents various newly-formed blocks (by the previous commit) in these three files, and reflows lines to fit into 79 columns
* Change format of mktables output binary property tablesKarl Williamson2013-12-311-0/+28
| | | | | | | | | mktables now outputs the tables for binary properties as inversion lists, with a size as the first element. This means simpler handling of these tables in the core, including removal of an entire pass over them (it was done just to get the size). These tables are marked as for internal use by the Perl core only, so their format is changeable at will.
* perlapi: Consistent spaces after dotsFather Chrysostomos2013-12-291-9/+12
| | | | plus some typo fixes. I probably changed some things in perlintern, too.
* utf8.c: White-space onlyKarl Williamson2013-12-061-3/+4
| | | | Rearrange this multi-line conditional to be easier to read.
* perlapi: Grammar nitsKarl Williamson2013-12-061-6/+6
| | | | | "The" referring to a parameter here does not look right to me, a native English speaker.
* utf8.c: Remove hard-coded names.Karl Williamson2013-12-061-8/+21
| | | | | | | The names of these hashes stored in some disk files is retrievable by a standardized lookup. There is no need to have them hard-coded in C code. This is one less opportunity for the file and the code to get out of sync.
* perlapi: NitsKarl Williamson2013-12-041-2/+2
|
* utf8.c: Use U8 instead of UV in several placesKarl Williamson2013-12-031-4/+4
| | | | | | These temporaries are all known to fit into 8 bits; by using a U8 it should be more obvious to an optimizing compiler, and so the bounds checking need not be done.
* fix -Wsign-compare in coreDavid Mitchell2013-11-291-4/+8
| | | | | | | | | | | | | There were a few places that were doing unsigned_var = cond ? signed_val : unsigned_val; or similar. Fixed by suitable casts etc. The four in utf8.c were fixed by assigning to an intermediate unsigned var; this has the happy side-effect of collapsing a large macro expansion, where toUPPER_LC() etc evaluate their arg multiple times.
* utf8.c: White-space onlyKarl Williamson2013-10-161-9/+9
| | | | | This outdents code to the proper level given that the surrounding block has been removed.
* Change mktables output for some tables to use hexKarl Williamson2013-10-161-12/+1
| | | | | | | | | | | | | | | | | | | This makes all the tables in the lib/unicore/To directory that map from code point to code point be formatted so that the mapped-to code point is expressed as hexadecimal. This allows for uniform treatment of these tables in utf8.c, and removes the final use of strtol() in the (non-CPAN) core. strtol() should be avoided because it is subject to locale rules, and some older libc implementations have been buggy. It was used because Perl doesn't have an efficient way of parsing a decimal number and advancing the parse pointer to beyond it; we do have such a method for hex numbers. The input to mktables published by Unicode is also in hex, so this now conforms to that convention. This also will facilitate the new work currently being done to read in the tables that find the closing bracket given an opening one.
* utf8.c: Silence Win32 compiler warningsKarl Williamson2013-09-301-8/+8
| | | | | The Win32 compiler doesn't realize that the values in these places can be a max of 255. Other compilers don't warn.
* Removed an ifdef for IS_UTF8_CHAR in utf8.cBrian Fraser2013-09-211-2/+0
| | | | | | IS_UTF8_CHAR is defined by utf8.h, so this is always defined. In fact, later in utf8.c we use it again, this time without the ifdef.
* The choice of 7 or 13 byte extended UTF-8 should be based on UVSIZE.Nicholas Clark2013-09-171-2/+2
| | | | Previously it was based on HAS_QUAD, which is not (as) correct.
* perlapi: Typos; clarify commentKarl Williamson2013-09-161-4/+6
|
* Use separate macros for byte vs uv UnicodeKarl Williamson2013-09-101-3/+3
| | | | | | | This removes a macro not yet even in a development release, and splits its calls into two classes: those where the input is a byte; and those where it can be any unsigned integer. The byte implementation avoids a function call on EBCDIC platforms.
* Move functions prematurely placed into mathoms back to utf8.cKarl Williamson2013-09-041-0/+58
| | | | | | | These functions are still called by some CPAN-upstream modules, so can't go into mathoms until those are fixed. There are other changes needed in these modules, so I'm deferring sending patching to their maintainers until I know all the necessary changes.
* perlapi: Remove newly obsolete statementKarl Williamson2013-09-041-2/+1
| | | | | Since commit 010ab96b9b802bbf77168b5af384569e053cdb63, this function is now longer a wrapper, so shouldn't be described as such.
* utf8.c: Add commentKarl Williamson2013-08-291-3/+11
|
* utf8.c: Add omitted fold casesKarl Williamson2013-08-291-5/+24
| | | | | | | | | | | | | | | | | | | The LATIN SMALL LETTER SHARP S can't fold to 'ss' under /iaa because the definition of /aa prohibits it, but it can fold to two consecutive instances of LATIN SMALL LETTER LONG S. A capital sharp s can do the same, and that was fixed in 1ca267a5, but this one was overlooked then. It turns out that another possibility was also overlooked in 1ca267a5. Both U+FB05 (LATIN SMALL LIGATURE LONG S T) and U+FB06 (LATIN SMALL LIGATURE ST) fold to the string 'st', except under /iaa these folds are prohibited. But U+FB05 and U+FB06 are equivalent to each other under /iaa. This wasn't working until now. This commit changes things so both fold to FB06. This bug would only surface during /iaa matching, and I don't believe there are any current code paths which lead to it, hence no tests are added by this commit. However, a future commit will lead to this bug, and existing tests find it then.
* utf8.c: Move some code around for speedKarl Williamson2013-08-291-5/+7
| | | | | This is a micro optimization. We now check for a common case and return if found, before checking for a relatively uncommon case.
* utf8.c: Remove wrapper functions.Karl Williamson2013-08-291-103/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the Unicode data is stored in native character set order, it is rare to need to work with the Unicode order. Traditionally, the real work was done in functions that worked with the Unicode order, and wrapper functions (or macros) were used to translate to/from native. There are two groups of functions: one that translates from code point to UTF-8, and the other group goes the opposite direction. This commit changes the base function that translates from UTF-8 to code point to output native instead of Unicode. Those extremely rare instances where Unicode output is needed instead will have to hand-wrap calls to this function with a translation macro, as now described in the API pod. Prior to this, it was the other way, the native was wrapped, and the rare, strict Unicode wasn't. This eliminates a layer of function call overhead for a common case. The base function that translates from code point to UTF-8 retains its Unicode input, as that is more natural to process. However, it is de-emphasized in the pod, with the functionality description moved to the pod for a native input wrapper function. And, those wrappers are now macros in all cases; previously there was function call overhead sometimes. (Equivalent exported functions are retained, however, for XS code that uses the Perl_foo() form.) I had hoped to rebase this commit, squashing it with an earlier commit in this series, eliminating the use of a temporary function name change, but the work involved turns out to be large, with no real payoff.
* perlapi vis utf8.c: NitsKarl Williamson2013-08-291-5/+4
|
* utf8.c: Move 2 functions to earlier in fileKarl Williamson2013-08-291-36/+36
| | | | | This moves these two functions to be adjacent to the function they each call, thus keeping like things together.
* utf8.c: Don't use slower general-purpose functionKarl Williamson2013-08-291-3/+7
| | | | | | There is a macro that accomplishes the same task for a two byte UTF-8 encoded character, and avoids the overhead of the general purpose function call.
* utf8.c: Don't do ++ in macro parameterKarl Williamson2013-08-291-2/+3
| | | | | The formal parameter gets evaluated multiple times on an EBCDIC platform, thus incrementing more than the intended once.
* utf8.c: Use macro instead of duplicating codeKarl Williamson2013-08-291-13/+13
| | | | There is a macro that accomplishes this task, and is easier to read.
* utf8.c: Avoid unnecessary UTF-8 conversionsKarl Williamson2013-08-291-27/+62
| | | | | | | | | | | | This changes the code so that converting to UTF-8 is avoided unless necessary. For such inputs, the conversion back from UTF-8 is also avoided. The cost of doing this is that the first swatches are combined into one that contains the values for all characters 0-255, instead of having multiple swatches. That means when first calculating the swatch it calculates all 256, instead of 128 (160 on EBCDIC). This also fixes an EBCDIC bug in which characters in this range were being translated twice.
* utf8.c: No need to check for UTF-8 malformationsKarl Williamson2013-08-291-5/+3
| | | | | | | | This function assumes that the input is well-formed UTF-8, even though until this commit, the prefatory comments didn't say so. The API does not pass the buffer length, so there is no way it could check for reading off the end of the buffer. One code path already calls valid_utf8_to_uvchr(); this changes the remaining code path to correspond.
* utf8.c: Fix so UTF-16 to UTF-8 conversion works under EBCDICKarl Williamson2013-08-291-5/+9
|
* Fix valid_utf8_to_uvchr() for EBCDICKarl Williamson2013-08-291-2/+6
|
* Fix EBCDIC bugs in UTF8_ACUMULATE and utf8.cKarl Williamson2013-08-291-1/+1
|
* utf8.c: Use more clearly named macroKarl Williamson2013-08-291-1/+1
| | | | | | In the case of invariants these two macros should do the same thing, but it seems to me that the latter name more clearly indicates what is going on.
* Add macro OFFUNISKIPKarl Williamson2013-08-291-3/+3
| | | | | | | | | This means use official Unicode code point numbering, not native. Doing this converts the existing UNISKIP calls in the code to refer to native code points, which is what they meant anyway. The terminology is somewhat ambiguous, but I don't think it will cause real confusion. NATIVE_SKIP is also introduced for situations where it is important to be precise.
* utf8.c: Stop using two functionsKarl Williamson2013-08-291-18/+21
| | | | | | | | | | | | | | | | | This is in preparation for deprecating these functions, to force any code that has been using these functions to change. Since the Unicode tables are now stored in native order, these functions should only rarely be needed. However, the functionality of these is needed, and in actuality, on ASCII platforms, the native functions are #defined to these. So what this commit does is rename the functions to something else, and create wrappers with the old names, so that anyone using them will get the deprecation when it actually goes into effect: we are waiting for CPAN files distributed with the core to change before doing the deprecation. According to cpan.grep.me, this should affect fewer than 10 additional CPAN distributions.
* Convert uvuni_to_utf8() to functionKarl Williamson2013-08-291-6/+4
| | | | | | | Code should almost never be dealing with non-native code points This is in preparation for later deprecation when our CPAN modules have been converted away from using it.
* Deprecate utf8_to_uni_buf()Karl Williamson2013-08-291-8/+8
| | | | | | | Now that the tables are stored in native order, there is almost no need for code to be dealing in Unicode order. According to grep.cpan.me, there are no uses of this function in CPAN.
* Deprecate valid_utf8_to_uvuni()Karl Williamson2013-08-291-2/+3
| | | | | | | | | Now that all the tables are stored in native format, there is very little reason to use this function; and those who do need this kind of functionality should be using the bottom level routine, so as to make it clear they are doing nonstandard stuff. According to grep.cpan.me, there are no uses of this function in CPAN.
* utf8.c: Swap which fcn wraps the otherKarl Williamson2013-08-291-27/+25
| | | | This is in preparation for the current wrapee becoming deprecated
* utf8.c: Skip a no-opKarl Williamson2013-08-291-1/+1
| | | | | Since the value is invariant under both UTF-8 and not, we already have it in 'uv'; no need to do anything else to get it