summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* Deprecate utf8_to_uvchr() and utf8_to_uvuni()Karl Williamson2012-03-191-4/+8
| | | | | | These functions can read beyond the end of their input strings if presented with malformed UTF-8 input. Perl core code has been converted to use other functions instead of these.
* Use the new utf8 to code point functionsKarl Williamson2012-03-191-21/+25
| | | | | These functions should be used in preference to the old ones which can read beyond the end of the input string.
* utf8.c: Add valid_utf8_to_uvuni() and valid_utf8_to_uvchr()Karl Williamson2012-03-191-0/+26
| | | | | | | | These functions are like utf8_to_uvuni() and utf8_to_uvchr(), but their name implies that the input UTF-8 has been validated. They are not currently documented, as it's best for XS writers to call the functions that do validation.
* utf8.c: Add utf8_to_uvchr_buf() and utf8_to_uvuni_buf()Karl Williamson2012-03-191-1/+54
| | | | | | | | The existing functions (utf8_to_uvchr and utf8_to_uvuni) have a deficiency in that they could read beyond the end of the input string if given malformed input. This commit creates two new functions which behave as the old ones did, but have an extra parameter each, which gives the upper limit to the string, so no read beyond it is done.
* utf8.c: pod clarificationKarl Williamson2012-03-191-1/+2
|
* utf8.c: pod (mostly formatting) + comments changesKarl Williamson2012-03-191-78/+90
|
* perl #77654: quotemeta quotes non-ASCII consistentlyKarl Williamson2012-02-151-0/+12
| | | | | | | | | | As described in the pod changes in this commit, this changes quotemeta() to consistenly quote non-ASCII characters when used under unicode_strings. The behavior is changed for these and UTF-8 encoded strings to more closely align with Unicode's recommendations. The end result is that we *could* at some future point start using other characters as metacharacters than the 12 we do now.
* is_utf8_char_slow(): Make constistent, correct docs.Karl Williamson2012-02-131-3/+3
| | | | | | | | | | | | | | | This function is only used by the Perl core for very large code points, though it is designed to be able to be used for all code points. For any variant code points, it doesn't succeed unless the passed in length is exactly the same as the number of bytes the code point occupies. The documentation says it succeeds if the length is at least that number. This commit updates the documentation to match the behavior. Also, for an invariant code point, it succeeds no matter what the passed-in length says. This commit changes this to be consistent with the behavior for all other code points.
* Deprecate is_utf8_char()Karl Williamson2012-02-111-3/+7
| | | | | | | This function assumes that there is enough space in the buffer to read however many bytes are indicated by the first byte in the alleged UTF-8 encoded string. This may not be true, and so it can read beyond the buffer end. is_utf8_char_buf() should be used instead.
* Add is_utf8_char_buf()Karl Williamson2012-02-111-8/+46
| | | | | | | | | | This function is to replace is_utf8_char(), and requires an extra parameter to ensure that it doesn't read beyond the end of the buffer. Convert is_utf8_char() and the only place in the Perl core to use the new one, assuming in each that there is enough space. Thanks to Jarkko Hietaniemi for suggesting this function name
* Unicode::UCD::prop_invmap(): New improved APIKarl Williamson2012-02-101-1/+0
| | | | | | | | | | | | Thanks to Tony Cook for suggesting this. The API is changed from returning deltas of code points, to storing the actual correct values, but requiring adjustments for the non-initial elements in a range, as explained in the pod. This makes the data less confusing to look at, and gets rid of inconsistencies if we didn't make the same sort of deltas for entries that were, e.g. arrays of code points.
* regcomp.c: Use compiled-in inversion listsKarl Williamson2012-02-091-1/+14
| | | | | | | | | This uses the compiled inversion lists to generate Posix character classes and things like \v, \s inside bracketed character classes. This paves the way for future optimizations, and fixes the bug which has no formal bug number that /[[:ascii:]]/i matched non-Ascii characters, such as the Kelvin sign, unlike /\p{ascii}/i.
* utf8.c: white-space onlyKarl Williamson2012-02-041-9/+9
| | | | This adds an indent now that the code is in a newly created block
* utf8.c: Use the new compact case mapping tablesKarl Williamson2012-02-041-5/+17
| | | | | | | | This changes the Perl core when looking up the upper/lower/title/fold-case of a code point to use the newly created more compact tables. Currently the look-up is done by a linear search, and the new tables are 54-61% of the size of the old ones, so that on average searches are that much shorter
* mktables: Add duplicate tablesKarl Williamson2012-02-041-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | This is for backwards compatibility. Future commits will change these tables that are generated by mktables to be more efficient. But the existence of them was advertised in v5.12 and v5.14, as something a Perl program could use because the Perl core did not provide access to their contents. We can't change the format of those without some notice. The solution adopted is to have two versions of the tables, one kept in the original file name has the original format; and the other is free to change formats at will. This commit just creates copies of the original, with the same format. Later commits will change the format to be more efficient. We state in v5.16 that using these files is now deprecated, as the information is now available through Unicode::UCD in a stable API. But we don't test for whether someone is opening and reading these files; so the deprecation cycle should be somewhat long; they will be unused, and the only drawbacks to having them are some extra disk space and the time spent in having to generate them at Perl build time. This commit also changes the Perl core to use the original tables, so that the new format can be gradually developed in a series of patches without having to cut over the whole thing at once.
* Provide as much diagnostic information as possible in "panic: ..." messages.Nicholas Clark2012-01-161-6/+16
| | | | | | | | | | | | | | | The convention is that when the interpreter dies with an internal error, the message starts "panic: ". Historically, many panic messages had been terse fixed strings, which means that the out-of-range values that triggered the panic are lost. Now we try to report these values, as such panics may not be repeatable, and the original error message may be the only diagnostic we get when we try to find the cause. We can't report diagnostics when the panic message is generated by something other than croak(), as we don't have *printf-style format strings. Don't attempt to report values in panics related to *printf buffer overflows, as attempting to format the values to strings may repeat or compound the original error.
* utf8.c: fix typo in podKarl Williamson2012-01-131-1/+1
|
* regcomp.c: Optimize a single Unicode property in a [character class]Karl Williamson2012-01-131-5/+5
| | | | | | | | | | | | | | | | | All Unicode properties actually turn into bracketed character classes, whether explicitly done or not. A swash is generated for each property in the class. If that is the only thing not in the class's bitmap, it specifies completely the non-bitmap behavior of the class, and can be passed explicitly to regexec.c. This avoids having to regenerate the swash. It also means that the same swash is used for multiple instances of a property. And that means the number of duplicated data structures is greatly reduced. This currently doesn't extend to cases where multiple Unicode properties are used in the same class [\p{greek}\p{latin}] will not share the same swash as another character class with the same components. This is because I don't know of a an efficient method to determine if a new class being parsed has the same components as one already generated. I suppose some sort of checksum could be generated, but that is for future consideration.
* utf8.c: White-space onlyKarl Williamson2012-01-131-67/+69
| | | | | | As a result of previous commits adding and removing if() {} blocks, indent and outdent and reflow comments and statements to not exceed 80 columns.
* utf8.c: Add ability to pass inversion list to _core_swash_init()Karl Williamson2012-01-131-7/+69
| | | | | | | Add a new parameter to _core_swash_init() that is an inversion list to add to the swash, along with a boolean to indicate if this inversion list is derived from a user-defined property. This capability will prove useful in future commits
* utf8.c: Add flag to swash_init() to not croak on errorKarl Williamson2012-01-131-2/+7
| | | | | | This adds the capability, to be used in future commits, for swash_ini() to return NULL instead of croaking if it can't find a property, so that the caller can choose how to handle the situation.
* utf8.c: Prevent reading before buffer startKarl Williamson2012-01-131-1/+3
| | | | | Make sure there is something before the character being read before reading it.
* Utf8.c: Generate and use inversion lists for binary swashesKarl Williamson2012-01-131-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this patch, every time a code point was matched against a swash, and the result was not previously known, a linear search through the swash was performed. This patch changes that to generate an inversion list whenever a swash for a binary property is created. A binary search is then performed for missing values. This change does not have much effect on the speed of Perl's regression test suite, but the speed-up in worst-case scenarios is huge. The program at the end of this commit is crafted to avoid the caching that hides much of the current inefficiencies. At character classes of 100 isolated code points, the new method is about an order of magnitude faster; two orders of magnitude at 1000 code points. The program at the end of this commit message took 97s to execute on my box using blead, and 1.5 seconds using this new scheme. I was surprised to see that even with classes containing fewer than 10 code points, the binary search trumped, by a little, the linear search Even after this patch, under the current scheme, one can easily run out of memory due to the permanent storing of results of swash lookups in hashes. The new search mechanism might be fast enough to enable the elimination of that memory usage. Instead, a simple cache in each inversion list that stored its previous result could be created, and that checked to see if it's still valid before starting the search, under the assumption, which the current scheme also makes, that probes will tend to be clustered together, as nearby code points are often in the same script. =============================================== # This program creates longer and longer character class lists while # testing code points matches against them. By adding or subtracting # 65 from the previous member, caching of results is eliminated (as of # this writing), so this essentially tests for how long it takes to # search through swashes to see if a code point matches or not. use Benchmark ':hireswallclock'; my $string = ""; my $class_cp = 2**30; # Divide the code space in half, approx. my $string_cp = $class_cp; my $iterations = 10000; for my $j (1..2048) { # Append the next character to the [class] my $hex_class_cp = sprintf("%X", $class_cp); $string .= "\\x{$hex_class_cp}"; $class_cp -= 65; next if $j % 100 != 0; # Only test certain ones print "$j: lowest is [$hex_class_cp]: "; timethis(1, "no warnings qw(portable non_unicode);my \$i = $string_cp; for (0 .. $iterations) { chr(\$i) =~ /[$string]/; \$i+= 65 }"); $string_cp += ($iterations + 1) * 65; }
* utf8.c: Refactor code slightly in prepKarl Williamson2012-01-131-13/+21
| | | | | Future commits will split up the necessary initialization into two components. This patch prepares for that without adding anything new.
* utf8.c: New function to retrieve non-copy of swashKarl Williamson2012-01-131-5/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, swash_init returns a copy of the swash it finds. The core portions of the swash are read-only, and the non-read-only portions are derived from them. When the value for a code point is looked up, the results for it and adjacent code points are stored in a new element, so that the lookup never has to be performed again. But since a copy is returned, those results are stored only in the copy, and any other uses of the same logical stash don't have access to them, so the lookups have to be performed for each logical use. Here's an example. If you have 2 occurrences of /\p{Upper}/ in your program, there are 2 different swashes created, both initialized identically. As you start matching against code points, say "A" =~ /\p{Upper}/, the swashes diverge, as the results for each match are saved in the one applicable to that match. If you match "A" in each swash, it has to be looked up in each swash, and an (identical) element will be saved for it in each swash. This is wasteful of both time and memory. This patch renames the function and returns the original and not a copy, thus eliminating the overhead for stashes accessed through the new interface. The old function name is serviced by a new function which merely wraps the new name result with a copy, thus preserving the interface for existing calls. Thus, in the example above, there is only one swash, and matching "A" against it results in only one new element, and so the second use will find that, and not have to go out looking again. In a program with lots of regular expressions, the savings in time and memory can be quite large. The new name is restricted to use only in regcomp.c and utf8.c (unless XS code cheats the preprocessor), where we will code so as to not destroy the original's data. Otherwise, a change to that would change the definition of a Unicode property everywhere in the program. Note that there are no current callers of the new interface; these will be added in future commits.
* utf8.c: Change name of static functionKarl Williamson2012-01-131-14/+14
| | | | | This function has always confused me, as it doesn't return a swash, but a swatch.
* utf8.c: Move test out of loopsKarl Williamson2012-01-131-20/+12
| | | | | We set the upper limit of the loops before entering them to the min of the two possible limits, thus avoiding a test each time through
* Comment additions, typos, white-space.Karl Williamson2012-01-131-0/+1
| | | | And the reordering for clarity of one test
* diag_listed_as galoreFather Chrysostomos2011-12-281-0/+2
| | | | | In two instances, I actually modified to code to avoid %s for a constant string, as it should be faster that way.
* utf8.c: white-space, comment clarification onlyKarl Williamson2011-12-181-8/+7
|
* utf8.c: foldEQ_utf8_flags() use specific flag, not just anyKarl Williamson2011-12-181-1/+1
| | | | | | | | | | The test here was if any flag was set, not the particular desired one. This doesn't cause any bugs as things are currently structured, but could in the future. The reason it doesn't cause any bugs currently are that the other flags are tested first, and only if they are both 0 does this flag get tested.
* utf8.c: Change prototypes of two functionsKarl Williamson2011-12-151-3/+6
| | | | | | | | | | _to_uni_fold_flags() and _to_fold_latin1() now have their flags parameter be a boolean. The name 'flags' is retained in case the usage ever expands instead of calling it by the name of the only use this currently has. This is as a result of confusion between this and _to_ut8_fold_flags() which does have more than one flag possibility.
* utf8.c: White-space changes onlyKarl Williamson2011-12-151-10/+12
| | | | This indents previous lines that are now within new blocks
* utf8.c: Allow Changed behavior of utf8 under localeKarl Williamson2011-12-151-15/+224
| | | | | | | | | | This changes the 4 case changing functions to take extra parameters to specify if the utf8 string is to be processed under locale rules when the code points are < 256. The current functions are changed to macros that call the new versions so that current behavior is unchanged. An additional, static, function is created that makes sure that the 255/256 boundary is not crossed during the case change.
* utf8.c: Add commentKarl Williamson2011-12-151-0/+4
|
* utf8.c: typos in podKarl Williamson2011-11-211-2/+2
|
* PATCH: [perl #32080] is_utf8_string() reads too farKarl Williamson2011-11-211-28/+30
| | | | | | This function and is_utf8_string_loclen() are modified to check before reading beyond the end of the string; and the pod for is_utf8_char() is modified to warn about the buffer overflow potential.
* utf8.c: typo in commentKarl Williamson2011-11-121-1/+1
|
* utf8.c: Skip extra function callsKarl Williamson2011-11-111-7/+3
| | | | | The function to_uni_fold() works without requiring conversion first to utf8.
* utf8.c: Add compiler hintKarl Williamson2011-11-111-1/+1
| | | | It's very rare that someone will be outputting these unusual code points
* utf8.c: Add and revise commentsKarl Williamson2011-11-111-6/+34
| | | | | I now understand swashes enough to document them better; nits in other comments
* utf8.c: Don't warn on \p{user-defined} for above-UnicodeKarl Williamson2011-11-101-13/+18
| | | | | | Perl has allowed user-defined properties to match above-Unicode code points, while falsely warning that it doesn't. This removes that warning.
* utf8.c: Handle swashes at UV_MAXKarl Williamson2011-11-101-0/+13
| | | | | | | The code assumed that there is a code point above the highest value we are looking at. That is true except when we are looking at the highest representable code point on the machine. A special case is needed for that.
* utf8.c: Fix swash handling under USE_MORE_BITSKarl Williamson2011-11-101-1/+1
| | | | | | On a 32 bit machine with USE_MORE_BITS, a UV is 64 bits, but STRLEN is 32 bits. A cast was missing during a bit complement that led to loss of 32 bits.
* utf8.c: Make swashes work close to UV_MAXKarl Williamson2011-11-091-1/+7
| | | | | | | | | | | | | | | | When a code point is to be checked if it matches a property, a swatch of the swash is read in. Typically this is a block of 64 code points that contain the one desired. A bit map is set for those 64 code points, apparently under the expectation that the program will desire code points near the original. However, it just adds 63 to the original code point to get the ending point of the block. When the original is so close to the maximum UV expressible on the platform, this will overflow. The patch is simply to check for overflow and if it happens use the max possible. A special case is still needed to handle the very maximum possible code point, and a future commit will deal with that.
* utf8.c: Faster latin1 foldingKarl Williamson2011-11-081-1/+47
| | | | | | | This adds a function similar to the ones for the other three case changing operations that works on latin1 characters only, and avoids having to go out to swashes. It changes to_uni_fold() and to_utf8_fold() to call it on the appropriate input
* utf8.c: Faster latin1 upper/title casingKarl Williamson2011-11-081-2/+81
| | | | | | | | | | | | | This creates a new function to handle upper/title casing code points in the latin1 range, and avoids using a swash to compute the case. This is because the correct values are compiled-in. And it calls this function when appropriate for both title and upper casing, in both utf8 and uni forms, Unlike the similar function for lower casing, it may make sense for this function to be called from outside utf8.c, but inside the core, so it is not static, but its name begins with an underscore.
* utf8.c: Expand use of refactored to_uni_lowerKarl Williamson2011-11-081-1/+10
| | | | | | | | The new function split out from to_uni_lower is now called when appropriate from to_utf8_lower. And to_uni_lower no longer calls to_utf8_lower, using the macro instead, saving a function call and duplicate work
* utf8.c: Refactor to_uni_lower()Karl Williamson2011-11-081-16/+27
| | | | | The portion that deals with Latin1 range characters is refactored into a separate (static) function, so that it can be called from more than one place.
* utf8.c: Refactor case-changing calls into macrosKarl Williamson2011-11-081-10/+20
| | | | Future commits will use these in additional places, so macroize