summaryrefslogtreecommitdiff
path: root/utf8.c
Commit message (Collapse)AuthorAgeFilesLines
* perlapi: Remove extraneous ">"Karl Williamson2015-05-121-2/+2
|
* perlapi: Use UVCHR_SKIP not UNI_SKIPKarl Williamson2015-05-111-2/+2
| | | | This new name is more consistent with other uses in the API.
* perlapi: Add 2 links to other parts of the podKarl Williamson2015-05-081-0/+2
|
* Revert "Don’t call save_re_context"David Mitchell2015-03-301-0/+5
| | | | | | This reverts commit d28a9254e445aee7212523d9a7ff62ae0a743fec. Turns out we need save_re_context() after all
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* [perl #123814] replace grok_atou with grok_atoUVHugo van der Sanden2015-03-091-3/+9
| | | | | | | | | | | | Some questions and loose ends: XXX gv.c:S_gv_magicalize - why are we using SSize_t for paren? XXX mg.c:Perl_magic_set - need appopriate error handling for $) XXX regcomp.c:S_reg - need to check if we do the right thing if parno was not grokked Perl_get_debug_opts should probably return something unsigned; not sure if that's something we can change.
* Consistently use NOT_REACHED; /* NOTREACHED */Jarkko Hietaniemi2015-03-041-1/+1
| | | | | | Both needed: the macro is for compilers, the comment for static checkers. (This doesn't address whether each spot is correct and necessary.)
* Add qr/\b{gcb}/Karl Williamson2015-02-191-1/+0
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.
* utf8.c: Slight refactor of UTF-16 codeKarl Williamson2015-02-181-8/+15
| | | | | | This eliminates a branch in the usual case, at the expense of an extra one in the rarer case, which allows us to collapse some error condition code. It sprinkles some UNLIKELYs.
* move functions marked as mathomed in embed.fnc to mathoms.cDaniel Dragan2015-01-271-16/+0
| | | | | | | | | | | | | | | | Ever since commit 075eb5c9b6 mathom functions must be in mathoms.c for their symbols to be skipped in makedef.pl on Win32 Perl. If a function is marked 'b' in embed.fnc, regen.pl does NOT add its prototype to proto.h (it is commented out). Without the proto.h entry, EXTERN_C will be missing and a -DNO_MATHOMS + C++ Win32 Perl build will not link, since the C function will have a mangled name and the symbol will not be found for creating the perl linking library. Also add EXTERN_C to Win32CORE, the init_Win32CORE symbol is special cased for exporting in makedef.pl. Perl_is_utf8_char_buf was marked as 'b' in commit 3cedd9d930 Perl_sv_copypv was marked as 'b' in commit 4bac9ae47b
* avoid C labels in column 0David Mitchell2015-01-211-4/+4
| | | | | | | | | Generally the guideline is to outdent C labels (e.g. 'foo:') 2 columns from the surrounding code. If the label starts at column zero, then it means that diffs, such as those generated by git, display the label rather than the function name at the head of a diff block: which makes diffs harder to peruse.
* Raise warning on multi-byte char in single-byte localeKarl Williamson2014-12-291-1/+2
| | | | | | | | | See http://nntp.perl.org/group/perl.perl5.porters/211909 Something is quite likely wrong with the logic if say in a Greek locale, Unicode characters (especially Greek ones) are encountered. The same character will be represented by two different code points. This warning alerts the user to this undesirable state of affairs.
* foldEQ_utf8(): Add some internal flagsKarl Williamson2014-12-291-1/+12
| | | | The comments explain their purpose
* Simplify foldEQ_utf8Karl Williamson2014-12-291-80/+45
| | | | | | | | | | | | This moves the uncommon case of handling inputs under non-UTF-8 locales out of this function to the functions it calls, which already have the logic to handle it. This simplifies this function, cutting a couple branches each time through the loop from the common usage. The locale handling is slowed down somewhat, but even if that were a concern, another simpler function is normally used for locale handling. This gets called only when one or both of the comparison strings is UTF-8, which should be comparatively rare for non-UTF8 locales.
* utf8.c: Use OP_DESC instead of passing string.Karl Williamson2014-12-291-6/+6
| | | | OP_DESC is simpler and more general.
* utf8.c: Fix potential fold bugKarl Williamson2014-12-291-6/+4
| | | | | | | | | The function _to_uni_fold_flags() supposedly had the ability to do folding based on the current locale, if the correct flag is passed. However, it didn't actually do that, returning a non-locale fold instead. Fortunately, this is an undocumented capability (actually, the whole function is undocumented), and no current calls to it used the flag. This commit causes it to work.
* utf8.c: Add some function parameter assertionsKarl Williamson2014-12-291-1/+5
| | | | | Currently these are not violated, but this guards against future mistakes.
* Don't raise 'poorly supported' locale warning unnecessarilyKarl Williamson2014-12-291-11/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 8c6180a91de91a1194f427fc639694f43a903a78 added a warning message for when Perl determines that the program's underlying locale just switched into is poorly supported. At the time it was thought that this would be an extremely rare occurrence. However, a bug in HP-UX - B.11.00/64 causes this message to be raised for the "C" locale. A workaround was done that silenced those. However, before it got fixed, this message would occur gobs of times executing the test suite. It was raised even if the script is not locale-aware, so that the underlying locale was completely irrelevant. There is a good prospect that someone using an older Asian locale as their default would get this message inappropriately, even if they don't use locales, or switch to a supported one before using them. This commit causes the message to be raised only if it actually is relevant. When not in the scope of 'use locale', the message is stored, not raised. Upon the first locale-dependent operation within a bad locale, the saved message is raised, and the storage cleared. I was able to do this without adding extra branching to the main-line non-locale execution code. This was done by adding regnodes which get jumped to by switch statements, and refactoring some existing C tests so they exclude non-locale right off the bat. These changes would have been necessary for another locale warning that I previously agreed to implement, and which is coming a few commits from now. I do not know of any way to add tests in the test suite for this. It is in fact rare for modern locales to have these issues. The way I tested this was to temporarily change the C code so that all locales are viewed as defective, and manually note that the warnings came out where expected, and only where expected. I chose not to try to output this warning on any POSIX functions called. I believe that all that are affected are deprecated or scheduled to be deprecated anyway. And POSIX is closer to the hardware of the machine. For convenience, I also don't output the message for some zero-length pattern matches. If something is going to be matched, the message will likely very soon be raised anyway.
* Nits in commentsKarl Williamson2014-12-291-2/+2
|
* make more use of NOT_REACHEDLukas Mai2014-11-291-2/+2
| | | | In particular, remove all instances of 'assert(0);'.
* Make is_invariant_string()Karl Williamson2014-11-261-6/+5
| | | | | | This is a more accurately named synonym for is_ascii_string(), which is retained. The old name is misleading to someone programming for non-ASCII platforms.
* Improve API pod of is_ascii_stringKarl Williamson2014-11-261-4/+8
|
* utf8.c: Shorten long constant names, and simplifyKarl Williamson2014-11-241-6/+10
| | | | | | | The previous commit fixed a typo caused by it being hard to see the differences in a long ALL_CAP name. This uses #defines to type the long name only once, and compile-time variables so the expression for the length of strings only is specified once.
* utf8.c: Was taking sizeof() wrong thingKarl Williamson2014-11-241-1/+1
| | | | | | This was a typo due to the long name. A future commit will make it cleaner. The sizeof() the wrong name evaluates to the right number on ASCII platforms, but not EBCDIC.
* Add warning message for locale/Unicode intermixingKarl Williamson2014-11-141-5/+21
| | | | This is explained in the added perldiag entry.
* uvoffuni_to_utf8_flags() die if platform can't handleKarl Williamson2014-10-211-0/+9
| | | | | | | | | | | | | | | | | On non EBCDIC platforms currently any UV is encodable as UTF-8. (This would change if there were 128-bit words). Thus, much code assumes that nothing can go wrong when converting to UTF-8, and hence does no error checking. However, UTF-EBCDIC is only capable of representing code points below 2**32, so if there are 64-bit words, this function can fail. Prior to this patch, there was no real overflow check, and garbage was returned by this function if called with too large a number. While not ideal, the easiest thing to do is to just die for such a number, like we do for division by 0. This involves changing only code within this function, and not its many callers.
* utf8.c: Improve debug messageKarl Williamson2014-10-211-2/+2
| | | | | | This function was called with an empty string "" because that string was not actually needed in the function, except to better identify the source when there is an error. So change to specify the actual source.
* utf8.c: Move an #ifndef for clarityFather Chrysostomos2014-09-121-1/+1
| | | | | The comment really belongs inside it, as it refers to those two lines of code.
* Remove obsolete comment from utf8.cFather Chrysostomos2014-09-121-8/+0
| | | | | | | | | | The call to save_re_context was removed by the previous commit. The commit before that stopped save_re_context from doing anything. Commit db2c6cb33 stopped the errsv_save line from triggering get-magic. So this comment, added in dc0c6abb4, no longer applies.
* Don’t call save_re_contextFather Chrysostomos2014-09-121-1/+4
| | | | It is an empty function.
* perl #122747: localize PL_curpm to null in _core_swash_initYves Orton2014-09-111-2/+17
| | | | | | | | | | | | Set PL_curpm to null before we do any swash intialization in _core_swash_init(). This "hides" the current regop from the swash code, with the intent of prevent weird reentrancy bugs when the swashes are initialized. Long term you could argue that we should just not use the regex engine to initialize a swash, and then this would be unnecessary. Thanks to FC for the suggestion!
* utf8.c: Use slightly more efficient macroKarl Williamson2014-07-251-2/+4
| | | | | | | | Lowercasing a Latin-1 range character results in a latin-1 range character, so we can use the more restrictive macros that is slightly more efficient than the general ones. (This difference only is applicable on EBCDIC platforms, as the macros all expand to nothing on ASCII ones.
* Use grok_atou instead of strtoul (no explicit strtol uses).Jarkko Hietaniemi2014-07-221-7/+10
|
* Remove or downgrade unnecessary dVAR.Jarkko Hietaniemi2014-06-251-35/+0
| | | | | | | | You need to configure with g++ *and* -Accflags=-DPERL_GLOBAL_STRUCT or -Accflags=-DPERL_GLOBAL_STRUCT_PRIVATE to see any difference. (g++ does not do the "post-annotation" form of "unused".) The version code has some of these issues, reported upstream.
* PERL_UNUSED_CONTEXT -> remove interp context where possibleDaniel Dragan2014-06-241-3/+1
| | | | | | | | | | | | | | | | | | | | | Removing context params will save machine code in the callers of these functions, and 1 ptr of stack space. Some of these funcs are heavily used as mg_find*. The contexts can always be readded in the future the same way they were removed. This patch inspired by commit dc3bf40570. Also remove PERL_UNUSED_CONTEXT when its not needed. See removal candidate rejection rational in [perl #122106]. -Perl_hv_backreferences_p uses context in S_hv_auxinit commit 96a5add60f was wrong -Perl_whichsig_sv and Perl_whichsig_pv wrongly used PERL_UNUSED_CONTEXT from inception in commit 84c7b88cca -in authors opinion cast_* shouldn't be public API, no CPAN grep usage, can't be static and/or inline optimized since it is exported -Perl_my_unexec move to block where it is needed, make Win32 block, context free, for inlining likelyhood, private api and only 2 callers in core -Perl_my_dirfd make all blocks context free, then change proto -Perl_bytes_cmp_utf8 wrongly used PERL_UNUSED_CONTEXT from inception in commit fed3ba5d6b
* Silence -Wunused-parameter my_perl under threads.Jarkko Hietaniemi2014-06-191-3/+4
| | | | | | | | | | | | | | For S_ functions, remove the context. For Perl_ functions, add PERL_UNUSED_CONTEXT. Tricky because sometimes depends on DEBUGGING, and sometimes on whether we are have PERL_IMPLICIT_SYS. (Why all the mathoms Perl_is_uni_... and Perl_is_utf8_... functions are not being whined about is a mystery.) vutil.c (included via util.c) has one of these, but it's cpan/, and a known problem: https://rt.cpan.org/Ticket/Display.html?id=96100
* Revert "/* NOTREACHED */ belongs *before* the unreachable."Jarkko Hietaniemi2014-06-191-4/+2
| | | | | | This reverts commit 148f39b7de6eae9ddd59e0b0aff691d6abea7aca. (Still needs more work, but wanted to see how well this passed with Jenkins.)
* /* NOTREACHED */ belongs *before* the unreachable.Jarkko Hietaniemi2014-06-191-2/+4
| | | | | | Definitely not *after* it. It marks the start of the unreachable, not the first unrechable line. And if they are in that order, it looks better to linebreak after the lint hint.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | - after return/croak/die/exit, return/break are pointless (break is not a terminator/separator, it's a goto) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (the initializations will be skipped since the flow will bypass the start of the block); in some easy cases simply hoist the declarations out of the block and move them earlier Note 1: Since after this patch the core is not yet -Wunreachable-code clean, not enabling that via cflags.SH, one needs to -Accflags=... it. Note 2: At least with the older gcc 4.4.7 there are far too many "unreachable code" warnings, which seem to go away with gcc 4.8, maybe better flow control analysis. Therefore, the warning should eventually be enabled only for modernish gccs (what about clang and Intel cc?)
* rmv duplicate SvUV call in Perl__swash_inversion_hashDarin McBride2014-06-141-3/+5
|
* Revert "Some low-hanging -Wunreachable-code fruits."Jarkko Hietaniemi2014-06-131-1/+1
| | | | | | | This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679. I don't understand - smoke-me came back happy with three separate reports... oh well, some other time.
* Some low-hanging -Wunreachable-code fruits.Jarkko Hietaniemi2014-06-131-1/+1
| | | | | | | | | | | | | | | | | | - after croak/die/exit (or return), break (or return!) are pointless (break is not a terminator/separator, it's a promise of a jump) - after goto, another goto (!) is pointless - in some cases (usually function ends) introduce explicit NOT_REACHED to make the noreturn nature clearer (do not do this everywhere, though, since that would mean adding NOT_REACHED after every croak) - for the added NOT_REACHED also add /* NOTREACHED */ since NOT_REACHED is for gcc (and VC), while the comment is for linters - declaring variables in switch blocks is just too fragile: it kind of works for narrowing the scope (which is nice), but breaks the moment there are initializations for the variables (they will be skipped!); in some easy cases simply hoist the declarations out of the block and move them earlier There are still a few places left.
* perlapi: Include general informationKarl Williamson2014-06-051-2/+1
| | | | | | | | | | | Unlike other pod handling routines, autodoc requires the line following an =head1 to be non-empty for its text to be included in the paragraph started by the heading. If you fail to do this, silently the text will be omitted from perlapi. I went through the source code, and where it was apparent that the text was supposed to be in perlapi, deleted the empty line so it would be, with some revisions to make more sense. I added =cuts where I thought it best for the text to not be included.
* Move some deprecated utf8-handling functions to mathomsKarl Williamson2014-05-311-136/+17
| | | | | This entailed creating new internal functions for some of them to call so that the functionality can be retained during the deprecation period.
* Make is_utf8_char_buf() a macroKarl Williamson2014-05-311-1/+1
| | | | | | This function is now more efficiently implemented as a synonym for isUTF8_CHAR(). We retain the Perl_is_utf8_char_buf() function for code that calls it that way.
* Create isUTF8_CHAR() macro and use itKarl Williamson2014-05-311-68/+15
| | | | | | | | | | | | | | | | | | This macro will inline the code to determine if a character is well-formed UTF-8 for code points below a certain value, falling back to a slower function for larger ones. On ASCII platforms, it will inline for well-beyond all legal Unicode code points. On EBCDIC, it currently does it for code points up to 0x3FFF. This could be increased, but our porting tests do the regen every time to make sure everything is ok, and making it larger slows that down. This is worked around on ASCII by normally commenting out the code that generates this info, but including in utf8.h a version that did get generated. This is static information and won't change. (This could be done for EBCDIC too, but I chose not to at this time as each code page has a different macro generated, and it gets ugly getting all of them in utf8.h) Using this macro allowed for simplification of several functions in utf8.c
* utf8.c: Move a static function to inline.hKarl Williamson2014-05-311-35/+3
| | | | | This is in preparation for it being called from outside utf8.c. It is renamed to have a leading underscore to emphasize its private nature
* utf8.c: Move documentation next to its functionKarl Williamson2014-05-301-16/+16
| | | | Somehow this pod stuff was orphaned from the function it describes.
* utf8.c: Silence compiler warningKarl Williamson2014-05-291-1/+1
| | | | | | | | | This was brought to my attention by Jarkko Hietaniemi. The compiler was complaining that a variable could be used uninitialized. In practice this doesn't happen, as it would only happen on bad data, and Perl itself generates the data used. (I suppose if the data got corrupted, it could happen.) This commit initializes the value unconditionally, which allows a conditional setting of it to be removed.
* utf8.c: Move static function to embed.fncKarl Williamson2014-05-291-6/+8
| | | | This automatically generates assertions for pointer arguments, etc.