diff options
author | Karl Williamson <khw@cpan.org> | 2016-12-10 15:26:24 -0700 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2016-12-23 16:48:35 -0700 |
commit | 9495395586e6a655057cb766ed00213037dd06c0 (patch) | |
tree | dfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.h | |
parent | 5a48568dae7e81342fc2f8d0845423834f5c818f (diff) | |
download | perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz |
Return REPLACEMENT for UTF-8 overlong malformation
When perl decodes UTF-8 into a code point, it must decide what to do if
the input is malformed in some way. When the flags passed to the decode
function indicate that a given malformation type is not acceptable, the
function returns 0 to indicate failure; on success it returns the decoded
code point (unfortunately that may require disambiguation if the
input is validly a NUL). As perl evolved, what happened when various
allowed malformations were encountered got stricter and stricter. This
is the final malformation that was not turned into a REPLACEMENT
CHARACTER when the malformation was allowed, and this commit changes to
return that. Unlike most other malformations, the code point value of
an overlong is well-defined, and that is why it hadn't been changed
here-to-fore. But it is safer to use the Unicode prescribed behavior on
all malformations, which is to replace them with the REPLACEMENT
CHARACTER. Just in case there is code that requires the old behavior,
it is retained, but you have to search the source for the undocumented
flag that enables it.
Diffstat (limited to 'utf8.h')
-rw-r--r-- | utf8.h | 5 |
1 files changed, 4 insertions, 1 deletions
@@ -738,8 +738,11 @@ case any call to string overloading updates the internal UTF-8 encoding flag. #define UTF8_ALLOW_SHORT 0x0008 #define UTF8_GOT_SHORT UTF8_ALLOW_SHORT -/* Overlong sequence; i.e., the code point can be specified in fewer bytes. */ +/* Overlong sequence; i.e., the code point can be specified in fewer bytes. + * First one will convert the overlong to the REPLACEMENT CHARACTER; second + * will return what the overlong evaluates to */ #define UTF8_ALLOW_LONG 0x0010 +#define UTF8_ALLOW_LONG_AND_ITS_VALUE (UTF8_ALLOW_LONG|0x0020) #define UTF8_GOT_LONG UTF8_ALLOW_LONG /* Currently no way to allow overflow */ |