summaryrefslogtreecommitdiff
path: root/utf8.h
diff options
context:
space:
mode:
authorKarl Williamson <khw@cpan.org>2016-12-10 15:26:24 -0700
committerKarl Williamson <khw@cpan.org>2016-12-23 16:48:35 -0700
commit9495395586e6a655057cb766ed00213037dd06c0 (patch)
treedfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.h
parent5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
downloadperl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz
Return REPLACEMENT for UTF-8 overlong malformation
When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
Diffstat (limited to 'utf8.h')
-rw-r--r--utf8.h5
1 files changed, 4 insertions, 1 deletions
diff --git a/utf8.h b/utf8.h
index a4cae099d8..3dde45a1dd 100644
--- a/utf8.h
+++ b/utf8.h
@@ -738,8 +738,11 @@ case any call to string overloading updates the internal UTF-8 encoding flag.
#define UTF8_ALLOW_SHORT 0x0008
#define UTF8_GOT_SHORT UTF8_ALLOW_SHORT
-/* Overlong sequence; i.e., the code point can be specified in fewer bytes. */
+/* Overlong sequence; i.e., the code point can be specified in fewer bytes.
+ * First one will convert the overlong to the REPLACEMENT CHARACTER; second
+ * will return what the overlong evaluates to */
#define UTF8_ALLOW_LONG 0x0010
+#define UTF8_ALLOW_LONG_AND_ITS_VALUE (UTF8_ALLOW_LONG|0x0020)
#define UTF8_GOT_LONG UTF8_ALLOW_LONG
/* Currently no way to allow overflow */