Return REPLACEMENT for UTF-8 overlong malformation

When perl decodes UTF-8 into a code point, it must decide what to do if the input is malformed in some way. When the flags passed to the decode function indicate that a given malformation type is not acceptable, the function returns 0 to indicate failure; on success it returns the decoded code point (unfortunately that may require disambiguation if the input is validly a NUL). As perl evolved, what happened when various allowed malformations were encountered got stricter and stricter. This is the final malformation that was not turned into a REPLACEMENT CHARACTER when the malformation was allowed, and this commit changes to return that. Unlike most other malformations, the code point value of an overlong is well-defined, and that is why it hadn't been changed here-to-fore. But it is safer to use the Unicode prescribed behavior on all malformations, which is to replace them with the REPLACEMENT CHARACTER. Just in case there is code that requires the old behavior, it is retained, but you have to search the source for the undocumented flag that enables it.
author: Karl Williamson <khw@cpan.org> 2016-12-10 15:26:24 -0700
committer: Karl Williamson <khw@cpan.org> 2016-12-23 16:48:35 -0700
commit: 9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree: dfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.h
parent: 5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download: perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz
1 files changed, 4 insertions, 1 deletions
diff --git a/utf8.h b/utf8.h
index a4cae099d8..3dde45a1dd 100644
--- a/utf8.h
+++ b/utf8.h
@@ -738,8 +738,11 @@ case any call to string overloading updates the internal UTF-8 encoding flag.
 #define UTF8_ALLOW_SHORT		0x0008
 #define UTF8_GOT_SHORT		        UTF8_ALLOW_SHORT
 
-/* Overlong sequence; i.e., the code point can be specified in fewer bytes. */
+/* Overlong sequence; i.e., the code point can be specified in fewer bytes.
+ * First one will convert the overlong to the REPLACEMENT CHARACTER; second
+ * will return what the overlong evaluates to */
 #define UTF8_ALLOW_LONG                 0x0010
+#define UTF8_ALLOW_LONG_AND_ITS_VALUE   (UTF8_ALLOW_LONG|0x0020)
 #define UTF8_GOT_LONG                   UTF8_ALLOW_LONG
 
 /* Currently no way to allow overflow */
author	Karl Williamson <khw@cpan.org>	2016-12-10 15:26:24 -0700
committer	Karl Williamson <khw@cpan.org>	2016-12-23 16:48:35 -0700
commit	9495395586e6a655057cb766ed00213037dd06c0 (patch)
tree	dfb0df883a3dd756d58ce106bb70bd8e57a55203 /utf8.h
parent	5a48568dae7e81342fc2f8d0845423834f5c818f (diff)
download	perl-9495395586e6a655057cb766ed00213037dd06c0.tar.gz