Documentation updates.

Mostly based on feedback by Richard Stallman <rms@gnu.org>.
author: Bruno Haible <bruno@clisp.org> 2017-12-11 03:16:16 +0100
committer: Bruno Haible <bruno@clisp.org> 2017-12-11 03:16:42 +0100
commit: 66423d10dedd2e1391cac7031bb00271694fafcb (patch)
tree: 09240fc93dadfa82ff93e7a69526db5ffcd5cc83 /doc/wchar_t.texi
parent: b227d76bef2ac9939548d2ed0b3cba8ac5a9ef3c (diff)
download: libunistring-66423d10dedd2e1391cac7031bb00271694fafcb.tar.gz
1 files changed, 51 insertions, 0 deletions
diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi
new file mode 100644
index 0000000..f5c239a
--- /dev/null
+++ b/doc/wchar_t.texi
@@ -0,0 +1,51 @@
+@node The wchar_t mess
+@appendix The @code{wchar_t} mess
+
+@cindex wchar_t, type
+The ISO C and POSIX standard creators made an attempt to fix the first
+problem mentioned in the section @ref{char * strings}.  They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate an entire character,
+@item
+a ``wide string'' type @samp{wchar_t *}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant the
+ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
+means that it can never accommodate an entire Unicode character.  Either
+the @code{wchar_t *} strings are limited to characters in UCS-2 (the
+``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
+strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
+of a character in the worst case, making the @posixheader{wctype.h} functions
+pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented.  This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
+not in the expected encoding, you have no way to determine the invalid
+byte sequence and do some corrective action.  If you use these
+functions, your program becomes ``garbage in - more garbage out'' or
+``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings, as explained in
+the section @ref{char * strings}.  Such multibyte strings can bypass
+limitations of the @code{wchar_t} type, if you use functions defined in gnulib
+and libunistring for text processing.  They can also faithfully transport
+malformed characters that were present in the input, without requiring
+the program to produce garbage or abort.
author	Bruno Haible <bruno@clisp.org>	2017-12-11 03:16:16 +0100
committer	Bruno Haible <bruno@clisp.org>	2017-12-11 03:16:42 +0100
commit	66423d10dedd2e1391cac7031bb00271694fafcb (patch)
tree	09240fc93dadfa82ff93e7a69526db5ffcd5cc83 /doc/wchar_t.texi
parent	b227d76bef2ac9939548d2ed0b3cba8ac5a9ef3c (diff)
download	libunistring-66423d10dedd2e1391cac7031bb00271694fafcb.tar.gz