diff options
author | Bruno Haible <bruno@clisp.org> | 2017-12-11 03:16:16 +0100 |
---|---|---|
committer | Bruno Haible <bruno@clisp.org> | 2017-12-11 03:16:42 +0100 |
commit | 66423d10dedd2e1391cac7031bb00271694fafcb (patch) | |
tree | 09240fc93dadfa82ff93e7a69526db5ffcd5cc83 /doc/wchar_t.texi | |
parent | b227d76bef2ac9939548d2ed0b3cba8ac5a9ef3c (diff) | |
download | libunistring-66423d10dedd2e1391cac7031bb00271694fafcb.tar.gz |
Documentation updates.
Mostly based on feedback by Richard Stallman <rms@gnu.org>.
Diffstat (limited to 'doc/wchar_t.texi')
-rw-r--r-- | doc/wchar_t.texi | 51 |
1 files changed, 51 insertions, 0 deletions
diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi new file mode 100644 index 0000000..f5c239a --- /dev/null +++ b/doc/wchar_t.texi @@ -0,0 +1,51 @@ +@node The wchar_t mess +@appendix The @code{wchar_t} mess + +@cindex wchar_t, type +The ISO C and POSIX standard creators made an attempt to fix the first +problem mentioned in the section @ref{char * strings}. They introduced +@itemize @bullet +@item +a type @samp{wchar_t}, designed to encapsulate an entire character, +@item +a ``wide string'' type @samp{wchar_t *}, and +@item +functions declared in @posixheader{wctype.h} that were meant to supplant the +ones in @posixheader{ctype.h}. +@end itemize + +Unfortunately, this API and its implementation has numerous problems: + +@itemize @bullet +@item +On AIX and Windows platforms, @code{wchar_t} is a 16-bit type. This +means that it can never accommodate an entire Unicode character. Either +the @code{wchar_t *} strings are limited to characters in UCS-2 (the +``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *} +strings are encoded in UTF-16 --- a @code{wchar_t} represents only half +of a character in the worst case, making the @posixheader{wctype.h} functions +pointless. + +@item +On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent +and undocumented. This means, if you want to know any property of a +@code{wchar_t} character, other than the properties defined by +@posixheader{wctype.h} --- such as whether it's a dash, currency symbol, +paragraph separator, or similar ---, you have to convert it to +@code{char *} encoding first, by use of the function @posixfunc{wctomb}. + +@item +When you read a stream of wide characters, through the functions +@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is +not in the expected encoding, you have no way to determine the invalid +byte sequence and do some corrective action. If you use these +functions, your program becomes ``garbage in - more garbage out'' or +``garbage in - abort''. +@end itemize + +As a consequence, it is better to use multibyte strings, as explained in +the section @ref{char * strings}. Such multibyte strings can bypass +limitations of the @code{wchar_t} type, if you use functions defined in gnulib +and libunistring for text processing. They can also faithfully transport +malformed characters that were present in the input, without requiring +the program to produce garbage or abort. |