doc/wchar_t.texi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

@node The wchar_t mess
@appendix The @code{wchar_t} mess

@cindex wchar_t, type
The ISO C and POSIX standard creators made an attempt to fix the first
problem mentioned in the section @ref{char * strings}.  They introduced
@itemize @bullet
@item
a type @samp{wchar_t}, designed to encapsulate an entire character,
@item
a ``wide string'' type @samp{wchar_t *}, and
@item
functions declared in @posixheader{wctype.h} that were meant to supplant the
ones in @posixheader{ctype.h}.
@end itemize

Unfortunately, this API and its implementation has numerous problems:

@itemize @bullet
@item
On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
means that it can never accommodate an entire Unicode character.  Either
the @code{wchar_t *} strings are limited to characters in UCS-2 (the
``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
of a character in the worst case, making the @posixheader{wctype.h} functions
pointless.

@item
On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
and undocumented.  This means, if you want to know any property of a
@code{wchar_t} character, other than the properties defined by
@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
paragraph separator, or similar ---, you have to convert it to
@code{char *} encoding first, by use of the function @posixfunc{wctomb}.

@item
When you read a stream of wide characters, through the functions
@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
not in the expected encoding, you have no way to determine the invalid
byte sequence and do some corrective action.  If you use these
functions, your program becomes ``garbage in - more garbage out'' or
``garbage in - abort''.
@end itemize

As a consequence, it is better to use multibyte strings, as explained in
the section @ref{char * strings}.  Such multibyte strings can bypass
limitations of the @code{wchar_t} type, if you use functions defined in gnulib
and libunistring for text processing.  They can also faithfully transport
malformed characters that were present in the input, without requiring
the program to produce garbage or abort.