Documentation updates.

Mostly based on feedback by Richard Stallman <rms@gnu.org>.
author: Bruno Haible <bruno@clisp.org> 2017-12-11 03:16:16 +0100
committer: Bruno Haible <bruno@clisp.org> 2017-12-11 03:16:42 +0100
commit: 66423d10dedd2e1391cac7031bb00271694fafcb (patch)
tree: 09240fc93dadfa82ff93e7a69526db5ffcd5cc83
parent: b227d76bef2ac9939548d2ed0b3cba8ac5a9ef3c (diff)
download: libunistring-66423d10dedd2e1391cac7031bb00271694fafcb.tar.gz
10 files changed, 393 insertions, 198 deletions
diff --git a/ChangeLog b/ChangeLog
index a09ef60..f8c4408 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,18 @@
+2017-12-10  Bruno Haible  <bruno@clisp.org>
+
+	Documentation updates.
+	Mostly based on feedback by Richard Stallman <rms@gnu.org>.
+	* doc/wchar_t.texi: New file, extracted from doc/libunistring.texi.
+	* doc/Makefile.am (libunistring_TEXINFOS): Add it.
+	* doc/libunistring.texi: Move "The wchar_t mess" section to an appendix.
+	* doc/unitypes.texi: Explain difference between uint32_t and ucs4_t.
+	* doc/unistr.texi (Elementary string functions,
+	Elementary string functions on NUL terminated strings): Add subsection
+	structure.
+	* doc/unictype.texi (Object oriented API): Explain each general category
+	once only.
+	* doc/unistdio.texi, doc/uninorm.texi, doc/unicase.texi: Small changes.
+
 2017-11-30  Daiki Ueno  <ueno@gnu.org>
 
 	* version.sh: Bump version number and date.
diff --git a/doc/Makefile.am b/doc/Makefile.am
index bca3aeb..de7647e 100644
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -1,5 +1,5 @@
 ## Makefile for the doc subdirectory of GNU libunistring.
-## Copyright (C) 2009, 2011 Free Software Foundation, Inc.
+## Copyright (C) 2009, 2011, 2017 Free Software Foundation, Inc.
 ##
 ## This program is free software: you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
@@ -34,7 +34,7 @@ info_TEXINFOS = libunistring.texi
 libunistring_TEXINFOS = \
   unitypes.texi unistr.texi uniconv.texi unistdio.texi uniname.texi \
   unictype.texi uniwidth.texi unigbrk.texi uniwbrk.texi unilbrk.texi \
-  uninorm.texi unicase.texi uniregex.texi \
+  uninorm.texi unicase.texi uniregex.texi wchar_t.texi \
   gpl.texi lgpl.texi fdl.texi
 
 # The dependencies of stamp-vti generated by automake are incomplete.
diff --git a/doc/libunistring.texi b/doc/libunistring.texi
index a9c7e0f..d1639dc 100644
--- a/doc/libunistring.texi
+++ b/doc/libunistring.texi
@@ -31,7 +31,19 @@
 @include version.texi
 
 @c Location of the POSIX specification on the web.
-@set POSIXURL http://www.opengroup.org/onlinepubs/9699919799
+@set POSIXURL http://pubs.opengroup.org/onlinepubs/9699919799
+
+@c Macro for referencing a POSIX header.
+@ifinfo
+@macro posixheader{header}
+@code{<\header\>}
+@end macro
+@end ifinfo
+@ifnotinfo
+@macro posixheader{header}
+@uref{@value{POSIXURL}/basedefs/\header\.html,,@code{<\header\>}}
+@end macro
+@end ifnotinfo
 
 @c Macro for referencing a POSIX function.
 @c We don't write it as func(), see section "GNU Manuals" of the
@@ -166,6 +178,7 @@ A copy of the license is included in @ref{GNU GPL}.
 * uniregex.h::                  Regular expressions
 * Using the library::           How to link with the library and use it?
 * More functionality::          More advanced functionality
+* The wchar_t mess::            Why @code{wchar_t *} strings are useless
 * Licenses::                    Licenses
 
 * Index::                       General Index
@@ -180,7 +193,6 @@ Introduction
 * Locale encodings::            What is a locale encoding?
 * In-memory representation::    How to represent strings in memory?
 * char * strings::              What to keep in mind with @code{char *} strings
-* The wchar_t mess::            Why @code{wchar_t *} strings are useless
 * Unicode strings::             How are Unicode strings represented?
 
 unistr.h
@@ -191,6 +203,26 @@ unistr.h
 * Elementary string functions with memory allocation::
 * Elementary string functions on NUL terminated strings::
 
+Elementary string functions
+
+* Iterating::
+* Creating Unicode strings::
+* Copying Unicode strings::
+* Comparing Unicode strings::
+* Searching for a character::
+* Counting characters::
+
+Elementary string functions on NUL terminated strings
+
+* Iterating over a NUL terminated Unicode string::
+* Length::
+* Copying a NUL terminated Unicode string::
+* Comparing NUL terminated Unicode strings::
+* Duplicating a NUL terminated Unicode string::
+* Searching for a character in a NUL terminated Unicode string::
+* Searching for a substring::
+* Tokenizing::
+
 unictype.h
 
 * General category::
@@ -304,8 +336,8 @@ in general, contain characters of all kinds of scripts.  The text processing
 functions provided by this library handle all scripts and all languages.
 
 libunistring is for you if your application already uses the ISO C / POSIX
-@code{<ctype.h>}, @code{<wctype.h>} functions and the text it operates on is
-provided by the user and can be in any language.
+@posixheader{ctype.h}, @posixheader{wctype.h} functions and the text it
+operates on is provided by the user and can be in any language.
 
 libunistring is also for you if your application uses Unicode strings as
 internal in-memory representation.
@@ -390,7 +422,7 @@ in multiple languages present in the same document or even in the same line
 of text.
 
 But use of Unicode is not everything.  Internationalization usually consists
-of three features:
+of four features:
 @itemize @bullet
 @item
 Use of Unicode where needed for text processing.  This is what this library
@@ -402,6 +434,10 @@ GNU gettext is about.
 Use of locale specific conventions for date and time formats, for numeric
 formatting, or for sorting of text.  This can be done adequately with the
 POSIX APIs and the implementation of locales in the GNU C library.
+@item
+In graphical user interfaces, adapting the GUI to the default text direction
+of the current locale (see
+@url{https://en.wikipedia.org/wiki/Right-to-left,right-to-left languages}).
 @end itemize
 
 @node Locale encodings
@@ -415,7 +451,7 @@ yet universally implemented and not widely used.)
 @cindex locale categories
 The locale is partitioned into several aspects, called the ``categories''
 of the locale.  The main various aspects are:
-@itemize
+@itemize @bullet
 @item
 The character encoding and the character properties.  This is the
 @code{LC_CTYPE} category.
@@ -453,7 +489,7 @@ this country earlier.
 
 The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in
 most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in
-many places, though.
+some places, though.
 
 UTF-16 and UTF-32 are not used as locale encodings, because they are not
 ASCII compatible.
@@ -463,7 +499,7 @@ ASCII compatible.
 
 There are three ways of representing strings in memory of a running
 program.
-@itemize
+@itemize @bullet
 @item
 As @samp{char *} strings.  Such strings are represented in locale encoding.
 This approach is employed when not much text processing is done by the
@@ -480,6 +516,21 @@ As @samp{wchar_t *}, a.k.a@. ``wide strings''.  This approach is misguided,
 see @ref{The wchar_t mess}.
 @end itemize
 
+Of course, a @samp{char *} string can, in some cases, be encoded in UTF-8.
+You will use the data type depending on what you can guarantee about how
+it's encoded: If a string is encoded in the locale encoding, or if you
+don't know how it's encoded, use @samp{char *}.  If, on the other hand,
+you can @emph{guarantee} that it is UTF-8 encoded, then you can use the
+UTF-8 string type, @code{uint8_t *}, for it.
+
+The five types @code{char *}, @code{uint8_t *}, @code{uint16_t *},
+@code{uint32_t *}, and @code{wchar_t *} are incompatible types at the C
+level.  Therefore, @samp{gcc -Wall} will produce a warning if, by mistake,
+your code contains a mismatch between these types.  In the context of
+using GNU libunistring, even a warning about a mismatch between
+@code{char *} and @code{uint8_t *} is a sign of a bug in your code
+that you should not try to silence through a cast.
+
 @node char * strings
 @section @samp{char *} strings
 
@@ -509,9 +560,9 @@ The important fact to remember is:
 @end cartouche
 
 As a consequence:
-@itemize
+@itemize @bullet
 @item
-The @code{<ctype.h>} API is useless in this context; it does not work in
+The @posixheader{ctype.h} API is useless in this context; it does not work in
 multibyte locales.
 @item
 The @posixfunc{strlen} function does not return the number of characters
@@ -546,7 +597,7 @@ functions do not work with multibyte strings.
 
 The workarounds can be found in GNU gnulib
 @url{http://www.gnu.org/software/gnulib/}.
-@itemize
+@itemize @bullet
 @item
 gnulib has modules @samp{mbchar}, @samp{mbiter}, @samp{mbuiter} that
 represent multibyte characters and allow to iterate across a multibyte
@@ -577,7 +628,7 @@ preferable to these functions; see below.
 @end itemize
 
 The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages:
-@itemize
+@itemize @bullet
 @item
 It assumes that there are only two forms of every character: uppercase
 and lowercase.  This is not true for Croatian, where the character
@@ -611,58 +662,6 @@ rather than on characters.
 
 This is implemented in this library, through the functions declared in @code{<unicase.h>}, see @ref{unicase.h}.
 
-@node The wchar_t mess
-@section The @code{wchar_t} mess
-
-@cindex wchar_t, type
-The ISO C and POSIX standard creators made an attempt to fix the first
-problem mentioned in the previous section.  They introduced
-@itemize
-@item
-a type @samp{wchar_t}, designed to encapsulate an entire character,
-@item
-a ``wide string'' type @samp{wchar_t *}, and
-@item
-functions declared in @code{<wctype.h>} that were meant to supplant the
-ones in @code{<ctype.h>}.
-@end itemize
-
-Unfortunately, this API and its implementation has numerous problems:
-
-@itemize
-@item
-On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
-means that it can never accommodate an entire Unicode character.  Either
-the @code{wchar_t *} strings are limited to characters in UCS-2 (the
-``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
-strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
-of a character in the worst case, making the @code{<wctype.h>} functions
-pointless.
-
-@item
-On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
-and undocumented.  This means, if you want to know any property of a
-@code{wchar_t} character, other than the properties defined by
-@code{<wctype.h>} --- such as whether it's a dash, currency symbol,
-paragraph separator, or similar ---, you have to convert it to
-@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
-
-@item
-When you read a stream of wide characters, through the functions
-@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
-not in the expected encoding, you have no way to determine the invalid
-byte sequence and do some corrective action.  If you use these
-functions, your program becomes ``garbage in - more garbage out'' or
-``garbage in - abort''.
-@end itemize
-
-As a consequence, it is better to use multibyte strings, as explained in
-the previous section.  Such multibyte strings can bypass limitations
-of the @code{wchar_t} type, if you use functions defined in gnulib and
-libunistring for text processing.  They can also faithfully transport
-malformed characters that were present in the input, without requiring
-the program to produce garbage or abort.
-
 @node Unicode strings
 @section Unicode strings
 
@@ -670,7 +669,7 @@ libunistring supports Unicode strings in three representations:
 @cindex UTF-8, strings
 @cindex UTF-16, strings
 @cindex UTF-32, strings
-@itemize
+@itemize @bullet
 @item
 UTF-8 strings, through the type @samp{uint8_t *}.  The units are bytes
 (@code{uint8_t}).
@@ -683,7 +682,7 @@ memory words (@code{uint32_t}).
 @end itemize
 
 As with C strings, there are two variants:
-@itemize
+@itemize @bullet
 @item
 Unicode strings with a terminating NUL character are represented as
 a pointer to the first unit of the string.  There is a unit containing
@@ -928,6 +927,8 @@ For the rendering of Unicode strings outside of the context of a given toolkit
 (KDE/Qt or GNOME/Gtk), we recommend the Pango library:
 @url{http://www.pango.org/}.
 
+@include wchar_t.texi
+
 @node Licenses
 @appendix Licenses
 @cindex Licenses
@@ -939,7 +940,7 @@ particular file or directory.  Here is a summary:
 @item
 The @code{libunistring} library and its header files are dual-licensed under
 "the GNU LGPLv3+ or the GNU GPLv2". This means, you can use it under either
-@itemize
+@itemize @bullet
 @item @minus{}
 the terms of the GNU Lesser General Public License (LGPL) version 3 or
 (at your option) any later version, or
diff --git a/doc/unicase.texi b/doc/unicase.texi
index e88a0a4..8dac4a4 100644
--- a/doc/unicase.texi
+++ b/doc/unicase.texi
@@ -106,6 +106,9 @@ Returns the uppercase mapping of a string.
 
 The @var{nf} argument identifies the normalization form to apply after the
 case-mapping.  It can also be NULL, for no normalization.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u8_tolower (const uint8_t *@var{s}, size_t @var{n}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
@@ -115,6 +118,9 @@ Returns the lowercase mapping of a string.
 
 The @var{nf} argument identifies the normalization form to apply after the
 case-mapping.  It can also be NULL, for no normalization.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u8_totitle (const uint8_t *@var{s}, size_t @var{n}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
@@ -128,6 +134,9 @@ are being mapped to lower case.
 
 The @var{nf} argument identifies the normalization form to apply after the
 case-mapping.  It can also be NULL, for no normalization.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @node Case mappings of substrings
@@ -200,6 +209,9 @@ prefix context and the suffix context.
 @deftypefunx {uint32_t *} u32_ct_toupper (const uint32_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Returns the uppercase mapping of a string that is surrounded by a prefix
 and a suffix.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u8_ct_tolower (const uint8_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
@@ -207,6 +219,9 @@ and a suffix.
 @deftypefunx {uint32_t *} u32_ct_tolower (const uint32_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Returns the lowercase mapping of a string that is surrounded by a prefix
 and a suffix.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u8_ct_totitle (const uint8_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
@@ -214,6 +229,9 @@ and a suffix.
 @deftypefunx {uint32_t *} u32_ct_totitle (const uint32_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Returns the titlecase mapping of a string that is surrounded by a prefix
 and a suffix.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 For example, to uppercase the UTF-8 substring between @code{s + start_index}
@@ -249,6 +267,9 @@ with the @code{u8_cmp2} function is equivalent to comparing @var{s1} and
 
 The @var{nf} argument identifies the normalization form to apply after the
 case-mapping.  It can also be NULL, for no normalization.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u8_ct_casefold (const uint8_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
@@ -256,6 +277,9 @@ case-mapping.  It can also be NULL, for no normalization.
 @deftypefunx {uint32_t *} u32_ct_casefold (const uint32_t *@var{s}, size_t @var{n}, casing_prefix_context_t @var{prefix_context}, casing_suffix_context_t @var{suffix_context}, const char *@var{iso639_language}, uninorm_t @var{nf}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Returns the case folded string.  The case folding takes into account the
 case mapping contexts of the prefix and suffix strings.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun int u8_casecmp (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, const char *@var{iso639_language}, uninorm_t @var{nf}, int *@var{resultp})
@@ -290,6 +314,9 @@ equivalent to comparing @var{s1} and @var{s2} with @code{u8_casecoll}.
 
 @var{nf} must be either @code{UNINORM_NFC}, @code{UNINORM_NFKC}, or NULL for
 no normalization.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun int u8_casecoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, const char *@var{iso639_language}, uninorm_t @var{nf}, int *@var{resultp})
diff --git a/doc/unictype.texi b/doc/unictype.texi
index 5f292cc..7fbeaa5 100644
--- a/doc/unictype.texi
+++ b/doc/unictype.texi
@@ -65,200 +65,199 @@ not an array type.
 The following are the predefined general category value.  Additional general
 categories may be added in the future.
 
-@deftypevr Constant uc_general_category_t UC_CATEGORY_L
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_LC
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Lu
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Ll
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Lt
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Lm
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Lo
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_M
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Mn
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Mc
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Me
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_N
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Nd
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Nl
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_No
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_P
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Pc
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Pd
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Ps
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Pe
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Pi
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Pf
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Po
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_S
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Sm
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Sc
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Sk
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_So
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Z
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Zs
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Zl
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Zp
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_C
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Cc
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Cf
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Cs
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Co
-@deftypevrx Constant uc_general_category_t UC_CATEGORY_Cn
-@end deftypevr
+The @code{UC_CATEGORY_*} constants reflect the systematic general category
+values assigned by the Unicode Consortium.  Whereas the other @code{UC_*}
+macros are aliases, for use when readable code is preferred.
 
-The following are alias names for predefined General category values.
-
-@deftypevr Macro uc_general_category_t UC_LETTER
-This is another name for @code{UC_CATEGORY_L}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_L
+@deftypevrx Macro uc_general_category_t UC_LETTER
+This represents the general category ``Letter''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_CASED_LETTER
-This is another name for @code{UC_CATEGORY_LC}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_LC
+@deftypevrx Macro uc_general_category_t UC_CASED_LETTER
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_UPPERCASE_LETTER
-This is another name for @code{UC_CATEGORY_Lu}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Lu
+@deftypevrx Macro uc_general_category_t UC_UPPERCASE_LETTER
+This represents the general category ``Letter, uppercase''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_LOWERCASE_LETTER
-This is another name for @code{UC_CATEGORY_Ll}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Ll
+@deftypevrx Macro uc_general_category_t UC_LOWERCASE_LETTER
+This represents the general category ``Letter, lowercase''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_TITLECASE_LETTER
-This is another name for @code{UC_CATEGORY_Lt}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Lt
+@deftypevrx Macro uc_general_category_t UC_TITLECASE_LETTER
+This represents the general category ``Letter, titlecase''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_MODIFIER_LETTER
-This is another name for @code{UC_CATEGORY_Lm}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Lm
+@deftypevrx Macro uc_general_category_t UC_MODIFIER_LETTER
+This represents the general category ``Letter, modifier''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OTHER_LETTER
-This is another name for @code{UC_CATEGORY_Lo}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Lo
+@deftypevrx Macro uc_general_category_t UC_OTHER_LETTER
+This represents the general category ``Letter, other''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_MARK
-This is another name for @code{UC_CATEGORY_M}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_M
+@deftypevrx Macro uc_general_category_t UC_MARK
+This represents the general category ``Marker''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_NON_SPACING_MARK
-This is another name for @code{UC_CATEGORY_Mn}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Mn
+@deftypevrx Macro uc_general_category_t UC_NON_SPACING_MARK
+This represents the general category ``Marker, nonspacing''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_COMBINING_SPACING_MARK
-This is another name for @code{UC_CATEGORY_Mc}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Mc
+@deftypevrx Macro uc_general_category_t UC_COMBINING_SPACING_MARK
+This represents the general category ``Marker, spacing combining''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_ENCLOSING_MARK
-This is another name for @code{UC_CATEGORY_Me}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Me
+@deftypevrx Macro uc_general_category_t UC_ENCLOSING_MARK
+This represents the general category ``Marker, enclosing''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_NUMBER
-This is another name for @code{UC_CATEGORY_N}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_N
+@deftypevrx Macro uc_general_category_t UC_NUMBER
+This represents the general category ``Number''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_DECIMAL_DIGIT_NUMBER
-This is another name for @code{UC_CATEGORY_Nd}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Nd
+@deftypevrx Macro uc_general_category_t UC_DECIMAL_DIGIT_NUMBER
+This represents the general category ``Number, decimal digit''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_LETTER_NUMBER
-This is another name for @code{UC_CATEGORY_Nl}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Nl
+@deftypevrx Macro uc_general_category_t UC_LETTER_NUMBER
+This represents the general category ``Number, letter''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OTHER_NUMBER
-This is another name for @code{UC_CATEGORY_No}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_No
+@deftypevrx Macro uc_general_category_t UC_OTHER_NUMBER
+This represents the general category ``Number, other''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_PUNCTUATION
-This is another name for @code{UC_CATEGORY_P}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_P
+@deftypevrx Macro uc_general_category_t UC_PUNCTUATION
+This represents the general category ``Punctuation''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_CONNECTOR_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Pc}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Pc
+@deftypevrx Macro uc_general_category_t UC_CONNECTOR_PUNCTUATION
+This represents the general category ``Punctuation, connector''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_DASH_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Pd}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Pd
+@deftypevrx Macro uc_general_category_t UC_DASH_PUNCTUATION
+This represents the general category ``Punctuation, dash''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OPEN_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Ps} (``start punctuation'').
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Ps
+@deftypevrx Macro uc_general_category_t UC_OPEN_PUNCTUATION
+This represents the general category ``Punctuation, open'', a.k.a. ``start punctuation''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_CLOSE_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Pe} (``end punctuation'').
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Pe
+@deftypevrx Macro uc_general_category_t UC_CLOSE_PUNCTUATION
+This represents the general category ``Punctuation, close'', a.k.a. ``end punctuation''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Pi}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Pi
+@deftypevrx Macro uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION
+This represents the general category ``Punctuation, initial quote''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Pf}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Pf
+@deftypevrx Macro uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION
+This represents the general category ``Punctuation, final quote''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OTHER_PUNCTUATION
-This is another name for @code{UC_CATEGORY_Po}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Po
+@deftypevrx Macro uc_general_category_t UC_OTHER_PUNCTUATION
+This represents the general category ``Punctuation, other''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_SYMBOL
-This is another name for @code{UC_CATEGORY_S}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_S
+@deftypevrx Macro uc_general_category_t UC_SYMBOL
+This represents the general category ``Symbol''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_MATH_SYMBOL
-This is another name for @code{UC_CATEGORY_Sm}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Sm
+@deftypevrx Macro uc_general_category_t UC_MATH_SYMBOL
+This represents the general category ``Symbol, math''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_CURRENCY_SYMBOL
-This is another name for @code{UC_CATEGORY_Sc}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Sc
+@deftypevrx Macro uc_general_category_t UC_CURRENCY_SYMBOL
+This represents the general category ``Symbol, currency''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_MODIFIER_SYMBOL
-This is another name for @code{UC_CATEGORY_Sk}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Sk
+@deftypevrx Macro uc_general_category_t UC_MODIFIER_SYMBOL
+This represents the general category ``Symbol, modifier''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OTHER_SYMBOL
-This is another name for @code{UC_CATEGORY_So}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_So
+@deftypevrx Macro uc_general_category_t UC_OTHER_SYMBOL
+This represents the general category ``Symbol, other''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_SEPARATOR
-This is another name for @code{UC_CATEGORY_Z}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Z
+@deftypevrx Macro uc_general_category_t UC_SEPARATOR
+This represents the general category ``Separator''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_SPACE_SEPARATOR
-This is another name for @code{UC_CATEGORY_Zs}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Zs
+@deftypevrx Macro uc_general_category_t UC_SPACE_SEPARATOR
+This represents the general category ``Separator, space''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_LINE_SEPARATOR
-This is another name for @code{UC_CATEGORY_Zl}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Zl
+@deftypevrx Macro uc_general_category_t UC_LINE_SEPARATOR
+This represents the general category ``Separator, line''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_PARAGRAPH_SEPARATOR
-This is another name for @code{UC_CATEGORY_Zp}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Zp
+@deftypevrx Macro uc_general_category_t UC_PARAGRAPH_SEPARATOR
+This represents the general category ``Separator, paragraph''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_OTHER
-This is another name for @code{UC_CATEGORY_C}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_C
+@deftypevrx Macro uc_general_category_t UC_OTHER
+This represents the general category ``Other''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_CONTROL
-This is another name for @code{UC_CATEGORY_Cc}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Cc
+@deftypevrx Macro uc_general_category_t UC_CONTROL
+This represents the general category ``Other, control''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_FORMAT
-This is another name for @code{UC_CATEGORY_Cf}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Cf
+@deftypevrx Macro uc_general_category_t UC_FORMAT
+This represents the general category ``Other, format''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_SURROGATE
-This is another name for @code{UC_CATEGORY_Cs}.  All code points in this
-category are invalid characters.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Cs
+@deftypevrx Macro uc_general_category_t UC_SURROGATE
+This represents the general category ``Other, surrogate''.
+All code points in this category are invalid characters.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_PRIVATE_USE
-This is another name for @code{UC_CATEGORY_Co}.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Co
+@deftypevrx Macro uc_general_category_t UC_PRIVATE_USE
+This represents the general category ``Other, private use''.
 @end deftypevr
 
-@deftypevr Macro uc_general_category_t UC_UNASSIGNED
-This is another name for @code{UC_CATEGORY_Cn}.  Some code points in this
-category are invalid characters.
+@deftypevr Constant uc_general_category_t UC_CATEGORY_Cn
+@deftypevrx Macro uc_general_category_t UC_UNASSIGNED
+This represents the general category ``Other, not assigned''.
+Some code points in this category are invalid characters.
 @end deftypevr
 
 The following functions combine general categories, like in a boolean algebra,
diff --git a/doc/uninorm.texi b/doc/uninorm.texi
index 5cad859..ad7a1da 100644
--- a/doc/uninorm.texi
+++ b/doc/uninorm.texi
@@ -209,6 +209,9 @@ The following functions apply a Unicode normalization form to a Unicode string.
 @deftypefunx {uint16_t *} u16_normalize (uninorm_t @var{nf}, const uint16_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp})
 @deftypefunx {uint32_t *} u32_normalize (uninorm_t @var{nf}, const uint32_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Returns the specified normalization form of a string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @node Normalizing comparisons
@@ -241,6 +244,9 @@ sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and
 comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function.
 
 @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun int u8_normcoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
diff --git a/doc/unistdio.texi b/doc/unistdio.texi
index e1fb9cf..8f1a0a1 100644
--- a/doc/unistdio.texi
+++ b/doc/unistdio.texi
@@ -9,7 +9,7 @@ strings.  It defines a set of functions similar to @code{fprintf} and
 
 These functions work like the @code{printf} function family.
 In the format string:
-@itemize
+@itemize @bullet
 @item
 The format directive @samp{U} takes an UTF-8 string (@code{const uint8_t *}).
 @item
diff --git a/doc/unistr.texi b/doc/unistr.texi
index 60f1daa..da0f4da 100644
--- a/doc/unistr.texi
+++ b/doc/unistr.texi
@@ -35,31 +35,61 @@ The following functions perform conversions between the different forms of Unico
 
 @deftypefun {uint16_t *} u8_to_u16 (const uint8_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-8 string to an UTF-16 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint32_t *} u8_to_u32 (const uint8_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-8 string to an UTF-32 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u16_to_u8 (const uint16_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-16 string to an UTF-8 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint32_t *} u16_to_u32 (const uint16_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-16 string to an UTF-32 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint8_t *} u32_to_u8 (const uint32_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-32 string to an UTF-8 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @deftypefun {uint16_t *} u32_to_u16 (const uint32_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp})
 Converts an UTF-32 string to an UTF-16 string.
+
+The @var{resultbuf} and @var{lengthp} arguments are as described in
+chapter @ref{Conventions}.
 @end deftypefun
 
 @node Elementary string functions
 @section Elementary string functions
 
+@menu
+* Iterating::
+* Creating Unicode strings::
+* Copying Unicode strings::
+* Comparing Unicode strings::
+* Searching for a character::
+* Counting characters::
+@end menu
+
+@node Iterating
+@subsection Iterating over a Unicode string
+
 @cindex iterating
 The following functions inspect and return details about the first character
 in a Unicode string.
@@ -75,9 +105,9 @@ This function is similar to @posixfunc{mblen}, except that it operates on a
 Unicode string and that @var{s} must not be NULL.
 @end deftypefun
 
-@deftypefun int u8_mbtouc_unsafe (ucs4_t *@var{puc}, const uint8_t *@var{s}, size_t @var{n})
-@deftypefunx int u16_mbtouc_unsafe (ucs4_t *@var{puc}, const uint16_t *@var{s}, size_t @var{n})
-@deftypefunx int u32_mbtouc_unsafe (ucs4_t *@var{puc}, const uint32_t *@var{s}, size_t @var{n})
+@deftypefun int u8_mbtouc (ucs4_t *@var{puc}, const uint8_t *@var{s}, size_t @var{n})
+@deftypefunx int u16_mbtouc (ucs4_t *@var{puc}, const uint16_t *@var{s}, size_t @var{n})
+@deftypefunx int u32_mbtouc (ucs4_t *@var{puc}, const uint32_t *@var{s}, size_t @var{n})
 Returns the length (number of units) of the first character in @var{s},
 putting its @code{ucs4_t} representation in @code{*@var{puc}}.  Upon failure,
 @code{*@var{puc}} is set to @code{0xfffd}, and an appropriate number of units
@@ -85,17 +115,21 @@ is returned.
 
 The number of available units, @var{n}, must be > 0.
 
+This function fails if an invalid sequence of units is encountered at the
+beginning of @var{s}, or if additional units (after the @var{n} provided units)
+would be needed to form a character.
+
 This function is similar to @posixfunc{mbtowc}, except that it operates on a
 Unicode string, @var{puc} and @var{s} must not be NULL, @var{n} must be > 0,
 and the NUL character is not treated specially.
 @end deftypefun
 
-@deftypefun int u8_mbtouc (ucs4_t *@var{puc}, const uint8_t *@var{s}, size_t @var{n})
-@deftypefunx int u16_mbtouc (ucs4_t *@var{puc}, const uint16_t *@var{s}, size_t @var{n})
-@deftypefunx int u32_mbtouc (ucs4_t *@var{puc}, const uint32_t *@var{s}, size_t @var{n})
-This function is like @code{u8_mbtouc_unsafe}, except that it will detect an
-invalid UTF-8 character, even if the library is compiled without
-@option{--enable-safety}.
+@deftypefun int u8_mbtouc_unsafe (ucs4_t *@var{puc}, const uint8_t *@var{s}, size_t @var{n})
+@deftypefunx int u16_mbtouc_unsafe (ucs4_t *@var{puc}, const uint16_t *@var{s}, size_t @var{n})
+@deftypefunx int u32_mbtouc_unsafe (ucs4_t *@var{puc}, const uint32_t *@var{s}, size_t @var{n})
+This function is identical to @code{u8_mbtouc}/@code{u16_mbtouc}/@code{u32_mbtouc}.
+Earlier versions of this function performed fewer range-checks on the sequence
+of units.
 @end deftypefun
 
 @deftypefun int u8_mbtoucr (ucs4_t *@var{puc}, const uint8_t *@var{s}, size_t @var{n})
@@ -112,6 +146,9 @@ This function is similar to @code{u8_mbtouc}, except that the return value
 gives more details about the failure, similar to @posixfunc{mbrtowc}.
 @end deftypefun
 
+@node Creating Unicode strings
+@subsection Creating Unicode strings one character at a time
+
 The following function stores a Unicode character as a Unicode string in
 memory.
 
@@ -127,6 +164,9 @@ Unicode strings, @var{s} must not be NULL, and the argument @var{n} must be
 specified.
 @end deftypefun
 
+@node Copying Unicode strings
+@subsection Copying Unicode strings
+
 @cindex copying
 The following functions copy Unicode strings in memory.
 
@@ -161,6 +201,9 @@ This function is similar to @posixfunc{memset}, except that it operates on
 Unicode strings.
 @end deftypefun
 
+@node Comparing Unicode strings
+@subsection Comparing Unicode strings
+
 @cindex comparing
 The following function compares two Unicode strings of the same length.
 
@@ -191,6 +234,9 @@ This function is similar to the gnulib function @func{memcmp2}, except that it
 operates on Unicode strings.
 @end deftypefun
 
+@node Searching for a character
+@subsection Searching for a character in a Unicode string
+
 @cindex searching, for a character
 The following function searches for a given Unicode character.
 
@@ -205,6 +251,9 @@ This function is similar to @posixfunc{memchr}, except that it operates on
 Unicode strings.
 @end deftypefun
 
+@node Counting characters
+@subsection Counting the characters in a Unicode string
+
 @cindex counting
 The following function counts the number of Unicode characters.
 
@@ -233,6 +282,20 @@ Makes a freshly allocated copy of @var{s}, of length @var{n}.
 @node Elementary string functions on NUL terminated strings
 @section Elementary string functions on NUL terminated strings
 
+@menu
+* Iterating over a NUL terminated Unicode string::
+* Length::
+* Copying a NUL terminated Unicode string::
+* Comparing NUL terminated Unicode strings::
+* Duplicating a NUL terminated Unicode string::
+* Searching for a character in a NUL terminated Unicode string::
+* Searching for a substring::
+* Tokenizing::
+@end menu
+
+@node Iterating over a NUL terminated Unicode string
+@subsection Iterating over a NUL terminated Unicode string
+
 The following functions inspect and return details about the first character
 in a Unicode string.
 
@@ -273,6 +336,9 @@ Puts the character's @code{ucs4_t} representation in @code{*@var{puc}}.
 Note that this function works only on well-formed Unicode strings.
 @end deftypefun
 
+@node Length
+@subsection Length of a NUL terminated Unicode string
+
 The following functions determine the length of a Unicode string.
 
 @deftypefun size_t u8_strlen (const uint8_t *@var{s})
@@ -293,6 +359,9 @@ This function is similar to @posixfunc{strnlen} and @posixfunc{wcsnlen}, except
 that it operates on Unicode strings.
 @end deftypefun
 
+@node Copying a NUL terminated Unicode string
+@subsection Copying a NUL terminated Unicode string
+
 @cindex copying
 The following functions copy portions of Unicode strings in memory.
 
@@ -355,6 +424,9 @@ This function is similar to @posixfunc{strncat} and @posixfunc{wcsncat}, except
 that it operates on Unicode strings.
 @end deftypefun
 
+@node Comparing NUL terminated Unicode strings
+@subsection Comparing NUL terminated Unicode strings
+
 @cindex comparing
 The following functions compare two Unicode strings.
 
@@ -396,6 +468,9 @@ This function is similar to @posixfunc{strncmp} and @posixfunc{wcsncmp}, except
 that it operates on Unicode strings.
 @end deftypefun
 
+@node Duplicating a NUL terminated Unicode string
+@subsection Duplicating a NUL terminated Unicode string
+
 @cindex duplicating
 The following function allocates a duplicate of a Unicode string.
 
@@ -408,6 +483,9 @@ This function is similar to @posixfunc{strdup} and @posixfunc{wcsdup}, except
 that it operates on Unicode strings.
 @end deftypefun
 
+@node Searching for a character in a NUL terminated Unicode string
+@subsection Searching for a character in a NUL terminated Unicode string
+
 @cindex searching, for a character
 The following functions search for a given Unicode character.
 
@@ -461,6 +539,9 @@ This function is similar to @posixfunc{strpbrk} and @posixfunc{wcspbrk}, except
 that it operates on Unicode strings.
 @end deftypefun
 
+@node Searching for a substring
+@subsection Searching for a substring in a NUL terminated Unicode string
+
 @cindex searching, for a substring
 The following functions search whether a given Unicode string is a substring
 of another Unicode string.
@@ -486,6 +567,9 @@ Tests whether @var{str} starts with @var{prefix}.
 Tests whether @var{str} ends with @var{suffix}.
 @end deftypefun
 
+@node Tokenizing
+@subsection Tokenizing a NUL terminated Unicode string
+
 The following function does one step in tokenizing a Unicode string.
 
 @deftypefun {uint8_t *} u8_strtok (uint8_t *@var{str}, const uint8_t *@var{delim}, uint8_t **@var{ptr})
diff --git a/doc/unitypes.texi b/doc/unitypes.texi
index 696ba88..68ab92f 100644
--- a/doc/unitypes.texi
+++ b/doc/unitypes.texi
@@ -13,3 +13,15 @@ taken from @code{<stdint.h>}, on platforms where this include file is present.
 @deftp Type ucs4_t
 This type represents a single Unicode character, outside of an UTF-32 string.
 @end deftp
+
+The types @code{ucs4_t} and @code{uint32_t} happen to be identical.  They differ
+in use and intent, however:
+@itemize @bullet
+@item
+Use @code{uint32_t *} to designate an UTF-32 string.  Use @code{ucs4_t} to
+designate a single Unicode character, outside of an UTF-32 string.
+@item
+Conversions functions that take an UTF-32 string as input will usually perform
+a range-check on the @code{uint32_t} values.  Whereas functions that are
+declared to take @code{ucs4_t} arguments will not perform such a range-check.
+@end itemize
diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi
new file mode 100644
index 0000000..f5c239a
--- /dev/null
+++ b/doc/wchar_t.texi
@@ -0,0 +1,51 @@
+@node The wchar_t mess
+@appendix The @code{wchar_t} mess
+
+@cindex wchar_t, type
+The ISO C and POSIX standard creators made an attempt to fix the first
+problem mentioned in the section @ref{char * strings}.  They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate an entire character,
+@item
+a ``wide string'' type @samp{wchar_t *}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant the
+ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
+means that it can never accommodate an entire Unicode character.  Either
+the @code{wchar_t *} strings are limited to characters in UCS-2 (the
+``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
+strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
+of a character in the worst case, making the @posixheader{wctype.h} functions
+pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented.  This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
+not in the expected encoding, you have no way to determine the invalid
+byte sequence and do some corrective action.  If you use these
+functions, your program becomes ``garbage in - more garbage out'' or
+``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings, as explained in
+the section @ref{char * strings}.  Such multibyte strings can bypass
+limitations of the @code{wchar_t} type, if you use functions defined in gnulib
+and libunistring for text processing.  They can also faithfully transport
+malformed characters that were present in the input, without requiring
+the program to produce garbage or abort.
author	Bruno Haible <bruno@clisp.org>	2017-12-11 03:16:16 +0100
committer	Bruno Haible <bruno@clisp.org>	2017-12-11 03:16:42 +0100
commit	66423d10dedd2e1391cac7031bb00271694fafcb (patch)
tree	09240fc93dadfa82ff93e7a69526db5ffcd5cc83
parent	b227d76bef2ac9939548d2ed0b3cba8ac5a9ef3c (diff)
download	libunistring-66423d10dedd2e1391cac7031bb00271694fafcb.tar.gz