diff options
author | Bruno Haible <bruno@clisp.org> | 2009-04-05 16:12:24 +0200 |
---|---|---|
committer | Bruno Haible <bruno@clisp.org> | 2009-04-05 16:12:24 +0200 |
commit | d6b5eb017aef649be89aec6dcd2519844f2bd91a (patch) | |
tree | 0ffe9ee2a24dc23aada781d4abf88f4965e47bd3 /doc/libunistring.texi | |
parent | 720fc85a3daaf654e7edb05ae3c652e7d9d4b52c (diff) | |
download | libunistring-d6b5eb017aef649be89aec6dcd2519844f2bd91a.tar.gz |
Add index entries.
Diffstat (limited to 'doc/libunistring.texi')
-rw-r--r-- | doc/libunistring.texi | 28 |
1 files changed, 25 insertions, 3 deletions
diff --git a/doc/libunistring.texi b/doc/libunistring.texi index 6c907de..d0eff27 100644 --- a/doc/libunistring.texi +++ b/doc/libunistring.texi @@ -248,6 +248,8 @@ case folding regular expressions (not yet implemented) @end table +@cindex use cases +@cindex value, of libunistring libunistring is for you if your application involves non-trivial text processing, such as upper/lower case conversions, line breaking, operations on words, or more advanced analysis of text. Text provided by the user can, @@ -274,6 +276,7 @@ internal in-memory representation. @node Unicode @section Unicode +@cindex Unicode Unicode is a standardized repertoire of characters that contains characters from all scripts of the world, from Latin letters to Chinese ideographs and Babylonian cuneiform glyphs. It also specifies how these characters @@ -283,6 +286,10 @@ to behave on Unicode text. Unicode also specifies three ways of storing sequences of Unicode characters in a computer whose basic unit of data is an 8-bit byte: +@cindex UTF-8 +@cindex UTF-16 +@cindex UTF-32 +@cindex UCS-4 @table @asis @item UTF-8 Every character is represented as 1 to 4 bytes. @@ -320,6 +327,7 @@ Markus Kuhn's UTF-8 and Unicode FAQ: @node Unicode and i18n @section Unicode and Internationalization +@cindex internationalization Internationalization is the process of changing the source code of a program so that it can meet the expectations of users in any culture, if culture specific data (translations, images etc.) are provided. @@ -352,12 +360,14 @@ POSIX APIs and the implementation of locales in the GNU C library. @node Locale encodings @section Locale encodings +@cindex locale A locale is a set of cultural conventions. According to POSIX, for a program, at any moment, there is one locale being designated as the ``current locale''. (Actually, POSIX supports also one locale per thread, but this feature is not -yet universally implemented and not widely used.) The locale is partitioned -into several aspects, called the ``categories'' of the locale. The main -various aspects are: +yet universally implemented and not widely used.) +@cindex locale categories +The locale is partitioned into several aspects, called the ``categories'' +of the locale. The main various aspects are: @itemize @item The character encoding and the character properties. This is the @@ -377,6 +387,7 @@ category. The formatting of date and time. This is the @code{LC_TIME} category. @end itemize +@cindex locale encoding In particular, the @code{LC_CTYPE} category of the current locale determines the character encoding. This is the encoding of @samp{char *} strings. We also call it the ``locale encoding''. GNU libunistring has a function, @@ -425,6 +436,7 @@ see @ref{The wchar_t mess}. @node char * strings @section @samp{char *} strings +@cindex C string functions The classical C strings, with its C library support standardized by ISO C and POSIX, can be used in internationalized programs with some precautions. The problem with this API is that many of the C library @@ -432,6 +444,7 @@ functions for strings don't work correctly on strings in locale encodings, leading to bugs that only people in some cultures of the world will experience. +@cindex locale, multibyte The first problem with the C library API is the support of multibyte locales. According to the locale encoding, in general, every character is represented by one or more bytes (up to 4 bytes in practice --- but @@ -442,6 +455,7 @@ to realize that the majority of Unix installations nowadays use UTF-8 or GB18030 as locale encoding; therefore, the majority of users are using multibyte locales. +@cindex char, type The important fact to remember is: @cartouche @emph{A @samp{char} is a byte, not a character.} @@ -552,6 +566,7 @@ This is implemented in this library, through the functions declared in @code{<un @node The wchar_t mess @section The @code{wchar_t} mess +@cindex wchar_t, type The ISO C and POSIX standard creators made an attempt to fix the first problem mentioned in the previous section. They introduced @itemize @@ -604,6 +619,9 @@ the program to produce garbage or abort. @section Unicode strings libunistring supports Unicode strings in three representations: +@cindex UTF-8, strings +@cindex UTF-16, strings +@cindex UTF-32, strings @itemize @item UTF-8 strings, through the type @samp{uint8_t *}. The units are bytes @@ -636,6 +654,7 @@ zero-valued unit used as ``end marker''. This chapter explains conventions valid throughout the libunistring library. +@cindex argument conventions Variables of type @code{char *} denote C strings in locale encoding. See @ref{Locale encodings}. @@ -674,6 +693,7 @@ All parameters starting with @samp{str} and the parameters of functions starting with @code{u8_str}/@code{u16_str}/@code{u32_str} denote a NUL terminated string. +@cindex return value conventions Error values are always returned through the @code{errno} variable, usually with a return value that indicates the presence of an error (NULL for functions that return an pointer, or -1 for functions that @@ -704,9 +724,11 @@ NULL is returned and @code{errno} is set. @node More functionality @chapter More advanced functionality +@cindex bidirectional reordering For bidirectional reordering of strings, we recommend the GNU FriBidi library: @url{http://www.fribidi.org/}. +@cindex rendering For the rendering of Unicode strings outside of the context of a given toolkit (KDE/Qt or GNOME/Gtk), we recommend the Pango library: @url{http://www.pango.org/}. |