diff options
author | Bruno Haible <bruno@clisp.org> | 2011-03-29 23:10:57 +0200 |
---|---|---|
committer | Bruno Haible <bruno@clisp.org> | 2011-03-29 23:47:50 +0200 |
commit | 6ec70a06da21e2bdceb9814fe6fde6be46f890cd (patch) | |
tree | 1eb310cb7adb756f54392eeea8423af19b1e6600 /doc | |
parent | 820590c2b81686f64c50d22022aeb49ff3c6e3ad (diff) | |
download | libunistring-6ec70a06da21e2bdceb9814fe6fde6be46f890cd.tar.gz |
Add grapheme cluster break functions.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/Makefile.am | 2 | ||||
-rw-r--r-- | doc/libunistring.texi | 18 | ||||
-rw-r--r-- | doc/unigbrk.texi | 29 | ||||
-rw-r--r-- | doc/uniwbrk.texi | 2 |
4 files changed, 34 insertions, 17 deletions
diff --git a/doc/Makefile.am b/doc/Makefile.am index cd5c514..c470f6c 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -33,7 +33,7 @@ info_TEXINFOS = libunistring.texi # List of texinfo sources @included by libunistring.texi, excluding version.texi. libunistring_TEXINFOS = \ unitypes.texi unistr.texi uniconv.texi unistdio.texi uniname.texi \ - unictype.texi uniwidth.texi uniwbrk.texi unilbrk.texi unigbrk.texi \ + unictype.texi uniwidth.texi unigbrk.texi uniwbrk.texi unilbrk.texi \ uninorm.texi unicase.texi uniregex.texi \ gpl.texi lgpl.texi fdl.texi diff --git a/doc/libunistring.texi b/doc/libunistring.texi index 32209ab..a6f9c8f 100644 --- a/doc/libunistring.texi +++ b/doc/libunistring.texi @@ -158,9 +158,9 @@ A copy of the license is included in @ref{GNU GPL}. * uniname.h:: Names of Unicode characters * unictype.h:: Unicode character classification and properties * uniwidth.h:: Display width +* unigbrk.h:: Grapheme cluster breaking * uniwbrk.h:: Word breaks in strings * unilbrk.h:: Line breaking -* unigbrk.h:: Grapheme cluster breaking * uninorm.h:: Normalization forms * unicase.h:: Case mappings * uniregex.h:: Regular expressions @@ -217,16 +217,16 @@ Properties * Properties as objects:: * Properties as functions:: -uniwbrk.h - -* Word breaks in a string:: -* Word break property:: - unigbrk.h * Grapheme cluster breaks in a string:: * Grapheme cluster break property:: +uniwbrk.h + +* Word breaks in a string:: +* Word break property:: + uninorm.h * Decomposition of characters:: @@ -281,12 +281,12 @@ character names character classification and properties @item <uniwidth.h> string width when using nonproportional fonts +@item <unigbrk.h> +grapheme cluster breaks @item <uniwbrk.h> word breaks @item <unilbrk.h> line breaking algorithm -@item <unigbrk.h> -grapheme cluster breaks @item <uninorm.h> normalization (composition and decomposition) @item <unicase.h> @@ -763,9 +763,9 @@ NULL is returned and @code{errno} is set. @include uniname.texi @include unictype.texi @include uniwidth.texi +@include unigbrk.texi @include uniwbrk.texi @include unilbrk.texi -@include unigbrk.texi @include uninorm.texi @include unicase.texi @include uniregex.texi diff --git a/doc/unigbrk.texi b/doc/unigbrk.texi index db4df6a..196bd9f 100644 --- a/doc/unigbrk.texi +++ b/doc/unigbrk.texi @@ -2,11 +2,18 @@ @chapter Grapheme cluster breaks in strings @code{<unigbrk.h>} @cindex grapheme cluster breaks +@cindex grapheme cluster boundaries @cindex breaks, grapheme cluster +@cindex boundaries, between grapheme clusters This include file declares functions for determining where in a string ``grapheme clusters'' start and end. A ``grapheme cluster'' is an approximation to a user-perceived character, which sometimes -corresponds to multiple Unicode characters. The letter @samp{@'e}, +corresponds to multiple Unicode characters. Editing operations such as +mouse selection, cursor movement, and backspacing often operate on +grapheme clusters as units, not on individual characters. + +Some grapheme clusters are built from a base character and a combining +character. The letter @samp{@'e}, for example, is most commonly represented in Unicode as a single character U+00E8 @sc{LATIN SMALL LETTER E WITH ACUTE}. It is, however, equally valid to use the pair of characters U+0065 @sc{LATIN @@ -14,6 +21,12 @@ SMALL LETTER E} followed by U+0301 @sc{COMBINING ACUTE ACCENT}. Since the user would perceive this pair of characters as a single character, they would be grouped into a single grapheme cluster. +But there are also grapheme clusters that consist of several base characters. +For example, a Devanagari letter and a Devanagari vowel sign that follows it +may form a grapheme cluster. Similarly, some pairs of Thai characters and +Hangul syllables (formed by two or three Hangul characters) are grapheme +clusters. + @menu * Grapheme cluster breaks in a string:: * Grapheme cluster break property:: @@ -65,10 +78,11 @@ grapheme cluster break at start of text. @node Grapheme cluster break property @section Grapheme cluster break property -This is a more low-level API. The grapheme cluster break property is a property defined -in Unicode Standard Annex #29, section ``Grapheme Cluster Boundaries, see -@url{http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}.@texnl{} It is -used for determining the grapheme cluster breaks in a string. +This is a more low-level API. The grapheme cluster break property is a +property defined in Unicode Standard Annex #29, section ``Grapheme Cluster +Boundaries'', see +@url{http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}.@texnl{} +It is used for determining the grapheme cluster breaks in a string. The following are the possible values of the grapheme cluster break property. More values may be added in the future. @@ -87,7 +101,8 @@ property. More values may be added in the future. @deftypevrx Constant int GBP_LVT @end deftypevr -The following function looks up the grapheme cluster break property of a character. +The following function looks up the grapheme cluster break property of a +character. @deftypefun int uc_graphemeclusterbreak_property (ucs4_t @var{uc}) Returns the Grapheme_Cluster_Break property of a Unicode character. @@ -102,7 +117,7 @@ Returns true if there is an grapheme cluster boundary between Unicode characters @var{a} and @var{b}. There is always a grapheme cluster break at the start or end of text. -Specify zero for @var{a} or @var{b} to indicate start of text or end +You can specify zero for @var{a} or @var{b} to indicate start of text or end of text, respectively. This implements the extended (not legacy) grapheme cluster rules diff --git a/doc/uniwbrk.texi b/doc/uniwbrk.texi index 6f06b92..08c273c 100644 --- a/doc/uniwbrk.texi +++ b/doc/uniwbrk.texi @@ -2,7 +2,9 @@ @chapter Word breaks in strings @code{<uniwbrk.h>} @cindex word breaks +@cindex word boundaries @cindex breaks, word +@cindex boundaries, between words This include file declares functions for determining where in a string ``words'' start and end. Here ``words'' are not necessarily the same as entities that can be looked up in dictionaries, but rather groups of |