summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorBruno Haible <bruno@clisp.org>2011-03-29 23:10:57 +0200
committerBruno Haible <bruno@clisp.org>2011-03-29 23:47:50 +0200
commit6ec70a06da21e2bdceb9814fe6fde6be46f890cd (patch)
tree1eb310cb7adb756f54392eeea8423af19b1e6600 /doc
parent820590c2b81686f64c50d22022aeb49ff3c6e3ad (diff)
downloadlibunistring-6ec70a06da21e2bdceb9814fe6fde6be46f890cd.tar.gz
Add grapheme cluster break functions.
Diffstat (limited to 'doc')
-rw-r--r--doc/Makefile.am2
-rw-r--r--doc/libunistring.texi18
-rw-r--r--doc/unigbrk.texi29
-rw-r--r--doc/uniwbrk.texi2
4 files changed, 34 insertions, 17 deletions
diff --git a/doc/Makefile.am b/doc/Makefile.am
index cd5c514..c470f6c 100644
--- a/doc/Makefile.am
+++ b/doc/Makefile.am
@@ -33,7 +33,7 @@ info_TEXINFOS = libunistring.texi
# List of texinfo sources @included by libunistring.texi, excluding version.texi.
libunistring_TEXINFOS = \
unitypes.texi unistr.texi uniconv.texi unistdio.texi uniname.texi \
- unictype.texi uniwidth.texi uniwbrk.texi unilbrk.texi unigbrk.texi \
+ unictype.texi uniwidth.texi unigbrk.texi uniwbrk.texi unilbrk.texi \
uninorm.texi unicase.texi uniregex.texi \
gpl.texi lgpl.texi fdl.texi
diff --git a/doc/libunistring.texi b/doc/libunistring.texi
index 32209ab..a6f9c8f 100644
--- a/doc/libunistring.texi
+++ b/doc/libunistring.texi
@@ -158,9 +158,9 @@ A copy of the license is included in @ref{GNU GPL}.
* uniname.h:: Names of Unicode characters
* unictype.h:: Unicode character classification and properties
* uniwidth.h:: Display width
+* unigbrk.h:: Grapheme cluster breaking
* uniwbrk.h:: Word breaks in strings
* unilbrk.h:: Line breaking
-* unigbrk.h:: Grapheme cluster breaking
* uninorm.h:: Normalization forms
* unicase.h:: Case mappings
* uniregex.h:: Regular expressions
@@ -217,16 +217,16 @@ Properties
* Properties as objects::
* Properties as functions::
-uniwbrk.h
-
-* Word breaks in a string::
-* Word break property::
-
unigbrk.h
* Grapheme cluster breaks in a string::
* Grapheme cluster break property::
+uniwbrk.h
+
+* Word breaks in a string::
+* Word break property::
+
uninorm.h
* Decomposition of characters::
@@ -281,12 +281,12 @@ character names
character classification and properties
@item <uniwidth.h>
string width when using nonproportional fonts
+@item <unigbrk.h>
+grapheme cluster breaks
@item <uniwbrk.h>
word breaks
@item <unilbrk.h>
line breaking algorithm
-@item <unigbrk.h>
-grapheme cluster breaks
@item <uninorm.h>
normalization (composition and decomposition)
@item <unicase.h>
@@ -763,9 +763,9 @@ NULL is returned and @code{errno} is set.
@include uniname.texi
@include unictype.texi
@include uniwidth.texi
+@include unigbrk.texi
@include uniwbrk.texi
@include unilbrk.texi
-@include unigbrk.texi
@include uninorm.texi
@include unicase.texi
@include uniregex.texi
diff --git a/doc/unigbrk.texi b/doc/unigbrk.texi
index db4df6a..196bd9f 100644
--- a/doc/unigbrk.texi
+++ b/doc/unigbrk.texi
@@ -2,11 +2,18 @@
@chapter Grapheme cluster breaks in strings @code{<unigbrk.h>}
@cindex grapheme cluster breaks
+@cindex grapheme cluster boundaries
@cindex breaks, grapheme cluster
+@cindex boundaries, between grapheme clusters
This include file declares functions for determining where in a string
``grapheme clusters'' start and end. A ``grapheme cluster'' is an
approximation to a user-perceived character, which sometimes
-corresponds to multiple Unicode characters. The letter @samp{@'e},
+corresponds to multiple Unicode characters. Editing operations such as
+mouse selection, cursor movement, and backspacing often operate on
+grapheme clusters as units, not on individual characters.
+
+Some grapheme clusters are built from a base character and a combining
+character. The letter @samp{@'e},
for example, is most commonly represented in Unicode as a single
character U+00E8 @sc{LATIN SMALL LETTER E WITH ACUTE}. It is,
however, equally valid to use the pair of characters U+0065 @sc{LATIN
@@ -14,6 +21,12 @@ SMALL LETTER E} followed by U+0301 @sc{COMBINING ACUTE ACCENT}. Since
the user would perceive this pair of characters as a single character,
they would be grouped into a single grapheme cluster.
+But there are also grapheme clusters that consist of several base characters.
+For example, a Devanagari letter and a Devanagari vowel sign that follows it
+may form a grapheme cluster. Similarly, some pairs of Thai characters and
+Hangul syllables (formed by two or three Hangul characters) are grapheme
+clusters.
+
@menu
* Grapheme cluster breaks in a string::
* Grapheme cluster break property::
@@ -65,10 +78,11 @@ grapheme cluster break at start of text.
@node Grapheme cluster break property
@section Grapheme cluster break property
-This is a more low-level API. The grapheme cluster break property is a property defined
-in Unicode Standard Annex #29, section ``Grapheme Cluster Boundaries, see
-@url{http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}.@texnl{} It is
-used for determining the grapheme cluster breaks in a string.
+This is a more low-level API. The grapheme cluster break property is a
+property defined in Unicode Standard Annex #29, section ``Grapheme Cluster
+Boundaries'', see
+@url{http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries}.@texnl{}
+It is used for determining the grapheme cluster breaks in a string.
The following are the possible values of the grapheme cluster break
property. More values may be added in the future.
@@ -87,7 +101,8 @@ property. More values may be added in the future.
@deftypevrx Constant int GBP_LVT
@end deftypevr
-The following function looks up the grapheme cluster break property of a character.
+The following function looks up the grapheme cluster break property of a
+character.
@deftypefun int uc_graphemeclusterbreak_property (ucs4_t @var{uc})
Returns the Grapheme_Cluster_Break property of a Unicode character.
@@ -102,7 +117,7 @@ Returns true if there is an grapheme cluster boundary between Unicode
characters @var{a} and @var{b}.
There is always a grapheme cluster break at the start or end of text.
-Specify zero for @var{a} or @var{b} to indicate start of text or end
+You can specify zero for @var{a} or @var{b} to indicate start of text or end
of text, respectively.
This implements the extended (not legacy) grapheme cluster rules
diff --git a/doc/uniwbrk.texi b/doc/uniwbrk.texi
index 6f06b92..08c273c 100644
--- a/doc/uniwbrk.texi
+++ b/doc/uniwbrk.texi
@@ -2,7 +2,9 @@
@chapter Word breaks in strings @code{<uniwbrk.h>}
@cindex word breaks
+@cindex word boundaries
@cindex breaks, word
+@cindex boundaries, between words
This include file declares functions for determining where in a string
``words'' start and end. Here ``words'' are not necessarily the same as
entities that can be looked up in dictionaries, but rather groups of