summaryrefslogtreecommitdiff
path: root/doc/uniwbrk.texi
diff options
context:
space:
mode:
authorBruno Haible <bruno@clisp.org>2009-04-05 13:41:56 +0200
committerBruno Haible <bruno@clisp.org>2009-04-05 13:41:56 +0200
commitc01cf0e5628faefdc80cc126a2cddb6403f315d5 (patch)
treed7c6db6cde8fe93be3c8ac41e5bce3f3097fe689 /doc/uniwbrk.texi
parent1887932c6e11205597a892b29c9d2c3b9ab344c0 (diff)
downloadlibunistring-c01cf0e5628faefdc80cc126a2cddb6403f315d5.tar.gz
Documentation of <uniwbrk.h>.
Diffstat (limited to 'doc/uniwbrk.texi')
-rw-r--r--doc/uniwbrk.texi69
1 files changed, 69 insertions, 0 deletions
diff --git a/doc/uniwbrk.texi b/doc/uniwbrk.texi
new file mode 100644
index 0000000..4c1a2a1
--- /dev/null
+++ b/doc/uniwbrk.texi
@@ -0,0 +1,69 @@
+@node uniwbrk.h
+@chapter Word breaks in strings @code{<uniwbrk.h>}
+
+This include file declares functions for determining where in a string
+``words'' start and end. Here ``words'' are not necessarily the same as
+entities that can be looked up in dictionaries, but rather groups of
+consecutive characters that should not be split by text processing
+operations.
+
+@menu
+* Word breaks in a string::
+* Word break property::
+@end menu
+
+@node Word breaks in a string
+@section Word breaks in a string
+
+The following functions determine the word breaks in a string.
+
+@deftypefun void u8_wordbreaks (const uint8_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void u16_wordbreaks (const uint16_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void u32_wordbreaks (const uint32_t *@var{s}, size_t @var{n}, char *@var{p})
+@deftypefunx void ulc_wordbreaks (const char *@var{s}, size_t @var{n}, char *@var{p})
+Determines the word break points in @var{s}, an array of @var{n} units, and
+stores the result at @code{@var{p}[0..@var{n}-1]}.
+@table @asis
+@item @code{@var{p}[i] = 1}
+means that there is a word boundary between @code{@var{s}[i-1]} and
+@code{@var{s}[i]}.
+@item @code{@var{p}[i] = 0}
+means that @code{@var{s}[i-1]} and @code{@var{s}[i]} must not be separated.
+@end table
+@code{@var{p}[0]} is always set to 0. If an application wants to consider a
+word break to be present at the beginning of the string (before
+@code{@var{s}[0]}) or at the end of the string (after
+@code{@var{s}[0..@var{n}-1]}), it has to treat these cases explicitly.
+@end deftypefun
+
+@node Word break property
+@section Word break property
+
+This is a more low-level API. The word break property is a property defined
+in Unicode Standard Annex #29, section ``Word Boundaries'', see
+@url{http://www.unicode.org/reports/tr29/#Word_Boundaries}. It is used for
+determining the word breaks in a string.
+
+The following are the possible values of the word break property. More values
+may be added in the future.
+
+@deftypevr Constant int WBP_OTHER
+@deftypevrx Constant int WBP_CR
+@deftypevrx Constant int WBP_LF
+@deftypevrx Constant int WBP_NEWLINE
+@deftypevrx Constant int WBP_EXTEND
+@deftypevrx Constant int WBP_FORMAT
+@deftypevrx Constant int WBP_KATAKANA
+@deftypevrx Constant int WBP_ALETTER
+@deftypevrx Constant int WBP_MIDNUMLET
+@deftypevrx Constant int WBP_MIDLETTER
+@deftypevrx Constant int WBP_MIDNUM
+@deftypevrx Constant int WBP_NUMERIC
+@deftypevrx Constant int WBP_EXTENDNUMLET
+@end deftypevr
+
+The following function looks up the word break property of a character.
+
+@deftypefun int uc_wordbreak_property (ucs4_t @var{uc})
+Returns the Word_Break property of a Unicode character.
+@end deftypefun