@node uninorm.h @chapter Normalization forms (composition and decomposition) @code{} @cindex normal forms @cindex normalizing This include file defines functions for transforming Unicode strings to one of the four normal forms, known as NFC, NFD, NKFC, NFKD. These transformations involve decomposition and --- for NFC and NFKC --- composition of Unicode characters. @menu * Decomposition of characters:: * Composition of characters:: * Normalization of strings:: * Normalizing comparisons:: * Normalization of streams:: @end menu @node Decomposition of characters @section Decomposition of Unicode characters @cindex decomposing The following enumerated values are the possible types of decomposition of a Unicode character. @deftypevr Constant int UC_DECOMP_CANONICAL Denotes canonical decomposition. @end deftypevr @deftypevr Constant int UC_DECOMP_FONT UCD marker: @code{}. Denotes a font variant (e.g@. a blackletter form). @end deftypevr @deftypevr Constant int UC_DECOMP_NOBREAK UCD marker: @code{}. Denotes a no-break version of a space or hyphen. @end deftypevr @deftypevr Constant int UC_DECOMP_INITIAL UCD marker: @code{}. Denotes an initial presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_MEDIAL UCD marker: @code{}. Denotes a medial presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_FINAL UCD marker: @code{}. Denotes a final presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_ISOLATED UCD marker: @code{}. Denotes an isolated presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_CIRCLE UCD marker: @code{}. Denotes an encircled form. @end deftypevr @deftypevr Constant int UC_DECOMP_SUPER UCD marker: @code{}. Denotes a superscript form. @end deftypevr @deftypevr Constant int UC_DECOMP_SUB UCD marker: @code{}. Denotes a subscript form. @end deftypevr @deftypevr Constant int UC_DECOMP_VERTICAL UCD marker: @code{}. Denotes a vertical layout presentation form. @end deftypevr @deftypevr Constant int UC_DECOMP_WIDE UCD marker: @code{}. Denotes a wide (or zenkaku) compatibility character. @end deftypevr @deftypevr Constant int UC_DECOMP_NARROW UCD marker: @code{}. Denotes a narrow (or hankaku) compatibility character. @end deftypevr @deftypevr Constant int UC_DECOMP_SMALL UCD marker: @code{}. Denotes a small variant form (CNS compatibility). @end deftypevr @deftypevr Constant int UC_DECOMP_SQUARE UCD marker: @code{}. Denotes a CJK squared font variant. @end deftypevr @deftypevr Constant int UC_DECOMP_FRACTION UCD marker: @code{}. Denotes a vulgar fraction form. @end deftypevr @deftypevr Constant int UC_DECOMP_COMPAT UCD marker: @code{}. Denotes an otherwise unspecified compatibility character. @end deftypevr The following constant denotes the maximum size of decomposition of a single Unicode character. @deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH This macro expands to a constant that is the required size of buffer passed to the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions. @end deftypevr The following functions decompose a Unicode character. @deftypefun int uc_decomposition (ucs4_t@tie{}@var{uc}, int@tie{}*@var{decomp_tag}, ucs4_t@tie{}*@var{decomposition}) Returns the character decomposition mapping of the Unicode character @var{uc}. @var{decomposition} must point to an array of at least @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and @code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is returned. @end deftypefun @deftypefun int uc_canonical_decomposition (ucs4_t@tie{}@var{uc}, ucs4_t@tie{}*@var{decomposition}) Returns the canonical character decomposition mapping of the Unicode character @var{uc}. @var{decomposition} must point to an array of at least @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled and @var{n} is returned. Otherwise -1 is returned. Note: This function returns the (simple) ``canonical decomposition'' of @var{uc}. If you want the ``full canonical decomposition'' of @var{uc}, that is, the recursive application of ``canonical decomposition'', use the function @code{u*_normalize} with argument @code{UNINORM_NFD} instead. @end deftypefun @node Composition of characters @section Composition of Unicode characters @cindex composing, Unicode characters @cindex combining, Unicode characters The following function composes a Unicode character from two Unicode characters. @deftypefun ucs4_t uc_composition (ucs4_t@tie{}@var{uc1}, ucs4_t@tie{}@var{uc2}) Attempts to combine the Unicode characters @var{uc1}, @var{uc2}. @var{uc1} is known to have canonical combining class 0. Returns the combination of @var{uc1} and @var{uc2}, if it exists. Returns 0 otherwise. Not all decompositions can be recombined using this function. See the Unicode file @file{CompositionExclusions.txt} for details. @end deftypefun @node Normalization of strings @section Normalization of strings The Unicode standard defines four normalization forms for Unicode strings. The following type is used to denote a normalization form. @deftp Type uninorm_t An object of type @code{uninorm_t} denotes a Unicode normalization form. This is a scalar type; its values can be compared with @code{==}. @end deftp The following constants denote the four normalization forms. @deftypevr Macro uninorm_t UNINORM_NFD Denotes Normalization form D: canonical decomposition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFC Normalization form C: canonical decomposition, then canonical composition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFKD Normalization form KD: compatibility decomposition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFKC Normalization form KC: compatibility decomposition, then canonical composition. @end deftypevr The following functions operate on @code{uninorm_t} objects. @deftypefun bool uninorm_is_compat_decomposing (uninorm_t@tie{}@var{nf}) Tests whether the normalization form @var{nf} does compatibility decomposition. @end deftypefun @deftypefun bool uninorm_is_composing (uninorm_t@tie{}@var{nf}) Tests whether the normalization form @var{nf} includes canonical composition. @end deftypefun @deftypefun uninorm_t uninorm_decomposing_form (uninorm_t@tie{}@var{nf}) Returns the decomposing variant of the normalization form @var{nf}. This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD. @end deftypefun The following functions apply a Unicode normalization form to a Unicode string. @deftypefun {uint8_t *} u8_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint8_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) @deftypefunx {uint16_t *} u16_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint16_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) @deftypefunx {uint32_t *} u32_normalize (uninorm_t@tie{}@var{nf}, const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uint32_t@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) Returns the specified normalization form of a string. The @var{resultbuf} and @var{lengthp} arguments are as described in chapter @ref{Conventions}. @end deftypefun @node Normalizing comparisons @section Normalizing comparisons @cindex comparing, ignoring normalization The following functions compare Unicode string, ignoring differences in normalization. @deftypefun int u8_normcmp (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) @deftypefunx int u16_normcmp (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) @deftypefunx int u32_normcmp (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) Compares @var{s1} and @var{s2}, ignoring differences in normalization. @var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}. If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. Upon failure, returns -1 with @code{errno} set. @end deftypefun @cindex comparing, ignoring normalization, with collation rules @cindex comparing, with collation rules, ignoring normalization @deftypefun {char *} u8_normxfrm (const@tie{}uint8_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) @deftypefunx {char *} u16_normxfrm (const@tie{}uint16_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) @deftypefunx {char *} u32_normxfrm (const@tie{}uint32_t@tie{}*@var{s}, size_t@tie{}@var{n}, uninorm_t@tie{}@var{nf}, char@tie{}*@var{resultbuf}, size_t@tie{}*@var{lengthp}) Converts the string @var{s} of length @var{n} to a NUL-terminated byte sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and @code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function. @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. The @var{resultbuf} and @var{lengthp} arguments are as described in chapter @ref{Conventions}. @end deftypefun @deftypefun int u8_normcoll (const@tie{}uint8_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint8_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) @deftypefunx int u16_normcoll (const@tie{}uint16_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint16_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) @deftypefunx int u32_normcoll (const@tie{}uint32_t@tie{}*@var{s1}, size_t@tie{}@var{n1}, const@tie{}uint32_t@tie{}*@var{s2}, size_t@tie{}@var{n2}, uninorm_t@tie{}@var{nf}, int@tie{}*@var{resultp}) Compares @var{s1} and @var{s2}, ignoring differences in normalization, using the collation rules of the current locale. @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. Upon failure, returns -1 with @code{errno} set. @end deftypefun @node Normalization of streams @section Normalization of streams of Unicode characters @cindex stream, normalizing a A ``stream of Unicode characters'' is essentially a function that accepts an @code{ucs4_t} argument repeatedly, optionally combined with a function that ``flushes'' the stream. @deftp Type {struct uninorm_filter} This is the data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters. @end deftp @deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t@tie{}@var{nf}, int@tie{}(*@var{stream_func})@tie{}(void@tie{}*@var{stream_data}, ucs4_t@tie{}@var{uc}), void@tie{}*@var{stream_data}) Creates and returns a normalization filter for Unicode characters. The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream. @code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode character @var{uc} and returns 0 if successful, or -1 with @code{errno} set upon failure. Returns the new filter, or NULL with @code{errno} set upon failure. @end deftypefun @deftypefun int uninorm_filter_write (struct@tie{}uninorm_filter@tie{}*@var{filter}, ucs4_t@tie{}@var{uc}) Stuffs a Unicode character into a normalizing filter. Returns 0 if successful, or -1 with @code{errno} set upon failure. @end deftypefun @deftypefun int uninorm_filter_flush (struct@tie{}uninorm_filter@tie{}*@var{filter}) Brings data buffered in the filter to its destination, the encapsulated stream. Returns 0 if successful, or -1 with @code{errno} set upon failure. Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized. @end deftypefun @deftypefun int uninorm_filter_free (struct@tie{}uninorm_filter@tie{}*@var{filter}) Brings data buffered in the filter to its destination, the encapsulated stream, then closes and frees the filter. Returns 0 if successful, or -1 with @code{errno} set upon failure. @end deftypefun