@node uninorm.h @chapter Normalization forms (composition and decomposition) @code{} @cindex normal forms @cindex normalizing This include file defines functions for transforming Unicode strings to one of the four normal forms, known as NFC, NFD, NKFC, NFKD. These transformations involve decomposition and --- for NFC and NFKC --- composition of Unicode characters. @menu * Decomposition of characters:: * Composition of characters:: * Normalization of strings:: * Normalizing comparisons:: * Normalization of streams:: @end menu @node Decomposition of characters @section Decomposition of Unicode characters @cindex decomposing The following enumerated values are the possible types of decomposition of a Unicode character. @deftypevr Constant int UC_DECOMP_CANONICAL Denotes canonical decomposition. @end deftypevr @deftypevr Constant int UC_DECOMP_FONT UCD marker: @code{}. Denotes a font variant (e.g@. a blackletter form). @end deftypevr @deftypevr Constant int UC_DECOMP_NOBREAK UCD marker: @code{}. Denotes a no-break version of a space or hyphen. @end deftypevr @deftypevr Constant int UC_DECOMP_INITIAL UCD marker: @code{}. Denotes an initial presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_MEDIAL UCD marker: @code{}. Denotes a medial presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_FINAL UCD marker: @code{}. Denotes a final presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_ISOLATED UCD marker: @code{}. Denotes an isolated presentation form (Arabic). @end deftypevr @deftypevr Constant int UC_DECOMP_CIRCLE UCD marker: @code{}. Denotes an encircled form. @end deftypevr @deftypevr Constant int UC_DECOMP_SUPER UCD marker: @code{}. Denotes a superscript form. @end deftypevr @deftypevr Constant int UC_DECOMP_SUB UCD marker: @code{}. Denotes a subscript form. @end deftypevr @deftypevr Constant int UC_DECOMP_VERTICAL UCD marker: @code{}. Denotes a vertical layout presentation form. @end deftypevr @deftypevr Constant int UC_DECOMP_WIDE UCD marker: @code{}. Denotes a wide (or zenkaku) compatibility character. @end deftypevr @deftypevr Constant int UC_DECOMP_NARROW UCD marker: @code{}. Denotes a narrow (or hankaku) compatibility character. @end deftypevr @deftypevr Constant int UC_DECOMP_SMALL UCD marker: @code{}. Denotes a small variant form (CNS compatibility). @end deftypevr @deftypevr Constant int UC_DECOMP_SQUARE UCD marker: @code{}. Denotes a CJK squared font variant. @end deftypevr @deftypevr Constant int UC_DECOMP_FRACTION UCD marker: @code{}. Denotes a vulgar fraction form. @end deftypevr @deftypevr Constant int UC_DECOMP_COMPAT UCD marker: @code{}. Denotes an otherwise unspecified compatibility character. @end deftypevr The following constant denotes the maximum size of decomposition of a single Unicode character. @deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH This macro expands to a constant that is the required size of buffer passed to the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions. @end deftypevr The following functions decompose a Unicode character. @deftypefun int uc_decomposition (ucs4_t @var{uc}, int *@var{decomp_tag}, ucs4_t *@var{decomposition}) Returns the character decomposition mapping of the Unicode character @var{uc}. @var{decomposition} must point to an array of at least @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and @code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is returned. @end deftypefun @deftypefun int uc_canonical_decomposition (ucs4_t @var{uc}, ucs4_t *@var{decomposition}) Returns the canonical character decomposition mapping of the Unicode character @var{uc}. @var{decomposition} must point to an array of at least @code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled and @var{n} is returned. Otherwise -1 is returned. @end deftypefun @node Composition of characters @section Composition of Unicode characters @cindex composing, Unicode characters @cindex combining, Unicode characters The following function composes a Unicode character from two Unicode characters. @deftypefun ucs4_t uc_composition (ucs4_t @var{uc1}, ucs4_t @var{uc2}) Attempts to combine the Unicode characters @var{uc1}, @var{uc2}. @var{uc1} is known to have canonical combining class 0. Returns the combination of @var{uc1} and @var{uc2}, if it exists. Returns 0 otherwise. Not all decompositions can be recombined using this function. See the Unicode file @file{CompositionExclusions.txt} for details. @end deftypefun @node Normalization of strings @section Normalization of strings The Unicode standard defines four normalization forms for Unicode strings. The following type is used to denote a normalization form. @deftp Type uninorm_t An object of type @code{uninorm_t} denotes a Unicode normalization form. This is a scalar type; its values can be compared with @code{==}. @end deftp The following constants denote the four normalization forms. @deftypevr Macro uninorm_t UNINORM_NFD Denotes Normalization form D: canonical decomposition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFC Normalization form C: canonical decomposition, then canonical composition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFKD Normalization form KD: compatibility decomposition. @end deftypevr @deftypevr Macro uninorm_t UNINORM_NFKC Normalization form KC: compatibility decomposition, then canonical composition. @end deftypevr The following functions operate on @code{uninorm_t} objects. @deftypefun bool uninorm_is_compat_decomposing (uninorm_t @var{nf}) Tests whether the normalization form @var{nf} does compatibility decomposition. @end deftypefun @deftypefun bool uninorm_is_composing (uninorm_t @var{nf}) Tests whether the normalization form @var{nf} includes canonical composition. @end deftypefun @deftypefun uninorm_t uninorm_decomposing_form (uninorm_t @var{nf}) Returns the decomposing variant of the normalization form @var{nf}. This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD. @end deftypefun The following functions apply a Unicode normalization form to a Unicode string. @deftypefun {uint8_t *} u8_normalize (uninorm_t @var{nf}, const uint8_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp}) @deftypefunx {uint16_t *} u16_normalize (uninorm_t @var{nf}, const uint16_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp}) @deftypefunx {uint32_t *} u32_normalize (uninorm_t @var{nf}, const uint32_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp}) Returns the specified normalization form of a string. @end deftypefun @node Normalizing comparisons @section Normalizing comparisons @cindex comparing, ignoring normalization The following functions compare Unicode string, ignoring differences in normalization. @deftypefun int u8_normcmp (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) @deftypefunx int u16_normcmp (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) @deftypefunx int u32_normcmp (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) Compares @var{s1} and @var{s2}, ignoring differences in normalization. @var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}. If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. Upon failure, returns -1 with @code{errno} set. @end deftypefun @cindex comparing, ignoring normalization, with collation rules @cindex comparing, with collation rules, ignoring normalization @deftypefun {char *} u8_normxfrm (const uint8_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) @deftypefunx {char *} u16_normxfrm (const uint16_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) @deftypefunx {char *} u32_normxfrm (const uint32_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) Converts the string @var{s} of length @var{n} to a NUL-terminated byte sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and @code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function. @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. @end deftypefun @deftypefun int u8_normcoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) @deftypefunx int u16_normcoll (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) @deftypefunx int u32_normcoll (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) Compares @var{s1} and @var{s2}, ignoring differences in normalization, using the collation rules of the current locale. @var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. Upon failure, returns -1 with @code{errno} set. @end deftypefun @node Normalization of streams @section Normalization of streams of Unicode characters @cindex stream, normalizing a A ``stream of Unicode characters'' is essentially a function that accepts an @code{ucs4_t} argument repeatedly, optionally combined with a function that ``flushes'' the stream. @deftp Type {struct uninorm_filter} This is the data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters. @end deftp @deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t @var{nf}, int (*@var{stream_func}) (void *@var{stream_data}, ucs4_t @var{uc}), void *@var{stream_data}) Creates and returns a normalization filter for Unicode characters. The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream. @code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode character @var{uc} and returns 0 if successful, or -1 with @code{errno} set upon failure. Returns the new filter, or NULL with @code{errno} set upon failure. @end deftypefun @deftypefun int uninorm_filter_write (struct uninorm_filter *@var{filter}, ucs4_t @var{uc}) Stuffs a Unicode character into a normalizing filter. Returns 0 if successful, or -1 with @code{errno} set upon failure. @end deftypefun @deftypefun int uninorm_filter_flush (struct uninorm_filter *@var{filter}) Brings data buffered in the filter to its destination, the encapsulated stream. Returns 0 if successful, or -1 with @code{errno} set upon failure. Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized. @end deftypefun @deftypefun int uninorm_filter_free (struct uninorm_filter *@var{filter}) Brings data buffered in the filter to its destination, the encapsulated stream, then closes and frees the filter. Returns 0 if successful, or -1 with @code{errno} set upon failure. @end deftypefun