This file contains information about the different parser implementations available in Tracker, each of them based on a different unicode support library (GNU libunistring, libunac). Specific parser implementation can be selected with the following option at configure time: --with-unicode-support=[libunistring|libicu] Parser based on GNU libunistring (http://www.gnu.org/software/libunistring) * Performs word-breaking as defined by UAX#29 [1], but still doesn't allow 'next-word' searches (as of v0.9.3, but feature is in the roadmap). * Performs full-word casefolding [2] in non-ASCII strings. * Performs lowercasing in ASCII strings. * Performs NFKD normalization in non-ASCII strings. * Library API is UTF-8 friendly. * Up to 60% faster than the libicu parser for ASCII words. Parser based on ICU libicu (http://icu-project.org): * Performs word-breaking as defined by UAX#29 [1], and allows 'next-word' searches, perfect in the Tracker case. * Performs full-word casefolding [2] in non-ASCII strings. * Performs lowercasing in ASCII strings. * Performs NFKD normalization in non-ASCII strings. * Library API is not UTF-8 friendly, strongly based on a custom data type (UChar), which is based on UTF-16 (so great for Windows systems, where Unicode strings are encoded in UTF-16). * Up to 37% faster than the libunistring parser for non-ASCII words. Notes: * As of tracker 0.9.15, the libunistring and libicu parsers have a list of Unicode characters which will always act as word breakers. This hack works on top of the unicode word-breaking algorithm, and was mainly done in order to be able to perform FTS searches using file extension as input for the FTS search. References: [1] UAX#29, Unicode Standard Annex #29: TEXT BOUNDARIES http://unicode.org/reports/tr29 [2] Section 5.18 of Unicode 5 standard: CASE MAPPINGS http://www.unicode.org/versions/latest/ch05.pdf