summaryrefslogtreecommitdiff
path: root/src/libtracker-common/README.parsers
blob: f8e3f725f6d8a692249a43a285ea8c2445f382b9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

This file contains information about the different parser implementations
 available in Tracker, each of them based on a different unicode support library
 (GNU libunistring, libunac).

Specific parser implementation can be selected with the following option at
 configure time: --with-unicode-support=[libunistring|libicu]


Parser based on GNU libunistring (http://www.gnu.org/software/libunistring)
 * Performs word-breaking as defined by UAX#29 [1], but still doesn't allow
    'next-word' searches (as of v0.9.3, but feature is in the roadmap).
 * Performs full-word casefolding [2] in non-ASCII strings.
 * Performs lowercasing in ASCII strings.
 * Performs NFKD normalization in non-ASCII strings.
 * Library API is UTF-8 friendly.
 * Up to 60% faster than the libicu parser for ASCII words.

Parser based on ICU libicu (http://icu-project.org):
 * Performs word-breaking as defined by UAX#29 [1], and allows 'next-word'
    searches, perfect in the Tracker case.
 * Performs full-word casefolding [2] in non-ASCII strings.
 * Performs lowercasing in ASCII strings.
 * Performs NFKD normalization in non-ASCII strings.
 * Library API is not UTF-8 friendly, strongly based on a custom data type
    (UChar), which is based on UTF-16 (so great for Windows systems, where
    Unicode strings are encoded in UTF-16).
 * Up to 37% faster than the libunistring parser for non-ASCII words.

Notes:
  * As of tracker 0.9.15, the libunistring and libicu parsers have a list of
     Unicode characters which will always act as word breakers. This hack works
     on top of the unicode word-breaking algorithm, and was mainly done in order
     to be able to perform FTS searches using file extension as input for the
     FTS search.

References:
 [1] UAX#29, Unicode Standard Annex #29: TEXT BOUNDARIES
      http://unicode.org/reports/tr29
 [2] Section 5.18 of Unicode 5 standard: CASE MAPPINGS
      http://www.unicode.org/versions/latest/ch05.pdf