blob: f8e3f725f6d8a692249a43a285ea8c2445f382b9 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
This file contains information about the different parser implementations
available in Tracker, each of them based on a different unicode support library
(GNU libunistring, libunac).
Specific parser implementation can be selected with the following option at
configure time: --with-unicode-support=[libunistring|libicu]
Parser based on GNU libunistring (http://www.gnu.org/software/libunistring)
* Performs word-breaking as defined by UAX#29 [1], but still doesn't allow
'next-word' searches (as of v0.9.3, but feature is in the roadmap).
* Performs full-word casefolding [2] in non-ASCII strings.
* Performs lowercasing in ASCII strings.
* Performs NFKD normalization in non-ASCII strings.
* Library API is UTF-8 friendly.
* Up to 60% faster than the libicu parser for ASCII words.
Parser based on ICU libicu (http://icu-project.org):
* Performs word-breaking as defined by UAX#29 [1], and allows 'next-word'
searches, perfect in the Tracker case.
* Performs full-word casefolding [2] in non-ASCII strings.
* Performs lowercasing in ASCII strings.
* Performs NFKD normalization in non-ASCII strings.
* Library API is not UTF-8 friendly, strongly based on a custom data type
(UChar), which is based on UTF-16 (so great for Windows systems, where
Unicode strings are encoded in UTF-16).
* Up to 37% faster than the libunistring parser for non-ASCII words.
Notes:
* As of tracker 0.9.15, the libunistring and libicu parsers have a list of
Unicode characters which will always act as word breakers. This hack works
on top of the unicode word-breaking algorithm, and was mainly done in order
to be able to perform FTS searches using file extension as input for the
FTS search.
References:
[1] UAX#29, Unicode Standard Annex #29: TEXT BOUNDARIES
http://unicode.org/reports/tr29
[2] Section 5.18 of Unicode 5 standard: CASE MAPPINGS
http://www.unicode.org/versions/latest/ch05.pdf
|