diff options
author | serg@serg.mysql.com <> | 2002-04-19 13:36:16 +0000 |
---|---|---|
committer | serg@serg.mysql.com <> | 2002-04-19 13:36:16 +0000 |
commit | 239f0714a63fd4b8894eec0755aeaac005ad189c (patch) | |
tree | ffbb4597baf09e29d29b267b6435b1c367d98c12 /Docs/internals.texi | |
parent | 5f2d79c55a60decf7305a08313ad84473544ffd5 (diff) | |
download | mariadb-git-239f0714a63fd4b8894eec0755aeaac005ad189c.tar.gz |
boolean fulltext search weighting scheme changed
Diffstat (limited to 'Docs/internals.texi')
-rw-r--r-- | Docs/internals.texi | 45 |
1 files changed, 44 insertions, 1 deletions
diff --git a/Docs/internals.texi b/Docs/internals.texi index 8f358982ded..871e51c50bd 100644 --- a/Docs/internals.texi +++ b/Docs/internals.texi @@ -57,6 +57,7 @@ This is a manual about @strong{MySQL} internals. * mysys functions:: Functions In The @code{mysys} Library * DBUG:: DBUG Tags To Use * protocol:: MySQL Client/Server Protocol +* Fulltext Search:: Fulltext Search in MySQL @end menu @@ -535,7 +536,7 @@ Print query. @end table -@node protocol, , DBUG, Top +@node protocol, Fulltext Search, DBUG, Top @chapter MySQL Client/Server Protocol @menu @@ -785,6 +786,48 @@ Date 03 0A 00 00 |01 0A |03 00 00 00 @c @printindex fn +@node Fulltext Search, , protocol, Top +@chapter Fulltext Search in MySQL + +Hopefully, sometime there will be complete description of +fulltext search algorithms. +Now it's just unsorted notes. + +@menu +* Weighting in boolean mode:: +@end menu + +@node Weighting in boolean mode, , , Fulltext Search +@section Weighting in boolean mode + +The basic idea is as follows: in expression +@code{A or B or (C and D and E)}, either @code{A} or @code{B} alone +is enough to match the whole expression. While @code{C}, +@code{D}, and @code{E} should @strong{all} match. So it's +reasonable to assign weight 1 to @code{A}, @code{B}, and +@code{(C and D and E)}. And @code{C}, @code{D}, and @code{E} +should get a weight of 1/3. + +Things become more complicated when considering boolean +operators, as used in MySQL FTB. Obvioulsy, @code{+A +B} +should be treated as @code{A and B}, and @code{A B} - +as @code{A or B}. The problem is, that @code{+A B} can @strong{not} +be rewritten in and/or terms (that's the reason why this - extended - +set of operators was chosen). Still, aproximations can be used. +@code{+A B C} can be approximated as @code{A or (A and (B or C))} +or as @code{A or (A and B) or (A and C) or (A and B and C)}. +Applying the above logic (and omitting mathematical +transformations and normalization) one gets that for +@code{+A_1 +A_2 ... +A_N B_1 B_2 ... B_M} the weights +should be: @code{A_i = 1/N}, @code{B_j=1} if @code{N==0}, and, +otherwise, in the first rewritting approach @code{B_j = 1/3}, +and in the second one - @code{B_j = (1+(M-1)*2^M)/(M*(2^(M+1)-1))}. + +The second expression gives somewhat steeper increase in total +weight as number of matched B's increases, because it assigns +higher weights to individual B's. Also the first expression in +much simplier. So it is the first one, that is implemented in MySQL. + @summarycontents @contents |