[Dovecot] New full text search indexer

Fri Apr 6 11:42:03 EEST 2007

On 4/5/07, Timo Sirainen <tss at iki.fi> wrote:
> As described earlier
> (http://dovecot.org/list/dovecot/2006-December/018055.html), Dovecot
> nowadays has full text search indexing support in CVS HEAD.
>
> Currently there are two backends: Lucene and Squat. Lucene's problem is
> that standard IMAP SEARCH command can't be used with it without breaking
> IMAP RFC. So Lucene can be used only with non-standard X-BODY-FAST and
> X-TEXT-FAST search commands.
>
> Squat's code then is quite complex and it's pretty slow. It works with 4
> character blocks, so it can answer:
>
>  - Search string < 4 characters: Can't answer anything
>  - Search string = 4 characters: Definite yes/no
>  - Search string > 4 characters: Probable yes / definite no
>
> Since search strings are usually more than 4 characters, Dovecot uses
> the index only to filter out messages which definitely won't match and
> then does regular searching for the non-filtered messages. It works ok,
> but it could work better.
>
> The Squat indexes take maybe 30-50% of the mailbox's size. Most of the
> space goes to UID lists (list of messages matching each 4 character
> block). For example one UID list could be "2,4-10,20-30". This is
> already stored in a compressed format by storing the UID numbers
> relative to previous UID, which allows storing them usually with a
> single byte per UID, even if the UIDs are large.
>
> So, this morning I had an idea how this could maybe be improved with two
> changes:
>
> 1. Don't use 4 character blocks. Instead allow variable sized paths, and
> keep track of the UIDs for each node in the path. So for example adding
> "abc" to UID 1 will place the UID 1 to node "a", "ab" and "abc".
>
> 2. Use UIDs relative to parent's UIDs. For example if node "a" has UIDs
> 1,3,5 and "ab" is added for UID 5, the UID is changed to 2 (3th in 1,3,5
> list - 1). This can help the compression a lot. The best case scenario
> for the above example is if "ab" contains also 1,3,5 UIDs. Then it can
> be written with a single 0-2 range.
>
> With these changes, the indexer can be configured to perform in
> different ways. Text can be indexed in two ways: full words, and
> substrings within those words (as required by IMAP). I think a useful
> combination of these would be to index all 4 character substrings and
> all full words up to 10 characters or so. This kind of an index can
> answer:
>
>  - Search string <= 4 characters: Definite yes/no
>  - For X-BODY-FAST / X-TEXT-FAST:
>    - Search string 5-10 characters: Definite yes/no
>    - Search string 11+ characters: Probable yes / definite no
>  - For standard IMAP SEARCH:
>    - Search string 5-10 characters: Definite yes if message contains the
> key as non-substring, otherwise same as below. This helps only if user
> does a search that matches a lot of messages.
>    - Search string 11+ characters: Probable yes / definite no
>
> UTF-8 text is indexed as bytes, but the path depth can be counted as
> characters instead of as bytes.
>
> I've a test program finished, and unless there are some bugs left I
> think it shows that this can work pretty well:
>
> Squat-like 4 byte substrings (but can answer 1-3 char queries also):
>
> Message count: 7495
> Index time: 4.19 CPU seconds (7.39MB/CPUs)
> Node memory: 3014072 bytes (2943 kB) in 90550 nodes
> UID memory: 8277212 bytes (8083 kB) in 352778 lists
> Total: 11291284 / 32474899 bytes = 34.77%
>
> Indexing the same mailbox file with squat-test gives:
>
>  - Index time: 9.55 CPU seconds, 9.63 real seconds (3.24MB/CPUs)
>  - 4145192 bytes in 102208 nodes (12.76%)
>  - 7286058 bytes in 305501 uid_lists (22.44%)
>  - 11431250 bytes total of 32474899 (35.20%)
>
> So the new indexer is over twice as fast and uses slightly less space.
>
> Indexing 4 char substrings and words up to 10 chars:
>
> Index time: 5.13 CPU seconds (6.04MB/CPUs)
> Node memory: 6002804 bytes (5862 kB) in 615501 nodes
> UID memory: 9044610 bytes (8832 kB) in 519488 lists
> Total: 15047414 / 32474899 bytes = 46.34%
>
> So it takes somewhat more space, but definitely less than having both
> Squat + Lucene.
>
> No substring indexing, words up to 10 chars (Lucene replacement):
>
> Index time: 1.39 CPU seconds (22.28MB/CPUs)
> Node memory: 3297663 bytes (3220 kB) in 548540 nodes
> UID memory: 1803735 bytes (1761 kB) in 213554 lists
> Total: 5101398 / 32474899 bytes = 15.71%
>
> Is there a need for Lucene anymore with this kind of a speed and disk
> space usage? :)
>
> The above tests are run with everything being indexed except space and
> control chars (ie. practically space + tab). If only alphanumeric
> characters are indexed, the total lines are (in same order as above):
>
> Total: 7048664 / 32474899 bytes = 21.70% (-13.07%)
> Total: 9610543 / 32474899 bytes = 29.59% (-16.75%)
> Total: 3963335 / 32474899 bytes = 12.20% (-3.51%)
>
> There are probably still a few optimizations that can be done to get the
> disk space usage even lower. I did spent quite a lot of time making it
> use less CPU though. Even tried a bit different coding style variations
> in the critical functions to get gcc to create faster code..
>
> One interesting thing I noticed was that binary searching a character
> from a char array was slower than just sequentially going through the
> characters. I tried adding checks to do it only if the array was big,
> but it was always slower. Maybe something to do with CPU cache.
>
> If you want to try yourself, the test program is in
> http://dovecot.org/tmp/new-indexer.c

An other problem with squat is that we can't remove items from the
index. (the version of Cyrus). Is that still the case ?

-- 
DINH Viêt Hoà