1 research outputs found

    A bloated FM-index reducing the number of cache misses during the search

    Full text link
    The FM-index is a well-known compressed full-text index, based on the Burrows-Wheeler transform (BWT). During a pattern search, the BWT sequence is accessed at "random" locations, which is cache-unfriendly. In this paper, we are interested in speeding up the FM-index by working on qq-grams rather than individual characters, at the cost of using more space. The first presented variant is related to an inverted index on qq-grams, yet the occurrence lists in our solution are in the sorted suffix order rather than text order in a traditional inverted index. This variant obtains O(m/CL+lognlogm)O(m/|CL| + \log n \log m) cache misses in the worst case, where nn and mm are the text and pattern lengths, respectively, and CL|CL| is the CPU cache line size, in symbols (typically 64 in modern hardware). This index is often several times faster than the fastest known FM-indexes (especially for long patterns), yet the space requirements are enormous, O(nlog2n)O(n\log^2 n) bits in theory and about 80n80n-95n95n bytes in practice. For this reason, we dub our approach FM-bloated. The second presented variant requires O(nlogn)O(n\log n) bits of space.Comment: 5 figure
    corecore