Indexing compressed text

Abstract

We study a method by Ferragina and Manzini for creating an index of a text. This index allows us to find any string in the original text. What is so special about this index is that it is smaller than the original text, while still allowingquick searching and recovery of the original text. In order to understand the performance bounds given by Ferragina and Manzini we first examine the concept of information density, the entropy. Next we examine the details of the method suggested by Ferragina and Manzini. Finally we design an extension to their method. Using this method we are not only able to search for any specific string in the text, but also for some more generalized descriptions of pieces of text. More precisely we can find all matches for a given regular expression. Using this we are able to find answers to the question like ‘give all quoted piece of text’.

    Similar works