Research report: a theoretical and experimental study on the construction of suffix arrays in external memory

Abstract

The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [Manber-Myers, 1993] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are the state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English text, Amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space vs. construction-time tradeoff. Given the current trends in model design [19, 47] and disk technology [16, 40], we will pose particular attention to differentiate between 'random' and 'contiguous' disk accesses, in order to reasonably explain some practical I/O-phenomena which are related to the experimental behavior of these algorithms and that would be otherwise meaningless in the light of other simpler external-memory models. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections. Finally, we conclude our paper by addressing two other issues. The former concerns with the problem of building word-indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming so to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counter-intuitive 'contradiction' between the effective practical performance of the well-known Baeza-Yates-Gonnet-Snider's algorithm [24], verified in our experiments, and its unappealing (i.e., cubic) worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance. (orig.)SIGLEAvailable from TIB Hannover: RR 1912(1999-1-001) / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekDEGerman

Similar works

Full text

thumbnail-image

OpenGrey Repository

redirect
Last time updated on 14/06/2016

This paper was published in OpenGrey Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.