The construction of full-text indexes on very large text collections is nowadays a hot problem. The suffix array [Manber-Myers, 1993] is one of the most attractive full-text indexing data structures due to its simplicity, space efficiency and powerful/fast search operations supported. In this paper we analyze, both theoretically and experimentally, the I/O-complexity and the working space of six algorithms for constructing large suffix arrays. Three of them are the state-of-the-art, the other three algorithms are our new proposals. We perform a set of experiments based on three different data sets (English text, Amino-acid sequences and random texts) and give a precise hierarchy of these algorithms according to their working-space vs. construction-time tradeoff. Given the current trends in model design [19, 47] and disk technology [16, 40], we will pose particular attention to differentiate between 'random' and 'contiguous' disk accesses, in order to reasonably explain some practical I/O-phenomena which are related to the experimental behavior of these algorithms and that would be otherwise meaningless in the light of other simpler external-memory models. At the best of our knowledge, this is the first study which provides a wide spectrum of possible approaches to the construction of suffix arrays in external memory, and thus it should be helpful to anyone who is interested in building full-text indexes on very large text collections. Finally, we conclude our paper by addressing two other issues. The former concerns with the problem of building word-indexes; we show that our results can be successfully applied to this case too, without any loss in efficiency and without compromising the simplicity of programming so to achieve a uniform, simple and efficient approach to both the two indexing models. The latter issue is related to the intriguing and apparently counter-intuitive 'contradiction' between the effective practical performance of the well-known Baeza-Yates-Gonnet-Snider's algorithm , verified in our experiments, and its unappealing (i.e., cubic) worst-case behavior. We devise a new external-memory algorithm that follows the basic philosophy underlying that algorithm but in a significantly different manner, thus resulting in a novel approach which combines good worst-case bounds with efficient practical performance. (orig.)SIGLEAvailable from TIB Hannover: RR 1912(1999-1-001) / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekDEGerman
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.