303 research outputs found
Lightweight LCP Construction for Very Large Collections of Strings
The longest common prefix array is a very advantageous data structure that,
combined with the suffix array and the Burrows-Wheeler transform, allows to
efficiently compute some combinatorial properties of a string useful in several
applications, especially in biological contexts. Nowadays, the input data for
many problems are big collections of strings, for instance the data coming from
"next-generation" DNA sequencing (NGS) technologies. In this paper we present
the first lightweight algorithm (called extLCP) for the simultaneous
computation of the longest common prefix array and the Burrows-Wheeler
transform of a very large collection of strings having any length. The
computation is realized by performing disk data accesses only via sequential
scans, and the total disk space usage never needs more than twice the output
size, excluding the disk space required for the input. Moreover, extLCP allows
to compute also the suffix array of the strings of the collection, without any
other further data structure is needed. Finally, we test our algorithm on real
data and compare our results with another tool capable to work in external
memory on large collections of strings.Comment: This manuscript version is made available under the CC-BY-NC-ND 4.0
license http://creativecommons.org/licenses/by-nc-nd/4.0/ The final version
of this manuscript is in press in Journal of Discrete Algorithm
- …