18 research outputs found
Compressed Spaced Suffix Arrays
Spaced seeds are important tools for similarity search in bioinformatics, and
using several seeds together often significantly improves their performance.
With existing approaches, however, for each seed we keep a separate linear-size
data structure, either a hash table or a spaced suffix array (SSA). In this
paper we show how to compress SSAs relative to normal suffix arrays (SAs) and
still support fast random access to them. We first prove a theoretical upper
bound on the space needed to store an SSA when we already have the SA. We then
present experiments indicating that our approach works even better in practice
Improved ESP-index: a practical self-index for highly repetitive texts
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different appearances of the same
subtexts in a text. Although ESP-index performs efficient top-down searches of
query texts, it has a serious issue on binary searches for finding appearances
of variables for a query text, which resulted in slowing down the query
searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea
behind succinct data structures for large alphabets. While ESP-index-I keeps
the same types of efficiencies as ESP-index about the top-down searches, it
avoid the binary searches using fast rank/select operations. We experimentally
test ESP-index-I on the ability to search query texts and extract subtexts from
real world repetitive texts on a large-scale, and we show that ESP-index-I
performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th
International Symposium on Experimental Algorithms (SEA2014
Compressed Spaced Suffix Arrays
As a first step in designing relatively-compressed data structures---i.e., such that storing an instance for one dataset helps us store instances for similar datasets---we consider how to compress spaced suffix arrays relative to normal suffix arrays and still support fast access to them. This problem is of practical interest when performing similarity search with spaced seeds because using several seeds in parallel significantly improves their performance, but with existing approaches we keep a separate linear-space hash table or spaced suffix array for each seed. We first prove a theoretical upper bound on the space needed to store a spaced suffix array when we already have the suffix array. We then present experiments indicating that our approach works even better in practice.Peer reviewe
CiNCT: Compression and retrieval for massive vehicular trajectories via relative movement labeling
In this paper, we present a compressed data structure for moving object
trajectories in a road network, which are represented as sequences of road
edges. Unlike existing compression methods for trajectories in a network, our
method supports pattern matching and decompression from an arbitrary position
while retaining a high compressibility with theoretical guarantees.
Specifically, our method is based on FM-index, a fast and compact data
structure for pattern matching. To enhance the compression, we incorporate the
sparsity of road networks into the data structure. In particular, we present
the novel concepts of relative movement labeling and PseudoRank, each
contributing to significant reductions in data size and query processing time.
Our theoretical analysis and experimental studies reveal the advantages of our
proposed method as compared to existing trajectory compression methods and
FM-index variants
Galloping in natural merge sorts
We study the algorithm TimSort and the sub-routine it uses to merge monotonic
(non-decreasing) sub-arrays, hereafter called runs. More precisely, we look at
the impact on the number of element comparisons performed of using this
sub-routine instead of a naive routine.
In this article, we introduce a new object for measuring the complexity of
arrays. This notion dual to the notion of runs on which TimSort built its
success so far, hence we call it dual runs. It induces complexity measures that
are dual to those induced by runs. We prove, for this new complexity measure,
results that are similar to those already known when considering standard
run-induced measures. Although our new results do not lead to any improvement
on the number of element moves performed, they may lead to dramatic
improvements on the number of element comparisons performed by the algorithm.
In order to do so, we introduce new notions of fast- and middle-growth for
natural merge sorts, which allow deriving the same upper bounds. After using
these notions successfully on TimSort, we prove that they can be applied to a
wealth of variants of TimSort and other natural merge sorts.Comment: 18 page
Grammar compressed sequences with rank/select support
An early partial version of this paper appeared in Proc. SPIRE 2014: G. Navarro, A. Ordóñez Grammar compressed sequences with rank/select support, Proc. 21st International Symposium on String Processing and Information Retrieval, LNCS, SPIRE, vol. 8799 (2014), pp. 31–44The final publication is available at Springer via http://dx.doi.org/10.1016/j.jda.2016.10.001[Abstract] Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.Chile. Fondo Nacional de Desarrollo CientÃfico y Tecnológico; 140796Ministerio de EconomÃa, Industria y Competitividad; 00645663/ITC-20133062Ministerio de EconomÃa, Industria y Competitividad; TIN2009-14560-C03-02Ministerio de EconomÃa, Industria y Competitividad; TIN2010-21246-C02-01Ministerio de EconomÃa, Industria y Competitividad; TIN2013-46238-C4-3-RMinisterio de EconomÃa, Industria y Competitividad; TIN2013-47090-C3-3-PMinisterio de EconomÃa, Industria y Competitividad; AP2010-6038Xunta de Galicia; GRC2013/05