Search CORE

2,539 research outputs found

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Author: Prezza Nicola
Publication venue
Publication date: 01/01/2020
Field of study

We consider the problem of encoding a string of length

n

from an integer alphabet of size

\sigma

so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take

n\log\sigma + \Theta(\log (n\log\sigma))

bits. We describe a new data structure matching this lower bound when

\sigma\leq n^{O(1)}

while supporting both queries in optimal

O(1)

time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of

\Theta(\log n)

bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in

O(n\log n)

time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in

O(n^{1.5}\sqrt{\log \sigma})

time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Author: Ayad Lorraine A. K.
Loukides Grigorios
Pissis Solon P.
Verbeek Hilde
Publication venue
Publication date: 13/10/2023
Field of study

Sparse suffix sorting is the problem of sorting

b=o(n)

suffixes of a string of length

n

. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in

\mathcal{O}(n\log b)

time, in the worst case, or in

\mathcal{O}(n)

time, when the total number of suffixes with an LCP value greater than

2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1

is in

\mathcal{O}(b/\log b)

, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only

8b+o(b)

machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in

\mathcal{O}(n\log b)

time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.Comment: 16 pages, 1 figur

arXiv.org e-Print Archive