19,230 research outputs found
On Longest Repeat Queries Using GPU
Repeat finding in strings has important applications in subfields such as
computational biology. The challenge of finding the longest repeats covering
particular string positions was recently proposed and solved by \.{I}leri et
al., using a total of the optimal time and space, where is the
string size. However, their solution can only find the \emph{leftmost} longest
repeat for each of the string position. It is also not known how to
parallelize their solution. In this paper, we propose a new solution for
longest repeat finding, which although is theoretically suboptimal in time but
is conceptually simpler and works faster and uses less memory space in practice
than the optimal solution. Further, our solution can find \emph{all} longest
repeats of every string position, while still maintaining a faster processing
speed and less memory space usage. Moreover, our solution is
\emph{parallelizable} in the shared memory architecture (SMA), enabling it to
take advantage of the modern multi-processor computing platforms such as the
general-purpose graphics processing units (GPU). We have implemented both the
sequential and parallel versions of our solution. Experiments with both
biological and non-biological data show that our sequential and parallel
solutions are faster than the optimal solution by a factor of 2--3.5 and 6--14,
respectively, and use less memory space.Comment: 14 page
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
Optimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length from an integer
alphabet of size so that access and substring equality queries (that
is, determining the equality of any two substrings) can be answered
efficiently. Any uniquely-decodable encoding supporting access must take
bits. We describe a new data
structure matching this lower bound when while supporting
both queries in optimal time. Furthermore, we show that the string can
be overwritten in-place with this structure. The redundancy of
bits and the constant query time break exponentially a lower bound that is
known to hold in the read-only model. Using our new string representation, we
obtain the first in-place subquadratic (indeed, even sublinear in some cases)
algorithms for several string-processing problems in the restore model: the
input string is rewritable and must be restored before the computation
terminates. In particular, we describe the first in-place subquadratic Monte
Carlo solutions to the sparse suffix sorting, sparse LCP array construction,
and suffix selection problems. With the sole exception of suffix selection, our
algorithms are also the first running in sublinear time for small enough sets
of input suffixes. Combining these solutions, we obtain the first
sublinear-time Monte Carlo algorithm for building the sparse suffix tree in
compact space. We also show how to derandomize our algorithms using small
space. This leads to the first Las Vegas in-place algorithm computing the full
LCP array in time and to the first Las Vegas in-place algorithms
solving the sparse suffix sorting and sparse LCP array construction problems in
time. Running times of these Las Vegas
algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las
Vegas algorithm
Fast Label Extraction in the CDAWG
The compact directed acyclic word graph (CDAWG) of a string of length
takes space proportional just to the number of right extensions of the
maximal repeats of , and it is thus an appealing index for highly repetitive
datasets, like collections of genomes from similar species, in which grows
significantly more slowly than . We reduce from to
the time needed to count the number of occurrences of a pattern of
length , using an existing data structure that takes an amount of space
proportional to the size of the CDAWG. This implies a reduction from
to in the time needed to
locate all the occurrences of the pattern. We also reduce from
to the time needed to read the characters of the
label of an edge of the suffix tree of , and we reduce from
to the time needed to compute the matching
statistics between a query of length and , using an existing
representation of the suffix tree based on the CDAWG. All such improvements
derive from extracting the label of a vertex or of an arc of the CDAWG using a
straight-line program induced by the reversed CDAWG.Comment: 16 pages, 1 figure. In proceedings of the 24th International
Symposium on String Processing and Information Retrieval (SPIRE 2017). arXiv
admin note: text overlap with arXiv:1705.0864
The complexity of resolving conflicts on MAC
We consider the fundamental problem of multiple stations competing to
transmit on a multiple access channel (MAC). We are given stations out of
which at most are active and intend to transmit a message to other stations
using MAC. All stations are assumed to be synchronized according to a time
clock. If stations node transmit in the same round, then the MAC provides
the feedback whether , (collision occurred) or . When ,
then a single station is indeed able to successfully transmit a message, which
is received by all other nodes. For the above problem the active stations have
to schedule their transmissions so that they can singly, transmit their
messages on MAC, based only on the feedback received from the MAC in previous
round.
For the above problem it was shown in [Greenberg, Winograd, {\em A Lower
bound on the Time Needed in the Worst Case to Resolve Conflicts
Deterministically in Multiple Access Channels}, Journal of ACM 1985] that every
deterministic adaptive algorithm should take rounds
in the worst case. The fastest known deterministic adaptive algorithm requires
rounds. The gap between the upper and lower bound is
round. It is substantial for most values of : When constant and (for any constant , the lower bound is
respectively and O(n), which is trivial in both cases. Nevertheless,
the above lower bound is interesting indeed when poly(). In this
work, we present a novel counting argument to prove a tight lower bound of
rounds for all deterministic, adaptive algorithms, closing
this long standing open question.}Comment: Xerox internal report 27th July; 7 page
Minimal Suffix and Rotation of a Substring in Optimal Time
For a text given in advance, the substring minimal suffix queries ask to
determine the lexicographically minimal non-empty suffix of a substring
specified by the location of its occurrence in the text. We develop a data
structure answering such queries optimally: in constant time after linear-time
preprocessing. This improves upon the results of Babenko et al. (CPM 2014),
whose trade-off solution is characterized by product of these
time complexities. Next, we extend our queries to support concatenations of
substrings, for which the construction and query time is preserved. We
apply these generalized queries to compute lexicographically minimal and
maximal rotations of a given substring in constant time after linear-time
preprocessing.
Our data structures mainly rely on properties of Lyndon words and Lyndon
factorizations. We combine them with further algorithmic and combinatorial
tools, such as fusion trees and the notion of order isomorphism of strings
- …