606 research outputs found
Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit
International audienceThe problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment
Finding the region of pseudo-periodic tandem repeats in biological sequences
SUMMARY: The genomes of many species are dominated by short sequences repeated consecutively. It is estimated that over 10% of the human genome consists of tandemly repeated sequences. Finding repeated regions in long sequences is important in sequence analysis. We develop a software, LocRepeat, that finds regions of pseudo-periodic repeats in a long sequence. We use the definition of Li et al. [1] for the pseudo-periodic partition of a region and extend the algorithm that can select the repeated region from a given long sequence and give the pseudo-periodic partition of the region. AVAILABILITY: LocRepeat is available a
An Almost Optimal Edit Distance Oracle
We consider the problem of preprocessing two strings S and T, of lengths m and n, respectively, in order to be able to efficiently answer the following queries: Given positions i,j in S and positions a,b in T, return the optimal alignment score of S[i..j] and T[a..b]. Let N = mn. We present an oracle with preprocessing time N^{1+o(1)} and space N^{1+o(1)} that answers queries in log^{2+o(1)}N time. In other words, we show that we can efficiently query for the alignment score of every pair of substrings after preprocessing the input for almost the same time it takes to compute just the alignment of S and T. Our oracle uses ideas from our distance oracle for planar graphs [STOC 2019] and exploits the special structure of the alignment graph. Conditioned on popular hardness conjectures, this result is optimal up to subpolynomial factors. Our results apply to both edit distance and longest common subsequence (LCS).
The best previously known oracle with construction time and size ?(N) has slow ?(?N) query time [Sakai, TCS 2019], and the one with size N^{1+o(1)} and query time log^{2+o(1)}N (using a planar graph distance oracle) has slow ?(N^{3/2}) construction time [Long & Pettie, SODA 2021]. We improve both approaches by roughly a ? N factor
Computing alignment plots efficiently
Dot plots are a standard method for local comparison of biological sequences.
In a dot plot, a substring to substring distance is computed for all pairs of
fixed-size windows in the input strings. Commonly, the Hamming distance is used
since it can be computed in linear time. However, the Hamming distance is a
rather crude measure of string similarity, and using an alignment-based edit
distance can greatly improve the sensitivity of the dot plot method. In this
paper, we show how to compute alignment plots of the latter type efficiently.
Given two strings of length m and n and a window size w, this problem consists
in computing the edit distance between all pairs of substrings of length w, one
from each input string. The problem can be solved by repeated application of
the standard dynamic programming algorithm in time O(mnw^2). This paper gives
an improved data-parallel algorithm, running in time using
vector operations that work on values in parallel and processors.
We show experimental results from an implementation of this algorithm, which
uses Intel's MMX/SSE instructions for vector parallelism and MPI for
coarse-grained parallelism.Comment: Presented at ParCo 200
Exact Distance Oracles for Planar Graphs
We present new and improved data structures that answer exact node-to-node
distance queries in planar graphs. Such data structures are also known as
distance oracles. For any directed planar graph on n nodes with non-negative
lengths we obtain the following:
* Given a desired space allocation , we show how to
construct in time a data structure of size that answers
distance queries in time per query.
As a consequence, we obtain an improvement over the fastest algorithm for
k-many distances in planar graphs whenever .
* We provide a linear-space exact distance oracle for planar graphs with
query time for any constant eps>0. This is the first such data
structure with provable sublinear query time.
* For edge lengths at least one, we provide an exact distance oracle of space
such that for any pair of nodes at distance D the query time is
. Comparable query performance had been observed
experimentally but has never been explained theoretically.
Our data structures are based on the following new tool: given a
non-self-crossing cycle C with nodes, we can preprocess G in
time to produce a data structure of size that can
answer the following queries in time: for a query node u, output
the distance from u to all the nodes of C. This data structure builds on and
extends a related data structure of Klein (SODA'05), which reports distances to
the boundary of a face, rather than a cycle.
The best distance oracles for planar graphs until the current work are due to
Cabello (SODA'06), Djidjev (WG'96), and Fakcharoenphol and Rao (FOCS'01). For
and space , we essentially improve the query
time from to .Comment: To appear in the proceedings of the 23rd ACM-SIAM Symposium on
Discrete Algorithms, SODA 201
- …