Search CORE

54 research outputs found

Faster subsequence recognition in compressed strings

Author: A Tiskin
A Tiskin
A. Tiskin
BW Watson
CER Alves
G Myers
G Navarro
G Ziv
G Ziv
J Kärkkäinen
JL Bentley
M Crochemore
P Cégielski
TA Welch
W Rytter
WJ Masek
Publication venue
Publication date: 18/01/2008
Field of study

Computation on compressed strings is one of the key approaches to processing massive data sets. We consider local subsequence recognition problems on strings compressed by straight-line programs (SLP), which is closely related to Lempel--Ziv compression. For an SLP-compressed text of length

\bar m

, and an uncompressed pattern of length

n

, C{\'e}gielski et al. gave an algorithm for local subsequence recognition running in time

O(\bar mn^2 \log n)

. We improve the running time to

O(\bar mn^{1.5})

. Our algorithm can also be used to compute the longest common subsequence between a compressed text and an uncompressed pattern in time

O(\bar mn^{1.5})

; the same problem with a compressed pattern is known to be NP-hard

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Dynamic String Alignment

Author: Charalampopoulos Panagiotis
Kociumaka Tomasz
Mozes Shay
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

A Succinct Four Russians Speedup for Edit Distance Computation and One-against-many Banded Alignment

Author: Brubach Brian
Ghurye Jay
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Annual Symposium on Combinatorial Pattern Matching (CPM 2018)
Publication date: 01/01/2018
Field of study

The classical Four Russians speedup for computing edit distance (a.k.a. Levenshtein distance), due to Masek and Paterson [Masek and Paterson, 1980], involves partitioning the dynamic programming table into k-by-k square blocks and generating a lookup table in O(psi^{2k} k^2 |Sigma|^{2k}) time and O(psi^{2k} k |Sigma|^{2k}) space for block size k, where psi depends on the cost function (for unit costs psi = 3) and |Sigma| is the size of the alphabet. We show that the O(psi^{2k} k^2) and O(psi^{2k} k) factors can be improved to O(k^2 lg{k}) time and O(k^2) space. Thus, we improve the time and space complexity of that aspect compared to Masek and Paterson [Masek and Paterson, 1980] and remove the dependence on psi. We further show that for certain problems the O(|Sigma|^{2k}) factor can also be reduced. Using this technique, we show a new algorithm for the fundamental problem of one-against-many banded alignment. In particular, comparing one string of length m to n other strings of length m with maximum distance d can be performed in O(n m + m d^2 lg{d} + n d^3) time. When d is reasonably small, this approaches or meets the current best theoretic result of O(nm + n d^2) achieved by using the best known pairwise algorithm running in O(m + d^2) time [Myers, 1986][Ukkonen, 1985] while potentially being more practical. It also improves on the standard practical approach which requires O(n m d) time to iteratively run an O(md) time pairwise banded alignment algorithm. Regarding pairwise comparison, we extend the classic result of Masek and Paterson [Masek and Paterson, 1980] which computes the edit distance between two strings in O(m^2/log{m}) time to remove the dependence on psi even when edits have arbitrary costs from a penalty matrix. Crochemore, Landau, and Ziv-Ukelson [Crochemore, 2003] achieved a similar result, also allowing for unrestricted scoring matrices, but with variable-sized blocks. In practical applications of the Four Russians speedup wherein space efficiency is important and smaller block sizes k are used (notably k < |Sigma|), Kim, Na, Park, and Sim [Kim et al., 2016] showed how to remove the dependence on the alphabet size for the unit cost version, generating a lookup table in O(3^{2k} (2k)! k^2) time and O(3^{2k} (2k)! k) space. Combining their work with our result yields an improvement to O((2k)! k^2 lg{k}) time and O((2k)! k^2) space

Dagstuhl Research Online Publication Server

Bounded-Length Smith-Waterman Alignment

Author: Tiskin Alexander
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)
Publication date: 01/01/2019
Field of study

Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith-Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring\u27s length (respectively, the sum of the two substrings\u27 lengths) is above the given threshold w. The latter problem was introduced by Arslan and Egecioglu under the name "local alignment with length threshold". They proposed a dynamic programming algorithm solving the problem in time O(mn^2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith - Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others

Dagstuhl Research Online Publication Server

Bounded-length Smith-Waterman alignment

Author: Tiskin Alexander
Publication venue: Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik
Publication date
Field of study

Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith-Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring's length (respectively, the sum of the two substrings' lengths) is above the given threshold w. The latter problem was introduced by Arslan and Egecioglu under the name "local alignment with length threshold". They proposed a dynamic programming algorithm solving the problem in time O(mn^2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith - Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others

Warwick Research Archives Portal Repository

Computing alignment plots efficiently

Author: Krusche Peter
Tiskin Alexander
Publication venue
Publication date: 10/09/2009
Field of study

Dot plots are a standard method for local comparison of biological sequences. In a dot plot, a substring to substring distance is computed for all pairs of fixed-size windows in the input strings. Commonly, the Hamming distance is used since it can be computed in linear time. However, the Hamming distance is a rather crude measure of string similarity, and using an alignment-based edit distance can greatly improve the sensitivity of the dot plot method. In this paper, we show how to compute alignment plots of the latter type efficiently. Given two strings of length m and n and a window size w, this problem consists in computing the edit distance between all pairs of substrings of length w, one from each input string. The problem can be solved by repeated application of the standard dynamic programming algorithm in time O(mnw^2). This paper gives an improved data-parallel algorithm, running in time

O(mnw/\gamma/p)

using vector operations that work on

\gamma

values in parallel and

p

processors. We show experimental results from an implementation of this algorithm, which uses Intel's MMX/SSE instructions for vector parallelism and MPI for coarse-grained parallelism.Comment: Presented at ParCo 200

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

On Almost Monge All Scores Matrices

Author: Carmel Amir
Tsur Dekel
Ziv-Ukelson Michal
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)
Publication date: 01/01/2016
Field of study

Dagstuhl Research Online Publication Server

DSEARCH: sensitive database searching using distributed computing

Author: Keane T.M.
Naughton Thomas J.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2005
Field of study

Summary: We present a distributed and fully cross-platform database search program that allows the user to utilise the idle clock cycles of machines to perform large searches using the most sensitive algorithms. For those in an academic or corporate environment with hundreds of idle desktop machines, DSEARCH can deliver a âfreeâ database search supercomputer. Availability: The software is publicly available under the GNU general public licence from http://www.cs.may.ie/distributed Contact: [email protected] Supplementary Information: Full documentation and a user manual is available from http://www.cs.may.ie/distribute

MURAL - Maynooth University Research Archive Library