Search CORE

1,528 research outputs found

Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit

Author: Crochemore Maxime
Landau Gad M.
Schieber Baruch
Ziv-Ukelson Michal
Publication venue: King's College London Publications
Publication date: 01/01/2005
Field of study

International audienceThe problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Approximating Dynamic Time Warping Distance Between Run-Length Encoded Strings

Author: Kuszmaul William
Xi Zoe
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual European Symposium on Algorithms (ESA 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

RLE Edit Distance in Near Optimal Time

Author: Gawrychowski Pawel
Kociumaka Tomasz
Martin Daniel P.
Uznanski Przemyslaw
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)
Publication date: 01/01/2019
Field of study

We show that the edit distance between two run-length encoded strings of compressed lengths m and n respectively, can be computed in O(mn log(mn)) time. This improves the previous record by a factor of O(n/log(mn)). The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively closes a line of algorithmic research first started in 1993

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Hardness of longest common subsequence for sequences with bounded run-lengths

Author: Blin Guillaume
Bulteau Laurent
Jiang Minghui
Tejada Pedro J.
Vialette Stéphane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/07/2012
Field of study

International audienceThe longest common subsequence (LCS) problem is a classic and well-studied problem in computer science with extensive applications in diverse areas ranging from spelling error corrections to molecular biology. This paper focuses on LCS for fixed alphabet size and fixed run-lengths (i.e., maximum number of consecutive occurrences of the same symbol). We show that LCS is NP-complete even when restricted to (i) alphabets of size 3 and run-length at most 1, and (ii) alphabets of size 2 and run-length at most 2 (both results are tight). For the latter case, we show that the problem is approximable within ratio 3/5

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM