Search CORE

153 research outputs found

Hardness of longest common subsequence for sequences with bounded run-lengths

Author: Blin Guillaume
Bulteau Laurent
Jiang Minghui
Tejada Pedro J.
Vialette Stéphane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

International audienceThe longest common subsequence (LCS) problem is a classic and well-studied problem in computer science with extensive applications in diverse areas ranging from spelling error corrections to molecular biology. This paper focuses on LCS for fixed alphabet size and fixed run-lengths (i.e., maximum number of consecutive occurrences of the same symbol). We show that LCS is NP-complete even when restricted to (i) alphabets of size 3 and run-length at most 1, and (ii) alphabets of size 2 and run-length at most 2 (both results are tight). For the latter case, we show that the problem is approximable within ratio 3/5

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Sketching, Streaming, and Fine-Grained Complexity of (Weighted) LCS

Author: Bringmann Karl
Chaudhury Bhaskar Ray
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 38th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2018)
Publication date: 01/01/2018
Field of study

We study sketching and streaming algorithms for the Longest Common Subsequence problem (LCS) on strings of small alphabet size |Sigma|. For the problem of deciding whether the LCS of strings x,y has length at least L, we obtain a sketch size and streaming space usage of O(L^{|Sigma| - 1} log L). We also prove matching unconditional lower bounds. As an application, we study a variant of LCS where each alphabet symbol is equipped with a weight that is given as input, and the task is to compute a common subsequence of maximum total weight. Using our sketching algorithm, we obtain an O(min{nm, n + m^{|Sigma|}})-time algorithm for this problem, on strings x,y of length n,m, with n >= m. We prove optimality of this running time up to lower order factors, assuming the Strong Exponential Time Hypothesis

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

MPG.PuRe

RLE Edit Distance in Near Optimal Time

Author: Gawrychowski Pawel
Kociumaka Tomasz
Martin Daniel P.
Uznanski Przemyslaw
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)
Publication date: 01/01/2019
Field of study

We show that the edit distance between two run-length encoded strings of compressed lengths m and n respectively, can be computed in O(mn log(mn)) time. This improves the previous record by a factor of O(n/log(mn)). The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively closes a line of algorithmic research first started in 1993

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Author: Abboud A.
Backurs A.
Bringmann K.
Künnemann M.
Publication venue
Publication date: 01/01/2018
Field of study

Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size

n

of data that originally has size

N

, and we want to solve a problem with time complexity

T(\cdot)

. The naive strategy of "decompress-and-solve" gives time

T(N)

, whereas "the gold standard" is time

T(n)

: to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: - The

O(nN\sqrt{\log{N/n}})

bound for LCS and the

O(\min\{N \log N, nM\})

bound for Pattern Matching with Wildcards are optimal up to

N^{o(1)}

factors, under the Strong Exponential Time Hypothesis. (Here,

M

denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the

k

-Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness

MPG.PuRe

Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings

Author: Karppanen Jari
Publication venue: Helsingin yliopisto
Publication date: 01/01/2012
Field of study

Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms. The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiently of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings

Helsingin yliopiston digitaalinen arkisto

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Author: Abboud Amir
Backurs Arturs
Bringmann Karl
Künnemann Marvin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

n

of data that originally has size

N

, and we want to solve a problem with time complexity

T(\cdot)

. The naive strategy of "decompress-and-solve" gives time

T(N)

, whereas "the gold standard" is time

T(n)

O(nN\sqrt{\log{N/n}})

bound for LCS and the

O(\min\{N \log N, nM\})

bound for Pattern Matching with Wildcards are optimal up to

N^{o(1)}

factors, under the Strong Exponential Time Hypothesis. (Here,

M

denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the

k

-Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness.Comment: Presented at FOCS'17. Full version. 63 page

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Recommended from our members

An image delta compression tool: IDelta

Author: Sullivan Kevin Michael
Publication venue: CSUSB ScholarWorks
Publication date: 01/01/2004
Field of study

The purpose of this thesis is to present a modified version of the algorithm used in the open source differencing tool zdelta, entitled iDelta . This algorithm will manage file data and will be built specifically to difference images in the Photoshop file format

CSUSB ScholarWorks