Search CORE

529 research outputs found

Faster subsequence recognition in compressed strings

Author: A Tiskin
A Tiskin
A. Tiskin
BW Watson
CER Alves
G Myers
G Navarro
G Ziv
G Ziv
J Kärkkäinen
JL Bentley
M Crochemore
P Cégielski
TA Welch
W Rytter
WJ Masek
Publication venue
Publication date: 18/01/2008
Field of study

Computation on compressed strings is one of the key approaches to processing massive data sets. We consider local subsequence recognition problems on strings compressed by straight-line programs (SLP), which is closely related to Lempel--Ziv compression. For an SLP-compressed text of length

\bar m

, and an uncompressed pattern of length

n

, C{\'e}gielski et al. gave an algorithm for local subsequence recognition running in time

O(\bar mn^2 \log n)

. We improve the running time to

O(\bar mn^{1.5})

. Our algorithm can also be used to compute the longest common subsequence between a compressed text and an uncompressed pattern in time

O(\bar mn^{1.5})

; the same problem with a compressed pattern is known to be NP-hard

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Compressed Subsequence Matching and Packed Tree Coloring

Author: A. Tiskin
A. Tiskin
D.D. Sleator
G. Das
H. Mannila
J. Ziv
J. Ziv
M. Charikar
M. Crochemore
M. Thorup
M.A. Bender
M.L. Fredman
N.J. Larsson
O. Berkman
P. Cégielski
P. Cégielski
P. Ferragina
P.F. Dietz
R.A. Baeza-Yates
S. Abiteboul
S. Alstrup
S. Alstrup
S. Alstrup
T. Yamamoto
W. Rytter
Z. Troníček
Publication venue
Publication date: 01/01/2014
Field of study

We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size

n

compressing a string of size

N

and a pattern string of size

m

over an alphabet of size

\sigma

, our algorithm uses

O(n+\frac{n\sigma}{w})

space and

O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ)

O(n+\frac{n\sigma}{w}\log w+m\log N\cdot occ)

time. Here

w

is the word size and

occ

is the number of occurrences of the pattern. Our algorithm uses less space than previous algorithms and is also faster for

occ=o(\frac{n}{\log N})

occurrences. The algorithm uses a new data structure that allows us to efficiently find the next occurrence of a given character after a given position in a compressed string. This data structure in turn is based on a new data structure for the tree color problem, where the node colors are packed in bit strings.Comment: To appear at CPM '1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

Algorithms and data structures for grammar-compressed strings

Author: Cording Patrick Hagge
Publication venue: Technical University of Denmark
Publication date: 01/01/2015
Field of study

Online Research Database In Technology

Measuring complexity with zippers

Author: Baronchelli Andrea
Caglioti Emanuele
Loreto Vittorio
Publication venue: 'IOP Publishing'
Publication date: 01/01/2005
Field of study

Physics concepts have often been borrowed and independently developed by other fields of science. In this perspective a significant example is that of entropy in Information Theory. The aim of this paper is to provide a short and pedagogical introduction to the use of data compression techniques for the estimate of entropy and other relevant quantities in Information Theory and Algorithmic Information Theory. We consider in particular the LZ77 algorithm as case study and discuss how a zipper can be used for information extraction.Comment: 10 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

City Research Online

CERN Document Server

Archivio della ricerca- Università di Roma La Sapienza

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Author: Jiang Huiqiang
Li Dongsheng
Lin Chin-Yew
Luo Xufang
Qiu Lili
Wu Qianhui
Yang Yuqing
Publication venue
Publication date: 10/10/2023
Field of study

In long context scenarios, large language models (LLMs) face three main challenges: higher computational/financial cost, longer latency, and inferior performance. Some studies reveal that the performance of LLMs depends on both the density and the position of the key information (question relevant) in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. We conduct evaluation on a wide range of long context scenarios including single-/multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. The experimental results show that LongLLMLingua compressed prompt can derive higher performance with much less cost. The latency of the end-to-end system is also reduced. For example, on NaturalQuestions benchmark, LongLLMLingua gains a performance boost of up to 17.1% over the original prompt with ~4x fewer tokens as input to GPT-3.5-Turbo. It can derive cost savings of \$28.5 and \$27.4 per 1,000 samples from the LongBench and ZeroScrolls benchmark, respectively. Additionally, when compressing prompts of ~10k tokens at a compression rate of 2x-10x, LongLLMLingua can speed up the end-to-end latency by 1.4x-3.8x. Our code is available at https://aka.ms/LLMLingua

arXiv.org e-Print Archive

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Author: A
Alzamel Mai
Counting
Grossi Roberto
Hagerup Torben
Optimal
Uniqueness
Wavelet
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/05/2019
Field of study

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text

T

of length

n

, permutes its symbols according to the lexicographic order of suffixes of

T

. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length

n

, occupying

O(n/\log n)

machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in

O(n)

time and

O(n/\log n)

space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require

\Omega(n)

time. In this paper, we propose the first algorithm that breaks the

O(n)

-time barrier for BWT construction. Given a binary string of length

n

, our procedure builds the Burrows-Wheeler transform in

O(n/\sqrt{\log n})

time and

O(n/\log n)

space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art

O(m\sqrt{\log m})

-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size

O(n/\log n)

that answers Longest Common Extension queries (LCE queries) in

O(1)

time and, furthermore, can be deterministically constructed in the optimal

O(n/\log n)

time.Comment: Full version of a paper accepted to STOC 201

arXiv.org e-Print Archive

Crossref

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Author: Abboud A.
Backurs A.
Bringmann K.
Künnemann M.
Publication venue
Publication date: 01/01/2018
Field of study

Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size

n

of data that originally has size

N

, and we want to solve a problem with time complexity

T(\cdot)

. The naive strategy of "decompress-and-solve" gives time

T(N)

, whereas "the gold standard" is time

T(n)

: to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: - The

O(nN\sqrt{\log{N/n}})

bound for LCS and the

O(\min\{N \log N, nM\})

bound for Pattern Matching with Wildcards are optimal up to

N^{o(1)}

factors, under the Strong Exponential Time Hypothesis. (Here,

M

denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the

k

-Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness

MPG.PuRe