Search CORE

777 research outputs found

Off-line compression by greedy textual substitution

Author: A. Apostolico
S. Lonardi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Compression of Biological Sequences by Greedy Off-Line Textual Subsitution

Author: Apostolico Alberto
Lonardi Stefano
Publication venue: 'Purdue University (bepress)'
Publication date: 01/11/1999
Field of study

Purdue E-Pubs

Structure induction by lossless graph compression

Author: Peshkin Leonid
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

This work is motivated by the necessity to automate the discovery of structure in vast and evergrowing collection of relational data commonly represented as graphs, for example genomic networks. A novel algorithm, dubbed Graphitour, for structure induction by lossless graph compression is presented and illustrated by a clear and broadly known case of nested structure in a DNA molecule. This work extends to graphs some well established approaches to grammatical inference previously applied only to strings. The bottom-up graph compression problem is related to the maximum cardinality (non-bipartite) maximum cardinality matching problem. The algorithm accepts a variety of graph types including directed graphs and graphs with labeled nodes and arcs. The resulting structure could be used for representation and classification of graphs.Comment: 10 pages, 7 figures, 2 tables published in Proceedings of the Data Compression Conference, 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Mining, compressing and classifying with extensible motifs

Author: A Apostolico
A Apostolico
A Chattaraj
A Lempel
Alberto Apostolico
C Nevill-Manning
C Neville-Manning
E Lehman
J Kieffer
JA Storer
Laxmi Parida
M Li
M Li
M Li
Matteo Comin
S DeAgostino
S Vinga
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time. RESULTS: In the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction. CONCLUSION: Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Padova

Finger Search in Grammar-Compressed Strings

Author: Bille Philip
Christiansen Anders Roy
Cording Patrick Hagge
Gørtz Inge Li
Publication venue
Publication date: 01/01/2016
Field of study

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. Given a grammar, the random access problem is to compactly represent the grammar while supporting random access, that is, given a position in the original uncompressed string report the character at that position. In this paper we study the random access problem with the finger search property, that is, the time for a random access query should depend on the distance between a specified index

f

, called the \emph{finger}, and the query index

i

. We consider both a static variant, where we first place a finger and subsequently access indices near the finger efficiently, and a dynamic variant where also moving the finger such that the time depends on the distance moved is supported. Let

n

be the size the grammar, and let

N

be the size of the string. For the static variant we give a linear space representation that supports placing the finger in

O(\log N)

time and subsequently accessing in

O(\log D)

time, where

D

is the distance between the finger and the accessed index. For the dynamic variant we give a linear space representation that supports placing the finger in

O(\log N)

time and accessing and moving the finger in

O(\log D + \log \log N)

time. Compared to the best linear space solution to random access, we improve a

O(\log N)

query bound to

O(\log D)

for the static variant and to

O(\log D + \log \log N)

for the dynamic variant, while maintaining linear space. As an application of our results we obtain an improved solution to the longest common extension problem in grammar compressed strings. To obtain our results, we introduce several new techniques of independent interest, including a novel van Emde Boas style decomposition of grammars

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Online Research Database In Technology

On the Use of Suffix Arrays for Memory-Efficient Lempel-Ziv Data Compression

Author: Ferreira Artur
Figueiredo Mario
Oliveira Arlindo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/03/2009
Field of study

Much research has been devoted to optimizing algorithms of the Lempel-Ziv (LZ) 77 family, both in terms of speed and memory requirements. Binary search trees and suffix trees (ST) are data structures that have been often used for this purpose, as they allow fast searches at the expense of memory usage. In recent years, there has been interest on suffix arrays (SA), due to their simplicity and low memory requirements. One key issue is that an SA can solve the sub-string problem almost as efficiently as an ST, using less memory. This paper proposes two new SA-based algorithms for LZ encoding, which require no modifications on the decoder side. Experimental results on standard benchmarks show that our algorithms, though not faster, use 3 to 5 times less memory than the ST counterparts. Another important feature of our SA-based algorithms is that the amount of memory is independent of the text to search, thus the memory that has to be allocated can be defined a priori. These features of low and predictable memory requirements are of the utmost importance in several scenarios, such as embedded systems, where memory is at a premium and speed is not critical. Finally, we point out that the new algorithms are general, in the sense that they are adequate for applications other than LZ compression, such as text retrieval and forward/backward sub-string search.Comment: 10 pages, submited to IEEE - Data Compression Conference 200

arXiv.org e-Print Archive

Crossref

Searching for Smallest Grammars on Large Sequences and Application to DNA

Author: Carrascosa Rafael
Coste François
Gallé Matthias
Infante-Lopez Gabriel
Publication venue: 'Elsevier BV'
Publication date: 01/02/2012
Field of study

International audienceMotivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms

HAL-CentraleSupelec

Elsevier - Publisher Connector

INRIA a CCSD electronic archive server

HAL-Ecole des Ponts ParisTech

HAL-Rennes 1

HAL - UPEC / UPEM

In-place Update of Suffix Array while Recoding Words

Author: Coste François
Gallé Matthias
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 01/09/2008
Field of study

International audienceMotivated by grammatical inference and data compression applications, we propose an algorithm to update a suffix array after the substitution, in the indexed text, of some occurrences of a given word by a new character. Compared to other published index update methods, the problem addressed here may require the modification of a large number of distinct positions over the original text. The proposed algorithm uses the specific internal order of suffix arrays in order to update simultaneously groups of entries, and ensures that only entries to be modified are visited. Experiments confirm a significant execution time speed-up compared to the construction of suffix array from scratch at each step of the application

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1