Search CORE

22 research outputs found

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

Author: A Lempel
A Puglisi
Andrew K Benson
CG Nevill-Manning
David J Russell
DR Bastola
E Ukkonen
EK Costello
EM McCreight
HH Otu
J Ziv
J Ziv
JD Parsons
JD Thompson
Khalid Sayood
L Holm
M Charikar
M Halkidi
P Weiner
RC Edgar
Samuel F Way
SF Altschul
W Li
W Li
W Li
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences

Crossref

DigitalCommons@University of Nebraska

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

3D BUILDING FAÇADE RECONSTRUCTION USING HANDHELD LASER SCANNING DATA

Author
Publication venue: 'Copernicus GmbH'
Publication date
Field of study

Crossref

In-place Update of Suffix Array while Recoding Words

Author: Coste François
Gallé Matthias
Peterlongo Pierre
Publication venue: HAL CCSD
Publication date: 01/09/2008
Field of study

International audienceMotivated by grammatical inference and data compression applications, we propose an algorithm to update a suffix array after the substitution, in the indexed text, of some occurrences of a given word by a new character. Compared to other published index update methods, the problem addressed here may require the modification of a large number of distinct positions over the original text. The proposed algorithm uses the specific internal order of suffix arrays in order to update simultaneously groups of entries, and ensures that only entries to be modified are visited. Experiments confirm a significant execution time speed-up compared to the construction of suffix array from scratch at each step of the application

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Rennes 1

A cognitively plausible model for grammar induction

Author
Publication venue: 'Institute of Computer Science, Polish Academy of Sciences'
Publication date
Field of study

Crossref

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Author: Abboud A.
Backurs A.
Bringmann K.
Künnemann M.
Publication venue
Publication date: 01/01/2018
Field of study

Can we analyze data without decompressing it? As our data keeps growing, understanding the time complexity of problems on compressed inputs, rather than in convenient uncompressed forms, becomes more and more relevant. Suppose we are given a compression of size

n

of data that originally has size

N

, and we want to solve a problem with time complexity

T(\cdot)

. The naive strategy of "decompress-and-solve" gives time

T(N)

, whereas "the gold standard" is time

T(n)

: to analyze the compression as efficiently as if the original data was small. We restrict our attention to data in the form of a string (text, files, genomes, etc.) and study the most ubiquitous tasks. While the challenge might seem to depend heavily on the specific compression scheme, most methods of practical relevance (Lempel-Ziv-family, dictionary methods, and others) can be unified under the elegant notion of Grammar Compressions. A vast literature, across many disciplines, established this as an influential notion for Algorithm design. We introduce a framework for proving (conditional) lower bounds in this field, allowing us to assess whether decompress-and-solve can be improved, and by how much. Our main results are: - The

O(nN\sqrt{\log{N/n}})

bound for LCS and the

O(\min\{N \log N, nM\})

bound for Pattern Matching with Wildcards are optimal up to

N^{o(1)}

factors, under the Strong Exponential Time Hypothesis. (Here,

M

denotes the uncompressed length of the compressed pattern.) - Decompress-and-solve is essentially optimal for Context-Free Grammar Parsing and RNA Folding, under the

k

-Clique conjecture. - We give an algorithm showing that decompress-and-solve is not optimal for Disjointness

MPG.PuRe