Search CORE

86 research outputs found

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Minimal Absent Words in Rooted and Unrooted Trees

Author: B Schieber
C Barton
D Belazzougui
D Belazzougui
F Mignosi
F Mignosi
F Mignosi
G Fici
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
M-P Béal
MA Bender
P Charalampopoulos
P Charalampopoulos
RM Silva
S Chairungsee
T Shibuya
Y Almirantis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Palermo

Coleção de culturas de microrganismos multifuncionais da embrapa clima temperado: implementação de boas práticas.

Author: ALMEIDA B. M.
CROCHEMORE A. G.
FACIO M. L. P.
GALARZ L. A.
MATTOS M. L. T.
Publication venue
Publication date: 01/01/2013
Field of study

Repository Open Access to Scientific Information from Embrapa

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Efficient exact pattern-matching in proteomic sequences

Author: B. Smyth
D.E. Knuth
D.M. Sunday
F. Franek
G. Navarro
H. Peltola
M. Crochemore
M. Crochemore
P.D. Michailidis
R.A. Baeza-Yates
R.M. Karp
R.N. Horspool
R.S. Boyer
T. Lecroq
T. Lecroq
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

This paper proposes a novel algorithm for complete exact pattern-matching focusing the specificities of protein sequences (alphabet of 20 symbols) but, also highly efficient considering larger alphabets. The searching strategy uses large search windows allowing multiple alignments per iteration. A new filtering heuristic, named compatibility rule, contributed decisively to the efficiency improvement. The new algorithm’s performance is, on average, superior in comparison with its best-rated competitors

CiteSeerX

Universidade do Minho: RepositoriUM

Crossref

Biblioteca Digital do IPB

How to compare arc-annotated sequences: The alignment hierarchy

Author: B. Ma
F. Bernhart
G. Lin
K. Zhang
K.C. Tai
M. Crochemore
S. Dulucq
S. Vialette
T. Jiang
T. Jiang
T.C. Biedl
Publication venue: Springer Verlag
Publication date: 01/01/2006
Field of study

International audienceWe describe a new unifying framework to express comparison of arc-annotated sequences, which we call alignment of arc-annotated sequences. We first prove that this framework encompasses main existing models, which allows us to deduce complexity results for several cases from the literature. We also show that this framework gives rise to new relevant problems that have not been studied yet. We provide a thorough analysis of these novel cases by proposing two polynomial time algorithms and an NP-completeness proof. This leads to an almost exhaustive study of alignment of arc-annotated sequences

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Coleção de culturas de microrganismos multifuncionais da embrapa clima temperado: métodos de preservação de culturas.

Author: ALMEIDA B. M.
CROCHEMORE A. G.
FACIO M. L. P.
GALARZ L. A.
MATTOS M. L. T.
RIBEIRO F. V.
THIEL C. H.
Publication venue
Publication date: 01/01/2013
Field of study

Repository Open Access to Scientific Information from Embrapa

RCAAP - Repositório Científico de Acesso Aberto de Portugal

A suffix tree or not a suffix tree?

Author: A Apostolico
B Cazaux
D Breslauer
D Breslauer
D Gusfield
E Ukkonen
G Kucherov
H Bannai
JP Duval
JP Duval
JP Duval
M Crochemore
P Gawrychowski
T I
T I
T I
Tomohiro I.
W Lu
Publication venue: 'Elsevier BV'
Publication date: 01/09/2014
Field of study

In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question ”Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if τ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes τ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that τ is a suffix tree if and only if it is realized by a string S of length n−1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

Explore Bristol Research

Burrows-wheeler transform of words defined by morphisms

Author: A Restivo
A Restivo
A Restivo
B Tan
D Adjeroh
E Barcucci
G Manzini
GA Hedlund
H Kaplan
I Gessel
J Simpson
M Crochemore
M Lothaire
P Ferragina
S Mantaci
S Mantaci
S Mantaci
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Crossref

Florence Research

Biodiversity of microorganisms involved in the biodegradation of pesticides in subtropical freshwater swamp forests in Brazil.

Author: ALMEIDA M. T. de
ANDRES A.
CROCHEMORE A. G.
MATTOS M. L. T.
MERTINS J. F. da S.
SANTOS I. B. dos
Publication venue
Publication date: 01/01/2015
Field of study

Repository Open Access to Scientific Information from Embrapa

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Author: B Langmead
Christian Otto
Cynthia M. Sharma
David B. Searls
G Myers
H Li
H Li
H Lin
JC Dohm
JM Rothberg
Jörg Hackermüller
Jörg Vogel
K Prüfer
M Crochemore
MI Abouelhoda
P Ferragina
Peter F. Stadler
Philipp Khaitovich
R Li
S Bennett
S Huse
S Karlin
SM Rumble
Stefan Kurtz
Steve Hoffmann
W Chang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central