Search CORE

19 research outputs found

Estimating evolutionary distances between genomic sequences from spaced-word matches

Author
Publication venue: BioMed Central
Publication date: 11/02/2015
Field of study

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Martin Donald E. K.
Noé Laurent
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

INRIA a CCSD electronic archive server

Fast algorithms for computing sequence distances by exhaustive substring composition

Author: A Apostolico
A Kolmogorov
A Lempel
Alberto Apostolico
B Blaidsell
B Hao
H Otu
I Ulitsky
J Na
J Qi
JV Helden
L Brillouin
LL Gatlin
M Höhl
M Li
Olgert Denas
P Ferragina
R Edgar
R von Mises
S Vinga
TJ Wu
TM Cover
Publication venue: BioMed Central
Publication date: 01/10/2008
Field of study

The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background

Springer - Publisher Connector

Directory of Open Access Journals

On the comparison of regulatory sequences with multiple resolution Entropic Profiles

Author: ANTONELLO MORRIS
COMIN MATTEO
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Enhancers are stretches of DNA (100-1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Although the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignment-based techniques

Springer - Publisher Connector

Archivio istituzionale della ricerca - Università di Padova

DNA Sequence Classification: It’s Easier Than You Think: An open-source k-mer based machine learning tool for fast and accurate classification of a variety of genomic datasets

Author: Solis-Reyes Stephen
Publication venue: Scholarship@Western
Publication date: 09/10/2018
Field of study

Supervised classification of genomic sequences is a challenging, well-studied problem with a variety of important applications. We propose an open-source, supervised, alignment-free, highly general method for sequence classification that operates on k-mer proportions of DNA sequences. This method was implemented in a fully standalone general-purpose software package called Kameris, publicly available under a permissive open-source license. Compared to competing software, ours provides key advantages in terms of data security and privacy, transparency, and reproducibility. We perform a detailed study of its accuracy and performance on a wide variety of classification tasks, including virus subtyping, taxonomic classification, and human haplogroup assignment. We demonstrate the success of our method on whole mitochondrial, nuclear, plastid, plasmid, and viral genomes, as well as randomly sampled eukaryote genomes and transcriptomes. Further, we perform head-to-head evaluations on the tasks of HIV-1 virus subtyping and bacterial taxonomic classification with a number of competing state-of-the-art software solutions, and show that we match or exceed all other tested software in terms of accuracy and speed

Scholarship@Western

MetaProb: Accurate metagenomic reads binning based on probabilistic sequence signatures

Author: Comin Matteo
Girotto Samuele
Pizzi Cinzia
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Archivio istituzionale della ricerca - Università di Padova