Search CORE

17 research outputs found

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Laurent Noé
Donald E.K. Martin
Apostolico A.
Bassino F.
Boden M.
Břinda K.
Burkhardt S.
Egidi L.
Gambin A.
Leslie C.S.
Martin D.E.K.
Martin D.E.K.
Régnier M.
Simon I.
Zhou L.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2010
Field of study

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Copenhagen University Research Information System

Indexing k-mers in linear space for quality value compression.

Author: Břinda K
Matteo Comin
Mohamadi H
Ochoa I
Prezza N
Schimd M
Shibuya Y
Yoshihiro Shibuya
Publication venue
Publication date: 01/01/2019
Field of study

Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff

Crossref

Open Access Repository

Archivio istituzionale della ricerca - Università di Padova

Indexing k

Author: Břinda K
Matteo Comin
Mohamadi H
Ochoa I
Prezza N
Schimd M
Shibuya Y
Yoshihiro Shibuya
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Crossref

A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k

Author: Apostolico A.
Bassino F.
Boden M.
Burkhardt S.
Břinda K.
Donald E.K. Martin
Egidi L.
Gambin A.
Laurent Noé
Leslie C.S.
Martin D.E.K.
Martin D.E.K.
Régnier M.
Simon I.
Zhou L.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Author: A Apostolico
AE Darling
B Ma
CA Leimeister
DE Wood
G Kucherov
K Břinda
L Hahn
L Noé
M Comin
M Comin
R Ounit
S Girotto
S Girotto
S Girotto
S Girotto
S Girotto
SM Rumble
T Onodera
U Keich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

International audienceAlignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed-up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this paper we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values in order to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup of 6.2x, outperforming previous methods. Software and Datasets are available at ISS

Crossref

Hal-Diderot

Archivio istituzionale della ricerca - Università di Padova

Improving Metagenomic Classification Using Discriminative k-mers from Sequencing Data

Author: BD Ondov
D Kim
D Wood
DE Wood
DH Huson
F Breitwieser
J Goke
J Qian
JA Eisen
K Břinda
M Antonello
M Comin
M Comin
M Comin
M Comin
M Comin
R Ounit
S Girotto
SF Altschul
SS Mande
Y Shibuya
YW Yu
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g. viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com/davide92/K2Mem.gi

Crossref

Archivio istituzionale della ricerca - Università di Padova

Fast Approximation of Frequent k-mers and Applications to Metagenomics

Author: B Solomon
BD Ondov
DE Wood
DR Kelley
DR Zerbino
G Benoit
G Marçais
G Rizk
GE Sims
H Mohamadi
K Břinda
L Salmela
LB Dickson
M Kokot
M Löffler
M Mitzenmacher
P Melsted
P Melsted
P Pandey
PA Pevzner
Q Zhang
R Chikhi
R Danovaro
R Patro
RS Roy
S Girotto
V Vapnik
V Vapnik
X Li
Z Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Estimating the abundances of all

k

-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all

k

-mers and their abundances, in other situations reporting only frequent

k

-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of

k

-mers' abundance-based distances among datasets of reads, commonly used in metagenomic analyses. In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent

k

-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent

k

-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of

k

-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate

k

-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.Comment: Accepted for RECOMB 201

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

SimBA: A methodology and tools for evaluating the performance of RNA-Seq bioinformatic pipelines

Author: A Conesa
A Kanitz
A Lex
A McKenna
AD Ewing
Christophe F. Grosset
D Kim
EM Quinn
GR Grant
H Li
H Li
J Köster
Jean-Marc Holder
Jérôme Audoux
K Břinda
M Carrara
M Garber
M Smolka
M Teng
Mikaël Salson
N Philippe
Nicolas Philippe
PG Engström
PKR Kumar
R Piskol
S Beaumeunier
S Caboche
S Kumar
S Marco-Sola
SA Byron
Sacha Beaumeunier
Seqc/Maqc-Iii Consortium
SH Giese
T Griebel
The 1000 Genomes Project Consortium
Thérèse Commes
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2017
Field of study

International audienceBACKGROUND:The evolution of next-generation sequencing (NGS) technologies has led to increased focus on RNA-Seq. Many bioinformatic tools have been developed for RNA-Seq analysis, each with unique performance characteristics and configuration parameters. Users face an increasingly complex task in understanding which bioinformatic tools are best for their specific needs and how they should be configured. In order to provide some answers to these questions, we investigate the performance of leading bioinformatic tools designed for RNA-Seq analysis and propose a methodology for systematic evaluation and comparison of performance to help users make well informed choices.RESULTS:To evaluate RNA-Seq pipelines, we developed a suite of two benchmarking tools. SimCT generates simulated datasets that get as close as possible to specific real biological conditions accompanied by the list of genomic incidents and mutations that have been inserted. BenchCT then compares the output of any bioinformatics pipeline that has been run against a SimCT dataset with the simulated genomic and transcriptional variations it contains to give an accurate performance evaluation in addressing specific biological question. We used these tools to simulate a real-world genomic medicine question s involving the comparison of healthy and cancerous cells. Results revealed that performance in addressing a particular biological context varied significantly depending on the choice of tools and settings used. We also found that by combining the output of certain pipelines, substantial performance improvements could be achieved.CONCLUSION:Our research emphasizes the importance of selecting and configuring bioinformatic tools for the specific biological question being investigated to obtain optimal results. Pipeline designers, developers and users should include benchmarking in the context of their biological question as part of their design and quality control process. Our SimBA suite of benchmarking tools provides a reliable basis for comparing the performance of RNA-Seq bioinformatics pipelines in addressing a specific biological question. We would like to see the creation of a reference corpus of data-sets that would allow accurate comparison between benchmarks performed by different groups and the publication of more benchmarks based on this public corpus. SimBA software and data-set are available at http://cractools.gforge.inria.fr/softwares/simba/

Crossref

HAL-Inserm

INRIA a CCSD electronic archive server

Directory of Open Access Journals

Hal-Diderot

Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

Author: A Hodgkinson
AP Bird
AR Quinlan
DJ Gaffney
E Hodis
E Isidore
H Li
J Harrow
JA Schlueter
JC Mu
JM Zook
John Parkinson
K Břinda
KE McElroy
Liudmila S. Mainzer
M Olivier
Matthew E. Hudson
Matthew R. Weber
MN Premachandran
Morgan Taschuk
N Shanks
P Danecek
P Polak
Ravishankar K. Iyer
S Andrews
S Caboche
S Kim
S Pattnaik
S Subramanian
S van der Walt
TJ Treangen
W Huang
X Hu
XS Puente
Z Su
Zachary D. Stephens
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref

Efficient computation of spaced seed hashing with block indexing

Author: A Shajii
A Zielezinski
B Ma
B Ma
C Leslie
C Pizzi
C Pizzi
C-A Leimeister
C-A Leimeister
Cinzia Pizzi
D Belazzougui
DE Wood
DG Brown
G Marcais
G Reinert
H Mohamadi
J Buhler
K Břinda
K Song
L Hahn
L Ilie
L Noé
L Parida
M Comin
M Comin
Matteo Comin
P Ferragina
R Ounit
R Ounit
S Deorowicz
S Girotto
S Girotto
S Girotto
S Girotto
S Girotto
S Van Dongen
Samuele Girotto
SF Altschul
SM Rumble
T Onodera
U Keich
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref