Search CORE

139 research outputs found

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Martin Donald E. K.
Noé Laurent
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

PubMed Central

Reconsidering the significance of genomic word frequency

Author: Csűrös Miklós
Kucherov Gregory
Noé Laurent
Publication venue
Publication date: 14/09/2006
Field of study

We propose that the distribution of DNA words in genomic sequences can be primarily characterized by a double Pareto-lognormal distribution, which explains lognormal and power-law features found across all known genomes. Such a distribution may be the result of completely random sequence evolution by duplication processes. The parametrization of genomic word frequencies allows for an assessment of significance for frequent or rare sequence motifs

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

YASS: enhancing the sensitivity of DNA similarity search

Author: Kucherov Gregory
Noé Laurent
Publication venue: Oxford University Press
Publication date: 01/01/2004
Field of study

YASS is a DNA local alignment tool based on an efficient and sensitive filtering algorithm. It applies transition-constrained seeds to specify the most probable conserved motifs between homologous sequences, combined with a flexible hit criterion used to identify groups of seeds that are likely to exhibit significant alignments. A web interface () is available to upload input sequences in fasta format, query the program and visualize the results obtained in several forms (dot-plot, tabular output and others). A standalone version is available for download from the web page

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

Hal-Diderot

Improved hit criteria for DNA local alignment

Author: Kucherov Gregory
Noé Laurent
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: The hit criterion is a key component of heuristic local alignment algorithms. It specifies a class of patterns assumed to witness a potential similarity, and this choice is decisive for the selectivity and sensitivity of the whole method. RESULTS: In this paper, we propose two ways to improve the hit criterion. First, we define the group criterion combining the advantages of the single-seed and double-seed approaches used in existing algorithms. Second, we introduce transition-constrained seeds that extend spaced seeds by the possibility of distinguishing transition and transversion mismatches. We provide analytical data as well as experimental results, obtained with the YASS software, supporting both improvements. CONCLUSIONS: Proposed algorithmic ideas allow to obtain a significant gain in sensitivity of similarity search without increase in execution time. The method has been implemented in YASS software available at

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

Hal-Diderot

Back-translation for discovering distant protein homologies in the presence of frameshift mutations

Author: Gîrdea Marta
Kucherov Gregory
Noé Laurent
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins ’ common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. \ud \ud Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.\ud \ud Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional\ud alignment methods, which is confirmed by biologically significant example

CiteSeerX

HAL - Lille 3

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

Designing Efficient Spaced Seeds for SOLiD Read Mapping

Author: Gîrdea Marta
Kucherov Gregory
Noé Laurent
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2010
Field of study

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency

CiteSeerX

HAL - Lille 3

Crossref

Directory of Open Access Journals

INRIA a CCSD electronic archive server

PubMed Central

Efficient seeding techniques for protein similarity search

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mihkail
Szczurek Ewa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Efficient seeding techniques for protein similarity search

Author: Roytberg Mihkail
Gambin Anna
Noé Laurent
Lasota Slawomir
Furletova Eugenia
Szczurek Ewa
Kucherov Gregory
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

YASS: Similarity search in DNA sequences

Author: Kucherov Gregory
Noé Laurent
Publication venue: HAL CCSD
Publication date: 01/01/2003
Field of study

We describe YASS -- a new tool for finding local similarities in DNA sequences. The YASS algorithm first scans the sequence(s) and creates on the fly groups of (small exact repeats obtained by hashing) according to statistically-founded criteria. Then it tries to extend those groups into similarity regions on the basis of a new extension criterion. The method can be seen as a compromise between single-seed () and multiple-seed (, ) approaches, and achieves a gain in both sensitivity and selectivity. The method is flexible and can be made more efficient by using spaced seeds, and in particular transition-constrained spaced seeds. We provide examples of applying YASS to Saccharomyces Cerevisiae and Drosophila Melanogaster chromosomes

INRIA a CCSD electronic archive server