Search CORE

30 research outputs found

Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models

Author: Furletova Evgenia
Roytberg Mikhail
Régnier Mireille
Yakovlev Victor
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2014
Field of study

International audienceWe present a novel algorithm, SufPref, computing an exact pvalue for Hidden Markov models (HMM). The algorithm inductively traverses specific data structure, the overlap graph. Nodes of the graph are associated with the overlaps of words from a given set H. Edges are associated to the prefix and suffix relations between ovelaps. An originality of our data structure is that pattern H need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; the approach is analogous to a weighted automaton approach. The gain in size of SufPref data structure leads to significant space and time complexity improvements. We suppose that all words in the pattern H are of the same length m. The algorithm SufPref was implemented as a C++ program; it can be used both as Web-server and a stand alone program for Linux and Windows.Cet article présente un nouvel algorithme, SufPref, qui calcule la pvaleur pour un ensemble de mots qui prend en compte les modèles de Markov cachés (HMM). Il est implémenté en C++ et peut être utilisé en ligne ou téléchargé

Crossref

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

HAL-Polytechnique

A Word Counting Graph

Author: Furletova Eugenia
Kirakossian Zara
Regnier Mireille
Roytberg Mikhail,
Publication venue: London College Publications
Publication date: 01/06/2009
Field of study

We study methods for counting occurrences of words from a given set H over an alphabet V in a given text. All words have the same length m. Our goal is the computation of the probability to ﬁnd p occurrences of words from a set H in a random text of size n, assuming that the text is generated by a Bernoulli or Markov model. We have designed an algorithm solving the problem; the algorithm relies on traversals of a graph, whose set of vertices is associated with the overlaps of words from H. Edges deﬁne two oriented subgraphs that can be interpreted as equivalence relations on words of H. Let P (H) be the set of equivalence classes and S be the set of other vertices. The run time for the Bernoulli model is O(np(|P (H)| +|S|)) time and the space complexity is O(pm|S| +|P (H)|). In a Markov model of order K, additional space complexity is O(pm|V | K ) and additional time complexity is O(npm|V | K). Our preprocessing uses a variant of Aho-Corasick automaton and achieves O(m|H|) time complexity. Our algorithm is implemented and provides a signiﬁcant space improvement in practice. We compare its complexity to the additional improvement due to AhoCorasick minimization

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Polytechnique

HAL-Rennes 1

Analysis of Sequence Conservation at Nucleotide Resolution

Author: Asthana Saurabh
Roytberg Mikhail
Stamatoyannopoulos John
Sunyaev Shamil
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

One of the major goals of comparative genomics is to understand the evolutionary history of each nucleotide in the human genome sequence, and the degree to which it is under selective pressure. Ascertainment of selective constraint at nucleotide resolution is particularly important for predicting the functional significance of human genetic variation and for analyzing the sequence substructure of cis-regulatory sequences and other functional elements. Current methods for analysis of sequence conservation are focused on delineation of conserved regions comprising tens or even hundreds of consecutive nucleotides. We therefore developed a novel computational approach designed specifically for scoring evolutionary conservation at individual base-pair resolution. Our approach estimates the rate at which each nucleotide position is evolving, computes the probability of neutrality given this rate estimate, and summarizes the result in a Sequence CONservation Evaluation (SCONE) score. We computed SCONE scores in a continuous fashion across 1% of the human genome for which high-quality sequence information from up to 23 genomes are available. We show that SCONE scores are clearly correlated with the allele frequency of human polymorphisms in both coding and noncoding regions. We find that the majority of noncoding conserved nucleotides lie outside of longer conserved elements predicted by other conservation analyses, and are experiencing ongoing selection in modern humans as evident from the allele frequency spectrum of human polymorphism. We also applied SCONE to analyze the distribution of conserved nucleotides within functional regions. These regions are markedly enriched in individually conserved positions and short (<15 bp) conserved “chunks.” Our results collectively suggest that the majority of functionally important noncoding conserved positions are highly fragmented and reside outside of canonically defined long conserved noncoding sequences. A small subset of these fragmented positions may be identified with high confidence

CiteSeerX

Public Library of Science (PLOS)

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)

Author: Kucherov Gregory
Noé Laurent
Roytberg Mikhail
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated probability distribution, and a seed model - that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

INRIA a CCSD electronic archive server

Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences

Author: Mikhail A Roytberg
Valery O Polyanovsky
Vladimir G Tumanyan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

On subset seeds for protein alignment

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mikhail A.
Szczurek Ewa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds vs. BLASTP.Comment: IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

MPG.PuRe

Multiseed Lossless Filtration

Author: Gregory Kucherov
Laurent Noé
Mikhail Roytberg
Publication venue
Publication date: 01/01/2005
Field of study

We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Karkkainen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server

A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)

Author: Gregory Kucherov
Laurent Noé
Mikhail Roytberg
Publication venue
Publication date: 01/01/2004
Field of study

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem – a set of target alignments, an associated probability distribution, and a seed model – that are specified by distinct finite automata. The approach is then applied to a new concept of subset seeds for which we propose an efficient automaton construction. Experimental results confirm that sensitive sub- set seeds can be efficiently designed using our approach, and can then be used in similarity search producing better results than ordinary spaced seeds

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server