Search CORE

919 research outputs found

The distribution of word matches between Markovian sequences with periodic boundary conditions

Author: Burden Conrad J
Foret Sylvain
Leopardi Paul
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome

PubMed Central

The Australian National University

Alignment-free sequence comparison for biologically realistic sequences of moderate length

Author: Burden Conrad J
Jing Junmei
Wilson Susan R
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 08/12/2015
Field of study

The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D2* and D2c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D2 and D2c, and to a somewhat lesser extent D2*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.This work was funded in part by ARC discovery grant DP098729

The Australian National University

Efficient exact motif discovery

Author: Ettwiller
Fratkin
Li
Lladser
Pavesi
Reinert
S. Rahmann
Sandve
Sandve
Sinha
T. Marschall
Tompa
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif

CiteSeerX

Crossref

PubMed Central

On the first k moments of the random count of a pattern in a multi-states sequence generated by a Markov source

Author: Nuel Grégory
Publication venue: 'Applied Probability Trust'
Publication date: 22/09/2009
Field of study

In this paper, we develop an explicit formula allowing to compute the first k moments of the random count of a pattern in a multi-states sequence generated by a Markov source. We derive efficient algorithms allowing to deal both with low or high complexity patterns and either homogeneous or heterogenous Markov models. We then apply these results to the distribution of DNA patterns in genomic sequences where we show that moment-based developments (namely: Edgeworth's expansion and Gram-Charlier type B series) allow to improve the reliability of common asymptotic approximations like Gaussian or Poisson approximations

arXiv.org e-Print Archive

HAL Descartes

Discovering Functional Communities in Dynamical Networks

Author: Camperi Marcelo F.
Klinkner Kristina Lisa
Shalizi Cosma Rohilla
Publication venue
Publication date: 01/01/2006
Field of study

Many networks are important because they are substrates for dynamical systems, and their pattern of functional connectivity can itself be dynamic -- they can functionally reorganize, even if their underlying anatomical structure remains fixed. However, the recent rapid progress in discovering the community structure of networks has overwhelmingly focused on that constant anatomical connectivity. In this paper, we lay out the problem of discovering_functional communities_, and describe an approach to doing so. This method combines recent work on measuring information sharing across stochastic networks with an existing and successful community-discovery algorithm for weighted networks. We illustrate it with an application to a large biophysical model of the transition from beta to gamma rhythms in the hippocampus.Comment: 18 pages, 4 figures, Springer "Lecture Notes in Computer Science" style. Forthcoming in the proceedings of the workshop "Statistical Network Analysis: Models, Issues and New Directions", at ICML 2006. Version 2: small clarifications, typo corrections, added referenc

arXiv.org e-Print Archive

CiteSeerX

Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

Author: Aho
Allauzen
Antzoulakos
Beaudoing
Boeva
Boeva
Brazma
Chang
Cormen
Cowan
Crochemore
Crochemore
Denise
El~Karoui
Erhardsson
Fiduccia
Frith
Fu
Geske
Godbole
Gregory Nuel
Hampson
Hopcroft
Hopcroft
Jean-Guillaume Dumas
Kaltofen
Karlin
Kleffe
Knuth
Le~Maout
Lladser
Mariño-Ramírez
Nicodème
Nuel
Nuel
Nuel
Nuel
Nuel
Nuel
Nuel
Pevzner
Prum
Reinert
Ribeca
Régnier
Stefanov
Stefanov
Storjohann
van Helden
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

We present two novel approaches for the computation of the exact distribution of a pattern in a long sequence. Both approaches take into account the sparse structure of the problem and are two-part algorithms. The first approach relies on a partial recursion after a fast computation of the second largest eigenvalue of the transition matrix of a Markov chain embedding. The second approach uses fast Taylor expansions of an exact bivariate rational reconstruction of the distribution. We illustrate the interest of both approaches on a simple toy-example and two biological applications: the transcription factors of the Human Chromosome 5 and the PROSITE signatures of functional motifs in proteins. On these example our methods demonstrate their complementarity and their hability to extend the domain of feasibility for exact computations in pattern problems to a new level

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

HAL Descartes

Hal-Diderot

Weighted k-word matches: a sequence comparison tool for proteins

Author: Burden Conrad John
Jing J.
Wilson S. R.
Publication venue: Australian Mathematical Society
Publication date: 04/06/2011
Field of study

The use of

k

-word matches was developed as a fast alignment-free comparison method for DNA sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with BLAST. References R. A. Lippert, H. Huang, and M. S. Waterman. Distributional regimes for the number of

k

-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 99(22):13980--9, 2002. doi:10.1073/pnas.202468099 J. Jing, C. J. Burden, S. Foret, and S. R. Wilson. Statistical considerations underpinning an alignment-free sequence comparison method. J. Korean Stat. Soc., 39:325--335, 2010. doi:10.1016/j.jkss.2010.02.009 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389--402, 1997. doi:10.1093/nar/25.17.3389 W. J. Ewens and G. R. Grant. Statistical Methods in Bioinformatics: an Introduction. Springer, 2nd edition, 2005. S. Foret, M. R. Kantorovitz, and C. J. Burden. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5:S21, 2006. doi:10.1186/1471-2105-7-S5-S21 S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915--10919, 1992. doi:10.1073/pnas.89.22.10915 http://bioinfo.lifl.fr/reblosum/ [31 May 2011] G. Reinert, D. Chew, F. Sun, and M. S. Waterman. Alignment-free sequence comparison (i): statistics and power. J. Comput. Biol., 16(12):1615--1634, 2009. doi:10.1089/cmb.2009.0198 S. Foret, S. R. Wilson, and C. J. Burden. Empirical distribution of

k

-word matches in biological sequences. Pattern Recogn., 42:539--548, 2009. doi:10.1016/j.patcog.2008.06.026 S. Foret, S. R. Wilson, and C. J. Burden. Characterizing the

D2

statistic: Word matches in biological sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43, 2009. doi:10.2202/1544-6115.1447 M. R. Kantorovitz, H. S. Booth, C. J. Burden, and S. R. Wilson. Asymptotic behavior of

k

-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788--805, 2006. doi:10.1239/jap/1189717545 T. J. Wu, Y. H. Huang, and L. A. Li. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21(22):4125--32, 2005. doi:10.1093/bioinformatics/bti658 S. Q. Le and O. Gascuel. An improved general amino acid replacement marix. Mol. Biol. Evol., 25:1307--1320, 2008. doi:10.1093/molbev/msn067 E. Gazave, P. Lapebi, G. S. Richards, F. Brunet, A. V. Ereskovsky, B. M. Degnan, C. Borchiellini, M. Vervoort, and E. Renard. Origin and evolution of the Notch signalling pathway: an overview from eukaryotic genomes. BMC Evol. Biol., 9:249, 2009. doi:10.1186/1471-2148-9-249 S. Q. Schneider, J. R. Finnerty, and M. Q. Martindale. Protein evolution: structure-function relationships of the oncogene Beta-catenin in the evolution of multicellular animals. J. Exptl. Zool. (Mol. Dev. Evol.), 295B:25--44, 2003. doi:10.1002/jez.b.0000

Australian Mathematical Society (AustMS): E-Journals