225 research outputs found
Adaptive estimation for Hawkes processes; application to genome analysis
The aim of this paper is to provide a new method for the detection of either
favored or avoided distances between genomic events along DNA sequences. These
events are modeled by a Hawkes process. The biological problem is actually
complex enough to need a nonasymptotic penalized model selection approach. We
provide a theoretical penalty that satisfies an oracle inequality even for
quite complex families of models. The consecutive theoretical estimator is
shown to be adaptive minimax for H\"{o}lderian functions with regularity in
: those aspects have not yet been studied for the Hawkes' process.
Moreover, we introduce an efficient strategy, named Islands, which is not
classically used in model selection, but that happens to be particularly
relevant to the biological question we want to answer. Since a multiplicative
constant in the theoretical penalty is not computable in practice, we provide
extensive simulations to find a data-driven calibration of this constant. The
results obtained on real genomic data are coherent with biological knowledge
and eventually refine them.Comment: Published in at http://dx.doi.org/10.1214/10-AOS806 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Multiple Comparative Metagenomics using Multiset k-mer Counting
Background. Large scale metagenomic projects aim to extract biodiversity
knowledge between different environmental conditions. Current methods for
comparing microbial communities face important limitations. Those based on
taxonomical or functional assignation rely on a small subset of the sequences
that can be associated to known organisms. On the other hand, de novo methods,
that compare the whole sets of sequences, either do not scale up on ambitious
metagenomic projects or do not provide precise and exhaustive results.
Methods. These limitations motivated the development of a new de novo
metagenomic comparative method, called Simka. This method computes a large
collection of standard ecological distances by replacing species counts by
k-mer counts. Simka scales-up today's metagenomic projects thanks to a new
parallel k-mer counting strategy on multiple datasets.
Results. Experiments on public Human Microbiome Project datasets demonstrate
that Simka captures the essential underlying biological structure. Simka was
able to compute in a few hours both qualitative and quantitative ecological
distances on hundreds of metagenomic samples (690 samples, 32 billions of
reads). We also demonstrate that analyzing metagenomes at the k-mer level is
highly correlated with extremely precise de novo comparison techniques which
rely on all-versus-all sequences alignment strategy or which are based on
taxonomic profiling
Statistical tests to compare motif count exceptionalities
BACKGROUND: Finding over- or under-represented motifs in biological sequences is now a common task in genomics. Thanks to p-value calculation for motif counts, exceptional motifs are identified and represent candidate functional motifs. The present work addresses the related question of comparing the exceptionality of one motif in two different sequences. Just comparing the motif count p-values in each sequence is indeed not sufficient to decide if this motif is significantly more exceptional in one sequence compared to the other one. A statistical test is required. RESULTS: We develop and analyze two statistical tests, an exact binomial one and an asymptotic likelihood ratio test, to decide whether the exceptionality of a given motif is equivalent or significantly different in two sequences of interest. For that purpose, motif occurrences are modeled by Poisson processes, with a special care for overlapping motifs. Both tests can take the sequence compositions into account. As an illustration, we compare the octamer exceptionalities in the Escherichia coli K-12 backbone versus variable strain-specific loops. CONCLUSION: The exact binomial test is particularly adapted for small counts. For large counts, we advise to use the likelihood ratio test which is asymptotic but strongly correlated with the exact binomial test and very simple to use
Assessing the Exceptionality of Coloured Motifs in Networks
Various methods have been recently employed to characterise the structure of biological networks. In particular, the concept of network motif and the related one of coloured motif have proven useful to model the notion of a functional/evolutionary building block. However, algorithms that enumerate all the motifs of a network may produce a very large output, and methods to decide which motifs should be selected for downstream analysis are needed. A widely used method is to assess if the motif is exceptional, that is, over- or under-represented with respect to a null hypothesis. Much effort has been put in the last thirty years to derive P-values for the frequencies of topological motifs, that is, fixed subgraphs. They rely either on (compound) Poisson and Gaussian approximations for the motif count distribution in Erdös-Rényi random graphs or on simulations in other models. We focus on a different definition of graph motifs that corresponds to coloured motifs. A coloured motif is a connected subgraph with fixed vertex colours but unspecified topology. Our work is the first analytical attempt to assess the exceptionality of coloured motifs in networks without any simulation. We first establish analytical formulae for the mean and the variance of the count of a coloured motif in an Erdös-Rényi random graph model. Using simulations under this model, we further show that a Pólya-Aeppli distribution bette
Statistique des comparaisons de génomes complets bactériens
La génomique comparative est l'étude des relations structurales et fonctionnelles entre des génomes appartenant à différentes souches ou espèces. Cette discipline offre ainsi la possibilité d'étudier et de comprendre les processus qui façonnent les génomes au cours de l'évolution. Dans le cadre de cette thèse, nous nous sommes intéressés à la génomique comparative des bactéries et plus particulièrement aux méthodes relatives à la comparaison des séquences complètes d'ADN des génomes bactériens. Ces dix dernières années, le développement d'outils informatiques permettant de comparer des génomes entiers à l'échelle de l'ADN est devenu une thématique de recherche à part entière. Actuellement, il existe de nombreux outils dédiés à cette tâche. Cependant, jusqu'à présent, la plupart des efforts ont été dirigés vers la réduction du temps de calcul et l'optimisation de la mémoire au détriment de l'évaluation de la qualité des résultats obtenus. Pour combler ce vide, nous avons travaillé sur différents problèmes statistiques soulevés par la comparaison de génomes complets bactériens. Notre travail se divise en deux axes de recherche. Dans un premier temps, nous nous sommes employés à évaluer la robustesse des alignements de génomes complets bactériens. Nous avons proposé une méthode originale fondée sur l'application de perturbations aléatoires sur les génomes comparés. Trois scores différents sont alors calculés pour estimer la robustesse des alignements de génomes à différentes échelles, allant des nucléotides aux séquences entières des génomes. Notre méthode a été expérimentée sur des données génomiques bactériennes réelles. Nos scores permettent d'identifier à la fois les alignements robustes et non robustes. Ils peuvent être employés pour corriger un alignement ou encore pour comparer plusieurs alignements obtenus à partir de différents outils. Dans un second temps, nous avons étudié le problème de la paramétrisation des outils de comparaisons de génomes entiers. En effet, la plupart des outils existants manquent à la fois de documentation et de valeurs par défaut fiables pour initialiser leurs paramètres. Conséquemment, il y a un besoin crucial de méthodes spécifiques pour aider les utilisateurs à définir des valeurs appropriées pour les paramètres de ces outils. Une grande partie des outils de comparaisons de génomes complets est fondée sur la détection des matches (mots communs exacts). Le paramètre essentiel pour ces méthodes est la longueur des matches à considérer. Au cours de cette thèse, nous avons développé deux méthodes statistiques pour estimer une valeur optimale pour la taille des matches. Notre première approche utilise un modèle de mélange de lois géométriques pour caractériser la distribution de la taille des matches obtenus lorsque l'on compare deux séquences génomiques. La deuxième approche est fondée sur une approximation de Poisson de la loi du comptage des matches entre deux chaînes de Markov. Ces méthodes statistiques nous permettent d'identifier facilement une taille optimale de matches à la fois pour des séquences simulées et pour des données génomiques réelles. Nous avons également montré que cette taille optimale dépend des caractéristiques des génomes comparés telles que leur taille, leur composition en base ou leur divergence relative. Cette thèse représente une des toutes premières études dont l'objectif est d'évaluer et d'améliorer la qualité des comparaisons des génomes complets. L'intérêt et les limites de nos différentes approches sont discutés et plusieurs perspectives d'évolution sont proposées.Comparative genomics is the study of the structural and functional relationships between genomes belonging to different strains or species. This discipline offers great opportunities to investigate and to understand the processes that shape genomes across the evolution. In this thesis, we focused on the comparative genomics of bacteria and more precisely, on methods dedicated to the comparison of the complete DNA sequences of bacterial genomes. This last decade, the design of specific computerized methods to compare complete genomes at the DNA scale has become a subject of first concern. Now, there exist many tools and methods dedicated to this task. However, until now, most of the efforts were directed to reduce execution time and memory usage at the expense of the evaluation of the quality of the results. To fill this gap, we worked on different statistical issues related to the comparison of complete bacterial genomes. Our work was conducted into two directions. In the first one, we investigated the assessment of the robustness of complete bacterial genome alignments. We proposed an original method based on random perturbations of the compared genomes. Three different scores were derived to estimate the robustness of genome alignments at different scales, from nucleotides to the complete genome sequences. Our method was trained on bacterial genomic data. Our scores allow us to identify robust and non robust genome alignments. They can be used to correct an alignment or to compare alignments performed with different tools. Secondly, we studied the problem of the parametrization of comparison tools. Briefly, most of the existing tools suffer from a lack of information and of reliable default values to set their parameters. Consequently, there is a crucial need of methods to help users to define reliable parameter values for these tools. Most of the comparison tools are rooted on the detection of word matches. The key parameter for all these tools is the length of the matches to be considered. During this thesis, we developed two statistical methods to estimate an optimal length for these matches. Our first approach consisted in using a mixture model of geometric distributions to characterize the distribution of the length of matches retrieved from the comparison of two genomic sequences. The second approach is rooted on a Poisson approximation of the number of matches between two Markov chains. These statistical methods allow us to easily identify an optimal length for the matches from both simulated and real genomic data. We also showed that this optimal length depends on the characteristics of the compared genomes such as their length, their nucleotide composition, and their relative divergence. This thesis represents one of the earliest attempts to statistically evaluate and to improve the quality of complete genome comparisons. The interest and limitations of our different methods are discussed and some perspectives are proposed.EVRY-Bib. électronique (912289901) / SudocSudocFranceF
Identification of DNA Motifs Implicated in Maintenance of Bacterial Core Genomes by Predictive Modeling
Bacterial biodiversity at the species level, in terms of gene acquisition or loss, is so immense that it raises the question of how essential chromosomal regions are spared from uncontrolled rearrangements. Protection of the genome likely depends on specific DNA motifs that impose limits on the regions that undergo recombination. Although most such motifs remain unidentified, they are theoretically predictable based on their genomic distribution properties. We examined the distribution of the “crossover hotspot instigator,” or Chi, in Escherichia coli, and found that its exceptional distribution is restricted to the core genome common to three strains. We then formulated a set of criteria that were incorporated in a statistical model to search core genomes for motifs potentially involved in genome stability in other species. Our strategy led us to identify and biologically validate two distinct heptamers that possess Chi properties, one in Staphylococcus aureus, and the other in several streptococci. This strategy paves the way for wide-scale discovery of other important functional noncoding motifs that distinguish core genomes from the strain-variable regions
The MatP/matS Site-Specific System Organizes the Terminus Region of the E. coli Chromosome into a Macrodomain
The organization of the Escherichia coli chromosome into insulated macrodomains influences the segregation of sister chromatids and the mobility of chromosomal DNA. Here, we report that organization of the Terminus region (Ter) into a macrodomain relies on the presence of a 13 bp motif called matS repeated 23 times in the 800-kb-long domain. matS sites are the main targets in the E. coli chromosome of a newly identified protein designated MatP. MatP accumulates in the cell as a discrete focus that colocalizes with the Ter macrodomain. The effects of MatP inactivation reveal its role as main organizer of the Ter macrodomain: in the absence of MatP, DNA is less compacted, the mobility of markers is increased, and segregation of Ter macrodomain occurs early in the cell cycle. Our results indicate that a specific organizational system is required in the Terminus region for bacterial chromosome management during the cell cycle
Finding and counting vertex-colored subtrees
The problems studied in this article originate from the Graph Motif problem
introduced by Lacroix et al. in the context of biological networks. The problem
is to decide if a vertex-colored graph has a connected subgraph whose colors
equal a given multiset of colors . It is a graph pattern-matching problem
variant, where the structure of the occurrence of the pattern is not of
interest but the only requirement is the connectedness. Using an algebraic
framework recently introduced by Koutis et al., we obtain new FPT algorithms
for Graph Motif and variants, with improved running times. We also obtain
results on the counting versions of this problem, proving that the counting
problem is FPT if M is a set, but becomes W[1]-hard if M is a multiset with two
colors. Finally, we present an experimental evaluation of this approach on real
datasets, showing that its performance compares favorably with existing
software.Comment: Conference version in International Symposium on Mathematical
Foundations of Computer Science (MFCS), Brno : Czech Republic (2010) Journal
Version in Algorithmic
String Matching and 1d Lattice Gases
We calculate the probability distributions for the number of occurrences
of a given letter word in a random string of letters. Analytical
expressions for the distribution are known for the asymptotic regimes (i) (Gaussian) and such that is finite
(Compound Poisson). However, it is known that these distributions do now work
well in the intermediate regime . We show that the
problem of calculating the string matching probability can be cast into a
determining the configurational partition function of a 1d lattice gas with
interacting particles so that the matching probability becomes the
grand-partition sum of the lattice gas, with the number of particles
corresponding to the number of matches. We perform a virial expansion of the
effective equation of state and obtain the probability distribution. Our result
reproduces the behavior of the distribution in all regimes. We are also able to
show analytically how the limiting distributions arise. Our analysis builds on
the fact that the effective interactions between the particles consist of a
relatively strong core of size , the word length, followed by a weak,
exponentially decaying tail. We find that the asymptotic regimes correspond to
the case where the tail of the interactions can be neglected, while in the
intermediate regime they need to be kept in the analysis. Our results are
readily generalized to the case where the random strings are generated by more
complicated stochastic processes such as a non-uniform letter probability
distribution or Markov chains. We show that in these cases the tails of the
effective interactions can be made even more dominant rendering thus the
asymptotic approximations less accurate in such a regime.Comment: 44 pages and 8 figures. Major revision of previous version. The
lattice gas analogy has been worked out in full, including virial expansion
and equation of state. This constitutes the main part of the paper now.
Connections with existing work is made and references should be up to date
now. To be submitted for publicatio
- …