18 research outputs found

    Parsimonious Higher-Order Hidden Markov Models for Improved Array-CGH Analysis with Applications to Arabidopsis thaliana

    Get PDF
    Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM)

    Dealing with Small Data: On the Generalization of Context Trees

    Get PDF
    Abstract Context trees (CT) are a widely used tool in machine learning for representing context-specific independences in conditional probability distributions. Parsimonious context trees (PCTs) are a recently proposed generalization of CTs that can enable statistically more efficient learning due to a higher structural flexibility, which is particularly useful for small-data settings. However, this comes at the cost of computationally expensive structure learning, which is feasible only for domains with small alphabets and tree depths. In this work, we investigate to which degree CTs can be generalized to increase statistical efficiency while still keeping the learning computationally feasible. Approaching this goal from two different angles, we (i) propose algorithmic improvements to the PCT learning algorithm, and (ii) study further generalizations of CTs, which are inspired by PCTs, but trade structural flexibility for computational efficiency. By empirical studies both on simulated and realworld data, we demonstrate that the synergy of combining of both orthogonal approaches yields a substantial breakthrough in obtaining statistically efficient and computationally feasible generalizations of CTs

    Algorithms for learning parsimonious context trees

    Get PDF
    Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions. They are particularly powerful for modeling context-specific independencies in sequential discrete data. Learning PCTs from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Under the score-and-search paradigm, the fastest algorithm for finding an optimal PCT, prior to the present work, is based on dynamic programming. While the algorithm can handle small instances fast, it becomes infeasible already when there are half a dozen four-state predictor variables. Here, we show that common scoring functions enable the use of new algorithmic ideas, which can significantly expedite the dynamic programming algorithm on typical data. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a bound-and-prune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. On real-world data from recent applications of PCTs within computational biology the ideas are shown to reduce the traversed search space and the computation time by several orders of magnitude in typical cases.Peer reviewe

    Structural and functional evolution of genes in conifers

    Get PDF
    Le développement de nouvelles techniques a accéléré l'exploration structurale et fonctionnelle des génomes des conifères et contribué à l’étude de leur physiologie et leur adaptation aux conditions environnementales. Cette thèse s’intéresse à l’évolution des gènes chez les conifères et (i) fait le point sur les facteurs génomiques qui ont influencé la structure des gènes et (ii) analyse une grande famille de gènes impliqués dans la tolérance à la sécheresse, les déhydrines. Notre étude de la structure génique s’est fait à partir de diverses séquences de l’épinette blanche (Picea glauca [Moench] Voss) provenant de clones BAC, de l'assemblage du génome et de l’espace génique obtenu à partir de la technologie de «sequence capture». Par le biais d’analyses comparatives, nous avons observé que les conifères présentent plus de séquences introniques par gène que la plupart des plantes à fleurs (angiospermes) et que la longueur moyenne des introns n'était pas directement corrélée à la taille du génome. Nous avons constaté que les éléments répétitifs qui sont responsables de la très grande taille des génomes des conifères affectent également l'évolution des exons et des introns. Dans la deuxième partie de la thèse, nous avons entrepris la première analyse exhaustive de la famille des gènes des déhydrines chez les conifères. Les analyses phylogénétiques ont indiqué l'apparition d'une série de duplications de gènes dont une duplication qui a provoqué l'expansion de la famille génique spécifiquement au sein du genre Picea. L’analyse démontre que les déhydrines ont une structure modulaire et présentent chez les conifères des agencements variés de différents motifs d'acides aminés. Ces structures sont particulièrement diverses chez l'épinette et sont associées à différents patrons d'expression en réponse à la sècheresse. Dans l’ensemble, nos résultats suggèrent que l'évolution de la structure génique est dynamique chez les conifères alors que l'évolution des chromosomes est largement reconnue comme étant lente chez ceux-ci. Ils indiquent aussi que l'expansion et la diversification des familles de gènes liés à l'adaptation, comme les déhydrines, pourraient conférer de la plasticité phénotypique permettant de répondre aux changements environnementaux au cours du long cycle de vie qui est typique de plusieurs conifères.Technical advances have accelerated the structural and functional exploration of conifer genomes and opened up new approaches to study their physiology and adaptation to environmental conditions. This thesis focuses on the evolution of conifer genes and explores (i) the genomic factors that have impacted the evolution of gene structure and (ii) the evolution of a large gene family involved in drought tolerance, the dehydrins. The analysis of gene structure was based on white spruce (Picea glauca [Moench] Voss) sequence data from BAC clones, the genome assembly and the gene space obtained from sequence capture. Through comparative analyses, we found that conifers presented more intronic sequence per gene than most flowering plants (angiosperms) and that the average intron length was not directly correlated to genome size. We found that repetitive elements, which are responsible for the very large size of conifer genomes, also affect the evolution of exons and introns. In the second part of the thesis, we undertook the first exhaustive analysis of the dehydrin gene family in conifers. The phylogenetic analyses indicated the occurrence of a series of gene duplications in conifers and a major lineage duplication, which caused the expansion of the dehydrin family in the genus Picea. Conifer dehydrins have an array of modular amino acid structures, and in spruce, these structures are particularly diverse and are associated with different expression patterns in response to dehydration stress. Taken together, our findings suggest that the evolution of gene structure is dynamic in conifers, which contrast with a widely accepted slow rate of chromosome evolution. They further indicate that the expansion and diversification of adaptation-related genes, like the dehydrins in spruce, may confer the phenotypic plasticity to respond to the environmental changes during their long life span

    Cereal Genomics II

    Get PDF
    During the last decades, major advances have been made in the field of cereal genomics. For instance, high-density genetic maps, physical maps, QTL maps and even draft genome sequence have become available for several cereal species. This has been facilitated by the development of next generation sequencing (NGS) technologies, so that, it is now possible to sequence genomes of hundreds or thousands of accessions of an individual cereal crop. Significant amounts of data generated using these latest NGS technologies created a demand for computational tools to analyse this massive data. These developments related to technology and the tools, along with their applications not only to plant and genome biology but also to breeding have been documented in this volume. The volume, entitled “Cereal Genomics II”, therefore supplements the earlier edited volume “Cereal Genomics” published in 2004. The new volume has updated chapters, from the leading authorities in their fields, on molecular markers, next generation sequencing platform and their use for QTL analysis, domestication studies, functional genomics and molecular breeding. In addition, there are also chapters on computational genomics, whole genome sequencing and comparative genomics of cereals. The book should prove useful to students, teachers and young research workers as a ready reference to the latest information on cereal genomics

    A statistical modeling framework for analyzing tree-indexed data: Application to plant development on microscopic and macroscopic scales

    Get PDF
    We address statistical models for tree-indexed data.In Virtual Plants team, the host team for this thesis, applications of interest focus on plant development and its modulation by environmental and genetic factors.We thus focus on plant developmental applications both at a microscopic level with the study of the cell lineage in the biological tissue responsible for the plant growth, and at a macroscopic level with the mechanism of branch production.Far fewer models are available for tree-indexed data than for path-indexed data.This thesis therefore aims to propose a statistical modeling framework for studying patterns in tree-indexed data.To this end, two different classes of statistical models, Markov and change-point models, are investigatedNous nous intéressons à des modèles statistiques pour données indexées par des arborescences. Dans le contexte de l'équipe Virtual Plants, les applications portent sur le développement de la plante et sa modulation par des facteurs génétiques et environnementaux. Les modèles statistiques pour données indexées par des arborescences sont beaucoup moins développés que ceux pour séquences ou séries temporelles. Cette thèse vise à proposer un cadre de modélisation statistique pour l'identification de patterns dans des données indexées par des arborescences. Deux classes de modèles statistiques, les modèles de Markov et leur extension aux modèles de Markov cachés et les modèles de détection de ruptures multiples, sont étudiés. Nous proposons notamment de nouvelles méthodes dinférence de la structure dindépendance conditionnelle entre nuds parent et enfants dans les modèles de Markov reposant sur des algorithmes de sélection de graphes dans des modèles graphiques probabilistes. Les modèles étudiés sont appliqués dune part à des arborescences de lignage cellulaire à léchelle microscopique et dautre part à des systèmes ramifiés à léchelle macroscopique
    corecore