61 research outputs found

    InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites

    Get PDF
    Summary: Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise. Availability and Implementation: InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe.Peer reviewe

    Intersection-Validation: A Method for Evaluating Structure Learning without Ground Truth

    Get PDF
    To compare learning algorithms that differ by the adopted statistical paradigm, model class, or search heuristic, it is common to evaluate the performance on training data of varying size. Measuring the performance is straightforward if the data are generated from a known model, the ground truth. However, when the study concerns real-world data, the current methodology is limited to estimating predictive performance, typically by cross-validation. This work introduces a method to compare algorithms’ ability to learn the model structure, assuming no ground truth is given. The idea is to identify a partial structure on which the algorithms agree, and measure the performance in relation to that structure on subsamples of the data. The method is instantiated to structure learning in Bayesian networks, measuring the performance by the structural Hamming distance. It is tested using benchmark ground truth networks and algorithms that maximize various scoring functions. The results show that the method can produce evaluation outcomes that are close to those one would obtain if the ground truth was available.Peer reviewe

    Extended Sunflower Hidden Markov Models for the recognition of homotypic cis-regulatory modules}

    Get PDF
    The transcription of genes is often regulated not only by transcription factors binding at single sites per promoter, but by the interplay of multiple copies of one or more transcription factors binding at multiple sites forming a cis-regulatory module. The computational recognition of cis-regulatory modules from ChIP-seq or other high-throughput data is crucial in modern life and medical sciences. A common type of cis-regulatory modules are homotypic clusters of binding sites, i.e., clusters of binding sites of one transcription factor. For their recognition the homotypic Sunflower Hidden Markov Model is a promising statistical model. However, this model neglects statistical dependences among nucleotides within binding sites and flanking regions, which makes it not well suited for de-novo motif discovery. Here, we propose an extension of this model that allows statistical dependences within binding sites, their reverse complements, and flanking regions. We study the efficacy of this extended homotypic Sunflower Hidden Markov Model based on ChIP-seq data from the Human ENCODE Project and find that it often outperforms the traditional homotypic Sunflower Hidden Markov Model

    On Structure Priors for Learning Bayesian Networks

    Get PDF
    Peer reviewe

    Comparison of NML and Bayesian scoring criteria for learning parsimonious Markov models

    Get PDF
    Parsimonious Markov models, a generalization of variable order Markov models, have been recently introduced for modeling biological sequences. Up to now, they have been learned by Bayesian approaches. However, there is not always sufficient prior knowledge available and a fully uninformative prior is difficult to define. In order to avoid cumbersome cross validation procedures for obtaining the optimal prior choice, we here adapt scoring criteria for Bayesian networks that approximate the Normalized Maximum Likelihood (NML) to parsimonious Markov models. We empirically compare their performance with the Bayesian approach by classifying splice sites, an important problem from computational biology.Non peer reviewe

    Pruning Rules for Learning Parsimonious Context Trees

    Get PDF
    We give a novel algorithm for finding a parsimonious context tree (PCT) that best fits a given data set. PCTs extend traditional context trees by allowing context-specific grouping of the states of a context variable, also enabling skipping the variable. However, they gain statistical efficiency at the cost of computational efficiency, as the search space of PCTs is of tremendous size. We propose pruning rules based on efficiently computable score upper bounds with the aim of reducing this search space significantly. While our concrete bounds exploit properties of the BIC score, the ideas apply also to other scoring functions. Empirical results show that our algorithm is typically an order-of-magnitude faster than a recently proposed memory-intensive algorithm, or alternatively, about equally fast but using dramatically less memory.Peer reviewe

    Learning Bayesian networks with local structure, mixed variables, and exact algorithms

    Get PDF
    Modern exact algorithms for structure learning in Bayesian networks first compute an exact local score of every candidate parent set, and then find a network structure by combinatorial optimization so as to maximize the global score. This approach assumes that each local score can be computed fast, which can be problematic when the scarcity of the data calls for structured local models or when there are both continuous and discrete variables, for these cases have lacked efficient-to-compute local scores. To address this challenge, we introduce a local score that is based on a class of classification and regression trees. We show that under modest restrictions on the possible branchings in the tree structure, it is feasible to find a structure that maximizes a Bayes score in a range of moderate-size problem instances. In particular, this enables global optimization of the Bayesian network structure, including the local structure. In addition, we introduce a related model class that extends ordinary conditional probability tables to continuous variables by employing an adaptive discretization approach. The two model classes are compared empirically by learning Bayesian networks from benchmark real-world and synthetic data sets. We discuss the relative strengths of the model classes in terms of their structure learning capability, predictive performance, and running time. (C) 2019 The Authors. Published by Elsevier Inc.Peer reviewe

    DNA-binding properties of the MADS-domain transcription factor SEPALLATA3 and mutant variants characterized by SELEX-seq

    Get PDF
    Key message We studied the DNA-binding profile of the MADS-domain transcription factor SEPALLATA3 and mutant variants by SELEX-seq. DNA-binding characteristics of SEPALLATA3 mutant proteins lead us to propose a novel DNA-binding mode. MIKC-type MADS-domain proteins, which function as essential transcription factors in plant development, bind as dimers to a 10-base-pair AT-rich motif termed CArG-box. However, this consensus motif cannot fully explain how the abundant family members in flowering plants can bind different target genes in specific ways. The aim of this study was to better understand the DNA-binding specificity of MADS-domain transcription factors. Also, we wanted to understand the role of a highly conserved arginine residue for binding specificity of the MADS-domain transcription factor family. Here, we studied the DNA-binding profile of the floral homeotic MADS-domain protein SEPALLATA3 by performing SELEX followed by high-throughput sequencing (SELEX-seq). We found a diverse set of bound sequences and could estimate the in vitro binding affinities of SEPALLATA3 to a huge number of different sequences. We found evidence for the preference of AT-rich motifs as flanking sequences. Whereas different CArG-boxes can act as SEPALLATA3 binding sites, our findings suggest that the preferred flanking motifs are almost always the same and thus mostly independent of the identity of the central CArG-box motif. Analysis of SEPALLATA3 proteins with a single amino acid substitution at position 3 of the DNA-binding MADS-domain further revealed that the conserved arginine residue, which has been shown to be involved in a shape readout mechanism, is especially important for the recognition of nucleotides at positions 3 and 8 of the CArG-box motif. This leads us to propose a novel DNA-binding mode for SEPALLATA3, which is different from that of other MADS-domain proteins known.Peer reviewe

    Robust learning of inhomogeneous PMMs

    Get PDF
    Inhomogeneous parsimonious Markov models have recently been introduced for modeling symbolic sequences, with a main application being DNA sequence analysis. Structure and parameter learning of these models has been proposed using a Bayesian approach, which entails the practically challenging choice of the prior distribution. Cross validation is a possible way of tuning the prior hyperparameters towards a specific task such as prediction or classification, but it is overly time-consuming. On this account, robust learning methods, which do not require explicit prior specification and – in the absence of prior knowledge – no hyperparameter tuning, are of interest. In this work, we empirically investigate the performance of robust alternatives for structure and parameter learning that extend the practical applicability of parsimonious Markov models to more complex settings than before.Peer reviewe
    • …
    corecore