230 research outputs found
Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression
One important issue commonly encountered in the analysis of microarray data
is to decide which and how many genes should be selected for further studies.
For discriminant microarray data analyses based on statistical models, such as
the logistic regression models, gene selection can be accomplished by a
comparison of the maximum likelihood of the model given the real data,
, and the expected maximum likelihood of the model given an
ensemble of surrogate data with randomly permuted label, .
Typically, the computational burden for obtaining is immense,
often exceeding the limits of computing available resources by orders of
magnitude. Here, we propose an approach that circumvents such heavy
computations by mapping the simulation problem to an extreme-value problem. We
present the derivation of an asymptotic distribution of the extreme-value as
well as its mean, median, and variance. Using this distribution, we propose two
gene selection criteria, and we apply them to two microarray datasets and three
classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004
InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites
Summary: Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise. Availability and Implementation: InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe.Peer reviewe
Extended Sunflower Hidden Markov Models for the recognition of homotypic cis-regulatory modules}
The transcription of genes is often regulated not only by transcription factors binding at single sites per promoter, but by the interplay of multiple copies of one or more transcription factors binding at multiple sites forming a cis-regulatory module.
The computational recognition of cis-regulatory modules from ChIP-seq or other high-throughput data is crucial in modern life and medical sciences.
A common type of cis-regulatory modules are homotypic clusters of binding sites, i.e., clusters of binding sites of one transcription factor.
For their recognition the homotypic Sunflower Hidden Markov Model is a promising statistical model.
However, this model neglects statistical dependences among nucleotides within binding sites and flanking regions, which makes it not well suited for de-novo motif discovery.
Here, we propose an extension of this model that allows statistical dependences within binding sites, their reverse complements, and flanking regions.
We study the efficacy of this extended homotypic Sunflower Hidden Markov Model based on ChIP-seq data from the Human ENCODE Project and find that it often outperforms the traditional homotypic Sunflower Hidden Markov Model
A general approach for discriminative de-novo motif discovery from highthroughput data
De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research
Recommended from our members
Cross-kingdom comparison of the developmental hourglass.
The developmental hourglass model has its foundations in classic anatomical studies by von Baer and Haeckel. In this context, even the conservation of animal body plans has been explained by evolutionary constraints acting on mid-embryogenic development. Recent studies have shown that developmental hourglass patterns also exist on the transcriptomic level, mirroring the corresponding morphological patterns. The identification of similar patterns in embryonic, post-embryonic, and life cycle spanning transcriptomes in plant and fungus development, however, contradict the notion of a direct coupling between morphological and molecular patterns. To explain the existence of hourglass patterns across kingdoms and developmental processes, we propose the organizational checkpoint model that integrates the developmental hourglass model into a framework of transcriptome switches
Comparison of NML and Bayesian scoring criteria for learning parsimonious Markov models
Parsimonious Markov models, a generalization of variable order Markov models, have been recently introduced for modeling biological sequences. Up to now, they have been learned by Bayesian approaches. However, there is not always sufficient prior knowledge available and a fully uninformative prior is difficult to define. In order to avoid cumbersome cross validation procedures for obtaining the optimal prior choice, we here adapt scoring criteria for Bayesian networks that approximate the Normalized Maximum Likelihood (NML) to parsimonious Markov models. We empirically compare their performance with the Bayesian approach by classifying splice sites, an important problem from computational biology.Non peer reviewe
MotifAdjuster: a tool for computational reassessment of transcription factor binding site annotations
MotifAdjuster helps to detect errors in binding site annotations
Recommended from our members
Prediction of regulatory targets of alternative isoforms of the epidermal growth factor receptor in a glioblastoma cell line.
BackgroundThe epidermal growth factor receptor (EGFR) is a major regulator of proliferation in tumor cells. Elevated expression levels of EGFR are associated with prognosis and clinical outcomes of patients in a variety of tumor types. There are at least four splice variants of the mRNA encoding four protein isoforms of EGFR in humans, named I through IV. EGFR isoform I is the full-length protein, whereas isoforms II-IV are shorter protein isoforms. Nevertheless, all EGFR isoforms bind the epidermal growth factor (EGF). Although EGFR is an essential target of long-established and successful tumor therapeutics, the exact function and biomarker potential of alternative EGFR isoforms II-IV are unclear, motivating more in-depth analyses. Hence, we analyzed transcriptome data from glioblastoma cell line SF767 to predict target genes regulated by EGFR isoforms II-IV, but not by EGFR isoform I nor other receptors such as HER2, HER3, or HER4.ResultsWe analyzed the differential expression of potential target genes in a glioblastoma cell line in two nested RNAi experimental conditions and one negative control, contrasting expression with EGF stimulation against expression without EGF stimulation. In one RNAi experiment, we selectively knocked down EGFR splice variant I, while in the other we knocked down all four EGFR splice variants, so the associated effects of EGFR II-IV knock-down can only be inferred indirectly. For this type of nested experimental design, we developed a two-step bioinformatics approach based on the Bayesian Information Criterion for predicting putative target genes of EGFR isoforms II-IV. Finally, we experimentally validated a set of six putative target genes, and we found that qPCR validations confirmed the predictions in all cases.ConclusionsBy performing RNAi experiments for three poorly investigated EGFR isoforms, we were able to successfully predict 1140 putative target genes specifically regulated by EGFR isoforms II-IV using the developed Bayesian Gene Selection Criterion (BGSC) approach. This approach is easily utilizable for the analysis of data of other nested experimental designs, and we provide an implementation in R that is easily adaptable to similar data or experimental designs together with all raw datasets used in this study in the BGSC repository, https://github.com/GrosseLab/BGSC
- …