5,402 research outputs found
Regulatory Motif Finding by Logic Regression
Multiple transcription factors coordinately control transcriptional regulation of genes in eukaryotes. Although multiple computational methods consider the identification of individual transcription factor binding sites (TFBSs), very few focus on the interactions between these sites. We consider finding transcription factor binding sites and their context specific interactions using microarray gene expression data. We devise a hybrid approach called LogicMotif composed of a TFBS identification method combined with the new regression methodology logic regression of Ruczinski et al. (2003). LogicMotif has two steps: First potential binding sites are identified from transcription control regions of genes of interest. Various available methods can be used in this first step when the genes of interest can be divided into groups such as up and down regulated. For this step, we also develop a simple univariate regression and extension method MFURE to extract candidate TFBSs from a large number of genes in the availability of microarray gene expression data. MFURE provides an alternative method for this step when partitioning of the genes into disjoint groups is not preferred. This first step aims to identify individual sites within gene groups of interest or sites that are correlated with the gene expression outcome. In the second step, logic regression is used to build a predictive model of outcome of interest (either gene expression or up and down regulation) using these potential sites. This two-fold approach creates a rich diverse set of potential binding sites in the first step and builds regression or classification models in the second step using logic regression that is particularly good at identifying complex interactions.
LogicMotif is applied to two publicly available data sets. A genome-wide gene expression data set of Saccharomyces cerevisiae is used for validation. The regression models obtained are interpretable and the biological implications are in agreement with the known resuts. This analysis suggests that LogicMotif provides biologically more reasonable regression models than previous analysis of this data set with standard linear regression methods. Another data set of Saccharomyces cerevisiae illustrates the use of LogicMotif in classification questions by building a model that discriminates between up and down regulated genes in iron copper deficiency. LogicMotif identified an inductive and two repressor motifs in this data set. The inductive motif matches the binding site of the transcription factor Aft1p that has a key role in regulation of the uptake process. One of the novel repressor sites is highly present in transcription control regions of FeS genes. This site could represent a TFBS for an unknown transcription factor involved in repression of genes encoding FeS proteins in iron deficiency. We established the stability of the method to the type of outcome variable by using both continuous and binary outcome variables for this data set. Our results indicate that logic regression used in combination with cluster/group operating binding site identification methods or with our proposed method MFURE is a powerful and flexible alternative to linear regression based motif finding methods
Validating module network learning algorithms using simulated data
In recent years, several authors have used probabilistic graphical models to
learn expression modules and their regulatory programs from gene expression
data. Here, we demonstrate the use of the synthetic data generator SynTReN for
the purpose of testing and comparing module network learning algorithms. We
introduce a software package for learning module networks, called LeMoNe, which
incorporates a novel strategy for learning regulatory programs. Novelties
include the use of a bottom-up Bayesian hierarchical clustering to construct
the regulatory programs, and the use of a conditional entropy measure to assign
regulators to the regulation program nodes. Using SynTReN data, we test the
performance of LeMoNe in a completely controlled situation and assess the
effect of the methodological changes we made with respect to an existing
software package, namely Genomica. Additionally, we assess the effect of
various parameters, such as the size of the data set and the amount of noise,
on the inference performance. Overall, application of Genomica and LeMoNe to
simulated data sets gave comparable results. However, LeMoNe offers some
advantages, one of them being that the learning process is considerably faster
for larger data sets. Additionally, we show that the location of the regulators
in the LeMoNe regulation programs and their conditional entropy may be used to
prioritize regulators for functional validation, and that the combination of
the bottom-up clustering strategy with the conditional entropy-based assignment
of regulators improves the handling of missing or hidden regulators.Comment: 13 pages, 6 figures + 2 pages, 2 figures supplementary informatio
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Recommended from our members
Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types.
Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data-driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta-analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell-type-specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" Challenge for predicting effects of single-nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest
Recommended from our members
FAM129B, an antioxidative protein, reduces chemosensitivity by competing with Nrf2 for Keap1 binding.
BackgroundThe transcription factor Nrf2 is a master regulator of antioxidant response. While Nrf2 activation may counter increasing oxidative stress in aging, its activation in cancer can promote cancer progression and metastasis, and confer resistance to chemotherapy and radiotherapy. Thus, Nrf2 has been considered as a key pharmacological target. Unfortunately, there are no specific Nrf2 inhibitors for therapeutic application. Moreover, high Nrf2 activity in many tumors without Keap1 or Nrf2 mutations suggests that alternative mechanisms of Nrf2 regulation exist.MethodsInteraction of FAM129B with Keap1 is demonstrated by immunofluorescence, colocalization, co-immunoprecipitation and mammalian two-hybrid assay. Antioxidative function of FAM129B is analyzed by measuring ROS levels with DCF/flow cytometry, Nrf2 activation using luciferase reporter assay and determination of downstream gene expression by qPCR and wester blotting. Impact of FAM129B on in vivo chemosensitivity is examined in mice bearing breast and colon cancer xenografts. The clinical relevance of FAM129B is assessed by qPCR in breast cancer samples and data mining of publicly available databases.FindingsWe have demonstrated that FAM129B in cancer promotes Nrf2 activity by reducing its ubiquitination through competition with Nrf2 for Keap1 binding via its DLG and ETGE motifs. In addition, FAM129B reduces chemosensitivity by augmenting Nrf2 antioxidative signaling and confers poor prognosis in breast and lung cancer.InterpretationThese findings demonstrate the important role of FAM129B in Nrf2 activation and antioxidative response, and identify FMA129B as a potential therapeutic target. FUND: The Chang Gung Medical Foundation (Taiwan) and the Ministry of Science and Technology (Taiwan)
Learning ‘‘graph-mer’’ Motifs that Predict Gene Expression Trajectories in Development
A key problem in understanding transcriptional regulatory networks is deciphering what cis regulatory logic is encoded in gene promoter sequences and how this sequence information maps to expression. A typical computational approach to this problem involves clustering genes by their expression profiles and then searching for overrepresented motifs in the promoter sequences of genes in a cluster. However, genes with similar expression profiles may be controlled by distinct regulatory programs. Moreover, if many gene expression profiles in a data set are highly correlated, as in the case of whole organism developmental time series, it may be difficult to resolve fine-grained clusters in the first place. We present a predictive framework for modeling the natural flow of information, from promoter sequence to expression, to learn cis regulatory motifs and characterize gene expression patterns in developmental time courses. We introduce a cluster-free algorithm based on a graph-regularized version of partial least squares (PLS) regression to learn sequence patterns—represented by graphs of k-mers, or “graph-mers”—that predict gene expression trajectories. Applying the approach to wildtype germline development in Caenorhabditis elegans, we found that the first and second latent PLS factors mapped to expression profiles for oocyte and sperm genes, respectively. We extracted both known and novel motifs from the graph-mers associated to these germline-specific patterns, including novel CG-rich motifs specific to oocyte genes. We found evidence supporting the functional relevance of these putative regulatory elements through analysis of positional bias, motif conservation and in situ gene expression. This study demonstrates that our regression model can learn biologically meaningful latent structure and identify potentially functional motifs from subtle developmental time course expression data
Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests
The recent availability of whole-genome scale data sets that investigate complementary and diverse aspects of transcriptional regulation has spawned an increased need for new and effective computational approaches to analyze and integrate these large scale assays. Here, we propose a novel algorithm, based on random forest methodology, to relate gene expression (as derived from expression microarrays) to sequence features residing in gene promoters (as derived from DNA motif data) and transcription factor binding to gene promoters (as derived from tiling microarrays). We extend the random forest approach to model a multivariate response as represented, for example, by time-course gene expression measures. An analysis of the multivariate random forest output reveals complex regulatory networks, which consist of cohesive, condition-dependent regulatory cliques. Each regulatory clique features homogeneous gene expression profiles and common motifs or synergistic motif groups. We apply our method to several yeast physiological processes: cell cycle, sporulation, and various stress conditions. Our technique displays excellent performance with regard to identifying known regulatory motifs, including high order interactions. In addition, we present evidence of the existence of an alternative MCB-binding pathway, which we confirm using data from two independent cell cycle studies and two other physioloigical processes. Finally, we have uncovered elaborate transcription regulation refinement mechanisms involving PAC and mRRPE motifs that govern essential rRNA processing. These include intriguing instances of differing motif dosages and differing combinatorial motif control that promote regulatory specificity in rRNA metabolism under differing physiological processes
Transcription factor binding site prediction with multivariate gene expression data
Multi-sample microarray experiments have become a standard experimental
method for studying biological systems. A frequent goal in such studies is to
unravel the regulatory relationships between genes. During the last few years,
regression models have been proposed for the de novo discovery of cis-acting
regulatory sequences using gene expression data. However, when applied to
multi-sample experiments, existing regression based methods model each
individual sample separately. To better capture the dynamic relationships in
multi-sample microarray experiments, we propose a flexible method for the joint
modeling of promoter sequence and multivariate expression data. In higher order
eukaryotic genomes expression regulation usually involves combinatorial
interaction between several transcription factors. Experiments have shown that
spacing between transcription factor binding sites can significantly affect
their strength in activating gene expression. We propose an adaptive model
building procedure to capture such spacing dependent cis-acting regulatory
modules. We apply our methods to the analysis of microarray time-course
experiments in yeast and in Arabidopsis. These experiments exhibit very
different dynamic temporal relationships. For both data sets, we have found all
of the well-known cis-acting regulatory elements in the related context, as
well as being able to predict novel elements.Comment: Published in at http://dx.doi.org/10.1214/10.1214/07-AOAS142 the
Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute
of Mathematical Statistics (http://www.imstat.org
Quantitative evaluation and reversion analysis of the attractor landscapes of an intracellular regulatory network for colorectal cancer
The molecular profiles of CMS cancer cells, statistical significance analysis of reversion targets, and synergistic effect analysis of every two nodes inhibition. (XLSX 67Â kb
- …