9,885 research outputs found
Motif Discovery through Predictive Modeling of Gene Regulation
We present MEDUSA, an integrative method for learning motif models of
transcription factor binding sites by incorporating promoter sequence and gene
expression data. We use a modern large-margin machine learning approach, based
on boosting, to enable feature selection from the high-dimensional search space
of candidate binding sequences while avoiding overfitting. At each iteration of
the algorithm, MEDUSA builds a motif model whose presence in the promoter
region of a gene, coupled with activity of a regulator in an experiment, is
predictive of differential expression. In this way, we learn motifs that are
functional and predictive of regulatory response rather than motifs that are
simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model
of the transcriptional control logic that can predict the expression of any
gene in the organism, given the sequence of the promoter region of the target
gene and the expression state of a set of known or putative transcription
factors and signaling molecules. Each motif model is either a -length
sequence, a dimer, or a PSSM that is built by agglomerative probabilistic
clustering of sequences with similar boosting loss. By applying MEDUSA to a set
of environmental stress response expression data in yeast, we learn motifs
whose ability to predict differential expression of target genes outperforms
motifs from the TRANSFAC dataset and from a previously published candidate set
of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed
binding sites associated with environmental stress response from the
literature.Comment: RECOMB 200
Mixture model with multiple allocations for clustering spatially correlated observations in the analysis of ChIP-Seq data
Model-based clustering is a technique widely used to group a collection of
units into mutually exclusive groups. There are, however, situations in which
an observation could in principle belong to more than one cluster. In the
context of Next-Generation Sequencing (NGS) experiments, for example, the
signal observed in the data might be produced by two (or more) different
biological processes operating together and a gene could participate in both
(or all) of them. We propose a novel approach to cluster NGS discrete data,
coming from a ChIP-Seq experiment, with a mixture model, allowing each unit to
belong potentially to more than one group: these multiple allocation clusters
can be flexibly defined via a function combining the features of the original
groups without introducing new parameters. The formulation naturally gives rise
to a `zero-inflation group' in which values close to zero can be allocated,
acting as a correction for the abundance of zeros that manifest in this type of
data. We take into account the spatial dependency between observations, which
is described through a latent Conditional Auto-Regressive process that can
reflect different dependency patterns. We assess the performance of our model
within a simulation environment and then we apply it to ChIP-seq real data.Comment: 25 pages; 3 tables, 6 figure
Genome-wide discovery of modulators of transcriptional interactions in human B lymphocytes
Transcriptional interactions in a cell are modulated by a variety of
mechanisms that prevent their representation as pure pairwise interactions
between a transcription factor and its target(s). These include, among others,
transcription factor activation by phosphorylation and acetylation, formation
of active complexes with one or more co-factors, and mRNA/protein degradation
and stabilization processes.
This paper presents a first step towards the systematic, genome-wide
computational inference of genes that modulate the interactions of specific
transcription factors at the post-transcriptional level. The method uses a
statistical test based on changes in the mutual information between a
transcription factor and each of its candidate targets, conditional on the
expression of a third gene. The approach was first validated on a synthetic
network model, and then tested in the context of a mammalian cellular system.
By analyzing 254 microarray expression profiles of normal and tumor related
human B lymphocytes, we investigated the post transcriptional modulators of the
MYC proto-oncogene, an important transcription factor involved in
tumorigenesis. Our method discovered a set of 100 putative modulator genes,
responsible for modulating 205 regulatory relationships between MYC and its
targets. The set is significantly enriched in molecules with function
consistent with their activities as modulators of cellular interactions,
recapitulates established MYC regulation pathways, and provides a notable
repertoire of novel regulators of MYC function. The approach has broad
applicability and can be used to discover modulators of any other transcription
factor, provided that adequate expression profile data are available.Comment: 15 pages, 3 figures, 2 tables; minor changes following referees'
comments; accepted to RECOMB0
Discovering transcriptional modules by Bayesian data integration
Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.
Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs
Application of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data
We present two complementary approaches for the interpretation of clusters of
co-regulated genes, such as those obtained from DNA chips and related methods.
Starting from a cluster of genes with similar expression profiles, two basic
questions can be asked:
1. Which mechanism is responsible for the coordinated transcriptional response
of the genes? This question is approached by extracting motifs that are shared
between the upstream sequences of these genes. The motifs extracted are putative
cis-acting regulatory elements.
2. What is the physiological meaning for the cell to express together these
genes? One way to answer the question is to search for potential metabolic
pathways that could be catalyzed by the products of the genes. This can be
done by selecting the genes from the cluster that code for enzymes, and trying
to assemble the catalyzed reactions to form metabolic pathways.
We present tools to answer these two questions, and we illustrate their use with
selected examples in the yeast Saccharomyces cerevisiae. The tools are available
on the web (http://ucmb.ulb.ac.be/bioinformatics/rsa-tools/;
http://www.ebi.ac.uk/research/pfbp/; http://www.soi.city.ac.uk/~msch/)
Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org
Infinite factorization of multiple non-parametric views
Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering
setting, by introducing a novel non-parametric hierarchical
mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block
model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views.
Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation
- …