Search CORE

Harvard University - DASH

eScholarship - University of California

A classification-based framework for predicting and analyzing gene regulatory response

Author: AJ Hartemink
Anshul Kundaje
AP Gasch
AP Gasch
Chris H Wiggins
Christina Leslie
CI Holmberg
D Pe'er
D Pe'er
D Pollard
DC Raitt
E Ramil
E Segal
E Segal
ER Gansner
HJ Bussemaker
I Ota
I Pedruzzi
J Ihmels
JD Hughes
JT Lin
M Middendorf
M Middendorf
M Middendorf
MA Beer
Manuel Middendorf
Mihir Shah
P Zarzov
RE Schapire
TI Lee
VK Vyas
W Hoeffding
Y Pilpel
Yoav Freund
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem — predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast — the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors — and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from

Columbia University Academic Commons

Washington University St. Louis: Open Scholarship

A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data

Author: Ruan Shuxiang
Publication venue: Washington University Open Scholarship
Publication date: 15/12/2017
Field of study

The interaction between transcription factors (TFs) and DNA plays an important role in gene expression regulation. In the past, experiments on protein–DNA interactions could only identify a handful of sequences that a TF binds with high affinities. In recent years, several high-throughput experimental techniques, such as high-throughput SELEX (HT-SELEX), protein-binding microarrays (PBMs) and ChIP-seq, have been developed to estimate the relative binding affinities of large numbers of DNA sequences both in vitro and in vivo. The large volume of data generated by these techniques proved to be a challenge and prompted the development of novel motif discovery algorithms. These algorithms are based on a range of TF binding models, including the widely used probabilistic model that represents binding motifs as position frequency matrices (PFMs). However, the probabilistic model has limitations and the PFMs extracted from some of the high-throughput experiments are known to be suboptimal. In this dissertation, we attempt to address these important questions and develop a generalized biophysical model and an expectation maximization (EM) algorithm for estimating position weight matrices (PWMs) and other parameters using HT-SELEX data. First, we discuss the inherent limitations of the popular probabilistic model and compare it with a biophysical model that assumes the nucleotides in a binding site contribute independently to its binding energy instead of binding probability. We use simulations to demonstrate that the biophysical model almost always provides better fits to the data and conclude that it should take the place of the probabilistic model in charactering TF binding specificity. Then we describe a generalized biophysical model, which removes the assumption of known binding locations and is particularly suitable for modeling protein–DNA interactions in HT-SELEX experiments, and BEESEM, an EM algorithm capable of estimating the binding model and binding locations simultaneously. BEESEM can also calculate the confidence intervals of the estimated parameters in the binding model, a rare but useful feature among motif discovery algorithms. By comparing BEESEM with 5 other algorithms on HT-SELEX, PBM and ChIP-seq data, we demonstrate that BEESEM provides significantly better fits to in vitro data and is similar to the other methods (with one exception) on in vivo data under the criterion of the area under the receiver operating characteristic curve (AUROC). We also discuss the limitations of the AUROC criterion, which is purely rank-based and thus misses quantitative binding information. Finally, we investigate whether adding DNA shape features can significantly improve the accuracy of binding models. We evaluate the ability of the gradient boosting classifiers generated by DNAshapedTFBS, an algorithm that takes account of DNA shape features, to differentiate ChIP-seq peaks from random background sequences, and compare them with various matrix-based binding models. The results indicate that, compared with optimized PWMs, adding DNA shape features does not produce significantly better binding models and may increase the risk of overfitting on training datasets

A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

Author: A Kundaje
A Smith
A Tanay
A Tanay
AJ Hartemink
AJ Kastaniotis
AM Erkine
Anshul Kundaje
AP Gasch
AV Grishin
BM Bolstad
C Dagsgaard
CH Yeang
Changgui Lan
Christina Leslie
CT Harbison
CV Lowry
D Pe'er
E Segal
E Segal
E Segal
FM Ausubel
FP Roth
Herbert M. Sauro
HF Bunn
HJ Bussemaker
J Ernst
J Ihmels
J Olesen
JC Schneider
JD Hughes
JJ ter Linde
JY Choi
K Pfeifer
KA Morano
KD MacIsaac
KD MacIsaac
KE Kwast
KE Kwast
KV Shianna
L Guarente
L Zhang
L Zhang
L-C Lai
L-C Lai
Li Zhang
M Kaern
M Middendorf
M Middendorf
MA Beer
MD Piper
Mei Zhou
MJ Vasconcelles
MK Yeung
MR Grably
N Abramova
N Rachidi
NE Abramova
O Sertil
O Sertil
PV Burke
R Schapire
RA Irizarry
RE Schapire
RS Zitomer
RS Zitomer
S Kuge
S Labb‚
S Tavazoie
SL Tai
Steve Lianoglou
T Hon
T Hoppe
T Keng
T Prezant
TI Lee
TS Gardner
VV Svetlov
Xiantong Xin
Y Benjamini
Y Freund
Y Jiang
Y Jiang
Y Pilpel
Y Tu
Z Bar-Joseph
Publication venue: Public Library of Science
Publication date: 01/11/2008
Field of study

Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included

Public Library of Science (PLOS)

Directory of Open Access Journals

Computational analyses of eukaryotic promoters

Author: AD Smith
AD Smith
AD Smith
AD Smith
BA Lewis
BC Foat
C Bock
CD Schmid
CE Lawrence
CE Lawrence
CT Workman
D Das
D Das
DJ Huebert
E Segal
EM Conlon
G Cavalli
GC Yuan
GZ Hertz
HJ Bussemaker
J Friedman
M Tompa
MC Thomas
Michael Q Zhang
MJ Buck
MJ Martinez
MQ Zhang
N Maeda
ND Heintzman
NI Gershenzon
P Carninci
P Gross
P Hong
P Sumazin
PJ Sabo
R Das
RA Rollins
RV Davuluri
S Keles
S Keles
S Sinha
S Sonnenburg
SR Schulze
T Hastie
TA Down
TH Kim
TH Kim
TL Bailey
U Ohler
V Matys
VB Bajic
VB Bajic
VB Bajic
VX Jin
WW Wasserman
X Zhao
Y Suzuki
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Computational analysis of eukaryotic promoters is one of the most difficult problems in computational genomics and is essential for understanding gene expression profiles and reverse-engineering gene regulation network circuits. Here I give a basic introduction of the problem and recent update on both experimental and computational approaches. More details may be found in the extended references. This review is based on a summer lecture given at Max Planck Institute at Berlin in 2005

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

An ensemble learning approach to reverse-engineering transcriptional regulatory networks from time-series gene expression data

Author: Deng Youping
Perkins Edward J
Ruan Jianhua
Zhang Weixiong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background One of the most challenging tasks in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its expression, and how a set of transcription factors coordinate to accomplish temporal and spatial specific regulations. Results Here we propose a supervised machine learning approach to address these questions. We focus our study on the gene transcriptional regulation of the cell cycle in the budding yeast, thanks to the large amount of data available and relatively well-understood biology, although the main ideas of our method can be applied to other data as well. Our method starts with building an ensemble of decision trees for each microarray data to capture the association between the expression levels of yeast genes and the binding of transcription factors to gene promoter regions, as determined by chromatin immunoprecipitation microarray (ChIP-chip) experiment. Cross-validation experiments show that the method is more accurate and reliable than the naive decision tree algorithm and several other ensemble learning methods. From the decision tree ensembles, we extract logical rules that explain how a set of transcription factors act in concert to regulate the expression of their targets. We further compute a profile for each rule to show its regulation strengths at different time points. We also propose a spline interpolation method to integrate the rule profiles learned from several time series expression data sets that measure the same biological process. We then combine these rule profiles to build a transcriptional regulatory network for the yeast cell cycle. Compared to the results in the literature, our method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. Our method also identifies many interesting synergetic relationships among these transcription factors, most of which are well known, while many of the rest can also be supported by other evidences. Conclusion The high accuracy of our method indicates that our method is valid and robust. As more gene expression and transcription factor binding data become available, we believe that our method is useful for reconstructing large-scale transcriptional regulatory networks in other species as well

Aquila Digital Community

South East Academic Libraries System (SEALS)

Digital Commons@Becker

Transcription factor binding specificity and occupancy : elucidation, modelling and evaluation

Author: Kibet Caleb Kipkurui
Publication venue: Faculty of Science, Computer Science
Publication date: 01/01/2017
Field of study

The major contributions of this thesis are addressing the need for an objective quality evaluation of a transcription factor binding model, demonstrating the value of the tools developed to this end and elucidating how in vitro and in vivo information can be utilized to improve TF binding specificity models. Accurate elucidation of TF binding specificity remains an ongoing challenge in gene regulatory research. Several in vitro and in vivo experimental techniques have been developed followed by a proliferation of algorithms, and ultimately, the binding models. This increase led to a choice problem for the end users: which tools to use, and which is the most accurate model for a given TF? Therefore, the first section of this thesis investigates the motif assessment problem: how scoring functions, choice and processing of benchmark data, and statistics used in evaluation affect motif ranking. This analysis revealed that TF motif quality assessment requires a systematic comparative analysis, and that scoring functions used have a TF-specific effect on motif ranking. These results advised the design of a Motif Assessment and Ranking Suite MARS, supported by PBM and ChIP-seq benchmark data and an extensive collection of PWM motifs. MARS implements consistency, enrichment, and scoring and classification-based motif evaluation algorithms. Transcription factor binding is also influenced and determined by contextual factors: chromatin accessibility, competition or cooperation with other TFs, cell line or condition specificity, binding locality (e.g. proximity to transcription start sites) and the shape of the binding site (DNA-shape). In vitro techniques do not capture such context; therefore, this thesis also combines PBM and DNase-seq data using a comparative k-mer enrichment approach that compares open chromatin with genome-wide prevalence, achieving a modest performance improvement when benchmarked on ChIP-seq data. Finally, since statistical and probabilistic methods cannot capture all the information that determine binding, a machine learning approach (XGBooost) was implemented to investigate how the features contribute to TF specificity and occupancy. This combinatorial approach improves the predictive ability of TF specificity models with the most predictive feature being chromatin accessibility, while the DNA-shape and conservation information all significantly improve on the baseline model of k-mer and DNase data. The results and the tools introduced in this thesis are useful for systematic comparative analysis (via MARS) and a combinatorial approach to modelling TF binding specificity, including appropriate feature engineering practices for machine learning modelling

Rhodes Repository (SEALS)

BoCaTFBS: a boosted cascade learner to refine the binding sites suggested by ChIP-chip experiments

Author: Gerstein Mark
Snyder Michael
Wang Lu-yong
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

Comprehensive mapping of transcription factor binding sites is essential in postgenomic biology. For this, we propose a mining approach combining noisy data from ChIP (chromatin immunoprecipitation)-chip experiments with known binding site patterns. Our method (BoCaTFBS) uses boosted cascades of classifiers for optimum efficiency, in which components are alternating decision trees; it exploits interpositional correlations; and it explicitly integrates massive negative information from ChIP-chip experiments. We applied BoCaTFBS within the ENCODE project and showed that it outperforms many traditional binding site identification methods (for instance, profiles)