Search CORE

24 research outputs found

Computational modeling of gene expression from regulatory sequences

Author: Samee Md. Abul Hassan
Publication venue
Publication date
Field of study

Regulation of gene expression is an important early step in controlling every biological process that underlies the function of living organisms. Even though gene expression may be regulated in several stages, the modulation occurs mostly at the primary stage known as “transcription”. Teasing out the details of transcriptional regulation is therefore a core focus of biological research. Transcriptional regulation of gene expression is dictated by regulatory DNA sequences, often called cis-regulatory modules (CRM; also known as “enhancer”), that contain specific binding sites for regulatory proteins (transcription factors, TF). The assembly of TFs bound on a CRM drives the desired expression level of the gene associated with the CRM. As the abundance of TFs vary across different cell types, the expression level of the gene, also termed as the “readout” of the CRM, varies accordingly and results in the aforementioned control over biological processes. The rules, collectively known as the “cis-regulatory logic”, to predict gene expression level given information about CRMs and TFs, however, are unclear. Decades of experimental studies have hypothesized mechanisms about parts of this regulatory process (e.g., about the influence of TF-TF interactions), but a comprehensive study of cis-regulatory logic is feasible only through computational models. The subject of this thesis is to develop mechanistic models of gene expression from regulatory sequences and use the models to understand such details of the system that are difficult to assess experimentally. The first part of this thesis develops a model that integrates the regulatory effect of signaling pathways with that of sequence-bound TFs to understand the expression pattern of a gene from its CRM. Given the various types of molecular interactions that the model needs to capture, it is both complex in structure and rich in the number of parameters. Similarly complex models commonly used in other disciplines, from signaling networks to climatology, have been shown to fit many distinct parameterizations that are equally consistent with data but might represent disparate mechanistic hypotheses. Whether this is also the case for models of cis-regulation has never been investigated, with the standard practice in this realm being to report a single or a few best-fit models. We demonstrate here – taking the Drosophila ind gene as an example – that gene expression modeling from cis-regulatory sequences may suffer from incomplete and even incorrect conclusions if one adheres to this current practice. We construct an ensemble of models by systematically exploring the entire parameter space and leveraging both wild-type data and various perturbation experiments, and make statistical inferences from the ensemble about detail regulatory mechanisms of ind. Years of genetic experiments have put forth an assortment of hypotheses about ind regulation. We use our modeling approach to show how a mechanism involving MAPK induced attenuation in the DNA binding affinity of Capicua and the use of low-affinity Dorsal binding sites may provide a coherent explanation of ind regulation. Also, we quantitatively predict and experimentally validate the role of the “pioneer factor” Zelda in activating ind. Finally, we discuss disparate hypotheses that are supported by our ensemble of models and will need future experimentation for a complete understanding of ind regulation. The second part of this thesis addresses a fundamental goal of computational biology, namely that of modeling a gene’s expression from its intergenic locus and trans-regulatory context. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene’s expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene’s expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were “shut down” by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model. The third part of this thesis applies the aforementioned models to two novel datasets. The first dataset was created by fusing two well-studied CRMs of the even-skipped (eve) gene in Drosophila. The fused constructs differ in the way the CRMs’ orientation, order, and intervening spacing are varied. Interestingly, the two constituent CRMs regulate eve expression by using the same TFs, although binding affinities (i.e., strength) of the repressor sites in the two CRMs are different – an observation that has been implicated to help the CRMs drive expression in two distinct domains (each domain consists of two stripes of eve) when they act in their endogenous context. However, the fact that these two CRMs harbor sites for the same TFs makes it difficult to predict the readouts of the constructs in our dataset. In particular, readouts of these constructs show some subtle aspects that essentially challenge the conventional models of information integration from sequences and suggest that a different mechanism may be necessary to explain these observations. Our modeling of this novel dataset suggests that the conventional assumption that relatively short DNA sequences, e.g., CRMs, do not comprise smaller “independent” regulatory sequences may not be true – since the lengths of the fused constructs are comparable to typical CRMs and their readouts can be modeled by assuming the existence of smaller independent regulatory segments. The second dataset modeled in this part of the thesis features five genes that control the growth and patterning of wing in Drosophila. Notably, ours is the first attempt to link regulatory sequences and the related molecular details to the growth and scaling of an organ. In course of fitting this dataset, we identify the important regulatory role of a TF called Scalloped (Sd) and speculate on Sd’s role in assuring that the expression domains of the studied genes scale with wing growth. We also use our models to identify novel regulatory sequences of these genes and to answer several questions that were left open in the experimental studies that attempted first to understand the cis-regulatory logic for these genes

Illinois Digital Environment for Access to Learning and Scholarship Repository

Incorporating Chromatin Accessibility Data into Sequence-to-Expression Modeling

Author: Hassan Samee Md. Abul
Peng Pei-Chen
Sinha Saurabh
Publication venue: Biophysical Society. Published by Elsevier Inc.
Publication date: 10/03/2015
Field of study

AbstractPrediction of gene expression levels from regulatory sequences is one of the major challenges of genomic biology today. A particularly promising approach to this problem is that taken by thermodynamics-based models that interpret an enhancer sequence in a given cellular context specified by transcription factor concentration levels and predict precise expression levels driven by that enhancer. Such models have so far not accounted for the effect of chromatin accessibility on interactions between transcription factor and DNA and consequently on gene-expression levels. Here, we extend a thermodynamics-based model of gene expression, called GEMSTAT (Gene Expression Modeling Based on Statistical Thermodynamics), to incorporate chromatin accessibility data and quantify its effect on accuracy of expression prediction. In the new model, called GEMSTAT-A, accessibility at a binding site is assumed to affect the transcription factor’s binding strength at the site, whereas all other aspects are identical to the GEMSTAT model. We show that this modification results in significantly better fits in a data set of over 30 enhancers regulating spatial expression patterns in the blastoderm-stage Drosophila embryo. It is important to note that the improved fits result not from an overall elevated accessibility in active enhancers but from the variation of accessibility levels within an enhancer. With whole-genome DNA accessibility measurements becoming increasingly popular, our work demonstrates how such data may be useful for sequence-to-expression models. It also calls for future advances in modeling accessibility levels from sequence and the transregulatory context, so as to predict accurately the effect of cis and trans perturbations on gene expression

Elsevier - Publisher Connector

Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: The Roles of Synergistic Activation, Cooperative Binding and Short-Range Repression

Author: A Krumm
A La Rosee-Borggreve
A Tanay
AM Moses
AP Lifanov
AR Borneman
AV Morozov
CC Fowlkes
Charles Blatti
CM Bergman
D Lebrecht
D Zenklusen
DC Bauer
DN Arnosti
DS Burz
DS Homsi
E Birney
E Segal
EH Davidson
ET Dermitzakis
F Gao
F Sauer
F Sauer
G Jimenez
GD Stormo
H Janssens
H Nakanishi
H Zhu
J Gertz
J Reinitz
JK Joung
K Struhl
LP Andrioli
M Carey
M Hoch
M Ptashne
MA Beer
MA Shea
MB Noyes
Md. Abul Hassan Samee
MM Kulkarni
MR Green
MZ Ludwig
NE Buchler
OG Berg
P Ray
PV Benos
R Hermsen
R Yan
RA Veitia
RP Zinzen
RP Zinzen
RW Lusk
S Gray
S Gray
S Small
S Small
SA Keller
Saurabh Sinha
SJ Maerkl
T Chi
T Wasson
Uwe Ohler
VB Teif
VJ Makeev
WD Fakhouri
X Ma
Xin He
Y Nibu
Z Hu
Publication venue: Public Library of Science
Publication date: 01/09/2010
Field of study

Quantitative models of cis-regulatory activity have the potential to improve our mechanistic understanding of transcriptional regulation. However, the few models available today have been based on simplistic assumptions about the sequences being modeled, or heuristic approximations of the underlying regulatory mechanisms. We have developed a thermodynamics-based model to predict gene expression driven by any DNA sequence, as a function of transcription factor concentrations and their DNA-binding specificities. It uses statistical thermodynamics theory to model not only protein-DNA interaction, but also the effect of DNA-bound activators and repressors on gene expression. In addition, the model incorporates mechanistic features such as synergistic effect of multiple activators, short range repression, and cooperativity in transcription factor-DNA binding, allowing us to systematically evaluate the significance of these features in the context of available expression data. Using this model on segmentation-related enhancers in Drosophila, we find that transcriptional synergy due to simultaneous action of multiple activators helps explain the data beyond what can be explained by cooperative DNA-binding alone. We find clear support for the phenomenon of short-range repression, where repressors do not directly interact with the basal transcriptional machinery. We also find that the binding sites contributing to an enhancer's function may not be conserved during evolution, and a noticeable fraction of these undergo lineage-specific changes. Our implementation of the model, called GEMSTAT, is the first publicly available program for simultaneously modeling the regulatory activities of a given set of sequences

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs.

Author: Samee Md Abul Hassan,
Publication venue
Publication date: 06/02/2019
Field of study

Ezid

Quantitative modeling of a gene's expression from its intergenic sequence.

Author: Md Abul Hassan Samee
Saurabh Sinha
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/03/2014
Field of study

Modeling a gene's expression from its intergenic locus and trans-regulatory context is a fundamental goal in computational biology. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene's expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene's expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were "shut down" by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Recommended from our members

Discovering Spatially Coherent Gene Modules from Spatial Transcriptomics Data

Author: Larina Maria
Samee Md. Abul Hassan
Singh Salvi
Publication venue
Publication date: 29/09/2022
Field of study

Spatial transcriptomics (ST) is an emerging technology that quantifies gene expression at spatial resolution from intact tissue sections. Although ST is enabling unprecedented studies on spatial gene expression, it has posed new challenges to biological data science. A typical ST dataset contains information of ~20K genes from 50K-100K cells. It is challenging to design efficient and scalable algorithms that generate new biological insights from these datasets. Here we feature an efficient and scalable non-negative matrix factorization (NMF) algorithm for identifying “spatial gene modules” (spatial-gems), i.e., groups of genes that express at spatially adjacent locations, in ST data. Spatial-gems are fundamental aspects of multi-cellular organisms. NMF is suitable for this problem since, in theory, NMF can identify the “informative parts” constituting a dataset, e.g., lips and eyes in human facial images and spatial-gems in ST data. The basic NMF formulation, however, can give sub-optimal results for spatial datasets – it ignores spatial locations of data points and thus does not guarantee informative parts that are spatially coherent. Graph-regularized NMF (GNMF) overcomes this issue by constraining the informative parts to comprise spatially adjacent data points. We introduce three changes to tailor the state-of-the-art GNMF algorithm for ST data. First, we statistically determine the optimal number of spatial-gems in an ST dataset. Secondly, we introduce regularizations that minimize the number of genes common between spatial-gems. Finally, we leverage numerical libraries and efficient data structures to obtain a scalable implementation. We benchmarked our GNMF against alternative algorithms on a brain ST dataset. Our algorithm comprehensively charted the spatial-gems in this dataset with a 20x speedup in execution time, making this an attractive tool for large-scale ST consortia like HuBMAP (Human BioMolecular Atlas Program). This tool and our multifaceted approach to enhance efficiency and scalability will be of major interest to the broad userbase of TACC.Texas Advanced Computing Center (TACC

Texas ScholarWorks

Results of fitting the GEMSTAT-GL model on the intergenic locus of 23 additional genes studied in [13].

Author: Md. Abul Hassan Samee (242314)
Saurabh Sinha (4760)
Publication venue
Publication date
Field of study

Quantitative data on target expression patterns were obtained from the companion website of the same study, and were originally derived from in situ expression images at the FlyExpress <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003467#pcbi.1003467-Kumar1" target="_blank">[81]</a> database. For each gene, the red and the green plots represent the target (real) and the modeled expression patterns, respectively.</p

FigShare

Overview of the newly proposed GEMSTAT-GL model (details are given in Materials and Methods).

Author: Md. Abul Hassan Samee (242314)
Saurabh Sinha (4760)
Publication venue
Publication date
Field of study

The model has two types of parameters, namely thermodynamic parameters (to compute the expression readout of any sequence window) and window-weight parameters (to compute a weighted summation of the expression readouts of a set of selected windows). The parameters are optimized iteratively to fit the expression pattern of a given gene from its locus. In the example shown, GEMSTAT-GL is applied to fit the three-striped expression pattern (shown by the green, the red, and the blue stripes) of a gene g from its locus. Each iteration in model training consists of two phases. In the first phase, through a sliding window mechanism, the model selects a set C(s) of candidate windows for each stripe s. To this end, each window's readout (computed by GEMSTAT, denoted here by function G) is compared against each individual stripe, as exemplified through the operations on the green window. (Computation of the initial estimates for thermodynamic parameters is explained in main text.) In the second phase, a solution is constructed by iteratively checking if including a new window from the candidate sets (computed in Phase 1) improves model performance. In the shown example, the green window first gets included in the solution since it fits the green stripe satisfactorily. Next, the first window from the red stripe's candidate set is added to the solution and weights for the two windows are optimized so that a weighted summation of their readouts (denoted by function GL) fits the expression pattern consisting of the green and the red stripes. The model shows improved performance and hence, the red window is retained in the solution. However, when the second window from the red stripe's candidate set is added to the solution, it deteriorates model performance. The window is therefore discarded. Similarly, the blue window from the blue stripe's candidate set is checked and found to improve model performance – resulting in its inclusion to the solution. After completing the second phase, the model re-estimates the thermodynamic parameters and loops back to Phase 1.</p

FigShare

(A) Schematic of expression pattern of the pair-rule gene even-skipped (eve) in D. melanogaster embryo.

Author: Md. Abul Hassan Samee (242314)
Saurabh Sinha (4760)
Publication venue
Publication date
Field of study

‘A’ and ‘P’ denote the anterior and the posterior ends of the embryo, respectively. (B) Quantitative profile of eve gene expression along the anterior-posterior axis of the embryo. (C) Genome Browser view of the five distinct enhancer elements that drive eve gene expression; each enhancer's name denotes the specific stripe(s) of gene expression that it drives. The entire locus is 17 Kbp long. (D) Concentration profiles along the anterior-posterior axis, for the nine TFs used to model the expression patterns of the genes eve, h, run, and gt. (E) Real (red) and GEMSTAT-predicted (green) expression profiles along the A/P axis for the known enhancers of eve, h, run, and gt.</p

FigShare

Outcome of MCMC sampling to reveal the cis-regulatory architecture of eve intergenic region.

Author: Md. Abul Hassan Samee (242314)
Saurabh Sinha (4760)
Publication venue
Publication date
Field of study

(A) Top panel shows the eve intergenic locus along with the known enhancers of eve and windows selected by GEMSTAT-GL to model eve expression pattern. Bottom panel shows the average weight of segments in the locus as estimated by MCMC sampling. The horizontal axis of the bottom panel spans the eve locus; green diamonds in the plot represent the starting positions of the sequence segments that comprise the MCMC samples (segments corresponding to two different green diamonds might therefore differ in length). The vertical axis denotes the average weight (on a relative scale between 0 and 1) that each segment received over 50,000 samples. (B) Predicted readouts of three zero-weight segments that could have an irreconcilable effect on the gene expression pattern, and were not selected by the two-tiered model.</p

FigShare

Computational modeling of gene expression from regulatory sequences

Incorporating Chromatin Accessibility Data into Sequence-to-Expression Modeling

Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: The Roles of Synergistic Activation, Cooperative Binding and Short-Range Repression

A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs.

Quantitative modeling of a gene's expression from its intergenic sequence.

Discovering Spatially Coherent Gene Modules from Spatial Transcriptomics Data

Results of fitting the GEMSTAT-GL model on the intergenic locus of 23 additional genes studied in [13].

Overview of the newly proposed GEMSTAT-GL model (details are given in Materials and Methods).

(A) Schematic of expression pattern of the pair-rule gene <i>even-skipped</i> (<i>eve</i>) in <i>D. melanogaster</i> embryo.

Outcome of MCMC sampling to reveal the cis-regulatory architecture of <i>eve</i> intergenic region.