24 research outputs found

    Computational modeling of gene expression from regulatory sequences

    Get PDF
    Regulation of gene expression is an important early step in controlling every biological process that underlies the function of living organisms. Even though gene expression may be regulated in several stages, the modulation occurs mostly at the primary stage known as “transcription”. Teasing out the details of transcriptional regulation is therefore a core focus of biological research. Transcriptional regulation of gene expression is dictated by regulatory DNA sequences, often called cis-regulatory modules (CRM; also known as “enhancer”), that contain specific binding sites for regulatory proteins (transcription factors, TF). The assembly of TFs bound on a CRM drives the desired expression level of the gene associated with the CRM. As the abundance of TFs vary across different cell types, the expression level of the gene, also termed as the “readout” of the CRM, varies accordingly and results in the aforementioned control over biological processes. The rules, collectively known as the “cis-regulatory logic”, to predict gene expression level given information about CRMs and TFs, however, are unclear. Decades of experimental studies have hypothesized mechanisms about parts of this regulatory process (e.g., about the influence of TF-TF interactions), but a comprehensive study of cis-regulatory logic is feasible only through computational models. The subject of this thesis is to develop mechanistic models of gene expression from regulatory sequences and use the models to understand such details of the system that are difficult to assess experimentally. The first part of this thesis develops a model that integrates the regulatory effect of signaling pathways with that of sequence-bound TFs to understand the expression pattern of a gene from its CRM. Given the various types of molecular interactions that the model needs to capture, it is both complex in structure and rich in the number of parameters. Similarly complex models commonly used in other disciplines, from signaling networks to climatology, have been shown to fit many distinct parameterizations that are equally consistent with data but might represent disparate mechanistic hypotheses. Whether this is also the case for models of cis-regulation has never been investigated, with the standard practice in this realm being to report a single or a few best-fit models. We demonstrate here – taking the Drosophila ind gene as an example – that gene expression modeling from cis-regulatory sequences may suffer from incomplete and even incorrect conclusions if one adheres to this current practice. We construct an ensemble of models by systematically exploring the entire parameter space and leveraging both wild-type data and various perturbation experiments, and make statistical inferences from the ensemble about detail regulatory mechanisms of ind. Years of genetic experiments have put forth an assortment of hypotheses about ind regulation. We use our modeling approach to show how a mechanism involving MAPK induced attenuation in the DNA binding affinity of Capicua and the use of low-affinity Dorsal binding sites may provide a coherent explanation of ind regulation. Also, we quantitatively predict and experimentally validate the role of the “pioneer factor” Zelda in activating ind. Finally, we discuss disparate hypotheses that are supported by our ensemble of models and will need future experimentation for a complete understanding of ind regulation. The second part of this thesis addresses a fundamental goal of computational biology, namely that of modeling a gene’s expression from its intergenic locus and trans-regulatory context. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene’s expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene’s expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were “shut down” by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model. The third part of this thesis applies the aforementioned models to two novel datasets. The first dataset was created by fusing two well-studied CRMs of the even-skipped (eve) gene in Drosophila. The fused constructs differ in the way the CRMs’ orientation, order, and intervening spacing are varied. Interestingly, the two constituent CRMs regulate eve expression by using the same TFs, although binding affinities (i.e., strength) of the repressor sites in the two CRMs are different – an observation that has been implicated to help the CRMs drive expression in two distinct domains (each domain consists of two stripes of eve) when they act in their endogenous context. However, the fact that these two CRMs harbor sites for the same TFs makes it difficult to predict the readouts of the constructs in our dataset. In particular, readouts of these constructs show some subtle aspects that essentially challenge the conventional models of information integration from sequences and suggest that a different mechanism may be necessary to explain these observations. Our modeling of this novel dataset suggests that the conventional assumption that relatively short DNA sequences, e.g., CRMs, do not comprise smaller “independent” regulatory sequences may not be true – since the lengths of the fused constructs are comparable to typical CRMs and their readouts can be modeled by assuming the existence of smaller independent regulatory segments. The second dataset modeled in this part of the thesis features five genes that control the growth and patterning of wing in Drosophila. Notably, ours is the first attempt to link regulatory sequences and the related molecular details to the growth and scaling of an organ. In course of fitting this dataset, we identify the important regulatory role of a TF called Scalloped (Sd) and speculate on Sd’s role in assuring that the expression domains of the studied genes scale with wing growth. We also use our models to identify novel regulatory sequences of these genes and to answer several questions that were left open in the experimental studies that attempted first to understand the cis-regulatory logic for these genes

    Incorporating Chromatin Accessibility Data into Sequence-to-Expression Modeling

    Get PDF
    AbstractPrediction of gene expression levels from regulatory sequences is one of the major challenges of genomic biology today. A particularly promising approach to this problem is that taken by thermodynamics-based models that interpret an enhancer sequence in a given cellular context specified by transcription factor concentration levels and predict precise expression levels driven by that enhancer. Such models have so far not accounted for the effect of chromatin accessibility on interactions between transcription factor and DNA and consequently on gene-expression levels. Here, we extend a thermodynamics-based model of gene expression, called GEMSTAT (Gene Expression Modeling Based on Statistical Thermodynamics), to incorporate chromatin accessibility data and quantify its effect on accuracy of expression prediction. In the new model, called GEMSTAT-A, accessibility at a binding site is assumed to affect the transcription factor’s binding strength at the site, whereas all other aspects are identical to the GEMSTAT model. We show that this modification results in significantly better fits in a data set of over 30 enhancers regulating spatial expression patterns in the blastoderm-stage Drosophila embryo. It is important to note that the improved fits result not from an overall elevated accessibility in active enhancers but from the variation of accessibility levels within an enhancer. With whole-genome DNA accessibility measurements becoming increasingly popular, our work demonstrates how such data may be useful for sequence-to-expression models. It also calls for future advances in modeling accessibility levels from sequence and the transregulatory context, so as to predict accurately the effect of cis and trans perturbations on gene expression

    Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: The Roles of Synergistic Activation, Cooperative Binding and Short-Range Repression

    Get PDF
    Quantitative models of cis-regulatory activity have the potential to improve our mechanistic understanding of transcriptional regulation. However, the few models available today have been based on simplistic assumptions about the sequences being modeled, or heuristic approximations of the underlying regulatory mechanisms. We have developed a thermodynamics-based model to predict gene expression driven by any DNA sequence, as a function of transcription factor concentrations and their DNA-binding specificities. It uses statistical thermodynamics theory to model not only protein-DNA interaction, but also the effect of DNA-bound activators and repressors on gene expression. In addition, the model incorporates mechanistic features such as synergistic effect of multiple activators, short range repression, and cooperativity in transcription factor-DNA binding, allowing us to systematically evaluate the significance of these features in the context of available expression data. Using this model on segmentation-related enhancers in Drosophila, we find that transcriptional synergy due to simultaneous action of multiple activators helps explain the data beyond what can be explained by cooperative DNA-binding alone. We find clear support for the phenomenon of short-range repression, where repressors do not directly interact with the basal transcriptional machinery. We also find that the binding sites contributing to an enhancer's function may not be conserved during evolution, and a noticeable fraction of these undergo lineage-specific changes. Our implementation of the model, called GEMSTAT, is the first publicly available program for simultaneously modeling the regulatory activities of a given set of sequences

    Quantitative modeling of a gene's expression from its intergenic sequence.

    No full text
    Modeling a gene's expression from its intergenic locus and trans-regulatory context is a fundamental goal in computational biology. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene's expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene's expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were "shut down" by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model

    Results of fitting the GEMSTAT-GL model on the intergenic locus of 23 additional genes studied in [13].

    No full text
    <p>Quantitative data on target expression patterns were obtained from the companion website of the same study, and were originally derived from <i>in situ</i> expression images at the FlyExpress <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003467#pcbi.1003467-Kumar1" target="_blank">[81]</a> database. For each gene, the red and the green plots represent the target (real) and the modeled expression patterns, respectively.</p

    Overview of the newly proposed GEMSTAT-GL model (details are given in Materials and Methods).

    No full text
    <p>The model has two types of parameters, namely thermodynamic parameters (to compute the expression readout of any sequence window) and window-weight parameters (to compute a weighted summation of the expression readouts of a set of selected windows). The parameters are optimized iteratively to fit the expression pattern of a given gene from its locus. In the example shown, GEMSTAT-GL is applied to fit the three-striped expression pattern (shown by the green, the red, and the blue stripes) of a gene <i>g</i> from its locus. Each iteration in model training consists of two phases. In the first phase, through a sliding window mechanism, the model selects a set <i>C(s)</i> of candidate windows for each stripe <i>s</i>. To this end, each window's readout (computed by GEMSTAT, denoted here by function <i>G</i>) is compared against each individual stripe, as exemplified through the operations on the green window. (Computation of the initial estimates for thermodynamic parameters is explained in main text.) In the second phase, a solution is constructed by iteratively checking if including a new window from the candidate sets (computed in Phase 1) improves model performance. In the shown example, the green window first gets included in the solution since it fits the green stripe satisfactorily. Next, the first window from the red stripe's candidate set is added to the solution and weights for the two windows are optimized so that a weighted summation of their readouts (denoted by function <i>GL</i>) fits the expression pattern consisting of the green and the red stripes. The model shows improved performance and hence, the red window is retained in the solution. However, when the second window from the red stripe's candidate set is added to the solution, it deteriorates model performance. The window is therefore discarded. Similarly, the blue window from the blue stripe's candidate set is checked and found to improve model performance – resulting in its inclusion to the solution. After completing the second phase, the model re-estimates the thermodynamic parameters and loops back to Phase 1.</p

    (A) Schematic of expression pattern of the pair-rule gene <i>even-skipped</i> (<i>eve</i>) in <i>D. melanogaster</i> embryo.

    No full text
    <p>‘A’ and ‘P’ denote the anterior and the posterior ends of the embryo, respectively. (B) Quantitative profile of <i>eve</i> gene expression along the anterior-posterior axis of the embryo. (C) Genome Browser view of the five distinct enhancer elements that drive <i>eve</i> gene expression; each enhancer's name denotes the specific stripe(s) of gene expression that it drives. The entire locus is 17 Kbp long. (D) Concentration profiles along the anterior-posterior axis, for the nine TFs used to model the expression patterns of the genes <i>eve</i>, <i>h</i>, <i>run</i>, and <i>gt</i>. (E) Real (red) and GEMSTAT-predicted (green) expression profiles along the A/P axis for the known enhancers of <i>eve</i>, <i>h</i>, <i>run</i>, and <i>gt</i>.</p

    Outcome of MCMC sampling to reveal the cis-regulatory architecture of <i>eve</i> intergenic region.

    No full text
    <p>(A) Top panel shows the <i>eve</i> intergenic locus along with the known enhancers of <i>eve</i> and windows selected by GEMSTAT-GL to model <i>eve</i> expression pattern. Bottom panel shows the average weight of segments in the locus as estimated by MCMC sampling. The horizontal axis of the bottom panel spans the <i>eve</i> locus; green diamonds in the plot represent the starting positions of the sequence segments that comprise the MCMC samples (segments corresponding to two different green diamonds might therefore differ in length). The vertical axis denotes the average weight (on a relative scale between 0 and 1) that each segment received over 50,000 samples. (B) Predicted readouts of three zero-weight segments that could have an irreconcilable effect on the gene expression pattern, and were not selected by the two-tiered model.</p
    corecore