Computational modeling of gene expression from regulatory sequences

Abstract

Regulation of gene expression is an important early step in controlling every biological process that underlies the function of living organisms. Even though gene expression may be regulated in several stages, the modulation occurs mostly at the primary stage known as “transcription”. Teasing out the details of transcriptional regulation is therefore a core focus of biological research. Transcriptional regulation of gene expression is dictated by regulatory DNA sequences, often called cis-regulatory modules (CRM; also known as “enhancer”), that contain specific binding sites for regulatory proteins (transcription factors, TF). The assembly of TFs bound on a CRM drives the desired expression level of the gene associated with the CRM. As the abundance of TFs vary across different cell types, the expression level of the gene, also termed as the “readout” of the CRM, varies accordingly and results in the aforementioned control over biological processes. The rules, collectively known as the “cis-regulatory logic”, to predict gene expression level given information about CRMs and TFs, however, are unclear. Decades of experimental studies have hypothesized mechanisms about parts of this regulatory process (e.g., about the influence of TF-TF interactions), but a comprehensive study of cis-regulatory logic is feasible only through computational models. The subject of this thesis is to develop mechanistic models of gene expression from regulatory sequences and use the models to understand such details of the system that are difficult to assess experimentally. The first part of this thesis develops a model that integrates the regulatory effect of signaling pathways with that of sequence-bound TFs to understand the expression pattern of a gene from its CRM. Given the various types of molecular interactions that the model needs to capture, it is both complex in structure and rich in the number of parameters. Similarly complex models commonly used in other disciplines, from signaling networks to climatology, have been shown to fit many distinct parameterizations that are equally consistent with data but might represent disparate mechanistic hypotheses. Whether this is also the case for models of cis-regulation has never been investigated, with the standard practice in this realm being to report a single or a few best-fit models. We demonstrate here – taking the Drosophila ind gene as an example – that gene expression modeling from cis-regulatory sequences may suffer from incomplete and even incorrect conclusions if one adheres to this current practice. We construct an ensemble of models by systematically exploring the entire parameter space and leveraging both wild-type data and various perturbation experiments, and make statistical inferences from the ensemble about detail regulatory mechanisms of ind. Years of genetic experiments have put forth an assortment of hypotheses about ind regulation. We use our modeling approach to show how a mechanism involving MAPK induced attenuation in the DNA binding affinity of Capicua and the use of low-affinity Dorsal binding sites may provide a coherent explanation of ind regulation. Also, we quantitatively predict and experimentally validate the role of the “pioneer factor” Zelda in activating ind. Finally, we discuss disparate hypotheses that are supported by our ensemble of models and will need future experimentation for a complete understanding of ind regulation. The second part of this thesis addresses a fundamental goal of computational biology, namely that of modeling a gene’s expression from its intergenic locus and trans-regulatory context. Owing to the distributed nature of cis-regulatory information and the poorly understood mechanisms that integrate such information, gene locus modeling is a more challenging task than modeling individual enhancers. Here we report the first quantitative model of a gene’s expression pattern as a function of its locus. We model the expression readout of a locus in two tiers: 1) combinatorial regulation by transcription factors bound to each enhancer is predicted by a thermodynamics-based model and 2) independent contributions from multiple enhancers are linearly combined to fit the gene expression pattern. The model does not require any prior knowledge about enhancers contributing toward a gene’s expression. We demonstrate that the model captures the complex multi-domain expression patterns of anterior-posterior patterning genes in the early Drosophila embryo. Altogether, we model the expression patterns of 27 genes; these include several gap genes, pair-rule genes, and anterior, posterior, trunk, and terminal genes. We find that the model-selected enhancers for each gene overlap strongly with its experimentally characterized enhancers. Our findings also suggest the presence of sequence-segments in the locus that would contribute ectopic expression patterns and hence were “shut down” by the model. We applied our model to identify the transcription factors responsible for forming the stripe boundaries of the studied genes. The resulting network of regulatory interactions exhibits a high level of agreement with known regulatory influences on the target genes. Finally, we analyzed whether and why our assumption of enhancer independence was necessary for the genes we studied. We found a deterioration of expression when binding sites in one enhancer were allowed to influence the readout of another enhancer. Thus, interference between enhancer activities was a possible factor necessitating enhancer independence in our model. The third part of this thesis applies the aforementioned models to two novel datasets. The first dataset was created by fusing two well-studied CRMs of the even-skipped (eve) gene in Drosophila. The fused constructs differ in the way the CRMs’ orientation, order, and intervening spacing are varied. Interestingly, the two constituent CRMs regulate eve expression by using the same TFs, although binding affinities (i.e., strength) of the repressor sites in the two CRMs are different – an observation that has been implicated to help the CRMs drive expression in two distinct domains (each domain consists of two stripes of eve) when they act in their endogenous context. However, the fact that these two CRMs harbor sites for the same TFs makes it difficult to predict the readouts of the constructs in our dataset. In particular, readouts of these constructs show some subtle aspects that essentially challenge the conventional models of information integration from sequences and suggest that a different mechanism may be necessary to explain these observations. Our modeling of this novel dataset suggests that the conventional assumption that relatively short DNA sequences, e.g., CRMs, do not comprise smaller “independent” regulatory sequences may not be true – since the lengths of the fused constructs are comparable to typical CRMs and their readouts can be modeled by assuming the existence of smaller independent regulatory segments. The second dataset modeled in this part of the thesis features five genes that control the growth and patterning of wing in Drosophila. Notably, ours is the first attempt to link regulatory sequences and the related molecular details to the growth and scaling of an organ. In course of fitting this dataset, we identify the important regulatory role of a TF called Scalloped (Sd) and speculate on Sd’s role in assuring that the expression domains of the studied genes scale with wing growth. We also use our models to identify novel regulatory sequences of these genes and to answer several questions that were left open in the experimental studies that attempted first to understand the cis-regulatory logic for these genes

    Similar works