Computationally-guided inference of cis-regulatory codes in metazoans

Abstract

Thesis: Sc. D., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2014.Cataloged from PDF version of thesis.Includes bibliographical references.The biomedical problems of spaceflight pose a formidable challenge for the future of long-duration human space exploration. Besides the effects of radiation, astronauts experience muscle atrophy, cardiovascular deconditioning, bone demineralization, cataracts, as well as sensorimotor and immune disruptions. Existing countermeasures are ineffective partly because they focus on the symptoms, rather than on the underlying biological circuitry. Deciphering the networks of genes, their regulatory proteins (transcription factors (TFs)) and their regulatory DNA elements (cis-regulatory modules (CRMs)) is fundamental to the understanding of the biological processes they underlie. These processes include the formation and maintenance of cells, tissues and organs in healthy conditions, in disease, and in response to environmental perturbations, such as spaceflight. A better understanding of these networks can thus aid the development of specific and effective molecular therapeutics. Towards this goal, this thesis focuses on the computational inference of the cis-regulatory codes (combinations of TF binding site motifs) that drive the precise spatio-temporal expression of different sets of genes. The problem of cis-regulatory code inference is notoriously difficult in multicellular organisms (metazoans) because of their large repertoires of TFs and because of their vast genomes, where biological discovery and validation is laborious and costly, requiring computational decision aids to guide and prioritize experiments. First, I present a novel computational approach to control for length-dependent artifacts encountered by popular CRM discovery algorithms in the field. This LOESS-based method is flexible in capturing diverse score-length relationships and is more effective at correcting for length-dependent artifacts, compared with four available competing approaches. Application of this method in the context of Drosophila melanogaster embryonic muscle and larval neural development resulted in a more accurate inference of their biologically validated cis-regulatory codes. This method is broadly applicable for the detection of other types of patterns in biological sequences. Second, I computationally identified the Forkhead (Fkh) family of TFs as putative regulators in the development of different D. melanogaster mesodermal tissues, including cardiac, somatic and visceral muscle. The LOESS method was used to identify the Fkh family as part of general cis-regulatory codes driving the development of the embryonic heart, which otherwise could not have been identified. This study has also found that the same CRM in different cell types can be targeted by different TFs of the same family, a finding that improves our understanding of the principles of gene regulation. Third, I developed Archimedes, a novel efficient algorithm to infer high-order cisregulatory codes of up to arity 10 using large input collections of TF binding site motifs (-100s). An exhaustive search of the exponential space of motif combinations is computationally intractable. Existing brute-force algorithms are limited to small collections of input TF binding site motifs (less than 20) or to low-order combinations (single or pairwise). Many other algorithms make assumptions that are biologically invalid or are unsuitable for metazoans with vast genomes. Archimedes achieves an average of -7 orders of magnitude savings in the number of motif combinations explored, with respect to a brute-force approach. These substantial savings are achieved by the use of (a) a qualitative model of gene regulation, and (b) concepts from combinatorial optimization and constraint satisfaction to eliminate large regions of the search space that lead to non-promising or to suboptimal solutions. Archimedes performs as well as Lever, a brute-force algorithm, with -85% mean positive predictive value at a 20% false positive rate, and outperforms its two other best competitors, Compo and CPModule, in identifying known regulatory motif combinations for different gene sets in D. melanogaster. In sum, the computational tools developed in this thesis can be used as decision aids to expedite biological discovery of cis-regulatory codes involved in health, in disease, and in the environmental perturbations of spaceflight. A better understanding of the underlying molecular circuitry of muscle atrophy, cataracts, and immune suppression, for instance, will ultimately inform and pave the way for the development of novel therapeutics to treat the health problems of humans when they are sick on Earth and as they push the frontiers of space exploration.by Anton Aboukhalil.Sc. D

    Similar works

    Full text

    thumbnail-image

    Available Versions