6,396 research outputs found

    Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

    Get PDF
    With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres Verständnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prädiktiv für die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (für Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler Länge in überwachten Lernszenarien kann CoMIK nicht nur die für die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin für weitreichende Interaktionen spielt. Obwohl CoMIK hauptsächlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfügbar. Der zweite Teil beschreibt unsere Pipeline pHDee für die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen Architekturänderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten für zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und für primäre Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert

    DNaseI hypersensitivity at gene-poor, FSH dystrophy-linked 4q35.2

    Get PDF
    A subtelomeric region, 4q35.2, is implicated in facioscapulohumeral muscular dystrophy (FSHD), a dominant disease thought to involve local pathogenic changes in chromatin. FSHD patients have too few copies of a tandem 3.3-kb repeat (D4Z4) at 4q35.2. No phenotype is associated with having few copies of an almost identical repeat at 10q26.3. Standard expression analyses have not given definitive answers as to the genes involved. To investigate the pathogenic effects of short D4Z4 arrays on gene expression in the very gene-poor 4q35.2 and to find chromatin landmarks there for transcription control, unannotated genes and chromatin structure, we mapped DNaseI-hypersensitive (DH) sites in FSHD and control myoblasts. Using custom tiling arrays (DNase-chip), we found unexpectedly many DH sites in the two large gene deserts in this 4-Mb region. One site was seen preferentially in FSHD myoblasts. Several others were mapped >0.7 Mb from genes known to be active in the muscle lineage and were also observed in cultured fibroblasts, but not in lymphoid, myeloid or hepatic cells. Their selective occurrence in cells derived from mesoderm suggests functionality. Our findings indicate that the gene desert regions of 4q35.2 may have functional significance, possibly also to FSHD, despite their paucity of known genes

    The non-coding genome in Autism Spectrum Disorders

    Get PDF
    Autism Spectrum Disorders (ASD) are a group of neurodevelopmental disorders (NDDs) characterized by difficulties in social interaction and communication, repetitive behavior, and restricted interests. While ASD have been proven to have a strong genetic component, current research largely focuses on coding regions of the genome. However, non-coding DNA, which makes up for ∼99% of the human genome, has recently been recognized as an important contributor to the high heritability of ASD, and novel sequencing technologies have been a milestone in opening up new directions for the study of the gene regulatory networks embedded within the non-coding regions. Here, we summarize current progress on the contribution of non-coding alterations to the pathogenesis of ASD and provide an overview of existing methods allowing for the study of their functional relevance, discussing potential ways of unraveling ASD's “missing heritability”S

    Thamodaran. P

    Get PDF
    Not AvailableUsually, most of the genes are biallelically expressed but imprinted gene exhibit monoallelic expression based on their parental origin. Genomic imprinting exhibit differences in control between flowering plants and mammals, for instance, imprinted gene are specifically activated by demethylation, rather than targeted for silencing in plants and imprinted gene expression in plant which occur in endosperm. It also displays sexual dimorphism like differential timing in imprint establishment and RNA based silencing mechanism in paternally repressed imprinted gene. Within imprinted regions, the unusual occurrence and distribution of various types of repetitive elements may act as genomic imprinting signatures. Imprinting regulation probably at many loci involves insulator protein dependent and higher-order chromatin interaction, and/or non-coding RNAs mediated mechanisms. However, placentaspecific imprinting involves repressive histone modifications and non-coding RNAs. The higher-order chromatin interaction involves differentially methylated domains (DMDs) exhibiting sex-specific methylation that act as scaffold for imprinting, regulate allelic-specific imprinted gene expression. The paternally methylated differentially methylated regions (DMRs) contain less CpGs than the maternally methylated DMRs. The non-coding RNAs mediated mechanisms include C/D RNA and microRNA, which are invovled in RNA-guided post-transcriptional RNA modifications and RNA-mediated gene silencing, respectively. The maintenance and reprogramming of imprinting are not significantly affected by reduced expression of Dicer1 and the evolution of imprinting might be related to acquisition of DNMT3L (de novo methyltransferase 3L) by a common ancestor of eutherians and marsupials. The common feature among diverse imprinting control elements and evolutionary significance of imprinting need to be identified.Not Availabl

    RNA, the Epicenter of Genetic Information

    Get PDF
    The origin story and emergence of molecular biology is muddled. The early triumphs in bacterial genetics and the complexity of animal and plant genomes complicate an intricate history. This book documents the many advances, as well as the prejudices and founder fallacies. It highlights the premature relegation of RNA to simply an intermediate between gene and protein, the underestimation of the amount of information required to program the development of multicellular organisms, and the dawning realization that RNA is the cornerstone of cell biology, development, brain function and probably evolution itself. Key personalities, their hubris as well as prescient predictions are richly illustrated with quotes, archival material, photographs, diagrams and references to bring the people, ideas and discoveries to life, from the conceptual cradles of molecular biology to the current revolution in the understanding of genetic information. Key Features Documents the confused early history of DNA, RNA and proteins - a transformative history of molecular biology like no other. Integrates the influences of biochemistry and genetics on the landscape of molecular biology. Chronicles the important discoveries, preconceptions and misconceptions that retarded or misdirected progress. Highlights major pioneers and contributors to molecular biology, with a focus on RNA and noncoding DNA. Summarizes the mounting evidence for the central roles of non-protein-coding RNA in cell and developmental biology. Provides a thought-provoking retrospective and forward-looking perspective for advanced students and professional researchers

    Exon-phase symmetry and intrinsic structural disorder promote modular evolution in the human genome

    Get PDF
    A key signature of module exchange in the genome is phase symmetry of exons, suggestive of exon shuffling events that occurred without disrupting translation reading frame. At the protein level, intrinsic structural disorder may be another key element because disordered regions often serve as functional elements that can be effectively integrated into a protein structure. Therefore, we asked whether exon-phase symmetry in the human genome and structural disorder in the human proteome are connected, signalling such evolutionary mechanisms in the assembly of multi-exon genes. We found an elevated level of structural disorder of regions encoded by symmetric exons and a preferred symmetry of exons encoding for mostly disordered regions (>70% predicted disorder). Alternatively spliced symmetric exons tend to correspond to the most disordered regions. The genes of mostly disordered proteins (>70% predicted disorder) tend to be assembled from symmetric exons, which often arise by internal tandem duplications. Preponderance of certain types of short motifs (e.g. SH3-binding motif) and domains (e.g. high-mobility group domains) suggests that certain disordered modules have been particularly effective in exon-shuffling events. Our observations suggest that structural disorder has facilitated modular assembly of complex genes in evolution of the human genome. © 2013 The Author(s)

    Organization of chromosome ends in the rice blast fungus, Magnaporthe oryzae

    Get PDF
    Eukaryotic pathogens of humans often evade the immune system by switching the expression of surface proteins encoded by subtelomeric gene families. To determine if plant pathogenic fungi use a similar mechanism to avoid host defenses, we sequenced the 14 chromosome ends of the rice blast pathogen, Magnaporthe oryzae. One telomere is directly joined to ribosomal RNA-encoding genes, at the end of the ∼2 Mb rDNA array. Two are attached to chromosome-unique sequences, and the remainder adjoin a distinct subtelomere region, consisting of a telomere-linked RecQ-helicase (TLH) gene flanked by several blocks of tandem repeats. Unlike other microbes, M.oryzae exhibits very little gene amplification in the subtelomere regions—out of 261 predicted genes found within 100 kb of the telomeres, only four were present at more than one chromosome end. Therefore, it seems unlikely that M.oryzae uses switching mechanisms to evade host defenses. Instead, the M.oryzae telomeres have undergone frequent terminal truncation, and there is evidence of extensive ectopic recombination among transposons in these regions. We propose that the M.oryzae chromosome termini play more subtle roles in host adaptation by promoting the loss of terminally-positioned genes that tend to trigger host defenses

    Role Of Sirna Pathway In Epigenetic Modifications Of The Drosophila Melanogaster X Chromosome

    Get PDF
    Eukaryotic genomes are organized into large domains of coordinated regulation. The role of small RNAs in formation of these domains is largely unexplored. An extraordinary example of domain-wide regulation is X chromosome compensation in Drosophila melanogaster males. This process occurs by hypertranscription of genes on the single male X chromosome. Extensive research in this field has shown that the Male Specific Lethal (MSL) complex binds X-linked genes and modifies chromatin to increase expression. The components of this complex, and their actions on chromatin, are well studied. In contrast, the mechanism that results in exclusive recruitment to the X chromosome is not understood. Our research focuses on the process by which male flies selectively modulate expression from their single X chromosome. Prior studies in the lab have found that the siRNAs produced from repetitive sequences on the X chromosome and the repeat DNA itself, participates in dosage compensation in flies. Interestingly, the siRNA pathway contributes to X-localization of the MSL complex. The basis of enhanced localization is unknown, and no RNAi components have been found to interact directly with the MSL complex. This suggests that siRNA influences X-recognition by an indirect and novel mechanism. I found evidence that chromatin around these repeats is modulated by the siRNA pathway. I demonstrated that FLAG-tagged Argonaute2 protein localizes at these repeats. I show that numerous Agonaute2-interacting proteins show evidence of participation in compensation. One of these, Su(var)3-9, deposits H3K9me2 in and near the repeats. When a repeat-containing transgene is inserted on an autosome, H3K9me2 is enriched in surrounding chromatin, an effect that is enhanced by ectopic production of cognate siRNA. In accord with the idea that these repeats contribute to recruitment of dosage compensation, genes as much as 100 kb from the autosomal insertion increase in expression upon expression of ectopic siRNA. My studies demonstrate that chromatin around a group of X-enriched sequences is modulated by siRNA, and supports the idea that siRNA contributes to the elevated expression that characterizes the compensated male X chromosome. This study advances our understanding of the mechanism of X recognition by showing a direct relationship between siRNA-directed chromatin modification and a class of repetitive elements that helps mark the X chromosome
    corecore