52 research outputs found

    Integration of Hi-C with short and long-read genome sequencing reveals the structure of germline rearranged genomes

    Get PDF
    Structural variants are a common cause of disease and contribute to a large extent to inter-individual variability, but their detection and interpretation remain a challenge. Here, we investigate 11 individuals with complex genomic rearrangements including germline chromothripsis by combining short- and long-read genome sequencing (GS) with Hi-C. Large-scale genomic rearrangements are identified in Hi-C interaction maps, allowing for an independent assessment of breakpoint calls derived from the GS methods, resulting in >300 genomic junctions. Based on a comprehensive breakpoint detection and Hi-C, we achieve a reconstruction of whole rearranged chromosomes. Integrating information on the three-dimensional organization of chromatin, we observe that breakpoints occur more frequently than expected in lamina-associated domains (LADs) and that a majority reshuffle topologically associating domains (TADs). By applying phased RNA-seq, we observe an enrichment of genes showing allelic imbalanced expression (AIG) within 100 kb around the breakpoints. Interestingly, the AIGs hit by a breakpoint (19/22) display both up- and downregulation, thereby suggesting different mechanisms at play, such as gene disruption and rearrangements of regulatory information. However, the majority of interpretable genes located 200 kb around a breakpoint do not show significant expression changes. Thus, there is an overall robustness in the genome towards large-scale chromosome rearrangements

    Integration of Hi-C with short and long-read genome sequencing reveals the structure of germline rearranged genomes

    Get PDF
    Here the authors characterize structural variations (SVs) in a cohort of individuals with complex genomic rearrangements, identifying breakpoints by employing short- and long-read genome sequencing and investigate their impact on gene expression and the three-dimensional chromatin architecture. They find breakpoints are enriched in inactive regions and can result in chromatin domain fusions.Structural variants are a common cause of disease and contribute to a large extent to inter-individual variability, but their detection and interpretation remain a challenge. Here, we investigate 11 individuals with complex genomic rearrangements including germline chromothripsis by combining short- and long-read genome sequencing (GS) with Hi-C. Large-scale genomic rearrangements are identified in Hi-C interaction maps, allowing for an independent assessment of breakpoint calls derived from the GS methods, resulting in >300 genomic junctions. Based on a comprehensive breakpoint detection and Hi-C, we achieve a reconstruction of whole rearranged chromosomes. Integrating information on the three-dimensional organization of chromatin, we observe that breakpoints occur more frequently than expected in lamina-associated domains (LADs) and that a majority reshuffle topologically associating domains (TADs). By applying phased RNA-seq, we observe an enrichment of genes showing allelic imbalanced expression (AIG) within 100 kb around the breakpoints. Interestingly, the AIGs hit by a breakpoint (19/22) display both up- and downregulation, thereby suggesting different mechanisms at play, such as gene disruption and rearrangements of regulatory information. However, the majority of interpretable genes located 200 kb around a breakpoint do not show significant expression changes. Thus, there is an overall robustness in the genome towards large-scale chromosome rearrangements

    Computational methods in protein structure comparison and analysis of protein interaction networks

    Get PDF
    Proteins are versatile biological macromolecules that perform numerous functions in a living organism. For example, proteins catalyze chemical reactions, store and transport various small molecules, and are involved in transmitting nerve signals. As the number of completely sequenced genomes grows, we are faced with the important but daunting task of assigning function to proteins encoded by newly sequenced genomes. In this thesis we contribute to this effort by developing computational methods for which one use is to facilitate protein function assignment. Functional annotation of a newly discovered protein can often be transferred from that of evolutionarily related proteins of known function. However, distantly related proteins can still only be detected by the most accurate protein structure alignment methods. As these methods are computationally expensive, they are combined with less accurate but fast methods to allow large-scale comparative studies. In this thesis we propose a general framework to define a family of protein structure comparison methods that reduce protein structure comparison to distance computation between high-dimensional vectors and therefore are extremely fast. Interactions among proteins can be detected through the use of several mature experimental techniques. These interactions are routinely represented by a graph, called a protein interaction network, with nodes representing the proteins and edges representing the interactions between the proteins. In this thesis we present two computational studies that explore the connection between the topology of protein interaction networks and protein biological function. Unfortunately, protein interaction networks do not explicitly capture an important aspect of protein interactions, their dynamic nature. In this thesis, we present an automatic method that relies on graph theoretic tools for chordal and cograph graph families to extract dynamic properties of protein interactions from the network topology. An intriguing question in the analysis of biological networks is whether biological characteristics of a protein, such as essentiality, can be explained by its placement in the network. In this thesis we analyze protein interaction networks for Saccharomyces cerevisiae to identify the main topological determinant of essentiality and to provide a biological explanation for the connection between the network topology and essentiality

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    Machine Learning Applications for Drug Repurposing

    Full text link
    The cost of bringing a drug to market is astounding and the failure rate is intimidating. Drug discovery has been of limited success under the conventional reductionist model of one-drug-one-gene-one-disease paradigm, where a single disease-associated gene is identified and a molecular binder to the specific target is subsequently designed. Under the simplistic paradigm of drug discovery, a drug molecule is assumed to interact only with the intended on-target. However, small molecular drugs often interact with multiple targets, and those off-target interactions are not considered under the conventional paradigm. As a result, drug-induced side effects and adverse reactions are often neglected until a very late stage of the drug discovery, where the discovery of drug-induced side effects and potential drug resistance can decrease the value of the drug and even completely invalidate the use of the drug. Thus, a new paradigm in drug discovery is needed. Structural systems pharmacology is a new paradigm in drug discovery that the drug activities are studied by data-driven large-scale models with considerations of the structures and drugs. Structural systems pharmacology will model, on a genome scale, the energetic and dynamic modifications of protein targets by drug molecules as well as the subsequent collective effects of drug-target interactions on the phenotypic drug responses. To date, however, few experimental and computational methods can determine genome-wide protein-ligand interaction networks and the clinical outcomes mediated by them. As a result, the majority of proteins have not been charted for their small molecular ligands; we have a limited understanding of drug actions. To address the challenge, this dissertation seeks to develop and experimentally validate innovative computational methods to infer genome-wide protein-ligand interactions and multi-scale drug-phenotype associations, including drug-induced side effects. The hypothesis is that the integration of data-driven bioinformatics tools with structure-and-mechanism-based molecular modeling methods will lead to an optimal tool for accurately predicting drug actions and drug associated phenotypic responses, such as side effects. This dissertation starts by reviewing the current status of computational drug discovery for complex diseases in Chapter 1. In Chapter 2, we present REMAP, a one-class collaborative filtering method to predict off-target interactions from protein-ligand interaction network. In our later work, REMAP was integrated with structural genomics and statistical machine learning methods to design a dual-indication polypharmacological anticancer therapy. In Chapter 3, we extend REMAP, the core method in Chapter 2, into a multi-ranked collaborative filtering algorithm, WINTF, and present relevant mathematical justifications. Chapter 4 is an application of WINTF to repurpose an FDA-approved drug diazoxide as a potential treatment for triple negative breast cancer, a deadly subtype of breast cancer. In Chapter 5, we present a multilayer extension of REMAP, applied to predict drug-induced side effects and the associated biological pathways. In Chapter 6, we close this dissertation by presenting a deep learning application to learn biochemical features from protein sequence representation using a natural language processing method

    An Inferential Framework for Network Hypothesis Tests: With Applications to Biological Networks

    Get PDF
    The analysis of weighted co-expression gene sets is gaining momentum in systems biology. In addition to substantial research directed toward inferring co-expression networks on the basis of microarray/high-throughput sequencing data, inferential methods are being developed to compare gene networks across one or more phenotypes. Common gene set hypothesis testing procedures are mostly confined to comparing average gene/node transcription levels between one or more groups and make limited use of additional network features, e.g., edges induced by significant partial correlations. Ignoring the gene set architecture disregards relevant network topological comparisons and can result in familiar

    Pan-genomics and the structural diversity of plant genomes

    Get PDF
    A central task of genetics research is to uncover genotypes linked to important phenotypes. However, many genomic loci are incompletely or inaccurately represented in genetics studies, thus obscuring their function and evolution. New technology can accurately and continuously sequence large segments of genomic DNA at affordable cost and unprecedented scale, raising the possibility of complete and accurate representations of genomes across the tree of life. However, new computational methods are required to automatically finish, validate, and curate the forthcoming wave of genome assemblies enabled by these technologies. Researchers must also devise analytical approaches to comparing previously unresolved and usually repetitive genomic loci within and between species. Here, we introduce RaGOO and RagTag, new methods that leverage genome maps to automatically scaffold and improve draft genome assemblies into chromosome-scale representations. By applying these new methods to a bread wheat genome, we show how the established reference falsely collapsed functional paralogs genome-wide. In Arabidopsis thaliana, we present a new reference assembly that completely resolves all five centromeres for the first time, revealing centromere architecture, genetics, epigenetics, and evolution. Finally, we present a catalog of natural structural variants (SVs) across 100 diverse tomato accessions revealing exceptional genetic diversity via artificial introgression as well as broad and specific examples of how SVs influence molecular, domestication, and improvement phenotypes. This work underscores the potential to accelerate genetics research with complete and diverse genotype data and apply these findings to plant breeding and engineering

    Biogenesis and Stability of Germline Small RNAs in C. elegans.

    Full text link
    Across the animal kingdom, small, noncoding RNAs preserve and promote fertility by engaging Argonaute effector proteins to silence deleterious genetic elements. Generated in germline and inherited into progeny, endogenous small interfering RNAs (endo-siRNAs) and Piwi-interacting RNAs (piRNAs) regulate vast suites of gametic and zygotic genes, yet remarkably little is known about how they are regulated. With an expanded repertoire of small RNA classes, Caenorhabditis elegans provides an ideal model for investigating how animals drive epigenetic inheritance of fertility-preserving germline small RNAs. The conserved methyltransferase HEN1 methylates small RNAs to prevent their degradation. Methylation of germline small RNAs enhances accumulation, promoting robust inheritance into progeny. All plant small RNAs are methylated, but animal HEN1 methylates only some small RNAs. The mechanisms of selective methylation were unknown. I identified the functional C. elegans ortholog of HEN1 and demonstrated that it methylates all piRNAs but only select subclasses of endo-siRNAs. I further found that particular endo-siRNAs are methylated in maternal, but not paternal, germlines. Through genetic and biochemical analyses, I showed that small RNA methylation status is likely dictated by the associated Argonaute. This established selective expression of divergent Argonautes as a novel mechanism for differentially stabilizing germline small RNAs, with significant implications for preferential inheritance of maternal epigenetic information. piRNAs are essential for animal fertility, but their expression mechanisms are poorly characterized. In collaboration with bioinformatician Mallory Freeberg, I showed that C. elegans male and female germlines express distinct piRNA subsets that evolve independently and differ in inheritance. A common sequence motif lies upstream of nematode piRNA loci. We discovered that this motif varies significantly between male and female piRNAs. Using a novel transgenic approach, I established that C. elegans piRNAs represent thousands of tiny, autonomous transcriptional units, rivaling coding genes in number. I further demonstrated that the upstream motif is required for piRNA expression and that variation at a single nucleotide position within this motif orchestrates selective male versus female germline enrichment and inheritance of piRNAs. These and additional included studies define novel factors and mechanisms involved in regulation of germline small RNAs and transgenerational transmission of their crucial epigenetic information.PHDHuman GeneticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111471/1/acbilli_1.pd

    Examination of Genetic Components Affecting Human Obesity-Related Quantitative Traits

    Get PDF
    Obesity increases the risk for several conditions, including type 2 diabetes mellitus, cardiovascular disease, hypertension, osteoarthirits and certain types of cancer. Twin- and family studies have shown that there is a major genetic component in the determination of body mass. In recent years several technological and scientific advance have been made in obesity research. For instance, novel replicated loci have been revealed by a number of genome wide association studies. This thesis aimed to investigate the association of genetic factors and obesity-related quantitative traits. The first study investigated the role of the lactase gene in anthropometric traits. We genetically defined lactose persistence by genotyping 31 720 individuals of European descent. We found that lactase persistence was significantly correlated with weight and body mass index but not with height. In the second study we performed the largest whole genome linkage scan for body mass index to date. The sample consisted of 4401 twin families and 10 535 individuals from six European countries. We found supporting evidence for two loci (3q29 and 7q36). We observed that the heritability estimate increased substantially when additional family members were removed from the analyses, which suggests reduced environmental variance in the twin sample. In the third study we assessed metabonomic, transcriptomic and genomic variation in a Finnish population cohort of 518 individuals. We formed gene expression networks to portray pathways and showed that a set of highly correlated genes of an inflammatory pathway associated with 80 serum metabolites (of 134 quantified measures). Strong association was found, for example, with several lipoprotein subclasses. We inferred causality by using genetic variation as anchors. The expression of the network genes was found to be dependent on the circulatory metabolite concentrations.Lihavuus on huomattava, lisääntyvä ongelma maailmassa. Lihavuus lisää riskiä sairastua sydän- ja verisuonitautiin, tyypin 2 diabetekseen, nivelrikkoon ja tietyn tyyppisiin syöpiin. Perhe- ja kaksostutkimukset ovat osoittaneet että suuri osa ruumiinpainon vaihtelusta selittyy perinnöllisillä tekijöillä. Tämän työn tarkoituksena oli tutkia lihavuuteen liittyvien jatkuvien muuttujien ja perinnöllisten komponenttien vuorovaikutusta. Ensimmäisessä osatyössä tarkasteltiin laktaasigeenin vaikutusta ruumiin rakenteeseen. Määritimme geneettisesti laktoosi-intoleranssin 31 720 Eurooppalaisessa henkilössä. Havaitsimme, että laktoosiintolerantikoilla oli tilastollisesti merkittävästi pienempi ruumiinpaino, sekä painoindeksi kuin laktoosia sietävillä henkilöillä. Laktoosi-intoleranssin ei havaittu vaikuttavan loppupituuteen. Toisessa osatyössä tutkimme painoindeksiä toistaiseksi suurimmalla kaksosperheistä koostuvalla kytkentätutkimuksella. Tutkimusaineistona oli 10 535 eurooppalaista henkilöä 4 401 perheestä, kuudesta eri maasta. Havaitsimme kromosomeissa 3q29 ja 7q36 aikaisempia tutkimuksia tukevia löydöksiä. Lisäksi havaitsimme että heritabiliteetti kasvoi, kun jätimme analyyseistä pois muut perheenjäsenet, joka viittaisi ympäristöstä johtuvan vaihtelun pienenemiseen kaksosaineistossa. Kolmannessa osatyössä tutkimme aineenvaihdunta-, geeniekspressio- ja geenimerkkidataa suomalaisessa väestöotoksessa joka koostui 518 suomalaisesta henkilöstä. Muodostimme geeniverkkoja keskenään vahvasti korreloivista geeneistä ja havaitsimme että tulehdukseen liittyvä geeniverkko korreloi vahvasti 80 seerumin aineenvaihduntatekijän kanssa 134:stä mitatusta. Erittäin vahvoja korrelaatioita löytyi esimerkiksi lipoproteiinien alaluokista. Arvioimme myös syy-seuraussuhdetta käyttämällä geenimerkkejä suuntaavina pisteinä verkkoanalyysissä. Geeniverkon ilmentymisen eheyden todettiin olevan riippuvainen aineenvaihduntatekijöiden pitoisuudesta veressä

    Using MapReduce Streaming for Distributed Life Simulation on the Cloud

    Get PDF
    Distributed software simulations are indispensable in the study of large-scale life models but often require the use of technically complex lower-level distributed computing frameworks, such as MPI. We propose to overcome the complexity challenge by applying the emerging MapReduce (MR) model to distributed life simulations and by running such simulations on the cloud. Technically, we design optimized MR streaming algorithms for discrete and continuous versions of Conway’s life according to a general MR streaming pattern. We chose life because it is simple enough as a testbed for MR’s applicability to a-life simulations and general enough to make our results applicable to various lattice-based a-life models. We implement and empirically evaluate our algorithms’ performance on Amazon’s Elastic MR cloud. Our experiments demonstrate that a single MR optimization technique called strip partitioning can reduce the execution time of continuous life simulations by 64%. To the best of our knowledge, we are the first to propose and evaluate MR streaming algorithms for lattice-based simulations. Our algorithms can serve as prototypes in the development of novel MR simulation algorithms for large-scale lattice-based a-life models.https://digitalcommons.chapman.edu/scs_books/1014/thumbnail.jp
    • …
    corecore