104 research outputs found

    INVESTIGATING ORIGIN AND FUNCTIONAL IMPACT OF GENOMIC STRUCTURAL VARIANTS WITH NEXT-GENERATION SEQUENCING

    Get PDF
    Genomic variants play an important role in phenotypic variation and have significant impact on a disease development. Due to the technology limitations, inference of genomic variants and their potential consequence on phenotype was until recently restricted. Only with the advent of next-generation sequencing (NGS) approaches, could a vast majority of genomic variants be successfully identified for the first time. In my PhD Thesis I will present my work on structural variants (SVs), their formation mechanism and their functional impact. The first part of my Thesis focuses on structural variants in non-human primates, studies of which using NGS have not been pursued prior to the research studies we carried out. In order to inspect the origin and functional impact of SV formation mechanisms, we constructed a comprehensive SV map based on the fibroblast-derived DNA from three different species: chimpanzee, orangutan and rhesus macaque. We noted striking differences in the activity of homology-related SV formation mechanisms between the great apes and rhesus macaques, with a third of the chimpanzee and orangutan SVs inferred to be formed by non-allelic homologous recombination compared with only 2% of the macaque SVs. One additional key finding was the presence of a markedly higher mobile element activity in macaques compared to the other non-human primates studies. Additionally, we could show that long L1 elements surpassed Alu activity in chimpanzee and orangutan as opposed to macaque where AluMacYa3 dominates the genomic landscape causing a burst of relatively short SVs. In addition to inserting into genome, active L1 elements possess the ability to mobilize 3’ flanking DNA to different genomic loci as transductions. By combining translocation and L1 discovery pipelines we further developed a novel computational methodology, termed TIGER, for the discovery of polymorphic L1-mediated 3’ transductions. We employed TIGER to a deeply sequenced human genome and to aforementioned non-human primates species to characterize transductions. TIGER enables studying germline L1-mediated 3’ transductions, making a relevant structural variation class amenable for population and disease studies for the first time. In the second part of my Thesis, I discuss the differences in the formation mechanisms of both germline and somatic SVs in the human genome. Our de novo mechanism classification analyses performed on four previously published SV datasets revealed that almost half of germline human SVs are due to mechanisms independent of homology, followed by homology-related DNA repair, mobile elements and variable number of tandem repeats. We also investigated the formation of somatic SVs in four medulloblastoma brain tumor patients with a germline TP53 mutation (Li- Fraumeni syndrome). In contrast to the germline SVs, our analyses of rearrangement breakpoints in medulloblastoma in the context of mutated TP53, rather support a model of massive DNA double strand breaks known as chromothripsis, followed by exclusive homology-independent repair

    Understanding and improving high-throughput sequencing data production and analysis

    Get PDF
    Advances in DNA sequencing revolutionized the field of genomics over the last 5 years. New sequencing instruments make it possible to rapidly generate large amounts of sequence data at substantially lower cost. These high-throughput sequencing technologies (e.g. Roche 454 FLX, Life Technology SOLiD, Dover Polonator, Helicos HeliScope and Illumina Genome Analyzer) make whole genome sequencing and resequencing, transcript sequencing as well as quantification of gene expression, DNA-protein interactions and DNA methylation feasible at an unanticipated scale. In the field of evolutionary genomics, high-throughput sequencing permitted studies of whole genomes from ancient specimens of different hominin groups. Further, it allowed large-scale population genetics studies of present-day humans as well as different types of sequence-based comparative genomics studies in primates. Such comparisons of humans with closely related apes and hominins are important not only to better understand human origins and the biological background of what sets humans apart from other organisms, but also for understanding the molecular basis for diseases and disorders, particularly those that affect uniquely human traits, such as speech disorders, autism or schizophrenia. However, while the cost and time required to create comparative data sets have been greatly reduced, the error profiles and limitations of the new platforms differ significantly from those of previous approaches. This requires a specific experimental design in order to circumvent these issues, or to handle them during data analysis. During the course of my PhD, I analyzed and improved current protocols and algorithms for next generation sequencing data, taking into account the specific characteristics of these new sequencing technologies. The presented approaches and algorithms were applied in different projects and are widely used within the department of Evolutionary Genetics at the Max Planck Institute of Evolutionary Anthropology. In this thesis, I will present selected analyses from the whole genome shotgun sequencing of two ancient hominins and the quantification of gene expression from short-sequence tags in five tissues from three primates

    Phylogenetics in the Genomic Era

    Get PDF
    Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees. This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era. One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6). Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order. Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments. Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6). The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy. It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease

    Insights into the Evolution of small nucleolar RNAs: Prediction, Comparison, Annotation

    Get PDF
    Over the last decades, the formerly irrevocable believe that proteins are the only key-factors in the complex regulatory machinery of a cell was crushed by a plethora of findings in all major eukaryotic lineages. These suggested a rugged landscape in the eukaryotic genome consist- ing of sequential, overlapping, or even bi-directional transcripts and myriads of regulatory elements. The vast part of the genome is indeed transcribed into an RNA intermediate, but solely a small fraction is finally translated into functional proteins. The sweeping majority, however, is either degraded or functions as a non-protein coding RNA (ncRNA). Due to continuous developments in experimental and computational research, the variety of ncRNA classes grew larger and larger, ranging from key-processes in the cellular lifespan to regulatory processes that are driven and guided by ncRNAs. The bioinformatical part pri- marily concentrates on the prediction, annotation, and extraction of characteristic properties of novel ncRNAs. Due to conservation of sequence and/or structure, this task is often deter- mined by an homology-search that utilizes information about functional, and hence conserved regions, as an indicator. This thesis focuses mainly on a special class of ncRNAs, small nucleolar RNAs (snoRNAs). These abundant molecules are mainly responsible for the guidance of 2’-O-ribose-methylations and pseudouridylations in different types of RNAs, such as ribosomal and spliceosomal RNAs. Although the relevance of single modifications is still rather unclear, the elimination of a bunch of modifications is shown to cause severe effects, including lethality. Several de novo prediction programs have been published over the last years and a substantial amount of publicly available snoRNA databases has originated. Normally, these are restricted to a small amount of species and a collection of experimentally extracted snoRNA. The detection of snoRNAs by means of wet lab experiments and/or de novo prediction tools is generally time consuming (wet lab) and a quite tedious task (identification of snoRNA-specific characteristics). The snoRNA annotation pipeline snoStrip was developed with the intention to circumvent these obstacles. It therefore utilizes a homology-based search procedure to reliably predict snoRNA genes in genomic sequences. In a subsequent step, all candidates are filtered with respect to specific sequence motifs and secondary structures. In a functional analysis, poten- tial target sites are predicted in ribosomal and spliceosomal RNA sequences. In contrast to de novo prediction tools, snoStrip focuses on the extension of the known snoRNA world to uncharted organisms and the mapping and unification of the existing diversity of snoRNAs into functional, homologous families. The pipeline is properly suited to analyze a manifold set of organisms in search for their snoRNAome in short timescales. This offers the opportunity to generate large scale analyses over whole eukaryotic kingdoms to gain insights into the evolutionary history of these spe- cial ncRNA molecules. A set of experimentally validated snoRNA genes in Deuterostomia and Fungi were starting points for highly comprehensive surveys searching and analyzing the snoRNA repertoire in these two major eukaryotic clades. In both cases, the snoStrip pipeline proved itself as a fast and reliable tool and collected thousands of snoRNA genes in nearly 200 organisms. Additionally, the Interaction Conservation Index (ICI), which is am- plified to additionally work on single lineages, provides a convenient measure to analyze and evaluate the conservation of snoRNA-targetRNA interactions across different species. The massive amount of data and the possibility to score the conservation of predicted interactions constitute the main pillars to gain an extraordinary insight into the evolutionary history of snoRNAs on both the sequence and the functional level. A substantial part of the snoR- NAome is traceable down to the root of both eukaryotic lineages and might indicate an even more ancient origin of these snoRNAs. However, a plenitude of lineage specific innovation and deletion events are also discernible. Due to its automated detection of homologous and functionally related snoRNA sequences, snoStrip identified extraordinary target switches in fungi. These unveiled a coupled evolutionary history of several snoRNA families that were previously thought to be independent. Although these findings are exceedingly interesting, the broad majority of snoRNA families is found to show remarkable conservation of the se- quence and the predicted target interactions. On two occasions, this thesis will shift its focus from a genuine snoRNA inspection to an analysis of introns. Both investigations, however, are still conducted under an evolutionary viewpoint. In case of the ubiquitously present U3 snoRNA, functional genes in a notable amount of fungi are found to be disrupted by U2-dependent introns. The set of previously known U3 genes is considerably enlarged by an adapted snoStrip-search procedure. Intron- disrupted genes are found in several fungal lineages, while their precise insertion points within the snoRNA-precursor are located in a small and homologous region. A potential targetRNA of snoRNA genes, U6 snRNA, is also found to contain intronic sequences. Within this work, U6 genes are detected and annotated in nearly all fungal organisms. Although a few U6 intron- carrying genes have been known before, the widespread of these findings and the diversity regarding the particular insertion points are surprising. Those U6 genes are commonly found to contain more than just one intron. In both cases of intron-disrupted non-coding RNA genes, the detected RNA molecules seem to be functional and the intronic sequences show remarkable sequence conservation for both their splice sites and the branch site. In summary, the snoStrip pipeline is shown to be a reliable and fast prediction tool that works on homology-based search principles. Large scale analyses on whole eukaryotic lineages become feasible on short notice. Furthermore, the automated detection of functionally related but not yet mapped snoRNA families adds a new layer of information. Based on surveys covering the evolutionary history of Fungi and Deuterostomia, profound insights into the evolutionary history of this ncRNA class are revealed suggesting ancient origin for a main part of the snoRNAome. Lineage specific innovation and deletion events are also found to occur at a large number of distinct timepoints

    THE ANALYSIS OF ANCIENT DNA: FROM MITOCHONDRIA TO PATHOGENS

    Get PDF
    Ancient DNA (aDNA) is arguably one of the most difficult science fields to work in due to the constant battle against contamination and degradation; however, it is also one of the most rewarding. aDNA researchers have consistently garnered interest the world over with their findings and sparking the curiosity of many who wish to know more about who we are as Homo sapiens. Mitochondrial DNA (mtDNA) and pathogen DNA were used in this dissertation to understand more about where populations came from, how they moved, and what their environment was like through the identification of their maternally inherited mtDNA and pathogens. This is a synthesis of my work and collaboration with other researchers both in lab and at the computer to add more data to the story of humankind

    Annual Report

    Get PDF

    Design of new algorithms for gene network reconstruction applied to in silico modeling of biomedical data

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111The root causes of disease are still poorly understood. The success of current therapies is limited because persistent diseases are frequently treated based on their symptoms rather than the underlying cause of the disease. Therefore, biomedical research is experiencing a technology-driven shift to data-driven holistic approaches to better characterize the molecular mechanisms causing disease. Using omics data as an input, emerging disciplines like network biology attempt to model the relationships between biomolecules. To this effect, gene co- expression networks arise as a promising tool for deciphering the relationships between genes in large transcriptomic datasets. However, because of their low specificity and high false positive rate, they demonstrate a limited capacity to retrieve the disrupted mechanisms that lead to disease onset, progression, and maintenance. Within the context of statistical modeling, we dove deeper into the reconstruction of gene co-expression networks with the specific goal of discovering disease-specific features directly from expression data. Using ensemble techniques, which combine the results of various metrics, we were able to more precisely capture biologically significant relationships between genes. We were able to find de novo potential disease-specific features with the help of prior biological knowledge and the development of new network inference techniques. Through our different approaches, we analyzed large gene sets across multiple samples and used gene expression as a surrogate marker for the inherent biological processes, reconstructing robust gene co-expression networks that are simple to explore. By mining disease-specific gene co-expression networks we come up with a useful framework for identifying new omics-phenotype associations from conditional expression datasets.In this sense, understanding diseases from the perspective of biological network perturbations will improve personalized medicine, impacting rational biomarker discovery, patient stratification and drug design, and ultimately leading to more targeted therapies.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e Informátic

    Using MapReduce Streaming for Distributed Life Simulation on the Cloud

    Get PDF
    Distributed software simulations are indispensable in the study of large-scale life models but often require the use of technically complex lower-level distributed computing frameworks, such as MPI. We propose to overcome the complexity challenge by applying the emerging MapReduce (MR) model to distributed life simulations and by running such simulations on the cloud. Technically, we design optimized MR streaming algorithms for discrete and continuous versions of Conway’s life according to a general MR streaming pattern. We chose life because it is simple enough as a testbed for MR’s applicability to a-life simulations and general enough to make our results applicable to various lattice-based a-life models. We implement and empirically evaluate our algorithms’ performance on Amazon’s Elastic MR cloud. Our experiments demonstrate that a single MR optimization technique called strip partitioning can reduce the execution time of continuous life simulations by 64%. To the best of our knowledge, we are the first to propose and evaluate MR streaming algorithms for lattice-based simulations. Our algorithms can serve as prototypes in the development of novel MR simulation algorithms for large-scale lattice-based a-life models.https://digitalcommons.chapman.edu/scs_books/1014/thumbnail.jp

    Proceedings, MSVSCC 2015

    Get PDF
    The Virginia Modeling, Analysis and Simulation Center (VMASC) of Old Dominion University hosted the 2015 Modeling, Simulation, & Visualization Student capstone Conference on April 16th. The Capstone Conference features students in Modeling and Simulation, undergraduates and graduate degree programs, and fields from many colleges and/or universities. Students present their research to an audience of fellow students, faculty, judges, and other distinguished guests. For the students, these presentations afford them the opportunity to impart their innovative research to members of the M&S community from academic, industry, and government backgrounds. Also participating in the conference are faculty and judges who have volunteered their time to impart direct support to their students’ research, facilitate the various conference tracks, serve as judges for each of the tracks, and provide overall assistance to this conference. 2015 marks the ninth year of the VMASC Capstone Conference for Modeling, Simulation and Visualization. This year our conference attracted a number of fine student written papers and presentations, resulting in a total of 51 research works that were presented. This year’s conference had record attendance thanks to the support from the various different departments at Old Dominion University, other local Universities, and the United States Military Academy, at West Point. We greatly appreciated all of the work and energy that has gone into this year’s conference, it truly was a highly collaborative effort that has resulted in a very successful symposium for the M&S community and all of those involved. Below you will find a brief summary of the best papers and best presentations with some simple statistics of the overall conference contribution. Followed by that is a table of contents that breaks down by conference track category with a copy of each included body of work. Thank you again for your time and your contribution as this conference is designed to continuously evolve and adapt to better suit the authors and M&S supporters. Dr.Yuzhong Shen Graduate Program Director, MSVE Capstone Conference Chair John ShullGraduate Student, MSVE Capstone Conference Student Chai
    • …
    corecore