1,389 research outputs found

    Distance-based methods for the analysis of Next-Generation sequencing data

    Get PDF
    Die Analyse von NGS Daten ist ein zentraler Aspekt der modernen genomischen Forschung. Bei der Extraktion von Daten aus den beiden am hĂ€ufigsten verwendeten Quellorganismen bestehen jedoch vielfĂ€ltige Problemstellungen. Im ersten Kapitel wird ein neuartiger Ansatz vorgestellt welcher einen Abstand zwischen Krebszellinienkulturen auf Grundlage ihrer kleinen genomischen Varianten bestimmt um die Kulturen zu identifizieren. Eine Voll-Exom sequenzierte Kultur wird durch paarweise Vergleiche zu ReferenzdatensĂ€tzen identifiziert so ein gemessener Abstand geringer ist als dies bei nicht verwandten Kulturen zu erwarten wĂ€re. Die Wirksamkeit der Methode wurde verifiziert, jedoch verbleiben EinschrĂ€nkung da nur das Sequenzierformat des Voll-Exoms unterstĂŒtzt wird. Daher wird im zweiten Kapitel eine publizierte Modifikation des Ansatzes vorgestellt welcher die UnterstĂŒtzung der weitlĂ€ufig genutzten Bulk RNA sowie der Panel-Sequenzierung ermöglicht. Die Ausweitung der Technologiebasis fĂŒhrt jedoch zu einer VerstĂ€rkung von Störeffekten welche zu Verletzungen der mathematischen Konditionen einer Abstandsmetrik fĂŒhren. Daher werden die entstandenen Verletzungen durch statistische Verfahren zuerst quantifiziert und danach durch dynamische Schwellwertanpassungen erfolgreich kompensiert. Das dritte Kapitel stellt eine neuartige Daten-Aufwertungsmethode (Data-Augmentation) vor welche das Trainieren von maschinellen Lernmodellen in Abwesenheit von neoplastischen Trainingsdaten ermöglicht. Ein abstraktes Abstandsmaß wird zwischen neoplastischen EntitĂ€ten sowie EntitĂ€ten gesundem Ursprungs mittels einer transkriptomischen Dekonvolution hergestellt. Die Ausgabe der Dekonvolution erlaubt dann das effektive Vorhersagen von klinischen Eigenschaften von seltenen jedoch biologisch vielfĂ€ltigen Krebsarten wobei die prĂ€diktive Kraft des Verfahrens der des etablierten Goldstandard ebenbĂŒrtig ist.The analysis of NGS data is a central aspect of modern Molecular Genetics and Oncology. The first scientific contribution is the development of a method which identifies Whole-exome-sequenced CCL via the quantification of a distance between their sets of small genomic variants. A distinguishing aspect of the method is that it was designed for the computer-based identification of NGS-sequenced CCL. An identification of an unknown CCL occurs when its abstract distance to a known CCL is smaller than is expected due to chance. The method performed favorably during benchmarks but only supported the Whole-exome-sequencing technology. The second contribution therefore extended the identification method by additionally supporting the Bulk mRNA-sequencing technology and Panel-sequencing format. However, the technological extension incurred predictive biases which detrimentally affected the quantification of abstract distances. Hence, statistical methods were introduced to quantify and compensate for confounding factors. The method revealed a heterogeneity-robust benchmark performance at the trade-off of a slightly reduced sensitivity compared to the Whole-exome-sequencing method. The third contribution is a method which trains Machine-Learning models for rare and diverse cancer types. Machine-Learning models are subsequently trained on these distances to predict clinically relevant characteristics. The performance of such-trained models was comparable to that of models trained on both the substituted neoplastic data and the gold-standard biomarker Ki-67. No proliferation rate-indicative features were utilized to predict clinical characteristics which is why the method can complement the proliferation rate-oriented pathological assessment of biopsies. The thesis revealed that the quantification of an abstract distance can address sources of erroneous NGS data analysis

    Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires

    Full text link
    The adaptive immune system recognizes antigens via an immense array of antigen-binding antibodies and T-cell receptors, the immune repertoire. The interrogation of immune repertoires is of high relevance for understanding the adaptive immune response in disease and infection (e.g., autoimmunity, cancer, HIV). Adaptive immune receptor repertoire sequencing (AIRR-seq) has driven the quantitative and molecular-level profiling of immune repertoires thereby revealing the high-dimensional complexity of the immune receptor sequence landscape. Several methods for the computational and statistical analysis of large-scale AIRR-seq data have been developed to resolve immune repertoire complexity in order to understand the dynamics of adaptive immunity. Here, we review the current research on (i) diversity, (ii) clustering and network, (iii) phylogenetic and (iv) machine learning methods applied to dissect, quantify and compare the architecture, evolution, and specificity of immune repertoires. We summarize outstanding questions in computational immunology and propose future directions for systems immunology towards coupling AIRR-seq with the computational discovery of immunotherapeutics, vaccines, and immunodiagnostics.Comment: 27 pages, 2 figure

    A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA

    Get PDF
    The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases

    Hybrid capture data unravel a rapid radiation of pimpliform parasitoid wasps (Hymenoptera: Ichneumonidae: Pimpliformes)

    Get PDF
    The parasitoid wasp family Ichneumonidae is among the most diverse groups of organisms, with conservative estimates suggesting that it contains more species than all vertebrates together. However, ichneumonids are also among the most severely understudied groups, and our understanding of their evolution is hampered by the lack of a robust higher‐level phylogeny of this group. Based on newly generated transcriptome sequence data, which were filtered according to several criteria of phylogenetic informativeness, we developed target DNA enrichment baits to capture 93 genes across species of Ichneumonidae. The baits were applied to DNA of 55 ichneumonids, with a focus on Pimpliformes, an informal group containing nine subfamilies. Phylogenetic trees were inferred under maximum likelihood and Bayesian approaches, at both the nucleotide and amino acid levels. We found maximum support for the monophyly of Pimpliformes but low resolution and very short branches close to its base, strongly suggesting a rapid radiation. Two genera and one genus‐group were consistently recovered in unexpected parts of the tree, prompting changes in their higher‐level classification: Pseudorhyssa Merrill, currently classified in the subfamily Poemeniinae, is transferred to the tribe Delomeristini within Pimplinae, and Hemiphanes Förster is moved from Orthocentrinae to Cryptinae. Likewise, the tribe Theroniini is resurrected for the Theronia group of genera (stat. rev.). Phylogenetic analyses, in which we gradually increased the numbers of genes, revealed that the initially steep increase in mean clade support slows down at around 40 genes, and consideration of up to 93 genes still left various nodes in the inferred phylogenetic tree poorly resolved. It remains to be shown whether more extensive gene or taxon sampling can resolve the early evolution of the pimpliform subfamilies.This is the pre-peer reviewed version of the following article: Klopfstein, S., Langille, B., Spasojevic, T., Broad, G.R., Cooper, S.J.B., Austin, A.D. and Niehuis, O. (2019), Hybrid capture data unravel a rapid radiation of pimpliform parasitoid wasps (Hymenoptera: Ichneumonidae: Pimpliformes). Syst Entomol, 44: 361-383. , which has been published in final form at doi:10.1111/syen.12333. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving The attached document is the authors’ submitted version of the journal article. You are advised to consult the publisher’s version if you wish to cite from it

    Meta Analysis of Gene Expression Data within and Across Species

    Get PDF
    Since the second half of the 1990s, a large number of genome-wide analyses have been described that study gene expression at the transcript level. To this end, two major strategies have been adopted, a first one relying on hybridization techniques such as microarrays, and a second one based on sequencing techniques such as serial analysis of gene expression (SAGE), cDNA-AFLP, and analysis based on expressed sequence tags (ESTs). Despite both types of profiling experiments becoming routine techniques in many research groups, their application remains costly and laborious. As a result, the number of conditions profiled in individual studies is still relatively small and usually varies from only two to few hundreds of samples for the largest experiments. More and more, scientific journals require the deposit of these high throughput experiments in public databases upon publication. Mining the information present in these databases offers molecular biologists the possibility to view their own small-scale analysis in the light of what is already available. However, so far, the richness of the public information remains largely unexploited. Several obstacles such as the correct association between ESTs and microarray probes with the corresponding gene transcript, the incompleteness and inconsistency in the annotation of experimental conditions, and the lack of standardized experimental protocols to generate gene expression data, all impede the successful mining of these data. Here, we review the potential and difficulties of combining publicly available expression data from respectively EST analyses and microarray experiments. With examples from literature, we show how meta-analysis of expression profiling experiments can be used to study expression behavior in a single organism or between organisms, across a wide range of experimental conditions. We also provide an overview of the methods and tools that can aid molecular biologists in exploiting these public data

    Technology dictates algorithms: Recent developments in read alignment

    Full text link
    Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies

    Identification and Functional Annotation of Alternatively Spliced Isoforms

    Full text link
    Alternative splicing is a key mechanism for increasing the complexity of transcriptome and proteome in eukaryotic cells. A large portion of multi-exon genes in humans undergo alternative splicing, and this can have significant functional consequences as the proteins translated from alternatively spliced mRNA might have different amino acid sequences and structures. The study of alternative splicing events has been accelerated by the next-generation sequencing technology. However, reconstruction of transcripts from short-read RNA sequencing is not sufficiently accurate. Recent progress in single-molecule long-read sequencing has provided researchers alternative ways to help solve this problem. With the help of both short and long RNA sequencing technologies, tens of thousands of splice isoforms have been catalogued in humans and other species, but relatively few of the protein products of splice isoforms have been characterized functionally, structurally and biochemically. The scope of this dissertation includes using short and long RNA sequencing reads together for the purpose of transcript reconstruction, and using high-throughput RNA-sequencing data and gene ontology functional annotations on gene level to predict functions for alternatively spliced isoforms in mouse and human. In the first chapter, I give an introduction of alternative splicing and discuss the existing studies where next generation sequencing is used for transcript identification. Then, I define the isoform function prediction problem, and explain how it differs from better known gene function prediction problem. In the second chapter of this dissertation, I describe our study where the overall transcriptome of kidney is studied using both long reads from PacBio platform and RNA-seq short reads from Illumina platform. We used short reads to validate full-length transcripts found by long PacBio reads, and generated two high quality sets of transcript isoforms that are expressed in glomerular and tubulointerstitial compartments. In the third chapter, I describe our generic framework, where we implemented and evaluated several related algorithms for isoform function prediction for mouse isoforms. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm is the first effort to predict and differentiate isoform functions through large-scale genomic data integration. In the fourth chapter, I present the extension of isoform function prediction study to the protein coding isoforms in human. We used a similar multiple instance learning (MIL)-based approach for predicting the function of protein coding splice variants in human. We evaluated our predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes. And in the fifth and final chapter, I give a summary of previous chapters and outline the future directions for alternatively spliced isoform reconstruction and function prediction studies.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144017/1/ridvan_1.pd

    A practical guide to design and assess a phylogenomic study

    Full text link
    Over the last decade, molecular systematics has undergone a change of paradigm as high-throughput sequencing now makes it possible to reconstruct evolutionary relationships using genome-scale datasets. The advent of 'big data' molecular phylogenetics provided a battery of new tools for biologists but simultaneously brought new methodological challenges. The increase in analytical complexity comes at the price of highly specific training in computational biology and molecular phy- logenetics, resulting very often in a polarized accumulation of knowledge (technical on one side and biological on the other). Interpreting the robustness of genome-scale phylogenetic studies is not straightforward, particularly as new methodological developments have consistently shown that the general belief of 'more genes, more robustness' often does not apply, and because there is a range of systematic errors that plague phylogenomic investigations. This is particularly problematic because phylogenomic studies are highly heterogeneous in their methodology, and best practices are often not clearly defined. The main aim of this article is to present what I consider as the ten most important points to take into consideration when plan- ning a well-thought-out phylogenomic study and while evaluating the quality of published papers. The goal is to provide a practical step-by-step guide that can be easily followed by nonexperts and phylogenomic novices in order to assess the tech- nical robustness of phylogenomic studies or improve the experimental design of a project
    • 

    corecore