14 research outputs found

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Using Nanopore Sequencing to Interrogate the Genome and Epigenome

    Get PDF
    This work involves using native RNA nanopore sequencing to directly characterize the transcriptome of a human cell line, GM12878. We demonstrated several new methods, and findings, including newly discovered isoforms, allele-specific isoforms, measurement of polyadenylation length, and even measurement of RNA modifications. We also describe an application of nanopore RNA sequencing and chemical labeling to measure the secondary structure of RNA. Lastly, we demonstrate an analysis framework for looking at a new file format for single-molecule/long-read modification data

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    Developing bioinformatics approaches for the analysis of influenza virus whole genome sequence data

    Get PDF
    Influenza viruses represent a major public health burden worldwide, resulting in an estimated 500,000 deaths per year, with potential for devastating pandemics. Considerable effort is expended in the surveillance of influenza, including major World Health Organization (WHO) initiatives such as the Global Influenza Surveillance and Response System (GISRS). To this end, whole-genome sequencning (WGS), and corresponding bioinformatics pipelines, have emerged as powerful tools. However, due to the inherent diversity of influenza genomes, circulation in several different host species, and noise in short-read data, several pitfalls can appear during bioinformatics processing and analysis. 2.1.2 Results Conventional mapping approaches can be insufficient when a sub-optimal reference strain is chosen. For short-read datasets simulated from human-origin influenza H1N1 HA sequences, read recovery after single-reference mapping was routinely as low as 90% for human-origin influenza sequences, and often lower than 10% for those from avian hosts. To this end, I developed software using de Bruijn 47Graphs (DBGs) for classification of influenza WGS datasets: VAPOR. In real data benchmarking using 257 WGS read sets with corresponding de novo assemblies, VAPOR provided classifications for all samples with a mean of >99.8% identity to assembled contigs. This resulted in an increase of the number of mapped reads by 6.8% on average, up to a maximum of 13.3%. Additionally, using simulations, I demonstrate that classification from reads may be applied to detection of reassorted strains. 2.1.3 Conclusions The approach used in this study has the potential to simplify bioinformatics pipelines for surveillance, providing a novel method for detection of influenza strains of human and non-human origin directly from reads, minimization of potential data loss and bias associated with conventional mapping, and facilitating alignments that would otherwise require slow de novo assembly. Whilst with expertise and time these pitfalls can largely be avoided, with pre-classification they are remedied in a single step. Furthermore, this algorithm could be adapted in future to surveillance of other RNA viruses. VAPOR is available at https://github.com/connor-lab/vapor. Lastly, VAPOR could be improved by future implementation in C++, and should employ more efficient methods for DBG representation

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    The Development of Candidate Therapeutics for Transmissible Spongiform Encephalopathies

    Get PDF
    Previous studies have shown that addition of recombinant prion into a cell free prion replication assay – PMCA inhibits the formation of PrPSc. Previously naturally existing versions of ovine prion protein were tested: rARQ, rARR and rVRQ within this assay. Of these, rVRQ was the most potent inhibitor of amplification of different scrapie isolates (IC50 value 120 nM) and bovine BSE (IC50 – 171 nM). The main aim of this study was to produce additional molecular clones for expression of recombinant ovine prion protein where codon 136 had been mutated to code for different amino acids. These rPrPs were tested in dose response experiments in order to investigate whether the change at 136 position in ovine PrP could impact on the ability to inhibit or stop the prion protein misfolding compared to previously tested rVRQ. In order to produce rPrP mutants at codon 136, site-directed mutagenesis was used. All rPrPs were purified by metal affinity chromatography taking advantage of the metal binding properties of PrP molecule. All mutated rPrPs were added to protein misfolding cyclic amplification (PMCA) at different concentration and compared to rVRQ. After amplification, samples were digested with Proteinase K (100 µg/ml) and quantified on immunoblots. The best inhibitors were tested with different ovine scrapie (ARQ/VRQ, VRQ/VRQ, AHQ/VRQ), bovine BSE and ovine BSE isolates (ARQ/ARQ). The results showed that three of the recombinant prion proteins: rRRQ, rKRQ and rPRQ (with arginine, lysine and proline at 136 position, respectively) were found to inhibit the PrPSc misfolding significantly better than naturally occurred rVRQ. The structure of rPrP variants and amino acid substitution at 136 position were analysed and different length peptides containing the valine, arginine, lysine and proline at 136 position were designed. None of these peptides analysed in PMCA gave similar levels of inhibition to the equivalent full length recombinant prion protein response. Moreover, structural analysis showed that introduction of longer amino acids at position 136 did not alter the whole scaffold of prion protein. In addition, the longer side chains for arginine136 and lysine136 or pyrrolidine loop in proline could result in more interatomic bonding in comparison to valine136 and therefore could act to stabilize the whole PrP molecule. Furthermore, the presence of the longer side chains of arginine136 and lysine136 would not predict further structure changes because of the ‘structural’ pocket present on the opposite site of position 136 in ovine PrP. The Rov9 cell line could be persistently infected with processed (heated and sonicated) scrapie brain homogenate SSBP1 (VRQ/VRQ) and SSBP1 derived, NaPTA precipitated PrPSc. In both cases, PrPres was detected in cell lysates from induced with 1 µg/ml doxycycline and when 500 µg of total protein was digested with 20 µg/ml of PK followed by PrPres concentration by centrifugation. The best inhibitory rPrPs were used in experiments to prevent the infection with SSBP1 isolate or reduce the PrPres in persistently infected Rov9 cells. As a result, addition of 250 nM of rRRQ, rKRQ and rPRQ prevented the infection of Rov9 cells at culture passage 1. The rPrP variants showed more promising results than natural rVRQ. In contrast, no significant reduction of PrPres was observed when persistently infected Rov9 cells were treated with 250 nM of either variants or natural rPrPs for 4 days. Overall, this work demonstrated a novel therapeutic approach for prion diseases using recombinant prion proteins. The recombinant protein treatment was effective not only in scrapie model but also among other TSEs and therefore these rPrPs or analogous strategy could be applied as potential human TSE therapeutic

    The Development of Candidate Therapeutics for Transmissible Spongiform Encephalopathies

    Get PDF
    Previous studies have shown that addition of recombinant prion into a cell free prion replication assay – PMCA inhibits the formation of PrPSc. Previously naturally existing versions of ovine prion protein were tested: rARQ, rARR and rVRQ within this assay. Of these, rVRQ was the most potent inhibitor of amplification of different scrapie isolates (IC50 value 120 nM) and bovine BSE (IC50 – 171 nM). The main aim of this study was to produce additional molecular clones for expression of recombinant ovine prion protein where codon 136 had been mutated to code for different amino acids. These rPrPs were tested in dose response experiments in order to investigate whether the change at 136 position in ovine PrP could impact on the ability to inhibit or stop the prion protein misfolding compared to previously tested rVRQ. In order to produce rPrP mutants at codon 136, site-directed mutagenesis was used. All rPrPs were purified by metal affinity chromatography taking advantage of the metal binding properties of PrP molecule. All mutated rPrPs were added to protein misfolding cyclic amplification (PMCA) at different concentration and compared to rVRQ. After amplification, samples were digested with Proteinase K (100 µg/ml) and quantified on immunoblots. The best inhibitors were tested with different ovine scrapie (ARQ/VRQ, VRQ/VRQ, AHQ/VRQ), bovine BSE and ovine BSE isolates (ARQ/ARQ). The results showed that three of the recombinant prion proteins: rRRQ, rKRQ and rPRQ (with arginine, lysine and proline at 136 position, respectively) were found to inhibit the PrPSc misfolding significantly better than naturally occurred rVRQ. The structure of rPrP variants and amino acid substitution at 136 position were analysed and different length peptides containing the valine, arginine, lysine and proline at 136 position were designed. None of these peptides analysed in PMCA gave similar levels of inhibition to the equivalent full length recombinant prion protein response. Moreover, structural analysis showed that introduction of longer amino acids at position 136 did not alter the whole scaffold of prion protein. In addition, the longer side chains for arginine136 and lysine136 or pyrrolidine loop in proline could result in more interatomic bonding in comparison to valine136 and therefore could act to stabilize the whole PrP molecule. Furthermore, the presence of the longer side chains of arginine136 and lysine136 would not predict further structure changes because of the ‘structural’ pocket present on the opposite site of position 136 in ovine PrP. The Rov9 cell line could be persistently infected with processed (heated and sonicated) scrapie brain homogenate SSBP1 (VRQ/VRQ) and SSBP1 derived, NaPTA precipitated PrPSc. In both cases, PrPres was detected in cell lysates from induced with 1 µg/ml doxycycline and when 500 µg of total protein was digested with 20 µg/ml of PK followed by PrPres concentration by centrifugation. The best inhibitory rPrPs were used in experiments to prevent the infection with SSBP1 isolate or reduce the PrPres in persistently infected Rov9 cells. As a result, addition of 250 nM of rRRQ, rKRQ and rPRQ prevented the infection of Rov9 cells at culture passage 1. The rPrP variants showed more promising results than natural rVRQ. In contrast, no significant reduction of PrPres was observed when persistently infected Rov9 cells were treated with 250 nM of either variants or natural rPrPs for 4 days. Overall, this work demonstrated a novel therapeutic approach for prion diseases using recombinant prion proteins. The recombinant protein treatment was effective not only in scrapie model but also among other TSEs and therefore these rPrPs or analogous strategy could be applied as potential human TSE therapeutic
    corecore