9 research outputs found

    mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications

    Get PDF
    Cataloged from PDF version of article.High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net

    A Linkage Map for the Newt \u3cem\u3eNotophthalmus viridescens\u3c/em\u3e: Insights in Vertebrate Genome and Chromosome Evolution

    Get PDF
    Genetic linkage maps are fundamental resources that enable diverse genetic and genomic approaches, including quantitative trait locus (QTL) analyses and comparative studies of genome evolution. It is straightforward to build linkage maps for species that are amenable to laboratory culture and genetic crossing designs, and that have relatively small genomes and few chromosomes. It is more difficult to generate linkage maps for species that do not meet these criteria. Here, we introduce a method to rapidly build linkage maps for salamanders, which are known for their enormous genome sizes. As proof of principle, we developed a linkage map with thousands of molecular markers (N=2349) for the Eastern newt (Notophthalmus viridescens). The map contains 12 linkage groups (152.3–934.7cM), only one more than the number of chromosome pairs. Importantly, this map was generated using RNA isolated from a single wild caught female and her 28 offspring. We used the map to reveal chromosome-scale conservation of synteny among N. viridescens, A. mexicanum (Urodela), and chicken (Amniota), and to identify large conserved segments between N. viridescens and Xenopus tropicalis (Anura). We also show that met1, a major effect QTL that regulates the expression of alternate metamorphic and paedomorphic modes of development in Ambystoma, associates with a chromosomal fusion that is not found in the N. viridescens map. Our results shed new light on the ancestral amphibian karyotype and reveal specific fusion and translocation events that shaped the genomes of three amphibian model taxa. The ability to rapidly build linkage maps for large salamander genomes will enable genetic and genomic analyses within this important vertebrate group, and more generally, empower comparative studies of vertebrate biology and evolution

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences

    CHARACTERIZATION OF A LARGE VERTEBRATE GENOME AND HOMOMORPHIC SEX CHROMOSOMES IN THE AXOLOTL, \u3cem\u3eAMBYSTOMA MEXICANUM\u3c/em\u3e

    Get PDF
    Changes in the structure, content and morphology of chromosomes accumulate over evolutionary time and contribute to cell, developmental and organismal biology. The axolotl (Ambystoma mexicanum) is an important model for studying these changes because: 1) it provides important phylogenetic perspective for reconstructing the evolution of vertebrate genomes and amphibian karyotypes, 2) its genome has evolved to a large size (~10X larger than human) but has maintained gene orders, and 3) it possesses potentially young sex chromosomes that have not undergone extensive differentiation in the structure that is typical of many other vertebrate sex chromosomes (e.g. mammalian XY chromosomes and avian ZW chromosomes). Early chromosomal studies were performed through cytogenetics, but more recent methods involving next generation sequencing and comparative genomics can reveal new information. Due to the large size and inherent complexity of the axolotl genome, multiple approaches are needed to cultivate the genomic and molecular resources essential for expanding its utility in modern scientific inquiries. This dissertation describes our efforts to improve the genomic and molecular resources for the axolotl and other salamanders, with the aim of better understanding the events that have driven the evolution of vertebrate (and amphibian) chromosomes. First, I review our current state of knowledge with respect to genome and karyotype evolution in the amphibians, present a case for studying sex chromosome evolution in the axolotl, and discuss solutions for performing analyses of large vertebrate genomes. In the second chapter, I present a study that resulted in the optimization of methods for the capture and sequencing of individual chromosomes and demonstrate the utility of the approach in improving the existing Ambystoma linkage map and generating targeted assemblies of individual chromosomes. In the third chapter, I present a published work that focuses on using this approach to characterize the two smallest chromosomes and provides an initial characterization of the huge axolotl genome. In the fourth chapter, I present another study that details the development of a dense linkage map for a newt, Notophthalmus viridescens, and its use in comparative analyses, including the discovery of a specific chromosomal fusion event in Ambystoma at the site of a major effect quantitative trait locus for metamorphic timing. I then describe the characterization of the relatively undifferentiated axolotl sex chromosomes, identification of a tiny sex-specific (W-linked) region, and a strong candidate for the axolotl sex-determining gene. Finally, I provide a brief discussion that recapitulates the main findings of each study, their utility in current studies, and future research directions. The research in this dissertation has enriched this important model with genomic and molecular resources that enhance its use in modern scientific research. The information provided from evolutionary studies in axolotl chromosomes shed critical light on vertebrate genome and chromosome evolution, specifically among amphibians, an underrepresented vertebrate clade in genomics, and in homomorphic sex chromosomes, which have been largely unstudied in amphibians

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Sensitive and fast mapping of di-base encoded reads

    No full text

    A systems based approach to neutrophil gene expression

    Get PDF
    Neutrophils are the major cellular constituent of blood leukocytes and play a central role in the inflammatory response, expressing an array of destructive molecules and antimicrobial processes that characterise the cells as front-line defenders of the innate immune system, thus neutrophils are crucial to host defence. It is now appreciated that neutrophils produce and respond to a variety of inflammatory signals and are able to regulate both the innate and adaptive immune response. The molecular changes that underlie this regulation are poorly defined, yet represent an attractive area of research to fully elucidate the role and regulatory capacity of neutrophils within the immune response. RNA-Seq provides an accurate and robust mechanism for global characterisation of cellular transcripts. Neutrophils were isolated from healthy donors and incubated with or without inflammatory cytokines for 1 h. RNA was extracted and analysed by RNA-Seq using the SOLiD or Illumina platforms. Raw data was quantified using a number of software packages which formed a bioinformatic pipeline for data analysis which was developed during the course of the research. Results were validated by a selection of traditional laboratory functional assays. Priming of neutrophils by GM-CSF and TNFα was found to induce differential gene expression and activation of transcription factors, which led to differential regulation of apoptotic pathways. Stimulation of neutrophils with inflammatory cytokines/chemokines (IL-1β, IL-8, G-CSF, IFNγ) resulted in expression of discrete gene sets and differential activation of signalling pathways. Stimulation of neutrophils with IL-6 did not induce any significant expression of genes but result in activation of STAT signalling. Comparison of gene expression of neutrophils isolated by density gradient and magnetic bead preparation revealed significant differences in gene expression and function, in part attributable to levels of contamination associated with each isolation method. Bead isolation was found to enrich a more heterogeneous neutrophil population including a subpopulation of neutrophils expressing transcripts previously associated with low density granulocytes. Thus, RNA-Seq and bioinformatic analysis has provided a full characterisation of neutrophil gene expression under inflammatory conditions and identified several new areas of research that could lead to targeted drug design for the treatment of inflammatory disease
    corecore