230 research outputs found
Transcript assembly and abundance estimation with high-throughput RNA sequencing
We present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing ("RNA-Seq"). We evaluate these approaches through large-scale experiments of a well studied model of muscle development.
We begin with an overview of sequencing assays and outline why the short read alignment problem is fundamental to the analysis of these assays. We then describe two approaches to the contiguous alignment problem, one of which uses massively parallel graphics hardware to accelerate alignment, and one of which exploits an indexing scheme based on the Burrows-Wheeler transform. We then turn to the spliced alignment problem, which is fundamental to RNA-Seq, and present an algorithm, TopHat. TopHat is the first algorithm that can align the reads from an entire RNA-Seq experiment to a large genome without the aid of reference gene models.
In the second part of the thesis, we present the first comparative RNA-Seq as-
sembly algorithm, Cufflinks, which is adapted from a constructive proof of Dilworth's Theorem, a classic result in combinatorics. We evaluate Cufflinks by assembling the transcriptome from a time course RNA-Seq experiment of developing skeletal muscle cells. The assembly contains 13,689 known transcripts and 3,724 novel ones. Of the novel transcripts, 62% were strongly supported by earlier sequencing experiments or by homologous transcripts in other organisms. We further validated interesting genes with isoform-specific RT-PCR.
We then present a statistical model for RNA-Seq included in Cufflinks and with which we estimate abundances of transcripts from RNA-seq data. Simulation studies demonstrate that the model is highly accurate. We apply this model to the muscle data, and track the abundances of individual isoforms over development.
Finally, we present significance tests for changes in relative and absolute abundances between time points, which we employ to uncover differential expression and differential regulation. By testing for relative abundance changes within and between transcripts sharing a transcription start site, we find significant shifts in the rates of alternative splicing and promoter preference in hundreds of genes, including those believed to regulate muscle development
Novel computational techniques for mapping and classifying Next-Generation Sequencing data
Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.
In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.
An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.
Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
Linear chemically sensitive electron tomography using DualEELS and dictionary-based compressed sensing
We have investigated the use of DualEELS in elementally sensitive tilt series tomography in the scanning transmission electron microscope. A procedure is implemented using deconvolution to remove the effects of multiple scattering, followed by normalisation by the zero loss peak intensity. This is performed to produce a signal that is linearly dependent on the projected density of the element in each pixel. This method is compared with one that does not include deconvolution (although normalisation by the zero loss peak intensity is still performed). Additionaly, we compare the 3D reconstruction using a new compressed sensing algorithm, DLET, with the well-established SIRT algorithm. VC precipitates, which are extracted from a steel on a carbon replica, are used in this study. It is found that the use of this linear signal results in a very even density throughout the precipitates. However, when deconvolution is omitted, a slight density reduction is observed in the cores of the precipitates (a so-called cupping artefact). Additionally, it is clearly demonstrated that the 3D morphology is much better reproduced using the DLET algorithm, with very little elongation in the missing wedge direction. It is therefore concluded that reliable elementally sensitive tilt tomography using EELS requires the appropriate use of DualEELS together with a suitable reconstruction algorithm, such as the compressed sensing based reconstruction algorithm used here, to make the best use of the limited data volume and signal to noise inherent in core-loss EELS
Data compression for sequencing data
Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question “why compression” in a quantitative manner. Then we also answer the questions “what” and “how”, by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question “why compression” and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology
Recommended from our members
3.5Å cryoEM structure of hepatitis B virus core assembled from full-length core protein.
The capsid shell of infectious hepatitis B virus (HBV) is composed of 240 copies of a single protein called HBV core antigen (HBc). An atomic model of a core assembled from truncated HBc was determined previously by X-ray crystallography. In an attempt to obtain atomic structural information of HBV core in a near native, non-crystalline environment, we reconstructed a 3.5Å-resolution structure of a recombinant core assembled from full-length HBc by cryo electron microscopy (cryoEM) and derived an atomic model. The structure shows that the 240 molecules of full-length HBc form a core with two layers. The outer layer, composed of the N-terminal assembly domain, is similar to the crystal structure of the truncated HBc, but has three differences. First, unlike the crystal structure, our cryoEM structure shows no disulfide bond between the Cys61 residues of the two subunits within the dimer building block, indicating such bond is not required for core formation. Second, our cryoEM structure reveals up to four more residues in the linker region (amino acids 140-149). Third, the loops in the cryoEM structures containing this linker region in subunits B and C are oriented differently (~30° and ~90°) from their counterparts in the crystal structure. The inner layer, composed of the C-terminal arginine-rich domain (ARD) and the ARD-bound RNAs, is partially-ordered and connected with the outer layer through linkers positioned around the two-fold axes. Weak densities emanate from the rims of positively charged channels through the icosahedral three-fold and local three-fold axes. We attribute these densities to the exposed portions of some ARDs, thus explaining ARD's accessibility by proteases and antibodies. Our data supports a role of ARD in mediating communication between inside and outside of the core during HBV maturation and envelopment
ACCURATE ALIGNMENT OF SEQUENCING READS FROM VARIOUS GENOMIC ORIGINS
Ph.DDOCTOR OF PHILOSOPH
- …