1,396 research outputs found
khmer: Working with Big Data in Bioinformatics
We introduce design and optimization considerations for the 'khmer' package.Comment: Invited chapter for forthcoming book on Performance of Open Source
Application
Recommended from our members
Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants.
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy
Differentially-Expressed Pseudogenes in HIV-1 Infection.
Not all pseudogenes are transcriptionally silent as previously thought. Pseudogene transcripts, although not translated, contribute to the non-coding RNA pool of the cell that regulates the expression of other genes. Pseudogene transcripts can also directly compete with the parent gene transcripts for mRNA stability and other cell factors, modulating their expression levels. Tissue-specific and cancer-specific differential expression of these "functional" pseudogenes has been reported. To ascertain potential pseudogene:gene interactions in HIV-1 infection, we analyzed transcriptomes from infected and uninfected T-cells and found that 21 pseudogenes are differentially expressed in HIV-1 infection. This is interesting because parent genes of one-third of these differentially-expressed pseudogenes are implicated in HIV-1 life cycle, and parent genes of half of these pseudogenes are involved in different viral infections. Our bioinformatics analysis identifies candidate pseudogene:gene interactions that may be of significance in HIV-1 infection. Experimental validation of these interactions would establish that retroviruses exploit this newly-discovered layer of host gene expression regulation for their own benefit
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
K-mer abundance analysis is widely used for many purposes in nucleotide
sequence analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation. We present the khmer software
package for fast and memory efficient online counting of k-mers in sequencing
data sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a simple
probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits
online updating and retrieval of k-mer counts in memory which is necessary to
support online k-mer analysis algorithms. On sparse data sets this data
structure is considerably more memory efficient than any exact data structure.
In exchange, the use of a Count-Min Sketch introduces a systematic overcount
for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we
analyze the speed, the memory usage, and the miscount rate of khmer for
generating k-mer frequency distributions and retrieving k-mer counts for
individual k-mers. We also compare the performance of khmer to several other
k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC,
Turtle and KAnalyze. Finally, we examine the effectiveness of profiling
sequencing error, k-mer abundance trimming, and digital normalization of reads
in the context of high khmer false positive rates. khmer is implemented in C++
wrapped in a Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
Keeping it light: (re)analyzing community-wide datasets without major infrastructure
© The Author(s), 2019. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Alexander, H., Johnson, L. K., & Brown, C. T.. Keeping it light: (re)analyzing community-wide datasets without major infrastructure. Gigascience, 8(2),(2019): giy159, doi:10.1093/gigascience/giy159.DNA sequencing technology has revolutionized the field of biology, shifting biology from a data-limited to data-rich state. Central to the interpretation of sequencing data are the computational tools and approaches that convert raw data into biologically meaningful information. Both the tools and the generation of data are actively evolving, yet the practice of re-analysis of previously generated data with new tools is not commonplace. Re-analysis of existing data provides an affordable means of generating new information and will likely become more routine within biology, yet necessitates a new set of considerations for best practices and resource development. Here, we discuss several practices that we believe to be broadly applicable when re-analyzing data, especially when done by small research groups.Funding was provided by the Gordon and Betty Moore Foundation (award GBMF4551 to C.T.B.)
Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes
© The Author(s), 2019. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Johnson, L. K., Alexander, H., & Brown, C. T. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Gigascience, 8(4), (2019): giy158, doi: 10.1093/gigascience/giy158.Background: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without
an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using
different workflows, or “pipelines,” on the resulting assemblies are poorly understood. Here, a pipeline was
programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of
the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated
and compared against assemblies that were previously generated with a different pipeline developed by the National
Center for Genome Research. Results: New transcriptome assemblies contained the majority of previous contigs as well as
new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the
previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed
a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora
had a lower percentage of open reading frames compared to other phyla. Conclusions: Given current bioinformatics
approaches, there is no single “best” reference transcriptome for a particular set of raw data. As the optimum
transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable
pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples
with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing
data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific
trends across samples in addition to novel and useful products for the community.Funding was provided by the Gordon and Betty Moore Foundation under award number GBMF4551 to C.T.B. Jetstream cloud platform was used with XSEDE allocation TG-BIO160028 [66, 67]
Assembling large, complex environmental metagenomes
The large volumes of sequencing data required to sample complex environments
deeply pose new challenges to sequence analysis approaches. De novo metagenomic
assembly effectively reduces the total amount of data to be analyzed but
requires significant computational resources. We apply two pre-assembly
filtering approaches, digital normalization and partitioning, to make large
metagenome assemblies more comput\ ationaly tractable. Using a human gut mock
community dataset, we demonstrate that these methods result in assemblies
nearly identical to assemblies from unprocessed data. We then assemble two
large soil metagenomes from matched Iowa corn and native prairie soils. The
predicted functional content and phylogenetic origin of the assembled contigs
indicate significant taxonomic differences despite similar function. The
assembly strategies presented are generic and can be extended to any
metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio
- …