22 research outputs found
These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure
K-mer abundance analysis is widely used for many purposes in nucleotide
sequence analysis, including data preprocessing for de novo assembly, repeat
detection, and sequencing coverage estimation. We present the khmer software
package for fast and memory efficient online counting of k-mers in sequencing
data sets. Unlike previous methods based on data structures such as hash
tables, suffix arrays, and trie structures, khmer relies entirely on a simple
probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits
online updating and retrieval of k-mer counts in memory which is necessary to
support online k-mer analysis algorithms. On sparse data sets this data
structure is considerably more memory efficient than any exact data structure.
In exchange, the use of a Count-Min Sketch introduces a systematic overcount
for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we
analyze the speed, the memory usage, and the miscount rate of khmer for
generating k-mer frequency distributions and retrieving k-mer counts for
individual k-mers. We also compare the performance of khmer to several other
k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC,
Turtle and KAnalyze. Finally, we examine the effectiveness of profiling
sequencing error, k-mer abundance trimming, and digital normalization of reads
in the context of high khmer false positive rates. khmer is implemented in C++
wrapped in a Python interface, offers a tested and robust API, and is freely
available under the BSD license at github.com/ged-lab/khmer
Assembling large, complex environmental metagenomes
The large volumes of sequencing data required to sample complex environments
deeply pose new challenges to sequence analysis approaches. De novo metagenomic
assembly effectively reduces the total amount of data to be analyzed but
requires significant computational resources. We apply two pre-assembly
filtering approaches, digital normalization and partitioning, to make large
metagenome assemblies more comput\ ationaly tractable. Using a human gut mock
community dataset, we demonstrate that these methods result in assemblies
nearly identical to assemblies from unprocessed data. We then assemble two
large soil metagenomes from matched Iowa corn and native prairie soils. The
predicted functional content and phylogenetic origin of the assembled contigs
indicate significant taxonomic differences despite similar function. The
assembly strategies presented are generic and can be extended to any
metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio
Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem
Soil microorganisms mediate biogeochemical processes, but how microbial community composition influences these processes remains contested. We combined monthly sequencing of soil 16S rRNA genes and intensive measurements of nitrogen (N), carbon (C), and iron (Fe) cycling along a topographic gradient in a poorly drained intensive agricultural ecosystem (corn–soybean rotation) in the midwestern United States. Observed microbial composition changed little over time within and among years despite large differences in weather and crop type. Yet, microbial composition varied greatly with topographic location and correlated strongly with moisture, soil organic carbon (SOC), and especially pH. Microbial families, genera, and/or amplicon sequence variants often correlated significantly with measured biogeochemical processes or pools, yet different taxa within the same phylogenetic groups often responded in opposite ways, indicating a lack of ecological coherence among close relatives. Dominant phyla were generally similar across the topographic gradient but specific members showed consistent tradeoffs among locations. Ammonia oxidizing archaea and bacteria sequences varied oppositely with pH across the gradient, but their combined relative abundances remained similar, as did potential nitrification rates. Nitrospira sequences correlated positively with nitrous oxide (N2O) fluxes, suggesting a direct or indirect contribution of nitrification (or possibly comammox) to N2O production. We also found significant linkages between taxonomic groups and redox-sensitive Fe pools, indicating a role for redox variation in structuring microbial communities. Several globally dominant bacteria identified previously correlated significantly with measured biogeochemical variables, providing insights into their possible functional roles. Overall, microbial composition provided a coarse measure of several key biogeochemical functions and implicated taxa that possibly mediate these processes in a widespread agroecosystem of North America.This is a manuscript of an article published as Yu, Wenjuan, Nathaniel C. Lawrence, Thanwalee Sooksa-nguan, Schuyler D. Smith, Carlos Tenesaca, Adina Chuang Howe, and Steven J. Hall. "Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem." Soil Biology and Biochemistry (2021): 108228. doi:10.1016/j.soilbio.2021.108228. Posted with permission.</p
Microbial linkages to soil biogeochemical processes in a poorly drained agricultural ecosystem
Soil microorganisms mediate biogeochemical processes, but how microbial community composition influences these processes remains contested. We combined monthly sequencing of soil 16S rRNA genes and intensive measurements of nitrogen (N), carbon (C), and iron (Fe) cycling along a topographic gradient in a poorly drained intensive agricultural ecosystem (corn–soybean rotation) in the midwestern United States. Observed microbial composition changed little over time within and among years despite large differences in weather and crop type. Yet, microbial composition varied greatly with topographic location and correlated strongly with moisture, soil organic carbon (SOC), and especially pH. Microbial families, genera, and/or amplicon sequence variants often correlated significantly with measured biogeochemical processes or pools, yet different taxa within the same phylogenetic groups often responded in opposite ways, indicating a lack of ecological coherence among close relatives. Dominant phyla were generally similar across the topographic gradient but specific members showed consistent tradeoffs among locations. Ammonia oxidizing archaea and bacteria sequences varied oppositely with pH across the gradient, but their combined relative abundances remained similar, as did potential nitrification rates. Nitrospira sequences correlated positively with nitrous oxide (N2O) fluxes, suggesting a direct or indirect contribution of nitrification (or possibly comammox) to N2O production. We also found significant linkages between taxonomic groups and redox-sensitive Fe pools, indicating a role for redox variation in structuring microbial communities. Several globally dominant bacteria identified previously correlated significantly with measured biogeochemical variables, providing insights into their possible functional roles. Overall, microbial composition provided a coarse measure of several key biogeochemical functions and implicated taxa that possibly mediate these processes in a widespread agroecosystem of North America
Recommended from our members
Assembling large, complex environmental metagenomes
The large volumes of sequencing data required to sample complex environments
deeply pose new challenges to sequence analysis approaches. De novo metagenomic
assembly effectively reduces the total amount of data to be analyzed but
requires significant computational resources. We apply two pre-assembly
filtering approaches, digital normalization and partitioning, to make large
metagenome assemblies more comput\ ationaly tractable. Using a human gut mock
community dataset, we demonstrate that these methods result in assemblies
nearly identical to assemblies from unprocessed data. We then assemble two
large soil metagenomes from matched Iowa corn and native prairie soils. The
predicted functional content and phylogenetic origin of the assembled contigs
indicate significant taxonomic differences despite similar function. The
assembly strategies presented are generic and can be extended to any
metagenome; full source code is freely available under a BSD license
Recommended from our members
Tackling soil diversity with the assembly of large, complex metagenomes
The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches--digital normalization and partitioning--to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil
Tackling soil diversity with the assembly of large, complex metagenomes
The large volumes of sequencing data required to sample deeply the microbial communities of complex environments pose new challenges to sequence analysis. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires substantial computational resources. We combine two preassembly filtering approaches—digital normalization and partitioning—to generate previously intractable large metagenome assemblies. Using a human-gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes totaling 398 billion bp (equivalent to 88,000 Escherichia coli genomes) from matched Iowa corn and native prairie soils. The resulting assembled contigs could be used to identify molecular interactions and reaction networks of known metabolic pathways using the Kyoto Encyclopedia of Genes and Genomes Orthology database. Nonetheless, more than 60% of predicted proteins in assemblies could not be annotated against known databases. Many of these unknown proteins were abundant in both corn and prairie soils, highlighting the benefits of assembly for the discovery and characterization of novelty in soil biodiversity. Moreover, 80% of the sequencing data could not be assembled because of low coverage, suggesting that considerably more sequencing data are needed to characterize the functional content of soil
Iterative low-memory k-mer trimming.
<p><b>The results of trimming reads at unique (erroneous) k-mers from a 5 m read </b><b><i>E. coli</i></b><b> data set (1.4 GB) in under 30 MB of RAM. After each iteration, we measured the total number of distinct k-mers in the data set, the total number of unique (and likely erroneous) k-mers remaining, and the number of unique k-mers present at the 3' end of reads.</b></p