93 research outputs found
Extreme Scale De Novo Metagenome Assembly
Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive
kernels in genomic data analysis, accounting for more than 90% of the runtime
for key bioinformatics applications. This method is particularly expensive for
third-generation sequences due to the high computational cost of analyzing
sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact
pairwise algorithms for long alignments, the community primarily relies on
approximate algorithms that search only for high-quality alignments and stop
early when one is not found. In this work, we present the first GPU
optimization of the popular X-drop alignment algorithm, that we named LOGAN.
Results show that our high-performance multi-GPU implementation achieves up to
181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100,
respectively, over the state-of-the-art software running on two IBM Power9
processors using 168 CPU threads, with equivalent accuracy. We also demonstrate
a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for
sequence alignment implemented in minimap2, a long-read mapping software. To
highlight the impact of our work on a real-world application, we couple LOGAN
with a many-to-many long-read alignment software called BELLA, and demonstrate
that our implementation improves the overall BELLA runtime by up to 10.6x.
Finally, we adapt the Roofline model for LOGAN and demonstrate that our
implementation is near-optimal on the NVIDIA Tesla V100s
Recommended from our members
Coassembly and binning of a twenty-year metagenomic time-series from Lake Mendota
The North Temperate Lakes Long-Term Ecological Research (NTL-LTER) program has been extensively used to improve understanding of how aquatic ecosystems respond to environmental stressors, climate fluctuations, and human activities. Here, we report on the metagenomes of samples collected between 2000 and 2019 from Lake Mendota, a freshwater eutrophic lake within the NTL-LTER site. We utilized the distributed metagenome assembler MetaHipMer to coassemble over 10 terabases (Tbp) of data from 471 individual Illumina-sequenced metagenomes. A total of 95,523,664 contigs were assembled and binned to generate 1,894 non-redundant metagenome-assembled genomes (MAGs) with ≥50% completeness and ≤10% contamination. Phylogenomic analysis revealed that the MAGs were nearly exclusively bacterial, dominated by Pseudomonadota (Proteobacteria, N = 623) and Bacteroidota (N = 321). Nine eukaryotic MAGs were identified by eukCC with six assigned to the phylum Chlorophyta. Additionally, 6,350 high-quality viral sequences were identified by geNomad with the majority classified in the phylum Uroviricota. This expansive coassembled metagenomic dataset provides an unprecedented foundation to advance understanding of microbial communities in freshwater ecosystems and explore temporal ecosystem dynamics
- …