411 research outputs found
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Extreme Scale De Novo Metagenome Assembly
Metagenome assembly is the process of transforming a set of short,
overlapping, and potentially erroneous DNA segments from environmental samples
into the accurate representation of the underlying microbiomes's genomes.
State-of-the-art tools require big shared memory machines and cannot handle
contemporary metagenome datasets that exceed Terabytes in size. In this paper,
we introduce the MetaHipMer pipeline, a high-quality and high-performance
metagenome assembler that employs an iterative de Bruijn graph approach.
MetaHipMer leverages a specialized scaffolding algorithm that produces long
scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is
end-to-end parallelized using the Unified Parallel C language and therefore can
run seamlessly on shared and distributed-memory systems. Experimental results
show that MetaHipMer matches or outperforms the state-of-the-art tools in terms
of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and
is able to assemble previously intractable grand challenge metagenomes. We
demonstrate the unprecedented capability of MetaHipMer by computing the first
full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion
reads - size 2.6 TBytes.Comment: Accepted to SC1
Compiling Tree Transforms to Operate on Packed Representations
When written idiomatically in most programming languages, programs
that traverse and construct trees operate over pointer-based data
structures, using one heap object per-leaf and per-node. This
representation is efficient for random access and shape-changing
modifications, but for traversals, such as compiler passes, that
process most or all of a tree in bulk, it can be inefficient. In this
work we instead compile tree traversals to operate on
pointer-free pre-order serializations of trees. On modern
architectures such programs often run significantly faster than
their pointer-based counterparts, and additionally are directly suited
to storage and transmission without requiring marshaling.
We present a prototype compiler, Gibbon, that compiles a
small first-order, purely functional language sufficient for tree
traversals. The compiler transforms this language into intermediate
representation with explicit pointers into input and output buffers
for packed data. The key compiler technologies include an effect
system for capturing traversal behavior, combined with an algorithm to
insert destination cursors. We evaluate our compiler on tree
transformations over a real-world dataset of source-code syntax trees.
For traversals touching the whole tree, such as maps and folds, packed
data allows speedups of over 2x compared to a highly-optimized
pointer-based baseline
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Efficient Strategies for Graph Pattern Mining Algorithms on GPUs
Graph Pattern Mining (GPM) is an important, rapidly evolving, and computation
demanding area. GPM computation relies on subgraph enumeration, which consists
in extracting subgraphs that match a given property from an input graph.
Graphics Processing Units (GPUs) have been an effective platform to accelerate
applications in many areas. However, the irregularity of subgraph enumeration
makes it challenging for efficient execution on GPU due to typical uncoalesced
memory access, divergence, and load imbalance. Unfortunately, these aspects
have not been fully addressed in previous work. Thus, this work proposes novel
strategies to design and implement subgraph enumeration efficiently on GPU. We
support a depth-first search style search (DFS-wide) that maximizes memory
performance while providing enough parallelism to be exploited by the GPU,
along with a warp-centric design that minimizes execution divergence and
improves utilization of the computing capabilities. We also propose a low-cost
load balancing layer to avoid idleness and redistribute work among thread warps
in a GPU. Our strategies have been deployed in a system named DuMato, which
provides a simple programming interface to allow efficient implementation of
GPM algorithms. Our evaluation has shown that DuMato is often an order of
magnitude faster than state-of-the-art GPM systems and can mine larger
subgraphs (up to 12 vertices).Comment: Accepted for publication on IEEE 34th International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD'22
- …