36 research outputs found
Fast multiplication of random dense matrices with fixed sparse matrices
This work focuses on accelerating the multiplication of a dense random matrix
with a (fixed) sparse matrix, which is frequently used in sketching algorithms.
We develop a novel scheme that takes advantage of blocking and recomputation
(on-the-fly random number generation) to accelerate this operation. The
techniques we propose decrease memory movement, thereby increasing the
algorithm's parallel scalability in shared memory architectures. On the Intel
Frontera architecture, our algorithm can achieve 2x speedups over libraries
such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we
can obtain a parallel efficiency of up to approximately 45%. We also present a
theoretical analysis for the memory movement lower bound of our algorithm,
showing that under mild assumptions, it's possible to beat the data movement
lower bound of general matrix-matrix multiply (GEMM) by a factor of ,
where is the cache size. Finally, we incorporate our sketching algorithm
into a randomized least squares solver. For extremely over-determined sparse
input matrices, we show that our results are competitive with SuiteSparse; in
some cases, we obtain a speedup of 10x over SuiteSparse
diBELLA: Distributed Long Read to Long Read Alignment
We present a parallel algorithm and scalable implementation for genome
analysis, specifically the problem of finding overlaps and alignments for data
from "third generation" long read sequencers. While long sequences of DNA offer
enormous advantages for biological analysis and insight, current long read
sequencing instruments have high error rates and therefore require different
approaches to analysis than their short read counterparts. Our work focuses on
an efficient distributed-memory parallelization of an accurate single-node
algorithm for overlapping and aligning long reads. We achieve scalability of
this irregular algorithm by addressing the competing issues of increasing
parallelism, minimizing communication, constraining the memory footprint, and
ensuring good load balance. The resulting application, diBELLA, is the first
distributed memory overlapper and aligner specifically designed for long reads
and parallel scalability. We describe and present analyses for high level
design trade-offs and conduct an extensive empirical analysis that compares
performance characteristics across state-of-the-art HPC systems as well as a
commercial cloud architectures, highlighting the advantages of state-of-the-art
network technologies.Comment: This is the authors' preprint of the article that appears in the
proceedings of ICPP 2019, the 48th International Conference on Parallel
Processin
Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU
Dedicated accelerator hardware has become essential for processing AI-based
workloads, leading to the rise of novel accelerator architectures. Furthermore,
fundamental differences in memory architecture and parallelism have made these
accelerators targets for scientific computing.
The sequence alignment problem is fundamental in bioinformatics; we have
implemented the -Drop algorithm, a heuristic method for pairwise alignment
that reduces search space, on the Graphcore Intelligence Processor Unit (IPU)
accelerator. The -Drop algorithm has an irregular computational pattern,
which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to
improve load balancing. Our implementation achieves speedup over a
state-of-the-art GPU implementation and up to compared to CPU. In
addition, we introduce a memory-restricted -Drop algorithm that reduces
memory footprint by and efficiently uses the IPU's limited
low-latency SRAM. This optimization further improves the strong scaling
performance by .Comment: 12 pages, 7 figures, 2 table
Optimizing High Performance Markov Clustering for Pre-Exascale Architectures
HipMCL is a high-performance distributed memory implementation of the popular
Markov Cluster Algorithm (MCL) and can cluster large-scale networks within
hours using a few thousand CPU-equipped nodes. It relies on sparse matrix
computations and heavily makes use of the sparse matrix-sparse matrix
multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are
not scalable to Exascale architectures, both due to their communication costs
dominating the runtime at large concurrencies and also due to their inability
to take advantage of accelerators that are increasingly popular.
In this work, we systematically remove scalability and performance
bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion
phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed
SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic
memory requirement estimator that is fast and accurate. We develop a new
merging algorithm for the incremental processing of partial results produced by
the GPUs, which improves the overlap efficiency and the peak memory usage. We
also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We
validate our new algorithms and optimizations with extensive evaluations. With
the enabling of the GPUs and integration of new algorithms, HipMCL is up to
12.4x faster, being able to cluster a network with 70 million proteins and 68
billion connections just under 15 minutes using 1024 nodes of ORNL's Summit
supercomputer
SIAM Data Mining Brings It to Annual Meeting
The Data Mining Activity Group is one of SIAM\u27s most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members on - GraphBLAS (Aydın Buluç) - Algorithms and statistical methods for noisy network analysis (Sanjukta Bhowmick & Ben Miller) - Inferring networks from non-network data (Rajmonda Caceres, Ivan Brugere & Tanya Y. Berger-Wolf) - Visual analytics (Jordan Crouser) - Mining in graph data (Jennifer Webster, Mahantesh Halappanavar & Emilie Hogan) - Scientific computing and big data (Vijay Gadepally) These minisymposia were well received by the broader SIAM community, and below are some of the key highlights
LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Pairwise sequence alignment is one of the most computationally intensive
kernels in genomic data analysis, accounting for more than 90% of the runtime
for key bioinformatics applications. This method is particularly expensive for
third-generation sequences due to the high computational cost of analyzing
sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact
pairwise algorithms for long alignments, the community primarily relies on
approximate algorithms that search only for high-quality alignments and stop
early when one is not found. In this work, we present the first GPU
optimization of the popular X-drop alignment algorithm, that we named LOGAN.
Results show that our high-performance multi-GPU implementation achieves up to
181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100,
respectively, over the state-of-the-art software running on two IBM Power9
processors using 168 CPU threads, with equivalent accuracy. We also demonstrate
a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for
sequence alignment implemented in minimap2, a long-read mapping software. To
highlight the impact of our work on a real-world application, we couple LOGAN
with a many-to-many long-read alignment software called BELLA, and demonstrate
that our implementation improves the overall BELLA runtime by up to 10.6x.
Finally, we adapt the Roofline model for LOGAN and demonstrate that our
implementation is near-optimal on the NVIDIA Tesla V100s
Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
Modern network sensors continuously produce enormous quantities of raw data
that are beyond the capacity of human analysts. Cross-correlation of network
sensors increases this challenge by enriching every network event with
additional metadata. These large volumes of enriched network data present
opportunities to statistically characterize network traffic and quickly answer
a key question: "What are the primary cyber characteristics of my network
data?" The Python GraphBLAS and PyD4M analysis frameworks enable anonymized
statistical analysis to be performed quickly and efficiently on very large
network data sets. This approach is tested using billions of anonymized network
data samples from the largest Internet observatory (CAIDA Telescope) and tens
of millions of anonymized records from the largest commercially available
background enrichment capability (GreyNoise). The analysis confirms that most
of the enriched variables follow expected heavy-tail distributions and that a
large fraction of the network traffic is due to a small number of cyber
activities. This information can simplify the cyber analysts' task by enabling
prioritization of cyber activities based on statistical prevalence.Comment: 8 pages, 8 figures, HPE