Search CORE

36 research outputs found

Fast multiplication of random dense matrices with fixed sparse matrices

Author: Buluç Aydın
Demmel James
Liang Tianyu
Murray Riley
Publication venue
Publication date: 23/10/2023
Field of study

This work focuses on accelerating the multiplication of a dense random matrix with a (fixed) sparse matrix, which is frequently used in sketching algorithms. We develop a novel scheme that takes advantage of blocking and recomputation (on-the-fly random number generation) to accelerate this operation. The techniques we propose decrease memory movement, thereby increasing the algorithm's parallel scalability in shared memory architectures. On the Intel Frontera architecture, our algorithm can achieve 2x speedups over libraries such as Eigen and Intel MKL on some examples. In addition, with 32 threads, we can obtain a parallel efficiency of up to approximately 45%. We also present a theoretical analysis for the memory movement lower bound of our algorithm, showing that under mild assumptions, it's possible to beat the data movement lower bound of general matrix-matrix multiply (GEMM) by a factor of

\sqrt M

, where

M

is the cache size. Finally, we incorporate our sketching algorithm into a randomized least squares solver. For extremely over-determined sparse input matrices, we show that our results are competitive with SuiteSparse; in some cases, we obtain a speedup of 10x over SuiteSparse

arXiv.org e-Print Archive

diBELLA: Distributed Long Read to Long Read Alignment

Author: Buluç Aydın
Ellis Marquita
Guidi Giulia
Oliker Leonid
Yelick Katherine
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/01/2020
Field of study

We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.Comment: This is the authors' preprint of the article that appears in the proceedings of ICPP 2019, the 48th International Conference on Parallel Processin

arXiv.org e-Print Archive

Crossref

Space Efficient Sequence Alignment for SRAM-Based Computing: X-Drop on the Graphcore IPU

Author: Buluç Aydın
Burchard Luk
Guidi Giulia
Langguth Johannes
Zhao Max Xiaohang
Publication venue
Publication date: 17/04/2023
Field of study

Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the

X

-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The

X

-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing. Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves

10\times

speedup over a state-of-the-art GPU implementation and up to

4.65\times

compared to CPU. In addition, we introduce a memory-restricted

X

-Drop algorithm that reduces memory footprint by

55\times

and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by

3.6\times

.Comment: 12 pages, 7 figures, 2 table

arXiv.org e-Print Archive

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Author: Azad Ariful
Buluç Aydın
Hussain Md Taufique
Selvitopi Oguz
Publication venue
Publication date: 24/02/2020
Field of study

HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. We develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

SIAM Data Mining Brings It to Annual Meeting

Author: Bhowmick Sanjukta
Buluç Aydın
Caceres Rajmonda
Crouser R. Jordan
Gadepally Vijay
Kepner Jeremy
Miller Ben
Webster Jennifer
Publication venue: Smith ScholarWorks
Publication date: 01/01/2017
Field of study

The Data Mining Activity Group is one of SIAM\u27s most vibrant and dynamic activity groups. To better share our enthusiasm for data mining with the broader SIAM community, our activity group organized six minisymposia at the 2016 Annual Meeting. These minisymposia included 48 talks organized by 11 SIAM members on - GraphBLAS (Aydın Buluç) - Algorithms and statistical methods for noisy network analysis (Sanjukta Bhowmick & Ben Miller) - Inferring networks from non-network data (Rajmonda Caceres, Ivan Brugere & Tanya Y. Berger-Wolf) - Visual analytics (Jordan Crouser) - Mining in graph data (Jennifer Webster, Mahantesh Halappanavar & Emilie Hogan) - Scientific computing and big data (Vijay Gadepally) These minisymposia were well received by the broader SIAM community, and below are some of the key highlights

Smith College: Smith ScholarWorks

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

Author: Buluç Aydın
Ding Nan
Ellis Marquita
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Santambrogio Marco D.
Yelick Katherine
Zeni Alberto
Publication venue
Publication date: 01/01/2020
Field of study

Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third-generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that we named LOGAN. Results show that our high-performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6x and 30.7x using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3x LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6x. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near-optimal on the NVIDIA Tesla V100s

arXiv.org e-Print Archive

eScholarship - University of California

Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic

Author: Buluç Aydın
Davis Tim
Elsakkary Youssef
Estrada Arminda
Grant Daniel
Jananthan Hayden
Jones Michael
Kawaminami Ivan
Kepner Jeremy
Meiners Chad
Morris Andrew
Pisharody Sandeep
Publication venue
Publication date: 07/09/2022
Field of study

Modern network sensors continuously produce enormous quantities of raw data that are beyond the capacity of human analysts. Cross-correlation of network sensors increases this challenge by enriching every network event with additional metadata. These large volumes of enriched network data present opportunities to statistically characterize network traffic and quickly answer a key question: "What are the primary cyber characteristics of my network data?" The Python GraphBLAS and PyD4M analysis frameworks enable anonymized statistical analysis to be performed quickly and efficiently on very large network data sets. This approach is tested using billions of anonymized network data samples from the largest Internet observatory (CAIDA Telescope) and tens of millions of anonymized records from the largest commercially available background enrichment capability (GreyNoise). The analysis confirms that most of the enriched variables follow expected heavy-tail distributions and that a large fraction of the network traffic is due to a small number of cyber activities. This information can simplify the cyber analysts' task by enabling prioritization of cyber activities based on statistical prevalence.Comment: 8 pages, 8 figures, HPE

arXiv.org e-Print Archive

The University of Arizona