27 research outputs found
Implementing Push-Pull Efficiently in GraphBLAS
We factor Beamer's push-pull, also known as direction-optimized
breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them
for generalizability, asymptotic speedup, and contribution to overall speedup.
We demonstrate that masking is critical for high performance and can be
generalized to all graph algorithms where the sparsity pattern of the output is
known a priori. We show that these graph algorithm optimizations, which
together constitute DOBFS, can be neatly and separably described using linear
algebra and can be expressed in the GraphBLAS linear-algebra-based framework.
We provide experimental evidence that with these optimizations, a DOBFS
expressed in a linear-algebra-based graph framework attains competitive
performance with state-of-the-art graph frameworks on the GPU and on a
multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph.Comment: 11 pages, 7 figures, International Conference on Parallel Processing
(ICPP) 201
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key
primitive for many high performance graph algorithms as well as for some linear
solvers, such as algebraic multigrid. Here we show that SpGEMM also yields
efficient algorithms for general sparse-matrix indexing in distributed memory,
provided that the underlying SpGEMM implementation is sufficiently flexible and
scalable. We demonstrate that our parallel SpGEMM methods, which use
two-dimensional block data distributions with serial hypersparse kernels, are
indeed highly flexible, scalable, and memory-efficient in the general case.
This algorithm is the first to yield increasing speedup on an unbounded number
of processors; our experiments show scaling up to thousands of processors in a
variety of test scenarios
Automatic Generation of Efficient Sparse Tensor Format Conversion Routines
This paper shows how to generate code that efficiently converts sparse
tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL,
and many others. We decompose sparse tensor conversion into three logical
phases: coordinate remapping, analysis, and assembly. We then develop a
language that precisely describes how different formats group together and
order a tensor's nonzeros in memory. This lets a compiler emit code that
performs complex remappings of nonzeros when converting between formats. We
also develop a query language that can extract statistics about sparse tensors,
and we show how to emit efficient analysis code that computes such queries.
Finally, we define an abstract interface that captures how data structures for
storing a tensor can be efficiently assembled given specific statistics about
the tensor. Disparate formats can implement this common interface, thus letting
a compiler emit optimized sparse tensor conversion code for arbitrary
combinations of many formats without hard-coding for any specific combination.
Our evaluation shows that the technique generates sparse tensor conversion
routines with performance between 1.00 and 2.01 that of hand-optimized
versions in SPARSKIT and Intel MKL, two popular sparse linear algebra
libraries. And by emitting code that avoids materializing temporaries, which
both libraries need for many combinations of source and target formats, our
technique outperforms those libraries by 1.78 to 4.01 for CSC/COO to
DIA/ELL conversion.Comment: Presented at PLDI 202
Unraveling the functional dark matter through global metagenomics
30 pages, 4 figures, 1 table, supplementary information https://doi.org/10.1038/s41586-023-06583-7.-- Data availability: All of the analysed datasets along with their corresponding sequences are available from the IMG system (http://img.jgi.doe.gov/). A list of the datasets used in this study is provided in Supplementary Data 8. All data from the protein clusters, including sequences, multiple alignments, HMM profiles, 3D structure models, and taxonomic and ecosystem annotation, are available through NMPFamsDB, publicly accessible at www.nmpfamsdb.org. The 3D models are also available at ModelArchive under accession code ma-nmpfamsdb.-- Code availability: Sequence analysis was performed using Tantan (https://gitlab.com/mcfrith/tantan), BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi), LAST (https://gitlab.com/mcfrith/last), HMMER (http://hmmer.org/) and HH-suite3 (https://github.com/soedinglab/hh-suite). Clustering was performed using HipMCL (https://bitbucket.org/azadcse/hipmcl/src/master/). Additional taxonomic annotation was performed using Whokaryote (https://github.com/LottePronk/whokaryote), EukRep (https://github.com/patrickwest/EukRep), DeepVirFinder (https://github.com/jessieren/DeepVirFinder) and MMseqs2 (https://github.com/soedinglab/MMseqs2). 3D modelling was performed using AlphaFold2 (https://github.com/deepmind/alphafold) and TrRosetta2 (https://github.com/RosettaCommons/trRosetta2). Structural alignments were performed using TMalign (https://zhanggroup.org/TM-align/) and MMalign (https://zhanggroup.org/MM-align/). All custom scripts used for the generation and analysis of the data are available at Zenodo (https://doi.org/10.5281/zenodo.8097349)Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17âbillion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matterWith the institutional support of the âSevero Ochoa Centre of Excellenceâ accreditation (CEX2019-000928-S)Peer reviewe
Recommended from our members
A work-efficient parallel sparse matrix-sparse vector multiplication algorithm.
A work-efficient parallel sparse matrix-sparse vector multiplication algorithm.
We design and develop a work-efficient multithreaded algorithm for sparse
matrix-sparse vector multiplication (SpMSpV) where the matrix, the input
vector, and the output vector are all sparse. SpMSpV is an important primitive
in the emerging GraphBLAS standard and is the workhorse of many graph
algorithms including breadth-first search, bipartite graph matching, and
maximal independent set. As thread counts increase, existing multithreaded
SpMSpV algorithms can spend more time accessing the sparse matrix data
structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is
work efficient in the sense its total work is proportional to the number of
arithmetic operations required. The key insight is to avoid each thread
individually scan the list of matrix columns.
Our algorithm is simple to implement and operates on existing column-based
sparse matrix formats. It performs well on diverse matrices and vectors with
heterogeneous sparsity patterns. A high-performance implementation of the
algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and
up to 49x speedup on a 64-core Intel KNL manycore processor. In contrast to
implementations of existing algorithms, the performance of our algorithm is
sustained on a variety of different input types include matrices representing
scale-free and high-diameter graphs
Highly Parallel Sparse Matrix-Matrix Multiplication
Abstract. Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memoryefficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios. Key words. parallel computing, numerical linear algebra, sparse matrix-matrix multiplication, SpGEMM, sparse matrix indexing, sparse matrix assignment, two-dimensional data decomposition, hypersparsity, graph algorithms, sparse SUMMA, subgraph extraction, graph contraction, graph batch updat
Recommended from our members
Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in
various graph, scientific computing and machine learning algorithms. In this
paper, we consider SpGEMMs performed on hundreds of thousands of processors
generating trillions of nonzeros in the output matrix. Distributed SpGEMM at
this extreme scale faces two key challenges: (1) high communication cost and
(2) inadequate memory to generate the output. We address these challenges with
an integrated communication-avoiding and memory-constrained SpGEMM algorithm
that scales to 262,144 cores (more than 1 million hardware threads) and can
multiply sparse matrices of any size as long as inputs and a fraction of output
fit in the aggregated memory. As we go from 16,384 cores to 262,144 cores on a
Cray XC40 supercomputer, the new SpGEMM algorithm runs 10x faster when
multiplying large-scale protein-similarity matrices