12,481 research outputs found
Graph Expansion and Communication Costs of Fast Matrix Multiplication
The communication cost of algorithms (also known as I/O-complexity) is shown
to be closely related to the expansion properties of the corresponding
computation graphs. We demonstrate this on Strassen's and other fast matrix
multiplication algorithms, and obtain first lower bounds on their communication
costs.
In the sequential case, where the processor has a fast memory of size ,
too small to store three -by- matrices, the lower bound on the number of
words moved between fast and slow memory is, for many of the matrix
multiplication algorithms, ,
where is the exponent in the arithmetic count (e.g., for Strassen, and for conventional matrix multiplication).
With parallel processors, each with fast memory of size , the lower
bound is times smaller.
These bounds are attainable both for sequential and for parallel algorithms
and hence optimal. These bounds can also be attained by many fast algorithms in
linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester
equation)
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many
high-performance graph algorithms as well as for some linear solvers, such as
algebraic multigrid. The scaling of existing parallel implementations of SpGEMM
is heavily bound by communication. Even though 3D (or 2.5D) algorithms have
been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi
matrices, those algorithms had not been implemented in practice and their
complexities had not been analyzed for the general case. In this work, we
present the first ever implementation of the 3D SpGEMM formulation that also
exploits multiple (intra-node and inter-node) levels of parallelism, achieving
significant speedups over the state-of-the-art publicly available codes at all
levels of concurrencies. We extensively evaluate our implementation and
identify bottlenecks that should be subject to further research
Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds
A parallel algorithm has perfect strong scaling if its running time on P
processors is linear in 1/P, including all communication costs.
Distributed-memory parallel algorithms for matrix multiplication with perfect
strong scaling have only recently been found. One is based on classical matrix
multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's
fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz,
2012). Both algorithms scale perfectly, but only up to some number of
processors where the inter-processor communication no longer scales.
We obtain a memory-independent communication cost lower bound on classical
and Strassen-based distributed-memory matrix multiplication algorithms. These
bounds imply that no classical or Strassen-based parallel matrix multiplication
algorithm can strongly scale perfectly beyond the ranges already attained by
the two parallel algorithms mentioned above. The memory-independent bounds and
the strong scaling bounds generalize to other algorithms.Comment: 4 pages, 1 figur
Algebraic Methods in the Congested Clique
In this work, we use algebraic methods for studying distance computation and
subgraph detection tasks in the congested clique model. Specifically, we adapt
parallel matrix multiplication implementations to the congested clique,
obtaining an round matrix multiplication algorithm, where
is the exponent of matrix multiplication. In conjunction
with known techniques from centralised algorithmics, this gives significant
improvements over previous best upper bounds in the congested clique model. The
highlight results include:
-- triangle and 4-cycle counting in rounds, improving upon the
triangle detection algorithm of Dolev et al. [DISC 2012],
-- a -approximation of all-pairs shortest paths in
rounds, improving upon the -round -approximation algorithm of Nanongkai [STOC 2014], and
-- computing the girth in rounds, which is the first
non-trivial solution in this model.
In addition, we present a novel constant-round combinatorial algorithm for
detecting 4-cycles.Comment: This is work is a merger of arxiv:1412.2109 and arxiv:1412.266
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Chameleon: A Hybrid Secure Computation Framework for Machine Learning Applications
We present Chameleon, a novel hybrid (mixed-protocol) framework for secure
function evaluation (SFE) which enables two parties to jointly compute a
function without disclosing their private inputs. Chameleon combines the best
aspects of generic SFE protocols with the ones that are based upon additive
secret sharing. In particular, the framework performs linear operations in the
ring using additively secret shared values and nonlinear
operations using Yao's Garbled Circuits or the Goldreich-Micali-Wigderson
protocol. Chameleon departs from the common assumption of additive or linear
secret sharing models where three or more parties need to communicate in the
online phase: the framework allows two parties with private inputs to
communicate in the online phase under the assumption of a third node generating
correlated randomness in an offline phase. Almost all of the heavy
cryptographic operations are precomputed in an offline phase which
substantially reduces the communication overhead. Chameleon is both scalable
and significantly more efficient than the ABY framework (NDSS'15) it is based
on. Our framework supports signed fixed-point numbers. In particular,
Chameleon's vector dot product of signed fixed-point numbers improves the
efficiency of mining and classification of encrypted data for algorithms based
upon heavy matrix multiplications. Our evaluation of Chameleon on a 5 layer
convolutional deep neural network shows 133x and 4.2x faster executions than
Microsoft CryptoNets (ICML'16) and MiniONN (CCS'17), respectively
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
- …