Sparse Matrix Multiplication Kernels by Dusefante, Matteo
Sparse Matrix Multiplication Kernels
Matteo Dusefante
Advisor: Riko Jacob
Submitted: August 2018

iii
Abstract
The fast progress of information technology in the present years
led to a substantial growth in terms of the volume of data to be an-
alyzed and the size of the problems needed to be solved. Since
the Eighties, the formulation of out-of-memory models of com-
putation was motivated by the intractability of the problems by
the prevailing architectures. This trend is expected to grow even
further. Therefore, algorithm designers have to provide solutions
specifically tailored for big data processing.
The design of algorithms often relies on kernels, basic yet essen-
tial primitives that may be invoked several times by an algorithm
during its execution. It is therefore fundamental to rely on theoret-
ically efficient, highly optimized and high performance kernels.
In the field of linear algebra, the most common primitives con-
cern the manipulation of data vectors, such as matrix-matrix mul-
tiplication or matrix-vector multiplication. Matrix multiplication
has countless applications, from graph algorithms, machine learn-
ing, to computer graphics and image processing. Its prominence is
validated by the almost half century old research field that aims at
providing optimal algorithms for matrix multiplication and trying
to answer a long-standing conjecture.
This dissertation aims at providing efficient kernels for sparse
matrix multiplication. The main contributions are the following:
(i) we present new Monte Carlo data structures for sparse output-
sensitive matrix multiplication based on repeated predecessor
queries that are used to find specific random linear combina-
tion of vectors in sparse matrices,
(ii) we introduce new Atlantic City algorithms for sparse Boolean
matrix multiplication derived from a novel framework that ex-
ploits size estimators for sparse matrix products,
(iii) we design new Monte Carlo algorithms for fast sparse matrix
multiplication that combine several existing frameworks to pro-
vide improved bounds for sparse matrix multiplication,
(iv) we present new deterministic data structures for permutation
matrices in the Parallel External Memory model that translate into
efficient and experimentally evaluated algorithms for permut-
ing in concurrent out-of-memory models.
The result is a collection of randomized and deterministic al-
gorithms for sparse matrix computations that are meant to provide
improvements over the state of the art in time or memory efficiency.

vResumé
Informationsteknologiens hurtige udvikling medfører at stadig
større mængder data skal analyseres og større problemer skal lø-
ses. Siden 1980’erne har motivationen for maskinmodeller med ek-
stern hukommelse været baseret på intraktibiliteten af store proble-
mer i tidens fremherskende maskinarkitekturer. Denne udvikling
forventes at fortsætte. Det er nødvendigt at algoritmedesignere ud-
vikler løsninger som er skræddersyet til behandlingen af Big Data.
Design af algoritmer bygger ofte på kerner. En kerne er en sim-
pel, men essentiel grundalgoritme, som kan kaldes flere gange i
løbet af kørslen af en algoritme. Det er derfor vigtigt at algoritme-
designere har adgang til kerner som har teoretiske effektivitetsga-
rantier og samtidig fungerer godt i praksis.
Indenfor lineær algebra bruges de mest almindelige kerner til
manipulation af datavektorer såsom matrix-matrix og matrix-vektor
multiplikation. Matrixmultiplikation har utallige anvendelser som
strækker sig fra grafalgoritmer og maskinlæring til computergrafik
og billedbehandling. Problemets fremtrædende betydning er un-
derstøttet af op mod et halvt århundrede af stadig aktiv forskning
med mål om at vise optimale algoritmer til matrixmultiplikation.
Målet med denne afhandling er at vise effektive kerner til mul-
tiplikation af tynde matricer. Hovedbidragene er som følger:
(i) Vi præsenterer nye Monte Carlo datastrukturer til resultatføl-
som tynd matrixmultiplikation baseret på gentagne prædeces-
sorforespørgsler som anvendes til at finde bestemte tilfældige
linearkombinationer af vektorer i tynde matricer,
(ii) Vi introducerer nye Atlantic City algoritmer til multiplikation af
tynde Boolske matricer som udnytter størrelsesestimatorer for
tynde matrixprodukter,
(iii) Vi designer nye Monte Carlo algoritmer til hurtig multiplikation
af tynde matricer som kombinerer flere eksisterende tilgange
for at vise forbedrede øvre grænser,
(iv) Vi præsenterer nye deterministiske datastrukturer til permuta-
tionsmatricer i den Parallelle Eksterne Hukommelsesmodel som
medfører effektive og eksperimentelt understøttede algorit-
mer til at udføre permutationer i sideløbende eksterne hukom-
melsesmodeller.
Resultatet er en samling af randomiserede og deterministiske
algoritmer til beregninger på tynde matricer som er mere tid- eller
pladseffektive end hvad hidtil har været kendt.

vii
Acknowledgments
First of all I would like to show my gratitude to my supervisor,
Professor Riko Jacob, for guiding me through the years of my PhD.
I thank him for the time he found to listen and discuss my ideas,
for introducing me to a new field of research and for supporting
me in the quest of becoming a better and independent researcher.
My gratitude goes also to Professor Micheal T. Goodrich for
hosting me at University of California at Irvine in the summer of
2018. Having had the chance to visit the Center for Algorithms and
Theory of Computation at UCI was a great honor.
My office mates and colleagues deserve my deepest apprecia-
tion: Daniel, Johan, Martin, Thomas, Tobias. My PhD would def-
initely not have been the same without such amazing colleagues.
A special acknowledgment goes to Johan for proofreading part of
this dissertation and to Tobias for the translation of the abstract.
The Algorithm group at ITU was a very pleasant and motivat-
ing workplace. During my PhD I was privileged to meet outstand-
ing researchers who were capable to make from research seminars
to lunch conversations inspiring moments.
A special thanks goes to the “Italians” at ITU: Francesco, Hugo,
Marco, Paolo, Rosario. You were able to make my journey an expe-
rience worth remembering both inside and outside ITU. A special
mention goes to Paolo for the countless “coffee breaks” and en-
lightening conversations. Grazie.
I would like to personally thank the PhD students at the Center
for Algorithms and Theory of Computation of the University of
California at Irvine for the pleasant stay.
On a person level, I would like to thank my long time friends
Gioel and Mattia. Thank you for all the good memories. A special
acknowledgment goes to Mattia to whom I owe my decision of
pursuing a PhD abroad and many unforgettable moments.
Last but not least, I would like to thank my parents, Dino and
Maria. I owe to you a debt of gratitude for your constant support
and neverending guidance throughout my life. Finally, I am deeply
grateful to Ilaria for her endless amount of encouragement and
support during the years of my PhD.

Contents
Contents ix
1 Introduction 1
1.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Matrix Multiplication in Linear Algebra . . . . . . . . . . . . . . . . . 6
1.3 Matrix Multiplication and Graph Theory . . . . . . . . . . . . . . . . 7
1.4 Algebraic structures . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Lower Bounds and Conjectures . . . . . . . . . . . . . . . . . . . . . 21
1.8 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Cache Oblivious Sparse Matrix Multiplication 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Comparison with the Related Work . . . . . . . . . . . . . . . . . . . 27
2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Probabilistic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Atlantic City Boolean Matrix Multiplication 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Comparison with the Related Work . . . . . . . . . . . . . . . . . . . 45
3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Probabilistic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Empirical Study of Sparse Submatrices . . . . . . . . . . . . . . . . . 58
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
x Contents
4 Compressed Sparse Matrix Multiplication 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Comparison with the Related work . . . . . . . . . . . . . . . . . . . 66
4.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 Permuting in Parallel External Memory 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Comparison with the Related Work . . . . . . . . . . . . . . . . . . . 81
5.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Conclusions 107
Bibliography 113
Chapter 1
Introduction
Matrix multiplication is a fundamental operation in computer science
and mathematics with numerous applications in graph algorithms and
combinatorial optimization. Matrix multiplication kernels can be ex-
ploited to obtain improved algorithms for finding cycles in graphs, small
subgraphs and cliques, shortest paths, for designing improved dynamic
reachability algorithms and for solving matching problems.
Given two n × n matrices A and C, the matrix product AC can be
trivially computed with O(n3) arithmetic operations. Strassen [Str69],
in 1969, provided a recursive algorithm able to multiply two matrices in
O(nω) with ω = log2 7 = 2.807 by cleverly exploiting cancellations. Be-
side this, he contributed to lay the foundations of a new, prolific research
path [Pan78, BCRL79, Sch81, Pan81, Rom82, CW82, Str86, CW87, Sto10,
Wil12]. The last known result is due to Le Gall [LG14] who holds the
current record of ω < 2.3728639. In his paper, Le Gall analyzed the pow-
ers of a specific trilinear form introduced by Coppersmith and Winograd
combined with convex optimization methods, obtaining improved up-
per bounds on the asymptotic complexity of matrix multiplication.
The vast literature shows that the problem of matrix multiplication
over a ring admits truly subcubic algebraic algorithms. That is, it is pos-
sible to multiply n× n matrices using O(n3−δ) additions and multiplica-
tions, for δ > 0. Nevertheless, over algebraic structures that do not guar-
antee inverse elements, Strassen-like techniques are not allowed which
translate to asymptotically worse bounds. For instance, in the Boolean
semiring, the best known upper bound for multiplying n× n matrices
is due to Yu [Yu15] whose combinatorial algorithm uses O˜(n3/ log4 n)
operations. What is more, it is conjectured that there exist no truly sub-
2 Chapter 1. Introduction
cubic combinatorial algorithm for Boolean matrix multiplication. Using
matrix multiplication over algebraic structures without inverse elements
can lead to interesting results. For example, the all pairs bottleneck
paths problem on weighted graphs, where it is required to find a maxi-
mum capacity path from all pairs of nodes in the graph, can be solved
using matrix multiplication over the (min, max) semiring [WW10].
In several application, the input matrices are sparse, i.e. the num-
ber of nonzero elements is upper bounded by o(n2) and in several in-
stances it is upper bounded by O(n). In this regard, sparse matrix prod-
ucts are a central kernel in many algorithms, ranging from machine
learning, data mining to graph analysis. Sparse matrix-matrix products
have been used for multi-source breadth-first search, Markov clustering,
high-dimensional and topological similarity search. In numerical appli-
cations, sparse matrix computations appear in the Algebraic Multigrid
method for solving sparse system of linear equations, linear systems,
eigenvalues and least squares problems. A non exhaustive list of spe-
cific applications can be found in [BG12, NMAB18].
As mentioned, matrix multiplication has applications in graph the-
ory. Specifically, consider a directed graph G(V, E) with n vertices and
h edges. The adjacency matrix AG associated with the graph G(V, E) is
an n× n matrix with h nonzero entries. Real-world graphs commonly
exhibit sparse structures, as confirmed by the Stanford Large Network
Dataset Collection [LK14], see Figure 1.1. For instance, the web graph
representing web pages from berkely.edu and stanford.edu domains
collected in 2002 have average degree d = 2|E|/|V| ≤ 23, [LLDM09],
where directed edges represent hyperlinks between web pages. Simi-
larly, Facebook and Twitter social network graphs have an average de-
gree d ≤ 43, [LM12]. Graphs with sparser structures can be found in
road networks, e.g. the California and the Texas road network are low-
dimensional networks with average degree d ≤ 3, [LLDM09].
This dissertation contributes with new kernels for matrix computa-
tions. The algorithms and data structures are specifically designed for
matrices that exhibit a sparse structure, in the input or in the output.
We give particular emphasis to big data processing by designing and
analyzing kernels on several computational models. In relation to this,
we assume that the size of the input is too large to be stored in main
memory and we design our algorithms with a specific focus on out-of-
memory and concurrent models.
1.1. Synopsis 3
0 500 1000
0
500
1000
(a)
0 2000 4000
0
2000
4000
(b)
0 500 1000
0
500
1000
(c)
0 500 1000
0
500
1000
(d)
Figure 1.1: Matrix representation of (a) the Berkeley and Stanford do-
mains, (b) Facebook combined egonets, (c) California road network and
(d) Texas road network. Matrices (a), (c) and (d) have been collapsed
to matrices of size 103 × 103. The datasets are from the Stanford Large
Network Dataset Collection [LK14].
1.1 Synopsis
The present dissertation is structured in six chapters. Chapter 1 forms
the introduction of the thesis. Chapter 2, Chapter 3, Chapter 4 and
Chapter 5 are based on research papers listed in detail in the sequel.
Each research chapter is structured to give a short presentation at the
4 Chapter 1. Introduction
beginning of each chapter followed by a detailed description of the con-
tribution and a thorough comparison with the related work. For each
research chapter we identify open questions and future working direc-
tions which are summarized in Chapter 6, together with concluding
remarks derived from this dissertation.
Chapter 1 The chapter introduces the problem of matrix multiplica-
tion. We briefly survey the history of the problem starting from the
seminal paper by Strassen [Str69]. We proceed by providing motivation
for studying matrix multiplication problems which leads to two funda-
mental research areas: linear algebra and graph theory. The chapter con-
tinues by providing preliminaries for the upcoming research chapters.
We analyze more thoroughly the problem of matrix multiplication by
introducing algebraic structures and the current hardware design which
leads to the review of the models of computation taken into consider-
ation in this dissertation. We conclude the chapter by introducing the
probabilistic tools for the analysis of our randomized algorithms and
by giving an overview of the lower bounds and conjectures revolving
around matrix multiplication.
Chapter 2 The chapter introduces new randomized algorithms for
sparse matrix multiplication in the Word RAM model, in the Cache
Oblivious model and in the Parallel External Memory model. Specifi-
cally, we present Monte Carlo data structures for output-sensitive sparse
matrix multiplication that extend the algorithms from [WY14] to the
sparse input case. In the Word RAM model, the time bounds of our
algorithms match with the state of the art algorithms for sparse matrix
multiplication. However, we are able to compute sparse matrix prod-
ucts using linear space, improving the space complexity compared to
the state of the art. The chapter is based on the paper Cache Oblivious
Sparse Matrix Multiplication by Matteo Dusefante and Riko Jacob, ap-
peared on the proceedings of the 2018 Latin American Symposium on
Theoretical Informatics [DJ18].
Chapter 3 In this chapter we study the problem of Boolean Matrix mul-
tiplication in the Word RAM model and in the Cache Oblivious model
and we introduce Atlantic City algorithms for Boolean matrix multipli-
cation. The algorithms in this chapter solve Boolean matrix multipli-
cation by embedding the Boolean semiring into a larger semiring and
1.1. Synopsis 5
by exploiting size estimation algorithm in a novel way. Motivated by
the impracticality of fast matrix multiplication à la Coppersmith and
Winograd, if we disallow the use of this class of algorithms, then the
algorithm presented in this chapter improves over the current state of
the art by a polylog factor. The chapter is based on the unpublished
manuscript Atlantic City Boolean Matrix Multiplication by Matteo Duse-
fante [Dus18a].
Chapter 4 In this chapter we investigate the problem of Sparse Matrix
multiplication in the Word RAM model and in the External Memory
model, when fast matrix multiplication is allowed as a subroutine. The
result is new partitioning techniques and we show how combining the
combinatorial framework of Yuster and Zwick [YZ05] and Amossen and
Pagh [AP09] with randomized compression techniques by Jacob and
Stöckel [JS15] yields improved bounds for sparse matrix multiplication.
The chapter is based on the unpublished manuscript Compressed Sparse
Matrix Multiplication by Matteo Dusefante [Dus18b].
Chapter 5 This chapter investigates empirically the problem of per-
muting in the Parallel External Memory model and we present a col-
lection of algorithms, with related C++ implementation, that permute
records in the Parallel External Memory model. This chapter serves
the purpose of giving motivation for designing algorithms in external
memory models. Influenced by the empirical studies of Vuduc [VD03],
we study how cache-friendly algorithms operate on modern mutlicore
architectures and we achieve significant speed-ups for the permuting
phase of our algorithms. This chapter is based on the unpublished
manuscript An Empirical Evaluation of Permuting in Parallel External Mem-
ory by Matteo Dusefante and Riko Jacob [DJ17].
Chapter 6 This chapter aims at summarizing the contributions of this
dissertation and to present new open problems and future research di-
rections arose in each chapter. For each research chapter, we highlight
the contributions and we guide the reader through each open problem
via a detailed discussion.
6 Chapter 1. Introduction
1.2 Matrix Multiplication in Linear Algebra
The natural applications of matrix multiplication reside in linear alge-
bra. A useful way to understand its algebraic significance is by inves-
tigating how matrix multiplication is defined. In fact, the most natural
way of defining multiplication between matrices is to consider the el-
ementwise product between entries. The latter, known as Hadamard
product, although being used for lossy compression algorithms, does
not capture the true significance of matrix multiplication: the compo-
sition of linear transformations. It should be mentioned that there are
several ways of defining matrix products, e.g. the Hadamard product,
the Kronecker product. In this thesis we refer to matrix multiplication
as composition of linear transformations.
A transformation is a map between two vector spaces. A linear trans-
formation is a map that preserves the operations of addition and scalar
multiplication. Given a linear transformation T : Rm → Rn, the lin-
earity implies the following properties: T(u + v) = T(u) + T(v) and
T(αu) = αT(u) for α ∈ R and u, v ∈ Rm.
Consider a linear transformation S : Rn → Rp. It maps a n-
dimensional vector space to a p-dimensional vector space. The linear
transformation S can be represented as a matrix C ∈ Rp×n where it
holds S(u) = Cu with u ∈ Rp. Consider another linear transformation
T : Rp → Rm. As above, it holds T(v) = Av with v ∈ Rp and A ∈ Rm×p
is the matrix representation of T.
Given the linear transformations S and T, suppose we want to com-
pose them thereby generating the transformation T ◦ S : Rn → Rm de-
fined as (T ◦ S)(u) = T(S(u)). Note that the the transformation T ◦ S
is linear since T(S(u + v)) = T(S(u) + S(v)) = T(S(u)) + T(S(v)) and
T(S(αu)) = T(αS(u)) = αT(S(u)) for the linearity of T and S. The
question now becomes, what is the matrix D associated with the transforma-
tion T ◦ S? By construction, we know that the transformation T ◦ S maps
a n-dimensional vector space to a m-dimensional vector space. Hence,
the matrix D ∈ Rm×n. Consider the standard basis for the vector space
Rn, i.e. the n× n identity matrix In, where (In)∗,j is the j-th column vec-
tor. Recall that A ∈ Rm×p and C ∈ Rp×n. Define D as n, m-dimensional
column vectors as follows,
D =
(
A
(
C(In)∗,1
)
A
(
C(In)∗,2
) · · · A (C(In)∗,n)) ,
which is equal to
D =
(
A
(
C∗,1
)
A
(
C∗,2
) · · · A (C∗,n)) = AC. (1.1)
1.3. Matrix Multiplication and Graph Theory 7
Formula 1.1 defines matrix multiplication as a composition of linear
transformations. It is clear that, the applications of matrix multiplication
in linear algebra reside on scenarios where it is necessary to compose
transformations, e.g. roto-translation, scaling or shearing. The latter are
fundamental operations in the fields of computer graphics and image
processing.
1.3 Matrix Multiplication and Graph Theory
The relation between matrices and graphs is straightforward: every
graph G(V, E) can be represented through an adjacency matrix AG,
where (AG)i,j = 1 if and only if (i, j) ∈ E with i, j ∈ V. Furthermore,
let |V| = n and |E| = h. This affinity makes possible to solve graph
related problems using matrix multiplication. The most well known
connections between graphs and matrix multiplication revolve around
computing the transitive closure of a graph and solving the all pairs
shortest path problem.
The transitive closure G?(V, E?) of a graph G(V, E) is the graph in
which (u, v) ∈ E? if and only if there is a path from u to v in G. If AG
is the adjacency matrix of a graph, then (AkG)i,j = 1 if and only if there
is a path of length k from i to j. Furman [Fur70] and Munro [Mun71]
showed independently that AG? = In ∨ AG ∨ A2G ∨ · · · ∨ AnG (where In is
the n× n identity matrix) can be computed in O(log n) iterations using
the method of repeated squaring. The intuition behind the repeated
squaring method lies in the observation that the shortest path of at most
m edges is the same as the shortest path of at most n− 1 edges for all
m > n− 1. Hence, it is sufficient to compute (AG)2k+1 = (AG)2k ∨ (AG)2k
for k ≤ log n. This leads to an algorithm for computing the transitive
closure in time O(nω log n).
The distance version of the all pairs shortest path problem was
solved by Seidel [Sei95] who presented an algorithm for undirected, un-
weighted graphs with n vertices that runs in O(nω log n) time together
with a randomized algorithm for finding a shortest path between each
pair of vertices with the same time bound. Shoshan and Zwick [SZ99]
presented a pseudo-polynomial time algorithm for the all pairs shortest
path problem for undirected, weighted graphs of n vertices that runs in
O˜(Wnω), where the edge weights are integers and W is the largest edge
weight in the graph. The all pairs shortest paths problem for weighted
8 Chapter 1. Introduction
and directed graphs was solved by Zwick [Zwi02] who provided an al-
gorithm using O(n2+µ) time, where ω(1, µ, 1) = 1+ 2µ is the exponent
of fast matrix multiplication between matrices of size n× nµ and nµ× n.
The problem of triangle finding and matrix multiplication are in-
trinsically connected. A triangle in a graph G(V, E) with n vertices is
a three node subgraph {(u, v), (v, w), (w, u)} such that u, v, w ∈ V and
{(u, v), (v, w), (w, u)} ∈ E. That is, the nodes u, v and w are connected
to form a triangle. In order to identify a triangle in G(V, E), consider
the n× n adjacency matrix AG associated with G and compute A2G. If
(A2G)i,j 6= 0 then there exists k ∈ [n] such that Ai,k 6= 0 and Ak,j 6= 0. If
Ai,j 6= 0 then there is a triangle in G, namely {(i, j), (j, k), (k, i)} ∈ E. It
follows that, an algorithm for finding triangles in graphs has complexity
at most the one required to square a matrix, i.e. O(nω).
The algorithm just presented for triangle finding is algebraic in na-
ture and it achieves subcubic asymptotic complexity by mapping the
connectivity matrix into a matrix with entries from a ring. Nevertheless,
when considering algebraic structures without subtraction, Strassen-
like algorithms are not allowed, as the additive inverse is not guar-
anteed. In contrast to algebraic algorithms, Boolean matrix multipli-
cation admits combinatorial algorithms. The most famous combinato-
rial algorithm is the Four Russians algorithm by Arlazarov, Dinic, Kro-
nrod, and Faradzhev [ADFK70], which stores precomputed lookup ta-
bles of polynomial size, used to answer queries of O(log n) bits in con-
stant time. The result is an algorithm to multiply Boolean matrices us-
ing O(n3/ log2 n) time, with words of size Θ(log n). As mentioned,
the last known result belongs to Yu [Yu15] whose algorithm achieves
O˜(n3/ log4 n) time. Furthermore, Williams and Williams [WW10]
proved equivalences between Boolean matrix multiplication and detect-
ing triangles in a graph. That is, either both problems have truly subcu-
bic combinatorial algorithms, or none of them do.
1.4 Algebraic structures
The problem of multiply matrices of m× p and p× n, as in Figure 1.2,
can be seen as mn inner products of size p. An alternative approach
to matrix multiplication requires the computation of p outer products
of size mn, see Figure 1.3. Each of these p outer products generates
an intermediate m × n matrix that contributes to the final output ma-
trix. Regardless of the technique used, matrix multiplication involves
1.4. Algebraic structures 9
〈A i
1
, C
1j
〉
〈A i
k
, C
kj
〉
〈A i
p
, C
pj
〉
+ . . .+
+ . . .+
A11 . . . A1k . . . A1p
...
. . .
...
...
...
Ai1 . . . Aik . . . Aip
...
...
...
. . .
...
Am1 . . . Amk . . . Amp


C11 . . . C1j . . . C1n
...
. . .
...
...
...
Ck1 . . . Ckj . . . Ckn
...
...
...
. . .
...
Cp1 . . . Cpj . . . Cpn


D11 . . . D1j . . . D1n
...
. . .
...
...
...
Di1 . . . Dij . . . Din
...
...
...
. . .
...
Dm1 . . . Dmk . . . Dmn


Figure 1.2: Falk’s scheme for visualizing the product between a m× p
matrix A with a p× n matrix C viewed as mn inner products 〈Ai,∗, C∗,j〉
of size p. Image template courtesy of texample.net/tikz/examples/
matrix-multiplication.
the product and the sum of entries from row and column vectors. The
entry type, i.e. the algebraic structure an entry belongs to, may simplify
or complicate the problem. For instance, the inner product of Boolean
vectors is nonzero if there exists a nonzero entry in the same coordinate
in both vectors. It follows that, Boolean matrix multiplication condenses
to detecting at least a witness for each of the mn inner products.
Definition 1.1. Let A ∈ {0, 1}n×n, C ∈ {0, 1}n×n and let AC ∈ {0, 1}n×n
be the product matrix. A witness for the entry (AC)i,j = 1 is an index κ ∈ [n]
such that Ai,κ = 1 and Cκ,j = 1.
Despite its apparent simplicity, we mentioned that there are no truly
subcubic algorithms for Boolean matrix multiplication. As surveyed in
Chapter 1, matrix multiplication is commonly defined among matrices
with elements from a ring. In the following, we review the main alge-
braic structures where matrix multiplication has been studied starting
10 Chapter 1. Introduction
with semirings. A semiring (S,+, ·) is a set S together with two binary
operations + and · satisfying the following axioms:
• Additive associativity: for all a, b, c ∈ S, (a + b) + c = a + (b + c),
• Additive commutativity: for all a, b ∈ S, a + b = b + a,
• Multiplicative associativity: for all a, b, c ∈ S, (a · b) · c = a · (b · c),
• Left and right distributivity: for all a, b, c ∈ S, a · (b+ c) = (a · b) +
(a · c) and (b + c) · a = (b · a) + (c · a).
The absence of an additive inverse in the semiring causes an in-
ner product to be nonzero if the partial inner product (or interme-
diate result) is nonzero. Examples of semirings include the set of
natural numbers under addition and multiplication, the non-negative
rational numbers and the non-negative real numbers. Several stud-
ies have been conducted on matrix multiplication in semiring mod-
els, e.g. [BBF+10, PS14]. Other more peculiar semirings include the
(max, min)-semiring used for the all pairs bottleneck paths problem and
the (min,+)-semiring used for the all pairs shortest paths problem.
A ring (R,+, ·) is a semiring satisfying the ring axioms:
• Additive identity: there exists an element 0 ∈ R such that for all
a ∈ R, 0+ a = a + 0 = a,
• Additive inverse: for every a ∈ R there exists −a ∈ R such that
a + (−a) = (−a) + a = 0.
A ring satisfying the multiplicative associativity is an associative
ring. As an example, the set of all algebraic integers forms a ring.
A field (F,+, ·) is a ring satisfying the additional following proper-
ties.
• Multiplicative commutativity: for all a, b ∈ F, a · b = b · a (also
referred as commutative ring),
• Multiplicative identity: there exists an element 1 ∈ F such that for
all a 6= 0 ∈ F, 1 · a = a · 1 = a (also referred as ring with identity),
• Multiplicative inverse: for each a 6= 0 ∈ F, there exists an element
a−1 ∈ F such that for all a 6= 0 ∈ F, a · a−1 = a−1 · a = 1, where 1 is
the identity element.
1.4. Algebraic structures 11
A ∗
,k
⊗ C
k,
∗
A11 . . . A1k . . . A1p
...
. . .
...
...
...
Ai1 . . . Aik . . . Aip
...
...
...
. . .
...
Am1 . . . Amk . . . Amp


C11 . . . C1j . . . C1n
...
. . .
...
...
...
Ck1 . . . Ckj . . . Ckn
...
...
...
. . .
...
Cp1 . . . Cpj . . . Cpn


Dk11 . . . D
k
1j . . . D
k
1n
...
. . .
...
...
...
Dki1 . . . D
k
ij . . . D
k
in
...
...
...
. . .
...
Dkm1 . . . D
k
mk . . . D
k
mn


Figure 1.3: Falk’s scheme for visualizing the product between a m ×
p matrix A with a p × n matrix C viewed as p outer products A∗,k ⊗
Ck,∗ of size mn. Each of the p outer products generates a m× n matrix
Dk = A∗,k ⊗ Ck,∗. The output matrix is computed via the element-wise
sum D = ∑
p
k=1 D
k. Image template courtesy of texample.net/tikz/
examples/matrix-multiplication.
Examples of fields are the set of rational numbers, real numbers and
the complex numbers. Conversely to semirings, rings and fields do not
guarantee nonzero inner products, given nonzero intermediate results.
This phenomenon is known as cancellation of terms. A cancellation
occurs when (AC)i,j = 〈Ai,∗, C∗,j〉 = 0 while elementary products do
not evaluate to zero, i.e. Ai,κ · Cκ,j 6= 0, for some κ ∈ [n]. Equivalently,
when the following property holds
−Ai,` · C`,j =
`−1
∑
k=1
Ai,k · Ck,j +
n
∑
k=`+1
Ai,k · Ck,j,
for some ` ∈ [n]. In general, in semirings, either a = −Ai,` · Ck,` /∈ S or
a = ∑`−1k=1 Ai,k ·Ck,j +∑nk=`+1 Ai,k ·Ck,j /∈ S since the inverse of an element
12 Chapter 1. Introduction
a ∈ S is not guaranteed to be in S. Hence, cancellations cannot occur.
Nevertheless, the following may hold: nnz(Ai,∗) > 0, nnz(C∗,j) > 0
while (AC)i,j = 〈Ai,∗, C∗,j〉 = 0. From a set notation and by considering
Boolean matrices, this corresponds to nonempty sets with empty inter-
section, i.e. disjoint sets. For algebraic structures with modular arith-
metics, cancellations may occur whenever 0 ≡ (AC)i,j (mod q) while
(AC)i,j 6= 0.
The matrix multiplication kernels presented in this dissertation are
designed for matrices with entries from fields. That is, our focus is on al-
gebraic structures that allow cancellation of terms. In general, the choice
of fields (rather than more generic rings) is required since multiplicative
inverses appear in our probability analysis. For our Boolean matrix mul-
tiplication algorithms we embed the Boolean semiring into a sufficiently
large semiring to allow the size estimation algorithm to output reliable
estimates for the number of nonzero entries in the output matrix.
1.5 Models of Computation
The design and analysis of algorithms requires an abstract model of
computation with predefined atomic operations. The Turing machine
is one of the first theoretical models of computation designed for the
purpose of mathematical calculation. Despite its Turing completeness,
a Turing Machine is unsuitable for designing algorithms for modern
real-world machines since it does not support random-access memory.
The Random Access Machine, on the other hand, is an computational
model similar to a multiple-register machine with indirect addressing
closer in design to moder architectures. It can be considered as the
standard de-facto model for the design and analysis of algorithms. As
Turing machines, the Random Access machine does not capture the con-
cept of locality of memory. In general, locality of memory refers to either
temporal locality, i.e. a process access the same data a multiple number
of times, or spacial locality, i.e. a process is more likely to access data in
a neighborhood of a location of an early memory access. The temporal
and spacial locality are exploited through the use of caches and block
transfers respectively on current hardwares.
In order to fill the locality-gap of the Random Access Machine it
is necessary to design a computational model that support temporal
and spacial locality. A motivation for memory-related models, applied
to matrix multiplication algorithms, is given by empirical evidences.
1.5. Models of Computation 13
Core 2Core 1
L1d Cache L1i Cache
L2 Cache
L1d Cache L1i Cache
L2 Cache
L3 Cache
Figure 1.4: Example of a single-socket, dual core architecture. Each core
has private L1 data and L1 instruction cache and private L2 cache. The
L3 cache is shared between cores and is located above the main memory
in the memory hierarchy.
Vuduc [VD03] observed that the CPU utilization is less than 10% when
computing elementary products for inner products. This relies on irreg-
ular memory access patterns generated by sparse matrices [Gre12].
The structure of a modern architecture is depicted in Figure 1.4. The
figure shows a simplification of a single-socket dual core architecture
with a three level memory hierarchy. A setup as depicted in Figure
1.4 has to be expected on modern architectures as, among others, Intel®
[Lev09]. The purpose of caches is indeed to exploit spacial locality. Ev-
ery level of the cache hierarchy collects data that has been previously
accessed by the processor. If the current requested data is present in
one cache layer then we have a so called cache hit. On the other hand,
if the data is not present in one cache layer then we have a so called
cache miss and the request for a specific memory address is propagated
down the hierarchy to the first layer contacting the address. Meanwhile,
the processor remains idle and on hold for the data to be loaded in the
caches. It is clear that, to optimize performances, algorithms designers
have to exploit data locality in order to reduce cache misses and CPU
idleness. In fact, cache hits account for fewer CPU cycles compared to
cache misses, especially in the lower levels of the memory hierarchy,
where they are responsible for the most expensive operations, causing
the CPU to remain idle for up to 40 to 100 cycles for an L3 cache on Intel®
Xeon® processors [Lev09]. The motivation for multi-layered memory hi-
erarchies is given by a trade off between cache sizes and performances.
The performance of caches degrades descending the hierarchy whereas
the size increases. Fast caches are usually of limited size, ranging from
32 KB for L1 to 256 KB for L2. On the performance front, Intel’s Haswell
14 Chapter 1. Introduction
architecture has clean-read latencies that double from L1 cache to L2
cache, namely from 5 ns to 10/15 ns, see Figure 5.6b in Chapter 5.
1.5.1 Word RAM model
The Random Access Machine is the standard de-facto for the design and
analysis of algorithms. It supports all the basic data manipulation op-
erations and it allows indirect addressing. Nevertheless, the progress in
hardware architectures causes the RAM model to be outdated in certain
aspects. The recent advances in CPU hardware makes possible to take
advantage of bit-level parallelism through packing various elements in
one word and operating on words simultaneously.
The Word RAM model, introduced by Fredman and Willard [FW90],
aims at overcoming this issue and specifies an abstract model similar to
a Random Access Machine, but with the additional property of perform-
ing bitwise operations on a word of w bits, where w = Ω(log n) and n
is the problem size. Note that it is natural to assume also w = O(log n)
otherwise it would be possible to pack super-polynomially many ele-
ments in a word. Similar to the RAM model, we are interested in bound-
ing the number of word operations performed during the computation,
such as logical and algebraic operations, bit shift and comparison oper-
ations. It is generally assumed that a memory access takes constant time
and the space usage is given by the number of words used during the
computation.
For the sake of completeness, we acknowledge that there exists other
RAM models, such as the Real RAM model, by Shamos [Sha78], used
mostly in computational geometry problems, where computation is car-
ried on exact real numbers. The Word RAM model is widely accepted as
the standard for analysis of algorithms as it closely resembles the mod-
ern architectures which operate on 64 bits registers and the atomic oper-
ations in the Word RAM model follow an imperative, C-like paradigm
that can be considered as an abstraction of the assembly language.
1.5.2 The External Memory model
The pioneers in external memory analysis were Hong and Kung
[JWK81] who introduced, in 1981, the red-blue pebble game to analyze
the I/O complexity of algorithms. This framework was used to derive
lower bounds to the n-point Fast Fourier Transform and the n× n matrix
multiplication algorithm using a memory of size O(M).
1.5. Models of Computation 15
In 1988, Aggarwal and Vitter [AV88] introduced the External Mem-
ory model. This model enhances Hong and Kung’s framework by in-
troducing the concept of spacial locality via block-organized memory.
Also known as I/O model, the External Memory model is a two level
memory hierarchy with an unbounded external memory and a limited
internal memory that can host at most M elements. The computation
can take place exclusively in internal memory and, within the mem-
ory hierarchy, elements are exchanged using blocks of size B. In this
model, we are interested in analyzing the number of I/Os, i.e. memory
exchanges, between internal and external memory. In this setting, we
denote with M the number of records that can fit into internal memory
and B the number of records that can be transferred in a single block.
The implicit constraint between memory and block size is given by
the inequality M ≥ B, since internal memory must host at least a block
for the purpose of I/O operations. A common assumption, legitimized
by modern architectures, is to consider the size of internal memories ex-
pressed as a univariate polynomial on B. Known as tall cache assumption,
this constraint corresponds to requiring the number of blocks M/B to be
larger than the size of each block B. Using asymptotic notation, this be-
comes M = Ω(B2) which can be adjusted to obtain weaker constraints
of the form M = Ω(B1+ε), for some constant ε > 0. Assuming a tall
cache assumption may simplify certain problems. In this regard, con-
sider the problem of transposing an m× n matrix into an n×m matrix.
This problem is equivalent to a layout transformation, namely from row
major to column mayor layout (or vice versa). Assuming M = Ω(B2),
simply load submatrices of size B × B in internal memory, order the
entries accordingly and output the submatrices transposed. Such an
algorithm would yield an optimal I/O complexity of O(mn/B) I/Os.
Conversely, without a tall cache assumption a sorting-based algorithm
has to be preferred.
1.5.3 The Cache Oblivious Model
The External Memory model is commonly referred as the cache aware
model since the parameters M and B are known to the algorithm de-
signer. This allows the design of algorithms that exploit blocking tech-
niques. As a matter of fact, one of the first lower bounds for matrix
multiplication on a model of limited memory was provided by Hong
and Kung [JWK81] using blocking arguments. The intuition was to
16 Chapter 1. Introduction
CPU
x1 x2 x3 xB
m1 m2 · · · · · · mM
x1 x2 x3 · · · xB
x1, x2, x3, . . . , xB
Figure 1.5: The External Memory model with an input/output operation
where the B elements x1, x2, x3, . . . , xB are loaded in internal memory.
divide the matrix into blocks of size
√
M × √M and perform matrix
multiplication thereby obtaining a lower bound of Ω(n3/
√
M).
The advantages of knowing the parameters M and B reflect on faster
algorithms for a specific layer of the memory hierarchy. Conversely,
cache aware algorithms do not scale efficiently on multi-level memory
hierarchies. In order to optimize programs such that they perform well
on all layers of the memory hierarchy, it is necessary to consider a model
oblivious to block and memory size. The Cache Oblivious model by
Frigo et al. [FLPR99] is a two level memory hierarchy with an internal
memory of size M and an unbounded external memory. Similar to the
External Memory model, the I/O communication between the levels of
the memory is performed through blocks of size B. Nevertheless, in this
model, the parameters M and B are oblivious to the algorithm designer
and they can only be used to analyze the I/O complexity. The obliv-
iousness causes an algorithm that is optimal in one unknown level of
the memory hierarchy to be optimal on all levels simultaneously. There-
fore, the Cache Oblivious model makes possible to model a multi-level
memory hierarchy via a two-level abstract model.
The Cache oblivious model implicitly assumes an optimal and auto-
matic cache replacement strategy, the presence of only two levels in the
memory hierarchy and full cache associativity, i.e. data can be stored in
any cache block. The motivation for designing a model with such rigid
assumptions is to make the design of algorithms easier. Although these
constraints may be considered unrealistic, Frigo et al. showed that, to a
certain extent, these assumptions require no loss of generality.
Besides the apparent limitations of oblivious models, Brodal and
Fagerberg [BF03] proved an inherent trade off for cache oblivious al-
gorithms between the strength of the tall cache assumption and the
1.5. Models of Computation 17
CPUCPU CPU
y1 y2 y3 yB
m1 m2 · · · · · · mM
x1 x2 x3 xB
m1 m2 · · · · · · mM
z1 z2 z3 zB
m1 m2 · · · · · · mM· · ·
x1 x2 x3 · · · xB y1 y2 y3 · · · yB z1 z2 z3 · · · zB
y1, y2, y3, . . . , yBx1, x2, x3 · · · , xB z1, z2, z3, · · · , zB
Figure 1.6: The Parallel External Memory model with a parallel in-
put/output operation where each processor loads B elements in internal
memory.
overhead when M  B. In addition, they proved that I/O optimal
cache-oblivious comparison-based sorting is not possible without a tall
cache assumption and that the problem of permuting is not possible
cache-obliviously, not even under the tall cache assumption, see Chap-
ter 5 for further considerations on permuting in the I/O model.
1.5.4 The Parallel External Memory model
The constant progress on hardware design and optimization implies the
necessity to design models of computation that closely reflects real hard-
ware settings. The advent of parallel architectures motivated the design
of the Parallel Random Access Machine, a parallel-computing counter-
part to the Random Access Machine, in order to allow algorithm design-
ers to model the performance of parallel programs.
Similarly, the The Parallel External Memory (PEM) model was in-
troduced by Arge et al. [AGNS08] and it is a cache aware version of
the model proposed by Bender et al. [BFGK05]. It is a computational
model with P processors and a two-level memory hierarchy which con-
sists of an external memory, shared by all the processors, and P private
memories, i.e. caches, exclusive to each processor of limited size M. To
perform any operation on the data, a processor must have the data in its
cache. The data is transferred between the main memory and the cache
in blocks of size B. Multiple processors can access distinct blocks of the
external memory concurrently. This means that the external memory
is treated as shared memory, and within one parallel I/O, each proces-
sor can perform an input or an output. When it comes to accessing
the same block of the external memory by multiple processors three
different approaches arise: Concurrent Read Concurrent Write (CRCW),
Concurrent Read Exclusive Write (CREW) and Exclusive Read Exclusive
Write (EREW). Arge et al., as well as we do in this dissertation, consider
the Concurrent Read Exclusive Write (CREW) PEM.
18 Chapter 1. Introduction
1.6 Probability Theory
Several algorithms presented in this thesis are randomized. Random-
ized algorithms exploit a probabilistic process during their execution.
Since randomness is involved, the algorithms exhibit a certain degree
of uncertainty which may cause the algorithms to fail by producing an
incorrect result. Known in literature as Monte Carlo, this class of al-
gorithms use deterministic runtime and provide correct results with a
bounded probability. In this regard, it is fundamental to guarantee that
the algorithms output correct solutions with a reasonable probability. As
a consequence, we require our algorithms to succeed with probability
polynomially approaching one as the problem size approaches infinity.
We refer to this as with high probability (whp), which can be expressed as
1−O(1/nc) for some constant c > 0. Similarly, algorithms may oper-
ate using resources, such as number of operations or memory, that are
expressed as random variables. This class of algorithms, known as Las
Vegas, always output a correct result whereas running time or memory
consumption are random variables. In this context, we are interested in
bounding the resources used in expectation by the algorithms.
In this section, we give an overview of the techniques used in the
following chapters for the probabilistic analysis of algorithms. We start
with the definition of random variables.
Definition 1.2 (Mitzenmacher and Upfal [MU05, Definition 2.1]). A ran-
dom variable, X on a sample space Q is a real-valued function on Q; that is,
X : Q → R. A discrete random variable X takes on only a finite or countably
infinite number of values.
One of the most common probability distribution is the discrete uni-
form distribution where all the events on the sample space Q are equally
likely to be observed. This distribution requires the sample space to
be finite since random variables cannot be uniformly distributed among
infinite and countable sets. For instance, consider a discrete random
variable X which assumes values in a countable infinite set Q. We claim
that there is no uniform distribution on Q. That is, there exists no prob-
ability function such that Pr[X = q] = p, for all q ∈ Q. By the second
probability axiom, i.e. the unit measure axiom, it holds
∑
q∈Q
p = ∑
q∈Q
Pr[X = q] = Pr[X ∈ Q] = 1 (1.2)
1.6. Probability Theory 19
By the first probability axiom, i.e. the probability of an event is a non-
negative real number, Pr[X = q] = p ≥ 0. Therefore, if p = 0 then
∑q∈Q p = 0 contradicting (1.2). Similarly, if p > 0, then ∑q∈Q p = ∞
contradicting (1.2). As a result, whenever we sample uniformly from a
set Q we require Q to be finite.
A common procedure in the analysis of randomized algorithms is to
estimate the probability of the union of some events. Boole’s inequality,
also known as union bound, gives an upper bound on the probability of
the union of a countable set of events Xi = qi, with i ∈N,
Pr
[⋃
i
(Xi = qi)
]
≤∑
i
Pr[Xi = qi].
In probability theory, it is often fundamental to study the deviation
of a random variable from its expectation E[X], also known as tail distri-
bution. This property, particularly significant to the analysis of Las Vegas
algorithms, gave rise to several tail inequalities. One of the simplest tail
bounds is the following inequality due to Markov.
Theorem 1.3 (Markov’s Inequality [MU05, Theorem 3.1]). Let X be a ran-
dom variable that assumes only nonnegative values. Then, for all δ > 0,
Pr[X ≥ δ] ≤ E[X]/δ.
Markov’s inequality, in contrast to stronger tail bounds such as
Chebyshev or Chernoff-Hoeffding, does not require any type of inde-
pendence as it is derived solely from the linearity of expectation.
Stronger tail distributions are given by Chernoff bounds which pro-
vide exponentially decreasing probabilities on the sum of independent
random variables. Hoeffding’s inequality provides an upper bound on
the probability that the sum of bounded independent random variables
deviates from its expected value by more than a certain amount.
Theorem 1.4 (Chernoff-Hoeffding bounds [DP09, Theorem 1.1]). Let
X = ∑i∈[k] Xi where the variables Xi, with i ∈ [k], are independently dis-
tributed in [0, 1]. Then, for all δ > 0
Pr[X > E[X] + δ] ≤ e−2δ2/k.
If δ > 2e E[X] then
Pr[X > δ] ≤ 2−δ.
20 Chapter 1. Introduction
Chernoff-Hoeffding bounds as presented in Theorem 1.4 require in-
dependent random variables. The random variables Xi and Xj, with
i 6= j, are independent if and only if
Pr[(Xi = qi) ∩ (Xj = qj)] = Pr[Xi = qi] · Pr[Xj = qj],
for all values qi, qj ∈ Q. However, Dubhashi and Panconesi [DP09]
showed that, for negatively associated random variables, Chernoff-
Hoeffding bounds still hold.
Definition 1.5 (Negative association [DP09, Definition 3.1]). The random
variables Xi, with i ∈ [k], are negatively associated if, for all disjoint subsets
I, J ⊆ [k] and all nondecreasing functions f and g,
E[ f (Xi)g(Xj)] ≤ E[ f (Xi)] E[g(Xj)],
with i ∈ I and j ∈ J.
To prove that a family of random variables is negatively associated
[DP09] used two basic properties, namely closure under products and
monotone aggregation, that exploit solely monotonicity, symmetry and
independence.
Definition 1.6 (Closure under products [DP09]). If X1, . . . , Xn and
Y1, . . . , Ym are two independent families of random variables that are separately
negatively associated, then the family X1, . . . , Xn, Y1, . . . , Ym is also negatively
associated.
Definition 1.7 (Monotone aggregation [DP09]). If Xi, i ∈ [n], are nega-
tively associated and A is a family of disjoint subsets of [n], then the random
variables
fA(Xi, i ∈ A), A ∈ A
are also negatively associated, where the fA’s are arbitrary non decreasing (or
non-increasing) functions.
As promised, we present the result from [DP09] which allows to use
Chernoff-Hoeffding bounds on negatively associated random variables.
Theorem 1.8 ([DP09, Theorem 3.1]). The Chernoff-Hoeffding bounds can be
applied as is to X = ∑i∈[k] Xi if the random variables X1, . . . , Xk are negatively
associated.
1.7. Lower Bounds and Conjectures 21
The canonical example of negatively associated random variables is
given by bivariate 0-1 random variables Xi,j which take value 1 if the j-
th ball is placed into the i-th bin. In order to prove Chernoff-Hoeffding
bounds on the sum Xi = ∑j Xi,j, one only needs to prove negative asso-
ciation among disjoint subsets of the random variables.
1.7 Lower Bounds and Conjectures
The trivial lower bound for multiplying two n × n dense matrices is
Ω(n2). This stems from the size of the input/output matrices, which,
in the worts case, will have n2 entries. It is conjectured that the expo-
nent of matrix multiplication is ω = 2 + o(1) which, given n2+o(1) =
polylog(n2), yields an O˜(n2) algorithm. Since Coppersmith and Wino-
grad [CW87], the approach exploited to derived improved bounds was
to analyze higher tensor powers of a specific trilinear form. For instance,
Le Gall [LG14] analyzed the thirty-second power of the Coppersmith
and Winograd’s tensor achieving ω < 2.3728639. Similarly, one could
think of analyzing higher powers of the Coppersmith and Winograd’s
tensor hoping to achieve ω = 2 + o(1). Ambainis, Filmus and Le Gall
[AFLG15] proved that analyzing higher tensor powers cannot lead to
algorithms running in O(n2.3725) time. In addition, they identify vari-
ants of this approach which cannot lead to algorithms with running
time O(n2.3078). The results from [AFLG15] imply explicitly that taking
higher powers of the Coppersmith and Winograd’s tensor cannot lead
to an O(n2+ε) algorithm for matrix multiplication.
Besides defining a specific trilinear form used until recently to de-
rived improved bounds for matrix multiplication, Coppersmith and
Winograd showed that the existence of an Abelian group and a sub-
set of it, satisfying certain conditions, imply that their techniques can
yield an O(n2+o(1)) time algorithm. Alon, Shpilka and Umans [ASU13]
showed that if the sunflower conjecture of Erdo˝s and Rado [ER60] is
true then no such group and subset exist. Similarly, Cohn and Umans
[CU03], by introducing a new approach for matrix multiplication based
on group representation theory, conjectured the existence of combina-
torial structures that would yield fast matrix multiplication algorithms.
Alon, Shpilka and Umans formulate a variant of the sunflower conjec-
ture and prove that, if true, it contradicts Cohn and Umans’ hypothesis.
Despite the subcubic advances of matrix multiplication on a ring, in
the Boolean semiring, there is no truly subcubic algorithm know so far.
22 Chapter 1. Introduction
The current state of the art is given by Yu’s combinatorial algorithm.
It is believed that combinatorial techniques cannot yield truly subcubic
algorithms. Abboud and Williams [AW14] conjectured that, in the Word
RAM model with words of O(log n) bits, any combinatorial algorithm
requires n3−o(1) time in expectation to compute the Boolean product of
two n× n matrices.
Bläser [Blä99] showed that 5n2/2− 3n multiplications are required
for computing the rank of n× n matrix multiplication. The same result
hold for the rank of multiplication in noncommutative division algebras
(where all the nonzero elements are invertible) over an arbitrary field.
Bläser [Blä01] extended the same result for the multiplicative complexity
of n× n matrix multiplication over arbitrary fields. Landsberg [Lan14]
proved that the rank of matrix multiplication is at least 3n2 − 4n3/2 − n.
This result was later improved by Massarenti and Raviolo [MR13] to
3n2 − 2√2n3/2 − 3n. Raz [Raz02] proved that, for any c = c(n) ≥ 1,
the size of any arithmetic circuit for the product of two matrices, over
the real or complex numbers, is Ω(n2 log2c n) unless products with field
elements of absolute value larger than c are performed by the circuit.
The lower bounds surveyed so far concern dense matrix products.
When the input matrices are sparse and they generate sparse products,
the bounds one wants to achieve are of the form O˜(h + k), where h =
nnz(A) + nnz(C) and k = nnz(AC). This is consistent with the matrix
multiplication conjectured since, for dense matrices, it holds O˜(n2).
Consider the problem of determining a pair of vectors u and v such
that 〈u, v〉 = 0, i.e. orthogonal, from two sets A and C each containing
n vectors of d = ω(log n) dimensions. This problem, known as the Or-
thogonal Vectors Problem, is conjectured not to have a truly subquadratic
algorithm, i.e. running in time n2−e. In fact, if such an algorithm exists,
then k-CNF-SAT would be solvable in 2(1−e/2)n time, refuting the Strong
Exponential Time Hypothesis (SETH) [IPZ01], see [WY14, Lemma A.1].
This problem has been thoroughly studied in communication complex-
ity and it is widely used for proving conditional lower bounds. It is
also intrinsically connected with sparse matrix multiplication. In this
context, Williams and Yu [WY14] designed algorithms for the commu-
nication complexity of sparse matrix multiplication where, using at most
O˜(kn) randomized communication, they were able to compute the prod-
uct of matrices with elements over an arbitrary finite field. Moreover,
they conjectured that Ω(n) bits of communication were necessary in the
worst case for exchanging the n-bit vectors, for every nonzero entry in
1.8. Notation 23
the output. This result was later improved by Van Gucht et al. [VG+15]
who proved that o(kn) randomized communication is in fact possible
for computing the product of matrices from arbitrary fields.
1.8 Notation
Given matrices A ∈ An×n and C ∈ An×n from a generic algebraic struc-
ture A, we define h as the number of nonzero entries in the input, i.e.
h = nnz(A) + nnz(C), k as the number of nonzero entries in the output,
i.e. k = nnz(AC), where nnz(A) denotes the number of nonzero entries
in a matrix A. Let Ai,j be the value of the entry in the matrix A with
coordinates (i, j). We denote with A∗,j and Ai,∗ the j-th column of A and
the i-th row of A respectively.
Sparse matrices are stored using a coordinate format, i.e. each entry
Ai,j = a is stored as a triple 〈a, i, j〉. Note that we can easily detect,
by scanning the input matrices, null vectors. Hence, without loss of
generality, we consider only the case where h ≥ 2n and the rows of A
and the columns of C are not n-dimensional null vectors. Observe that,
in contrast cancellations can lead to situations where k ≤ n. We use the
notation [n] to denote the set {1, 2 . . . , n− 1, n} and [m, n] to denote the
set {m, m + 1, . . . , n− 1, n}.
The definition of Boolean matrix multiplication follows by the stan-
dard definition of logical operators,
(AC)i,j = 〈Ai,∗, C∗,j〉 =
n∨
`=1
Ai,` ∧ C`,j.
The inner product (also known as dot product or scalar product, see
Figure 1.2) of a row vector Ai,∗ and a column vector C∗,j is defined as
(AC)i,j = 〈Ai,∗, C∗,j〉 =
n
∑
`=1
Ai,` · C`,j
The outer product (also known as tensor product, see Figure 1.3), of a
column vector A∗,` and a row vector C`,∗ is defined as
(AC)` = A∗,` ⊗ C`,∗ where (AC)`i,j = Ai,` · C`,j and (AC)i,j =
n
∑
`=1
(AC)`i,j
24 Chapter 1. Introduction
We denote with column major layout the lexicographic order where the
entries of A are sorted by column first, and by row index within a col-
umn. Analogously, we denote with row major layout the order where the
entries of A are sorted row-wise first, and column-wise within a row.
Observe that a row major layout can be obtained from column major lay-
out by transposing A, denoted with AT, and vice versa. Furthermore, we
use f (n) = O˜(g(n)) as shorthand for f (n) = O(g(n) logc g(n)) for some
constant c. The following equivalence holds O˜(g(n)) = O(g(n)1+o(1)).
The memory hierarchy we refer to is modeled by the I/O model by
Aggarwal and Vitter [AV88], the Ideal Cache Oblivious model by Frigo
et al. [FLPR99] and the Parallel External Memory model by Arge et
al. [AGNS08]. We denote with M the number of elements that can fit
into internal memory, B the number of elements that can be transferred
in a single block and P as the number of processors. The parameters
M, and B are referred as the memory size and block size respectively. It
holds 1 ≤ B ≤ M  h, n. We assume that a word is big enough to
hold a matrix element as well as its coordinates positions, i.e. a block
holds B matrix elements in the coordinate format, and the main memory
can store at most M words. Note that, packing a constant number of
elements in a word yields only a constant improvement and therefore,
implies no loss of generality.
We use the notation sort(h) as a shorthand for indicating the num-
ber of I/Os required to sort h elements. Cache obliviously, it holds
sort(h) = O((h/B) logM h), see [FLPR99]. In the External Memory
model it holds sort(h) = (h/B) logM/B(h/B) while in the Parallel Exter-
nal Memory model we have sort(h) = (h/PB) logM/B(h/PB). Further-
more, we assume that the input is stored contiguously in main memory
and, for our parallel algorithms, we assume that the input is partitioned
into P segments that are assigned to each processor. Hence, processors
are responsible only for the address space spanned by its segment.
Chapter 2
Cache Oblivious Sparse Matrix Multiplication
In this chapter we study the problem of sparse matrix multiplication in
the Word RAM model, in the Cache Oblivious model and in the Parallel
External Memory model. We present output-sensitive algorithms that
exploit randomization in order to compute the product of two sparse
matrices with elements over an arbitrary field.
Let A ∈ Fn×n and C ∈ Fn×n be matrices with h nonzero entries in
total from an arbitrary field F. In the Word RAM model, we are able
to compute all the k nonzero entries of the product matrix AC ∈ Fn×n
using O˜(h + kn) time and O(h) space.
In the Cache Oblivious model, we are able to compute cache oblivi-
ously all the k nonzero entries of the product matrix AC ∈ Fn×n using
O˜(h/B + kn/B) I/Os and O(h) space. In the Parallel External Memory
model, we are able to compute all the k nonzero entries of the prod-
uct matrix AC ∈ Fn×n using O˜(h/PB + kn/PB) time and O(h) space,
which makes the analysis in the Cache Oblivious model a special case
of Parallel External Memory for P = 1.
The guarantees are given in terms of the size of the field and by
bounding the size of F as |F| > kn log(n2/k) we guarantee an error
probability of at most 1/n for computing the matrix product.
2.1 Introduction
Williams and Yu [WY14] studied the problem of finding pairs of orthog-
onal vectors in discrete structures. Inspired by the work of Freivalds
[Fre77], they provided a simple probabilistic procedure to test whether
there is a pair of non-orthogonal vectors in discrete sub-structure. Their
26 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
analysis is centered around dense input matrices. In this chapter we
extend their techniques to the sparse input case.
2.2 Contributions
We study the problem of sparse matrix multiplication in the Word RAM
model, in the Cache Oblivious model and in the Parallel External Mem-
ory model over an arbitrary field (F,+, ·), where (+, ·) are atomic oper-
ations over the elements of F in the computational models.
We present a randomized algorithm for multiplying matrices A ∈
Fn×n, C ∈ Fn×n that, after O(h) time for preprocessing using determin-
isticO(h) space, computes, usingO(nk log(n2/k)) time all the k nonzero
entries of the product matrix AC ∈ Fn×n, with high probability. The al-
gorithm is summarized in the following theorem.
Theorem 2.1 (Word RAM). Let F be an arbitrary field, let A ∈ Fn×n,
C ∈ Fn×n and assume A and C have h nonzero entries. After O(h) time
for preprocessing and using deterministic O(h) space, it is possible to com-
pute all the k nonzero entries of AC ∈ Fn×n with high probability, using
O(kn log(n2/k)) time.
Subsequently, we present a cache oblivious algorithm for multiplying
matrices A ∈ Fn×n and C ∈ Fn×n. After O((h/B) logM/B h/B) I/Os for
preprocessing, using deterministic O(h) space, we are able to compute,
usingO((n/B)k log(n2/k)) I/Os, all the k nonzero entries of the product
matrix AC ∈ Fn×n, with high probability, under a tall cache assumption,
i.e. M ≥ B1+ε, for some ε > 0.
Theorem 2.2 (Cache Oblivious). Let F be an arbitrary field, let A ∈ Fn×n,
C ∈ Fn×n and assume A and C have h nonzero entries. Let M ≥ B1+ε for
some ε > 0. After O((h/B) logM/B(h/B)) I/Os for preprocessing and using
deterministic O(h) space, it is possible to compute all the k nonzero entries of
AC ∈ Fn×n with high probability, using O((kn/B) log(n2/k)) I/Os.
Similarly, from Theorem 2.2, in the Parallel External Memory model,
we derive an algorithm for multiplying matrices A ∈ Fn×n, C ∈ Fn×n.
After O((h/PB) logd(h/B)) I/Os for preprocessing, with d = max{2,
min{M/B, h/PB}}, using deterministic O(h) space, it computes, us-
ing O((n/PB + log P)k log(n2/k)) I/Os, all the k nonzero entries of the
product matrix AC ∈ Fn×n, with high probability, as phrased in the
following corollary.
2.3. Comparison with the Related Work 27
Corollary 2.3 (Parallel External Memory). Let F be an arbitrary field,
let A ∈ Fn×n, C ∈ Fn×n, assume A and C have h nonzero entries and
let P ≤ n/B. After O((h/PB) logd(h/B)) I/Os for preprocessing, with
d = max{2, min{M/B, h/PB}}, and using deterministic O(h) space, it is
possible to compute all the k nonzero entries of AC ∈ Fn×n with high probabil-
ity, using O((n/PB + log P)k log(n2/k)) I/Os.
Note that, for the Cache Oblivious model and the Parallel External
Memory model, preprocessing is dominated by O(sort(h)) I/Os which
stems from layout transposition. Although faster algorithms for trans-
posing sparse matrices exist, for the ease of exposition, we consider
O(sort(h)) I/Os as an upper bound for preprocessing which weakens
the bounds only in the parameters of the logarithmic factors. For our
parallel algorithm, we consider a cache aware model since concurrency
is nontrivial in external memory models whenever the block size B is
unknown [BFGK05].
We give rigorous guarantees on the probability of detecting all the
nonzero entries of the output matrix by studying how the process of
generating random elements from the field affects the process of locating
entries. The guarantees are given in terms of the size of the field. If the
algorithms generate random variables from an arbitrary field F then we
are able to detect a nonzero entry in the matrix with probability at least
1 − 2/|F| + 1/|F|2. On the other hand, given an arbitrary field F, if
the random variables are generated from F∗ = F \ {0} then we detect
a nonzero entry with probability at least 1− 1/|F∗|. By bounding the
size of F as |F| ≥ ckn log(n2/k), for some constant c, we guarantee an
error probability of at most 1/n. Conversely, if |F| < ckn log(n2/k) we
can improve the error probability by repeating the random process an
arbitrary number of time, say log n times, thus sacrificing a log n factor
in space and time with the effect of decreasing the error probability by
a factor of n.
2.3 Comparison with the Related Work
Yuster and Zwick [YZ05] presented an algorithm that multiplies two n×
n matrices over a ring using O˜(h0.7n1.2 + n2+o(1)) arithmetic operations.
Iwen and Spencer [IS09] proved that if each column of AC contains
at most n0.29462 nonzero entries, then A and C can be multiplied with
O(n2+ε) operations. Our algorithms improve over Yuster and Zwick
28 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
[YZ05] as well as Iwen and Spencer [IS09] when k < n and h  n2. In
addition, we do not require a balanced assumption of the output matrix,
e.g. the number of nonzero entries per column, as in [IS09].
Amossen and Pagh [AP09] provided a sparse, output-sensitive ma-
trix multiplication that incurs in O˜(h2/3k2/3 + h0.862k0.408) operations
and O˜(h√k/(BM1/8)) I/Os. Lingas [Lin09] presented an output-
sensitive, randomized algorithm for computing the product of two
n × n Boolean matrices using O˜(n2kω/2−1) operations. Compared to
Amossen and Pagh and Lingas, we exploit cancellations of terms, we
do not restrict our analysis to Boolean matrices and we do not make
use of fast matrix multiplication, considered by many as impractical
[Rob05, VG+15]. In addition, Amossen and Pagh’s I/O algorithm is not
cache oblivious, i.e. it requires knowledge of the memory size.
Pagh [Pag12] presented a randomized algorithm that computes an
unbiased estimator of AC in time O˜(h + nk), with guarantees given in
terms of the Frobenius norm. Pagh’s compressed algorithm achieves
the same time bounds as our algorithms. However, we improve over
space complexity whenever k < h/ log n, i.e when cancellations account
for a 1/ log n factor compared to the number of input entries. Besides
this, Pagh’s result is algorithmically more involved, since it makes use
of hash functions and Fast Fourier Transform.
Williams and Yu [WY14] provided an output-sensitive, randomized
algorithm for matrix multiplication with elements over any field, that,
after O(n2) preprocessing, computes each nonzero entry of the matrix
product in O˜(n) time. We extend their techniques to the sparse input
case and we improve whenever h n2, i.e. when the input matrices are
sparse, both in time and space complexity.
Jacob and Stöckel [JS15] presented a Monte Carlo algorithm that uses
O˜(n2(k/n)ω−2 + h) operations and, with high probability, outputs the
nonzero elements of the matrix product. In addition, they state an I/O
complexity of O˜(n2(k/n)ω−2/(Mω/2−1B) + n2/B). Their analysis is fo-
cused specifically on dense matrices and we improve over their results,
both in time and I/O complexity, whenever k is asymptotically smaller
than n in the general case while we achieve the same bounds when
k = n. In addition, we do not require knowledge of the memory size as
opposed to [JS15].
Van Gucht et al. [VG+15] presented a randomized algorithm for
multiplying two Boolean matrices in O˜(k + h√k + h) time. In contrast
2.4. Algorithms 29
to their results, our algorithms are not restricted to the Boolean case and
we are able to compute the product of matrices from an arbitrary field.
Matrix multiplication has been widely studied in external memory
as well. In a restricted setting, i.e. the semiring model, Hong and Kung
[JWK81] provided a lower bound of Ω(n3/
√
M) I/Os for multiplying
two n× n matrices using n3 operations and a memory of size M.
Frigo et al. [FLPR99] provided a cache oblivious algorithm for multi-
plying two n× n matrices cache obliviously using O(n3) operations and
O(n2/B + n3/(B√M)) I/Os. In the External Memory model, Pagh and
Stöckel [PS14] provided a randomized, I/O optimal algorithm for ma-
trix multiplication that incurs in O˜((h/B)min{√k/√M, h/M}) I/Os.
However, their algorithm does not allow cancellation of terms and it
requires knowledge of the memory size in order to partition the input
matrices. In relation to this, we require no knowledge on the size of
M and our algorithm run cache obliviously. To the knowledge of the
authors, there are no previously known cache oblivious algorithms for
sparse matrix multiplication that exploit cancellations.
2.4 Algorithms
Williams and Yu designed a very simple, output-sensitive algorithm for
matrix multiplication, when the output matrices are sparse. Their idea
was to design a membership query algorithm to test whether a subma-
trix in the output contains nonzero entries. This subroutine is then used
to guide a two dimensional binary search among rows and columns of
AC in search of the k nonzero entries. Their result can be rephrased as
follows.
Theorem 2.4 (Williams and Yu [WY14]). Let F be an arbitrary finite field,
let A ∈ Fn×n and let C ∈ Fn×n. After O(n2) preprocessing time, it is possible
to compute all the k nonzero entries of AC ∈ Fn×n with high probability, with
only O(kn log n) time.
We recall more in detail the proof idea from [WY14]. The intuition
behind [WY14] is that nonzero entries in a submatrix of AC with in-
dices in [i1, i2]× [j1, j2] and i1, i2, j1, j2 ∈ [n], can be detected by extracting
sketches
a =
i2
∑
k=i1
uk Ak,∗ c =
j2
∑
k=j1
vkC∗,k
30 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
and testing whether 〈a, c〉 = 0, where uk and vk are chosen uniformly at
random from F. In general, we can generate u = (u1, . . . , un) ∈ Fn and
v = (v1, . . . , vn) ∈ Fn uniformly at random, compute all the prefix sums
Si =
i
∑
k=1
uk Ak,∗ Tj =
j
∑
k=1
vkC∗,k
for i, j ∈ [n], with S0 = T0 = 0. For any interval [i1, i2], [j1, j2] we can
detect a nonzero in the [i1, i2] × [j1, j2] submatrix of AC by computing
a = Si2 − Si1−1, c = Tj2 − Tj1−1 and subsequently 〈a, c〉.
When the input matrices are sparse, the prefix sums densify the ma-
trices thus having to compute and store n2 elements. In addition, sparse
matrices make nontrivial to compute linear combinations since row/-
column vectors are not explicitly stored.
We refine the analysis of [WY14] as follows. In order to detect the k
positions (i, j) such that (AC)i,j 6= 0, using binary search among the n2
feasible locations, we need at most k log n2 comparisons. We note that
the algorithms do not yield false positives when querying submatrices.
That is, given an all-zero submatrix with related sketches a and c, it
holds 〈a, c〉 = 0 always. This leads to the following lemma.
Lemma 2.5. Let F be an arbitrary field and let A = {a1, . . . , an} and C =
{c1, . . . , cn} be two sets of d-dimensional vectors such that ai, cj ∈ Fd, for all
i, j ∈ [n]. In addition, let u1, . . . , un, v1, . . . , vn be 2n random variables chosen
uniformly at random from F and let a, c be vectors defined as
a =
n
∑
k=1
ukak, c =
n
∑
k=1
vkck.
If 〈ai, cj〉 = 0, for all i, j ∈ [n], then 〈a, c〉 = 0.
Proof. The proof follows from the linearity of the inner product,
〈a, c〉 = 〈 n∑
i=1
uiai,
n
∑
j=1
vjcj
〉
=
m
∑
i=1
n
∑
j=1
〈uiai, vjcj〉 =
m
∑
i=1
n
∑
j=1
uivj〈ai, cj〉 = 0,
where the last equality holds since vectors are pairwise orthogonal by
hypothesis.
As a consequence, at most k queries produce a positive answer, i.e.
〈a, c〉 6= 0. Via a level-by-level top-down analysis, we note that at most
2.4. Algorithms 31
min{2i, 2k} nodes are explored at each recursive level, with i ∈ [log n2].
Hence, we deduce the following.
log n2
∑
i=1
min{2i, 2k} =
log k
∑
i=1
2i +
log n2
∑
i=log k
k ≤ 2k + 2k log(n2/k). (2.1)
Accordingly, we recursively split AC into two evenly divided submatri-
ces, which resembles the splitting phase of a k-d tree. We query each
submatrix and after at most log(n2/k) queries we isolate each nonzero
entry. In the following theorem we show how to compute linear combi-
nations of sparse matrices.
Algorithms overview The intuition for our algorithm is as follows:
we preprocess the matrix where we compute sparse vector prefix sums
while preserving the sparseness of the input matrices. In order to
compute the matrix product, we binary search among the rows and
the columns of AC following a partitioning scheme similar to a k-d
tree. Each submatrix is queried using sketches a = ∑nk=1 ukak and
c = ∑nk=1 vkck. For the Word RAM model, Theorem 2.1, we show how
to compute sketches efficiently using fractional cascading [CG86] and lin-
ear time. For the Cache Oblivious model, Theorem 2.2, we show that is
possible to compute sketches using O(n/B) I/Os with an I/O efficient
range coalesced data structure [DGH15]. The latter is then extended to
the Parallel External Memory model, Theorem 2.3.
Theorem 2.1 (Word RAM). Let F be an arbitrary field, let A ∈ Fn×n,
C ∈ Fn×n and assume A and C have h nonzero entries. After O(h) time
for preprocessing and using deterministic O(h) space, it is possible to com-
pute all the k nonzero entries of AC ∈ Fn×n with high probability, using
O(kn log(n2/k)) time.
Proof. We assume that the input matrices A and C are stored in column
major and row major layout respectively. If not, we can transpose A and
C using O(h) time and O(h) additional space.
Preprocessing: We generate vectors u = (u1, . . . , un) ∈ Fn and v =
(v1, . . . , vn) ∈ Fn uniformly at random and we initialize the data struc-
tures A and C as follows: for each Ai,j 6= 0 and Ci,j 6= 0 then
Ai,j =
i
∑
k=1
uk Ak,j Ci,j =
j
∑
k=1
vkCi,k Ak,j, Ci,k 6= 0, i, j ∈ [n]. (2.2)
32 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
Intuitively, Ai,j (resp. Ci,j) denotes the prefix sum of the nonzero entries
of the column vector A∗,j up to row i (row vector Ci,∗ up to column j).
After this phase, A and C maintain the same sparse structure, as well
as the same layout, of the original input matrices. Initializing A and C
requires O(h) time and O(h) space.1
Starting from column j = n − 1, every column vector A∗,j is aug-
mented with every element in even position from the sparse column
vector A∗,j+1. After the augmentation, the vector A∗,j contains entries
native toA∗,j and entries inherited from A∗,j+1. For each inherited entry,
we add pointers to its native-predecessor and its native-successor. If A1,j
is undefined, every column vector stores a dummy entry in first position
with value 0. For each entry in A∗,j, we add a bridge to the entry with
the same row index in A∗,j+1 or, if it is undefined, we add a bridge to
the predecessor. Dummy entries ensure that every element in A∗,j has at
least a bridge towards A∗,j+1. The augmentation, together with bridg-
ing, requires a linear scan of the column vectors. The space required
by the augmented vectors is T(j) = nnz(A∗,j) + T(j + 1)/2 + 1, with
T(n) = nnz(A∗,n) and j ∈ [n− 1], which is a geometric series bounded
by 2h. The data structure A is further augmented with a dense vector
A∗,0 where every Ai,0 has a bridge to either the entry with the same row
index or its predecessor in A∗,1. The total space required is 2h+ n ≤ 3h.
Analogous considerations hold for data structure C.
Computing AC: We recursively divide AC into two evenly divided
submatrices (which resembles the splitting of a k-d tree) and query each
submatrix in order to detect nonzero entries. Each query is answered
via an inner product 〈a, c〉 where sketches a and c are constructed using
fractional cascading. Given a generic submatrix of AC with indices in
[i1, i2]× [j1, j2] we compute sketches of matrix A with rows in [i1, i2] and
of matrix C with columns in [j1, j2] respectively.
We start by indexing Ai1,0 which redirects to an entry Ai,1. We probe
the data structure for the native-predecessor, call it Aip,1, and the native-
successor, call it Ais,1, of Ai,1. Recall i2 ≥ i1 and i ≤ i1.
1. If Ai,1 is native then:
a) if is < i1 then we emit Ais,1,
1Initializing A and C corresponds to computing prefix sums of each row and col-
umn vector of A and C respectively, which requires a linear scan of the input matrices.
2.4. Algorithms 33
b) if i = i1 then we emit Aip,1,
c) otherwise we emit Ai,1.
2. If Ai,1 is inherited then:
a) if is < i1 then we emit Ais,1,
b) otherwise we emit Aip,1.
Note that, if the predecessor or the successor of Ai,j is not defined in
the j-th column vector we simply output 0 or Ai,j respectively. Accord-
ingly, we correct the following lookup by redirecting the search from
either the successor Ais,1, if is < i1, or to Ai,1, otherwise, and follow-
ing its bridge to Ai,2. We iterate the process up to the n-th column
and we produce a n-dimensional vector ai1 . The process for i2 is analo-
gous. Note that, for i2, the case (1b) is omitted and inequalities become
non-strict as we want to capture the elements with row index i2. After
cascading through the n columns we have vectors ai1 and ai2 .
The sketch of the submatrix A with row indices in [i1, i2] stems from
a = ai2 − ai1 , i.e. the element-wise difference. We repeat the same pro-
cess for C thus computing c and we query the submatrix of AC by per-
forming the inner product 〈a, c〉. The construction of sketches a and c
requires to probe the data structure a constant number of times per col-
umn and per row respectively. Hence, O(n) time is required per query.
By Formula (2.1) at most k log(n2/k) queries are required to isolate the
k nonzero entries of AC. The claim follows.
The algorithm from Theorem 2.1 computes k locations (i, j) to as
many nonzero entries in AC ∈ Fn×n. In order to compute (AC)i,j we
can retrieve, using Formula (2.2), the entry value as follows
(AC)i,j = 〈Ai,∗, C∗,j〉 =
n
∑
`=1
[
(Ai,` −Ai−1,`)(C`,j − C`,j−1)
]
/uivj
while querying unit length matrices.
Figure (2.1a) shows an example of a sparse matrix where A∗,j iden-
tifies the j-th column. Figure (2.1b) shows the same matrix after pre-
processing where the red elements of A∗,j are the entries inherited from
column A∗,j+1 while the black elements are the original entries of the
matrix. The 0-th column, i.e. the vector containing the blue entries, is
the dense vector used to index the rows of the matrix. White entries in
34 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
A∗,0 A∗,1 A∗,2 A∗,3 A∗,4 A∗,5 A∗,6 A∗,7 A∗,8 A∗,9 A∗,10
(a)
A∗,0
i1
i2
A∗,1 A∗,2 A∗,3 A∗,4 A∗,5 A∗,6 A∗,7 A∗,8 A∗,9 A∗,10
(b)
Figure 2.1: Example of a sparse matrix (a) and the process of generating
a sketch a with the data structure of Theorem 2.1, (b).
A1,∗ denote dummy entries with value 0. The highlighted paths depict
the process of computing a sketch of the submatrix of A with row in-
dexes in [i1, i2]. Note that, for the ease of exposition, we omit pointers to
predecessors and successors between consecutive elements of the same
column, as they are implicitly derived from the ordering of the column
vectors.
Example 2.6. Starting from the sparse matrix of Figure 2.1a, we show how the
data structure A is built and how to compute sketches. Given j = n = 10,
A∗,10 = A∗,10 = {A1,10, A2,10, A6,10, A8,10, A9,10, A10,10}. For j = 9, the aug-
mented vector is as follows A∗,9 = {A1,9, A4,9, A6,9} ∪ {A2,10, A8,10, A10,10}.
That is, A∗,9 together with the entries from A∗,10 in even position. Note that,
if Ai,j has to be inherited but Ai,j−1 6= 0 then we trivially consider Ai,j−1. This
is the case of, e.g., A10,9 and A10,8. After the augmentation, predecessors and
successors are computed for the inherited entries and bridges are added between
column vectors A∗,9 and A∗,10. The process is iterated down to vector A∗,1.
Finally, bridges are added between the dense vector A∗,0 and A∗,1. Storing A
in column major layout allow us to keep the data structure compact and to de-
rive implicitly predecessors/successors for consecutive elements, even between
different columns.
Given indexes, e.g. i1 = 5 and i2 = 8, we proceed to show how to compute
a sketch of the submatrix. For i1, we index A5,0 which redirects to A5,1 /∈ A∗,1.
That is, A5,1 is inherited. The predecessor and the successor of A5,1 have row
2.4. Algorithms 35
A∗,0 A∗,1 A∗,2 A∗,3 A∗,4 A∗,5 A∗,6 A∗,7 A∗,8 A∗,9 A∗,10
(a)
i1
i2
A1
A2
A3
(b)
Figure 2.2: Example of a sparse matrix (a) and the process of generating
a sketch a with a coalesced data structure of Theorem 2.2, (b).
index ip = 4 and is = 8 respectively. Since is > i1 we return A4,1, case 2b
from Theorem 2.1. The same considerations apply for A∗,2. In A∗,3 the lookups
redirect to A5,3 ∈ A∗,3, i.e. a native entry. Since the row index is equal to i1
we emit its predecessor, i.e. A1,3, case 1b. In A∗,4, the inherited entry A3,4
has a successor A6,4 with row index within [i1, i2] but has no predecessor. As
a result, we output a dummy entry with value 0. The lookups for the other
columns are similar. As a remark, note the path correction between A3,7 and
A4,7, case 1a. The construction of the vector ai2 is similar and the sketch follows
straightforwardly.
2.4.1 External Memory and Parallel External Memory
Fractional cascading relies on random memory accesses for cascading
through A∗,j, with j > 1. In the worst case, O(n) blocks must be loaded
in memory. Instead, we use a data structure which is close in spirit to
the range coalescing data structure by Demaine et al. [DGH15].
Theorem 2.2 (Cache Oblivious). Let F be an arbitrary field, let A ∈ Fn×n,
C ∈ Fn×n and assume A and C have h nonzero entries. Let M ≥ B1+ε for
some ε > 0. After O((h/B) logM/B(h/B)) I/Os for preprocessing and using
deterministic O(h) space, it is possible to compute all the k nonzero entries of
AC ∈ Fn×n with high probability, using O((kn/B) log(n2/k)) I/Os.
36 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
Proof. We describe the procedure for preprocessing matrix A and gen-
erating the sketch a. We transpose the input matrix A in column major
layout using O((h/B) logM/B(h/B)) I/Os with a cache oblivious sorting
algorithm [BF02] (this requires the tall cache assumption M ≥ B1+ε) and
we compute column-wise prefix sums using O(h/B) I/Os.
Given the matrix A, we generate a sparse 0-1 representation A′ of
A, where A′i,j = 1 if and only if Ai,j 6= 0, A′i,j = 0 otherwise, using
O(h/B) I/Os. We compute a counting vector H = A′1, where 1 ∈ 1n
and Hi = ∑i nnz(Ai,∗), using a cache oblivious Sparse Matrix Vector
Multiplication algorithm [BBF+10] and O((h/B) logM/B(n/M)) I/Os.
After a prefix sum over H we are able to emit h/n index positions
rl ∈ [n] such that ∑rl+1i=rl nnz(Ai,∗) ≤ 3n. As a consequence, we build h/n
buckets Al of size O(n) where the elements of Al are the entries Ai,j
such that i ∈ [rl, rl+1). Starting from A2, we incrementally augment the
bucket Al with elements from Al−1 such that, after the augmentation,
for every column index j, there is an entry with value equal to the prefix
sum up to bucket l. As in Theorem 2.1, we augment the data structure
with a column vector A∗,0 of size n, where Ai,0 indices the l-th bucket if
and only if i ∈ [rl, rl+1), with l ∈ [h/n].
A query on the data structure A probes Ai1,0 using a single I/O and
it incurs in O(n/B) I/Os for scanning the bucket, thus generating the
sketch a. Analogously, we generate the sketch c and we compute the
inner product 〈a, c〉 by scanning the vectors using O(n/B) I/Os.
Corollary 2.3 (Parallel External Memory). Let F be an arbitrary field,
let A ∈ Fn×n, C ∈ Fn×n, assume A and C have h nonzero entries and
let P ≤ n/B. After O((h/PB) logd(h/B)) I/Os for preprocessing, with
d = max{2, min{M/B, h/PB}}, and using deterministic O(h) space, it is
possible to compute all the k nonzero entries of AC ∈ Fn×n with high probabil-
ity, using O((n/PB + log P)k log(n2/k)) I/Os.
Proof. We describe the procedure for preprocessing matrix A and gen-
erating the sketch a. We transpose the input matrix A in Column
Major Layout using O((h/PB) logM/B(h/B)) I/Os with a parallel sort-
ing algorithm [Gre12] and we compute column-wise prefix sums using
O(h/B + log P) I/Os, where the log P term stems from a synchroniza-
tion phase among processors [AGNS08, Ble90].
We generate a sparse 0-1 representation A′ of A, where A′i,j = 1 if and
only if Ai,j 6= 0, A′i,j = 0 otherwise, using O(h/PB) I/Os. We compute
2.5. Probabilistic Analysis 37
a counting vector H = A′1, where 1 ∈ 1n and Hi = ∑i nnz(Ai,∗), us-
ing a Sparse Matrix Vector Multiplication algorithm in Parallel External
Memory [Gre12] and O((h/PB) logd min{n2/h, n/B}+ log(h/B)) I/Os.
After a prefix sum over H, using O(n/PB+ log P) I/Os, we emit h/n
index positions rl ∈ [n] such that ∑rl+1i=rl nnz(Ai,∗) ∈ [n, 3n]. Accordingly,
we build h/n buckets Al of size O(n) where the elements of Al are the
entries Ai,j such that i ∈ [rl, rl+1). Starting from A2, we incrementally
augment the bucket Al with elements from Al−1 such that, after the
augmentation, for every column index j, there is an entry with value
equal to the prefix sum up to bucket l. As in Theorem 2.1, we augment
the data structure with a column vectorA∗,0 of size n, whereAi,0 indexes
the l-th bin if and only if i ∈ [rl, rl+1), with l ∈ [h/n].
A query on the data structureA probesAi1,0 using a single I/O and it
incurs in O(n/PB) I/Os for scanning the bucket, thus generating sketch
a. Analogously, we generate the sketch c and we compute the inner
product 〈a, c〉 by scanning the vectors using O(n/PB + log P) I/Os.
2.5 Probabilistic Analysis
We proceed to give guarantees on the probability of detecting non-zero
entries in the output matrix and we study how altering the process of
random generation alters the probability of detection.
The guarantees are given in terms of the field size rather on the size
of the matrix as, e.g., in [Pag12]. Throughout the analysis we gave no
restriction on the field F. Nevertheless, when F is infinite and countable,
we require to sample from a finite subset of F. This constraint is justified
since random variables cannot be uniformly distributed among infinite
and countable sets, see Section 1.6 for further considerations.
Fields, in contrast with other algebraic structures, guarantee the ex-
istence of the multiplicative inverse for elements of F, a property we use
for proving the following lemmas, see also Section 1.4.
Lemma 2.7. Let A ∈ Fn×n, C ∈ Fn×n and let AC ∈ Fn×n have at most k
nonzero entries. Consider a submatrix of AC with indices [i1, i2]× [j1, j2] and
assume to query the submatrix with sketches a, c as in Theorem 2.1.
(i) The matrix has a nonzero entry if and only if 〈a, c〉 6= 0 with probability
at least 1− 2/|F|+ 1/|F|2.
38 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
(ii) The submatrix is all zero if and only if 〈a, c〉 = 0 with probability at least
1− 2k log(n2/k)/|F|+ k log(n2/k)/|F|2.
Lemma 2.7. (i) If 〈a, c〉 6= 0, then there exist i, j ∈ [n] such that ui, vj 6= 0
and 〈ai, cj〉 6= 0, hence, (AC)i,j 6= 0. If there is a nonzero entry then
〈a, c〉 6= 0 with probability at least 1− 2/|F|+ 1/|F|2. This is equivalent
of saying that if there is a nonzero entry then 〈a, c〉 = 0 with probability
at most 2/|F| − 1/|F|2.
Without loss of generality, let i1 = i2 = i and j1 = j2 = j. Considering
a bigger submatrix with exactly one nonzero entry leaves the probability
unchanged, while considering more nonzero entries will only increase
the probability of 〈a, c〉 6= 0. Therefore, we consider the case where we
want to isolate, with high probability, the location of a single nonzero
entry in a submatrix of unit size. It follows that, in order to query
the submatrix we have to perform the following inner product 〈a, c〉 =
〈uiai, vjcj〉, where u, v are chosen uniformly at random from F. Since
〈ai, cj〉 6= 0 by hypothesis, we have that Pr(〈a, c〉 = 0) ≥ 2/|F| − 1/|F|2.
(ii) If the submatrix of AC with indices [i1, i2] × [j1, j2] is all
zero then 〈a, c〉 = 0 with probability at least 1 − 2k log(n2/k)/|F| +
k log(n2/k)/|F|2. By Lemma 2.5 this is true. If 〈a, c〉 = 0 then the
submatrix of AC with indices [i1, i2] × [j1, j2] is all zero with probabil-
ity at least 1− 2k log(n2/k)/|F| + k log(n2/k)/|F|2. That is, if 〈a, c〉 =
0 then the submatrix has a nonzero entry with probability at most
2k log(n2/k)/|F| − k log(n2/k))/|F|2.
Without loss of generality, let i1 = i2 = i and j1 = j2 = j. We have that
〈a, c〉 = 〈uai, vcj〉 = 0, where u, v are chosen uniformly at random from
F. Therefore, Pr(〈ai, cj〉 6= 0) ≥ 2/|F| − 1/|F|2. The latter is a lower
bound on the probability to not detect a nonzero entry in the output
matrix. A union bound over the k log(n2/k) queries needed to isolate
the k nonzero entries, gives us the probability to incur in at least one
false negative. By considering its complement, the claim follows.
Pr(〈a, c〉 = 〈uiai, vjcj〉 = 0), with 〈ai, cj〉 6= 0, is given by the proba-
bility of choosing either ui or vj zero uniformly at random from F. By
altering the algorithm, such that random entries are now generated from
F∗ = F \ {0}, we derive the following lemma.
Lemma 2.8. Let A ∈ Fn×n, C ∈ Fn×n and let AC ∈ Fn×n have at most k
nonzero entries. Let F∗ = F \ {0}, consider the submatrix of AC with indices
2.5. Probabilistic Analysis 39
[i1, i2] × [j1, j2] and assume to query the submatrix with sketches a, c as in
Theorem 2.1 where the entries of the vectors u and v are chosen uniformly at
random from F∗.
(i) The submatrix has a nonzero entry if and only if 〈a, c〉 6= 0 with proba-
bility at least 1− 1/|F∗|.
(ii) The submatrix is all zero if and only if 〈a, c〉 = 0 with probability at least
1− k log((n2/k)− 1)/|F∗|.
Lemma 2.8. (i) If 〈a, c〉 6= 0, then there exist i, j ∈ [m] such that ui, vj 6= 0
and 〈ai, cj〉 6= 0, hence, (AC)i,j 6= 0. If there is a nonzero entry then
〈a, c〉 6= 0 with probability at least 1 − 1/|F∗|. This is equivalent of
saying that if there is a nonzero entry then 〈a, c〉 = 0 with probability at
most 1/|F∗|.
If i1 = i2 = i, j1 = j2 = j and 〈ai, cj〉 6= 0 then 〈a, c〉 6= 0 since scaling
vectors with random elements from F∗ preserves non orthogonality. If
i1 < i2 and j1 < j2 and the submatrix contains exactly one nonzero entry,
the same reasoning applies. If the submatrix has ` > 1 nonzero entries,
then, without loss of generality, there exist a1, . . . , a`, c1, . . . , c` such that
〈u1a1, v1c1〉+ · · ·+ 〈u`a`, v`c`〉 = 0 and 〈ai, cj〉 6= 0, for all i, j ∈ [`]. That
is, ` inner products that generate as many nonzero entries and produce
a false negative when the submatrix is queried. By the linearity of the
inner product, we have that
〈u1a1, v1c1〉+ · · ·+ 〈u`a`, v`c`〉 = u1v1〈a1, c1〉+ · · ·+ u`v`〈a`, c`〉.
Hence, the sum cancels whenever
ui = −u1v1〈a1, c1〉+ · · ·+ u`v`〈a`, c`〉vi〈ai, ci〉
for a generic i ∈ [`]. Note that, such a ui is in F since fields guarantee
the existence of additive and multiplicative inverses. The probability to
choose ui such that it cancels the other inner products is the same as
choosing an element from F∗ uniformly at random, i.e. 1/|F∗|.
(ii) If the submatrix of AC with indices [i1, i2] × [j1, j2] is all zero
then 〈a, c〉 = 0 with probability at least 1 − k log((n2/k) − 1)/|F∗|.
By Lemma 2.5 this is true. If 〈a, c〉 = 0 then the submatrix of
AC with indices [i1, i2] × [j1, j2] is all zero with probability at least
40 Chapter 2. Cache Oblivious Sparse Matrix Multiplication
1− k log((n2/k) − 1)/|F∗|. That is, if 〈a, c〉 = 0 then the submatrix of
AC has a nonzero entry with probability at most k log((n2/k)− 1)/|F∗|.
If 〈a, c〉 = 0 and i1 = i2 = i, j1 = j2 = j then 〈ai, cj〉 = 0. The same
reasoning applies for i1 < i2, j1 < j2 and exactly one nonzero entry in
the submatrix. Let 〈a, c〉 = 0 and suppose there exist a1, . . . , a`, c1, . . . , c`
such that 〈u1a1, v1c1〉 + · · · + 〈u`a`, v`c`〉 = 0 and 〈ai, cj〉 6= 0, for all
i, j ∈ [`]. Hence, as in (i), the sum cancels with probability 1/|F∗|. The
latter is a lower bound on the probability to not detect a nonzero entry
in the output matrix. A union bound over the k log((n2/k)− 1) queries
needed to isolate the k nonzero entries, gives us the probability to incur
in at least one false negative. We do not consider the last layer, i.e.
log(n2/k), as it does not involve any stochastic process. By considering
its complement, the claim follows.
Lemma 2.7 and Lemma 2.8 refine the analysis of [WY14] and gives
guarantees on the probability that, multiplying two sketches a and c re-
lated with submatrices of A, with row indexes in [i1, i2], and C, with
column indexes in [j1, j2], respectively, gives the correct answer, i.e.
〈a, c〉 = 0 if there is no entry in the submatrix, 〈a, c〉 6= 0 otherwise.
In addition, Lemma 2.7 gives guarantees even for small fields, e.g. F2.
On the other hand, Lemma 2.8 does not apply for F2 as removing the
zero from the field takes randomness out of the process. Nevertheless,
removing the zero allow us to deterministically detect entries whenever
a submatrix with exactly one nonzero element is considered.
2.6 Conclusions
We present randomized algorithms for multiplying two n × n matri-
ces with elements over an arbitrary field F. Our analysis covers the
Word RAM model, the Cache Oblivious model and the Parallel External
Memory model. The algorithms exploit a preprocessing phase in order
to compute sketches used to query the output matrix and to find the k
nonzero entries of the matrix product.
The algorithms presented require no specific knowledge of the in-
put/output matrices, such as number of nonzero entries per column or
balanced distribution of the entries, as, e.g., in [IS09, JS15]. As a matter
of fact, we ask whether knowledge on the structure of the input/output
matrices may be exploited in order to improve our algorithms.
2.6. Conclusions 41
Algorithm 2.3 computes the matrix product cache obliviously. That
is, no knowledge of M and B is required during the computation. Con-
versely, Pagh and Stöckel [PS14] provided an optimal cache aware al-
gorithm for multiplying sparse matrices that uses blocking arguments
and matrix size estimations. Motivated by [PS14], we ask whether such
bounds hold even for cache oblivious models. Similarly, in parallel
cache-aware models, we formulate the problem whether the same tech-
niques from [PS14] are efficient even in parallel models. That is, we
ask whether the following bound O˜((h/PB)min{√k/√M, h/M}), with
matching lower bound, holds in the Parallel External Memory model.
The parallel algorithm from Theorem 2.3 has been designed only in
a cache aware model. Given the concurrency problems that arise from
concurrent cache oblivious models [BFGK05]. An interesting future di-
rection is to try to extend the bounds from Theorem 2.3 to a concurrent
cache oblivious model.

Chapter 3
Atlantic City Boolean Matrix Multiplication
In this chapter, we study the problem of Boolean Matrix multiplication
in the Word RAM model and in the Cache Oblivious model. We present
novel randomized techniques for computing the product of two Boolean
matrices A ∈ {0, 1}n×n and C ∈ {0, 1}n×n with h nonzero entries.
The first algorithm computes all the k nonzero entries of AC ∈
{0, 1}n×n using expected O˜(h + h√k + k) time and expected O˜(h/B +
h
√
k/B + k/B) I/Os cache obliviously, where the notation O˜(·) sup-
presses polylogarithmic factors, matching the time bound of Van Gucht
et al. [VG+15]. Similarly, we propose a second, more refined, ran-
domized algorithm that computes the k nonzero entries of the Boolean
product AC using O(h+ h√k+ k) expected operations and O(sort(h) +
h
√
k/B + k/B) expected I/Os cache obliviously, thus improving over
our previous result and over Van Gucht et al. by a log2 n factor.
The guarantees for detecting nonzeros are given in terms of the size
of the matrix and state an error probability of at least 1 − O(1/n1/4)
for detecting single entries. For a specif class of output matrices, we
guarantee a probability of at least 1− 1/2
√
k to exceed our time and I/O
bounds while computing nonzero in submatrices of AC. In the general
case, our bounds are guaranteed with error probability at most 1/
√
k.
3.1 Introduction
Randomized algorithms are usually categorized in two classes: Monte
Carlo and Las Vegas. The former give correct results with a bounded
probability using deterministic runtime. The latter produce always the
44 Chapter 3. Atlantic City Boolean Matrix Multiplication
correct answer, with a random variable as runtime whose expectation is
bounded. In between these categories, there lies a third class: Atlantic
City algorithms. These algorithm give correct results with a bounded
probability and their runtime is a random variable whose expectation
is also bounded. Mollin [Mol02] provides the following definition: “an
Atlantic City algorithm is a probabilistic polynomial-time algorithm that
answers correctly at least seventy-five percent of the time. In other ver-
sions of the algorithm, the value 3/4 may be replaced by any value
greater than 1/2”. As Mollin reports, the term Atlantic City was first
introduced by J. Finn in one of his unpublished manuscripts.
The algorithms presented in this chapter compute the product of
matrices where single entries are computed with error probability poly-
nomially decreasing on the size of the matrix. Conversely, the expected
time and I/O bounds on output submatrices are guaranteed with proba-
bility at least polynomially decreasing on the size of the product matrix.
3.2 Contributions
We study the problem of Boolean matrix multiplication in the Word
RAM model and in the Cache Oblivious model. We present a ran-
domized algorithm for multiplying matrices A ∈ {0, 1}n×n and C ∈
{0, 1}n×n, with h = nnz(A) + nnz(C) nonzero entries which can be sum-
marized in the following theorem.
Theorem 3.1. Let A, C ∈ {0, 1}n×n be Boolean matrices and assume A and
C have h nonzero entries. It is possible to compute all the k nonzero entries of
AC ∈ {0, 1}n×n using expected O˜(h+ h√k+ k) time and expected O˜(h/B+
h
√
k/B + k/B) I/Os cache obliviously.
The intuition behind Theorem 3.1 is to use size estimation algorithms
to compute the positions (i, j) of nonzero entries in highly sparse sub-
matrices of AC. In this framework, it is clear that, the more efficient
the estimation algorithm is, the more efficient the overall solution will
be. As a matter of fact, the polylogarithmic factors in Theorem 3.1 stem
from estimating the number of nonzero entries in the matrix product.
By considering an improved algorithm for size estimation, we derive
our second result presented in the following theorem.
3.3. Comparison with the Related Work 45
Theorem 3.2. Let A, C ∈ {0, 1}n×n be Boolean matrices and assume A and
C have h nonzero entries. It is possible to compute all the k nonzero en-
tries of AC ∈ {0, 1}n×n using expected O(h + h√k + k) time and expected
O(sort(h) + h√k/B + k/B) I/Os cache obliviously.
Theorem 3.1 and Theorem 3.2 share the same technique. However,
the primitive for the estimation step is now replaced with an extension
of Amossen et al. [ACP10] size estimator, where a 1± ε approximation
of k is computed using expected linear time in terms of the number of
nonzeros in the input matrices.
The guarantees of detecting nonzero entries are given in terms of
the size n of the input matrix. The algorithm from Theorem 3.1 detects
single nonzero entries with probability at least 1− 1/n. Conversely, the
algorithm from Theorem 3.2 detects single nonzero entries with proba-
bility at least 1−O(1/n1/4). Our time and I/O bounds are guaranteed
in terms of the size k of the matrix product AC. In the specific case of
entries on the same row/column not sharing the same submatrix, by a
Chernoff-Hoeffding bound on negatively associated random variables,
we prove that the probability of having highly dense submatrices is ex-
ponentially decreasing in k, namely 1− 2−
√
k for k > 4e2. In the general
case, we prove polynomially decreasing bounds in k, namely 1− 1/√k.
3.3 Comparison with the Related Work
O’Neil and O’Neil [OO73] presented a probabilistic algorithm to com-
pute the Boolean product of two n × n Boolean matrices using O(n2)
expected elementary operations. Their solution is based on an inductive
argument where they exploit precomputed positions of the nonzeros
entries. Nevertheless, their algorithm runs in worst case cn3 operations
while they achieve O(n2) for random matrices.
The Four Russians algorithm by Arlazarov, Dinic, Kronrod, and
Faradzhev [ADFK70] is the most famous combinatorial algorithm for
Boolean matrix multiplication. The Four Russians algorithm exploits
partitioning techniques and look up tables to compute the product of
Boolean matrices in the Word RAM model using O(n3/ log2 n) time,
with words of size Θ(log n). Bansal and Williams [BW09] presented
an improved combinatorial algorithm for Boolean Matrix multiplica-
tion that uses O(n3(log log n)2/ log9/4 n) time. Their algorithm com-
bined the Four Russians technique together with exploiting small sub-
46 Chapter 3. Atlantic City Boolean Matrix Multiplication
structures in the graph to give the aforementioned time algorithm
for triangle detection. Chan [Cha15] later improved over Bansal and
Williams and presented a divide and conquer algorithm that incurs in
O(n3(log log n)3/ log3 n) operations. The last known result, in term of
combinatorial algorithms, is due to Yu [Yu15] who presented a new al-
gorithm for triangle finding and Boolean matrix multiplication. Yu’s
algorithm generalizes the divide and conquer technique of Chan and
runs in O˜(n3/ log4 n) time. Our algorithms are not combinatorial in
nature. By exploiting randomization, we improve over combinatorial
algorithms and we achieve both input and output sensitivity.
Lingas [Lin09] presented an output-sensitive, randomized algo-
rithm that computes the product of two n × n Boolean matrices us-
ing O˜(n2kω/2−1) operations. Compared to Lingas, our algorithms are
both input and output sensitive and we improve over Lingas’ algo-
rithm whenever hk0.314 < n2, according to the actual exponent ω of
matrix multiplication [LG14]. Amossen and Pagh [AP09] presented
an algorithm for computing Boolean matrix products that incurs in
O˜(h2/3k2/3 + h0.862k0.408) operations.1 Compared to our first result,
considering h = k = O(n), our bound is worse than Amossen and
Pagh by a factor of n0.167. However, as primitives for our algorithm,
we make use of size estimators, compared to fast matrix multiplica-
tion à la Coppersmith and Winograd of [AP09]. Our second algorithm
closes the gap with Amossen and Pagh for sparse input matrices, given
O˜(g(n)) = O(g(n)1+o(1)), unless n is large. It holds ny ≤ log n if and
only if y ≤ log log n/ log n and n0.168 ≤ log n for n ≤ 108, assuming
[AP09] suppresses only a log n factor in the analysis. For log n log n fac-
tors, it holds n0.168 ≤ log n log n for n ≤ 1022. Assuming the latter and
ignoring the constants for fast matrix multiplication, we expect to see
asymptotic improvements of [AP09] over our solution only for n > 1022.
Amossen and Pagh, as well as Lingas, exploit fast matrix multipli-
cation. It is folklore, e.g. see [Rob05, VG+15], that fast matrix multipli-
cation algorithms are empirically impractical. It follows that, although
designing algorithms that rely on fast matrix multiplication may lead to
asymptotically better bounds, the result may be of no practical use. Con-
versely, our algorithms exploit simple primitives and they are designed
for providing practical advantages for computing matrix products.
Van Gucht et al. [VG+15] presented a randomized algorithm for
multiplying two Boolean matrices which runs in O˜(h + h√k + k) time.
1The bound from [AP09] becomes O˜(h2/3k2/3 + h0.859k0.406) with ω from [LG14].
3.4. Algorithms 47
Our algorithm is very close in spirit to their solution. As a matter of fact,
we treat dense subinstances of the problem as in [VG+15], i.e. by just
computing matrix vectors products. Conversely, Van Gucht et al. solve
sparse sub-instances using algorithmically more involved techniques,
including dyadic intervals, Count Sketch matrices and efficient recovery
techniques. Conversely, we use simple matrix products derived from es-
timation algorithms. In terms of complexity analysis, our first algorithm
matches the time bounds of Van Gucht et al. while our second, more
efficient algorithm, improves over [VG+15] by a log2 n factor. In term of
I/O bounds, our second algorithm improves over our first solution by a
log2 n factor in the h
√
k term.
3.4 Algorithms
The algorithms presented in this chapter are based on size estimation
techniques. The idea is to estimate the number of nonzero entries in
highly sparse submatrices. In the following, we survey the state-of-the-
art estimation algorithms used as subroutines by our algorithms. In
addition, we prove that estimation algorithms are false positive free.
This property is fundamental since false positives increase our time and
I/O complexity. Conversely, we prove that false negatives (also known
as silent failure) occur with probability 1/n.
Pagh and Stöckel [PS14] showed the existence of an algorithm that,
using quasilinear time and I/Os, i.e. O(e−3h polylog(n)), computes a
1± ε approximation k˜1, . . . , k˜n of the number of nonzero entries for the
columns of AC ∈ Fn×n with high probability.
Lemma 3.3 (Pagh and Stockel [PS14, Lemma 1]). Let F be an arbitrary
field and let A ∈ Fn×n and C ∈ Fn×n, with h = nnz(A) + nnz(C).
Furthermore, let δ, ε ∈ (0, 1]. We can compute estimates k˜1, . . . , k˜n using
O(ε−3h log(n/δ) log n) time and O˜(ε−3h/B) I/Os such that with probability
at least 1− δ it holds that (1− ε) nnz((AC)∗,j) ≤ k˜ j ≤ (1+ ε) nnz((AC)∗,j),
for all j ∈ [n].
Intuitively, such result holds for the rows of AC, since
CT AT = (AC)T. Therefore, assume k˜T1 , . . . , k˜
T
n are estimates for
nnz(AC)1,∗, . . . , nnz(AC)n,∗. Equivalently, the same guarantees hold,
i.e. (1− ε) nnz((AC)i,∗) ≤ k˜Ti ≤ (1+ ε) nnz((AC)i,∗), for all i ∈ [n].
Lemma 3.3 works by creating a F0-distinguishability sketch F ∈
{0, 1}d×n with d = O(ε−2 log n/δ) and ε constant. The estimates are
48 Chapter 3. Atlantic City Boolean Matrix Multiplication
then computed via (FA)C, i.e. a 2d dense-vector sparse-matrix multi-
plications of dimension d× n× n. In order to compute estimates with
high probability, in Lemma 3.3, we set δ = 1/n. This increases the time
complexity by a factor of two.
The following lemma ensures that the estimation algorithm does not
yield false positives, i.e. nnz((AC)∗,j) = 0 while k˜ j 6= 0. Since false
positives are deterministically detected, we assume to have d = 1 in the
following lemma.
Lemma 3.4. Let A, C ∈ {0, 1}n×n be matrices with pairwise orthogonal vec-
tors, i.e. AC = 0n×n. Assume to compute estimates k˜1, . . . , k˜n using the
F0-distinguishability sketch F ∈ {0, 1}1×n from Lemma 3.3. It holds k˜ j = 0
for all j ∈ [n].
Proof. Let j ∈ [n] be an arbitrary index. By construction, we have that
k˜ j = (F1A1,1 + · · ·+ Fn An,1)C1,j + · · ·+ (F1A1,n + · · ·+ Fn An,n)Cn,j
= F1(A1,1C1,j + · · ·+ A1,nCn,j) + · · ·+ Fn(An,1C1,j + · · ·+ An,nCn,j)
= F1(A1,∗C∗,j) + · · ·+ Fn(An,∗C∗,j) = 0
where the last equality follows since 〈Ai,∗, C∗,j〉 = 0 for all i, j ∈ [n] by
hypothesis.
Conversely, false negatives may occur, i.e. nnz((AC)∗,j) > 0 while
k˜ j = 0. Nevertheless, in the following, we prove that, with an appro-
priate choice of d, the size estimator is false negative free with high
probability.
Lemma 3.5. Let A, C ∈ {0, 1}n×n be matrices such that there exist at least
a nonzero entry in AC. Assume to compute estimates k˜1, . . . , k˜n using the
F0-distinguishability sketch F ∈ {0, 1}d×n from Lemma 3.3, with d = log n.
Then, there exists j ∈ [n] such that k˜ j 6= 0 with probability at least 1− 1/n.
Proof. Without loss of generality, assume (AC)i,j 6= 0 and ` is the witness
for 〈Ai,∗, C∗,j〉 6= 0, i.e. Ai,` · C`,j 6= 0. To lower bound Pr[k˜ j 6= 0], we
study its complement, i.e. Pr[k˜ j 6= 0] = 1− Pr[k˜ j = 0]. Consider the
definition of k˜ j from the proof of Lemma 3.4.
Pr[k˜ j = 0] = Pr[F1(A1,∗C∗,j) + · · ·+ Fi(Ai,∗C∗,j) + · · ·+ Fn(An,∗C∗,j) = 0]
≤ Pr[Fi(Ai,1C1,j + · · ·+ Ai,`C`,j + · · ·+ Ai,nCn,j) = 0]
≤ Pr[Fi(Ai,`C`,j) = 0],
3.4. Algorithms 49
0 1 0 0 0 1 1 0
0 0 0 1 0 1 0 0
0 1 0 0 0 0 1 0
0 0 0 0 0 1 0 0
0 1 0 1 1 0 0 0
0 1 0 1 0 0 0 0
0 0 0 0 0 1 1 0
0 1 0 1 0 0 1 0


0 1 0 0 1 0 1 0
0 0 0 0 0 0 0 0
1 1 0 1 0 1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 1 0 1 1 0 1


0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0


k˜1 k˜j· · · k˜n· · ·
k˜T1
k˜Ti
...
k˜Tn
...
Figure 3.1: Boolean matrices A, C ∈ {0, 1}n×n where the product matrix
AC has a single nonzero entry. By Lemma 3.4 and Lemma 3.5, the
estimates k˜1, . . . , k˜n and k˜T1 , . . . , k˜
T
n are zero except for k˜ j 6= 0 and k˜Ti 6= 0,
with high probability. Therefore, (AC)i,j 6= 0.
where the first inequality holds since, in the worst case, (AC)i,j is the
unique nonzero in (AC)∗,j while the second inequality holds since, in
the worst case, ` is the unique witness for 〈Ai,∗, C∗,j〉 6= 0. The probabil-
ity is given by choosing Fi as 0 uniformly at random from {0, 1}. With
a F0-distinguishability sketch F ∈ {0, 1}d×n we can increase such prob-
ability up to 1− 1/2d. Taking d = log n, it holds Pr[k˜ j 6= 0] = 1− 1/n,
whenever (AC)i,j 6= 0.
Algorithm overview The intuition for our algorithm is as follows.
Consider the matrices A, C ∈ {0, 1}n×n and assume that the product
AC ∈ {0, 1}n×n has a single nonzero entry, i.e. nnz(AC) = 1. In order to
compute AC it is sufficient to estimate the number of nonzero entries in
the columns and in the rows of AC. That is, by applying the algorithm
from Lemma 3.3, we have access to estimates k˜1, . . . , k˜n and k˜T1 , . . . , k˜
T
n
for the number of nonzeros in the columns and in the rows of AC. It
follows that, there exists a unique pair (i, j) ∈ n × n such that k˜ j 6= 0
50 Chapter 3. Atlantic City Boolean Matrix Multiplication
and k˜Ti 6= 0 with high probability. The uniqueness holds since AC has
a single nonzero entry and, by Lemma 3.4, an F0 sketch on random
linear combinations of orthogonal vectors do not yield false positives.
Additionally, by Lemma 3.5, (i, j) is reported with probability 1− 1/n.
Hence, we set (AC)i,j = 1, see Figure 3.1. Nevertheless, the constraint
introduced so far, i.e. nnz(AC) = 1, is rigid and makes the algorithm
impractical. However, by exploiting random permutations and matrix
splittings we can extend this technique to a more general case.
Lemma 3.6. Let AC ∈ {0, 1}n×n be the product of two Boolean matrices
A, C ∈ {0, 1}n×n. Let k = nnz(AC) and assume that every row and every
column of AC has at most
√
k nonzero entries. Assume AC is the matrix
resulting from permuting uniformly at random the rows of A and the columns
of C with permutation functions pir and pic. Divide AC into k submatrices of
size n/
√
k× n/√k. The expected number of nonzero entries in each submatrix
is one.
Proof. Suppose to enumerate the k nonzero entries as z = 1, . . . , k. Con-
sider an arbitrary submatrix (AC)` of size n/
√
k × n/√k. Define an
indicator random variable
X`,z =
{
1 if the z-th nonzero entry is in (AC)`
0 otherwise
(3.1)
We have X` = ∑kz=1 X`,z = nnz((AC)
`). The probability that a nonzero
entry is mapped into a submatrix (AC)` is given by the probability that
among n possible outcomes the random permutation function for the
rows pir maps a nonzero to the n/
√
k rows of (AC)` and the random
permutation function for the columns pic maps a nonzero to the n/
√
k
columns of (AC)`. That is, if z = (AC)i,j then pir(i) = i′ and pic(j) = j′,
where i′ and j′ are row and column indices within the submatrix (AC)`.
It follows,
Pr[pir(i) = i′ ∧ pic(j) = j′] = n/
√
k
n
· n/
√
k
n
=
1
k
. (3.2)
The expected number of nonzero entries in the submatrix (AC)` is given
by
E[X`] = E
[ k
∑
z=1
X`,z
]
=
k
∑
z=1
E[X`,z] =
k
∑
z=1
Pr[pir(i) = i′ ∧ pic(j) = j′] = 1,
which follows from the linearity of expectation.
3.4. Algorithms 51
Algorithm BMM(A, C)
Input A, C ∈ {0, 1}n×n
Output AC ∈ {0, 1}n×n
(1) Generate uniformly distributed random permutations pir and pic
and permute the rows and columns of AC.
(2) Compute estimates for the number of nonzero entries in the rows
of AC and for the number of nonzero entries in the columns of
AC.
(3) Define Dc = {j ∈ [n] : k˜ j >
√
k} and Dr = {i ∈ [n] : k˜Ti >
√
k}.
That is, Dc and Dr refers to dense columns and rows of AC re-
spectively, i.e. with more than
√
k nonzeros. Similarly, define
Sc = [n] \ Dc and Sr = [n] \ Dr, i.e. sparse columns and rows
of AC.
(4) For each index in Dc and Dr compute directly the matrix vector
product.
(5) Filter AC by excluding all the column vectors j ∈ Dc and all the
row vectors i ∈ Dr from the estimation algorithm.
(6) Divide AC into k submatrices (AC)` of size n/
√
k × n/√k, with
` ∈ [k].
(7) For each submatrix (AC)`, compute estimates for the number of
nonzero entries in the columns of (AC)`, k˜1, . . . , k˜ν, and for the
number of nonzero entries in the rows of (AC)`, k˜T1 , . . . , k˜
T
ν , with
ν = n/
√
k.
(8) For each k˜ j 6= 0 and k˜Ti 6= 0, set (AC)`i,j = 1, with i, j indices of the
sumbatrix (AC)`.
Algorithm 3.1: Boolean Matrix Multiplication
Lemma 3.6 allows to use our main technique for an arbitrary number
of nonzero entries in the output matrix. Nevertheless, we still require
the condition that every row and column of AC has at most
√
k nonzero
entries. With a sparsification trick we achieve full generality.
Theorem 3.1. Let A, C ∈ {0, 1}n×n be Boolean matrices and assume A and
C have h nonzero entries. It is possible to compute all the k nonzero entries of
52 Chapter 3. Atlantic City Boolean Matrix Multiplication
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0


k˜j k˜j′· · ·
k˜Ti
k˜Ti′
...
(a)
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0


k˜j k˜j′· · ·
k˜Ti
k˜Ti′
...
(b)
Figure 3.2: Ambiguous estimators of a Boolean matrix AC ∈ {0, 1}n×n
where k˜ j 6= 0 and k˜ j′ 6= 0 and k˜Ti 6= 0 and k˜Ti′ 6= 0. To discern whether
(a) (AC)i,j 6= 0 and (AC)i′,j′ 6= 0 or (b) (AC)i,j′ 6= 0 and (AC)i′,j 6= 0,
we generate two estimation subproblems where the rows Ai,∗ and Ai′,∗
are mutually ignored by the estimation algorithm. In the worst case, k
nonzero entries in a submatrix generate k estimation subproblems.
AC ∈ {0, 1}n×n using expected O˜(h+ h√k+ k) time and expected O˜(h/B+
h
√
k/B + k/B) I/Os cache obliviously.
Proof. Steps (3),(5) and (6) of Algorithm 3.1 require at most a linear scan
of the input matrices. Hence, we focus on the remaining steps. Ran-
dom permutations can be computed in O(h) time and O˜(h/B) I/Os
[BBF+10]. Estimates in (2) require O˜(h) time and O˜(h/B) I/Os, by
Lemma 3.3.
Consider now the set Dc. It holds |Dc| ≤
√
k. If, by contradiction,
|Dc| >
√
k then there would be more than k entries in AC, since Dc
collects the indices of column vectors with more than
√
k entries. Hence,
the step (4) requires to compute at most
√
k matrix vector products. It
follows O˜(h√k) and O˜(h√k/B) I/Os. After the filtering step (5), every
column and every row of AC has at most
√
k nonzero entries since all the
dense vectors with more than
√
k nonzeros have already been computed.
Therefore, we can apply the estimation technique from Lemma 3.6.
Step (7) requires to compute k estimates for k distinct submatrices.
In the event that a submatrix has more than one nonzero entry, it is
impossible to recover the entries. For instance, consider the case where
the algorithm from Lemma 3.3 returns estimates k˜ j 6= 0 and k˜ j′ 6= 0 and
k˜Ti 6= 0 and k˜Ti′ 6= 0. It is impossible to discern whether (AC)i,j 6= 0 and
3.4. Algorithms 53
(AC)i′,j′ 6= 0 or (AC)i,j′ 6= 0 and (AC)i′,j 6= 0, see Figure 3.2. Hence, we
need to run the estimation algorithm for each nonzero in the submatrix
(AC)`. Since we are interested in the expected running time, we repeat
the estimation algorithm only E[X`] times.
Define φ(A, C, k) as the complexity of step (7) with parameters A, C
and k. In the following, we use the notation AI to denotenews n/
√
k
consecutive rows of A and C J for n/
√
k consecutive columns of C. The
submatrix of AC induced by AI and C J is (AC)`=I∩J with related ran-
dom variable X`=I∩J .
E[φ(A, C, k)] ≤
√
k
∑
I=1
√
k
∑
J=1
E[X`=I∩J ]
(
∑
i∈AI
O˜(nnz(Ai,∗)) + ∑
j∈C J
O˜(nnz(C∗,j))
)
=
√
k
∑
I=1
(
∑
i∈AI
O˜(nnz(Ai,∗)) + O˜(nnz(C))
)
(3.3)
= O˜(nnz(A) + nnz(C)
√
k),
where (3.3) holds since E[X`=I∩J ] = 1 by Lemma 3.6. Hence, O˜(h
√
k)
time and O˜(h√k/B) I/Os.
3.4.1 Hashing Based Estimators
Amossen et al. [ACP10] designed an algorithm for estimating the num-
ber of nonzero elements in a sparse Boolean matrix product. They for-
mulate their analysis in terms of a join operation between two binary
relations R1 and R2. However, we can rephrase their main results in
terms of Boolean matrix multiplication.
Lemma 3.7 (Amossen et al. [ACP10]). Let A, C ∈ {0, 1}n×n, h =
nnz(A) + nnz(C) and assume k = nnz(AC). Let ε be a given constant,
with 0 < ε < 1/4. There are algorithms that run in expected O(h) time in
the RAM model, and expected O(sort(h)) I/Os in the Cache Oblivious model,
that output k˜ such that Pr[(1− ε)k < k˜ < (1+ ε)k] = 1−O(1/√h).
We provide an intuition of the idea from [ACP10]. For a set U, let
η1, η2 : U → [0, 1] be hash functions chosen independently at random
from a pairwise independent family, and define η : U ×U → [0, 1] by
η(x, y) = (η1(x)− η2(y)) (mod 1). (3.4)
54 Chapter 3. Atlantic City Boolean Matrix Multiplication
Note that, by taking the (mod 1) in Formula 3.4 we have η(x, y) ∈
[0, 1]. Let W = {w ∈ [n] | Ai,w 6= 0 ∨ Cw,j 6= 0}. Observe that, the set of
witnesses, see Definition 1.1, is a subset of W where both Ai,w and Cw,j
are required to be different than zero.
For each w ∈ W let Aw = {i ∈ [n] | Ai,w 6= 0} and Cw = {j ∈ [n] |
Cw,j 6= 0}. The intuition behind [ACP10] is to efficiently iterate over all
pairs (x, y) ∈ Aw × Cw, with w ∈ W, for which η(x, y) is smaller than a
threshold t = min{1/√h,√h/(|Aw||Cw|)}, see [ACP10, Lemma 2, Sec-
tion 2.4]. While finding the relevant pairs, Amossen et al. maintain the x
smallest hash values in an unordered buffer in constant amortized time
per insertion.2 By setting x to
√
h, Lemma 3.7 holds. Note that, in terms
of x, the algorithm from Lemma 3.7 estimates the number of nonzero
entries using expected O(x2) time with error probability of O(1/x).
Lemma 3.7, as is, cannot replace in Theorem 3.1 the estimator algo-
rithm, as it outputs a cumulative estimate of the nonzero entries of the
matrix product instead of an estimate vector k˜1, . . . , k˜n. An extension of
Lemma 3.7 allows to compute estimate vectors.
Lemma 3.8. Let A, C ∈ {0, 1}n×n be two Boolean matrices and let ε be a
given constant, with 0 < ε < 1/4. There are algorithms that run in ex-
pected O(h√k) time in the RAM model, and expected O(sort(h) + h√k/B)
I/Os in the Cache Oblivious model, that computes k˜ j such that Pr[(1 −
ε) nnz((AC)∗,j)) < k˜ j < (1+ ε) nnz((AC)∗,j)] ≥ 1−O(1/k1/4).
Proof. Estimate the number of nonzero entries in AC using the algorithm
from Lemma 3.7 and setting x to
√
h. That is, we have access to an
estimate k of nnz(AC) with error probability O(1/√h). This estimate
is used to initialize buffers for the following step. Construct the matrix
C′ ∈ {0, 1}n×n as follows
C′∗,` =
{
C∗,j if ` = j,
0n×1 otherwise,
with ` ∈ [n].
Estimate the number of nonzero entries in (AC′) with the algorithm
from Lemma 3.7, setting x to h1/2k1/4/n1/2 where k is the estimate ob-
tained in the preliminary estimation. After O(h√k/n) time we have ac-
cess to estimate k˜ with error probability of at most O(1/k1/4). It holds
k˜ j = k˜. By repeating the estimation process for the n columns of C we
have the claim.
2Amossen et al. [ACP10] uses k for the number of hash values stored in the buffer
and n as the number on nonzero entries in A and C.
3.4. Algorithms 55
Lemma 3.8 gives guarantees in terms of the size of the product ma-
trix, i.e. k. For k = n, the overall number of hash values stored is at
least n5/4, a factor of n3/4 more than in Lemma 3.7. Hence, we expect
our estimators to be more accurate. Conversely, for highly sparse sub-
matrices, e.g. when k is constant, as in Theorem 3.1, the probability of
detecting nonzero entries as in Lemma 3.8 degrades. The intuition is to
binary search among the n columns of AC using Lemma 3.7 to answer
membership queries.
Lemma 3.9. Let A, C ∈ {0, 1}n×n be two Boolean matrices and assume AC ∈
{0, 1}n×n has a single nonzero entry. Let ε be a given constant, with 0 < ε <
1/4. There are algorithms that run in expected O(h) time in the RAM model,
and expected O(sort(h)) I/Os in the Cache Oblivious model, that computes
AC with probability at least 1−O(1/n1/4).
Proof. Divide the matrix C into two evenly divided submatrices C′ and
C′′ of size n× n/2. Estimate the number of nonzero entries in (AC′) and
(AC′′) with the algorithm from Lemma 3.7, setting x to h1/2/ log1/2 n.
The submatrix containing the single nonzero entry is detected with
probability 1 − O(log1/2 n/h1/2) ≥ 1 − O(1/n1/4) using O(h/ log n)
time. After log n queries, we detect the column j ∈ [n] such that
nnz((AC)∗,j) 6= 0. We repeat such process for the rows of AC thus
finding i ∈ [n] such that nnz((AC)i,∗) 6= 0.
In contrast to Lemma 3.3, where the estimation was performed d =
log n times, Lemma 3.8 gives weaker guarantees whenever the set of
witnesses for the entry (AC)i,j has few elements or, equivalently, when
Ai,∗ and C∗,j are highly sparse. Conversely, the denser the submatrix
(analogously, the larger the ratio h/n) the more hash values are collected
in the buffer.
Before restating the main theorem we prove that the estimation algo-
rithm does not yield false positives, i.e. nnz((AC)∗,j) = 0 while k˜ j 6= 0.
Lemma 3.10. Let A, C ∈ {0, 1}n×n be matrices with pairwise orthogonal vec-
tors, i.e. AC = 0n×n. Compute the estimate k˜ using the algorithm from Lemma
3.7. It holds k˜ = 0.
Proof. Let WA = {w ∈ [n] | Ai,w 6= 0} and WC = {w ∈ [n] | Cw,j 6= 0}
as defined in the algorithm of Lemma 3.7. It holds W = WA ∪WC.
Unless A, C ∈ 0n×n, W 6= ∅. Conversely, WA ∩WC = ∅. Assume,
by contradiction, WA ∩WC 6= ∅. Then, there exists w ∈ [n] such that
56 Chapter 3. Atlantic City Boolean Matrix Multiplication
Ai,w 6= 0 and Cw,j 6= 0. Hence, (AC)i,j 6= 0 since w is a witness for the
inner product 〈Ai,∗, C∗,j〉. Therefore, we reached a contradiction since
we assumed nnz((AC)i,∗) = 0. It follows, Aw × Cw = ∅ since Aw and
Cw are mutually disjoint.
Theorem 3.2. Let A, C ∈ {0, 1}n×n be Boolean matrices and assume A and
C have h nonzero entries. It is possible to compute all the k nonzero en-
tries of AC ∈ {0, 1}n×n using expected O(h + h√k + k) time and expected
O(sort(h) + h√k/B + k/B) I/Os cache obliviously.
Proof. The proofs follows the outline of Theorem 3.1. We generate uni-
formly distributed random permutations pir and pic and we permute the
rows and columns of A and C. In Algorithm 3.1, we replace the estima-
tor of step (2) with algorithm from Lemma 3.8 and the estimator of step
(7) with the algorithm from Lemma 3.9. The sort(h) term in the I/O
complexity stems from the sorting phase required by the estimator from
Lemma 3.7, where sets Aw and Cw are sorted according to the witnesses
w ∈W.
3.5 Probabilistic Analysis
The algorithms from Theorem 3.1 and Theorem 3.2 have two degrees of
randomness. The first, related with the correctness of the output, has
been analyzed in Chapter 3.4. There, we gave a probability of 1− 1/n
for detecting single entries with the algorithm from Theorem 3.1 and a
probability of 1−O(1/n1/4) with the algorithm from Theorem 3.2. The
second degree of randomness concerns the running time and the num-
ber of I/Os, which depends on the sparseness of the submatrices. We
can recover entries from ambiguous estimators by running the estima-
tion algorithm for each entry. This requires linear time in the size of the
input matrices. If the random generated permutations do not distribute
uniformly the entries across the matrix, submatrices may be denser than
expected, leading to a worst case of k nonzero in (AC)` and O(hk) time
and I/Os for recovery. In general, more than
√
k nonzero per submatrix
worsen our time bounds. In the following, we study tail bounds of our
distribution and, under specific constraints, we give an exponentially
decreasing probability of having highly dense submatrices in AC.
We would like to apply tail bounds à la Chernoff on the ran-
dom variables X` = ∑kz=1 X`,z. Unfortunately, the random variables
3.5. Probabilistic Analysis 57
X`,1, . . . , X`,k are not independent. While independence was never re-
quired by Lemma 3.6, which follows from the linearity of expectation,
Chernoff bounds require the random variables to be independent and
tail bounds cannot be studied directly. Nevertheless, Dubhashi and Pan-
conesi [DP09] show how to bypass independence whenever the random
variables are negatively associated, see Definition 1.5. We proceed by
following the analysis of [DP09].
As discussed in Section 1.6, for negatively associated random vari-
able, we can still apply Chernoff-Hoeffding bounds and obtain expo-
nentially decreasing probabilities for our probability distribution. In a
restricted setting, we show how to prove that the random variables X`,z
are negatively associated.
Lemma 3.11. In the settings of Lemma 3.6, if entries on the same row or on
the same column are forced not to share the same submatrix, then the random
variables X`,z, with `, z ∈ [k], are negatively associated.
Proof. Let I, J ∈ F be disjoint subsets from a family of disjoint subsets F
of [k] and let f , g be nondecreasing functions such that f (0) = g(0) = 0.
Consider two entries z ∈ I and z′ ∈ J. We need to prove that the
variables X`,z and X`,z′ are negatively associated. If z and z′ are in the
same row/column, then Pr[(X`,z = 1) ∧ (X`,z′ = 1)] = 0 by hypothesis.
Hence, E[ f (X`,z)g(X`,z′)] = 0. If the entries do not reside on the same
row and on the same column, then Pr[(X`,z = 1) ∧ (X`,z′ = 1)] = 1/k2.
Hence, E[ f (X`,z)g(X`,z′)] = 1/k2. The probability of an entry z being
mapped in a submatrix (AC)` is given by Pr[X`,z = 1] = 1/k. It follows,
E[ f (X`,z)g(X`,z′)] ≤ E[ f (X`,z)] E[g(X`,z′)].
Similarly, consider I, J and f , g as above. Let i ∈ I and j ∈ J. If
Xi,z = 1 then Xj,z = 0 since, given z ∈ (AC)i, z 6∈ (AC)j. Hence
E[ f (Xi,z)g(Xj,z)] = 0. It follows,
E[ f (Xi,z)g(Xj,z)] ≤ E[ f (Xi,z)] E[g(Xj,z)],
since E[ f (Xi,z)] ≥ 0 and E[g(Xj,z)] ≥ 0 and f , g are nondecreasing.
Since each variable is negatively associated then the entire fam-
ily X`,z, with `, z ∈ [k], is negatively associated (closure under prod-
ucts, Definition 1.6). Additionally, since the variables X`,z are nega-
tively associated from a family of disjoint subsets F then the variables
58 Chapter 3. Atlantic City Boolean Matrix Multiplication
X` = ∑kz=1 X`,z, with ` ∈ [k], are negatively associated (disjoint mono-
tone aggregation, Definition 1.7).
The constraints introduced in Lemma 3.11 are restrictive as they force
entries from the same row/column not to be placed in the same subma-
trix. Nevertheless, given two entries z and z′ on the same row/column,
if z ∈ (AC)` and z′ ∈ (AC)` then proving Chernoff-Hoeffding bounds
is nontrivial as X`,z and X`,z′ are not even negatively associated.
Lemma 3.12. Let AC ∈ {0, 1}n×n be the product of two Boolean matrices
A, C ∈ {0, 1}n×n and let k = nnz(AC). Assume every row and every column
of AC has at most
√
k nonzero entries. Assume AC is the matrix resulting from
permuting uniformly at random the rows of A and the columns of C. Divide
AC into k submatrices (AC)` of size n/
√
k× n/√k. If entries in the same row
or in the same column do not share the same submatrix then the probability that
a submatrix has strictly more than
√
k entries is at most 1/2
√
k for k > 4e2.
Proof. By Lemma 3.6, E[X`] = 1. The lemma follows from Lemma 3.11,
Theorem 1.8 and the standard Chernoff-Hoeffding bound, see Theo-
rem 1.4.
In the general case, i.e. whenever entries in the same row or column
share the same submatrix, we can still prove tail bounds. The latter are
only polynomially decreasing on k and they are derived from Markov’s
inequality, Theorem 1.3.
Lemma 3.13. In the settings of Lemma 3.6, if two entries in the same row
or in the same column share the same submatrix, then the probability that a
submatrix has more than
√
k entries is at most 1/
√
k.
Proof. By Lemma 3.6, E[X`] = 1. The lemma follows from Markov’s
inequality, Theorem 1.3, by setting δ =
√
k.
3.6 Empirical Study of Sparse Submatrices
The sparseness of the submatrices is validated using empirical evidence.
Figure 3.3 shows a script that, given an arbitrary n × n matrix A with
n = 102, k nonzero entries and at most
√
k nonzero entries per row and
column, permute the rows and the columns uniformly at random. The
matrix is then split in k submatrices of size n/
√
k× n/√k and, for each
submatrix A`, we increase the counter relative to nnz(A`) obtaining the
3.6. Empirical Study of Sparse Submatrices 59
0 20 40 60 80
0
20
40
60
80
(a)
0 20 40 60 80
0
20
40
60
80
(b)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
100
200
300
400
500
Fr
eq
ue
nc
y
(c)
0 20 40 60 80
0
20
40
60
80
(d)
0 20 40 60 80
0
20
40
60
80
(e)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0
100
200
300
400
500
600
Fr
eq
ue
nc
y
(f)
Figure 3.3: Block diagonal matrix before (a) and after (b) the rows
and the columns were permuted uniformly at random. The frequency
histogram (c) counts the number of submatrices with x ∈ {0, . . . , 16}
nonzero entries. Block matrix before (d) and after (e) the rows and the
columns were permuted with related frequency histogram (f).
related frequency histogram. While it is not straightforward to identify
hard instances for this problem, we investigates how random permuta-
tion distribute entries for block diagonal and block matrices.
The block diagonal matrix has k = 1000 nonzero entries. Each row
and column is d-sparse, meaning that there are exactly d = 10 nonzero
entries. The matrix is split in k submatrices using d√ke = 31 row/col-
umn splitters thus obtaining submatrices of 4× 4 size. The block matrix
has k = 1024 nonzero entries. Each row and column has
√
k = 32
nonzero entries. The matrix is split in k submatrices using
√
k row/col-
umn splitters thus obtaining submatrices of 4× 4 size.
The frequency histograms show that the majority of the submatrices
are empty. By Lemma 3.4 and Lemma 3.10, the estimation algorithms do
not report nonzero entries for such submatrices. Conversely, the number
of dense submatrices approaches zero rapidly. Note that, according to
our parameters n and k, the probability of having more than
√
k entries
60 Chapter 3. Atlantic City Boolean Matrix Multiplication
is zero, since the submatrices have size 4 × 4. However, for different
parameters we expect to see a more evident decreasing probability.
3.7 Conclusions
In this chapter we considered the problem of Boolean matrix multiplica-
tion in the Word RAM model and in the Cache Oblivious model. We pre-
sented a randomized algorithm that computes the product of Boolean
matrices with worst case complexity matching the state of the art algo-
rithm for Boolean matrix multiplication by [VG+15] with the advantage
of using, as a subroutine, estimation algorithms only. What is more, we
improve over the aforementioned solution by providing a second, more
refined algorithm that improves over our solution and over [VG+15] in
the polylogarithmic factors in regard to time complexity.
The partitioning step of our algorithms divides the output matrix in
meta-rows and meta-columns of size n/
√
k. This technique seems to be
relatively standard as it appears similarly in Van Gucht et al. [VG+15]
and Lingas [Lin09]. Amossen et al. [AP09] exploit improved partition-
ing techniques for isolating dense submatrices which are computed us-
ing fast matrix multiplication. This technique does not apply directly
to our framework. We ask whether using estimators differently, e.g. by
detecting zeros in highly dense submatrices, achieves improved bounds,
similar to [AP09], e.g. O(h + h2/3k2/3 + k).
The random permutation, when applied to input matrices, may not
spread the nonzero entries of the output uniformly across the matrix.
Dense submatrices with more than
√
k nonzero entries worsen our time
and I/O bounds. Although we prove polynomially decreasing probabil-
ity of having dense submatrices, we ask whether we can achieve a partial
derandomization by designing deterministic algorithms to partition the
output matrix in submatrices with O(1) nonzero entries.
In relation to the sparseness of the submatrices, the tail bounds
guarantee either polynomially or exponentially decreasing probability
of having dense submatrices. In order to guarantee that time and I/O
bounds deviate from their expectation with low probability, we have to
prove that the probability of having at least one dense submatrix, i.e.
with more than
√
k nonzeros, is approaching zero rapidly as k grows.
For our exponential tail bounds such probability is 1−√k/2
√
k. Con-
versely, for our polynomial tail bounds the probability is unsatisfactory.
3.7. Conclusions 61
Consequently, we leave open the problem of proving stronger bounds
(e.g. exponential tail inequalities) for the general case.
The algorithms presented in this chapter are Atlantic City, i.e. time,
I/Os and output are random variables. On the front of derandomiza-
tion, we ask whether we can design full Las Vegas algorithms, where
only running time and number of I/Os are random variables while the
output is always correct. The main hurdle for this problem is to provide
deterministic algorithms for detecting nonzero entries in highly sparse
submatrices. Note that, this problem is presumably easier compared to
approximating the number of nonzero entries in a matrix product.
The framework we used for computing the Boolean matrix product
can be extended straightforwardly to matrices with entries from a field:
apply the algorithm from Theorem 3.1 and compute k inner products us-
ing kn additional operations. That said, the algorithm from Theorem 3.2
can only be extended to semirings as it does not capture cancellation of
terms. The questions that arise are whether we can extend the size esti-
mator from Lemma 3.7 to algebraic structures that allow cancellation of
terms and whether this framework can be used to improve over state of
the art algorithms for sparse matrix multiplication, e.g. [DJ18, Pag12].

Chapter 4
Compressed Sparse Matrix Multiplication
In this chapter we study the problem of sparse matrix multiplication
in the Word RAM model and in the External Memory model when
a fast matrix multiplication subroutine à la Coppersmith and Wino-
grad [CW87] is used. We investigate how combining the combinatorial
framework of Yuster and Zwick [YZ05] and Amossen and Pagh [AP09]
with randomized compression techniques from Jacob and Stöckel [JS15]
yields improved bounds for sparse matrix multiplication.
Let A, C ∈ Fn×n be two matrices from a field F with h = nnz(A) +
nnz(C) nonzero entries in total. For balanced product matrices, we
are able to compute all the k nonzero entries of the matrix prod-
uct AC ∈ Fn×n using O˜(hk(ω−1)/4) time and O˜(hk(ω−1)/4/M(ω−2)/4B)
I/Os for sparse output matrices and O˜(h2(ω−2)/(ω−1)k1/(ω−1)) time and
O˜(h2(ω−2)/ω−1k1/ω−1/M(ω−2)/2(ω−1)B) I/Os for dense output matrices,
where ω is the current exponent of fast matrix multiplication [LG14]. By
exploiting Coppersmith and Winograd’s rectangular fast matrix multi-
plication algorithm [Cop97] as subroutine, we provide a second algo-
rithm that incurs in O˜(h0.690k0.604) operations where the analysis is in-
dependent on the sparseness of the input/output matrices.
For unbalanced product matrices, we are able to compute all
the k nonzero elements of the output matrix AC ∈ Fn×n us-
ing O˜(hk(ω−1)/(ω+1)) time and O˜(hk(ω−1)/(ω+1)/M(ω−2)/2(ω+1)B) I/Os
for sparse output matrices and O˜(h2(ω−1)/(ω+1)k2/(ω+1)) time and
O˜(h2(ω−1)/ω+1k2/ω+1/M(ω−2)/2(ω+1)B) I/Os for dense output matrices.
The guarantees for computing the matrix product AC follow from
the analysis of [JS15] and are given in terms of the size of the matrices
and state a probability of at least 1− 1/n to compute the matrix product.
64 Chapter 4. Compressed Sparse Matrix Multiplication
4.1 Introduction
The technique exploited for deriving new algorithms for sparse matrix
multiplication since [YZ05] was to partition the matrices into submatri-
ces, apply fast matrix multiplication on dense parts and compute the
remaining products using direct matrix multiplication algorithms.
The choice of fast matrix multiplication as a subroutine is straight-
forward: it provides the best asymptotic upper bound for multiplying
matrices. Nevertheless, when it comes to real life applications, fast ma-
trix multiplication does not perform well in practice as it only provides
an advantage for matrices so large that cannot be processed by modern
hardware [Rob05]. Due to the level of complexity, to the best of our
knowledge, there are no implementation of fast matrix multiplication.
The choice of algorithms based on fast matrix multiplication seems
therefore to be unreasonable. Nevertheless, algorithm designers use fast
matrix multiplication as a black box and, theoretically, any algorithm
for multiplying matrices can be substitute as a subroutine obtaining al-
gorithms that may perform well in practice yet yield presumably sub-
optimal bounds. For instance, the algorithms presented in this chapter,
as well as in [YZ05, AP09], can be redesigned to exploit Strassen’s algo-
rithm therefore obtaining a simpler yet suboptimal solution. In this re-
gard, the algorithms presented in this chapter provide more of a frame-
work for sparse matrix multiplication that can be adapted to any matrix
multiplication subroutine.
4.2 Contributions
We study the problem of sparse matrix multiplication in the Word RAM
model and in the External Memory model. We present an algorithm for
multiplying matrices A ∈ Fn×n and C ∈ Fn×n, with entries from a field
F and h = nnz(A) + nnz(C) nonzero entries which can be phrased in
the following theorem.
Theorem 4.1. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. Furthermore, assume that the
product AC ∈ Fn×n has k nonzero entries and Θ(k/n) nonzeros per row and
per column. It is possible to compute all the k nonzero entries of AC using
(i) O˜(hk(ω−1)/4) time for h ≥ k(ω+1)/4 and O˜(hk(ω−1)/4/M(ω−2)/4B)
I/Os for h ≥ k(ω+1)/4/M(ω−2)/4;
4.2. Contributions 65
(ii) O˜(h2(ω−2)/(ω−1)k1/(ω−1)) time for h < k(ω+1)/4 and O˜(h2(ω−2)/(ω−1)
k1/(ω−1)/M(ω−2)/2(ω−1)B) I/Os, for h < k(ω+1)/4/M(ω−2)/4.
For the ease of exposition, we omit in the bounds the terms h and
k which are implicit. With the current exponent of fast matrix multi-
plication, i.e. ω = 2.372 from [LG14], the bounds of Theorem 4.1 cor-
respond to O˜(hk0.343) operations for h ≥ k0.843 and O˜(hk0.343/BM0.093)
I/Os for h ≥ k0.843/M0.093 and O˜(h0.542k0.728) operations for h < k0.843
and O˜(h0.542k0.728/BM0.186) I/Os for h < k0.843/M0.093.
The algorithm from Theorem 4.1 makes use of fast matrix multipli-
cation invoked on square submatrices. By exploiting rectangular fast
matrix multiplication [Cop97], we achieve the following bound.
Theorem 4.2. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. Furthermore, assume that the
product AC ∈ Fn×n has k nonzero entries and Θ(k/n) nonzeros per row and
per column. It is possible to compute all the k nonzero entries of AC using
O˜(h0.690k0.604) time.
For general product matrices, i.e. without any assumptions on bal-
anced output, we design an algorithm with worst case complexity sum-
marized in the following theorem.
Theorem 4.3. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. It is possible to compute all the
k nonzero entries of AC using
(i) O˜(hk(ω−1)/(ω+1)) time and O˜(hk(ω−1)/(ω+1)/M(ω−2)/2(ω+1)B) I/Os,
for h ≥ k;
(ii) O˜(h2(ω−1)/(ω+1)k2/(ω+1)) time and O˜(h2(ω−1)/(ω+1)k2/(ω+1)/
M(ω−2)/2(ω+1)B) I/Os, for h < k.
With the current exponent of fast matrix multiplication the
bounds of Theorem 4.3 correspond to O˜(hk0.406) operations and
O˜(hk0.406/M0.093B) I/Os for h ≥ k and O˜(h0.814k0.593) operations and
O˜(h0.814k0.593/M0.135B) I/Os for h ≤ k.
The guarantees for computing the matrix product are given in terms
of the size of the input matrices and are derived from the analysis of
[JS15]. Jacob and Stöckel guarantee an error probability of 1/n.
66 Chapter 4. Compressed Sparse Matrix Multiplication
0.0 0.2 0.4 0.6 0.8 1.0
k 1e8
0.0
0.5
1.0
1.5
2.0
2.5
3.0
h
1e6
1
1
h = k + 14
< 1
1
> 1
(a)
0.2 0.4 0.6 0.8 1.0
k 1e2
0.0
0.2
0.4
0.6
0.8
1.0
h
1e12
< 1
1
> 1
(b)
Figure 4.1: Comparison between f (h, k) = h2/3k2/3 + h0.859k0.406 and (a)
g(h, k) = hk(ω−1)/4 for h ≥ k(ω+1)/4 and g(h, k) = h2(ω−2)/(ω−1)k1/(ω−1)
for h ≤ k(ω+1)/4 and (b) g(h, k) = h0.690k0.604. Functions f (h, k)
and g(h, k) are compared by computing the ratio function R(h, k) =
f (x, k)/g(h, k) and studying where R(h, k) > 1. The contour lines define
boundaries where the function becomes smaller than one. The absence
of contours in (b) indicates that the function R(h, k) is always strictly
larger than one, hence f (h, k) > g(h, k).
4.3 Comparison with the Related work
Given two n× n matrices A and C, with h = nnz(A) + nnz(C) nonzero
entries, Yuster and Zwick [YZ05] designed a very simple algebraic al-
gorithm for sparse matrix multiplication. Their idea was to partition
the inner products into disjoint sets and perform direct matrix multi-
plication on one set and fast matrix multiplication on the other. This
simple combinatorial idea yields an improved bound for sparse matrix
multiplication, namely O˜(h0.7n1.2 + n2+o(1)) operations which becomes
O˜(h0.689n1.209 + n2+o(1)) with the current ω from [LG14].
Amossen and Pagh [AP09] extended the result of Yuster and Zwick
by considering a bivarite combinatorial problem, where they take ad-
vantage of outer products. The latter are split into two disjoint sub-
sets where one subset is computed using direct matrix multiplica-
tion while the other is computed by partitioning inner products as in
[YZ05]. This leads to an improved algorithm for sparse matrix mul-
tiplication that incurs in O˜(h2/3k2/3 + h0.862k0.408) operations (with ω
4.3. Comparison with the Related work 67
0.0 0.5 1.0 1.5 2.0 2.5
k 1e11
0.0
0.5
1.0
1.5
2.0
2.5
h
1e9
1
1
h = k
+ 1
4
M 24
< 1
1
> 1
(a)
0 1 2 3 4 5
k 1e19
0
1
2
3
4
5
h
1e19
1
1
h = k
< 1
1
> 1
(b)
Figure 4.2: Comparison between f (h, k) = (h/B)min{√k/√M, h/M}
and (a) g(h, k) = hk(ω−1)/4/M(ω−2)/4B for h ≥ k(ω+1)/4/M(ω−2)/4
and g(h, k) = h2(ω−2)/(ω−1)k1/(ω−1)/M(ω−2)/2(ω−1)B for h <
k(ω+1)/4/M(ω−2)/4 and (b) g(h, k) = hk(ω−1)/(ω+1)/M(ω−2)/2(ω+1)B for
h ≥ k and g(h, k) = h2(ω−1)/(ω+1)k2/(ω+1)/M(ω−2)/2(ω+1)B for h < k.
The contour lines define boundaries where the function becomes smaller
than one. We set the values of M = 8000 and B = 16 to simulate a real
hardware setting with 32-bit integers, a 32 KB cache and a 16 Bytes cache
line.
from [CW87]) which is equal to O˜(h2/3k2/3 + h0.859k0.406) with the cur-
rent ω = 2.372. Additionally, Amossen and Pagh provided an I/O
complexity of O˜(h√k/M1/8B) I/Os in the External Memory model.
Compared to Yuster and Zwick and Amossen and Pagh, we use
randomized compression techniques from [JS15] and obtain improved
bounds on sparse matrix multiplication. For the sparse case, namely
when h = k = O(n), [YZ05] achieves O˜(n1.898) operations, [AP09]
achieves O˜(n1.333) operations and O˜(n1.5/M0.125B) I/Os. For bal-
anced output matrices, i.e. when every row and column of AC
has Θ(k/n) nonzeros, our algorithm uses O˜(n1.270) operations and
O˜(n1.270/M0.093B) I/Os, given h ≥ k(ω+1)/4 and h ≥ k(ω+1)/4/M(ω−2)/4,
unless M is large. Using fast rectangular matrix multiplication, we
achieve O(n1.294) operations independently from the input and output
sparseness. In the general case, i.e. without assumptions on balanced
output, we achieve O˜(n1.406) which is asymptotically worse than [AP09].
68 Chapter 4. Compressed Sparse Matrix Multiplication
Nevertheless, since the exponents are dissimilar between the factors h
and k, we still achieve an improvement in a limited parameter range.
Lingas [Lin09] presented an output-sensitive, randomized algo-
rithm for computing the product of two n × n boolean matrices using
O˜(n2kω/2−1) operations, which corresponds to O(n2.186) for k = O(n).
For kω+1/4 ≤ h ≤ n2/k(3−ω)/4 and h ≤ min{n/k1−ω, k(ω+1)/4} our algo-
rithm is asymptotically faster than [Lin09]. Moreover, our algorithm is
designed for arbitrary fields and it is not restrict to Boolean matrices.
Jacob and Stöckel [JS15] presented a Monte Carlo algorithm that uses
O˜(nk(ω−1)/2) operations and, with high probability, outputs the nonzero
elements of the matrix product. In addition, they state an I/O complex-
ity of O˜(nk(ω−1)/2/Mω/2−1B) I/Os. We show that the same technique
yields improved bounds for sparse matrix multiplication when com-
bined to the combinatorial framework of [YZ05, AP09]. Namely, for
sparse input matrices and balanced output, we achieve O˜(nk(ω−1)/4)
operations and O˜(nk(ω−1)/4/M(ω−2)/4B) I/Os, a factor of k(ω−1)/4 op-
erations and k(ω−1)/4/M(ω−2)/4 I/Os faster than [JS15]. For generic out-
put matrices, when h = O(n), we improve over [JS15] by a factor of
k(ω−1)2/2(ω+1) operations and k(ω−1)2/2(ω+1)/Mω(ω−2)/2(ω+1) I/Os.
Van Gucht et al. [VG+15] presented a randomized algorithm for
multiplying two Boolean matrices in O˜(h + h√k + k) time. This result
was later improved by Dusefante [Dus18a] in the logarithmic factors,
see also Chapter 3. Compared to their results, we are asymptotically
faster than [VG+15, Dus18a] in almost every parameter range and our
algorithms are not restricted to the Boolean semiring.
In the External Memory model, Pagh and Stöckel [PS14] provided
an I/O optimal algorithm for matrix multiplication that incurs in
O˜((h/B)min{√k/√M, h/M}) I/Os. Their algorithm is designed for
semiring algebras and, therefore, does not allow cancellation of terms.
Compared to their I/O bound, we improve over [PS14] unless M is large,
see Figure 4.2 for a more thorough comparison.
4.4 Algorithms
The algorithms presented in this chapter are close in spirit with [YZ05,
AP09] and they are design around three subroutines:
1. rectangular fast matrix multiplication (RFMM), via a blocking al-
gorithm (Lemma 4.4), or actual RFMM (Lemma 4.7),
4.4. Algorithms 69
2. compressed matrix multiplication (Lemma 4.5 and Lemma 4.9),
3. matrix product estimation (Lemma 3.3 from Chapter 3).
The estimation algorithm is used solely for the unbalanced case. In
the following we present the aforementioned primitives.
Algorithms overview The intuition for our algorithms is as follows:
we filter out dense columns of A and dense rows of C which are com-
puted directly using outer products. For the balanced case, the remain-
ing rows and columns are compressed and computed using fast matrix
multiplication. In the event of an unbalanced output matrix, we require
an extra step. The rows of A and columns of C are split depending on
the number of nonzero entries they produce in the output. Dense rows
and dense columns of AC are computed with the direct matrix multipli-
cation algorithm. The resulting sparse submatrix of AC is compressed
using the data structure from [JS15] and computed using fast matrix
multiplication. The balanced case is presented in detail in Algorithm 4.1
while the unbalanced case is presented in Algorithm 4.2.
4.4.1 Balanced Output Matrix
Fast matrix multiplication algorithms are commonly designed specifi-
cally for square matrices. A straightforward idea allows to apply these
algorithms to the rectangular case: simply divide the matrices into
square submatrices and perform fast matrix multiplication. Although
this idea can be considered folklore, it appeared formally in [JS15, Fact
1]. We rephrase their result in an alternate form in the following lemma.
Lemma 4.4. Let ω be the current exponent for fast matrix multiplication on
two square n× n matrices. Consider two matrices of size m× n and n× p. Fast
rectangular matrix multiplication has f (m, n, p) = O(mω−2np + mnω−2p +
mnpω−2) time complexity and g(m, n, p) = O( f (m, n, p)/M(ω−2)/2B) I/O
complexity.
Proof. Let µ = min{m, n, p}. Divide the matrices into submatrices of
size µ× µ. The number of subproblems is given by mnp/µ3 and each
subproblem can be solved using O(µω) operations. Without loss of
generality, assume µ = m < n < p. It holds
mnp
m3−ω
>
mnp
n3−ω
>
mnp
p3−ω
.
70 Chapter 4. Compressed Sparse Matrix Multiplication
Therefore, the term mnp · µω−3 dominates which leads to the follow-
ing complexity O(mω−2np + mnω−2p + mnpω−2). Regarding the I/O
bound, consider a blocking argument similar to [JWK81]. The number
of subproblems remains the same, i.e. mnp/µ3. The number of I/Os is
given by
mnp
µ3
·
(
µ√
M
)ω
· M
B
=
f (m, n, p)
M
ω−2
2 B
since each subproblem requires to multiply tiles of size
√
M×√M.
Jacob and Stöckel [JS15] proved that, for balanced output matrices, it
is possible to multiply matrices faster using a clever compression tech-
nique. Their key result for balanced product matrix [JS15, Theorem 2]
can be rephrased as follows.
Lemma 4.5. Let F be an arbitrary field and let A ∈ Fn×x and C ∈ Fx×n
where h = nnz(A) + nnz(C) and k = nnz(AC). Assume each row and each
column of AC has Θ(k/n) nonzero entries. Let ω be the current exponent
of fast matrix multiplication. There is a data structure that can be initialized
using f (
√
k, x,
√
k) = O˜(xω−2k + xk(ω−1)/2) time and can answer queries
for entries in AC in time O(log n).
The intuition behind [JS15] is to collapse rows of A and columns of C,
thereby obtaining a compressed version of the same problem. This com-
pression technique is used by our algorithms to obtain improved bounds
for sparse matrix multiplication. Therefore, the probabilistic nature of
our algorithms rely solely on the the Monte Carlo data structure from
[JS15]. The failure probability is derived by Jacob and Stöckel’s analysis
and state a probability of 1− 1/n of computing the matrix product.
Theorem 4.1. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. Furthermore, assume that the
product AC ∈ Fn×n has k nonzero entries and Θ(k/n) nonzeros per row and
per column. It is possible to compute all the k nonzero entries of AC using
(i) O˜(hk(ω−1)/4) time for h ≥ k(ω+1)/4 and O˜(hk(ω−1)/4/M(ω−2)/4B)
I/Os for h ≥ k(ω+1)/4/M(ω−2)/4;
(ii) O˜(h2(ω−2)/(ω−1)k1/(ω−1)) time for h < k(ω+1)/4 and O˜(h2(ω−2)/(ω−1)
k1/(ω−1)/M(ω−2)/2(ω−1)B) I/Os, for h < k(ω+1)/4/M(ω−2)/4.
4.4. Algorithms 71
Algorithm SBMM(A, C)
Input A, C ∈ Fn×n
Output AC ∈ Fn×n
(1) Let A′ = {A∗,j : nnz(A∗,j) ≥ τ} and C′ = {Ci,∗ : nnz(Ci,∗) ≥ τ}.
(2) Let A¯′ = {A∗,j : nnz(A∗,j) < τ} and C¯′ = {Ci,∗ : nnz(Ci,∗) < τ}.
(3) Compute A′C′ using the data structure of Lemma 4.5.
(4) Compute directly the matrix product A¯′C¯′ through outer products.
(5) Combine the matrix products A′C′ and A¯′C¯′.
Algorithm 4.1: Sparse Balanced Matrix Multiplication.
Proof. Steps (1), (2) and (5) from Algorithm 4.1 require linear time in
terms of the size of the input/output matrices. Step (4) requires O(hτ)
time and O˜(hτ/B) I/Os, see [AP09, Lemma 3.3]. In addition, the output
matrix A′′ has size n × h/τ and the matrix C′′ has size h/τ × n, by a
counting argument. By hypothesis, the product matrix A′′C′′ of step (3)
has Θ(k/n) nonzeros per row and per column. Therefore, we can apply
the data structure of Lemma 4.5. Note that, in the settings of Lemma
4.5, x = h/τ. The time required by the algorithm is
f (
√
k, h/τ,
√
k) + hτ. (4.1)
We want to find τ such that it minimizes the function from Formula
(4.1). With respect to τ, the function f (
√
k, h/τ,
√
k) is monotoni-
cally decreasing while the function hτ is monotonically increasing. It
follows that the function from Formula (4.1) is minimized whenever
f (
√
k, h/τ,
√
k) = hτ. By Lemma 4.4 it holds
f (
√
k, h/τ,
√
k) + hτ =
hk
ω−1
2
τ
+
hω−2k
τω−2
+ hτ. (4.2)
In order to minimize the complexity for the first term in (4.2) we set
(hk(ω−1)/2)/τ = hτ and we solve for τ,
hk
ω−1
2
τ
= hτ
k
ω−1
4 = τ,
72 Chapter 4. Compressed Sparse Matrix Multiplication
which yields the bound hk
ω−1
4 . Similarly, to minimize the complexity in
the second term of (4.2) we set k(h/τ)ω−2 = hτ and we solve for τ,
hω−2k
τω−2
= hτ
hω−3k = τω−1
h
ω−3
ω−1 k
1
ω−1 = τ.
which gives a bound equal to h2(ω−2)/(ω−1)k1/(ω−1). Combining the
aforementioned bounds it holds
hk
ω−1
4 + h
2(ω−2)
ω−1 k
1
ω−1 . (4.3)
We now proceed to study Formula (4.3) in order to obtain a function
µ(h, k) such that, for positive values of µ the first term dominates while
for negative values the second term in (4.3) dominates the complexity.
hk
ω−1
4 ≥ h 2(ω−2)ω−1 k 1ω−1
h
3−ω
ω−1 ≥ k
4−(ω−1)2
4(ω−1)
h3−ω ≥ k 3−ω
2+2ω
4
h−(ω−3) ≥ k−(ω−3)(ω+1)4
h ≥ kω+14
If h ≥ k(ω+1)/4, then the first term in Formula (4.3) dominates which
yields an O˜(hk(ω−1)/4) algorithm and the result from (i). If h < k(ω+1)/4,
then the second term in Formula (4.3) dominates which gives a worst
case complexity of O˜(h2(ω−2)/(ω−1)k1/(ω−1)) and the result from (ii). Re-
garding the I/O bounds, by Lemma 4.4, Formula (4.2) becomes
f (
√
k, h/τ,
√
k)
M
ω−2
2 B
+
hτ
B
=
hk
ω−1
2
τM
ω−2
2 B
+
hω−2k
τω−2Mω−22 B
+
hτ
B
.
By the same argument, we obtain τ = k(ω−1)/4/M(ω−2)/4 which yields
the bound from the case (i). Similarly, for the case (ii), by substituting
τ = h(ω−3)/(ω−1)k1/(ω−1)/M(ω−2)/2(ω−1) it holds the related complexity.
The function for studying the dominant terms is the following µ(h, k) =
h− k(ω+1)/4/M(ω−2)/4 and it is derived by a similar analysis.
4.4. Algorithms 73
Amossen and Pagh [AP09] proved that, if the exponent of matrix
multiplication is ω = 2 + o(1), then there is an algorithm for matrix
multiplication with worst case complexity of O˜(h2/3k2/3) operations.
Assuming ω = 2 + o(1), our algorithm improves over [AP09] as stated
in the following corollary.
Corollary 4.6. If ω = 2+ o(1) then Algorithm 4.1 has worst case complexity
of O˜(hk1/4), for h ≥ k3/4, and worst case complexity of O˜(h + k), for h ≤
k3/4.
As a consequence and considering h = k = O(n), Algorithm 4.1
requires O˜(n1.25) operations for dense output matrices (h ≥ k3/4) and
O˜(n) operations for sparse output matrices (h ≤ k3/4).
The upper bounds derived in Theorem 4.1 are based on a blocking
algorithm that solves rectangular matrix multiplication by dividing the
matrices into square submatrices and performs fast (square) matrix mul-
tiplication on them, see Lemma 4.4. Alternatively, one may apply fast
rectangular matrix multiplication on the input matrices.
Let ω(r, s, t) be the minimal exponent ω for which f (nr, ns, nt) =
O(nω+o(1)). Accordingly, ω(1, r, 1) is the exponent of rectangular matrix
multiplication. Following [YZ05], let α = max{0 ≤ r ≤ 1 : ω(1, r, 1) =
2} and β = (ω − 2)/(1 − α). Coppersmith [Cop97] proved that α >
0.294. Hence, let
ω(1, r, 1) ≤
{
2 if 0 ≤ r ≤ α,
2+ β(r− α) otherwise.
With ω = 2.372 and α = 0.294, we have β = 0.526.
Starting from ω(1, r, 1), Huang and Pan [HP98] proved the following
result about fast rectangular matrix multiplication.
Lemma 4.7 (Huang and Pan [HP98]). Consider two matrices of size m× n
and n×m. Fast matrix multiplication has
f (m, n, m) = m2−αβ+o(1)nβ + m2+o(1)
running time.
Given the result from Lemma 4.7, we can modify Algorithm 4.1 to ex-
ploit actual rectangular matrix multiplication. The result is summarized
in the following theorem.
74 Chapter 4. Compressed Sparse Matrix Multiplication
Theorem 4.2. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. Furthermore, assume that the
product AC ∈ Fn×n has k nonzero entries and Θ(k/n) nonzeros per row and
per column. It is possible to compute all the k nonzero entries of AC using
O˜(h0.690k0.604) time.
Proof. In Theorem 4.1 we proved that the time required by Algorithm
4.1 is f (
√
k, h/τ,
√
k) + hτ. By Lemma 4.7, we have
f (
√
k, h/τ,
√
k) + hτ = k
2−αβ
2
(
h
τ
)β
+ k + hτ (4.4)
In order to minimize the complexity we set f (
√
k, h/τ,
√
k) = hτ and
we solve for τ.
f (
√
k, h/τ,
√
k) = hτ
k
2−αβ
2
(
h
τ
)β
= hτ
k
2−αβ
2 hβ−1 = τβ+1
k
2−αβ
2(β+1) h
β−1
β+1 = τ
which simplifies to τ = k0.604/h0.310 with the current values of α and β.
By substituting τ in Formula (4.4) we have the claim.
Corollary 4.8. If ω = 2+ o(1) then Algorithm 4.1 has complexity O˜(√hk).
4.4.2 Unbalanced Output Matrix
Algorithm 4.1 relies on the assumption of a balance output. Without
knowledge on the number of nonzero entries in the rows and columns
of AC, we cannot apply Lemma 4.5. In order to overcome this issue, we
need to extrapolate information on the structure of the output matrix.
Pagh and Stöckel [PS14] proved that is possible to compute a 1± ε
approximation of the number of nonzero entries in the rows and in
the columns of the product matrix using quasilinear time. The result,
proposed in Chapter 3 Lemma 3.3, sates that we can compute estimates
k˜1, . . . , k˜n using O(ε−3h log(n) log n) time and O˜(ε−3h/B) I/Os. The
estimates satisfy the following condition (1 − ε) nnz((AC)i,∗) ≤ k˜i ≤
(1+ ε) nnz((AC)i,∗), for all i ∈ [n], with probability 1− 1/n.
4.4. Algorithms 75
Algorithm SMM(A, C)
Input A, C ∈ Fn×n
Output AC ∈ Fn×n
(1) Let A′ = {A∗,j : nnz(A∗,j) ≥ τ} and C′ = {Ci,∗ : nnz(Ci,∗) ≥ τ}.
(2) Let A¯′ = {A∗,j : nnz(A∗,j) < τ} and C¯′ = {Ci,∗ : nnz(Ci,∗) < τ}.
(3) Compute directly the matrix product A¯′C¯′ through outer products.
(4) Compute estimates k˜T1 , . . . , k˜
T
n for the number of nonzero entries
in the rows of (A′C′) and estimates k˜1, . . . , k˜n for the number of
nonzero entries in the columns of (A′C′).
(5) Let A′′ = {A′i,∗ : k˜Ti ≥ k/τ} and C′′ = {C′∗,j : k˜ j ≥ k/τ}.
(6) Let A¯′′ = {A′i,∗ : k˜Ti < k/τ} and C¯′′ = {C′∗,j : k˜ j < k/τ}.
(7) Compute directly the matrix product A′′C′′.
(8) Compute A¯′′C¯′′ using the data structure of Lemma 4.9.
(9) Combine the matrix products A¯′C¯′, A′′C′′ and A¯′′C¯′′.
Algorithm 4.2: Sparse Matrix Multiplication.
The primitive from [PS14] is used by Algorithm 4.2 to detect dense
rows and columns in the output. The latter are computed directly. Con-
versely, sparse rows and columns are compressed using a randomized
data structure by Jacob and Stöckel [JS15]. We propose their result in
an alternate form, which follows directly from [JS15, Theorem 2] and
Lemma 4.5.
Lemma 4.9. Let F be an arbitrary field and let A ∈ Fn×x and C ∈ Fx×n where
h = nnz(A) + nnz(C) and k = nnz(AC). Assume each row and each column
of AC has at most δ nonzero entries. Let ω be the current exponent of fast
matrix multiplication. There is a data structure that can be initialized using
f (δ, x, δ) = O˜(xω−2δ2 + xδω−1) time and can answer queries for entries in
AC in time O(log n).
The compression data structure of Lemma 4.9, together with the es-
timate algorithm from [AP09], give us all the necessary tools to prove
our last main result.
76 Chapter 4. Compressed Sparse Matrix Multiplication
Theorem 4.3. Let A, C ∈ Fn×n be sparse matrices with elements from a field
F and h = nnz(A) + nnz(C) nonzero entries. It is possible to compute all the
k nonzero entries of AC using
(i) O˜(hk(ω−1)/(ω+1)) time and O˜(hk(ω−1)/(ω+1)/M(ω−2)/2(ω+1)B) I/Os,
for h ≥ k;
(ii) O˜(h2(ω−1)/(ω+1)k2/(ω+1)) time and O˜(h2(ω−1)/(ω+1)k2/(ω+1)/
M(ω−2)/2(ω+1)B) I/Os, for h < k.
Proof. Steps (1), (2), (5), (6) and (9) from Algorithm 4.2 require linear
time in terms of the size of the input/output matrices. In the proof of
Theorem 4.1 we showed that computing the outer products of step (3)
requires O(hτ) time and O˜(hτ/B) I/Os. By Lemma 3.3, computing esti-
mates for the number of nonzero entries in the rows and columns of AC
in step (4) requires O˜(h) operations and O˜(h/B) I/Os. The matrices A′′
and C′′ collect rows of A and columns of C that generate more than k/τ
nonzero entries in rows and columns of AC. By a counting argument,
A′′ and C′′ have at most τ rows and columns respectively. If not, there
would be more than k nonzero entries in AC. Hence, step (7) requires
O˜(hτ) operations and O˜(hτ/B) I/Os. By construction, the output sub-
matrix A¯′′C¯′′ have at most k/τ nonzero entries per row and per column.
Therefore, we can apply the data structure of Lemma 4.9. Recall that A¯′′
has h/τ columns and C¯′′ has h/τ rows. Summarizing, the complexity
of Algorithm 4.2 is
f (k/τ, h/τ, k/τ) + hτ (4.5)
Following the proof of Theorem 4.1, the function from Formula 4.5 is
minimized when f (k/τ, h/τ, k/τ) = hτ. By Lemma 4.4 it holds
f (k/τ, h/τ, k/τ) + hτ =
hkω−1
τω
+
hω−2k2
τω
+ hτ (4.6)
For the firs term, we set hkω−1/τω = hτ and we solve for τ.
hkω−1
τω
= hτ
kω−1 = τω+1
k
ω−1
ω+1 = τ
4.5. Conclusions 77
Similarly, we set hω−2k2/τω = hτ and we solve for τ.
hω−2k2
τω
= hτ
hω−3k2 = τω+1
h
ω−3
ω+1 k
2
ω+1 = τ
In contrast to the proof of Theorem 4.1, where we needed to find a
function µ(h, k) to discern which term dominates, the function is sim-
ply µ(h, k) = h − k, since the minimum in Formula (4.5) is given by
min{h/τ, k/τ} = min{h, k}. Accordingly, the time complexity follows
by substituting the values of τ in Formula (4.5). Regarding the I/O com-
plexity, the bounds follow from Lemma 4.4 and by the same argument
from the proof of Theorem 4.1.
Theorem 4.3 generates a compressed matrix of size k/τ × k/τ. Such
compression is consistent only if kτ < n, namely if k0.594 < n or
k0.406/h0.813 < n. Observe that, although we are introducing an addi-
tional constraint in Theorem 4.3, compression techniques are only com-
patible with sparse matrices, as for
√
k = n no compression can be
performed. We conclude our analysis with the following corollary that
partially matches the result from [AP09].
Corollary 4.10. If ω = 2+ o(1) then Algorithm 4.2 has worst case complexity
of O˜(hk1/3), for h ≥ k, and worst case complexity of O˜(h2/3k2/3), for h ≤ k.
4.5 Conclusions
In this chapter we considered the problem of sparse matrix multipli-
cation in the Word RAM model and in the External Memory model.
We showed how to combine the combinatorial framework of Yuster and
Zwick [YZ05] and Amossen and Pagh [AP09], with randomized com-
pression techniques from Jacob and Stöckel [JS15] to obtain improved
bounds for sparse matrix multiplication. The result is a collection of al-
gorithms that exploit fast matrix multiplication kernels in order to mul-
tiply sparse matrices asymptotically faster in the balanced case and in
the general case.
Motivated by the studies of [JS15] on balanced output matrices, we
showed that, if the product matrix exhibits a balanced structure, then we
78 Chapter 4. Compressed Sparse Matrix Multiplication
are able to multiply sparse matrices asymptotically faster compared to
the state of the art and to the unbalanced case. Unfortunately, we were
not able to replicate the same time bounds for unbalanced output matri-
ces. Consequently, we leave open the problem of designing algorithms
for the unbalanced case that achieve the same bound of the balanced
one. In relation to this, the bound derived in the proof of Theorem 4.3
is not sharp as the compression is only down to k/τ × k/τ which, for
τ <
√
k, is suboptimal. Alternatively, one could think of proving lower
bounds either on the combinatorial framework of [YZ05, AP09] or on the
compression technique of [JS15]. As a matter of fact, the marginal im-
provements gained by our randomized algorithms over [AP09] suggest
little hope for significant further improvements in this direction.
The algorithms have been designed and analyzed in the External
Memory model. An interesting future direction is to extend the analysis
to the Parallel External Memory model and the Cache Oblivious model.
The absence of parallel algorithms for fast matrix multiplication leave
little room for the distribution step. That is, each processor has to be
responsible for fast-multiply batches of submatrices. For blocked-based
rectangular fast matrix multiplication, this seems doable. For actual rect-
angular fast matrix multiplication à la Coppersmith [Cop97] this seems
to require major efforts. On the oblivious front, in the Cache Oblivi-
ous model, one would need to find partitioning techniques, similar to
Lemma 4.4, oblivious to memory and block size.
Algorithms that exploit fast matrix multiplication as a kernel are able
to outperform most of the (if not every) present matrix multiplication
algorithms. The impracticality of fast matrix multiplication makes these
algorithms of theoretical interest only. The question we leave open is
whether we can design algorithms matching the bounds of, e.g., [AP09,
Dus18b] that do not rely on fast matrix multiplication primitives.
Chapter 5
Permuting in Parallel External Memory
In this chapter we investigate empirically the problem of permuting in
a concurrent cache aware model. We present sorting-based algorithms
for permuting elements in the Parallel External Memory model.
The result is a collection of I/O optimal algorithms that per-
mute records in Θ(min{n/P, (n/PB) logd(n/B)}) I/Os with d =
max{2, min{M/B, n/PB}}. Furthermore, we present efficient imple-
mentations of the aforementioned algorithms and we study how optimal
concurrent out of memory permutation algorithms perform on modern
parallel architectures.
We conduct extensive experiments for main-memory permutations
where the algorithms are profiled according to several metrics, such as
running times, cache and TLB misses. Finally, we test our algorithms
against a state of the art numerical library for matrix permutation.
5.1 Introduction
Permuting is one of the simplest problems in algorithm theory, yet fre-
quently used, e.g permutation matrices. The task of permuting consists
of arranging n input records according to a permutation function. The
problem is trivially solved with Θ(n/P) operations in PRAM, where P is
the number of processors, and therefore it is not of algorithmic interest.
Nevertheless, when this problem is analyzed in the External Mem-
ory Model and, specifically, in the Parallel External Memory model, it
becomes more challenging, [AGNS08, Guo11, Gre12, BF03]. Despite the
simplicity of the task and its trivial complexity in PRAM, permuting in
the Parallel External Memory model becomes nearly as hard as sorting.
80 Chapter 5. Permuting in Parallel External Memory
0 0 0 0 1
0 0 0 1 0
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0


a1
a2
a3
a4
a5


a5
a4
a1
a3
a2

=
(a)
1
2
3
4
5
1
2
3
4
5
pi
(b)
Figure 5.1: Permutation matrix Π (a) and its functional representation
(b) as a map pi : [n] → [n] which can be stored concisely as a vector
(3, 5, 4, 2, 1).
The problem of permuting can be defined as follows: given n records
a = (a1, . . . , an) and a permutation function pi : [n] → [n], compute
c = (c1, . . . , cn) such that ai = cj if and only if pi(i) = j, for i, j ∈ [n].
The problem of permuting can also be defined from a linear algebra
perspective. In this regard, consider a = (a1, . . . , an) as above. The
permutation is now defined as a n× n matrix Π. The task now becomes
computing the sparse matrix vector product Πa = c such that ai =
cj if and only if 〈Πi,∗, a〉 = j, for i, j ∈ [n]. Intuitively, the problems
are equivalent. Given a permutation matrix Π, it is always possible to
represent Π concisely as a function pi : [n] → [n] and vice versa. Such
function is generated as follows: ΠTi,j = 1 if and only if pi(i) = j. Observe
that it is sufficient to store the j values only.
For the ease of exposition, we consider only the case where 2 ≤
M/B ≤ n/PB. This allows us to use interchangeably the variables q =
M/B and d = max{2, min{M/B, n/PB}}.
5.2 Contributions
Sorting-based algorithms are proven to be optimal in external memory
models, under certain circumstances, for several problems, e.g. per-
muting or matrix computations [AV88, BBF+10, Gre12]. However, in
parallel architectures, sorting-based techniques require thread synchro-
nization in order to prevent race conditions and exploit high levels of
parallelism. This may lead to severe performance reductions that can
even cancel the benefits of optimal I/O algorithms.
5.3. Comparison with the Related Work 81
In this chapter we provide deterministic data structures for thread
synchronization applied to the problem of permuting. We present a first
data structure that consists of a multi-layer table computed via tracing
the positions the elements are assigned to, when a sorting-based algo-
rithm is executed on the input records.
Similarly, we provide a second data structure that consist of a multi-
layer table initialized with memory locations that identify conflict free
regions. These regions are then exploited by a sorting-based algorithm
to permute the input records. The data structure exploits a multi-level
approach where each layer of the table induces a finer-grained partition-
ing of the elements into contiguous memory locations.
Finally, starting from the two previous ideas, we provide a hybrid
data structure that combines the benefits of traces and thread synchro-
nization tables. By using these data structures, we are able to overcome
synchronization issues between processors, to exploit even further the
parallelism provided by the architecture and to take advantage of I/O
optimal, sorting-based algorithms.
Empirically, we address the question whether optimal algorithms in
theoretical models, e.g. the Parallel External Memory model, exhibit
high performance on modern parallel architectures. To do so, we pro-
vide C++ code that exploits memory management directives and algo-
rithm engineering techniques in order to reduce performance issues
such as false sharing and concurrent memory access. The implemen-
tation is then profiled using performance counters where we trace cache
and TLB misses as well as running times. In order to highlight the ef-
ficiency of our implementation, we test our algorithms against a state
of the art library for linear algebra that provides directives for matrix
manipulation.
5.3 Comparison with the Related Work
Aggarwal and Vitter [AV88] studied the problem of permuting in the
External Memory model. Specifically, they provided tight upper and
lower bounds for the number of I/Os for several sorting-related prob-
lems including sorting, permutation networks, permuting and matrix
transposition. For permuting in external memory, they proved an I/O
bound of Θ(min{n, (n/B) logM/B(n/B)}) using a counting argument.
Nodine and Vitter [NV93] proposed a single disk, multiple processor
model. They provide a multiprocessor generalization of the I/O model
82 Chapter 5. Permuting in Parallel External Memory
in which each of the processors controls one disk and has an internal
memory of size M/P. In this model they presented a load balancing
strategy applied to the problem of external sorting with multiple disks
and parallel processors. Vitter and Shriver [VS94], described the Parallel
Disk model, a single processor, multiple disk model, where D blocks can
simultaneously be read or written from/to D different disks. They pro-
vide upper and lower bounds for permuting, sorting and Fast Fourier
Transform in the Parallel Disk model.
Brodal and Fagerberg [BF03] showed that permuting is not possible
cache-obliviously using Θ(min{n, (n/B) logM/B(n/B)}), not even un-
der the tall-cache assumption, i.e M ≥ B1+e. Intuitively, the minimum
min{n, (n/B) logM/B(n/B)} cannot be decided without knowledge of
M and B in order to apply either a sorting-based algorithm or perform
the direct approach. Bender et al. [BFGK05] extended the cache oblivi-
ous model of Frigo et al. to a parallel setting consisting of P processors.
Arge et. al [AGNS08] studied parallel algorithms for private-cache
chip multiprocessors (CMPs). They introduced the Parallel External
Memory (PEM) model, and they provided lower bounds for several
fundamental problems such as sorting and permuting. Bender et
al. [BBF+10] provided optimal algorithm for sparse matrix vector multi-
plication in the I/O model when different layouts are considered.
Guo [Guo11] performed empirical as well as theoretical studies on
permuting in external memory. He proposed algorithms exploiting two
different techniques, namely funnels and buckets. In addition, Guo’s
analysis covers upper bounds and experiments on several case stud-
ies. Greiner [Gre12], in his PhD dissertation, analyzed I/O algorithms
in external memory. His effort is specifically focused on matrix mul-
tiplication in the Parallel External Memory model, however, his work
covers lower bounds for permuting in Parallel External Memory and
bit-matrix-multiply/complement permutations.
5.4 Algorithms
The direct algorithm for permuting, given n records and a permutation
function, rearranges elements by directly applying the permutation to
the input entries. As a matter of fact, it does not produce intermediate
results and its trivial implementation can be summarized in the follow-
ing formula:
c(pi(i))← a(i), for i ∈ [n]. (5.1)
5.4. Algorithms 83
a1 ai · · · a2 an · · · · · · a3 aj · · ·
a1 a2 a3 · · · · · · ai · · · · · · aj · · · · · · an
B1
T (1) T (i) . . .
B2
T (2) T (n) . . .
BM
B
T (3) T (j) . . .
Figure 5.2: A distribution step of a sorting-based algorithm, where the
elements are partitioned in M/B buckets. The Trace data structure
stores the intermediate positions, i.e. the trace, induced by the partition-
ing algorithm.
Despite its simplicity, the direct algorithm for permuting requires
O(n/PB) I/Os in the best case and O(n/P) both in the average and
worst case. The proof for the average case follows in the next section.
The worst case occurs when the records are arranged in a way such that,
to write each element ai in position pi(i), it is necessary to perform an
I/O, i.e. the memory address is not mirrored in internal memory.
The Formula (5.1) can be modified as follows c(i) ← a(pi−1(i)),
where pi−1(j) = i if and only if pi(i) = j is the inverse, in order to achieve
optimal I/O behavior during write operations while loosing memory
contiguity when reading the entries. However, the same analysis and
considerations apply.
5.4.1 Permuting using Traces
Permutation algorithms that use random memory accesses do not en-
sure an optimal number of I/Os. Intuitively, two consecutive input
records can be placed in two non-contiguous memory locations in the
output that are not mirrored in internal memory.
In order to induce a more cache-efficient behavior during the
permutation process, we provide a `-layer data structure T where
every level induces a finer-grained partitioning of the records into
contiguous memory locations. More precisely, given a permutation
pi = pi(1),pi(2), . . . ,pi(n) we decompose the permutation in ` levels
producing the following data structure, namely a ` × n table, T =
T1(1), . . . , T1(n), . . . , T`(1) . . . T`(n), where Tl(il) = i denotes that, in
the l-th layer, the record in il-th position is placed in position i. It
84 Chapter 5. Permuting in Parallel External Memory
holds pi(i) = T`(i`−1) ◦ T`−1(i`−2) ◦ · · · ◦ T1(i) which can be rewritten
as pi(i) = T`(T`−1(· · · T1(i) · · · )) for every i ∈ [n].
1 for l ← 1 to `− 1 do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do . δ = n/ql−1
4 bucket← pil(i)/ql
5 mutex_lock(bucket)
6 index ← B(bucket) ← B(bucket) + 1
7 mutex_unlock(bucket)
8 Tl(i)← index
9 pil+1(index)← pil(i)
10 for i← 1 to n in parallel grouped_by P do
11 T`(i)← pi`(i)
12 return T
Preprocessing 5.1: Initialization of the Trace data structure T .
We denote with pil(i) the state of the permutation in the i-th position
at the l-th level. Equivalently, we use the following notation al(i). Pre-
processing 5.1 works as follows: for every level l ∈ [`− 1] (line 1), we
divide the n records into subproblems of size δ = n/ql−1 (lines 2-3) and
for each subproblem we partition in parallel the elements into buckets
(lines 3-9). Such partitioning is computed by diving the elements into
q = M/B buffers (line 6), according to the value of bucket (line 4). The
trace is then stored into the data structure (line 8) and the permutation
is updated (line 9). The last layer ` stores the permutation that satisfies
the original permutation (line 11). After T is computed, every layer of
the data structure is applied to the records by Permuting 5.1 (lines 1-4),
eventually arranging the elements according to the given permutation
function.
1 for l ← 1 to ` do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do . δ = n/ql−1
4 al+1(Tl(i))← al(i)
5 return c← a`+1
Permuting 5.1: Permuting using the Trace data structure T .
5.4.2 Permuting using Thread Synchronization Tables
The data structure initialization from Preprocessing 5.1 computes explic-
itly the permutation decomposition which is stored in the data structure
5.4. Algorithms 85
a1 · · · a nP +1 · · · · · · an−P+1 · · · an−1 · · · a2 · · · a nP +2 · · · · · · an−P+2 · · · an
a1 a2 · · · a nP +1 a nP +2 · · · · · · an−P+1 an−P+2 · · · an−1 an
 P1  P2  PP· · ·
B1 BMB
P1

P2

PP
· · · 
P1

P2

PP
· · ·
Figure 5.3: A distribution step of a sorting-based algorithm, where the
elements are partitioned in M/B buckets. The Sync data structure stores
the pointers to concurrent-free locations where each processor operates
exclusively.
T and then applied, by Permuting 5.1, to the input records. Alterna-
tively, in order to avoid storing the trace, we can compute intermediate
positions on the fly.
1 for l ← 1 to `− 1 do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do . δ = n/ql−1
4 p← GetProcessor()
5 bucket← pil(i)/ql
6 I(p, bucket)← I(p, bucket) + 1
7 Sl ← Sl ← ParallelPrefixSum(I)
8 for i← (b− 1) · δ+ 1 to b · δ in parallel do
9 p← GetProcessor()
10 bucket← pil(i)/ql
11 index ← Sl (p · ql + bucket)← Sl (p · ql + bucket) + 1
12 pil+1(index)← pil(i)
13 return S , pi
Preprocessing 5.2: Initialization of the Sync data structure S .
To manage concurrent memory access, it is possible to allocate, for
each bucket, a fixed number of entries where each processor will write
to. Therefore, we partition the permutation entries in P segments and
for each segment we count the number of records for each bucket.
Formally, we denote with Bj the j-th bucket, with Pi the i-th proces-
sor and with q = M/B the number of buckets. The resulting table
I = I(1, 1), . . . , I(1, q), . . . , I(P, 1), . . . , I(P, q) consists of Pq elements
where the generic entry I(i, j) denotes the number of entries the pro-
cessor Pi will write in bucket Bj. A prefix sum of I gives us Pq pointers
to as many memory locations where each processor will write regardless
race conditions. This leads to the construction of the following `-layer
86 Chapter 5. Permuting in Parallel External Memory
data structure, with l ∈ [`],
Sl = S(1, 1), . . . , S(1, q), . . . , S(P, 1), . . . , S(P, q), where S(i, j) =
i
∑
k=1
I(k, j).
Note that, the table I does not store the elements I(1, i), . . . , I(P, i) con-
tiguously. Nevertheless, we can still compute parallel prefix sums using
O(q/B + log P) I/Os [Ble90, AGNS08], see also Theorem 5.3.
1 for l ← 1 to `− 1 do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do
4 p← GetProcessor()
5 bucket← pil(i)/ql . pil from Preprocessing 5.2
6 index ← Sl(p, b · ql + bucket)
7 Sl(p, b · ql + bucket)← Sl(p, b · ql + bucket) + 1
8 al+1(index)← al(i)
9 for b← 1 to q`−1
10 for i← (b− 1) · δ+ 1 to b · δ in parallel do
11 c(pi` + (i))← a`(i)
12 return c
Permuting 5.2: Permuting using the Sync data structure S .
We use the following notation Sl(i, j), where Sl denotes the l-th level
of the data structure. We start with Preprocessing 5.2. As in Preprocess-
ing 5.1, at the l-th level, we generate ql−1 subproblems of size n/ql−1
(lines 2-3). Each subproblem is then divided in P contiguous segments
and assigned to each processor. For each element in the segment, we
compute its bucket and we increment the corresponding counter in I
(lines 5-6). Once the table I is computed, we prefix-sum over I thus
producing Sl (line 7). The permutation is then updated according to Sl
using a local copy Sl (lines 8-12). Permuting 5.2 applies the permutation
to the input exploiting the data structure S . Note that Preprocessing 5.1
and Permuting 5.2 are identical except for the locking phases that are
replaced by the data structure S and for the trace which is stored in
Preprocessing 5.1 (line 8) while it is directly applied to the input records
in Permuting 5.2 (line 8).
5.4.3 Combining the Data Structures
The Trace data structure T , obtained by decomposing the permutation
function, introduces a performance issue known as false sharing. False
5.5. Theoretical Analysis 87
1 for l ← 1 to `− 1 do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do . δ = n/ql−1
4 p← GetProcessor()
5 bucket← pil(i)/ql
6 I(p, bucket)← I(p, bucket) + 1
7 Sl ← Sl ← ParallelPrefixSum(I)
8 for i← (b− 1) · δ+ 1 to b · δ in parallel do
9 p← GetProcessor()
10 bucket← pil(i)/ql
11 Hl(i)← index ← Sl (p · ql + bucket)← Sl (p · ql + bucket) + 1
12 pil+1(index)← pil(i)
13 for i← 1 to n in parallel do
14 H`(i)← pi`(i)
15 return H
Preprocessing 5.3: Initialization of the Hybrid data structure H.
sharing occurs on symmetric multiprocessing systems, where each pro-
cessor has a local cache, when threads on different processors modify
variables that reside on the same cache line.
Conversely, the thread synchronization table S does not cause such
issue as it induces rigid concurrent-free locations where threads operate
exclusively. Nevertheless, the data structure S requires more compu-
tational effort, compared to data structure T , as intermediate positions
need to be computed during the permutation process.
We provide a data structureH obtained by combining the ideas from
the previous algorithms meeting the best of both worlds. The data struc-
ture H is a `-layer data structure obtained by decomposing the permu-
tation function using the trace defined by the data structure S .
1 for l ← 1 to ` do
2 for b← 1 to ql−1
3 for i← (b− 1) · δ+ 1 to b · δ in parallel do
4 al+1(Hl(i))← al(i)
5 return c← a`+1
Permuting 5.3: Permuting using the Hybrid data structure H.
5.5 Theoretical Analysis
In this section we study the I/O complexity of algorithms of Section
5.4. We prove that Permuting 5.1, Permuting 5.2 and Permuting 5.3
are optimal in the PEM model. Conversely, we prove that, even in the
average case, the direct algorithm for permuting is not I/O optimal.
88 Chapter 5. Permuting in Parallel External Memory
Lemma 5.1. Given n elements to permute with Direct using a random gen-
erated permutation, the expected number of blocks required to perform k write
operations is Ω(k), with 1 ≤ k ≤ n/B.
Proof. Let pi be a permutation chosen uniformly at random from the n!
feasible permutations. Define an indicator random variable
Xi =
{
1 if block i is required by at least one of the k operations,
0 otherwise.
The probability that a block i among the n/B total blocks is not used
during the k operations is given by (1− B/n)k. It follows that the prob-
ability that the block i is required is 1− (1− B/n)k. Let X = ∑ki=1 Xi
be the number of blocks required to perform k write operations with
Direct, i.e. c(pi(i))← a(i). The expected number of blocks required for
k operations is given by
E[X] = E
[ n/B
∑
i=1
Xi
]
=
n/B
∑
i=1
E[Xi] =
n
B
(
1−
(
1− B
n
)k)
.
Given (1− x/n)n = 1/ex, it follows,
E[X] =
n
B
(
1−
(
1− B
n
)k)
=
n
B
(
1−
(
1− B
n
) kn
n
)
=
n
B
(
1− 1
e
Bk
n
)
.
Since 1/ex ≥ 1− x, then
E[X] =
n
B
(
1− 1
e
Bk
n
)
≥ n
B
(
Bk
n
)
= k.
Hence, E[X] = Ω(k).
In the following, we study the I/O complexity of data structure ini-
tialization and permutation process of the algorithms of Section 5.4. As
a consequence, we prove that permuting elements with the data struc-
tures of Section 5.4 induces an optimal number of I/Os.
Theorem 5.2. Let P ≤ n/B, after initializing the Trace data structure with
Preprocessing 5.1 usingO((n/PB) logd(n/B)) I/Os, Permuting 5.1 permutes
n records using O((n/PB) logd(n/B)) I/Os and using O(n logd(n/B))
space for the data structure, with d = max{2, min{M/B, n/PB}}.
5.5. Theoretical Analysis 89
Proof. Let q be the distribution degree in each of the ` levels. Partitioning
the n elements through the q buckets requires at most n/PB I/Os. Let
f = f (n, M, B, P) be the number of I/Os for parameters n, M, B and P.
If n/B ≤ 1 then f = 1. If n/B > 1 then we generate q subproblems of
size n/qB. In general, ql subproblems of size n/qlB are generated, with
l ∈ [`]. Since n/q`B = 1 then ` = logq(n/B). Hence, we derive
f (n, M, B, P) ≤
`
∑
l=1
O
( n
PB
)
.
It yields f = O((n/PB) logq(n/B)). If n/P < (n/PB) logq(n/B)
then we can permute using the direct algorithm. Hence, it holds
f = O(min{n/P, (n/PB) logq(n/B)}).
We can sharpen our analysis by studying how the number of buck-
ets affect the complexity. If M/B > n/PB then the number of buck-
ets is greater than the number of I/Os required to scan the input.
Hence, we consider d′ = min{M/B, n/PB}. If d′ < 2 then we are
considering only one bucket which means that no partitioning is per-
formed on any of the ` layers. Therefore, we assume to always have at
least two buckets available. Combining everything together we have
` = logd(n/B), where d = max{2, min{M/B, n/PB}}. It follows
f = O(min{n/P, (n/PB) logd(n/B)}). The I/O complexity for Prepro-
cessing 5.1 and Permuting 5.1 follows. Space complexity immediately
follows by observing that the data structure requires O(n) space for
each layer.
Theorem 5.3. Let 2 ≤ P ≤ n/B, after initializing the Sync
data structure with Preprocessing 5.2 using O((n/PB) logd(n/B) +
n/B + log P logd(n/B)) I/Os, Permuting 5.2 permutes n records using
O((n/PB) logd(n/B)) I/Os and using O(Pn/B) space for the data struc-
ture, with d = max{2, min{M/B, n/PB}}.
Proof. We divide the n entries between the P processors. Following
the proof of Theorem 5.2, the distribution degree is given by d =
max{2, min{M/B, n/PB}}. At the l-th level, every processors computes
at most Pdl pointers corresponding to table I using O(n/PB) I/Os. In
order to initialize Sl we have to calculate a prefix sums of Pdl total
elements, namely Sl(i, j) = ∑nk=1 I(k, j), ∀j ∈ [dl]. Since P > dl/B,
we need an extra log P term for down-sweep and up-sweep operations
[Ble90, AGNS08] on blocks of size B. Therefore, O(dl/B + log P) I/Os
90 Chapter 5. Permuting in Parallel External Memory
are required. Let f = f (n, M, B, P) be the number of I/Os, for parame-
ters n, M, B and P, then
f (n, M, B, P) =
`−1
∑
l=1
( n
PB
+
dl
B
+ log P
)
≤ n
PB
logd
n
B
+
(d` − 1)
d− 1 + log P logd
n
B
≤ n
PB
logd
n
B
+
n
B
+ log P logd
n
B
where the first inequality hold since ` = logd(n/B) and we used
∑k−1i=1 x
i = (xk − 1)/x− 1. Regarding Permuting 5.2, the bound follows
from the proof of Theorem 5.2. Space complexity follows by observing
that, for each bucket, we store P pointers and there are dl buckets at the
l-th level. Hence,
`−1
∑
l=1
Pdl = P
`−1
∑
l=1
dl =
P(d` − 1)
d− 1 ≤ Pd
` =
Pn
B
since we assumed ` = logd(n/B).
Theorem 5.4. Let 2 ≤ P ≤ n/B, after initializing the Hybrid
data structure with Preprocessing 5.3 using O((n/PB) logd(n/B) +
n/B + log P logd(n/B)) I/Os, Permuting 5.3 permutes n records using
O((n/PB) logd(n/B)) I/Os and using O(n logd(n/B)) space for the data
structure, with d = max{2, min{M/B, n/PB}}.
Proof. The proof follows from Theorem 5.2 and Theorem 5.3.
Theorem 5.5 (Greiner [Gre12, Theorem 2.7]). The average and worst case
number of parallel I/Os required to permute n records in the PEM model
with P ≤ n/B processors is Ω(min{n/P, (n/PB) logd(n/B)}) where d =
max{2, min{M/B, n/PB}}.
The proof of Theorem 5.5 relies on a counting argument through
a time forward analysis. Briefly, given a family of programs that can
generate every permutation of n records, there are at least n! different
configurations reached by the programs. The bound on the number
of I/Os derives by considering the number of configurations reached
by programs with ` I/Os [Gre12]. We conclude the section with the
following result.
Corollary 5.6. Permuting 5.1, Permuting 5.2 and Permuting 5.3 are optimal
in the Parallel External Memory model in terms of parallel I/Os.
5.6. Experiments 91
5.6 Experiments
We conduct extensive experiments using the algorithms of Section 5.4.
We identify the following experimental objectives that are addressed
through empirical evaluation.
Q1: How to specify the parameters to the algorithms? Can we exploit theo-
retical insights in order to tune our algorithms and achieve optimal
performance in practice?
Q2: Is there a correlation between algorithms optimal in theory and optimal in
practice? Can we achieve a significant speedup by designing exter-
nal memory algorithms? Are external memory algorithms efficient
in practice or is their nature merely theoretical?
Q3: How do data structures perform against state of the art permutation al-
gorithms? Can our data structures outperform the state of the art
numerical algorithms for matrix manipulations?
Q4: How do data structures perform when executed on different architectures?
We address the question whether different architectures yield dif-
ferent experimental results.
The questions we want to answer are strictly correlated. A theoret-
ical model that reflects the structure of the physical architecture allows
to tune the algorithms optimally which, in turn, correspond to high per-
formance in practice.
5.6.1 Experimental setup
The algorithms of Section 5.4 are tested on architectures listed in Ta-
ble 5.1. We implement the algorithms using C++11 and compile using
g++5.1.1 with optimization flags -O3 and -march=native. For Broadwell-
EP we compile using g++5.4.0. Moreover, we use OpenMP 4.5, an API for
multi-platform shared-memory parallel programming, to instruct our
code with parallel directives.
Experiments make use of statistical data. Every data point is the re-
sult of the median among k repetitions. We set 11 ≤ k ≤ 51 according to
the size of the problem. In order to profile the code we use PAPI 5.2.0.0
while we validate the experimental results using Cachegrind from Val-
grind 3.10.0. PAPI, Performance API, is a standard API for accessing
92 Chapter 5. Permuting in Parallel External Memory
hardware performance counters while Cachegrind is a cache simulator.
We use PAPI thread initialization together with thread inheritance in or-
der to enable thread support and to collect, through hardware counters,
cumulative cache misses deriving from multiple threads. The events
we consider during profiling are PAPI_Lx_TCM for Lx cache misses and
PAPI_TLB_DM for data TLB misses. PAPI_L1_TCM accounts for all de-
mand requests and any code read to L2 cache. PAPI_L2_TCM accounts
for each request originating from the core to reference a cache line in
the L3 cache. PAPI_L3_TCM counts each cache miss condition for refer-
ences to the last level cache. All the above events do not account for
cache line fills due to prefetching. Lastly, PAPI_TLB_DM provides sup-
port fo data TLB load and store misses. To collect running times we use
std::chrono::high_resolution_clock from the standard library.
We use the LMbench3 [MS12, MS+96] benchmark suite to compute
memory latencies of the entire memory hierarchy. We permute ran-
dom vectors of n 32-bit integers, where n ∈ [107, 109], i.e. we consider
main memory computations only. We generate random permutation us-
ing the Knuth/Fisher-Yates algorithm to avert biased distribution in the
permutation function. We set P equal to the number of physical cores
and we bind threads using OMP_PROC_BIND=true, thus disabling Hyper-
Threading. This setup correspond to a hybrid model where we consider
a PEM model with P physical cores, L1 and L2 as a hierarchical private
memory and L3 and main memory as a hierarchical external memory
shared by the processors which identifies an External Memory model.
The primary objective of the experiments is to evaluate optimal I/O
algorithms in modern parallel architectures. To achieve this, we mea-
sure time, cache and TLB misses stemming only from the permutation
phase and not from the data structure initialization. It has been observed
[WD10, WTM13] that hardware counters produce nondeterministic re-
sults. Therefore, in order to collect reliable and meaningful metrics, we
instruct only small parts of our code with performance measurement
directives.
5.6.2 Experimental Evaluation
How to specify the parameters to the algorithms? The process of specify-
ing the parameters for the data structures is crucial in order to achieve
optimal performance. Memory size M, cache line B and the number
5.6. Experiments 93
Table 5.1: The architectures where the experiments are performed.
Code Broadwell-EP Haswell-EP Haswell
Arch Xeon® E5-2690 v4 Xeon® E5-1650 v3 Core™ i7-4790
CPU 2×14, 2.6 - 3.5 GHz† 6, 3.5 - 3.8 GHz‡ 4, 3.6 - 4.0 GHz‡
Line 64 B (L1/L2/L3) 64 B (L1/L2/L3) 64 B (L1/L2/L3)
L1 32 KB – 8-way 32 KB – 8-way 32 KB – 8-way
L2 256 KB – 8-way 256 KB – 8-way 256 KB – 8-way
L3 35000 KB – 20-way 15360 KB – 20-way 8192 KB – 16-way
RAM 528 GB 65 GB 32 GB
OS Xenial 16.04.2 LTS CentOS 3.10.0.x86_64
† Two sockets with private L3 cache.
‡ Single socket.
of processors are not user dependent. However, although the number
of cores is fixed, the number of threads is user-dependent and Hyper-
Threading allow us to assign multiple threads per core. We design an
experiment where we fix the problem size while we select the number
of threads in an arbitrary interval, e.g. [2, 120] for Haswell-EP. The re-
sults for this experiment report a local minimum when P is equal to
the number of physical cores while the performance degrades when-
ever 2 ≤ P < 6, as the architecture is not fully exploited, and when
6 < P ≤ 120, as increasing the number of threads causes cache trashing,
i.e. data eviction from the cache, in the levels of the hierarchy. This is
in line with our expectations as permuting is an I/O-bounded problem,
rather than CPU-bounded problem. As a result, we set P equal to the
number of physical cores.
We now focus our tuning on the number of buckets and the number
of levels `. In terms of buckets, the theory suggests to set q = M/B,
such that the internal memory can host at least a cache line from each
bucket. In our experimental setup, using 32-bit integers and consider-
ing L2 as internal memory, M = 65536 and B = 16 for the machines
listed in Table 5.1. It follows q = 4096. The theory suggests to set
` = logq(n/B). That is, in line with our experimental setup, ` ∈ {1, 2},
since logq(n/B) ∈ [1.60, 2.16] for n ∈ [107, 109]. Although this analysis is
optimal for the number of the parallel I/Os between L2 and L3, it does
not consider the cache misses generated between L3 and main mem-
ory which account for the most expensive operations for main memory
computations [Lev09]. To do so, we consider q = min{M/B, M3/PB},
94 Chapter 5. Permuting in Parallel External Memory
0.0
0.5
1.0
1.5
2.0
2.5
1e+02 1e+03 2e+03 3e+03 4e+03 5e+03 6e+03 7e+03 8e+03 9e+03 1e+04
L
1
/n
q
n=1e+08
n=2e+08
n=3e+08
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1e+02 1e+03 2e+03 3e+03 4e+03 5e+03 6e+03 7e+03 8e+03 9e+03 1e+04
L
2
/n
q
n=1e+08
n=2e+08
n=3e+08
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1e+02 1e+03 2e+03 3e+03 4e+03 5e+03 6e+03 7e+03 8e+03 9e+03 1e+04
L
3/
n
q
n=1e+08
n=2e+08
n=3e+08
0.0e+00
5.0e-10
1.0e-09
1.5e-09
2.0e-09
1e+02 1e+03 2e+03 3e+03 4e+03 5e+03 6e+03 7e+03 8e+03 9e+03 1e+04
T
im
e
(s
)/
n
q
n=1e+08
n=2e+08
n=3e+08
Figure 5.4: Permuting using the Hybrid data structure on Haswell-EP
with fixed problem size n ∈ {1e+08, 2e+08, 3e+08} and q ∈ [102, 104].
where Mi specifies the number of elements in the Li cache. Every choice
of q in the following interval [M2/B, M3/PB] will increase the number
of L2 cache misses while leaving the number of L3 cache misses unal-
tered. In other words, instead of increasing the number of layers when-
ever logq(n/B) increases, which will unavoidably increase the number
of L3 cache misses, we consider ` = min{logq(n/B), logq(n/B)}, where
q = M3/PB. According to our experimental setup and to the above anal-
ysis, we set q = 4096 and ` = 1. Note that, permuting using an `-layer
data structure generates `+ 1 different configurations, with ` = 0 cor-
responding to permuting using Direct. We do not include results for
` = 2 since we do not record any improvement when our data structures
exploit more than one layer.
We now address the question whether the choice of q obtained via a
theoretical analysis can be confirmed by the experiments. The distribu-
tion degree is as follows: q = min{M2/B, M3/PB} where M2 and M3
are the size of L2 and L3 cache respectively. This analysis provides more
of an upper bound rather than an approximate result of q. Therefore we
5.6. Experiments 95
design an experiment where we fix the problem size, i.e. n, while q
ranges in a fixed subset of values, e.g. [102, 104]. We choose Permuting
5.3 for this task since it is not affected by false sharing, as Permuting
5.1, and it minimizes the computational effort, in contrast with Permut-
ing 5.2. The results of such experiments are depicted in Figure 5.4. By
comparing the best values for running times and cache misses we notice
that, although our choice of q is not optimal, the advantage of using
optimal values of q is negligible and in general, an exhaustive search is
the only approach to ensure the optimality of the distribution degree q.
Is there a correlation between algorithms optimal in theory, namely in PEM,
and optimal in practice? Once the parameters are set for the data struc-
tures, we address the question whether there is an actual improvement
when using data structures to permute. To do so, we compare Permut-
ing 5.1, Permuting 5.2 and Permuting 5.3 with Direct.
1 for i← 1 to n in parallel do
2 c(pi(i))← a(i)
Permuting 5.4: Permuting n records usingDirect.
The choice of Direct as the main competitor for our benchmarks is
no coincidence. The memory access pattern induced by Direct on a
random generated permutation can be considered as a generic cache-
inefficient algorithm on a random input. For instance, sorting is a spe-
cial case of permuting where the permutation induces an ordering of
the elements. Hence, every sorting algorithm that is not optimized for
I/O efficiency is lower bounded by Direct. This makes permuting an
appropriate candidate for studying algorithms for data manipulation.
The experiments show that data structures ensure a reduction of
cache misses for L2 and L3 caches. We do not report information on
L1 cache misses as we are never able to outperform over L1 misses. In
general, cache hits in the lower levels of the memory hierarchy account
for the most expensive operations, namely 40 to 100 cycles for L3 on
Intel® Xeon® processors [Lev09]. Hence, optimization in such levels
accounts for the most effective. In order to measure the efficiency of our
algorithms we rely on the following function to compute improvements
factors
g(x, y) = n
√
n
∏
i=1
xi
yi
96 Chapter 5. Permuting in Parallel External Memory
0.0
0.5
1.0
1.5
2.0
2.5
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
L
2
/n
n
Direct
Trace
Sync
Hybrid
0.0e+00
5.0e-10
1.0e-09
1.5e-09
2.0e-09
2.5e-09
3.0e-09
3.5e-09
4.0e-09
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
T
im
e
(s
)/
n
n
Direct
Trace
Sync
Hybrid
Figure 5.5: Permuting using the Trace, Sync and Hybrid data structures
compared to Direct on Haswell-EP with a random generated permuta-
tion, P = 6 and q = 4096. Note the effect of false sharing on Trace that
induces an increase in the number of L2 cache misses.
where x = {x1, . . . , xn}, y = {y1, . . . , yn} are the data points we want
to compare pairwise. Given n different problem sizes, the function g
computes the geometric mean over single improvement factors. The re-
sults are shown in Table 5.2. Experiments show improvement factors of
more than 3.00 on the number of L3 cache misses for Haswell-EP and
from 1.38 to 2.31 on running times for Permuting 5.2 and Permuting 5.3
respectively. The improvement over running times is even more promi-
nent for Haswell, see also Figure 5.6, where the factors span from 2.72 of
Permuting 5.2 to 4.06 of Permuting 5.3. Despite the high level of paral-
lelism, we record slight improvement factors for Broadwell-EP with 1.54
and 1.86 for Permuting 5.1 and Permuting 5.2 respectively and 2.93 for
Permuting 5.3. Such improvements are even more significant when the
L3 cache stores only prefetched data. In addition, experiments show that
a reduction on cache misses for Permuting 5.2, compared to Permuting
5.1, does not correspond to a reduction on running times. Although this
may seem counterintuitive, Permuting 5.2 requires more computational
effort which affects running times. The attempt to relate cache miss
reduction with running time improvements, in order to strengthen the
experimental results, was unsuccessful. As Levinthal [Lev09] reports,
L3 cache hits have different latencies depending on whether cache lines
are unshared, shared or modified by different cores thereby making any
reverse analysis not reliable. A downside of algorithms from Section 5.4
is certainly data structures initialization which accounts for most of the
computational process. In order to amortize the preprocessing time, we
5.6. Experiments 97
Table 5.2: Performance improvement factors for permuting using Per-
muting 5.1, Permuting 5.2 and Permuting 5.3 compared to Direct for
running times and cache misses using the geometric mean as an estima-
tor for the average improvement.
Trace Sync Hybrid
Time L2 L3 Time L2 L3 Time L2 L3
Haswell 3.31 0.69 3.03 2.72 0.90 3.03 4.06 1.14 3.20
Haswell-EP 1.64 0.67 3.05 1.38 0.90 3.05 2.31 1.16 3.31
Broadwell-EP 1.54 0.65 1.30 1.86 1.05 2.92 2.93 1.09 2.92
acknowledge that applications of such data structures are restricted to
scenarios where the permutation has to be applied repeatedly. We em-
phasize that all the experiments do not consider preprocessing time but
only the mere permutation process.
How do data structures perform against state of the art permutation algo-
rithms? In order to compare data structures of Section 5.4, we would like
to benchmark different linear algebra libraries that provide directives for
permutation matrices. To our knowledge, there is no C++ library that fea-
tures parallel algorithms for permutation matrices.1 Eigen3 [GJ+10] fea-
tures a PermutationMatrix with related member functions which does
not support multithreading. However, it provides operators for ma-
trix vector multiplication. We compare Eigen3 PermutationMatrix with
Permuting 5.1 setting P = 1. We do not compare Permuting 5.2 as the
data structure S acts as a thread synchronization table which would be
useless with a single thread. In addition, note that, for P = 1, T = H.
1 Eigen::PermutationMatrix<Eigen::Dynamic, Eigen::Dynamic, uint32_t> pi′ = pi
2 Eigen::VectorXuint32 a′ = A . Custom 32-bit integer type
3 Eigen::VectorXuint32 c
4 c = pi′ × a′
Permuting 5.5: Permuting n records using Eigen.
The data structure T is initialized using q = 4096, i.e. we con-
sider the L2 cache as internal memory while the L3 cache is treated
as external memory so as to preserve consistency with the experimental
1Boost [Sch11] linear algebra library uBLAS provides a data type
permutation_matrix wich does not provide operators for matrix vector multi-
plication.
98 Chapter 5. Permuting in Parallel External Memory
Table 5.3: Performance improvement factors for permuting using Hy-
brid compared to Eigen3 for running times and cache misses using the
geometric mean as an estimator for the average improvement.
Time L2 L3
Haswell 1.89 1.14 3.33
Haswell-EP 1.49 1.16 3.34
Broadwell-EP 1.40 1.10 2.88
setup. By analyzing Eigen3 source code, we verify that the C++ tem-
plate function used by Eigen3 to permute is indeed the direct algorithm
for permuting using std::swap as routine to permute entries. There-
fore, we claim that a possible multithreading implementation of Eigen3
PermutationMatrix would not outperform our algorithms. For the sake
of completeness, we stress that our implementations are not focused on
single thread computations and we claim that experimental results can
be improved even further for serial setups.
How do data structures perform when executed on different architectures?
The data structures are architecture dependent. Hence, we expect to
find dissimilarities in the experimental results when we permute us-
ing algorithms of Section 5.4 on machines listed in Table 5.1. Unfortu-
nately, the difference in architecture between Haswell and Haswell-EP
is not clean-cut. Conversely, Broadwell-EP is dissimilar to the other ar-
chitectures as it features a two socket CPU with private L3 cache and
the level of parallelism is higher compared to Haswell and Haswell-
EP. The machines provide a three level memory hierarchy with iden-
tical cache sizes for L1 and L2 while Haswell has a L3 cache half the
size of Haswell-EP. Broadwell-EP has a L3 cache of 35MB divided be-
tween private to each sockets. However, q remains unchanged as the
minimum min{M2/B, M3/PB} is still determined by the size of the L2
cache. Conversely, the number of processors changes and consequently
the size of data structure S . Figure 5.6 shows the execution of Permut-
ing 5.1, Permuting 5.2 and Permuting 5.3 on Haswell. By comparing
the experiments from Haswell and Haswell-EP, we identify more simi-
larities than dissimilarities. The difference on the number of processors
reduces the size of data structure S , however, the improvement is at
most of 8 · 10−4 cache misses per element. The size of the L3 cache
influences the execution only partially. Read and write operations in
L3 are performed sequentially, therefore the number of cache misses is
5.7. Empirical Validation 99
not dependent on M. However, using a larger L3 cache slightly im-
proves upon the number of L2 and L3 cache misses for prefetched data.
The improvement factors between L1, L2 and L3 match for the two ar-
chitectures, even for the serial setup. Conversely, running times differ
and Haswell is roughly a factor of 2.00 faster than Haswell-EP when
both architectures are compared to Direct. Since there is no improve-
ment in cache misses, as highlighted by the improvement factors, and
since permuting is not a CPU bounded problem we conjecture that such
factor is due to memory latency. To investigate this issue we use LM-
bench3 benchmark suite to compute memory latencies from L1 to main
memory, see Figure 5.6. LMbench3 lat_mem_rd benchmark measures
memory read latencies by traversing an array of fixed size with steps
the size of the stride. By fixing the stride to 64 bytes, i.e. the size of
the cache line, we ensure that every read operation is performed on a
different cache line. Figure 5.6 shows the result of LMbench3 on both
architectures and clearly demonstrates that Haswell main memory is
slower than Haswell-EP while the contrary holds for L2 and L3 caches.
As a results algorithms that rely on random accesses are penalized by
high memory latencies and the improvement is more prominent when
the difference in latencies is sharper. On the contrary, algorithms that
rely on locality of memory benefit from fast caches. Nevertheless, such
benchmark measures only “clean read” latencies while, in our experi-
ments, writes are responsible for the majority of cache misses. When
the difference in architectures is more consistent, i.e. Haswell/Haswell-
EP and Broadwell-EP, dissimilarities are more evident. As highlighted in
Section 5.4, Permuting 5.1 is affected by false sharing. This performance
issue is evident for Haswell and Haswell-EP from the L2 counts while it
is even sharper for Broadwell-EP where false sharing affect even the L3
counts as the L3 cache is private to each socket. As a consequence, de-
spite the high level of parallelism, Broadwell-EP scores the lowest results
for running times, L2 and L3 cache misses for Permuting 5.1.
5.7 Empirical Validation
The experiments make use of statistical measurements in order to reduce
the noise deriving from the background activity of the operative system.
The median of k repetitions gives an accurate estimate of both running
times and cache misses. However, we detected significant oscillations in
some experiments. Such oscillations derive from several aspects. The
100 Chapter 5. Permuting in Parallel External Memory
0.0e+00
2.0e-09
4.0e-09
6.0e-09
8.0e-09
1.0e-08
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
T
im
e
(s
)/
n
n
Direct
Trace
Sync
Hybrid
(a)
0
10
20
30
40
50
60
70
80
90
100
110
120
1KB 10KB 100KB 1MB 10MB 100MB 1GB
T
im
e
(n
s)
Array Size
Broadwell Haswell-EP Haswell
L
1
B
/
H
/
H
-E
P
L
2
B
/
H
/
H
-E
P
L
3
H
L
3
H
-E
P
L
3
B
(b)
Figure 5.6: (a) Permuting using the Trace, Sync and Hybrid data struc-
tures compared to Direct on Haswell with P = 4 and q = 4096. (b)
Latency benchmark obtained with LMbench3 using an array of 1GB and
a stride of 64 bytes. The boundary lines indicate L1, L2 and L3 mem-
ory sizes for each architecture. As LMbench benchmark shows, Haswell
has the fastest L2 and L3 caches which translates to the most significant
improvements.
technology developed in order to improve cached data can influence
data measurements. For instance, prefetchers are commonly used in
caches and they can alter the number of cache misses as they are invoked
in kernel-space instead of being user-space operations. The latter are
generally the only activity accounted by profilers.
Cache replacement strategies affect I/O complexity only by a con-
stant factor. Nevertheless, when it comes to profiling, cache replacement
strategies can alter the results of the experiments. Replacement policies
are often unknown, however, some Intel architectures have been reverse
engineered. As a result, the Haswell architecture uses adaptive cache re-
placement policies which only behave as pseudo-LRU in some situations
[GMM16]. Pseudo-LRU refers to an optimized LRU strategy where the
age of the cache lines is approximated rather than computed exactly.
We acknowledge that the tools we used to produce the results suf-
fer from precision issues and the measurements are not guaranteed to
be exact throughout the whole execution. In addition, as Weaver et
al. [WD10, WTM13] suggest, for several performance events, hardware
counters cannot produce expected and deterministic results. We can
overcome hardware counters and exploit cache simulators in order to
profile the execution. Nevertheless, despite the precision and the re-
5.7. Empirical Validation 101
Table 5.4: Number of cache misses per element when permuting n
records using a random generated permutation and the identity permu-
tation using the direct algorithm on Haswell-EP. We set P = 6 and we
use Cachegrind to profile the execution. Cache misses are listed divided
by read and write operations for L1, L2 and L3.
n Direct (Random)
L1W L2W L3W L1R L2R L3R
106 0.993 0.946 0.062 0.125 0.125 0
107 0.999 0.994 0.680 0.125 0.125 0.125
108 0.999 0.999 0.967 0.125 0.125 0.125
109 0.999 0.999 0.996 0.125 0.125 0.125
n Direct (Identity)
L1W L2W L3W L1R L2R L3R
106 0.062 0.062 0.062 0.125 0.125 0
107 0.062 0.062 0.062 0.125 0.125 0.125
108 0.062 0.062 0.062 0.125 0.125 0.125
109 0.062 0.062 0.062 0.125 0.125 0.125
peatability of the results obtained using simulators we acknowledge that
they do not reflect a true execution on the real hardware. For instance,
Cachegrind schedules threads differently from the native configuration.
It accounts neither for kernel activity nor for virtual-to-physical address
mappings nor for cache misses arising from TLB misses. Regarding
replacement strategies, Cachegrind simulates LRU and, as our exper-
iments confirm, it can closely approximate Haswell replacement algo-
rithm. Despite the limitations of Cachegrind, we use cache simulators
in order to validate our results up to a certain accuracy.
5.7.1 Cache Misses Validation
Given n 32-bit integers and a permutation function expressed as n 32-bit
integers, the direct algorithm reads sequentially the permutation func-
tion and the input records using n/B I/Os each. According to Table
5.1, B = 16, hence, we predict 2/16 = 0.125 I/Os per element for L1,
L2 and L3, see Table 5.4. This analysis is identical for both random and
identity permutations as the permutation function does not affect the
number of reading operations. On the other hand, writing is affected
102 Chapter 5. Permuting in Parallel External Memory
by the permutation map and the number of writings can be predicted
only for the identity while it can be approximated for the random per-
mutation. Permuting with the identity permutation induces sequential
writes. Hence, as above, we predict 1/16 = 0.0625 write misses per
element. Regarding random permutations, on average, the direct algo-
rithm for permuting induces one write miss per element which matches
with writing operations of Table 5.4, as long as the problem size satu-
rates the caches. As expected, the number of write operation is settled
for L1, as it can store approximately 8 · 103 32-bit integers. Regarding L3,
note that, on Haswell-EP, the cache can store less than 4 · 106 integers.
Consequently, for 106 records, the number of L3 writes corresponds to
scanning while it reaches approximately one write miss per element for
n ≥ 108. By comparing Figure 5.7 and Figure 5.8 with Table 5.4, we can
see that the data matches for the random permutation, while we identify
a significant gap between PAPI and Cachegrind for the identity permu-
tation. The explanation relies on the hardware prefetcher. Prefetchers
work by predicting certain memory patterns and preloading cache lines
in memory in order to reduce memory latencies. In the case of the
identity permutation, the direct algorithm exploits prefetched data as it
access the input in a streaming fashion. On the other hand, random per-
mutations induce random memory patterns and there is no guarantee
that prefetched data will be used by the algorithm.
We can apply the same analysis to Permuting 5.1, Permuting 5.2 and
Permuting 5.3, see Table 5.5.2 Unfortunately, as the complexity of the
algorithms increases it becomes more difficult to predict the number of
cache misses. Permuting 5.1 works by reading sequentially the input
records and the data structure T and writing sequentially in the output
vector. Since ` = 1 and B = 16, we predict 4/16 = 0.250 read misses and
2/16 = 0.125 write misses per element. Similarly, Permuting 5.2 works
by reading sequentially the input records, the permutation function and
the data structure S and writing sequentially in the output vector. Write
misses account for 2/16 = 0.125 as above, for ` = 1 and B = 16. Con-
versely, we would expect to have 6/16 = 0.375 read misses per elements
since we are considering more memory locations. However, according to
our experimental setup, S stores only 2.4 · 104 pointers which accounts
for 1.5 · 10−4 additional cache misses per element for n = 107. This argu-
ment apply only for L2 and L3 since S fits in memory while it does not
2Note that, from a theoretical standpoint, Permuting 5.1 and Permuting 5.3 are
identical. Hence from now on we only consider Permuting 5.1.
5.7. Empirical Validation 103
Table 5.5: Number of cache misses per element when permuting n
records with a random generated permutation using Permuting 5.1 and
Permuting 5.2 on Haswell-EP. We set P = 6, ` = 1 and q = 4096 and we
use Cachegrind to profile the execution. Cache misses are listed divided
by read and write operations for L1, L2 and L3.
n Trace, Hybrid
L1W L2W L3W L1R L2R L3R
106 0.958 0.354 0.125 0.250 0.250 0.170
107 0.958 0.360 0.127 0.250 0.250 0.249
108 1.631 0.365 0.125 0.250 0.250 0.249
109 1.867 1.092 0.125 0.250 0.250 0.250
n Sync
L1W L2W L3W L1R L2R L3R
106 0.994 0.387 0.125 0.543 0.251 0.067
107 0.994 0.391 0.127 0.542 0.250 0.249
108 1.667 0.398 0.125 0.542 0.250 0.250
109 1.904 1.121 0.125 0.540 0.250 0.250
apply for L1 which can store only 8 · 103 elements. In general, our pre-
dictions are accurate for reading operations on all levels of the memory
hierarchy. On the other hand, writing operations are predictable only
for L3 cache while they can be approximated for the higher levels of the
memory hierarchy. According to Table 5.5, write operations account for
one cache miss per element for L1 cache as the problem size increases.
This has to be expected since we are not optimizing for the higher levels
of the hierarchy.
The data collected through cache simulators has proven to be pre-
dictable for certain operations and caches. However, we already em-
phasize how cache simulators do not always reflect a true execution
on the real hardware. We address the issue whether data collected via
hardware counters is somehow comparable to Cachegrind data. Figure
5.7 and Figure 5.8 show the comparison between PAPI and Cachegrind
when profiling the exact same algorithms. The figures reveal clear dis-
crepancies between the profilers and confirm the limitation of cache sim-
ulators. Starting from L1 cache misses, Figure 5.8, we identify a dissimi-
larity between graphs for n ≥ 5 · 108 that correspond to an increment on
TLB misses. As stated, Cachegrind does not account for cache misses
104 Chapter 5. Permuting in Parallel External Memory
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
L
1
/n
n
Direct
Trace
Sync
Hybrid
Direct (CG)
Trace (CG)
Sync (CG)
Hybrid (CG)
(a)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
T
L
B
/n
n
Direct
Trace
Sync
Hybrid
(b)
Figure 5.7: Raw data of L1 (a) and TLB (b) misses from permuting using
the Trace, Sync and Hybrid data structures compared to Direct on
Haswell-EP with a random generated permutation with parameters P =
6 and q = 4096. Each x-coordinate collects k repetitions of the same
algorithm, with k = 51 for Permuting 5.1 and Permuting 5.2 and k =
11 for the direct algorithm. The reference plots (CG) are obtained by
simulating the same algorithms using Cachegrind, see Table 5.4 and
Table 5.5. Graphs are scaled by a 1/n factor to depict information per
element.
arising from TLB misses, hence, such dissimilarity has to be expected.
Regarding L3 cache misses, Figure 5.8, we identify some inconsistencies
as well. The number of L3 cache misses is lower than what reported
by Cachegrind for n ≤ 3 · 108 and q = 4096. The explanation relies
on the hardware prefetcher that preloads in memory cache lines to im-
prove memory management. The result is a reduction on cache misses
which are not accounted by profilers as they are performed in kernel-
space. In addition, note the similarities between Haswell and Haswell-
EP. The number of L3 cache misses converges to approximately 0.375
cache misses per element for n ∼ 3 · 108 for both machines as they arise
from the L2 cache. As a matter of fact, when q = 4096 the size of the
buckets ranges between 5 · 104 and 7.5 · 104, for 2 · 108 ≤ n ≤ 3 · 108,
and the L2 cache gets filled in such interval. The result for L2 cache
misses show some interesting behavior as well. The number of cache
misses for Direct and Permuting 5.2 is approximate with high preci-
sion. However, Permuting 5.1 diverges from Cachegrind measurements
by approximately 0.5 additional cache misses per element. In addition,
the raw data collected from such experiment showed significant oscil-
5.7. Empirical Validation 105
0.0
0.5
1.0
1.5
2.0
2.5
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
L
2
/n
n
Direct
Trace
Sync
Hybrid
Direct (CG)
Trace (CG)
Sync (CG)
Hybrid (CG)
(a)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1e+07 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 8e+08 9e+08 1e+09
L
3
/n
n
Direct
Trace
Sync
Hyrid
Direct (CG)
Trace (CG)
Sync (CG)
Hybrid (CG)
(b)
Figure 5.8: Raw data of L2 (a) and L3 (b) cache misses from permuting
using the Trace, Sync and Hybrid data structures compared to Direct
on Haswell-EP with a random generated permutation with parameters
P = 6 and q = 4096. Each x-coordinate collects k repetitions of the
same algorithm, with k = 51 for Permuting 5.1 and Permuting 5.2 and
k = 11 for Direct. The reference plots (CG) are obtained by simulating
the same algorithms using Cachegrind, see Table 5.4 and Table 5.5.
lations for the L2 cache misses of Permuting 5.1. This behavior is can-
celed whenever we consider single thread computations and the data
overlaps with the one from Permuting 5.2. We conjecture that our data
structure is sensitive to false sharing during the process of computing
the intermediate positions of the records. Such behavior is not present
in Permuting 5.2 as the data structure S induces a strong thread syn-
chronization. Conversely, thread synchronization for Permuting 5.1 is
managed by the OpenMP directive schedule(static,segment). How-
ever, switching from the compiler optimization -O3 to a less aggressive
one, i.e. -O1, cancels the problem while it occurs again for -O2. We claim
that the set of instructions included in the -O2 optimization forces the
compiler to optimize thread scheduling thus enhancing false sharing.
The process of investigating the -O2 optimization in order to identify
the instructions that causes performance degradation in the L2 cache
is far for being trivial if not unfeasible. The -02 optimization enables
almost ninety flags, fifty more than -01. While we attempted to identify
such flag with a trial and error approach, the results were inconclusive
and we conjecture that such behavior is caused by a combination of
optimization flags. Nevertheless, we highlight that such performance
degradation affect only partially running times.
106 Chapter 5. Permuting in Parallel External Memory
5.8 Conclusions
The studies conducted on permuting have confirmed the motivation for
designing External Memory algorithms. Permuting using data struc-
tures and sorting-based algorithms ensures a reduction in the number of
I/O operations. In particular, exploiting data structures allow us to pro-
vide I/O optimal algorithms that are efficient in practice. Experiments
show that, even if the level of parallelism changes, the data structures
achieve satisfactory performances. Nevertheless, the difference in the ar-
chitectures is not clean-cut and it is not know what level of parallelism
such data structures can achieve on massively parallel systems. As a
matter of fact, our data structures are architecture dependent as they
rely on number of cores and memory size. Hence, we formalize the
problem of designing data structures that are architecture independent
as well as optimal for different levels of parallelism.
Sorting-based algorithms in parallel systems introduce synchroniza-
tion issues that are commonly solved via atomic operations. Data struc-
tures overcome this issue by exploiting index tables that grow consider-
ably as the input size increases. Data structures initialization is, indeed,
time consuming and, in order to amortize the preprocessing time, the
applications of such data structures are restricted to scenarios where the
permutation is applied repeatedly. We leave open the question whether
the trade off between time and space for process synchronization is still
well established or algorithms that permute I/O optimally on the fly
can achieve high performance in practice.
The problem of permuting is oriented entirely to memory transfers,
hence, it does not reflect the structure of standard numerical problems,
e.g. matrix computations. We formalize the question whether we can
achieve the same performance for tasks that require intensive compu-
tation, such as matrix multiplication algorithms. While the algorithms
in this chapter aim at shedding some light on external memory models
and the relation between algorithms optimal in theory and in practice,
our future work will extend the focus on architecture independent algo-
rithms for intensive computation problems.
Chapter 6
Conclusions
This chapter aims at collecting the contributions and the future research
directions highlighted in each chapter. For each research output, we
give a brief overview of the contributions and we introduce the reader
to open problems that are addressed through a detailed discussion.
Chapter 2
In Chapter 2 we presented new randomized algorithms for sparse ma-
trix multiplication. Specifically, we presented new data structures that
extend the algorithm from Williams and Yu [WY14] to the sparse input
case. Our analysis covers the Word RAM model, the Cache Oblivious
model and the Parallel External Memory model.
The algorithms presented do not exploited any structure of the in-
put/output matrices, such as number of nonzero entries per column or
balanced distribution of the nonzero elements. As addressed in Chapter
4, balanced assumptions yield improved bounds for sparse matrix mul-
tiplication. It is natural to believe that, by exploiting the anatomy of the
matrices, one may derive improved bounds for sparse matrix multipli-
cation, e.g., in the logarithmic factors.
Open Problem 1. Can we exploit the structure of the input/output matrices
in order to improve the bounds of Chapter 2?
In the Cache Oblivious model we are able to multiply matrices with
no knowledge of M and B during the computation. Conversely, Pagh
and Stöckel [PS14] provided an optimal cache aware algorithm for mul-
tiplying sparse matrices that uses blocking arguments and matrix size
108 Chapter 6. Conclusions
estimations. The limitations of the Cache Oblivious are well understood.
However, we address the issue of designing improved algorithms in
oblivious models. The latter are commonly a non trivial framework for
parallel algorithms, given the concurrency issues that arise from concur-
rent cache oblivious models. Hence, an interesting research direction is
to study more in depth parallel, cache oblivious models.
Open Problem 2. Can we obtain bounds similar to [PS14] in the Cache Obliv-
ious model? Can we extend the bounds obtained in the Parallel External Mem-
ory model to a concurrent cache oblivious model?
The algorithms presented scale efficiently to multiple processors.
The Cache Oblivious algorithm of Chapter 2 can be easily derived from
Parallel External Memory analysis by setting P = 1. Nevertheless, for
parallel, cache aware models, the bounds one wants to achieve are of the
form O˜((h/PB)min{√k/√M, h/M}) which follow straightforwardly
from [PS14].
Open Problem 3. Can we extend Pagh and Stöckel’s techniques to the Parallel
External Memory model, to obtain a bound similar to [PS14], with matching
lower bound?
Chapter 3
In Chapter 3 we presented new algorithms for sparse Boolean matrix
multiplication. Our bounds are asymptotically worse than the state of
the art. Nevertheless, our algorithms are designed with the purpose of
practical efficiency. As a matter of fact, we use only estimation algo-
rithms as subroutines.
The partitioning techniques used by our algorithms seem to be rela-
tively standard as they appear similarly in Van Gucht et al. [VG+15] and
Lingas [Lin09]. Nonetheless, Amossen et al. [AP09] exploit improved
partitioning techniques for isolating submatrices which are computed
using fast matrix multiplication. This technique does not apply directly
to our framework.
Open Problem 4. Can we use estimators differently, e.g. by detecting zeros
in highly dense submatrices, and obtain bounds similar to [AP09], e.g. O(h +
h2/3k2/3 + k)?
109
The algorithms presented in this chapter are Atlantic City, i.e. time,
I/Os and output are random variables. On the front of derandomiza-
tion, we would like to reduce the amount of randomness during the ex-
ecution by having deterministic output results. The main hurdle for this
problem is to provide deterministic algorithms for detecting nonzero
entries in highly sparse submatrices.
Open Problem 5. Can we design full Las Vegas algorithms, where only run-
ning time and number of I/Os are random variables while the output is always
correct?
The random permutation, when applied to input matrices, may not
spread the nonzero entries of the output uniformly across the matrix.
Dense submatrices with more than
√
k nonzero entries worsen our time
bounds. Our analysis guarantees either a polynomially or exponentially
decreasing probability of having highly dense submatrices. An interest-
ing future direction is to partially derandomize the permutation process
to deterministically obtain sparse submatrices.
Open Problem 6. Can we achieve a partial derandomization by designing
deterministic algorithms to partition the output matrix in submatrices with
O(1) nonzero entries?
The restriction on submatrices such that they do not contain entries
on the same row/column, allows us to prove exponentially decreas-
ing probability of having dense submatrices. In the general case, we
proved only a polynomially decreasing probability. For our polynomial
tail bounds, the chances of deviating from the expected overall time and
I/O bounds are still undesirable.
Open Problem 7. Can we prove stronger tail inequalities, e.g. exponential,
for the general case?
The framework we used for computing the Boolean matrix product
can be extended straightforwardly to matrices with entries from a field
by simply computing k inner products using kn additional operations.
Unfortunately, cancellation of terms is not supported by the estimation
algorithm from [ACP10] which is designed specifically for semiring al-
gebras.
Open Problem 8. Can we extend the size estimator from [ACP10] to allow
cancellation of terms and use it as a subroutine for sparse matrix multiplication
algorithms?
110 Chapter 6. Conclusions
Chapter 4
In Chapter 4 we investigated how exploiting compression techniques
from [JS15], combined with the combinatorial framework of [YZ05,
AP09] yields improved bounds for sparse matrix multiplication.
Motivated by the studies of [JS15] on balanced output matrices, we
showed that, exploiting the balanced structure of product matrices, yield
asymptotically better bounds compared to the state of the art and to
the unbalanced case. When no information on the output matrix is
available, we are not able to replicate the same time bounds and we
improve over the state of the art only for a specific parameter range.
Open Problem 9. Can we design algorithms for the unbalanced case that
achieve the same time bounds of the balanced one?
An alternative research direction is to prove lower bounds either on
the combinatorial framework of [YZ05, AP09] or on the compression
technique of [JS15]. The motivation for this line of work is given by
the marginal improvements gained by our randomized algorithms over
[AP09] that does not suggest significant further improvements.
Open Problem 10. Can we prove lower bounds either on the combinatorial
framework of [YZ05, AP09] or on the compression techniques of [JS15]?
The literature is lacking with parallel algorithms for fast matrix mul-
tiplication This leaves little room for the distribution step as each pro-
cessor can only fast-multiply batches of submatrices. For blocked-based
rectangular fast matrix multiplication, this seems doable. For actual rect-
angular fast matrix multiplication à la Coppersmith [Cop97] this seems
to require major efforts. Similarly, in the Cache Oblivious model, one
would need to find partitioning techniques, similar to the block-based
of Lemma 4.4, oblivious to memory and block size.
Open Problem 11. Can we extend the analysis to the Parallel External Mem-
ory model and to the Cache Oblivious model?
Algorithms that exploit fast matrix multiplication as a kernel are able
to outperform almost every present matrix multiplication algorithms.
The impracticality of fast matrix multiplication makes these algorithms
of theoretical interest only.
Open Problem 12. Can we design algorithms matching the bounds of, e.g.,
[AP09, Dus18b] that do not rely on fast matrix multiplication subroutines?
111
Chapter 5
In Chapter 5 we studied empirically the problem of permuting in the
Parallel External Memory model. We provide sorting based data struc-
tures and efficient algorithm implementation for permuting elements on
multicore architectures.
The data structures rely on parameters defined by the hardware ar-
chitecture. That is, we adapt our algorithms to memory structure and
level of parallelism to obtain high performance in the permutation pro-
cess. It is clear that, our data structures are architecture dependent.
Nevertheless, the difference in the architectures we tested is not clean-
cut and it is not know what level of parallelism such data structures can
achieve on massively parallel systems.
Open Problem 13. Can we designing data structures for the problem of per-
muting that are architecture independent as well as efficient for different levels
of parallelism?
Algorithms in parallel systems introduce synchronization issues that
are commonly solved via atomic operations. The algorithms in Chap-
ter 5 are no exception and our data structures overcome this issue by
exploiting synchronization tables that grow considerably as the input
size increases. Data structures initialization is the most time consuming
phase and, in order to amortize the preprocessing time, the applications
of such data structures are restricted to scenarios where the permutation
is applied repeatedly.
Open Problem 14. Is the trade off between time and space for process syn-
chronization still well established? Can we achieve high performance in practice
with algorithms that permute I/O optimally on the fly?
The problem of permuting is an I/O bounded problem, rather than
a CPU bounded one. As a consequence, it does not reflect the structure
of standard numerical problems, e.g. matrix computations, and we ex-
pect further improvements for sorting-based algorithms applied to more
computational demanding problems.
Open Problem 15. Can we achieve high performance by using sorting-based
data structures in more intensive computational problems, such as matrix mul-
tiplication?

Bibliography
[ACP10] Rasmus Resen Amossen, Andrea Campagna, and Rasmus
Pagh. Better Size Estimation for Sparse Matrix Products. In
Approx-Random, pages 406–419. Springer, 2010.
[ADFK70] V.L. Arlazarov, E.A. Dinic, I.A. Faradzev, and A. Kronrod.
On Economical Construction of Transitive Closure of an Ori-
ented Graph. Doklady Akademii Nauk SSSR, 194(3):487–88,
1970.
[AFLG15] Andris Ambainis, Yuval Filmus, and François Le Gall. Fast
Matrix Multiplication: Limitations of the Coppersmith-
Winograd Method. In Proceedings of the 47th annual ACM
Symposium on Theory of Computing, pages 585–593. ACM,
2015.
[AGNS08] Lars Arge, Michael T Goodrich, Michael Nelson, and Nodari
Sitchinava. Fundamental Parallel Algorithms for Private-
Cache Chip Multiprocessors. In Proceedings of the 20th An-
nual Symposium on Parallelism in Algorithms and Architectures,
pages 197–206. ACM, 2008.
[AP09] Rasmus Resen Amossen and Rasmus Pagh. Faster Join-
projects and Sparse Matrix Multiplications. In Proceedings
of the 12th International Conference on Database Theory, pages
121–126. ACM, 2009.
[ASU13] Noga Alon, Amir Shpilka, and Christopher Umans. On Sun-
flowers and Matrix Multiplication. Computational Complexity,
22(2):219–243, 2013.
114 Bibliography
[AV88] Alok Aggarwal and Jeffrey Vitter. The Input/Output Com-
plexity of Sorting and Related Problems. Communications of
the ACM, 31(9):1116–1127, 1988.
[AW14] Amir Abboud and Virginia Vassilevska Williams. Popular
Conjectures Imply Strong Lower Bounds for Dynamic Prob-
lems. In Proceedings of the 55th Annual Symposium on Founda-
tions of Computer Science, pages 434–443. IEEE, 2014.
[BBF+10] Michael A Bender, Gerth Stølting Brodal, Rolf Fagerberg,
Riko Jacob, and Elias Vicari. Optimal sparse matrix dense
vector multiplication in the I/O-model. Theory of Computing
Systems, 47(4):934–962, 2010.
[BCRL79] Dario Bini, Milvio Capovani, Francesco Romani, and Grazia
Lotti. O(n2.7799) Complexity for n × n Approximate Ma-
trix Multiplication. Information processing letters, 8(5):234–
235, 1979.
[BF02] Gerth Stølting Brodal and Rolf Fagerberg. Cache Obliv-
ious Distribution Sweeping. In International Colloquium
on Automata, Languages, and Programming, pages 426–438.
Springer, 2002.
[BF03] Gerth Stølting Brodal and Rolf Fagerberg. On the Limits of
Cache-Obliviousness. In Proceedings of the 35th Annual ACM
Symposium on Theory of Computing, pages 307–315. ACM,
2003.
[BFGK05] Michael A Bender, Jeremy T Fineman, Seth Gilbert, and
Bradley C Kuszmaul. Concurrent Cache-Oblivious B-trees.
In Proceedings of the 17th annual ACM Symposium on Paral-
lelism in Algorithms and Architectures, pages 228–237. ACM,
2005.
[BG12] Aydin Buluç and John R Gilbert. Parallel Sparse
Matrix-Matrix Multiplication and Indexing: Implementa-
tion and Experiments. SIAM Journal on Scientific Computing,
34(4):C170–C191, 2012.
[Blä99] Markus Bläser. A 52 n
2-Lower Bound for the Rank of n× n-
Matrix Multiplication over Arbitrary Fields. In Proceedings
Bibliography 115
of the 40th Annual Symposium on Foundations of Computer Sci-
ence, pages 45–50. IEEE, 1999.
[Blä01] Markus Bläser. A 52 n
2-Lower Bound for the Multiplicative
Complexity of n× n-Matrix Multiplication. In Annual Sym-
posium on Theoretical Aspects of Computer Science, pages 99–
109. Springer, 2001.
[Ble90] Guy E. Blelloch. Prefix Sums and Their Applications. Tech-
nical report, Synthesis of Parallel Algorithms, 1990.
[BW09] Nikhil Bansal and Ryan Williams. Regularity Lemmas and
Combinatorial Algorithms. In Proceedings of the 50th Annual
Symposium on Foundations of Computer Science, pages 745–
754. IEEE, 2009.
[CG86] Bernard Chazelle and Leonidas J Guibas. Fractional Cascad-
ing: I. A Data Structuring Technique. Algorithmica, 1(1):133–
162, 1986.
[Cha15] Timothy M Chan. Speeding Up the Four Russians Algo-
rithm by about One More Logarithmic Factor. In Proceed-
ings of the 26th annual ACM-SIAM symposium on Discrete al-
gorithms, pages 212–217. Society for Industrial and Applied
Mathematics, 2015.
[Cop97] Don Coppersmith. Rectangular Matrix Multiplication Re-
visited. Journal of Complexity, 13(1):42–49, 1997.
[CU03] Henry Cohn and Christopher Umans. A Group-Theoretic
Approach to Fast Matrix Multiplication. In Proceedings of the
44th Annual Symposium on Foundations of Computer Science,
pages 438–449. IEEE, 2003.
[CW82] Don Coppersmith and Shmuel Winograd. On the Asymp-
totic Complexity of Matrix Multiplication. SIAM Journal on
Computing, 11(3):472–492, 1982.
[CW87] Don Coppersmith and Shmuel Winograd. Matrix Multipli-
cation via Arithmetic Progressions. In Proceedings of the nine-
teenth annual ACM symposium on Theory of computing, pages
1–6. ACM, 1987.
116 Bibliography
[DGH15] Erik D Demaine, Vineet Gopal, and William Hasenplaugh.
Cache-Oblivious Iterated Predecessor Queries via Range
Coalescing. In Workshop on Algorithms and Data Structures,
pages 249–262. Springer, 2015.
[DJ17] Matteo Dusefante and Riko Jacob. An Empirical Evalua-
tion of Permuting in Parallel External Memory. Manuscript,
2017.
[DJ18] Matteo Dusefante and Riko Jacob. Cache Oblivious Sparse
Matrix Multiplication. In Latin American Symposium on The-
oretical Informatics, pages 437–447. Springer, 2018.
[DP09] Devdatt P. Dubhashi and Alessandro Panconesi. Concen-
tration of Measure for the Analysis of Randomized Algorithms.
Cambridge University Press, 2009.
[Dus18a] Matteo Dusefante. Atlantic City Boolean Matrix Multiplica-
tion. Manuscript, 2018.
[Dus18b] Matteo Dusefante. Compressed Sparse Matrix Multiplica-
tion. Manuscript, 2018.
[ER60] Paul Erdo˝s and Richard Rado. Intersection Theorems for
Systems of Sets. Journal of the London Mathematical Society,
1(1):85–90, 1960.
[FLPR99] Matteo Frigo, Charles E Leiserson, Harald Prokop, and Srid-
har Ramachandran. Cache-oblivious Algorithms. In Proceed-
ing of the 40th Annual Symposium on Foundations of Computer
Science, pages 285–297. IEEE, 1999.
[Fre77] Rusins Freivalds. Probabilistic Machines Can Use Less Run-
ning Time. In IFIP congress, volume 839, page 842, 1977.
[Fur70] ME Furman. Application of a method of rapid multipli-
cation of matrices to the problem of finding the transitive
closure of a graph. In Doklady Akademii Nauk, volume 194,
pages 524–524. Russian Academy of Sciences, 1970.
[FW90] Michael L Fredman and Dan E Willard. Blasting Through
the Information Theoretic Barrier with Fusion Trees. In Pro-
ceedings of the twenty-second annual ACM symposium on Theory
of Computing, pages 1–7. ACM, 1990.
Bibliography 117
[GJ+10] Gaël Guennebaud, Benoît Jacob, et al. Eigen v3.
http://eigen.tuxfamily.org, 2010.
[GMM16] Daniel Gruss, Clémentine Maurice, and Stefan Mangard.
Rowhammer. js: A remote software-induced fault attack in
javascript. In Detection of Intrusions and Malware, and Vulner-
ability Assessment, pages 300–321. Springer, 2016.
[Gre12] Gero Greiner. Sparse Matrix Computations and their I/O Com-
plexity. PhD thesis, Dissertation, Technische Universität
München, München, 2012.
[Guo11] Yang Guo. Tuning Funnelsort for Permuting. Master’s
thesis, Fakultät für Informatik der Technischen Universität
München, 2011.
[HP98] Xiaohan Huang and Victor Y Pan. Fast Rectangular Ma-
trix Multiplication and Applications. Journal of complexity,
14(2):257–299, 1998.
[IPZ01] Russell Impagliazzo, Ramamohan Paturi, and Francis Zane.
Which Problems Have Strongly Exponential Complexity?
Journal of Computer and System Sciences, 63(4):512–530, 2001.
[IS09] Mark A Iwen and Craig V Spencer. A Note on Compressed
Sensing and the Complexity of Matrix Multiplication. Infor-
mation Processing Letters, 109(10):468–471, 2009.
[JS15] Riko Jacob and Morten Stöckel. Fast Output-Sensitive Ma-
trix Multiplication. In European Symposium on Algorithms,
pages 766–778. Springer, 2015.
[JWK81] Hong Jia-Wei and Hsiang-Tsung Kung. I/O Complexity:
The Red-Blue Pebble Game. In Proceedings of the 13th An-
nual ACM Symposium on Theory of Computing, pages 326–333.
ACM, 1981.
[Lan14] Joseph M Landsberg. New Lower Bounds for the Rank
of Matrix Multiplication. SIAM Journal on Computing,
43(1):144–149, 2014.
[Lev09] David Levinthal. Performance Analysis Guide for Intel®
Core™ i7 Processor and Intel® Xeon™ 5500 processors. Intel
Performance Analysis Guide, 30:18, 2009.
118 Bibliography
[LG14] François Le Gall. Powers of Tensors and Fast Matrix Mul-
tiplication. In Proceedings of the 39th International Symposium
on Symbolic and Algebraic Computation, pages 296–303. ACM,
2014.
[Lin09] Andrzej Lingas. A Fast Output-Sensitive Algorithm for
Boolean Matrix Multiplication. In European Symposium on
Algorithms, pages 408–419. Springer, 2009.
[LK14] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford
Large Network Dataset Collection. http://snap.stanford.
edu/data, June 2014.
[LLDM09] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and
Michael W Mahoney. Community Structure in Large Net-
works: Natural Cluster Sizes and the Absence of Large Well-
Defined Clusters. Internet Mathematics, 6(1):29–123, 2009.
[LM12] Jure Leskovec and Julian J Mcauley. Learning to Discover
Social Circles in Ego Networks. In Advances in neural infor-
mation processing systems, pages 539–547, 2012.
[Mol02] Richard A Mollin. RSA and Public-Key Cryptography. CRC
Press, 2002.
[MR13] Alex Massarenti and Emanuele Raviolo. The Rank of n× n
Matrix Multiplication is at least 3n2 − 2√2n 32 − 3n. Linear
Algebra and its Applications, 438(11):4500–4509, 2013.
[MS+96] Larry W McVoy, Carl Staelin, et al. LMbench: Portable Tools
for Performance Analysis. In USENIX Annual Technical Con-
ference, pages 279–294. San Diego, CA, USA, 1996.
[MS12] Larry W McVoy and Carl Staelin. LMbench - Tools for Per-
formance Analysis, 2012.
[MU05] Michael Mitzenmacher and Eli Upfal. Probability and com-
puting: Randomized algorithms and probabilistic analysis. Cam-
bridge university press, 2005.
[Mun71] Ian Munro. Efficient determination of the transitive closure
of a directed graph. Information Processing Letters, 1(2):56–58,
1971.
Bibliography 119
[NMAB18] Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Ay-
dın Buluç. High-Performance Sparse Matrix-Matrix Prod-
ucts on Intel KNL and Multicore Architectures. arXiv
preprint arXiv:1804.01698, 2018.
[NV93] Mark H Nodine and Jeffrey Scott Vitter. Deterministic Dis-
tribution Sort in Shared and Distributed Memory Multipro-
cessors. In Proceedings of the 5th Annual ACM Symposium on
Parallel Algorithms and Architectures, pages 120–129. ACM,
1993.
[OO73] Patrick O’Neil and Elizabeth O’Neil. A Fast Expected Time
Algorithm for Boolean Matrix Multiplication and Transitive
Closure. Information and Control, 22(2):132–138, 1973.
[Pag12] Rasmus Pagh. Compressed Matrix Multiplication. In Pro-
ceedings of the 3rd Innovations in Theoretical Computer Science
Conference, pages 442–451. ACM, 2012.
[Pan78] V Ya Pan. Strassen’s Algorithm Is not Optimal: Trililnear
Technique of Aggregating, Uniting and Canceling for Con-
structing Fast Algorithms for Matrix Operations. In Proceed-
ings of the 19th Annual Symposium on Foundations of Computer
Science, pages 166–176. IEEE, 1978.
[Pan81] V Ya Pan. New Combinations of Methods for the Accelera-
tion of Matrix Multiplications. Computers & Mathematics with
Applications, 7(1):73–125, 1981.
[PS14] Rasmus Pagh and Morten Stöckel. The Input/Output Com-
plexity of Sparse Matrix Multiplication. In European Sympo-
sium on Algorithms, pages 750–761. Springer, 2014.
[Raz02] Ran Raz. On the Complexity of Matrix Product. In Proceed-
ings of the 34th Annual ACM Symposium on Theory of Comput-
ing, pages 144–151. ACM, 2002.
[Rob05] Sara Robinson. Toward an Optimal Algorithm for Matrix
Multiplication. SIAM news, 38(9):1–3, 2005.
[Rom82] Francesco Romani. Some Properties of Disjoint Sums of Ten-
sors Related to Matrix Multiplication. SIAM Journal on Com-
puting, 11(2):263–267, 1982.
120 Bibliography
[Sch81] Arnold Schönhage. Partial and Total Matrix Multiplication.
SIAM Journal on Computing, 10(3):434–455, 1981.
[Sch11] Boris Schling. The Boost C++ Libraries. XML Press, 2011.
[Sei95] Raimund Seidel. On the All-Pairs-Shortest-Path Problem in
Unweighted Undirected Graphs. Journal of Computer and Sys-
tem Sciences, 51(3):400–403, 1995.
[Sha78] Michael Ian Shamos. Computational Geometry. PhD thesis,
New Haven, CT, USA, 1978. AAI7819047.
[Sto10] Andrew James Stothers. On the complexity of matrix multipli-
cation. PhD thesis, 2010.
[Str69] Volker Strassen. Gaussian Elimination is not Optimal. Nu-
merische mathematik, 13(4):354–356, 1969.
[Str86] Volker Strassen. The Asymptotic Spectrum of Tensors and
the Exponent of Matrix Multiplication. In Proceedings of the
27th Annual Symposium on Foundations of Computer Science,
pages 49–54. IEEE, 1986.
[SZ99] Avi Shoshan and Uri Zwick. All Pairs Shortest Paths in
Undirected Graphs with Integer Weights. In Proceeding of the
40th Annual Symposium on Foundations of Computer Science,
pages 605–614. IEEE, 1999.
[VD03] Richard Wilson Vuduc and James W Demmel. Automatic per-
formance tuning of sparse matrix kernels, volume 1. University
of California, Berkeley, 2003.
[VG+15] Dirk Van Gucht et al. The Communication Complexity of
Distributed Set-Joins with Applications to Matrix Multipli-
cation. In Proceedings of the 34th ACM SIGMOD-SIGACT-
SIGAI Symposium on Principles of Database Systems, pages
199–212. ACM, 2015.
[VS94] Jeffrey Scott Vitter and Elizabeth AM Shriver. Algorithms
for Parallel Memory, I: Two-Level Memories. Algorithmica,
12(2-3):110–147, 1994.
Bibliography 121
[WD10] Vince Weaver and Jack Dongarra. Can Hardware Perfor-
mance Counters Produce Expected, Deterministic Results.
In Proceedings of the 3rd Workshop on Functionality of Hardware
Performance Monitoring, 2010.
[Wil12] Virginia Vassilevska Williams. Multiplying Matrices Faster
Than Coppersmith-Winograd. In Proceedings of the 44th an-
nual ACM Symposium on Theory of Computing, pages 887–898.
ACM, 2012.
[WTM13] Vincent M Weaver, Dan Terpstra, and Shirley Moore. Non-
Determinism and Overcount on Modern Hardware Perfor-
mance Counter Implementations. In International Symposium
on Performance Analysis of Systems and Software, pages 215–
224. IEEE, 2013.
[WW10] Virginia Vassilevska Williams and Ryan Williams. Subcubic
Equivalences between Path, Matrix and Triangle Problems.
In Proceedings of the 51st Annual Symposium on Foundations of
Computer Science, pages 645–654. IEEE, 2010.
[WY14] Ryan Williams and Huacheng Yu. Finding Orthogonal Vec-
tors in Discrete Structures. In Proceedings of the 25th Annual
ACM-SIAM Symposium on Discrete Algorithms, pages 1867–
1877. SIAM, 2014.
[Yu15] Huacheng Yu. An Improved Combinatorial Algorithm for
Boolean Matrix Multiplication. In International Colloquium
on Automata, Languages, and Programming, pages 1094–1105.
Springer, 2015.
[YZ05] Raphael Yuster and Uri Zwick. Fast Sparse Matrix Multipli-
cation. ACM Transactions on Algorithms, 1(1):2–13, 2005.
[Zwi02] Uri Zwick. All Pairs Shortest Paths using Bridging Sets
and Rectangular Matrix Multiplication. Journal of the ACM,
49(3):289–317, 2002.
