Engineering Boolean Matrix Multiplication for Multiple-Accelerator
  Shared-Memory Architectures by Karppa, Matti & Kaski, Petteri
Engineering Boolean Matrix Multiplication for Multiple-Accelerator
Shared-Memory Architectures
MATTI KARPPA, Aalto University, Finland
PETTERI KASKI, Aalto University, Finland
We study the problem of multiplying two bit matrices with entries either over the Boolean algebra (0, 1, ∨, ∧) or over the binary
field (0, 1, +, ·). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator
shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared
memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product
in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds
(2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak
system power (3.5 kW).
Our contributions are (a) for the binary product, we use alternative-basis techniques of Karstadt and Schwartz [SPAA ’17] to
design novel alternative-basis variants of Strassen’s recurrence for 2 × 2 block multiplication [Numer. Math. 13 (1969)] that have been
optimized for both the number of additions and low working memory, (b) structuring the parallel block recurrences and the memory
layout for coalescent and register-localized execution on accelerator hardware, (c) low-level engineering of the innermost block
products for the specific target hardware, and (d) structuring the top-level shared-memory implementation to feed the accelerators
with data and integrate the results for input and output sizes beyond the aggregate memory capacity of the available accelerators.
CCS Concepts: • Computing methodologies→ Linear algebra algorithms; Shared memory algorithms; •Mathematics of
computing→ Mathematical software performance; • Computer systems organization→ Single instruction, multiple data.
Additional Key Words and Phrases: shared-memory, heterogeneous architecture, gpu, binary matrix multiplication, Boolean matrix
multiplication
1 INTRODUCTION
Matrix multiplication is one of the most widely deployed primitives in computing. In a combinatorial context, one often
encounters the task of multiplying Boolean matrices with entries in {0, 1}, and with the arithmetic taking place either
over the Boolean algebra (0, 1,∨,∧) or over the binary field (0, 1,+, ·). Fast algorithms for the Boolean product and the
binary product of two bit matrices underlie the fastest known algorithms for a wide range of tasks, such as transitive
closure, context-free parsing, and triangle detection [11, 12, 17, 36, 38].
Our interest in this paper is to engineer high-performance open-source implementations of Boolean and binary
matrix multiplication for shared-memory platforms connected to multiple vector-parallel accelerator devices. Such
platforms are the state of the art in terms of delivering both
(i) speed through extensive parallelism available in the accelerator devices, and
(ii) energy-efficiency for moderately large data through shared memory.1
Configurations of this type are not only powerful individual systems; indeed, shared-memory platforms with multiple
accelerator devices are individual compute nodes in larger distributed-memory systems ranging from a few compute
1For example, a representative configuration of this type is the NVIDIA DGX-1, with eight Tesla V100 SXM2 accelerator devices (in total 40960 32-bit
cores running at 1530-MHz boost clock), 512 GiB of shared DDR4-2133 memory, and 3.5-kW peak power consumption. This configuration can execute, at
peak, 40960 · 1530 · 106 · 32 = 2.01 · 1015 bit operations per second (2.01 Pbop/s), which at peak power consumption translates to 1.8 · 10−12 Joules of
energy consumed per bit operation (1.8 pJ/bop).
1
ar
X
iv
:1
90
9.
01
55
4v
1 
 [c
s.D
S]
  4
 Se
p 2
01
9
2 Matti Karppa and Petteri Kaski
nodes to current leading-edge supercomputers.2 Thus, scalability to distributed-memory systems starts from carefully
engineered shared-memory-accelerated implementations.
To our knowledge, the present paper constitutes the first study of Boolean matrix multiplication that seeks to push
the envelope towards peak hardware performance on shared-memory-accelerated platforms simultaneously in terms
of speed, energy consumption, and input sizes of up to one terabinary-bit (1048576 × 1048576 = 220 × 220 = 240 bits
= 1 Tib = 128 GiB) per operand matrix.3 For example, given two terabinary-bit square matrices as input, our present
open-source implementation computes the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for
a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop
for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW).
1.1 Engineering Challenges
Before outlining our specific contributions in more detail, let us introduce some of the challenges when engineering for
peak performance on inputs that nearly occupy the available capacity of shared memory and exceed the total memory
available on the accelerators in a platform.
First, the mathematical framework for algorithm design needs fine-grained optimization beyond coarse tools offered
by traditional asymptotic analysis. Algorithms for fast matrix multiplication are a canonical example of this phenomenon,
where the asymptotically fastest known algorithm designs [8, 10, 22, 37] hide in their asymptotic analysis very large
constants, which make these designs infeasible to use in practice; we refer to the recent survey by Pan [33] for an
extensive discussion. For bit matrices, a natural first measure for the concrete efficiency of an algorithm design is
the number of bit operations executed. For example, the elementary algorithm to multiply two n × n bit matrices
uses 2n3 − n2 bit operations, and the Strassen–Winograd algorithm [35, 41] for the binary field uses 6nlog2 7 − 5n2 bit
operations when n is a power of two. Already here one observes a leading constant that is three times the leading
constant for the elementary algorithm, which necessitates n ≥ 512 before the Strassen–Winograd algorithm uses
fewer bit operations than the elementary algorithm. The Strassen–Winograd algorithm is known to be an optimal
implementation of matrix multiplication relying on recursive 2 × 2 multiplications [5, 34], assuming one works in
the standard basis. Recently, Karstadt and Schwartz [20] introduce an alternative-basis framework that reduces the
number of bit operations of Strassen’s algorithm to 5nlog2 7 − 4n2, assuming the input and output are represented in an
alternative basis. (The transformation of an n × n bit matrix between the standard basis and the alternative basis uses
n2 log2 n bit operations.)
Second, accelerator hardware is designed to give peak performance when executing an identical instruction on each
entry of a fixed-length vector of words. Thus, all recurrences of bit operations must be designed to support vectorization
and coalesced memory accesses across words of bits. Due to pipelining and long latencies in accelerator hardware (cf. Mei
and Chu [25] and Volkov [39]), recurrences should make use of the available lowest-latency memory (registers) and
expose sufficient parallelism to enable a large number of threads simultaneously in execution compared with the number
of cores for effective latency hiding. Furthermore, the bottom layer of recursion needs to be engineered for compatibility
with the available low-level instruction set, such as vector-shuffles and custom bit operations given by a truth table.
While the word- and vector-level hardware is at the moment rapidly evolving towards integrating increasingly domain-
specific hardware units—for example, units that that perform small-size numerical matrix multiplication—the high-level
2For example, the Summit supercomputer at Oak Ridge National Laboratory—the first-ranked supercomputer on the June 2019 Top500-list (https:
//top500.org)—consists of 4608 compute nodes, each of which has 512 GiB of shared DDR4 memory and six Tesla V100 SXM2 accelerator devices.
3Here we note that numerical rather than Boolean matrix multiplication has been extensively studied from an implementation engineering perspective
both in the setting of shared memory and in the setting of distributed memory—we postpone a discussion of related work to §1.3.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 3
architecture with vectorization, pipelining, and long latencies is likely to remain more stable over time. In particular,
this requires designs and engineering for the on-accelerator memory hierarchy that saturates the bandwidth of the
dedicated hardware units.
Third, assuming each accelerator can be engineered to perform close to the peak available bandwidth in terms of bit
operations per second, the host-level algorithm must be engineered to make effective use of the accelerators. This in
particular means feeding the accelerators with input and aggregating their output in comparatively low-bandwidth
shared memory across a low-bandwidth interconnect between the accelerators and the host. Our objective of engineering
for inputs that are near the capacity of shared memory presents a further challenge in that at the host level we must
use designs with a low memory footprint. We expect also these traits to withstand the test of time and thus warrant the
present engineering effort.
1.2 Our Contributions
Let us now proceed to our specific contributions to address the aforementioned engineering challenges.
Fine-grained optimization of the base design. First, we proceedwith a fine-grained optimization of Strassen’s algorithm [35]
using the alternative-basis framework of Karstadt and Schwartz [20] for the binary field. Essentially, we investigate all
possible alternative bases over the binary field, and optimize (i) the number of Boolean operations for the alternative-
basis multiplication, (ii) the weight distribution of linear combinations in the alternative basis, and (iii) the number
of Boolean operations for changing between bases. Further desirable properties include in-place-computability and
self-invertibility of the basis changes, as well as the ability to make chains of matrix multiplications in the alternative
basis.
Theorem 1.1 (Main, Self-Inverse and In-Place Basis Changes). There exists an alternative-basis bilinear algorithm
that multiplies two 2 × 2 input matrices over the binary field using
(a) 3 additions to preprocess each input,
(b) 7 noncommutative multiplications, and
(c) 6 additions for postprocessing to obtain the output.
Moreover, the weight distribution in the alternative bases is 1, 1, 1, 1, 2, 2, 2. The basis changes between the standard basis
and the alternative bases each use 2 additions, are self-inverse, and can be computed in-place.
Applied recursively in the alternative basis, Theorem 1.1 gives binary matrix multiplication using 5nlog2 7 − 4n2
bit operations when n is a power of two. Changing between the standard basis and the alternative bases takes
1
2n
2 log2 n bit operations and can be done in-place in memory, which presents a minor improvement over the original
Karstadt–Schwartz design, but comes at the cost of losing the chain-multiplication property, which is desirable in
many applications. The next theorem summarizes a new design that supports chain multiplication but is slightly less
efficient than Theorem 1.1 in terms of its weight distributions, which results in slightly less efficiency when aggregating
solutions of sub-instances in tight working memory at the host.
Theorem 1.2 (Main, Chain-Multiplication). There exists an alternative-basis bilinear algorithm that multiplies two
2 × 2 input matrices over the binary field using
(a) 3 additions to preprocess each input,
(b) 7 noncommutative multiplications, and
4 Matti Karppa and Petteri Kaski
(c) 6 additions for postprocessing to obtain the output.
Moreover, the weight distributions are 1, 1, 1, 1, 2, 2, 2 and 1, 1, 1, 1, 2, 3, 3 for taking linear combinations of the operands
and the results of noncommutative multiplications, respectively. The basis changes between the standard basis and the
alternative basis each use 2 additions, can be computed in-place, and support chain-multiplication in the alternative basis.
The proofs of Theorem 1.1 and Theorem 1.2 are presented in §2 together with a development of the alternative-
basis framework using Kronecker products and Yates’s algorithm [42] that enables easy parallelization on vectorized
accelerator hardware.
Engineering for performance on accelerator hardware. Second, beyond optimizing the base design, we engineer an
implementation suitable for vectorized accelerator hardware with extensive bit-operation-and-memory bandwidth
but long latencies. Here the key engineering principle is to expose sufficient parallelism to saturate the compute cores
with work and hide latency, but not to exceed the available on-accelerator memory. It is well known (e.g. [2, 23]) that
bilinear block-recursive algorithm designs for matrix multiplication enable a tradeoff between
(1) parallel processing (executing the recursive block multiplications in parallel, or “breadth-first”, using independent
memory for each recursive case) and
(2) serial processing (by processing the recursive multiplications serially one after another, or “depth-first”, reusing
memory).
We observe that this tradeoff also applies to alternative-basis algorithms, and illustate its application with a recursive
design whose upper levels proceed serially through recursive calls to lower levels executed in parallel, relying on the
recurrences underlying Theorems 1.1 and 1.2. The lower parallel levels consist of (i) parallel preprocessing for both
inputs, (ii) parallel low-level optimized 64 × 64 bit-matrix-multiplication, and (iii) parallel postprocessing to recover the
output. We fine-tune the performance of these levels for the target hardware by (a) merging consecutive levels so that
the intermediate results are stored in per-thread registers and (b) working with the widest available per-thread load
and store instructions for communicating between the per-thread registers and the on-accelerator memory.
For an n × n binary product for large enough n, these engineering considerations together with Theorem 1.1 and 1.2
enable us to obtain on a single Tesla V100 SXM2 accelerator an empirical effective bit-operation bandwidth that
exceeds the theoretical boost-clock peak bandwidth for bit operations. Here by effective bandwidth we mean 2n3−n2T bit
operations per second, where 2n3 − n2 is the number of bit operations to compute an n × n binary product with the
elementary algorithm, andT is the measured wall-clock running time to compute an n ×n binary product. We postpone
a detailed review of empirical performance to §4.
Host-level implementation with a low memory footprint. Third, we engineer host-level subroutines that in parallel feed
multiple accelerators with recursive subproblems and aggregate the results obtained to the host-level buffers. Here we
follow a strategy of using multiple groups of threads on the host CPUs, where each group contains exactly one thread
for each accelerator. Each group is responsible for a specific task in a pipeline of tasks, such as preparing subproblems
from the host-level input, solving subproblems on accelerators, and integrating the results of subproblems into the
host-level output. The threads coordinate their work through standard synchronization primitives such as mutexes
and blocking. To maintain a low working-memory footprint in host memory, we use q-fold Kronecker products of the
decomposition matrices underlying Theorems 1.1 and 1.2 to produce subproblems and aggregate the sub-results. This
strategy in particular benefits from optimization of the weights of the decomposition matrices discussed above.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 5
For the binary product, this strategy suffices to obtain aggregate effective bandwidth that exceeds the aggregate
peak boost-clock bandwidth of the accelerators on inputs whose size is beyond the aggregate memory capacity of the
accelerators. In particular, we consider it a very satisfactory engineering outcome as well as benchmark that we can
deliver the binary product of two one-terabinary-bit square matrices in less than one thousand seconds and with less
than one kilowatt-hour of energy consumption.
Open source. We release our present experimental algorithm implementations as open source4 to enable further
engineering and to communicate precisely the lowest-level design decisions for current microarchitectures.
1.3 Earlier and Related Work
From a theory perspective, the study of fast algorithms for Boolean matrix multiplication has proceeded along essentially
two lines of study. The first line of study either relies on the binary field directly or embeds the Boolean algebra to a
ring, employing techniques for fast matrix multiplication over rings to obtain the result. Asymptotically the fastest
known such algorithms run in time O(nω+o(1)), where 2 ≤ ω < 2.3728639 is the exponent of matrix multiplication
[8, 10, 22, 37].
While such algorithms are the fastest known in terms of asymptotic efficiency, practical algorithms for fast
matrix multiplication over rings rely on recursive tensor techniques using small base tensors (e.g. [2, 4, 6, 9, 13–
15, 20, 21, 23, 24]) or trilinear aggregation–cancellation techniques [31, 32] for a small number of independent matrix
multiplications (e.g. [18, 19]). These practical studies differ from our present work in that they consider numerical
(floating-point) matrix multiplication for CPU-based shared- or distributed-memory systems, whereas we optimize for
bit matrices and a multiple-accelerator shared-memory system. Furthermore, we rely on an alternative-basis approach
optimized for bit matrices, whereas most of the earlier work—with the exception of Karstadt and Schwartz [20]—operates
in the standard basis.
A second line of study on Boolean matrix multiplication seeks combinatorial rather than algebraic techniques
to obtain competitive algorithms [1, 3, 7, 38, 43]. Currently, the fastest such algorithm runs asymptotically in time
Oˆ(n3/log4 n) [43]. However, we are not aware of engineering work to bring algorithms in this second line of study to
the computing practice on contemporary parallel architectures.
2 ALTERNATIVE-BASIS MATRIX-MULTIPLICATION
This section develops the mathematical framework for alternative-basis matrix multiplication over the binary field
(0, 1,+, ·). We start by recalling the Strassen–Winograd standard-basis design, and then proceed to introduce two novel
alternative-basis designs for use with the binary field as well as establishing our main theorems (Theorem 1.1 and
Theorem 1.2). In essence, both new designs are variants of the design in Karstadt and Schwartz [20], with some further
optimization of the designs in particular as pertains to the additive complexity and in-place computability of the basis
changes. The rest of the section develops the mathematical framework of fast alternative-basis matrix multiplication
with no claim on originality, apart perhaps from an expositionary choice to work with Kronecker products and Yates’s
algorithm [42] to highlight the symmetries and the easily vector-parallelizable sum-product-sum-layered structure of
the framework. As discussed in the introduction, we expect this framework to withstand the test of time with further
designs and evolving computing hardware.
4Cf. https://github.com/mkarppa/matmul.
6 Matti Karppa and Petteri Kaski
2.1 The Strassen–Winograd Recurrences
We start by recalling the classical Strassen–Winograd design that works in the standard basis. It will be convenient to
first state all algorithms as straight-line programs to highlight their additive complexity. Toward this end, suppose we
are to multiply two matrices
A =
[
A00 A01
A10 A11
]
and B =
[
B00 B01
B10 B11
]
(1)
with entries over the binary field (0, 1,+, ·).
The Strassen-Winograd design is as follows. First, compute the linear combinations
T0 ← A10 +A11 , T1 ← A01 , T2 ← A01 +A11 , T3 ← A10 +T2 , T4 ← A00 +T3 , T5 ← A10 , T6 ← A00 ,
S0 ← B10 + B11 , S1 ← B10 , S2 ← B01 + B11 , S3 ← B10 + S2 , S4 ← B01 , S5 ← B00 + S3 , S6 ← B00 .
(2)
Then, multiply the linear combinations to obtain the products
Q0 ← T0S0 , Q1 ← T1S1 , Q2 ← T2S2 , Q3 ← T3S3 , Q4 ← T4S4 , Q5 ← T5S5 , Q6 ← T6S6 . (3)
Finally, compute the linear combinations of products
U0 ← Q1 +Q3 , U1 ← Q2 +U0 , U2 ← Q4 +U0 ,
C00 ← Q1 +Q6 , C01 ← Q0 +U2 , C10 ← Q5 +U1 , C11 ← Q0 +U1 .
(4)
In total, this straight-line program makes exactly 15 additions and 7 multiplications. Furthermore, a direct calculation
shows that the straight-line program correctly computes the product
C =
[
C00 C01
C10 C11
]
=
[
A00B00 +A01B10 A00B01 +A01B11
A10B00 +A11B10 A10B01 +A11B11
]
= AB . (5)
2.2 An Alternative-Basis Algorithm with Self-Inverse Basis Changes
Let us now turn to alternative-basis designs. We continue to assume that the input is given in 2 × 2 block form, as in (1).
Our first new alternative-basis algorithm is as follows. First, change basis by computing
Aˆ00 ← A00 , Aˆ01 ← A01 , Aˆ10 ← A10 , Aˆ11 ← A01 +A10 +A11 ,
Bˆ00 ← B00 , Bˆ01 ← B01 , Bˆ10 ← B10 , Bˆ11 ← B01 + B10 + B11 ,
(6)
We observe that the same basis change is applied for both matrices A and B. Next, compute the linear combinations
Tˆ0 ← Aˆ00 , Tˆ1 ← Aˆ01 , Tˆ2 ← Aˆ10 , Tˆ3 ← Aˆ11 , Tˆ4 ← Aˆ00 + Aˆ11 , Tˆ5 ← Aˆ01 + Aˆ11 , Tˆ6 ← Aˆ10 + Aˆ11 ,
Sˆ0 ← Bˆ00 , Sˆ1 ← Bˆ10 , Sˆ2 ← Bˆ00 + Bˆ11 , Sˆ3 ← Bˆ11 , Sˆ4 ← Bˆ01 , Sˆ5 ← Bˆ01 + Bˆ11 , Sˆ6 ← Bˆ10 + Bˆ11 .
(7)
Multiply the linear combinations to obtain the products
Qˆ0 ← Tˆ0Sˆ0 , Qˆ1 ← Tˆ1Sˆ1 , Qˆ2 ← Tˆ2Sˆ2 , Qˆ3 ← Tˆ3Sˆ3 , Qˆ4 ← Tˆ4Sˆ4 , Qˆ5 ← Tˆ5Sˆ5 , Qˆ6 ← Tˆ6Sˆ6 . (8)
Compute the linear combinations of products
Cˆ00 ← Qˆ0 + Qˆ1 , Cˆ01 ← Qˆ4 + Qˆ6 , Cˆ10 ← Qˆ2 + Qˆ5 , Cˆ11 ← Qˆ1 + Qˆ3 + Qˆ5 + Qˆ6 . (9)
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 7
Finally, transform Cˆ back to the standard basis with
C00 ← Cˆ00 , C01 ← Cˆ01 + Cˆ11 , C10 ← Cˆ10 + Cˆ11 , C11 ← Cˆ11 . (10)
A direct calculation shows that the product (5) is correctly evaluated. This establishes Theorem 1.1 with the exception
of the claim on weights, which can be easily verified from (46) in what follows.
Compared with the design of Karstadt and Schwartz [20], we observe that the basis transformations (6) and (10) over
the binary field use only two additions per matrix, whereas their design uses three additions per matrix, but works over
an arbitrary ring. Furthermore, our transformations (6) and (10) admit straightforward in-place computation as well as
in-place inversion—indeed, both (6) and (10) are easily verified to be self-inverses over the binary field.
Remark. A drawback of the design above is that it does not enable alternative-basis chaining of matrix multiplications
in the sense that the transformations (6) and (10) are not inverses of each other, which would be advantageous in
applications that seek chain-multiplication. Our next design removes this drawback but loses the appealing self-inverse
property.
2.3 An Alternative-Basis Algorithm with Chaining
Our second algorithm retains the arithmetic advantage in basis changes over the Karstadt–Schwartz [20] design and
works in an alternative basis for the matrix ring. For a matrix in the standard basis
X =
[
X00 X01
X10 X11
]
,
we transform to the alternative basis by computing
Xˆ00 ← X00 , Xˆ01 ← X01 , Xˆ11 ← X01 + X11 , Xˆ10 ← Xˆ11 + X10 . (11)
The inverse transform from the alternative basis to the standard basis is given by
X00 ← Xˆ00 , X01 ← Xˆ01 , X10 ← Xˆ10 + Xˆ11 , X11 ← Xˆ01 + Xˆ11 . (12)
The algorithm now proceeds as follows. Given two matrices A and B as input, we transform both to the alternative
basis using (11) to obtain the matrices Aˆ and Bˆ. Then, compute the linear combinations
Tˆ0 ← Aˆ00 , Tˆ1 ← Aˆ01 , Tˆ2 ← Aˆ10 , Tˆ3 ← Aˆ11 , Tˆ4 ← Aˆ00 + Aˆ10 , Tˆ5 ← Aˆ01 + Aˆ10 , Tˆ6 ← Aˆ10 + Aˆ11 ,
Sˆ0 ← Bˆ00 , Sˆ1 ← Bˆ10 + Bˆ11 , Sˆ2 ← Bˆ10 , Sˆ3 ← Bˆ11 , Sˆ4 ← Bˆ01 , Sˆ5 ← Bˆ01 + Bˆ10 , Sˆ6 ← Bˆ00 + Bˆ10 .
(13)
Multiply the linear combinations to obtain the products
Qˆ0 ← Tˆ0Sˆ0 , Qˆ1 ← Tˆ1Sˆ1 , Qˆ2 ← Tˆ2Sˆ2 , Qˆ3 ← Tˆ3Sˆ3 , Qˆ4 ← Tˆ4Sˆ4 , Qˆ5 ← Tˆ5Sˆ5 , Qˆ6 ← Tˆ6Sˆ6 . (14)
Compute the linear combinations of products
Rˆ ← Qˆ1 + Qˆ2 + Qˆ4 , Cˆ00 ← Qˆ0 + Qˆ1 , Cˆ01 ← Rˆ + Qˆ5 , Cˆ10 ← Rˆ + Qˆ6 , Cˆ11 ← Qˆ3 + Qˆ4 . (15)
Finally, transform the matrix Cˆ back to the standard basis using (12) to obtain the product matrixC . A direct calculation
shows thatC = AB. Furthermore, since the transformations (11) and (12) are mutual inverses, multiplications (Aˆ, Bˆ) 7→ Cˆ
in the alternative basis may be chained without transforming back to the standard basis in between multiplications; this
8 Matti Karppa and Petteri Kaski
property will become immediate from our exposition in §2.7 and §2.8. This establishes Theorem 1.2 with the exception
of the claim on weights, which can be easily verified from (48) in what follows.
2.4 Arrays, Vectors, and Matrices
This section develops preliminaries and notational conventions for working with binary matrices used throughout the
rest of this paper. The basic data structure wework with is an array ofm entries, indexed by the set [m] = {0, 1, . . . ,m−1}.
A tensor is an array with an associated shape m1 × m2 × · · · × md for nonnegative integers m1,m2, . . . ,md with
m =m1m2 · · ·md . In this case we say that the tensor has d modes and that the mode ℓ = 1, 2, . . . ,d has lengthmℓ . A
tensor with one mode is called a vector and a tensor with two modes is called a matrix.
We index the entries of a tensorT of shapem1 ×m2 × · · · ×md using tuples (i1, i2, . . . , id ) ∈ [m1] × [m2] × · · · × [md ],
with the convention that the tuple refers to the entry indexed by
i = i1m2m3 · · ·md + i2m3m4 · · ·md + . . . + id−1md + id (16)
in the underlying array of lengthm. In other words, in linearizing the tuple (i1, i2, . . . , id ) ∈ [m1] × [m2] × · · · × [md ]
to a linear index i ∈ [m], the first index i1 ∈ [m1] is the most significant, the next index i2 ∈ [m2] is the next most
significant, and so forth. We write T(i1,i2, ...,id ) for the entry of T indexed by (i1, i2, . . . , id ). To lighten the notation,
we may also write simply Ti1i2 · · ·id assuming the indexing is immediate from the context. When working with modes
of composite lengthmℓ =mℓ,1mℓ,2 · · ·mℓ,k for nonnegative integersmℓ,1,mℓ,2, . . . ,mℓ,k , for convenience we often
choose to index such a mode with tuples in [mℓ,1] × [mℓ,2] × · · · × [mℓ,k ], with the understanding that we follow the
first-index-major convention (16) to arrive at the linear index in [mℓ].
We use the following notation for subtensors of tensors indexed along the most significicant modes. For a tensorT of
shapem1×m2×· · ·×md and a tuple (i1, i2, . . . , iℓ) ∈ [m1]×[m2]×· · ·×[mℓ] of indices, we writeTi1i2 · · ·iℓ for the tensor of
shapemℓ+1×mℓ+2×· · ·×md with entries (Ti1i2 · · ·iℓ )iℓ+1iℓ+2 · · ·id for all (iℓ+1, iℓ+2, . . . , id ) ∈ [mℓ+1]×[mℓ+2]× · · ·×[md ].
In particular, we find it convenient to use the same notation for entries and subarrays of a tensor T , as the structure of
the indexing tuple will indicate whether an entry or a subarray is meant.
In what follows let us assume that all arrays have their entries in the binary field (0, 1,+, ·) unless indicated otherwise.
For an s × t matrix µ, write write µ⊤ for the t × s transpose of µ with entries defined for all i ∈ [s] and j ∈ [t] by the
rule µ⊤i j = µ ji . For an s × t matrix µ and a t × u matrix ν , we write µν for the s × u product matrix with entries defined
for all i ∈ [s] and k ∈ [u] by (µν )ik =
∑
j ∈[t ] µi jνjk . We write I = In for the n × n identity matrix. For an s × t matrix µ
and a p × q matrix ν , the Kronecker product µ ⊗ ν is the sp × tq matrix with entries defined for all i ∈ [s], j ∈ [t], k ∈ [p],
and ℓ ∈ [q] by the rule (µ ⊗ ν )ik jℓ = µi jνkℓ . For an s × t matrix µ, a t × u matrix σ , a p × q matrix ν , and a q × r matrix
τ , let us recall the composition rule for Kronecker products
(µσ ) ⊗ (ντ ) = (µ ⊗ ν )(σ ⊗ τ ) . (17)
Let us also recall the transposition rule
(µ ⊗ ν )⊤ = µ⊤ ⊗ ν⊤ . (18)
For two tensors σ and τ of identical shape, the entrywise product σ ⊙ τ has the same shape, with entries defined by the
rule (σ ⊙ τ )i = σiτi for all indices i .
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 9
2.5 The Triple Product Property and Matrix Multiplication
This section recalls the triple product of matrix multiplication and its closure under taking of Kronecker products. For
an su × r matrix ζ , an r × st matrix ξ , and an r × tu matrix η, we say that the three-tuple (ζ |ξ ,η) has the triple product
property with parameters ⟨s, t ,u⟩r if for all i, i ′ ∈ [s], j, j ′ ∈ [t], and k,k ′ ∈ [u] it holds that∑
h∈[r ]
ζi′k ′hξhi jηhj′k =

1 if i = i ′, j = j ′, and k = k ′ ,
0 otherwise .
(19)
For example, the Strassen–Winograd design (2), (3), (4) gives rise to the matrices
ξ =

0 0 1 1
0 1 0 0
0 1 0 1
0 1 1 1
1 1 1 1
0 0 1 0
1 0 0 0

, η =

0 0 1 1
0 0 1 0
0 1 0 1
0 1 1 1
0 1 0 0
1 1 1 1
1 0 0 0

, and ζ =

0 1 0 0 0 0 1
1 1 0 1 1 0 0
0 1 1 1 0 1 0
1 1 1 1 0 0 0

, (20)
which are readily verified to satisfy the triple product property with s = t = u = 2 and r = 7.
Let (ζ |ξ ,η) satisfy the triple product rule with parameters ⟨s, t ,u⟩r . Let A be an s × t matrix and B be a t × u matrix.
Define an s × u matrix C for all i ′ ∈ [s] and k ′ ∈ [u] by the rule
Ci′k ′ =
∑
h∈[r ]
ζi′k ′h
∑
i ∈[s]
∑
j ∈[t ]
ξhi jAi j
∑
j′∈[t ]
∑
k ∈[u]
ηhj′kBj′k
=
∑
i ∈[s]
∑
j ∈[t ]
∑
j′∈[t ]
∑
k ∈[u]
Ai jBj′k
∑
h∈[r ]
ζi′k ′hξhi jηhj′k
=
∑
j ∈[t ]
Ai′jBjk ′ ,
(21)
where the second equality follows by changing the order of summation and the last equality follows from the triple
product property (19). Viewing A, B, and C as vectors of length st , tu, and su, respectively, from (21) it immediately
follows that
C = ζ (ξA ⊙ ηB) . (22)
In other words, we observe that a three-tuple (ζ |ξ ,η) with the triple product property reduces multiplication of the
matrices A and B to (i) taking linear combinations ξA and ηB of entries of A and B independently, (ii) multiplying these
linear combinations pointwise to obtain ξA ⊙ ηB, and (iii) taking linear combinations ζ (ξA ⊙ ηB) to obtain the product
matrix C of A and B.
The triple product property gains its computational power from closure under Kronecker products and the compo-
sition rule. Indeed, let (ζℓ |ξℓ ,ηℓ) satisfy the triple product property with parameters ⟨sℓ , tℓ ,uℓ⟩rℓ for ℓ = 1, 2, . . . ,d .
Let A be a vector of length s1t1s2t2 · · · sd td , let B be a vector of length t1u1t2u2 · · · tdud , and let C be a vector of length
s1tus2u2 · · · sdud . Similarly to (21), for all (i ′1, i ′2, . . . , i ′d ) ∈ [s1]×[s2]×· · ·×[sd ] and (k ′1,k ′2, . . . ,k ′d ) ∈ [u1]×[u2]×· · ·×[ud ],
10 Matti Karppa and Petteri Kaski
by the triple product property (19) we now have
Ci′1k
′
1i
′
2k
′
2 · · ·i′dk ′d =
=
∑
h1∈[r1]
h2∈[r2]...
hd ∈[rd ]
ζ
(1)
i′1k
′
1h1
· · · ζ (d )i′dk ′dhd
∑
i1∈[s1]
i2∈[s2]...
id ∈[sd ]
∑
j1∈[t1]
j2∈[t2]...
jd ∈[td ]
ξ
(1)
h1i1 j1
· · · ξ (d )hd id jdAi1 j1i2 j2 · · ·id jd
∑
j′1∈[t1]
j′2∈[t2]...
j′d ∈[td ]
∑
k1∈[u1]
k2∈[u2]...
kd ∈[ud ]
η
(1)
h1 j′1k1
· · ·η(d )hd j′dkd Bj′1k1 j′2k2 · · ·j′dkd
=
∑
j1∈[t1]
j2∈[t2]...
jd ∈[td ]
Ai′1 j1i
′
2 j2 · · ·i′d jd Bj1k ′1 j2k ′2 · · ·jdk ′d .
(23)
Similarly to (22), from (23) we observe that
C =
(
ζ (1) ⊗ · · · ⊗ ζ (d )) ( (ξ (1) ⊗ · · · ⊗ ξ (d ))A ⊙ (η(1) ⊗ · · · ⊗ η(d ))B) . (24)
In other words, to multiply A and B, it suffices to first compute the two matrix-vector products
T ← (ξ (1) ⊗ · · · ⊗ ξ (d ))A , S ← (η(1) ⊗ · · · ⊗ η(d ))B , (25)
then multiply the resulting vectors elementwise
Q ← T ⊙ S , (26)
and finally compute the matrix-vector product
C ← (ζ (1) ⊗ · · · ⊗ ζ (d ))Q (27)
to recover the product C of A and B as a vector of length s1u1s2u2 · · · sdud .
2.6 Yates’s Algorithm and Fast Matrix Multiplication
This section develops fast matrix multiplication in the above framework by reduction to Yates’s algorithm. The
computational bottleneck of the matrix-multiplication algorithm (A,B) 7→ C given by (25), (26), and (27) occurs with
matrix-vector multiplications of the following form. For ℓ = 1, 2, . . . ,d , let µ(ℓ) be a matrix of shape bℓ × aℓ , and letU
be a vector of length a1a2 · · ·ad . We want to compute the vector V of length b1b2 · · ·bd with
V =
(
µ(1) ⊗ µ(2) ⊗ · · · ⊗ µ(d ))U .
The key idea is to rely on the composition rule (17) to implement multiplication with µ(1)⊗µ(2)⊗· · ·⊗µ(d ) one component
matrix µ(ℓ) at a time, via a sequence of sparse matrices µ¯[ℓ] defined in what follows. Let π : {1, 2, . . . ,d} → {1, 2, . . . ,d}
be an arbitrary permutation that encodes the order in which the matrices µ(ℓ) will be applied.5 For all k, ℓ ∈ {1, 2, . . . ,d},
let
mπk, ℓ =m
π
k, ℓ(b,a) =

ak if k ∈ π
({
1, 2, . . . ,π−1(ℓ)}) ;
bk otherwise.
5At first reading, it may be convenient to assume that π is the identity permutation.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 11
Let us recall that we write In for an n × n identity matrix. From the composition rule (17) of Kronecker products, we
obtain the decomposition
µ(1) ⊗ µ(2) ⊗ · · · ⊗ µ(d ) = µ¯[π (1)] µ¯[π (2)] · · · µ¯[π (d )] (28)
with
µ¯[ℓ] = Imπ1, ℓ ⊗ Imπ2, ℓ ⊗ · · · ⊗ Imπℓ−1, ℓ ⊗ µ
(ℓ) ⊗ Imπ
ℓ+1, ℓ
⊗ Imπ
ℓ+2, ℓ
⊗ · · · ⊗ Imπd, ℓ (29)
for ℓ = 1, 2, . . . ,d . In essence, each matrix µ¯[ℓ] implements∏dk=1, k,ℓmπk, ℓ independent matrix-vector multiplications
with the matrix µ(ℓ). Accordingly, µ¯[ℓ] is sparse with at most aℓbℓ
∏d
k=1, k,ℓm
π
k, ℓ nonzero entries. Furthermore, if
matrix-vector multiplication with the bℓ × aℓ matrix µ(ℓ) has a straight-line-program implementation consisting of
Pℓ ≤ bℓ(aℓ − 1) binary additions, the total number of binary additions in the algorithm
U [d ] ← U , U [d−1] ← µ¯[π (d )]U [d ] , U [d−2] ← µ¯[π (d−1)]U [d−1] , U [0] ← µ¯[π (1)]U [1] , V ← U [0] , (30)
is
d∑
ℓ=1
Pℓ
d∏
k=1
k,ℓ
mπk, ℓ . (31)
The algorithm (30) is known as Yates’s algorithm [42] for multiplying a Kronecker-product-structured matrix with a
given vector.
Yates’s algorithm is known to admit highly efficient parallelization. Indeed, since each layer U [ℓ−1] ← µ¯[π (ℓ)]U [ℓ] in
Yates’s algorithm consists of a large number of independent matrix-vector multiplications with the same matrix µ(π (ℓ)),
Yates’s algorithm admits immediate vector-parallelization (single-instruction-multiple-data parallelization). Furthermore,
multiple consecutive layers may be aggregated into one layer to optimize use of local storage such as per-scalar-thread
registers in vectorized parallel execution. Here the ability to permute layers arbitrarily yields considerable freedom to
optimize both the arithmetic cost as well as the vectorized execution and the use of local storage in applications.
For matrix multiplication, using Yates’s algorithm with permutations πζ ,πξ ,πη : {1, 2, . . . ,d} → {1, 2, . . . ,d}
to implement the matrix-vector multiplications in (25) and (27), we obtain an algorithm design that multiplies an
s1s2 · · · sd × t1t2 · · · td matrix with a t1t2 · · · td × u1u2 · · ·ud matrix using exactly
d∑
ℓ=1
P
ζ
ℓ
d∏
k=1
k,ℓ
m
πζ
k, ℓ(s ⊙ u, r ) +
d∑
ℓ=1
P
ξ
ℓ
d∏
k=1
k,ℓ
m
πξ
k, ℓ(r , s ⊙ t) +
d∑
ℓ=1
P
η
ℓ
d∏
k=1
k,ℓ
m
πη
k, ℓ(r , t ⊙ u) (32)
binary additions and r1r2 · · · rd binary multiplications, where Pζℓ , P
ξ
ℓ
, and Pη
ℓ
are the number of additions in a straight-
line program that multiplies the matrix ζ (ℓ), ξ (ℓ), and η(ℓ), respectively, with a vector, ℓ = 1, 2, . . . ,d .
2.7 Fast Alternative-Basis Matrix Multiplication
Let us now review the key idea of Karstadt and Schwartz [20] to change basis to reduce the total additive cost within
the previous framework. For ℓ = 1, 2, . . . ,d , let (ζ (ℓ) |ξ (ℓ),η(ℓ)) satisfy the triple product property with parameters
⟨sℓ , tℓ ,uℓ⟩rℓ . For each ℓ = 1, 2, . . . ,d , let us now decompose the matrices ζ (ℓ), ξ (ℓ),η(ℓ) into
ζ (ℓ) = χ (ℓ)γ (ℓ) , ξ (ℓ) = α (ℓ)ϕ(ℓ) , η(ℓ) = β (ℓ)ψ (ℓ) , (33)
where ϕ(ℓ),ψ (ℓ), and χ (ℓ) are arbitrary invertible matrices. Applying the decomposition (33) together with the composi-
tion rule (17) of Kronecker products, the key observation of Karstadt and Schwartz [20] in the present framework is
12 Matti Karppa and Petteri Kaski
that the multiplication identity (24) decomposes as
C =
(
ζ (1) ⊗ · · · ⊗ ζ (d )) ( (ξ (1) ⊗ · · · ⊗ ξ (d ))A ⊙ (η(1) ⊗ · · · ⊗ η(d ))B)
=
(
χ (1)γ (1) ⊗ · · · ⊗ χ (d )γ (d )) ( (α (1)ϕ(1) ⊗ · · · ⊗ α (d )ϕ(d ))A ⊙ (β (1)ψ (1) ⊗ · · · ⊗ β (d )ψ (d ))B)
=
(
χ (1) ⊗ · · · ⊗ χ (d )) (γ (1) ⊗ · · · ⊗ γ (d )) ( (α (1) ⊗ · · · ⊗ α (d )) (ϕ(1) ⊗ · · · ⊗ ϕ(d ))A
⊙ (β (1) ⊗ · · · ⊗ β (d )) (ψ (1) ⊗ · · · ⊗ψ (d ))B) .
(34)
From (34) we can immediately extract the following alternative-basis multiplication algorithm, which we state in a form
that relies on Yates’s algorithm for multiplying with each Kronecker-product-structured matrix in (34).
LetA be a vector of length s1t1s2t2 · · · sd td and let B be a vector of length t1u1t2u2 · · · tdud . Let πϕ ,πψ ,πχ ,πα ,πβ ,πγ :
{1, 2, . . . ,d} → {1, 2, . . . ,d} be arbitrary permutations. To multiply A and B, first change basis for both inputs by
computing
Aˆ← ϕ¯[πϕ (1)] · · · ϕ¯[πϕ (d )]A , Bˆ ← ψ¯ [πψ (1)] · · · ψ¯ [πψ (d )]B . (35)
Then compute linear combinations in the new bases
Tˆ ← α¯ [πα (1)] · · · α¯ [πα (d )]Aˆ , Sˆ ← β¯ [πβ (1)] · · · β¯ [πβ (d )]Bˆ . (36)
Multiply the resulting vectors elementwise
Qˆ ← Tˆ ⊙ Sˆ . (37)
Take linear combinations of the products
Cˆ ← γ¯ [πγ (1)] · · · γ¯ [πγ (d )]Qˆ . (38)
Finally change basis
C ← χ¯ [πχ (1)] · · · χ¯ [πχ (d )]Cˆ (39)
to recover the product C of A and B as a vector of length s1u1s2u2 · · · sdud .
The key insight of Karstadt and Schwartz [20] is that the additive straight-line-program cost of the matrix-vector
multiplications with the basis-transformed matrices α (ℓ), β (ℓ), and γ (ℓ) may be strictly lower than the corresponding
cost for the matrices ξ (ℓ), η(ℓ), and ζ (ℓ). Furthermore, one may engineer the basis changes ϕ(ℓ),ψ (ℓ), and χ (ℓ) to such
an effect.
In more precise terms, implementing each of the transformations (35), (36), (38), and (39) using Yates’s algorithm (30)
with the identity permutation, we obtain that the alternative-basis algorithm uses exactly
d∑
ℓ=1
P
χ
ℓ
d∏
k=1
k,ℓ
m
πχ
k, ℓ(s ⊙ u, s ⊙ u) +
d∑
ℓ=1
P
ϕ
ℓ
d∏
k=1
k,ℓ
m
πϕ
k, ℓ(s ⊙ t , s ⊙ t) +
d∑
ℓ=1
P
ψ
ℓ
d∏
k=1
k,ℓ
m
πψ
k, ℓ(t ⊙ u, t ⊙ u) (40)
binary additions for the basis changes,
d∑
ℓ=1
P
γ
ℓ
d∏
k=1
k,ℓ
m
πγ
k, ℓ(s ⊙ u, r ) +
d∑
ℓ=1
Pαℓ
d∏
k=1
k,ℓ
mπαk, ℓ(r , s ⊙ t) +
d∑
ℓ=1
P
β
ℓ
d∏
k=1
k,ℓ
m
πβ
k, ℓ(r , t ⊙ u) (41)
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 13
binary additions for the linear combinations, and
r1r2 · · · rd (42)
binary multiplications.
In the diagonal case with r = rℓ , s = sℓ , t = tℓ , u = uℓ , α = α (ℓ), β = β (ℓ), γ = γ (ℓ), ϕ = ϕ(ℓ),ψ = ψ (ℓ), and χ = χ (ℓ)
for ℓ = 1, 2, . . . ,d , and s = t = u with r > s2, we obtain that the alternative-basis algorithm uses(
P χ + Pϕ + Pψ
)
s2(d−1)d (43)
binary additions for the basis changes,(
Pγ + Pα + Pβ
) d∑
ℓ=1
s2(ℓ−1)rd−ℓ =
(
Pγ + Pα + Pβ
) rd − s2d
r − s2 (44)
binary additions for the linear combinations, and rd binary multiplications. In particular, from (43) and (44) we see
that optimizing the additive costs Pγ , Pα , and Pβ optimizes the leading constant of the dominant cost for large d .
In the non-diagonal case, more fine-grained optimization may be performed by optimizing the components of the
decomposition and the permutations to minimize (40) and (41).
A useful subfamily of alternative-basis algorithms are the alternative-basis algorithms that satisfy ϕ(ℓ) = ψ (ℓ) and
χ (ℓ) =
(
ϕ(ℓ)
)−1 for all ℓ = 1, 2, . . . ,d . We say that such algorithms support matrix chain multiplication or chaining in
the alternative basis. Indeed, for such an algorithm, one may chain multiplications (Aˆ, Bˆ) 7→ Cˆ in the alternative basis
using (36), (37), and (38) only, without changing back to the standard basis in between multiplications.
2.8 The Two New Algorithms in Matrix Form
Let us first express the new alternative-basis algorithm in §2.2 in matrix form. The change-of-basis matrices are
ϕ = ψ =

1 0 0 0
0 1 0 0
0 0 1 0
0 1 1 1

, χ =

1 0 0 0
0 1 0 1
0 0 1 1
0 0 0 1

. (45)
We readily observe that ϕ = ψ = χ⊤ and that ϕ = ϕ−1 and χ = χ−1 over the binary field. This self-inverse property and
the ease of in-place computation of matrix-vector multiplication with (45) are appealing properties when working with
inputs with tight space constraints. From (6) we observe that we can take Pϕ = Pψ = 2 and from (10) we observe that
we can take P χ = 2. The alternative-basis decomposition is defined by the matrices
α =

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 1
0 1 0 1
0 0 1 1

, β =

1 0 0 0
0 0 1 0
1 0 0 1
0 0 0 1
0 1 0 0
0 1 0 1
0 0 1 1

, and γ =

1 1 0 0 0 0 0
0 0 0 0 1 0 1
0 0 1 0 0 1 0
0 1 0 1 0 1 1

. (46)
14 Matti Karppa and Petteri Kaski
We readily verify that matrix-vector multiplication with the matrices α and β can be implemented using the recur-
rences (7) so that Pα = Pβ = 3, and matrix-vector multiplication with the matrix γ can be implemented with Pγ = 6
using (9). We observe that the three-tuple (χγ |αϕ, βψ ) with ϕ, α , β , and γ given by (45) and (46) satisfies the triple
product property (19) with parameters ⟨2, 2, 2⟩7.
Let us now turn to the alternative-basis algorithm in §2.3. The change-of-basis matrices are
ϕ = ψ =

1 0 0 0
0 1 0 0
0 1 1 1
0 1 0 1

and χ =

1 0 0 0
0 1 0 0
0 0 1 1
0 1 0 1

, (47)
We observe that χ = ϕ−1 = ψ−1, so we have an alternative-basis algorithm that supports chaining. From (11) we observe
that Pϕ = Pψ = 2. From (12) we observe that P χ = 2. The alternative-basis decomposition is defined by the matrices
α =

1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 1 0
0 1 1 0
0 0 1 1

, β =

1 0 0 0
0 0 1 1
0 0 1 0
0 0 0 1
0 1 0 0
0 1 1 0
1 0 1 0

, and γ =

1 1 0 0 0 0 0
0 1 1 0 1 1 0
0 1 1 0 1 0 1
0 0 0 1 1 0 0

. (48)
We readily verify that matrix-vector multiplication with the matrices α and β can be implemented using the recurrences
(13) so that Pα = Pβ = 3, and matrix-vector multiplication with the matrix γ can be implemented using (15) so that
Pγ = 6. Finally, we have that the three-tuple (χγ |αϕ, βψ ) with ϕ,ψ , χ , α , β , and γ given by (47) and (48) satisfies the
triple product property (19) with parameters ⟨2, 2, 2⟩7.
3 IMPLEMENTATION ENGINEERING
This section describes our algorithm engineering effort in more detail, with a focus on implementing the binary product
of two matrices; our implementation of Boolean products is described more concisely at the end of this section. We
strive for generality of exposition in terms of the target platform and in terms of enabling the use of future advances in
specific alternative-basis decompositions beyond our new decompositions presented in §2.2 and §2.3.
The high-level exposition withstanding, our goal with the present engineering effort is that a careful low-level
implementation of our engineering framework will be able to utilize specific hardware configurations across a range of
configurations efficiently, which we seek to demonstrate in our experiments across a range of current platforms in
the subsequent section. Furthermore, we would like to highlight that detailed low-level parameterization and specific
optimizations for current target platforms can always be found in the accompanying open-source release.
3.1 The Family of Target Platforms
Let us recall from the introduction that we seek generality towards a family of target platforms that consist of a single
shared-memory compute node (the host) equipped with N independent and identical vectorized accelerator devices
(the accelerators), each joined to the host by a low-bandwidth interconnect compared with the bandwidth available at
each device.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 15
While the detailed parameters of such a configuration are expected to vary and evolve over time, this overall topology
of the configuration—that is, a large-capacity shared-memory host joined by a low-bandwidth interconnect to N
independent high-bandwidth accelerators with restricted memory capacity and vectorized parallelism based on thread
arrays—can perhaps be expected to remain more stable over time and thus warrant engineering attention with the
goal of providing a design that can be fine-tuned to the varying parameters of each specific configuration. A modern
typical configuration of this type is a one-or-more-CPU-based shared-memory host joined by a peripheral component
interconnect to N GPU accelerators.
A further specific goal in our engineering effort is to enable consideration of input sizes that are close to the
shared-memory capacity of the host. Accordingly, our present design at the host level has been structured for a low
shared-memory footprint in terms of working memory, but with the understanding that, memory permitting, a faster
design can be obtained by following a series-parallel strategy at the host, similarly to the present framework employed
at each accelerator.
One further design choice in the present framework is that we assume that the accelerators work independently
and asynchronously, apart from synchronization enforced by the host. This choice scopes out platforms where the
accelerators may be joined to each other by a fast interconnect; indeed, making use of such an interconnect would
require at least partial synchronization between the accelerators.
3.2 Accelerators and Thread Arrays
Let us now review our more detailed conventions for working with components of a target platform, starting with the
accelerators. We assume the accelerator devices are vector processors designed to execute a large number of threads in
parallel, working asynchronously with one or more arrays of data residing in accelerator memory. Each of these data
arrays has a shape, which makes it convenient to assume that the array of threads working on the data has a shape to
structure the workload.
More precisely, we say that a parallel workload of L = L1L2 · · · Lr asynchronous threads for positive integers
L1,L2, . . . ,Lr is a thread array of volume L and shape
L1 × L2 × · · · × Lr .
This assumption enables us to index a thread t = 0, 1, . . . ,L − 1 inside a thread array of volume L alternatively by its
linear index t or by the unique tuple (t1, t2, . . . , tr ) ∈ [L1] × [L2] × · · · × [Lr ] with
t = t1L2L3 · · · Lr + t2L3L4 · · · Lr + . . . + tr−1Lr + tr (49)
for compatibility with our convention for linearization of data arrays (16). Workloads based on thread arrays are also
essentially immediately translatable to modern GPU platforms.
With current and envisaged future platforms in mind, a key property for a thread-array based design is that of
coalescence. We say that a thread array of shape L1 × L2 × · · · × Lr is coalescent up to c modes if any two threads
threads whose tuple-indices agree in all but least significant c modes Lr−c+1,Lr−c+2, . . . ,Lr execute an identical stream
of instructions; furthermore, whenever an instruction in the stream is a memory access, either a single address or
consecutive addresses of words in accelerator memory are accessed across consecutive threads in the c least significant
modes. Here it is important to note that we do not assume that the threads in a thread array execute synchronously,
although the underlying hardware in most cases has a vectorized structure that is optimized for synchronized execution
of the workload in groups of V coalescent threads for a hardware-dependent parameter V .
16 Matti Karppa and Petteri Kaski
Further engineering principles for accelerators include the following:
(1) work with a large enough L to expose sufficient parallelism to saturate the accelerator hardware;
(2) seek coalescent execution by careful design and ordering of the modes of the thread array and the modes of the
data tensors; and
(3) make sure each thread in the array works with enough local data to make use of low-latency and high-bandwidth
storage available to each thread or groups of threads, for example, in the form of per-thread registers and
per-thread-group cache memory.
3.3 The Host and Coordinating the Work
The role of the host is to coordinate the work of the N accelerators through multithreading and synchronization
primitives at the host. Furthermore, our assumption is that the input and the output require memory capacity in excess
of what is available at the accelerators, thus requiring the host to prepare workloads for the accelerators and aggregate
the results obtained from the accelerators. We also recall that we are seeking a design that has a low working-memory
footprint at the host to enable processing of large inputs.
Our engineering principles for the host-side implementation include the following:
(1) design for efficient use of the host-side memory hierarchy; each host-thread works with at least one cache line
of data at a time, utilizing the cache hierarchy available;
(2) the N accelerators are coordinated asynchronously in parallel so that each accelerator has its own pipeline
implemented using host-side threads, with synchronization on buffers and submatrices of the output to avoid
data races;
(3) each of the N accelerator-pipelines involves four stages, implemented using dedicated host-threads: (i) prepare
left input to accelerator in host memory, (ii) prepare right input to accelerator in host memory, (iii) upload input
from host memory to accelerator, compute at accelerator, download result from accelerator to host memory, and
(iv) aggregate the result in host memory to output;
(4) the accelerators must be supplied with extensive-enough workloads if possible to hide the running time of the
other stages of the pipeline behind the accelerator-side compute on the workloads;
(5) the accelerator memory and the available low bandwidth between the host and the accelerators constrain the
size of the workloads, which we ease by adopting a series/parallel approach on the accelerators; and
(6) the memory budget at the host is tight due to our goal of scaling to large inputs and the need to buffer of
workloads for N accelerators in host memory—we use low-memory-footprint designs to prepare and aggregate
the workloads, including in-place basis changes and other primitives help with the memory budget.
3.4 The High-Level Design
We are now ready for a high-level exposition of the algorithm design engineered for our target platforms. Once the
high-level description is done, we proceed to review more detailed implementation considerations. We will use the
alternative-basis framework of §2.
Let d be a positive integer that represents the depth of the algorithm design in terms of the number of three-tuples of
matrices that satisfy the triple-product property (19) used to structure the arithmetic. The exact implementation of this
arithmetic will vary somewhat from the ideal design (35), (36), (37), (38), and (39) given in §2, in particular due to storage
and bandwidth considerations both at the host and at the accelerators, and due to the need to supply independent
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 17
and asynchronous parallel workloads to each of the N accelerators. Each accelerator will in turn work both serially
(recursively) and in parallel with its specific subproblems delegated to it by the host.
We will split d according to the high-level layers of the design. We work with four layers:
(1) host layer,
(2) accelerator serial layer,
(3) accelerator parallel layer, and
(4) accelerator inner layer.
It will be convenient to indicate the layers in our notation with the symbols “h” (host), “s” (accelerator serial), “p”
(accelerator parallel), and “i” (inner). Accordingly, we have
d = d(h) + d(s) + d(p) + d(i) ,
where the nonnegative integers d(h), d(s), d(p), and d(i) indicate the number of decomposition matrices employed at
each high-level layer.
For each type of layer t ∈ {h, s, p, i} and each level ℓ = 1, 2, . . . ,d(t) inside a layer, let the three-tuple(
χ (t, ℓ)γ (t, ℓ)
α (t, ℓ)ϕ(t, ℓ), β (t, ℓ)ψ (t, ℓ))
satisfy the triple product property with parameters
〈
s
(t)
ℓ
, t
(t)
ℓ
,u
(t)
ℓ
〉
r (t)
ℓ
, where χ (t, ℓ), ϕ(t, ℓ), and ψ (t, ℓ) are invertible
matrices.
Our task is to multiply a given matrix A′ of shape s × t with a given matrix B′ of shape t × u to yield as output the
product matrix C ′ = A′B′ of shape s × u, where
s = s
(h)
1 · · · s
(h)
d (h)
· s(s)1 · · · s
(s)
d (s)
· s(p)1 · · · s
(p)
d (p)
· s(i)1 · · · s
(i)
d (i)
,
t = t
(h)
1 · · · t
(h)
d (h)
· t (s)1 · · · t
(s)
d (s)
· t (p)1 · · · t
(p)
d (p)
· t (i)1 · · · t
(i)
d (i)
,
u = u
(h)
1 · · ·u
(h)
d (h)
·u(s)1 · · ·u
(s)
d (s)
·u(p)1 · · ·u
(p)
d (p)
·u(i)1 · · ·u
(i)
d (i)
.
(50)
The following subsectionswill expose themore detailed algorithm design.Wewill first present the flow of computation
from the host to the accelerators and back to the host one layer at a time, and then present in detail how the host
coordinates its own work and the work of the accelerators through host-side threading and appropriate synchronization
primitives. Finally, we will parameterize the algorithm implementation.
Our expositionary focus will be on the implementation of the alternative-basis phase (36), (37), and (38) of the
algorithm, with less attention and optimization effort devoted to the pre-and-postprocessing phases of data permutation
and the basis changes at the host; the latter can be found in the accompanying source code.
Linearization and indexing conventions recalled. At this point it is convenient to recall that we tacitly employ the
first-index-major linearization convention (16) when linearizing bit arrays to words of memory; that is, changing the
most significant (leftmost) indices in a tuple of indices causes the largest stride in memory addressing, and the least
significant (rightmost) indices will index bits inside a word of memory as appropriate. We follow a similar convention
when linearizing thread arrays on accelerators (49); that is, hardware vectorization occurs with the least significant
(rightmost) indices of a tuple indexing a thread in an array of asynchronous threads that implements a workload on an
accelerator. We assume that the accelerator hardware computes with vectors ofV synchronous threads with consecutive
linear indices and that each thread can load and storeW -bit words.
18 Matti Karppa and Petteri Kaski
3.5 Input Permutation at the Host
To obtain coalesecent execution of the layered design on vectorized accelerator hardware, it will be convenient to
permute the input matrices A′ and B′ from the s × t and t × u layouts with (50) into the interleaved-layout vectors A
and B of lengths
s
(h)
1 t
(h)
1 · · · s
(h)
d (h)
t
(h)
d (h)
s
(s)
1 t
(s)
1 · · · s
(s)
d (s)
t
(s)
d (s)
s
(p)
1 t
(p)
1 · · · s
(p)
d (p)
t
(p)
d (p)
s
(i)
1 t
(i)
1 · · · s
(i)
d (i)
t
(i)
d (i)
(51)
and
t
(h)
1 u
(h)
1 · · · t
(h)
d (h)
u
(h)
d (h)
t
(s)
1 u
(s)
1 · · · t
(s)
d (s)
u
(s)
d (s)
t
(p)
1 u
(p)
1 · · · t
(p)
d (p)
u
(p)
d (p)
t
(i)
1 u
(i)
1 · · · t
(i)
d (i)
u
(i)
d (i)
, (52)
respectively. This permutation is executed in host memory, using appropriate parallelization, vectorization primitives,
and cache-blocking at the host for efficiency. For uniform implementation of Boolean and binary matrix multiplication,
our low-level implementation also transposes the right-hand side operand at this point by transposing the roles of t
and u in (52). The innermost transpose is implemented for 64 × 64 submatrices recursively using word operations on
32-bit words, cf. [40].
3.6 Input Basis Change at the Host
After input permutation, we execute basis changes Aˆ← ϕ¯A and Bˆ ← ψ¯B at the host. Both basis changes are executed
in parallel using Yates’s algorithm (30) on d levels. The innermost d(i) levels employ the identity basis change and thus
are omitted; indeed, we work with the standard-basis cubic multiplication algorithm in the accelerator inner layer in
§3.10, so no basis changes will be required.
3.7 Instance Generation at the Host
We now proceed to describe the sub-instances that the host generates and forwards to the N accelerators for processing.
In particular, we assume that the alternative-basis inputs Aˆ and Bˆ have been prepared and reside in shared memory for
parallel access by multiple host-threads working in parallel to keep all the N accelerators saturated with work; we
postpone the precise description of the host-side threading and synchronization to §3.16.
In essence, the sub-instances are generated from Aˆ and Bˆ as the sub-arrays obtained by aggregating the linear
combinations at the d(h) first levels of the base design (36) at once, which results in increased arithmetic cost compared
with (36) implemented using Yates’s algorithm, but keeps the working memory needs low at the host, as per our
engineering goal to scale up to large inputs on a single host.
In precise terms, the sub-instance indexed by (h1,h2, . . . ,hd (h) ) ∈ [r (h)1 ] × [r
(h)
2 ] × · · · × [r
(h)
d (h)
] consists of the vectors
Tˆ
[h]
h1h2 · · ·hd (h)
←
∑
(i1,i2, ...,id (h) )∈[s
(h)
1 ]×[s (h)2 ]×···×[s (h)d (h) ]
(j1, j2, ..., jd (h) )∈[t
(h)
1 ]×[t (h)2 ]×···×[t (h)d (h) ]
α
(h,1)
h1i1 j1
α
(h,2)
h2i2 j2
· · ·α (h,d (h))hd (h) id (h) jd (h) Aˆi1 j1i2 j2 · · ·id (h) jd (h) ,
Sˆ
[h]
h1h2 · · ·hd (h)
←
∑
(j1, j2, ..., jd (h) )∈[t
(h)
1 ]×[t (h)2 ]×···×[t (h)d (h) ]
(k1,k2, ...,kd (h) )∈[u
(h)
1 ]×[u (h)2 ]×···×[u (h)d (h) ]
β
(h,1)
h1 j1k1
β
(h,2)
h2 j2k2
· · · β (h,d (h))hd (h) jd (h)kd (h) Bˆj1k1 j2k2 · · ·jd (h)kd (h) .
(53)
We compute these vectors by aggregating the sums of subvectors in (53) as they are stated, taking care to not consider
subvectors whose associated α-product (respectively, β-product) in (53) is zero. At low level, subroutines for performing
the XOR and memory copy operations on subvectors have been handcrafted to make use of the 256-bit AVX2-registers
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 19
in CPU hardware. The routines also perform reads and writes of 512 bits at a time, which corresponds to the length of
cache lines on the target CPU. We leave these low-level implementation details to the accompanying source code.
In total thus r (h)1 r
(h)
2 · · · r
(h)
d (h)
sub-instances are generated and forwarded for solution by the N accelerators. Such
solving of a sub-instance (Tˆ [h]h1h2 · · ·hd (h) , Sˆ
[h]
h1h2 · · ·hd (h)
) by an accelerator consists of (i) uploading the sub-instance to one
of the accelerators, (ii) having the accelerator compute the solution Qˆ[h]h1h2 · · ·hd (h)
, and (iii) downloading the solution
from the accelerator to shared memory at the host. We next describe how each accelerator proceeds to solve a given
sub-instance, and then describe how each downloaded solution Qˆ[h]h1h2 · · ·hd (h)
is aggregated to the master solution Cˆ at
the host.
3.8 Accelerator Serial Layer
Suppose the sub-instance (Tˆ [h]h1h2 · · ·hd (h) , Sˆ
[h]
h1h2 · · ·hd (h)
) has been uploaded to an accelerator. The accelerator then proceeds
to solve the instance using three layers. First, a serial layer works with the instance recursively, at the bottom of the
recursion switching to parallel work with three parts, the expanding parallel layer, the inner layer, and the compressing
parallel layer. We start with a description of the serial (recursive) layer. Let the uploaded instance
Tˆ [s,0] = Tˆ [h]h1h2 · · ·hd (h)
,
Sˆ[s,0] = Sˆ[h]h1h2 · · ·hd (h)
be the input to the serial layer. The serial layer consists of d(s) levels.
Each recursive invocation at level ℓ = 1, 2, . . . ,d(s) can be indexed by a tuple (h1,h2, . . . ,hℓ−1) ∈ [r (s)1 ] × [r
(s)
2 ] ×
· · · × [r (s)
ℓ−1]. In particular, the initial invocation at ℓ = 1 is indexed by the empty tuple and with input consisting of the
vectors Tˆ [s,0] and Sˆ[s,0]. More generally, the input to invocation (h1,h2, · · · ,hℓ−1) consists of vectors Tˆ [s, ℓ−1]h1h2 · · ·hℓ−1 and
Sˆ
[s, ℓ−1]
h1h2 · · ·hℓ−1 . For hℓ ∈ [r
(s)
ℓ
], the invocation computes the vectors
Tˆ
[s, ℓ]
h1h2 · · ·hℓ ←
∑
iℓ ∈[s (s)ℓ ]
jℓ ∈[t (s)ℓ ]
α
(s, ℓ)
hℓ iℓ jℓ
Tˆ
[s, ℓ−1]
h1h2 · · ·hℓ−1iℓ jℓ , Sˆ
[s, ℓ]
h1h2 · · ·hℓ ←
∑
jℓ ∈[t (s)ℓ ]
kℓ ∈[u (s)ℓ ]
β
(s, ℓ)
hℓ iℓ jℓ
Sˆ
[s, ℓ−1]
h1h2 · · ·hℓ−1iℓ jℓ ,
(54)
and makes the recursive invocation with index (h1,h2, . . . ,hℓ) and input consisting of the vectors Tˆ [s, ℓ]h1h2 · · ·hℓ and
Sˆ
[s, ℓ]
h1h2 · · ·hℓ , obtaining as return value and storing the solution Qˆ
[s, ℓ]
h1h2 · · ·hℓ . At the bottom level of recursion when ℓ = d
(s),
the expanding parallel layer (described in the following section) is invoked with the input, and the return value is
obtained from the output of that layer. Once all r (s)
ℓ
return values Qˆ[s, ℓ]h1h2 · · ·hℓ for hℓ ∈ [r
(s)
ℓ
] are available, the invocation
computes and returns the value Qˆ[s, ℓ−1]h1h2 · · ·hℓ−1 defined for all iℓ ∈ [s
(s)
ℓ
] and jℓ ∈ [t (s)ℓ ] by
Qˆ
[s, ℓ−1]
h1h2 · · ·hℓ−1iℓ jℓ ←
∑
hℓ ∈[r (s)ℓ ]
γ
(s, ℓ)
iℓ jℓhℓ
Qˆ
[s, ℓ−1]
h1h2 · · ·hℓ . (55)
The array-arithmetic in (54) and (55) is implemented with thread arrays on the accelerator so that each thread produces
oneW -bit word of output using optimized straight-line programs for matrix-vector multiplication with the α , β , and γ
matrices.
20 Matti Karppa and Petteri Kaski
3.9 Accelerator Expanding Parallel Layer
We now turn to the layer that implements Yates’s algorithm with each of the α and β matrices, as follows. Each input to
the expanding parallel layer arrives from the bottom level of recursion in the serial layer. Let us write Tˆ [p,0] and Sˆ[p,0]
for the input vectors to the parallel layer. These vectors have lengths
s
(p)
1 t
(p)
1 s
(p)
2 t
(p)
2 · · · s
(p)
d (p)
t
(p)
d (p)
s
(i)
1 t
(i)
1 s
(i)
2 t
(i)
2 · · · s
(i)
d (i)
t
(i)
d (i)
and t (p)1 u
(p)
1 t
(p)
2 u
(p)
2 · · · t
(p)
d (p)
u
(p)
d (p)
t
(i)
1 u
(i)
1 t
(i)
2 u
(i)
2 · · · t
(i)
d (i)
u
(i)
d (i)
,
respectively. We describe the processing of the input Tˆ [p,0] only, with the understanding that the processing of Sˆ[p,0] is
similar but utilizes the β matrices in place of the α matrices.
Let us write α¯ [p] for the matrix
α¯ [p] = α (p,1) ⊗ α (p,2) ⊗ · · · ⊗ α (p,d (p)) ⊗ I
s (i)1 t
(i)
1 s
(i)
2 t
(i)
2 · · ·s (i)d (i) t
(i)
d (i)
and decompose the matrix using Yates’s decomposition (28) with respect to the reversal permutation as
α¯ [p] = α¯ [p,d (p)]α¯ [p,d (p)−1] · · · α¯ [p,1] .
The parallel layer proceeds to compute α¯ [p]Tˆ [p,0] by a sequence of matrix-vector multiplications Tˆ [p, ℓ] ← α¯ [p, ℓ]T [p, ℓ−1]
for ℓ = 1, 2, . . . ,d(p). Each multiplication in the sequence is implemented with vectorization usingW -bit words6 and
parallelization via a thread array of
r
(p)
1 r
(p)
2 · · · r
(p)
ℓ−1s
(p)
ℓ+1t
(p)
ℓ+1s
(p)
ℓ+2t
(p)
ℓ+2 · · · s
(p)
d (p)
t
(p)
d (p)
s
(i)
1 t
(i)
1 s
(i)
2 t
(i)
2 · · · s
(i)
d (i)
t
(i)
d (i)
/W
threads working asynchronously, so that each thread loads s(p)
ℓ
t
(p)
ℓ
words from the array Tˆ [p, ℓ−1] to its local registers,
word-bit-parallel multiplies in registers withW copies of the matrix α (p, ℓ) using an optimized straight-line program,
and saves r (p)
ℓ
words to the array Tˆ [p, ℓ]. We illustrate the design in pseudocode with Algorithm 1, where we assume that
w is a nonnegative integer withW = s(i)
d (i)−w+1t
(i)
d (i)−w+1 · · · s
(i)
d (i)
t
(i)
d (i)
, and that the parameterization of the inner layer is
such that such aw exists; this will be the case in what follows.
The output of the layer consists of the expanded vectors Tˆ [p,d (p)] and Sˆ[p,d (p)] of lengths
r
(p)
1 r
(p)
2 · · · r
(p)
d (p)
s
(i)
1 t
(i)
1 s
(i)
2 t
(i)
2 · · · s
(i)
d (i)
t
(i)
d (i)
and r (p)1 r
(p)
2 · · · r
(p)
d (p)
t
(i)
1 u
(i)
1 t
(i)
2 u
(i)
2 · · · t
(i)
d (i)
u
(i)
d (i)
,
respectively, which are in turn given as input to the inner layer.
3.10 Accelerator Inner Layer
The inner layer is the most performance-critical part of the design since it works with the expanded vectors and thus
with the most data aggregated over the execution of the algorithm. The inner layer takes as input two vectors Tˆ [i]
and Sˆ[i]. The output of the layer is the product
Qˆ[i] = γ¯ [i]
(
α¯ [i]Tˆ [i] ⊙ β¯ [i]Sˆ[i]) ,
6We tacitly assume in what follows thatW divides the product s (i)1 t
(i)
1 s
(i)
2 t
(i)
2 · · · s (i)d (i) t
(i)
d (i) . Indeed,W is in most cases a power of two, and the product
can be similarly chosen to be a power of two by careful design of the inner layer.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 21
Algorithm 1 The accelerator expanding parallel layer illustrated in pseudocode. This pseudocode illustrates the
procedure implemented using a thread array for level ℓ = 1, 2, . . . ,d(p) that takes as input Tˆ [p, ℓ−1] and yields the output
Tˆ [p, ℓ]. Each thread in the array reads s(p)
ℓ
t
(p)
ℓ
words to its local registers, multiplies with the matrix α (p, ℓ) in registers,
and writes r (p)
ℓ
words. This procedure illustrates operation for the left input only; the procedure for the right input is
similar.
1: procedure AcceleratorExpandingParallelLeft(Tˆ [p, ℓ−1], ℓ)
2: parallel for thread
(h(p)1 , . . . , h
(p)
ℓ−1, i
(p)
ℓ+1, j
(p)
ℓ+1, . . . , i
(p)
d (p)
, j (p)
d (p)
, i (i)1 , j
(i)
1 , . . . , i
(i)
d (i)−w , j
(i)
d (i)−w ) ∈
[r (p)1 ] × · · · × [r
(p)
ℓ−1] × [s
(p)
ℓ+1] × [t
(p)
ℓ+1] × · · · × [s
(p)
d (p)
] × [t (p)
d (p)
] × [s (i)1 ] × [t (i)1 ] × · · · × [s (i)d (i)−w ] × [t
(i)
d (i)−w ]
do
3: for (i (p)
ℓ
, j (p)
ℓ
) ∈ [s (p)
ℓ
] × [t (p)
ℓ
] do
4: Tˆ [local, ℓ−1]
i (p)
ℓ
j (p)
ℓ
← Tˆ [p, ℓ−1]
h(p)1 ···h
(p)
ℓ−1i
(p)
ℓ
j (p)
ℓ
i (p)
ℓ+1 j
(p)
ℓ+1 ···i
(p)
d (p)
j (p)
d (p)
i (i)1 j
(i)
1 ···i
(i)
d (i)−w
j (i)
d (i)−w
5: end for
6: Tˆ [local, ℓ] ← α (p, ℓ)Tˆ [local, ℓ−1] [[ Implemented with word-bit-operations using an optimized
straight-line program for multiplying the r (p)
ℓ
× s (p)
ℓ
t (p)
ℓ
matrix α (p, ℓ)
with the s (p)
ℓ
t (p)
ℓ
vector Tˆ [local, ℓ−1], such as (7) or (13). ]]
7: for h(p)
ℓ
∈ [r (p)
ℓ
] do
8: Tˆ [p, ℓ]
h(p)1 ···h
(p)
ℓ−1h
(p)
ℓ
i (p)
ℓ+1 j
(p)
ℓ+1 ···i
(p)
d (p)
j (p)
d (p)
i (i)1 j
(i)
1 ···i
(i)
d (i)−w
j (i)
d (i)−w
← Tˆ [local, ℓ]
h(p)
ℓ
9: end for
10: end for
11: end procedure
where
α¯ [i] = I
r (p)1 r
(p)
2 · · ·r (p)d (p)
⊗ α (i,1) ⊗ α (i,2) ⊗ · · · ⊗ α (i,d (i)) ,
β¯ [i] = I
r (p)1 r
(p)
2 · · ·r (p)d (p)
⊗ β (i,1) ⊗ β (i,2) ⊗ · · · ⊗ β (i,d (i)) ,
γ¯ [i] = I
r (p)1 r
(p)
2 · · ·r (p)d (p)
⊗ γ (i,1) ⊗ γ (i,2) ⊗ · · · ⊗ γ (i,d (i)) .
As the inner layer works with the least significant bit positions of the bit vectors, in most cases the implementation of the
inner layer is hardware-specific and amounts to making the best possible use of the available bit- and word-operations
in the instruction set as well as the available vectorization and associated vector-shuffling instructions.
For example, withW -bit words and length-V hardware vectorization of thread arrays withM2 =WV , one possibility
to implement the inner layer is to perform r (p)1 r
(p)
2 · · · r
(p)
d (p)
independentM ×M binary matrix multiplications in parallel
using the elementary cubic algorithm implemented with word operations and a thread array of r (p)1 r
(p)
2 · · · r
(p)
d (p)
V threads,
so that each thread works with oneW -bit-word-size fragment from each of theM ×M operand matrices and the result
matrix, with hardware V -vector shuffle instructions used to communicate words between threads. This is essentially
how our low-level implementation of the inner layer is structured, though some optimizations remain to be discussed
in §3.15. This part of the framework is also the most sensitive to changes in underlying hardware, and so is perhaps the
least likely to withstand the test of time.
22 Matti Karppa and Petteri Kaski
3.11 Accelerator Compressing Parallel Layer
Once the inner layer is complete, its output Qˆ[i] is given as input to a further parallel layer that implements Yates’s
algorithm with the γ matrices, as follows. Let us write Qˆ[p,d (p)] for the input vector to the compressing parallel layer.
Similarly to the expanding layer, let us write γ¯ [p] for the matrix
γ¯ [p] = γ (p,1) ⊗ γ (p,2) ⊗ · · · ⊗ γ (p,d (p)) ⊗ I
s (i)1 u
(i)
1 s
(i)
2 u
(i)
2 · · ·s (i)d (i)u
(i)
d (i)
.
and decompose the matrix using Yates’s decomposition (28) with respect to the identity permutation as
γ¯ [p] = γ¯ [p,1]γ¯ [p,2] · · · γ¯ [p,d (p)] .
The parallel layer proceeds to compute γ¯ [p]Qˆ[p,d (p)] by a sequence of matrix-vector multiplications Qˆ[p, ℓ−1] ←
γ¯ [p, ℓ]Q[p, ℓ] for ℓ = d(p),d(p) − 1, . . . , 1. Each multiplication in the sequence is implemented with vectorization using
W -bit words and parallelization via a thread array of
r
(p)
1 r
(p)
2 · · · r
(p)
ℓ−1s
(p)
ℓ+1u
(p)
ℓ+1s
(p)
ℓ+2u
(p)
ℓ+2 · · · s
(p)
d (p)
u
(p)
d (p)
s
(i)
1 u
(i)
1 s
(i)
2 u
(i)
2 · · · s
(i)
d (i)
u
(i)
d (i)
/W
threads working asynchronously, so that each thread loads r (p)
ℓ
words from the array Qˆ[p, ℓ] to its local registers,
word-bit-parallel multiplies in registers withW copies of the matrix γ (p, ℓ) using an optimized straight-line program, and
stores s(p)
ℓ
u
(p)
ℓ
words to the array Qˆ[p, ℓ−1]. We illustrate the design in pseudocode with Algorithm 2, where, analogously
to Algorithm 1, we assume thatw is a nonnegative integer withW = s(i)
d (i)−w+1u
(i)
d (i)−w+1 · · · s
(i)
d (i)
u
(i)
d (i)
.
The output of the layer consists of the compressed vector Qˆ[p,0] of length
s
(p)
1 u
(p)
1 s
(p)
2 u
(p)
2 · · · s
(p)
d (p)
u
(p)
d (p)
s
(i)
1 u
(i)
1 s
(i)
2 u
(i)
2 · · · s
(i)
d (i)
u
(i)
d (i)
,
which is in turn given as output to the serial layer.
3.12 Instance Aggregation at the Host
The aggregation of a solution obtained at an accelerator proceeds as follows at the host, again with a focus on obtaining
a design with a low working memory footprint in host memory at the price of increased arithmetic cost. Before
any solutions are accepted, the output vector Cˆ is initialized to all-0 values. Suppose that the solution Qˆ[h]h1h2 · · ·hd (h)
for (h1,h2, . . . ,hd (h) ) ∈ [r (h)1 ] × [r
(h)
2 ] × · · · × [r
(h)
d (h)
] has been downloaded from an accelerator to host memory. We
then execute the following aggregation procedure. For each (i1, i2, . . . , id (h) ) ∈ [s(h)1 ] × [s
(h)
2 ] × · · · × [s
(h)
d (h)
] and each
(k1,k2, . . . ,kd (h) ) ∈ [u(h)1 ] × [u
(h)
2 ] × · · · × [u
(h)
d (h)
], whenever it holds that γ (h,1)i1k1h1γ
(h,2)
i2k2h2
· · ·γ (h,d (h))id (h)kd (h)hd (h) = 1, then
aggregate the solution to the output by
Cˆi1k1i2k2 · · ·id (h)kd (h) ← Cˆi1k1i2k2 · · ·id (h)kd (h) + Qˆ
[h]
h1h2 · · ·hd (h)
. (56)
Whenmultiple host-threads execute aggregation in parallel, appropriate synchronization primitives need to be employed
to avoid conflicts between threads when executing (56). This will be described in detail in §3.16.
3.13 Output Basis Change at the Host
After the alternative-basis output Cˆ has been aggregated, we execute the basis change C ← χ¯Cˆ at the host, in parallel
using Yates’s algorithm (30) on d levels. In our implementation, the innermost d(i) levels employ the identity basis
change (since the accelerator inner-layer uses the standard-basis cubic multiplication algorithm) and thus are omitted.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 23
Algorithm 2 The accelerator compressing parallel layer illustrated in pseudocode. This pseudocode illustrates the
procedure implemented using a thread array for level ℓ = d(p),d(p) − 1, . . . , 1 that takes as input Qˆ[p, ℓ] and yields the
output Qˆ[p, ℓ−1] . Each thread in the array reads r (p)
ℓ
words to its local registers, multiplies with the matrix γ (p, ℓ) in
registers, and writes s(p)
ℓ
u
(p)
ℓ
words.
1: procedure AcceleratorCompressingParallel(Qˆ [p, ℓ], ℓ)
2: parallel for thread
(h(p)1 , . . . , h
(p)
ℓ−1, i
(p)
ℓ+1, k
(p)
ℓ+1, . . . , i
(p)
d (p)
, k (p)
d (p)
, i (i)1 , k
(i)
1 , . . . , i
(i)
d (i)−w , k
(i)
d (i)−w ) ∈
[r (p)1 ] × · · · × [r
(p)
ℓ−1] × [s
(p)
ℓ+1] × [u
(p)
ℓ+1] × · · · × [s
(p)
d (p)
] × [u (p)
d (p)
] × [s (i)1 ] × [u (i)1 ] × · · · × [s (i)d (i)−w ] × [u
(i)
d (i)−w ]
do
3: for h(p)
ℓ
∈ [r (p)
ℓ
] do
4: Qˆ [local, ℓ]
h(p)
ℓ
← Qˆ [p, ℓ]
h(p)1 ···h
(p)
ℓ−1h
(p)
ℓ
i (p)
ℓ+1k
(p)
ℓ+1 ···i
(p)
d (p)
k (p)
d (p)
i (i)1 k
(i)
1 ···i
(i)
d (i)−w
k (i)
d (i)−w
5: end for
6: Qˆ [local, ℓ−1] ← γ (p, ℓ)Qˆ [local, ℓ] [[ Implemented with word-bit-operations using an optimized
straight-line program for multiplying the s (p)
ℓ
u (p)
ℓ
× r (p)
ℓ
matrix γ (p, ℓ)
with the r (p)
ℓ
vector Qˆ [local, ℓ], such as (9) or (15). ]]
7: for (i (p)
ℓ
, k (p)
ℓ
) ∈ [s (p)
ℓ
] × [u (p)
ℓ
] do
8: Qˆ [p, ℓ−1]
h(p)1 ···h
(p)
ℓ−1i
(p)
ℓ
k (p)
ℓ
i (p)
ℓ+1k
(p)
ℓ+1 ···i
(p)
d (p)
k (p)
d (p)
i (i)1 k
(i)
1 ···i
(i)
d (i)−w
k (i)
d (i)−w
← Qˆ [local, ℓ−1]
i (p)
ℓ
k (p)
ℓ
9: end for
10: end for
11: end procedure
3.14 Output Permutation at the Host
The last layer of the framework permutes the interleaved-layout vector C of length
s
(h)
1 u
(h)
1 · · · s
(h)
d (h)
u
(h)
d (h)
s
(s)
1 u
(s)
1 · · · s
(s)
d (s)
u
(s)
d (s)
s
(p)
1 u
(p)
1 · · · s
(p)
d (p)
u
(p)
d (p)
s
(i)
1 u
(i)
1 · · · s
(i)
d (i)
u
(i)
d (i)
(57)
to an s ×u output matrixC ′ with s and u as in (50). This completes the description of the layers of the algorithm design.
3.15 Further Optimization by Merging Levels and Shifting Levels Between Layers
The previous framework can be further optimized at the accelerators by making more extensive use of the registers local
to each thread in a thread array, assuming such registers are available in sufficient quantity. This section documents
two techniques towards this end.
Merging levels. Assuming the parameters r (p)
ℓ
, s
(p)
ℓ
, t
(p)
ℓ
,u
(p)
ℓ
of the expanding levels ℓ = 1, 2, . . . ,d(p) in the expanding
layer (cf. §3.9) are small enough, one may utilize the per-thread registers and compute two consecutive levels, ℓ and
ℓ + 1, instead of only one level ℓ. In this case one works with a thread array of
r
(p)
1 r
(p)
2 · · · r
(p)
ℓ−1s
(p)
ℓ+2t
(p)
ℓ+2s
(p)
ℓ+3t
(p)
ℓ+3 · · · s
(p)
d (p)
t
(p)
d (p)
s
(i)
1 t
(i)
1 s
(i)
2 t
(i)
2 · · · s
(i)
d (i)
t
(i)
d (i)
/W
threads, so that each thread loads s(p)
ℓ
t
(p)
ℓ
s
(p)
ℓ+1t
(p)
ℓ+1 words from accelerator memory to its local registers, then multiplies
word-bit-parallel in registers withW copies of the matrix α (p, ℓ) ⊗ α (p, ℓ+1) using optimized straight-line programs
for both component matrices (that is, one implements with register arithmetic Yates’s algorithm using straight-line
programs for α (p, ℓ) and α (p, ℓ+1) for the levels ℓ and ℓ + 1 computed in registers), and finally saves r (p)
ℓ
r
(p)
ℓ+1 words
to accelerator memory. Similar merging of levels may be used for the compressing layer (cf. §3.11). Our low-level
24 Matti Karppa and Petteri Kaski
implementation uses this two-level merging for pairs of levels closest to the inner layer. In particular one wants to
optimize at levels closest to the inner layer because these layers process the longest vectors and thus do the most work.
If more per-thread registers are available, more consecutive levels can be merged and computed in registers.
Shifting levels between layers. The inner layer can use per-thread registers more efficiently by shifting expand-
ing/compressing levels to the inner layer from the expanding/compressing layers. Assuming that the parameters
of the expansion/compression are r (p)
ℓ
, s
(p)
ℓ
, t
(p)
ℓ
,u
(p)
ℓ
, and that the inner layer originally implementsM ×M standard-
basis matrix multiplication (cf. §3.10) using oneW -bit-word per thread, the inner layer with shifting (i) first loads
sℓtℓ + tℓuℓ words to each thread, (ii) expands in registers using straight-line programs for α (p, ℓ) and β (p, ℓ) to rℓ + rℓ
words of operands per thread, (iii) executes in registers rℓ independent M × M matrix multiplications (essentially
repeating the original inner layer rℓ times in registers) to obtain rℓ words of expanded results, and (iv) finally compresses
to sℓuℓ words of results per thread using a straight-line program for γ (p, ℓ) and writes these results to accelerator
memory. Our low-level implementation follows this strategy of shifting one level of expansion/compression to the
inner layer to make more efficient use of per-thread registers and to perform less accelerator-memory transactions at
the inner layer.
3.16 Coordinating the Workload
This section describes how the entire workload for the host and the N accelerators is coordinated at the host using
host-threads and appropriate synchronization.
Let us start by recalling that the top-level alternative-basis workload (Aˆ, Bˆ) 7→ Cˆ consists of r (h)1 r
(h)
2 · · · r
(h)
d (h)
sub-
instances that need to be (i) generated from Aˆ and Bˆ on the host, (ii) solved in an accelerator, and (iii) the solution
aggregated to Cˆ on the host. This suggests processing the workload with N essentially independent pipelines imple-
mented by one or more host-threads each so that each pipeline is responsible for keeping exactly one accelerator busy
with work. Synchronization between pipelines is also required because the aggregation steps (56) of different pipelines
can access and update the same subarray of Cˆ , which can lead to data races unless synchronization is used to ensure
serial updates to each subarray.
Our strategy is to implement the pipeline for each accelerator ℓ ∈ [N ] using three buffers Tℓ , Sℓ ,Qℓ in host memory
and four host-threads: one thread to prepare the left input Tℓ in host memory, one thread to prepare the right input Sℓ
in host memory, one thread to multiply Tℓ , Sℓ on accelerator ℓ to obtain the result Qℓ (including uploading the input
and downloading the result), and one thread to aggregate the result Qℓ to the result C˜ .
Pseudocode for coordinating work at the host is given in Algorithm 3. This workload of 4N host-threads is deadlock-
free due to the design of the pipeline—blocking is needed to prevent race conditions occurring on the arrays Tℓ , Sℓ ,Qℓ ,
the accelerator devices, and subarrays of the array Cˆ , but each pipeline will always be flushed eventually. We also
observe that the order in which the subproblems (h1, . . . ,hd (h)) are processed is insignificant as long as the order is
well-defined, which is guaranteed by the locks on subarrays of Cˆ . Here we have presented a 4N -thread design with a
single 4-thread pipeline for each of the N accelerators; the design can be easily extended to multiple pipelines competing
for a single accelerator device to better saturate the accelerators with data as appropriate.
3.17 Parameterizing the Implementation
Let us now turn to the detailed parameterization of the framework. Our current implementation uses the framework
in a diagonal setting based on either of the two ⟨2, 2, 2⟩7 alternative-basis decompositions developed in §2.2 and §2.3,
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 25
Algorithm 3 Procedure for coordinating work at a host joined to N accelerators. The input arrays are Aˆ and Bˆ and the
output array is Cˆ . The workload is formed of 4N threads. Each thread is associated with one of the N accelerators and a
specific part of the pipeline for this accelerator.
,
1: procedure HostCoordinate(Aˆ, Bˆ)
2: Let Tℓ for ℓ ∈ [N ] be an array of shape s (s)1 t (s)1 · · · s (s)d (s) t
(s)
d (s)s
(p)
1 t
(p)
1 · · · s
(p)
d (p)
t (p)
d (p)
s (i)1 t
(i)
1 · · · s (i)d (i) t
(i)
d (i)
3: Let Sℓ for ℓ ∈ [N ] be an array of shape t (s)1 u (s)1 · · · t (s)d (s)u
(s)
d (s) t
(p)
1 u
(p)
1 · · · t
(p)
d (p)
u (p)
d (p)
t (i)1 u
(i)
1 · · · t (i)d (i)u
(i)
d (i)
4: Let Qℓ for ℓ ∈ [N ] be an array of shape s (s)1 u (s)1 · · · s (s)d (s)u
(s)
d (s)s
(p)
1 u
(p)
1 · · · s
(p)
d (p)
u (p)
d (p)
s (i)1 u
(i)
1 · · · s (i)d (i)u
(i)
d (i)
5: Initialize Cˆ as zero
6: parallel for thread t ∈ [4N ] do
7: ℓ ← t mod N
8: for (h1, . . . , hd (h) ) ∈ [r (h)1 ] × · · · × [r (h)d (h) ] such that ℓ ≡
∑d (h)
i=1 hi
∏d (h)
j=i+1 r
(h)
j (mod N ) do
9: if 0 ≤ t ≤ N − 1 then
10: Block until Tℓ is free
11: Initialize Tℓ as zero
12: for (i1, j1, . . . , id (h), jd (h) ) ∈ [s (h)1 ] × [t (h)1 ] × · · · × [s (h)d (h) ] × [t
(h)
d (h) ] do
13: if α (h,1)h1i1 j1 · · · α
(h,d (h))
h
d (h) id (h) jd (h)
, 0 then
14: Tℓ ← Tℓ + Aˆi1 j1 ···id (h) jd (h)
15: end if
16: end for
17: Mark Tℓ occupied
18: else if N ≤ t ≤ 2N − 1 then
19: Block until Sℓ is free
20: Initialize Sℓ as zero
21: for (j1, k1 . . . , jd (h), kd (h) ) ∈ [t (h)1 ] × [u (h)1 ] × · · · × [t (h)d (h) ] × [u
(h)
d (h) ] do
22: if β (h,1)h1 j1k1 · · · β
(h,d (h))
h
d (h) jd (h)kd (h)
, 0 then
23: Sℓ ← Sℓ + Bˆj1k1 ···jd (h)kd (h)
24: end if
25: end for
26: Mark Sℓ occupied
27: else if 2N ≤ t ≤ 3N − 1 then
28: Block until Tℓ and Sℓ are occupied and Qℓ is free
29: Upload Tℓ and Sℓ to accelerator ℓ
30: Mark Tℓ and Sℓ free
31: Multiply Tℓ and Sℓ on accelerator ℓ to obtain the result Qℓ
32: Download Qℓ from accelerator ℓ
33: Mark Qℓ occupied
34: else if 3N ≤ t ≤ 4N − 1 then
35: Block until Qℓ is occupied
36: for (i1, k2, . . . , id (h), kd (h) ) ∈ [s (h)1 ] × [u (h)1 ] × · · · × [s (h)d (h) ] × [u
(h)
d (h) ] do
37: if γ (h,1)i1k1h1 · · ·γ
(h,d (h))
i
d (h)kd (h)hd (h)
, 0 then
38: Acquire the lock for Cˆi1k1 ···id (h)kd (h)
39: Cˆi1k1 ···id (h)kd (h) ← Cˆi1k1 ···id (h)kd (h) +Qℓ
40: Release the lock for Cˆi1k1 ···id (h)kd (h)
41: end if
42: end for
43: Mark Qℓ free
44: end if
45: end for
46: end for
47: end procedure
26 Matti Karppa and Petteri Kaski
respectively. Accordingly, we set
sd
(h)
j = s
d (s)
j = s
d (p)
j = 2 ,
td
(h)
j = t
d (s)
j = t
d (p)
j = 2 ,
ud
(h)
j = u
d (s)
j = u
d (p)
j = 2 ,
rd
(h)
j = r
d (s)
j = r
d (p)
j = 7 .
When more efficient decompositions are discovered, these can be immediately used in the present framework.
The present accelerator hardware has V -length vectorization of threads with V = 32, and the maximum word length
supported per thread for memory transactions isW = 128 bits.7 We structure the inner layer to work with M ×M
matrices,M = 64, so thatM2 = VW and theM ×M multiplication can work with instructions for shuffling data between
threads across each V -length vector of threads in a thread array. Accordingly, we let the inner layer consists of a single
level with
d(i) = 1, s(i)1 = t
(i)
1 = u
(i)
1 = M .
This level works in the standard basis and does not use basis changes.
The setting of the parameters d(h), d(s), and d(p) for best performance depends on the details of the target platform,
such as the communication bandwidth available between the host and the accelerators, and the memory capacities in
each component. Here we will give a discussion of considerations for parameterization that we expect to generalize and
withstand the test of time. In the present case we have up to N = 8 accelerator devices available in the host.
The following considerations affect the detailed parameterization:
(1) The host layer should consist of as few levels d(h) as possible since the N accelerators with their parallel capacity
have far superior aggregate instruction bandwidth and they should shoulder the bulk of the workload. Constraints
on decreasing d(h) include the memory capacity at the host for maintaining the arrays Tℓ , Sℓ ,Qℓ associated
with each of the N host-side accelerator pipelines, as well as the bandwidth of the communication interconnect
between the host and the accelerators—large subinstances and their solutions take more time to transfer over the
interconnect, but on the other hand take more time to solve on the accelerator, leaving the interconnect idle and
thus enabling the hiding of transfer latency. When working with inputs close to the memory capacity of the host
as per our engineering goal, the memory capacity at the host is the primary constraint that determines how
many levels need to be performed at the host layer.
(2) The accelerator serial layer in its d(s) levels enables each accelerator to accept larger inputs than would be
otherwise possible by resorting to the accelerator parallel layer only. That is, because of limited memory available
at the accelerator, the data-expanding/-compressing accelerator parallel layers can only be invoked on input
sizes that tolerate a constant-factor expansion in the size of the data at each level, up to the inner layer. The serial
layer is more space-efficient by working through the r (s)
ℓ
cases at each level recursively and serially, reusing
space, but becomes time-inefficient when the size of a case becomes so small that its parallel processing does not
saturate the parallel hardware at the accelerator.
(3) The accelerator parallel layer should consist of sufficiently many levels d(p) to saturate the parallel hardware of
the accelerator and to enable utilization of the optimizations in §3.15.
7The latter implemented with instructions that work with four consecutive 32-bit words.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 27
The setting of the parametersd(h),d(s), andd(p) that gives the fastest overall performancemay be done by first identifying
the memory-capacity-induced constraints on the parameters and then, subject to these constraints, empirically finding
the best parameter combination that optimizes performance among the relatively few choices that remain.
For the detailed target platforms considered in what follows, subproblems of shape 65536 × 65536 (512 MiB per
matrix) are small enough to be accommodated in the host-side arrays Tℓ , Sℓ ,Qℓ for ℓ ∈ [N ] alongside the input arrays.
For example, with subproblems of shape 65536 × 65536 = 216 × 216 delegated to the accelerators, an input of shape
1048576 × 1048576 = 220 × 220 (1 Tib; or what is the same, 128 GiB per matrix) will perform d(h) = 20 − 16 = 4 levels in
the host layer using a ⟨2, 2, 2⟩7-decomposition. On an input of shape 65536 × 65536 = 216 × 216, we parameterize the
layers on the accelerators so that each accelerator performs two levels in the accelerator serial layer (d(s) = 2) using a
⟨2, 2, 2⟩7-decomposition, and then proceeds with the accelerator parallel layer. That is, the parallel layer is started with
inputs of shape 16384 × 16384 = 214 × 214 (32 MiB per matrix).
3.18 Boolean Multiplication
We also provide reference implementations for both Boolean and binary multiplication using the elementary cubic
algorithm distributed to the accelerators. For an s × t by t ×u multiplication to yield an s ×u product, the N accelerators
each work with subproblems of size s × t by t × u such that s divides s , t divides t , and u divides u. That is, the total
workload of volume stu (in precise terms, stu+s(t−1)u bit operations) gets executed on the accelerators so that each unit
of work for an accelerator has volume stu. Each accelerator executes a thread array of shape sM ×
u
M × M ·MW such that
the least significantM2/W threads are responsible for aggregating oneM ×M block of the s × u output, with low-level
vectorization and vector shuffle operations used for each work unit of volumeM3. Empirically, s = t = u = 131072 and
M = 64 give best parameterization for our target hardware withW = 128 and V = M ·MW = 32. Coordination at the host
is simplified compared with Algorithm 3; namely, we partition the s × u result matrix to disjoint segments of size s × u,
and set up threads on the host so that a single thread is responsible for each segment, which eliminates the need for
synchronization between threads to guard against race conditions. In addition, we set up each host thread so that it
integrates each s × u subresult it has downloaded from an accelerator in parallel when the next subproblem is being
solved on the accelerator; this effectively hides the latency of integrating each subresult because the time to solve each
subproblem is greater than the time to integrate the result.
4 EXPERIMENTS
This section describes the experiments we ran to evaluate the performance of the framework described in the previous
section. In particular, our goal is to observe that a careful implementation of the framework enables efficient use of
current multiple-accelerator systems at input sizes that are close to the host memory capacity of the entire system.
The implementation was written in C++ [16], and the accelerator device routines were written in CUDA C [28]. Host-
level parallelization was prepared using the OpenMP API [30]. We ran three sets of experiments: (i) alternative-basis
binary matrix multiplication, (ii) classical binary matrix multiplication, and (iii) Boolean matrix multiplication.
The single-accelerator versionwas evaluated at all powers of two until the subproblem size for themultiple-accelerator
case. The multiple-accelerator case was evaluated up to one terabinary-bit (1 Tib) input size (n = 220).
4.1 Hardware configuration
The target hardware for our implementation was an NVIDIA DGX-1 system with eight NVIDIA Tesla V100 accelerators,
512 GiB of memory, and 2 × 20 CPU cores. We also ran experiments on two less powerful configurations: One with
28 Matti Karppa and Petteri Kaski
Table 1. Comparison of the hardware used in our experiments.
System CPUs CPU RAM GPUs RAM/GPU
Dell PowerEdge
C4130
2 × 6-core Xeon
E5 2620v3 2.50 GHz
(Haswell)
8 × 16 GiB = 128 GiB
DDR4-2133
4 × 2 Tesla K80 12 GiB GDDR5
Dell PowerEdge
C4130
2 × 12-core Xeon
E5-2680v3 2.50 GHz
(Haswell)
16×16 GiB = 256 GiB
DDR4-2400
4 × Tesla P100 16 GiB HBM2
NVIDIA DGX-1 2×20-core Xeon E5-
2698v4 (Broadwell)
16×32 GiB = 512 GiB
DDR4-2133
8 × Tesla V100
SXM2
16 GiB HBM2
Table 2. Peak performance of the GPU accelerators used in our experiments. This presents an upper bound on the number of bit
operations that could theoretically be achieved by the hardware.
GPU #Cores Boost clock Peak performance
4 × Tesla K80 [26] 8 · 2496 875 MHz 8 · 2496 · 875 · 106 · 32 ≈ 5.60 · 1014 bops
4 × Tesla P100 [27] 4 · 3584 1480 MHz 4 · 3584 · 1480 · 106 · 32 ≈ 6.79 · 1014 bops
8 × Tesla V100 [29] 8 · 5120 1530 MHz 8 · 5120 · 1530 · 106 · 32 ≈ 2.01 · 1015 bops
four NVIDIA Tesla P100 accelerators, 256 GiB of memory, and 2 × 12 CPU cores, and one with four NVIDIA Tesla K80
accelerators, 128 GiB of memory, and 2 × 6 CPU cores. A more detailed description of the hardware configurations is
given in Table 1.
Table 2 lists technical details of the accelerators: the number of cores, boost clock speed, and the peak performance
that presents an upper bound on the number of bit operations (bops) that could be theoretically achieved by simultaneous
use of all the accelerators in each configuration. In practice, this upper bound is unattainable due to thermal effects and
the need to coordinate data across the storage hierarchy ranging from per-thread registers to the global memory of each
accelerator. Yet this upper bound presents an uncompromising benchmark against which to measure the performance
of the actual implementation.
The power consumption of the systems used in our experiments is listed in Table 3. The table lists the nominal power
consumption of the CPUs and GPUs, and the system power intake as indicated in the basic system documentation
of each vendor. While one may perhaps expect the CPUs and GPUs to operate essentially at their maximum power,
the given system power intake is perhaps expected to be higher than the actual power usage. While we would like
to be able to measure the actual energy consumption by each system and its components during the computation
more precisely, such instrumentation has been unavailable to us, so in the present experiments we resort to using
the tabulated power values as the mean power over the duration of each computation, which then gives us a crude
approximation of the energy used by each computation. In particular, we would like to highlight the importance of
implementation engineering for low energy consumption as an important goal beyond the present study, with the hope
of having in the future available more fine-grained measurement tools.
Finally, to record a rough benchmark of the performance of the different system components in our main target
configuration, measurements on the bandwith for transferring data within host memory, and between the accelerator
device and the host system are shown in Tables 4 and 5, respectively.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 29
Table 3. Power consumption of the system and the GPUs.
System CPU watts GPU watts System watts
C4130/K80 2 × 85 W 4 × 300 W 2000 W
C4130/P100 2 × 120 W 4 × 300 W 2000 W
DGX-1/V100 2 × 135 W 8 × 300 W 3500 W
Table 4. Measured host-memory bandwiths of the DGX-1 node. The values shown are averages of five consecutive repetitions.
Benchmark Single core All cores
Read from linear addresses (consecutive 64-bit words) 9.08 GiB/s 37.28 GiB/s
Write to linear addresses (consecutive 64-bit words) 6.92 GiB/s 19.93 GiB/s
Read from random addresses (individual 64-bit words) 0.17 GiB/s 3.13 GiB/s
Read from random addresses (full cache lines) 0.78 GiB/s 14.09 GiB/s
Table 5. Measured memory bandwiths of transferring data within a single GPU device or between the host and the GPU device.
Measurements on the DGX-1 node.
Benchmark Bandwith
Host to device 10.5 GiB/s
Device to host 11.8 GiB/s
Device to device 704.9 GiB/s
Table 6. Running times for the routine that transposes each of the 64 × 64 submatrices. The procedure is parallelized and works using
32-bit words as the base data type. All runtimes reported are the median of five repeats on the DGX-1 node.
n 64 × 64 submatrix transpose time
1024 1.56 · 10−5 s
2048 3.69 · 10−5 s
4096 1.23 · 10−4 s
8192 4.63 · 10−4 s
16384 1.84 · 10−3 s
32768 7.45 · 10−3 s
65536 2.98 · 10−2 s
131072 1.29 · 10−1 s
262144 4.94 · 10−1 s
524288 2.03 · 100 s
1048576 7.99 · 100 s
4.2 Results
Runtimes, effective bit operations per second, and energy consumption at different sizes of input are shown in Tables 8,
9, 10, 11, and 12. All values shown are medians over five consecutive runs. Since the instrumentation does not permit us
to measure the actual power use, two different values are shown for the energy consumption: one is computed assuming
full CPU utilization with nominal CPU wattage and the maximum wattage of a single GPU accelerator times the number
of GPUs in use, and the other by simply taking the maximum power intake of the entire system in account. We expect
the actual energy consumption to lie somewhere between these two values, since the former ignores macroscopic
30 Matti Karppa and Petteri Kaski
Table 7. Scalability of running times for the various change-of-basis routines. The forward columns show the running times for
the forward transformations (that is, corresponding to ψ and ϕ) whereas the inverse columns show the inverse transformations
(corresponding to χ ). For the chaining variant, measurements were made to confirm that accounting for transposed data in the
right-hand operand has no effect on the running time. The procedures have been crafted using AVX2 intrinsics and to work with
512-bit cache lines on the DGX-1 host hardware. All runtimes reported are the median of five repeats on the DGX-1 node.
n Forward self-inverse Inverse self-inverse Forward chain left Forward chain right Inverse chain
1024 3.28 · 10−5 s 3.42 · 10−5 s 3.41 · 10−5 s 3.44 · 10−5 s 3.41 · 10−5 s
2048 4.70 · 10−5 s 4.88 · 10−5 s 5.02 · 10−5 s 5.13 · 10−5 s 5.00 · 10−5 s
4096 8.10 · 10−5 s 8.80 · 10−5 s 9.42 · 10−5 s 9.33 · 10−5 s 9.35 · 10−5 s
8192 1.80 · 10−4 s 2.09 · 10−4 s 2.04 · 10−4 s 2.07 · 10−4 s 2.05 · 10−4 s
16384 7.15 · 10−4 s 8.78 · 10−4 s 8.55 · 10−4 s 8.68 · 10−4 s 8.67 · 10−4 s
32768 2.06 · 10−2 s 2.56 · 10−2 s 2.59 · 10−2 s 2.61 · 10−2 s 2.59 · 10−2 s
65536 9.36 · 10−2 s 1.20 · 10−1 s 1.20 · 10−1 s 1.20 · 10−1 s 1.20 · 10−1 s
131072 4.10 · 10−1 s 5.30 · 10−1 s 5.27 · 10−1 s 5.30 · 10−1 s 5.26 · 10−1 s
262144 1.81 · 100 s 2.31 · 100 s 2.31 · 100 s 2.33 · 100 s 2.32 · 100 s
524288 7.82 · 100 s 1.01 · 101 s 1.01 · 101 s 1.01 · 101 s 1.01 · 101 s
1048576 3.45 · 101 s 4.41 · 101 s 4.46 · 101 s 4.46 · 101 s 4.40 · 101 s
factors such as memory and cooling, and the latter is simply the maximum amount of power that the entire system can
supply.
In Tables 8 and 9, given runtime T , the exact bit operation count of 2n3 − n2 is used to compute (2n3 − n2)/T , the
number of individual bit operations performed in a second on average. The same number is also used in Tables 10, 11,
and 12 to compute the effective bit operation count; that is, to highlight the relative difference in performance, we show
how many elementary bit operations per second a classical implementation operating on individual bits would have to
perform to achieve the same wall-clock runtime as the alternative-basis design.
The times reported here do not include the time required for data permutation or change of basis operations at the
host; it is assumed that the data is already in the desired format. Table 7 shows the runtime of an implementation of the
change of basis with AVX2 vector-intrinsics on the DGX-1 host. It can be seen that, in the case of the best-performing
algorithm (with self-inversion), the forward-transformation takes approximately 34.5 seconds per 1-Tib input matrix,
and 44.1 seconds for the inverse transformation of the result. The favorable difference is due to the structure of the
transformation which means that only 1 of the 4 subarrays needs to be modified at any level of recursion. Also, to
preserve linearity of access in memory, we assume that the right-hand-side operand is transposed. For Strassen-like
algorithms, this needs to be only done for the 64 × 64 submatrices; we implemented a parallelized, recursive transpose
function using word-operations that can perform the transpose for all 228 submatrices of a 1-TiB input matrix in 7.99 s
(median of five consecutive runs, see Table 6 for scalability).
With two 1-Tib input matrices, on the DGX-1, the classical cubic algorithm runs for 1880 and 2110 seconds, and
performs over 1.23 and 1.09 peta-bit-operations per second, in the binary and Boolean cases, respectively. This means
we are able to achieve over 50 % of the theoretical maximum peak performance of 2.01 Pbops per second. At the assumed
level of power used, this means that one such multiplication takes at most 1.83 and 2.05 kWh of electricity, or 2.85
and 3.20 pJ/bop, in the binary and Boolean cases, respectively. In both cases, a single V100 accelerator processes a
131072 × 131072-subproblem in just under 30 seconds, meaning that the runtime obtained here is accelerator-bound.
The time required for data upload and download, and result integration of individual subproblems is in the order of
seconds, making it negligible in comparison and effectively hidden by the host-side pipeline design.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 31
Due to their limited memory, the P100 and K80 systems can only be evaluated up to n = 524288. However, otherwise
the results are very similar to those of the DGX-1, apart from worse energy consumption and slower processing power,
owing to the older hardware in use.
As an interesting artifact of the low-level implementation, we see in the Tables 8 and 9 that, with the V100 and P100
accelerators, the Boolean version is slightly slower than the binary version of the algorithm, even though the only
differences in code are with the replacement of the popcount-parity check with the nonzero test, and the replacement
of XOR with OR. Furthermore, the difference is inverse with the K80 accelerator. This artifact appears to be due to the
optimizations performed by the NVIDIA compiler: On the P100 and the V100, the compiler compresses several bitwise
operations into one LOP3.LUT instruction, or arbitrary trinary logical instructions, and for one reason or another, the
sequence is slightly more efficient in the XOR case. However, these instructions are not available on the K80.
For binary multiplication, the Strassen-Winograd design in standard basis executes in 2610 seconds, the alternative
basis with self-inverse in 821 seconds, and the alternative basis with chaining in 972 seconds, on 1-Tib inputs on the
DGX-1, and achieve 0.8, 2.8, and 2.3 effective Pbops per second, respectively. The best alternative-basis variant exceeds
the theoretical peak hardware performance of the elementary algorithm by more than 30 %. One alternative-basis
multiplication takes less than 0.8 kWh of electricity, or less than 1.26 pJ/bop. This is less than one half of that required
by the cubic multiplication algorithm. Furthermore, considering that a subproblem of size n = 65536 can be solved by
an accelerator in 1.87 seconds, this means that the available host-side bandwidth forms the bottleneck (cf. Tables 4 and
5). Indeed, if subproblem construction and aggregation as well as the data transfer over the PCIe link were completely
free, the multiplication could be executed in 748 × 1.87 s ≈ 560 s. This in particular highlights the need to carefully
engineer workloads and their pipelining both at the host and at the accelerators to obtain balanced overall performance
from a multi-accelerator system. Tables 11 and 12 also show that the present approach generalizes to alternative
multiple-accelerator configurations, even though the more limited memory on the K80 accelerators requires a smaller
subproblem size.
ACKNOWLEDGMENTS.
The research leading to these results has received funding from the European Research Council under the European
Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement 338077 “Theory and Practice of
Advanced Search and Enumeration”. We gratefully acknowledge the use of computational resources provided by the
Aalto Science-IT project at Aalto University.
REFERENCES
[1] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzhev. 1970. On economical construction of the transitive closure of an oriented graph. Sov.
Math., Dokl. 11 (1970), 1209–1210.
[2] Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. 2012. Communication-optimal parallel algorithm for Strassen’s
matrix multiplication. In 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012, Guy E.
Blelloch and Maurice Herlihy (Eds.). ACM, 193–204. https://doi.org/10.1145/2312005.2312044
[3] Nikhil Bansal and Ryan Williams. 2012. Regularity Lemmas and Combinatorial Algorithms. Theory of Computing 8, 1 (2012), 69–94. https:
//doi.org/10.4086/toc.2012.v008a004
[4] Austin R. Benson and Grey Ballard. 2015. A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, February 7-11, 2015, Albert Cohen and David
Grove (Eds.). ACM, 42–53. https://doi.org/10.1145/2688500.2688513
[5] Nader H. Bshouty. 1995. On the Additive Complexity of 2 × 2 Matrix Multiplication. Inf. Process. Lett. 56, 6 (1995), 329–335. https://doi.org/10.1016/
0020-0190(95)00176-X
32 Matti Karppa and Petteri Kaski
Table 8. Scalability of the running times, the bit operations per second, and the energy requirements for the classical (cubic) binary
(AND/XOR) multiplication procedure. The subproblem size is 131072 for the V100 (DGX-1) and P100 accelerators, and 65536 for the
K80. The number of bit operations is 2n3 − n2. The first two energy columns are computed assuming full usage of CPU Watts and the
Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake.
GPU n d Runtime Bop/s Energy/bop Total energy System en./bop Total system en.
V100 1024 1 1.07 · 10−4 s 20.22 Tbop/s 132.06 pJ/bop 7.88 · 10−8 kWh 173.12 pJ/bop 1.04 · 10−7 kWh
V100 2048 1 2.15 · 10−4 s 79.94 Tbop/s 33.40 pJ/bop 1.60 · 10−7 kWh 43.79 pJ/bop 2.09 · 10−7 kWh
V100 4096 1 1.16 · 10−3 s 118.78 Tbop/s 22.48 pJ/bop 8.59 · 10−7 kWh 29.47 pJ/bop 1.13 · 10−6 kWh
V100 8192 1 8.28 · 10−3 s 132.83 Tbop/s 20.10 pJ/bop 6.14 · 10−6 kWh 26.35 pJ/bop 8.05 · 10−6 kWh
V100 16384 1 5.21 · 10−2 s 169.11 Tbop/s 15.79 pJ/bop 3.86 · 10−5 kWh 20.70 pJ/bop 5.06 · 10−5 kWh
V100 32768 1 4.09 · 10−1 s 172.46 Tbop/s 15.48 pJ/bop 3.03 · 10−4 kWh 20.29 pJ/bop 3.97 · 10−4 kWh
V100 65536 1 3.28 · 100 s 171.74 Tbop/s 15.55 pJ/bop 2.44 · 10−3 kWh 20.38 pJ/bop 3.19 · 10−3 kWh
V100 131072 1 2.65 · 101 s 170.14 Tbop/s 15.69 pJ/bop 1.97 · 10−2 kWh 20.57 pJ/bop 2.58 · 10−2 kWh
V100 262144 8 6.76 · 101 s 533.46 Tbop/s 5.01 pJ/bop 5.01 · 10−2 kWh 6.56 pJ/bop 6.57 · 10−2 kWh
V100 524288 8 2.40 · 102 s 1204.30 Tbop/s 2.22 pJ/bop 1.78 · 10−1 kWh 2.91 pJ/bop 2.33 · 10−1 kWh
V100 1048576 8 1.88 · 103 s 1230.18 Tbop/s 2.17 pJ/bop 1.40 · 100 kWh 2.85 pJ/bop 1.83 · 100 kWh
P100 1024 1 6.22 · 10−5 s 34.56 Tbop/s 41.67 pJ/bop 2.49 · 10−8 kWh 57.87 pJ/bop 3.46 · 10−8 kWh
P100 2048 1 2.96 · 10−4 s 58.21 Tbop/s 24.74 pJ/bop 1.19 · 10−7 kWh 34.36 pJ/bop 1.64 · 10−7 kWh
P100 4096 1 1.88 · 10−3 s 73.31 Tbop/s 19.64 pJ/bop 7.50 · 10−7 kWh 27.28 pJ/bop 1.05 · 10−6 kWh
P100 8192 1 1.25 · 10−2 s 88.56 Tbop/s 16.26 pJ/bop 4.97 · 10−6 kWh 22.58 pJ/bop 6.90 · 10−6 kWh
P100 16384 1 8.64 · 10−2 s 101.90 Tbop/s 14.13 pJ/bop 3.46 · 10−5 kWh 19.63 pJ/bop 4.80 · 10−5 kWh
P100 32768 1 6.83 · 10−1 s 103.03 Tbop/s 13.98 pJ/bop 2.74 · 10−4 kWh 19.41 pJ/bop 3.80 · 10−4 kWh
P100 65536 1 5.46 · 100 s 103.15 Tbop/s 13.96 pJ/bop 2.19 · 10−3 kWh 19.39 pJ/bop 3.04 · 10−3 kWh
P100 131072 1 4.37 · 101 s 103.21 Tbop/s 13.95 pJ/bop 1.75 · 10−2 kWh 19.38 pJ/bop 2.43 · 10−2 kWh
P100 262144 4 9.57 · 101 s 376.85 Tbop/s 3.82 pJ/bop 3.83 · 10−2 kWh 5.31 pJ/bop 5.32 · 10−2 kWh
P100 524288 4 7.24 · 102 s 398.11 Tbop/s 3.62 pJ/bop 2.90 · 10−1 kWh 5.02 pJ/bop 4.03 · 10−1 kWh
K80 1024 1 2.27 · 10−4 s 9.47 Tbop/s 144.69 pJ/bop 8.63 · 10−8 kWh 211.23 pJ/bop 1.26 · 10−7 kWh
K80 2048 1 1.95 · 10−3 s 8.83 Tbop/s 155.16 pJ/bop 7.41 · 10−7 kWh 226.50 pJ/bop 1.09 · 10−6 kWh
K80 4096 1 9.94 · 10−3 s 13.84 Tbop/s 99.01 pJ/bop 3.78 · 10−6 kWh 144.54 pJ/bop 5.52 · 10−6 kWh
K80 8192 1 6.44 · 10−2 s 17.09 Tbop/s 80.15 pJ/bop 2.45 · 10−5 kWh 117.00 pJ/bop 3.58 · 10−5 kWh
K80 16384 1 4.28 · 10−1 s 20.59 Tbop/s 66.52 pJ/bop 1.63 · 10−4 kWh 97.11 pJ/bop 2.38 · 10−4 kWh
K80 32768 1 3.43 · 100 s 20.55 Tbop/s 66.68 pJ/bop 1.31 · 10−3 kWh 97.34 pJ/bop 1.91 · 10−3 kWh
K80 65536 1 2.86 · 101 s 19.70 Tbop/s 69.54 pJ/bop 1.09 · 10−2 kWh 101.52 pJ/bop 1.59 · 10−2 kWh
K80 131072 8 6.34 · 101 s 71.11 Tbop/s 19.27 pJ/bop 2.42 · 10−2 kWh 28.13 pJ/bop 3.52 · 10−2 kWh
K80 262144 8 2.51 · 102 s 143.73 Tbop/s 9.53 pJ/bop 9.54 · 10−2 kWh 13.91 pJ/bop 1.40 · 10−1 kWh
K80 524288 8 1.98 · 103 s 146.29 Tbop/s 9.36 pJ/bop 7.50 · 10−1 kWh 13.67 pJ/bop 1.10 · 100 kWh
[6] Murat Cenk and M. Anwar Hasan. 2017. On the arithmetic complexity of Strassen-like matrix multiplications. J. Symbolic Comput. 80, part 2 (2017),
484–501. https://doi.org/10.1016/j.jsc.2016.07.004
[7] Timothy M. Chan. 2015. Speeding up the Four Russians Algorithm by About One More Logarithmic Factor. In Proceedings of the Twenty-Sixth
Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015, Piotr Indyk (Ed.). SIAM, 212–217.
https://doi.org/10.1137/1.9781611973730.16
[8] Don Coppersmith and Shmuel Winograd. 1990. Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9, 3 (1990), 251–280.
https://doi.org/10.1016/S0747-7171(08)80013-2
[9] Paolo D’Alberto, Marco Bodrato, and Alexandru Nicolau. 2011. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor
systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation. ACM Trans. Math. Softw.
38, 1 (2011), 2:1–2:30. https://doi.org/10.1145/2049662.2049664
[10] A. M. Davie and A. J. Stothers. 2013. Improved bound for complexity of matrix multiplication. Proc. R. Soc. Edinb., Sect. A, Math. 143, 2 (2013),
351–369. https://doi.org/10.1017/S0308210511001648
[11] Michael J. Fischer and Albert R. Meyer. 1971. Boolean Matrix Multiplication and Transitive Closure. In 12th Annual Symposium on Switching and
Automata Theory, East Lansing, Michigan, USA, October 13-15, 1971. IEEE Computer Society, 129–131. https://doi.org/10.1109/SWAT.1971.4
[12] M. E. Furman. 1970. Application of a method of fast multiplication of matrices in the problem of finding the transitive closure of a graph. Sov. Math.,
Dokl. 11 (1970), 1252.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 33
Table 9. Scalability of the running times, the bit operations per second, and the energy requirements for the classic (cubic) Boolean
(AND/OR) multiplication procedure. The subproblem size is 131072 for the V100 (DGX-1) and P100 accelerators, and 65536 for the
K80. The number of bit operations is 2n3 − n2. The first two energy columns are computed assuming full usage of CPU Watts and the
Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake.
GPU n d Runtime Bop/s Energy/bop Total energy System en./bop Total system en.
V100 1024 1 9.84 · 10−5 s 21.83 Tbop/s 122.33 pJ/bop 7.30 · 10−8 kWh 160.36 pJ/bop 9.57 · 10−8 kWh
V100 2048 1 1.93 · 10−4 s 89.25 Tbop/s 29.92 pJ/bop 1.43 · 10−7 kWh 39.22 pJ/bop 1.88 · 10−7 kWh
V100 4096 1 1.30 · 10−3 s 106.45 Tbop/s 25.08 pJ/bop 9.58 · 10−7 kWh 32.88 pJ/bop 1.26 · 10−6 kWh
V100 8192 1 7.86 · 10−3 s 140.00 Tbop/s 19.07 pJ/bop 5.83 · 10−6 kWh 25.00 pJ/bop 7.64 · 10−6 kWh
V100 16384 1 5.94 · 10−2 s 148.16 Tbop/s 18.02 pJ/bop 4.41 · 10−5 kWh 23.62 pJ/bop 5.78 · 10−5 kWh
V100 32768 1 4.69 · 10−1 s 150.33 Tbop/s 17.76 pJ/bop 3.48 · 10−4 kWh 23.28 pJ/bop 4.56 · 10−4 kWh
V100 65536 1 3.73 · 100 s 151.07 Tbop/s 17.67 pJ/bop 2.77 · 10−3 kWh 23.17 pJ/bop 3.63 · 10−3 kWh
V100 131072 1 2.98 · 101 s 151.27 Tbop/s 17.65 pJ/bop 2.21 · 10−2 kWh 23.14 pJ/bop 2.90 · 10−2 kWh
V100 262144 8 7.51 · 101 s 479.81 Tbop/s 5.56 pJ/bop 5.57 · 10−2 kWh 7.29 pJ/bop 7.31 · 10−2 kWh
V100 524288 8 2.71 · 102 s 1065.59 Tbop/s 2.51 pJ/bop 2.01 · 10−1 kWh 3.28 pJ/bop 2.63 · 10−1 kWh
V100 1048576 8 2.11 · 103 s 1094.98 Tbop/s 2.44 pJ/bop 1.57 · 100 kWh 3.20 pJ/bop 2.05 · 100 kWh
P100 1024 1 5.09 · 10−5 s 42.24 Tbop/s 34.09 pJ/bop 2.04 · 10−8 kWh 47.35 pJ/bop 2.83 · 10−8 kWh
P100 2048 1 2.72 · 10−4 s 63.35 Tbop/s 22.73 pJ/bop 1.09 · 10−7 kWh 31.57 pJ/bop 1.51 · 10−7 kWh
P100 4096 1 1.62 · 10−3 s 85.13 Tbop/s 16.91 pJ/bop 6.46 · 10−7 kWh 23.49 pJ/bop 8.97 · 10−7 kWh
P100 8192 1 1.16 · 10−2 s 95.10 Tbop/s 15.14 pJ/bop 4.63 · 10−6 kWh 21.03 pJ/bop 6.43 · 10−6 kWh
P100 16384 1 8.90 · 10−2 s 98.89 Tbop/s 14.56 pJ/bop 3.56 · 10−5 kWh 20.22 pJ/bop 4.95 · 10−5 kWh
P100 32768 1 7.09 · 10−1 s 99.29 Tbop/s 14.50 pJ/bop 2.84 · 10−4 kWh 20.14 pJ/bop 3.94 · 10−4 kWh
P100 65536 1 5.67 · 100 s 99.40 Tbop/s 14.49 pJ/bop 2.27 · 10−3 kWh 20.12 pJ/bop 3.15 · 10−3 kWh
P100 131072 1 4.53 · 101 s 99.45 Tbop/s 14.48 pJ/bop 1.82 · 10−2 kWh 20.11 pJ/bop 2.52 · 10−2 kWh
P100 262144 4 9.89 · 101 s 364.55 Tbop/s 3.95 pJ/bop 3.96 · 10−2 kWh 5.49 pJ/bop 5.50 · 10−2 kWh
P100 524288 4 7.50 · 102 s 384.46 Tbop/s 3.75 pJ/bop 3.00 · 10−1 kWh 5.20 pJ/bop 4.17 · 10−1 kWh
K80 1024 1 2.15 · 10−4 s 10.03 Tbop/s 136.60 pJ/bop 8.15 · 10−8 kWh 199.41 pJ/bop 1.19 · 10−7 kWh
K80 2048 1 1.21 · 10−3 s 14.22 Tbop/s 96.35 pJ/bop 4.60 · 10−7 kWh 140.66 pJ/bop 6.72 · 10−7 kWh
K80 4096 1 8.08 · 10−3 s 17.02 Tbop/s 80.48 pJ/bop 3.08 · 10−6 kWh 117.49 pJ/bop 4.49 · 10−6 kWh
K80 8192 1 5.52 · 10−2 s 19.92 Tbop/s 68.77 pJ/bop 2.11 · 10−5 kWh 100.40 pJ/bop 3.07 · 10−5 kWh
K80 16384 1 3.26 · 10−1 s 27.06 Tbop/s 50.62 pJ/bop 1.24 · 10−4 kWh 73.90 pJ/bop 1.81 · 10−4 kWh
K80 32768 1 2.53 · 100 s 27.82 Tbop/s 49.24 pJ/bop 9.63 · 10−4 kWh 71.88 pJ/bop 1.41 · 10−3 kWh
K80 65536 1 2.11 · 101 s 26.72 Tbop/s 51.27 pJ/bop 8.02 · 10−3 kWh 74.84 pJ/bop 1.18 · 10−2 kWh
K80 131072 8 4.63 · 101 s 97.47 Tbop/s 14.06 pJ/bop 1.76 · 10−2 kWh 20.52 pJ/bop 2.57 · 10−2 kWh
K80 262144 8 1.80 · 102 s 200.24 Tbop/s 6.84 pJ/bop 6.85 · 10−2 kWh 9.99 pJ/bop 10.00 · 10−2 kWh
K80 524288 8 1.41 · 103 s 205.47 Tbop/s 6.67 pJ/bop 5.34 · 10−1 kWh 9.73 pJ/bop 7.80 · 10−1 kWh
[13] Brian Grayson and Robert A. van de Geijn. 1996. A High Performance Parallel Strassen Implementation. Parallel Processing Letters 6, 1 (1996), 3–12.
https://doi.org/10.1142/S0129626496000029
[14] Jianyu Huang, Leslie Rice, Devin A. Matthews, and Robert A. van de Geijn. 2017. Generating Families of Practical Fast Matrix Multiplication
Algorithms. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. IEEE
Computer Society, 656–667. https://doi.org/10.1109/IPDPS.2017.56
[15] Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen’s algorithm reloaded. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, November 13-18, 2016, John West and
Cherri M. Pancake (Eds.). IEEE Computer Society, 690–701. https://doi.org/10.1109/SC.2016.58
[16] International Organization for Standardization. 2017. Programming Languages – C++. Standard. ISO/IEC 14882:2017.
[17] Alon Itai and Michael Rodeh. 1978. Finding a Minimum Circuit in a Graph. SIAM J. Comput. 7, 4 (1978), 413–423. https://doi.org/10.1137/0207033
[18] Igor Kaporin. 1999. A practical algorithm for faster matrix multiplication. Numer. Linear Algebra Appl. 6, 8 (1999), 687–700. https://doi.org/10.1002/
(SICI)1099-1506(199912)6:8<687::AID-NLA177>3.0.CO;2-I
[19] Igor Kaporin. 2004. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theoret. Comput. Sci. 315, 2-3
(2004), 469–510. https://doi.org/10.1016/j.tcs.2004.01.004
[20] Elaye Karstadt and Oded Schwartz. 2017. Matrix Multiplication, a Little Faster. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms
and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, Christian Scheideler and Mohammad Taghi Hajiaghayi (Eds.). ACM, 101–110.
34 Matti Karppa and Petteri Kaski
Table 10. Scalability the running times, the effective bit operations per second, and the energy requirements for the standard-basis
binary Strassen-Winograd multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators, and
32768 for the K80. The effective number of bit operations is computed with the same value of 2n3 −n2 as in the cubic case to highlight
the relative difference in performance. The first two energy columns are computed assuming full usage of CPUWatts and the Wattage
of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. The
right-hand-side input matrix is assumed to be pre-transposed; the times reported here do not include the transposition of submatrices.
GPU n d Runtime Effective bop/s Energy/bop Total energy System en./bop Total system en.
V100 1024 1 1.67 · 10−4 s 12.92 Tbop/s 206.67 pJ/bop 1.24 · 10−7 kWh 270.92 pJ/bop 1.62 · 10−7 kWh
V100 2048 1 2.72 · 10−4 s 63.24 Tbop/s 42.22 pJ/bop 2.02 · 10−7 kWh 55.34 pJ/bop 2.65 · 10−7 kWh
V100 4096 1 1.05 · 10−3 s 132.07 Tbop/s 20.22 pJ/bop 7.72 · 10−7 kWh 26.50 pJ/bop 1.02 · 10−6 kWh
V100 8192 1 5.86 · 10−3 s 187.69 Tbop/s 14.23 pJ/bop 4.35 · 10−6 kWh 18.65 pJ/bop 5.70 · 10−6 kWh
V100 16384 1 3.97 · 10−2 s 222.03 Tbop/s 12.03 pJ/bop 2.94 · 10−5 kWh 15.76 pJ/bop 3.86 · 10−5 kWh
V100 32768 1 2.81 · 10−1 s 251.04 Tbop/s 10.64 pJ/bop 2.08 · 10−4 kWh 13.94 pJ/bop 2.73 · 10−4 kWh
V100 65536 1 1.97 · 100 s 285.95 Tbop/s 9.34 pJ/bop 1.47 · 10−3 kWh 12.24 pJ/bop 1.92 · 10−3 kWh
V100 131072 8 3.74 · 100 s 1205.38 Tbop/s 2.22 pJ/bop 2.78 · 10−3 kWh 2.90 pJ/bop 3.64 · 10−3 kWh
V100 262144 8 2.34 · 101 s 1543.54 Tbop/s 1.73 pJ/bop 1.74 · 10−2 kWh 2.27 pJ/bop 2.27 · 10−2 kWh
V100 524288 8 2.66 · 102 s 1086.54 Tbop/s 2.46 pJ/bop 1.97 · 10−1 kWh 3.22 pJ/bop 2.58 · 10−1 kWh
V100 1048576 8 2.61 · 103 s 884.18 Tbop/s 3.02 pJ/bop 1.94 · 100 kWh 3.96 pJ/bop 2.54 · 100 kWh
P100 1024 1 1.39 · 10−4 s 15.54 Tbop/s 92.64 pJ/bop 5.53 · 10−8 kWh 128.67 pJ/bop 7.68 · 10−8 kWh
P100 2048 1 3.34 · 10−4 s 51.50 Tbop/s 27.96 pJ/bop 1.34 · 10−7 kWh 38.84 pJ/bop 1.86 · 10−7 kWh
P100 4096 1 1.53 · 10−3 s 90.17 Tbop/s 15.97 pJ/bop 6.10 · 10−7 kWh 22.18 pJ/bop 8.47 · 10−7 kWh
P100 8192 1 9.12 · 10−3 s 120.55 Tbop/s 11.94 pJ/bop 3.65 · 10−6 kWh 16.59 pJ/bop 5.07 · 10−6 kWh
P100 16384 1 5.76 · 10−2 s 152.77 Tbop/s 9.43 pJ/bop 2.31 · 10−5 kWh 13.09 pJ/bop 3.20 · 10−5 kWh
P100 32768 1 4.08 · 10−1 s 172.84 Tbop/s 8.33 pJ/bop 1.63 · 10−4 kWh 11.57 pJ/bop 2.27 · 10−4 kWh
P100 65536 1 2.87 · 100 s 196.61 Tbop/s 7.32 pJ/bop 1.15 · 10−3 kWh 10.17 pJ/bop 1.60 · 10−3 kWh
P100 131072 4 7.37 · 100 s 611.79 Tbop/s 2.35 pJ/bop 2.95 · 10−3 kWh 3.27 pJ/bop 4.09 · 10−3 kWh
P100 262144 4 4.37 · 101 s 824.68 Tbop/s 1.75 pJ/bop 1.75 · 10−2 kWh 2.43 pJ/bop 2.43 · 10−2 kWh
P100 524288 4 3.02 · 102 s 956.67 Tbop/s 1.51 pJ/bop 1.21 · 10−1 kWh 2.09 pJ/bop 1.68 · 10−1 kWh
K80 1024 1 3.07 · 10−4 s 7.00 Tbop/s 195.67 pJ/bop 1.17 · 10−7 kWh 285.64 pJ/bop 1.71 · 10−7 kWh
K80 2048 1 1.19 · 10−3 s 14.50 Tbop/s 94.47 pJ/bop 4.51 · 10−7 kWh 137.91 pJ/bop 6.58 · 10−7 kWh
K80 4096 1 6.96 · 10−3 s 19.77 Tbop/s 69.31 pJ/bop 2.65 · 10−6 kWh 101.18 pJ/bop 3.87 · 10−6 kWh
K80 8192 1 4.08 · 10−2 s 26.99 Tbop/s 50.76 pJ/bop 1.56 · 10−5 kWh 74.10 pJ/bop 2.27 · 10−5 kWh
K80 16384 1 2.35 · 10−1 s 37.44 Tbop/s 36.59 pJ/bop 8.95 · 10−5 kWh 53.42 pJ/bop 1.31 · 10−4 kWh
K80 32768 1 1.62 · 100 s 43.51 Tbop/s 31.49 pJ/bop 6.16 · 10−4 kWh 45.97 pJ/bop 8.99 · 10−4 kWh
K80 65536 8 2.34 · 100 s 241.49 Tbop/s 5.67 pJ/bop 8.88 · 10−4 kWh 8.28 pJ/bop 1.30 · 10−3 kWh
K80 131072 8 1.36 · 101 s 331.26 Tbop/s 4.14 pJ/bop 5.18 · 10−3 kWh 6.04 pJ/bop 7.56 · 10−3 kWh
K80 262144 8 1.33 · 102 s 272.80 Tbop/s 5.02 pJ/bop 5.03 · 10−2 kWh 7.33 pJ/bop 7.34 · 10−2 kWh
K80 524288 8 1.60 · 103 s 180.20 Tbop/s 7.60 pJ/bop 6.09 · 10−1 kWh 11.10 pJ/bop 8.89 · 10−1 kWh
https://doi.org/10.1145/3087556.3087579
[21] Bharat Kumar, Chua-Huang Huang, Rodney W. Johnson, and P. Sadayappan. 1993. A Tensor Product Formulation of Strassen’s Matrix Multiplication
Algorithm with Memory Reduction. In The Seventh International Parallel Processing Symposium, Proceedings, Newport Beach, California, USA, April
13-16, 1993. IEEE Computer Society, 582–588. https://doi.org/10.1109/IPPS.1993.262814
[22] François Le Gall. 2014. Powers of tensors and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation,
ISSAC ’14, Kobe, Japan, July 23-25, 2014, Katsusuke Nabeshima, Kosaku Nagasaka, Franz Winkler, and Ágnes Szántó (Eds.). ACM, 296–303.
https://doi.org/10.1145/2608628.2608664
[23] Benjamin Lipshitz, Grey Ballard, James Demmel, and Oded Schwartz. 2012. Communication-avoiding parallel Strassen: implementation and
performance. In SC Conference on High Performance Computing Networking, Storage and Analysis, SC ’12, Salt Lake City, UT, USA - November 11 - 15,
2012, Jeffrey K. Hollingsworth (Ed.). IEEE/ACM, 101. https://doi.org/10.1109/SC.2012.33
[24] Qingshan Luo and John B. Drake. 1995. A scalable parallel Strassen’s matrix multiplication algorithm for distributed-memory computers. In
Proceedings of the 1995 ACM symposium on applied computing, SAC’95, Nashville, TN, USA, February 26-28, 1995, Jim Hightower, Ed Deaton, K. M.
George, Janice H. Carroll, and Dave Oppenheim (Eds.). ACM, 221–226. https://doi.org/10.1145/315891.315965
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 35
Table 11. Scalability the running times, the effective bit operations per second, and the energy requirements for the alternative-basis
(with self-inversion) binary multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators,
and 32768 for the K80. The effective number of bit operations is computed with the same value of 2n3 − n2 as in the cubic case to
highlight the relative difference in performance. The first two energy columns are computed assuming full usage of CPU Watts and
the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake.
The right-hand-side input matrix is assumed to be pre-transposed and in the alternate basis; the times reported here do not include
the transposition of submatrices or the change of basis.
GPU n d Runtime Effective bop/s Energy/bop Total energy System en./bop Total system en.
V100 1024 1 1.51 · 10−4 s 14.26 Tbop/s 187.30 pJ/bop 1.12 · 10−7 kWh 245.53 pJ/bop 1.47 · 10−7 kWh
V100 2048 1 2.45 · 10−4 s 70.36 Tbop/s 37.95 pJ/bop 1.82 · 10−7 kWh 49.74 pJ/bop 2.38 · 10−7 kWh
V100 4096 1 9.97 · 10−4 s 137.96 Tbop/s 19.35 pJ/bop 7.39 · 10−7 kWh 25.37 pJ/bop 9.69 · 10−7 kWh
V100 8192 1 5.55 · 10−3 s 198.32 Tbop/s 13.46 pJ/bop 4.12 · 10−6 kWh 17.65 pJ/bop 5.39 · 10−6 kWh
V100 16384 1 3.76 · 10−2 s 234.14 Tbop/s 11.40 pJ/bop 2.79 · 10−5 kWh 14.95 pJ/bop 3.66 · 10−5 kWh
V100 32768 1 2.66 · 10−1 s 265.39 Tbop/s 10.06 pJ/bop 1.97 · 10−4 kWh 13.19 pJ/bop 2.58 · 10−4 kWh
V100 65536 1 1.87 · 100 s 301.77 Tbop/s 8.85 pJ/bop 1.39 · 10−3 kWh 11.60 pJ/bop 1.82 · 10−3 kWh
V100 131072 8 3.33 · 100 s 1356.04 Tbop/s 1.97 pJ/bop 2.47 · 10−3 kWh 2.58 pJ/bop 3.23 · 10−3 kWh
V100 262144 8 1.83 · 101 s 1970.12 Tbop/s 1.36 pJ/bop 1.36 · 10−2 kWh 1.78 pJ/bop 1.78 · 10−2 kWh
V100 524288 8 1.19 · 102 s 2440.74 Tbop/s 1.09 pJ/bop 8.76 · 10−2 kWh 1.43 pJ/bop 1.15 · 10−1 kWh
V100 1048576 8 8.21 · 102 s 2809.41 Tbop/s 0.95 pJ/bop 6.09 · 10−1 kWh 1.25 pJ/bop 7.98 · 10−1 kWh
P100 1024 1 1.18 · 10−4 s 18.22 Tbop/s 79.02 pJ/bop 4.72 · 10−8 kWh 109.75 pJ/bop 6.55 · 10−8 kWh
P100 2048 1 2.94 · 10−4 s 58.49 Tbop/s 24.62 pJ/bop 1.18 · 10−7 kWh 34.19 pJ/bop 1.64 · 10−7 kWh
P100 4096 1 1.38 · 10−3 s 100.27 Tbop/s 14.36 pJ/bop 5.49 · 10−7 kWh 19.95 pJ/bop 7.62 · 10−7 kWh
P100 8192 1 8.31 · 10−3 s 132.34 Tbop/s 10.88 pJ/bop 3.33 · 10−6 kWh 15.11 pJ/bop 4.62 · 10−6 kWh
P100 16384 1 5.37 · 10−2 s 163.82 Tbop/s 8.79 pJ/bop 2.15 · 10−5 kWh 12.21 pJ/bop 2.99 · 10−5 kWh
P100 32768 1 3.79 · 10−1 s 185.81 Tbop/s 7.75 pJ/bop 1.52 · 10−4 kWh 10.76 pJ/bop 2.11 · 10−4 kWh
P100 65536 1 2.67 · 100 s 211.34 Tbop/s 6.81 pJ/bop 1.07 · 10−3 kWh 9.46 pJ/bop 1.48 · 10−3 kWh
P100 131072 4 7.12 · 100 s 633.00 Tbop/s 2.27 pJ/bop 2.85 · 10−3 kWh 3.16 pJ/bop 3.96 · 10−3 kWh
P100 262144 4 4.12 · 101 s 876.27 Tbop/s 1.64 pJ/bop 1.65 · 10−2 kWh 2.28 pJ/bop 2.29 · 10−2 kWh
P100 524288 4 2.60 · 102 s 1110.41 Tbop/s 1.30 pJ/bop 1.04 · 10−1 kWh 1.80 pJ/bop 1.45 · 10−1 kWh
K80 1024 1 2.87 · 10−4 s 7.50 Tbop/s 182.63 pJ/bop 1.09 · 10−7 kWh 266.62 pJ/bop 1.59 · 10−7 kWh
K80 2048 1 1.12 · 10−3 s 15.44 Tbop/s 88.74 pJ/bop 4.24 · 10−7 kWh 129.54 pJ/bop 6.19 · 10−7 kWh
K80 4096 1 6.56 · 10−3 s 20.95 Tbop/s 65.38 pJ/bop 2.50 · 10−6 kWh 95.45 pJ/bop 3.65 · 10−6 kWh
K80 8192 1 4.17 · 10−2 s 26.40 Tbop/s 51.89 pJ/bop 1.59 · 10−5 kWh 75.75 pJ/bop 2.32 · 10−5 kWh
K80 16384 1 2.24 · 10−1 s 39.43 Tbop/s 34.75 pJ/bop 8.50 · 10−5 kWh 50.73 pJ/bop 1.24 · 10−4 kWh
K80 32768 1 1.52 · 100 s 46.45 Tbop/s 29.49 pJ/bop 5.77 · 10−4 kWh 43.05 pJ/bop 8.42 · 10−4 kWh
K80 65536 8 2.04 · 100 s 276.04 Tbop/s 4.96 pJ/bop 7.77 · 10−4 kWh 7.25 pJ/bop 1.14 · 10−3 kWh
K80 131072 8 1.24 · 101 s 365.09 Tbop/s 3.75 pJ/bop 4.70 · 10−3 kWh 5.48 pJ/bop 6.86 · 10−3 kWh
K80 262144 8 7.79 · 101 s 462.55 Tbop/s 2.96 pJ/bop 2.97 · 10−2 kWh 4.32 pJ/bop 4.33 · 10−2 kWh
K80 524288 8 5.64 · 102 s 511.71 Tbop/s 2.68 pJ/bop 2.15 · 10−1 kWh 3.91 pJ/bop 3.13 · 10−1 kWh
[25] Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Trans. Parallel Distrib. Syst. 28, 1 (2017),
72–86. https://doi.org/10.1109/TPDS.2016.2549523
[26] NVIDIA Corporation. 2015. NVIDIA Tesla K80 GPU accelerator board specification. https://www.nvidia.com/content/dam/en-zz/Solutions/
Data-Center/tesla-product-literature/Tesla-K80-BoardSpec-07317-001-v05.pdf (retrieved 2019-01-09).
[27] NVIDIA Corporation. 2016. NVIDIA Tesla P100 datasheet. https://www.nvidia.com/en-us/data-center/tesla-p100/ (retrieved 2019-01-09).
[28] NVIDIA Corporation. 2018. CUDA C Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (retrieved
2018-08-11). PG-02829-001_v9.2.
[29] NVIDIA Corporation. 2018. NVIDIA Tesla V100 datasheet. https://www.nvidia.com/en-us/data-center/tesla-v100/ (retrieved 2019-01-09).
[30] OpenMP Architecture Review Board. 2015. OpenMP Application Program Interface. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.
Version 4.5.
[31] Victor Pan. 1984. How can we speed up matrix multiplication? SIAM Rev. 26, 3 (1984), 393–415. https://doi.org/10.1137/1026076
[32] Victor Y. Pan. 1978. Strassen’s Algorithm Is not Optimal: Trililnear Technique of Aggregating, Uniting and Canceling for Constructing Fast
Algorithms for Matrix Operations. In 19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16-18 October 1978.
36 Matti Karppa and Petteri Kaski
Table 12. Scalability the running times, the effective bit operations per second, and the energy requirements for the alternative-basis
(with chaining) binary multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators, and 32768
for the K80. The effective number of bit operations is computed with the same value of 2n3 − n2 as in the cubic case to highlight the
relative difference in performance. The first two energy columns are computed assuming full usage of CPU Watts and the Wattage
of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. The
right-hand-side input matrix is assumed to be pre-transposed and in the alternate basis; the times reported here do not include the
transposition of submatrices or the change of basis.
GPU n d Runtime Effective bop/s Energy/bop Total energy System en./bop Total system en.
V100 1024 1 1.63 · 10−4 s 13.20 Tbop/s 202.35 pJ/bop 1.21 · 10−7 kWh 265.25 pJ/bop 1.59 · 10−7 kWh
V100 2048 1 2.53 · 10−4 s 68.02 Tbop/s 39.25 pJ/bop 1.88 · 10−7 kWh 51.45 pJ/bop 2.46 · 10−7 kWh
V100 4096 1 1.01 · 10−3 s 136.65 Tbop/s 19.54 pJ/bop 7.46 · 10−7 kWh 25.61 pJ/bop 9.78 · 10−7 kWh
V100 8192 1 5.56 · 10−3 s 197.78 Tbop/s 13.50 pJ/bop 4.13 · 10−6 kWh 17.70 pJ/bop 5.41 · 10−6 kWh
V100 16384 1 3.77 · 10−2 s 233.70 Tbop/s 11.42 pJ/bop 2.80 · 10−5 kWh 14.98 pJ/bop 3.66 · 10−5 kWh
V100 32768 1 2.66 · 10−1 s 265.02 Tbop/s 10.07 pJ/bop 1.97 · 10−4 kWh 13.21 pJ/bop 2.59 · 10−4 kWh
V100 65536 1 1.87 · 100 s 301.20 Tbop/s 8.86 pJ/bop 1.39 · 10−3 kWh 11.62 pJ/bop 1.82 · 10−3 kWh
V100 131072 8 3.50 · 100 s 1287.65 Tbop/s 2.07 pJ/bop 2.60 · 10−3 kWh 2.72 pJ/bop 3.41 · 10−3 kWh
V100 262144 8 1.83 · 101 s 1976.47 Tbop/s 1.35 pJ/bop 1.36 · 10−2 kWh 1.77 pJ/bop 1.78 · 10−2 kWh
V100 524288 8 1.37 · 102 s 2105.28 Tbop/s 1.27 pJ/bop 1.02 · 10−1 kWh 1.66 pJ/bop 1.34 · 10−1 kWh
V100 1048576 8 9.72 · 102 s 2374.71 Tbop/s 1.12 pJ/bop 7.21 · 10−1 kWh 1.47 pJ/bop 9.45 · 10−1 kWh
P100 1024 1 1.26 · 10−4 s 17.17 Tbop/s 83.89 pJ/bop 5.01 · 10−8 kWh 116.51 pJ/bop 6.95 · 10−8 kWh
P100 2048 1 3.04 · 10−4 s 56.63 Tbop/s 25.43 pJ/bop 1.22 · 10−7 kWh 35.32 pJ/bop 1.69 · 10−7 kWh
P100 4096 1 1.38 · 10−3 s 99.65 Tbop/s 14.45 pJ/bop 5.52 · 10−7 kWh 20.07 pJ/bop 7.67 · 10−7 kWh
P100 8192 1 8.38 · 10−3 s 131.30 Tbop/s 10.97 pJ/bop 3.35 · 10−6 kWh 15.23 pJ/bop 4.66 · 10−6 kWh
P100 16384 1 5.40 · 10−2 s 163.10 Tbop/s 8.83 pJ/bop 2.16 · 10−5 kWh 12.26 pJ/bop 3.00 · 10−5 kWh
P100 32768 1 3.82 · 10−1 s 184.64 Tbop/s 7.80 pJ/bop 1.53 · 10−4 kWh 10.83 pJ/bop 2.12 · 10−4 kWh
P100 65536 1 2.69 · 100 s 210.00 Tbop/s 6.86 pJ/bop 1.08 · 10−3 kWh 9.52 pJ/bop 1.49 · 10−3 kWh
P100 131072 4 7.08 · 100 s 636.99 Tbop/s 2.26 pJ/bop 2.83 · 10−3 kWh 3.14 pJ/bop 3.93 · 10−3 kWh
P100 262144 4 4.12 · 101 s 875.64 Tbop/s 1.64 pJ/bop 1.65 · 10−2 kWh 2.28 pJ/bop 2.29 · 10−2 kWh
P100 524288 4 2.63 · 102 s 1096.61 Tbop/s 1.31 pJ/bop 1.06 · 10−1 kWh 1.82 pJ/bop 1.47 · 10−1 kWh
K80 1024 1 2.97 · 10−4 s 7.25 Tbop/s 189.02 pJ/bop 1.13 · 10−7 kWh 275.94 pJ/bop 1.65 · 10−7 kWh
K80 2048 1 1.13 · 10−3 s 15.22 Tbop/s 90.03 pJ/bop 4.30 · 10−7 kWh 131.43 pJ/bop 6.28 · 10−7 kWh
K80 4096 1 6.70 · 10−3 s 20.53 Tbop/s 66.73 pJ/bop 2.55 · 10−6 kWh 97.42 pJ/bop 3.72 · 10−6 kWh
K80 8192 1 3.97 · 10−2 s 27.71 Tbop/s 49.44 pJ/bop 1.51 · 10−5 kWh 72.17 pJ/bop 2.21 · 10−5 kWh
K80 16384 1 2.24 · 10−1 s 39.33 Tbop/s 34.84 pJ/bop 8.52 · 10−5 kWh 50.86 pJ/bop 1.25 · 10−4 kWh
K80 32768 1 1.58 · 100 s 44.61 Tbop/s 30.71 pJ/bop 6.01 · 10−4 kWh 44.84 pJ/bop 8.77 · 10−4 kWh
K80 65536 8 2.15 · 100 s 262.51 Tbop/s 5.22 pJ/bop 8.17 · 10−4 kWh 7.62 pJ/bop 1.20 · 10−3 kWh
K80 131072 8 1.31 · 101 s 344.31 Tbop/s 3.98 pJ/bop 4.98 · 10−3 kWh 5.81 pJ/bop 7.27 · 10−3 kWh
K80 262144 8 8.05 · 101 s 447.61 Tbop/s 3.06 pJ/bop 3.07 · 10−2 kWh 4.47 pJ/bop 4.48 · 10−2 kWh
K80 524288 8 6.92 · 102 s 416.80 Tbop/s 3.29 pJ/bop 2.64 · 10−1 kWh 4.80 pJ/bop 3.85 · 10−1 kWh
IEEE Computer Society, 166–176. https://doi.org/10.1109/SFCS.1978.34
[33] Victor Y. Pan. 2018. Fast Feasible and Unfeasible Matrix Multiplication. CoRR abs/1804.04102 (2018). arXiv:1804.04102 http://arxiv.org/abs/1804.04102
[34] Robert L. Probert. 1976. On the Additive Complexity of Matrix Multiplication. SIAM J. Comput. 5, 2 (1976), 187–203. https://doi.org/10.1137/0205016
[35] V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13 (1969), 354–356. https://doi.org/10.1007/BF02165411
[36] Leslie G. Valiant. 1975. General Context-Free Recognition in Less than Cubic Time. J. Comput. Syst. Sci. 10, 2 (1975), 308–315. https://doi.org/10.
1016/S0022-0000(75)80046-8
[37] Virginia Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the 44th Symposium on Theory
of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, Howard J. Karloff and Toniann Pitassi (Eds.). ACM, 887–898. https:
//doi.org/10.1145/2213977.2214056
[38] Virginia Vassilevska Williams and Ryan Williams. 2010. Subcubic Equivalences between Path, Matrix and Triangle Problems. In 51th Annual
IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA. IEEE Computer Society, 645–654.
https://doi.org/10.1109/FOCS.2010.67
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures 37
Table 13. This table presents a best-case comparison between the runtimes of the different algorithms as run on the DGX-1 with
8 GPUs and subproblem size of 131072 for the cubic binary and Boolean algorithms and 65536 for Strassen-like algorithms. The
right-hand-side operand is assumed to be pre-transposed and in the desired basis; the change of basis or the transpose of the
submatrices is not included in the reported times.
n Cubic Boolean Strassen-Winograd Alt.-basis self-inverse Alt.-basis chaining
1024 1.07 · 10−4 s 9.84 · 10−5 s 1.67 · 10−4 s 1.51 · 10−4 s 1.63 · 10−4 s
2048 2.15 · 10−4 s 1.93 · 10−4 s 2.72 · 10−4 s 2.45 · 10−4 s 2.53 · 10−4 s
4096 1.16 · 10−3 s 1.30 · 10−3 s 1.05 · 10−3 s 9.97 · 10−4 s 1.01 · 10−3 s
8192 8.28 · 10−3 s 7.86 · 10−3 s 5.86 · 10−3 s 5.55 · 10−3 s 5.56 · 10−3 s
16384 5.21 · 10−2 s 5.94 · 10−2 s 3.97 · 10−2 s 3.76 · 10−2 s 3.77 · 10−2 s
32768 4.09 · 10−1 s 4.69 · 10−1 s 2.81 · 10−1 s 2.66 · 10−1 s 2.66 · 10−1 s
65536 3.28 · 100 s 3.73 · 100 s 1.97 · 100 s 1.87 · 100 s 1.87 · 100 s
131072 2.65 · 101 s 2.98 · 101 s 3.74 · 100 s 3.33 · 100 s 3.50 · 100 s
262144 6.76 · 101 s 7.51 · 101 s 2.34 · 101 s 1.83 · 101 s 1.83 · 101 s
524288 2.40 · 102 s 2.71 · 102 s 2.66 · 102 s 1.19 · 102 s 1.37 · 102 s
1048576 1.88 · 103 s 2.11 · 103 s 2.61 · 103 s 8.21 · 102 s 9.72 · 102 s
[39] Vasily Volkov. 2016. Understanding Latency Hiding on GPUs. Ph.D. Dissertation. Electrical Engineering and Computer Sciences, University of
California at Berkeley. Technical Report No. UCB/EECS-2016-143.
[40] Henry S. Warren. 2013. Hacker’s Delight (2nd ed.). Addison-Wesley.
[41] S. Winograd. 1971. On multiplication of 2 × 2 matrices. Linear Algebra Appl. 4 (1971), 381–388. https://doi.org/10.1016/0024-3795(71)90009-7
[42] F. Yates. 1937. The Design and Analysis of Factorial Experiments. Imperial Bureau of Soil Science, Harpenden.
[43] Huacheng Yu. 2018. An improved combinatorial algorithm for Boolean matrix multiplication. Inf. Comput. 261 (2018), 240–247. https://doi.org/10.
1016/j.ic.2018.02.006
