We study the problem of multiplying two bit matrices with entries either over the Boolean algebra (0, 1, ∨, ∧) or over the binary field (0, 1, +, ·). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW).
INTRODUCTION
Matrix multiplication is one of the most widely deployed primitives in computing. In a combinatorial context, one often encounters the task of multiplying Boolean matrices with entries in {0, 1}, and with the arithmetic taking place either over the Boolean algebra (0, 1, ∨, ∧) or over the binary field (0, 1, +, ·). Fast algorithms for the Boolean product and the binary product of two bit matrices underlie the fastest known algorithms for a wide range of tasks, such as transitive closure, context-free parsing, and triangle detection [11, 12, 17, 36, 38] .
Our interest in this paper is to engineer high-performance open-source implementations of Boolean and binary matrix multiplication for shared-memory platforms connected to multiple vector-parallel accelerator devices. Such platforms are the state of the art in terms of delivering both (i) speed through extensive parallelism available in the accelerator devices, and (ii) energy-efficiency for moderately large data through shared memory. 1 Configurations of this type are not only powerful individual systems; indeed, shared-memory platforms with multiple accelerator devices are individual compute nodes in larger distributed-memory systems ranging from a few compute 1 For example, a representative configuration of this type is the NVIDIA DGX-1, with eight Tesla V100 SXM2 accelerator devices (in total 40960 32-bit cores running at 1530-MHz boost clock), 512 GiB of shared DDR4-2133 memory, and 3.5-kW peak power consumption. This configuration can execute, at peak, 40960 · 1530 · 10 6 · 32 = 2.01 · 10 15 bit operations per second (2.01 Pbop/s), which at peak power consumption translates to 1.8 · 10 −12 Joules of energy consumed per bit operation (1.8 pJ/bop).
nodes to current leading-edge supercomputers. 2 Thus, scalability to distributed-memory systems starts from carefully engineered shared-memory-accelerated implementations.
To our knowledge, the present paper constitutes the first study of Boolean matrix multiplication that seeks to push the envelope towards peak hardware performance on shared-memory-accelerated platforms simultaneously in terms of speed, energy consumption, and input sizes of up to one terabinary-bit (1048576 × 1048576 = 2 20 × 2 20 = 2 40 bits = 1 Tib = 128 GiB) per operand matrix. 3 For example, given two terabinary-bit square matrices as input, our present open-source implementation computes the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW).
Engineering Challenges
Before outlining our specific contributions in more detail, let us introduce some of the challenges when engineering for peak performance on inputs that nearly occupy the available capacity of shared memory and exceed the total memory available on the accelerators in a platform.
First, the mathematical framework for algorithm design needs fine-grained optimization beyond coarse tools offered by traditional asymptotic analysis. Algorithms for fast matrix multiplication are a canonical example of this phenomenon, where the asymptotically fastest known algorithm designs [8, 10, 22, 37] hide in their asymptotic analysis very large constants, which make these designs infeasible to use in practice; we refer to the recent survey by Pan [33] for an extensive discussion. For bit matrices, a natural first measure for the concrete efficiency of an algorithm design is the number of bit operations executed. For example, the elementary algorithm to multiply two n × n bit matrices uses 2n 3 − n 2 bit operations, and the Strassen-Winograd algorithm [35, 41] for the binary field uses 6n log 2 7 − 5n 2 bit operations when n is a power of two. Already here one observes a leading constant that is three times the leading constant for the elementary algorithm, which necessitates n ≥ 512 before the Strassen-Winograd algorithm uses fewer bit operations than the elementary algorithm. The Strassen-Winograd algorithm is known to be an optimal implementation of matrix multiplication relying on recursive 2 × 2 multiplications [5, 34] , assuming one works in the standard basis. Recently, Karstadt and Schwartz [20] introduce an alternative-basis framework that reduces the number of bit operations of Strassen's algorithm to 5n log 2 7 − 4n 2 , assuming the input and output are represented in an alternative basis. (The transformation of an n × n bit matrix between the standard basis and the alternative basis uses n 2 log 2 n bit operations.) Second, accelerator hardware is designed to give peak performance when executing an identical instruction on each entry of a fixed-length vector of words. Thus, all recurrences of bit operations must be designed to support vectorization and coalesced memory accesses across words of bits. Due to pipelining and long latencies in accelerator hardware (cf. Mei and Chu [25] and Volkov [39] ), recurrences should make use of the available lowest-latency memory (registers) and expose sufficient parallelism to enable a large number of threads simultaneously in execution compared with the number of cores for effective latency hiding. Furthermore, the bottom layer of recursion needs to be engineered for compatibility with the available low-level instruction set, such as vector-shuffles and custom bit operations given by a truth table.
While the word-and vector-level hardware is at the moment rapidly evolving towards integrating increasingly domainspecific hardware units-for example, units that that perform small-size numerical matrix multiplication-the high-level architecture with vectorization, pipelining, and long latencies is likely to remain more stable over time. In particular, this requires designs and engineering for the on-accelerator memory hierarchy that saturates the bandwidth of the dedicated hardware units.
Third, assuming each accelerator can be engineered to perform close to the peak available bandwidth in terms of bit operations per second, the host-level algorithm must be engineered to make effective use of the accelerators. This in particular means feeding the accelerators with input and aggregating their output in comparatively low-bandwidth shared memory across a low-bandwidth interconnect between the accelerators and the host. Our objective of engineering for inputs that are near the capacity of shared memory presents a further challenge in that at the host level we must use designs with a low memory footprint. We expect also these traits to withstand the test of time and thus warrant the present engineering effort.
Our Contributions
Let us now proceed to our specific contributions to address the aforementioned engineering challenges.
Fine-grained optimization of the base design. First, we proceed with a fine-grained optimization of Strassen's algorithm [35] using the alternative-basis framework of Karstadt and Schwartz [20] for the binary field. Essentially, we investigate all possible alternative bases over the binary field, and optimize (i) the number of Boolean operations for the alternativebasis multiplication, (ii) the weight distribution of linear combinations in the alternative basis, and (iii) the number of Boolean operations for changing between bases. Further desirable properties include in-place-computability and self-invertibility of the basis changes, as well as the ability to make chains of matrix multiplications in the alternative basis. Theorem 1.1 (Main, Self-Inverse and In-Place Basis Changes). There exists an alternative-basis bilinear algorithm that multiplies two 2 × 2 input matrices over the binary field using Moreover, the weight distribution in the alternative bases is 1, 1, 1, 1, 2, 2, 2. The basis changes between the standard basis and the alternative bases each use 2 additions, are self-inverse, and can be computed in-place.
Applied recursively in the alternative basis, Theorem 1.1 gives binary matrix multiplication using 5n log 2 7 − 4n 2 bit operations when n is a power of two. Changing between the standard basis and the alternative bases takes 1 2 n 2 log 2 n bit operations and can be done in-place in memory, which presents a minor improvement over the original Karstadt-Schwartz design, but comes at the cost of losing the chain-multiplication property, which is desirable in many applications. The next theorem summarizes a new design that supports chain multiplication but is slightly less efficient than Theorem 1.1 in terms of its weight distributions, which results in slightly less efficiency when aggregating solutions of sub-instances in tight working memory at the host. Theorem 1.2 (Main, Chain-Multiplication). There exists an alternative-basis bilinear algorithm that multiplies two 2 × 2 input matrices over the binary field using (a) 3 additions to preprocess each input, (b) 7 noncommutative multiplications, and (c) 6 additions for postprocessing to obtain the output.
Moreover, the weight distributions are 1, 1, 1, 1, 2, 2, 2 and 1, 1, 1, 1, 2, 3, 3 for taking linear combinations of the operands and the results of noncommutative multiplications, respectively. The basis changes between the standard basis and the alternative basis each use 2 additions, can be computed in-place, and support chain-multiplication in the alternative basis.
The proofs of Theorem 1.1 and Theorem 1.2 are presented in §2 together with a development of the alternativebasis framework using Kronecker products and Yates's algorithm [42] that enables easy parallelization on vectorized accelerator hardware.
Engineering for performance on accelerator hardware. Second, beyond optimizing the base design, we engineer an implementation suitable for vectorized accelerator hardware with extensive bit-operation-and-memory bandwidth but long latencies. Here the key engineering principle is to expose sufficient parallelism to saturate the compute cores with work and hide latency, but not to exceed the available on-accelerator memory. It is well known (e.g. [2, 23] ) that bilinear block-recursive algorithm designs for matrix multiplication enable a tradeoff between (1) parallel processing (executing the recursive block multiplications in parallel, or "breadth-first", using independent memory for each recursive case) and (2) serial processing (by processing the recursive multiplications serially one after another, or "depth-first", reusing memory).
We observe that this tradeoff also applies to alternative-basis algorithms, and illustate its application with a recursive design whose upper levels proceed serially through recursive calls to lower levels executed in parallel, relying on the recurrences underlying Theorems 1.1 and 1.2. The lower parallel levels consist of (i) parallel preprocessing for both inputs, (ii) parallel low-level optimized 64 × 64 bit-matrix-multiplication, and (iii) parallel postprocessing to recover the output. We fine-tune the performance of these levels for the target hardware by (a) merging consecutive levels so that the intermediate results are stored in per-thread registers and (b) working with the widest available per-thread load and store instructions for communicating between the per-thread registers and the on-accelerator memory.
For an n × n binary product for large enough n, these engineering considerations together with Theorem 1.1 and 1.2 enable us to obtain on a single Tesla V100 SXM2 accelerator an empirical effective bit-operation bandwidth that exceeds the theoretical boost-clock peak bandwidth for bit operations. Here by effective bandwidth we mean 2n 3 −n 2 T bit operations per second, where 2n 3 − n 2 is the number of bit operations to compute an n × n binary product with the elementary algorithm, and T is the measured wall-clock running time to compute an n × n binary product. We postpone a detailed review of empirical performance to §4.
Host-level implementation with a low memory footprint. Third, we engineer host-level subroutines that in parallel feed multiple accelerators with recursive subproblems and aggregate the results obtained to the host-level buffers. Here we follow a strategy of using multiple groups of threads on the host CPUs, where each group contains exactly one thread for each accelerator. Each group is responsible for a specific task in a pipeline of tasks, such as preparing subproblems from the host-level input, solving subproblems on accelerators, and integrating the results of subproblems into the host-level output. The threads coordinate their work through standard synchronization primitives such as mutexes and blocking. To maintain a low working-memory footprint in host memory, we use q-fold Kronecker products of the decomposition matrices underlying Theorems 1.1 and 1.2 to produce subproblems and aggregate the sub-results. This strategy in particular benefits from optimization of the weights of the decomposition matrices discussed above.
Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures
For the binary product, this strategy suffices to obtain aggregate effective bandwidth that exceeds the aggregate peak boost-clock bandwidth of the accelerators on inputs whose size is beyond the aggregate memory capacity of the accelerators. In particular, we consider it a very satisfactory engineering outcome as well as benchmark that we can deliver the binary product of two one-terabinary-bit square matrices in less than one thousand seconds and with less than one kilowatt-hour of energy consumption.
Open source. We release our present experimental algorithm implementations as open source 4 to enable further engineering and to communicate precisely the lowest-level design decisions for current microarchitectures.
Earlier and Related Work
From a theory perspective, the study of fast algorithms for Boolean matrix multiplication has proceeded along essentially two lines of study. The first line of study either relies on the binary field directly or embeds the Boolean algebra to a ring, employing techniques for fast matrix multiplication over rings to obtain the result. Asymptotically the fastest known such algorithms run in time O(n ω+o(1) ), where 2 ≤ ω < 2.3728639 is the exponent of matrix multiplication [8, 10, 22, 37] .
While such algorithms are the fastest known in terms of asymptotic efficiency, practical algorithms for fast matrix multiplication over rings rely on recursive tensor techniques using small base tensors (e.g. [2, 4, 6, 9, 13-15, 20, 21, 23, 24] ) or trilinear aggregation-cancellation techniques [31, 32] for a small number of independent matrix multiplications (e.g. [18, 19] ). These practical studies differ from our present work in that they consider numerical (floating-point) matrix multiplication for CPU-based shared-or distributed-memory systems, whereas we optimize for bit matrices and a multiple-accelerator shared-memory system. Furthermore, we rely on an alternative-basis approach optimized for bit matrices, whereas most of the earlier work-with the exception of Karstadt and Schwartz [20] -operates in the standard basis.
A second line of study on Boolean matrix multiplication seeks combinatorial rather than algebraic techniques to obtain competitive algorithms [1, 3, 7, 38, 43] . Currently, the fastest such algorithm runs asymptotically in timê O(n 3 /log 4 n) [43] . However, we are not aware of engineering work to bring algorithms in this second line of study to the computing practice on contemporary parallel architectures.
ALTERNATIVE-BASIS MATRIX-MULTIPLICATION
This section develops the mathematical framework for alternative-basis matrix multiplication over the binary field (0, 1, +, ·). We start by recalling the Strassen-Winograd standard-basis design, and then proceed to introduce two novel alternative-basis designs for use with the binary field as well as establishing our main theorems (Theorem 1.1 and Theorem 1.2). In essence, both new designs are variants of the design in Karstadt and Schwartz [20] , with some further optimization of the designs in particular as pertains to the additive complexity and in-place computability of the basis changes. The rest of the section develops the mathematical framework of fast alternative-basis matrix multiplication with no claim on originality, apart perhaps from an expositionary choice to work with Kronecker products and Yates's algorithm [42] to highlight the symmetries and the easily vector-parallelizable sum-product-sum-layered structure of the framework. As discussed in the introduction, we expect this framework to withstand the test of time with further designs and evolving computing hardware.
The Strassen-Winograd Recurrences
We start by recalling the classical Strassen-Winograd design that works in the standard basis. It will be convenient to first state all algorithms as straight-line programs to highlight their additive complexity. Toward this end, suppose we are to multiply two matrices A = A 00 A 01
A 10 A 11 and B = B 00 B 01
B 10 B 11 (1) with entries over the binary field (0, 1, +, ·).
The Strassen-Winograd design is as follows. First, compute the linear combinations
Then, multiply the linear combinations to obtain the products
Finally, compute the linear combinations of products
In total, this straight-line program makes exactly 15 additions and 7 multiplications. Furthermore, a direct calculation shows that the straight-line program correctly computes the product
A 10 B 00 + A 11 B 10 A 10 B 01 + A 11 B 11 = AB .
An Alternative-Basis Algorithm with Self-Inverse Basis Changes
Let us now turn to alternative-basis designs. We continue to assume that the input is given in 2 × 2 block form, as in (1).
Our first new alternative-basis algorithm is as follows. First, change basis by computinĝ
We observe that the same basis change is applied for both matrices A and B. Next, compute the linear combinationŝ
Multiply the linear combinations to obtain the productŝ
Compute the linear combinations of productŝ
Finally, transformĈ back to the standard basis with
A direct calculation shows that the product (5) is correctly evaluated. This establishes Theorem 1.1 with the exception of the claim on weights, which can be easily verified from (46) in what follows.
Compared with the design of Karstadt and Schwartz [20] , we observe that the basis transformations (6) and (10) over the binary field use only two additions per matrix, whereas their design uses three additions per matrix, but works over an arbitrary ring. Furthermore, our transformations (6) and (10) admit straightforward in-place computation as well as in-place inversion-indeed, both (6) and (10) are easily verified to be self-inverses over the binary field.
Remark. A drawback of the design above is that it does not enable alternative-basis chaining of matrix multiplications in the sense that the transformations (6) and (10) are not inverses of each other, which would be advantageous in applications that seek chain-multiplication. Our next design removes this drawback but loses the appealing self-inverse property.
An Alternative-Basis Algorithm with Chaining
Our second algorithm retains the arithmetic advantage in basis changes over the Karstadt-Schwartz [20] design and works in an alternative basis for the matrix ring. For a matrix in the standard basis X = X 00 X 01
we transform to the alternative basis by computinĝ X 00 ← X 00 ,X 01 ← X 01 ,X 11 ← X 01 + X 11 ,X 10 ←X 11 + X 10 .
The inverse transform from the alternative basis to the standard basis is given by X 00 ←X 00 , X 01 ←X 01 , X 10 ←X 10 +X 11 , X 11 ←X 01 +X 11 .
The algorithm now proceeds as follows. Given two matrices A and B as input, we transform both to the alternative basis using (11) to obtain the matricesÂ andB. Then, compute the linear combinationŝ 
Finally, transform the matrixĈ back to the standard basis using (12) to obtain the product matrix C. A direct calculation shows that C = AB. Furthermore, since the transformations (11) and (12) 
Arrays, Vectors, and Matrices
This section develops preliminaries and notational conventions for working with binary matrices used throughout the rest of this paper. The basic data structure we work with is an array of m entries, indexed by the set [m] = {0, 1, . . . , m−1}.
A tensor is an array with an associated shape
In this case we say that the tensor has d modes and that the mode ℓ = 1, 2, . . . , d has length m ℓ . A tensor with one mode is called a vector and a tensor with two modes is called a matrix.
We index the entries of a tensor T of shape m 1 ×m 2 × · · · ×m d using tuples
with the convention that the tuple refers to the entry indexed by
in the underlying array of length m. In other words, in linearizing the tuple 
In particular, we find it convenient to use the same notation for entries and subarrays of a tensor T , as the structure of the indexing tuple will indicate whether an entry or a subarray is meant.
In what follows let us assume that all arrays have their entries in the binary field (0, 1, +, ·) unless indicated otherwise.
For an s × t matrix µ, write write µ ⊤ for the t × s transpose of µ with entries defined for all i ∈ [s] and j ∈ [t] by the rule µ ⊤ i j = µ ji . For an s × t matrix µ and a t × u matrix ν , we write µν for the s × u product matrix with entries defined for all i ∈ [s] and k ∈ [u] by (µν) ik = j ∈[t ] µ i j ν jk . We write I = I n for the n × n identity matrix. For an s × t matrix µ and a p × q matrix ν , the Kronecker product µ ⊗ ν is the sp × tq matrix with entries defined for all
and ℓ ∈ [q] by the rule (µ ⊗ ν ) ik j ℓ = µ i j ν k ℓ . For an s × t matrix µ, a t × u matrix σ , a p × q matrix ν , and a q × r matrix τ , let us recall the composition rule for Kronecker products
Let us also recall the transposition rule
For two tensors σ and τ of identical shape, the entrywise product σ ⊙ τ has the same shape, with entries defined by the rule (σ ⊙ τ ) i = σ i τ i for all indices i.
The Triple Product Property and Matrix Multiplication
This section recalls the triple product of matrix multiplication and its closure under taking of Kronecker products. For an su × r matrix ζ , an r × st matrix ξ , and an r × tu matrix η, we say that the three-tuple (ζ |ξ , η) has the triple product property with parameters ⟨s, t, u⟩ r if for all i, i ′ ∈ [s], j, j ′ ∈ [t], and k, k ′ ∈ [u] it holds that
For example, the Strassen-Winograd design (2), (3), (4) gives rise to the matrices 
which are readily verified to satisfy the triple product property with s = t = u = 2 and r = 7.
Let (ζ |ξ , η) satisfy the triple product rule with parameters ⟨s, t, u⟩ r . Let A be an s × t matrix and B be a t × u matrix.
Define an s × u matrix C for all i ′ ∈ [s] and k ′ ∈ [u] by the rule
where the second equality follows by changing the order of summation and the last equality follows from the triple product property (19) . Viewing A, B, and C as vectors of length st, tu, and su, respectively, from (21) it immediately
In other words, we observe that a three-tuple (ζ |ξ , η) with the triple product property reduces multiplication of the matrices A and B to (i) taking linear combinations ξ A and ηB of entries of A and B independently, (ii) multiplying these linear combinations pointwise to obtain ξ A ⊙ ηB, and (iii) taking linear combinations ζ (ξ A ⊙ ηB) to obtain the product matrix C of A and B.
The triple product property gains its computational power from closure under Kronecker products and the composition rule. Indeed, let (ζ ℓ |ξ ℓ , η ℓ ) satisfy the triple product property with parameters ⟨s ℓ , t ℓ , u ℓ ⟩ r ℓ for ℓ = 1, 2, . . . , d.
Let A be a vector of length
, and let C be a vector of length
by the triple product property (19) we now have
. . .
Similarly to (22) , from (23) we observe that
In other words, to multiply A and B, it suffices to first compute the two matrix-vector products
then multiply the resulting vectors elementwise
and finally compute the matrix-vector product
to recover the product C of A and B as a vector of length
Yates's Algorithm and Fast Matrix Multiplication
This section develops fast matrix multiplication in the above framework by reduction to Yates's algorithm. The computational bottleneck of the matrix-multiplication algorithm (A, B) → C given by (25), (26) , and (27) occurs with matrix-vector multiplications of the following form. For ℓ = 1, 2, . . . , d, let µ (ℓ) be a matrix of shape b ℓ × a ℓ , and let U be a vector of length a 1 a 2 · · · a d . We want to compute the vector V of length
The key idea is to rely on the composition rule (17) to implement multiplication with µ (1) ⊗µ (2) ⊗· · ·⊗µ (d ) one component matrix µ (ℓ) at a time, via a sequence of sparse matricesμ [ℓ] defined in what follows. Let π : {1, 2, . . . , d} → {1, 2, . . . , d} be an arbitrary permutation that encodes the order in which the matrices µ (ℓ) will be applied. 5 For all k, ℓ ∈ {1, 2, . . . , d},
Let us recall that we write I n for an n × n identity matrix. From the composition rule (17) of Kronecker products, we obtain the decomposition
independent matrix-vector multiplications with the matrix µ (ℓ) . Accordingly,μ [ℓ] is sparse with at most a ℓ b ℓ
nonzero entries. Furthermore, if matrix-vector multiplication with the b ℓ × a ℓ matrix µ (ℓ) has a straight-line-program implementation consisting of P ℓ ≤ b ℓ (a ℓ − 1) binary additions, the total number of binary additions in the algorithm
The algorithm (30) is known as Yates's algorithm [42] for multiplying a Kronecker-product-structured matrix with a given vector.
Yates's algorithm is known to admit highly efficient parallelization. Indeed, since each layer
Yates's algorithm consists of a large number of independent matrix-vector multiplications with the same matrix µ (π (ℓ)) ,
Yates's algorithm admits immediate vector-parallelization (single-instruction-multiple-data parallelization). Furthermore, multiple consecutive layers may be aggregated into one layer to optimize use of local storage such as per-scalar-thread registers in vectorized parallel execution. Here the ability to permute layers arbitrarily yields considerable freedom to optimize both the arithmetic cost as well as the vectorized execution and the use of local storage in applications.
For matrix multiplication, using Yates's algorithm with permutations
to implement the matrix-vector multiplications in (25) and (27) , we obtain an algorithm design that multiplies an
binary additions and r 1 r 2 · · · r d binary multiplications, where
, and P η ℓ are the number of additions in a straightline program that multiplies the matrix ζ (ℓ) , ξ (ℓ) , and η (ℓ) , respectively, with a vector, ℓ = 1, 2, . . . , d.
Fast Alternative-Basis Matrix Multiplication
Let us now review the key idea of Karstadt and Schwartz [20] to change basis to reduce the total additive cost within the previous framework.
) satisfy the triple product property with parameters
where ϕ (ℓ) , ψ (ℓ) , and χ (ℓ) are arbitrary invertible matrices. Applying the decomposition (33) together with the composition rule (17) of Kronecker products, the key observation of Karstadt and Schwartz [20] in the present framework is that the multiplication identity (24) decomposes as
From (34) we can immediately extract the following alternative-basis multiplication algorithm, which we state in a form that relies on Yates's algorithm for multiplying with each Kronecker-product-structured matrix in (34) .
. . , d} be arbitrary permutations. To multiply A and B, first change basis for both inputs by computingÂ
Then compute linear combinations in the new baseŝ
Multiply the resulting vectors elementwiseQ ←T ⊙Ŝ .
Take linear combinations of the productsĈ
Finally change basis
The key insight of Karstadt and Schwartz [20] is that the additive straight-line-program cost of the matrix-vector multiplications with the basis-transformed matrices α (ℓ) , β (ℓ) , and γ (ℓ) may be strictly lower than the corresponding cost for the matrices ξ (ℓ) , η (ℓ) , and ζ (ℓ) . Furthermore, one may engineer the basis changes ϕ (ℓ) , ψ (ℓ) , and χ (ℓ) to such an effect.
In more precise terms, implementing each of the transformations (35), (36), (38) , and (39) using Yates's algorithm (30) with the identity permutation, we obtain that the alternative-basis algorithm uses exactly
binary additions for the basis changes,
binary additions for the linear combinations, and
binary multiplications.
In the diagonal case with r = r ℓ ,
, and χ = χ (ℓ)
for ℓ = 1, 2, . . . , d, and s = t = u with r > s 2 , we obtain that the alternative-basis algorithm uses
binary additions for the linear combinations, and r d binary multiplications. In particular, from (43) and (44) we see that optimizing the additive costs P γ , P α , and P β optimizes the leading constant of the dominant cost for large d.
In the non-diagonal case, more fine-grained optimization may be performed by optimizing the components of the decomposition and the permutations to minimize (40) and (41).
A useful subfamily of alternative-basis algorithms are the alternative-basis algorithms that satisfy ϕ (ℓ) = ψ (ℓ) and
We say that such algorithms support matrix chain multiplication or chaining in the alternative basis. Indeed, for such an algorithm, one may chain multiplications (Â,B) →Ĉ in the alternative basis using (36), (37) , and (38) only, without changing back to the standard basis in between multiplications.
The Two New Algorithms in Matrix Form
Let us first express the new alternative-basis algorithm in §2.2 in matrix form. The change-of-basis matrices are
We readily observe that ϕ = ψ = χ ⊤ and that ϕ = ϕ −1 and χ = χ −1 over the binary field. This self-inverse property and the ease of in-place computation of matrix-vector multiplication with (45) are appealing properties when working with inputs with tight space constraints. From (6) we observe that we can take P ϕ = P ψ = 2 and from (10) we observe that we can take P χ = 2. The alternative-basis decomposition is defined by the matrices 
We readily verify that matrix-vector multiplication with the matrices α and β can be implemented using the recurrences (7) so that P α = P β = 3, and matrix-vector multiplication with the matrix γ can be implemented with P γ = 6 using (9) . We observe that the three-tuple (χγ |αϕ, βψ ) with ϕ, α, β, and γ given by (45) and (46) satisfies the triple product property (19) with parameters ⟨2, 2, 2⟩ 7 .
Let us now turn to the alternative-basis algorithm in §2.3. The change-of-basis matrices are
We observe that χ = ϕ −1 = ψ −1 , so we have an alternative-basis algorithm that supports chaining. From (11) we observe that P ϕ = P ψ = 2. From (12) we observe that P χ = 2. The alternative-basis decomposition is defined by the matrices 
We readily verify that matrix-vector multiplication with the matrices α and β can be implemented using the recurrences (13) so that P α = P β = 3, and matrix-vector multiplication with the matrix γ can be implemented using (15) so that P γ = 6. Finally, we have that the three-tuple (χγ |αϕ, βψ ) with ϕ, ψ , χ , α, β, and γ given by (47) and (48) satisfies the triple product property (19) with parameters ⟨2, 2, 2⟩ 7 .
IMPLEMENTATION ENGINEERING
This section describes our algorithm engineering effort in more detail, with a focus on implementing the binary product of two matrices; our implementation of Boolean products is described more concisely at the end of this section. We strive for generality of exposition in terms of the target platform and in terms of enabling the use of future advances in specific alternative-basis decompositions beyond our new decompositions presented in §2.2 and §2.3.
The high-level exposition withstanding, our goal with the present engineering effort is that a careful low-level implementation of our engineering framework will be able to utilize specific hardware configurations across a range of configurations efficiently, which we seek to demonstrate in our experiments across a range of current platforms in the subsequent section. Furthermore, we would like to highlight that detailed low-level parameterization and specific optimizations for current target platforms can always be found in the accompanying open-source release.
The Family of Target Platforms
Let us recall from the introduction that we seek generality towards a family of target platforms that consist of a single shared-memory compute node (the host) equipped with N independent and identical vectorized accelerator devices (the accelerators), each joined to the host by a low-bandwidth interconnect compared with the bandwidth available at each device.
While the detailed parameters of such a configuration are expected to vary and evolve over time, this overall topology of the configuration-that is, a large-capacity shared-memory host joined by a low-bandwidth interconnect to N independent high-bandwidth accelerators with restricted memory capacity and vectorized parallelism based on thread arrays-can perhaps be expected to remain more stable over time and thus warrant engineering attention with the goal of providing a design that can be fine-tuned to the varying parameters of each specific configuration. A modern typical configuration of this type is a one-or-more-CPU-based shared-memory host joined by a peripheral component interconnect to N GPU accelerators.
A further specific goal in our engineering effort is to enable consideration of input sizes that are close to the shared-memory capacity of the host. Accordingly, our present design at the host level has been structured for a low shared-memory footprint in terms of working memory, but with the understanding that, memory permitting, a faster design can be obtained by following a series-parallel strategy at the host, similarly to the present framework employed at each accelerator.
One further design choice in the present framework is that we assume that the accelerators work independently and asynchronously, apart from synchronization enforced by the host. This choice scopes out platforms where the accelerators may be joined to each other by a fast interconnect; indeed, making use of such an interconnect would require at least partial synchronization between the accelerators.
Accelerators and Thread Arrays
Let us now review our more detailed conventions for working with components of a target platform, starting with the accelerators. We assume the accelerator devices are vector processors designed to execute a large number of threads in parallel, working asynchronously with one or more arrays of data residing in accelerator memory. Each of these data arrays has a shape, which makes it convenient to assume that the array of threads working on the data has a shape to structure the workload.
More precisely, we say that a parallel workload of
This assumption enables us to index a thread t = 0, 1, . . . , L − 1 inside a thread array of volume L alternatively by its linear index t or by the unique tuple
for compatibility with our convention for linearization of data arrays (16) . Workloads based on thread arrays are also essentially immediately translatable to modern GPU platforms.
With current and envisaged future platforms in mind, a key property for a thread-array based design is that of coalescence. We say that a thread array of shape L 1 × L 2 × · · · × L r is coalescent up to c modes if any two threads threads whose tuple-indices agree in all but least significant c modes L r −c+1 , L r −c+2 , . . . , L r execute an identical stream of instructions; furthermore, whenever an instruction in the stream is a memory access, either a single address or consecutive addresses of words in accelerator memory are accessed across consecutive threads in the c least significant modes. Here it is important to note that we do not assume that the threads in a thread array execute synchronously, although the underlying hardware in most cases has a vectorized structure that is optimized for synchronized execution of the workload in groups of V coalescent threads for a hardware-dependent parameter V .
Further engineering principles for accelerators include the following:
(1) work with a large enough L to expose sufficient parallelism to saturate the accelerator hardware; (2) seek coalescent execution by careful design and ordering of the modes of the thread array and the modes of the data tensors; and (3) make sure each thread in the array works with enough local data to make use of low-latency and high-bandwidth storage available to each thread or groups of threads, for example, in the form of per-thread registers and per-thread-group cache memory.
The Host and Coordinating the Work
The role of the host is to coordinate the work of the N accelerators through multithreading and synchronization primitives at the host. Furthermore, our assumption is that the input and the output require memory capacity in excess of what is available at the accelerators, thus requiring the host to prepare workloads for the accelerators and aggregate the results obtained from the accelerators. We also recall that we are seeking a design that has a low working-memory footprint at the host to enable processing of large inputs.
Our engineering principles for the host-side implementation include the following:
(1) design for efficient use of the host-side memory hierarchy; each host-thread works with at least one cache line of data at a time, utilizing the cache hierarchy available; (2) the N accelerators are coordinated asynchronously in parallel so that each accelerator has its own pipeline implemented using host-side threads, with synchronization on buffers and submatrices of the output to avoid data races; (3) each of the N accelerator-pipelines involves four stages, implemented using dedicated host-threads: (i) prepare left input to accelerator in host memory, (ii) prepare right input to accelerator in host memory, (iii) upload input from host memory to accelerator, compute at accelerator, download result from accelerator to host memory, and (iv) aggregate the result in host memory to output; (4) the accelerators must be supplied with extensive-enough workloads if possible to hide the running time of the other stages of the pipeline behind the accelerator-side compute on the workloads; (5) the accelerator memory and the available low bandwidth between the host and the accelerators constrain the size of the workloads, which we ease by adopting a series/parallel approach on the accelerators; and (6) the memory budget at the host is tight due to our goal of scaling to large inputs and the need to buffer of workloads for N accelerators in host memory-we use low-memory-footprint designs to prepare and aggregate the workloads, including in-place basis changes and other primitives help with the memory budget.
The High-Level Design
We are now ready for a high-level exposition of the algorithm design engineered for our target platforms. Once the high-level description is done, we proceed to review more detailed implementation considerations. We will use the alternative-basis framework of §2.
Let d be a positive integer that represents the depth of the algorithm design in terms of the number of three-tuples of matrices that satisfy the triple-product property (19) used to structure the arithmetic. The exact implementation of this arithmetic will vary somewhat from the ideal design (35) , (36) , (37) , (38) , and (39) given in §2, in particular due to storage and bandwidth considerations both at the host and at the accelerators, and due to the need to supply independent and asynchronous parallel workloads to each of the N accelerators. Each accelerator will in turn work both serially (recursively) and in parallel with its specific subproblems delegated to it by the host.
We will split d according to the high-level layers of the design. We work with four layers:
(
It will be convenient to indicate the layers in our notation with the symbols "h" (host), "s" (accelerator serial), "p" (accelerator parallel), and "i" (inner). Accordingly, we have
where the nonnegative integers
, and d (i) indicate the number of decomposition matrices employed at each high-level layer.
For each type of layer t ∈ {h, s, p, i} and each level ℓ = 1, 2, . . . , d (t) inside a layer, let the three-tuple
satisfy the triple product property with parameters s
, where χ (t, ℓ) , ϕ (t, ℓ) , and ψ (t, ℓ) are invertible matrices.
Our task is to multiply a given matrix A ′ of shape s × t with a given matrix B ′ of shape t × u to yield as output the product matrix C ′ = A ′ B ′ of shape s × u, where
(50)
The following subsections will expose the more detailed algorithm design. We will first present the flow of computation from the host to the accelerators and back to the host one layer at a time, and then present in detail how the host coordinates its own work and the work of the accelerators through host-side threading and appropriate synchronization primitives. Finally, we will parameterize the algorithm implementation.
Our expositionary focus will be on the implementation of the alternative-basis phase (36), (37) , and (38) of the algorithm, with less attention and optimization effort devoted to the pre-and-postprocessing phases of data permutation and the basis changes at the host; the latter can be found in the accompanying source code.
Linearization and indexing conventions recalled. At this point it is convenient to recall that we tacitly employ the first-index-major linearization convention (16) when linearizing bit arrays to words of memory; that is, changing the most significant (leftmost) indices in a tuple of indices causes the largest stride in memory addressing, and the least significant (rightmost) indices will index bits inside a word of memory as appropriate. We follow a similar convention when linearizing thread arrays on accelerators (49); that is, hardware vectorization occurs with the least significant (rightmost) indices of a tuple indexing a thread in an array of asynchronous threads that implements a workload on an accelerator. We assume that the accelerator hardware computes with vectors of V synchronous threads with consecutive linear indices and that each thread can load and store W -bit words.
Input Permutation at the Host
To obtain coalesecent execution of the layered design on vectorized accelerator hardware, it will be convenient to permute the input matrices A ′ and B ′ from the s × t and t × u layouts with (50) into the interleaved-layout vectors A and B of lengths s
and t
respectively. This permutation is executed in host memory, using appropriate parallelization, vectorization primitives, and cache-blocking at the host for efficiency. For uniform implementation of Boolean and binary matrix multiplication, our low-level implementation also transposes the right-hand side operand at this point by transposing the roles of t and u in (52). The innermost transpose is implemented for 64 × 64 submatrices recursively using word operations on 32-bit words, cf. [40] .
Input Basis Change at the Host
After input permutation, we execute basis changesÂ ←φA andB ←ψ B at the host. Both basis changes are executed in parallel using Yates's algorithm (30) on d levels. The innermost d (i) levels employ the identity basis change and thus are omitted; indeed, we work with the standard-basis cubic multiplication algorithm in the accelerator inner layer in §3.10, so no basis changes will be required.
Instance Generation at the Host
We now proceed to describe the sub-instances that the host generates and forwards to the N accelerators for processing.
In particular, we assume that the alternative-basis inputsÂ andB have been prepared and reside in shared memory for parallel access by multiple host-threads working in parallel to keep all the N accelerators saturated with work; we postpone the precise description of the host-side threading and synchronization to §3.16.
In essence, the sub-instances are generated fromÂ andB as the sub-arrays obtained by aggregating the linear combinations at the d (h) first levels of the base design (36) at once, which results in increased arithmetic cost compared with (36) implemented using Yates's algorithm, but keeps the working memory needs low at the host, as per our engineering goal to scale up to large inputs on a single host.
In precise terms, the sub-instance indexed by
We compute these vectors by aggregating the sums of subvectors in (53) as they are stated, taking care to not consider subvectors whose associated α-product (respectively, β-product) in (53) is zero. At low level, subroutines for performing the XOR and memory copy operations on subvectors have been handcrafted to make use of the 256-bit AVX2-registers in CPU hardware. The routines also perform reads and writes of 512 bits at a time, which corresponds to the length of cache lines on the target CPU. We leave these low-level implementation details to the accompanying source code.
In total thus r
d (h) sub-instances are generated and forwarded for solution by the N accelerators. Such solving of a sub-instance (T
) by an accelerator consists of (i) uploading the sub-instance to one of the accelerators, (ii) having the accelerator compute the solutionQ
, and (iii) downloading the solution from the accelerator to shared memory at the host. We next describe how each accelerator proceeds to solve a given sub-instance, and then describe how each downloaded solutionQ
is aggregated to the master solutionĈ at the host.
Accelerator Serial Layer

Suppose the sub-instance (T [h]
) has been uploaded to an accelerator. The accelerator then proceeds to solve the instance using three layers. First, a serial layer works with the instance recursively, at the bottom of the recursion switching to parallel work with three parts, the expanding parallel layer, the inner layer, and the compressing parallel layer. We start with a description of the serial (recursive) layer. Let the uploaded instancê
be the input to the serial layer. The serial layer consists of d (s) levels. 
Each recursive invocation at level
and makes the recursive invocation with index (h 1 , h 2 , . . . , h ℓ ) and input consisting of the vectorsT 
The array-arithmetic in (54) and (55) is implemented with thread arrays on the accelerator so that each thread produces one W -bit word of output using optimized straight-line programs for matrix-vector multiplication with the α, β, and γ matrices.
Accelerator Expanding Parallel Layer
We now turn to the layer that implements Yates's algorithm with each of the α and β matrices, as follows. Each input to the expanding parallel layer arrives from the bottom level of recursion in the serial layer. Let us writeT [p,0] andŜ [p,0] for the input vectors to the parallel layer. These vectors have lengths
respectively. We describe the processing of the inputT [p,0] only, with the understanding that the processing ofŜ [p,0] is similar but utilizes the β matrices in place of the α matrices.
Let us writeᾱ [p] for the matrix
and decompose the matrix using Yates's decomposition (28) with respect to the reversal permutation as
The parallel layer proceeds to computeᾱ [p]T [p,0] by a sequence of matrix-vector multiplicationsT
. Each multiplication in the sequence is implemented with vectorization using W -bit words 6 and parallelization via a thread array of
threads working asynchronously, so that each thread loads s words to the arrayT [p, ℓ] . We illustrate the design in pseudocode with Algorithm 1, where we assume that w is a nonnegative integer with W = s
, and that the parameterization of the inner layer is such that such a w exists; this will be the case in what follows.
The output of the layer consists of the expanded vectorsT
and r
respectively, which are in turn given as input to the inner layer.
Accelerator Inner Layer
The inner layer is the most performance-critical part of the design since it works with the expanded vectors and thus with the most data aggregated over the execution of the algorithm. The inner layer takes as input two vectorsT [i] andŜ [i] . The output of the layer is the product
We tacitly assume in what follows that W divides the product s
. Indeed, W is in most cases a power of two, and the product can be similarly chosen to be a power of two by careful design of the inner layer.
Algorithm 1
The accelerator expanding parallel layer illustrated in pseudocode. This pseudocode illustrates the procedure implemented using a thread array for level ℓ = 1, 2 
for (i
[[ Implemented with word-bit-operations using an optimized straight-line program for multiplying the r
with the s
, such as (7) or (13) . ]]
7:
for h
end for 10:
end for 11: end procedure
As the inner layer works with the least significant bit positions of the bit vectors, in most cases the implementation of the inner layer is hardware-specific and amounts to making the best possible use of the available bit-and word-operations in the instruction set as well as the available vectorization and associated vector-shuffling instructions.
For example, with W -bit words and length-V hardware vectorization of thread arrays with M 2 = W V , one possibility to implement the inner layer is to perform r
independent M × M binary matrix multiplications in parallel using the elementary cubic algorithm implemented with word operations and a thread array of r
so that each thread works with one W -bit-word-size fragment from each of the M × M operand matrices and the result matrix, with hardware V -vector shuffle instructions used to communicate words between threads. This is essentially how our low-level implementation of the inner layer is structured, though some optimizations remain to be discussed in §3.15. This part of the framework is also the most sensitive to changes in underlying hardware, and so is perhaps the least likely to withstand the test of time.
Accelerator Compressing Parallel Layer
Once the inner layer is complete, its outputQ [i] is given as input to a further parallel layer that implements Yates's algorithm with the γ matrices, as follows. Let us writeQ [p,d (p) ] for the input vector to the compressing parallel layer.
Similarly to the expanding layer, let us writeγ [p] for the matrix
. and decompose the matrix using Yates's decomposition (28) with respect to the identity permutation as
The parallel layer proceeds to computeγ
Each multiplication in the sequence is implemented with vectorization using W -bit words and parallelization via a thread array of
threads working asynchronously, so that each thread loads r to Algorithm 1, we assume that w is a nonnegative integer with W = s
. The output of the layer consists of the compressed vectorQ [p,0] of length
which is in turn given as output to the serial layer.
Instance Aggregation at the Host
The aggregation of a solution obtained at an accelerator proceeds as follows at the host, again with a focus on obtaining a design with a low working memory footprint in host memory at the price of increased arithmetic cost. Before any solutions are accepted, the output vectorĈ is initialized to all-0 values. Suppose that the solutionQ
has been downloaded from an accelerator to host memory. We then execute the following aggregation procedure. For each
aggregate the solution to the output bŷ
When multiple host-threads execute aggregation in parallel, appropriate synchronization primitives need to be employed to avoid conflicts between threads when executing (56). This will be described in detail in §3.16.
Output Basis Change at the Host
After the alternative-basis outputĈ has been aggregated, we execute the basis change C ←χĈ at the host, in parallel using Yates's algorithm (30) on d levels. In our implementation, the innermost d (i) levels employ the identity basis change (since the accelerator inner-layer uses the standard-basis cubic multiplication algorithm) and thus are omitted. parallel for thread (h
[[ Implemented with word-bit-operations using an optimized straight-line program for multiplying the s
with the r
, such as (9) or (15). ]] 7 :
Output Permutation at the Host
The last layer of the framework permutes the interleaved-layout vector C of length
to an s × u output matrix C ′ with s and u as in (50). This completes the description of the layers of the algorithm design.
Further Optimization by Merging Levels and Shifting Levels Between Layers
The previous framework can be further optimized at the accelerators by making more extensive use of the registers local to each thread in a thread array, assuming such registers are available in sufficient quantity. This section documents two techniques towards this end.
Merging levels. Assuming the parameters r
of the expanding levels ℓ = 1, 2, . . . , d (p) in the expanding layer (cf. §3.9) are small enough, one may utilize the per-thread registers and compute two consecutive levels, ℓ and ℓ + 1, instead of only one level ℓ. In this case one works with a thread array of
words from accelerator memory to its local registers, then multiplies word-bit-parallel in registers with W copies of the matrix α (p, ℓ) ⊗ α (p, ℓ+1) using optimized straight-line programs for both component matrices (that is, one implements with register arithmetic Yates's algorithm using straight-line programs for α (p, ℓ) and α (p, ℓ+1) for the levels ℓ and ℓ + 1 computed in registers), and finally saves r words to accelerator memory. Similar merging of levels may be used for the compressing layer (cf. §3.11). Our low-level implementation uses this two-level merging for pairs of levels closest to the inner layer. In particular one wants to optimize at levels closest to the inner layer because these layers process the longest vectors and thus do the most work.
If more per-thread registers are available, more consecutive levels can be merged and computed in registers.
Shifting levels between layers. The inner layer can use per-thread registers more efficiently by shifting expanding/compressing levels to the inner layer from the expanding/compressing layers. Assuming that the parameters of the expansion/compression are r
, and that the inner layer originally implements M × M standardbasis matrix multiplication (cf. §3.10) using one W -bit-word per thread, the inner layer with shifting (i) first loads s ℓ t ℓ + t ℓ u ℓ words to each thread, (ii) expands in registers using straight-line programs for α (p, ℓ) and β (p, ℓ) to r ℓ + r ℓ words of operands per thread, (iii) executes in registers r ℓ independent M × M matrix multiplications (essentially repeating the original inner layer r ℓ times in registers) to obtain r ℓ words of expanded results, and (iv) finally compresses to s ℓ u ℓ words of results per thread using a straight-line program for γ (p, ℓ) and writes these results to accelerator memory. Our low-level implementation follows this strategy of shifting one level of expansion/compression to the inner layer to make more efficient use of per-thread registers and to perform less accelerator-memory transactions at the inner layer.
Coordinating the Workload
This section describes how the entire workload for the host and the N accelerators is coordinated at the host using host-threads and appropriate synchronization.
Let us start by recalling that the top-level alternative-basis workload (Â,B) →Ĉ consists of r
instances that need to be (i) generated fromÂ andB on the host, (ii) solved in an accelerator, and (iii) the solution aggregated toĈ on the host. This suggests processing the workload with N essentially independent pipelines implemented by one or more host-threads each so that each pipeline is responsible for keeping exactly one accelerator busy with work. Synchronization between pipelines is also required because the aggregation steps (56) of different pipelines can access and update the same subarray ofĈ, which can lead to data races unless synchronization is used to ensure serial updates to each subarray.
Our strategy is to implement the pipeline for each accelerator ℓ ∈ [N ] using three buffers T ℓ , S ℓ , Q ℓ in host memory and four host-threads: one thread to prepare the left input T ℓ in host memory, one thread to prepare the right input S ℓ in host memory, one thread to multiply T ℓ , S ℓ on accelerator ℓ to obtain the result Q ℓ (including uploading the input and downloading the result), and one thread to aggregate the result Q ℓ to the resultC.
Pseudocode for coordinating work at the host is given in Algorithm 3. This workload of 4N host-threads is deadlockfree due to the design of the pipeline-blocking is needed to prevent race conditions occurring on the arrays T ℓ , S ℓ , Q ℓ , the accelerator devices, and subarrays of the arrayĈ, but each pipeline will always be flushed eventually. We also observe that the order in which the subproblems (h 1 , . . . , h d ( h) ) are processed is insignificant as long as the order is well-defined, which is guaranteed by the locks on subarrays ofĈ. Here we have presented a 4N -thread design with a single 4-thread pipeline for each of the N accelerators; the design can be easily extended to multiple pipelines competing for a single accelerator device to better saturate the accelerators with data as appropriate.
Parameterizing the Implementation
Let us now turn to the detailed parameterization of the framework. Our current implementation uses the framework in a diagonal setting based on either of the two ⟨2, 2, 2⟩ 7 alternative-basis decompositions developed in §2.2 and §2.3, Algorithm 3 Procedure for coordinating work at a host joined to N accelerators. The input arrays areÂ andB and the output array isĈ. The workload is formed of 4N threads. Each thread is associated with one of the N accelerators and a specific part of the pipeline for this accelerator. Let T ℓ for ℓ ∈ [N ] be an array of shape s
Let S ℓ for ℓ ∈ [N ] be an array of shape t
Let Q ℓ for ℓ ∈ [N ] be an array of shape s
InitializeĈ as zero 6: parallel for thread t ∈ [4N ] do 7:
Block until T ℓ is free 11: Initialize T ℓ as zero 12: for
end if 16: end for 17: Mark T ℓ occupied 18: else if N ≤ t ≤ 2N − 1 then
19:
Block until S ℓ is free 20: Initialize S ℓ as zero 21 :
end if
25:
end for 26: Mark S ℓ occupied 27: else if 2N ≤ t ≤ 3N − 1 then
28:
Block until T ℓ and S ℓ are occupied and Q ℓ is free 29: Upload T ℓ and S ℓ to accelerator ℓ
30:
Mark T ℓ and S ℓ free 31: Multiply T ℓ and S ℓ on accelerator ℓ to obtain the result Q ℓ
32:
Download Q ℓ from accelerator ℓ 33: Mark Q ℓ occupied 34: else if 3N ≤ t ≤ 4N − 1 then 35: Block until Q ℓ is occupied 36 :
Acquire the lock forĈ
Release the lock forĈ respectively. Accordingly, we set
When more efficient decompositions are discovered, these can be immediately used in the present framework.
The present accelerator hardware has V -length vectorization of threads with V = 32, and the maximum word length supported per thread for memory transactions is W = 128 bits. 7 We structure the inner layer to work with M × M matrices, M = 64, so that M 2 = VW and the M × M multiplication can work with instructions for shuffling data between threads across each V -length vector of threads in a thread array. Accordingly, we let the inner layer consists of a single level with between the host and the accelerators-large subinstances and their solutions take more time to transfer over the interconnect, but on the other hand take more time to solve on the accelerator, leaving the interconnect idle and thus enabling the hiding of transfer latency. When working with inputs close to the memory capacity of the host as per our engineering goal, the memory capacity at the host is the primary constraint that determines how many levels need to be performed at the host layer.
(2) The accelerator serial layer in its d (s) levels enables each accelerator to accept larger inputs than would be otherwise possible by resorting to the accelerator parallel layer only. That is, because of limited memory available at the accelerator, the data-expanding/-compressing accelerator parallel layers can only be invoked on input sizes that tolerate a constant-factor expansion in the size of the data at each level, up to the inner layer. The serial layer is more space-efficient by working through the r (s) ℓ cases at each level recursively and serially, reusing space, but becomes time-inefficient when the size of a case becomes so small that its parallel processing does not saturate the parallel hardware at the accelerator.
(3) The accelerator parallel layer should consist of sufficiently many levels d (p) to saturate the parallel hardware of the accelerator and to enable utilization of the optimizations in §3.15. 7 The latter implemented with instructions that work with four consecutive 32-bit words.
The setting of the parameters d (h) , d (s) , and d (p) that gives the fastest overall performance may be done by first identifying the memory-capacity-induced constraints on the parameters and then, subject to these constraints, empirically finding the best parameter combination that optimizes performance among the relatively few choices that remain. the host layer using a ⟨2, 2, 2⟩ 7 -decomposition. On an input of shape 65536 × 65536 = 2 16 × 2 16 , we parameterize the layers on the accelerators so that each accelerator performs two levels in the accelerator serial layer (d (s) = 2) using a ⟨2, 2, 2⟩ 7 -decomposition, and then proceeds with the accelerator parallel layer. That is, the parallel layer is started with inputs of shape 16384 × 16384 = 2 14 × 2 14 (32 MiB per matrix).
Boolean Multiplication
We also provide reference implementations for both Boolean and binary multiplication using the elementary cubic algorithm distributed to the accelerators. For an s × t by t × u multiplication to yield an s × u product, the N accelerators each work with subproblems of size s × t by t × u such that s divides s, t divides t, and u divides u. That is, the total workload of volume stu (in precise terms, stu +s(t −1)u bit operations) gets executed on the accelerators so that each unit of work for an accelerator has volume stu. Each accelerator executes a thread array of shape
W such that the least significant M 2 /W threads are responsible for aggregating one M × M block of the s × u output, with low-level vectorization and vector shuffle operations used for each work unit of volume M 3 . Empirically, s = t = u = 131072 and M = 64 give best parameterization for our target hardware with W = 128 and V = M ·M W = 32. Coordination at the host is simplified compared with Algorithm 3; namely, we partition the s × u result matrix to disjoint segments of size s × u, and set up threads on the host so that a single thread is responsible for each segment, which eliminates the need for synchronization between threads to guard against race conditions. In addition, we set up each host thread so that it integrates each s × u subresult it has downloaded from an accelerator in parallel when the next subproblem is being solved on the accelerator; this effectively hides the latency of integrating each subresult because the time to solve each subproblem is greater than the time to integrate the result.
EXPERIMENTS
This section describes the experiments we ran to evaluate the performance of the framework described in the previous section. In particular, our goal is to observe that a careful implementation of the framework enables efficient use of current multiple-accelerator systems at input sizes that are close to the host memory capacity of the entire system.
The implementation was written in C++ [16] , and the accelerator device routines were written in CUDA C [28] . Hostlevel parallelization was prepared using the OpenMP API [30] . We ran three sets of experiments: (i) alternative-basis binary matrix multiplication, (ii) classical binary matrix multiplication, and (iii) Boolean matrix multiplication.
The single-accelerator version was evaluated at all powers of two until the subproblem size for the multiple-accelerator case. The multiple-accelerator case was evaluated up to one terabinary-bit (1 Tib) input size (n = 2 20 ).
Hardware configuration
The target hardware for our implementation was an NVIDIA DGX-1 system with eight NVIDIA Tesla V100 accelerators, 512 GiB of memory, and 2 × 20 CPU cores. We also ran experiments on two less powerful configurations: One with accelerators, 128 GiB of memory, and 2 × 6 CPU cores. A more detailed description of the hardware configurations is given in Table 1 . Table 2 lists technical details of the accelerators: the number of cores, boost clock speed, and the peak performance that presents an upper bound on the number of bit operations (bops) that could be theoretically achieved by simultaneous use of all the accelerators in each configuration. In practice, this upper bound is unattainable due to thermal effects and the need to coordinate data across the storage hierarchy ranging from per-thread registers to the global memory of each
accelerator. Yet this upper bound presents an uncompromising benchmark against which to measure the performance of the actual implementation.
The power consumption of the systems used in our experiments is listed in Table 3 . The table lists the nominal power consumption of the CPUs and GPUs, and the system power intake as indicated in the basic system documentation of each vendor. While one may perhaps expect the CPUs and GPUs to operate essentially at their maximum power, the given system power intake is perhaps expected to be higher than the actual power usage. While we would like to be able to measure the actual energy consumption by each system and its components during the computation more precisely, such instrumentation has been unavailable to us, so in the present experiments we resort to using the tabulated power values as the mean power over the duration of each computation, which then gives us a crude approximation of the energy used by each computation. In particular, we would like to highlight the importance of implementation engineering for low energy consumption as an important goal beyond the present study, with the hope of having in the future available more fine-grained measurement tools.
Finally, to record a rough benchmark of the performance of the different system components in our main target configuration, measurements on the bandwith for transferring data within host memory, and between the accelerator device and the host system are shown in Tables 4 and 5 , respectively. 
Results
Runtimes, effective bit operations per second, and energy consumption at different sizes of input are shown in Tables 8,   9 , 10, 11, and 12. All values shown are medians over five consecutive runs. Since the instrumentation does not permit us to measure the actual power use, two different values are shown for the energy consumption: one is computed assuming full CPU utilization with nominal CPU wattage and the maximum wattage of a single GPU accelerator times the number of GPUs in use, and the other by simply taking the maximum power intake of the entire system in account. We expect the actual energy consumption to lie somewhere between these two values, since the former ignores macroscopic Table 7 . Scalability of running times for the various change-of-basis routines. The forward columns show the running times for the forward transformations (that is, corresponding to ψ and ϕ) whereas the inverse columns show the inverse transformations (corresponding to χ ). For the chaining variant, measurements were made to confirm that accounting for transposed data in the right-hand operand has no effect on the running time. The procedures have been crafted using AVX2 intrinsics and to work with 512-bit cache lines on the DGX-1 host hardware. All runtimes reported are the median of five repeats on the DGX-1 node. factors such as memory and cooling, and the latter is simply the maximum amount of power that the entire system can supply.
In Tables 8 and 9 , given runtime T , the exact bit operation count of 2n 3 − n 2 is used to compute (2n 3 − n 2 )/T , the number of individual bit operations performed in a second on average. The same number is also used in Tables 10, 11 , and 12 to compute the effective bit operation count; that is, to highlight the relative difference in performance, we show how many elementary bit operations per second a classical implementation operating on individual bits would have to perform to achieve the same wall-clock runtime as the alternative-basis design.
The times reported here do not include the time required for data permutation or change of basis operations at the host; it is assumed that the data is already in the desired format. Table 7 shows the runtime of an implementation of the change of basis with AVX2 vector-intrinsics on the DGX-1 host. It can be seen that, in the case of the best-performing algorithm (with self-inversion), the forward-transformation takes approximately 34.5 seconds per 1-Tib input matrix, and 44.1 seconds for the inverse transformation of the result. The favorable difference is due to the structure of the transformation which means that only 1 of the 4 subarrays needs to be modified at any level of recursion. Also, to preserve linearity of access in memory, we assume that the right-hand-side operand is transposed. For Strassen-like algorithms, this needs to be only done for the 64 × 64 submatrices; we implemented a parallelized, recursive transpose function using word-operations that can perform the transpose for all 2 28 submatrices of a 1-TiB input matrix in 7.99 s (median of five consecutive runs, see Table 6 for scalability).
With two 1-Tib input matrices, on the DGX-1, the classical cubic algorithm runs for 1880 and 2110 seconds, and performs over 1.23 and 1.09 peta-bit-operations per second, in the binary and Boolean cases, respectively. This means we are able to achieve over 50 % of the theoretical maximum peak performance of 2.01 Pbops per second. At the assumed level of power used, this means that one such multiplication takes at most 1.83 and 2.05 kWh of electricity, or 2.85 and 3.20 pJ/bop, in the binary and Boolean cases, respectively. In both cases, a single V100 accelerator processes a 131072 × 131072-subproblem in just under 30 seconds, meaning that the runtime obtained here is accelerator-bound.
The time required for data upload and download, and result integration of individual subproblems is in the order of seconds, making it negligible in comparison and effectively hidden by the host-side pipeline design.
Due to their limited memory, the P100 and K80 systems can only be evaluated up to n = 524288. However, otherwise the results are very similar to those of the DGX-1, apart from worse energy consumption and slower processing power, owing to the older hardware in use.
As an interesting artifact of the low-level implementation, we see in the Tables 8 and 9 that, with the V100 and P100
accelerators, the Boolean version is slightly slower than the binary version of the algorithm, even though the only differences in code are with the replacement of the popcount-parity check with the nonzero test, and the replacement of XOR with OR. Furthermore, the difference is inverse with the K80 accelerator. This artifact appears to be due to the optimizations performed by the NVIDIA compiler: On the P100 and the V100, the compiler compresses several bitwise operations into one LOP3.LUT instruction, or arbitrary trinary logical instructions, and for one reason or another, the sequence is slightly more efficient in the XOR case. However, these instructions are not available on the K80.
For binary multiplication, the Strassen-Winograd design in standard basis executes in 2610 seconds, the alternative basis with self-inverse in 821 seconds, and the alternative basis with chaining in 972 seconds, on 1-Tib inputs on the DGX-1, and achieve 0.8, 2.8, and 2.3 effective Pbops per second, respectively. The best alternative-basis variant exceeds the theoretical peak hardware performance of the elementary algorithm by more than 30 %. One alternative-basis multiplication takes less than 0.8 kWh of electricity, or less than 1.26 pJ/bop. This is less than one half of that required by the cubic multiplication algorithm. Furthermore, considering that a subproblem of size n = 65536 can be solved by an accelerator in 1.87 seconds, this means that the available host-side bandwidth forms the bottleneck (cf. Tables 4 and   5 ). Indeed, if subproblem construction and aggregation as well as the data transfer over the PCIe link were completely free, the multiplication could be executed in 7 4 8 × 1.87 s ≈ 560 s. This in particular highlights the need to carefully engineer workloads and their pipelining both at the host and at the accelerators to obtain balanced overall performance from a multi-accelerator system. Tables 11 and 12 also show that the present approach generalizes to alternative multiple-accelerator configurations, even though the more limited memory on the K80 accelerators requires a smaller subproblem size. Table 8 . Scalability of the running times, the bit operations per second, and the energy requirements for the classical (cubic) binary (AND/XOR) multiplication procedure. The subproblem size is 131072 for the V100 (DGX-1) and P100 accelerators, and 65536 for the K80. The number of bit operations is 2n 3 − n 2 . The first two energy columns are computed assuming full usage of CPU Watts and the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. Table 9 . Scalability of the running times, the bit operations per second, and the energy requirements for the classic (cubic) Boolean (AND/OR) multiplication procedure. The subproblem size is 131072 for the V100 (DGX-1) and P100 accelerators, and 65536 for the K80. The number of bit operations is 2n 3 − n 2 . The first two energy columns are computed assuming full usage of CPU Watts and the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. Table 10 . Scalability the running times, the effective bit operations per second, and the energy requirements for the standard-basis binary Strassen-Winograd multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators, and 32768 for the K80. The effective number of bit operations is computed with the same value of 2n 3 − n 2 as in the cubic case to highlight the relative difference in performance. The first two energy columns are computed assuming full usage of CPU Watts and the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. The right-hand-side input matrix is assumed to be pre-transposed; the times reported here do not include the transposition of submatrices. Table 11 . Scalability the running times, the effective bit operations per second, and the energy requirements for the alternative-basis (with self-inversion) binary multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators, and 32768 for the K80. The effective number of bit operations is computed with the same value of 2n 3 − n 2 as in the cubic case to highlight the relative difference in performance. The first two energy columns are computed assuming full usage of CPU Watts and the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. The right-hand-side input matrix is assumed to be pre-transposed and in the alternate basis; the times reported here do not include the transposition of submatrices or the change of basis. Table 12 . Scalability the running times, the effective bit operations per second, and the energy requirements for the alternative-basis (with chaining) binary multiplication procedure. The subproblem size is 65536 for the V100 (DGX-1) and P100 accelerators, and 32768 for the K80. The effective number of bit operations is computed with the same value of 2n 3 − n 2 as in the cubic case to highlight the relative difference in performance. The first two energy columns are computed assuming full usage of CPU Watts and the Wattage of a single GPU times the number GPUs in use. The latter two columns are computed using the full system power intake. The right-hand-side input matrix is assumed to be pre-transposed and in the alternate basis; the times reported here do not include the transposition of submatrices or the change of basis. 
ACKNOWLEDGMENTS.
GPU
