Constant-Depth and Subcubic-Size Threshold Circuits for Matrix
  Multiplication by Parekh, Ojas et al.
Constant-Depth and Subcubic-Size Threshold Circuits for
Matrix Multiplication
Ojas Parekh
Sandia National Laboratories
Albuquerque, NM, USA
odparek@sandia.gov
Cynthia A. Phillips
Sandia National Laboratories
Albuquerque, NM, USA
caphill@sandia.gov
Conrad D. James
Sandia National Laboratories
Albuquerque, NM, USA
cdjame@sandia.gov
James B. Aimone
Sandia National Laboratories
Albuquerque, NM, USA
jbaimon@sandia.gov
ABSTRACT
Boolean circuits of McCulloch-Pitts threshold gates are a classic
model of neural computation studied heavily in the late 20th century
as a model of general computation. Recent advances in large-scale
neural computing hardware has made their practical implemen-
tation a near-term possibility. We describe a theoretical approach
for multiplying two N by N matrices that integrates threshold
gate logic with conventional fast matrix multiplication algorithms,
that perform O(Nω ) arithmetic operations for a positive constant
ω < 3. Our approach converts such a fast matrix multiplication al-
gorithm into a constant-depth threshold circuit with approximately
O(Nω ) gates. Prior to our work, it was not known whether the
Θ(N 3)-gate barrier for matrix multiplication was surmountable by
constant-depth threshold circuits.
Dense matrix multiplication is a core operation in convolutional
neural network training. Performing this work on a neural archi-
tecture instead of off-loading it to a GPU may be an appealing
option.
KEYWORDS
threshold circuits; matrix multiplication; triangle counting; numer-
ical algorithms; neural-inspired algorithms; neuromorphic comput-
ing; neural networks
ACM Reference Format:
Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone.
2018. Constant-Depth and Subcubic-Size Threshold Circuits for Matrix Mul-
tiplication. In SPAA ’18: 30th ACM Symposium on Parallelism in Algorithms
and Architectures, July 16–18, 2018, Vienna, Austria. ACM, New York, NY,
USA, Article 4, 10 pages. https://doi.org/10.1145/3210377.3210410
1 INTRODUCTION
Neuromorphic computing devices (NCDs) are composed of mas-
sively parallel networks of artificial neurons that compute boolean
functions, akin to conventional logic gates but with relatively larger
fan-in. The simplicity of each neuron affords NCDs immense energy
SPAA’18, July 16–18, 2018, Vienna, Austria
© 2018 Association for Computing Machinery.
This is the author’s version of the work. It is posted here for your personal use. Not for
redistribution. The definitive Version of Record was published in SPAA ’18: 30th ACM
Symposium on Parallelism in Algorithms and Architectures, July 16–18, 2018, Vienna,
Austria, https://doi.org/10.1145/3210377.3210410.
efficiency and scalability, circumventing some of the data move-
ment and energy bottlenecks associated with traditional HPC [17].
Within the last few years, large-scale commercial NCD platforms,
such as SpiNNaker, IBM’s TrueNorth, and Intel’s Loihi, have emerged
and currently offer up to hundreds of millions of neurons, with
systems with billions of neurons a near-term likelihood [5, 17, 23].
Such NCDs were envisioned, for example, to accelerate deep learn-
ing; however, a specific application for which NCDs offer a clear and
rigorously established advantage over conventional parallel com-
puting remains elusive. Developing theoretical models for NCDs
and identifying potential advantages and tradeoffs over other par-
allel models of computation remain largely open problems. We
consider a well-known theoretical model that captures some of the
features offered by NCDs, and we demonstrate resource-efficient
constant-time algorithms for a fundamental problem, matrix multi-
plication.
We study the computational power of boolean circuits where
the fundamental gates have unbounded fan-in and compute a lin-
ear threshold function1. Such circuits are rooted in the classical
McCulloch-Pitts neuronal model [12], with linear threshold func-
tions serving as plausible models of spiking neurons. A boolean
threshold gate withm binary inputs y1,y2, . . . ,ym computes a lin-
ear threshold function, outputting 1 if and only if
∑m
i=1wiyi ≥ t ,
where the integer weightswi and integer threshold t are constants
associated with the gate. Rational wi and t may be represented,
for example, by multiplying suchwi and t with a sufficiently large
number. There are several natural measures of complexity associ-
ated with boolean circuits, including size: the total number of gates,
depth: the length of the longest directed path from an input node
to an output node, edges: the total number of connections between
gates, and fan-in: the maximum number of inputs to any gate.
Threshold gates are the most basic model for an artificial neuron,
and as such, all currently available neuromorphic computing archi-
tectures provide a hardware implementation of threshold circuits.
We consider threshold circuits with constant depth and polynomial
size with respect to the total number of inputs; this class of circuits
is called TC0. Such circuits represent a plausible model of constant-
time parallel computing. This is a notion of perfect parallelizability,
faster than the polylogarithmic time allowed in the complexity class
1As we explain below, for convolutional neural networks, we can restrict the fan-in to
a constant, or whatever practical fan-in the architecture supports.
ar
X
iv
:2
00
6.
14
65
2v
1 
 [c
s.D
S]
  2
5 J
un
 20
20
SPAA’18, July 16–18, 2018, Vienna, Austria Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone
NC .TC0 circuits can compute a variety of functions including inte-
ger arithmetic, sorting, and matrix multiplication [19, 21]. There is
also a TC0 threshold-gate circuit of sublinear size to compute the
parity of n bits [20]. In contrast, any constant-depth circuit with
NOT gates and unbounded-fan-in AND and OR gates requires a
superpolynomial number of gates [7, 26].
Understanding the power and limitations of TC0 circuits has
been a major research challenge over the last couple of decades.
The 1990’s saw a flurry of results showing what TC0 circuits could
do [19, 21], while more recent results have focused on lower bounds
showing what TC0 circuits cannot do [9]. TC0 has been studied as
a theoretical model. Its practicality is an open question. Currently,
large-scale electronic circuits with high fan-in may be difficult to
implement. However, neural-inspired architectures may offer hope.
The adult human brain contains about 100 billion neurons, with
maximum fan-in 10,000 or larger some regions, operating at under
50 watts [1]. Though impressive, this figure represents a single
class of instance size, so one might wonder how a synthetic system
based on the physical limitations governing a brain might scale
asymptotically. We are not aware of any generative brain models
for which this has been analyzed. However, a fan-in that grows with
the total system size seems plausible for a 3-dimensional circuit
such as the brain. Although large fan-in is a critical resource in de-
signingTC0 algorithms, in practical neuromorphic devices resource
requirements may grow as a function of fan-in. For example, the
available numerical precision or dynamic range may decrease as
fan-in increases, resulting in an overall increase in energy expendi-
ture or execution time. Thus constant depth, in theTC0 sense, may
not practically equate to constant time. However, ideal theoretical
algorithms may still guide the development of resource-efficient
practical algorithms as neuromorphic architectures become more
prevalent.
While neuromorphic computing has long focused on the use
of analog computation to emulate neuronal dynamics [8], recent
years have seen rapid development of novel digital CMOS neural
hardware platforms which can scale to very large numbers of neu-
rons [10, 13]. While initially designed for large biologically inspired
circuits, these architectures are attracting attention as an alterna-
tive to conventional CMOS architectures for accelerating machine
learning algorithms such as deep artificial neural networks [6].
Many of these neural architectures, such as TrueNorth and the
SpiNNaker platform, achieve considerable benefits in energy and
speed by using large numbers of simple digital spiking neurons
instead of a relatively smaller number of powerful multi-purpose
processors. These systems are almost configurable threshold gate
circuits, except that they are capable of extended temporal dynam-
ics. Scientific computing is an application domain for which neural
architectures are often quickly dismissed. There is a perception
that human cognition is better for data-centric functions, such
as image recognition, and for abstract decision making than for
precise numerical calculations, particularly at large scale. While
biologically-inspired neural algorithms are often probabilistic or
approximate, the neuronal-level computations in large scale neural
architectures are sufficiently precise for numerical computation.
We consider a fundamental scientific-computing-inspired prob-
lem: can one produce constant-depth threshold circuits that com-
pute the product of two N × N matrices using O(N 3−ε ) gates for
constant ε > 0? For matrices with relatively large entries (say Ω(N )
bits), this goal seems out of reach as we would need to output
Ω(N 3) bits in the worst case. However, prior to our work, it was
not known if this was possible even for binary matrices, those with
entries that are all either 0 or 1.
We show how to multiply two N × N integer matrices with
O(logN )-bit entries using O(N 3−ε ) gates in constant depth. The
näive algorithm based on the definition of matrix multiplication
requires Θ(N 3) arithmetic operations. Our results are based on clas-
sical breakthroughs for fast matrix multiplication [22]: multiplying
two N × N matrices using O(Nω ) arithmetic operations, for a pos-
itive constant ω < 3, that depends on the particular fast matrix
multiplication being employed. These techniques can be extended,
in a relatively straightforward manner, to yield O(logN )-time con-
ventional parallel algorithms (for architectures such as PRAMs)
with O(Nω ) total work. In contrast, we give a constant-time algo-
rithm, in the threshold circuit model, with approximately O(Nω )
total gates, which is a reasonable measure of total work.
One of our motivations for neural-circuit-based matrix multi-
plication is convolutional neural networks for deep learning. See
Section 5 for more details. Deep learning is a major driver for
neural-inspired architectures. A current vision for using these ar-
chitectures for deep learning requires the matrix multiplication to
be moved off-system to a GPU. If circuit-based matrix multiplica-
tion can be made practical, perhaps this computation can be left
on-chip, avoiding energy-intensive and slow I/O.
We also consider the somewhat simpler problem of determining
whether the trace of A3 is at least τ , for an N × N integer matrix
A with entries of size O(logN ) bits. This case illustrates the funda-
mental ideas of our approach and has applications in social network
analysis, particularly to triangle counting. The problem we solve
allows one to answer: “Does a graph G have at least τ triangles?”
The user may select a relevant value of τ . See Section 5 for more
details on triangles, social network analysis, and picking τ .
There is a simple depth-2 threshold circuit to solve this problem
for a graph G = (V ,E). The circuit has an input variable, xi j for
i, j ∈ V with i < j; the variable xi j is 1 if ij ∈ E and 0 otherwise.
The first layer of the circuit consists of a gate, дi jk for each triple
i, j,k ∈ V with i < j < k . The gate дi jk computes the value of
the linear threshold function xi j + xik + x jk ≥ 3 as an output
yi jk . That is, the gate fires (yi jk = 1) if and only if all edges in the
triangle on i , j, and k are in the graph. The second layer consists
of a single output gate that computes the linear threshold function∑
i, j,k ∈V :i<j<k yi jk ≥ τ ; this gate fires if and only if the number of
triangles in G is at least τ . The circuit has
(N
3
)
+ 1 = Θ(N 3) gates.
We ask (and answer) whether it is possible to beat the size of
this threshold circuit in constant depth. This is akin to asking if it
is possible to beat the näive matrix multiplication algorithm with
an algorithm that performs O(Nω ) operations for ω < 3. In fact
the above threshold circuit is a specialization of the näive matrix
multiplication algorithm.
The analysis of our new threshold circuits is more involved than
analyzing conventional fast matrix multiplication methods. We
must explicitly consider sparsity (see Definition 2.1), a measure of
how many times a matrix element or intermediate result is part
of a computation during the fast multiplication. Thus, while we
Constant-Depth and Subcubic-Size Threshold Circuits for Matrix Multiplication SPAA’18, July 16–18, 2018, Vienna, Austria
use existing fast matrix multiplication techniques to achieve our
results, we use them in a new context. Our performance results
exploit different features of fast matrix multiplication techniques
than those traditionally used.
Results and contributions
Consider a fast recursive or divide-and-conquer matrix multiplica-
tion algorithm like Strassen’s, with run-time complexity O(Nω ).
We consistently useω as the exponent in the runtime complexity of
a base non-circuit fast matrix multiplication algorithm. Our results
leverage such a fast matrix multiplication to construct a constant-
depth threshold circuit with O˜(Nω+ε )-gates, where ε depends on
the depth of the circuit.
Our main result is an O(d)-depth, O˜(Nω+O (γ d ))-gate threshold
circuit for multiplying twoN×N integer matrices withO(logN )-bit
entries, for a positive integerd and constantγ < 1. Specifically, for a
given integerd , the depth is 4d+1. The constantd is a multiplicative
factor hidden in the O˜ for the number of gates. Section 4.3 gives a
more detailed discussion of the value of γ . For Strassen’s algorithm
it is about 0.491. The constant multiplier of γd is about 1.581 for
Strassen’s algorithm. Thus for d > 3, this circuit will haveO(N 3−ε )
gates for positive constant ε > 0. We also give aO(log logN )-depth,
O˜(Nω )-gate circuit for this task.
We present a simplified circuit of the same gate complexity and
slightly lower depth (2d + 2) for computing the trace of A3, for an
N×N integer matrixA. This gives triangle counts for a graphG with
adjacencymatrixA (see Section 2.3). Our circuits implement limited-
depth versions of fast divide-and-conquer matrix multiplication,
and our techniques should extend to other types of algebraic divide-
and-conquer algorithms.
Our contributions are:
• This work revives and redirects research on designing al-
gorithms for a classical theoretical model of parallel com-
putation to a data science problem on an emerging class of
neural-inspired parallel architectures. We show that thresh-
old circuits, comprised of threshold gates that model neurons,
might be applicable to linear-algebra computations for deep
learning on new neuromorphic hardware.
• We give O(log logn)-depth and O˜(Nω )-gate threshold cir-
cuits for computing the product of two N ×N matrices with
O(logN )-bit entries, where Nω is the complexity of a fast
conventional matrix multiplication algorithm like Strassen’s.
• We give constant-depth threshold circuits with O˜(Nω+ε )
gates, where ε is exponentially small in the depth of the
circuit.
• Our circuits are elementary and are composed entirely of
copies of a relatively simple depth-2 threshold circuit that
performs addition. We hope this will facilitate implementa-
tion of our approach in neural-inspired hardware.
2 PRELIMINARIES AND PROBLEM
STATEMENT
2.1 Fast matrix multiplication algorithms
Strassen developed the first matrix multiplication algorithm requir-
ing O(N 3−ε ) multiplications [22]. Strassen observed that one can
M1 = A11(B12 − B22)
M2 = (A21 +A22)B11
M3 = (A11 +A22)(B11 + B22)
M4 = A22(B21 − B11)
M5 = (A11 +A12)B22
M6 = (A21 −A11)(B11 + B12)
M7 = (A12 −A22)(B21 + B22).
C11 = M3 +M4 −M5 +M7
C12 = M1 +M5
C21 = M2 +M4
C22 = M1 −M2 +M3 +M6.
Figure 1: Strassen’s algorithm for multiplying two 2 × 2 matrices
A and B. The 7 multiplications computed in Strassen’s algorithm
are represented byM1, . . . ,M7. EachMi is the product of weighted
sums of entries of matricesA and B. The entries of product matrixC
are then computed from theMi using only addition and subtraction.
One can verify by substitution and expansion that the entries of C
are set to the proper expressions involving entries of A and B.
compute the matrix product, C = AB for 2 × 2 matrices A and B
using 7 multiplications rather than the 8 multiplications required
by the näive algorithm. The reduction in multiplications comes at
the expense of additional additions and subtractions.
Figure 1 gives Strassen’s algorithm for 2 × 2 matrices. The al-
gorithm is generalized to N × N matrices A and B, where N = 2l
for some positive integer l , as follows. We partition A and B into 4
blocks, each of size N /2 × N /2, and let Ai j and Bi j refer to these
blocks, for i, j ∈ {1, 2}. The above equations remain correct. How-
ever, each Mi now represents a multiplication of two N /2 × N /2
matrices. We can recursively apply the above procedure to perform
each of these multiplications until the blocks are individual matrix
elements or, for more practical applications, sufficiently small. For
each of the l = log2 N levels of recursion, we invoke 7 recursive ma-
trix multiplications, resulting in a total of 7log2 N = N log2 7 ≈ N 2.81
scalar multiplications. The recurrence relation for the total number
of arithmetic operations is T (N ) = 7 · T (N /2) + 18 · (N /2)2 and
T (1) = O(1). The 18 · (N /2)2 term arises from the 18 additions or
subtractions on N /2×N /2 blocks in the expressions above. This re-
currence shows the total number of scalar additions or subtractions
is also O(N log2 7).
Although Strassen’s seminal approach was based on a fast matrix
multiplication algorithm for 2 × 2 matrices, subsequent work has
yielded improved algorithms by employing a fast matrix multipli-
cation algorithm involving larger square matrices as well as more
sophisticated techniques. The currently best known algorithm re-
quires O(N 2.373) operations [11]. For the sake of exposition, we
view fast matrix multiplication algorithms as recursive divide-and-
conquer approaches, yet our techniques extend to the more general
tensor perspective of fast matrix multiplication. See the survey by
Bläser [4] for a detailed introduction to and history of this area,
including the connection between the (border) rank of the matrix
multiplication tensor and fast matrix multiplication algorithms.
2.2 Technical challenges
The divide-and-conquer Strassen’s algorithmhas a naturalO(logN )-
time parallel (PRAM) implementationwith a total work ofO(N log2 7)
SPAA’18, July 16–18, 2018, Vienna, Austria Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone
arithmetic operations. The main question we consider is whether
Strassen’s approach and subsequent improvements of it can yield a
constant-time algorithm implemented using threshold circuits with
O(N 3−ε ) gates, where the latter is a measure of total work. The
recursive (O(logN )-depth) implementation of Strassen’s algorithm
only performs scalar multiplications during the base case. However,
it performs matrix additions and subtractions at each level before
and after the recursion, reusing computed results.
If we attempt to implement Strassen’s approach without recur-
sion and the consequent reuse of computed results, we must first
compute N log2 7 scalars that are linear combinations of entries of
A and another N log2 7 scalars representing linear combinations of
entries of B. The main technical hurdle is that such linear combi-
nations involve up to N entries of A or B, and we seek to compute
O(N log2 7) of these sums with constant-depth threshold circuits. A
näive implementation would require Ω(N 1+log2 7) gates.
We overcome this hurdle by selecting a constant orO(log logN )
number of levels of recursion out of the O(logN ) levels suggested
by the standard implementation of Strassen’s approach. We make
the notion of selecting a level of recursion precise in Section 4. This
allows us enough reuse of computed results within the confines
of a constant-depth or O(log logN )-depth circuit to achieve our
results. We note that we must carefully select the levels of recursion
employed; for instance, simply selecting every kth level does not
achieve our best results.
2.3 Problem statement
We develop threshold circuits to compute the matrix product C =
AB of two N × N integer matrices A and B. Our results assume the
entries of A and B require at mostO(logN ) bits. We also consider a
related problem: given an integer matrix A as above and an integer
τ , determine whether the matrix trace of A3 is at least τ . This
problem is solved by a simpler threshold circuit than for computing
matrix product and serves to illustrate our main ideas. It also has
applications to triangle counting in graphs as we describe below.
LetA be theN ×N symmetric adjacency matrix of a simple graph
G = (V ,E) with N nodes: for i, j ∈ V , Ai j = Aji = 1 if ij ∈ E, and
Ai j = Aji = 0 otherwise. Since there are no self-loops in the graph,
we haveAii = 0 for all nodes i . Consider the square of the adjacency
matrix,C = A2. For i, j ∈ V with i , j ,Ci j = ∑k ∈V AikAk j = |{k ∈
V | ik ∈ E and kj ∈ E and k , i, j}|, which is the number of paths
of length 2 between i and j. If there is an edge between the nodes
i and j, then each path of length 2 between them, along with the
edge ij, forms a triangle in G. Moreover, every triangle containing
i and j arises in this way. Suppose G has ∆ triangles. Then,
3∆ =
∑
i, j ∈V :i<j
Ai jCi j , (1)
since the sum counts each triangle once for each of its edges. Thus
one can count the triangles inG by summing some of the entries of
A2. An equivalent computation is the trace of A3, trace(A3), which,
from (1), is equal to 6∆. This counts the loop from each vertex in
each direction.
We employ a threshold circuit implementation of fast matrix
multiplication algorithms to compute this sum in constant depth
using O(N 3−ε ) gates. In fact the exponent of our gate count can be
made arbitrarily close to the exponent of the arithmetic operation
count for the best possible fast matrix multiplication algorithm.
We explain our notion of a fast matrix multiplication algorithm.
We assume we are given an algorithm for multiplying two T ×T
matrices using a total of r multiplications (for Strassen’s algorithm,
T = 2 and r = 7). We assume N = T l for some positive integer
l . As outlined in Section 2.1, this yields a recursive algorithm for
computing the product of two N × N matrices, C = AB, using a
total of r l = r logT N = N logT r scalar multiplications.
As with Strassen’s algorithm, we assume we are given a list
of r expressions for each of the multiplications, M1, . . . ,Mr ; we
view each Mi as an expression involving the T 2 different N /T ×
N /T blocks of A and B. In particular each Mi is a product of a
{−1, 1}-weighted sum of blocks ofAwith a {−1, 1}-weighted sum of
blocks of B. We also assume the fast matrix multiplication algorithm
provides a list of T 2 expressions, each representing a N /T × N /T
block of C as a {−1, 1}-weighted sum of theMi . More general fast
matrix multiplication algorithmsmay allow theMi to be products of
linear combinations with rational weights beyond {−1, 1} (likewise
for the entries of C). Although we do not present details here, our
techniques can be extended for such fast matrix multiplication
algorithms (these weights correspond to the wi in Lemma 3.2 in
Section 3).
For 1 ≤ i ≤ r , let ai be the number of distinct blocks of A that
appear in the expressionMi , and let bi be defined analogously with
respect to B. We let ci be the number of expressions for blocks ofC
in whichMi appears.
Definition 2.1. We let
sA =
∑
1≤i≤r
ai , sB =
∑
1≤i≤r
bi , and sC =
∑
1≤i≤r
ci .
We define the sparsity of a fast matrix multiplication algorithm as
s = max{sA, sB , sC }.
Sparsity will be an essential ingredient of our analysis, and we
better motivate it in Section 4.3. Others [2, 3] consider sparsity
in analyzing and improving the numerical stability of fast matrix
multiplication algorithms, though they do not refer to it by the
same name.
We use the following notation in proofs of circuit quality. We
define bits(m) as the minimum number of bits required to express
the nonnegative integer m in binary, i.e. the least integer l such
thatm < 2l .
3 BASIC TC0 ARITHMETIC CIRCUITS
We first develop the fundamental TC0 arithmetic circuits on which
our results rely. Our circuits are designed with neuromorphic im-
plementation in mind, and we try to favor simple constructions
over those that offer the least depth or gate count. The bulk of the
computation performed by our circuits is computing the bits of
integer-weighted sums of nonnegative integers,
∑
i wixi , where
the nonnegative xi depend upon the inputs to the circuit but the
weightswi are constants associated with the circuit.
Our first circuit follows from a classical technique to compute
symmetric functions inTC0 by Muroga from 1959 [14, 15]; it is also
a special case of a more general result by Siu et al. [20]. We include
a proof to demonstrate the simplicity of the construction.
Constant-Depth and Subcubic-Size Threshold Circuits for Matrix Multiplication SPAA’18, July 16–18, 2018, Vienna, Austria
Lemma 3.1. Let s =
∑
i wixi be an integer-weighted sum of bits,
xi ∈ {0, 1}. We assume s ≥ 0 and fix an integer l such that s ∈ [0, 2l ).
For 1 ≤ k ≤ l , the kth most significant bit of s can be computed by a
depth-2 threshold circuit using 2k + 1 gates.
Proof. We define bool(P), for a predicate P , to be 1 if predicate
P is true and 0 otherwise.
The kth most significant bit of s is 1 precisely when s lies in
one of the intervals [i2l−k , (i + 1)2l−k ) for some odd integer 1 ≤
i < 2k . The interval enumerates over all combinations of bits less
signicant than the kth and the odd multipliers i enumerate over
all combinations of bits more significant than k . The first layer
of our circuit computes the function yi := bool(s ≥ i2l−k ), for
1 ≤ i ≤ 2k . The output of the circuit is bool(∑i odd(yi −yi+1) ≥ 1),
since yi − yi+1 is 1 if s ∈ [i2l−k , (i + 1)2l−k ) and is 0 otherwise.
□
The circuit construction for the above lemma requires an integer
l such that the sum s is guaranteed to be in [0, 2l ). Note that if
s < [0, 2l ), the circuit for any k outputs 0. We build upon the
above to obtain our primary addition circuit. The next lemma is a
generalized and strengthened version of Siu et al.’s Lemma 3.1 in
[20] for the depth-2 case.
Lemma 3.2. Let s =
∑
i wizi be an integer-weighted sum of n
nonnegative numbers zi , each with at most b bits. We assume s ≥ 0
and let w = maxi |wi |. The sum s can be computed by a depth-2
threshold circuit with O(wbn) gates.
Proof. The sum s requires at most bits(nw(2b − 1)) ≤ bits(n) +
bits(w) + b bits. (Recall the defintion of bits() in Section 2.3).
First we compute the jth (least significant) bit of s , for 1 ≤
j ≤ b. Let sj = ∑i wi z˜i , where z˜i is obtained from zi by ignoring
all but the least significant j bits. Note that sj requires at most
bits(n) + bits(w) + j bits and that s and sj have the same least
significant j bits. Furthermore, since we are given the bits of each
z˜i , we may treat sj as an integer-weighted sum of bits, where each
weight is a product of some wi with a power of 2. Thus we may
compute the jth bit of sj by appealing to Lemma 3.1 on sj with
k = bits(n) + bits(w) + 1. To see this recall that k represents the
kth most significant bit. The jth least significant bit of sj is the
(bits(sj ) − j + 1) = n +w + j − j + 1most significant bit. This circuit
requires 2k + 1 = 2 · 2bits(n) · 2bits(w ) + 1 = O(wn) gates, hence
O(bwn) gates suffice to compute the b least significant bits of s .
Appealing to Lemma 3.1 to compute each of the remaining a =
bits(n) + bits(w) most significant bits of s requires O(∑ak=1 2k ) =
O(2a ) = O(wn) gates. This is improved in practice by observing
that the functionsyi computed for k = bits(n)+bits(w) in the proof
of Lemma 3.1 include those required for all the most significant
bits(n) + bits(w) bits of s . □
We need to compute products of numbers as well. However, the
products we compute are only used as inputs to other threshold
gates, and we do not need an explicit base-2 representation of the
bits of these products. A more generic representation suffices: a
representation of an integer x is an integer-weighted sum of binary
variables, x =
∑
1≤i≤d wixi with xi ∈ {0, 1} and d polynomial in
bits(x).
Lemma 3.3. A representation of the product of threem-bit nonneg-
ative integers can be computed by a depth-1 threshold circuit withm3
gates.
Proof. We compute a representation of the product of x =∑
1≤i≤m 2i−1xi , y =
∑
1≤j≤m 2j−1yj , and z =
∑
1≤k≤m 2k−1zk ,
with xi ,yj , zk ∈ {0, 1}. Thus xyz =
∑
1≤i, j,k≤m 2i+j+k−3xiyjzk ,
which is a representation for xyz. This representation differs from
the standard binary representation in that 2i+j+k−3 can represent
the same power of 2 for different values of i, j, and k . We usem3
gates in a single layer with predicates xi + yj + zk ≥ 3 to compute
xiyjzk ∈ {0, 1} for 1 ≤ i, j,k ≤ m. □
Our results require only the above relatively simple arithmetic
circuits. This facilitates practical implementation.
Negative numbers. The above lemmas give circuits to compute
products and weighted sums of nonnegative integers. However,
they can be extended to handle negative integers. We represent
each integer x as x = x+ − x−, where x+ and x− are each non-
negative. Other more efficient representations are possible, but we
select this one as it makes for a relatively simple presentation and
implementation at the cost of a constant-factor overhead in gate
and wire count.
Theworkhorse subroutine of our circuits, captured by Lemma 3.2,
is computing the bits of integer-weighted sums of nonnegative
integers, s =
∑
i wixi , where the xi = x+i − x−i depend upon the
inputs to the circuit but the wi are constant with respect to the
inputs. LetW + be the set of indices withwi > 0, and letW − be those
indices withwi < 0. We define s+ =
∑
i ∈W + wix+i +
∑
i ∈W − (−wi )x−i
to be the positive terms in sum s , and we define s− = ∑i ∈W + wix−i +∑
i ∈W − (−wi )x+i to be the negation of the negative terms in sum s .
We have s = s+ − s− and s+, s− ≥ 0. Moreover, each of s+ and s− is
an integer-weighted sum of nonnegative integers, hence the bits of
each of s+ and s− may be computed using two separate instances
of the circuit of Lemma 3.2. Each of these circuit instances only
depends on the x+i and x
−
i hence we may apply them in parallel
without increasing the depth of the resulting overall circuit.
Computing products also incurs extra constant overhead. The
representation of the product of three numbers, xyz described in
the proof of Lemma 3.3 becomes xyz =
∑
1≤i, j,k≤m 2i+j+k−3(x+i −
x−i )(y+j −y−j )(z+k−z−k ). Thus
∑
1≤i, j,k≤m [2i+j+k−3(x+i y+j z+k+x+i y−j z−k+
x−i y
+
j z
−
k + x
−
i y
−
j z
+
k ) + (−2i+j+k−3)(x+i y+j z−k + x+i y−j z+k + x−i y+j z+k +
x−i y
−
j z
−
k )] is also a representation of xyz that requires eight times
as many gates to compute, which is still O(m3).
For ease of exposition, we proceed as if we were only computing
positive quantities. From this point on, we assume that a number x
requires at most b bits, by which we mean that each of x+ and x−
requires at most b bits.
4 SUBCUBIC TC0 CIRCUITS FOR TRACE AND
MATRIX MULTIPLICATION
4.1 Overview
Our circuits for matrix trace and matrix multiplication implement
a given conventional fast matrix multiplication algorithm in both a
depth-efficient and gate-efficient manner. We assume we are given
SPAA’18, July 16–18, 2018, Vienna, Austria Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone
A
A11
(A11)11
...
...
(A11)12 − (A11)22
...
...
A12 −A22
(A12 −A22)11
...
...
(A12−A22)12
−(A12−A22)22
...
...
...
r2, NT 2 × NT 2
r1, NT × NT
r0, N × N
(#, size)
of matrices
· · ·
· · · · · ·
· · · · · · · · · · · ·
1 2
1 2 1 2
Figure 2: The r -ary tree TA for Strassen’s Algorithm (r = 7, T = 2).
For K × K matricesU and V , the notationUi j or (U )i j refers to the
(i, j)th KT × KT block ofU . Observe that (U +V )i j = Ui j +Vi j . Each
node has children corresponding to the r multiplication expressions
Mi (see Figure 1). An edge associated withMi is labeled with the
number of terms of A that appear in Mi . Each node u on level h,
starting with the root as level 0, corresponds to a matrix that is a
weighted sum of N
T h
× N
T h
blocks of A. The number of blocks of A
appearing in such a sum is the product of the edge labels on the
path from u to the root of the tree. For example, (A12 − A22)12 −
(A12 −A22)22 = (A12)12 − (A22)12 − (A12)22 + (A22)22 is a weighted
sum of 4 NT 2 × NT 2 blocks of A. The N logT r leaves of TA correspond
to scalars that are weighted sums of entries of A.
N × N integer matrices A and B with entries of size O(logN ). We
consider two problems: (1) determining whether trace(A3) ≥ τ ,
for an integer τ , and (2) computing the bits of the matrix product
C = AB. We consider the first problem in this section, and the
second problem in the next section.
We define trees TA and TB for the input matricesA and B, respec-
tively, based on the recursive or divide-and-conquer structure of
the fast matrix multiplication algorithm. The nodes in TA represent
weighted sums of blocks of A and likewise for TB . The root of TA
represents the matrix A, while the leaves represent weighted sums
of its entries. See Figure 2 for a detailed explanation.
In a conventional PRAM implementation of a fast matrix mul-
tiplication algorithm, all the matrices at each level of TA and TB
are computed, and the results are reused. Since there are O(logN )
levels, we cannot hope to compute all the matrices at each level in
a constant-depth circuit. We give constant-depth threshold circuits
that computes all nodes on only a constant number of levels of TA
and TB while using a number of gates arbitrarily close to the total
work performed by the fast matrix multiplication algorithm.
Our circuit computes the same O(Nw ) scalar products as the
underlying fast matrix multiplication algorithm. These scalars cor-
respond to the leaves of TA and those of TB respectively. Our algo-
rithm processes these trees in a top-down fashion to compute the
scalars, corresponding to the leaves. We then appeal to Lemma 3.3
to compute the product of each scalar corresponding to a leaf of TA
with the corresponding leaf of TB . For both the problems we solve,
we will need to consider another tree with similar structure to TA
and TB ; however, these trees will be used in different ways.
We assume, as in Section 2.3, that we have a fast matrix mul-
tiplication algorithm that multiplies two T × T matrices using r
multiplications. We describe an improved TC0 circuit for comput-
ing the values at the leaves of TA. Our results extend naturally to
computing the leaves of TB . Level h of TA contains rh nodes, each
corresponding to an N /Th × N /Th matrix. Moreover, each entry
of each matrix at level h is the {−1, 1}-weighted sum of at most
T 2h entries of the root matrix,A. Hence if each entry of the integer
matrix A requires at most b bits, the number of bits required for
each entry of a matrix at level h is at most
bits((2b − 1)T 2h ) ≤ b + bits(T 2h ) = b +O(h logT ). (2)
For our results we assume b = O(logN ) bits. Moreover, T is a
constant associated with the fast matrix multiplication selected,
and h ≤ logT N , hence the scalar values at the leaves of TA require
O(logN ) bits.
We give a subcubicTC0 circuit for computing the matrix product
C = AB in the next section, but first, we illustrate our main ideas by
showing how to check trace(A3) ≥ τ . As mentioned in Section 2.3,
this allows us, for example, to count triangles in a graph. The bulk
of our work lies in showing how to compute the N logT r scalar
products prescribed by the fast matrix multiplication algorithm.
Each such scalar product is between a weighted sum of entries of A
and a weighted sum of entries of B. We next show how to compute
these weighted sums for A with a circuit that computes a constant
number of levels of TA. An analogous construction works for B.
4.2 Approach
Our main goal in the following sections is to give O(log logN )-
depth, O˜(Nω )-gate and O(d)-depth, O˜(Nω+O (γ d ))-gate threshold
circuits for multiplying two N × N integer matrices withO(logN )-
bit entries, for a positive integer d and constant γ < 1.
We first motivate our approach by attempting to construct a
constant-depth and O(N 3−ϵ )-gate threshold circuit for matrix mul-
tiplication using Strassen’s algorithm as a guide. As described in
the previous section and Figure 2, our immediate goal is to compute
the scalars associated with the leaves of TA (and TB ) for Strassen’s
algorithm by selecting a constant number of levels of TA to compute.
Themost natural approach is perhaps to directly compute the leaves,
at level log2 N , without computing any other level. The leaves of TA
correspond to {−1, 1}-weighted sums of at most N entries of A. By
recalling (2) and invoking Lemma 3.2, we can compute each such
sum in depth 2 usingO(N logN ) gates. However, we must compute
O(N log2 7) such sums, yielding a total of O˜(N 1+log2 7) ≈ O˜(N 3.81)
gates. This can be improved to ≈ O˜(N 3.58) by observing that not
all sums have the same number of summands; however, this still
fails to achieve our goal.
We can improve the approach by employing addition circuits of
depth greater than 2 due to Siu et al. [20] (Corollary 2). This allows
computation of the desired sums in depth O(d) using O(dN 1/d )
gates, yielding the following result.
Theorem 4.1. Suppose we are given N × N integer matrices A
and B with entries of sizeO(logN ) bits. There is a threshold circuit of
depth O(d) that computes the matrix product AB using O˜(dNω+1/d )
gates.
Constant-Depth and Subcubic-Size Threshold Circuits for Matrix Multiplication SPAA’18, July 16–18, 2018, Vienna, Austria
We do not include a full proof of this theorem as our main results
give superior results, both in terms of gate count and the simplicity
of the resulting circuits. The results of the following sections show
how to more carefully select a constant number of levels of TA and
TB in order to improve the exponent in the gate count from ω + 1/d
to ω +O(γd ) for a constant γ < 1.
4.3 Matrix trace
We select t levels, 0 = h0 < h1 < h2 < · · · < ht of the tree TA. Our
TC0 circuit computes all of the matrices at these t levels of TA. Our
goal is to compute the scalars corresponding to the leaves of TA,
hence ht = logT N . The benefit of computing level hi is that each
entry of each matrix at level hi+1 is then a {−1, 1}-weighted sum
of at most T 2(hi+1−hi ) matrices at level hi .
Our results rely on parameters associated with our fast matrix
multiplication algorithm. Recall sA from Definition 2.1. We define
α = r/sA and β = sA/T 2, and we have that 0 < α ≤ 1 and β ≥ 1
(for Strassen’s algorithm, α = 7/12 and β = 3).
Lemma 4.2. For 1 ≤ i ≤ t , if the matrices at level hi−1 of TA have
been computed, then the matrices at level hi can be computed in depth
2 using O((b + hi−1)αhi−1βhiN 2) gates.
Proof. The rhi nodes at level hi of TA each correspond to an
N /Thi × N /Thi matrix. We set δi = hi − hi−1 for convenience.
We can associate each node u at level hi with the unique subtree
rooted at level hi−1 that contains it. The N /Thi × N /Thi matrix
corresponding tou is a sum of at mostT 2δi blocks of the N /Thi−1 ×
N /Thi−1 matrix associated with the root of the subtree containing
u.
We seek a better bound on the number of such blocks we must
sum to obtain the matrix associated with u. Let size(u) represent
this quantity, and let root(u) be the node at level hi−1 on the path
from u to the root of TA. Recall that each edge of TA corresponds
to one of the fast matrix multiplication expressionsMi and that ai
is the number of distinct blocks of A that appear inMi (defined in
Section 2.3). The quantity size(u) is the product of the ai associated
with the edges on the path from u to root(u) (see Figure 2). Thus
for each node v at level hi−1, we have:∑
{u |root(u)=v }
size(u) =
∑
m1+· · ·+mr=δi
(
δi
m1, . . . ,mr
) ∏
1≤j≤r
a
mj
j = s
δi
A ,
(3)
where the last equality follows from the multinomial theorem. We
now bound the number of gates required to compute the matrices
at level hi . Since we assume the matrices at level hi−1 have been
computed, by Lemma 3.2, each entry of the matrix associated with
node u at level hi can be computed usingO((b +hi−1)size(u)) gates
in depth 2. We charge the gate count for u to root(u), and by (3)
and (2), we have that the number of gates charged to each node at
level hi−1 is
O((b + hi−1)shi−hi−1A N 2/T 2hi ),
hence the total number of gates required for level hi is
O((b + hi−1)rhi−1shi−hi−1A N 2/T 2hi ) =
O((b + hi−1)(r/sA)hi−1 (sA/T 2)hiN 2),
as desired. □
Next we show how to set the hi so that the number of gates
required at each level is approximately balanced. This yields a total
gate count that is, at worst, within a factor of t of the gate count for
an optimal setting of the hi . We must assume the number of multi-
plications our fast T ×T matrix multiplication algorithm requires
is greater than T 2. The results, as stated and proven below, do not
hold if we have an optimal fast matrix multiplication algorithm
where the number of multiplications, r = T 2. We set γ = logβ (1/α).
Note that 0 < γ < 1 since r > T 2 is equivalent to αβ > 1 (for
Strassen’s algorithm, γ ≈ 0.491).
Lemma 4.3. Let hi = ⌈(1 − γ i )ρ⌉, for some ρ > 0. Then all the
matrices at levels h1, . . . ,ht of TA can be computed in depth 2t using
O(t(αβ)ρ (b + logN )N 2) gates.
Proof. We have hi ≤ logT N for all 0 ≤ i ≤ t since the latter is
the height of TA. By Lemma 4.2, level hi can be computed in depth
2 using O((b + logN )αhi−1βhiN 2) gates.
Let h˜i = (1 − γ i )ρ. Observe that∑
1≤i≤t
αhi−1βhi < β
∑
1≤i≤t
α h˜i−1β h˜i ,
hence it suffices to bound
∑
1≤i≤t α h˜i−1β h˜i . The terms in this sum
are all equal:
α h˜i−1β h˜i =
(
αβ
αγ
i−1
βγ
i
)ρ
=
(
αβ
αγ
i−1 (βγ )γ i−1
)ρ
= (αβ)ρ ,
so that
∑
1≤i≤t αhi−1βhi = O(t(αβ)ρ ), from which the claim fol-
lows. □
The above lemma establishes a tradeoff in the following sense.
The value ρ impacts the total number of gates. However, we require
thatht = logT N , which imposes constraints on t and, consequently,
the depth of the circuit. The larger ρ, the smaller t needs to be in
order for ht = logT N .
The natural strategy of taking hi = ⌈i logT N /t⌉ yields a weaker
result, comparable to Theorem 4.1. We now establish our main
theorems by better quantifying the tradeoff between ρ and t . For
these theorems we assume we are given a fast matrix multiplication
algorithm and take ω = logT r .
Theorem 4.4. Suppose we are given an integer τ and an N × N
integer matrixAwith entries of sizeO(logN ) bits. There is a threshold
circuit of depth O(log logN ) that determines whether trace(A3) ≥ τ
using O˜(Nω ) gates.
Proof. We appeal to Lemma 4.3, setting ρ = logT N . The gate
bound follows from (αβ)ρ = (r/T 2)loдT N = Nω−2. To bound the
depth, we set t = ⌊log1/γ logT N ⌋+1 > log1/γ logT N . This implies:
logT N − (1 − γ t ) logT N < logT N − (1 − 1/logT N ) logT N = 1.
Thus ht = ⌈(1 − γ t ) logT N ⌉ = logT N as desired.
This shows that we can compute the values corresponding to the
leaves of TA and TB in the stated gate and depth bounds. One may
see that each entryCi j is aweighted sumof products,
∑
k ∈Ii j wi jkpk ,
where each pk corresponds to a product between a leaf of TA and
SPAA’18, July 16–18, 2018, Vienna, Austria Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone
the corresponding leaf of TB with each weightwi jk ∈ {−1, 1}. We
seek to compute
trace(A3)
2 =
∑
i<j
Ai jCi j =
∑
i<j
Ai j
©­«
∑
k ∈Ii j
wi jkpk
ª®¬ (4)
=
∑
k
pk
©­«
∑
i<j :k ∈Ii j
wi jkAi j
ª®¬ .
Thus for each product, pk , we want to multiply it with a {−1, 1}-
weighted sum over entries of A. We may compute these weighted
sums independently and in parallel with those for A and B using
the same techniques. Thus we seek to compute Nω products of 3
O(logN )-bit numbers, and we appeal to Lemma 3.3 to accomplish
this in depth 1 using a total of O˜(Nω ) gates. A final output gate
sums the representations of the products computed by Lemma 3.3
and compares with the threshold τ . □
We now prove our main theorem by exhibiting a more refined
tradeoff between ρ and t .
Theorem 4.5. Suppose we are given an integer τ , an N ×N integer
matrix A with entries of size O(logN ) bits, and a positive integer d .
There is a threshold circuit of depth at most 2d + 5 that determines
whether trace(A3) ≥ τ using O˜(dNω+cγ d ) gates, where c > 0 and
γ < 1 are constants with respect to N and d that depend on the
parameters of the fast matrix multiplication algorithm employed.
Proof. As for the previous theorem, we appeal to Lemma 4.3,
this time setting ρ = logT N +ε logα β N , for a constant ε > 0whose
value is given below. We have (αβ)ρ = (r/T 2)loдT NN ε = Nω−2+ε .
We set ε = γd logT (αβ)/(1 − γ ) > γd logT (αβ)/(1 − γd ). This
implies:
logT N − (1 − γd )(logT N + ε logα β N )
< logT N − (1 − γd )(logT N + (γd/(1 − γd )) logT (αβ) logα β N )
= logT N − (1 − γd ) logT N − γd logT N
= 0,
hence we may take t < d in Lemma 4.3 in order to have ht =
logT N . The theorem follows from the argument used in the proof
of Theorem 4.4 and taking c = logT (αβ)/(1 − γ ) (for Strassen’s
algorithm, c ≈ 1.585) . □
4.4 Matrix product
Now we describe how to compute the entries of the matrix product
C = AB, where we assume the entries of the N × N matrices A
and B require O(logN ) bits. We define a tree TAB with the same
structure as TA and TB . Each node of TAB represents the product
of the matrices of the corresponding nodes of TA and TB . Hence
the root of TAB represents the matrix C = AB, and the leaves
represent the N logT r scalar products computed by our fast matrix
multiplication algorithm. We compute the root of TAB in a bottom-
up manner assuming that we are only computing the nodes at levels
logT N = ht > ht−1 > . . . > h1 > h0 = 0.
We let αC = r/sC and βC = sC/T 2 be parameters that are
a function of the fast matrix multiplication algorithm employed.
Recall that from (2) we have that the scalars at the leaves of TA
and TB each require O(logN ) bits. Therefore the products of these
scalars represented by the leaves of TAB also require O(logN ) bits.
We show that TAB can be computed in a bottom-up manner
with depth and gate bounds comparable to those we obtained for
computing the leaves of TA and TB in the previous section. We will
need a lemma analogous to Lemma 4.2.
Lemma 4.6. For 1 ≤ i ≤ t , if the matrices at level hi of TAB have
been computed, then the matrices at level hi−1 can be computed in
depth 2 using O˜(αhi−1C β
hi
C N
2) gates.
Proof: See the appendix.
Using the above lemma, the proof of Lemma 4.3 yields the fol-
lowing.
Lemma 4.7. Let hi = ⌈(1 − γ i )ρ⌉, for some ρ > 0. Then all the
matrices at levelsh1, . . . ,ht of TAB can be computed in depth 2t using
O˜(t(αβ)ρN 2) gates.
Armed with the above lemmas, we obtain our main results in
similar fashion to those for the trace of A3. The structure of our
circuit is that we compute the scalars corresponding to the leaves
of TA and TB as described in the previous section. We then use
Lemma 3.3 to compute the scalar products between corresponding
leaves of TA and TB in depth 1. We finally apply the procedure
outline above to compute the root of TAB , representing the matrix
product AB, in a bottom-up manner. This essentially doubles the
depth of the circuits we obtain compared to the corresponding
circuits for the trace of A3.
Theorem 4.8. Suppose we are given N × N integer matrices A
and B with entries of size O(logN ) bits. There is a threshold circuit
of depth O(log logN ) that computes the matrix product AB using
O˜(Nω ) gates.
Theorem 4.9. Suppose we are given N ×N integer matricesA and
B with entries of size O(logN ) bits, and a positive integer d . There is
a threshold circuit of depth at most 4d + 1 that computes the matrix
product AB using O˜(dNω+cγ d ) gates, where c > 0 and γ < 1 are
constants with respect to N and d that depend on the parameters of
the fast matrix multiplication algorithm employed.
5 MATRIX-MULTIPLICATION APPLICATION
BACKGROUND
In this section, we provide more background on the relevance of
dense matrix multiplication to deep learning. We also discuss the
relevance of matrix multiplication to triangle counting in graphs
and triangle counting’s relevance to social network analysis.
Deep Learning. As mentioned in Section 1, our primary motiva-
tion for neural-circuit-based matrix multiplication is convolutional
neural networks for deep learning. See Warden’s clear explanation
of the role of matrix multiplication in convolution steps for neural
networks [25], which we summarize here. For more details see the
Stanford course notes at http://cs231n.github.io. These networks
assume the input is a two-dimensional image, with an n × n grid of
pixels, each with ℓ channels. The neural networks usually refer to
the number of channels as depth, but in this paper, “depth” refers
Constant-Depth and Subcubic-Size Threshold Circuits for Matrix Multiplication SPAA’18, July 16–18, 2018, Vienna, Austria
to the number of layers in our circuit. Typically the number of
channels ℓ is a constant, but not necessarily just the three classic
color channels (red, green, blue). In a convolutional step, we apply
a set of K kernels to the image. Each kernel looks for a particular
subpattern such as a horizontal edge or a splash of red. The ker-
nel considers a small constant q × q submatrix of pixels (with ℓ
channels) at a time and is applied across the whole image based on
a stride. This recognizes the pattern no matter where it is in the
image. For example, if the stride is four, then the kernel is applied
to every fourth column and every fourth row. A place where the
kernel is applied is called a patch. For each patch, for each kernel, a
dot product scores the extent to which the patch matches the kernel.
Computing all the kernels simultaneously is a matrix multiplication.
The first matrix is P ×Q , where P = O(n2) is the number of patches
andQ = q×q× ℓ is the number of elements in a kernel. The second
matrix is Q × K . This gives a P × K output matrix, giving the score
for each patch for each kernel.
Let N be the largest matrix dimension and suppose we use a
fast matrix multiplication algorithm that can multiply two N × N
matrices in time O(Nω ). Our circuit requires fan-in as large as
O(Nω ). These are gates at the end that compute the final output
matrix entries. Two of the relevant matrix dimensions for convo-
lutional neural networks, K and Q , are generally constants. The
third dimension P is not. However, if the particular architecture can
only support fan in x , we can break the matrix multiplication into
independent pieces, each with at most ω
√
x rows in the first matrix.
These can run in parallel, so they have the same depth, given a large
enough architecture. Thus the unbounded fan-in in our algorithm is
not neccesarily a practical limitation for our motivating application.
Social Network Analysis. Social networks of current interest are
too large for our circuit methods to be practical for neuromorphic
architectures in the near future. Also social network adjacency ma-
trices are sparse, unlike the dense small matrices for convolutional
neural networks we described above. Nevertheless, we briefly re-
view the motivation for matrix multiplication in this setting. One
application is computing the clustering coefficient of an N -node
graph (or subgraph). The global clustering coefficient is the ratio
of the number of triangles in the graph to the number of wedges
(length-2 paths) in the graph. A degree-δ node is at the center of(δ
2
)
wedges. The global clustering coefficient is the fraction of total
wedges in the graph that close into triangles. These triangles are
common in social networks, where the central node of a wedge
may introduce two neighbors. Social-network-analysis researchers
believe a high global clustering coefficient (also called transitivity)
means the graph has community structure. For example, Seshadri,
Kolda and Pinar [18] assumed constant global clustering coeffi-
cients when proving a structural property of social networks they
used for their BTER (Block Two-Level Erdös-Renyi) generative
model. Orman, Labatut and Cherifi [16] empirically studied the re-
lationship between community structure and clustering coefficient.
They found that high clustering coefficients did imply community
structure, although low clustering coefficients did not preclude it.
This paper considered the question: “Does a graph G have at
least τ triangles?” The user can pick a value of τ that represents
reasonable community structure for their particular kind of graph.
Usually they compute the total number of wedges D in O(N ) time
and set τ to some function of D (perhaps just scaling by a constant).
6 OPEN PROBLEMS
The main open problem is whether we can do matrix multiplication
with O˜(Nω ) gates in constant depth. Theorem 4.4 shows this can
be done in O(log logN ) depth. Another open question is lower
bounds: What is the minimum depth of a threshold circuit for
computing matrix products using O(N 3−ε ) gates? Can one show
that a constant-depth threshold circuit using O˜(Nω ) gates yields
an O(logN ) PRAM algorithm with O(Nω ) work?
One may show that our circuits are L-uniform. Can a stronger
uniformity condition be imposed?
One advantage of neural networks is their low energy relative to
CMOS-based electronics. One possible energy model for threshold
gates is to charge a gate only if it fires [24]. That is, charge a gate one
unit of energy for sending a signal if and only if the weighted sum
of the inputs exceeds the threshold. What is the energy complexity
of the kinds of matrix-multiplication circuits we consider?
ACKNOWLEDGEMENTS
This research was supported by the Laboratory Directed Research
and Development program at Sandia National Laboratories, a multi-
mission laboratory managed and operated by National Technology
and Engineering Solutions of Sandia, LLC., a wholly owned sub-
sidiary of Honeywell International, Inc., for the U.S. Department of
Energy’s National Nuclear Security Administration under contract
DE-NA0003525.
REFERENCES
[1] Frederico A. C. Azevedo, Ludmila R. B. Carvalho, Lea T. Grinberg, José Marcelo
Farfel, Renata E. L. Ferretti, Renata E. P. Leite, Wilson Jacob Filho, Roberto
Lent, and Suzana Herculano-Houzel. 2009. Equal numbers of neuronal and
nonneuronal cells make the human brain an isometrically scaled-up primate
brain. The Journal of Comparative Neurology 513, 5 (April 2009), 532–541. https:
//doi.org/10.1002/cne.21974
[2] Grey Ballard, Austin R. Benson, Alex Druinsky, Benjamin Lipshitz, and Oded
Schwartz. 2016. Improving the Numerical Stability of Fast Matrix Multiplication.
SIAM J. Matrix Anal. Appl. 37, 4 (Jan. 2016), 1382–1418. https://doi.org/10.1137/
15M1032168
[3] Dario Bini and Grazia Lotti. 1980. Stability of fast algorithms for matrix multipli-
cation. Numer. Math. 36 (1980), 63–72.
[4] Markus Bläser. 2013. Fast Matrix Multiplication. Number 5 in Graduate Surveys.
Theory of Computing Library. http://theoryofcomputing.org/articles/gs005/
[5] M. Davies, N. Srinivasa, T. H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou,
P. Joshi, N. Imam, S. Jain, Y. Liao, C. K. Lin, A. Lines, R. Liu, D. Mathaikutty, S.
McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. H. Weng, A. Wild, Y. Yang, and H.
Wang. 2018. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning.
IEEE Micro 38, 1 (Jan. 2018), 82–99. https://doi.org/10.1109/MM.2018.112130359
[6] Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and
Dharmendra S Modha. 2015. Backpropagation for energy-efficient neuromorphic
computing. In Advances in Neural Information Processing Systems. 1117–1125.
[7] Merrick Furst, James B. Saxe, and Michael Sipser. 1984. Parity, circuits, and the
polynomial-time hierarchy. Mathematical systems theory 17, 1 (Dec. 1984), 13–27.
https://doi.org/10.1007/BF01744431
[8] Giacomo Indiveri, Bernabe Linares-Barranco, Tara Julia Hamilton, André van
Schaik, Ralph Etienne-Cummings, Tobi Delbruck, Shih-Chii Liu, Piotr Dudek,
Philipp Häfliger, Sylvie Renaud, Johannes Schemmel, Gert Cauwenberghs,
John Arthur, Kai Hynna, Fopefolu Folowosele, Sylvain Saïghi, Teresa Serrano-
Gotarredona, Jayawan Wijekoon, Yingxue Wang, and Kwabena Boahen. 2011.
Neuromorphic silicon neuron circuits. Frontiers in Neuroscience 5, 73 (2011).
https://doi.org/10.3389/fnins.2011.00073
[9] Daniel M. Kane and Ryan Williams. 2015. Super-Linear Gate and Super-
Quadratic Wire Lower Bounds for Depth-Two and Depth-Three Threshold Cir-
cuits. arXiv:1511.07860 [cs] (Nov. 2015). http://arxiv.org/abs/1511.07860 arXiv:
1511.07860.
SPAA’18, July 16–18, 2018, Vienna, Austria Ojas Parekh, Cynthia A. Phillips, Conrad D. James, and James B. Aimone
[10] MuhammadMukaramKhan, David R Lester, Luis A Plana, A Rast, Xin Jin, Eustace
Painkras, and Stephen B Furber. 2008. SpiNNaker: mapping neural networks
onto a massively-parallel chip multiprocessor. In Neural Networks, 2008. IJCNN
2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint
Conference on. IEEE, 2849–2856.
[11] François Le Gall. 2014. Powers of Tensors and Fast Matrix Multiplication. In
Proceedings of the 39th International Symposium on Symbolic and Algebraic Com-
putation (ISSAC ’14). ACM, New York, NY, USA, 296–303. https://doi.org/10.
1145/2608628.2608664
[12] Warren S. McCulloch and Walter Pitts. 1943. A logical calculus of the ideas
immanent in nervous activity. The bulletin of mathematical biophysics 5, 4 (Dec.
1943), 115–133. https://doi.org/10.1007/BF02478259
[13] Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, Jun
Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka Naka-
mura, et al. 2014. A million spiking-neuron integrated circuit with a scalable
communication network and interface. Science 345, 6197 (2014), 668–673.
[14] Robert C. Minnick. 1961. Linear-Input Logic. IRE Trans. Electronic Computers 10,
1 (1961), 6–16. https://doi.org/10.1109/TEC.1961.5219146
[15] Saburo Muroga. 1959. The principle of majority decision logical elements and
the complexity of their circuits. In IFIP Congress. 400–406.
[16] Günce Keziban Orman, Vincent Labatut, and Hocine Cherifi. 2013. An empirical
study of the relation between community structure and transivity. Studies in
Computational Intelligence 424 (2013), 99–110.
[17] Catherine D. Schuman, Thomas E. Potok, Robert M. Patton, J. Douglas Bird-
well, Mark E. Dean, Garrett S. Rose, and James S. Plank. 2017. A Survey of
Neuromorphic Computing and Neural Networks in Hardware. (May 2017).
https://arxiv.org/abs/1705.06963
[18] C Seshadhri, Tamara G. Kolda, and Ali Pinar. 2012. Community structure and
scale-free collections of Erdös-Rényi graphs. Physical Review E 85, 056109 (2012).
[19] Jiří Šíma and Pekka Orponen. 2003. General-Purpose Computation with Neural
Networks: A Survey of Complexity Theoretic Results. Neural Computation 15, 12
(Dec. 2003), 2727–2778. https://doi.org/10.1162/089976603322518731
[20] Kai-Yeung Siu, Vwani Roychowdhury, and Thomas Kailath. 1991. Depth-size
tradeoffs for neural computation. IEEE Trans. Comput. 40, 12 (Dec. 1991), 1402–
1412. https://doi.org/10.1109/12.106225
[21] Kai-Yeung Siu, Vwani Roychowdhury, and Thomas Kailath. 1995. Discrete Neural
Computation: A Theoretical Foundation. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA.
[22] Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4
(Aug 1969), 354–356.
[23] I. Sugiarto, G. Liu, S. Davidson, L. A. Plana, and S. B. Furber. 2016. High per-
formance computing on SpiNNaker neuromorphic platform: A case study for
energy efficient image processing. In 2016 IEEE 35th International Performance
Computing and Communications Conference (IPCCC). 1–8. https://doi.org/10.
1109/PCCC.2016.7820645
[24] Kei Uchizawa, Rodney Douglas, and Wolfgang Maass. 2006. Energy Complexity
and Entropy of Threshold Circuits. In Automata, Languages and Programming
(Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 631–642. https:
//doi.org/10.1007/11786986_55
[25] Pete Warden. 2015. Why GEMM is at the heart of deep learning. https://
petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/. on-
line, accessed February 9, 2017.
[26] Andrew C. C. Yao. 1985. Separating the polynomial-time hierarchy by oracles.
In , 26th Annual Symposium on Foundations of Computer Science, 1985. 1–10.
https://doi.org/10.1109/SFCS.1985.49
APPENDIX
Proof of Lemma 4.6:
Our proof uses a similar analysis to that of Lemma 4.2. We need
new parameters derived from our fast matrix multiplication algo-
rithm. For 1 ≤ j ≤ T 2, we use j to index the T 2 expressions for
entries of C . We define c ′j as the number of Mi terms that appear
in the jth expression for an entry of C . For Strassen’s algorithm
(Figure 1), we have c ′1 = 4, c
′
2 = 2, c
′
3 = 2, and c
′
4 = 4. Recall
the sparsity parameter sC from Definition 2.1, and observe that
sC =
∑
1≤j≤T 2 c ′j .
We assume the matrix products at level hi of TAB have been
computed and compute a node u at level hi−1. We again define
δi = hi − hi−1 for convenience. As an example, suppose we were
using the scalars corresponding to the leaves of TAB , at level ht ,
to construct a T ×T matrix at level ht − 1. Each of the T 2 entries
of this matrix consists of a {−1, 1}-weighted sum of the scalars at
level ht .
More generally, the matrix corresponding to node u at level hi−1
is composed of T 2δi blocks that are each {−1, 1}-weighted sums of
matrices of size N /Thi × N /Thi from level hi . We seek to bound
the number of terms in each such sum. Let ul for 1 ≤ l ≤ T 2δi
correspond to the blocks of the matrix at node u, and let size(ul )
be the number of terms in the weighted sum of matrices from level
hi that is equal to the block ul . Using an approach similar to that
from the proof of Lemma 4.2, we have:∑
1≤l ≤T 2δi
size(ul ) =
∑
m1+· · ·+mT 2=δi
(
δi
m1, . . . ,mT 2
) ∏
1≤j≤T 2
(c ′j )mj
=
©­«
∑
1≤j≤T 2
c ′j
ª®¬
δi
= sδiC , (5)
where, as in the proof of Lemma 4.2, the penultimate equality fol-
lows from the multinomial theorem.
Each block ul is of size N /Thi × N /Thi , and by the discussion
preceding this lemma and (2), we have that each entry of a block ul
requiresO(logN ) bits. Thus, by Lemma 3.2, we may compute all of
the blocks ul of u in depth 2 with a gate count of:∑
1≤l ≤T 2δi
O(N 2/T 2hi logN size(ul ))
= O(N 2/T 2hi logN
∑
1≤l ≤T 2δi
size(ul ))
= O(N 2/T 2hi logNsδiC ), (6)
where the last equality follows from the above equation. Since
there are rhi−1 nodes in total on level hi−1, we may compute all the
matrices on level hi−1 with a total gate count of,
O(rhi−1shi−hi−1C N 2/T 2hi logN ) = O((r/sC )hi−1 (sC/T 2)hiN 2 logN )
= O˜(αhi−1C β
hi
C N
2).
□
