Efficient Circuit Simulation in MapReduce by Frei, Fabian & Wada, Koichi
Efficient Circuit Simulation in MapReduce
Fabian Frei
Department of Computer Science, ETH Zürich, Universitätstrasse 6, CH-8006 Zürich, Switzerland
fabian.frei@inf.ethz.ch
Koichi Wada
Department of Applied Informatics, Hosei University, 3-7-2 Kajino, 184-8584 Tokyo, Japan
wada@hosei.ac.jp
Abstract
The MapReduce framework has firmly established itself as one of the most widely used parallel
computing platforms for processing big data on tera- and peta-byte scale. Approaching it from a
theoretical standpoint has proved to be notoriously difficult, however. In continuation of Goodrich et
al.’s early efforts, explicitly espousing the goal of putting the MapReduce framework on footing equal
to that of long-established models such as the PRAM, we investigate the obvious complexity question
of how the computational power of MapReduce algorithms compares to that of combinational
Boolean circuits commonly used for parallel computations. Relying on the standard MapReduce
model introduced by Karloff et al. a decade ago, we develop an intricate simulation technique to show
that any problem in NC (i.e., a problem solved by a logspace-uniform family of Boolean circuits
of polynomial size and a depth polylogarithmic in the input size) can be solved by a MapReduce
computation in O(T (n)/ logn) rounds, where n is the input size and T (n) is the depth of the
witnessing circuit family. Thus, we are able to closely relate the standard, uniform NC hierarchy
modeling parallel computations to the deterministic MapReduce hierarchy DMRC by proving that
NCi+1 ⊆ DMRCi for all i ∈ N. Besides the theoretical significance, this result has important
applied aspects as well. In particular, we show for all problems in NC1 – many practically relevant
ones, such as integer multiplication and division and the parity function, being among these – how
to solve them in a constant number of deterministic MapReduce rounds.
2012 ACM Subject Classification Theory of computation→ Complexity classes; Computing method-
ologies → MapReduce algorithms; Theory of computation → Circuit complexity; Theory of compu-
tation → MapReduce algorithms; Software and its engineering → Ultra-large-scale systems
Keywords and phrases MapReduce, Circuit Complexity, Parallel Algorithms, Nick’s Class NC
Digital Object Identifier 10.4230/LIPIcs.ISAAC.2019.52
Related Version The full version of this paper including all figures and proofs is freely available at
http://arxiv.org/abs/1907.01624.
Funding Koichi Wada: Research done in part during a supported visit at ETH Zürich and partly
supported by JSPS KAKENHI No. 17K00019 and by the Japan Science and Technology Agency
(JST) SICORP (Grant#JPMJSC1806).
Acknowledgements We thank the anonymous reviewers for their helpful comments.
1 Introduction
Despite the overwhelming success of the MapReduce framework in the big data industry and
the great attention it has garnered ever since its inception over a decade ago, theoretical
results about it have remained scarce in the literature. In particular, it is very natural to
ask how powerful exactly MapReduce computations are in comparison to the traditional
models of parallel computations based on circuits; a question that has practical implications
as well. The answers have proved to be very elusive, however. In this paper, we show how
MapReduce programs can efficiently simulate circuits used for parallel computations, thus
tying these two worlds together more tightly.
© Fabian Frei and Koichi Wada;
licensed under Creative Commons License CC-BY
30th International Symposium on Algorithms and Computation (ISAAC 2019).
Editors: Pinyan Lu and Guochuan Zhang; Article No. 52; pp. 52:1–52:21
Leibniz International Proceedings in Informatics
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany
52:2 Efficient Circuit Simulation in MapReduce
In this section we first provide an introduction to the concept of MapReduce, then present
the related work, and finally describe our contribution. In Section 2, we will formally define
the traditional models of parallel computing and the MapReduce model. In Section 3, we
then derive our main results. Section 4 concludes the paper with a short summary and a
discussion of our findings, outlining opportunities for future research.
1.1 Background and Motivation
In recent years the amount of data available and demanding analysis has experienced an
astonishing growth. The amount of memory in commercially available servers has also grown
at a remarkable pace over the past decade; it is now exceeding tera- and even peta-bytes.
Despite the considerable advances in the availability of computational power, traditional
approaches remain insufficient to cope with such huge amounts of data. A new form of
parallel computing has become necessary to deal with these enormous quantities of available
data. The MapReduce framework has been attracting great interest due to its suitability for
processing massive data-sets. This framework was originally developed by Google [5], but
an open source implementation called Hadoop has recently been developed and is currently
used by over a hundred companies, including Yahoo!, Facebook, Adobe, and IBM [19].
MapReduce differs substantially from previous models of parallel computation in that
it combines aspects of both parallel and sequential computation. Informally, a MapReduce
computation can be described as follows.
The input is a multiset of key-value pairs 〈k; v〉. In a first step, the map step, each of
these key-value pairs is separately and independently transformed into an entire multiset of
key-value pairs by a map function µ. In the next step, the shuﬄe step, we collect all key-value
pairs from the multisets that have been produced in the previous step, group them by their
keys, and collapse each group {〈k; v1〉, 〈k; v2〉, . . .} of pairs containing the same key into a
single key-value pair 〈k; {v1, v2, . . .}〉 consisting of said key and a list of the associated values.
In a third step, the reduce step, a reduce function ρ transforms the list of values in each
key-value pair 〈k; {v1, v2, . . .}〉 into a new list {v′1, v′2, . . . }. Again, this is done separately
and independently for each pair. The final output consists of the pairs {〈k; v′1〉, 〈k; v′2〉, . . .}
for each key k. The different instances that implement the reduce function for the different
groups of pairs are called reducers. Analogously, mappers are instances of the map function.
The three steps described above constitute one round of the MapReduce computation and
transform the input multiset into a new multiset of key-value pairs. A complete MapReduce
computation consists of any given number of rounds and acts just as the composition of
the single rounds. The shuﬄe step works the same way every time; the map and reduce
functions, however, may change from round to round. A MapReduce computation with
R rounds is therefore completely described by a list µ1, ρ1, µ2, ρ2, . . . , µR, ρR of map and
reduce functions. In both the map step and the reduce step, the input pairs can be processed
in parallel since the map and reduce functions act independently on the pairs and groups
of pairs, respectively. These steps therefore capture the parallel aspect of a MapReduce
computation, whereas the shuﬄe step enforces a partial sequentiality since the shuﬄed pairs
can be output only once the previous map step is completed in its entirety.
The MapReduce paradigm has been introduced in [5] in the context of algorithm design
and analysis. A treatment as a formal computational model, however, was missing in
the beginning. Later on, a number of models have emerged to deal more rigorously with
algorithmic issues [7, 10, 12, 14, 15]. In this paper, our interest lies in studying the MapReduce
framework from a standpoint of parallel algorithmic power by comparing it to standard
models of parallel computation such as Boolean circuits and parallel random access machines
F. Frei and K. Wada 52:3
(PRAMs). A PRAM can be classified by how far simultaneous access by processors to its
memory is restricted; it can be CRCW, EREW, CREW, or ERCW, where R, W, C, and
E stand for Read, Write, Concurrent, and Exclusive, respectively [4]. If concurrent writing
is allowed, we need to further specify how parallel writes by multiple processors to a single
memory cell are handled. The most natural choice is arguably that every memory cell contains
after each time step the total of all numbers assigned to it by different processors during that
step. In fact, all constructions in this paper work with this treatment of simultaneous writes;
we thus generally assume this model. If the context warrants it, we speak of a Sum-CRCW
to make this assumption explicit.
1.2 Related Work
We briefly present and discuss the following known results on the comparative power of the
MapReduce framework and PRAM models.
1. A T -time EREW-PRAM algorithm can be simulated by an O(T )-round MapReduce
algorithm, where each reducer uses memory of constant size and an aggregate memory
proportional to the amount of shared memory required by the PRAM algorithm [10, 12].
2. A P -processor,M -memory, T -time EREW-PRAM algorithm can be simulated by an O(T )-
round, (P +M)-key MUD algorithm with a communication complexity of O(log(P +M))
bits per key, where a MUD (massive, unordered, distributed) algorithm is a data-streaming
MapReduce algorithm in the following sense: The reducers do not receive the entire list of
values associated with a given key at once, but rather as a stream to be processed in one
pass, using only a small working memory determining the communication complexity [7].
3. When using MapReduce computations to simulate a CRCW-PRAM instead, again with
P processors and M memory, we incur an O(logm(P +M)) slowdown compared to the
simulations above, where m is an upper bound on each reducer’s input and output [10].
These results imply that any problem solved by a PRAM with a polynomial number of
processors and in polylogarithmic time T can be simulated by a MapReduce computation
with an amount of memory equal to the number of PRAM processors, and in a number
of rounds equal to the computation time of even the powerful CRCW-PRAM. Since the
class of problems solved by CRCW-PRAMs in time T ∈ O(logi n) is equal to the class of
problems solved by families of polynomial-sized combinational circuits consisting of gates
with unbounded fan-in and fan-out and depth T ∈ O(logi n) (often denoted ACi) [1], these
circuits can be simulated in a MapReduce computation with a number of rounds equal to
the time required by these circuits.
Since the publication of the seminal paper by Karloff et al. [12], extensive effort has been
spent on developing efficient algorithms in MapReduce-like frameworks [3, 6, 13, 11, 17]. Only
few relationships between the theoretical MapReduce model by [12] and classical complexity
classes have been established, however; for example, any problem in SPACE(o(logn)) can be
solved by a MapReduce computation with a constant number of rounds [8].
Recently, Roughgarden et al. [16, Theorem 6.1] described a short and simple way of
simulatingNC1 circuits with a certain class of models of parallel computation. The constraints
of these models, namely the number of machines and the memory restrictions, are exactly
tailored to allow for this general simulation method, however. In particular, it crucially relies
on the fact that all models of this class are more powerful than the MapReduce model in
that they all grant us a number of machines that is polynomial in the input size; this makes
it possible to just dedicate one machine to each of the circuit gates. Such a simple simulation
is impossible with MapReduce computations since the standard model due to Karloff only
allows for a sublinear number of machines with sublinear memory.
ISAAC 2019
52:4 Efficient Circuit Simulation in MapReduce
1.3 Contribution
We prove that NCi+1 ⊆ DMRCi for all i ∈ {0, 1, 2, . . . }, where DMRCi is the set of
problems solvable by a deterministic MapReduce computation in O(logi n) rounds. In the
case of NC1 ⊆ DMRC0, which already opens up a plethora of applications on its own, the
result holds for every possible choice of ε, that is, for 0 < ε ≤ 1/2. The higher levels of the
hierarchy require an entirely different proof method, which yields the result for 0 < ε < 1/2.
This is a substantial improvement over the previous results that only imply, as outlined
above, the far weaker claim ACi ⊆MRCi. The case i = 1 is of particular practical interest
since NC1 \ AC0 contains plenty of relevant problems such as integer multiplication and
division, the parity function, and the recognition of Dyck languages; see [1]. Our results show
how to solve all of these problems with a deterministic MapReduce program in a constant
number of rounds.
2 Preliminaries
We denote by N = {0, 1, 2, . . .} the natural numbers including zero and let N+ = N \ {0}.
Moreover, we let [i] = {0, 1, . . . , i− 1} denote the i first natural numbers for any i ∈ N+.
2.1 Models of Parallel Computation
In this section, we define the common complexity classes capturing the power of parallel
computation; most prominently the NC hierarchy.
A finite set B = {f0, . . . , f|B|−1} of Boolean functions fi : {0, 1}ni → {0, 1} with ni ∈ N
for every i ∈ [|B|] is called a basis. For every n,m ∈ N+, a (Boolean) circuit C over the basis
B with n inputs and m outputs is a directed acyclic graph that contains n sources (nodes
with no incoming edges), called the input nodes, and m sinks (nodes with no outgoing edges).
The fan-in of a node is the number of incoming edges, the fan-out is the number of outgoing
edges. Nodes that are neither sources nor sinks are called gates. Each gate is labeled with a
function fi ∈ B and has fan-in ni. It computes fi on the input given by the incoming edges
and outputs the result (either 0 or 1) to each of the outgoing edges. A basis B is said to be
complete if for every Boolean function f , we can construct over the basis B a circuit of the
described form that computes f . In the following, we use the complete basis B = {∨,∧,¬}.
The size of a circuit C, denoted by size(C), is the total number of edges it contains. The
level of a node v in a circuit C, denoted level(v), is defined recursively: The level of a sink is 0,
and the level of a node v with nonzero fan-out is one greater than the maximum of the levels of
the outgoing neighbors of v. The depth of C, denoted depth(C), is the maximum level across
all nodes in C. A function f : {0, 1}∗ → {0, 1}∗ is implicitly logspace computable if the two
mappings (x, i) 7→ χi≤|f(x)|, where χ denotes the characteristic function, and (x, i) 7→ (f(x))i
are computable using logarithmic space. A circuit family {Cn}∞n=0 is logspace-uniform if
there is an implicitly logspace computable function mapping 1n to the description of the
circuit Cn. It is known that the class of languages that have logspace-uniform circuits of
polynomial size equals P [1, Thm. 6.15].
For any i ∈ N, the complexity class NCi contains a language L exactly if there is a
constant c and a logspace-uniform family of circuits {Cn}∞n=0 recognizing L such that Cn has
size O(nc), depth O(logi n), and all nodes have fan-in at most 2. The union is Nick’s class
NC = ⋃∞i=0NCi. We mention that there is an analogous definition of classes Nonuniform-NCi
that do not require logspace uniformity from the circuits; they constitute a different hierarchy.
F. Frei and K. Wada 52:5
The complexity classes ACi and AC = ⋃∞i=0ACi are defined exactly as NCi and NC,
except that the restriction of the maximal fan-in to at most 2 is omitted. Nevertheless, the
restriction on the circuit size imply that the fan-in of a node is bounded by a polynomial
in n. The OR gates and AND gates in such a circuit can therefore be replaced by trees of
gates of fan-in at most 2 with a depth in O(logn). It follows that ACi ⊆ NCi+1 for all i ∈ N
and thus NC = AC. (Analogously, we see why Nick’s class can also be defined, as it often
is, by upper-bounding the fan-in by an arbitrary constant greater than 2.) The inclusion
NCi ⊆ ACi for every i ∈ N is immediate from the definition. The first two inclusions of the
resulting chain are known to be strict – namely, we have NC0 ( AC0 ( NC1; see [1].
Finally, we summarize the known results on how the classes of languages recognized
by different PRAMs fit into the two hierarchies of NC and AC. Let EREWi, CREWi and
CRCWi denote the sets of problems of size n computed by EREW-PRAMs, CREW-PRAMs,
and CRCW-PRAMs, respectively, with a polynomial number of processors in O(logi n) time.
For every i ∈ N, we have NCi ⊆ EREWi ⊆ CREWi ⊆ CRCWi = ACi ⊆ NCi+1; see [1].
2.2 The MapReduce Model
In this section we describe the standard MapReduce model as proposed by [12]. It defines
the notions of map functions and reduce functions, which are summarized under the term
primitives. Roughly speaking, a MapReduce computing system executes primitives, inter-
leaved with so-called shuﬄe operations. The basic data unit in these computations is an
ordered pair 〈key; value〉, called key-value pair. In general, keys and values are just binary
strings, allowing us to encode all the usual entities.
A map function is a (possibly randomized) function that takes as input a single key-value
pair and outputs a finite multiset of new key-value pairs. A reduce function (again, possibly
randomized) takes instead an entire set of key-value pairs {〈k; vk,1〉, 〈k; vk,2〉, . . .}, where all
the keys are identical, and outputs a single key-value pair 〈k; v′〉 with that same key.
A MapReduce program is nothing else than a sequence µ1, ρ1, µ2, ρ2, . . . , µR, ρR of map
functions µr and reduce functions ρr. The input of this program is a multiset U0 of key-value
pairs. For each r ∈ {1, . . . , R}, a map step, a shuﬄe step and a reduce step are successively
executed as follows:
1. Map step: Each pair 〈k; v〉 in Ur−1 is given as input to an arbitrary instance of the map
function µr, which then produces a finite sequence of pairs. The multiset of all produced
pairs is denoted by Vr.
2. Shuﬄe step: For each key k, let Vk,r be the multiset of all values vi such that 〈k, vi〉. The
MapReduce system automatically constructs the multiset Vk,r from Vr in the background.
3. Reduce step: For each key k, a reducer (i.e., an instance calculating the reduce function
ρr) receives k and the elements of Vk,r in arbitrary order. We usually write such an input
as a set of key-value pairs that all have key k. The reducer calculates, for each key k
independently, from Vk,r a set Uk,r of key-value pairs. The output will then consist of all
key-value pairs computed in this reduce step; that is, Ur is the union over all sets Uk,r.
Fix any ε with 0 < ε ≤ 1/2 and denote the size of the MapReduce program’s input by
N . For every i ∈ N, a problem is inMRCi if and only if if there is a MapReduce program
µ1, ρ1, µ2, ρ2, . . . , µR, ρR satisfying the following properties:
1. It outputs a correct answer to the problem with probability at least 3/4.
2. The number of rounds of the MapReduce program, R, is in O(logiN).
3. The potentially randomized primitives (i.e., all map and reduce functions) are computable
by a RAM with O(logN)-bit words using O(N1−ε) space and time polynomial in N .
4. The pairs produced by the map functions can be stored in O(N2(1−ε)) space.
ISAAC 2019
52:6 Efficient Circuit Simulation in MapReduce
A MapReduce program satisfying these conditions is called anMRCi-algorithm. Note
that due to the last condition it is impossible to even store the input unless 2(1− ε) ≥ 1,
which explains the restriction 0 < ε ≤ 1/2. As with NC, we define the union classMRC =⋃∞
i=0MRCi. Requiring all primitives to be deterministic yields the analogous hierarchy of
DMRC = ⋃∞i=0DMRCi. Note that we obviously have DMRCi ⊆MRCi for all i ∈ N. We
will often refer to the single rounds of such MapReduce algorithms as MRC-rounds and
DMRC-rounds, respectively.
3 Simulating Parallel Computations in MapReduce
We are now going to prove our two main results NC1 ⊆ DMRC0 for 0 < ε ≤ 1/2 and
NCi+1 ⊆ DMRCi for all i ∈ N+ and 0 < ε < 1/2 in Sections 3.2 and 3.3, respectively. In
both cases, we will be making use of a technical tool derived in Section 3.1 and obtain the
results by showing how to use MapReduce computations for two different, delicate simulations.
For the inclusion NC1 ⊆ DMRC0, we simulate width-bounded branching programs that are
equivalent to the respective circuits by Barrington’s classical theorem [2], whereas for the
higher levels of the hierarchy, we directly simulate the combinational circuits themselves.
3.1 A Technical Tool
Goodrich et al. [10] parametrize MapReduce algorithms, on the one hand, by the memory limit
m for the input/output buffer of the reducers and, on the other hand, by the communication
complexity Kr of round r, that is, the total size of inputs and outputs for all mappers and
reducers in round r. We state a useful result from [10].
I Theorem 1. Any CRCW-PRAM algorithm using M total memory, P processors and T
time can be simulated in O(T logm P ) deterministic MapReduce-rounds with communication
complexity Kr ∈ O((M + P ) logm(M + P )).
We denote by N the size of the smallest circuit representation of the CRCW-PRAM
algorithm (i.e., its number of edges) plus the size of its input. Taking into account our
requirements m ∈ O(N1−ε) and Kr ∈ O(N2(1−ε)), we obtain the following a technical tool,
which will prove to be useful in our endeavor.
I Corollary 2. Any CRCW-PRAM algorithm usingM total memory, P processors and T time
can be simulated in O(T logN1−ε P ) DMRC-rounds if (M+P ) logN1−ε(M+P ) ∈ O(N2(1−ε)).
3.2 Simulating NC1
It is known that Nonuniform-NC1 is equal to the class of languages recognized by nonuniform
width-bounded branching programs. A careful inspection of the proof due to Barrington [2]
– crucially relying on the non-solvability of the permutation group on 5 elements – reveals
that it naturally translates to the uniform analogue: Our uniform class NC1 is identical
with the class of languages recognized by uniform width-bounded branching programs. In
order to prove NC1 ⊆ DMRC0, it therefore suffices to show how to simulate such branching
programs by appropriate MapReduce computations with a constant number of rounds.
We first define what a width-bounded branching program is. Let n,w ∈ N+. The input to
the program is an assignment α to n Boolean variables X = {x0, . . . , xn−1}. An instruction
or line of the program is a triple (xi, f, g), where i is the index of an input variable xi ∈ X
and f and g are endomorphisms of [w]. An instruction (xi, f, g) evaluates to f if α(xi) = 1
and to g if α(xi) = 0. A width-w branching program of length t is a sequence of instructions
F. Frei and K. Wada 52:7
(xij , fj , gj) for j ∈ [t]. We also refer to the t instructions as the lines of the program. Given
an assignment α to X , a branching program B yields a function B(α) that is the composition
of the functions to which the instructions evaluate.
To recognize a language L ⊆ {0, 1}∗, we need a family (Bn)∞n=0 of width-w branching
programs with Bn taking n Boolean inputs. We say that L is recognized by Bn if there is,
for each n ∈ N, a set Fn of endomorphisms of [w] such that for all α ∈ {0, 1}n, α ∈ L if
and only if Bn(α) ∈ Fn. If fi and gi are automorphisms, that is, permutations of [w] for all
i ∈ [t], then Bn is called a width-w permutation branching program, or w-PBP for short.
I Theorem 3 ([2]). If L ∈ NC1, then L is recognized by a logspace-uniform 5-PBP family.
Due to Theorem 3 it is sufficient for our purposes to simulate the w-PBPs with constant
w instead of the circuit families provided by the definition of NC1. In order to do this, we
need to encode the given w-PBP and the possible assignments in the right form, namely we
express them as sets of key-value pairs. A w-PBP of length t can be described as the set
{〈 p; (xip , fp, gp) 〉 | p ∈ [t]}, where we call p the line number of line (xip , fp, gp). Similarly, an
assignment α : X → {0, 1}, xi 7→ vi to the input variables X = {x0, x1, . . . , xn−1} is described
by the set of key-value pairs {〈i; (xi, vi)〉 | i ∈ [n]}, letting the mappers divide the information
by the indices of the input variables. Let NO and NI be the total size of the encodings of the
w-PBP and the input assignment α, respectively. Let N = NO +NI and let d = dN1−εO e and
` = dNεOe. We denote by ÷ the integer division. For every q ∈ [t÷d], let w-PBPq be the qth of
the subprogram blocks of w-PBP of length d, that is {〈p; (xip , fp, gp)〉 | qd ≤ p ≤ (q+1)d−1}.
For ease of readability, we assume from now on without loss of generality that d` = t, so
that w-PBP can be partitioned into exactly ` such subprograms.
For every q ∈ [`], we denote by Xq the subset of variables from X appearing in the
instructions of subprogram w-PBPq. An assignment αq : Xq → {0, 1} to these variables is
represented as a set of key-value pairs in the following way. Recall that the subprogram
w-PBPq is a list of lines, each of which requires the assignment of a value, either 0 or 1, for
exactly one variable. Let xq,j be the jth variable to which a value is assigned in w-PBPq,
let pq,j denote the number of the line in which this assignment occurs for the first time in
w-PBPq, and let vq,j denote the value that is assigned to xq,j in this line. Now, we represent
αq by {〈q; (pq,j , xq,j , vq,j)〉 | j ∈ [|Xq|]}. Note that despite the dependence of Xq on q, we
always have |Xq| ≤ d. Having seen how to express w-PBP, α, and both w-PBPq and αq for
all q ∈ [`] as a set of key-value pairs, we are ready to state and prove the following lemma.
I Lemma 4 (Proof in Appendix A [9]). Let L be a w-PBP-recognized language. If the
representations of w-PBP and, for every q ∈ [`], αq are given, then we can decide in a
2-round DMRC-computation whether α ∈ L or not.
In the following four lemmas, we show that αq can be computed in a constant number
of rounds from w-PBP and α for every q ∈ [`]. The challenge lies in designing an interface
between the different reducers to bridge the gap between the ` program blocks w-PBPq
and the given assignments, initially cut into ` block based solely on the indices of the input
variables, without exceeding the memory limits. We begin with a brief overview of the
four steps.
1. For each xi, where i ∈ [n], we compute the number of subprograms in which xi appears,
and denote this number by #S(xi). Note that #S(xi) ≤ ` and that #S(xi) is the number
of all those reducers for which the value assignment of xi is generally required to compute
the resulting permutations in the corresponding subprograms.
ISAAC 2019
52:8 Efficient Circuit Simulation in MapReduce
2. We compute the prefix sums of #S(xi). For i ∈ [n], let yi =
∑i
j=0 #S(xj). Note that yi
is the number of assignment triples (pq,j , xq,j , vq,j) with 0 < j ≤ i needed to compute the
action of the first i subprograms and that yn−1 =
∑`−1
q=0 |αq|.
3. Based on the prefix sums, we will compute a separation of the input variables into `
contiguous blocks such that, for each q ∈ [`], it is feasible for reducerq to produce from
the qth block the input value assignments that it needs to contribute for the next step.
This is nontrivial since the number of input assignments must not exceed O(d) due to
the memory limitation of reducerq. A separation of the input variables {x0, . . . , xn−1}
is a list of `− 1 split values σ1, . . . , σ`−1 such that we have ` ordered, contiguous blocks
{x0, . . . , xσ1}, {xσ1+1, . . . , xσ2}, . . . , {xσ`−1+1, . . . , xn−1}. For notational convenience, we
let σ0 = −1 and σ` = n − 1. Let σq = max{j ∈ [n] | yj ≤ qd} for q ∈ {1, . . . , ` − 1}.
Using these split values each reducerq can provide all value assignments needed for the
computation of all subprograms in the next step without violating the memory limitations.
4. We compute αq for q ∈ [`] by using w-PBP, the input assignment α, and the split values.
I Lemma 5 (Proof in Appendix A [9]). Calculating #S(xi) is in DMRC0. That is, for each
i ∈ [n], #S(xi) is computable from w-PBP in a constant number of DMRC-rounds.
I Lemma 6 (Proof in Appendix A [9]). Computing the prefix-sums of #S(xi) is in DMRC0.
I Lemma 7 (Proof in Appendix A [9]). Each of the split values σ1, . . . , σ`−1 can be computed
in one reducer with the required prefix-sums being made available in one more DMRC-round.
I Lemma 8 (Proof in Appendix A [9]). Given w-PBP, α, and the split values σ0, . . . , σ`, we
can, for each q ∈ [`], compute αq in a constant number of DMRC-rounds.
We finally obtain the desired inclusion by applying Theorem 3 and Lemmas 4 through 8.
I Theorem 9. We have NC1 ⊆ DMRC0.
3.3 Simulating NCi For All i ≥ 2
For the higher levels in the hierarchy of Nick’s class, we show how to simulate the involved
circuits directly. We begin with a short outline of the proof.
Let Cn = (Vn, En) be a NCi+1 circuit with an input of size n, given as a set of nodes
and a set of directed edges, together with an input assignment α. The total size of Cn in
bits is NO, the total size of the input assignment in bits is NI, and N = NO +NI. Note that
size(Cn) is polynomial in n and depth(Cn) ∈ O(logi n). We will take the following steps to
simulate the circuit Cn with deterministic MapReduce computations:
1. We compute the level of each node in Cn.
2. The nodes and edges are sorted by their level.
3. Both the circuit Cn and the input assignment α are divided equally among the reducers.
4. We split the circuit into subcircuits computable in a constant number of rounds.
5. A custom communication scheme collects and constructs the complete subcircuits.
6. The entire circuit is evaluated via evaluation of the subcircuits.
Note that equal division of Cn in the third step is very different from the split in the
forth one, where the parts may differ radically in size. Great care must be taken so as to
no violate any of the memory and time restrictions, necessitating the two unlike partitions.
The subsequent steps then need to mediate between these dissimilar divisions. We will
show that the steps (1) to (6) can be computed in O(logn), O(1), O(1), O(1), O(logn), and
O(depth(Cn)/ logn) rounds, respectively, yielding the desired theorem.
I Theorem 10. We have NCi+1 ⊆ DMRCi for all i ∈ N+ and all 0 < ε < 1/2.
F. Frei and K. Wada 52:9
3.3.1 Computing The Levels
We begin by showing how to compute the level of each node in the circuit in O(logn)
DMRC-rounds by simulating a CRCW-PRAM algorithm. (We mention in passing that this
step requires more than a constant number of rounds, which prevents us from obtaining the
result for NC1 ⊆ DMRC0 by simulating the circuits directly; the separate approach from
Subsection 3.2 via Barrington’s theorem is thus required for this case.)
In [18], an algorithm is presented that computes the levels of all nodes in a directed acyclic
graph and can be computed on a CREW-PRAM with O(n + m) processors in O(log2m)
time, where n and m are the numbers of nodes and edges in the graph, respectively. The
first stage of this algorithm relies partly on the computation of prefix-sums, which can be
computed much more efficiently when switching to a CRCW-PRAM, as we will show below.
A straightforward adaptation of the analysis in [18], taking into account the maximum
in-degree and out-degree and separating out the computation of prefix-sums, yields the
following result.
I Lemma 11. Let G = (V,E) be a directed acyclic graph with n nodes, m edges, maximum
in-degree din, and maximum out-degree dout. The level of each node in G can then be
computed on a CRCW-PRAM with P ∈ O(m + PP-Sum(O(m))) processors in time T ∈
O((logm) · (TP-Sum(O(m)) + log max{din, dout})), where PP-Sum(q) and TP-Sum(q) denote,
respectively, the number of processors and the computation time to compute the prefix-sums
of q numbers on a CRCW-PRAM.
In the following lemma, we aim to lower the time and memory requirements for computing
prefix-sums on a CRCW-PRAM as far as possible.
I Lemma 12 (Proof in Appendix A [9]). The prefix-sums of q numbers can be computed on a
CRCW-PRAM with P ∈ O(q log q) processors and memory M ∈ O(q) in constant time.
We plug in the result of Lemma 12 into Lemma 11 and then apply it to the graph Cn.
Since its in-degrees and out-degrees are bounded by a constant ∆, we have m ≤ ∆n/2 ∈ O(n).
Hence we can compute the levels of the nodes of Cn on a CRCW-PRAM with P ∈ O(N logN)
processors in time T ∈ O(logn). By Corollary 2, we obtain the following result.
I Lemma 13 (Proof in Appendix A [9]). Computing the levels of all nodes in Cn is in
DMRC1.
3.3.2 Sorting By Levels
Once the levels of all nodes are computed, each node in the circuit can be represented as
(level(xi), xi). Recall that the depth of Cn is just the maximum level. Since depth(Cn) ∈
O(logk n) for some k ∈ N+ and the number of nodes is bounded by the number of edges,
which is size(Cn) ∈ O(N), we can encode each pair (level(xi), xi) by appending to a bit
string of length log(c1 logk n) another one of length log(c2N), for appropriate constants c1
and c2, which results in a bit string of length log(cN logk n) for c = c1c2 ∈ N. This enables
us to identify each pair (level(xi), xi) with a different bit string, which can interpreted as an
integer bounded by cN logk n. We call this integer the sorting index of node xi. Crucially, we
chose the bit string to start with the encoding of the level. Sorting the sorting indices thus
means to sort the nodes of Cn by their level. The following lemma shows how prefix-sums
can be used to perform such a sort so efficiently on a CRCW-PRAM that we can apply
Corollary 2 to simulate it in a constant number of DMRC-rounds.
ISAAC 2019
52:10 Efficient Circuit Simulation in MapReduce
I Lemma 14 (Proof in Appendix A [9]). A CRCW-PRAM with P ∈ O(D logD) processors
and memory M ∈ O(D) can sort any subset I ⊆ {1, . . . , D} of integers in constant time.
Combining Lemma 14 and Corollary 2 we obtain, by a careful analysis using ε 6= 1/2, the
promised result.
I Corollary 15 (Proof in Appendix A [9]). Let c ∈ N and 0 < ε < 1/2. Any set of distinct
integers from {1, . . . , dcN logk ne} can be sorted in a constant number of DMRC-rounds.
Once all the nodes are sorted by their sorting index (and therefore implicitly by their
level), we can enumerate them in ascending order using the sorting index j; that is, we
represent each node as the key-value pair 〈j; (level(v), v)〉. Clearly, we obtain an analogous
representation of the edges of the form 〈i; ((j, (level(v), v), (j′, (level(v′), v′))〉, which will
prove useful later on.
3.3.3 Division of Circuit And Assignment Among Reducers
As we have already seen when discussing the branching programs, an assignment α to
input variables X = {x0, x1, . . . , xn−1} can be represented as a set {〈i; (xi, vi)〉 | i ∈ [n]} of
key-value pairs, where α(xi) = vi ∈ {0, 1}.
The circuit Cn is now divided into ` = NεO subsets of edges according to the sorting
indices and input values that are assigned to each subset as in the case of branching programs.
For every q ∈ [`], let Cqn = {((j, level(v), v), (j′, level(v′), v′)) | qd ≤ j ≤ (q + 1)d − 1},
where d = N1−εO , be the qth subset. Note that |Cqn| ∈ O(d). For every q ∈ [`], the set of
variables appearing in Cqn is denoted as Xq and the assignment αq to Xq is represented as
{〈j;xq,j , vq,j〉 | j ∈ [|αq|]}, where xq,j is the jth variable that appears as an input in Cqn, and
vq,j is its assignment value. Just as seen in Lemma 8 for the case of a branching program,
we can now, for all q ∈ [`], compute αq from Cn and α, yielding the following lemma.
I Lemma 16. Computing αq from Cn and α is in DMRC0 for every q ∈ [`].
We can therefore assume that each input node is represented by 〈j; (level(xji), xji , vji)〉,
a key-value pair that is computed from Cqn and αq for q ∈ [`] in a single DMRC-round.
3.3.4 Division Into Subcircuits By Levels
We divide Cn = (Vn, En) into as few subcircuits as possible such that the simulation of each
subcircuit is in DMRC0 and we can evaluate Cn by evaluating the subcircuits sequentially.
Given v ∈ Vn and δ ∈ N, we define the v-down-circuit Cdownδ (v) = (V downδ (v), Edownδ (v))
of depth δ to be the subcircuit of Cn induced by V downδ (v) = {u | level(v) ≤ level(u) ≤
level(v)+δ, u→∗ v}, where u→∗ v means that there is a directed path of any length (including
0) from u to v in Cn. The v-up-circuit Cupδ (v) = (V
up
δ (v), E
up
δ (v)) of depth δ is analogously
the subcircuit of Cn induced by V upδ (v) = {u | level(v)− δ ≤ level(u) ≤ level(v), v →∗ u}.
When dividing Cn into subcircuits we have two conflicting goals. On the one hand, we
want as few of them as possible, which implies that they have to be of great depth. On the
other hand, we need to simulate them in MapReduce without exceeding the memory bounds.
A depth in Θ(logn) turns out to be the right choice. Let s = (γ logn)/ log ∆, where ∆ ≥ 2
is a constant bounding the maximum degree of Cn and γ is an arbitrary constant satisfying
0 < γ < 1− 2ε. (Note that such a γ exists exactly if ε < 1/2.) Since a tree of depth s and
maximum degree bounded by a constant ∆ contains at most
∑s
i=1 ∆i edges, its size is in
O(∆s) = O(nγ) ⊆ O(Nγ). Hence each reducer may contain up to N1−ε/Nγ such subcircuits
without exceeding the memory constraint of O(N1−ε); see Figure 2 in Appendix B [9]. We
denote this number of allowed subcircuits per reducer by β = N1−ε−γ .
F. Frei and K. Wada 52:11
For each i ∈ [ddepth(Cn)/se + 1], we define Li = i · s. For every node v on level Li –
that is, with level(v) = Li – we call the v-down-circuit (v-up-circuit, resp.) of depth s an
Li-down-circuit (Li-up-circuit, resp.). We will construct in each reducer the v-down-circuits
and v-up-circuits of depth 1 of all its nodes. From those we then construct all Li-down-circuits
and Li-up-circuits for every i. Note that we can evaluate all Li-down-circuits if the values of
the nodes of level Li+1 are given. The values of the nodes v of level Li+1 that are necessary
to compute the Li-up-circuits are then known from the Li+1-down-circuits.
When the circuit Cn is divided into Li-down-circuits, there may exist edges of Cn that are
not contained in any Li-down-circuit. If an edge ((ju, level(u), u), (jv, level(v), v)) satisfies
Liu ≤ level(u) ≤ Liu+1 and Liv ≤ level(v) ≤ Liv+1 for iu 6= iv, then this edge is not included
in any Liu-down-circuit nor any Liv -down-circuit. We call such edges level-jumping edges;
see Figure 3 in Appendix B [9] for an example. We would like to replace every level-jumping
edge (u, v) by a path from u to v that consists only of edges that will be part of the respective
Li-down-circuits and Li-up-circuits in the resulting, augmented circuit. The following lemma
states that this is possible without increasing the size by too much.
I Lemma 17 (Proof in Appendix A [9]). We can subdivide the jumping edges in Cn in a way
that renders the subcircuit-wise evaluation possible without increasing the size beyond O(N).
3.3.5 Construction of Subcircuits in Reducers
Having described the subcircuits on which the evaluation of the entire circuits will be based,
we now need to show how to split and construct them in the ` different reducers. In each
reducer, we start with the nodes v contained in it that satisfy level(v) = Li for any i and
the associated v-down-circuits and v-up-circuits of depth 1. We then iteratively increase
the depth one by one, until the full Li-down-circuits and Li-up-circuits of depth up to s
are constructed. Note that the nodes of any level Li and their corresponding circuits may
be scattered across multiple reducers since edges were split equally among them according
to their the sorting index and not depending on the level. We therefore need to carefully
implement a communication scheme that allows each reducer to encode requests for missing
edges required in the construction, which are then delivered to them in multiple rounds,
without exceeding any of the memory or time bounds. Taking care of all these details, we
obtain the following lemma.
I Lemma 18 (Proof in Appendix A [9]). Given Cn, all Li-down-circuits and Li-up-circuits
can be constructed in O(logn) DMRC-rounds whenever 0 < ε < 1/2.
3.3.6 Evaluation Via Subcircuits
The main idea in the proof of the following lemma is to compute the evaluation values
subcircuit-wise, starting with the deepest ones, and then iteratively moving up the circuit in
depth(Cn)/s rounds, passing on the newly computed values to the right reducers, until the
value of the unique output node is known.
I Lemma 19. If all up-circuits and down-circuits are constructed in the proper reducers, Cn
can be evaluated in O(depth(Cn)/ logn) DMRC-rounds.
ISAAC 2019
52:12 Efficient Circuit Simulation in MapReduce
4 Conclusion and Research Opportunities
In a substantial improvement over all previously known results, we have shown that NCi+1 ⊆
DMRCi for all i ∈ N. In the case of NC1 ⊆ DMRC0, we have proved this result for every
feasible choice of ε in the model, that is, for 0 < ε ≤ 1/2. For i > 0, we have shown the
result to hold for all but one value, namely ε = 1/2.
Achieving these two results required a detailed description of two different, delicate
simulations within the MapReduce framework. For the case of NC1, which is particularly
relevant in practice, we applied Barrington’s theorem and simulated width-bounded branching
programs [2], whereas we directly simulated the circuits for the higher levels of the hierarchy.
We emphasize that none of the two approaches can replace the other: Barrington’s theorem
only gives a characterization for the first level of the NC hierarchy and the second approach
does not even yield NC1 ⊆MRC0. (Recall that DMRC is just the deterministic variant of
MRC, so we have DMRCi ⊆MRCi for all i ∈ N.)
We would like to briefly address the small question that immediately arises from our
result, namely whether it possible to extend the inclusion NCi+1 ⊆ DMRCi of Theorem 10
to the case ε = 1/2. Going through all involved lemmas, we see that the two reasons that
our proof does not work in this corner case are the sorting of the nodes using Lemma 15 and
the construction of the up-circuits and down-circuits in Lemma 18. Regarding the former,
we can avoid the restriction by allowing randomization. For the latter, it is not clear that
this can be achieved, however. If there was any way to construct the levels for ε = 1/2 as
well, then Theorem 10 would immediately extend to the full range 0 < ε ≤ 1/2 of feasible
choices for ε.
Besides dealing with the small issue mentioned above, the natural next step for future
research is to take the complementary approach and address the reverse relationship: Having
shown in this paper how to obtain efficient deterministic MapReduce algorithms for par-
allelizable problems, we now aim to include DMRCi into NCi+1 for all i ∈ N, thus finally
settling the long-standing open question of how exactly the MapReduce classes correspond
to the classical classes of parallel computation.
References
1 S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University
Press, 2009.
2 D.A. Barrington. Bounded-Width Polynomial-Size Branching Programs Recognize Exactly
Those Languages in NC1. J. of Computer and System Sciences, 38:150–164, 1989.
3 C.-T. Chu, S.K. Kim, Y.-A. Lin, Y. Yu, G.R. Bradski, A.Y. Ng, and K. Olukotum. Map-
Reduce for machine learning on multicore. In Advances in neural information processing
systems (NIPS), pages 281–288, 2006.
4 T.H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.
5 J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun.
ACM, 51(1):107–113, 2008.
6 A.K. Farahat, A. Elgohary, A. Ghodsi, and M. S. Kamel. Distributed Column Subset Selection
on MapReduce. In International Conference on Data Mining (ICDM), pages 171–180, 2013.
7 J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On Distributing
Symmetric Streaming Computations. ACM Trans. on Algorithms, 6(4):66:1–66:15, 2010.
8 B. Fish, J. Kun, Á.D. Lelker, L. Reyzin, and G. Turán. On the Computational Complexity of
MapReduce. In International Symposium on Distributed Computing (DISC), pages 1–15, 2015.
9 F. Frei and K. Wada. Efficient Circuit Simulation in MapReduce. Technical Report arXiv.org,
cs(arXiv:1907.01624):1–20, 2019. arXiv:1907.01624.
F. Frei and K. Wada 52:13
10 M. Goodrich, N. Sichinava, and Q. Zhang. Sorting, Searching, and Simulation in the MapReduce
Framework. In 22nd Int. Symp. on Algorithms and Computation (ISAAC), pages 374–383,
2011.
11 S. Kamara and M. Raykova. Parallel Homomorphic Encryption. In Financial Cryptography
Workshops, pages 213–225, 2013.
12 H. Karloff, S. Suri, and S. Vassilvitskii. A Model of Computation for MapReduce. In 21st
ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 938–948, 2010.
13 R. Kumar, B. Moseley, and S. Vassilvitskii. Fast Greedy Algorithms in MapReduce and
Streaming. In ACM Symp. on Parallelism in Algorithms and Architectures (SPAA), pages
1–10, 2013.
14 M.F. Pace. BSP vs MapReduce. In 12th Int. Conf. on Computational Science (ICCS), pages
246–255, 2012.
15 A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-Round Tradeoffs for
MapReduce Computations. In 26th ACM Int. Conf. on Supercomputing (ICS), pages 235–244,
2012.
16 T. Roughgarden, S. Vassilvitskii, and J. R. Wang. Shuﬄes and Circuits (On Lower Bounds for
Modern Parallel Computation). Journal of the ACM (JACM), 65(6):41:1–66:24, 2018.
17 A.D. Sarma, F.N. Afrati, S. Salihoglu, and J.D. Ullman. Upper and Lower Bounds on the
Cost of a Map-Reduce Computation. In Proceedings of the VLDB Endowment (PVLDB),
pages 277–288, 2013.
18 A. Tada, M. Migita, and R. Nakamura. Parallel Topological Sorting Algorithm. J. of the
Information Processing Society of Japan (IPSJ), 45(4):1102–1111, 2004.
19 T. White. Hadoop: The Definitive Guide, 4th edition. O’Reilly, 2015.
A Deferred Proofs
In this appendix, we provide all proofs that had to be deferred due to the space constraints.
For the reader’s convenience, we reprint all statements.
I Lemma 4 (Reprint of Lemma 4 on page 7). Let L be a w-PBP-recognized language. If the
representations of the w-PBP and, for every q ∈ [`], αq are given, then we can decide in a
2-round DMRC-computation whether α ∈ L or not.
Proof. As already described above, let w-PBP be represented by the set {〈p; (xip , fp, gp)〉 |
p ∈ [t]} and, for every q ∈ [`], the assignment αq by {〈q, (pq,j , xq,j , vq,j)〉 | j ∈ [|Xq|]}. Note
that there are ` subprograms of length at most d and ` partial assignments that each assign
values to at most one variable per line of the corresponding partial program. The total size
of the input is thus in O(d`) ⊆ O(NO) ⊆ O(N).
We define the first map function µ1 by
µ1(〈p; (xip , fp, gp)〉) = {〈 p÷ d ; (p, xip , fp, gp) 〉}, for each p ∈ [t] and
µ1(〈q; (pq,j , xq,j , vq,j)〉) = {〈 pq,j ÷ d ; (pq,j , xq,j , vq,j) 〉} for each q ∈ [`], j ∈ [k + 1].
For any q ∈ [`], there is one subprogram w-PBPq and an associated assignment set αq. We use
the map function µ1 to find the value assignment for each variable appearing in w-PBPq and
store it in a key-value pair. This pair has the key q and is thereby designated to be processed
by reducerq, which can calculate ρ1, having all pairs with key q available. This function
simulates, for each permutation pi of [w], the subprogram w-PBPq on this permutation with
the received assignment and stores the resulting permutation pi′. This yields a table Tq of size
w! ∈ O(1), describing the action of w-PBPq for the given assignment on all w! permutations.
(We mention in passing that for the first reducer0 it would be sufficient to compute and store
ISAAC 2019
52:14 Efficient Circuit Simulation in MapReduce
only the permutation that results from applying w-PBP0 on the given assignment to the
identity as the initial permutation, thus saving the time and memory necessary for the rest
of the first table.) The output of ρ1 on the qth reducer is 〈q;Tq〉.
The map function µ2 of the second round is simple, it maps 〈q;Tq〉 to 〈0; (q, Tq)〉, thus
delivering all pairs (i, Ti) to a single instance of the reduce function ρ2. This first reducer
has therefore all tables T0, . . . , T`−1 at its disposal and knows which one is which. Using Tq
as a look-up table for the permutation performed by w-PBPq, reducer0 can now compute,
starting from the identity permutation id, the permutation pi = T`−1 ◦ · · · ◦ T2 ◦ T1 ◦ T0(id),
and the input is accepted if and only if pi ∈ Fn, where Fn is the set of accepted permutations
that is given to us alongside the program w-PBP. J
I Lemma 5 (Reprint of Lemma 5 on page 8). Calculating #S(xi) is in DMRC0. That is,
for each i ∈ [n], #S(xi) is computable from w-PBP in a constant number of DMRC-rounds.
Proof. For each q ∈ [`], the subprogram w-PBPq is stored in reducerq. The output of
reducerq – which will be the input to compute #S(xi) – is 〈q; (q, 1)〉, . . . , 〈q; (q, kq)〉, with
the variables xq,1, . . . , xq,kq appearing in the subprogram w-PBPq and kq ∈ O(d). The total
number of inputs used to compute #S(xi) is therefore at most d` ∈ O(N). We use a
Sum-CRCW-PRAM, whose concurrent writes to a single memory register are resolved by
summing up all values being written to the same register simultaneously, see [10]. We use
at most d` processors, Pq,1, . . . ,Pq,kq for each q ∈ [`], and registers R0, . . . ,Rn−1 and let all
processors Pq,j add 1 to Rj concurrently. Thus we see that computing #S(xi) is possible in
constant time on a Sum-CRCW-PRAM and therefore, by Corollary 2, in DMRC0. J
I Lemma 6 (Reprint of Lemma 6 on page 8). Computing the prefix-sums of #S(xi) is in
DMRC0.
Proof. The input is given as 〈i; (#S(xi), i)〉 for i ∈ [n]. We compute the prefix-sums yi of
#S(xi) for all i ∈ [n] in three rounds that can be summarized as follows:
1. Each reducerq, for q ∈ [`], determines its local prefix-sums; that is, it computes the d
prefix-sums ylocaldq , . . . , ylocald(q+1)−1 of the d numbers #S(xdq), . . . ,#S(xd(q+1)−1).
2. A single reducer computes the prefix-sums z0, z1, . . . z`−1 of ylocald−1 , ylocal2d−1, . . . ylocal`d−1, which
are known from the first round. For every q ∈ [`− 1], we send zq to reducerq+1.
3. Each reducerq+1 with q ∈ [`− 1] computes yd(q+1)+j = ylocald(q+1)+j + zq for each j ∈ [d].
We now describe the three rounds in more detail at the level of the key-value pairs.
1. By defining the map function µ1(〈i; (#S(xi), i)〉) = 〈i÷ d; (#S(xi), i)〉, each reducerq, for
q ∈ [`], receives #S(xdq), . . . ,#S(xd(q+1)−1) together with the correct indices. Thus we
can compute in reducerq all local prefix-sums ylocaldq , . . . , ylocald(q+1)−1 of these number. The
output of reducerq consists of the local prefix-sums in the format 〈q; (p-sum, q, j, ylocalq,j )〉
for j ∈ [d] and the last of each group of local prefix-sums in the format 〈q; (last, ylocald(q+1)−1)〉,
where p-sum = 0 and last = 1 is a simple binary identifier.
2. By defining the map function µ2(〈q; (last, ylocald(q+1)−1)〉) = 〈0; (last, ylocald(q+1)−1)〉, all last
parts of the local prefix-sums can be gathered in reducer0. Thus, the prefix-sums
z0, z1, . . . z`−1 of ylocald−1 , . . . , ylocald`−1 can be computed in it and the output of the reducer is
〈0; (last, i+ 1, zi)〉 for every i ∈ [`− 1]. All other key-value pairs – that is, those of the
form 〈q; (p-sum, q, j, ylocalq,j )〉 – are passed on unaltered.
3. The input of the third round consists of the output pairs 〈q; (p-sum, q, j, ylocalq,j )〉 for all
j ∈ [d] and q ∈ [`] passed on from the first round and the pairs 〈0; (last, q+1, zq)〉 for all q ∈
[`− 1] from the second round. Defining the map function as µ3(〈q; (p-sum, q, j, ylocalq,j )〉) =
F. Frei and K. Wada 52:15
〈q; (p-sum, q, j, ylocalq,j )〉 and µ3(〈0; (last, q + 1, zq)〉) = 〈q + 1; (last, q + 1, zq)〉, we can, for
each j ∈ [d] and each q ∈ {1, . . . , `− 1}, compute yq,j = ylocalq,j + zj in reducerq.
The memory limitations of the mappers and reducers are clearly respected. J
I Lemma 7 (Reprint of Lemma 7 on page 8). Each of the split values σ1, . . . , σ`−1 can be
computed in one reducer with the required prefix-sums being made available in one more
DMRC-round.
Proof. If there is a k ∈ [` − 1] such that yn−1 ≤ kd, then it is clear from the definition
σq = max{j ∈ [n] | yj ≤ qd} of the split values that σk = σk+1 = . . . = σ`−1. We can
therefore assume that yn−1 > (`− 1)d and characterize, for each q ∈ {1, . . . , `− 1}, the split
value σq as the unique integer satisfying (q− 1)d < yσq ≤ qd and qd < yσq+1; see Figure 1 in
Appendix B [9].
This characterization is well defined since 0 < #S(xi) ≤ ` < d for each i ∈ [n] and
yn−1 ≤ d` ∈ O(NO). For each q ∈ [`], in order to determine the split value σq, it is therefore
sufficient to have available in the respective reducer a sequence of consecutive prefix-sums
such that the first one is at most qd and the last one is greater than qd. This condition is
satisfied if reducerq has the d+ 2 consecutive prefix-sums yqd−1, yqd, . . . , y(q+1)d−1, y(q+1)d
available. (For the first and the last reducer, the d + 1 prefix-sums y0, . . . , yd−1, yd and
y(`−1)d−1, y(`−1)d, . . . , y`d−1, respectively, will suffice.) Slightly extending the sequence of
available prefix-sums in each reducer by copying the overlapping prefix-sums from another
reducer thus enables us to compute all split values in the ` reducers. Since for each q ∈ [`],
there are the d prefix-sums yqd, . . . , y(q+1)d−1 in reducerq, each reducer can have the d+ 2
prefix-sums made available after one more round by having each neighboring reducer copy
one more prefix-sum into it. We have σ0 = −1 and σ` = n− 1; it is thus immediately verified
that, for every q ∈ [`], the total number of subprograms in which input variables between
xσq+1 and xσ(q+1) appear is at most 2d, showing that all the memory restrictions on the
reducers are observed. J
I Lemma 8 (Reprint of Lemma 8 on page 8). Given w-PBP, α, and the split values σ0, . . . , σ`,
we can, for each q ∈ [`], compute αq in a constant number of DMRC-rounds.
Proof. We can assume that, for each κ ∈ [`], the reducerκ has the subprogram w-PBPκ, the
κth block of input assignments {(xj , vj) | κ · d ≤ j ≤ (κ + 1)d − 1}, and the split values
σ0, . . . , σ` available. The output of reducerκ then consists of the following:
1. 〈κ; (q, p, xip , fp, gp)〉 for each line (p, xip , fp, gp) in w-PBPκ, where σq + 1 ≤ ip ≤ σq+1.
2. 〈κ; (q, xj , vj)〉 for each value assignment (xj , vj) with σq + 1 ≤ j ≤ σq+1.
For any κ ∈ [`], we need to bound the total number of outputs with key κ from above. From
the definition of the split values we see that this number is in O(d) since it is bounded by the
number of lines, which is at most 2d, plus the number of assignments, which is at most d.
Naturally, the map function µ of the next round is defined by
1. µ(〈κ; (q, p, xjp , fp, gp)〉) = 〈q; (p, xjp , fp, gp)〉 and
2. µ(〈κ; (q, xj , vj)〉) = 〈q; (xj , vj)〉.
For any κ ∈ [`], the assignment variables αq can be computed by the subsequent reduce
function using the key-value pairs produced above. For each q ∈ [`], the reducerq has now
available the lines of w-PBP and the value assignments for the input variables between xσq+1
and xσq+1 . It can therefore go through all the program lines and determine, on the one
hand, which value assignments they require and, on the other hand, to which subprogram
they belong. The required assignment information is then sent to the respective reducers by
outputting 〈q; (p÷ d, p, xip , vip)〉. J
ISAAC 2019
52:16 Efficient Circuit Simulation in MapReduce
I Lemma 12 (Reprint of Lemma 12 on page 9). The prefix-sums of q numbers can be computed
on a CRCW-PRAM with P ∈ O(q log q) processors and memory M ∈ O(q) in constant time.
Proof. We use a Sum-CRCW-PRAM, where concurrent writes to the same memory reg-
ister are resolved by adding up all simultaneously assigned numbers [10]. Let q numbers
x0, x1, . . . , xq−1 be given as input. Without loss of generality, we assume q to be a power
of 2 and calculate si(j) =
∑
j2i≤p<(j+1)2i xp for all i ∈ [1 + log q] and all j ∈ [q/2i + 1]; see
Figure 4 in Appendix B [9] for an illustrating example.
Since each of the q/2i elements in si is the sum of 2i elements, we can – by allocating
q processors for each i ∈ [1 + log q] – compute every si(j) in a Sum-CRCW-PRAM with
O(q log q) processors and O(1) time.
We now describe how the prefix-sums y(0), y(1), . . . , y(q− 1) are computed from the si(j).
Assume first that j + 1 is a power of 2, that is, j + 1 = 2p. Then we have y(j) = sp(0),
so the value has already been computed. If j + 1 = 2p + 1 for some p, then we have
y(j) = sp(0) + s0(2p), so we need to add two summands. In general, y(j) can be calculated
as the sum of at most log q − 1 known summands.
Let ajlog qa
j
(log q)−1 . . . a
j
0 be the binary representation of j + 1. Now, we can see that
y(j) = slog q(0) · ajlog q
+ s(log q)−1((j + 1− 2(log q)−1)÷ 2(log q)−1) · aj(log q)−1
+ . . .
+ s1((j + 1− 21)÷ 21) · aj1
+ s0((j + 1− 20)÷ 20) · aj0;
that is, y(j) can be computed as the sum of all sp((j + 1− 2p)÷ 2p) such that ajp = 1. Thus,
it is sufficient to supply a maximum of (log q)− 1 processors for the calculation of each y(j)
in a second time step, and the prefix-sums can be computed on a Sum-CRCW-PRAM with
O(q log q) processors in constant time. J
I Lemma 13 (Reprint of Lemma 13 on page 9). Computing the levels of all nodes in Cn is
in DMRC1.
Proof. From Lemmas 11 and 12 we know that the level of each node in Cn can be computed
in T ∈ O(logn) time on a Sum-CRCW-PRAM with P ∈ O(N +N logN) processors. Now,
Corollary 2 yields a MapReduce simulation of this Sum-CRCW-PRAM. We need to check that
the conditions of Corollary 2 are indeed all satisfied: From T ∈ O(logn), P ∈ O(N+N logN),
and M ∈ O(N) follows M + P ∈ O(N logN) and logN1−ε(M + P ) ∈ O(1), hence we have
(M +P ) logN1−ε(M +P ) ∈ O(N2(1−ε)). Thus, the level of each node in Cn can be computed
in O(logn) DMRC-rounds. J
I Lemma 14 (Reprint of Lemma 14 on page 10). A CRCW-PRAM with P ∈ O(D logD)
processors and memory M ∈ O(D) can sort any subset I ⊆ {1, . . . , D} of integers in constant
time.
Proof. Recall that we use a Sum-CRCW-PRAM that sums up concurrent writes. Assume
that the input and output are stored in the arrays x[0], . . . , x[p− 1] and y[0], . . . , y[p− 1],
respectively. We will use two auxiliary arrays z[0], . . . , z[D] and zˆ[0], . . . , zˆ[D] of size D + 1.
The algorithm works in four steps:
F. Frei and K. Wada 52:17
1. Initialize z by using D + 1 ≤ P processors to set z[k]← 0 for all k ∈ [D + 1].
2. Use p ≤ P processors in parallel to set z[x[k]]← 1 for all k ∈ [p].
3. Compute the prefix-sums of the array z and save them into zˆ.
4. Use D processors to set, for all k ∈ {1, . . . , D} in parallel, y[zˆ[k]] ← k if and only if
zˆ[k] 6= zˆ[k − 1].
Since the prefix-sums of D numbers can be computed by the Sum-CRCW PRAM with
P ∈ O(D logD) processors and memory M ∈ O(D) in constant time by Lemma 12, the
above algorithm stays within these bounds as well.
We now prove that this algorithm is correct. First we observe that after step 2, for every
k ∈ {1, . . . , D}, we have z[k] = 1 if and only if one of the p integers to be sorted is k. Because
zˆ contains the prefix-sums of z, the value stored in zˆ[k] hence tells us how many of the p
integers in x are at most k. (Note that accordingly we always have z[0] = zˆ[0] = 0.) Thus k
is one of the integers in x if and only if zˆ[k] = zˆ[k− 1] + 1; otherwise, we have zˆ[k] = zˆ[k− 1].
As a consequence, the array zˆ contains exactly the indices of x, namely [p], as values in
non-decreasing order, that is, 0 = zˆ[0] ≤ zˆ[1] ≤ · · · ≤ zˆ[D − 1] ≤ zˆ[D] = p. Stepping through
zˆ from start to end, that is, from k = 0 to k = D, we therefore observe an increment of
1 from zˆ[k − 1] to zˆ[k] exactly if k is one of the integers to be sorted. This means that in
step 4 the integers contained in x are detected from left to right in ascending order and
subsequently stored into y in the same order. J
I Corollary 15 (Reprint of Lemma 15 on page 10). Let c ∈ N and 0 < ε < 1/2. Any
set of distinct integers from {1, . . . , dcN logk ne} can be sorted in a constant number of
DMRC-rounds.
Proof. We apply Lemma 14 with D ∈ O(N logk n). We have D ∈ O(N logkN) ⊆ O(N1+ζ)
and thus also D logD ∈ O(N1+ζ) for any constant ζ > 0. Choose any ζ < 1 − 2ε, which
is possible for ε < 1/2. The sorting is then possible on a CRCW-PRAM with O(N1+ζ)
processors and O(N1+ζ) memory in constant time. By Corollary 2, this CRCW-PRAM can be
simulated in a constant number of DMRC-rounds because logN1−ε(N1+ζ) = (1+ζ)/(1−ε) ∈
O(1) and O(N1+ζ) ⊆ O(N2(1−ε)). J
I Lemma 17 (Reprint of Lemma 17 on page 11). We can subdivide the jumping edges in
Cn in a way that renders the subcircuit-wise evaluation possible without increasing the size
beyond O(N).
Proof. Let ((ju, level(u), u), (jv, level(v), v)) be a jumping edge, where Liu ≤ level(u) ≤
Liu+1, Liv ≤ level(v) ≤ Liv+1, and iu < iv. If iu = iv − 1, then this edge is divided into
two edges ((ju, level(u), u),dummy) and (dummy, (jv, level(v), v)), introducing a new node
dummy of the id kind with level(dummy) = iv. If iu ≤ iv − 2, then this edge is divided into
three edges ((ju, level(u), u),dummy1), (dummy1,dummy2), and (dummy2, (jv, level(v), v)),
introducing two new nodes with level(dummy1) = iu + 1, level(dummy2) = iv. Having
divided the jumping edges in this way, the newly created edges are all part of some dummy-
down-circuit or dummy-up-circuit, except for edges of the form (dummy1,dummy2). Note
that we cannot further subdivide the edges of the form (dummy1,dummy2) because we
would exceed the size limit on the circuit otherwise. The most convenient way to deal
with this is to adjust our definition of down-circuits and up-circuits such that every edge
of the form (dummy1,dummy2) is considered to be both a dummy1-down-circuit and a
dummy2-up-circuit on its own. This way, every edge in the augmented circuit is included in
ISAAC 2019
52:18 Efficient Circuit Simulation in MapReduce
some down-circuit or up-circuit. Note that this augmentation can be performed in a single
round and that the size of the augmented circuit is in O(N). In what follows, we consider
Cn to be this augmented circuit. J
I Lemma 18 (Reprint of Lemma 18 on page 11). Given Cn, all Li-down-circuits and Li-up-
circuits can be constructed in O(logn) DMRC-rounds whenever 0 < ε < 1/2.
Proof. In the first round, the map function µ1 is defined such that each reducerq is assigned
(via the choice of the key) β nodes of the form 〈j; (level(v), v)〉 and directed edges adjacent
to these nodes. Note that one edge can thus be assigned to two different reducers, once as
an outgoing and once as an incoming edge. Specifically, we define
µ1(〈 j ; (level(v), v) 〉) = {〈 j ÷ β ; (j, level(v), v) 〉}
for the key-value pairs representing nodes and
µ1(〈i; ( (j, level(v), v), (j′, level(v′), v′) )〉) = { 〈j ÷ β; ((j, level(v), v), (j′, level(v′), v′))〉,
〈j′÷ β; ((j, level(v), v), (j′, level(v′), v′))〉 }
for the key-value pairs representing edges.
In the subsequent execution of ρ1, each reducer can therefore directly construct the
v-up-circuits and v-down-circuits of depth 1 for its β assigned nodes. We will now describe
how some of these initial circuits, namely those on levels Li for any i ∈ [r], can be extended
to full Li-up-circuits and Li-down-circuits by iteratively increasing the circuit depth one by
one in the following way:
Let v be a node with level(v) = Li in reducerq for any i ∈ [r] and q ∈ [`]. We want
to extend Cdown1 (v) and C
up
1 (v) to Cdown2 (v) and C
up
2 (v), respectively. Let uin (uout, resp.)
be any node of in-degree (out-degree, resp.) 0 in it, that is, any node that potentially
needs to be extended by one or multiple edges. These extending edges are not necessarily
available in reducerq, however. We need to find out which reducer stores them – if there
are any – and then request these edges from it in some way. To determine the right
reducer, we make use of the sorting index stored alongside each node, even when part of an
edge. Any edge (uin, v) that we need to check for possible extensions is in fact represented
as 〈q , ( (juin , level(uin), uin) , (jv, level(v), v) ) 〉 in reducerq. The number of the reducer
containing the downward extending edges is now retrieved as to(uin) = juin ÷β. Analogously,
the upward extending edges for an edge (v, uout) are to be found in reducerto(uout), where
to(uout) = juout ÷ β. We now know whom to ask for edges extending the subcircuit beyond
node u, namely reducer number to(u). Let from(v) = q denote the number of the reducer
sending the request, which we encode in form of the key-value pair 〈q; (u, to(u), from(v))〉.
Each reducerq does the above for every node with possible extending edges and also
passes along to the mapper all v-up-circuits and v-down-circuits constructed so far unaltered.
This concludes the first round. In the second round, the map function µ2 naturally re-
assigns 〈q; (u, to(u), from(v))〉 to reducerto(u), and returns the v-up-circuits and v-down-
circuits to the reducers that sent them. Having received the edge request of the form
〈to(u); (u, to(u), from(v))〉 while executing ρ2, reducerto(u) now sends all edges potentially
useful to reducerfrom(v) – that is, the entire u-up-circuit and the entire u-down-circuit of
depth 1 – to the next mapper in the form of a pair (from(v), e) for every edge containing
node u. As before, all other circuits constructed so far get passed along without modification
as well.
F. Frei and K. Wada 52:19
In the third round, the map function µ3 routes the requested edges to the requesting
reducer by generating the key-value pairs 〈from(v); (from(v), e)〉. In the reducing step, which
implements the same reduce function ρ1 as in the first round, reducerfrom(v) now finally has
all v-up-circuits and v-down-circuits fully extended to depth 2.
Since performing the two rounds µ2, ρ2, µ3, ρ1 deepens the Li-up-circuits and Li-down-
circuits by one level in the way just seen, the complete Li-up-circuits and Li-down-circuits
can be constructed by repeating these two rounds s times.
It is again clear that the memory and I/O requirements of the reducers are all met in
every round since the input size and output size are in O(d) for each reducer. Moreover, the
total memory for storing the v-up-circuits and v-down-circuits is β ·N ∈ O(N1+γ) because
Cn has at most NO ∈ O(N) nodes. Since the constant γ was chosen such that 0 < γ ≤ 1−2ε,
we have N1+γ ∈ O(N2(1−ε)) and thus all up-circuits and down-circuits can be stored in the
respective reducers. J
I Lemma 19 (Reprint of Lemma 19 on page 11). If all up-circuits and down-circuits are
constructed in the proper reducers, Cn can be evaluated in O(depth(Cn)/ logn) DMRC-
rounds.
Proof. Without loss of generality, let depth(Cn) be divisible by s and let r = depth(Cn)/s.
Once all Li-down-circuits and Li-up-circuits for all i ∈ {1, . . . , r} have been constructed, we
can evaluate Cn on the given input assignment. We begin by evaluating the Lr−1-down-
circuits. Since every input node has its value assigned in a v-down-circuit, the Lr−1-down-
circuits can be computed in the reducers containing these v-down-circuits. With the values
of all nodes at level Lr−1 determined, we can send the necessary values to the Lr−2-down-
circuits and, in the case of edges that were divided using two dummy nodes, to lower-level
down-circuits. Nodes at level Lr−1 that are necessary to compute Lr−2-down-circuits are
described in the Lr−1-up-circuits. Any node v at level Lr−1 that is necessary to compute
Lr−2-down-circuits is described in the v-up-circuit. Therefore, the output of the reducerq is
as follows: Let v be at level Lr−1 and let ui, for i ∈ {1, . . . , kv}, be the nodes at level Lr−2
in the v-up-circuit. For each v in reducerq, it outputs (to(ui), v, val(v)), where to(ui) is the
index of the reducer containing the ui-down-circuit and val(v) is the value of v determined
in the computation of the v-down-circuit. The reducerq also passes on all v-down-circuits
and v-up-circuits contained in it.
In the next round, the map function sends each (to(uin), v, val(v)) to the reducer containing
the uin-down-circuit; that is, it generates the key-value pair 〈to(uin); (v, val(v))〉. Of course,
the map function also passes along all v-down-circuits and v-up-circuits to the proper reducers.
Since now each Lr−2-down-circuit is contained completely in a reducer that has gathered
all values of nodes at level Lr−1 necessary to compute this subcircuit, all Lr−2-down-circuits
can be computed in their reducers. Now we can compute the values of nodes higher and
higher up in the circuit, by iterating the last mapping-reducing function pair, until the value
is finally known for the unique output node. As before, we clearly stay within the memory
and I/O buffer limits of each reducer. J
ISAAC 2019
52:20 Efficient Circuit Simulation in MapReduce
B Illustrating Figures
xn−1 = xσ`
xn−2
xσ`−1
xσ2
xσ1+1
xσ1
x2
x1
x0 = xσ0+1
y0 y1 yσ1 yσ1+1 · · · yn−2 yn−1
reducer0
reducer1
reducer`−1
Figure 1 Separation of the input variables x0, . . . , xn−1 into ` blocks for the ` reducers, in
dependence of the values of yi.
β ed
ges
reducerq
For every key-value pair 〈q; (jv, level(v), v)〉
such that there is an i ∈ N with level(v) = Li:
v-up-circuit of
depth Li−1 − Li
v
v-down-circuit of
depth Li − Li+1
For every key-value pair 〈q; (jv, level(v), v)〉
such that level(v) 6= Li for all i ∈ N:
v
v-up-circuit of depth 1
v-down-circuit of depth 1
Figure 2 The up-circuits and down-circuits constructed in reducerq, comprising up to β edges.
F. Frei and K. Wada 52:21
Li
Li+1
Li+2
Li
Li+1
Li+2
v
u
v
u
v′
u′
v′
u′
dummy dummy′1
dummy′2
Figure 3 Two jumping edges on the left and their resolving division on the right.
x0
s0(0)
x1
s0(1)
x2
s0(2)
x3
s0(3)
x4
s0(4)
x5
s0(5)
x6
s0(6)
x7
s0(7)
+s1(0) +s1(1) + s1(2) + s1(3)
+
s2(0)
+
s2(1)
+
s3(0)
y0 = s0(0)
y1 = s1(0)
y2 = s1(0) + s0(2)
y3 = s2(0)
y4 = s2(0) + s0(4)
y5 = s2(0) + s1(2)
y6 = s2(0) + s1(2) + s0(6)
y7 = s3(0)
Figure 4 Calculation of the prefix-sums si(j) =
∑
p∈[(j+1)2i]\[j2i] xp for every i ∈ [1 + log q] and
j ∈ [q/2i] for the example of q = 8.
ISAAC 2019
