The MapReduce framework has firmly established itself as one of the most widely used parallel computing platforms for processing big data on tera-and peta-byte scale. Approaching it from a theoretical standpoint has proved to be notoriously difficult, however. In continuation of Goodrich et al.'s early efforts, explicitly espousing the goal of putting the MapReduce framework on footing equal to that of long-established models such as the PRAM, we investigate the obvious complexity question of how the computational power of MapReduce algorithms compares to that of combinational Boolean circuits commonly used for parallel computations. Relying on the standard MapReduce model introduced by Karloff et al. a decade ago, we develop an intricate simulation technique to show that any problem in N C (i.e., a problem solved by a logspace-uniform family of Boolean circuits of polynomial size and a depth polylogarithmic in the input size) can be solved by a MapReduce computation in O(T (n)/ log n) rounds, where n is the input size and T (n) is the depth of the witnessing circuit family. Thus, we are able to closely relate the standard, uniform NC hierarchy modeling parallel computations to the deterministic MapReduce hierarchy DMR C by proving that
Introduction
Despite the overwhelming success of the MapReduce framework in the big data industry and the great attention it has garnered ever since its inception over a decade ago, theoretical results about it have remained scarce in the literature. In particular, it is very natural to ask how powerful exactly MapReduce computations are in comparison to the traditional models of parallel computations based on circuits; a question that has practical implications as well. The answers have proved to be very elusive, however. In this paper, we show how MapReduce programs can efficiently simulate circuits used for parallel computations, thus tying these two worlds together more tightly. In this section we first provide an introduction to the concept of MapReduce, then present the related work, and finally describe our contribution. In Section 2, we will formally define the traditional models of parallel computing and the MapReduce model. In Section 3, we then derive our main results. Section 4 concludes the paper with a short summary and a discussion of our findings, outlining opportunities for future research.
Background and Motivation
In recent years the amount of data available and demanding analysis has grown at an astonishing rate. The amount of memory in commercially available servers has also grown at a remarkable pace in the past decade; it is now exceeding tera-and even peta-bytes. Despite the considerable advances in the availability of computational power, traditional approaches remain insufficient to cope with such huge amounts of data. A new form of parallel computing has become necessary to deal with these enormous quantities of available data. The MapReduce framework has been attracting great interest due to its suitability for processing massive data-sets. This framework was originally developed by Google [5] , but an open source implementation called Hadoop has recently been developed and is currently used by over a hundred companies, including Yahoo!, Facebook, Adobe, and IBM [17] .
MapReduce differs substantially from previous models of parallel computation in that it combines aspects of both parallel and sequential computation. Informally, a MapReduce computation can be described as follows.
The input is a set of key-value pairs k; v . In a first step, the map step, each of these key-value pairs is separately and independently transformed into an entire set of key-value pairs by a map function µ. In the next step, the shuffle step, we collect all key-value pairs from the sets that have been produced in the previous step, group them by their keys, and merge each group { k; v 1 The three steps described above constitute one round of the MapReduce computation and transform the input set into a new set of key-value pairs. A complete MapReduce computation consists of any given number of rounds and acts just as the composition of the single rounds. The shuffle step works the same way every time; the map and reduce functions, however, may change from round to round. A MapReduce computation with R rounds is therefore completely described by a list µ 1 , ρ 1 , µ 2 , ρ 2 , . . . , µ R , ρ R of map and reduce functions. In both the map step and the reduce step, the input pairs can be processed in parallel since the map and reduce functions act independently on the pairs and groups of pairs, respectively. These steps therefore capture the parallel aspect of a MapReduce computation, whereas the shuffle step enforces a partial sequentiality since the shuffled pairs can be output only once the previous map step is completed in its entirety.
The MapReduce paradigm has been introduced in [5] in the context of algorithm design and analysis. A treatment as a formal computational model, however, was missing in the beginning. Later on, a number of models have emerged to deal more rigorously with algorithmic issues [8, 9, 10, 13, 14] . In this paper, our interest lies in studying the MapReduce framework from a standpoint of parallel algorithmic power by comparing it to standard models of parallel computation such as Boolean circuits and parallel random access machines (PRAMs). A PRAM can be classified by how far simultaneous access by processors to its memory is restricted; it can be CRCW, EREW, CREW, or ERCW, where R, W, C, and E stand for Read, Write, Concurrent, and Exclusive, respectively [4] . If concurrent writing is allowed, we need to further specify how parallel writes by multiple processors to a sin-gle memory cell are handled. The most natural choice is arguably that every memory cell contains after each time step the total of all numbers assigned to it by different processors during that step. In fact, all constructions in this paper work with this treatment of simultaneous writes; we thus generally assume this model. If the context warrants it, we speak of a Sum-CRCW to make this assumption explicit.
Related Work
We briefly present and discuss the following known results on the comparative power of the MapReduce framework and PRAM models.
A T -time EREW-PRAM algorithm can be simulated by an O(T )-round MapReduce
algorithm, where each reducer uses memory of constant size and an aggregate memory proportional to the amount of shared memory required by the PRAM algorithm [9, 10] . 2. A P -processor, M -memory, T -time EREW-PRAM algorithm can be simulated by an O(T )-round, (P +M )-key MUD algorithm with a communication complexity of O(log(P + M )) bits per key, where a MUD (massive, unordered, distributed) algorithm is a datastreaming MapReduce algorithm in the following sense: The reducers do not receive the entire list of values associated with a given key at once, but rather as a stream to be processed in one pass, using only a small working memory determining the communication complexity [8] . 3. When using MapReduce computations to simulate a CRCW-PRAM instead, again with P processors and M memory, we incur an O(log m (P + M )) slowdown compared to the simulations above, where m is an upper bound on each reducer's input and output [9] .
These results imply that any problem solved by a PRAM with a polynomial number of processors and in polylogarithmic time T can be simulated by a MapReduce computation with an amount of memory equal to the number of PRAM processors, and in a number of rounds equal to the computation time of even the powerful CRCW-PRAM. Since the class of problems solved by CRCW-PRAMs in time T ∈ O(log i n) is equal to the class of problems solved by families of polynomial-sized combinational circuits consisting of gates with unbounded fan-in and fan-out and time T ∈ O(log i n) (often denoted AC i ) [1] , these circuits can be simulated in a MapReduce computation with a number of rounds equal to the time required by these circuits.
Since the publication of the seminal paper by Karloff et al. [10] , extensive effort has been spent on developing efficient algorithms in MapReduce-like frameworks [3, 6, 12, 11, 15] . Only few relationships between the theoretical MapReduce model by [10] and classical complexity classes have been established, however; for example, any problem in SPACE(o(log n)) can be solved by a MapReduce computation with a constant number of rounds [7] .
Contribution
We prove that NC i+1 ⊆ DMRC i for all i ∈ {0, 1, 2, . . . }, where DMRC i is the set of problems solvable by a deterministic MapReduce computation in O(log i n) rounds. In the case of NC 1 ⊆ DMRC 0 , which already opens up a plethora of applications on its own, the result holds for every possible choice of ε, that is, for 0 < ε ≤ 1/2. The higher levels of the hierarchy require an entirely different proof method, which yields the result for 0 < ε < 1/2. This is a substantial improvement over the previous results that only imply, as outlined above, the far weaker claim AC i ⊆ MR C i . The case i = 1 is of particular practical interest since NC 1 \ AC 0 contains plenty of relevant problems such as integer multiplication and a r X i v . o r g division, the parity function, and the recognition of Dyck languages; see [1] . Our results show how to solve all of these problems with a deterministic MapReduce program in a constant number of rounds.
Preliminaries
We denote by N = {0, 1, 2, . . .} the natural numbers including zero and let N + = N \ {0}. Moreover, we let [i] = {0, 1, . . . , i − 1} denote the i first natural numbers for any i ∈ N + .
Models of Parallel Computation
In this section, we define the common complexity classes capturing the power of parallel computation; most prominently the NC hierarchy.
A finite set B = {f 0 , . . . , f |B|−1 } of Boolean functions f i : {0, 1} ni → {0, 1} with n i ∈ N for every i ∈ [|B|] is called a basis. For every n, m ∈ N + , a (Boolean) circuit C over the basis B with n inputs and m outputs is a directed acyclic graph that contains n sources (nodes with no incoming edges), called the input nodes, and m sinks (nodes with no outgoing edges). The fan-in of a node is the number of incoming edges, the fan-out is the number of outgoing edges. Nodes that are neither sources nor sinks are called gates. Each gate is labeled with a function f i ∈ B and has fan-in n i . It computes f i on the input given by the incoming edges and outputs the result (either 0 or 1) to the outgoing edges. A basis B is said to be complete if for every Boolean function f , we can construct a circuit of the described form that computes f over the basis B. In the following, we use the complete basis B = {∨, ∧, ¬}.
The size of a circuit C, denoted by size(C), is the total number of edges it contains. The level of a node v in a circuit C, denoted level(v), is defined recursively: The level of a sink is 0, and the level of a node v with nonzero fan-out is one greater than the maximum of the levels of the outgoing neighbors of v. The depth of C, denoted depth(C), is the maximum level across all nodes in C.
A function f : {0, 1} * → {0, 1} * is implicitly logspace computable if the two mappings (x, i) → χ i≤|f (x)| , where χ denotes the characteristic function, and (x, i) → (f (x)) i are computable using logarithmic space. A circuit family {C n } ∞ n=0 is logspace-uniform if there is an implicitly logspace computable function mapping 1 n to the description of the circuit C n . It is known that the class of languages that have logspace-uniform circuits of polynomial size equals P [1, Thm. 6.15].
For any i ∈ N, the complexity class NC i contains a language L exactly if there is a constant c and a logspace-uniform family of circuits
, and all nodes have fan-in at most 2. The union is Nick's class NC = ∞ i=0 NC i . We mention that there is an analogous definition of classes Nonuniform-NC i that do not require logspace uniformity from the circuits; they constitute a different hierarchy.
The complexity classes AC i and AC = ∞ i=0 AC i are defined exactly as NC i and NC, except that the restriction of the maximal fan-in to at most 2 is omitted. Nevertheless, the restriction on the circuit size imply that the fan-in of a node is bounded by a polynomial in n. The OR gates and AND gates in such a circuit can therefore be replaced by trees of gates of fan-in at most 2 with a depth in O(log n). It follows that AC i ⊆ NC i+1 for all i ∈ N and thus NC = AC. (Analogously, we see why Nick's class can also be defined, as it often is, by upper-bounding the fan-in by an arbitrary constant greater than 2.) The inclusion 
The MapReduce Model
In this section we describe the standard MapReduce model as proposed by [10] . A MapReduce program satisfying these conditions is called an MR C i -algorithm. Note that due to the last condition it impossible to even store the input unless 2(1 − ε) ≥ 1, which explains the restriction to 0 < ε ≤ 1/2. As with NC, we define the union class
Requiring all primitives to be deterministic yields the analogous hierarchy of DMRC = ∞ i=0 DMRC i . Note that we obviously have
for all i ∈ N. We will often refer to the single rounds of such MapReduce algorithms as MR C-rounds and DMRC-rounds, respectively.
Simulating Parallel Computations by MapReduce
We are now going to prove our two main results NC 1 ⊆ DMRC 0 for 0 < ε ≤ 1/2 and NC i+1 ⊆ DMRC i for all i ∈ N + and 0 < ε < 1/2 in Sections 3.2 and 3.3, respectively. In both cases, we will be making use of the technical tool derived in Section 3.1 and obtain the results by showing how to use MapReduce computations for two different, delicate simulations. For the inclusion NC 1 ⊆ DMRC 0 , we simulate width-bounded branching programs that are equivalent to the respective circuits by Barrington's classical theorem [2] , whereas for the higher levels of the hierarchy, we directly simulate the combinational circuits themselves.
A Technical Tool
Goodrich et al. [9] parametrize MapReduce algorithms, on the one hand, by the memory limit m for the input/output buffer of the reducers and, on the other hand, by the communication complexity K r of round r, that is, the total size of inputs and outputs for all mappers and reducers in round r. We state a useful result from [9] .
◮ Theorem 1. Any CRCW-PRAM algorithm using M total memory, P processors and T time can be simulated in O(T log m P ) deterministic MapReduce-rounds with communication complexity
We denote by N the size of the smallest circuit representation of the CRCW-PRAM algorithm (i.e., its number of edges) plus the size of its input. Taking into account our
, we obtain the following a technical tool, which will prove to be useful in our endeavor.
◮ Corollary 2. Any CRCW-PRAM algorithm using M total memory, P processors and
T time can be simulated in O(T log N 1−ε P ) DMRC-rounds if (M + P ) log N 1−ε (M + P ) ∈ O(N 2(1−ε) ).
Simulating NC

1
It is known that Nonuniform-NC 1 is equal to the class of languages recognized by nonuniform width-bounded branching programs. A careful inspection of the proof due to Barrington [2] -crucially relying on the non-solvability of the permutation group on 5 elements-reveals that it naturally translates to the uniform analogue: Our uniform class NC 1 is identical with the class of languages recognized by uniform width-bounded branching programs. In order to prove NC 1 ⊆ DMRC 0 , it therefore suffices to show how to simulate such branching programs by appropriate MapReduce computations with a constant number of rounds.
We first define width-bounded branching programs. Let n, w ∈ N + . The input to the program is an assignment α to n Boolean variables X = {x 0 , . . . , x n−1 }. An instruction or line of the program is a triple (x i , f, g), where i is the index of an input variable x i ∈ X and f and g are endomorphisms of [w]. An instruction (x i , f, g) evaluates to f if α(x i ) = 1 and to g if α(x i ) = 0. A width-w branching program of length t is a sequence of instructions (x ij , f j , g j ) for j ∈ [t]. We also refer to the t instructions as the lines of the program. Given an assignment α to X , a branching program B yields a function B(α) that is the composition of the functions to which the instructions evaluate.
To recognize a language L ⊆ {0, 1} * , we need a family (B n ) ∞ n=0 of width-w branching programs with B n taking n Boolean inputs. We say that L is recognized by B n if there is, for each n ∈ N, a set F n of endomorphisms of [w] 
then L is recognized by a logspace-uniform 5-PBP family.
Due to Theorem 3 it is sufficient for our purposes to simulate the w-PBPs with constant w instead of the circuit families provided by the definition of NC 1 . In order to do this, we need to encode the given w-PBP and the possible assignments in the right form, namely we express them as sets of key-value pairs. A w-PBP of length t can be described as the set 
For ease of readability, we assume from now on without loss of generality that dℓ = t, so that w-PBP can be partitioned into exactly ℓ such subprograms.
For every q ∈ [ℓ], we denote by X q the subset of variables from X appearing in the instructions of subprogram w-PBP q . An assignment α q : X q → {0, 1} to these variables is represented as a set of key-value pairs in the following way. Recall that the subprogram w-PBP q is a list of lines, each of which requires the assignment of a value, either 0 or 1, for exactly one variable. Let x q,j be the jth variable to which a value is assigned in w-PBP q , let p q,j denote the number of the line in which this assignment occurs for the first time in w-PBP q , and let v q,j denote the value that is assigned to x q,j in this line. Now, we represent
Note that despite the dependence of X q on q, we always have |X q | ≤ d. Having seen how to express w-PBP, α, and both w-PBP q and α q for all q ∈ [ℓ] as a set of key-value pairs, we are ready to state and prove the following lemma.
◮ Lemma 4. Let L be a w-PBP-recognized language. If, for every q ∈ [ℓ], the representations of w-PBP and α q are given, then we can decide in a 2-round DMRC-computation whether α ∈ L or not. Proof. As already described above, let w-PBP be represented by the set { p;
Note that there are ℓ subprograms of length at most d and ℓ partial assignments that each assign values to at most one variable per line of the corresponding partial program, thus the total size of the input is in
We define the first map function µ 1 by
and
For any q ∈ [ℓ], there is one subprogram w-PBP q and an associated assignment set α q .
We use the map function µ 1 to find the value assignment for each variable appearing in w-PBP q and store it in a key-value pair. This pair has the key q and is thereby designated to be processed by reducer q , which can calculate ρ 1 , having all pairs with key q available. This function simulates, for each permutation π of [w], the subprogram w-PBP q on this permutation with the received assignment and stores the resulting permutation π ′ . This yields a table T q of size w! ∈ O(1), describing the action of w-PBP q for the given assignment on all w! permutations. (We mention in passing that for the first reducer 0 it would be sufficient to compute and store only the permutation that results from applying w-PBP 0 on the given assignment to the identity as the initial permutation, thus saving the time and memory necessary for the rest of the first table.) The output of ρ 1 on the qth reducer is q; T q . The map function µ 2 of the second round is simple, it maps q; T q to 0; (q, T q ) , thus delivering all pairs (i, T i ) to a single instance of the reduce function ρ 2 . This first reducer has therefore all tables T 0 , . . . , T ℓ−1 at its disposal and knows which one is which. Using T q as a look-up table for the permutation performed by w-PBP q , reducer 0 can now compute, starting from the identity permutation id, the permutation π
and the input is accepted if and only if π ∈ F n , where F n is the set of accepted permutations that is given to us alongside the program w-PBP. ◭
In the following four lemmas, we show that α q can be computed in a constant number of rounds from w-PBP and α for every q ∈ [ℓ]. The challenge lies in designing an interface between the different reducers to bridge the gap between the ℓ program blocks w-PBP q and the given assignments, initially cut into ℓ block based solely on the indices of the input variables, without exceeding the memory limits. We begin with a brief overview of the four steps.
1.
For each x i , where i ∈ [n], we compute the number of subprograms in which x i appears, and denote this number by #S(x i ). Note that #S(x i ) ≤ ℓ and that #S(x i ) is the number of all those reducers for which the value assignment of x i is generally required to compute the resulting permutations in the corresponding subprograms.
We compute the prefix sums of #S(x i ). For
Note that y i is the number of assignment triples (p q,j , x q,j , v q,j ) with 0 < j ≤ i needed to compute the action of the first i subprograms and that y n−1 = ℓ−1 q=0 |α q |. 3. Based on the prefix sums, we will compute a separation of the input variables into ℓ contiguous blocks such that, for each q ∈ [ℓ], it is feasible for reducer q to produce from the qth block the input value assignments that it needs to contribute for the next step. This is nontrivial since the number of input assignments must not 
Using these split values, each reducer q can provide all value assignments needed for the computation of all subprograms in the next step without violating the memory limitations. 4. We compute α q for q ∈ [ℓ] by using w-PBP, the input assignment α, and the split values.
Proof. For each q ∈ [ℓ], the subprogram w-PBP q is stored in reducer q . The output of reducer q -which will be the input to compute #S(x i )-is q; (q, 1) , . . . , q; (q, k q ) , with the variables x q,1 , . . . , x q,kq appearing in the subprogram w-PBP q and k q ∈ O(d). The total number of inputs used to compute #S(x i ) is therefore at most dℓ ∈ O(N ). We use a Sum-CRCW-PRAM, whose concurrent writes to a single memory register are resolved by summing up all values being written to the same register simultaneously, see [9] . We use at most dℓ processors, P q,1 , . . . , P q,kq for each q ∈ [ℓ], and registers R 0 , . . . , R n−1 and let all processors P q,j add 1 to R j concurrently. Thus we see that the computing #S(x i ) is possible in constant time on a Sum-CRCW-PRAM and therefore, by Corollary 
We now describe the three rounds in more detail at the level of key-value pairs. prefix-sums made available after one more round by having each neighboring reducer copy one more prefix-sum into it. We have σ 0 = −1 and σ ℓ = n− 1; it is thus immediately verified that, for every q ∈ [ℓ], the total number of subprograms in which input variables between x σq +1 and x σ (q+1) appear is at most 2d, showing that all the memory restrictions on the reducers are observed. Proof. We can assume that, for each κ ∈ [ℓ], the reducer κ has the subprogram w-PBP κ , the κth block of input assignments {(
, and the split values σ 0 , . . . , σ ℓ available. The output of reducer κ then consists of the following:
, we need to bound the total number of outputs with key κ from above. From the definition of the split values we see that this number is in O(d) since it is bounded by the number of lines, which is at most 2d, plus the number of assignments, which is at most d. Naturally, the map function µ of the next round is defined by
For any κ ∈ [ℓ], the assignment variables α q can be computed by the subsequent reduce function using the key-value pairs produced above. For each q ∈ [ℓ], the reducer q has now available the lines of w-PBP and the value assignments for the input variables between x σq +1 and x σq+1 . It can therefore go through all the program lines and determine, on the one hand, which value assignments they require and, on the other hand, to which subprogram they belong. To required assignment information is then sent to the respective reducers by outputting q;
We finally obtain the desired inclusion by applying Theorem 3 and Lemmas 4 through 8.
◮ Theorem 9. We have NC 1 ⊆ DMRC 0 .
Simulating NC i For All i ≥ 2
For the higher levels in the hierarchy of Nick's class, we show how to simulate the involved circuits directly. We begin with a short outline of the proof. Let C n = (V n , E n ) be a NC i+1 circuit with an input of size n, given as a set of nodes and a set of directed edges, together with an input assignment α. The total size of C n in bits is N O , the total size of the input assignment in bits is N I , and N = N O + N I . Note that size(C n ) is polynomial in n and depth(C n ) ∈ O(log i n). We will take the following steps to simulate the circuit C n with deterministic MapReduce computations:
1.
We compute the level of each node in C n . 2. The nodes and edges are sorted by their level.
3.
Both the circuit C n and the input assignment α are divided equally among the reducers. 4. We split the circuit into subcircuits computable in a constant number of rounds.
5.
A custom communication scheme collects and constructs the complete subcircuits. 6. The entire circuit is evaluated via evaluation of the subcircuits.
Note that equal division of C n in the third step is very different from the split in the forth one, where the parts may differ radically in size. Great care must be taken so as to no violate any of the memory and time restrictions, necessitating two unlike partitions. The subsequent steps then need to mediate between these dissimilar divisions. We will show that the steps (1) to (6) 
can be computed in O(log n), O(1), O(1), O(1), O(log n)
and O(depth(C n )/ log n) rounds, respectively, yielding the desired theorem.
◮ Theorem 10. We have NC
i+1 ⊆ DMRC i , for all i ∈ N + and all 0 < ε < 1/2.
Computing The Levels
We begin by showing how to compute the level of each node in the circuit in O(log n) DMRC-rounds by simulating a CRCW-PRAM algorithm. (We mention in passing that this step requires more than a constant number of rounds, which prevents us from obtaining the result for NC 1 ⊆ DMRC 0 by simulating the circuits directly; the separate approach from Subsection 3.2 via Barrington's theorem is thus required for this case.)
In [16] , an algorithm is presented that computes the levels of all nodes in a directed acyclic graph can on a CREW-PRAM with O(n + m) processors in O(log 2 m) time, where n and m are the numbers of nodes and edges in the graph, respectively. The first stage of this algorithm relies partly on the computation of prefix-sums, which can be computed much more efficiently when switching to a CRCW-PRAM, as we will show below. A straightforward adaptation of the analysis in [16] , taking into account the maximum in-degree and out-degree and separating out the computation of prefix-sums, yields the following result.
◮ Lemma 11. Let G = (V, E) be a directed acyclic graph with n nodes, m edges, maximum in-degree d in , and maximum out-degree d out . The level of each node in G can then be computed on a CRCW-PRAM with P ∈ O(m + P P-Sum (O(m))) processors and time T ∈ O((log m)
, where P P-Sum (q) and T P-Sum (q) denote, respectively, the number of processors and the computation time to compute the prefix-sums of q numbers on a CRCW-PRAM.
In the following lemma, we aim to lower the time and memory requirements for computing prefix-sums on a CRCW-PRAM as far as possible.
◮ Lemma 12.
The prefix-sums of q numbers can be computed on a CRCW-PRAM with P ∈ O(q log q) processors and memory M ∈ O(q) in constant time. Proof. We use a Sum-CRCW-PRAM, where concurrent writes to the same memory register are resolved by adding up all simultaneously assigned numbers. [9] . Let q numbers x 0 , x 1 , . . . , x q−1 be given as input. Without loss of generality, we assume q to be a power of 2 and calculate s i (j) = j2 i ≤p<(j+1)2 i x p for all i ∈ [1 + log q] and all j ∈ [q/2 i + 1]; see Figure 2 for an illustrating example.
Since each of the q/2 i elements in s i is the sum of 2 i elements, we can-by allocating q processors for each i ∈ [1 + log q]-compute every s i (j) in a Sum-CRCW-PRAM with O(q log q) processors and O(1) time.
We now describe how the prefix-sums y(0), y (1) 
that is, y(j) can be computed as the sum of all
Thus, it is sufficient to supply a maximum of (log q) − 1 processors for the calculation of each y(j) in a second time step, and the prefix-sums can be computed on a Sum-CRCW-PRAM with O(q log q) processors in constant time. ◭
We plug in the result of Lemma 12 into Lemma 11 and then apply it to the graph C n . Since its in-degrees and out-degrees are bounded by a constant ∆, we have m ≤ ∆n/2 ∈ O(n). Hence we can compute the levels of the nodes of C n on a CRCW-PRAM with P ∈ O(N log N ) processors in time T ∈ O(log n). By Corollary 2, we obtain the following result.
◮ Lemma 13. Computing the levels of all nodes in
Proof. From Lemmas 11 and 12 we know that the level of each node in C n can be computed in T ∈ O(log n) time on a Sum-CRCW-PRAM with P ∈ O(N + N log N ) processors. Now, Corollary 2 yields a MapReduce simulation of this Sum-CRCW-PRAM. We need to check that the conditions of Corollary 2 are indeed all satisfied:
Thus, the level of each node in C n can be computed in O(log n) DMRC-rounds. ◭
Sorting By Levels
Once the levels of all nodes are computed, each node in the circuit can be represented as (level(x i ), x i ). Recall that the depth of C n is just the maximum level. Since depth(C n ) ∈ O(log k n) for some k ∈ N + and the number of nodes is bounded by the number of edges size(C n ) ∈ O(N ), we can encode each pair (level(x i ), x i ) by appending to a bit string of length log(c 1 log k n) another one of log(c 2 N ) = log(cN log k n), for appropriate constants c 1 and c 2 , which results in a bit string of length lg(cN log k n) for c = c 1 c 2 ∈ N. This enables us to identify each pair (level(x i ), x i ) with a different bit string, which can interpreted as an integer bounded by cN log k n. We call this integer the sorting index of node x i . Crucially, we chose the bit string to start with the encoding of the level. Sorting the sorting indices thus means to sort the nodes of C n by their level. The following lemma shows how prefixsums can be used to perform such a sort so efficiently on a CRCW-PRAM that we can apply Corollary 2 to simulate it in a constant number of DMRC-rounds. The algorithm works in for steps:
3.
Compute the prefix-sums of the array z and save them intoẑ.
Since the prefix-sums of D numbers can be computed by the Sum-CRCW PRAM with P ∈ O(D log D) processors and memory M ∈ O(D) in constant time by Lemma 12, the above algorithm stays within these bounds as well.
We now prove that this algorithm is correct. First we observe that after step 2, for every k ∈ {1, . . . , D}, we have z Combining Lemma 14 and Corollary 2 we obtain, by a careful analysis using ε = 1/2, the promised result.
◮ Corollary 15. Let c ∈ N and 0 < ε < 1/2. Any set of distinct integers from {1, . . . , ⌈cN log k n⌉} can be sorted in a constant number of DMRC-rounds.
Proof. We apply Lemma 14 with
for any constant ζ > 0. Choose any ζ < 1 − 2ε, which is possible for ε < 1/2. The sorting is then possible on a CRCW-PRAM with O(N 1+ζ ) processors and O(N 1+ζ ) memory in constant time. By Corollary 2, this CRCW-PRAM can be simulated in a constant number of DMRC-rounds because log
Once all the nodes are sorted by their sorting index (and therefore implicitly by their level), we can enumerate them in ascending order using a sorting index j; that is, we store each node as the key-value pair j; (level(v), v) . Clearly, we obtain an analogous representation of the edges in the form i;
) , which will prove useful later on.
Division of Circuit And Assignment Among Reducers
As we have already seen when discussing the branching programs, an assignment α to input variables X = {x 0 , x 1 , . . . , x n−1 } can be represented as a set { i;
The circuit C n is now divided into ℓ = N ε O subsets of edges according to the sorting indices and input values that are assigned to each subset as in the case of branching programs. For
, the set of variables appearing in C q n is denoted as X q and the assignment α q to X q is represented as { j; x q,j , v q,j | j ∈ [|α q |]}, where x q,j is the jth variable that appears as an input in C q n , and v q,j is its assignment value. Just as seen in Lemma 8 for the case of a branching program, we can now compute α q from C n and α for all q ∈ [ℓ], yielding the following lemma.
We can therefore assume that each input node is represented by j; (level(x ji ), x ji , v ji ) , a key-value pair that is computed from C q n and α q for q ∈ [ℓ] in a single DMRC-round.
Division Into Subcircuits By Levels
We divide C n = (V n , E n ) into as few subcircuits as possible such that the simulation of each subcircuit is in DMRC 0 and we can evaluate C n by evaluating the subcircuits sequentially. Given v ∈ V n and δ ∈ N, we define the v-down-circuit C 
When dividing C n into subcircuits we have two conflicting goals. On the one hand, we want as few of them as possible, which implies that they have to be of great depth. On the other hand, we need to simulate them in MapReduce without exceeding the memory bounds. A depth in O(log n) turns out to be the right choice. Let s = γ(log n)/ log ∆, where ∆ ≥ 2 is a constant bounding the maximum degree of C n and γ is an arbitrary constant satisfying 0 < γ < 1 − 2ε. (Note that such a γ exists exactly if ε < 1/2.) Since a tree of depth s and maximum degree bounded by a constant ∆ contains at most 
For every key-value pair q; When the circuit C n is divided into L i -down-circuits, there may exist edges of C n that are not contained in any
We call such edges level-jumping edges; see Figure 4 for an example. We would like to replace every level-jumping edge (u, v) by a path from u to v that consists only of edges that will be part of the respective L i -downcircuits and L i -up-circuits in the resulting, augmented circuit. The following lemma states that this is possible without increasing the size by too much.
◮ Lemma 17. We can subdivide the jumping edges in C n in a way that renders the subcircuitwise evaluation possible without increasing the size beyond O(N ). (j v , level(v), v) ), introducing two new nodes with level(dummy 1 ) = i u + 1, level(dummy 2 ) = i v . Having divided the jumping edges in this way, the newly created edges are all part of some dummydown-circuit or dummy-up-circuit, except for edges of the form (dummy 1 , dummy 2 ). Note that we cannot further subdivide the edges of the form (dummy 1 , dummy 2 ) because we would exceed the size limit on the circuit otherwise. The most convenient way to deal with this is to adjust our definition of down-circuits and up-circuits such that every edge of the form (dummy 1 , dummy 2 ) is considered to be both a dummy 1 -down-circuit and a dummy 2 -up-circuit on its own. This way, every edge in the augmented circuit is included in some
Figure 4
Two jumping edges on the left and their resolving division on the right.
down-circuit or up-circuit. Note that this augmentation can be performed in a single round and that the size of the augmented circuit is in O (N ) . In what follows, we consider C n to be this augmented circuit. ◭
Construction of Subcircuits in Reducers
Having described the subcircuits on which the evaluation of the entire circuits will be based we now need to show how to split and construct them in the ℓ different reducers. In each reducer, We start with the nodes v contained in it that satisfy level(v) = L i for any i and the associated v-down-circuits and v-up-circuits of depth 1. We then iteratively increase the depth one by one, until the full L i -down-circuits and L i -up-circuits of depth up to s are constructed. Note that the nodes of any level L i and their corresponding circuits may scattered across multiple reducers since edges were split equally among them according to their the sorting index and not depending on the level. We therefore need to carefully implement a communication scheme that allows each reducer to encode requests for missing edges required in the construction, which are then delivered to them in multiple rounds, without exceeding any of the memory or time bounds. Taking care of all these details, we obtain the following lemma.
Proof. In the first round, the map function µ 1 is defined such that each reducer q is assigned (via the choice of the key) β nodes of the form j; (level(v), v) and directed edges adjacent to these nodes. Note that one edge can thus be assigned to two different reducers, once as an outgoing, once as in in-going edge. Specifically, we define
for the key-value pairs representing nodes and
for the key-value pairs representing edges.
In the subsequent execution of ρ 1 , each reducer can therefore directly construct the vup-circuits and v-down-circuits of depth 1 for its β assigned nodes. We will now describe how some of these initial circuits, namely those on levels L i for any i ∈ [r], can be used to extended to full L i -up-circuits and L i -down-circuits by iteratively increasing the circuit depth one by one in the following way:
Let (v) and C up 2 (v), respectively. Let u in (u out , resp.) be any node of in-degree (out-degree, resp.) 0 in it, that is, any node that potentially needs to be extended by one or multiple edges. These extending edges are not necessarily available in reducer q , however. We need to find out which reducer stores them-if there are any-and then request these edges from it in some way. To determine the right reducer, we make use of the sorting index stored alongside each node, even when part of an edge. Any edge (u in , v) that we need to check for possible extensions is in fact represented as q ,
The number of the reducer containing the downward extending edges is now retrieved as to(u in ) = j uin ÷ β. Analogously, the upward extending edges for an edge (v, u out ) are to be found in reducer to (u out ), where T o(u out ) = j uout ÷ β. We now know whom to ask for edges extending the subcircuit beyond node u, namely reducer number to(u). Let from(v) = q denote the number of the reducer sending the request, which we encode in form of the key-value pair q; (u, to(u), from(v)) .
Each reducer q does the above for every node with possible extending edges and also passes along to the mapper all v-up-circuits and v-down-circuits constructed so far unaltered. This concludes the first round.
In the second round, the map function µ 2 naturally re-assigns q; (u, to(u), from(v)) to reducer to(u) , and returns the v-up-circuits and v-down-circuits to the reducers that sent them. Having received the edge request of the form of to(u); (u, to(u), from(v)) while executing ρ 2 reducer to(u) now sends all edges potentially useful to reducer from(v) -that is, the entire u-up-circuit and the entire u-down-circuit of depth 1-to the next mapper in the form of a pair (from(v), e) for every edge containing node u. As before, all other circuits constructed so far get passed along without modification as well.
In the third round, the map function µ 3 routes the requested edges to the requesting reducer by generating the key-value pairs from(v); (from(v), e . In the reducing step, which implements the same reduce function ρ 1 as in the first round, reducer reducer from(v) now finally has all fully extend the v-up-circuits and v-down-circuits to depth 2.
Since performing the two rounds µ 2 , ρ 2 , µ 3 , ρ 1 deepens the L i -up-circuits and L i -downcircuits by one level in the way just seen, the complete L i -up-circuits and L i -down-circuits can be constructed by repeating these two rounds s times.
It is again clear that the memory and I/O requirements of the reducers are all met in every round since the input size and output size are in O(d) for each reducer. Moreover,the total memory for storing the v-up-circuits and v-down-circuits is β · N ∈ O(N 1+γ ) because C n has at most N O ∈ O(N ) nodes. Since the constant γ was chosen such that 0 < γ ≤ 1−2ε, we have N 1+γ ∈ O(N 2(1−ε) ) and thus all up-circuits and down-circuits can be stored in the respective reducers. ◭
Evaluation Via Subcircuits
The main idea in the proof of the following lemma is to compute the evaluation values subcircuit-wise, starting with the deepest ones, and then iteratively moving up the circuit in depth(C n )/s rounds, passing on the newly computed values to the right reducers, until the value of the unique output node is known.
◮ Lemma 19. If all up-circuits and down-circuits are constructed in the proper reducers,
C n can be evaluated in O(depth(C n )/ log n) DMRC-rounds.
Proof. Without loss of generality, let depth(C n ) be divisible by s and let r = depth(C n )/s. Once all L i -down-circuits and L i -up-circuits for all i ∈ {1, . . . , r} have been constructed, we can evaluate C n on the given input assignment. We begin by evaluating the L r−1 -downcircuits. Since every input node has its value assigned in a v-down-circuit, the L r−1 -downcircuits can be computed in the reducers containing these v-down-circuits. With the values of all nodes at level L r−1 determined, we can send the necessary values to the L r−2 -downcircuits and, in the case of edges that were divided using two dummy nodes, to lower-level down-circuits. Nodes at level L r−1 that are necessary to compute L r−2 -down-circuits are described in the L r−1 -up-circuits. Any node v at level L r−1 that is necessary to compute L r−2 -down-circuits is described in the v-up-circuit. Therefore, the output of the reducer q is as follows: In the next round, the map function sends each (to(u in ), v, val(v)) to the reducer containing the u in -down-circuit; that is, it generates the key-value pair to(u in ); (v, val(v) ) . Of course, the map function also passes along all v-down-circuits and v-up-circuits to the proper reducers.
Since now each L r−2 -down-circuit is contained completely in a reducer that has gathered all values of nodes at level L r−1 necessary to compute this subcircuit, all L r−2 -down-circuits can be computed in their reducers. Now we can compute the values of nodes higher and higher up in the circuit, by iterating the last mapping-reducing function pair, until the value is finally known for the unique output node.
As before, we clearly stay within the memory and I/O buffer limits of each reducer. ◭
Conclusion and Research Opportunities
In a substantial improvement over all previously known results, we have shown that NC i+1 ⊆ DMRC i for all i ∈ N. In the case of NC 1 ⊆ DMRC 0 , we have proved this result for every feasible choice of ε in the model, that is, for 0 < ε ≤ 1/2. For i > 0, we have shown the result to hold for all but one value, namely ε = 1/2.
Achieving these two results required a detailed description of two different, delicate simulations within the MapReduce framework. For the case of NC 1 , which is particularly relevant in practice, we applied Barrington's theorem and simulated width-bounded branching programs [2] , whereas we directly simulated the circuits for the higher levels of the hierarchy. We emphasize that none of the two approaches can replace the other: Barrington's theorem only gives a characterization for the first level of the NC hierarchy and the second approach does not even yield NC 1 ⊆ MR C 0 . (Recall that DMRC is just the deterministic variant of MR C, so we have DMRC i ⊆ MR C i for all i ∈ N.) We would like to briefly address the small question that immediately arises from our result, namely whether it possible to extend the inclusion NC i+1 ⊆ DMRC i of Theorem 10 to the case ε = 1/2. Going through all involved lemmas, we see that the two reasons that our proof does not work in this corner case are the sorting of the nodes using Lemma 15 and the construction of the up-circuits and down-circuits in Lemma 18. Regarding the former, we can avoid the restriction by allowing randomization. For the latter, it is not clear that this can be achieved, however. Any way to construct the levels for ε = 1/2 as well, then Theorem 10 would immediately extend to the full range 0 < ε ≤ 1/2 of feasible choices for ε.
Besides dealing with the small issue mentioned above, the natural next step for future research is to take the complementary approach and address the reverse relationship: Having shown in this paper how to obtain efficient deterministic MapReduce algorithms for parallelizable problems, we now aim to include DMRC i into NC i+1 for all i ∈ N, thus finally settling the long-standing open question of how exactly the MapReduce classes correspond to the classical classes of parallel computation.
