The Boolean circuit has been an important model of parallel computation, but not many parallel algorithms have been designed on this model because it is 'awkward to program.' To overcome this drawback, we propose a description language for designing parallel algorithms. This description language is to parallel algorithms what the pseudocode is to sequential algorithms. Through example codes, we show that the description language is a convenient tool to design parallel algorithms due to its general iterative and recursive structures and the ease of modular design.
Introduction
Numerous parallel algorithms have been developed on various models of parallel computation, which can be roughly classified into the following three.
1. Network models such as mesh and hypercube [12] : These network models can be actually used as the architectures of parallel machines. However, a parallel algorithm designed for a network model is usually too specific to the fixed connection that the network model has. Also a network model itself imposes some constraints on problem complexity. For example, sorting in an n × n mesh requires Ω(n) time, and a similar result applies to a three-dimensional mesh, which are widely used topologies in massively parallel processors [11] .
Another important model of parallel computation is the Boolean circuit [18, 19] . Uniform Boolean circuits have long been considered as a parallel computation model [14] , and the classes NC i for i ≥ 1 and NC are defined by them. Sipser [15] describes the following advantages and disadvantages of the Boolean circuit.
• It is simple and realistic.
• It is 'awkward to program' because individual processors (AND, OR, NOT) are so weak.
The latter is the main reason why not many parallel algorithms have been designed on this model. On the other hand, 'programming logic circuits' using VHDL and Verilog, which is inherently parallel programming, is a main task for hardware designers. But, parallel algorithms, except for basic problems such as multiplication and FFT, that are readily available to hardware designers are very scarce. Hence, there is a mismatch between theory (parallel algorithms) and practice (circuit design).
In this paper we will use the Boolean circuit as the model of parallel algorithm design, and present a way to overcome its drawback (awkward to program). We propose a description language like VHDL [4] for designing parallel algorithms, which we call the BC (Boolean Circuit) language. We will explain the language features of BC by giving example codes for the problems of counting and sorting. Although VHDL and Verilog also describe Boolean circuits, they are very complicated languages to describe all physical behaviors of hardware logic circuits. The language BC is such a simplified version of VHDL that only features necessary for algorithm design are retained. Some features such as recursion are generalised in BC to facilitate algorithm description. Still, the mapping from the BC language to VHDL is straightforward.
Our work has analogies with algorithm design in sequential computation. The model of sequential computation is the 'random access machine' (RAM) [1, 14] , and its instruction set is in a level of assembly languages. Thus, designing an algorithm with RAM's instruction set will be painful. Fortunately, most algorithms are described in pseudo-codes [6] , because any codes in a high-level language can be translated to assembly codes by compilers and this process is well understood [3] . Similarly, VHDL-like codes can be translated to logic circuits by high-level synthesizers and even to silicon layouts by silicon compilers. In fact, the codes in this paper were synthesized to logic circuits by a VHDL synthesis tool.
Our approach to parallel computation meets the needs from both theory and practice. It provides a convenient tool (description language) to parallel algorithm designers, and any parallel algorithm designed in this description language can be automatically translated to logic circuits, which means that this approach is very realistic. Our approach puts forward the Boolean circuit not just as a model to study complexity classes but also as a model to design parallel algorithms.
When we talk about practice in regard to parallel algorithms, there are two different fields: one is parallel computing in existing parallel machines where each processor is quite powerful, and the other is parallel programming in logic circuits where the standard set of gates is {AND, OR, NOT}. Boolean circuit programming suggests that parallel programming in logic circuits is a more appropriate context for developing parallel algorithms and studying parallel complexity of problems. 
Preliminaries
A Boolean circuit consists of inputs, outputs, logic gates, and directed wires. The inputs, outputs and logics gates are connected by the wires. If we consider the inputs, outputs, and logic gates as nodes and the wires as edges, a boolean circuit can be represented as a graph. A combinatorial Boolean circuit is a Boolean circuit that can be represented as a directed acyclic graph, i.e., there are no cycles in the graph (circuit). Since we only consider the combinatorial Boolean circuit in this paper, we will say just 'circuit' instead of combinatorial Boolean circuit. We define the depth and the size of a circuit. The depth of a circuit is the number of logic gates in a longest path from an input to an output. The depth of a circuit is well-defined because the circuit is acyclic, and it corresponds to the worst-case running time of the circuit. The size of a circuit is the number of inputs, outputs, and logic gates in it. Note that the number of wires is proportional to the number of logic gates and outputs because the number of inputs to a logic gate is either one or two and the number of wires connected to an output is one. We describe some circuits that are used as building blocks in composing more complicated circuits. The basic circuits are logic gates (AND, OR, NOT) and additional components are multiplexers (MUX), demultiplexers (DEMUX), adders (ADD), and carry save adders (CSADD).
• Logic gates (AND, OR, NOT): Normally, an AND (resp. OR) gate has two input wires ( Fig. 1 (a) and (b)) and a NOT gate has an input wire ( Fig. 1 (c) ). However, we allow the AND gate and the OR gate to have n input wires for arbitrary integer n ≥ 2 to simplify the description of circuits ( Fig. 1 (d) ). Actually, the n-input OR gate is a circuit composed of n − 1 2-input OR gates that are connected in a tree-like fashion ( Fig. 1 (f) ). The internal structure of an n-input AND gate is similar. Thus, the depth of the n-input OR (AND) gate is O(log n) and its size is O(n). Furthermore, we allow the OR gate and the AND gate to have b-bit integers as inputs and output ( Fig. 1 (e) ). A b-bit n-input OR gate is a circuit composed of b n-input OR gates such that each n-input OR gate gets the ith bits, 0 ≤ i ≤ b − 1, of n input integers as inputs and it produces the ith bit of the b-bit output ( Fig. 1 (g) ). The internal structure of a b-bit n-input AND gate is similar. Since a b-bit n-input OR (AND) gate is composed of b n-input OR (AND) gates that performs in parallel, the depth 
-bit n-input OR (AND) gate is O(log n) and its size is O(bn).
• MUX ( • DEMUX ( • Adder (Fig. 4 (a) ):
The depth of the k-bit adder is O(log k) and its size is O(k) [6] .
• Carry save adder ( Fig. 4 (b) ): The k-bit carry save adder gets three k-bit integers
The depth of the k-bit carry save adder is O(1) and its size is O(k) [6] . 
The BC language
We explain how to describe Boolean circuits using the language BC. The language features of BC are essentially the same as those of VHDL with some exceptions. Those features of VHDL that are not directly related to algorithm design were eliminated or simplified. Some features such as recursion were generalised to facilitate algorithm description. We first introduce the basic rules to describe the circuit elements and the interconnections between them, and then advanced features that are necessary to describe the iterative structures in circuits.
The basic rules to describe circuit elements (logic gates, components, wires) and the interconnections between them are as follows.
• A wire is represented as a variable such as A and B. A set of wires is represented by an array of variables such as A(0..7) and C(0..n − 1)(0..b − 1). For brevity, an omitted index in an array means the whole range, e.g., C for C(0..n − 1)(0..b − 1) and C(1) for C(1)(0..b − 1).
• Logic gates are represented as operators such as AND, OR, and NOT.
• The interconnection between wires and logic gates are described using a set of assignment statements. For example, if two wires A and B are connected to the inputs of an AND gate and a wire C is connected to the output of the AND gate, the description of the interconnection is 'C ← A AND B.'
• Components are represented as functions. A function for a component is composed of two parts. In the first part, we define the wires (variables) associated with the component. The input and output wires of the component are defined using type identifier in and out respectively. The wires inside the component are defined using type identifier signal. In the second part that is closed by begin and end, we describe the internal structure of the component. For example, a function for the 2 × 1 MUX (Fig. 2 (b) ), named 2X1MUX, is shown in Fig. 5 (a) . Some functions may have a varying number of elements in an array for inputs and/or outputs. For example, the function for the b-bit 2 k × 1 MUX (Fig. 2 (a) ) should have b · 2 k + k elements in its input arrays and b elements in its output array. To describe a function like this, we use keyword generic. Keyword generic indicates the use of special parameters such that the numbers of elements in arrays for inputs and outputs are represented as mathematical expressions of the special parameters. Function MUX for describing the b-bit 2 k × 1 MUX is shown in Fig. 5 (c) . In function MUX, k and b are defined as special parameters and the numbers of inputs and outputs are defined using k and b.
The internal structure of function MUX will be clarified after we introduce recursion. ORF generic ( ) in: We show how to describe iterative structures in circuits. In an iterative structure, a similar structure appears multiple times with small regular variations. The iterative structures are divided into two categories. They are parallel iterative structures and sequential iterative structures. In a parallel iterative structure, there are no wires connecting the structures constituting the parallel iterative structure. In a sequential iterative structure, there are. The parallel and sequential iterative structures are not distinguished in VHDL, but we distinguish them in BC to enhance readability of codes. We first show how to describe parallel iterative structures and then sequential iterative structures.
Parallel iterative structures appear in the 8-input OR gate in Fig. 1 (f) . The 8-input OR gate is a circuit composed of seven 2-input OR gates that are connected in a tree-like fashion: The 8-input wires are connected to four 2-input OR gates, then the four output wires of the four 2-input OR gates are connected to two 2-input OR gates, and finally the two output wires are connected to an 2-input OR gate. Thus, there are three parallel iterative structures in the 8-input OR gate which are composed of four, two, and one 2-input OR gates, respectively. To describe the parallel iterative structure, we use pfor. Let us assume that 8 input wires are represented as A(i) for 0 ≤ i ≤ 7 and 4 output wires as B(i) for 0 ≤ i ≤ 3. Three parallel iterative structures consisting of 4, 2, and 1 OR gates form a sequential iterative structure in the 8-input OR gate. To describe the sequential iterative structure, we use sfor. Let us assume that wires are represented A(i)(j) for 0 ≤ i ≤ 3 and 0 ≤ j ≤ 2 i − 1, where A(i)(j) denotes the jth wire to the ith parallel iterative structure.
A(i)(j) ← A(i + 1)(2j) OR A(i + 1)(2j + 1); } }
Describing (and reading) an iterative structure is much easier if we replace it with recursion. To describe recursion, keywords if and else are used. Keywords if and else are used to describe a conditional structure that is dependent on special parameters indicated by generic, or indices of pfor or sfor loops. A recursive function ORF describing the b-bit 2 k -input OR gate, k ≥ 1, is shown in Fig. 5 (b) . A full description of the b-bit 2 k × 1 MUX using recursion is shown in Fig. 5 (c) .
Remark.
A Boolean circuit can be described either recursively or iteratively, and either way the final logic circuits produced by a synthesis tool are the same. Hence, using recursion in circuit design is highly recommendable. In contrast, recursion in sequential computation suffers performance degradation due to the use of system stacks.
Comparison
We describe a 2 k -bit magnitude comparator (2 k -COMP), k ≥ 0, that gets two 2 k -bit integers a and b as inputs and produces a pair of bits (x, y). The output (x, y) is (1, 0) if a = b, (0, 1) if a > b, and (0, 0) if a < b. A 2 0 -COMP (1-COMP) consists of three AND gates, two NOT gates, and an OR gate (Fig. 6 (a) ). A 2 k -COMP, k ≥ 1, is constructed recursively from two 2 k−1 -COMPs and a 2-input 2 × 1 MUX (Fig. 6 (b) ). The correctness of a 2 k -COMP is as follows. Fig. 7 (a) . Note that the wires inside 1-COMP (i.e., T(0..1)) and the wires inside 2 k -COMP, k ≥ 1 (i.e., P(0..1)(0..1), Q(0..1), and R) are not the same and their definitions are affected by special parameter k.
Bit counting
We describe a bit counter that gets n bits as inputs and produces the number of 1 bits among the n bits, i.e., the sum of the n bits. Thus, the output can be as large as log(n + 1) bits. The bit counter consists of carry save adders and an adder. In fact, the bit counter is a variant of a Wallace-tree [6] that is designed to add n n-bit numbers with O(log n) depth and O(n 2 ) size. However, since the bit counter adds 1-bit numbers, the size of the bit counter is reduced to O(n). The internal structure of the bit counter can be divided into stages. Each stage i, i ≥ 1, gets i-bit integers. Let x i denote the number of i-bit integers. Stage i (except the last) outputs 2x i /3 (= x i − x i /3 ) number of (i + 1)-bit integers. The last stage takes two integers and it outputs the sum of the integers. The number of stages is O(log n) because the number of output integers of each stage (except the last) is 2/3 of the number of input integers of the stage. Hence, carry save adders in stage i are i-bit carry save adders and the adder in the last stage is an O(log n)-bit adder.
For example, a bit counter taking 6 bits as inputs is shown in Fig. 8 . In the first stage, In the second and the third stage, the four 2-bit integers are reduced to three 3-bit integers and then to two 4-bit integers. Note that the integer that is an output of a 1-CSADD and directly connected to a 3-CSADD should be expanded by one bit. In the last stage, the two 4-bit integers are added by an adder. Although the adder outputs a 5-bit integer, we take only 3 least significant bits of them because the sum of 6 input bits is at most 6 and thus 3 bits are sufficient for the sum.
Consider the depth and the size of the bit counter. Since the depth of a carry save adder is O(1) and the depth of the O(log n)-bit adder is O(log n), the depth of the bit counter is O(log n). We compute the size of the bit counter by adding the sizes of all carry save adders and the O(log n)-bit adder. Because in stage i (except the last), we have O(n · (2/3) i−1 ) number of i-bit carry save adders whose sizes are i, the size of all carry save adders in the bit counter is
Since the size of the O(log n)-bit adder is O(n), the size of the bit counter is O(n). The bit counter is described as a function call to recursive function ITADD. The function ITADD, taking n b-bit numbers and producing the sum of the n numbers, is shown in Fig. 7 (b) . The bit counter for n inputs is described as a function call 'ITADD generic (n, 1, log(n + 1) ) (X(0..n − 1), Y(0.. log(n + 1) − 1)).'
Sorting
We describe a sorting circuit that sorts n integers (of b bits each) with O(log n + log b) depth and O(bn 2 ) size. We will represent the sequence of input integers as a and each input integer as a i (0 ≤ i ≤ n − 1). We will represent the sequence of (sorted) output integers as a and each output integer as a i (0 ≤ i ≤ n − 1). The sequence of output integers is sorted in nondecreasing order. When two integers a i and a j are equal, a i appears after a j in the output sequence if a i is after a j in the input sequence, i.e., i > j. Hence, a i will be after a j in the output sequence if and only if a Boolean expression (a i > a j ) ∨ (a i = a j ∧ i > j) is true. We assume that n and b are powers of 2. (If not, we can take the smallest power of 2 that is greater than n or b.)
This sorting circuit consists of three top-level components, which are PAIRWISE COMPARISON, COMPUTE RANK, and PERMUTATION. This is a naive sorting algorithm that is also described in [18] , but it shows many interesting aspects of Boolean circuit design. In ad- dition, it produces a sorting circuit of optimal depth O(log n + log b). We first overview these top-level components, then describe the internal structures of these components, and finally consider their depths and sizes.
For every pair of input integers a i and a j (0 ≤ i, j ≤ n − 1), we determine whether or not a i will appear before a j in the output sequence by evaluating (
If the Boolean expression is true, we set c ij as 1, and 0 otherwise.
For each a i , we compute the rank r i of a i in the output sequence, i.e., the number of input integers that will appear before a i in the output sequence. The rank r i of a i is obtained by The code for the interconnection of these top-level components is shown in Fig. 11 (a) . We now describe the internal structures of these three components. Component PAIR-WISE COMPARISON (Fig. 9 ) consists of n 2 blocks such that each block computes c ij (0 ≤ i, j ≤ n − 1) in parallel. To compute c ij , each block needs to evaluate the Boolean expression (a i > a j ) ∨ (a i = a j ∧ i > j). To compare a i and a j , we use a b-bit comparator. Additionally, we use an AND gate and an OR gate. The Boolean expression i > j can be computed at compile time of a synthesis tool because values i and j are fixed regardless of values a i and a j and thus (i > j) is either 0 or 1. The depth of PAIRWISE COMPARISON is O(log b) due to the b-bit comparator, and its size is n 2 times the size of the b-bit comparator, which is O(bn 2 ). The code for PAIRWISE COMPARISON is shown in Fig. 11  (b) .
Component COMPUTE RANK consists of n bit counters such that each bit counter computes the rank r i = c i0 + · · · + c i,n−1 (0 ≤ i ≤ n − 1) in parallel. Its depth is the depth of the bit counter, which is O(log n). Its size is n times the size of the bit counter and thus it is O(n 2 ). The code for COMPUTE RANK is shown in Fig. 11 (c) .
Component PERMUTATION consists of n b-bit 1× n DEMUXs and n b-bit n-input OR gates (Fig. 10) . The ith DEMUX, 0 ≤ i ≤ n − 1, from the left gets a i and r i and sends a i to its r i th output wire which is connected to the r i th OR gate from the left. The depth of this component is the sum of the depth of the DEMUX and the depth of the OR gate, which is O(log n) because the depths of the DEMUX and the OR gate are all O(log n). Its size is n times the sum of the sizes of the DEMUX and the OR gate, which is O(bn 2 ), because the sizes of the DEMUX and the OR gate are all O(bn). The code for PERMUTATION is shown in Fig. 11 (d) .
Theorem 1
The depth and the size of the sorting circuit are O(log n + log b) and O(bn 2 ), respectively.
