This paper presents the following algorithms to compute the sum of n d-bit integers on reconfigurable parallel computation models: (1) 
Introduction
A reconfigurable mesh (RM) is a processor array that consists of processors arranged in a two-dimensional grid and a dynamically reconfigurable bus system (Fig. 1 ). There is a link between the ports of each two adjacent processors and the four ports of each processor can be connected or disconnected locally during execution of an algorithm. Each connected component formed by links and internal connections constitutes a subbus. On a RM of the word model, each processor can execute one of the word operations and a word of data can be transferred through a subbus in one unit of time. On the bit model, a bit operation and the transfer of a bit of data requires one unit of time.
RMs have recently attracted considerable attention as theoretical models of parallel computation, and many studies have been devoted to developing efficient parallel algorithms on Ms. For example, they efficiently solve problems such as sorting [25, 31] , selection [8, lo] , arithmetic operations [2, 18, 6] , graph problems [20, 15, 331 , geometric problems [24, 28] , image processing [l, 16,17,3] . Nakano [22] Chen et al. [7] Jang et al. [l l] Jang et al. [13] This paper This paper This paper deals with the problems of computing the sum of n binary values and the sum of n d-bit integers and it describes efficient algorithms on RMs. These problems are quite important because they are hmdamental procedures used as subroutines in many algorithms, sorting, selection, geometric algorithms, graph algorithms, arithmetic operations, matrix computations, and so on. Furthermore, these summing problems have no highly parallelized algorithms on traditional parallel machine models without a dynamically reconfigurable bus system: even a PRAM requires R(log n/log log n) time to solve these problems [4] . It is, therefore, interesting to find highly parallelized algorithms for these problems on RMs. Tables 1 and 2 list relevant results known previously as well as results presented in this paper. We will first show that for n binary values given to processors on every fi rows, their sum can be computed in O(log* n -log* m) (1 dm < log n) time on a ,,&% x fi RM of the bit model, where log@+') n =log(logCk)n) for all k, log(') n = log n, and log* n is the minimum integer k such that logCk) n < 1. The key idea of the algorithm for this is implementation of Nakano's summing algorithm [22] in the Jnm x fi RM. From this algorithm, we can get an O(log*n)-time algorithm on a fi x fi RM and an 0( 1 )-time algorithm on a filogCo(')) n x Ji; Rh4. These algorithms are improvements of the Jang's algorithms [ 11, 131 that have been the best known algorithms, in the sense that our algorithms use fewer processors.
We will also present an algorithm that computes the sum of n d-bit integers in O(log* n -log* m) (1 <m <log n) time on a Jnm x dJTi Rh4 of the bit model. This Table 2 Algorithms for summing n d-bit integers
Model
Size of RM Computing time
Wang et al. [6] bit Ben-Asher et al. [5] bit Jang et al. [14] bit Jang et al. [13] bit Olariu et al. [27] word Chen et al. [7] word Fragopoulou [9] word This paper bit This paper bit
This paper bit
This paper word dn x dn n x n x n for n n-bit integers n x dn min(dn x n, dn x d log2 n) n x n for n log n-bit integers J;;XJ;I J;;Xfi ,/'G x dJ; (1 <m< logn) &log('('))n x dfi &log* n x d-\/n/log* n n/(log d + log* n) x n/(log d + log* n)
algorithm is based on the two-stage summing method (Fig. 7) , in which the sum of each digit of given integers are computed and then the sum of these sums is computed. In this algorithm, the sum of each digit is computed by the O(log*n -log* m)-time algorithm for binary values above, and the sum of these sums is computed by using Jang's algorithm [14] . Thus, a filog('(')) n x dfi RM is sufficient for constant-time summing, and if the number of processors is the same as the number of input bits, the sum can be computed in O(log*n) time. We will then consider the case where n d-bit integers are given to a fi x fi RM of the word model and present an O(logd + log*n)-time algorithm. We will also show that the size of the RM can be reduced to +&G&X4_ n/(log d + log n) without asymptotically increasing the computing time. As a previously known best algorithm, Fragopoulou [9] presented an O(d+log log n)-time algorithm. Thus our algorithm is an improvement of this algorithm. This paper also deals with a VLSI reconjigurable circuit (VLSI RC) under the assumption that a switch with two terminals and a single control terminal is available.
By the signal sent to the control terminal, the switch connects or disconnects the two terminals. Other assumptions of this treatment of the VLSI RC are the same as those of the usual VLSI model [30] . Under the usual VLSI model, Muller et al. [21] presented a VLSI circuit with O(n) gates and O(logn) depth to compute the sum of n binary values. Based on this circuit, Wada et al. [32] showed a VLSI circuit of area O(n/logn) to compute the sum of n binary values in O(logn) time. This VLSI algorithm attains a trivial AT lower bound n(n). For summing 12 d-bit integers, Wada et al.
[32] also presented a VLSI circuit constructed with area O(dn + (d log n)*) and time O(log n + logd). In this paper, we will show that the sum of n binary values can be computed in O(log*n) time using a VLSI RC of area O(n/log*n) and that the sum of n d-bit integers can be computed in O(log* n) time using a VLSI RC of area O(dn/log* n). These VLSI RCs are optimal for trivial AT lower bounds n(n) and R(dn), respectively. Furthermore, simulation of the VLSI RCs by a RM of the bit model enables us to reduce the size of RM for summing and get the following algorithms:
an O(log*n)-time algorithm for summing n binary values on J$&lXJ~ RM of the bit model, and an O(log*n)-time algorithm for summing n d-bit integers ona JY Jn/log n x d n/log n RM of the bit model. These algorithms are cost optimal in the sense that the product of the time and the number of processors is equal to the number of bits of the input. This paper is organized as follows. Section 2 defines RMs formally, and Section 3 briefly explains the basic algorithms and techniques used in this paper. Section 4 presents a summing algorithm for the bit model, Section 5 presents a summing algorithm for the word model, and Section 6 presents an AT optimal VLSI RC for the summing problems.
Reconfigurable mesh
This section defines a RM. A reco~~g~r~~fe mesh (RM) consists of processors arranged in a grid. An n RM is a RM with n processors PE(O), PE( 1 ), . . . , PE(n -I), in which for each i, PE(i) and PE(I' + 1) are connected with a link (Fig. 1 ). An n x m RIM is a RM in which n x m processors are arranged in a two-dimensional grid in which any two adjacent processors are connected with a link (Fig. 1) .
The control mechanism of the RM is based on the SIMD principle. A single control unit dispatches instructions to each processor. Although all processors execute the same instructions, their behaviors may differ, because they work on different input and different coordinates. Each processor has locally con~ollable switches that can configure the connection patterns of its four ports denoted by N(~o~~~), E(E&rt), W( Host), and S(Sourh). The computing power of RMs depends on the connection patterns that are allowed. For example, if the cross-ooer pattern (i.e. connection of N and S, and that of E and W are configured independently), is not allowed, a fi x fi RM requires Q(log*n) time to compute the parity of n binary values, whereas if all patterns are allowed it can compute the parity in constant time [19] . If the brunch patterns, (i.e. three or four ports are corrected internally) is allowed, the connected components of an n-node graph can be labeled in constant time on an n x n RM [34] , whereas we have not found an algo~t~ that can do this labeling in constant time when branch patterns are not allowed. In this paper, we assume that processors in a RM can configure any pattern except branch patterns.
The connected components formed by links and internal connections constitute subbuses, through which the processors can communicate. We assume that any broadcast through subbus takes a unit of time, and for each subbus only one processor can send a piece of data to it. We deal with two models of RMs, the bit model and the word model On the RM of the bit model, each processor has a constant storage size and the bandwidth of each subbus is 1; that is, either 0 or 1 can be transferred through the subbus. On the RM of the word model, each processor has an arbitrarily large 43 . . . storage size, can perform basic arithmetic operations (addition, multiplication, division, log, square root and so on), and can transfer arbitrarily large value through a subbus in a unit of time.
A VLSI reconjguruble circuit (VLSI RC), which is an extension of the usual VLSI model [30] , has a switch device with two terminals and one control terminal. By the signal sent to the control terminal, the switch connects or disconnects the two terminals. If two terminals are connected, a signal sent to a terminal passes to the other terminal without delay. A I-terminal switch that can configure every internal connection patterns of a processor of the bit-model RM can be constructed by six switches as shown in Fig. 2 . We assume that a switch occupies constant area, hence a 4-terminal switch also occupies constant area. We will use 4-terminal switches to implement an algorithm executed on RMs in a VLSI RC. Since we are interested in the computational power of reconfiguration in terms of the capacity for summing, we assume that the computing time of an algorithm excludes the time necessary to give an input data to each processor.
To represent an integer x on a RM, we use the following formats: VALUE This format is available only to the word model. A processor knows the value of x. For example, the value of x is stored to the local storage of PE(0). UNARY A processor PE(x) knows that the integer is equal to its own index, and the other processors know that the integer is not equal to their own index.
BINARY Let x+1x,-2 . . .x0 be the binary representation of x. Each PE(i) (O<i< n -1) knows the value of xi.
The conversions between these formats will be shown in the following section.
Basic algorithms and techniques
This section briefly explains the basic algorithms and techniques used in this paper.
Basic algorithms on one-dimensional RM
First, we will note basic algorithms on a one-dimensional RM.
Lemma 3.1 (Nakano, Masuzawa and Tokura [23] ). For n binary values given to an n RM, the rightmost element whose value is 1 can be determined in O(1).
The idea of the proof is as follows. Each processor connects its two ports if the input is 0, otherwise disconnects them. By using the subbuses thus configured, the rightmost elements can be determined. Similarly, for given n Boolean values, their logical OR can be determined in 0( 1) time.
By transferring carries through the subbus, we have Lemma 3.2 (Thangavel and Muthuswamy [29] ). The sum of two n-bit integers can be computed in constant time on an n RM of the bit model.
By applying the divide-and-conquer
technique, we have the following lemma.
Lemma 3.3 (Nakano [22] ). The sum of n integers given to an n RM of the word model can be computed in O(logn) time.
Integer summing by using lookahead carry generators
This subsection briefly explains an integer summing algorithm by Jang et al.
The algorithm computes the sum of n d-bit integers in 0( 1) time on a 2n x 2dn RM of the bit model. See [14] for details. Fig. 3 shows a lookahead carry generator that sums four binary values and a carry in the right part divides the result by two in the left part. The carry enters from the rightmost column and the sum is represented by the position of the topmost processor that the signal goes through. In Fig. 3 , the carry is 3 and 0 + 1 + 0 + 1 is added to it.
Then, the position of the topmost processor that the signal goes through in the leftmost column corresponds to the carry to the next digit. For example, in the figure, the carry to the next digit is [(3 + 0 + 1 + 0 + 1)/2J = 2.
By using n lookahead carry generators as shown in the figure, we can compute the sum of n d-bit integers in constant time on the 2n x 2dn RM. Furthermore, since the product of two n-bit integers corresponds to the sum of n (2n -l)-bit integers, it can be computed in constant time on a 2n x 4n2 RM of the bit model. Consequently, we have Lemma 3.4. The sum of n d-bit integers can be computed in constant time on a 2n x 2dn RM of the bit model, and the product of two n-bit integers can be computed in constant time on a 2n x 4n2 RM of the bit model.
A more efficient multiplication algorithm that computes the product of two n-bit integers in constant time on an n x n RM has been developed [12] , but although our summing algorithms use the multiplication algorithm as a subroutine, they do not require such high performance.
Conversions between VALUE, UNARY, and BINARY formats
We will show how to convert an integer x (0 <x <n -1) represented in one of the three formats to the other formats in constant time, if sufhciently many processors are available. We will start with the conversions on the bit model, where it is easy to convert from the UNARY format to the BINARY format. Assume that PE(x,O) knows that x is equal to its own index. PE(x,O) sends 1 to the processors in the same row, and every processor tries to receive it. Each PE(x,i) that succeeds in receiving 1 learns that the ith digit of the binary representation of x is Xi. So PE(x,i) sends xi to PE(i, 0). Note that for each PE(x, i) the value of xi is a constant that does not depend on the input. Therefore, the UNARY format can be converted to the BINARY format in constant time on an n x logn RM of the bit model. Conversely, the conversion from the BINARY format to the UNARY format can also be done in constant time on an n x n RM of the bit model. Assume that the BINARY of x is given to the top row. Each PE(0, i) (0 <i 6 log n -1) first broadcasts Xi to 2' processors PE(2',0), PE(2' + l,O), . . .,PE(2'+' -l,O), and each of them makes a copy of xi in its local storage. Hence, 2' copies of each xi are made in the leftmost column. By using the right part of the lookahead carry generator for Lemma 3.4, the UNARY of the number of copies can be computed in constant time, which corresponds to the UNARY of x. Now consider the conversions on the word model. The conversions between the UNARY and BINARY formats obviously can be done in the same way as the bit model, so we will consider the conversions to and from the VALUE. It is quite easy to convert from the VALUE format to the UNARY and BINARY formats: PE(O,O) broadcast x to all processors, and each PE(i, 0) can determine the corresponding digit of the UNARY or BINARY of x by local computation in constant time. The conversion from the UNARY format to the VALUE format is as follows: PE(x,O) sends x to PE(O,O), and PE(O,O) can then get the VALUE of X. The conversion from the BINARY format to the VALUE format can be done by two conversions: after the BINARY format is converted to the UNARY format by the algorithm on the bit model, the UNARY format is converted to the VALUE format by the method just described. The conversion from the BINARY format to the VALUE format can thus be done in constant time on an n x n RM of the word model. By repeating this procedure, the O(logn)-bit integer represented in the BINARY format, whose value is at most no('), can also be converted in constant time on an 12 x n RM.
Summing binary values using prejx remainder computation
This subsection describes an algorithm to compute the sum of binary values using prefix-remainders computation. The algorithm computes the sum of n binary values given to an m x n RM of the bit model in O(log n/&E). See [22] for details.
We will start with the prefix-remainders computation. The prefix w-remainders are ( x;,x:, . . . ,x;-, ) where xi=(ao + al + ... + ai) mod w for given n binary values A=(ao,al,...,a,_l). Each ai (O<i<n--1) is given to the 2ith column. After executing the algorithm, each 2ith column knows the UNARY of xi. In the algorithm each column functions as a cyclic shift register of size w. In the rightmost column, only the bottom element of the cyclic shift register is 1. In each column, if the element given to the column is 1, then the processors in the column shift the cyclic shift register, otherwise, they hold it. Then each prefix-remainder is equal to the position where 1 is found on the cyclic shift register. See Fig. 4 for an example of the bus configuration. Hence, the UNARY of prefix w-remainders can be computed in constant time on a (w+l)x2n RM. that the RM is partitioned into q-1 subRMs of sizes 3 x 2n, 4 x 2n,. . . , (q+ 1) x 2n. In each subRM of size (i + 1) x 2n (2 < i dq) the UNARY of x mod i is computed using the prefix-remainders algorithm and is sent to all processors in the subRM. In each 2jth column of the subRM of size (i + 1) x 2n, it is checked whether j mod i =x mod i.
Then in each 2jth column, the logical OR technique shown in Lemma 3.1 is used to check whether j mod i =x mod i holds for all i (2 6 i <q). After that, the minimum j such that j mod i =x mod i for all i, is computed using the rightmost determining. For such j, the relation j =xmodlcm(q) holds. Therefore, the UNARY of xmod lcm(q) can be computed in constant time. Furthermore, the BINARY of xmod lcm(q) can be obtained in constant time by converting the UNARY to the BINARY.
Using this remainder algorithm above, we can compute the sum efficiently. Let bj be a binary value such that bj = 1 iff xj mod lcm(q) = 0 and ui = 1, where xj = us + at + . . . + aj. Then x = (xmod lcm(q)) + (bo + bl + . . . + b,_l )lcm(q) holds. The BINARY of x is, therefore, computed by recursively computing the sum of (bo, bl, . . . , b,_,): after computing the sum of bo, bl , . . . , b,_ 1, the multiplication by lcm(q) is computed in constant time on a 2 log n x 4 log' n RM, because this can be done by the multiplication of two logn-bit integers. The addition to xmod lcm(q) can be done in constant time on a logn RM from Lemma 3.2. Hence, each recursion takes constant time. Now let us estimate the depth t of the recursion. For the estimation of t, we have to evaluate the value of lcm(q). From the prime number theorem, there are approximately q/logq prime numbers less than or equal to q. More precisely, for any small E > 0, there exists q' such that for all q aq', the number of prime numbers less than or equal to q is more than (1 -.s)q/log q and less than (1 + E)q/logq. Hence, there are O(qlogq) prime numbers in the range of [q/2,q] and lcm(q) = ~(q)"(q"Ogq) =2@(q)
holds. (See [22] for the detail of the estimation.) Since the recursion is terminated when all the b's are zero, n d (lcm(q))' is sufficient for the termination. Thus t < log n/ log(lcm(q)) = O(log n/q), which corresponds to the computing time. If an m x 2n RM is available, the sum of n binary values can be computed in O(log n/fi) time as follows:
To execute the above algorithm on an m x 2n RM, let q be the maximum integer satisfying (q -l)(q + 4)/2 dm. Since q 9 J;tl, the computing time is O(logn/,/Z).
Furthermore, if n binary value are given to an m x n RM (one binary value for each processor in the bottom row), the sum can be computed without an asymptotic increase in the computing time: the sum for even columns and the sum for odd columns are computed, and then, their sum is computed by the addition algorithm for Lemma 3.2. Consequently we have Lemma 3.5 (Nakano [22] ). The sum of n binary values can be computed in O(logn/ J;t;) time on an m x n RM of the bit model.
By choosing the prime numbers and computing the prefix remainders for them, the computing time of the above algorithm can be reduced to O(logn/&%@&) [22] . However, the O(logn/&$-time algorithm is sufficient for onr algorithms to use as a subroutine.
Integer big on the bit model
This section first gives an algorithm that sums n binary values on a ,/% x y/;; RM and then uses this algorithm in an algorithm that computes the sum of n d-bit integers onaJnmxdJ;;RM.
Algorithm for summing n binary values
This subsection gives a summing algorithm that is based on the algorithm for Lemma 3.5 but that is more efhcient. The key idea is as follows. We first assume that the nk (k <n) binary values are given, and show an efficient suing algorithm on an 2mk x (2n + 1) RM. The algorithm for Lemma 3.5 can compute the sum of nk binary values can be computed in O(log(nk)/J-) m on an m x nk RM. By embedding this algorithm to an 2mk x (2n + 1) RM based on the snake like embedding, the sum can be computed faster. Fig. 5 illustrates the snake-like embedding of an m x nk Rh4 in a 2mk x (2n + 1) RM. In snake-like embedding, a processor in the m x nk RM corresponds to a processor whose position is in an even row and odd column of the 2mk x (2n + 1) RM. In the figure these processors are represented by thick circles and the co~ections for the embedding are represented by thick lines. The m x nk RM is bent k -1 times and is partitioned into k segments, each of which corresponds to consecutive 2m rows in the 2mk x (2n + 1) RM. To complete the embedding, the connections between any two adjacent processors in the m x nk RM should be embedded in the 2mk x (2n + 1) RM. Embedding of the connection between two adjacent processors in the same segment is trivial. To embed inter-segment connections, m processors in the rightmost column of each segment should be connected one-to-one to m processors in the next segment. For one-to-one connections between the 2m processors, m odd rows in the 2mk x (2n + 1) RM are used as shown in the figure. Sillily, m processors in the leftmost column can be connected to the corresponding processors in the previous segment.
Suppose that the algorithm for Lemma 3.5 is used to compute the remainder of nk (k<n) binary values (ao,al,..., a&--l) on the snake-like embedding. The nk binary values are given to the RM such that n binary values are given to each segment, one binary value for every two column (Fig. 5) . That is, the input is given to every 2m rows in the 2mk x (2n + 1) RM. Let q be the maximum integer such that (q-1 )(q+4)/2 Gm. Using the algorithm for Lemma 3.5, it is easy to compute x mod lcm(q), and bo, bl, . . . , b,,k_ 1, where x = a0 + ai + . . . + Unk_ 1, and each bi is defined in the same way as in the summing algorithm for Lemma 3. In each recursion, the BINARY of (nmod lcm(q))+(cs+cl+~ . .+~~~~l~~(~)-~)lcrn(q) can be computed by using the algorithm for Lemma 3.4 in the same manner as the algorithm for Lemma 3.5. Therefore each recursion can be done in constant time. Finally, we will evaluate the depth of the recursion. Let qi be the value of q at the ith recursion. For each i, integer qi is the maximum integer such that (qi -I)(qi + 4)/2 <rn. lcm(qt ) * lcm(qz) . . . lcm(qi_l ). Obviously, q1 3 &i holds. Furthermore, from lcm(qi) = 2@(41), qi = 28(ql-l) h olds. Let t be the depth of the recursion. Then, the condition lcm(q,) >n is sufficient for the termination of the recursion. By applying log@-') to the inequation lcm(q,)>n, we have O(ql)> log@-') n. Furthermore, by applying logoos*"-') again, log('og*"-')(O(qi )) > log('og*"-l) n > 1 holds. Thus, from q1 > J;i;, the inequation t > log* n -log* m + 0( 1) is sufficient for the termination of the recursion. Therefore, the depth of the recursion is t = O(log* n -log* m). As a result, we can see that the sum of nk binary values can be computed in O(log*n -log*m) time on a 2mk x (2n + 1) RM. Replacing n, m, and k by fi, fi, and fi, we can compute the sum of n binary values in O(log* n -log* m) time on a 2,/&i x (2fi f 1)
Rh4 of the bit model.
Note that in this algorithm n binary values are given to a 26 x (2fi + 1) RM, one value for each 2J;T; x 2 subRM. Therefore, by executing this algorithm four times we can compute the sum of 4n binary values (given one for each fi x 1 subRM). Using this technique generally enables the size of a RM to be reduced by a constant factor without asymptotically increasing the computing time. As a result, we have Theorem 4.1. For n binary values given to every J;t; rows of a l/nm x fi RM, the sum of these values can be computed in O(log* n -log* m) (1 <m <log n) time.
If m = logck) n, then log* n -log* m = k. Thus, as a corollary to Theorem 4.1, we have Corollary 4.2. 1. The sum of n binary values given one for each processor on a fi x fi RM can be computed in O(log*n) time, and 2. The sum of n binary values given to every log('(')) n rows of a &zlog("(l)) n x fi RM can be computed in constant time.
If the pipeline method is used, the size of the RM implementing the O(log*n)-time algorithm of Corollary 4.2 can be reduced to Js x &lo&. This reduction will be shown in Section 6. 
Algorithm for summing n d-bit integers
This subsection shows an algorithm that computes the sum of n d-bit integers in O(log*n-log*m) (l<m<logn) time on a ,/'%xd&RM. The algorithm is based on the two-stage summing method (Fig. 7) . . bj, 1 is computed in the jth square A, log n bits are transferred to the jth subRM B through the (j + log n -1)st subRM B, one bit for each subRM.
Hence, 2 log n rows are enough to transfer every bit simultaneously. In the figure, these transfers are illustrated by arrows. After that, bits transferred to each subRM B in Fig. 8 are corresponds to a column of the parallelogram B in Fig. 7 . Therefore, to compute the sum of the parallelogram B in Fig. 7 , a 2 log n x 2 log n lookahead carry generator is implemented in each subRM B. Each lookahead carry generator receives a carry from the previous carry generator, adds it to the sum of a column of the parallelogram B in Fig. 7 , and sends the carry to the following carry generator. Hence, in a way similar to the summing algorithm for Lemma 3.4, the sum of the parallelogram B can be computed in constant time. Therefore the sum of n d-bit integers can be computed in O(log* n -log* m) time on an 0 ( 6) x O(dfi) RM. Furthermore, big-'0' notation in the size of the RM can be removed in the same way as Theorem 4.1. As a consequence, we have We also show in Section 6 that the size of a Rh4 for the O(log*n)-time algorithm can be reduced by a factor of log*n.
Integer summing on the word model
This section describes an algorithm that computes the sum of n d-bit integers, given one for each processor, in O(log d + log* n) time on a ,/% x J;; Rh4 of the word model. In this algorithm each d-bit integer is given in the VALUE format, and the VALUE format of the sum is computed.
If d > n*i4 the sum can be computed in O(log n) = O(log d) time from Lemma 3.3. Hence, we consider only the case where d < n'i4.
First, let us imagine that the RM is partitioned into &z/d x&/d subRh4s each of size d x d as shown in Fig. 9 . The sum of each group can be computed in O(logd) time from Lemma 3.3. Since each sum has at most d + log d bits, we have to compute the sum of n/d2 (d + logd)-bit integers. To do this, we apply the two-stage summing method (Fig. 7) . We have next to compute the sum of the parallelogram B in Fig. 7 . To do this, we use the method used in Lemma 3.4 that implements lookahead carry generators. In this case, the size of each lookahead carry generator is 2 log(n/d2) x 2 log(n/d2) and the number of generators is d + logd + log(n/d2) -1 dd + logn. Therefore, 2 log(n/d2)(d + log n) x 2 log(n/d2) < 2n 'I4 log n x 2 logn is sufficient to compute all the bits in constant time. Note that the sum thus obtained in the BINARY format. Finally, we have to convert the BINARY of the sum into the VALUE format. The BINARY has log n + log d d 2 log n bits, because d 6 n 'I4 The conversion can therefore . In this case (logd+log*n) d-bit integers are given to each processor. Each processor first independently computes the sum of log*n+log d d-bit integers in 0( log d + log* n) time. The sum thus obtained has at most d + log(log d + log* n) bits. Then, the sum of n/( log d + log* n) (d + log(log d + log* n))-bit integers is computed in O(log(d + log(log d + log* n)) + log*(n/(log d + log* n))) = O(log d + log* n) time on the dw x d-n/(log d + log n) RM by the algorithm for Lemma 5.1.
Hence, the computing time is still O(logd + log*n). Therefore, we have 
VLSI RC for summing integers
This section describes an AT optimal VLSI RC for summing integers. First, though, let us implement the algorithm for Theorem 4.1 to the VLSI RC, and estimate its
AT. Since each processor executes the algorithm in O(log*n -log*m) time and has a constant storage size, the processor can be embedded in O(log*n -log*m) area on the VLSI RC as follows. Each processor consists of O(log*n -log*m) registers, a local storage, a decoder, and a 4-te~inal switch as shown in The decoder receives an instruction, a value of the local storage, and signals transferred through terminals of the 4-terminal switch, and then outputs a new value of local storage, singals to control the switch, and data to be sent from the terminals. Since the decoder receives and outputs a constant number of bits, the size and the depth of the circuit necessary to implement the decorder is constant. Therefore, the implementation shown in Fig. 10 has an area of O(log*nlog*m) and performs one step execution of a bit-model processor in costant machine cycles. By using the implementation above, the total area is A = O(n *(log* n-log* m)) and the computing time is T = O(log* n -log*m). Hence, AT = O(nfi(log*n -log*m)') holds and this AT upper bound does not match the trivial lower bound AT = Q(n).
If the program has loops, a processor may be embedded in smaller area by reusing a subsequence of instructions. However, even if every processor can be implemented in constant area, the upper boud is AT = O(nJi;;(log* it -log* m)), which does not match the lower bound.
To attain AT = O(n) we will reduce the number of processors by using the MullerPreparata's VLSI circuit on the usual VLSI model. As a preliminary, we start with explaining the Muller-Preparata's VLSI circuit [21, 32] , which is an AT optimal VLSI circuit for summing n binary values, Fig. 11 Each block consists of S(log*n) and a 2 log*n x 2 log*n subRM of the bit model. Since each processor has a program of size O(log*n), the area for each subRM is O((log*n)3) and does not dominate the area in its block. Each subRM is connected to the subR.Ms in four adjacent blocks. Thus, the VLSI RC has a 2dxx 2dn,!2 I0 n RM overall. In each block the sum of 2'Os*' log* n binary values is computed in O(log*n) time by using S(iog*n), and it is transferred to the subRM in the same block. The sum thus obtained has at most log(2"s*" log* n) < 2 log* n bits. Then the sum of n/(210e*n log* n) 2 log* n-bit integers is computed in O(log*n) time by the two-stage summing method in the same way as in the summing algo~thm for the word model in Section 5. That is, in the first stage the sum of each digit of the n/(2 log* n log* n) 2 log* n-bit integers is computed independently by the RM. This can be done in O~log*~) time because the RM can be partitioned into 2 log* n subRMs each of size Jni(2lop'"log*n)x{$z-lo& Each sum thus obtained has at most log n bits. In the second stage 2 iog*n lookahead carry generators, each of size 2 log n x 2 log n, are implemented in the RM, and the sum of 2 log*n log n-bit integers is computed in constant time. The RM in the VLSI RC is large enough to implement the lookahead carry generators. Therefore, we have The sum of integers can be computed on a VLSI RC by a method similar to that given in Subsection 4.2, which uses the RM as shown in Fig. 8 . The RC for summing integers has the same layout as shown in Fig. 8 , in that the RC for summing binary values is implemented in each A and a lookahead carry generator is implemented in each 3. Since each A occupies area of 0( n/ log* n) x 0( da), the total area is Note that this corollary assumes that the input is given to the RM by the pipeline scheme. If the pipeline scheme is not available, n processors to store the input are required.
Conclusions
This paper has presented efficient summing algorithms for binary values and integers on reconfigurable meshes of the bit and the word models. It has also presented AT optimal VLSI reconfigurable circuits for summing problems. An interesting open problem is whether the sum of n binary values can be computed in constant time on a reconfigurable mesh of size fi x&i.
