Abstract-This brief introduces a novel (or ancient) technique for highspeed arithmetic. The new proposed method is based on the still-used Chinese abacus. We show that proper electronic circuits, based on pass transistor and domino logic, allow us to achieve the same functions of the Chinese abacus. Simulations with a 0.35-m CMOS technology show that either a pipeline 8-bit adder and 8 2 8 multiplier can run at a speed as high as 1 GHz.
I. INTRODUCTION
The Chinese abacus is a very popular and efficient technique used to perform arithmetic functions. It was used for centuries in many part of the world (mainly in China) and it is still in use in shops and small commercial enterprises. The main feature of the Chinese abacus is the speed of use: a well-trained operator is often capable of competing with electronic pocket calculators. The time required inputting data manually is comparable to the electronic approach, and the generation of the result in the Chinese abacus is so straightforward that the total computation time is extremely fast.
The above observation stimulated us to analyze the basic reason of the displayed speed and, possibly, to transfer the same features to an electronic circuit. This paper shows that, actually, the use of the Chinese-abacus approach leads to promising results when using, for example, a 0.35-m CMOS process. The speed for an 8-bit pipeline full adder is as high as 1.3 GHz, and a parallel 8 2 8 bit multiplier can run at 980 MHz. Moreover, the compactness of the physical layout leads to a relatively small area for the circuits.
II. OPERATION PRINCIPLE
The Chinese abacus is made of a set of unity elements representing the various decade of decimal number. Each element is made up of five beads having a unity weight and two beads having a weight of 5. The configuration shown in Fig. 1(a) represents the number seven.
The coding rule is thermometric; thus, in order to represent a number lower than five, the same number of beads will be raised in the main part of the unit. For numbers higher than five, one bead with weight 5 will be lowered. In such a way, a basic element is able to represent a decimal number comprised in the range from 0 to 15. The key feature of the Chinese abacus is the use of two beads with weight 5. This allows the operator to minimize the transmission of rests. Moreover, the use of the thermometric code permits a fast implementation of elementary arithmetic functions such as addition and subtraction. The number representation used in the Chinese abacus refers to the digital numeric system. As we are mostly interested in the case of binary-based coding, it is more convenient to use a basic element made up of four unity-weight beads and two beads having a weight of four units [ Fig. 1(b) ]. In practice, we use a base of 2 2 = 4; and the basic element is able to represent numbers comprised in the range from 0 to 12. The configuration shown in Fig. 1(b) represents the number five.
As it happens, in the other considered cases, the given coding is able to represent numbers exceeding the full scale by half of the base of the numeric system. Having an over-scaled room is the key of the operation of the method.
III. CHINESE-BEAD BASIC BLOCKS
In order to design circuits based on the Chinese-abacus approach, it is necessary to achieve, with electronic circuits, some basic functions.
The first of them is the binary-to-Chinese-bead conversion. We attain it with two steps: a binary-to-thermometric (B/T) conversion and a thermometric-to-abacus (T/A) coding. Fig. 2 shows the basic block for the B/T conversion, where we have four unity-weight inputs. Similar circuits with binary-weight input can be designed. The solution in Fig. 2 is based on the pass-transistor approach [1] and contains n-channel transistors. The control is given by the inputs x 1 ; x 2 ; x 3 ; and their complemented x 0 ; x 1 ; x 2 ; x 3 : The output is made by a thermometric 0 representation or high impedance.
1057-7130/99$10.00 © 1999 IEEE The status of the output nodes when they are in the high-impedance condition can be set to one by a complementary block made up of p-channel transistors. However, this circuit solution would require additional silicon area and would lead to nonminimum parasitic capacitance, which reduces the operation speed achievable. An alternative method is to use the pre-charge approach: the output nodes are pre-charged to logic 1 during a pre-charge phase. The data output is valid during a complementary phase.
In the use of the Chinese abacus, a possible rest coming from one unit is accounted for by rising one lower bead in the successive unit. The simple circuit shown in Fig. 3 achieves this function. It shifts up (SU) the input by one position and provides an extra "one" on the output d0: Note that the representation used so far is just a thermometric one. In order to have a representation in the abacus form a further block is necessary, the T/A converter, as shown in The blocks discussed above have a total of 6-bit representation (four lower beads and two upper beads), whereas in binary representation, the number represented can be between 0-63. This is an intrinsic cost of the approach used.
The speed of the basic blocks described above has been simulated using a digital CMOS 0.35-m technology. For the case of the circuit in Fig. 2 , the output shows a delay between clock edge and data output as low as 0.34 ns. Therefore, we can expect an excellent speed of operation in the overall architecture.
IV. THE CIRCUIT OF THE SUM OPERATION
The basic blocks discussed in the previous section are used here to achieve an N-bit full adder (we will assume N = 8; but the method can be extended to any N value). Since the advantage of the Chinese abacus lies mainly in the number representation used, we will exploit the Chinese-abacus representation of numbers by a specific sum operation procedure. The required operation is G = A + B The partial sums G 10 ; G 32 ; G 54 ; and G 76 are derived with a B/T circuit. We have two inputs with weight 1 and two inputs with weight 2; the schematic with pre-charge transistors is shown in Fig. 6 . It also contains a symbol representing the entire block. The six thermometric outputs of the B/T block are processed as shown in Fig. 7 . We have four processing lines, each of them receiving the carry d (3+6k) from the lower line. Each line is the cascade of the B/T block, an SU block (that accommodates the carry of the lower line) and the conversion from thermometric into binary achieved with the T/A and A/B block. The pair of bits at the output of each line and the carry of the upper T/A, as it is stated by (9), give the binary representation of the sum G:
The architecture in Fig. 7 depicts a parallel implementation of the adder. However, we can achieve the result with pipeline architecture (2) and (3). It is well known that the digital representation of P results from the sum between the binary elements (11), shown at the bottom of the page, where the elements of the same column have equal binary weight that increases by a factor 2 moving from right to left. Of course, the term a 0 b 0 represents the LSB. The conventional approach to calculate (11) is to use a "shift-and-add" serial technique or, for fast applications, to hardware implement (11) in a parallel or a pipeline fashion. For our purposes, a convenient way to calculate P is to express (11) in the form a 7 b 1 P 7; 0 P 5; 0 P 3; 0 P 1; 0 a 7 b 3 P 7; 2 P 5; 2 P 3; 2 P 1; 2 a7 b5 P7; 4 P5; 4 P3; 4 P1; 4 a 7 b 7 P 7; 6 P 5; 6 P 3; 6 P 1; 6 (12) Again, the elements on the same column have the identical weight; moreover, the weight of columns increases by a factor 4 when moving from right to left. The generic partial sums Pi;j represent the expression Pi; j = 2 1 (aibj + ai01bj+1) + ai01 bj + ai02 bj+1 (13) where i = 1; 3; 5; 7 and j = 0; 2; 4; 6; moreover, for i = 1; the last term in (13) must be set equal to zero. We achieve a thermometric representation of the partial sums P i; j with simple logic (for achieving the necessary "and" operations) and a schematics similar to the one in Fig. 6 . The successive use of a T/A block permits to represent the result into the abacus format. (11) Using the same principle followed to compute (11), we can group the terms in (12) as follows: K 7; 3 H 7; 0 H 3; 0 K 7; 7 H 7; 4 H 3; 4 (14) where the weight of each column increases by a factor 16 moving from right to left. Moreover K l; m = 4aibm + a l bm02 + P l; m01 (15) H i; j = 4(P i; j + P i02; j+2 ) + P i02; j + P i04; j+2 The approach proposed here is similar to the well-known Wallace tree [2] and Dadda [3] , [4] implementations. The basic idea is to achieve the multiplication result with a hierarchical operand reduction. However, the method proposed here utilizes an abacus representation of numbers with a 0-7 range instead of a simpler binary coding. This feature leads to a further reduction of carry-transfer need and a lower number of hierarchical levels. Moreover, specific architectures can be studied in order to reduce the critical path. Nevertheless, the proposed method requires using the variety of basic blocks discussed in Section II. This is a partial limit: the basic blocks can be achieved with a regular layout and a well-structured floor plan.
The calculation of the K l; m and H i; j terms represented by (15) and (16) involve the addition of terms with different weight, some of them having an abacus format. It is possible to achieve the result by a proper use of abacus blocks. Unity weight bin, like the lower beads' partial sums Pi; j or the output of logic "and," are added by B/T or SU blocks. The results are then transformed into the abacus format by T/A converters. Similar to the architecture in Fig. 7 , we can design parallel computation lines with a minimum (or pipelined) carry path. The strategy used was to performs the required operations with a hierarchical approach: the various terms are successively grouped in groups of three or four terms and the results are calculated with architectures made by basic blocks.
Pipeline implementations are also possible: the technique, of course, requires the architecture partitioning in various stages. Each stage provides the input to an "hold block" used as interface of the successive pipeline stage.
VI. SIMULATION RESULTS AND IMPLEMENTATION ISSUES
Using the methodology discussed in the previous section we have designed an 8-bit parallel adder, an 8-bit pipeline adder, and an 8 2 8 pipeline multiplier. The circuits have been simulated with SPICE using a 0.35-m CMOS process. Parasitic capacitances extracted from the layout of basic blocks and an estimation of interconnection capacitances have been accounted for. The achieved results are summarized in Table I . We can observe that for the pipeline implementations, the pre-charge phase and the I/O delay due to the transfer-gate operation are less than 0.38 and 0.51 ns for the 8-bit adder and the 8 2 8 multiplier, respectively. Therefore, the maximum possible clock frequency is, in the nominal case, 1.3 GHz and 980 MHz, respectively. The total number of transistors required by the simulated circuits is limited; it ranges from 296 to 3699. These figures are quite acceptable for the implemented functions. Moreover, a custom layout permits obtaining a good compactness. The 33 transistors required to achieve a B/T function can accommodated within a 16 2 19 m space, leading to an area per single transistor as small as 9.5 m 2 : Assuming that the overhead for block interconnections is 100% of the basic block area, we can estimate that the entire 8 2 8 pipeline multiplier can be accommodated in 0.07 mm 2 : The above estimation is rough; nevertheless, the achieved result just gives us an idea of the possible chip area of the proposed solution.
VII. CONCLUSION
This brief presented a technique for performing arithmetic functions that mimic the Chinese abacus. The key feature of the method is the use of a different data representation. Using abacus basic blocks, it was possible to achieve fast CMOS adders/multipliers operating at a clock frequency higher than traditional counterparts. The circuit implementation requires a small chip area. Nevertheless, it is difficult to compare our solution with traditional architectures; the chip area critically depends on design rules of the specific technology used.
