Index Terms-High-speed integrated circuit, logic design, logic family, race logic, RALA.
I. INTRODUCTION

E
VEN IN the age of deep submicron technology, the demand for faster logic circuits has not been tempered, and numerous kinds of high-speed logic families and techniques are constantly being reported. Some of them, such as differential cascode voltage switch logic [1] and sample-set differential logic [2] , improve the operation speed by inserting feedback pMOSs or sense amplifiers. Pass gate techniques, wave pipelining, and self-timing are also frequently used for fast logic circuits [3] - [5] . However, these logic families and techniques have fundamental limitations in speed because their logical functions rely entirely on the level-based logic computation. Some other logic families based on analog computation, such as Josephson logic [6] , current steering logic [7] , and the current-mode circuit [8] , have been reported. However, Josephson logic requires integrated inductors that are inadequate for monolithic integrated circuits, current steering logic has been developed for low switching noise rather than high-speed operation, and the current mode circuit technique is hard to integrate with conventional digital logic families.
Moreover, these conventional logic techniques reduce only the gate delay time, while the interconnection delay caused by the coupling capacitance between metal routings causes a more serious delay as fabrication technology develops. In other words, the time during which the signals run along the wires is wasted. As long as transistor logic gates perform logic operations, the delay time caused by interconnection wires is an extra expenditure out of an already limited time budget.
In the proposed new logic architecture, Race Logic Architecture (RALA) abandons transistor logic gates for logic operation. RALA utilizes the racing between input variables to perform logic operations. Sequentially triggered input signals race with each other, and by detecting the winning signal of the race, logic operation can be realized. When a circuit is implemented on the basis of RALA, it can not only overcome the limitations of the transistor delay, but also utilize the interconnection delay to obtain logic operation [9] .
In Section II, the concept of RALA is presented. In Section III, a CMOS circuit implementation of RALA is suggested and discussed. The comparisons with the circuits designed by other various logic families follow in Section IV. In Section V, CMOS adder implementation using RALA is shown, and finally, Section VI concludes the paper.
II. OPERATION OF RALA
A. Structure
The conceptual diagram of RALA is illustrated in Fig. 1(a) . RALA consists of three parts: the clock distribution line (CDL), the race lines, and the winner-take-all circuit (WTAC). When the clock transits from low to high, the clock travels along the resistor array, and switches (1), (2), (3), and (4) are triggered sequentially. The termination switch, switch (4) , is attached at the end of the CDL. The race lines consist of a true-line and a false-line. The race lines perform wired-OR operations. The WTAC determines which line becomes 1 earlier. If the true-line becomes 1 earlier, the OUT becomes 1. On the other hand, if the false-line becomes 1 earlier, the OUT becomes 0. Even in the case of all-0 inputs, one of race lines, e.g., the true-line in this example, is set to 1 by the termination switch. The input of the termination switch is 1, so that there is no situation in which both race lines stay 0. Logic values of each node are depicted in Fig. 1(b) in the case of , , and . There can be various methods of circuit implementation of RALA. One circuit implementation is shown as an example in Fig. 1(c) . To implement the switches by nMOS, negative logic is adopted. During the low clock period, the race lines and OUT/OUT are precharged. If an input variable is 1, the relevant switch is turned on when the clock triggers the switch, and then the race line is discharged. The WTAC only allows the signal that falls faster to pass. The remaining signal that arrives later is blocked by the WTAC. A detailed explanation of the circuit will be shown in Section IV.
0018-9200/02$17.00 © 2002 IEEE Fig. 2(a) shows the AND operation. When the clock goes high, the input is triggered first. If is 1, the output becomes 0. If is 0, OUT depends on the value of input . If input is 1, the OUT becomes 1, and if it is 0, the OUT becomes 0 because of the termination that is attached to the false-line. As shown in the truth table of the Fig. 2(a) , the race logic performs the AND operation with the depicted circuit configuration. Fig. 2(b) shows another implementation of the AND operation. The OR operation can be implemented with circuit configurations shown in Fig. 3 . Implementation of the OR operation is quite similar to that of the AND operation except for the positions of the input variables.
B. Basic Boolean Operations
From Figs. 2 and 3, the connection of the first input variable, say , to the false-line gives an AND operation between the first variable and the following input variable. An OR operation is obtained with the connection of the first input variable to the true-line, regardless of the position of the second input variable. This characteristic is useful when figuring out the functionality of the cascaded multiple race logics.
C. Cascading Race Logics
The race logics can be cascaded to implement more complicated functions. A simple cascading of two race logics is shown in Fig. 4 . The logic operations of the left gray block and the right one are AND operations, respectively. The logical relationship between input and is an OR operation because input is connected to the true-line. When two race logics are cascaded, the Boolean function of the cascaded race logic is expected to be (1). However, during cascading the two race logics, an associative operation is created, and the resulting Boolean function is given as (2) instead.
By inserting a WTAC between two race logics, cascading can be performed without an associative operation. Fig. 5 (a) and (b) shows the cascading for the case of AND and OR operations, respectively. In the AND operation, the OUT and OUT of the first WTAC are connected to the CDL and the false-line of the second race logic, respectively. In contrast, in the OR operation, OUT and OUT of the first WTAC are connected to the true-line and the CDL, respectively.
D. Design Algorithm of Race Logics
Systematic methodology can be developed for the efficient design of the race logics. For convenience, several symbols are defined as in Table I . and can be expressed as the following equations. (3) (4) where the and are the triggering sequences. There are six design rules for the synthesis of race logic. The relationship between input and is AND with an associative operation; therefore, method M-AND2 in Table II is applied. Method M-OR1 is applied to and for an OR operation. When we relate and , method M-AND2 is used instead of M-AND1 because input should be placed in the true-line for the next OR operation. Method M-OR2 is proper for input and . Method M-AND1 is used for and . Actually, method M-AND1 creates the associative operation. However, the created associative operation has no effect because the next operation will terminate the association. When placing input after , the previous associative operation is to be closed and a new OR operation is to be created. Therefore, we apply method M-OR3 when relating input and . Inputs , , and are placed by methods M-AND1, M-AND2, and M-OR1, respectively. Finally, the WTAC is placed at the ends of the true-line and the false-line.
When the input vector is given by 1 0 0 1 1 1 1 1 0 , the operation of the race logic shown in Fig. 6 is as follows. When the clock goes high, input , , , , and are sequentially triggered. When the clock triggers , , and , true-line1 and false-line1 stay low because all of them are 0. When the clock triggers , false-line1 is activated, and WTAC1 detects the activation. Since false-line1 is activated earlier than true-line1, the OUT of WTAC1 becomes high. This operation proceeds regardless of the value of input . As shown in Fig. 6 , the CDL2 is connected to the OUT. Therefore, the race logic2 is enabled. If the CDL2 is connected to the OUT, the race logic2 may not be activated. Except for the , the and are all 0, so that true-line2 is activated earlier than false-line2. Therefore, WTAC2 detects true-line2 and the OUT of WTAC2 becomes 1. Finally, the result of the circuit becomes 1, the same value as that of (5) for the given input vector.
III. CMOS CIRCUIT IMPLEMENTATION
RALA can be realized by various methods, and in this study, a CMOS implementation is exemplified to assert the feasibility of RALA. As shown in Fig. 7(a) , the CDL consists of resistors and nMOS gate capacitors. The value of gate capacitance is chosen to be 2 fF, and the value of resistance is determined to satisfy the required timing gap, in this study, 10 ps, between the two neighboring tapped clock signals. The operation of the WTAC is as follows. True-line, false-line, OUT, and OUT are precharged to when the clock is low. During the precharge period, and are turned off, and and are also turned off. If the voltage of the true-line falls down by , is turned on, and OUT starts to fall.
, whose gate is connected to OUT, is turned off, and is turned on. Therefore, OUT is isolated from the false-line and stays high with the help of . The simulation results with 0.18-m CMOS model parameters are shown in Fig. 7(b) . Among the inputs, only is set to 1. Once the clock goes high, the clock signal travels along the CDL, and after 30 ps, the nMOS switch of input receives the clock signal. The termination switch receives the clock signal after 40 ps. The WTAC detects this 10-ps difference to generate OUT 1 and OUT 0. As shown in Fig. 7(b) , the WTAC detects the timing difference successfully.
In order to estimate the speed limitations of the RALA, Monte Carlo simulation is performed, taking the process Fig. 7(a) . To highlight the mismatch effect, the input pattern or 0 1 1 1 1 is selected, with which three switches pull down the voltage of the true-line and only two switches pull down that of the false-line that must be discharged earlier than the true-line. Fig. 8 shows the malfunction probability measured with respect to the timing gap in the CDL. As shown in this simulation, even 5 ps can be used to maximize its operation speed. In this study, we choose 10 ps for ease of design.
To improve the mismatch robustness and noise immunity, several circuit techniques can be applied as shown in Fig. 9 . When the true-line discharges first, the nMOS gate of the false-line, in Fig. 9 , is turned off, so that the load capacitance on the false-line is diminished because the junction capacitance of and is isolated from the false-line. This capacitance reduction on the false-line slightly helps the discharging of the false-line, and this can cause the malfunction of the WTAC. To compensate for the isolation, a replica circuit that balances the capacitance of each race line can be added. The optimized sizing of the input switches also improves robustness and noise immunity. In the worst case, the switch may discharge the false-line alone and all of the true-line switches may discharge the true-line. Even though the discharging of the false-line started earlier than that of the true-line, the discharging of the true-line overrides that of the false-line, which may result in the malfunction of the WTAC. As the number of input switches increases, the situation becomes worse. This problem can be resolved by adjusting the sizes of the input switches. If the size is optimized to make the early-triggered switch discharge the race line earlier and more strongly, the operation of the WTAC can be more stable.
IV. COMPARISON WITH CONVENTIONAL LOGIC FAMILIES
The carry generation circuits of the carry-look-ahead adder are implemented by RALA, dynamic logic, DVCSPG, and SSDL. The comparisons focus on speed, area, and power consumption. 0.18-m CMOS model parameters are used for the simulation.
In the delay time comparison, no other logic families are found to be faster than the race logic, as shown in Fig. 10(a) . Compared to the DCVSPG logic that is found to be the fastest among the conventional schemes in this comparison, the race logic shows 50% speed improvement. The area or the total width of MOSFETs is compared. We assume that the resistor array in the CDL of the race logic is made of poly silicon without salicide. The area occupied by one resistor of the resistor array is calculated as the same as that of the smallest nMOS. In the area comparison, the area of the race logic is similar to that of the dynamic logic, as shown in Fig. 10(b) . In the power comparison shown in Fig. 10(c) , the race logic consumes 26% more power than the DCVSPG logic that consumes the lowest power. A circuit implemented by race logic has two dynamic nodes, the true-line and the false-line. These two lines are precharged and discharged at every clock cycle, and this consumes dynamic power. However, compared to the dynamic logic, the power consumption of the race logic is smaller than that of the dynamic logic even though it also has dynamic nodes. This is because the size of the transistors related with the dynamic nodes of the race logic is the smallest. The race logic is so fast that it does not need to enlarge the transistor width. The delay, area, and power production are shown in Fig. 10(d) , and the race logic shows the best performance.
V. CMOS ADDER IMPLEMENTATION
To verify the feasibility and functionality of RALA, a 64-bit carry-look-ahead adder is designed and fabricated by 0.25-m six-metal CMOS technology. The carry generation logic of this adder is designed on the basis of RALA. The architecture of the adder is illustrated in Fig. 11 . This carry-look-ahead adder consists of eight 8-bit subadders. Each subadder has one carry generation circuit, which has one gate depth. If the carry generation circuit is implemented by conventional logic styles, it must have several gate stages. The race logic is one type of the clocked nMOS logic, and the inputs of the race logic must be 0 during the precharge phase. The p and g generators, the prestage of the carry generator, are designed by dynamic logic, so that the outputs are precharged to low during the precharge phase. The sum generation circuit is designed by conventional static logic style.
The carry generator designed by race logic certainly does not waste wire delay. As depicted in Fig. 12 , the timing difference between two neighboring signals is determined by the clock delay in the CDL plus the wire delay. This means that WTAC requires the timing difference and it can be the wire delay in the race lines instead of the clock delay in the CDL. Therefore, the delay time in the race lines that is inevitable because of the geometry of the layout can be used effectively instead of being wasted. In the actual layout, the length of the race lines is about 100 m in each carry generation circuit.
The layout of WTAC is carefully handcrafted to maintain symmetry as shown in Fig. 13 . Polycide with salicide blocking is used as a CDL resistor. The active area of the adder is 800 m 150 m. The delay time from the clock to Sum31 measures 0.9 ns. The photograph of the adder and measured waveform are shown in Fig. 14(a) and (b) , respectively.
VI. CONCLUSION
A novel logic concept, Race Logic Architecture (RALA), is proposed. RALA realizes logic operations with signal racing between two race lines instead of the actions of logic gates. RALA overcomes transition delay times of logic gates in conventional logic circuits and does not just waste the propagation delay times of routing wires but utilizes it for logic operations. In this regard, RALA is a quite promising concept as fabrication technology develops so that the delay time caused by increased parasitic capacitance of routing wires becomes more serious.
RALA consists of three parts: the clock distribution line, the race lines, and the winner-take-all circuit. The CDL generates sequential trigger signals for input variable switches. Triggered signals start racing on two race lines, the true-line and the falseline. The winner-take-all circuit determines which signal arrives earlier.
CMOS circuits of the carry generation function of a general carry-look-ahead adder are implemented by RALA and by various other logic families. The circuit implemented by RALA shows good performance in terms of speed and area, and also shows reasonable power consumption.
Using RALA, a 64-bit carry-look-ahead adder was fabricated by 0.25-m CMOS technology to confirm its feasibility. Its measured delay time from the clock to SUM31 was 0.9 ns. From 1988 to 1990, he was with Bell Communications Research, Red Bank, NJ, and invented the two-dimensional phase-locked VCSEL array, the front-surface-emitting laser, and the high-speed lateral HBT. In 1991, he became Manager of a DRAM design group at Hyundai Electronics and designed a family of fast-1M DRAMs and synchronous DRAMs including 256M SDRAM. From 1995 to 1997, he was a faculty member of Kangwon National University. In 1998, he joined the faculty of the Department of Electrical Engineering at KAIST and currently leads a project team on RAMP (RAM Processor). In 2001, he founded a national research center, SIPAC (System Integration and IP Authoring Research Center), funded by the Korean government to promote world-wide IP authoring and its SOC application. His current interests are SOC design, IP authoring, high-speed and low-power memory circuits and architectures, design of embedded memory logic, optoelectronic integrated circuits, and novel devices and circuits. He is the author of the books DRAM Design (in Korean, 1996) and High Performance DRAM (in Korean, 1999 
