Abstract-We present a new comparator design featuring wide-range and high-speed operation using only conventional digital CMOS cells. Our comparator exploits a novel scalable parallel prefix structure that leverages the comparison outcome of the most significant bit, proceeding bitwise toward the least significant bit only when the compared bits are equal. This method reduces dynamic power dissipation by eliminating unnecessary transitions in a parallel prefix structure that generates the N-bit comparison result after log 4 N + log 16 N + 4 CMOS gate delays. Our comparator is composed of locally interconnected CMOS gates with a maximum fan-in and fan-out of five and four, respectively, independent of the comparator bitwidth. The main advantages of our design are high speed and power efficiency, maintained over a wide range. Additionally, our design uses a regular reconfigurable VLSI topology, which allows analytical derivation of the input-output delay as a function of bitwidth. HSPICE simulation for a 64-b comparator shows a worst case input-output delay of 0.86 ns and a maximum power dissipation of 7.7 mW using 0.15-µm TSMC technology at 1 GHz.
several prior designs [9] [10] [11] [12] [13] use subtractors in the form of flat adder components, but these designs are typically slow and area-intensive, even when implemented using fast adders [14] [15] [16] . Other comparator designs improve scalability and reduce comparison delays using a hierarchical prefix tree structure composed of 2-b comparators [17] . These structures require log 2 N comparison levels, with each level consisting of several cascaded logic gates. However, the delay and area of these designs may be prohibitive for comparing wide operands.
The prefix tree structure's area and power consumption can be improved by leveraging two-input multiplexers (instead of 2-b comparator cells) at each level and generate-propagate logic cells on the first level (instead of 2-b adder cells), which takes advantage of one's complement addition [18] . Using this logic composition, a prefix tree requires six levels for the most common comparison bitwidth of 64 bits, but suffers from high power consumption due to every cell in the structure being active, regardless of the input operands' values. Furthermore, the structure can perform only "greater-than" or "less-than" comparisons and not equality.
To improve the speed and reduce power consumption, several designs rely on pipelining and power-down mechanisms [19] to reduce switching activity [20] , [21] with respect to the actual input operands' bit values. One design uses all-Ntransistor (ANT) circuits to compensate for high fan-in with high pipeline throughput [22] . A 64-b comparator requires only three pipeline cycles using a multiphase clocking scheme [23] . However, such a clocking scheme may be unsuitable for high-speed single-cycle processors because of several heavily loaded global clock signals that have high-power transition activity. Additionally, race conditions and a heavily constrained clock jitter margin may make this design unsuitable for wide-range comparators.
An alternative architecture leverages priority-encoder magnitude decision logic with two pipelined operations that are triggered at both the falling and rising clock edges [24] to improve operating speed and eliminate long dynamic logic chains. However, 64-b and wider comparators require a multilevel cascade structure, with each logic level consisting of seven nMOS transistors connected in series that behave in saturating mode during operation. This structure leads to a large overall conductive resistance [16] , with heavily loaded parasitic components on the clock signal, which severely limits the clock speed and jitter margin.
Other architectures use a multiplexer-based structure to split a 64-b comparator into two comparator stages [25] : the first stage consists of eight modules performing 8-b comparisons and the modules' outputs are input into a priority encoder and the second stage uses an 8-to-1 multiplexer to select the appropriate result from the eight modules in the first stage. This architecture uses two-phase domino clocking [14] , [23] , [26] to perform both stages in a single clock cycle. Since operations occur on the rising and falling clock edges, this further limits the operating speed and jitter margin and makes the design highly susceptible to race conditions [27] .
Some comparators combine a tree structure with a twophase domino clocking structure [28] for speed enhancement. These architectures add the two inputs, after negating one input via two's complement, using the carry-out signal as the "greater-than" or "less-than" indicator (equality is not supported). Since the critical signal is the carry-out, the tree structure's adder modules are optimized to compute only the carry signal. Because the adder module is implemented using a Manchester carry chain [19] , this architecture reduces the tree structure's area, power consumption, and comparison delay. However, the heavy loading of the clock signal with 64×2 gates for the precharge and evaluate phases complicates routing, constrains the long clock cycle required for two-phase clocking, and necessitates large drivers for the clock signals.
Some architectures save power by dynamically eliminating unnecessary computations using novel ripple-based structures, such as those incorporating wide-range ripple-carry adders [29] [30] [31] . Similarly, other energy-efficient designs [32] [33] [34] leverage schemes to reduce switching activity. Compute-ondemand comparators compare two binary numbers one bit at a time, rippling from the most significant bit (MSB) to the least significant bit (LSB). The outcome of each bit comparison either enables the comparison of the next bit if the bits are equal, or represents the final comparison decision if the bits are different. Thus, a comparison cell is activated only if all bits of greater significance are equal. Although these designs reduce switching, they suffer from long worst case comparison delays for wide worst case operands.
To reduce the long delays suffered by bitwise ripple designs, an enhanced architecture incorporates an algorithm that uses no arithmetic operations. This scheme [35] detects the larger operand by determining which operand possesses the leftmost 1 bit after pre-encoding, before supplying the operands to a bitwise competition logic (BCL) structure. The BCL structure partitions the operands into 8-b blocks and the result for each block is input into a multiplexer to determine the final comparison decision. Due to this BCL-based design's low transistor count, this design has the potential for low power consumption, but the pre-encoder logic modules preceding the BCL modules limit the maximum achievable operating frequency. In addition, special control logic is needed to enable the BCL units to switch dynamically in a synchronized fashion, thus increasing the power consumption and reducing the operating frequency.
To alleviate some of the drawbacks of previous designs (such as high power consumption, multicycle computation, custom structures unsuitable for continued technology scaling, long time to market due to irregular VLSI structures, and irregular transistor geometry sizes), in this paper we leverage standard CMOS cells to architect fast, scalable, wide-range, and power-efficient algorithmic comparators with the following key features. Block diagram of our comparator architecture, consisting of a comparison resolution module connected to a decision module. 1) Use of reconfigurable arithmetic algorithms, with total (input-to-output) hardware realization for both fullycustom and standard-cell approaches, improves the longevity of our design and makes our design ideal for technology scaling and short time to market. 2) A novel MSB-to-LSB parallel-prefix tree structure, based on a reduced switching paradigm and using parallelism at each level (as opposed to a sequential approach [32] ), contributes to the speed and energy efficiency of our design. 3) Use of components built from simple single-gate-level logic, with maximum fan-in and fan-out of five and four, respectively, regardless of the comparator bitwidth, makes it easy to characterize and accurately model our comparator for arbitrary bitwidths. 4) Use of combinatorial logic, with neither clock gating nor latency delay, enables global partitioning into two main pipelined stages or locally into several pipelined stages based on the number of levels. This flexibility provides area versus performance tradeoffs. The remainder of this paper is organized as follows. Section II covers our comparator's operating principles and overall structure and Section III provides the design details. Section IV evaluates the area, operating speed, and power consumption of our comparator. Performance analysis and simulation results for input widths ranging from 16 to 256 bits, along with generalization to N-bit inputs, appear in Section V. Concluding remarks and suggestions for further work are provided in Section VI.
II. COMPARATOR ARCHITECTURAL OVERVIEW
The comparison resolution module in Fig. 1 (which depicts the high-level architecture of our proposed design) is a novel MSB-to-LSB parallel-prefix tree structure that performs bitwise comparison of two N-bit operands A and B, each of which store the partial comparison result as each bit position is evaluated, such that
In addition, to reduce switching activities, as soon as a bitwise comparison is not equal, the bitwise comparison of every bit of lower significance is terminated and all such positions are set to zero on both buses, thus, there is never more than one high bit on either bus. The decision module uses two OR-networks to output the final comparison decision based on separate OR-scans of all of the bits on the left bus (producing the L bit) and all of the bits on the right bus (producing the R bit). If LR = 00, then A = B, if LR = 10 then A > B, if LR = 01 then A < B, and LR = 11 is not possible.
An 8-b comparison of input operands A = 01011101 and B = 01101001 is illustrated in Fig. 2 . In the first step, a parallel prefix tree structure generates the encoded data on the left bus and right bus for each pair of corresponding bits from A and B. In this example, A 7 = 0 and B 7 = 0 encodes as left 7 = right 7 = 0, A 6 = 1, and B 6 = 1 encodes as left 6 = right 6 = 0, and A 5 = 0 and B 5 = 1 encodes left 5 = 0 and right 5 = 1. At this point, since the bits are unequal, the comparison terminates and a final comparison decision can be made based on the first three bits evaluated. The parallel prefix structure forces all bits of lesser significance on each bus to 0, regardless of the remaining bit values in the operands. In the second step, the OR-networks perform the bus OR-scans, resulting in 0 and 1, respectively, and the final comparison decision is A > B.
We partition the structure into five hierarchical prefixing sets, as depicted in Fig. 3 , with the associated symbol representations in Tables I and II , where each set performs a specific function whose output serves as input to the next set, until the fifth set produces the output on the left bus and the right bus. All cells (components) within each set operate in parallel, which is a key feature to increase operating speed while minimizing the transitions to a minimal set of leftmost bits needed for a correct decision. This prefixing set structure bounds the components' fan-in and fan-out regardless of comparator bitwidth and eliminates heavily loaded global signals with parasitic components, thus improving the operating speed and reducing power consumption. Additionally, the OR-network's fan-in and fan-out is limited by partitioning the buses into 4-b groupings of the input operands, thus reducing the capacitive load of each bus.
III. COMPARATOR DESIGN DETAILS
In this section, we detail our comparator's design (Fig. 3) , which is based on using a novel parallel prefix tree (Tables I and II contain or group of cells produces outputs that serve as inputs to the next set in the hierarchy, with the exception of set 1, whose outputs serve as inputs to several sets. Set 1 compares the N-bit operands A and B bit-by-bit, using a single level of N -type cells. The -type cells provide a termination flag D k to cells in sets 2 and 4, indicating whether the computation should terminate. These cells compute (where
OR Network
Set 2 consists of 2 -type cells, which combine the termination flags for each of the four -type cells from set 1 (each 2 -type cell combines the termination flags of one 4-b partition) using NOR-logic to limit the fan-in and fan-out to a maximum of four. 
Set 3 consists of 3 -type cells, which are similar to 2 -type cells, but can have more logic levels, different inputs, and carry different triggering points. A 3 -type cell provides no comparison functionality; the cell's sole purpose is to limit the fan-in and fan-out regardless of operand bitwidth. To limit the 3 -type cell's local interconnect to four, the number of levels in set 3 increases if the fan-in exceeds four. Set 3 provides functionality similar to set 2 using the same NORlogic to continue or terminate the bitwise comparison activity. 
Levels set3 = log 16 (N) .
From left to right, the first four 3 -type cells in set 3 combine the 4-b partition comparison outcomes from the one, two, three, and four 4-b partitions of set 2. Since the fourth 
Set 4 consists of -type cells, whose outputs control the select inputs of -type cells (two-input multiplexors) in set 5, which in turn drive both the left bus and the right bus. For an -type cell and the 4-b partition to which the cell belongs, bitwise comparison outcomes from set 1 provide information about the more significant bits in the cell's -type cells, which
The number of inputs in the -type cells increases from left to right in each partition, ending with a fan-in of five. Thus, the -type cells in set 4 determine whether set 5 propagates the bitwise comparison codes. Table III :
The output F 1,0 k denotes the "greater-than," "less-than," or "equal to" final comparison decision
Essentially, the 2-b code F 1,0 k can be realized by OR-ing all left bits and all right bits separately, as shown in the decision module (Figs. 2 and 3) , using an OR-gate network in the form of NOR-NAND gates yielding a more optimum gate structure
The superscripts "1" and "0" in (8) and (9) denote the summation of the left and right bits, respectively, and the subscript "1" denotes the first level of OR-logic in the decision module that receives data directly from set 5. If we limit the fan-in of each gate to four, the number L DM of the OR-gate tree levels for the decision module is given by
IV. AREA, SPEED, AND POWER EVALUATIONS
In this section, we analyze the area (in number of transistors), operating speed, and power requirements of our proposed comparator architecture and calculate the number of logic levels required for an N-bit comparator based on simple CMOS logic gates. Both faster logic structures [19] , [23] , [27] and wider zero detectors [36] may be used in the decision module. However, since this paper is focused on the architecture and arithmetic levels, enhanced circuit techniques are orthogonal and constitute potential future improvements.
A. Area Analysis
We begin by deriving the total number of cells required and use Table IV to translate the cell counts into transistors for an N-bit comparator. Based on (1)-(10), the number of C CRM cells required for the comparison resolution module and the numbers of CDM cells in the decision module is, respectively
Table IV shows the total number of cells and the required number of levels per set for various comparator bitwidths, based on (11) and (12) . The cell counts in Table IV , along with the number of transistors per cell type (Table I) , allow us to derive the total number of transistors for various bitwidths (Table V) . The results show an approximate linear growth in comparator size as a function of bitwidth.
B. Operating Speed
We analyze the critical path delay of our proposed comparator with N-bit inputs. The delay D CRM for the comparison resolution module is All terms, except the third, on the right-hand side of (13) entail a single gate delay D U , resulting in
The delay D DM for the decision module's NOR-NAND gate network is
The total (asynchronous) comparator delay D T from input to output for an N-bit comparator is
To the best of our knowledge, the total delay of (16) puts our design among the fastest comparators reported in the literature based on a basic CMOS gate circuit without any circuit level modifications. Detailed simulation-based comparisons will be provided in Section IV.
C. Power Requirements
Minimizing the switching activity reduces the average power dissipation and is considered a key enabling technique for modern low-power design [29] [30] [31] [32] [33] [34] [35] . In this subsection, we assess the impact of this method on power dissipation in our comparator design.
The operands activate all cells in set 1 in parallel, thus set 1 provides no power savings. Table V shows that set 1 accounts for 25% of the total transistors, and thus power dissipation, for an arbitrary comparator size.
The cells of each partition in set 2 are selectively activated in parallel (except for the most significant partition, which is always active) if the previous partition's set 1 provides no comparison decision. However, to preserve parallelism and ensure high operating speed, set 2 does not limit activity to only one cell, and accounts for 4.2% of the transistor switching activity due to set 2's share of the total transistor count.
A partition in set 3, which is comprised of multilevel NORlogic gates, is activated only if all bits of greater significance are equal. Thus, if the bitwise comparison is equal for all cells in set 1, a comparison request is sent to the next lower significant bit in set 3, otherwise, no gate activity occurs at this level. Set 3 achieves significant power savings, because set 3 uses the smallest number of gates necessary to make a final comparison decision, with only one cell per level being active. Table V shows that set 3 accounts for only 1.1% of the total switching activity.
Set 4 combines the results of set 1 and the single active cell in set 3, which incorporates the comparison outcomes of all more significant sets to activate the cell at this bitwise position if all MSBs are unequal. Therefore, only one cell in set 4 is active, leading to a significant reduction in power dissipation. Table V shows that set 4 accounts for 41.6% of the total transistors for an arbitrary comparator size, but since only one cell in set 4 is active, set 4 only accounts for 2.6% of the total transistor switching activity, with this share decreasing as comparator bitwidth increases.
The single activated cell in set 4 triggers the multiplexer circuit in set 5 and provides an additional reduction in power consumption. Set 5 accounts for only 1.56% of the total transistor switching activity, with this share decreasing for wider comparators.
Our comparator's worst case cell activities occur when A = 00…01 and B = 00…00 (or vice versa) and Fig. 4 depicts the number of transitions versus comparator bitwidth. For each comparator bitwidth, the first bar shows the total number of transistors and the second bar shows the number of active transistors. We note that for all comparator bitwidths, less than half of the transistors are active, making the power dissipation roughly one-third of the value if all of the transistors were active. Our design is thus competitive with other low-power comparators while offering the additional advantages of highspeed operation and scalability. As technology scales further, the contribution of leakage current to the overall power consumption increases. Given that our design operates at the threshold voltage level and considering that dynamic power consumption has been reduced through circuit techniques, leakage power could become dominant (especially since every circuit component, not only the active components, contribute to the total leakage), thus overshadowing the savings achieved in dynamic power consumption via reduced activity. The worst leakage power is usually measured at the fast-fast corner with a severe temperature of 100°C [37] , [38] for a single NAND gate that is built using four CMOS transistors, as depicted in Table IV , for different technology node factors. Table VII shows the results of HSPICE simulations for our proposed comparator with 64-b and reveals a leakage contribution of only 0.6%, 1.7%, and 4.3% with respect to the total power at 0.15 μm, 0.13 μm, and 90 nm, respectively, as compared to Table VI. This nominal increase in leakage power percentage is due to our design's small sizes and local cell interconnects with very limited fanout and fan-in as well as the absence of global routing and ratioed dynamic sizes, and therefore, leakage power will not impact our power-saving method in near-future technologies.
The average power consumption values are significantly better, given that when the probability of reaching a decision at each bit position is 50%, the expected number of positions examined before reaching a decision is only two. 
V. SIMULATION-BASED COMPARISONS
To evaluate the functionality and performance of our comparator, we simulated the complete design with various inputs using the HSPICE simulator [39] with 0.15 μm-TSMC digital CMOS technology [40] for slow-slow corner (1.35 V at 125°C). The worst case delay was evaluated by activating the maximum number of cells, including all the least significant cells (i.e., all input operand bits were equal, except at the least significant position). We limited the N-type transistor width to 2 μm and enlarged the P-type transistor width to a maximum of 5 μm, since all cells were locally interconnected and there were no global signals that required a large driver.
Since our key objective was to maximize the operating speed, both transistor types were chosen to have the minimum channel length (i.e., 0.15 μm), given the lack of restriction on the channel length modulation for our design. The maximum measured cell delay was 0.0847 ns for the -type cell with a maximum fan-in of five and a maximum fan-out of one, as suggested by Table I .
We evaluated our comparator against several state-of-the-art implementations, whose structures represent recently proposed topologies and circuits targeted for high-speed operation and power savings (i.e., objectives similar to ours). Simulation results for our 64-b comparator and reported results for several other comparators [25] , [28] , [32] , [35] , [41] are shown in Table VIII . The maximum total input-to-output delay (in nanoseconds) versus input bitwidth for our comparator is shown in Fig. 5 . The simulation results closely match the analytical model in Table V , showing that the number of gate levels increases at log 4 N + log 16 N + 4.
Independent of technology scaling, our comparator offers a 40% speed advantage over the design in [28] , whose number of levels increases at log 4 N+ two's complement , with each level comprising of approximately three cascaded gates. Furthermore, the Cadence data sheet reported in [28] and [41] show that the design used 14 cascaded gates with a fan-out of four for a 64-b comparator, which operates at a slower speed as compared to our design that uses eight cascaded gates with a maximum fan-out of four. Additionally, for comparators wider than 64 bits in our design, the nonlinearity in the growth rate of the number of levels becomes less significant, as evident from Fig. 5 . This is due to the second-order effect of logarithmic scaling for large parameter values [4] , [16] . Fig. 6 shows the maximum power dissipation versus the number of bits that must be evaluated to reach a decision for a 64-b comparator based on our design operating at 1 GHz. For example, if the two input operands have the values 11111… 1 and 01111…1, only one bit needs to be evaluated for the 1) Not power efficient for the common case of data dependencies 2) High power dissipation in tree structure comparison decision. As expected, the power dissipation for our comparator is always higher than that in [32] , which uses one logic level per cell to evaluate each bit sequentially, thereby trading off operating speed for low power. We also observed that our comparator dissipates more leakage power than all of the alternate comparator designs due to a larger number of transistors. Taking into consideration that leakage power is on the order of nanowatts, while our savings is mainly with respect to dynamic activity, which is on the order of milliwatts, the disadvantage is not critical. Essentially, our design trades low-order leakage for the cost of high-order dynamic activities and high operating speed.
According to Fig. 6 , our proposed design consumes an average of 7.7 mW while operating at 1 GHz. When fewer than 28 bits must be evaluated, which is the case with probability very close to 1 for random inputs, our comparator dissipates power at a rate of 0.9 μW/MHz. When the number of evaluated bits is greater than 32, our comparator dissipates power at a rate of 4.12 μW/MHz. Our comparator operates at very low power when the number of evaluated bits ranges from 8 to 28, which makes our comparator suitable for applications with typical data-dependent completion time and a low average number of evaluated bits.
VI. CONCLUSION
In this paper, we presented a scalable high-speed low-power comparator using regular digital hardware structures consisting of two modules: the comparison resolution module and the decision module. These modules are structured as parallel prefix trees with repeated cells in the form of simple stages that are one gate level deep with a maximum fan-in of five and fanout of four, independent of the input bitwidth. This regularity allows simple prediction of comparator characteristics for arbitrary bitwidths and is attractive for continued technology scaling and logic synthesis.
Leveraging the parallel prefix tree structure [42] for our comparator design is novel in that this design performs the comparison operation from the most significant to the least significant bit, using parallel operation, rather than rippling. Regardless of the comparator bitwidth, our structure guarantees that less than 35% of all of the transistors used in the design are active during operation. Additionally, all cells are locally interconnected, which avoids the need for large cell drivers, thus balancing all cells to a uniform transistor size.
Simulation results with standard CMOS transistor cells revealed operating speeds of 1.2 and 1 GHz for 64-and 512-b comparators, respectively, under a 0.15-μm CMOS process and worst case operands. These results translate to a 40% speed advantage over state-of-the-art fast comparators. Furthermore, simulation results confirmed our comparator's power efficiency, with a power dissipation of 0.9 μW/MHz on average and 4.12 μW/MHz in the worst case when 32 bits or more of the inputs must be evaluated.
Our simulation-based analysis of leakage power dissipation showed that, whereas the percentage contribution of leakage power increases with each new technology generation, the increase effect is not significant enough to nullify the savings in dynamic power dissipation in near-future technologies.
Future work will include additional circuit optimizations to further reduce the power dissipation by adapting dynamic and analog implementations for the comparator resolution module and a high-speed zero-detector circuit for the decision module. Given that our comparator is composed of two balanced timing modules, the structure can be divided into two or more pipeline stages with balanced delays, based on a set structure, to effectively increase the comparison throughput at the expense of increased power and latency.
