We propose several synchronous counter designs that have high counting and sampling rates and low cost at the same time. We first present carry-select counters which improve the maximum counting and sampling rates ofprevious counters based on carry anticipation by a factor of about 2, while requiring similar cost, or reduce the hardware cost of the fastest counters proposed thus far by a factor of about 2, while achieving comparable Fountinghampling rate. We then propose a novel technique called postponed readout to further reduce the countinghampling period to the delay of a 2-input AND gate plus the time for loading a flip-flop, while requiring similar cost. The resultant countinghampling rate is competitive with the fastest previous designs and is achieved at a hardware cost that is lower by a factor of about 2. The price paid is that the count is read out 2 or 3 cycles later (depending on the length of the counter),
. instead of I cycle in previous synchronous counters. - 
Introduction
Counters are sequential circuits that keep track of the number of pulses applied on their inputs. They are among the most widely used components in digital systems, with applications in computer systems, communication equipments, scientific instruments, and industrial process control, to nanie a few. A vast variety of counter designs have been proposed in the literature [2,8, 10, 11, 13, 14. 15, 171 , patented[l, 3, 5] , andorusedinpractice [6, 9, 12] . They can be classified into synchronous counters, such as ring counters and twisted ring counters [6] , and asynchronous counters, such as ripple counters [ 121. In many applications, synchronous counters are required or preferred. Other criteria for the performance of counters include their counting rate, sampling rate and hardware cost.
Many conventional synchronous counter designs are based on an incrementer (i.e., an adder with addend 1) and a register [9, 121. A vast variety of fast adders have been proposed in the literature 116, 7, 12, 16, 19, 211 , which can be adapted for building fast incrementers and, thus, fast counters. However, full carry propagation is required in each increment cycle in such designs, leading to counting periods lower-bounded by SZ(1ogm) for modulo-m counters, even when the fastest adders are used.
To further improve the performance, several constanttime counters have been proposed. Binary counters proposed by Vuillemin [ 181 have counting period equal to the delay of two 3-input gates plus the loading time of a flip-flop, and have cost equal to log2 m flip-flops and half adders plus a small number of AND gates. They use about half as many flip-flops as the constant-time designs given in [4] , but their maximum counting/sampling rates are only half those of the designs in [4], which have counting/sampling periods equal to the delay of an HA plus the loading time of a flip-flop.
In this paper, we first propose carry-select counters, whose counting/sampling rates are approximately double those of the designs proposed in [ 181, while achieving similarly low cost. We then propose a novel technique called postponed readout to further increase the counting and sampling rates. In a synchronous counter with postponed readout, the count read out is actually the total number of pulses up to several clock cycles ago, but it is synchronous in the sense that all count bits appear at the output at the same time so that we can continue inputting counting signals when reading out the count. We apply the technique to carryselect counters and obtain carry-select counters with postponed readout (CSCPRs), which further reduce the counting/sampling periods to the delay of a 2-input gate plus the loading time of a flip-flop. CSCRRs are the fastest synchronous counters proposed in the literature thus far. The price paid is that the count of a CSCRR appears at the output 2 or 3 cycles (depending on counter range) after the input is applied to the counter, instead of 1 cycle later. Carryselect counters and CSC/PRs are the only designs reported in the literature thus far that achieve near minimal count-0-7803-65 14-3/00/$10.0002000 IEEE ing/sampling period and hardware cost at the same time. Our single-input counter designs can be combined with the techniques we proposed in [20] to derive efficient multiinput counters.
Carry-Select Counters (CSCs)
In this section we present carry-select counters, a variant of the binary counters proposed in [18] , while increasing the maximum counting and sampling rate by a factor of about 2.
'Short Carry-Select Counters
In Fig. 1 we present a 12-bit carry-select counter. The counter is composed of three sections with 1,2, and 9 bits respectively, starting with the least significant bit for the count. If the number of bits required is smaller than 12, we simply remove flip-flops, and the associated HAS and AND gates, that are not needed. Each of the sections includes an incrementer, which is implemented as a carry-ripple adder with addend 1, the same as a carry-ripple counter with input 1 or a carry-anticipate counter [ 181. The enable signal of each section is the logical AND of carry-out signals from all the previous sections.
The minimum counting period of carry-select counters is the delay of a 3-input AND gate plus the time required for loading a flip-flop. To verify the correctness of the counter in Fig. 1 for such a minimum counting period, we first consider the case where bit bl and possibly bit b2 change at time 0. It can be seen that the earliest time for any bits in b3 b4b5 bll to change again is time 2 since bits bl and bz will not change value before that, where a time unit is assumed to be equal to the counting period. Therefore, the requirement for carry c3 is that it has to become stable before time 2. The signals corresponding to new values of CO and bo are fed directly to the AND gate for cany c3 so they will not cause any problem. Let us now consider the signal corresponding to the new value of section-camy s2. Section-carry s2 will become to the new value of section-carry s2 will in turn propagate through the 3-input AND gate and make carry c3 stable before time 2. From this example, we can see that carry c3 is always available in time after a change in bl and/or b2. Let us now examine whether bll can be updated correctly. Consider another case where bit b3 and an arbitrary subset of bits b4bsb6 --.bll change their values at time 100. It can be seen that the earliest time for any bits in b3b4b3 bll to change their values again is time 108 since bits b3 will not change value before the counter receives another 8 inputs. Therefore, the requirement for bil is that it has to become stable before time 108. The signals corresponding to new values of b3b4 ---bl1 will propagate through at most 7 concatenated 2-input AND gates and a 2-input XOR gate before time 108, so the new value of b;, is ready to be loaded into the flip-flop for bit bl 1 (if required) at time 108. Note that if the signal can propagate through more than 7 concatenated 2-input AND gates and a 2-input XOR gate, we can add more bits to the third section of the counter without increasing the counting period. We illustrate the timing diagram of the 12-bit carry-select counter in Fig. 2 . It can be seen that the carry-select counter functions correctly with the aforementioned counting period.
Long Carry-Select Counters
The design in Fig. 1 can be extended to obtain a k-bit carry-select counter for any 13 5 k 5 231 + 6. Such acounter is composed of five sections with 1,1,3,3 1, and k -36 bits respectively for the count when k 2 39, and is composed of four sections with 1, 1, 3, and k -5 bits respectively when 13 5 k 5 38. Each of the sections includes an incrementer, and the enable signal of each section is the conjunction of the input co and the section-carry bits from all the previous sections. The main difference between the counters in Fig.  1 and the extended design is that we now use more than one level of AND gates to keep their fan-ins equal to 3 or less. Signals co and SO have to be fed directly to the AND gate at the last level since their values may change during every cycle. Other section-carry bits can be fed to the AND gate at the first level since the lengths of the corresponding sections are designed to be short enough for the section-carry bits to propagate through several levels of AND gates before the resultant carry bit is needed.
The minimum counting period of a long carry-select counter is still the delay of a 3-input AND gate plus the time required for loading a flip-flop. The correctness for the first three sections and bits b5b6--.bk can be verified as in the preceding examples. Let us now verify that carry bit c5 will always become stable in time. Consider the case where bit b2 and possibly bit b3 and/or bit b4 change at time 0. It can be seen that the earliest time for any bits in b5b6.. . b35 to change again is time 4 since bits b2b3*b4 will not change before that. Therefore, the requirement for carry bit c5 is that it has to become stable before time 4. Section.carry s4 will become stable before time 2, since the signals corresponding to the new values of b2b3b4 updated at time 0 only need to propagate through at most two concatenated 2-input AND gates; the signal corresponding to the new value of sectioncarry s4 will in turn propagate through two 3-input AND gates and make carry c5 stable before time 4. The propagation of signals for alI other inputs for computing carry bits can be verified in a similar manner.
A k-bit carry-select counter only requires k flip-flops, slightly fewer than k HAS, as well as several inverters and AND gates. The number of flip-flops required is approximately half that of the design proposed in [4] . Therefore, carry-select counters achieve near maximal count/sampling rate and near minimal hardware cost at the same time.
The design can be easily generalized to even longer counters and to carry-select counters with different section sizes which has important practical purposes. In general, as long as the length of a section is smaller than 2rc -1 + 1 the counting rate remains the same, where k ' is the total number of bits before the section and 1 is the maximum number of levels the section-carry of that section has to propagate to generate the carry bit(s) for subsequent section(s). For example, to obtain a 128 bit carry-select counter, we can use section sizes 1, 1,3, 8, 115. An advantage for having more bits in the last section is that the subcircuit for the last section of such counters can use considerably slower logic (e.g., by a factor of 64) without reducing the counting rate. Such de- signs may lead to lower power and/or lower cost designs.
Note that in practice the length of a section can be larger than the preceding restriction since a 2-input AND gate requires delay smaller than a clock cycle. The large fan-out of the AND gate feeding larger sections can be reduced by replicating the AND gate. For example, the fan-out of the AND gate producing c5 can be reduced from 3 1 to 8 by simply using four AND gate, each producing c5 for 8 bits of the following section.
CSC with Postponed Readout
Carry-select counters with postponed readout (CSUPRs) further reduce the counting period and sampling period to the delay of a 2-input gate plus the loading time of a flipflop. However, the count read out from a CSCPR is actually the total number of pulses up to several clock cycles ago. Therefore, carry-select counters with or without postponed readout are complementary and useful to different systems, depending on application requirements.
In Fig. 3 we present the design of k-bit CSWR for any
The counter is composed of four sections with 1, 2, 8, and k -11 bits for the count when k 2 13, and is composed of three sections with 1,2, and k -3 bits when 4 5 k 5 12. Each of the sections also includes an incrementer, and the enable signal for each section is still the carry-out of all the previous sections. There are three major differences between the structures of carry-select counters and CSCPRs: (1) Each bit of the first two sections of a CSC/PR will go through two levels of latches before read out, and each bit of the third section w i l l go through one level of latches; (2) latches are added between two levels of AND gates for computing carry bits c3 and c11, and signals corresponding to CO, SO, and s2 go through a level of latches before being fed into the AND gates for computing carry ell; (3) the count bits boblb2 go through two levels of latches, and the count bits b3b4 --. blo go through one level of latches, before read out. The advantage of CSWRs is that its minimum counting period is reduced to the delay of a 2-input AND gate plus the time required for loading a flip-flop. The hardware cost of CSWRs is only slightly increased compared to the original carry-select counters since only a few additional latches are added to C S W R for the first few bits and the calculation of carry bits. Therefore, CSWR has the fastest counting and sampling rate among the synchronous counters proposed in the literature thus far, and requires near minimum hardware cost at the same time. Such counters, which we refer to as synchronous counters with postponed readout, provide the count as it was up to 3 cycles ago, and thus add up to 2 cycles to the usual 1 cycle delay.
Note that the sampling rates of carry-select counters, CSWRs, and synchronous counters with postponed readout in general are the same as their counting rate. Note also that the proposed postponed readout technique is different from the delayed readout mechanism we presented in [ 1 I] since the latter was mainly designed for combinational circuits and would lead to smaller sampling rate if applied to pipelined circuits.
The correctness for the count bits can be verified as the first example in Section 2. Let us now verify the correctness of carry bits in CSWRs. Consider the case where bit bl and possibly bit b2 change at time 0. It can be seen that the earliest time for any bits in b3b4b5 --blo to change again is time 2. Section-carry 9 will become stable before time 1, and the corresponding signal will then propagate through the 2-input AND gate and be stored during time 2 to 3 at the latch before the input of the AND gate for carry bit c3. This latched value together with the latched value for input CO from the previous cycle (time 1 to 2) will generate the value for carry bit c3, which will be used to update the count bits b3b4 . --blo at time 3. Note that this update is delayed by one clock cycle. Similarly, we can verify that carry bit c11 will become available to update the count bits with a delay of two cycles. Therefore, the count bits from the first two sections, which do not experience any additional delay, have to be delayed by two cycles before read out so that they are read at the same time as the corresponding count bits from the forth section. Similarly, the count bits from the third section have to be delayed by one cycle before read out. The sampling rate is not reduced due to the postponed readout since the count can be read during every clock cycle.
4 Multi-Input C o~n t e r~ Fig. 4 illustrates the design for a k-bit multi-input counter [20] . The right part of the multi-input counter is a short ac- cumulative parallel counter [ l l , 201, which is essentially a pipelined parallel counter followed by a pipelined ripplecarry adder. The full-adder sum and carry outputs are connected to latches; the carry out of the leftmost FA in each group that forms a ripplecarry adder is connected to a second latch before going to the next level. The left part of the multi-input counter is a carry-select counter (without postponed readout). We add 2 latches to the most significant sum bit of the pipelined ripple-cany adder in the last level, 3 latches to the second most significant sum bit, 4 latches to the third, and so on. Then the outputs of the carry-select counter and those of the last latches of the various sum bits collectively represent the count of the multi-input counter.
For an n-input counter, the count will be available for read out after 2[log2nl cycles. For example, in the particular multi-input counter depicted in Fig. 4 , the number of inputs is 16 so the count will appear at the lowermost sum latches and counter outputs after 8 cycles, no matter how large k is. Note that the readout delay 2 [log2 nl is very close to the minimum possible. The counting and sampling rate of the multi-input counter in Fig. 4 is the delay of one FA (plus the loading time of a latch). Therefore, there is no need to use CSUPR and a carry-select counter is usually fast enough for the left part of the design.
We may parallelize high-frequency sequential counting signals to allow the use of relatively slow, and thus lowcost, compact, and/or power-efficient, multi-input counters.
For example, 800 MHz incoming signals, when demultiplexed 32 ways, are transformed into a set of 25 MHz signals that can be handled by non-speed-critical components. In this way, a complex circuit that would have to be designed by paying meticulous attention to speed optimization at every juncture (with the attendant overheads in design time, testing effort, VLSI area, and power consumption) is replaced by a simple high-speed front end feeding a lower-grade main part. With the unyielding pursuit of high-throughput and low-power digital systems, this type of "demultiplexed" computation has become the norm in certain areas (notably data communications) that, as recently as a decade ago, used to rely on "multiplexed" hardware for reasons of economy.
By using a (twisted) ring counter, or another type of prescaler, as the first section and following it by a carryselect counter, we can obtain ring-select counters that have respective advantages. The postponed readout technique can also be incorporated into such designs. The details are omitted in this paper.
Conclusion
In this paper, we have presented counter designs based on carry-select and carry-select with postponed readout. These are the only counters reported in the literature thus far that achieve near maximal counting/sampling rate and near minimal hardware cost at the same time. Carry-select counters and CSCPRs can be combined with short accumulative parallel counters to form long multi-input counters to implement counting with low power, low cost, or high performance. They can also be combined with ring counters or other prescalers. Moreover, the proposed postponed readout technique can be applied to many counter designs to obtain various synchronous counters with postponed readout.
