We define a new class of parallel counters, Saturating Counters, which 
Introduction
Various designs of parallel counters to be used in multiplier units and other applications have been proposed and implemented (e.g., [1, 2] ). Such designs use different basic building blocks like (3, 2) counters, (7, 3) counters and the like [3] . An (Ò, ) parallel counter has Ò input bits and produces abit binary count of its inputs that are 1. Clearly, must satisfy ¾ ½ Ò or ÐÓ ¾´Ò · ½ µ . We define here a new type of parallel counters which we call saturating counters. A saturating counter needs to provide the exact count of its inputs that are 1 only if this count is below a certain threshold, denoted by Ì . The exact output is less important when the number of inputs that are 1 exceeds the threshold Ì , as long as the output indicates that the threshold has been exceeded. Such a saturating counter is needed in the design of a self-test and repair circuit for large memories embedded in a system-on-a-chip. Note that the saturating counters considered here are different from those used in certain image processing applications and in microprocessors' branch prediction units. The latter normally saturate at their maximum count of ¾ ½ (and, sometimes also at their minimum count of 0) and all other results must be exact.
The necessary number of output bits of a saturating counter, denoted by , does not have to satisfy the condition ÐÓ ¾´Ò · ½ µ . Instead, the inequality which must be satisfied is ÐÓ ¾´Ì · ½µ . In principle, Ì can be any number smaller than Ò; however, a simpler and faster implementation can be achieved when Ì is a power of 2. Moreover, for the application considered in this paper, if the threshold is not a power of 2 we can still employ a saturating counter with · ½ output bits, where ÐÓ ¾ Ì . We will therefore focus in this paper on the special case of [Ò ] saturating counters with Ò inputs, a threshold of Ì ¾ ½ , and output bits.
The paper is organized as follows. In Section 2 we describe the application that requires the design of a fast saturating counter. In Section 3 we present some design alternatives for saturating counters and in Section 4 we compare the delay and area of the various alternatives. Section 5 concludes the paper.
Self-Test and Repair for Embedded Memories in a System on a Chip
The high density and size of memory units, implemented either as separate ICs or as embedded memories, have resulted in an increasing number of manufacturing defects leading to low yields of high volume ICs. System-On-a-Chip (SOC) designs that contain megabits of embedded memory are now available from several companies. The manufacturing yield of these SOC products is strongly dependent on the yield of their embedded memory.
Spare memory rows and columns have traditionally been added to memory designs to replace defective rows, columns or individual cells. To perform such replacements, the defective rows, columns or cells must be identified first. In the past, dedicated external memory testers with fault diagnosis capabilities have been used. Following the identification of the defective cells, the chip is taken to a laser repair station and fuses are blown to replace faulty memory cells with spare memory cells [4] .
To eliminate the costly memory tester from the chip manufacturing process, designers have started to incorporate Built-In Self-Test (BIST) circuitry into large memory units. Such circuitry is capable of executing memory tests to diagnose any error, which may be the result of either a manufacturing fault or a fault (intermittent or permanent) that occurs during the normal operation of the IC.
Designers of systems-on-a-chip have gone one step further, and several current designs include a Built-In Self-Test Diagnosis and Repair (BISTDR) circuit for the embedded memories in the SOC. The use of BISTDR not only enables permanent memory repair following manufacturing (hard repair), but also every time the system is powered up (soft repair). Hard repair can be done by laser blown fuses or by writing non-volatile re-configuration flip-flops, while soft repair uses only the latter [5, 6, 7] .
The process starts with a self-test operation performed internally in the memory unit. Once the faulty data bits and faulty addresses have been identified, the faulty data bits are replaced with spare data bits, and faulty words are replaced with spare words. The built-in self-repair is usually executed automatically during the power-on reset sequence of the SOC and must, therefore, be performed at system speed using the system clock. The test and repair process is done on the fly in a single cycle to avoid the need to store fault information. The Fault Diagnosis Unit (FDU) is therefore on the critical path in the BISTDR circuit since it has to identify the faulty bit(s), and make a repair decision (address or data repair) within one memory read access cycle. This repair decision is based on the number of failing data bits at the current address, and the currently available repair resources (unused spare data bits and address locations). The failing bits are determined using an array of XOR gates which compares the memory output with the expected output. This produces a bit vector whose width equals the width of the memory. A bit in this vector will have a 1 if there is a mismatch at the corresponding bit position, and a 0 otherwise.
The critical path within the FDU includes a circuit that counts the number of failures, or 1's, in this bit vector. If the number of bit failures exceeds the number of spare bits (typically no larger than 8), the memory is not repairable. Therefore, it is sufficient to know the failing bit count accurately only when it is less than the number of spares available. The fast saturating counter we design is given a threshold Ì , where Ì is a power of 2 equal to or slightly larger than the number of available spares. The number of inputs of the required saturating counter, Ò, is the width of the memory. Unlike stand-alone memory chips, embedded memories in SOC designs have no restriction on the data width due to pad limitations. Thus, embedded memories of width of up to 1024 are commonly used in SOCs. There is therfore a need for saturating counters for as many as 1024 input bits.
Saturating Counters -Design Alternatives
An Ò saturating counter has Ò input bits denoted by ½ ¾ ¡ ¡ ¡ Ò , output bits denoted by
For example, a ½¼¾ saturating counter has 1024 inputs, a threshold of 8, and produces four output bits satisfyinǵ
if there are at most eight input bits which equal 1, A complete Wallace tree for 1024 inputs produces nine output bits and requires 16 levels of (3,2) counters. A straightforward way to implement a ½¼¾ saturating counter is to use (3,2) counters in the columns with weights ¾ ¾ ¾ ½ and ¾ ¼ but use only OR gates in the column with weight ¾ ¿ . This implementation, shown in Table 1 , requires 11 levels of (3,2) counters plus one level of an OR gate, assuming that two levels of OR operations in column ¾ ¿ can be completed in parallel to the operation of a single level of (3,2) counters in the ¾ ¾ ¾ ½ and ¾ ¼ columns. Table 1 shows, for each level of the tree, the number of (3,2) or (2,2) counters required in every column, and the resulting number of intermediate results in every column. For example, in the second level of the tree, 114 (3,2) counters are used in the ¾ ¼ column, producing 114 intermediate bits of weight ¾ ¼ and 114 bits of weight ¾ ½ , which are added to the 115 bits generated directly in the ¾ ½ column. The notation 9+2(OR ) in the ¾ ¿ column means that two levels of OR gates are used, 9 in the first level and 2 in the second.
Note that the implementation depicted in Table  1 will produce a result of 8 if the number of input bits which equal 1 satisfies Ñ Ó ¼ , e.g., 16, 32 and so on. If such a situation is not allowed, a threshold of Ì ½ can be selected. However, for the application at hand the probability of such an event occurring was deemed to be negligible. The average expected number of defective memory cells in a single row is less than 4, with a standard deviation of less than 2, making the probability of 16 defective cells in one row practically zero.
In [1] Jones and Swartzlander have compared the design of parallel counters using only (3,2) or (2,2) counters to designs using more complex counters like (7, 3) , (15,4) and (31,5). They have analyzed the delay and area of different implementations and concluded that designs based on (3,2) and (2,2) counters only are generally superior. We therefore decided not to experiment with counters like (7,3), (15,4) and (31,5) . However, in recent years (4;2) compressors [3] have become common in parallel multiplier designs, and very efficient implementations for them have been proposed (e.g., [8] ). Consequently, we studied the possibility of using (4;2) compressors instead of (3,2) counters in one or more levels of the saturating counter. Table 2 shows that if (4;2) compressors are used in levels 1 through 5, the total number of levels is reduced from 12 to 9. (4;2) compressors, though, have a higher delay than (3,2) counters. However, if the delay of a (4;2) compressor is only about 50% larger than the delay of a (3,2) counter, the overall delay of the [1024,4] saturating counter still decreases when (4;2) compressors are used. Detailed delay comparisons are reported in the next section. Tables 1 and 2 were generated using an online saturating counter simulator which is available at [9].
( Ñ ¿) units
Re-examining Tables 1 and 2 , one can notice that the last few stages achieve only a small reduction in the number of bits but incur a high delay. One could replace the last four stages in Table 1 , which reduce the number of bits from (5,5,2,1) to (1,1,1,1) , by a look-up table with ¾ · ·¾ inputs and 4 outputs. However, a simpler and probably faster (for most technologies) solution exists which takes advantage of the saturating nature of the counter. This solution uses a special 3-column ( 5,5,2 ,3) unit, as shown in Table 3 . If we wish to apply the same approach to the [1024,4] saturating counter which uses (4;2) compressors (see Table 2 An ( Ñ ¿) unit, shown in Figure 2 , is a saturating parallel counter which receives Ñ inputs of weight ¾ ½ , inputs of weight ¾ ¾ and inputs of weight ¾ ¿ . It produces three outputs of weights ¾ ½ , ¾ ¾ and ¾ ¿ where ¾ ½ Ì is the threshold of the saturating counter.
We restrict our discussion to the case where ¾ ¿, for which the maximum carry from the position of weight ¾ ¿ to the position of weight ¾ ¾ is 1. Thus, we have · ½ bits of weight ¾ ¾ to be added. × ¿ Ý ½¡ ¡ ¡ Ý can be replaced, if The resulting Boolean expressions are:
We can substitute Ü ·½ into the expressions for × ¾ and × ½ . This would result in product terms with up to three literals in × ½ , i.e., fan-in ... 3. Notice that the simplified Boolean equation for × ¾ may in fact produce × ¾ =1 even if the correct value is 0, but only if × ½ =1. Therefore, the probability of producing an output of 8, when the number of input bits which equal 1 satisfies ÑÓ ¼ and (e.g., =16), is lower than the corresponding probability for the saturating counters of the types depicted in Tables 1 and  2 .
To calculate the delay and area of the proposed ( Ñ ¿) unit, some further analysis is required. The total number of signals in an implementation of × ½ after the first level of gates (OR gates for the Þ inputs and AND gates for the remaining terms) is
where is the maximum fan-in allowed.
The table below shows the number of signals (after the first level of gates, i.e., OR gates for the Þ inputs and AND gates for the Ü Ô Ü Õ and Ü Ô Ý × Ý Ø terms) and the number of logic levels for the special case of fan-in , Ñ and ¾ . For fan-in the total number of gate levels is therefore ½ · ÐÓ AE ¾ .
The exact benefit of using an ( Ñ ¿) unit instead of several levels of (3,2) and (2,2) counters is highly dependent on its circuit implementation. For simplicity, we will assume for the numerical results summarized in the next section that the ( Ñ ¿) unit is implemented using basic logic gates with a delay of ¡ for an OR or AND gate with fan-in= or less. The (3,2) and (2,2) counters are implemented using 2-input XOR gates whose delay is denoted by ¡ Ç Ê . ] saturating counter (for Ò=72, 136, 264, 520 and 1032) implemented in four different ways: using (3,2) and (2,2) counters only, allowing the use of (4;2) compressors as well, and allowing all types of counters including the special ( Ñ ¿) unit. The latter has two implementations, one with fan-in and another with ¿. Only the basic design, which is restricted to the use of (3,2) and (2, 2) counters, is unique. The remaining three designs have multiple possible implementations and the delay of the fastest implementation for that type is shown. The delays in Figure 3 Figure 4 shows that the reduction in total area due to the use of (4;2) compressors increases with the number of inputs (under the above-mentioned area ratios assumption). The use of an ( Ñ ¿) unit may increase the total area but the area will still be lower than that of the basic design using only (3, 2) and (2,2) counters. To make a decision regarding the use of an ( Ñ ¿) unit, the designer should consider the delay as well as the area. A measure like Area ¢ Delay ¾ can help, and Figure 5 shows that the use of an ( Ñ ¿) unit is beneficial.
Numerical Results
As mentioned above, a design using (4;2) compressors and an ( Ñ ¿) unit is not unique and therefore one can trade off area and delay. Figure  6 illustrates such a tradeoff for Ò=520 and 1032; where the basic designs (using only (3,2) and (2,2) counters) are also shown for reference. Note that a reasonable reduction in the area can be achieved with some increase in delay. Further increases in the delay will only marginally reduce the area, and thus are not advisable.
Conclusion
Saturating counters have been defined and several design alternatives have been presented and evaluated. The motivation for this study was the need to design such a counter as part of a self test and repair unit for an embedded memory in a system on a chip. The saturating counter that has been implemented uses (3, 2) counters and an ( Ñ ¿) unit. It has been implemented using the Perfect SAGE standard cell library for 0.15mi-cron TSMC CMOS from Artisan. A preliminary design which did not use an ( Ñ ¿) unit did not satisfy the timing requirements.
