Abstract-Adder compressor architectures have been widely used in multipliers and have recently achieved improvements over conventional approaches in the computation of multiple modules, such as transform blocks, in the context of video coding. This paper reviews four different state-of-the-art 8-2 adder compressor architectures and proposes a novel one. A 65 nm commercial standard cell library was used to synthesize the compressors. The results show that, as a consequence of a shorter critical path, our circuit presented significant improvements regarding maximum operational frequency, while still maintaining similar results for power dissipation. Our circuit also managed to achieve a smaller circuit area, as a result of a more straightforward net interconnection.
I. INTRODUCTION
A variety of situations needs multiple additions blocks, such as in multipliers and video coding. In these contexts, N-2 adder compressors circuits tend to achieve better performance than other approaches by reducing the N operands to only two. After reducing the number of inputs, the original sum can be recovered in a recombination stage, which is essentially the sum of the remaining two operands.
Fast multipliers architectures have extensively used adder compressors. Approximate multipliers, for instance, have used approximate adder compressors such as in [1] and [2] . Nevertheless, basic precise multipliers and adder circuits also use basic compressors, such as the 3-2 adder compressor serving as building block in Wallace trees [3] .
More recently the adder compressors have been applied extensively inside various video coding modules. Improvements in the computation of transform blocks, as well as in the computation of SAD (Sum of Absolute Difference) and SATD (Sum of Absolute Transformed Difference) architectures, can be attained.
Works in [4] and [5] achieve impressive energy reduction when building adder compressors based SATD architectures, which account for a significant amount of the power dissipated in video encoders.
The use of 8-2 adder compressors can also reduce energy consumption in SAD modules, as presented in [6] . On the other hand, combinations of 8-2 with 6-2 and 4-2 adder compressors achieve energy reduction in the computation of the Discrete Tchebichef Transform, as shown in [7] . Finally, the work in [8] demonstrates that applying 5-2 adder compressors [9] leads to power reduction in split-radix butterflies.
Designing architectures with adder compressors, however, implicates in propagating signals across multiple modules, as will be further explored in section II. Previous designs in the literature incur in suboptimal carry-signals propagation, sometimes leading to a critical path depth proportional to the width of the operands. This issue smoothly goes unnoticed when using narrow operands widths, but becomes central if wider operands are necessary. To avoid this, a careful design of the internal signals is essential.
This paper explores four current state-of-the-art 8-2 adder compressor circuits [6] and proposes a novel circuit, presenting a shorter and limited signal propagation. The architectures are synthesized into a 65 nm standard cells library for operands with 9, 16, 24, 32 and 64 bits and compared for maximum operational frequency, power dissipation and circuit area.
Our novel circuit proposal achieve a significantly higher maximum clock frequency while maintaining similar results in power dissipation and circuit area.
II. ADDER COMPRESSORS OVERVIEW
In this section, we review four well-established adder compressors, namely the 3-2, 4-2, 5-2 and 7-2 [10] . The current state-of-the-art hierarchical 8-2 adder compressors [6] use these compressors as building blocks.
A. Adder Compressors
The adder compressors circuits reduce the given input operands to only two, the Carry and Sum signals, which can then be recombined through the equation + 2 × to compute the original addition. The signals propagate to the next 1-bit compressor -used as a signal -to form -bit compressors.
If we want, for instance, to add five operands, with 8 bits each, we would need 8 1-bit 5-2 adder compressors as the one shown in Figure 1 (c) , i.e., one for each bit.
The logic implemented in the 3-2 adder compressor shown in Figure 1 (a) is the same as that of the full-adder. The difference here is that the signal is 'saved' instead of being propagated. The 3-2 adder compressor is also referred to as a carry-save block and has been used as building block for Wallace Trees [3] and in multipliers in general. Figure 2 demonstrates this process for the 4-2 compressor, that reduces the critical path from four to only three gates. The 5-2 adder compressor can be built similarly, as shown in Figure 3 . The 7-2 adder compressor, shown in Figure 1 (d), is constructed through a different process. The work in [10] describes this compressor.
B. 8-2 Hierarchical Adder Compressor
Four different state-of-the-art architectures for the 8-2 compressor, designed through hierarchically combining the aforementioned basic compressors, are shown in Figure 4 .
These four architectures were thoroughly explored in the context of video encoding in works [11] , [12] and also [6] .
If we considered only the 1-bit version for each of the 8-2 compressors, we would conclude that their critical paths are given by six, six, five and eight logic gates for architectures Type I, Type II, Type III and Type IV, respectively. However, this would not be the case if we considered them in the context of operands with more than a single bit. The critical paths for architectures Type I and Type III are in fact dependent on the size of the operands. Figure 5 shows the case for the hierarchical adder compressor Type I. All modules of the N-bits adder compressor propagate the carry signal as a consequence of the fact that, although the signal in the 4-2 compressor is not dependent on the , the signal is. Hence, the 1 from the 8-2 adder compressor is dependent on its 1 , which is itself the 1 signal from the previous adder compressor. This propagation thus extends itself all the way to the least significant bit.
Differing from Type I, the 8-2 hierarchical adder compressor II presents a limited signal propagation. Figure 6 shows the critical path (in red) of the Type II, containing a total of ten logic gates -eight XORs and two MUXes. We note here that these paths extend themselves across three 1-bit modules. Hence, a single isolated module will possess a smaller depth -only six gates, as noted before.
A similar situation to that of Type I arises with the hierarchical adder compressor Type III, as shown in Figure 7 . In this case, however, the propagation occurs as a result of the dependence the signal 3 has on the signal 3 , deriving from the fact that the on the 3-2 adder compressor depends on all its inputs.
The hierarchical adder compressor Type IV, finally, behaves differently from the ones described above. Because it is built based on the 7-2 adder compressor, the signal 1 has weight four times higher than that of its inputs, contrasting with 0 , which has weight only two times higher. his implies that the 1 from the adder compressor module must be used as input to the compressor + 2 and not to the compressor + 1. This scheme has the benefit of reducing the number of necessary signals in the 8-2 adder compressor from five to only three at the cost of increasing energy consumption [6] . Since this strategy was not considered in our design for the 8-2 adder compressor, a further analysis is referred to [10] and [6] . Lastly, as will be seen in the synthesis results, the hierarchical version Type IV also presents ten gates in its worst path.
Nonetheless, we note here that, for smaller operands widths -such as 9 bits -architectures Type I and Type III might still achieve better results for power dissipation since the problem arising from the carry propagation will not have a noticeable impact on performance.
The reduced power dissipation in Type I is confirmed in this paper, reinforcing a similar conclusion found in [6] for energy consumption in the context of video coding.
III. PROPOSED MONOLITHIC 8-2 ADDER COMPRESSORS
This section explores our proposed 8-2 adder compressor. We refer to it here as monolithic since it was conceived using only XOR and MUX gates, instead of hierarchically combining smaller compressors. Furthermore, it cannot be constructed from the existing smaller compressors. The compressor was designed by carefully considering the signal propagation across the modules. It, therefore, allows us to achieve a shorter critical path of only eight logic gates, contrasting with what was previously discussed for the hierarchical versions. Figure 8 shows our proposed monolithic 8-2 adder compressor. This architecture can be understood as the natural consequence of the algorithm described in section II for designing the 4-2 and 5-2 adder compressors. In this case, six 3-2 adder compressors are sequentially arranged and then shorten in a manner which is analogous to that of the 4-2 and 5-2 adder compressors. Analyzing this circuit, we note that the 1-bit version has a critical path containing seven logic gates. A multiple bits version, however, includes eight gates. Figure 9 shows this critical path which has a depth of eight logic gates -seven XOR gates and one MUX. We note finally that this circuit routing is straightforward, which implies in a smaller circuit area considering that all the compressors architectures have the same number of logic gates. 
IV. SYNTHESIS RESULTS AND DISCUSSION
The five analyzed architectures for the 8-2 compressor were synthesized: the four hierarchical versions referenced in Figure 4 and our proposed version. All architectures were synthesized to a 65 nm std-cells library for bit widths of 9, 16, 24, 32 and 64 targeting an operating frequency of 100 MHz, to obtain the estimated power dissipation, circuit area, and critical path. Table I shows the number of logic gates in the critical path in each architecture. We note how the path in the hierarchical architectures Type I and Type III increase as the number of bits increases, as a result of the carry propagation. The other structures, though, do not show this same behavior. It is expected that the carry propagation should result in a loss regarding maximum operating frequency for the architectures that do not have a limited carry propagation chain, which is indeed observed. Table II shows how the clock frequency is affected by the carry propagation problem. We see that as the number of bits increase, the maximum operating frequency of the architectures with carrying propagation diminishes rapidly, while the frequencies of the architectures without propagation issues remain approximately constant. Our circuit proposal achieved higher operational frequency over the hierarchical versions, as a consequence of the limited and shorter signal propagation. Hierarchical versions Type I and Type III still reach sufficiently high frequencies when using narrow operands, as is the case in SAD and SATD architectures. Nevertheless, if it is mandatory to use wider operands, only a much lower frequency can be expected.
A. Critical Path and Clock Frequency Analysis
Moreover, deriving from the fact that our monolithic 8-2 compressor has only eight logic gates in its critical path, while both the hierarchical versions Type II and Type IV have ten, our circuit achieved a higher operational frequency. Our proposed compressor reached a frequency 19% higher than the hierarchical version II and 10% higher than the hierarchical version IV, for all bit widths. Table III shows the total power dissipation results for each architecture as a function of the input operands bit-width at a frequency of 100 MHz and 1.0 V. From the table we see that for small bit-widths the power dissipation is approximately equal for all compressors, but the differences become noticeable as the operands bit-width increases. Random inputs were used to estimate the power dissipation to keep the analysis general to all applications.
Agreeing with related works discussing energy consumption in the hierarchical adder compressors such as [6] , the hierarchical compressor Type I realized the lowest power dissipation for all widths until 32 bits. However, the results change at 64 bits, and our proposed architecture begin to dissipate less power than the hierarchical ones. The reduction is a consequence of the synthesis tool attempting to keep the circuits functional at the desired target operating frequency of 100 MHz, which incurs in the inclusion of buffers and logic gates with higher strengths. Therefore, if higher frequencies or larger operands are needed, our proposed versions become a better option.
For narrow operands, a decision must be made considering the application. When using 9, 16, 24, and 32 bits, the hierarchical version (a) dissipate respectively 12%, 11%, 7% and 5% less power in comparison with our proposed monolithic compressor. The highest operational frequency reached by our circuit, however, is almost 50% higher for 9 bits and nearly four times higher for 32 bits.
In comparison with the hierarchical 8-2 adder compressor Type IV, who had a maximum operating clock frequency close to that of the monolithic architecture, we note finally that our proposed circuit dissipated 23% less power for 9 bits and almost 25% for 64 bits. Table IV shows the total circuit area used for each of the five architectures, concerning the operands width. Similarly to what occurs with power dissipation, all circuits use similar areas for small operand widths, and the differences become more apparent as the width increases.
Our proposed circuit achieved slightly better results for circuit area than the hierarchical versions. In comparison with the best hierarchical compressor in terms of area, our proposed version used a circuit area approximately 5% smaller for all bit widths. Comparing with the hierarchical Type IV -the fastest hierarchical compressor -our circuit required an area 12% smaller for all widths.
We note again here that, as happened with power dissipation, the inclusion of buffers and logic gates with higher strengths in the hierarchical versions Type I and Type III also incur in a greater circuit area for 64 bits. While the hierarchical compressor Type I takes, for instance, approximately 6% more area than our proposed circuit for 9, 16, 24 and 32 bits, it then takes almost 20% more area for 64 bits. 
V. CONCLUSION
This paper presented a comparison between four well researched state-of-the-art 8-2 compressors and our novel monolithic 8-2 adder compressor. Each circuit was synthesized to a 65 nm standard cell technology for operands of sizes 9, 16, 24, 32 and 64 at 1.0 V. Both power dissipation and circuit area were estimated at a frequency of 100 MHz and using random inputs.
The design of the monolithic compressors took into account that smaller hierarchical versions cannot represent them, and careful routing of the propagating signals allowed for a shorter critical path. This shorter critical path leads to significantly higher maximum operational frequencies for our circuits.
Moreover, our proposed circuits achieved similar results for power dissipation. Although the hierarchical compressor Type I still dissipated less power for narrow operands, the much higher maximum operational frequency, produced by our circuits, implies that, depending on the application, the monolithic versions might very well come as better options.
Finally, our proposed circuit also presented a slightly small total circuit area, which is also to be taken into consideration when deciding the best 8-2 adder compressor for a given application.
