Abstract-Variable Latency Adders are attracting strong interest for increasing performance at a low cost. However, most of the literature is focused on achieving a good area-delay tradeoff. In this paper we consider multispeculation as an alternative for designing adders with low energy consumption, while offering better performance than the corresponding nonspeculative ones. Instead of introducing more logic to accelerate the computation, the adder is split into several fragments which operate in parallel, and whose carry-in signals are provided by predictor units. On the one hand, the critical path of the module is shortened, and on the other hand the frequent useless glitches produced in the carry propagation structure are diminished. Hence, this will be translated into an overall energy reduction. Several experiments have been performed with linear and logarithmic adders, and results show energy savings by up to 90% and 70%, respectively, while achieving an additional execution time decrease. Futhermore, when utilized in whole datapaths with current control techniques, it is possible to reduce execution time by 24.5% (34% best case) and energy by 32% (48% best case) on average.
INTRODUCTION
Addition is the key arithmetic operation in most digital circuits and processors. Signal processing and typical multimedia applications in mobile devices [20] [21] [22] are often dominated by additive structures. These devices are manufactured with increasingly stringent performance, power, and packaging constraints. Moreover, multipliers and other complex modules usually include a great amount of adders as well. Hence, it is crucial to improve the quality of adders and adder-dominated structures without incurring significant area or power overhead.
Historically, several fixed latency adder designs have been proposed. A common principle to implement these modules has consisted of increasing the adders' complexity and thereby, their speed. This will improve performance, while increasing area and power considerably, depending on the concrete implementation. Several designs with different tradeoffs have been proposed, such as Ripple Carry Adders (RCA), Carry Select Adders (CSEL), Carry Lookahead Adders (CLA), Brent-Kung Adders (BK) or Kogge-Stone (KS) Adders [1] . RCA are composed by the replication of a basic cell called Full Adder (FA). This is the simplest implementation of an adder. The main problem is the length of the carry path, which diminishes its performance. CSEL's idea is to divide the adder into several parts and replicate each most significant part. The most significant parts are then executed both with carry in '0' and '1' simultaneously, and the correct result is selected once the real carry is calculated. The final critical path will be roughly the size of the original adder divided by the number of fragments. However, the replication of the most significant modules will penalize both area and power. CLAs also reduce the carry path by dividing the adder into several modules, anticipating the carry-in of every module according to a fast calculation of the carry-out from the previous (less significant) module. With this technique the critical path becomes logarithmic with respect to the input width, but the carry propagation tree increases the adder complexity as well. Prefix adders, such as KS or BK, utilize a similar principle to CLA. They present various tradeoffs depending on the concrete implementation, but all of them have a logarithmic delay in common, as they propagate the carries in a tree-like fashion. Nevertheless the area and power scale differently [2] . For instance, in the case of BK the area is O(n), while in the case of KS it is O(n·log(n)). Hence, prefix-adders are in general the fastest, largest and most power hungry designs, specially the KS. However, as energy is a magnitude that depends on both power and execution time, KS consumes less energy than other kinds of adders, such as BK or RCA.
All the aforementioned designs operate with fixed latency. On the contrary, several Variable Latency Adders (VLA) have appeared recently. Some of these VLA utilize various forms of speculation. We shall refer to them as speculative adders [3] [4] [5] 23] . The adder is divided into several fragments, and the carryout signals are predicted. Then, each prediction is used to feed the carry-in of the immediately preceding more significant fragment. Other approaches [6] [7] [8] [9] leverage on the observation of the independence between carry values among different stages inside an adder. Finally, some designs rely on the dynamic modification of the voltage to reduce the energy consumption [10] .
All the abovementioned proposals have proved to be highly efficient in area and delay. However, their power and energy consumption have remained largely unexplored. In this paper, we study the energy efficiency of Multispeculative Adders, and demonstrate that they are superior to their non-speculative counterparts in terms of both energy and performance, without degrading the area and even reducing it in some cases. As part of our systematic analysis, we have investigated the optimal choice for an important design parameter: the fragment size when using multiple predictors. Our analysis also covers both the adder modules in isolation, as well as cases when they are deployed in datapaths. Our results show that it is possible to This work has been supported by the Spanish Government Grants TIN 2008-00508 and TEC2012-33892 reduce energy consumption up to 90% in RCA, 85% in BK, and 70% in KS for large data width (256 bits). In the concrete study performed for 64-bit adders, there is a 71% energy reduction for RCA, 69% for BK, and 62% for KS, with additional 3.6X, 2.6X and 1.3X speedups, respectively. Finally, when applied to entire KS-based datapaths, it is possible to decrease energy consumption by 32% (48% best case) while achieving an additional execution time reduction by 24.5% (34% best case).
The rest of the paper is organized as follows: section II discusses the related work and section III describes the generic adder design. Section IV shows an example to illustrate our ideas, while sections V and VI present our experimental results and conclusions.
II. RELATED WORK
Recent works on adders design have introduced speculative techniques in order to improve average performance while keeping a low area cost. The main idea behind all these designs is to relax some logic conditions in the equations that will be mapped into hardware in order to quickly execute most additions, while reducing the area. More specifically, some carries are either predicted [3] [4] [5] or assumed to have a '0' value [6] [7] [8] [9] . The combinations of inputs that do not satisfy the new equations will be executed more slowly. In order to achieve high performance, designers must minimize the number of cases that produce a slow addition.
The works presented in [8] [9] establish some conditions under which the carry-in (c i ) is independent of a previous carry (c i-1 ) with a very high probability. Nonetheless these designs are asynchronous. The work by Verma et al. [6] adapts these ideas to a synchronous context. The carry-in c i only depends on c i-1 if a i ≠b i , i.e. the carry is propagated. Their conclusion is that the longest sequence of propagates in an n-bit adder is log(n)+12 with 99.99% probability [6] . Therefore, it is possible to build an extremely fast adder by replicating many fragments, because for a certain k, c i+k+1 is quasi-independent from c i . This is a remarkable contribution, because it means that a change in c i will not affect c i+k+1 . In general, implementing an n-bit addition with this technique will need to execute in parallel n-(k+1) fragments of size k+2, so replication of these submodules will provoke an area and power penalty as in the CSEL case. Besides, an error detection and recovery circuit must be included. This error detection circuit evaluates every possible chain of k+1 propagates. Hence, the whole structure, i.e. adder, detection and correction, occupies a large amount of area: around 1.5X the area of a conventional logarithmic fast adder. The work presented in [7] is based on the structure proposed in [6] , but it takes into account the fact that there exists some correlation between the inputs, so not every input combination will have the same probability, which was the starting point of the analysis by the adders described in [6, 8, 9] . The area penalty of this scenario is around 25% of the original approach presented in [6] .
Adders proposed in [3] [4] start from a RCA divided into two halves, but without replicating the most significant module. They estimate the carry-in to this most significant fragment taking into account the most significant bits of the least significant fragment. The work presented in [4] establishes a methodology for doing this with an arbitrary number of bits. Nevertheless, the main disadvantage of the designs described in [3] [4] [8] [9] is that they are asynchronous, while most circuits today are synchronous. In order to introduce Speculative FUs (SFUs) in the synchronous context, techniques [5] similar to the branch prediction [11] were proposed for speculating the carry-in of the most significant fragment. This Variable Latency FU (VLFU) operates in one short cycle if the speculated carry is guessed correctly, or two if it is not. A theoretical 50% execution time reduction can be achieved in speculative RCA with respect to the nonspeculative case, with negligible area overhead. Although this theoretical improvement will depend on the predictor's accuracy, the hit rates will be high because of data correlation [12] . Moreover, the state of the art provides several controllers for handling one or several VLFUs [5, [14] [15] [16] [23] [24] . The simplest ones stop the whole datapath everytime a misprediction happens, which degrades performance as the number of VLFUs increases [15] [16] . Other works are based on handshaking protocols [14, 24] , and other implementations even sacrifice a piece of accuracy for the sake of energy [25] . On the other hand, in the approach presented in [5] , the use of several local controllers is proposed for stopping only the mispredicting VLFUs and thus obtaining a good performance while keeping full accuracy. The approach described in [23] is based on traditional global controllers [14] [15] [16] . In that work, performance is improved for additive structures using VLFUs by propagating mispredictions from a cstep to the following one and predicting only in few selected nodes within the global additive structure.
In conclusion, there is a quickly expanding landscape of several promising proposals to design VLFUs, deploy them efficiently during High Level Synthesis, and manage them in the resulting datapaths. The impact of these efforts on energy efficiency needs careful consideration, which is the main goal of this paper. Specifically, we aim to study the energy efficiency of generic Multispeculative Adders (MSADD), based on the carry-in quasi independency [6, 8] , in order to obtain high performance, and on the non-replication of the modules [3] [4] [5] , to achieve a low power penalty. Finally, we show that the energy reduction of the individual FUs contributes to improve the overall energy consumption of synthesized datapath circuits, while maintaining the accuracy.
III. THE MULTISPECULATIVE ADDER
The general structure of a MSADD is depicted in figure 1 . As it can be observed, an n-bit adder will be divided into n/k fragments of width k. Each fragment will operate in parallel using the carry-in value given by the corresponding predictor (pred i ). A hit signal will indicate when the operation is correct, i.e. iff all the true carry-out values are equal to the corresponding prediction of the carry-in values of the following module, i.e. iff all the error signals coming from the predictors (err i ) are false. Every error signal indicates when the predicted value is different from the true carry. Finally, note that every predictor will be updated with the true carry-out iff there is a misprediction. This error detection and correction logic is less expensive than evaluating every possible chain of (k+1) propagate signals.
The main idea for using MSADD is that despite of utilizing more than one predictor, almost every addition will be executed in two very short cycles at the most, because the correction of a carry will most likely not affect the carry-in of the following fragment. Let us consider the carry-in values C ik-1 and C (i-1)k-1 , which enter into fragments i and i-1, respectively. Let us assume that there is a failure while predicting C (i-1)k-1 . In the next cycle it will be corrected. Now the question is whether this correction will affect C ik-1 . The probability of corrected C (i-1)k-1 to be propagated to C ik-1 is the same as finding a propagates chain of length k. Hence if k is big enough according to [6, 8, 23] , the correction will not be propagated and 2 very short cycles will be enough for executing any addition with an extremely high probability. It is clear that utilizing many predictors will increase the probability of failing once. But it is fairly unlikely to fail more than once. A concrete hit probability study can be found in [23] . Then, if the adder delay is reduced sufficiently such that this cycle penalty is compensated, overall execution time will be reduced. Hence, note that this approach is different from the traditional ones, which introduce more logic for accelerating the carry propagation.
Besides improving the execution time, MSADD are more power and e nergy efficient. Splitting the adder into several fragments reduces the switching activity of the adder, because frequently the commutation of a carry signal does not affect the value of the follow ing ones, although it can produce the commutation of several intermediate signals. Hence, as switching activity is proportional to dynamic power [13] , multispeculation should produce a reduction of power, which will affect the energy consumption. This is especially important in the case of deep carry-propagation trees, which are fast, but power hungry. Furthermore, the use of small k-bit fragments will reduce the number of possible commutations and the area of fast logarithmic adder implementations, because the number of nodes composing the fragment is much smaller than in an n-bit adder, being k<<n.
Regarding the concrete implementation, 1-bit predictors have been utilized. These predictors are composed of a Dflipflop and are updated iff the true carry is different from the predicted value. The main reason for considering these predictors is their simplicity, as they reduce the critical path when implementing additive structures (e.g. addition chains or trees), which are of greater relevance in Design Automation. In order to implement whole datapaths, a controller similar to the one presented in [23] has been chosen. For that purpose, the Dflipflops will store the intermediate carries during the first steps of the additive structure, and apply a Static Zero Prediction in the last stage, i.e. a misprediction will be produced iff a carryout is equal to '1'. For instance, let us consider an addition chain. The middle carries can be pipelined from a cstep to the next one, except for the last cstep. In the last cstep, instead of penalizing performance by propagating the carries from a fragment to the following one, the Static Zero Prediction will be applied. If every carry-out is '0', there will be an overall hit and no more cycles will be necessary. If there is a misprediction, this last cstep will be repeated until producing a hit. But according to previous studies, 2 cycles will be enough to calculate correctly the last stage, instead of n/k cycles if we had propagated the carries. A similar philosophy can be applied to the addition trees and some more complex additive structures including product nodes in the leaves.
IV. AN ILLUSTRATIVE EXAMPLE
In order to show the principles of our technique and expected performance and energy benefits, let us consider the example of figure 2, where the dot diagrams correspond to a 16-bit Kogge-Stone adder and its multispeculative version (MSKS) with 4-bit fragments. The white dots represent the initial calculation of signals p i =a i ْb i and g i =a i ·b i , while the black dots correspond with operator . This operator works such that (G',P') (G'',P'') = (G'+P'·G'',P'·P''), where the ' signals are located on the left and the '' signals on the right of every node in figures 2 a) and 2 b). The blue dots represent those nodes whose signals can potentially switch. Finally, in figure 2 b) , the red arrows indicate how the carries are interconnected from a fragment to the following one.
In this example, we have assumed that the inputs at the 6 th bit are changing. As it can be observed in figure 2 a) , a change in the inputs of bit 6 can produce commutations up to 23 nodes. This problem is even worse if we consider more complex adders such as the ones presented in [6] [7] , where the same bit flip affects several fragments at the same time. On the contrary, in the multispeculative case, as it is shown in figure 2 b) , the same flip at the 6 th bit can only affect up to 5 nodes. This happens if there is a hit in the prediction of the carry-in to the following fragment. Otherwise, in the following cycle the correction can potentially affect 4, in general k, more nodes. Nevertheless, as stated in [6, 8, 23] , in the vast majority of the cases two very short cycles will be enough for producing a correct result. Hence, from the power point of view, in the MSKS case the bit flip on bit 6 can potentially affect 9 nodes at the most with a high probability, although if prediction is accurate enough, it will only affect 5 nodes. A similar fact happens with the rest of the adder inputs. Hence, as k << n, this will produce an overall switching activity reduction of the whole adder. Furthermore, as it can be observed in figure 2 b) , the number of required nodes is also diminished. Thereby, the introduction of predictors will contribute to an area reduction in logarithmic-like implementations such as the KS.
V. EXPERIMENTS
In this section we first describe our experimental framework. Next, we discuss our experimental results.
A. Framework
A simulator has been built in order to measure the number of hits and mispredictions, as well as the average number of cycles. 2 i -bit, 4≤i≤8, adders are simulated for 10 6 iterations to obtain execution time results. During the simulation, inputs are modeled stochastically, in a similar fashion to the profiling information obtained in [17] by Brooks and Martonosi. In this work, authors observe that in the SPECint95 benchmark suite, over half the operations require 16-bits or less, instead of the 64-bit full precision. Similar conclusions are extracted in [18] [19] , working with the SPECint2000 suite. Hence, taking into account the data presented in [17] , the most significant part of each operand consists of a sign extension with a high probability (>0.9). On the contrary, the least significant fragments behave randomly. Afterwards the resulting adders have been coded in VHDL and synthesized with Synopsys Design Compiler, with a 65 nm library. Average execution time per operation is obtained as the product of both the average latency and the delay given by Synopsys. In order to get the power results, adders are simulated at a RTL level with Modelsim to obtain the input stimulus for Synopsys Power Compiler. Thus, the energy per operation is computed as the product of both power and average execution time. Power is measured in μW and delay in ns. Hence, energy is measured in fJ.
Furthermore, several experiments have been also performed with random inputs in order to get both execution time and power in a hypothetical scenario where no kind of data correlation could be exploited. Finally, it must be noticed that we have synthesized every adder considering three types of basic blocks: RCA, BK and KS adders.
B. Results
In order to check the performance and energy of the MSADD, several experiments have been performed. First a study about the optimal size of the MSADD fragments, from the latency point of view, has been performed. Second, a study about the energy impact of the number of predictors over the adders has been performed. Next, we have investigated the relationship between performance and energy for different multispeculative adder implementations. Finally, we evaluated the energy and performance implications on complete datapaths utilizing MSKS modules.
1) Optimal fragment size
The main objective of this study is to determine what fragment width, and with what probability, yields a hit after two cycles at the most. This means that there is an overall hit in the first cycle, or after the second cycle, i.e. the failures are not propagated from one fragment to the following more significant fragment. Results are shown in figure 3 a) . The X axis depicts the adder size, n, and the Y axis the fragment width, k, such that the hit is achieved after two cycles at the most with 90%, 95%, and 99% probability (p), respectively. The percentage of hits achieved after more than 2 cycles would be calculated as (100-p). Note that the number of predictors can be obtained as n/k-1.
Results show that even with the smallest fragment widths, k=4 and k=8, it is possible to achieve a hit after two cycles in 90% of the cases. In order to reach the 95% and the 99% hit percentage, k seems to tend to n/4 and n/2, respectively. The impossibility of overcoming the 99% hit percentage is a consequence of the assumed data model [17] . Extremely wide sign bit chains in the most significant part of the operands will cause some additions to be computed in more than 2 cycles because of the propagation of the correction in the middle carry. However, this effect can be lessened by introducing more complex predictors or by propagating the mispredictions to the following cstep, when considering a whole datapath [23] . Figure 3 b) depicts the average energy per operation for 16 and 32-bit adders. Figures 3 c), d ) and e) show the average energy per operation for 64-bit, 128-bit and 256-bit adders. In these four figures, the inputs have been modeled according to the information reported in [17] . It should also be noticed that the MSADD with zero predictors corresponds to the baseline case, i.e. non-speculative, for every adder type. These points have been indicated with non-solid markers. As it can be observed, the energy consumption decreases as the number of predictors is increased. It decreases faster with few predictors (3 or 7) and with more predictors the slope is still negative, but less steep. Thus, fragmenting the adders into several submodules prevents the propagation of unnecessary switching activity through the whole adder, regardless of the submodules type. On the other hand, as the number of predictors increases, the application of multispeculation makes the adders independent of the type of basic block. This is because the basic blocks become similar in area, power, delay, and energy when the fragment width is small (4-8 bit).
2) Energy impact of the number of predictors
Figures 3 f), g), h) and i) depict the same results as the four previous figures, but considering random input data, which is not realistic according to [17] [18] [19] . However, we have performed these experiments in order to thoroughly assess the behavior of these modules even in this extreme case. As it is observed, the energy consumption decreases with few predictors (3 or 7) and then it slightly increases with 15 or more predictors. With random input data, the latency is increased because the input bits become completely uncorrelated and predictors fail more times. Hence, more submodules will be active recalculating results due to the mispredictions. This will lead to an increase of the switching activity through the whole adder, and thereby to an increase of the overall energy consumption. Nevertheless, it should be noticed that even under this unrealistic scenario, all the MSADD consume less energy than their corresponding baseline cases.
Another conclusion that can be extracted from these results is the energy efficiency of different implementations. It may seem that KS implementations are the most energy consuming because of their complexity, but it is quite the opposite: the RCA's rippling carry structure is what causes the largest penalty. However, applying multispeculation it is possible to solve these problems and design MSADD that are energy efficient regardless of the implementation style. Considering input data according to [17] , the reduction in energy consumption of MSRCA ranges from 45% to 90%, of MSBK from 30% to 85%, and of MSKS from 30% to 70% with respect to their corresponding baseline case. With random input data, MSADD are less efficient but they still achieve good energy reductions, especially for widths greater than 32-bits. For instance, MSRCA can achieve an energy reduction around 70%, MSBK around 68% and MSKS around 53%.
3) Energy and delay
In this experiment we have checked whether the energy reduction comes with a performance degradation. Figures 4 a) and b) show the evolution of the average latency and the average delay with respect to the number of predictors for a 64-bit adder. Note that figures 4 a) and 4 b) utilize the [17] data model. Results in figure 4 a) show that the latency grows as the number of predictors is increased. Nevertheless, the critical path reduction is more significant and thus average delay is decreased with more predictors. Concretely, it is possible to encounter MSRCA that are 3.6X faster, MSBK 2.6X faster and MSKS 1.3X faster. Hence, the energy reduction, which can be Figures 4 c) and d) show the same results as the aforementioned figures, but considering random inputs. As it can be observed in figure 4 c), the latency increase is around twice the increase in figure 4 a) . This fact affects the average delay and, as figure 4 d) depicts, when there are more than 7 predictors the delay is prone to be slightly increased. Nonetheless, all the MSADD are still faster than their baseline counterparts, except in the case of KS. Specifically, MSRCA are 2.1X faster and MSBK 1.5X faster. Therefore, when comparing the energy reductions shown in figure 3 g ), regardless the increase of the switching activity all the multispeculative solutions consume less energy than their corresponding baseline cases. And this is achieved with an additional performance increase in all the cases, except the MSKS implementations with more than 3 predictors.
It must be noted that there are two red bold points in figures 4 a) and c), which correspond to the MSADD's average latency with a fragment size of k=n/4=64/4=16 bits. Results show that the latency is always below 2, which copes with the conclusions extracted from figure 3 a), i.e. with k=n/4 in most of the cases 2 or less cycles are enough for executing correctly most of the additions.
4) Multispeculative Adders in datapaths
In this last experiment several addition-intensive benchmarks have been implemented to see the impact of using MSKS over whole datapaths, namely:
• The 16-bit Dilation inner loop code, used in the ECG [20] .
• The 16-bit Accum submodule in the ADPCM decoder [22] .
• The 32-bit Simpson 3/8 integration (Simpson38) [21] .
• The 32-bit Trapezoid integration (Trapezoid) [21] . The controllers have been implemented using techniques similar to [23] . Table I depicts the area, cycle time, average latency execution time, power and energy per iteration of both conventional KS-based and MSKS implementations (labeled as MS). Figure 5 shows the area, execution time and energy percentage variations with respect to a baseline implementation with KS non-speculative adders. As it can be observed, the multispeculative implementations are 24.5% faster (34% best case) and consume 32% less energy (48% best case). It must be observed that these reductions are greater when the input width is larger, because the power reduction is greater. This is due to the switching activity decrease, as illustrated in figure 2 . Regarding the area, Simpson38 and Trapezoid also obtain some reductions. Dilation and Accum are implemented with 16-bit inputs, so the area reduction due to MSKS is slightly overshadowed by some area overhead due to the additional routing and control.
VI. CONCLUSIONS
In this work we have explored the energy efficiency of Multispeculative Adders. First, the generic scheme of Multispeculative Adder has been presented, based on the fact that two very short cycles are enough for producing a correct result. Besides, this division reduces the switching activity through the whole adder. Secondly, the energy efficiency of a concrete predictive structure has been checked with several adder types of different sizes and under different switching conditions. Results show that it is possible to diminish the energy consumption by up to 90%, 85% and 70% in the case of RCA, BK and KS adders, respectively, while achieving at the same time an execution time reduction. When applied to whole circuits, these reductions contribute to the overall energy and execution time improvement.
In conclusion, Multispeculative Adders offer new possibilities for designing circuits, as they provide a new dimension, i.e. the number of predictors, to explore costefficient designs. Figure 5 . Area, execution time and energy variation of several benchmarks
