Abstract
Introduction
Present day portable computers run the most common interactive applications (word-processors, spreadsheets, windows, etc.) with no noticeable computation delay; weight and battery life have become more important than processing speed. These two factors are related by battery size: to operate the computer for a longer time without recharging, we need a larger, heavier, battery.
The limitation is therefore in the total amount of electric energy stored in that battery, that is available for operation. To extend the battery life, we have to make the computer more efficient in the way it uses this energy.
Electrical power dissipation has been used as a figure of merit for this type of application. It is convenient for synchronous circuits with no power management, where power dissipation is very much independent of the level of activity of the circuit. Asynchronous operation is better described in terms of reactive programs: energy is dissipated only when the circuit is active. As a consequence, asynchronous circuits can have remarkable energy performance [6, 71. For asynchronous systems, a proper measure of per-0-8186-6210-7/94 $4.00 0 1994 IEEE formance is the "energy per operation." This metric measures the energy required to execute an instruction, fetch a piece of data from memory, service an interrupt, etc. To maximize the battery life, we can minimize the average energy per operation, that is, we maximize the number of instructions that we can execute with one battery charge.
Energy per operation is an additive quantity: given a computation described in terms of more elementary steps, we can calculate the energy required to execute that computation by adding the energy requirements of each step. In this way we can compare the energy efficiency of different algorithms that execute the same computation, independently of timing considerations. Comparison power consumptions would require some knowledge about timing (e.g. "so much power a t so much throughput").
In this article we propose an energy model for asynchronous circuits based on the energy cost of data communications. This model is justified in terms of the physical implementation of a communication action, and the actual energy dissipation associated with that implement at ion.
AS an example of the use of the energy model, we analyze the design of asynchronous memories. Memory subsystems are usually designed for speed and density, with secondary consideration given to energy. Memory is slow compared to processors, and high throughput is achieved through parallelism (wide data-words) and prediction (memory caching). These same design techniques can be used to improve energy performance; the design criteria are, however, different, and are explained in detail in this paper.
First we show how to partition a memory array to minimize access energy under the assumption that all addresses are equally probable. Second, we show how to use the statistics of long sequences of addresses to further reduce the average energy per access. These techniques result in a trade-off between area and energy per access. This analysis shows that conventional commercial architectures are not optimal from the point of view of energy eficiency.
Energy Index
The energy dissipation of a CMOS circuit is dependent on the supply voltage: the speed of operation and the energy required to charge capacitors increases at higher voltages. In order to evaluate the energy efficiency of a high-level circuit description, we need a measure of energy dissipation that is independent of the supply voltage. In this section, we derive such an index of performance, and use it in the next section to justify an energy model for asynchronous circuits based on the energy cost of communication actions.
Sources of energy dissipation
CMOS circuits have three main sources of energy dissipation: leakage currents, short-circuit currents, and dynamic currents. The total energy dissipated during the execution of one operation, E T , can be calculated as:
where E, is the energy dissipated by the sub-threshold leakage currents, Ed is the energy used for charging and discharging capacitors, and E,, is the energy dissipated by the short-circuit currents.
Leakage currents come from the sub-threshold behavior of MOSFET's. For VGS < T/th, the channel conductance, g,, can be modeled by [ll] :
(1)
All these currents add up, and are responsible for an energy dissipation of the form
where VGS = 0 is assumed. At the present state of the technology, energy dissipation due to leakage currents represents only a small fraction of the total power of a CMOS circuit. Short-circuit currents originate in the short transients, as in the case of a CMOS inverter, when both pull-up and pull-down transistors conduct while the input signal switches between &hn and VDD -K h p . This energy dissipation has the form [12] : (4) where the si's are proportionality constants, and the sum is made over all transitions executed in one operation. Short-circuit currents also play a significant role in storing a value into a flip-flop built from crosscoupled inverters. Dynamic energy dissipation, Ed, comes from the energy used to charge the capacitors in the circuit. The capacitors are then discharged to ground, and the energy is not recuperated. Ed can be computed as:
where the Ci's are all the capacitors in the circuit, and n, is the number of times the capacitor is switched in the execution of one operation. We rewrite Eq. 5 as:
Linear Energy Model
Using Eqs. 4 and 6, and neglecting the effect of subthreshold currents, we rewrite the energy equation as:
Outside the sub-threshold region, (VDD >> K h ) , Eq. 7 simplifies to: Figure 1 shows E T / V~~ as a function of VDD for a 4-bit counter (SPICE simulation), and for the Caltech Asynchronous Microprocessor and for a 32 + 1 engine (measurement). This figure shows that the linear approximation of Eq. 7 is indeed accurate.
Based on these results, we propose as an index of performance for an asynchronous CMOS circuit, the corresponding constants K L and K,. These indices are independent of the power-supply voltage, and the speed of operation; furthermore, K L and K , are additive: we can calculate the index corresponding to an operation by adding the indices of all of its suboperations.
As a first-order approximation we assume K , = 0, and use K L as the energy performance index.
Energy model for CHP programs
Our high-level description language is CHP (Communicating Hardware Processes) [SI, which is similar to CSP [4] . The CHP specification of an asynchronous circuit corresponds very closely to its implementation; for each assignment, communication, function evaluation executed by the CHP program there will be a corresponding assignment, communication, function evaluation computed by the CMOS implementation. The CMOS implementation will dissipate energy only during the execution of the assignment, etc.. This energy can be assimilated to the energy required to execute the corresponding CHP statement. To calculate the energy required to execute a CHP program, we add the energy required to execute each statement in the trace of that program.
We would like to be able to map each statement into an energy performance index, independently of the other statements in the program. In general, it is not possible to do so; layout constraints make that the length-and therefore capacitance-of wires is affected by the connectivity of the whole circuit, not just the local connections. A detailed energy model would have to take into consideration the program as a whole, instead of individual statements.
The purpose of the model is, however, to study architectural trade-offs (e.g. compare bit-serial and parallel implementation of a function) or determine architectural parameters (e.g. determine the optimal width of a cache memory). A very detailed model, with a large number of parameters can be intractable, and not that much more accurate if those parameters are layout-dependent (and therefore not well known before the layout is finished). At the architectural design stage a simpler model is desirable.
The model proposed is based on the energy performance index. To each type of statement, we assign a capacitance that is representative of the energy that we would expect that operation to cost in a typical implementation. A full discussion of the possibilities and limitations of this model can be found in [lo] 
Communication
A CHP data communication involves two actions: first, copying the data into the wires that implement the communication channel, and second, copying the data from the communication channel into a register; the second part may not be present if the data is to be tested on the channel wires.
We assume that data communications are implemented with a four phase, dual-rail encoded protocol.
The first action involves two transitions per bit; the second action involves one transition per bit in average. If the channel is one-to-one, the energy index of this communication will be proportional to the number of bits. If the channel is a bus (i.e. the channel has three or more ports), the capacitance of the wires will increase with the number of connections. To incorporate this effect, we scale the capacitance of the send action proportionally to the number of senders, and the capacitance of the receive action proportionally to the number of receivers.
Shared Variables
Variables shared by many processes are more expensive to implement than local variables. The value of those variables have to be known in many places, which increases the capacitance of all the related wires. To represent this cost, we scale the cost of writing into a variable proportionally to the number of processes that can read or write from that variable.
This cost has a number of important consequences. Even though the original specification of the circuit may not contain shared variables, some will appear after process decomposition. Also, guard evaluation may involve several tests on the same variable. The process decomposition presented above for choice statements will make that cost explicit by distributing the guard evaluation, one process per guard. The actual cost of an assignment is therefore not known until after process decomposition. We can, nevertheless, make an estimate of the worst case implementation of the assignment by scaling its cost proportionally to the number of times the variable is used in the program text.
Selection
The cost of selection is the difference in energy consumption between executing one statement from each of the following two programs:
and,
The second program can be transformed into the first program by adding state variables and an extra process: 
CHOOSE

1
Because this implementation of a selection is completely general, the cost of selection is at most the cost of an Or-gate. This cost scales proportionally to the log of the number of inputs.
Function evaluation
Function evaluation can hide part of the computation executed by the program; to incorporate that cost into the energy model, we have to make the evaluation of that function explicit in the CHP specification, or otherwise use a worst case cost for the evaluation of an arbitrary boolean function.
Given the program:
. . . ; F!f (z);. . .
we want to express the cost of the evaluation of f(2).
To estimate the worst case cost we give a specific implementation for f and calculate the cost of that implementation based on the energy model described so far; that way we know that the cost of evaluating a function is consistent with the rest of the model.
If the range of 2 is {XI,. . . ,xn}, and f(x,) = fi, we can express the function evaluation as:
The cost of this program scales with n, which can be a large number. To obtain a more efficient implementation, we encode z as an array of N = [log,nl bits, and eliminate one bit from the function evaluation by currying:
..., 
In general, the cost {of evaluating a function of N inputs and M outputs, I<f(N, M) can be expressed as:
Memory array
In CHP, a memory is an array, and reading from memory is one of the two operations:
writing to memory is one of the two opera- Channel Type Width A1 1-to-E log, 1 To read one word from the array, we have to acexecute an A communication (one sender, n receivers, log, n bits wide), an R; communication (one sender, one receiver, data-less), and an R communication ( n senders, one receiver, b bits wide). These costs are summarized in Table 1 . The energy cost of reading one word is the sum of the energy costs of executing each of these communications, that is:
where K A , K R ) , and K R are layout-dependent proportionality constants.
A one dimensional array is a viable solution only for small arrays; for large n, the energy cost scales like n log, 12 . One way of improving on this cost is by mapping the one-dimensional array into a two-dimensional array (we verify this fact later). We represent the double indexing by splitting the address in two:
The first indexing is removed by extracting a row decoder:
The second indexing is removed by extracting a column decoder: The process decomposition and channel interconnection are shown in Fig. 2 .
In the following section we show how to choose 1 and w from a simple energy model.
Energy model and optimization
The energy cost of accessing one element of the array is calculated as the sum of the costs of the communications executed by the DEC, MUX, and AR-RAY processes. A read from memory requires executing communication R (w senders, one receiver, b bits wide), communication A E (one sender, 1 receivers, log, 1 bits wide), communication Aw (one sender, w receivers, log, w bits wide), communication Si (one sender, w receivers), and communication Rj ( I senders, one receiver, b bits wide). We simplify Eq. 14 by assuming all constant equal to one. This approximation is acceptable for most technologies; if a more accurate model is needed, the parameters can be calculated from the layout, and the optimization is done with those values of the parameters.
We minimize &D,R with respect to 1 and w under the constraint 2 x w = n, using Lagrange multipliers: U = 1 log, 1 + w log, w + (W +Z)b+ w + X ( n -ZW) (16) We take derivatives with respect to I, w, and A:
Assuming that 3 log, n + b + log, e >> 1, we solve for 1, w, and A, and lopt = wopt = fi. The optimum energy per access, &D,R(n, b) is:
A memory designed for speed usually has 1 = b x w [l]. A completely square bit-array optimizes the access time per bit, but does not take into account the energy savings derived from selecting only the bits that are part of the desired word. This extra selection step takes time and area, and saves energy.
If we compare the optimal energy for a two dimensional array with the energy used by a one-dimensional array (assuming that all constants are equal to one) we get:
The number of words in a memory chip is usually very large, in the order of 220, making the two dimensional arrangement far better in energy. We can, in principle, generalize this argument to multidimensional arrays, to get an even greater improvement in energy per access. This cannot be done, however, by simply increasing the number of indices in the array. The memory has to be laid out on a 2-dimensional surface; mapping a multidimensional array on this surface will make all wires much longer, and the results will not be comparable with Eq. 20. In the next section we make that mapping explicit in the CHP program for derived from that program. the memory, so that a realistic energy model can be
Multi-bank memory array
We can further reduce the energy per access by breaking up the memory into several sub-arrays, so that only one of the simaller sub-arrays is accessed in each memory reference; this technique is also known as the divided word-line method [13] . We obtain the CHP for the multi-banked memory by applying a divide-and-conquer s'trategy to the M E M program.
where M E and M O are n/2 x b arrays.
To read one word from MEMZB, we have to execute communication A (one sender, one receiver, log, n bits wide), communication R (two senders, one receiver, b bits wide), plus we have to execute either MEMO or MEME. The energy cost of reading one word from a memory of size n x b can therefore be expressed as:
We apply the same transformation to the sub-arrays, until the indexing is completely removed. We get:
Or, in terms of n, log2 n(log, n -1) 2 E~B , R (~, 6) % 2KRblogz 72 f KA (25) Depending on the cost of merging the results from the two sub-arrays, it may be convenient to stop the divide-and-conquer process after fewer than N steps, and implement the remaining array as a twodimensional array. After N -J divide steps, the energy cost is: Choosing the optimum number of banks requires more accurate knowledge of the constants in the energy equation. Some feed-back from the layout can help in determining those constants with sufficient accuracy.
Address prediction
The techniques described above try to minimize the energy cost of a single access to memory, under the assumption that all addresses are equally probable. In most applications, long address sequences are far from random: the past history of the address sequence is used to increase memory throughput [8, 9 , 31. The same type of information can be used to decrease the average energy per access.
The assumptions we make about the address sequence are spacial and temporal locality [ 2 ] . Spacial locality indicates that, once an address has been accessed, there is a strong probability that nearby addresses will be accessed in the near future. Temporal locality indicates that, once an address has been accessed, there is a strong probability that the same address will be accessed again in the near future.
Spacial locality is used by pre-fetch mechanisms. The cost per word of fetching a multi-word line from memory decreases with the number of words on the line (the energy cost of decoding the address is shared among more words). If we can predict that a sufficient number of words on a line will be used, we will be able to decrease the average cost of accessing a word in memory.
Temporal and spacial locality can be used to store a copy of the contents of the memory locations most likely to be needed in the future, in a small, fast, energy efficient memory. If the locality is strong enough, most of the memory references will be serviced by the small memory, with a corresponding improvement in energy performance.
Sequential access memory: spacial locality
Instruction memory accesses are, most of the time, accesses to consecutive memory locations. This fact can be exploited in two ways to make a more efficient memory in terms of speed and energy cost. First, the address does not need to be communicated all of the time, it can be calculated locally. Second, several consecutive words can be read in parallel a t one time, thus reducing the number of memory references.
The following program describes a read-only memory with sequential access:
where M is an n x b array (at+ means post-increment a with wrap-around). After an address a is sent to the memory, several data requests are executed. Sequencing between the A and D communications is maintained by the environment.
We can reduce the number of accesses to the array M by reading several words in parallel. We replace M by an (nlm) x mb array, M P . After receiving an address, the variable h e is read from the array, and subsequent data requests are satisfied with data from line, until a.w overflows, and new data is read into lane. 
M E M R z ( P R E F 11 MEMP) P R E F
O D A v -~! l i n e [ a . w + + ] ;
Ca.w = 0 -VJ.
Notice that MEMP has the same form as MEMS, For example, if all constants are equal to 1, n = 220, b = 32, % = 8, we obtain the minimum energy cost for m = 8, Es = 1134. Compared to the minimum energy cost of accessing a memory array, ER = 1493, we do not obtain a significant improvement. However, the optimal block size for this parameter set is 8 words per block; with this block size, most of the silicon area occupied by the memory will be dedicated to routing of data and address, resulting in very poor memory density. We can choose a sub-optimal block size to improve in density; for a block size of 21° words, we get E R = 3195, and Es = 2590. The pre-fetch mechanism allows us to use a denser memory with a smaller energy penalty.
Memory with cache: temporal locality
If the energy per access of a memory of size n is EM(n), and h is the hit ratio of the cache (that is, the fraction of addresses that are found in the cache), then the average energy per access of a system consisting of a memory of size n and a cache of size c is: -
n n Using T = f , and taking derivatives with respect to T , we find popt:
Usual values of p are in the range 0.01 1. p 
In this range, we get 08.1 5 popt 5 0.5.
From the previous results we conclude that a cache designed for low-energy has to optimize the hit ratio at relatively small cache sizes, to make p as small as possible. In general, caches with good hit ratios use very complicated architectures, which make the energy cost of a cache access high. Fully associative caches, for example, require that a clomparison be made for each line in the cache for every access, thus cancelling the energy advantage of a higher hit ratio. The hit ratio can be increased at the expense of added delay, or by specializing the cache to specific address sequence types (instruction memory references, vectors, I/O, etc.).
Conclusions
In this paper we hiLve shown how to estimate the energy-per-operation (cost of a CMOS asynchronous circuit from its CHP specification. This technique allows us to exploit early in the design process the tradeoff between energy, ariea, and delay. Ultimately, it allows us to get to a circuit architecture more suited to low-energy design.
The energy model derived from the high-level specification of a circuit is, by its very nature, only approximate. It represents the energy complexity of the algorithm used to solve the problem at hand, under the assumption that there is a strong correlation between this energy complexity and the actual energy dissipation of the circuit.
We have presented several memory designs, and we have shown how to clioose the design parameters t o obtain the optimum energy cost. These results show that commercial memory designs, optimized for delay and density, can be greatly improved in energy performance. The execution of this command corresponds to choosing one of the true guards, executing the corresponding statement, and repeating until all guards are found to be false, in which case the command terminates. The notation *[SI is short-hand for *[true -+ SI.
Communication:
