This paper describes an analytical model for the access and cycle times of on-chip directmapped and set-associative caches. The inputs to the model are the cache size, block size, and associativity, a s w ell as array organization and process parameters. The model gives estimates that are within 6 of Hspice results for the circuits we h a ve c hosen.
Introduction
Most computer architecture research i n volves investigating trade-o s between various alternatives. This can not be done adequately without a rm grasp of the costs of each alternative. For example, it is impossible to compare two di erent cache organizations without considering the di erence in access or cycle times. Similarly, the chip area and power requirements of each alternative m ust be taken into account. Only when all the costs are considered can an informed decision be made.
Unfortunately, it is often di cult to determine costs. One solution is to employ analytical models that predict costs based on various architectural parameters. In the cache domain, both chip area models 1 and access time models 2 h a ve been published.
In 2 , Wada et al. present an equation for the access time of an on-chip cache as a function of various cache parameters cache size, associativity, block size as well as organizational and process parameters. Unfortunately, W ada's access time model has a number of signi cant shortcomings. For example, the cache tag and comparator in setassociative memories are not modeled, and in practice, these often constitute the critical path. Each stage in their model e.g., bitline, wordline assumes that the inputs to the stage are step waveforms; actual waveforms in memories are far from steps and this can greatly impact the delay of a stage. In the Wada model, all memory subarrays are stacked linearly in a single le; this can result in aspect ratios of greater than 10:1 and overly pessimistic access times. Wada's decoder model is a gate-level model which contains no wiring parasitics. In addition, transistor sizes in Wada's model are xed independent of the load. For example, the wordline driver is always the same size independent of the number of cells that it drives. Finally, W ada's model predicts only the cache access time, whereas both the access and cycle time are important for design comparisons.
This paper describes a signi cant improvement and extension of Wada's access time model. The enhanced model is called CACTI. Some of the new features are: a tag array model with comparator and multiplexor drivers non-step stage input slopes rectangular stacking of memory subarrays a transistor-level decoder model column-multiplexed bitlines and an additional array organizational parameter load-dependent transistor sizes for wordline drivers cycle times as well as access times
The enhancements to Wada's model can be classi ed into two categories. First, the assumed cache structure has been modi ed to more closely represent real caches. Some examples of these enhancements are the column multiplexed bitlines and the inclusion of the tag array. The second class of enhancements involve the modeling techniques used to estimate the delay of the assumed cache structure e.g. taking into account non-step input rise times. This paper describes both classes of enhancements. After discussing the overall cache structure and model input parameters, the structural enhancements are described in Section 4. The modeling techniques used are then described in Section 5. A complete derivation of all the equations in the model is far beyond the scope of a journal paper but is available in a technical report 3 .
Any model needs to be validated before the results generated using the model can be trusted. In 2 , a Hspice model of the cache was used to validate the authors' analytical model. The same approach w as used here; Section 6 compares the model predictions to Hspice measurements. Of course, this only shows that the analytical model matches the Hspice model; it does not address the issue of how w ell the assumed cache structure and hence the Hspice model re ects a real cache design. When designing a real cache, many di erent circuit implementations are possible. In architecture studies, however, the relative di erences in access cycle times between di erent cache sizes or con gurations are usually more important than absolute access times. Thus, even though our model predicts absolute access times for a speci c cache implementation, it can be used in a wide variety of situations when an estimate of the cost of varying an architectural parameter is required. Section 7 gives some examples of how the model can be used in architectural studies.
The model described in this paper has been implemented, and the software is available via ftp. Appendix A explains how to obtain and use the CACTI software.
2 Cache Structure Figure 1 shows the organization of the SRAM cache being considered. The decoder rst decodes the address and selects the appropriate row b y driving one wordline in the data array and one wordline in the tag array. Each array contains as many w ordlines as there are rows in the array, but only one wordline in each array can go high at a time. Each memory cell along the selected row is associated with a pair of bitlines; each bitline is initially precharged high. When a wordline goes high, each memory cell in that row pulls down one of its two bitlines; the value stored in the memory cell determines which bitline goes low. Each sense ampli er monitors a pair of bitlines and detects when one changes. By detecting which line goes low, the sense ampli er can determine the contents of the selected memory cell. It is possible for one sense ampli er to be shared among several pairs of bitlines. In this case, a multiplexor is inserted before the sense amps; the select lines of the multiplexor are driven by the decoder. The number of bitlines that share a sense ampli er depends on the layout parameters described in the next section.
The information read from the tag array is compared to the tag bits of the address. In an A-way set-associative cache, A comparators are required. The results of the A comparisons are used to drive a v alid hit miss output as well as to drive the output multiplexors. These output multiplexors select the proper data from the data array in a set-associative cache or a cache in which the data array width is larger than the output width, and drive the selected data out of the cache.
3 Cache and Array Organization Parameters In addition, there are six array organization parameters that are used to estimate the cache access and cycle time. In the basic organization discussed by W ada 2 , a single set shares a common wordline. Figure 2 -a shows this organization, where B is the block size in bytes, A is the associativity, and S is the number of sets S = C BA . Clearly, such an organization could result in an array that is much larger in one direction than the other, causing either the bitlines or wordlines to be very slow. To alleviate this problem, Wada describes how the array can be broken horizontally and vertically and de nes two parameters, N dwl and N dbl , which indicates to what extent the array has been divided. We assume that the tag array can be con gured independently of the data array. T h us, there are also three tag array parameters: N twl , N tbl , and N tspd .
Model Components
This section gives an overview of key portions of the cache read access and cycle time model. The access and cycle times were derived by estimating delays due to the following components: The delay o f e a c h these components is estimated separately and the results combined to estimate the access and cycle time of the entire cache. A complete description of each component can be found in 3 . In this paper, we focus on those parts that di er signi cantly from Wada's model.
Decoder
Wada's model contains a gate-level decoder model without any parasitic capacitances or resistances. It also assumes all memory sub-arrays are stacked single le in a linear array. We h a ve used a detailed transistor-level decoder that includes both parasitic capacitances and resistances. We h a ve also assumed that sub-arrays are placed in a two-dimensional array to minimize critical wiring parasitics. Figure 3 shows the logical structure of the decoder architecture used in this model. The decoder in Figure 3 contains three stages. Each block in the rst stage takes three address bits in true and complement, and generates a 1-of-8 code, driving a precharged decoder bus. These 1-of-8 codes are combined using NOR gates in the second stage. The nal stage is an inverter that drives each w ordline driver. We also model separate decoder driver bu ers for driving the 3-to-8 decoders of the data arrays and the tag arrays.
Estimating the wire lengths in the decoder requires knowledge of the memory tile layout.
As mentioned in Section 3, the memory is divided into N dwl N dbl subarrays; each of these arrays is 8BAN spd N dwl cells wide. If these arrays were placed side-by-side, the total memory width would be 8 B A N dbl N spd cells. Instead, we assume they are grouped in twoby-two blocks, with the 3-to-8 predecode NAND gates at the center of each block; Figure 4 shows one of these blocks. This reduces the length of the connection between the decoder driver and the predecode block to approximately one quarter of the total memory width, or 2 B A N dbl N spd . The length of the connection between the predecode block and the NOR gate is then on average half of the subarray height, which i s C BAN dbl N spd cells. In large memories with many groups the bits in the memory are arranged so that all bits driving the same data output bus are in the same group, shortening the data bus.
Variable Size Wordline Driver
The size of the wordline driver in Wada's model is independent of the number of cells attached to the wordline; this severely overestimates the wordline delay of large arrays. Our model assumes a variable-sized wordline driver. Normally, a cache designer would choose a target wordline rise time, and adjust the driver size appropriately. Rather than N dwl and k rise is a constant that depends on the implementation technology. To obtain the transistor size that would give this rise time, it is necessary to work backwards, using an equivalent R C circuit to nd the required driver resistance, and then nding the transistor width that would give this resistance. This is described in Section 5.5.
Bitlines and Sense Ampli ers
Wada's model does not apply to memories with column multiplexing. Our model allows column multiplexing using NMOS pass transistors between several pairs of bitlines and a shared sense amp. In our model, the degree of column multiplexing number of pairs of bitlines per sense amp is N spd N dbl .
Although we use the same sense amp as Wada's model, we precharge the bitlines to two NMOS diodes less than V dd since the sense amp performs poorly with a common-mode 
Comparator
Although Wada's model gives access times for set-associative caches, it only models the data portion of a set-associative memory. H o wever, the tag portion of a set-associative memory is often the critical path. Our model assumes the tag memory array circuits are similar to those on the data side with the addition of comparators to choose between di erent sets.
The comparator that was modeled is shown in Figure 5 . The outputs from the sense ampli ers are connected to the inputs labeled b n and b n -bar. The a n and a n -bar inputs are driven by tag bits in the address. Initially, the output of the comparator is precharged high; a mismatch i n a n y bit will close one pull-down path and discharge the output. In order to ensure that the output is not discharged before the b n bits become stable, node EVAL is held high until roughly three inverter delays after the generation of the b n -bar signals. This is accomplished by using a timing chain driven by a sense amp on a dummy r o w in the tag array. The output of the timing chain is used as a virtual ground" for the pull-down paths of the comparator. When the large NMOS transistor in the nal inverter in the timing chain begins to conduct, the virtual ground and hence the comparator output if there is a mismatch begins to discharge.
Set Multiplexor and Output Drivers
In a set-associative cache, the result of the A comparisons must be used to select which of the A possible data blocks are to be sent out of the cache. Since the width of a block 8B is usually greater than the cache output width b o , it is also necessary to choose part of the selected block to drive the output lines. An A-way set-associative cache contains A multiplexor driver blocks, as shown in Figure 6 . Each m ultiplexor driver uses a single comparator output bit, along with address bits, to determine which b o data array outputs drive the output bus. The delay of each component w as estimated by decomposing each component i n to several equivalent R C circuits, and using simple RC equations to estimate the delay of each stage. This section shows are resistances and capacitances were estimated, as well as how they were combined and the delay of a stage calculated. The stage delay in our model depends on the slope of its inputs; this section also describes how this was done.
Estimating Resistances
To use the RC approximations described in Sections 5.5 and 5.6, it is necessary to estimate the full-on resistance of a transistor. The full-on resistance is the resistance seen between drain and source of a transistor assuming the gate voltage is constant and the gate is fully conducting. This resistance can also be used for pass transistors that as far as the critical path is concerned are fully conducting.
It is assumed that the equivalent resistance of a conducting transistor is inversely pro-portional to the transistor width only minimum-length transistors were used. Thus, equivalent resistance = R W where R is a constant di erent for NMOS and PMOS transistors and W is the transistor width.
Estimating Gate Capacitances
The RC approximations in Section 5.5 and 5.6 also require an estimation of a transistor's gate and drain capacitances. The gate capacitance of a transistor consists of two parts: the capacitance of the gate itself, and the capacitance of the polysilicon line going into the gate.
If L e is the e ective length of the transistor, L poly is the length of the poly line going into the gate, C gate is the capacitance of the gate per unit area, and C polywire is the poly line capacitance per unit area, then a transistor of width W has a gate capacitance of:
The same formula holds for both NMOS and PMOS transistors.
The value of C gate depends on whether the transistor is being used as a pass transistor, or as a pull-up or pull-down transistor in a static gate. Thus, two di erent v alues of C gate are required. Figure 7 shows typical transistor layouts for small and large transistors. We h a ve assumed that if the transistor width is larger than 10m, the transistor is split as shown in Figure 7 -b.
Drain Capacitances
The drain capacitance is composed of both an area and perimeter component. Using the geometries in Figure 7 , the drain capacitance for a single transistor can be obtained. If the width is less than 10m, draincapW = 3 L e W C di area + 6 L e + W C di side + W C di gate where C di area , C di side , and C di gate are process dependent parameters there are two values for each of these: one for NMOS and one for PMOS transistors. C di gate is the If the width is larger than 10m, w e assume the transistor is folded see Figure 7 -b, reducing the drain capacitance to:
Now, consider two transistors with widths less than 10m connected in series, with only a single L e W wide region acting as both the source of the rst transistor and the drain of the second. If the rst transistor is on, and the second transistor is o , the capacitance seen looking into the drain of the rst is: draincapW = 4 L e W C di area + 8 L e + W C di side + 3 W C di gate Figure 8 shows the situation if the transistors are wider than 10m. In this case, the capacitance seen looking into the drain of the inner transistor x in the diagram assuming it is on but the outer transistor is o is: draincapW = 5 L e W 2 C di area + 1 0 L e C di side + 3 W C di gate 
Simple RC Circuits
Each component described in Section 4 can be decomposed into several rst or second order RC circuits. Figure 9 -a shows a typical rst-order circuit. The time for node x to rise or fall can be determined using the equivalent circuit of Figure 9 -b. Here, the pull-down path assuming a rising input of the rst stage is replaced by a resistance, and the gate capacitances of the second stage and the drain capacitance of the rst stage are replaced by a single capacitor. The resistances and capacitances are calculated as shown in Sections 5.1 to 5.3. In stages in which the two gates are separated by a long wire, parasitic capacitances and resistances of the wire are included in C eq and R eq .
The delay of the circuit in Figure 9 can be estimated using an equation due to Horowitz 4 assuming a rising input: for a falling input. As described in Section 4.2, the size of the wordline driver depends on the number of cells being driven. For a given array width, the capacitance driven by the wordline driver can be estimated by summing the gate capacitance of each pass transistor being driven by the wordline, as well as the metal capacitance of the line. Using this, and the desired rise time, the required pull-up resistance of the driver can be estimated by: R p = ,desired rise time C eq ln0:5 recall that the desired rise time is assumed to be the time until the wordline reaches 50 of its maximum value.
Once R p is found, the required transistor width can be found using the equation in Section 5.1. Since this backwards analysis" did not take i n to account the non-zero input fall time, we then use R p and the wordline capacitance and calculate the adjusted delay using Horowitz's equations as described earlier. These transistor widths are also used to estimate the delay of the nal gate in the decoder.
RC-Tree Solutions
All but two of the stages along the cache's critical path can be approximated by simple rst-order stages as in the previous section. The bitline and comparator equivalent circuits, however, require more complex solutions. Figure 10 shows an equivalent circuit that can be used for the bitline and comparator circuits. Step input If we assume that the same amount of drive" is required to drive the output to v pre , v sense regardless of the shape of the input waveform, then we can calculate the output delay for an arbitrary input waveform. Consider Figure 12 -a. If we assume the area is the same as in Figure 11 , then we can calculate the value of T delay adjusted for input rise time.
Using simple algebra, it is easy to show that If the wordline rises quickly, a s s h o wn in Figure 12 -b, then the algebra is slightly di erent. In this case,
The cross-over point b e t ween the two cases for T occurs when:
The non-zero input rise time of the comparator can be taken into account similarly. The delay of the comparator is composed of two parts: the delay of the timing chain and the delay discharging the output see Figure 5 . The delay of the rst three inverters in the timing chain can be approximated using simple rst-order RC stages as described in Section 5.5. The time to discharge the comparator output through the nal inverter can be estimated using the equivalent circuit of Figure 10 and taking into account the non-zero input rise time using the same technique that was used for the bitline subcircuit. In this case, the input" is the output of the third inverter in the timing chain we assume the timing chain is long enough that the a n and b n lines are stable. The discharging delay o f the comparator output is measured from the time the input reaches the threshold voltage of the nal timing chain inverter. The equations for this case can be found in 3 .
Total Access and Cycle Time
This section describes how delays of the model components described in Section 4 are combined to estimate the cache read access and cycle times.
Access Time
There are two potential critical paths in a cache read access. If the time to read the tag array, perform the comparison, and drive the multiplexor select signals is larger than the time to read the data array, then the tag side is the critical path, while if it takes longer to read the data array, then the data side is the critical path. In many cache implementations, the designer would try to margin the cache design such that the tag path is slightly faster than the data path so that the multiplexor select signals are valid by the time the data is ready. Often, however, this is not possible. Therefore, either side could determine the access time, meaning both sides must be modeled in detail.
In a direct-mapped cache, the access time is the larger of the two paths:
T access dm = maxT dataside + T outdrive data ; T tagside dm + T outdrive valid where T dataside is the delay of the decoder, wordline, bitline, and sense ampli er for the data array, T tagside dm is the delay of the decoder, wordline, bitline, sense ampli er, and comparator for the tag array, T outdrive dm is the delay of the cache data output driver, and T outdrive valid is the delay of the valid signal driver.
In a set-associative cache, the tag array m ust be read before the data signals can be driven. Thus, the access time is: T access;sa = maxT dataside ; T tagside sa + T outdrive data where T tagside sa is the same as T tagside dm , except that it includes the time to drive the select lines of the output multiplexors.
Figures 13 to 16 show analytical and Hspice estimations of the data and tag sides for direct-mapped and 4-way set-associative caches. A 0:8m CMOS process was assumed 6 . To gather these results, the model was rst used to nd the array organization parameters which resulted in the lowest access time via exhaustive search for each cache size. These 
Cycle Time
The di erence between the access and cycle time of a cache varies widely depending on the circuit techniques used. Usually the cycle time is a modest percentage larger than the access time, but in pipelined or post-charge circuits 7, 8 the cycle time can be less than There are three elements in our assumed cache organization that need to be precharged: the decoders, the bitlines, and the comparator. The precharge times for these elements are somewhat arbitrary, since the precharging transistors can be scaled in proportion to the loads they are driving. We h a ve assumed that the time for the wordline to fall and bitline to rise in the data array is the dominant part of the precharge delay. Assuming properly ratioed transistors in the wordline drivers, the wordline fall time is approximately the same as the wordline rise time. It is assumed that the bitline precharging transistors are scaled such that a constant o ver all cache organizations bitline charge time is obtained. This constant will, of course, be technology dependent. In the model, we assume that this constant is equal to four inverter delays each with a fanout of four. Thus, the cycle time of the cache can be written as:
T cache = T access + T wordline delay + 4 inverter delay 7 Applications of the Model This section gives examples of how the analytical model can be used to quickly gather data that can be used in architectural studies.
Cache Size
First consider Figure 17 . These graphs show h o w the cache size a ects the cache access and cycle times in a direct-mapped and 4-way set-associative cache. In these graphs and all graphs in this report, b o = 64 and b addr = 32. For each cache size, the optimum array organization parameters were found these optimum parameters are shown in the graphs as before; the six numbers associated with each point correspond to N dwl , N dbl , N spd , N twl , N tbl , and N tspd in that order, and the corresponding access and cycle times were plotted.
In addition, the graph breaks down the access time into several components.
There are several observations that can be made from the graphs. Starting from the bottom, it is clear that the time through the data array decoders is always longer than the time through the tag array decoders. For all but one of the organizations selected, there are more data subarrays N dwl N dbl than tag subarrays N twl N tbl . This is because the total tag storage is usually much less than the total data storage.
In all caches shown, the comparator is responsible for a signi cant portion of the access time. Another interesting trend is that the tag side is always the critical path in the cache access. In the direct-mapped cases, organizations are found which result in very closely matched tag and data sides, while in the set-associative case, the paths are not matched nearly as well. This is due primarily to the delay driving select lines of the output multiplexor. Figure 18 shows how the access and cycle times are a ected by the block size the cache size is kept constant. In the direct-mapped graph, the access and cycle times drop as the block size increases. Most of this is due to a drop in the decoder delay a larger block decreases the depth of each array and reduces the number of tags required. In the set-associative case, the access and cycle time begins to increase as the block size gets above 32. This is due to the output driver; a larger block size means more drivers share the same cache output line, so there is more loading at the output of each driver. This trend can also be seen in the direct-mapped case, but it is much less pronounced. The number of output drivers that share a line is proportional to A, so the proportion of the total output capacitance that is the drain capacitance of other output drivers is smaller in a direct-mapped cache than in the 4-way set associative cache. Also, in the direct-mapped case, the slower output driver only a ects the data side, and it is the tag side that dictates It is dangerous to make too many conclusions directly from the graphs without considering miss rate data. Figure 19 seems to imply that a direct-mapped cache is always the best. While it is always the fastest, it is important to remember that the direct-mapped cache will have the lowest hit-rate. Hit rate data obtained from a trace-driven simulation or some other means must be included in the analysis before the various cache alternatives can be fairly compared. Similarly, a small cache has a lower access time, but will also have a l o wer hit rate. In 9 , it was found that when the hit rate and cycle time are both taken into account, there is an optimum cache size between the two extremes.
Appendix A: Obtaining and Using the CACTI Software A program that implements the CACTI model described in this paper is available. To obtain the software, log into gatekeeper.dec.com using anonymous ftp use anonymous" as the login name and your machine name as the password. The les for the program are stored together in archive pub DEC cacti.tar.Z". Get this le, uncompress" it, and extract the les using tar".
The program consists of a number of C les; time.c contains the model. Transistor widths and process parameters are de ned in def.h. A make le is provided to compile the program.
Once the program is compiled, it can be run using the command:
where C is the cache size in bytes, B is the block size in bytes, and A is the associativity. The output width and internal address width can be changed in def.h. When the program is run, it will consider all reasonable values for the array organization parameters discussed in Section 3 and choose the organization that gives the smallest access time. The values of the array organization parameters chosen are included in the output report.
The extended description of model details is available on the World Wide Web at URL http: nsl.pa.dec.com wrl techreports 93.5.html93.5".
