For the newer DRAM designs, the time to extract the required data from the sense amps/row caches for transmission on the memory b us is the largest component in the average access time, though page mode allows this to be overlapped with column access and the time to transmit the data over the memory bus.
INTRODUCTION
In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, evaluating each in terms of its effect on total execution time. We simulate the performance of se Rambus [33] . While there are a number of academic proposals for new DRAM designs, space limits us to covering only existing commercial architectures. To obtain accurate memory-request timing for an aggressive out-of-order processor, we integrate our code into the SimpleScalar tool set [4] . Second, widening buses will present new optimization opportunities. Each application exhibits a different degree of locality and therefore benefits from page mode to a different degree. As buses widen, this ef fect becomes more pronounced, to the extent that different applications can have average access times that dif fer by a factor of two. This is a minor issue considering current bus technology. However, future bus technologies will expose the row access as a primary performance bottleneck, justifying the exploration of mechanisms that exploit locality to guarantee hits in the DRAM row buffers: e.g. row-buffer victim caches, prediction mechanisms, etc. Note that recent commercial DRAM proposals address e xactly this issue by placing associati ve SRAM caches on the DRAM die to exploit locality and the tremendous bandwidth a vailable on-chip [12] .
High-Performance DRAMs in Workstation Environments
Third, while buses as wide as the L2 cache yield the best memory latency, they have passed the point of diminishing returns: for instance, a bus half as wide would not yield twice the latency. The use of page mode overlaps the components of DRAM access when making multiple requests to the same row, and one can only exploit this overlap when a cache block is larger than the bus width-otherwise, every cache-fill request requires one row access and one column access. Therefore, the DRAM bus should not exceed N/2 bits, where N is the L2 cache width.
Fourth, critical-word-first does not mix well with burst mode. Critical-word-first is a strategy that requests a block of data potentially out of address-order; burst mode delivers data in a fixed but redefinable order. A burst-mode DRAM can thus can have longer latencies in real systems, even if its end-to-end latency is low. However, we note that for the applications studied, total execution time seems to correlate more with end-to-end DRAM latencies than with critical-word latencies.
Finally, the choice of refresh mechanism can significantly alter the average memory access time. For some benchmarks and some refresh organizations, the amount of time spent waiting for a DRAM in refresh mode accounted for 50% of the total latency.
As one might expect, our results and conclusions are dependent on our system specifications, which we chose to be representative of mid-to high-end workstations: a 100MHz 128-bit memory bus (an organization that is found in SPARC workstations and has the same bandwidth as a DRDRAM channel), an eight-way superscalar out-of-order CPU, lockup-free caches, and a small-system DRAM organization with ~10 DRAM chips.
RELATED WORK
Burger, Goodman, and Kagi quantified the effect on memory behavior of high-performance latencyreducing or latency-tolerating techniques such as lockup-free caches, out-of-order execution, prefetching, speculative loads, etc.
[5]. They concluded that to hide memory latency, these techniques often increase the demands on memory bandwidth. They classify memory stall cycles into two types: those due to lack of available memory bandwidth, and those due purely to latency. This is a useful classification, and we use it in our study. This study differs from theirs in that we focus on the access time of only the primary memory system, while their study combines all memory access time, including the L1 and L2 caches. Their study focuses on the behavior of latency-hiding techniques, while this study focuses on the behavior of different DRAM architectures.
Several marketing studies compare the memory latency and bandwidth available from different DRAM architectures [6, 30, 31] . This paper builds on these studies by looking at a larger assortment of DRAM architectures, measuring DRAM impact on total application performance, decomposing the memory access time into different components, and measuring the hit rates in the row buffers.
Finally, there are many studies that measure system-wide performance, including that of the primary memory system [1, 2, 10, 22, 26, 27, 34, 35] . Our results resemble theirs, in that we obtain similar figures for the fraction of time spent in the primary memory system. However, these studies have different goals from ours, in that they are concerned with measuring the effects on total execution time of varying several CPU-level parameters such as issue width, cache size & organization, number of processors, etc. This study focuses on the performance behavior of different DRAM architectures.
BACKGROUND
A Random Access Memory (RAM) that uses a single transistor-capacitor pair for each binary value (bit) is referred to as a Dynamic Random Access Memory or DRAM. This circuit is dynamic because leakage requires that the capacitor be periodically refreshed for information retention. Initially, DRAMs had minimal I/O pin counts because the manufacturing cost was dominated by the number of I/O pins in the package. Due lar £ gely to a desire to use standardized parts, the initial constraints limiting the I/O pins have had a long-term effect on DRAM architecture: the address pins for most DRAMs are still multiplexed, potentially limiting performance.
£
As the standard DRAM interface has become a performance bottleneck, a number of "revolutionary" proposals [28] have been made. In most cases, the revolutionary portion is the interface or access mechanism, while the DRAM core remains essentially unchanged.
The Conventional DRAM
The addressing mechanism of early DRAM architectures is still utilized, with minor changes, in many of the DRAMs produced today . In this interface, shown in Figure 1 , the address bus is multiplexed between row and column components. The multiplexed address bus uses two control signals-the row and column address strobe signals, RAS and CAS respectively-which cause the DRAM to latch the address components. The row address causes a complete row in the memory array to propagate down the bit lines The column address selects the appropriate data subset from the sense amps and causes it to be dri ' ven to the output pins. 
Extended Data Out DRAM (EDO DRAM)
Extended Data Out DRAM, sometimes referred to as h 3 yper-page mode DRAM, adds a latch between the sense-amps and the output pins of the DRAM, shown in Figure 3 . This latch holds output pin state and permits the CAS £ to rapidly de-assert, allowing the memory array to begin precharging sooner. In addition, the latch in the output path also implies that the data on the outputs of the DRAM circuit remain v alid longer into the ne 1 xt clock phase. Figure 4 gives the timing for an EDO read. 
Synchronous DRAM (SDRAM)
Conventional, FPM, and EDO DRAM are controlled asynchronously by the processor or the memory controller; the memory latency is thus some fractional number of CPU clock cycles. An alternative is to make the DRAM interface synchronous such that the DRAM latches information to and from the controller based on a clock signal. A timing diagram is shown in Figure 5 . SDRAM devices typically have 
Enhanced Synchronous DRAM (ESDRAM)
Enhanced Synchronous DRAM is a modification to Synchronous DRAM that parallels the differences between FPM and EDO DRAM. First, the internal timing parameters of the ESDRAM core are f aster than SDRAM. Second, SRAM row-caches have been added at the sense-amps of each bank. These caches pro £ vide the kind of improved inter-row performance observed with EDO DRAM, allowing requests to the last accessed row to be satisfied even when subsequent refreshes, precharges, or activates are taking place.
It also allows a write to proceed through the sense amps directly without overwriting the line buffered in the SRAM cache, which w ould otherwise destroy any read locality.
Double Data Rate DRAM (DDR DRAM)
Double data rate (DDR) DRAM doubles the bandwidth available from SDRAM by transfering data at both edges of the clock. DDR DRAM are very similar to single data rate SDRAM in all other characteristics.
They use the same signalling technology, the same interface specification, and the same pinouts on the DIMM carriers. Internally, DDR-DRAM employs 2n prefetching, where twice the number of bits is read in or written to the DRAM array on each access, and two n-bit transfers take place every half cycle.
Synchronous Link DRAM (SLDRAM)
RamLink is the IEEE standard (P1596.4) for a b us DRAMs use a one-byte-wide multiplexed address/data bus to connect the memory controller to the RDRAM de vices. The bus runs at 300 Mhz and transfers on both clock edges to achieve a theoretical peak of 600 Mbytes/s. Ph 
3.9
Direct Rambus (DRDRAM) . This is the portion of T P that is overlapped with memory access. SLDRAM, RDRAM, and DRDRAM utilize narrower, but higher speed buses. These DRAM architectures can be arranged in parallel channels, and we study them here in the context of a single-width DRAM bus, which is the simplest configuration, as well as a dual-channel configuration for SLDRAM and RDRAM.
As in real-world systems, the memory controller coalesces bus packets into 128-bit chunks to be and (d) the parallel-channel SLDRAM and Rambus performance numbers in Figure 11 . Due to differences in bus design, the only bus overhead included in the simulations is that of the bus that is common to all organizations: the 100MHz 128-bit memory bus. The simulator models a synchronous memory interface: the processor's interface to the memory controller has a clock signal. This is typically simpler to implement and debug than a fully asynchronous interface. If the processor executes at a faster clock rate than the memory bus (as is likely), the processor may have to stall for several cycles to synchronize with the bus before transmitting the request. We account for the number of stall cycles in Bus Wait Time .
The simulator models several different refresh organizations, as described in Section 5. The amount of time (on a verage) spent stalling due to a memory reference arriving during a refresh cycle is accounted for in the time component labeled Refresh Time .
Interleaving
For the 100MHz 128-bit bus configuration, the transfer size is eight times the request size; therefore each DRAM access is a pipelined operation that takes advantage of page mode. ved; EDO DRAM specifies a 25ns CAS period and is two-way interleaved. Both are interleaved at a bus-width granularity.
EXPERIMENTAL RESULTS

F (
or most graphs, the performance of several DRAM organizations is given: FPM1, FPM2, FPM3, EDO1, EDO2, SDRAM, ESDRAM, DDR, SLDRAM, SLDRAMx2, RDRAM, RDRAMx2, and DRDRAM. represent the SLDRAM and RDRAM organizations with two channels (described earlier). The remaining labels should be self-explanatory.
Handling Refresh
Surprisingly, DRAM refresh organization can affect performance dramatically. Where the refresh organization is not specified for an architecture, we simulate a model in which the DRAM allocates bandwidth to either memory references or refresh operations, at the e xpense of predictability [28] . The refresh period for all DRAM parts but Rambus is 64ms; Rambus parts have a refresh period of 33ms. In the simulations presented in this paper, this period is divided into N individual refresh operations that occur 33/N milliseconds apart, where 33 is the refresh period in milliseconds and N is the number of rows in an internal bank times the number of internal banks. This is the Rambus mechanism, and a memory request can be delayed at most the refresh of one DRAM row. For Rambus parts, this behavior is spelled out in the data sheets. For other DRAMs, the refresh mechanism is not explicitly stated. Note that normally, when multiple DRAMs are ganged together into physical banks, all banks are refreshed at the same time. This is different; Rambus refreshes internal banks individually.
Because many textbooks describe the refresh operation as a periodic shutting down of the DRAM until all rows are refreshed (e.g. [17]), we also simulated stalling the DRAM once every 64ms to refresh the entire memory array; thus, every 64ms, one can potentially delay one or more memory references the time it takes to refresh the entire memory array. This approach yields refresh stalls up to two orders of magnitude worse than the time-interspersed scheme. Particularly hard-hit was the compress benchmark, shown in Figure 9 with refresh stalls accounting for over 50% of the average access time in several of the DRAM architectures. Because such high overheads are easily avoided with an appropriate refresh organization, we only present results for the time-interspersed refresh approach. One of the most obvious results is that more than half of the SPECint '95 benchmarks (gcc, ijpeg, m88ksim, perl, and vortex) exhibit the same memory-system overhead that has been reported in the literature for large-footprint applications considered much more memory-intensive than SPEC: the middle bars in Figure 10 (a) for these benchmarks, which represent CPU speeds of 1GHz, ha ve non-overlapped 3.
H &
Total Execution Time
g We do not look at the floating-point benchmarks here because their regular access patterns make them easy targets for optimizations such as prefetching and access reordering [24, 25] . Another obvious point is that anywhere from 5% to 99% of the memory overhead is overlapped with processor e £ xecution-the most memory-intensive applications successfully overlap 5-20%. SimpleScalar schedules instructions extremely aggressively and hides a fair amount of the memory latency with other work-though this "other work" is not all useful work, as it includes all L1 and L2 cache activity. For the 100ns L2 (corresponding to a 100MHz processor), between 50% and 99% of the memory access-time is hidden, depending on the type of DRAM the CPU is attached to (the faster DRAM parts allow a processor to e xploit greater degrees of concurrency). For 10ns (corresponding to a 1GHz processor), between 5% and The rankings do not change from application to application (DDR is fastest, followed by ESDRAM, Direct Rambus, ad SDRAM), and the gap between the fastest and slowest architectures is only 10-15%.
Summary:
The graphs demonstrate the degree to which contemporary DRAM designs are addressing the memory bandwidth problem. Popular high-performance techniques such as lockup-free caches and out-of-order execution expose memory bandwidth as the bottleneck to improving system performance; i.e., The graphs also show the expected result that as L2 cache and processor speeds increase, systems are less able to tolerate memory latency. Accordingly, the remainder of our study focuses on the components of memory latency. Though it is a completely unbalanced design, we also measured latencies for 128-bit wide configurations for Rambus and SLDRAM designs, pictured in Figure 7 (d). These "parallel-channel"
Average Memory Latency
results are intended to demonstrate the mismatch between today's bus speeds and fastest DRAMs; they are shown in the bottom left corner of Figure 11 .
Bus Transmission Time is that portion of the bus activity not overlapped with column access or data transfer , and it accounts for 10% to 30% of the total latency. In the DDR results Bus Transmission accounts for 40-45% of the total, and in the parallel-channel results it accounts for more than 50%. EDO DRAM does a much better job than FPM DRAM of overlapping column access with data transfer . This is to be expected, given the timing diagrams for these architectures. Note that the overlap components (Data Transfer Time Overlap) tend to be very large in general, demonstrating relatively significant performance savings due to page-mode. This is an argument for keeping buses no wider than half the block size of the L2 cache.
Several of the architectures show no overlap at all between data transfer and column access. SDRAM and ESDRAM do not allow such overlap because they instead use burst mode, which obviates multiple column accesses (see Figure 5 ). SLDRAM does allow overlap, just as the Rambus parts do; however, for simplicity, in our simulations we modeled SLDRAM's burst mode. The overlapped mode would have yielded similar latencies.
The interleaved configurations (FPM3 and EDO2) demonstrate excellent performance; latency for FPM DRAM improves by a factor of 2 with four-way interleaving, and EDO improves by 25-30% with two-way interleaving. The interleaved EDO configuration performs slightly worse than the FPM configuration ¦ because it does not tak e full advantage of the memory bus; there is still a small amount of unused data bus bandwidth. Note that the break-do wns of these organizations look very much like Direct Rambus; Rambus beha ves similarly to highly interleaved systems but at much lower cost points.
The "x2" variants of SLDRAM and RDRAM demonstrate excellent performance as well. Both Column Access and Data Transfer decrease by a factor of two; both channels can be active simultaneously, fetching or writing different parts of the same L2 cache line. This behavior is expected. This reduces the average DRAM access time by roughly 30% and the total execution time (see Figure 10 ) by 25%, making these configurations as fast as any other of the modern DRAM designs.
The time stalled due to refresh tends to account for 1-2% of the total latency; this is more in line with expectations than the results shown in Figure 9 . The time stalled synchronizing with the memory bus is in the same range, accounting for 1-5% of the total.
This is a small price to pay for a simpler DRAM interface, compared to a fully asynchronous design.
Summary:
The FPM architecture is the baseline architecture, but it could be sped up by 30% with a greater degree of overlap between the column access and data transmission. This is seen in the EDO architecture: its column access is a bit faster due to the latch between the sense amps and the output pins, and its degree of overlap with data transfer is greater, yielding a significantly faster design using essentially the same technology as FPM. Synchronous DRAM is another 30% f aster than EDO, and Enhanced SDRAM increases performance another 15% by improving the row-and column-access timing parameters and adding an SRAM cache to improve concurrency. DDR is the fastest of the DRAM architectures studied, which is not surprising due to its bandwidth, which is twice that of the other DRAMs studied. It is interesting to note that its performance is slightly better than that of Enhanced Memory's SDRAM, and Figure 10 shows that while it has reduced the bandwidth portion of latency more than ESDRAM, ESDRAM has reduced the latency component more than DDR. This is to be expected, as DDR has a core that is fundamentally similar to that of SDRAM-it simply has a f aster interface-while ESDRAM has a core unlike any other DRAM architecture studied: latching the entire row optimially hides the precharge activity and increases the overlap between access to different rows, thus reducing average latency.
As modeled, SLDRAM and Rambus designs have higher end-to-end transaction latencies than SDRAM, ESDRAM, or DDR, as they require twice as many data transfers to complete a 128-bit transaction. Ho wever, they are not ganged together into a wide datapath, as are the other organizations.
Despite the handicap, SLDRAM performs well, which is important considering it is a public standard. The 
Perfect-Width Buses
As a limit study 
4
Row Access component becomes relatively more significant than in the results of a 128-bit bus (Figure 11 ). Whereas in Figure 11 , variations in Row Access caused overall variations in access time of roughly 10%, these graphs quantify the effect that Row Access has on systems with wider buses: average access time can vary by a factor of two. Summary: Coupled with extremely wide buses that hide the effects of limited bandwidth and thus highlight the differences in memory latency, the DRAM architectures perform similarly. As FPM1 and ESDRAM show, the variations in Row Access can be avoided by always closing the row buffer after an access and hiding the sense-amp precharge time during idle moments. This yields the best measured performance, and its performance is much more deterministic (e.g. FPM1 yields the same £ Row Access independent of benchmark). Note that in studies with a 4MB L2 cache, some benchmarks executing with an optimistic strategy showed very high row-buffer hit rates and had Row Access components that were near-zero (see Figure 13) ; however, this simply serves to illustrate the behavior when the bulk of the requests reaching the DRAM system are compulsory cache misses.
Comparing the 128-byte results to the previous experiment, we see that when one considers current technology (128-bit b uses), there is little variation from application to application in the average memory access time. The two components that vary, Row Access and Bus Transmission, contribute little to the total latency, being overshadowed by long memory-access pipelines that exploit page mode. However, moving to wider b uses decreases the column accesses per request, and, as a result, the row access, which is much larger than column access to begin with, becomes significant. With fewer column accesses per request, we are less able to hide bus transmission time, and this component becomes more noticeable as well. 
The Effect of Limited MSHRs
As mentioned in section 4.1, the measurements presented so f The results in Figure 15 show the individual benchmarks for DRDRAM alone. Obviously, we expect little variation from four to sixteen MSHRs because this exceeds the capabilities of single-bus designs-nonetheless, it acts as a reasonable sanity-check.
As the graphs show, there is on average a 1% difference in execution time between a system with a single MSHR and a system with enough MSHRs to fully occupy a DRAM architecture's abilities. We measured a maximum difference of roughly 5% (shown in the DRDRAM results). We conclude that our MSHR-based limitation of concurreny in the DRAM system introduces no significant performance degradation. This is not to say that concurrency in the memory system is not beneficial, however: We look more closely at the effects of memory-system concurrency in several follow-on studies that suggest concurrency is better exploited at the DRAM-system level than the DRAM-architecture level [7, 8] .
Critical-Word Latencies
The a W verage access time numbers shown in Figure 11 represent average end-to-end latency: e.g., for a read the y represent the time from the start of the DRAM request to the moment the last word in the requested block reaches the le vel-2 cache. This is somewhat misleading because it is widely held that the true limiter to performance is the critical-w ord latency.
Critical-word latencies are shown in Figure 16 for most of the DRAM architectures, at the highest CPU speed. The figure shows that time-to-critical-word is significantly lower than the end-to-end latency, as expected. At great expense, the end-to-end latency can be improved by widening the bus, thereby making the end-to-end latenc y equal to the critical-word latency. This is shown in Figure 12 (described earlier).
Note that doing so yields latencies similar to the critical-w B ord latencies in Figure 16 -in short, there is no significant latency argument for widening the bus. To reduce latency, one must speed up the bus, speed up the DRAM core, impro ve the hit ratio in the DRAM row buffers, or redesign the DRAM interface.
It is interesting to note that the SLDRAM and Rambus designs excel in their critical-word latencies:
though SDRAM and ESDRAM win in end-to-end latenc y, they are rigid in their access ordering. Parts like Rambus and SLDRAM are like the interleaved FPM and EDO organizations in that they allow the memory controller to request the components of a large block in arbitrary order. Thus, the Rambus parts allow easy critical-word-first ordering, whereas burst-mode DRAMs do not. However, as one can see by looking at Figures 16 and 10 side-by-side, the total execution time seems to correlate more with the end-to-end latency than the critical-word latency-e.g., if total execution time scaled with critical-word latency, we would expect SLDRAM results to be faster than ESDRAM (which they are not), and we would expect SDRAM results to be 10-20% slower than ESDRAM, SLDRAM, RDRAM, and DRDRAM (which they are not). We would expect the ranking from fastest system to slowest to be
when, in fact, the order (for both PERL and GCC, at both medium and high CPU speeds) is 
The fact that, in these cases, the total execution time correlates better with end-to-end latency than with critical-word latency simply suggests that, on average, these benchmarks tend to use a significant portion of each L2 cache line.
Cost-Performance Considerations
The or W ganizations are equal in their capacity: all but DDR and the interleaved examples use eight 64Mbit
DRAMs.
E
The FPM3 organization uses 32 64Mbit DRAMs, and the EDO2 organization uses sixteen.
Ho ¤ wever, the cost of each system is very different. Cost is a criterion in DRAM selection that may be as important as performance. Each of these DRAM technologies carries a dif ' ferent price, and these prices are dynamic, based on factors including number of suppliers, sales volume, die area premium, and speed yield.
In the narro . Alternatively, by ganging together several Rambus Channels, one can achieve better performance at the same cost.
£ Accordingly, Rambus parts typically carry a stiff price premium-roughly 3x at the time of this writing, despite less than a 20% area premium-but significantly less than the 8x disparity in the number of chips required to achieve the same performance.
Using the Collective Row Buffers in Lieu of an L2 Cache
Associated with each DRAM core is a set of sense amps that can latch data; this amounts to a cache of This is also seen in the decreased hit rates relative to hit rates with 1MB and 4MB L2 caches (next figure). Figure 19 presents the variations in hit rates for the row-buffer caches of different DRAM architectures.
Hit rate does not include the effect of hits that are due to multiple requests to satisfy one L2 cacheline: these results are for the ideal b uses. We present results for two sets of benchmarks, including applications from SPEC and Etch suites. As mentioned later, the Etch applications are included because they tend to have larger footprints than SPEC.
The results show that memory requests frequently hit the row buffers; hit rates range from 2-97%, with a mean of 40%. Hit rates increase with increasing L2 cache size (because the DRAM traffic is increasingly compulsory misses, which tend to be sequential) and decrease as the L2 cache disappears (because the writeback L2 does a good job of filtering out writes, as well as the fact that more noncompulsory misses will hit the DRAM with the L2 cache gone). As shown in our previous study [9] , there is a significant change in hit rate when writes are included in the address stream: including write traffic tends to decrease the ro w-buffer hit-rate for those DRAMs with less SRAM storage. Writebacks tend to pur £ ge useful data from the smaller row-buffer caches; thus the Rambus, SLDRAM, and ESDRAM parts perform better than the others.
£
This effect suggests that when writebacks happen, they do so without much locality: the cachelines that are written back tend to be to DRAM pages that have not been accessed recently. This is expected behavior.
Note that a designer can play with the ordering of address bits to maximize the ro B w-buffer hits. A similar technique is used in interleaved memory systems to obtain the highest bandwidth.
Trace-Driven Simulations
W © e also investigated the effect of using trace-driven simulation to measure memory latency. We simulated the same benchmarks using SimpleScalar' s in-order mode with single-issue. Clearly, in-order execution cannot yield the same degree of overlap as out-of-order execution, but we did see virtually identical average access times compared to out-of-order execution, for both 128-bit and 128-byte buses. Because SPEC has been criticized as being not representative of real-world applications, we also used University of read or write request (requests are often cacheline-sized, and the cache width is typically greater than the b us width). This is similar to the performance optimization of placing multiple DRAMs in parallel to achieve a bus-width datapath: this optimization works because the bus width is typically greater than an individual DRAM's transfer width. We have seen that each of the DRAM architectures studied takes advantage of internal interleaving and page mode to differing degrees of success. However, as the studies show, we will soon hit the limit of these benefits: the limiting factors are now the speed of the bus and, to a lesser degree, the speed of the DRAM core. To improve performance further, we must explore other avenues. This graph demonstrates that when the row is accessed in the future, it is most often accessed in the very near future. Our conclusion is that the previously-referenced row has a high hit rate, and it is likely to be referenced within a short period of time if it is referenced again at all. A number of proven techniques exist to exploit this behavior, such as victim caching, set associative row buffers, etc.
Ä CKNOWLEDGMENTS
This study gre 
