



































Organizational Design Trade-Offs at the DRAM, Memory Bus,
and Memory Controller Level: Initial Results
Vinodh Cuppu and Bruce Jacob
Electrical & Computer Engineering
University of Maryland, College Park
{ramvinod,blj}@eng.umd.edu
University of Maryland Systems & Computer Architecture Group Technical Report UMD-SCA-1999-2, November 1999TR-1999-2, November 1999ABSTRACT
This paper presents initial results in a study of organization-
level parameters associated with the design of the primary
memory system—the DRAM system beneath the lowest level
of the cache hierarchy. These parameters are orthogonal to
architecture-level parameters such as DRAM core speed, bus
arbitration protocol, etc. and include bus width, bus speed,
number of independent channels, degree of banking, read
burst width, write burst width, etc; this study presents the
effective cross-product of varying each of these parameters
independently. The simulator is based on SimpleScalar 3.0a
and models a fast (simulated as 2GHz), highly aggressive
out-of-order uniprocessor. The interface to the primary mem-
ory system is fully non-blocking, supporting up to 32 out-
standing misses at both the level-1 and level-2 caches.
Our simulations show the following: (a) the choice of pri-
mary memory-system organization is critical, as it can effect
total execution time by a factor of3x for a constant CPU
organization and DRAM speed; (b) the most important fac-
tors in the performance of the primary memory system are the
channel speed (bus cycle time) and the granularity of data
access, the burst width—each of these can independently
affect total execution time by a factor of2x; (c) for small
bursts, multiple narrow independent channels to the memory
system exhibit better performance than a single wide chan-
nel; for large bursts, channel cycle time is the most important
factor; (d) the degree of DRAM multi-banking plays a sec-
ondary role in its impact on total execution time; (e) the opti-
mal burst width tends to be high (large enough to fetch an L2
cache block in 2 bursts) and scales with the block size of the
level 2 cache; and (f) the memory queue sizes can be
extremely large, due to the bursty nature of references to the
primary memory system and the promotion of reads ahead of
writes. Among other things, we conclude that the scheduling
of the memory bus is the primary bottleneck and that it should
be the focus of further study.
1 INTRODUCTION
The expanding performance gap between processor speeds
and primary memory speeds has prompted a number of stud-
ies in DRAM systems. These studies range from memory-
controller design [13, 12, 16, 4, 7] to integrating the DRAM
core with the processor core for improved memory band-
width and power consumption [3, 14, 10, 6, 9]. Additionally,
our recent DRAM study compares the performance of se
eral contemporary DRAM architectures, including FPM
EDO, Synchronous, Enhanced Synchronous, SLDRAM
Rambus, and Direct Rambus [5]; one of its primary concl
sions was that present bus architectures are becoming a
tleneck.
As a result, we have been studying bus and memory-co
troller organizations and have developed a simulation fram
work for placing disparate DRAM architectures on the sam
footing. The model defines a continuum of design choic
that includes most contemporary DRAM architectures su
s Rambus, Direct Rambus, PC-100/133/266 SDRAM, e
Using this framework, we have investigated the organiz
tional parameters of memory systems such as bus width,
speed, number of independent channels, logical organiza
of channels, degree of banking, degree of interleaving, bur
mode vs. packetized access, read burst width, write bu
width, split-transaction vs. pipelined buses, symmetric v
asymmetric read/write request shapes, etc. We label thes
“organizational” parameters because they are design cho
that can be made independently of the architecture of t
DRAM core.
In this paper, we present the simulation framework and
initial study of different organization-level parameter
including bus speed, bus width, number of independent ch
nels, degree of banking, and read/write burst width; desp
the large range covered in this study, it really only begins
explore the space of memory-system organizations. W
model a high-performance uniprocessor system (2GHz o
of-order superscalar CPU with lockup-free L1 and L2 cach
[11]) and use the more memory-intensive applications in t
SPEC’95 integer suite. In this study we ask and answer
following questions (clearly, our results and conclusions a
dependent on our system configuration and choice of ben
marks):
• How important are the design choices made at the
organization level of the primary memory system?
Holding constant the CPU architecture, the L1/L2 cach
organizations, the DRAM architecture, and the DRAM
speed, the choices made at the organization level can
affecttotal execution time by a factor of 3x. The choices
of memory-system organization can affect the memory
overhead by a factor of 10x, but much of this overhead 
hidden behind program execution. Clearly, the choices 















































• What are the most significant organizational parameters
that affect performance of the primary memory system?
Holding other factors constant, the read/write burst width1
(the granularity of data access) can be responsible for
differences in total execution time of 3x; the cycle time of
the memory channel can be responsible for a factor of 2x;
the number of independent channels connecting the CPU
to the DRAMs can be responsible for a performance
change of 25%. Other parameters are responsible for
differences in total execution time of less than 15%.
• How does the degree of banking affect performance?
Surprisingly, the degree of banking has little impact on
total execution time. While the memory-system overhead
can decrease 10-20% by increasing the number of banks
per channel beyond 1, much of the improvement is hidden
behind CPU execution. The net result is a 5%
improvement in total execution time.
• What are the performance trade-offs between the number
of independent channels, the channel width, the channel
speed, and the total system bandwidth (number of
channels× channel width× channel speed)?
As one might guess, the total per-channel bandwidth (bus
width × bus speed) is often more important than the
choice of either bus width or bus speed, because it takes
the same amount of time to send 128 bits down a 16-bit,
800MHz channel as a 128-bit, 100MHz channel.
However, there are counterexamples. Whereas, for a given
burst size, performance is not particularly sensitive to
bandwidth, it is very sensitive to channel width or speed:
for a given burst size, doubling the memory system’s
bandwidth can occasionallyincrease execution time,
while changing the number of channels, the speed of a
channel, or the width of a channel (and at the same time
holding bandwidth constant) can often reduce total
execution time by a significant amount.
We also make the following observations. First, and most
importantly, there is a very complex tradeoff between the
optimal burst size and the optimal system bandwidth configu-
ration (number of channels, channel width, channel speed).
The optimal burst size is wide enough to fetch an L2 cache
block in two requests (e.g. 64-byte burst for a 128-byte L2
block size). Given a fixed burst size, the optimal choice of
system bandwidth configuration changes dramatically from
large burst sizes to small burst sizes: for example, what is
good for large bursts (few independent channels) is the worst
choice for small bursts, and what is good for small bursts
(many independent channels) is the worst choice for large
bursts. Because the interactions between system configura-
tion and burst size can affect system performance by up to a
factor of three, it is critically important to design the entire
memory system to fit together—no one component of the
memory system can be optimized in isolation. Given that the
optimal burst width scales with the level 2 cache block siz
even the organization of the caches must play a role in
design of the primary memory system.
Second, the large degrees of internal banking in many
today’s high-performance DRAMs (e.g. 16 banks in Dire
Rambus DRAM), while perhaps necessary from an impl
mentation standpoint, might be unnecessary from a perf
mance standpoint. For the benchmarks studied, relatively l
degrees of internal banking—in the range of 2x to 4x—are
that is necessary to achieve good performance.
Last, we did not place any restrictions on the size of th
memory controller’s request queue. Given that the combin
tion of an 8-byte burst and a 128-byte cache block produc
16 requests per L2 read miss, a system with 32 MSHRs c
have up to 512 outstanding requests in the memory syste
For medium and large burst sizes, we saw relatively sm
queue sizes (up to tens of entries, down to 1 or 0 on averag
By contrast, for small burst sizes, we frequently saw que
lengths in the tens of thousands, which is due to the fact t
write requests can be stalled for arbitrarily long periods
time if a string of read requests appears. Future work w
look at the effects of a finite queue size.
As previously mentioned, one of the primary results from
our prior work was that present bus architectures are beco
ing a bottleneck. This study comes to the same conclusi
Our observations that small bursts require multiple indepe
dent channels for good performance suggest the interleav
of small bursts on a single channel to be expensive. O
observations that the memory queue lengths are enormous
small bursts suggest that interleaving small bursts creates
traffic jams. Our observations that channel speed can be m
important than channel bandwidth suggest that two differe
configurations with equal bandwidth do not necessari
exploit that bandwidth with the same degree of efficienc
These results all point to bus scheduling as the bottlene
Future work will be to investigate this more closely.
2 SIMULATION FRAMEWORK &
EXPERIMENTAL METHODOLOGY
2.1 High-Performance Memory Systems Primer,
Briefly
High-performance memory systems are not structured a
each DRAM is connected directly to the CPU; there are us
ally several layers of memory controllers that serve to redu
the amount of time spent on an address or data bus. Typica
there is a memory controller ASIC that is integrated onto th
DIMM itself that performs theRAS andCAS commands—
what is usually called “the memory controller” is only
responsible for scheduling requests to the DIMMs over t
memory channel; the controller does not usually control t
DRAMs directly. This enables a memory system to have se
eral independent banks that can be active at the same ti
enabling relatively full utilization of the data bus, even thoug
the time it takes to get data out of the DRAM core is fa
longer than the bus transmission time. If there were only o
bank per memory channel, there could be no such overl
and the fastest rate at which requests could be serviced wo
be the time to pull data from the DRAM core. For more infor
mation, see [1, 8, 15].
1. Note that this term does not imply that the model is a burst-mode model.
The term refers to the granularity of data access; for example, Direct
Rambus has a packetized DRAM interface, as opposed to burst-mode
DRAMs such as SDRAM or ESDRAM. However, its granularity of































so2.2 Channels and Banks
The fundamental idea in this work is to define a model for the
primary memory system that represents most DRAM organi-
zations in existence, including burst-mode organizations such
as SDRAM and packetized organizations such as Rambus
(these being the two primary competing commercial stan-
dards), as well as almost everything else in between.
Several example memory-system organizations that can
be represented by our model are illustrated in Figure 1. A sin-
gle DRAM device can handle one request at a time and pro-
duces a certain number of bits per request: this is the device-
level transfer width. DRAM devices are ganged together into
banks, each of which is independent and can service a differ-
ent request than all other banks at any given moment. The
bank is the smallest unit of granularity represented in this
model. Whether a “bank” is a single physical device or a sub-
component within a single physical device need not be speci-
fied. A single bank has a transfer width at least as wide as the
data bus. Each channel is a split-transaction address-bus/data-
bus pair and is connected to potentially multiple banks, each
of which is operated independently of the others; using multi-
ple banks per channel supports concurrent transactions at the
channel level. The CPU connects via an on-board memory
controller to potentially multiple channels, each of which is
operated independently of the others; using multiple channels
supports concurrent transactions at the DRAM subsystem
level. The bit mapping from address to channel/bank/row
attempts to best exploit the available concurrency in the phys-
ical organization by assigning the lowest-order bits (which
change the most frequently) to the channel number, the next
bits to the bank number, etc. Counters in our simulation
results show that the requests are divided evenly across the
channels in a system and across the banks in each channel.
This is a very simple organization that accounts for most
existing DRAM architectures: clearly, it can emulate organi-
zations such as PC-XXX SDRAM, but it can also emulate
Rambus-style organizations by increasing the degree of bank-
ing and scaling the channel width and speed, as Rambus
devices use normal DRAM cores and are banked internally.
For the studies presented in this paper, we did not explore
all possible combinations of channel speed and channel width
to obtain the same bandwidth. For example, as shown in Fig-
ure 2, there is a 5% performance range between a 1byte
running at 800 MHz vs. a 2byte bus at 400MHz vs. a 4by
bus at 200MHz vs. an 8byte bus at 100MHz, with the highe
frequency bus yielding the best performance. To reduce
number of simulations run for this paper we simulated th
following combinations:1x200, 1x400, 1x800, 2x800, 4x80
8x800 (bandwidths from 200MB/s to 6400MB/s).
2.3 Burst Timing
For the DRAM core speed, we use parameters from the lat
SDRAM, which has reasonably fast timing specifications an
is common to PC-100 and Direct Rambus designs. This giv
us the read and write bus and bank occupancies shown in F
ure 3, which are similar to those reported in the literature [
8, 15]. The figure presents numbers for burst widths equal
he data bus width, twice the bus width, and four times th
bus width. A burst is the smallest atomic transaction size—
read and writes requests are processed as an integral num
of bursts, and the bursts of different requests may be mu
plexed in time over the same channel. We model the bus tu
around time as a constant number of bus cycles; for this stu
we used 1 cycle.
Note that this interface model covers burst-mode DRA
architectures such as SDRAM, ESDRAM, and burst-mo
SLDRAM, and it also covers packetized DRAM architec
tures such as Rambus, Direct Rambus, and packeti
SLDRAM. The only difference with moving to a packetized
interface is that the address bus packet scales with the d
bus packet in the length of time it occupies the address b
Since the two are scheduled together, there is no additio
overhead imposed by this scheme.
2.4 Burst Ordering
If a burst is smaller than the level-2 cache line size, then the
are a number of options for the ordering of the burst-siz
blocks that make up the request. In this study, the block co
taining the critical word is always fetched first and takes pr
ority over any other block in the queue, unless that block al
Figure 1: Channels and banks. This study looks at varying such
parameters as the number of independent channels and the number of




































Banking degrees of 1, 2, 4, ...
Four independent channels
Banking degrees of 1, 2, 4, ...
Two independent channels
Banking degrees of 1, 2, 4, ...
Figure 2: Performance as a function of bus width and bus speed.
Though there is up to a 5% difference between different combinations of
bus width and bus speed that yield the same bandwidth, we cut the number
of combinations simulated to reduce simulation time.
0.2 0.4 0.8 1.6 3.2









































































































































contains a critical word. Write requests are always given low-
est priority and tend to stack up in the queue until all the reads
drain from the queue.
2.5 Handling Concurrency
With multiple channels in a system, it is easy to see how con-
currency can be exploited. However, within a single channel,
provided that there is sufficient banking to support it, there
can also be support for concurrency. Figure 4 illustrates sev-
eral of the ways back-to-back requests are overlapped in time,
sharing the common resources. Back-to-back reads can be
pipelined, provided they require different banks. Back-to-
back read/write pairs can be similarly pipelined, but it is also
possible to nestle writes “inside of” reads, as shown in Fig-
ures 4(b) and (c), provided the conditions support it. This last
feature is only possible because the asymmetric nature of
read/write requests. Note that, though reads and writes are
asymmetric, they look less so as the burst width increases and
the time that the data bus is held grows large. This w
become important: it is more efficient to interleave symmetr
requests, because there is less wasted dead time on the b
2.6 CPU Model
To obtain accurate timing of memory requests in a dynam
cally reordered instruction stream, we integrated our co
into SimpleScalar 3.0a, an execution-driven simulator of
aggressive out-of-order processor [2]. Our simulated proc
sor is eight-way superscalar; its simulated cycle time is 0.5
(2GHz clock). Its L1 caches are split 64KB/64KB; both ar
2-way set associative; both have 64-byte linesizes. Its
cache is unified 1MB, 4-way set associative, writeback, ha
128-byte linesize and a 10-cycle access time. The L1 and
caches are both lockup-free, and both allow up to 32 o
standing requests at a time. For our lockup-free cache mo
a load instruction that misses the L2 cache is blocked unti
obtains an MSHR, and it holds the MSHRonly until the criti-
cal burst of data returns(remember that the atomic unit of
transfer between the CPU and DRAM system is a burst). T
scheme frees up the MSHR relatively quickly, allowing sub
sequent load instructions that miss the L2 cache to comme
as soon as possible. This scheme is relatively expensive
implement, as it assumes that the cache tags can be che
for the subsequently arriving blocks without disturbing cach
traffic. We model this optimization to put the highest possib
pressure on the physical memory system—it represents
highest rate at which the processor can generate concur
memory accesses given the number of available MSHRs.
2.7 Timing Calculations
Much of the DRAM access time is overlapped with instruc
tion execution. To determine the degree of overlap, we run
second simulation with perfect primary memory (no ove
head). Similar to the methodology in [5], we partition th
total application execution time into three components: TP
TM and TO which correspond to time spent processing, tim
spent stalling for memory, and the portion of time spent in th
memory system that is successfully overlapped with proce
sor execution. In this paper, time spent “processing” includ
all activity above the primary memory system, i.e. it contain
all processor execution time and L1 and L2 cache activity. L
TREAL be the total execution time for the realistic simulation
Figure 3: Bus and bank occupancies for 100MHz channel. Each DRAM request requires the address bus, the data bus, and whatever bank it is destined
for. The shape of these request blocks is dependent on the burst widths. Figures are shown for burst-widths equal to (a) 1x the bus width, (b) 2x the bus width,






















































































Figure 4: Concurrency within a single channel. If two concurrent reads
require different banks, they can be pipelined across the address and data
bus as shown in (a). Writes can be nestled inside of reads, provided the bus

























































let TPERFbe the execution time with a perfect DRAM sys-
tem; let TDRAM be the total time spent in the DRAM system.
Then we have the following:
• TP = TREAL – TDRAM
• TM = TREAL – TPERF
• TO = TPERF+ TDRAM – TREAL
The relationships between the different time parameters are
illustrated in Figure 5.
3 EXPERIMENTAL RESULTS
The simulations in this study cover most of the space defined
by the cross-product of these variables:
• {1, 2, 4} independent channels
• {1, 2, 4} banks per channel
• {8, 16, 32, 64, 128} byte burst widths
• {1, 2, 4, 8} byte data-bus widths
• {200, 400, 800} MHz bus speeds (equivalent to 100, 200,
400 MHz dual data rate)
• {gcc, perl} from SPEC’95 known to have relatively large
memory footprints
As described earlier, we did not simulate every combination
of bus width and bus speed. The simulated L1/L2 cache line
sizes are 64/128 bytes, and, for a few configurations, we also
simulated L1/L2 linesizes of 32/64 bytes.
The following sections each present an analysis of a
slightly different slice through the data. The unit of perfor-
mance is cycles per instruction: a direct measurement of exe-
cution time, given a fixed cycle time and the length of each
program. Note that for some system configurations (but not
all), total execution time is further broken down into the com-
ponents described in Section 2.7.
3.1 The Effects of Burst Width and Bandwidth
We begin by presenting in Figure 6 the total execution time as
a function of both burst width and memory-system band-
width. On the x-axis is the system bandwidth, which is tot
channels× channel width× channel speed. For each band
width value, there are a number of configurations that rep
sent different combinations of channels/width/speed. F
each configuration, there are five stacked bars represen
the total execution time for burst widths of 8, 16, 32, 64, an
128 bytes.
Among other things, the graphs show that for a give
bandwidth configuration, the choice of burst size can affe
execution time significantly—e.g., by a factor of just under 3
for gcc and just under 2x for perl. This clearly shows th
importance of selecting an appropriate burst size. Though
optimal burst width depends on bandwidth and channel spe
(optimal burst width is around 32 bytes for 200MHz chan
nels, and around 64 bytes for 400 and 800MHz channels)
tends to be relatively large in general: for most configur
tions, it is 64 bytes. Figure 7 shows that it is also depende
on cache block size. The data are for a L2 cache block of s
64 bytes, and the graph shows the optimal burst width to
32 bytes—i.e., the burst should be large enough to fetch
level-2 cache block in two requests.
In Figure 6, if one can ignore the noise, there is a gradu
curve that slopes down as bandwidth increases, showing
effects of increased bandwidth on execution time. The slo
reflects a 5–10% improvement in execution time for eve
doubling of memory-system bandwidth, which is far less si
nificant than the effect that burst width has on performanc
Within a fixed bandwidth class, the choice of bus speed a
number of channels is significant, but not as significant
doubling or halving the bandwidth. For example, at 800MB/
the effect of moving from a quad 200MHz 1-byte bus organ
zation to a dual 400MHz 1-byte bus organization to a sing
800MHz 1-byte bus organization yields a smaller perfo
mance difference than moving to a 400MB/s or 1.6GB
organization.
In summary, burst width is an extremely significan
parameter that overshadows both raw bandwidth and
details of how you choose your bandwidth (number of cha
nels, channel width, channel speed).
3.2 Optimal Burst Width vs. Channel Organization
Next, we look more closely at optimal burst size in Figures
and 9. In each figure there are several graphs, each of wh
represents data for a constant burst width. Each graph dep
the total execution time (and for some bars, a break-down
well) for constant bitwidth organizations. Note that the da
points at each bitwidth may have different bandwidths. A
each data point, there are three vertical bars, correspondin
degrees of multibanking of 1, 2, and 4 banks per channel.
The graphs illustrate that there are three distinct regions
behavior, corresponding to small burst sizes, medium bu
sizes, and large burst sizes. At small burst sizes (8 bytes),
parameter that influences performance the most is the num
of independent channels: all 1-channel configurations ha
roughly the same performance; all 2-channel configuratio
have roughly the same performance; all 4-channel configu
tions have roughly the same performance—this despite
configuration’s bandwidth. For a 32-bit datapath, the thre
configurations that are comprised of 4 8-bit channels all ou







TDRAM = time spent
TPERF = execution time
TREAL
TM = TREAL – TPERF
TP = TREAL – TDRAM
TO = TPERF + TDRAM – TREAL
Figure 5: Definitions for execution-time breakdowns. The results of
several simulations are used to show time spent in the memory system vs.






atthe 1x32-bit 800MHz configuration by 25%. This happens
even though the worse-performing configurations have 2x
and 4x the bandwidth of the better-performing configura-
tions—e.g., the 4x8-bit 200MHz system has a bandwidth
800MB/s and outperforms the 1x32-bit 800MHz syste
(which has 3.2GB/s bandwidth) by 25%. This suggests th
0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6




















Figure 6: Bandwidth and burst width.
0.2 0.4 0.8 1.6 3.2 6.4 12.8 25.6































































































































































































Figure 7: Optimal burst width for 32/64-byte L1/L2 line sizes. At each data point, there are three histograms representing the execution time as a function of
the degree of banking. From left to right, the vertical bars show performance for 1, 2, and 4 banks per channel. There is no data for 128-byte burst, because such
a burst size does not make sense for a 64-byte cache block. While the data in Figure 6 suggest the optimal burst width to be 64 bytes, this shows that the optimal
burst size is 32 bytes when the L2 cache block is 64 bytes. Our conclusion is that the optimal burst width scales with the L2 cache size: it is large enough to fetch
an L2 cache block in two requests.






































Figure 8: Burst width and channel organization tradeoffs — GCC.
8 16 32 64 128 256




















8 16 32 64 128 256
Total Datapath Bitwidth (bits = Channels * BusWidth)
gcc-burst-064
8 16 32 64 128 256




















8 16 32 64 128 256
Total Datapath Bitwidth (bits = Channels * BusWidth)
gcc-burst-016
8 16 32 64 128 256


































































































































































































Figure 9: Burst width and channel organization tradeoffs — PERL.
8 16 32 64 128 256


















8 16 32 64 128 256
Total Datapath Bitwidth (bits = Channels * BusWidth)
perl-burst-064
8 16 32 64 128 256


















8 16 32 64 128 256
Total Datapath Bitwidth (bits = Channels * BusWidth)
perl-burst-016
8 16 32 64 128 256













































































































































































































































further dividing the bitpath would yield further improve-
ments—perhaps 8 4-bit channels would continue to yield
improved performance. However, simply changing the burst
width yields better results.
At medium burst sizes (32 bytes), there is little difference
to be seen across all configurations. It is clear that the config-
urations with slower busses and narrower busses are likely to
do slightly worse, but the difference between the best and
worst configurations is roughly 25–30%.
At large burst sizes (128 bytes), it is no longer the case
that more channels yields better performance; in fact, increas-
ing the number of channels always degrades performance.
For example, again at the 32-bit data point, the three configu-
rations at 800MHz (all of which have identical bandwidth)
show the effect of going from 4x8-bit to 2x16-bit to 1x32-bit
configurations: in contrast to the behavior seen at small burst
sizes, increasing the number of independent channels wors-
ens performance. The most significant influence on perfor-
mance for large burst sizes comes from the channelsp ed—
note, for example, that the worst performance comes from
200MHz channels, which have roughly identical perfor-
mance regardless of the bandwidth represented. The best per-
formance comes from 800MHz channels, all of which
perform within 10% of each other. At this burst width, simply
increasing bandwidth makes little difference in execution
time, provided the channel speed remains the same.
In summary, there is a delicate trade-off between the opti-
mal burst size and the channel configuration: optimal choices
in channel configuration (the number of channels, the speed
of each channel, and the width of each channel) change dra-
matically depending on the choice of burst width. The opti-
mal burst width appears to be somewhere between medium
and large (64 bytes per burst), and we showed earlier that this
parameter seems to scale with cache block size. Therefore,
there are no blanket statements that cover memory-system
design: each system must be optimized by taking into account
all aspects of the design—no one component can be opti-
mized in isolation.
3.3 A Closer Look at Banking and Burst Width
The graphs in Figure 10 illustrate the degree of memory over-
lap for several configurations. Some interesting things to
note: first, with a single channel (top left column), gcc man-
ages to overlap a fair amount of memory activity with CPU
execution; as the number of independent channels increases,
the system becomes much more streamlined, lowering the
memory overhead rapidly. However, it also becomes more
difficult for the system to overlap memory activity with CPU
execution, as shown in the very small overlap components.
Second, the perl benchmark does not have this problem—its
behavior is such that it can always overlap a significant com-
ponent of its memory activity with CPU execution. Clearly,
this behavior is benchmark-dependent. Last, note the behav-
ior of the 8-bit configuration (the bottom row of graphs). As
we have pointed out before, as bus widths become narrow,
large burst sizes tend to perform worse—this graph demon-
strates that the problem occurs even earlier. By increasing the
burst width from 16 bytes to 32 bytes, the memory overhead
is almost alwaysincreased; often, this increase is hidden by
CPU execution, but it is clear that there are two factors at
work: small bursts making it more difficult to use the memor
system, and large bursts that occupy the busses for suc
long duration that the average memory access is stalled w
ing for resources.
The graphs show that the degree of banking has a noti
able impact on the total memory-system time, even though
might not translate to much in terms of total execution tim
For instance, at 16-bit busses (the top two rows of graph
each doubling of the number of banks decreases the overh
of the memory system by 10-20%. This ultimately translat
to a net savings of around 5% in execution time due to t
degree of overlap with CPU execution time.
4 CONCLUSIONS
We have found that the organization of the memory system
extremely important and can affect the total execution of t
application by a factor of 3x. Unfortunately, there are n
choices that are universally good—the interaction of th
parameters is such that no component can be optimized in
vidually. The only rules of thumb are that the optimal bur
size scales with the L2 blocksize, and that faster channels
usually better.
As previously mentioned, one of the primary results from
our prior work was that present bus architectures are beco
ing a bottleneck. This study comes to the same conclusi
The fact that small bursts require multiple independent cha
nels for good performance suggests that the interleaving
small bursts on a single is expensive. Observations of the r
time lengths of the memory queues, which are enormous
small bursts, suggest that interleaving small bursts can cre
bus traffic jams. The fact that channel speed can be m
mportant than channel bandwidth suggests that two differe
configurations with equal bandwidth do not necessari
exploit that bandwidth with the same degree of efficiency.
These results point to bus scheduling as the primary ov
head. Possible explanations include intermingling writes w
reads, yielding turnaround overhead and odd-shaped in
leaved patterns (due to the asymmetric nature of reads
writes). Small bursts cause major backups in the memory s
tem, because the time to transfer a burst is on the order of
bus turnaround overhead—and because the asymme
nature of read requests vs. write requests makes it ineffici
to interleave the two. For larger bursts, the turnaround time
amortized, and interleaving reads with writes is not much d
ferent than interleaving read pairs or write pairs, because
time to hold the data bus is extremely long.
More directions for future study include the use of sym
metric read/write shapes to simplify bus scheduling, th
effects of cache organizations (since block size has suc
dramatic influence), the effects of turnaround time (mayb
two separate data busses would do better), as well as the
of realistic queue sizes and conventional MSHR designs.
REFERENCES
[1] W. R. Bryg, K. K. Chan, and N. S. Fiduccia. “A high-performance, low-cost mu
tiprocessor bus for workstations and midrange servers.”The Hewlett-Packard
Journal, vol. 47, no. 1, February 1996.
[2] D. Burger and T. M. Austin. “The SimpleScalar tool set, version 2.0.” Tech. Rep
CS-1342, University of Wisconsin-Madison, June 1997.
[3] D. Burger, J. R. Goodman, and A. Kagi. “Memory bandwidth limitations of fu-
ture microprocessors.” InProc. 23rd Annual International Symposium on Com-





es[4] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, and et al. “Impulse: Build-
ing a smarter memory controller.” InProc. Fifth International Symposium on
High Performance Computer Architecture (HPCA’99), Orlando FL, January
1999, pp. 70–79.
[5] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. “A performance comparison of
contemporary DRAM architectures.” InProc. 26th Annual International Sympo-
sium on Computer Architecture (ISCA’99), Atlanta GA, May 1999, pp. 222–233.
[6] R. Fromm, S. Perissakis, N. Cardwell, C. Kozyrakis, B. McGaughy,
D. Patterson, T. Anderson, and K. Yelick. “The energy efficiency of IRAM archi-
tectures.” InProc. 24th Annual International Symposium on Computer Architec-
ture (ISCA’97), Denver CO, June 1997, pp. 327–337.
[7] S. I. Hong, S. A. McKee, M. H. Salinas, R. H. Klenke, J. H. Aylor, and W. A.
Wulf. “Access order and effective bandwidth for streams on a Direct Rambus
memory.” InProc. Fifth International Symposium on High Performance Com-
puter Architecture (HPCA’99), Orlando FL, January 1999, pp. 80–89.
[8] T. R. Hotchkiss, N. D. Marschke, and R. M. McColsky. “A new memory system
design for commercial and technical computing products.”The Hewlett-Packard
Journal, vol. 47, no. 1, February 1996.
[9] K. Inoue, K. Kai, and K. Murakami. “Dynamically variable line-size cache ex-
ploiting high on-chip memory bandwidth of merged DRAM/logic LSIs.” InProc.
Fifth International Symposium on High Performance Computer Architecture
(HPCA’99), Orlando FL, January 1999, pp. 218–222.
[10] C. Kozyrakis, et al. “Scalable processors in the billion-transistor era: IRAM.”
IEEE Computer, vol. 30, no. 9, pp. 75–78, September 1997.
[11] D. Kroft. “Lockup-free instruction fetch/prefetch cache organization.” InProc.
8th Annual International Symposium on Computer Architecture (ISCA’81), Min-
neapolis MN, May 1981.
[12] S. McKee, A. Aluwihare, B. Clark, R. Klenke, T. Landon, C. Oliver, M. Salinas
A. Szymkowiak, K. Wright, W. Wulf, and J. Aylor. “Design and evaluation of
dynamic access ordering hardware.” InProc. International Conference on Super-
computing, Philadelphia PA, May 1996.
[13] S. A. McKee and W. A. Wulf. “Access ordering and memory-conscious cache
utilization.” In Proc. International Symposium on High Performance Compute
Architecture (HPCA’95), Raleigh NC, January 1995, pp. 253–262.
[14] A. Saulsbury, F. Pong, and A. Nowatzyk. “Missing the memory wall: The cas
for processor/memory integration.” InProc. 23rd Annual International Sympo-
sium on Computer Architecture (ISCA’96), Philadelphia PA, May 1996, pp. 90–
101.
[15] R. C. Schumann. “Design of the 21174 memory controller for DIGITAL persona
workstations.”Digital Technical Journal, vol. 9, no. 2, pp. 57–70, 1997.
[16] M. Swanson, L. Stoller, and J. Carter. “Increasing TLB reach using superpag
backed by shadow memory.” InProc. 25th Annual International Symposium on
Computer Architecture (ISCA’98), Barcelona, Spain, June 1998, pp. 204–213.



















8 16 32 64 128
Burst Width (bytes)
perl-channel-2-buswidth-2-mhz-800
8 16 32 64 128
Burst Width (bytes)
perl-channel-4-buswidth-2-mhz-800





















8 16 32 64 128
Burst Width (bytes)
gcc-channel-2-buswidth-2-mhz-800
8 16 32 64 128
Burst Width (bytes)
gcc-channel-4-buswidth-2-mhz-800



















8 16 32 64 128
Burst Width (bytes)
perl-channel-2-buswidth-1-mhz-200
8 16 32 64 128
Burst Width (bytes)
perl-channel-4-buswidth-1-mhz-200
Figure 10: Banking degree and burst width. Each graph shows three histograms for each burst width: the three bars correspond to banking degrees of 1, 2,
and 4 banks per channel.10
