Main Memory with Proximity Communication: A Wide I/O DRAM Architecture by Harvard, Qawi et al.
Boise State University
ScholarWorks
Electrical and Computer Engineering Faculty
Publications and Presentations
Department of Electrical and Computer
Engineering
4-16-2010
Main Memory with Proximity Communication: A
Wide I/O DRAM Architecture
Qawi Harvard
Boise State University
R. Jacob Baker
Boise State University
Robert Drost
VLSI Research Group
©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes
or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works
must be obtained from the IEEE. DOI: 10.1109/WMED.2010.5453754
Main Memory with Proximity Communication 
A Wide I/O DRAM Architecture 
 
Qawi Harvard and R. Jacob Baker 
Department of Electrical and Computer Engineering 
Boise State University 
Boise, ID, U.S.A. 
Robert Drost 
VLSI Research Group 
Sun Microsystems Laboratory 
Menlo Park, CA, U.S.A. 
 
 
Abstract — The bandwidth and power consumption of dynamic 
random access memory (DRAM), used as the main memory of a 
computer system, impacts computer execution rates. DRAM 
manufacturers focus on density increases, due to the innate price 
per bit decline of main memory, while processor manufacturers 
continually focus on boosting performance. This leads to a 
performance gap between the two technologies. Proximity 
communication promises to increase the off/on chip bandwidth of 
DRAM products while reducing the power consumption of the 
main memory system. The design of a memory system employing 
4 Gb DRAM chips with a 64-bit wide communication bus using 
proximity communication is proposed. Technological roadblocks 
are analyzed and novel solutions are proposed. The proposed 4 
Gb DRAM architecture can reduce the power consumption of a 
main memory system by 50% while increasing the bandwidth by 
100%. The 4 Gb chip architecture measures 68.88 mm2 and has 
an array efficiency of 59.9%. The estimates are comparable to 
2012 International Technology Roadmap for Semiconductors’ 
(ITRS) estimates of 74 mm2 and 56%, respectively. 
Keywords – DRAM, proximity communication, chip-to-chip, 
server memory, main memory, bandwidth, power consumption. 
I.  INTRODUCTION 
The performance gap between the computer's processor and 
its main memory has been growing over the past two decades 
[1]. Density and die size are the figures of merit for main 
memory manufacturers. Increasing these performance 
measurements places a physical limit on the latency of the main 
memory array due to the parasitics [2]. The limitations keep 
memory latency scaling at roughly 7%, while processor 
performance has been scaling at roughly 50%. This 
performance differential is termed the “memory gap”, and 
refers to the growing performance disparity between the 
processor core and its main memory. 
Processor manufacturers have made several architecture 
changes that enable computer performance to scale with 
Moore's Law (double the performance every two years). 
Multiple cores, increased cache levels, multiple threads, and 
speculative accessing, have made memory stalls almost 
transparent to the computer user [3]. Main memory 
manufacturers increase their density per unit area by 
developing longer bitlines, longer wordlines, decreased unit 
cell size, and feature size scaling [4]. Main memory 
manufacturers alleviate bandwidth limitations by using DRAM 
pre-fetch. Unfortunately, the pre-fetch architectures did not 
begin taking hold until 2000 [5]. This places memory 
bandwidth scaling decades behind processor bandwidth 
scaling. 
Proximity communication is an input/output (I/O) 
technology that uses capacitors to electrically connect two 
chips [6]. The off/on chip communication technique has the 
ability to substantially increase the memory bandwidth and not 
impact the power consumption [7]. This work develops a 
memory architecture that utilizes proximity communication to 
substantially increase bandwidth, while reducing power 
consumption.  This is achieved by allowing a single DRAM 
chip to provide a full cache line of memory (64 Bytes).   
II. PROXIMITY COMMUNICATION 
Capacitive coupled proximity communication is a chip-to-
chip interface technology that uses the top level of metal on an 
integrated circuit to form the parallel plates of a capacitor. Two 
chips are placed face to face and their top level of metal is 
allowed to come within close proximity (1 µm – 20 µm) of 
each other without touching. This arrangement creates a 
parallel plate capacitor.  
A. Advantages 
The advantages of proximity communication allow for a 
significant reduction of parasitics in the transmission channel, 
which increases bandwidth and lowers power relative to other 
chip-to-chip interconnects. Fig. 1 depicts a cross sectional view 
of two chips using proximity communication as the I/O 
interface. 
 
Figure 1.  Cross section view of placing two chips face-to-face and within 
close proximity of each other [6]. 
Chip 1 
Chip 2 
Transmit 
Transmit 
Receive 
Receive 
 
 
Chip 1 
 
Chip 2 
978-1-4244-6575-0/10/$26.00 ©2010 IEEE
The removal of off chip wires allows for a 
to be delivered to the transmission channel. 
over the metal pads is not opened, as 
applications, which allows the electrostatic 
protection circuitry to be removed, and for th
die resistive termination is superfluous [8].  
The increase in I/O density is anothe
proximity communication. The capacitance 
capacitors is at least 10 pF/mm2 (with a 1 µm
channels per mm2 is possible when each
channel uses 25 fF. The configuration creates a
for scaling the transmission channel below 25 f
Placing multiple die into a single p
complicated wire-bonding technologies used 
interconnects [9]. Proximity communication a
simply glued in place, which increases the ea
System in package (SiP) configurations can
defective chips easily replaced while u
communication.  
B. Challenges 
Chip misalignment is a major challenge as
development of proximity communication. Re
Microsystems were able to develop a novel
problem [10]. Through the development of el
which could be incorporated into the same sil
transmit and receive circuits, it was possible 
misalignment of the two chips. Electrical steer
developed that allowed the transmit data 
multiple receiver pads to electrically realign 
channel. 
III. DRAM TRENDS 
Incorporated proximity communication 
architecture without understanding the DRA
result in a product that does not meet the 
memory and computer systems. 
A. Effect of Price Decline and Scaling 
The performance differential between mic
the main memory system is referred to as the 
is often misunderstood. The memory gap
microprocessor’s instructions per second 
memory’s access latency. The confusion o
blindly relate these two figures of merit. DRA
focus the majority of their innovations 
technology that allows for an increase in den
for this is due to the innate price per bit decline
DRAM manufacturers are forced to fo
scaling over access latency, or I/O bandwi
historic 36% price per bit decline. Putting this
if two gigabits of memory chip costs $2.00 
gigabits would cost $1.64 in two years. Densit
memory chips follows Moore’s Law. The 
number of transistors in main memory chips 
(or √2 every year) is used to increase the densit
 
fixed impedance 
The passivation 
in wire-bonded 
discharge (ESD) 
is reason the on-
r advantage of 
of parallel plate 
 gap).  400 I/O 
 communication 
 research avenue 
F.  
ackage requires 
for chip-to-chip 
llows chips to be 
se of testability. 
 be tested, and 
sing proximity 
sociated with the 
searchers at Sun 
 solution to this 
ectronic sensors, 
icon substrate as 
to determine the 
ing circuits were 
to be driven to 
the transmission 
into a DRAM 
M market will 
need of current 
roprocessors and 
memory gap and 
 measures the 
and the main 
ccurs when you 
M manufacturers 
on the process 
sity. The reason 
 of DRAM. 
cus on density 
dth, due to the 
 into perspective, 
today, then four 
y scaling in main 
doubling of the 
every two years 
y of the die.  
Figure 2.  Array pre-fetch, of two and highe
to increase at a rate of 26% per year since 20
not scale [5
Microprocessors use the extra 
number of instructions that can b
Main memory manufacturers have 
the latency and bandwidth of the 
sacrifice power and die size, whic
compete in the main memory ma
densities [11]. Instead, these pro
varying applications that do not requ
Reducing the minimum feature 
the memory array achieves the req
reduction in feature size increases th
the array and places a physical lim
parasitics have placed a limit on 
current memory chips to 133 M
generation of main memory starts w
MHz, and then transitions to 167 MH
achieve a generational approach t
shows this bandwidth trend in main 
Array pre-fetch allows the DRA
chip bandwidth. Array pre-fetch re
the latency at once, and serializing
Main memory chips have been op
sixteen data pins over the past three
DDR3), with eight data pins b
maximum bandwidth that can be ac
12.8 Gbps (8 data pins, pre-fetch of 
chip speeds above 12.8 Gbps requi
(to 16) or an increase in data pins,
limit. 
B. Memory Channel Bandwidth Lim
Current computer systems use 
communicate between the m
microprocessor. Series stub termin
used in computer system memory c
memory upgrades. The series stub
terminating electrical signals at eac
resistive pull up device that p
reflections from interfering with tra
memory channel. 
The resistive termination netw
loading, places a bandwidth limit
Server applications require a subs
number of main memory modules)
 
r, has allowed off chip bandwidth 
00, while the core frequency does 
]. 
transistors to increase the 
e completed each second. 
the ability to arbitrarily set 
memory chip. These chips 
h creates their inability to 
rket, which requires large 
ducts find their place in 
ire large density. 
size of the components in 
uired density scaling. The 
e parasitics associated with 
it to the bandwidth. The 
the column bandwidth of 
Hz – 200 MHz. Each 
ith a column access of 133 
z, and then to 200 MHz to 
o chip bandwidth. Fig. 2 
memory chips.  
M to sustain a larger off-
fers to accessing all bits of 
 the data in the data path. 
erating with four, eight, or 
 generations (DDR, DDR2, 
eing most common. The 
hieved with DDR3 chips is 
8, at 200 MHz). Increasing 
res an increase in pre-fetch 
 due to the column access 
itation 
a 64-bit wide data bus to 
ain memory and the 
ated logic connections are 
hannels due to the ease of 
 terminated logic refers to 
h memory module with a 
revents transmission line 
nsmitted data on the shared 
ork, along with module 
 on the memory channel. 
tantially larger density (or 
 then personal computers. 
Registered, fully buffered, and load reduced DIMMs were 
developed for server applications to increase the number of 
DIMMs per memory channel. These innovations have a cost 
and power premium associated with them. 
IV. X64 DRAM ARCHITECTURE 
A 4 Gb DRAM architecture utilizing proximity 
communication was developed that is realizable with existing 
technology and meets 2012 ITRS predictions [12]. Challenges 
associated with incorporating proximity communication into 
DRAM were characterized and several innovations were 
developed that alleviated these challenges.  A novel global I/O 
routing structure was discussed that promises to increase the 
number of data signals that can be read and written to a 
memory array.  The slice architecture was developed to 
increase the modularity of memory systems.   
A. Moving the Pads 
Moving the communication channel to the edge of the 
DRAM chip creates several interesting challenges when 
performing an architectural feasibility study. The bank 
structure used in this research alleviates the initial challenges. 
Once the communication channel is moved to the edge of the 
die additional circuitry is required to buffer the signals into the 
memory chip. Limiting the number of rows per bank creates a 
“short” bank that reduces global data and command signals, 
eliminating the need for additional buffers. 
The inexpensive process technology of DRAM chips 
utilizes 2 – 3 layers of metal above the memory capacitor. This 
places an intrinsic limit to the number of global I/O tracks over 
each bank. Due to this, the half-bank structure used in this 
proposal has 64k columns and 8k rows. This half-bank 
structure must decode the 64k columns into eight 8k pages. A 
by 64 DRAM chip operating with a pre-fetch of eight requires 
512 bits to be accessed at once. Accessing 512 bits from one 
bank requires the use of a half-bank to reduce the total metal 
usage. Each half-bank supplies 256 bits of data. This allows the 
global I/O track to be spread across the chip, limiting metal 
usage for the global I/O bus. The challenges of buffering the 
signals into the array and limited routing channels are 
circumvented by using the proposed bank and segmented page 
structures. Fig. 3 shows the block diagram of the 4 Gb DRAM 
die. The half-bank structure can be thought of as dividing each 
bank horizontally, and firing a wordline in each half-bank. 
 
Figure 3.  A 4 Gb DRAM architecture incorporating proximity 
communication and centralized row and column circuitry. 
B. Local I/O Routing 
 
 
Figure 4.  Space and data mapping of the local input/output routing within a 
half-bank. 
The by 16 and by 32 proximity configurations will not 
require any significant innovation, but the by 64 configuration 
will require additional innovation for local I/O routing. The 
large number of global I/O tracks (256 per half-bank) requires 
32 data signals from each 256 kb memory array. Moving 32 
data signals from the bitline sense amplifiers to the global I/O 
track is a major challenge due to the limited routing space 
above the bitline sense amplifiers. Increasing the page size will 
alleviate this challenge but will also increase the power 
consumption. Instead, these signals can be routed to the top and 
bottom of each 256 kb memory segment, as seen in Fig. 4. An 
additional avenue for architectural research consists of routing 
the data signals through adjacent inactive bitlines (above and 
below). 
 
C. New Global I/O Routing 
As mentioned above the memory array operates at a 
maximum frequency of 200 MHz due to the parasitics of the 
memory array. The global I/O route does not share the 
parasitics of the memory array and can operate at a higher 
frequency. Insertion muxes, and additional latches can be used 
to keep the global I/O bus fully occupied with data. A column 
path protocol can be developed that allows for multiple banks 
to be accessed and data stored in the local I/O channels. Busy, 
ready, and data insertion requests can be used to allow the 
global I/O routing to operate at a higher frequency, while the 
memory array remains operating at frequencies below 200 
MHz.    
D. Modular Architecture 
Main memory DRAM chips use a large number of repeated 
structures and symmetry.  The proposed modular architecture 
speeds up design verification. Each modular architecture 
contains all circuitry required for one data pin to read and 
write. Combining many of these modular structures together 
will create the entire chip. A data, command, and clock 
modular architecture was developed during this research. 
The first advantage of this architecture is that the time 
required for chip verification can be reduced significantly. Due 
to the sheer number of transistors on a modern DRAM chip, 
simulating an extracted netlist can take several weeks to 
complete. Using smaller modular blocks to fully verify the 
data, command, and clock paths within the chip will reduce the 
time required to perform validation on the extracted netlist 
800 µm 
DQ<64:61> Region DQ<60:57> 
Region
DQ<3:0> 
Half-Bank 
BLSA 
BLSA 
32 
256 
32 256 
BL
12.3 mm 
Chip Size = 68.88 mm2 
Array Efficiency = 59.9% 
Proximity Interface 
R
O
W
 
512M Bank 
6.0 mm 
512M Bank 
512M Bank 512M Bank 
512M Bank 512M Bank 
512M Bank 512M Bank 
R
O
W
 
COLUMN COLUMN 
5.6 m
m
 
2.4 m
m
 
because each block is self contained. The second advantage of 
this modular structure is that varying densities of memory chips 
can be easily constructed for varying applications. DRAM 
chips utilizing both proximity communication and this modular 
structure can simply be glued directly over their application 
with the correct density and I/O count.  This has the possibility 
of revolutionizing the way chips access off-chip data. Instead 
of driving data requests away from the central circuitry of an 
integrated circuit to the memory channel, it is possible to 
simply send signals up towards the memory chip. This 
approach provides the exact memory requirement at the exact 
place it is required, reducing the access latency considerably. 
V. SUMMARY 
Developing a wide I/O DRAM architecture that is suitable 
for proximity communication necessitates the communication 
channel to be moved to the side of the DRAM chip. This 
enables a proximity communication DRAM chip with 8 or 16 
data pins. This modification requires limited design changes 
from current DRAM architectures. 
A distributed page and bank structure was developed to 
enable the possibility of using proximity communication with 
32 data pins. The architecture utilized the standard main 
memory page size specification of 8k, which allows the array 
power consumption to remain competitive with current and 
future DRAM architectures.  
Reaching the use of 64 data pins required architectural 
changes that would not increase the manufacturing cost 
compared to current DRAM architectures. Three levels of 
metal above the memory capacitor is the projection for DRAM 
densities greater than 2 Gb. The wide I/O architecture allows 
the metal stack to remain at two levels of metal above the 
memory capacitor without increasing the chip size. The 
reduction of projected metal usage enables a significant cost 
advantage when compared to other DRAM architectures. A 
new column structure was introduced that will aide in the 
development of a proximity communication enabled DRAM 
architecture that utilizes ≥ 64 data pins. 
The wide I/O DRAM architecture utilizing proximity 
communication enables several technological advantages over 
existing DRAM architectures. Fixing the page size and 
increasing the I/O count through the wide I/O DRAM 
architecture allows for an energy efficient DRAM architecture. 
Fig. 5 shows the relative energy per bit estimates for DRAM 
chips utilizing proximity communication.  
 
Figure 5.  Energy per bit comparison. 
 
Figure 6.  Module bandwidth comparison of current and future main memory 
compared to a main memory chips using proximity communication. 
Current commodity DRAM chips have poor energy 
efficiency due to only using 64 data bits of the 8k bits accessed 
per page. The wide I/O architecture increases the number of 
bits accessed per page to 512, which significantly increases the 
energy efficiency of DRAM chips.  
Although it is possible to only access one proximity 
communication DRAM chip to supply the full 64 bytes of data 
to the memory controller, it is also possible to increase the 
amount of data accessed by increasing the memory channel 
width. The projected bandwidth trend shown in Fig. 6 clearly 
shows the advantage of using proximity communication 
DRAM over current and future DRAM technologies.  
REFERENCES 
[1] J. Hennessy, D. Patterson, Computer Architecture A Quantitative 
Approach, 4th ed., Morgan Kaufmann Publishers, San Francisco, 2007. 
ISBN 978-0-12-370490-0 
[2] D. Rhosen, “The evolution of DDR,” VIA Technology Forum, 2005. 
[3] D. Patterson, “Latency lags bandwidth,” Communications of the ACM, 
vol. 47, Issue 10, pp. 71-75, October 2004. 
[4] D. Klein, “The future of memory and sorage: closing the gap,” Microsoft 
WinHEC 2007, May 2007. 
[5] Rambus, “Challenges and solutions for future main memory,” 
http://www.rambus.com/assets/documents/products/future_main_memor
y_whitepaper.pdf, May 2009. 
[6] R. Drost, R. Hopkins, I. Sutherland, “Proximity communication,” 
Proceedings of the IEEE 2003 Custom Integrated Circuits Conference, 
vol. 39, issue 9, pp. 469-472, September 2003. 
[7] Q. Harvard, “Wide I/O DRAM architecture utilizing proximity 
communiation,” Master’s thesis, Boise State University, December 
2009. 
[8] D. Salzman, T. Knight, “Capacitively coupled multichip modules,” 
Multichip Module Conference Proceedings, pp. 487-494, April 1994. 
[9] K. Kilbuck, “Main memory technology direction,” Microsoft WinHEC 
2007, May 2007. 
[10] R. Drost, R. Ho, R. Hopkins, I. Sutherland, “Electronic alignment for 
proximity communication,” IEEE International Solid State Circuits 
Conference, vol. 1, pp. 144-145, February 2004. 
[11] Samsung Semiconductor Inc. Various Datasheets:  
http://www.samsung.com/global/business/semiconductor/productList.do
?fmly_id=690 
[12] International Technology Roadmap for Semiconductor, 2007 Edition, 
http://www.itrs.net/Links/2007ITRS/Home2007.htm, 2007. 
 
200 
175 
150 
125 
100 
75 
50 
25 
2009 
M
od
ul
e 
Ba
nd
w
id
th
 [G
B
/s
]  
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 0 
225 
DDR5 
DDR4 
DDR3 
PxCDRAM x16 
PxCDRAM x32 
PxCDRAM x64 
100
90
80
70
60
50
40
30
20
10
PxCDRAM x64 PxCDRAM x32 PxCDRAM x16 DDR3 
R
el
at
iv
e 
En
er
gy
 [%
] 
 
