INTRODUCTION
The performance gap between the computer's processor and its main memory has been growing over the past two decades [1] . Density and die size are the figures of merit for main memory manufacturers. Increasing these performance measurements places a physical limit on the latency of the main memory array due to the parasitics [2] . The limitations keep memory latency scaling at roughly 7%, while processor performance has been scaling at roughly 50%. This performance differential is termed the "memory gap", and refers to the growing performance disparity between the processor core and its main memory.
Processor manufacturers have made several architecture changes that enable computer performance to scale with Moore's Law (double the performance every two years). Multiple cores, increased cache levels, multiple threads, and speculative accessing, have made memory stalls almost transparent to the computer user [3] . Main memory manufacturers increase their density per unit area by developing longer bitlines, longer wordlines, decreased unit cell size, and feature size scaling [4] . Main memory manufacturers alleviate bandwidth limitations by using DRAM pre-fetch. Unfortunately, the pre-fetch architectures did not begin taking hold until 2000 [5] . This places memory bandwidth scaling decades behind processor bandwidth scaling.
Proximity communication is an input/output (I/O) technology that uses capacitors to electrically connect two chips [6] . The off/on chip communication technique has the ability to substantially increase the memory bandwidth and not impact the power consumption [7] . This work develops a memory architecture that utilizes proximity communication to substantially increase bandwidth, while reducing power consumption. This is achieved by allowing a single DRAM chip to provide a full cache line of memory (64 Bytes).
II. PROXIMITY COMMUNICATION
Capacitive coupled proximity communication is a chip-tochip interface technology that uses the top level of metal on an integrated circuit to form the parallel plates of a capacitor. Two chips are placed face to face and their top level of metal is allowed to come within close proximity (1 µm -20 µm) of each other without touching. This arrangement creates a parallel plate capacitor.
A. Advantages
The advantages of proximity communication allow for a significant reduction of parasitics in the transmission channel, which increases bandwidth and lowers power relative to other chip-to-chip interconnects. Fig. 1 depicts a cross sectional view of two chips using proximity communication as the I/O interface. 978-1-4244-6575-0/10/$26.00 ©2010 IEEE
The removal of off chip wires allows for a to be delivered to the transmission channel. over the metal pads is not opened, as applications, which allows the electrostatic protection circuitry to be removed, and for th die resistive termination is superfluous [8] .
The increase in I/O density is anothe proximity communication. The capacitance capacitors is at least 10 pF/mm 2 (with a 1 µm channels per mm 2 is possible when each channel uses 25 fF. The configuration creates a for scaling the transmission channel below 25 f Placing multiple die into a single p complicated wire-bonding technologies used interconnects [9] . Proximity communication a simply glued in place, which increases the ea System in package (SiP) configurations can defective chips easily replaced while u communication.
B. Challenges
Chip misalignment is a major challenge as development of proximity communication. Re Microsystems were able to develop a novel problem [10] . Through the development of el which could be incorporated into the same sil transmit and receive circuits, it was possible misalignment of the two chips. Electrical steer developed that allowed the transmit data multiple receiver pads to electrically realign channel.
III. DRAM TRENDS
Incorporated proximity communication architecture without understanding the DRA result in a product that does not meet the memory and computer systems.
A. Effect of Price Decline and Scaling
The performance differential between mic the main memory system is referred to as the is often misunderstood. The memory gap microprocessor's instructions per second memory's access latency. The confusion o blindly relate these two figures of merit. DRA focus the majority of their innovations technology that allows for an increase in den for this is due to the innate price per bit decline DRAM manufacturers are forced to fo scaling over access latency, or I/O bandwi historic 36% price per bit decline. Putting this if two gigabits of memory chip costs $2.00 gigabits would cost $1.64 in two years. Densit memory chips follows Moore's Law. The number of transistors in main memory chips (or √2 every year) is used to increase the densit package requires for chip-to-chip allows chips to be ase of testability. n be tested, and using proximity ssociated with the esearchers at Sun l solution to this lectronic sensors, licon substrate as to determine the ring circuits were to be driven to the transmission into a DRAM AM market will need of current croprocessors and memory gap and p measures the and the main ccurs when you AM manufacturers on the process nsity. The reason e of DRAM.
ocus on density idth, due to the into perspective, today, then four ty scaling in main doubling of the every two years ty of the die. Microprocessors use the extra number of instructions that can b Main memory manufacturers have the latency and bandwidth of the sacrifice power and die size, whic compete in the main memory ma densities [11] . Instead, these pro varying applications that do not requ Reducing the minimum feature the memory array achieves the req reduction in feature size increases th the array and places a physical lim parasitics have placed a limit on current memory chips to 133 M generation of main memory starts w MHz, and then transitions to 167 MH achieve a generational approach t shows this bandwidth trend in main Array pre-fetch allows the DRA chip bandwidth. Array pre-fetch re the latency at once, and serializing Main memory chips have been op sixteen data pins over the past three DDR3), with eight data pins b maximum bandwidth that can be ac 12.8 Gbps (8 data pins, pre-fetch of chip speeds above 12.8 Gbps requi (to 16) or an increase in data pins, limit.
B. Memory Channel Bandwidth Lim
Current computer systems use communicate between the m microprocessor. Series stub termin used in computer system memory c memory upgrades. The series stub terminating electrical signals at eac resistive pull up device that p reflections from interfering with tra memory channel.
The resistive termination netw loading, places a bandwidth limit Server applications require a subs number of main memory modules) er, has allowed off chip bandwidth 000, while the core frequency does ].
transistors to increase the be completed each second.
the ability to arbitrarily set memory chip. These chips ch creates their inability to arket, which requires large oducts find their place in uire large density.
size of the components in quired density scaling. The he parasitics associated with mit to the bandwidth. AM to sustain a larger offfers to accessing all bits of g the data in the data path. erating with four, eight, or e generations (DDR, DDR2, eing most common. The chieved with DDR3 chips is f 8, at 200 MHz). Increasing res an increase in pre-fetch , due to the column access mitation a 64-bit wide data bus to main memory and the nated logic connections are channels due to the ease of b terminated logic refers to ch memory module with a prevents transmission line nsmitted data on the shared work, along with module t on the memory channel. tantially larger density (or ) then personal computers.
Registered, fully buffered, and load reduced DIMMs were developed for server applications to increase the number of DIMMs per memory channel. These innovations have a cost and power premium associated with them.
IV. X64 DRAM ARCHITECTURE
A 4 Gb DRAM architecture utilizing proximity communication was developed that is realizable with existing technology and meets 2012 ITRS predictions [12] . Challenges associated with incorporating proximity communication into DRAM were characterized and several innovations were developed that alleviated these challenges. A novel global I/O routing structure was discussed that promises to increase the number of data signals that can be read and written to a memory array. The slice architecture was developed to increase the modularity of memory systems.
A. Moving the Pads
Moving the communication channel to the edge of the DRAM chip creates several interesting challenges when performing an architectural feasibility study. The bank structure used in this research alleviates the initial challenges. Once the communication channel is moved to the edge of the die additional circuitry is required to buffer the signals into the memory chip. Limiting the number of rows per bank creates a "short" bank that reduces global data and command signals, eliminating the need for additional buffers.
The inexpensive process technology of DRAM chips utilizes 2 -3 layers of metal above the memory capacitor. This places an intrinsic limit to the number of global I/O tracks over each bank. Due to this, the half-bank structure used in this proposal has 64k columns and 8k rows. This half-bank structure must decode the 64k columns into eight 8k pages. A by 64 DRAM chip operating with a pre-fetch of eight requires 512 bits to be accessed at once. Accessing 512 bits from one bank requires the use of a half-bank to reduce the total metal usage. Each half-bank supplies 256 bits of data. This allows the global I/O track to be spread across the chip, limiting metal usage for the global I/O bus. The challenges of buffering the signals into the array and limited routing channels are circumvented by using the proposed bank and segmented page structures. Fig. 3 shows the block diagram of the 4 Gb DRAM die. The half-bank structure can be thought of as dividing each bank horizontally, and firing a wordline in each half-bank. The by 16 and by 32 proximity configurations will not require any significant innovation, but the by 64 configuration will require additional innovation for local I/O routing. The large number of global I/O tracks (256 per half-bank) requires 32 data signals from each 256 kb memory array. Moving 32 data signals from the bitline sense amplifiers to the global I/O track is a major challenge due to the limited routing space above the bitline sense amplifiers. Increasing the page size will alleviate this challenge but will also increase the power consumption. Instead, these signals can be routed to the top and bottom of each 256 kb memory segment, as seen in Fig. 4 . An additional avenue for architectural research consists of routing the data signals through adjacent inactive bitlines (above and below).
B. Local I/O Routing

C. New Global I/O Routing
As mentioned above the memory array operates at a maximum frequency of 200 MHz due to the parasitics of the memory array. The global I/O route does not share the parasitics of the memory array and can operate at a higher frequency. Insertion muxes, and additional latches can be used to keep the global I/O bus fully occupied with data. A column path protocol can be developed that allows for multiple banks to be accessed and data stored in the local I/O channels. Busy, ready, and data insertion requests can be used to allow the global I/O routing to operate at a higher frequency, while the memory array remains operating at frequencies below 200 MHz.
D. Modular Architecture
Main memory DRAM chips use a large number of repeated structures and symmetry. The proposed modular architecture speeds up design verification. Each modular architecture contains all circuitry required for one data pin to read and write. Combining many of these modular structures together will create the entire chip. A data, command, and clock modular architecture was developed during this research.
The first advantage of this architecture is that the time required for chip verification can be reduced significantly. Due to the sheer number of transistors on a modern DRAM chip, simulating an extracted netlist can take several weeks to complete. Using smaller modular blocks to fully verify the data, command, and clock paths within the chip will reduce the time required to perform validation on the extracted netlist A distributed page and bank structure was developed to enable the possibility of using proximity communication with 32 data pins. The architecture utilized the standard main memory page size specification of 8k, which allows the array power consumption to remain competitive with current and future DRAM architectures.
Reaching the use of 64 data pins required architectural changes that would not increase the manufacturing cost compared to current DRAM architectures. Three levels of metal above the memory capacitor is the projection for DRAM densities greater than 2 Gb. The wide I/O architecture allows the metal stack to remain at two levels of metal above the memory capacitor without increasing the chip size. The reduction of projected metal usage enables a significant cost advantage when compared to other DRAM architectures. A new column structure was introduced that will aide in the development of a proximity communication enabled DRAM architecture that utilizes ≥ 64 data pins.
The wide I/O DRAM architecture utilizing proximity communication enables several technological advantages over existing DRAM architectures. Fixing the page size and increasing the I/O count through the wide I/O DRAM architecture allows for an energy efficient DRAM architecture. Fig. 5 shows the relative energy per bit estimates for DRAM chips utilizing proximity communication. Current commodity DRAM chips have poor energy efficiency due to only using 64 data bits of the 8k bits accessed per page. The wide I/O architecture increases the number of bits accessed per page to 512, which significantly increases the energy efficiency of DRAM chips.
Although it is possible to only access one proximity communication DRAM chip to supply the full 64 bytes of data to the memory controller, it is also possible to increase the amount of data accessed by increasing the memory channel width. The projected bandwidth trend shown in Fig. 6 clearly shows the advantage of using proximity communication DRAM over current and future DRAM technologies.
