Abstract-The demand for capacity and off-chip bandwidth to dynamic random-access memory (DRAM) will continue to grow as we integrate more cores onto a die. However, as the data rate of DRAM has increased, the number of dual in-line memory modules (DIMMs) supported on a multi-drop bus has decreased. Therefore, traditional memory systems are not sufficient to meet both these demands. We propose the DIMM tree architecture for better scalability by connecting the DIMMs as a tree. The DIMM tree architecture is able to grow the number of DIMMs exponentially with each level of latency in the tree. We also propose application of multiband radio-frequency interconnect (MRF-I) to the DIMM tree architecture for even greater scalability and higher throughput. The DIMM tree architecture without MRF-I was able to scale up to 64 DIMMs with only an 8% degradation in throughput over an ideal system. The DIMM tree architecture with MRF-I was able to increase throughput by 68% (up to 200%) on a 64-DIMM system over a 4-DIMM system. Finally, we propose the partitioned DIMM tree, which allows the scaling of a main memory system to a many-DIMM memory system while still maintaining high throughput. The partitioned DIMM tree is able to improve throughput by an average of 19% up to 35% over the DIMM tree with 256 DIMMs on a single channel.
Utilizing Radio-Frequency Interconnect for a Many-DIMM DRAM System database systems [9] using RAMClouds [21] , clouds, and virtualization, the problem has become even worse. As CMPs continue to scale further with more and more cores on a chip, we reach a point where overall system performance cannot increase any further due to the limits of the DRAM system. Since the number of concurrent applications and threads increases with the number of cores, the working set size and the DRAM throughput required by the system also increases. The larger working set size will increase the number of page faults leading to costly transfers from hard disk to memory. The increased throughput requirements will put an even greater strain on the already scarce DRAM bandwidth. Therefore, a DRAM system for future many-core CMPs will require greater capacity and greater throughput. New technologies, such as virtualization and clouds that run several applications concurrently, increase the memory footprint and the throughput required by the memory system even more. Virtualization and clouds require an even greater number of applications to be running concurrently. Other technologies such as main memory database systems [9] and RAMClouds [21] also require memory systems with high capacity and high throughput. Main memory database systems [9] store entire databases in main memory for higher throughput and lower latency. RAMClouds [21] are a new kind of main memory database system that aggregate main memory from many systems as a hard disk replacement, while using the hard disk for backup, in order to achieve 100 -1000 improvements in throughput and latency. However, a better design would be to build a RAMCloud within a single system by using a many-core CMP with a many-DIMM memory system to reduce the inter-system communication. Unfortunately, current DRAM memory systems are not able to increase throughput while increasing capacity to meet these requirements.
The trend in industry to provide greater throughput in DDRx DRAM has been to increase the DRAM clock and data rate over each pin. For example, DDR2 has data rates of 400, 533, 677, and 800 Mb/s/pin. DDR3 has data rates of 800, 1066, 1333, and 1600 Mb/s/pin. However, a faster data rate also decreases the number of DIMMs (and overall capacity) that can be connected together on a multi-drop bus in a conventional DDRx DRAM system. Each drop on a multi-drop bus acts as an impendence discontinuity, causing ringing, a longer delay, and slower rise time [13] . We have already seen the maximum number of drops reduce from eight in DDR2 to four in DDR3. If the trend continues, then to support higher data rates in the future for technologies such as DDR4, there must be fewer drops on a multi-drop bus and therefore fewer DIMMs.
2156-3357/$31.00 © 2012 IEEE The other technique used to connect multiple DIMMs is a point-to-point link. Fully buffered DIMM (FB-DIMM) is an example that uses point-to-point links. In a point-to-point link, the signals are buffered and repeated at each DIMM. This has the potential to allow an infinite number of DIMMs to be chained together at a high data rate. However, each buffer that is traversed adds latency to the system. As the latency increases, throughput will degrade as we will demonstrate later on. Therefore, even with a point-to-point link the number of DIMMs is limited by the added latency.
In this paper, we make the following contributions. 1) We propose the DIMM tree architecture [27] for scaling the number of DIMMs in a DRAM system, but without severely degrading latency as in a point-to-point linked system. 2) We propose the application of multiband radio-frequency interconnect (MRF-I) [4] , [7] , [16] to the DIMM tree architecture to provide even greater scaling and higher throughput. 3) We propose the partitioned DIMM tree. The partitioned DIMM tree partitions the DIMM tree into a fast and slow partition. The fast partition is composed of the faster levels of the DIMM tree and acts as a cache for the pages in the rest of the DIMMs, in the slow partition. This allows us to scale the DIMM tree even further. 4) We propose the addition of direct DIMM-to-DIMM transfers for the partitioned DIMM tree in order to reduce contention of the memory channel caused by transferring blocks between the fast and slow partitions. The remainder of this paper is organized as follows. Section II gives background for DDRx DIMMs. Section III gives a background for multiband radio frequency interconnect (MRF-I). Section IV describes the DIMM tree architecture in detail. Section V describes the tree DIMM. Section VI describes our experimental framework for the DIMM tree. Section VII describes our results for the DIMM tree architecture. Section VIII describes the partitioned DIMM tree. Section IX describes direct DIMM-to-DIMM transfer. Section X describes our experimental framework and results for the partitioned DIMM tree. Section XI describes related work. Section XII concludes this paper.
II. DIMM BACKGROUND
In this section, we first give a brief overview of existing DIMM technologies in order to understand the tradeoffs of the DIMM tree architecture. Fig. 1(a) shows a conventional DDRx DIMM consisting of multiple DDRx DRAM chips that are accessed in parallel. In this example, each DIMM contains eight DRAM chips with each chip containing eight data pins. A 64-bit data bus is created by aggregating the data signals from each chip together, which means each data line is connected to only one DRAM chip. Each command, address, and control signal, however, is connected to all eight DRAM chips on the DIMM. Therefore, if there were four DIMMs on a single multi-drop bus, then each data line would be connected to four DRAM chips, and each command, address, and control signal would be connected to 32 DRAM chips. This demonstrates how quickly the load on a multi-drop bus can increase, thereby degrading the signal integrity.
One technique to reduce the load is to insert a buffer for all the signals on the DIMM between the DRAM chips and the memory controller. This technique is used in load reduced DIMM (LR-DIMM) [12] as shown in Fig. 1(b) . LR-DIMM uses an isolation memory buffer (iMB) to buffer the command, address, control, and data signals to the DRAM chips. There are no changes needed to the DRAM chips themselves, which is very important. Since DRAM chips are commodity parts, and are optimized for high density and low cost, any design changes that may reduce density or increase cost are usually avoided. Therefore, our proposal to support the DIMM tree architecture will also not modify the DRAM chips, but just interface to them as is done with the iMB in LR-DIMM.
The data rate of DDRx is always twice the rate of the command, address, and control signals. Thus, the name double data rate (DDR). In order for the DRAM chips to deliver such high data rates, internally the DRAM chips fetch the data with a lower data rate and a wider interface, but transfer externally with a higher data rate and a narrower interface. This is known as an n-bit prefetch. For example, DDR3-1600 is an 8n-bit prefetch architecture. Internally, it fetches eight bits of data in one clock cycle at 800 MHz. Externally it transfers eight bits of data, one bit at a time, over a single pin at 1.6 GHz. Therefore, the 1.6 Gb/s/pin external data rate of DDR3-1600 is really generated by fetching data at half the clock speed internally. Since the DIMM tree architecture must support more DIMM-to-DIMM interfaces than a conventional DDRx DIMM, we will use this technique to reduce the number of data pins for each DIMM-to-DIMM interface by transmitting at twice the data rate over half the number of pins.
III. OFF-CHIP MULTIBAND RF-I BACKGROUND
Multiband RF-I [4] , [7] , [16] is a high aggregate bandwidth and power saving alternative to a traditional interconnect. MRF-I is realized via transmission of electromagnetic waves through multiple carrier channels over a shared transmission line, rather than the transmission of a voltage signal through a single baseband over a wire. In MRF-I, carrier waves are continuously propagated along the transmission line, and data is generated through either the amplitude or phase modulation of the carrier wave. By transmitting independent data streams each over different RF bands, MRF-I can provide simultaneous transmissions of multiple data streams over a shared physical transmission line to improve the aggregate bandwidth and data rates.
There has been much advancement in off-chip MRF-I in recent years [4] , [7] , [16] . The most recent advancement, by Byun et al. [4] , uses amplitude shift key (ASK) modulation with differential signaling, which we refer to as ASK MRF-I. ASK MRF-I [4] uses differential signaling, which means it uses two lines to propagate a signal. Differential signaling allows for a higher signal integrity, which leads to higher data rates and a higher number of RF bands per pin overall. Byun et al. [4] were successful in demonstrating the high data rate and low power of MRF-I, the low BER, and the feasibility of process integration by implementation in a general-purpose logical complementary metal-oxide-semiconductor (CMOS) process of 65 nm. They also demonstrated a dual band MRF-I transceiver operating over 10 cm on a FR4 board and Roger 4003C board at 8.4 Gb/s aggregate data rate and 10 Gb/s aggregate data rate, respectively. The power consumption of the dual band MRF-I transceivers on the FR4 and Roger boards were 21 and 25 mW, respectively. Both boards operated with less than a BER.
A. Why Use MRF-I for DRAM?
Traditional chip-to-DRAM interconnects are able to support fewer drops as the data rate increases and signal integrity becomes worse. This has become apparent in the reduced number of DIMMs on a multi-drop bus as technology has changed from the slower DDR2 DIMMs, to the faster DDR3 DIMMs, and in the near future DDR4 DIMMs. Because of the high data rate, DDR4 is projected to only support one DIMM on a bus. Even though Byun et al. [4] did not demonstrate multi-drop for MRF-I, multi-drop for MRF-I is very feasible. We project that MRF-I at 4 Gb/s per RF band (enough to support DDR4) will be able to support four DIMMs on a multi-drop bus. Since Byun et al. [4] implemented differential signaling using dual bands, there isn't the extra pin overhead usually associated with differential signaling with only the baseband i.e., two pins for two bands with MRF-I instead of two pins for one band for a traditional interconnect.
MRF-I can also be used to reduce pin count or increase bandwidth by supporting more than two bands per pair of differential lines i.e., greater than one band per pin. For example, with two RF bands per pin (four RF bands per pair of differential lines), we could either support the same bandwidth and reduce the number of pins by half, or keep the number of pins the same and double the bandwidth. The latter application is particularly useful from an energy savings perspective as we start to approach data rates of greater than 5 Gb/s/pin. At about 5 Gb/s/pin traditional interconnects start to consume power super linearly due to power-hungry circuit techniques of pre-emphasis and equalization that must be used to compensate for the signal loss. Examples include current technologies such as GDDR5 (7 Gb/s/pin) and future technologies such as DDR4/DDR5 that will reach around 5 Gb/s/pin. By multiplexing the data over multiple bands, the interconnect power can be kept in the linear power consumption region. For example, transmitting data over two RF bands per pin operating at 4 Gb/s per band will provide 8 Gb/s/pin. ASK MRF-I [4] can currently support up to four RF bands per pin. However, as MRF-I technology advances, we expect that number to increase even more.
B. Overhead of Using MRF-I
There is also an area savings improvement that can be made for multi-bit transceivers. Byun et al. [4] demonstrated a dual band transceiver over a single pair of differential lines. When creating a multiple bit transceiver, the simplest approach would be just to replicate the design. However, this is very area and energy inefficient. The capacitive loading on each ASK transmitter is very low, so each ASK transmitter does not require its own dedicated voltage-controlled oscillator (VCO) in order to produce the RF carrier [as shown in Fig. 2(a) ]. Instead, a single VCO can be shared among up to eight ASK transmitters as long as they are using the same RF band [as shown in Fig. 2(b) ]. This optimization results in both an area and energy savings. We were able to validate these area and energy optimizations by layout and simulation using the Spectre circuit simulator [6] as was done in [17] . The values are shown in Table I and Fig. 3 . The area of 8-bit transceivers, including pads, for baseband (BB) and RF-I transceivers is shown in Table I for 65 nm technology. We label transceivers for two, four, and eight RF bands per differential lines as 2ASK, 4ASK, and 8ASK, respectively. The individual transceiver size can be obtained by taking the "Area" and dividing by "#transceivers." For example, a single two ASK transceiver is . Table I shows that as the number of RF bands per pin increases, the area and number of pins required to transmit eight bits of data shrinks significantly. We will discuss later on which ASK transceivers are required to support the DIMM tree architecture. Fig. 3 shows the energy per bit as bandwidth is increased for MRF-I versus a traditional interconnect, which is labeled as baseband (BB). The power numbers for the baseband were taken [10] , [14] , [20] . We compare BB against 2ASK, 4ASK, and 8ASK. The figure shows that we can continue to maintain the lower energy per bit at higher data rates by adding more RF bands (in turn, by maintaining the linear power-consumption versus bandwidth region in each of the multiple RF-bands, as well as limiting the BB-only interconnect operation within its linear-power consumption region versus the bandwidth). The energy efficiency (dotted lines) of 4ASK and 8ASK was projected by using both measured DBI power numbers [4] , [5] , [15] and simulation results.
Latencies for the transmitters, receivers, and transmission lines for RF-I versus the baseband are shown in Table II for 5 and 10 cm. These latencies fall well within the DDR3-1600 cycle time of 1.25 ns, which we use as our level-to-level latency in the DIMM tree architecture. Please note that Byun et al. [4] did not optimize their circuits for area or power, since it was a proof of concept paper to demonstrate the feasibility of off-chip MRF-I. One area reducing improvement that can be made without affecting the operation of the design is to place the digital logic circuits directly underneath the passive structures.
C. Current State of MRF-I
While 8ASK MRF-I has not yet been demonstrated, we believe this is achievable in the not too distant future. Multi-band signaling of MRF-I using ASK is different from multi-level signaling such as four or eight level pulse amplitude modulation (PAM). One possible design is to create better band-selective and area-efficient filters to mitigate inter-channel interference between adjacent bands. With differential 2ASK [4] with 5 pJ/b/pin and 4 Gb/s/pin, single-ended 2ASK with inter-channel interference suppression scheme [15] was demonstrated for 4 pJ/b/pin and 8 Gb/s/pin in CMOS 65 nm. The design complexity and power overhead of an 8ASK MRF-I will be mitigated through 25 nm (or better) CMOS process technology.
The most critical issue for 4ASK or 8ASK MRF-I with multidrop is signal integrity due to impedance discontinuities on each drops. Our latest MRF-I inter-channel interference suppression scheme [15] showed single-ended multi-band signaling communication between a point-to-point TL connection. We believe development of a multi-drop 4-ASK MRF-I is possible using an on-chip transformer-based band-bass filter [5] and an improved inter-channel interference suppression scheme [15] . In our simulation, multi-drop MRF-I can work up to 4 G/b/pin per band with a 5 cm TL. As expected, a multi-drop MRF-I would have limitations for much faster data-rate due to degraded signal integrity on a multidrop bus with a 10-20 cm TL.
IV. DIMM TREE ARCHITECTURE
The DIMM tree architecture [27] is designed to increase the capacity of a DRAM system without degradation in throughput. The DIMM tree architecture creates a tree of DIMMs in order to grow the latency logarithmically instead of linearly with the number of DIMMs; this allows the memory system to scale to a many-DIMM DRAM system. The DIMM tree requires a minimum of two DIMMs to be supported on a multi-drop bus. Otherwise, it becomes a chain of DIMMs connected with point-topoint links. Therefore, in the future with much high data rates, a technology such as MRF-I that can support two DIMMs on a multi-drop bus will be required. This section will describe the benefits and organization of a tree of DIMMs, the implementation details of a tree DIMM (T-DIMM) without MRF-I, and the adding of MRF-I to the DIMM tree architecture.
A. Benefits and Organization of a DIMM Tree
The main benefit of a DIMM tree is the logarithmic increase in latency with the number of DIMMs. This can be seen by comparing the DIMM tree versus a point-to-point and multidrop bus organization. Fig. 4 shows the varying latencies of each level of DIMMs in a point-to-point organization, a multidrop organization, and the DIMM tree. A multi-drop connection among DIMMs is represented by the DIMMs sharing a common wire. In Fig. 4 (a), the point-to-point organization causes the latency to increase linearly with the number of DIMMs, due to the buffer at each DIMM that acts as a signal repeater. DIMM0 only has a latency of one hop from the CMP while DIMM3 has a latency of four hops from the CMP. A point-to-point organization of DIMMs can also be viewed as a tree with branching factor one and height . In Fig. 4(b) , the multi-drop organization causes the latency to be equal among all the DIMMs. In this case all the DIMMs have equal latency of one hop away from the CMP. A multi-drop organization of N DIMMs can be viewed as a tree with branching factor N and height one. Fig. 4 (c) shows a DIMM tree of branching factor two and height two. In a DIMM tree, there are different families of DIMMs connected by a multi-drop bus. Each DIMM contains a buffer in order to generate a new clean signal to its children. This is just as in a point-to-point connection, except each DIMM in the DIMM tree may have multiple children. For example, the CMP and its children, DIMM0 and DIMM1, all share a multi-drop connection [just as the CMPs and their children, do in Fig. 4 (a) and (b)]. Likewise, DIMM0 also shares a multi-drop connection with its children, DIMM2 and DIMM4. However, DIMM2 is not on the same multi-drop connection with the CMP, since the buffer on DIMM0 separates the connections. Just as in the point-to-point connection, each buffer represents a connection to an additional hop. Therefore, DIMM0 has a latency of one hop while DIMM2 has a latency of two hops.
V. TREE DIMM (T-DIMM) IMPLEMENTATION
A conventional DDR3 DIMM only supports one interface-from the DIMM to the memory controller. In a T-DIMM, however, we must be able to support two interfaces-one to the parent and sibling DIMMs and one to the children DIMMs. In order to support an additional interface on the DIMM without the added pin overhead, we can use a technique similar to the n-bit prefetch used in DDRx described in Section II. By transferring some of the signals over half the number of pins but at twice the data rate, we can reduce the overhead of adding a second DIMM interface. Fig. 5(a) shows a single DDR3-1600 T-DIMM with data rates and number of pins for the data, address, command, and control lines. The data and address lines operate at 2X the data rate (3.2 Gb/s/pin for data, 1.6 Gb/s for address) of a conventional DDR3 DIMM (1.6 Gb/s/pin for data, 0.8 Gb/s for address), but using half the number of pins (32 for data, 7 for address). Therefore, in order to support a second DIMM interface, the number of pins on the DIMM is increased by what amounts to another set of command/control lines plus chip select, which is 10 log2(number of ranks). We assume there is logic on each DIMM to decode the chip select with "log2(number of ranks)" lines instead of "number of ranks" lines. This design also causes the number of pins needed to interface to the memory controller to be halved. The command and control lines for the T-DIMM operate at the same rate as a conventional DDR3 DIMM (0.8 Gb/s/pin). All signals to the DRAM chips must go through the DIMM Interface Router (DIR) just as they do for the iMB in LR-DIMM [12] .
The DIR, shown in Fig. 5(b) , contains a parent DIMM baseband (BB) transceiver, a router, a buffer, a child DIMM baseband transceiver, several data rate converters, and several baseband transceivers (BB TX/RX). The parent DIMM BB transceiver connects the DIMM to its parent and siblings within the tree hierarchy. The router consists of a lookup table indexed by the rank number specifying four possible routes: the current DIMM, a descendent of the current DIMM, the parent DIMM, or none of the above. The buffer is used to buffer signals that must go to the next level of the tree (i.e., a descendent of the DIMM). The child DIMM BB transceiver connects the DIMM to its children in the tree hierarchy. The memory controller would be the root of the tree, so it would also contain a child DIMM BB transceiver. The data rate converter converts between a data rate of X with Y pins to a data rate of 2X with Y/2 pins and vice versa. This is accomplished by interleaving the values of two signals operating at data rate X onto a single wire at data rate 2X and vice versa.
A. Adding MRF-I to the DIMM Tree Architecture
The drawback of using a traditional interconnect is that as the DRAM chip data rate is increased, fewer and fewer drops are supported. This directly affects the branching factor of the DIMM tree and the rate at which the DIMM tree can grow with each level of latency added. This is especially true since the DRAM chip data signal rates are doubled in order to support an additional DIMM interface. Therefore, as new DRAM chip technologies with much higher data rates such as DDR4 arrive, the scalability of the DIMM tree using a traditional interconnect decreases. Replacing the traditional interconnect with MRF-I will allow the DIMM tree to continue to scale as data rates increase.
MRF-I is projected to support up to four drops on a multi-drop bus up to 4 Gb/s per RF band. That means with 2ASK MRF-I (1 RF band per pin), a DIMM tree with a branching factor of four can support DDRx-2000 DRAM chips (2 Gb/s/pin). In order to support even higher data rates, the data signals can be multiplexed over more RF bands. Therefore, with 4ASK MRF-I (two RF bands per pin) and 8ASK MRF-I (four RF bands per pin), a DIMM tree with a branch factor of four could support DDRx-4000 DRAM chips (4 Gb/s/pin) and DDRx-8000 (8 Gb/s/pin), respectively. Currently, ASK MRF-I [4] is limited to just 8ASK MRF-I. However, as MRF-I technology advances, we expect the number of RF bands per pin to increase.
MRF-I can also be used to provide multiple logical channels over a single physical channel when there is more than one RF band per pin. Each RF band on a pin would form a logical channel. By partitioning DIMMs on a multi-drop bus among the logical channels, we can increase the concurrency of DRAM transactions. For example, with four DIMMs on a multi-drop bus and two logical channels, there would be two DIMMs per logical channel. All transactions to the DIMMs on the first logical channel can be scheduled independently of the transactions to the DIMMs on the second logical channel. Therefore, we are able to utilize the extra bandwidth provided by MRF-I to current DRAM chip technologies to increase throughput and improve scalability as we shall see in the results section.
Adding MRF-I to the T-DIMM involves replacing the parent and child DIMM baseband transceivers in the DIR with MRF-I transceivers. The parent MRF-I transceiver will always be a 2ASK MRF-I transceiver regardless of the number of logical channels supported, since each T-DIMM only has one parent. The child MRF-I transceiver, however, will be have a 4ASK and 8ASK MRF-I transceiver for two and four logical channels respectively, since each T-DIMM can have multiple children. Adding MRF-I to the T-DIMM also involves having a set of lines for each logical channel from the router to the buffer, and from the buffer to the child DIMM MRF-I transceiver. These extra lines are required, since the buffer can only buffer a conventional signal, not an RF signal. The BB TX/RX remain the same, so there is no modification needed to interface to the commodity DRAM chips.
VI. EXPERIMENTAL FRAMEWORK
For our evaluation, we generated memory transaction traces from the SPEC CPU 2006 benchmark suite [11] , stream suite [18] , and some medical imaging benchmarks. We selected the most memory intensive benchmarks from the SPEC CPU 2006 benchmark suite. The benchmarks included are bzip2, gcc, libquantum, lbm, mcf, milc, and sjeng. Copy and triad are derived from the stream benchmark suite [18] , and are streaming benchmarks. Deblur [25] , registration [35] , and denoise [30] are medical imaging benchmarks. In order to generate memory transaction traffic for a multiprogrammed workload (or a system using virtualization) running on a many-core CMP, we need to model several of the benchmarks on separate cores concurrently. We create six workloads shown in Table III . Each workload is generated so that the memory footprint is at least 4 GB when the benchmarks are run to completion.
Since simulating the benchmarks to completion with a cycle accurate simulator would take several months to complete, we reduce our analysis to a one billion instruction phase of each benchmark. The traces were gathered using Pin [24] , a dynamic instrumentation tool, with a 2 MB eight-way set associative L2 cache model with 64 B blocks taken from Simplescalar [3] . The traces were generated by warming up for one billion instructions before recording and then running for another one billion instructions while recording memory transactions, similar to Rafique et al. [23] . We found that warming up for one billion instructions was enough to reach beyond the initialization phase of the benchmarks when all the compulsory page faults occur. The traces were taken as input into DRAMsim [28] , a detailed cycle accurate memory system simulator. We use DRAMsim's built-in ability to interleave several trace files together in order to create a multiprogrammed CMP workload similar to Ganesh et al. [8] that will stress the DRAM system. Table IV shows the six different mixes we use. The mixes are categorized by how much they will stress the DDR3-1600 DRAM system (11.92 GB/s per channel)-i.e., low (zero to two channels), med (two to four channels), and high (greater than four channels). We use the parameters in Table V for the simulations. We modify DRAMsim to support the DIMM tree architecture. For the DRAM chips, we use timing and power parameters from the Micron datasheets for DDR3-1600 [19] . We assume one rank on each DIMM.
We also assume that the memory controller is on-chip with the CMP, is responsible for scheduling the transactions to each DIMM, and assures there are no conflicts for DRAM resources. The more DIMMs there are, the more complexity is added to the memory controller, since state must be maintained for each DIMM. However, the multilevel nature of the tree allows the memory scheduler to be broken up into simpler parts. For example, when scheduling to a DIMM in the level furthest away from the memory controller, we first check the level closest to the memory controller. If the transaction cannot be scheduled in the level closest to the memory controller, then we do not need to check the other levels, since we know the transaction is not schedulable.
VII. RESULTS FOR THE DIMM TREE ARCHITECTURE

A. Throughput and Scalability
In this section we compare the throughput and scalability of a system of DIMMs connected by point-to-point links, an ideal multi-drop system, the DIMM tree architecture, and MRF-I. We implement all systems using DDR3-1600 DRAM chips for a fair comparison of the architecture. For example, it would not be fair to compare a DDR2-800 FB-DIMM against a multi-drop system of DDR3-1600 DRAM chips, since the data rates are different. Fig. 6 shows the throughput and scalability of a DIMM system using DDR3-1600 DRAM chips connected with point-to-point links. Throughput is measured in GB/s. The number of DIMMs is varied from four to 64. As the number of DIMMs is increased from four DIMMs, the added latency degrades throughput drastically. With eight DIMMs, the throughput is degraded on average 15% up to 24%. With 16 DIMMs, the throughput is degraded on average 21% up to 39%. With 32 DIMMs, the throughput is degraded on average 38% up to 56%. The only exception is low_bw_mix_1 where the throughput increases by 34% when going to 16 DIMMs. The reason for this increase is that low_bw_mix_1 has a large amount of rank level parallelism. The gains by exploiting low_bw_mix_1's rank level parallelism exceed the degradation caused by the added latency. However, when increasing the DIMMs from 16 to 32, the throughput begins to decline again with the added latency. Fig. 7 shows the throughput and scalability of an ideal multi-drop system versus the DIMM tree architecture. The ideal multi-drop system models a hypothetical upper bound on the throughput we could achieve if all the DIMMs could be supported on a multi-drop bus. The system modeled with the DIMM tree could either be one using a traditional interconnect or one with one RF band per pin, since they would have the same throughput. The ideal multi-drop system is labeled with "Ideal MD". The DIMM tree architecture with branching factor four is labeled with "DT, BF4". We see that the DIMM tree scales much better than a system connected with point-to-point links. With 16 DIMMs, the degradation in throughput with the DIMM tree over the ideal multi-drop is on average 2% up to 4%. With 64 DIMMs, the degradation is on average 8% up to 12%. Mixes low_bw_mix_1 and med_bw_mix_2 both have a large amount of rank level parallelism, and see a large throughput increase as the DIMMs are increased from four to 16 . Fig. 8 shows the throughput and scalability as we add MRF-I to a DIMM tree of branching factor four. MRF-I is added with multiple RF bands per pin in order to support multiple concurrent logical channels. Fig. 8 shows the throughput of a DIMM tree system with branch factor four and four RF bands per pin (labeled DT, BF4, RF4) against an ideal multi-drop system (labeled Ideal MD) and a nonideal multi-drop system with four RF bands per pin (labeled MD, RF4). The addition of four RF bands per pin in a multi-drop system with four DIMMs increases throughput by an average of 93% up to 159%. As the number of DIMMs in the DIMM tree increases, we again see a benefit. The mixes with a high amount of rank level parallelism are able to use the four logical channels to schedule transactions to separate ranks concurrently to improve throughput. For example, low_bw_mix_1 with 32 T-DIMMs increases throughput by 124% over a four-DIMM multi-drop system with four RF bands per pin ("MD, RF4, 4 DIMMs"). All the mixes see similar increases in throughput with four logical channels. The exception though is low_bw_mix_1, which at 16 DIMMs is already close to its maximum throughput of 8.2 GB/s from Table IV . With 64 T-DIMMs, we see a throughput increase on average of 68% up to 200% over "MD, RF4, 4 DIMMs".
The four logical channels created by the four RF bands per pin creates so much more bandwidth than required by the mixes that it offsets any latency caused by additional levels in the DIMM tree. Therefore, the DIMM tree with multiple RF bands per pin is able to outperform both an ideal multi-drop system and a nonideal multi-drop system using multiple RF bands per pin.
B. Power
In this section, we compare the power of the DIMM tree architecture versus a conventional DDRx DIMM and a LR-DIMM system from Fig. 1 . At four DIMMs, a DIMM tree with branching factor four will perform equivalently to a conventional DDRx DIMM and LR-DIMM, since all three systems will have the four DIMMs connected with a multi-drop bus. Therefore, we compare all three systems with four DIMMs using DDR3-1600 DRAM chips so we can compare the power values directly. We obtain the DRAM chip power from DRAMsim [28] and the Micron datasheet for DDR3-1600 [19] . The non-DRAM chip numbers are obtained by the highly accurate Spectre circuit simulator [6] . Fig. 9 shows the power results in milliwatts. Most of the power is consumed by the DDR3-1600 DRAM chips. The rest of the non-DRAM chip power includes the interconnect, baseband or RF transceivers, and any additional structure needed e.g., iMB for LR-DIMM. LR-DIMM adds on average 3% up to 4% more power compared to a conventional DDRx DIMM. The DIMM tree adds on average 5% up to 6% more power compared to a conventional DDRx DIMM. Therefore, unlike past technologies to improve throughput and capacity such as FB-DIMM, the power overhead of the DIMM tree architecture is very small.
VIII. PARTITIONED DIMM TREE
A high capacity high throughput memory system such as the DIMM tree opens up new opportunities for greater performance by reducing or even replacing the number of accesses to hard disk. Table VI sums up the latencies, throughput, and capacities of a DIMM tree with 64 DDR3-1600 tree DIMMs (T-DIMMs) against a hard drive and a solid state drive. We assume each DIMM can hold 4 GB with the DRAM chip numbers taken from Micron datasheets [19] . The data shows that the DIMM tree dominates solid state drives and hard drives in terms of latency and throughput, and with a comparable amount of capacity. The read/write and idle power numbers are also shown. Even though the DIMM tree consumes more power than hard drives, the throughput of a DIMM tree is also much higher. Therefore, a DIMM tree is able to transfer data with greater power efficiency as shown by the higher throughput per watt numbers. The only drawback to using the DIMM tree is the high idle power compared to hard disk. However, for systems as in the RAMCloud paper [21] , high throughput systems would rarely ever be idle, and would benefit greatly from this.
While the DIMM tree [27] has shown to scale well to 64 DIMMs, the degradation in throughput starts to become unacceptable as we try to scale beyond that. In an era where terabyte sized hard drives are prevalent, the DIMM tree would need to scale beyond 64 DIMMs to offer comparable capacity. Fig. 10 shows the normalized throughput as we scale the number of DIMMs in a DIMM tree with a branch factor of four with one rank per DIMM. With 256 DIMMs, throughput degrades on average 24% (up to 33%) compared to just four DIMMs. Therefore, while the DIMM tree architecture scales much better than any existing DRAM systems, it still has its limitations.
The partitioned DIMM tree divides the DIMM tree into a fast partition and slow partition as shown in Fig. 11 . The fast partition consists of the DIMMs in the fast levels of the DIMM tree, while the slow partition consists of the remaining DIMMs. In this example, the fast partition consists of all the DIMMs in the level of the DIMM tree that has an access latency of one hop from the CMP. In general the fast partition does not have to be limited to only one level of the DIMM tree, and does not have to be statically defined i.e., the partition can be dynamically changed by the operating system. However, in this paper, we only study a static fast partition consisting of the DIMMs in the single level closest to the CMP and leave other configurations and dynamically changing the partition for future work. The fast partition acts as a cache for pages in the slow partition. The partitioned DIMM tree also contains a fast partition page table to keep track of the pages in the fast partition and a fast partition page fault handler to handle situations where a page must be transferred between the fast and slow partitions. The remainder of this section will discuss the fast and slow partitions, and the fast partition page table. 
A. Fast and Slow Partitions
The relationship between the fast partition and the slow partition is similar to the relationship between memory and hard disk in a virtual memory system. The fast partition is a cache for pages from the slow partition just as memory is a cache for pages from the hard disk in a virtual memory system. The fast and slow partitions can either partition the same physical address space or have separate physical address spaces. In a dynamically partitioned DIMM tree, the physical address space would have to be the same. Because we are only addressing a static partition in this paper, we assume the fast and slow partitions each have their own separate address space as shown in Fig. 12 . Transfers between the fast and slow partitions occur at the page granularity just as in virtual memory.
In order to keep track of the pages in the fast partition, the partitioned DIMM tree must use a fast partition page table. The fast partition page table is located in the fast partition as shown in Fig. 12 for fast access. However, unlike a traditional page table, the fast partition page table is a set associative structure instead of a fully associative structure. This is done in order to improve lookup times and the time it takes to handle page faults. Even though a set associative structure is not as space efficient as a fully associative structure, the partitioned DIMM tree provides so much more capacity than a traditional DRAM system that the inefficient space organization can be tolerated in order to improve lookup and page fault times.
The TLB and page table also have to be modified in order to support the partitioned DIMM tree. Each TLB and page table entry need to add a bit to specify whether the current page exists in the fast partition or not, and a field for the fast partition page number. The TLB also needs to cache the LRU bits of the fast partition page table. The slow partition page number is stored in the field of the page table entry used traditionally used for the physical page number. These modifications do increase the size of the page table. However, the additional size is insignificant compared to how much more we can scale the DRAM system using the partitioned DIMM tree.
B. The Fast Partition Page Table
The fast partition page table is a set associative structure as shown in Fig. 13(a) . The fast partition page table is accessed using the slow partition page number and returns a fast partition page table entry. The least significant bits of the slow partition page number map to the index of the fast partition page table. The fast partition table entry contains a valid bit, a dirty bit, a pending replace bit, and tag bits for the slow partition page number as shown in Fig. 13(b) . The location of the entry determines what fast partition page the slow partition page number is mapped to as shown in Fig. 13(c) . For example, way 0 in index 0 maps to page 0, way 0 in index 1 maps to page 4, and way 1 in index 1 maps to page 5. The generic mapping from the index and way to the fast partition page number is . The fast partition page fault handler is responsible for servicing fast partition page faults. The fast partition page handler is shown in Fig. 14 . When there is a fast partition page fault, the pending fast partition page fault queue is first checked to see if there is an existing entry for the slow partition page. If there is not, then an entry from the fast partition page table must be chosen for replacement, and a new request must be added to the pending fast partition page fault queue. In this study, we use an LRU replacement policy for choosing entries to replace in the fast partition page table. Once a request is placed in the pending fast partition page fault queue, the evicted fast partition page table entry's pending replace bit is set. This allows the fast partition page fault handler to mark a fast partition page table entry as a candidate for replacement while still allowing any prior transactions in the transaction queue to the evicted page to continue until the actual replacement occurs.
The actual page replacement occurs when the request in the pending fast partition page fault queue has reached the head of the queue. When there are no more transactions to the evicted page, then the fast partition page fault handler can fill the page transaction queue with memory transactions to writeback or read a new page from the slow partition. In this paper, we assume each memory transaction is 64 bytes, and each page is 4 KB. Therefore, transferring an entire page requires 64 memory transactions. Fig. 14(a) shows an example where we are evicting page A, which is dirty and so must be written back, before we can read in page B. The page transaction queue only holds transactions to one page at a time, so is filled with transactions to only page A. At this point, the fast partition page table entry's valid bit is marked as false, in order to prevent incoming memory transactions from accessing the page, though the probability may be remote. Once all the transactions to page A have been issued, but perhaps not yet complete, the page transaction queue can be filled up with transactions to page B as shown in Fig. 14(b) . The transactions from the page transaction queue are given lower priority than transactions from the transaction queue.
IX. DIRECT DIMM-TO-DIMM TRANSFER
Transferring pages between the fast and slow partitions creates a lot of additional traffic in the memory system. If we are able to support direct DIMM-to-DIMM transfers of data then we can reduce the impact these page transfers have on normal DRAM transactions. In order to support direct DIMM-to-DIMM transfers, the DIMM interface router (DIR) for the T-DIMM would have to be modified slightly. The rest of this section discusses the motivation for supporting direct DIMM-to-DIMM transfers, the implementation to supporting DIMM-to-DIMM transfers, and the address mapping policy needed for the fast and slow partitions to full take advantage of the DIMM-to-DIMM transfer.
A. Why Direct DIMM-to-DIMM Transfer
The impact of supporting DIMM-to-DIMM transfers when transferring a page from the slow partition to the fast partition is shown in Fig. 15 . In Fig. 15(a) , without DIMM-to-DIMM transfers, a page from the slow partition has to be first read from DIMM 2 to the CMP, and then written from the CMP to DIMM 0 in the fast partition. This transfer consumes the multi-drop bus between DIMMs 0, 2, and 4 for one time unit and the multi-drop bus between DIMM 0 and the CMP for two time units. In Fig. 15(b) , with DIMM-to-DIMM transfer, the page is transferred directly from DIMM 2 to DIMM 0 without having to go through the CMP. Due to the separation of the various multi-drop networks in the DIMM tree, transfers between DIMM 2 to DIMM 0 can occur concurrently with transfers from DIMM 1 to the CMP. Therefore, the CMP is free to process normal transactions to and from DIMM 1, not related to the page transfer.
B. Implementation of Direct DIMM-to-DIMM Transfer
Support for direct DIMM-to-DIMM transfer requires addition of a new command to designate when to read data from one DIMM and write directly to another DIMM. We call this new command the CAS_S2D command, which derives its name from doing a CAS command from source to destination. Support for the CAS_S2D command is implemented in the DIR in the T-DIMM, so the DRAM chips themselves do not have to be modified at all. A timing diagram showing the use of the CAS_S2D command is shown in Fig. 16 . Fig. 16(b) shows the timing of a DIMM-to-DIMM writeback from the fast partition to slow partition for the partitioned DIMM tree shown in Fig. 16(a) . Since we are using closed page mode for the row buffer management, a RAS must first be performed to activate the pages at the source and destination. The CAS_S2D command is then issued. The CAS_S2D command consumes twice as many cycles than a regular CAS command, since it contains two sets of addresses i.e., one for the source and one for the destination. Once the CAS_S2D command reaches the source, DIMM 0, a normal CAS write command (CAS WR DST) is created and forward to DIMM 2 by the router in the DIR of DIMM 0. DIMM 0 initiates a CAS read and DIMM 2 initiates a CAS write one cycle later when the CAS write command reaches it. The timing of data transfers from the fast partition to slow partition just happen to line up correctly no matter what level of the slow partition we are writing to. Therefore, no ad- ditional delays are required to line up the timings of the source and destination DIMMs. Fig. 16(c) shows the timing diagram for performing a read from the slow partition to the fast partition. When the CAS_S2D command reaches destination DIMM 0, a CAS read command (CAS RD SRC) is created by the router in the DIR and forwarded to source DIMM 2. In this case, additional delays are needed at the destination DIMM for the CAS write command to the DRAM chips. The memory controller is responsible for designating how much the additional delay should be in order for the source and destination data timings to line up correctly. The delay will depend on how many levels apart are the source and destination. For a distance of N levels apart, the additional delay for the destination DIMM is 2N cycles. Therefore, for a distance of one level apart, as shown in Fig. 16(c) , the delay of the destination is two cycles.
The delay of the destination can be implemented in a similar way the additional latency (AL) parameter is implemented in a DRAM chip by creating a buffer. We choose, however, to make the implementation in the DIR of the T-DIMM to prevent from having to modify the commodity DRAM chip. The changes to the DIR are shown in white in Fig. 17 . The router needs to be modified to recognize the CAS_S2D command, create CAS read or write commands for the situations described above, and route data from the source DIMM to the destination DIMM. A new DELAY CMD structure must be added in order to delay the commands to the DRAM chip based on parameters in the CAS_S2D command set by the memory controller.
C. Address Mapping
In order to take advantage of direct DIMM-to-DIMM transfer, it is important to have the right address mapping. For the partitioned DIMM tree shown in Fig. 16(a) , with the wrong mapping, we could get data transfers through level 1 which would interfere with normal non-page transfer transactions.
For example, we might get a transfer from DIMM 1 to DIMM 2. Ideally we would want the transfers isolated between a fast partition DIMM and its subtree. For example, transfers to/from DIMM 0 can only happen to/from DIMMs 2 and 4. Likewise, transfers to/from DIMM 1 can only happen to/from DIMMs 3 and 5. Fig. 18 shows a closed page mode address mapping we could use for the fast and slow partitions given a physical address from the CMP. The slow partition rank number in the partitioned DIMM tree would be the rank obtained in the slow partition address map plus the number of ranks in the fast partition. For example, physical address 0 0 would map to rank 0 in the fast and slow partition address map, but would map to ranks 0 and 2 for the fast partition DIMM and slow partition DIMM respectively in the partitioned DIMM tree.
X. EXPERIMENTAL FRAMEWORK AND RESULTS
A. Experimental Framework
The simulation environment and the collection of traces is the same as described in Section VI. We mix the traces as described on Table VII in order to create traces with a larger memory footprint than those used in Section VI. The mixes are categorized by how much they will stress the DRAM system-i.e., low, med, and high.
B. Results
In this section we discuss the results of our simulation of DIMM trees with branch factor four. In Fig. 19 we show throughput of our proposed partitioned DIMM tree against the DIMM tree architecture with 256 DIMMs and one rank per DIMM. The DIMM tree architecture is labeled as "R256.DT". The partitioned DIMM tree without direct DIMM-to-DIMM transfer is labeled as "R256.PDT". The partitioned DIMM tree with direct DIMM-to-DIMM transfer is labeled as "R256.PDT.D2D". The partitioned DIMM tree without DIMM-to-DIMM transfer varies in throughput greatly compared to the DIMM tree-degrading throughput by as much as 44% in high_100M_mix_1 to improving throughput by 32% in med_100M_mix_1. There is a clear benefit when adding direct DIMM-to-DIMM transfer to the partitioned DIMM tree. Direct DIMM-to-DIMM transfer has higher throughput than the DIMM tree architecture in all cases with an average increase of 19% and up to 35% with med_100M_mix_1. Fig. 20 shows the breakup of accesses to the fast partition page table for a partitioned DIMM tree. The accesses are broken up into hits, misses to a page that already has a request in the pending fast partition page fault queue (labeled as "miss pending replace"), compulsory misses that cause an entry to be evicted (labeled as "compulsory miss evict"), and noncompulsory misses that cause an entry to be evicted (labeled as "noncompulsory miss evict"). The data shows that most of the time we hit in the fast partition page table with an average of 86% ranging from 67% with high_100M_mix_1 to 99% with med_100M_mix_1. Only around 1% of the time do we evict an entry from the fast partition page table. On average 13% of the accesses, ranging from 1% in med_100M_mix_1 to 32% in high_100M_mix_1, are waiting for a new page to be transferred to the fast partition page table. Fig. 21 shows the breakup of transaction types between normal DRAM transactions and transactions created for transferring pages between the fast and partitions (labeled "DIMM-to-DIMM transactions") in a partitioned DIMM tree with direct DIMM-to-DIMM transfers. Even though the percentage of fast partition page faults shown in Fig. 20 was low, the number of transactions created for transferring pages is high, since a 4 KB page requires 64 transactions of 64 bytes to transfer the page. The percentage of transactions for DIMM-to-DIMM transfers is on average 27% ranging from 2% with med_100M_mix_1 to 47% with high_100M_mix_2. Even though the DIMM-to-DIMM transactions make up a significant part of the total transactions, throughput does not degrade since normal transactions to other fast partitions can still be serviced while transfers are happening within the fast partition DIMM's subtree.
XI. RELATED WORK
Tan et al. [26] demonstrated a 10 Gbit/s optical link with GaAs based technology to implement an optical multi-drop memory bus. Vantrease et al. [29] proposed Corona, which used photonic links to provide high bandwidth to the DRAM. However, their design required a 3-D layout and is incompatible with commodity DRAM parts. Beamer et al. [2] redesigned the entire memory system all the way down to the banks in order to support silicon photonics. While much research is being done with optical interconnect, an optical memory bus still suffers from several critical problems. First the photonic GaAs compound technology is still immature and incompatible with silicon-based DRAM commodity fabrication. Second, critical optical building blocks such as a silicon laser and the Ge p-i-n photo detector [1] are extremely sensitive to temperature/process variations. In contrast, our demonstrated Multiband RF-I is fully compatible with the low-cost CMOS manufacturing and is ready for use now, unlike optical technologies. However, once optical interconnect technology does mature enough, our DIMM architectures can be implemented with optical links instead of RF-I.
Ko et al. [16] demonstrated a MRF-I board using BPSK modulation. The demo achieved a data rate of 3.6 Gb/s/pin with two RF bands per pin. Therefore, each RF band was able to achieve 1.6 Gb/s. The RF-I transceivers were manufactured in a 0. 18 1.8 V CMOS technology. However, the BER was too high to be used for DDR3, which requires a BER of . Byun et al. [4] demonstrated a MRF-I board using ASK modulation with differential signaling. The demo achieved a data rate of 5 Gb/s/pin and achieved a BER of less than . Fully buffered DIMM (FB-DIMM) [8] was designed to reduce load by interfacing all signals through the advanced memory buffer (AMB), and encoding everything as packets. The AMB connected each FB-DIMM in a point-to-point manner using a high-speed serial link operating at six times the DRAM clock. However, FB-DIMM consumed considerably more power than a conventional DDRx DIMM due to its high-frequency serial links and power-hungry AMB used to decode, store, forward, and encode packets.
Ousterhout et al. proposed RAMClouds [21] . RAMClouds proposed to have data resident in main memory rather than hard disk by aggregating DRAM from many systems together. They showed that latency and throughput could be improved by 100 -1000 . The partitioned DIMM tree could be used in each system within the RAMCloud to increase the total RAMCloud capacity, or it could be used to reduce the number of systems needed within a RAMCloud for the same amount of memory capacity.
Qureshi et al. [22] proposed using phase change memory coupled with a DRAM cache to increase the capacity of the DRAM system but without degrading throughput. The partitioned DIMM tree could also use phase change memory instead of DRAM on the slow partition DIMMs to increase capacity even further. The partitioned DIMM tree also has the advantage that it can scale to a much larger number of DIMMs and therefore much larger memory system capacities.
Vantrease et al. [29] proposed Corona, which used photonic links to provide high bandwidth to DRAM. However, their design required 3-D chip stacking and is incompatible with commodity DRAM parts. Furthermore, optical interconnects in general are incompatible with low-cost CMOS logical processes.
XII. CONCLUSION
The DRAM system is one of the most critical components in any modern day computing system. We are reaching a point where we are pushing the limits of traditional interconnect technology for DRAM, sacrificing throughput for capacity. The DIMM tree architecture [27] allows DRAM systems to scale to many DIMMs without sacrificing throughput, while reducing the number of pins to interface with the memory controller. We have shown that the DIMM tree architecture can scale up to 64 DIMMs with only an 8% reduction in throughput over an ideal multi-drop system. By adding MRF-I to the DIMM tree architecture, we are able to scale even further than a system with just the DIMM tree architecture or with just MRF-I. Using four RF bands per pin with a DIMM tree of 64 DIMMs, we are able to see an average of 68% (up to 200%) increase in throughput over a four-DIMM multi-drop system with four RF bands per pin. We have also shown that the additional structures required to support the DIMM tree architecture only require 5% more power than a conventional DDRx DIMM and 2% more power than a LR-DIMM. The DIMM tree architecture is a high capacity high throughput DRAM system for future many-core CMPs for running many applications or threads concurrently.
In order to scale the DIMM tree architecture to many more DIMMs without degrading throughput, we proposed the partitioned DIMM tree. The partitioned DIMM tree partitions the DIMM tree into a fast partition and a slow partition. The fast partition is made up of the faster levels of the DIMM tree and act as a cache for pages in the slow partition. We also propose the addition of direct DIMM-to-DIMM transfers for pages between the fast and slow partitions. With 256 DIMMs, the partitioned DIMM tree with direct DIMM-to-DIMM transfers is able to improve throughput on average by 19% up to 35% over the DIMM tree architecture. While we only cover the partitioned DIMM tree with one logical channel per physical channel, using MRF-I to support multiple logical channels per physical channel can lead to an even larger number of DIMMs with higher throughput using the partitioned DIMM tree. The highly scalable partitioned DIMM tree is a promising solution for high capacity high bandwidth systems of the future.
He was also the first to demonstrate CMOS RFICs in the Tera-Hertz frequency range of 324 GHz. He has authored or co-authored over 250 technical papers, 10 book chapters, authored one book, edited one books and holds 20 U.S. 
