Abstract-Despite power boundaries, Moore's law is still present via scaling the number of cores, which keeps adding demands for more memory bandwidth (MBW) requested by these cores. To obtain higher MBW levels, it is fundamental to address memory controller (MC) scalability. However, MC scalability growth is limited by I/O pin counts scaling. To underline MC and pin scaling, a radio frequency(RF) I/O pad-scalable packagebased (RFiop) memory organization is further investigated. In RFiop, a RF pad (RFpad) is defined as a quilt-packaging (QP) coplanar waveguide employed at RF ranges. An RFpad connects a rank to an RFMC which is formed by coupling MCs to RF transmitter/receivers. By using QP package to explore the architectural benefits of laying out ranks, RFiop replaces the traditional memory path with an RF-based one, while exploring the scalability of RFpads/RFMCs via RF signaling. When evaluating RFiop, our findings show that MBW/performance are enhanced by around 4.3× which can be viewed as a diminution in transaction queue occupancy/latency as well as using a reduced and scalable 4-8 RFpads per RFMC. RFiop architectural area benefits allow MBW/performance improvements of around 3.2×, while reducing interconnection energy up to 78%.
2) ranks (which are commercially known as dual in-line memory module or DIMM that are sets of memory banks with data output aggregated and sharing addresses) with larger widths.
Having larger width means employing a larger number of MCs (MC counts or MC scalability). Despite low cost and proper design alternative for low numbers of MCs, given international technology roadmap for semiconductors (ITRS) pin-count limitations [3] , as cores count tend to tens/hundreds, DDR technologies present significant I/O pin counts scalability restrictions, thus limiting the number of MCs, which further restricts MBW and performance. For example, 16-core Bulldozer [4] and 64-core Tile64 [5] processors have four MCs.
More advanced commercial solutions such as Intel fully buffered DIMM (FBDIMM) [6] , hyper memory cube (HMC) [7] , and RAMBUS XDR2 [8] all of which even employing serialization, accompanied by adaptive equalization in the latter one, are still bound by unscalable I/O pins, which restricts the scalability of the number of MCs and as a consequence MBW benefits. Alternatively, using 2) much wider ranks and presenting no I/O pins/scalability restrictions, scaling MCs in 3-Dstacking is reported [9] to be limited by temperature when scaling ranks, thus restricting memory parallelism.
Optical-and radio frequency-(RF)-based memory are technologies that combine telecommunication transmission techniques and fast media on the memory path to address I/O pin scalability. Former solutions employ wavelength division multiplexing and optical fibers to connect processor and memory through optical MCs and scalable optical pins [10] . Still restricted in terms of development costs, optical transmission has advanced significantly in regards to temperature sensitiveness [11] . Instead, by sharing manufacturability with CMOS, RF shares its low costs, while remaining advantageous in terms of energy and millimeter-range delays when compared to optical transmission as reported in [12] .
Very importantly, being appointed [13] as one of the areas that can improve processor performance the most, Tam et al. [14] state that in the 1-10 cm range (which is well within regular package distances [13] ) RF-transmission is more energy-efficient than optical and digital (traditional) ones.
Coplanar waveguide (CPW) 1 and microstrip 2 are examples of types of RF-interconnection that could be employed along the memory path and placed on package. In particular, CPW quilt-packaging (QP) 3 lines [15] were prototyped and manufactured, which demonstrates the viability of an on-package RF-interconnection that could be used to connect processor to memory.
In the RF domain of scalable-width solutions, RFiop system employs the package area which fits ranks (which are assumed to be manufactured as dies) and QP lines to connect ranks and MCs as a likely solution to improve MBW. Exploring Polka et al.'s [13] guidelines toward improving MBW on the package area, compared to RFiop organization [16] previously proposed, this report further leverages the space of scalablewidth memory solutions through the following contributions.
1) Given potential growth on the number of cores, RFiop MBW and latency are further evaluated and a sensitive analysis is performed under a larger number of cores (twice as the previous publication). 2) Through detailed-accurate system simulation, RFiop performance, area, and power architectural implications are further investigated when replacing MC with an equivalent RFMC, while the most important ones are identified. 3) An RF behavioral model of the RFpads (which are defined as QP lines in [16] ) is introduced. This model includes the following important RF parameters: insertion loss (IL), RL, and crosstalk noise (CN). Since RFpads are QP lines, the model is obtained from regression over QP RF simulations performed in [15] . To the best of our knowledge, it is the first time that such model used to determine previously mentioned losses is developed using regression. The model allows the designer to predict the RF behavior of RFpads for a wide RF bandwidth range and appropriate for future memory solutions. 4) Given the wide variability and complexity of DDR systems, a further validation of the benefits of RFiop for different types of memories with different settings (such as different data rates and timing parameters) and memory generations is performed. Furthermore, the scalability investigation of the number of RFpads for faster memories is further extended. 5) Several area and power/energy benefits of RFiop are newly presented and discussed including RFMCs versus traditional MCs comparisons. 6) Not previously covered, RFiop is compared to other state-of-the-art memory systems such as HMC [7] and its manufacturing viability. 7) To the best of our knowledge, not previously discussed, this paper demonstrates that scaling ranks laid out on the package area presents lower temperature restrictions than stacking ranks (3-Dstacking). 8) Further RFiop architectural benefits are investigated for other MBW-bound benchmarks. 9) Further approaches to RFiop's limitations are analyzed. 3 QP is a technique where quilt lines [15] are introduced: these lines are CPWs built as extensions of the processor and memory dies coupled to face each other to enable a low return loss (RL). As CPWs, QP present RF properties, therefore these lines can be used as RF-interconnections. The rest of this paper is organized as follows. Section II introduces the motivation of the I/O pad/pin problem. Section III presents RFiop and compares RF technology to other advanced solutions in terms of approaching the I/O pin/pad problem. Section IV describes the experiments while Section V depicts the related work. Section VI draws the conclusion. TO ACHIEVE PIN/PAD SCALABILITY, AND RF BACKGROUND FOR RFPADS
II. MOTIVATION, BACKGROUND, MECHANISMS
In this section, the impact of the I/O pin problem on the MBW limitations and pin/pad count scalability is illustrated through a sequence of steps. Next, a formulation is introduced to show the approach of current memory technologies and common optical/RF-memory mechanisms to, respectively, achieve higher MBW-per-pin/pad as well as to promote pin/pad scalability. In addition and very importantly, RF background is introduced to facilitate understanding RFpads behavior.
A. Motivation: I/O Pad/Pin Problem
A baseline reference should be defined to estimate RFiop further architectural benefits. The baseline strategy determination proposed in [16] is adopted to establish likely MBW/pin requirements. In this strategy, for processors currently in the market, the number of cores as well as a minimum threshold for the number of MCs and pins is determined. For example, for a two-core traditional out-of-order (OOO) microprocessor, one MC is typically utilized, while for a four-core microprocessor, two MCs are employed, and for a 16-core one, four MCs [4] are used. In this example, by observing core count and number of MCs for DDR-family generations, a logarithmic behavior for the MC counts as a function of the number of cores can be noted, and a likely estimation for a future 32-core-OOO processor is five MCs (which is defined in this paper as the baseline MC count), thus core: MC ratio is 32:5.
Using the reports from [13] and ITRS [3] predictions, in combination with the previously determined core: MC ratio, pin-counts are estimated next.
To understand the MBW requirements of a likely 32-core system, a MBW characterization is proposed. In this characterization, in order to guarantee that addresses are equally distributed along the ranks so that any advantage is taken on locality [17] , the most conservative addressing mode is adopted by interleaving cache lines along the RFMCs and closed page mode (server) employed in all experiments.
The characterization experiments are divided in two sets: 1) in the first, MBW of one rank is derived to calibrate/validate the system and 2) in the second, 1) is extended to the maximum core: MC ratio, while comparing MBW and pin count in both of them. A detailed list of the parameters used in these experiments can be found in Table III(a). The rank selected to perform the MBW characterization 1) scaling is a generic 1-GB-DDR3 DIMM, with 64-databit, 1333 MT/s-data-rate, based on Micron MT41K128M8 [2] [ Table III(a)] . MCs are individually connected to independent ranks to extract their maximum MBW. In this characterization, two experiments are performed: in 1), core: MC ratio adopted is 1:1 and this baseline system is modeled as a set of one core/MC/crossbar/selected rank [settings in Table III(a)] using M5 [18] and DRAMsim [17] simulators while MBW is measured utilizing an average of STREAM [19] benchmarks. This experimentation reports a 2.5 GB/s MBW, which confirms its proper calibration and validity, since it fits within the MBW magnitude range reported by Micron [2] .
Experiment 2) starts with determining the number of pins employed on each rank: as a first observation, in a regular chip, 50% of the total pads is destined to power purposes while the other 50% is destined to the remaining signals. Further investigation of Micron manuals [2] shows that 50% of 240 pins available, i.e., around 120 pads, is dedicated to control/data signals, while the rest is dedicated to power.
To estimate the maximum number of MCs that fit on the onpackage area, Marino's assumptions [16] are utilized: 16 ranks dies can be fit within the package area and each rank is connected to a different MC (thus 16 MCs) so that MBW of each rank can be fully explored. Therefore, by employing the previously assumed 32 cores, the core: MC ratio is 32:16.
The same simulators and benchmark suite in 1) are used in 2), but using 32:16 core: MC ratio rather than the 1:1 core: MC one, as well as scaling pads counts linearly with MC counts. The results of this scaling are reported in Fig. 2 , where it is observed that 1920 pads (or 3840 pins using the same pad: pin of 1:2 previous assumed proportion) are needed to achieve 32:16 core: MC ratio -30.4 GB/s MBW, which corresponds to a significant larger amount than the ITRS upper limit of 1023 pads [3] . These findings show that when comparing the maximum MBW obtained for core: MC ratio of 32:16 to the baseline (which has core: MC ratio of 32:5), a significant larger MBW improvement factor of 2.7× (30.4 GB/s over 11.25 GB/s) is obtained. As a conclusion, larger MC-counts significantly benefit MBW, which motivates the search for pin-scalable solutions.
B. Background: Current Memory Solutions do Not Scale
The main focus of current commercial solutions [2] consists of maximizing MBW by generally increasing the frequency and/or the width of the bus that connects MC to the rank, while keeping MC counts at lower magnitudes due to pin restrictions. To start to understand how commercial strategies employ current design parameters, we begin with bsr = memory_bus_width * freq_multiplier * freq (1) where bsr represents the maximum MBW supplied by the rank, memory bus width the width of the memory bus, freq_multiplier the bus frequency multiplier, and freq the frequency of the memory bus. For a pad, we define bpp = bsr/number_of_available_iopads (2) where bpp is the MBW per pad and number_of_available_iopads the number of available I/O pads.
As previous experiments have illustrated, current DDR3 memories present around 180-240 I/O pins/MC [2] , which are clearly not scalable. Furthermore, using (2) with the significant magnitude range of 32-55 pin-range to represent a large amount of pins as in commercial solutions (e.g., Intel FBDIMM [6] with 48 pins/MC and 2.5 Gb/s/pin; RAMBUS XDR2 [8] with 32 pins/MC and 12.8 Gb/s/pin, HMC [7] with 55 pins/MC and 10 Gb/s/pin; typical DDR ranks [2] with 123 MC pins and 1.2-5 Gb/s/pin), lower MBW-per-pin rates are obtained, which still remain a challenge when more MBW is required, thus motivating the search for pad/pin-scalable solutions.
C. Mechanisms to Achieve Pin/Pad Scalability: Optics and RF
In this section, the tradeoffs involved when adopting RF/optical technologies to approach pin/pad scalability are explained via modeling modulation signaling principles.
In both RF and optics, high MC scalability can be obtained via modulation combined to very low latencies (light or highfrequency speed transmission), respectively, over electrical wires or fiber. Equation (1) is modified to estimate the benefits of modulation. Using total_data_rate or tdr results in tdr = number_carriers * data_rate_per_carrier (3) bpp = tdr/number_of_available_iopads (4) where number_carriers also represents the number of wavelengths when optical systems is referred. For example, optical Corona [10] is reported to have two I/O optical-pins, i.e., two optical fibers between MC and the ranks, thus scalable. In this case, (4) applied in Corona [10] indicates that bpp = 160 GB/s/2pin = 640 Gb/s/pin (5) which is much larger than maker solutions (12.8 Gb/s/pin [8] ). Similarly, as further explained, typical 30-140 Gb/s data rates used in RF are able to support typical DDR-data rates using a low amount of wires/pads counts. Next, an RF background and modeling are provided to understand the RF behavior of the RFpads.
D. RF Background for RFpads
To facilitate understanding RFpads RF behavior, a simple modeling by Liu [15] is adopted. In this model, the characteristic impedance of a QP line is defined as Z 0, when the load impedance Zl is different from Z 0. Having a wave at the termination reflected to the generator enables to define the reflection coefficient at the termination (γ (l)) as the ratio of the reflected wave to the incident wave the following way:
where V 0+ is the incident wave amplitude at z = 0, and V 0− is the amplitude reflected to the load. RL is defined as available power at the transmission line that will not be delivered thoroughly to the load, and represented (dB) as
or RL = 20 · log(S11) dB.
Given that the reflection coefficient γ (l) at a distance l from the load can be expressed as
Then, input impedance Zin can be defined as
where V (−l), I (−l), Z 0, and Zl are, respectively, the voltage, current at distance l from the load, impedances at distance 0 and l. With those, the power delivered (Pin) to the transmission line at z = −l can be represented as (12) and the power loss through the transmission line can be defined as the difference between Pin and Pl, represented as
Defining reflection coefficient at the source (γ g) and Z 0 as
and
IL can then be defined as the ratio of power of the load to the power from the generator
Alternatively, as defined by Liu [15] , using a symmetric general two-port transmission line from port 1 (if a simple imaginary line considers port 1 to the left of port 2, at V 1 voltage, V 1+ direction to the right, V 1− signal direction to the left) to port 2 (at V 2 voltage, to the right of port 1, V 2+ signal direction to the left, and V 2− signal direction to the right), S11 and S21 parameters, can be defined as
In the above model, RL is represented by S11 and IL by S21. Very importantly, equations (6)- (17) represent a general and simple CPW model. According to Liu [15] , it is very challenging to represent and quantify QP lines parameters using closed equations such as those exemplified previously due to CPW frequency-dependent parameters and complex discontinuities between different parts of its structures, especially at high bandwidth.
To approach these challenges in QP [15] , Ansoft HFSS 3-D electromagnetic field solver simulator [20] was adopted to determine RL(S11), IL(S21), and CN of a QP CPW. Liu [15] Furthermore, besides RL and IL, crosstalk (CN) was also investigated by Liu [15] . By simulating with several groundlane configurations between QP lines, Liu [15] shows that isolation between different QP lines is improved.
As a result, many different curves of RL, IL, and CN were obtained for a wide variety of frequencies. While Fig. 4 (a) illustates obtained RL, IL, and CN, these parameters proportionally increase with the increase of the frequency. RF behavior is further approached in Section IV-B.
III. RFIOP In this section, RFiop memory organization techniques explore RFpad scalability which enables RFMC scalability. In order to have I/O pin counts minimized to achieve RFMC scalability, memory channels are best matched with RF. While minimizing I/O pin counts of each individual MC, the total pin count must be scalable targeting MBW increase as well as keeping power utilization within low levels.
A. RFiop Overview and General Design Rules
A general view of RFiop can be found in Fig. 3 . RFiop employs the following strategies: 1) minimal amount of elements designed for RF and also 2) for short distances. Fig. 1 illustrates RFiop's memory path: its memory path is composed of: 1) RFMCs -formed by coupling MCs to RF transmitters (TX) and receivers (RX), and placed at the processor die; 2) OFF-die RF-interconnection lines; and 3) by on-package ranks placed on the rank dies in a coplanar fashion. In each RFMC, RF TX/RX are responsible for modulating/demodulating data/commands. Modulated signals (RF waves) are transmitted/received through the RF QP lines. To address RF-transmission challenges, lesser elements such as RF TX/RX at the RFMCs, RFpads (QP), and ranks are employed when compared to typical solutions [21] .
Furthermore, the fact that in RFiop all elements are properly designed for RF minimizes the previously mentioned RF degradation effects (RL, IL, and CN). The short distances employed in RFiop can be traversed through QP lines which connect the RFMCs to ranks and allow significantly lower degradation effects than those along long printed-circuit-board as reported in [22] .
B. Ranks Manufactured as Dies and Rank Width
Before other new technologies such as HMC [7] were developed, RFiop employed ranks manufactured as DDR dies, each die containing its proper set of TX/RX to be able to communicate with the RFMCs (at the processor die). In RFiop, the fact that ranks operate as traditional DDR elements allows compatibility with memories in the market, thus not requiring any protocol or memory timing change. In Fig. 1 , a memory die with its RF TX/RX is connected to the core (with its RFMCs, i.e., MCs coupled to RF TX/RX). To keep DDR compatibility along future DDR-memory generations, RFiop employs typical DDR-rank width, i.e., 64 bits (8 B) [2] . The width aspect is further discussed.
C. RFiop Signal Path
In Fig. 3 , the interface between the TX/RX elements and MC (to form an RFMC) and the RFpads is illustrated: TXs/RXs are assumed to be present on each RFMC and rank, and upon a cache request, signals go through the RFMC TX where they are converted to analog waves. Next they traverse the waveguide/CPW and reach RX, where analog waves are converted back to digital signals in order to reach the busses and a rank. The signal does traverse the same path in the opposite direction when a rank responds, and at the RFMC-RX it is converted down back to digital before reaching the processor.
D. RFiop Viability
RFiop viability relies on QP lines. The fact that QP was prototyped and tested for bandwidth up to 60 GHz, while presenting low-magnitude RL (0.1 dB), demonstrates the viability of RFpads. Moreover, being simulated for bandwidth up to 200 GHz, QP lines reduce the number of pads, which is aligned to the pad reduction goals.
In general, RF design explores the matureness achieved in CMOS manufacturing, and is therefore a very consolidated technology. Once putting chips down and sliding to match each other is a straightforward process according to [15] , QP lines are reported to be manufacturable through the programmability of already existing industry tools such as pattern-recognition of the modules. Self-alignment structures are easily built into the shapes of the nodules as indicated in [15] . Deep reactive ion etching can be used to separate chips from wafers.
E. RFiop Limitations and Approaches to Address Them
The following approaches address the previously mentioned RFiop limitations.
1) The manufacturing technology evolution is likely to allow a reduction of twice the area used by the cores, thus likely allowing more ranks to be fit, which enables a large core: package area ratio. 2) Other than using QP as RF-interconnection lines in RFiop, microstrips and striplines could be potentially employed [12] thus allowing other benefits such as lowering costs, improving data rates, and/or reducing losses.
IV. EXPERIMENTAL SECTION
MBW, latency, number of pads, energy, area, and temperature are the key technical elements which help the researcher understand the goals and achievements of RFiop. To evaluate these RFiop elements, an experimental infrastructure composed by mathematical modeling and several detailed-accurate simulators is employed as follows.
1) Determination of QP RF bandwidth ranges needed to match memory data rates to minimize the number of RFpads. 2) To the best of our knowledge, it is the first time that a Mathematical modeling for IL, RL, and CN is obtained via regression from the resulting RF-behavioral simulations performed by Liu [15] . 3) Mathematical pad scaling modeling to determine the behavior of the number of RFpads as a function of the rank data rates and width. 4) M5 simulator [18] to simulate the multicore system running MBW-bound applications. [25] to determine the behavior of RFiop memory organization. The first three steps previously proposed guide the RFpads behavioral modeling in terms of RF behavior and scaling. The remaining steps allow to extract performance, power, and temperature implications of RFiop.
A. Determination of RF Frequency Ranges to Match Memory Data Rates
In the first order, MBW provided by each rank dictates the number of lines required: not considering loss effects, the ratio between rank MBW and RFpad RF-bandwidth determines the amount of RFpads needed to match rank data rate.
To show the benefits of an RF-based memory path, once QP was manufactured and has validated RF-properties, QP lines/parameters are employed as the RF-interconnection lines between RFMCs/ranks in RFiop without any loss in generality.
To determine the number of RFpads (RFpad counts), the number of QP lines is required: the key is to match QP data rate to the rank data rate. QP data rates are estimated with ON-chip RF scaling predictions by Chang et al. [12] [Table I(b)]. Though valid for ON-chip interconnections, these are also considered valid when connecting two different dies via QP. A second reason to justify this strategy is the significantly reduced interdie distance in QP (around 40 µm), completely within ON-chip typical distance ranges. RFpad count determination is performed under three strategies: 1) considering simulated QP bandwidth (200 GHz [15] ); 2) validated QP bandwidth (60 GHz [15] ); and 3) taking into account just RF predictions (half of maximum CMOS frequency carrier in Table I [3], [12] ) , i.e., regardless of the assumption of QP as RFpads.
In strategy 1), design and estimation of RFpads counts employ the rank previously used in Section II. 32-nm technology is assumed in Table I (b); it allows 12 carriers and data rate per carrier of 8 Gb/s. With a static RF band allocation [12] , these carriers are spaced by 32 GHz to avoid crosstalk (further described) that could lead to low bit error rate (BER). Using QP bandwidth as 200 GHz [15] and previous carrier spacing, there are up to six carriers, each with 8 Gb/s of data rate, thus the overall data rate budget available for each RFMC is 48 Gb/s. Next, important RL, IL, and CN parameters are determined. 
B. Determination of Return Loss (RL), Insertion Loss (IL), and Crosstalk Noise (CN) for RFpads
As mentioned in Section II-D, Liu [15] has performed a wide range of simulations using Ansoft HFSS 3-D electromagnetic field solver simulator [20] in order to determine RL(S11), IL(S21), and CN behavior of the RFpads. In these simulations, different RFpad widths and different silicon resistivity substrate for a wide range of frequencies were utilized. To exemplify, the widths (100 , 50 , 20 , and 10 μm), and two different silicon resistivity substrates (high, which means a magnitude resistivity of 8000 · cm and low, which means a resistivity magnitude of 10 · cm) as well as bandwidth from 0 to 40 GHz for (100-and 50 μm-width) and from 0 to 200 GHz for (20-and 10 μm-width) were simulated.
Output magnitudes of these previously simulated losses for the 20-μm-width RFpad are illustrated in Fig. 4(a) . In this example, IL is lower than −5 dB, RL stays between −20 and −40 dB and CN between −60 and −10 dB. If such losses are not acceptable, it is a designer's task to tackle them, such as having larger separation gaps between them or augmenting the number RFpads.
In order to incorporate the behavior of the RF circuits in the RFpads RL, IL, and CN parameters are proposed to be represented via an extensive least-squares quadratic polynomial regression over the wide range of IL, RL, and CN simulations performed in [15] in order to determine their mathematical behaviors as a function of frequency ranges within bandwidth. Without any loss in generality, given simulated bandwidth magnitudes of 200 GHz, the 20-μm width range and high resistivity are conservatively adopted. As a result of this regression, the following formulations are obtained: 
C. Determination of the Number of RFpads
For subsequent modeling, memory read/write operations are assumed, while utilizing RFpad modeling equations [from (21)- (27) ] developed in Marino's report [16] .
RF-delays through TX/RX are not included in the following formulations due to their insignificant magnitudes (around 200-ps [12] ) compared to the duration of memory timing operations. To determine RFpad count behavior, memory_bits or mb is defined as mb = mc * dr (21) i.e., a function of the number of bits transmitted in one memory cycle −mc, where dr is the memory data_rate. RFiop total cycle (tot_cycle) is limited by the maximum bandwidth allowed in QP (200 GHz [15] as QP is adopted). Keeping DRAM circuitry as original as possible, dedicated RF-interconnection lines (control and data) for RFpads are included RFpads = number of RFpads per RFMC (22) RFpads_data = floor(data_mb/(mc * mb)) (23) RFpads_data = floor(data_mb/(mc * drRFc * nRFc)). (24) Considering, respectively, RFpaddr, RFpads_data, RFpads_ct, drRFc, nRFc as the total RFpad data rate, number of RFpads destined for data/control lines, data rate carriers, and number of RF carriers, the following equations can be utilized:
RFpads_ct = floor(ct_mb/(mc * drRFc * nRFc)) (26)
Having inspected ranks with similar features in Micron catalogs [2] , except voltage, ground, and not-connected pins, around 123 bits are used in one rank access (total of 240 pins, around 50%; 64 for data, and 59 for control). Assuming the same rank (1-GB-DIMM DDR3 rank, with 64-data-bit, 1333 MT/s-data-rate, based on Micron MT41K128M8 [2] ) previously employed in the MBW characterization (Section II-A), from (5) and (6) the total amount of bits (tot_bits) transferred via one RFpad in one memory clock (1/1333 MT/s) is tot_bits = (1/1333/s) * 6carriers * 8Gb/s (28) floor(tot_bits) = 36bits.
Therefore, in one memory cycle only four RFpads are needed to perform an RF transfer of 144 bits, which carry the total of 123 memory bits (64 of data plus 59 of control). Other widths can be used via recalculation of equations starting from (21) .
According to Chang et al. [12] , to avoid IL, RL, and CN previously observed effects and minimize likely BER, as a general rule-of-thumb RFpads are doubled. Following this rule, eight RFpads are required to transfer 64-data and 59-control bits.
Very importantly, Fig. 4(b) shows related experiments performed in the initial RFiop report [16] . Comparing Fig. 4(a) and (b), either with faster DDR3 memories (1333 MT/s versus 666 MT/s in the initial RFiop report) or DDR4/DDR5 models, RFpads still scale properly, enabling RFMC scaling.
By comparing RFpad scalability to current DDR-based pad counts, assuming a pad: pin ratio of 1:1 and 200-GHz bandwidth (QP parameters [15] ), it is concluded that RFiop has 4× more MC pads (8 RFpads) than optical-Corona [10] , a MC pad reduction of 4× when compared to RAMBUS XDR2, and up to 6× when compared to FBDIMMM.
Before comparing RFiop to HMC [7] , a brief background about HMC is presented. A HMC rank is composed of a single package containing multiple memory dies which form one logic die. A vault is defined as a set of banks of memory dies, and different vaults are going to contain different memory die portions. Each vault has a MC named vault controller (VC) which is responsible for managing its memory references to that specific vault, besides timing, refresh operations, and buffering vault accesses. As opposed to HMC, RFiop follows typical DDR organization in ranks (rows, columns, and banks) as multiple dies placed on a coplanar layout (Fig. 1) .
In HMC, the communication between memory die and processor happens via serial/deserial communication over I/O-links, while RFiop employs modulation over QP lines. Typical I/O-links in HMC present 10 versus 48 Gbit/s-links (six carriers, 8 Gb/s data rate) in RFiop. The maximum aggregated MBW in HMC is 320 GB/s, which is significantly higher than in RFiop, i.e., with memory settings defined in Section II-A, RFiop maximum MBW achieves 96 GB/s (16 RFMCs × 6 GB/s). However, to have RFiop achieving the same levels of MBW of HMC the improvement of transistor technology is likely to allow: 1) a larger number of RFMCs (1:1 RFMC: rank assumption) and 2) QP bandwidth is likely to increase. Assuming that at 22 nm, 32 ranks can be fit in: 1) RFiop package area, RFiop memory MBW is leveraged to 192 GB/s and 2) with the assumption that the QP bandwidth is doubled, about double the carriers can be fit while larger data rates are allowed (10 Gb/s) thus resulting in 480 GB/s, which is much larger than 320 GB/s in HMC.
Alternatively, if the number of pads is not considered, having the 55 pins of HMC (versus four RFpads in RFiop) as budget in RFiop allows 1056 GB/s (55 over 4 = 11; 96 GB/s * 11 = 1056), i.e., 3× more MBW than HMC. Further advancing RFiop report [16] , assuming a pad: pin ratio of 1:2 (at the beginning of this section) and that a HMC memory package utilizes eight links correspondent to eight VCs and 55 I/O pins, in RFiop the equivalent configuration with eight [6] , [8] , [10] , [12] , [26] RFMCs -each RFMC corresponding to one VC-is likely to have 32 RFpads, i.e., a much lower pad usage than HMC. To predict future memory data rate versus RFpads scaling behavior (which is supported by the scaling of RF technology, number of carriers and bandwidth) different types of faster memories (e.g., DDR4/DDR5) are similarly modeled [using (4)-(6)]: 1) with and 2) without a bandwidth limit of 200 GHz (QP [15] ) and using 16 nm-/22 nm-RF-technology based on RF ITRS predictions [3] , [12] . The result of this modeling is shown in Fig. 4(b) , which demonstrates RFpad scalability along future memory and RF-interconnection generations.
On strategy 2) (defined at Section IV-C) as assumed in RFiop report [16] , a combination of QP prototyped/validated bandwidth of 60 GHz [15] with the pad reductions obtained (30% in RAMBUS XDR2 and 50% in Intel FBDIMM), it is found that, if compared to HMC, RFiop can reduce the number of pads up to about 56%.
Moreover, regarding strategy 3) which was defined at Section IV-C, assuming RF predictions [3] , [12] and disregarding QP parameters, remarkable four RFpads are found as reported in [16] , which are of similar magnitude to optical-Corona [10] . Table II compares pad count, MBWper-pin, interconnection energy, and energy among diverse systems, including RFiop. Other energy aspects are discussed in Section IV-I.
Comparing modeling equations (22)- (27) to the ones previously developed in [16] .
1) Equations (21)- (27) are valid for different types: different data rates and/or widths than 8B (DDR standard). 2) Equations (21)- (27) can be used to determine different pad counts as a function of scaling widths. Next, different memory types/technology and RFpad counts scalings are compared using the developed modeling.
D. RFpad Area. Die Area Saving, and I/O Pad Reduction
Liu's [15] design space exploration of QP dimensions results in 20-to-100 and 10 μm, respectively, for depth and width. Since QP lines are RFpads, previously obtained dimensions are valid for RFpads. Using these results, Marino [16] reports RFpad dimensions of 200 to 1000 μm 2 . Once the insertion of ground lines is the typical rule of thumb to minimize crosstalk between two adjacent lines, RFpad pitch is conservatively assumed as the largest dimension of QP, i.e., around 100 μm.
Being RFpads (QP lines) built at the side of the die, i.e., not at the basis, they favor I/O pad die area saving [21] . To further estimate area savings, an ITRS 1023-pad limitation is assumed as illustrated in Fig. 2 . In this assumption, 50% (512 pads, rounded 50% of 1023) are dedicated to data/control bits (the remaining 50% to power and other, e.g., I/O and interrupt) [21] .
Thus, for a typical DDR3 240-pin budget and area estimation of 50%, 46.9% (240/512) of the die area allocated to the I/O pads can be potentially saved [21] . Furthermore, since I/O pads are connected to the same set of I/O pins, a significant reduction is expected in the latter [21] . A comparative area analysis between RFMC and traditional MC is performed in Section IV-H. Next, temperature comparison with 3-DStacking is approached.
E. Temperature Comparison: RFiop and 3-D Stacking
In this section, temperature effects are compared in RFiop and 3-Dstacking when scaling ranks. Both architectures are assumed to have the following. 1) 256 µm 2 for rank area based on 3-Dstacking rank dimensions [1] once 3-Dstacking is an on-package/ on-die technology. 2) Initial rank temperatures at the same magnitude of the L2 caches (assumed as 60 • C). 3) Hotspot tool [25] with its respective gcc benchmark trace is used to compare both architectures. 4) Most parameters employed in this estimation are the default ones used in the Hotspot tool configuration file [25] , except the area covered by the heat sink and spreader, which is conservatively adjusted to a maximum of 0.05 m in either configurations. 5) The number of ranks was scaled up to 16, either in RFiop/3-Dstacking to match the maximum number of RFMCs/MCs. As a result of this temperature modeling, RFiop is about 10.5% lower than 3-Dstacking, thus likely to be advantageous when scaling of ranks/RFMCs.
F. Performance Evaluation Methodology
RFiop is modeled using M5 [18] and DRAMsim [17] simulators. Memory transactions are generated by M5 and captured by multiple MCs/RFMCs in DRAMsim, which responds to M5 with the result of the memory transaction. To have enough memory pressure and demonstrate higher MBW under RFMC scalability, a clustered microprocessor architecture with 32 cores is selected -previously explained in Section II-A-versus [16] . Furthermore, to ensure higher memory pressure OOO-processors (based on Alpha, four-wide issue, similar as in [16] ) have been employed with private L2 slices to prevent cache sharing from affecting MBW. Furthermore, a banked-scalable L2 miss status holding register (MSHR) structure is assumed with 1 MB/core L2 slice size [27] . L2 slices communicate through an one-cycle RFcrossbar, i.e., similar RF circuitry latency settings adopted by Chang et al. [12] : 200 ps of TX-RX delay, plus the rest of the cycle to transfer 64 B via high speed/modulation, which also prevents larger interconnection delays from masking memory settings. Instead of bus delays, RF TX-RX delays were also configured in DRAMsim to represent RF transmission. Based on the rank previously used in Section II (Micron MT41K128M8 [2] , parameters in Table III(b) are kept constant throughout all experiments). To generalize RFiop usage with different DDR-families, different rank parameter settings from [16] are used, particularly with the 1333 MT/s-memory data rate instead of the 666 MT/s.
In all experiments, as stated in Section II-A, to avoid no advantage is taken on locality, [17] addresses are equally distributed along the ranks, via cache-address interleaving along RFMCs and closed page mode (server). Using previous RF assumptions, a 200 ps-TX/RX-delay [12] is estimated. Due to the speed-of-light property of RF, signal delays of commandsduration and burst duration between RFMC/rank are estimated to be reduced from two cycles to one cycle and from eight cycles (typical) to one cycle [2] . DRAMsim was modified to support an arbitrary number of RFMCs. In DRAMsim, each RFMC has a FIFO associated with queue memory requests, as well as duration and occupation of the banks and taking all of these into consideration contention is properly modeled. To evaluate RFMC scalability, core: MC proportion is varied from the baseline configuration 32:5 up to 32:16 (32 cores, 16 RFMCs, as previously justified) via M5/DRAMsim simulations with a different number of RFMCs. In Fig. 5(a) and (b) , the baseline core: MC ratio of 32:5 is shown as a matter of reference −5 MCs (Section II-A).
To obtain cache latencies, Cacti [23] is set with aggressive ultralow-power optimizations. MSHR counts selected for each L2 slice follow the study by Loh [1] once multiple MCs and ranks as OOO-cores are used in it. Summarizing, all parameters used in the simulation environment are in Table III(a). Benchmarks have been selected according to Loh's [1] criteria, focusing on the ones with a high number of misses per kiloinstructions (MPKI) to exercise the memory system. The selection involves in the following: 1) STREAM [19] suite to evaluate MBW, decomposed in its four subbenchmarks (Copy, Add, Scale, and Triad); 2) pChase [28] designed to evaluate MBW and latency, with pointer-chase sequences randomly accessed; 3) hotspot from Rodinia suite [29] ; 4) conjugate gradient (CG), scalar pentadiagonal (SP), and Fourier transform, from Nasa parallel benchmarks as part of the HPC challenge to evaluate MBW [30] . STREAM and pChase MBW measurements are extracted from these applications since these are designed to measure MBW. Table III(b) shows the benchmarks, input sizes, read-to-write rate, and L2 MPKI obtained. In all benchmarks, parallel regions of interest are executed until completion. Input sizes are large enough to stress the memory system (120 MB-1.8 GB). Average results are calculated using harmonic average. For the rest of this evaluation, the following are defined. 1) Baseline: As determined in Section II, corresponding to the electrical counterpart version with five MCs (Section II), which are constrained to I/O pin scalability. 2) RFiop: Represents RFiop with RFMC scalability benefits, i.e., with RFMCs scaling up to 16 RFMCs and 16 ranks.
3) To facilitate comparison, the terms RFiopa, RFiop_burst_command and RFiopa_burst_command are adopted from Marino's report [16] . RFiopa is defined as the RF version with the same area budget as the baseline to explore its architectural benefits in terms of higher RFMC counts. As further described in Section IV-H, RFiopa can have up to 12 RFMCs. RFiopa magnitudes were not directly obtained from the simulators, but extrapolated from the performance results. 4) RFiop_burst_command: RFiop plus (simultaneously) RF latency benefits (on command/burst transfers). 5) RFiopa_burst_command: RFiopa plus RF latency benefits applied to command and burst transfers. 6) RFiopp: As the version that uses MC power as power budget, based on further power/energy analysis (Section IV-I1), RFiopp can have up to 16 RFMCs. RFiopp has MBW/speedup behavior similar to RFiop.
G. Bandwidth, Latency, Speedups, and Number of Cores: Sensitive Analysis
MBW benefits from RFMC scalability are analyzed first, and next high-speed signaling. In Fig. 5(a) , the MBW obtained for different cores: MC ratios (32:5, 32:8, 32:12, and 32:16), and with STREAM and pChase, respectively, representing stream and random behaviors is improved with the increase of the number of RFMCs. Significantly, RFiop/RFiopa, respectively, provide 3.6× and 2.6× more MBW than the baseline due to larger RFMC counts (larger memory parallelism). Comparing Figs. 5(a) and 6, MBWs are up to 10% larger due to the use of larger data rate memories. Moreover, RFMC scalability does provide MBW growth with different memory settings and any number of RFMCs, which generalizes and validates RFiop RFMC scaling previously proposed [16] .
Speedups obtained for different cores: MC ratios (32:5, 32:8, 32:12, and 32:16), i.e., with different RFMC counts, are shown in Fig. 5(b) . For all benchmarks, speedups increase proportionally to the increase of the number of RFMCs. Compared to the baseline, for STREAM benchmark Considering RFMC scalability, pChase MBW and latency present improvements of 4%-25.8% and 10%, while speedups improve up to 3× (transaction queue average duration/occupancy reduction). Combining RFMC scalability and high-speed, overall speedups have shown a significant improvement of up to 4.3×, while RFiopa achieved a significant factor of 3.2× when compared to the baseline. Alternatively, the latency in Fig. 8 follows a similar reduction trend when considering high-speed RF benefits.
RFiopa (RFiop under area budget constraints) presents similar behavior trends to RFiop for MBW, speedups, and latency. Therefore, performance and energy benefits can be observed when architectural area benefits of RFMCs replacing traditional MCs (RFiopa definition) are considered.
Similar to RFiopa, architectural power budget is explored by replacing traditional MCs with RFMCs in RFiopp. Architectural area (Section IV-H) and power (Section IV-I) analyses show that a larger number of RFMCs can be used in RFiopp (16 RFMCs) than in RFiopa (12 RFMCs) . This demonstrates that the area factor considered in RFiopa is more restrictive than the power factor considered in RFiopp, while MBWs and speedups are achieved in both.
While some benchmarks exhibit RFMC scalability limitation (observed saturation on the MBW/speedup curves), considering that memory requests are equally interleaved over RFMCs and cache transfers are done in one cycle (RF-crossbar latency), a deeper investigation of simulators statistics shows significantly different L2 miss rates in some slices, which provides evidence of the churn phenomenon reported by Loh [1] when scaling MSHRs, not necessarily decreasing L2 miss rates, that is, left as a further investigation [16] . Moreover, Fig. 5(b) presents speedups 10% higher than in Fig. 6 , thus 2) Latency: Larger RFMC availability results on shallower transaction queues and smaller transaction duration. Due to lack of space, latency results are only shown for STREAM and pChase. Fig. 5(a) , by increasing RFMCs for both RFiop and RFiopa, occupancy is reduced of up to 3× and 2× (STREAM/pChase) when compared to the baseline. Furthermore, Fig. 8 shows that the average duration of memory accesses is decreased by up to 3.5×/2.2× for RFiop/RFiopa. This can also be seen in pChase, where latency is significantly reduced of 61% when compared to the baseline.
Comparing obtained latencies in the previous report [16] and those shown in Fig. 8 , a surprisingly remarkable latency reduction of 30% is obtained. Even when using twice as fast memories, RFMC scalability can further reduce latencies under the pressure of twice the number of cores generating memory traffic. Compared to the previous experiments in Fig. 6 where 666 MT/s-memories were used, occupancy and duration are lower in Fig. 5(a) with 1333 MT/s-memories.
H. RFMC Versus MC Area
First TX/RX area is estimated and after that the impact of this area is determined for different technology generations. To estimate TX/RX area, a similar methodology (further described in Section IV-I1) is adopted from Tam et al. [14] as a combination of RF circuitry area estimations from ITRS [3] , design of TX/RX circuitry [12] , and validated TX/RX circuits [31] . As a result, TX/RX area is estimated at about 0.0123-0.015 mm 2 , which is of lower overhead.
MC internal elements are introduced to highlight the differences between an RFMC and a typical MC: in either: 1) the front engine (FE) that processes requests from memory; 2) the transaction engine (TE) that transforms memory requests into control/memory commands; and 3) the physical transmission (PHY), which is constituted by control and data over traditional physical channels [24] at MCs versus RF TX/RX and RF channels at RFMCs.
McPAT [24] tool estimates area and power of FE/TE/PHY parts of a regular MC. Since FE and TE are both present in MC/RFMC, by using an average over the previous simulated benchmarks in McPAT as well as specific RFiop settings (methodology further described in Section IV-I1), the area occupied by FE/TE is determined while RF RX/TX area are obtained as previously described.
Similar to Marino's report [16] , in Fig. 7 it is observed that PHY is the dominant element in terms of area; for different technology generations, 57.3% of MC area can be saved when replacing MCs by RFMCs. Put differently, by adopting MC area as area budget, up to 2.4× more RFMCs can be fit on the die, i.e., up to 12 RFMCs (versus five MCs-baseline area budget).
I. Power and Energy Analyses
The following analyses aim to identify and compare power/energy magnitudes of RFiop with its respective traditional counterpart: RFpad interconnection and total rank energies.
1) RFpad Interconnection Energy: As previously analyzed in Section IV-H, FE/TE are either present in RFMC or MC and, as previously adopted, McPAT is used to estimate the power of both these parts. However, since the PHY is the most significant element in terms of power when compared to FE and TE, its power and amount of bits transferred to/from memory are included as part of the dynamic energy.
According to the methodologies [8] , [10] , [26] , energy is preferable rather than power since the former considers the amount of bits transferred with the memory. For a traditional MC, PHY contains I/O pins and a regular channel, which power can be estimated by McPAT [24] . However, for RFMC, PHY is represented by RF TX/RX and RF interconnection, i.e., I/O pin and line power are replaced with TX/RX and RF line power.
Similar to the previous RF TX/RF area estimation in Section IV-H, power estimation relies on a combination of RF circuitry estimations from ITRS [3] , design of TX/RX circuitry [12] , and validated TX/RX circuits [31] , all adjusted to RFiop settings: 1) an average distance of about 1 mm from each RFMC to its respective rank RX/TX is assumed and 2) since QP RL is of significantly reduced magnitude [15] , and TX/RXs elements designed for QP are still an open area, a conservative power reduction-estimated in 10%-can be applied to the employed transmission models [12] , [14] .
Moreover, since energy-per-bit depends on MBW, its modeling is performed considering an average of the simulations performed previously (Sections IV-F and IV-G) which includes their memory utilization. Fig. 9(a) illustrates the results of energy modeling in which different distances and different technologies (45, 32, and 22 nm) are experimented for RF versus traditional ones. Given distances assumed, RF can save an average of 78% of PHY energy if compared to the baseline. This power budget reduction allows the significant factor of 4.6× more RFMCs to be fit in the package area, i.e. a total of 23 RFMCs (5 × 4.6), conservatively rounded to 16 RFMCs (maximum of 16 RFMCs as previously stated [16] ).
2) Total Rank Energy: In this paper, RFiop is set with traditional DDR3-1333 MT/s ranks [detailed in Table III(a)] , mainly focusing on the memory channel reduction, rather than on rank power reduction. Despite this, it is also shown that TX/RX utilization at the rank can reduce power which can be estimated by employing Micron power sheet [2] , while previously assumed RF models [12] , [14] are employed to estimate RF TX/RX power. Therefore, I/O pin termination power is replaced with TX/RX power in RFiop: this results in a 6.7% power reduction of DRAM power.
In order to determine the total rank energy-per-bit (repb) usage when using multiple memory channels and ranks attached to them, the following calculation is performed:
Total rank energy considers dynamic and static power spent by all ranks: it is obtained via Micron data sheet [2] combined with the set formed by M5 generating memory requests when running the benchmarks and DRAMsim [17] (responding to M5 and performing accounting of memory accesses, managing contention, and others). Obtained results show that static power is roughly 10% of the dynamic one. MBW is obtained via similar experiments and settings performed in Section II (different RFMC/MC counts).
Energy experimentation results are shown in Fig. 9(b) . When having large MBW demand, the rank energy-per-bit level either decreases or keeps constant as RFMCs are scaled; for example, as RFMCs are scaled, in STREAM energy decreases up to 50% and in Hotspot up to 5% (compared to the baseline with five MCs, as explained in Section IV-F), which demonstrates that in these benchmarks RFMC scaling significantly benefits not only performance but also power. For pChase (set with random behavior) performance can be improved while the energy-per-bit level remains approximately constant for lower small counts. Instead, for SP and MG (which demand smaller MBWs), energy levels increase up to 14% as the number of RFMCs is increased; if performance benefits are considered as a priority, this increase in energy levels is likely to be tolerated. By employing Micron power sheet [2] , the typical rank energy-per-bit usage is estimated [STREAM benchmarks average, Liu [15] proposed QP lines as on-package interdie CPW to communicate processor and memory, while operating at regular/RF frequency ranges. In RFiop, QP lines [15] are used as RFpads to connect RFMCs and on-package ranks, while QP parameters are used to demonstrate pad reduction.
Muralidhara et al. [32] propose to map the data of applications to different channels and combine channel partitioning to scheduling to avoid applications interference. In this paper, memory scheduling is not approached; therefore, Muralidhara's technique is orthogonal and can be applied to RFiop.
Xie et al. [33] propose that memory banks be dynamically partitioned according to thread utilization profiling. Janz et al. [34] propose a software scheduling framework in which an application interacts with the OS to determine its dynamic memory footprint utilization. In this report, memory thread scheduling is not approached; therefore, Xie's and Janz' techniques can be orthogonally applied to RFiop.
While Ausavarungnirun et al. [35] employ a MC management technique that groups memory requests according to rowbuffer locality first, then interapplication and FIFO scheduling, Kayiran et al. [36] manage to alleviate graphics processing units contention for shared resources. These techniques could be orthogonally applied to RFiop RFMC row buffers.
HMC [7] commercial solution employs sets of banks of memory dies, and processor/memory communication is done via serial/deserial, with 10-Gb/s-I/O-links. Instead, RFiop employs typical DDR ranks and protocol, RF modulation and demodulation, over a scalable RFpads/RFMC. As a result, RFiop has about 48 Gb/s data rate per I/O-channel, thus larger than HMC. To finalize, in the utilized settings, RFiop presents maximum aggregate MBW smaller than HMC; however, it presents a significantly lower number of pads.
RFiop [16] lays out ranks on the on-package area and connects them to MCs via RF modulation (forming RFMCs) of data/address using QP (RFpads). As a follow-up, Marino [21] approached the I/O pin problem by defining scalable RFpins (microstrip interface) and adopting RFMCs connected to ranks-extension of LaMeres et al. [22] RF-designed elements. In this paper, RFiop benefits are extended for more cores and different memories. In addition, RF behavioral modeling of the RFpads is introduced, while energy and RFpads scaling behavior are evaluated.
VI. CONCLUSION
To address the I/O pad/pin problem, RFiop replaces the regular memory path with an RF path, formed by RF elements such as RFMCs and QP lines-defined as RFpads to replace I/O pads. Compared to the previous RFiop report [16] , this investigation advances RFiop architecture via contributing to a: 1) scaled MBW/performance; 2) die area reduction; and 3) MC power and energy reduction, all compared to a baseline version with traditional I/O pads. The performance RFMC/RFpad scalability analysis previously evaluated for 16 cores is extended to 32 cores, including energy aspects. Furthermore, to the best of our knowledge, for the first time a modeling for RFpad that includes RL, IL, and CN as a function of RF banwdith was developed from a real prototyped circuit aiming to assist the designer with important RF features.
We have demonstrated that RFiop techniques are also valid for other DDR-family members: different data rates and widths. As a result, a significant improvement has been noticed when having twice the number of cores, which triggers a further investigation for the next generations.
As future endeavors, a future RFiop version with low power DDR memories and more efficient RF-interconnections (e.g., carbon nanotubes) are considered. Rather than the utilization of the reported transmission line model [12] , developing one for RFiop is also planned. Moreover, a power-saving strategy is also considered by including either memory system and last-level cache system for any type of applications. Finally, an investigation of the scalability of optical pads due to the significant advance of optical interposers [11] is planned.
ACKNOWLEDGMENT
The author would like to thank M. A. G. Marino and anonymous reviewers for their precious feedbacks.
