Abstract-Recent implementations of heterogeneous multicore systems [central processing unit (CPU), graphics processing unit (GPU), and hybrid] address the issue of communication latency between CPU and GPU memory systems by merging these two, so that they can share the same memory address space. In recent years, the combination of the escalation in the number of cores with the rise in memory-intensive applications has significantly increased bandwidth (Bw) needs in both homogeneous and heterogeneous systems. Since tasks assigned to CPU and/or GPU cores will have different Bw demands, a two-tier memory system is needed. Hence, in this paper, Region-Aware Memory cONtroller (RAMON) is proposed as a configurable memory system where different address space regions are able to be dedicated to a different number of memory controllers (MCs), concurrently to supply different amounts of Bw to a different number of cores, providing different levels of memory parallelism. By having different address space regionssimply regions, each with a different number of MCs to match its Bw needs, memory interference per region is reduced. Our findings show that RAMON is promising and improves Bw by a factor of 9 times for CPU regions, 14.1 times for GPU regions, and 4.5 times for combined heterogeneous regions.
to form a heterogeneous region (further detailed) on the same multicore chip, thus allowing data exchange through exchanging addresses, rather than transferring contents via PCIe. As a result, communication latencies are significantly reduced, which allows performance improvements. However, this approach requires both types of cores to share one single address space, which further pushes the demands on the memory system side.
As reported in [5] [6] [7] [8] [9] , despite differences among cores and Bw-bound applications trends, another layer of contention that reduces the performance is represented by memory interference due to different programs running on different cores with different demands on the memory channels.
A straightforward solution to address Bw needs in the future multicore generations is via the augment of memory parallelism by increasing the number of memory controllers (MCs), which are assumed to be connected to its ranks [typically known as dual inline memory module (DIMM) that are sets of memory banks with data output aggregated and sharing addresses]. To exemplify sets of multiple MCs/ranks, typical PCs present 2 MCs/2 ranks, embedded Tilera [10] microprocessor with 4 MCs/4 ranks, and IBM-Cisco router [11] , 16 MCs/16 ranks. However, according to [12] , the increase in the number of MCs is restricted by the number of input/output (I/O) pins. To address these I/O pin issues, the number of I/O pin-based structures should be increased while still respecting the low power aspect.
Solutions that use a larger number of MCs typically rely on the principles of modulation and different media [13] , [14] . Multiple frequency carriers can carry multiple data simultaneously over the same media, which can dramatically save pin utilization while remaining significantly powerconscious [13] , [14] . With reduced pin utilization, the number of MCs can be increased to significantly higher levels. For example, in either optical Corona [14] or in radio frequency (RF) DIMM Tree [13] , no more than 64 MCs can be utilized.
To illustrate the benefits of a larger number of MCs, using 32 RF-based MCs, Marino [12] indicates a Bw improvement factor of about 7.2 times compared with 2-4 MCs in typical microprocessors. Furthermore, according to [14] , an optical interface allows memory energy interconnection to be significantly reduced, which is fundamentally important when exploring different numbers of MCs.
In this paper, different degrees of memory parallelism are proposed to be achieved through a different number of MCs with optical- [14] or RF-based [12] memory interfaces applied to different types of regions (CPU, GPU, or heterogeneous), that is, Region-Aware Memory cONtroller (RAMON). Under several Bw-bound benchmarks using the detailed-accurate simulators, RAMON presents the following contributions.
1) Revisiting the operating system (OS) concept of address space used by Marino and Li's report [15] , in RAMON, the novel concept of region is defined as an address space range dedicated to different sets of cores (CPU, GPU, or both), caches, and respective interconnection. The inclusion of the two latter elements differentiate RAMON from nonuniform memory access (NUMA-node) mechanism in Linux OS. MC region awareness in RAMON allows the formation of different regions with different combinations of types of cores. In addition, each formed region can be associated with a different number of MCs, while the user can assign different tasks of a program to different regions. 2) In RAMON, a novel scheme of reorganization and property isolation is proposed to reduce or eliminate memory interference and defined as follows. The reorganization operation of a region permits the change on the number of MCs associated with it, i.e., different degrees of memory parallelism, thus likely reducing memory contention via reducing memory interference. Through reorganization, traffic of regions is "more" self-contained-isolation property-which can reduce or eliminate memory interference (i.e., related memory requests) that do not belong to the region, as further discussed. This novel scheme is enabled via the proposal of a low-overhead configurable optical crossbar. 3) Architectural investigation of RAMON system implications when increasing the number of MCs for CPU, GPU, and heterogeneous regions. This investigation aims to determine the performance and power behavior of several degrees of memory parallelismrepresented by different numbers of MCs. To develop this RAMON investigation, we vary the number of MCs for the different types of regions exploring significant larger MC number of optical-and RF-based interfaces. 4) An evaluation on system implications of a larger number of MCs applied to different types of regions rather than to the same type of regions performed in [12] and [16] . 5) An evaluation on system implications for larger numbers of MCs in heterogeneous regions via using opticaland RF-based interfaces (signal modulation) rather than traditional digital transmission (where, to transmit a "0" or a "1," the whole line should be entirely set to the respective level) developed in [15] . 6) The methodology utilized to determine the performance when CPUs and GPUs are combined is an improved version over the proposed one in [15] . The methodology consists of running each simulator (CPU and GPU) independently in regions that contain the maximum number of MCs allocated to each one. A formulation to estimate the Bw of heterogeneous systems is developed and compared with homogeneous ones. 7) A user-feedback scheduling algorithm is introduced aiming to reorganize regions in order to match the Bw needs of that region. Consecutive runs of this algorithm and reconfigurations can likely guide the user to determine proper allocation of tasks as further described.
8) The increase in the number of MCs using optical or RF interconnections has been investigated in current OOO microprocessors [12] , [16] , [17] . In order to investigate the impact of the number of MCs on heterogeneous regions, we have to investigate it on CPU-regionswhich we compare with previous research [12] -as well as on GPU-regions. It is assumed that coherency aspects and scheduling of regions of the memory system are beyond the scope of this paper, assuming that coherency and scheduling of programs are treated at the programming-level environment. The remainder of this paper is organized as follows. Section II describes the background and motivation for increasing the MCs toward improving Bw, and Section III describes RAMON's properties of creation, isolation, and reorganization of regions. In Section IV, experimental methodology and results are discussed. Section V analyzes the sensitivity of RAMON operations and methodology. Section VI describes the related work, while Section VII gives the conclusion remarks and future directions.
II. BACKGROUND AND MOTIVATION Marino [12] characterizes the I/O pin problem as a set of physical restrictions likely to occur when the number of I/O pins increases, larger pin densities are employed, and faster clocks (666 MHz-1.3 GHz or more) along the processor-tomemory channel are implemented. These I/O pin restrictions involve electromigration, crosstalk effects among pins [16] , as well as a reliable design connection between the motherboard and the processor chip [12] . In addition, as the number of I/O pins is increased, area costs are likely to increase correspondingly.
We illustrate the effects of restrictions as: 1) current embedded and typical microprocessors in terms of cores versus MC counts; 2) rank frequency and its effects on Bw; and 3) the effects of pin counts on Bw and MC counts. To illustrate 1) and 3), as depicted by Marino [12] , most typical multicore systems have more cores than MCs. This imbalance between MC counts and cores is very significant in terms of magnitude. Furthermore, it is likely to cause queuing of memory requests at the MCs rather than processing them. As for 2), it is shown that low-power DDR presents lower Bw when compared with traditional DDR3/DDR4/GDDR5 memory, given that it was designed aiming low power consumption [15] , [18] . Therefore, 1) and 2) demonstrate the need to focus on systems with a larger number of MCs to tackle the approach targeting a larger number of cores, so that techniques that provide the utilization of a larger number of MCs are described next.
Optical and RF technologies enable the exploration of large Bw while using a small number of special pins, which are designed for these technologies. Modulation in these technologies enables the multiplexing of simultaneous carriers that transmit simultaneous data, thus allowing significantly larger Bw when compared with traditional transmission with no modulation. To illustrate how modulation benefits Bw in optical-or RF-modulation, Bw-per-pin is defined similar to [15] bw_pin = ncarrier * data_rate/number_of_IO_pins. (1) This equation shows that by reducing the number of pins, Bw-per-pin can be improved. Considering waves represented by multiple carriers placed at different frequency ranges, as carriers are added, larger data rates are achieved. By using a low amount of pins, all of those combined can generate higher Bw-per-pin magnitudes.
The following examples further illustrate the importance of reducing pin utilization in MCs. For example, optical Corona [14] only presents two optical pins/MC, while RFiof [12] about four RFpins; therefore, pin magnitude is noticeably reduced when compared with 120 pins (of 240 pins total) utilized in traditional DDR systems, thus enabling the use of a larger number of MCs.
To estimate Bw improvements when augmenting the number of MCs, we utilize the equation developed in [15] BwP = rank frequency * width * MCs (2) where BwP represents the peak bandwidth (BwP) that the memory system can supply, rank frequency represents the rank clock frequency or data rate, and width refers to the number of bits, which represent the width of the rank. Equation 2 simply allows to derive that BwP increases as the number of MCs is increased, which means a larger degree of memory parallelism. Considering all elements on the memory path between the core and the rank, BwP degrades due to cache delays, crossbar contention, interconnection delays, MC delays, memory bank delays, and program instructions, which effectively use memory (read and write). As all effects are independent, effective Bw is obtained as follows:
where Bw is the effective Bw supplied to the cores (either CPU, GPU, or both), Misslatency% corresponds to the fraction of time spent on cache miss delays, BusDelay% is the fraction of time spent on bus delays, MCdelay% is the fraction of time spent at the MC queue/processing, NetworkCacheAndContention% is the fraction of time spent on cache occupation, interconnectionDelay% is the fraction of time spent at opticalor RF-interconnection, ActiveState% is the fraction of time memory banks that are active (read and write operations occur), and NoReadWrite% is the fraction of time read and write operations that are not present. Effective Bw follows BwP, and is still proportional to the increase in the number of MCs. Through a careful design that includes the control of impedance matching and interference, signal degradation, dispersion, and divergence effects during transmission [15] are minimized. Consequently, energy interconnection is minimized. For instance, Vantrease et al. [14] illustrated an interconnection energy reduction of about 80% when compared with traditional transmission. Additionally, lower area utilization [16] favors the utilization of a larger number of MCs based on either technologies.
III. RAMON MANAGEMENT OF REGIONS
In RAMON, we assume that each region keeps all communication needed mostly self-contained in the region, avoiding CPUs/GPUs of different regions to access one or multiple regions. Traffic is not self-contained when different regions need the OS. As further described, a proper OS allocation to a region can potentially allow lesser OS interregion communication. Despite being beyond this paper, complete isolation could be achieved by having multiple copies of the OS in all regions at the expense of extra overhead to run these multiple OS copies, which would also require a master OS to keep control of the other OS copies.
A. User/OS Scheduling Assumptions in RAMON
RAMON allows the creation of new regions and/or the reorganization of previous existing ones in terms of the number of MCs. Heterogeneous tasks are considered to be a combination of CPU-and GPU-tasks, and assumed to be scheduled and executed on heterogeneous regions. Furthermore, heterogeneous tasks are assumed to be created, scheduled, controlled, and triggered by the user (e.g., OpenCL [19] ) or OS according to Bw selection or quality of service (QoS) [9] level of the aimed Bw.
These tasks are assumed to invoke RAMON operations and region management. In this scenario, the frequency of the creation and reorganization operations is controlled and triggered by the user/OS. Additionally, the user/OS are assumed to be responsible for controlling synchronization via invoking the creation operation of separated regions so that memory interference is minimized. Since there are overheads to perform region operations, these overheads are of lower time when compared with software ones (e.g., synchronization among threads and memory allocation). Further considerations are discussed along this section.
Though beyond this paper, the following user-based feedback Bw-scheduling mechanism is proposed to be applied to a set of tasks in order to explore different degrees of memory parallelism offered by a different number of MCs. 1) For each individual task running in one region, identify its Bw utilization: indirectly determining the latency via Little's law [20] , through an estimation of concurrence, i.e., simple circuits to count the number of outstanding memory transactions-and latency-number of memory transactions waiting on the transaction queues. These counters could be exposed to the user as performance counters. Repeat this step for each task of the set. 2) A new reference degree of memory parallelism for an individual task can be estimated by the user/OS via assigning a different number of MCs to each created region or reorganized region. As a result, the configurations with the highest performance can be found. Region aspects are discussed in Sections III-B and III-C. 3) Given the most Bw-performing region configurations references were found in the previous step, the user/OS can select the number of MCs to perform the target region reconfiguration to achieve the Bw goals while scheduling tasks via assigning or combining them to be run into separate or shared regions aiming to achieve the required Bw.
The proposed mechanism can be integrated to other reported mechanisms, such as [5] [6] [7] [8] [9] and [21] , further described in Section VI. The proposed mechanism assumes applications with constant Bw demand behavior. However, by splitting the tasks execution into time phases in which Bw is approximately constant as well as estimating Bw and reconfiguring the regions based on the mentioned time phases as previously indicated, the algorithm can be repurposed to target variable Bw.
In all previous assumptions, RAMON exposes to the user/OS the capability of creating new regions and/or reconfiguring these regions in terms of number of MCs. Furthermore and importantly, given these previous considerations, with a careful user/OS software-level application scheduling and proper configuration of the MCs in different regions, interference-related traffic among regions and likely delays are likely to be reduced as further discussed.
B. Creation of CPU, GPU, and Heterogeneous Regions
As previously mentioned, each region is defined as a range of addresses and a set of MCs associated with it. Each region is likely to have a task or multiple tasks being executed using the same address space and a dedicated number of MCs. Since the number of cores/caches/MCs in each region is reconfigurable, interleaving address mapping needs to be changed accordingly. Once MCs are subject to the modification of the address mappings on the caches, a region reconfiguration should consider a new interleaving of address mappings, which should be explored by the tasks allocated to that region.
Assuming that cache requests are interleaved at MCs, each region accesses a range of addresses according to the available number of MCs. Therefore, addresses accessed within one region are guaranteed to be different, which means that the regions are isolated. The only exception is to access OS-related data/programs out of the region, if the OS is not allocated to that region.
We illustrate the creation of different regions in Fig. 1 based on [15] -different regions with homogeneous and heterogeneous behaviors.
The crossbar utilizes region boundaries in order to delimit traffic inside each of these regions. Each region has a unique identifier at the crossbar (further described in Section III-E1). When the creation operation is performed, a new region is formed with the proper number of MCs and respective tasks associated. In the case, creation operation is successful, likely configurations with the appropriate MC counts dedicated to the tasks are created. The creation operation (designed to happen at creation time) fails when the configuration requested or the number of MCs associated is not available to meet the request and the user/OS is informed to reduce the number of MCs requested.
As mentioned in Section III-A, when multiple tasks are successfully initialized at the crossbar, regions are created accordingly. For instance, at the initial phase presented in Fig. 1 , the application requests the creation of two different regions. Next, the application is able to schedule tasks to two different regions (CPU and GPU).
C. Isolation of Regions
Tasks associated with a region mostly access memory addresses within that region, thus guaranteeing that most memory traffic is contained within it. Isolation means that the MCs assigned to each created region allow tasks to proceed with their memory accesses, regardless of other tasks executed in other regions. Also, tasks executed in different regions use different numbers of MCs-each region likely to have a different level of memory parallelism. Therefore, by having traffic mostly self-contained, memory interference of one region with respect to another region is mostly reduced (with the previous assumptions that only OS accesses are allowed to cross regions). Additionally, if a region is set to run more than one task and these tasks share multiple MCs, these tasks are likely to cause memory interference. Instead, if two regions are created, each one with separate MCs, a lesser amount of traffic interference is likely to happen within each region.
D. Bandwidth Behavior
We assume that cache addresses are interleaved on each MC and that each region has a set of MCs dedicated to it. Memory traffic between cores and MCs is mostly kept inside each region, i.e., memory traffic between cores and MCs that do not belong to that region is avoided. Some likely scenarios to be observed are the following.
1) If MCs of a certain region were shared with another one, these MCs would receive more traffic, thus likely to cause a latency increase. However, if each region has its own set of MCs, the set of each region would receive its corresponding memory traffic predominantly. Therefore, memory interference of other regions is significantly reduced.
2) When executing Bw-bound programs, memory traffic generated to respond to processor requests goes through the cache network to get back to the cores. Therefore, regions with larger cache network traffic are likely to be subject to larger congestion. Consequently, if properly separated at the crossbar, these regions can have traffic minimized from the others, and network interference is also reduced. Let MCs be the total number of MCs available. Assuming the existence of one heterogeneous region, 1) and 2) can be demonstrated by comparing the Bw and traffic of a heterogeneous region with the equivalent CPU and GPU regions, all with the same amounts of dedicated MCs.
We start by considering that the memory system has the same properties, regardless of the type of region. As previously mentioned in Section II, effective memory Bw supplied to any of the regions (CPU, GPU, or heterogeneous) follows the number of MCs: Bw = rank frequency * width * MCs
If available MCs are assigned to a region, through isolation, these are mostly used for dealing with memory requests generated by the cores on that region. For example, assuming we have two heterogeneous regions, one with 75% of the MCs available and the other one with the rest (25%), thus from 5, the first heterogeneous region (A) would have BwhetA = rank frequency * width * 0.75 * MCs
where BwhetA is the effective Bw supplied to heterogeneous region A. A similar equation can be developed for heterogeneous region B (replacing 0.75 with 0.25) and with both we can state that BwhetA > BwhetB, which proves (1). Analogously, case 2) can be demonstrated using the same equations. More specifically, memory-bound applications enable the generation of cache requests at the CPU, GPU, or heterogeneous regions. The respective memory responses to these cache requests are generated at the ranks-memory traffic-and repassed to the MCs, which themselves repass to the crossbar/cache network. In this case, the network traffic of a heterogeneous region would contain memory requests generated at CPU and GPU cores and is comparatively larger-likely to cause larger interference on the MC channels-than the respective CPU and GPU regions.
E. Reorganization of Regions
Similar to the creation, new identifiers are assigned so that the region can be manipulated at the crossbar as further described. At any point, reorganization of regions redefines any existing address spaces as well as the number of MCs allocated to it. Similar to region creation, the amount of Bw needed and respective MCs that enable that region to achieve user desired Bw, while the creation of separated regions associated with different tasks can decrease memory traffic per MC, which reduces memory interference.
Importantly, if a region is reorganized, this region is going to be decoupled in different regions. Since reorganization of a region is assumed to be controlled by the user/OS, the latter are in charge of reorganizing the regions and assumed to associate new or previous tasks to these new regions. To implement reorganization, we propose a reconfigurable crossbar described next.
1) Crossbar Description and Design Scheme:
The crossbar is a key element not only in the reorganization operation in RAMON but also on the other operations as well (creation and isolation).
To understand the crossbar design and how address region isolation is implemented, we first show how MCs are grouped and dedicated to a region. Unique identifiers are associated with each MC with the assumption of cache addresses interleaved at each one of them. Since the address space and proper amount of MCs can be used by tasks associated with that region, memory traffic destined to a certain MC can be identified by comparing address mod (MCsDedicatedToRegion) (address represents a general address, mod represents the modulo operation, and MCsDedicatedToRegion represents the number of MCs dedicated to that region) to the MC identifier. Importantly, when executing tasks, region boundaries are used to prevent addresses from going out of bounds of that region, thus guaranteeing that cache addresses are spread at the sets of MCs associated with that region.
By guaranteeing memory traffic is kept within the boundaries corresponding to the set of MCs dedicated to that region, memory traffic out of the region is avoided, which reduces memory traffic destined to other MCs that belong to other regions, i.e., it reduces the interference of other regions. Furthermore, by having consecutive MC identifiers dedicated to a region, network traffic of that region can be isolated. Region boundaries are utilized at the crossbar to set the reconfiguration hardware, formed by sets of registers and comparators, as follows. 1) Registers are required to perform creation and configuration operations. These registers contain the address boundaries of each created or reorganized region, and are assumed to be exposed to the user/OS scheduling described in Section III-A. 2) address mod (MCsDedicatedToRegion) calculations are implemented in circuits via shift registers and XOR elements. Since these calculations can occur frequently so that memory traffic is isolated inside each region, the hardware elements needed can be implemented in separate at the microprocessor decode hardware unit, thus not incurring additional overhead. 3) Comparators are also required to enforce memory and network traffic isolation. These comparators check the validity of the address with respect to the boundaries of that region, ensuring that the memory traffic is mostly self-contained to that region. Each region has its own comparators that guarantee that traffic is contained in each of them. 4) We assume that the memory interfaces are based on optical and RF technologies. Therefore, the crossbar is likely to be designed according to the modulation, transmission, and proper pin-based interfaces required by these technologies. 5) Registers' and comparators' circuits complexity grows linearly with the number of regions. The individual complexity of the registers and comparators involved in the operations of creation, isolation, and reorganization is negligible in terms of circuits complexity. Importantly, since the number of regions is finite, the overall complexity of the registers and comparators to assist the implementation of the region operations is still estimated the low area/overhead. As described next, we briefly approach the design of the crossbar in Section III-E2.
2) Reconfiguration and Optical Crossbar Design Scheme: Regions should contain a number of MCs and proper network, which will vary with the behavior of the applications. The crossbar network should be easily configurable so as to properly create regions. This configurable behavior can be implemented using photonics components. Indeed, according to the logical operations proposed by Almeida et al. [22] , the direction of the light can be used to implement logical operations. Before illustrating how crossbar operations are implemented, we describe the basic optical inverter elementsuch as the developed by Almeida et al. [22] -on which our operations are based. As illustrated in Fig. 2(b) , this element has the following ports: control, input-output, and throughout. If control signal is "1," incoming data from the input port are sent to the throughout port. And, if control signal is "0," incoming data are not propagated to the throughout port.
Using these logical operations implemented with optical inverter, we propose straightforward control circuits to be connected to the optical logical control signal of the inverter responsible for: 1) selecting cache addresses, which belong to the address range of the created regions and 2) filtering cache addresses, which do not belong to the specified range.
To select cache addresses that belong to a region, the control operation circuit should be coupled to the optical control signal so that these signals are sent to the throughout port. Conversely, for addresses which do not belong to a specified region address range, the control should send them to the drop port.
A general description of the operation of the crossbar is illustrated in Fig. 2(a) . The config (configure) block contains registers and required operations in order to configure the crossbar operation. This block takes the cache addresses of different ranges as inputs, while it generates the configuration signal [conf as described in detail in Fig. 2(b) ] for that ring resonator block as output. The circuits present in this block are mainly composed of simple comparator circuits. The complexity involved in this block-which participates to the operations of creation, isolation, and reorganization-is not significant in terms of overall circuits complexity. Though beyond this paper, overall complexity grows linearly with the number of regions, and requires a tradeoff investigation between low area overheads and the number of regions required.
The configuration signal of the config registers block in Fig. 2(a) is connected to the control signal (conf) of the ring resonator, as depicted in Fig. 2(b) . This signal is used to implement the address mod MCs and address mod (MCsDedicatedToRegion) operations. We detail this implementation in Section IV.
IV. EXPERIMENTAL RESULTS
We start this section by describing the methodology employed in experiments performed, followed by presenting and discussing the results obtained.
A. Methodology
As evaluation, we concentrate on determining the benefits on the increase of the number of MCs in terms of Bw and processor throughput [instructions per cycle (IPCs)], rather than the reconfiguration process itself. As mentioned in the last section, the time overheads of the optical reconfiguration process are significantly smaller and the hardware unit responsible for it is of low-magnitude area overhead. Before describing the methodology adopted, the baseline in terms of number of MCs for the regions is explained. We utilize the number of MCs in real systems as the baseline MC count for CPU, GPU, and heterogeneous regions to resemble real systems. An in-depth research on heterogeneous systems currently available indicates that current heterogeneous systems [1] have 2 MCs, which is adopted as the baseline.
For a global view of the methodology considered, all simulators employed and their descriptions are listed in Table I . The general methodology employed in this paper is adopted from [16] , applied to each type of region. Bw-bound applications are used in order to evaluate the regions, since the primary goal is to target Bw.
Region-related concepts are implemented as follows. Regions boundary check is included in DRAMsim [24] , while network traffic isolation is in the M5 simulator [25] . Region boundary check is basically implemented with the address-mod-related calculations previously explained in Section III-E1. As previously mentioned in III-A, memory traffic isolation is controlled via user/OS tasks. Region identification is assumed when the user/Linux tasks are created on the M5 simulator [25] or at the creation of GPU tasks in GPGPUsim [26] , while network traffic isolation is in M5 by filtering packets that do not belong to that region (a range of memory addresses previously explained). Reorganization region identification, boundary checking, and network traffic isolation are implemented similar to the creation operation.
To evaluate the memory behavior of the CPU regions, we combine the M5 [25] and the DRAMsim [24] simulators as follows. To predict the behavior of future heterogeneous multicores, a 32-core processor model is created in M5 [25] , and as memory transactions are generated in M5 upon benchmark execution, these are captured in DRAMsim [24] , which is set with multiple MCs, each associated with its own rank. Next, DRAMsim responds to M5 with the result of each transaction. In this environment, we confirm the appropriate calibration of the Bw in one rank: about 2 GB/s as indicated in the manuals [27] . For the CPU regions, we perform an analysis of regions with MC counts ranging from 2 MCs (baseline) to 32 MCs. We should highlight that this paper explores up to 32 MCs, significantly larger than currently used counts in typical microprocessors.
In this evaluation, we have further utilized a CPU ISA based on Alpha processor, set as a four-way issue OOO core similar to high-performance microprocessors such as [1] , at 3 GHz with MCs operating at 1.5 GHz as in typical microprocessors [28] . We used Cacti [23] to obtain cache latencies and adopted miss status holding register (MSHR) counts similar to current microprocessors [28] . In order to be able to increase the number of cores, these are connected as a clustered architecture, and we utilize L2-MSHR structures [29] that can be replicated to achieve higher memory Bws, while we assumed an L2 similar to what can be found in current microprocessors. However, instead of a shared L2, we employed a private L2 to avoid L2-cache sharing effects, which would affect memory Bw and IPC measurements.
To model the behavior of GPU regions with a set of assigned MCs, we used the GPGPUsim [26] simulator, which already contains a module that implements multiple DDR-based MC-systems. Memory transactions are generated by the multiple GPU L3 caches and simulated in the memory module of GPGPUsim. The GPU architecture modeled follows the Nvidia Fermi architecture [30] . Similarly, we explore a variable number of MCs from 2 to 32. Furthermore, we adopted similar GPU cache settings [30] to GPU processors keeping proper (as further described) RF-intercommunication delays [13] between GPU L3 caches and the crossbar.
The most straightforward way to perform a detailed simulation of the Bw impact on a heterogeneous region is to employ a simulator for heterogeneous systems. However, the higher computational complexity of these simulators can lead to significant higher simulation times. Instead, we propose a methodology that independently utilizes CPU and GPU simulation infrastructures and combines them to determine the impact of the increase of the number of MCs in the heterogeneous regions.
Similar to Marino's report [12] , we employ ranks with proper DDR-family settings in terms of timings, protocols, control-data signal separations, and organization-in terms of banks, rows, and columns. Timing parameters are based on 1-GB DDR3 rank, based on Micron model MT41K128M8 [27] .
Each MC is associated with one of the previous ranks, i.e., 32 MCs are associated with 32 ranks. The sets of 32 MCs/32 ranks are coupled to 32 cores in order for the Bw to feed these cores while keeping a balanced proportion core:MC as 1:1. Furthermore, since each MC is connected to a different DDR rank, cache lines interleaved along MCs are actually interleaved along different ranks. Additionally, closedpage mode was used aiming RAMON to target low energy usage on a server environment.
To model communication delays, validated modeling methodology by Chang et al. [13] follows as: the optical and RF-interconnection delays involved are of similar order of magnitude by signals delays when traversing the estimated distance (e.g., 2.5 cm [12] ).
In this RF model, modulation and line separation are taken into account so as to keep as low bit error rate as possible. Therefore, L2 CPU caches are interconnected via an RF-crossbar with a single cycle latency (adopting the same timing settings as in [13] : 200 ps-for TX-RX delays-plus the rest of the burst cycle used to transfer 64 B-memory word-using high speed and modulation). Crossbar upper Bw was designed so that: 1) as the number of ranks (that follow the number of MCs) is increased, total Bw is not restricted and 2) to approximate actual delays [13] . Similar settings are also valid for connections to GPU L2 caches.
Given that the number of regions in benchmarks evaluated is of reduced magnitude (an assumed maximum of 32 regions), boundary registers and comparators setting delays involved in the creation, isolation, and reorganization operations are of significantly reduced magnitude when compared with cache delays and RF-interconnection.
Next, the methodology employed to obtain power and energy-per-bit is discussed. To obtain the total power, DRAMsim power models are considered, following the Micron formulation [27] , which includes the power spent on all ranks and interconnection. To determine the total energyper-bit spent, we adapt the methodology described in [14] to the magnitudes obtained in DRAMsim power infrastructure (which are also based on the Micron formulations [27] ) and combine them with the memory Bw extracted from the benchmark, if designed to measure Bw, the number of memory transactions (DRAMsim or GPGPUsim), and execution time, if otherwise. Table II contains all architectural parameters.
Benchmarks selection followed the methodology employed in [12] . It consists of a selection of medium and high Bwbound benchmarks with a significant number of misses per kiloinstructions (MPKIs) to evaluate the memory system: the STREAM [31] suite designed to measure memory Bw, decomposed in its four subbenchmarks (Copy, Add, Scale, and Triad) as well as Backpropagation, Hotspot, Pathfinder, and Srad from Rodinia suite [32] .
In addition, we exemplify some region operations (creation and reorganization), assuming they happen as a sequence of task termination followed by the creation of a new one, as illustrated next. In experiments, we assume up to 32 different regions, since we have 32 cores. Table III summarizes the selected benchmarks, input data set sizes, read-to-write rate, and L2 MPKI obtained. In all benchmarks, parallel regions of interest were executed until completion. All input data sets are larger than the total rank memory size, which guarantees that all memory space is stressed. Average results are calculated based on a harmonic average. To measure the implications on performance, it is proposed the following.
1) To measure IPCs as an indication of throughput/performance. 2) To measure Bw gains on each type of region (CPU, GPU, and heterogeneous). 3) To perform an investigation of the degree of memory parallelism for each type of region by performing an investigation of the Bw gains from the baseline (2 MCs) to the maximum number of MCs (32 MCs). 4) To determine the maximum Bw of a CPU, GPU, and heterogeneous region as top boundary magnitudes. 5) To be able to determine the maximum Bw, each selected benchmark is scheduled to each individual type of region, with all MCs dedicated to that same one. 6) To measure Bw for benchmarks not necessarily designed to measure Bw (all from the Rodinia suite [32] ) by dividing the number of bytes, the memory system effectively transfers-obtained in the simulation infrastructure correspondent to memory read and write operations-by the execution time. 7) To allow the creation of regions before proper benchmarks' executions in those regions.
B. Determining the Bw of a Heterogeneous Region
Since a CPU region contains solely CPU cores, we define the memory Bw supplied to each CPU core as BwCPU = Bw CPUcores (6) where Bw is the effective Bw according to 3 and CPUcores represents the total number of CPU cores present in that region. Similarly, for a GPU region with GPUcoresclusters of cores (each cluster with an associated MC), Bw can be expressed as
Now, a heterogeneous region created with cores from both CPU and GPU regions has a maximum Bw expressed as
Using Section III-C considerations and CPUcores + GPUcores > CPUcores and (9) CPUcores + GPUcores > GPUcores.
We can derive that Bwhet = BwCPU * CPUcores CPUcores + GPUcores
Bwhet = BwGPU * GPUcores CPUcores + GPUcores . Therefore, by analyzing (8)- (10), we demonstrate that the Bw of a heterogeneous region is lower than the Bw of the respective CPU and GPU regions (with similar MC counts) and can be calculated using the latter equations. Thus, to determine the Bw of the heterogeneous region, without loss of generality, we select the CPU regionB w results and apply the factor CPUcores/(CPUcores + GPUcores) to determine it next.
C. Bandwidth Fig. 3(a)-(d) shows the Bw results obtained for the CPU and GPU regions when executing the selected benchmarks. For all STREAM benchmarks which were designed to measure the Bw, we can observe that Bw increases as the number of MCs is increased. Similar to the Bw trends in STREAM suite, Bw also increases proportionally to the number of MCs in the Rodinia suite.
Our findings show that the largest Bw achieved is for Add, where for 32 MCs, it is about 9 times faster than the baseline, while the smallest one is observed with Pathfinder, where the 32MC-configuration is 2.2 times faster than the 2MC-configuration. Similar to the number of MC' settings found in [12] , the selection of a larger input data set allows larger memory data transfers for Pathfinder in the CPU region. We also observe that for the Copy and Scale benchmarks, the Bw saturates at 16 MCs. By analyzing infrastructure simulation statistics, we observe that this effect happens due to the Bw limitations of the crossbar (Table II , 48 GB/s). By applying different degrees of parallelism through the use of different numbers of MCs, tasks executed in CPU regions yield significant Bw improvements when compared with the baseline as the number of MCs (parallelism) is increased.
As shown in Fig. 3(b) , just like with CPU regions, GPU regions present similar trends in terms of augmenting MCs; For all the benchmarks in the STREAM suite [31] and for all the benchmarks in the Rodinia suite [32] , Bw is improved for the different degrees of memory parallelism. The best Bw result happens for Copy, which for the 32-MC configuration, is 14.1 times higher than the baseline. The worst result (2.3 times faster than the baseline) happens for Hotspot with the 32MC-configuration.
The largest Bw magnitude in both types of regions occurs for the largest number of MCs, which is obtained according to the behavior observed in 2. By comparing CPU with GPU regions, Bws are larger on the GPU ones, since their architecture [30] demands higher memory Bw.
For STREAM suite [31] , a Bw behavior in both types of regions increases proportionally to the number of MCs available to that region. However, in the Rodinia suite [32] , it is interesting to observe that some benchmarks present consistent increase on a Bw behavior, while others do not. For example, Backpropagation and Srad Bws increase proportionally with the number of MCs, while others, such as Hotspot and Pathfinder, display the opposite behavior.
To explain this phenomenon, we point out that we follow traditional methodologies used in computer architecture [12] , [13] , since we are employing standard benchmarks selected from the mentioned suites but not to explore benchmark parallelization techniques in these different architectures. Furthermore, according to Marino and Li's report [15] , to have a fair comparison between Hotspot and Pathfinder on these different architectures, we should have both programs parallelized using similar techniques to provide the best possible Bw, which is not the case. For example, in the case of a GPU architecture, the use of GPU caches are controlled via programming, which is not the case of the Rodinia benchmarks [32] , where CPU programs use OpenMP, while GPU programs use CUDA.
Highlighting its importance, Bw improves in all types of regions upon increasing the number of MCs. In addition, this Bw improvement is valid for a significant diversity of benchmarks for either CPU or GPU regions.
To estimate the Bw of a heterogeneous region, 32 cores (Table II) are used for the CPU regions, 32 shader cores (Table II) are used for GPU regions, and CPU Bw results in Fig. 3(a) are used as the inputs to (11) and (12) . This estimation is illustrated in Fig. 3(c) . Instead of using the baseline of the heterogeneous region as 2 MCs, we use the one from the homogeneous so that Bw reduction can be easily observed. The results show that Bw also increases when the number of MCs is augmented.
D. Throughput or Processor Performance
We have measured the number of IPCs in order to estimate processor performance improvements. Fig. 4(a) and (b) illustrates IPCs obtained in the experimentation. As a general observation, for either CPU or GPU regions, IPC increases follow Bw increases as the number of MCs is increased. IPC largest magnitudes are very significant, about 14 times for CPU regions and 17 times for GPUs. The best results obtained are for STREAM suite and Hotspot in the CPU regions, while Srad, STREAM, and Pathfinder in the GPU regions. The worst IPC gains obtained are present in backpropagation either in CPU or GPU regions.
E. Latency
Since Bw and latency are related [20] , the effects of several degrees of parallelism can be observed in terms of latency, which are measured as transaction queue average occupancy. The results of these measurements are shown in Fig. 5(a) . Compared with the baseline, transaction queue occupancy is, respectively, reduced by up to 93.3% in the CPU regions, up to 93.1% in the GPU regions, and up to 86.6% in the heterogeneous regions. Heterogeneous reduction is lower, since this type of region presents a larger memory contention (more cores issuing memory requests). In all cases, occupancy reduction demonstrates significantly lower levels of memory interference.
Similar to the formulation developed to obtain Bw dependence between the CPU and GPU regions in the heterogeneous regions-represented by (11) and (12), we obtain latency by using a similar proportionality factor. This estimation is shown in Fig. 5(a) , where significantly higher levels of latency exhibit the behavior previously predicted.
To summarize, we observe that for isolated CPU or GPU regions, as well as for merged ones, an increase in the number of MCs benefits performance by improving memory parallelism. Combined CPU/GPU sets need more Bw to achieve the same levels of speedup of isolated ones.
F. Creation, Isolation, and Reorganization
In this section, operations of creation, isolation, and reorganization are discussed. Various degrees of parallelism, starting from 2 MCs (baseline) to 32 MCs, were tested. To perform the evaluation, each region is created with a different number of MCs dedicated to it. For example, in Fig. 3(d) , each vertical bar corresponding to an MC configuration at the X-axis corresponds to a different region set with different numbers of MCs. These regions were created or reorganized assuming the user requested the experimented MC settings (X-axis). Similarly, different regions are present when performing the evaluation of GPU regions. Fig. 3(d) illustrates the reorganization operation assumed (Section IV-A) as a sequence of tasks created and terminated. In this example, after a first region is created, it is reorganized to have the number of MCs double as the previous. This process is repeated until the maximum number of MCs is achieved, after which it is terminated.
Results showing Bw improvements for each type of region demonstrate that Bw significantly increases with an increase of the number of MCs. Isolation property backs up these results as shown in Section III-C; otherwise, regions would present much larger memory and network traffic for CPU, GPU, and heterogeneous regions.
G. Energy
Due to space considerations, only energy-per-bit for STREAM benchmarks are shown in Fig. 5 . Fig. 5 shows that energy-per-bit decreases as the number of MCs increases, which follows Marino's report [12] and can be intuitively explained by dividing power usage over proportionally increasing Bws. In addition, as noticed in the previous (section) latency analysis, memory system is active for a lower amount of time, thus reducing the energy-per-bit spent. Though with different slopes, a similar behavior is observed for the remaining benchmarks.
H. Important Points: Design Guidance Outcomes
Very importantly, as previously mentioned in Section I, even adopting optical or RF technology, as a guide to future heterogeneous industry processors design, the drop in Bw/latency and IPC is significant-roughly 60%-when having more heterogeneous processors, as observed in Fig. 3(a)-(d) and previous homogeneous/heterogeneous results. Furthermore, different regions allow the reduction of interference (shown as lower levels of occupancy and higher levels of memory Bw).
We conclude that the contention bottleneck on the PCI e-bus due to communication between traditional CPUs and GPUs is shifted to the memory system in heterogeneous microprocessors is still present in scalable memory systems given the drop in Bw/IPC.
A second important outcome is that the utilization of up to 32 MCs-compared with 8 MCs in [15] -as a straightforward approach to improve Bw/IPC (Section I) is significantly beneficial. For homogeneous and heterogeneous cores, roughly 1.6-4.2 times better Bw/IPCs are obtained. Importantly, the use of more MCs approaches the bottlenecks analysis mentioned in [15] , i.e., when more cores are present. 
V. SENSITIVITY ANALYSIS
Our sensitivity analysis is designed to assess the impact of several key aspects: 1) number of memory MCs and cores; 2) creation, isolation, and reorganization operations; 3) minimum acceptable performance degradation; and 4) benchmarks.
A. Number of MCs and Number of Cores
Different levels of memory parallelism are estimated via experimenting a different number of MCs assigned to each region (2-32 MCs).
Via properly designed pins, optical/RF technologies allow the use of larger MCs, which potentially enables a larger number of cores. Aiming the highest memory parallelism possible, a higher magnitude of MCs compared with cores is employed to maintain the ratio core:MC (we used 32 MCs/32 ranks).
If the number of cores in either CPU, GPU, or heterogeneous regions is increased, pairs of MCs and ranks can be augmented to match the core:MC proportion as previously discussed. Using this strategy, the augment of the number of MCs to larger magnitudes is an attractive solution for either homogeneous or heterogeneous systems with a larger numbers of cores. In Marino and Li's report [15] , the number of MCs selected is restricted to 8 MCs, and we clearly went further up to 32 MCs in order to explore larger amounts of MCs.
B. Creation, Isolation, and Reorganization Operations
These operations require the crossbar network to be configured accordingly. Regardless of the type of MC interface (optical/RF), the registers and XORs to perform modulo operations, as well as the registers and comparators required to perform region boundaries grow linearly with the number of regions. As previously mentioned, a tradeoff investigation between the area of these elements and the number of required regions in CPU, GPU, or heterogeneous applications is required to establish Bw design requirements.
C. Acceptable Performance Degradation and Crossbar
As we have demonstrated, the increase in the number of MCs for memory-bound applications increases Bw while reducing interference. In order to be used with a larger number of MCs, cache-related structures should allow a larger amount of memory traffic through, which can be achieved via employing a low-transmission delay optical/RF crossbar set as in [12] and [16] . As we have multicores/many cores, by utilizing replicated MSHR structures [29] , we guarantee that traffic is not restricted or reduced. Both previous conditions are likely to avoid significant Bw degradation.
D. Benchmarks
The benchmarks utilized here are composed of Bw-bound applications to evaluate the behavior of each type of region, as well as several degrees of memory parallelism via a different number of MCs. However, if lower intense (lower MKPIs) Bw-bound applications or benchmarks that are not Bw-bound are employed, the benefits of the increase in the number of MCs are likely to be comparatively lower.
A design space exploration can determine the ratio between the number of MCs-for CPU, GPU, or heterogeneous regions-and the number of cores to achieve a desired Bw. Such exploration is beyond this paper and fundamentally depends on the Bw level and QoS aimed.
VI. RELATED WORK The work by Muralidhara et al. [5] proposes to map applications data that are likely to severely interfere with each other on different channels and combine channel partitioning with scheduling. RAMON is orthogonal to this paper in the sense that as it aims to reduce memory interference on heterogeneous regions, different numbers of channels are created. However, RAMON could be coupled to Muralidhara's techniques with a different scheduling and channel partitioning.
In the work by Xie et al. [6] , memory banks are dynamically partitioned according to thread utilization profiling. Configurable memory regions in RAMON with the isolation property can reduce memory interference of other regions.
Despite its integration to OS/user OpenCL environment, it could be combined to Xie et al.'s approach [6] in order to create configurable memory regions based on the profile utilization.
The research by Janz et al. [7] proposes a software scheduling framework where an application interacts with the OS to determine its memory address space dynamic footprint utilization. Instead, RAMON proposes dynamic creation and reconfiguration of regions with a different number of MCs in order to reduce memory interference, according to the OS or the application (CUDA), where Janz's framework fits.
Ausavarungnirun et al. [8] propose a different MC management, which groups memory requests according to row-buffer locality first, then interapplication scheduling, and finally FIFO scheduling. RAMON's approach is orthogonal to Ausavarungnirun et al.'s approach [8] , where individual regions creation or reconfiguration could be triggered by the latter.
Kayiran et al. [21] propose an integrated management technique that alleviates the GPU contention on shared resources when on a heterogeneous environment. Although RAMON's techniques of isolation and creation reduce the interference of memory regions, by reconfiguring the crossbar, Kayiran et al.'s [21] management technique could be coupled to RAMON in order to create/reconfigure regions.
Jeong et al. [9] investigate a QoS mechanism that keeps GPU workloads to dynamically adapt CPU and GPU priorities. RAMON's approach is orthogonal to the latter, i.e., as it could be used to create/reconfigure regions to fit different Bw needs using different MC settings according to the QoS priority.
Usui et al. [33] extended Jeong et al.'s approach [9] to hardware accelerators trading off QoS applications Bw and latency. RAMON is orthogonal to the latter and could change the number of MCs in regions to match different hardware accelerators and tradeoff latency/Bw.
Memscale [18] is a set of software and hardware mechanisms, which includes OS policies and hardware power techniques to trade off memory energy and performance in typical memory systems. RAMON shares with Memscale the similarity of adopting Bw estimations independently. In the former, each MC Bw is estimated independently, while in RAMON, Bw is estimated on each independent region. Nevertheless, RAMON can have different regions and each of them can have a variable number of MCs assigned according to the Bw demands.
The investigation by Zhang et al. [34] proposes the utilization of a variation-aware MC scheme that explores the utilization of memory chunks with different access times. RAMON's approach is orthogonal to the former approach, since the proposed scheme can potentially explore not only different variation aspects but also different Bw demands.
Similarly, we share the same interface and RF-technology principles of RFiop [16] and RFiof [12] , as well as the optical technology approaches [14] when increasing the number of MCs. In this paper, the concept of regions and their properties are used to evaluate the impact of different degrees of memory parallelism for different types of regions.
The work by Marino and Li [15] approaches multicore Bw challenges via the increase of the number of MCs in traditional digital-based embedded systems, which has a restricted number of counts due to high I/O pin usage. Differently, RAMON approaches scalability of MCs using optical-and RF-based interfaces that enable the use of a significant larger number of MCs than in traditional digital systems as in the report [15] . Additionally, in RAMON, MC region awareness is highlighted while presenting a new low-area overhead optical-crossbar design that enables the creation of regions with isolation/reorganization properties. Moreover, a significant extension of the Bw model initially proposed in [15] is performed with a new detailed experimental investigation that considers energy and processor (IPC) as a measure of throughput/performance for a significant larger number of cores.
VII. CONCLUSION AND FUTURE WORK
In this paper, RAMON approaches heterogeneous multicore memory Bw demands via creating different regions of memory for CPUs and GPUs, or combined heterogeneous CPU/GPU regions, where different levels of memory parallelism are explored via different numbers of MCs. In order to afford these different levels of memory parallelism, each region of CPU, GPU, or heterogeneous should provide isolation, i.e., should have its own memory space. Isolation segregates memory accesses by regions and avoids memory interference from other regions, also avoiding larger amounts of memory contention and related memory traffic. The use of a lower number of pins of optical-and RF-based interfaces enables RAMON to achieve significantly higher levels of memory parallelism when compared with traditional memory systems with low amounts of MCs. Our findings show that a larger number of MCs contributes to performance improvement more than region isolation.
An important open issue for industry or research design is to face a performance-represented by Bw or IPC-drop of about 60% when using heterogeneous cores sharing the same memory address space, even when using a high scalable memory design.
As observed in this paper, the Bw improvement has a direct impact on processor performance improvement. Since different algorithms demand significantly different degrees of memory parallelism, we leave that the integration of RAMON to user/OS scheduling strategies [5] [6] [7] and a detailed tradeoff of the crossbar area/circuits based on application behaviors are directions for future research.
Given that memory Bw challenges are a very active area of investigation, the evaluation of the proposed user-feedback scheduling for application with constant and variable demanding Bw proposed in Section III-A is also planned. As future challenges, an investigative analysis of the tradeoffs involved in the crossbar when switching regions operations (creation, isolation, and reorganization) could be coupled with RAMON dynamic regions determination and reconfiguration required in IoT, scientific and commercial applications aiming to improve Bw. Moreover, we plan to evaluate regions when introducing other types of cores which run applications with different memory Bw requirements.
