Near-data accelerators (NDAs) that are integrated with main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and the NDAs in a way that permits both regular memory access by some applications and accelerating others with an NDA, avoids copying data, enables collaborative processing, and simultaneously offers high performance for both host and NDA. We identify and solve new challenges in this context: mitigating row-locality interference from host to NDAs, reducing read/write-turnaround overhead caused by fine-grain interleaving of host and NDA requests, architecting a memory layout that supports the locality required for NDAs and sophisticated address interleaving for host performance, and supporting both packetized and traditional memory interfaces. We demonstrate our approach in a simulated system that consists of a multi-core CPU and NDA-enabled DDR4 memory modules. We show that our mechanisms enable effective and efficient concurrent access using a set of microbenchmarks, and then demonstrate the potential of the system for the important stochastic variance-reduced gradient (SVRG) algorithm.
Introduction
Near data accelerators (NDAs) are attractive for applications with low temporal locality and low arithmetic intensity. NDAs (a.k.a. processing in/near memory) help by bringing computation close to data, saving power and utilizing proximity to overcome the bandwidth bottleneck of a main memory "bus" (e.g., [68, 38, 22, 39, 55, 33, 24, 19, 2, 1, 6, 21, 5, 45, 10] ). Despite decades of research, many challenges remain.
In this paper we address several of these outstanding issues in the context of an NDA-enabled main memory that can be concurrently accessed both as an NDA and as a memory, and that can collaboratively process data with the host without data copies. Furthermore, we focus on NDAs that perform coarse-grain operations across entire arrays without blocking host access to memory, even when these memory devices are controlled directly by the host (e.g., a DDRx-like DIMM). Figure 1 illustrates an exemplar NDA architecture, which presents the challenges we address, and is similar to other recently-researched main-memory NDAs [19, 6, 5] . We choose a DIMM-based memory system because it offers the high capacity required for a high-end server's main memory. Each DIMM is composed of multiple chips, with one or more DRAM dice stacked on top of a logic die in each chip, using a low-cost commodity 3DS-like approach. Processing elements (PEs) and a memory controller are located on the logic die. Each PE can access memory internally through the NDA memory controller. These local NDA accesses must not conflict with external accesses from the host (e.g., a CPU). A rank that is being accessed by the host cannot at the same time serve NDA requests, though the bandwidth of all other ranks in the channel can be used by the NDAs. There is no communication between PEs other than through the host.
There are two key challenges to enable this architecture, which have not been addressed by prior work. First, interleaved accesses may hurt memory performance because they can both decrease row-buffer locality and introduce additional read/write turnaround penalties. Second, each NDA can process kernels that consume entire arrays, though all the data a single operation processes must be local to a PE (a memory chip). Therefore, enabling cooperative processing requires that host physical addresses are mapped to memory locations (channel, rank, bank, etc.) in a way that both achieves high host-access performance (through effective and complex interleaving) and maintains NDA locality across all elements of all operands of a kernel. We note that these challenges exist when using either a packetized interface, where the memory-side controller interleaves accesses between NDAs and the host, or a traditional host-side memory controller that sends explicit low-level memory commands. host and NDAs [19] . We control interference on shared ranks by opportunistically issuing NDA memory commands to those ranks that are even briefly not used by the host and curb NDA to host interference with mechanisms that can throttle NDA requests, either selectively when we predict a conflict (nextrank prediction) or stochastically.
For the second challenge, we enable fine-grain collaboration by architecting a new data layout that can be simultaneously used by both the high-performance host and NDAs, preserving locality of operands within the distributed NDAs. This layout requires minor modifications to the memory controller and utilizes coarse-grain allocations and physical-frame coloring in OS memory allocation. This combination allows large arrays to be shuffled across memory devices (and their associated NDAs) in a coordinated manner such that they remain aligned in each NDA. This is crucial for coarse-grain NDA operations that can achieve higher performance and efficiency than cacheline-oriented fine-grain NDAs (e.g., [2, 35, 28] ).
A third challenge exists in systems where the host maximizes its memory performance by directly controlling memory devices, because adding NDA capabilities requires providing local memory controllers near memory in addition to the host ones. We coordinate memory controllers and ensure a consistent view of bank and timing state by combining minimal signaling with replication of the controller finite state machine (FSM). Replicating the FSM requires all NDA accesses to be determined only by the NDA operation (known to the host controller) and any host memory operations. Thus, no explicit signaling is required from the NDAs back to the host. We therefore require that for non-packetized NDAs, each NDA operation has a deterministic access pattern for all its operands (which may be arbitrarily fine-grained).
We perform a detailed evaluation both when the host and NDAs process different data and when they collaborate on a single application. We demonstrate that CHoNDA enables high memory NDA memory throughput (up to 97% of unutilized bandwidth) while maintaining host performance. Performance and scalability are better than with prior approaches of partitioning ranks and only allowing coarse-grain temporal interleaving, or with only fine-grain NDA operations.
We demonstrate the potential of host and NDA collaboration by studying a machine-learning application (logistic regression with stochastic variance-reduced gradient descent [31] ). We map this application to the host and NDAs such that the host stochastically updates weights in a tight inner loop that utilizes the speculation and locality mechanisms of the CPU while NDAs concurrently compute a correction term across the entire input data that helps the algorithm converge faster. Collaborative and parallel NDA and host execution can speed up this application by 2× compared to host-only execution and 1.6× compared to non-concurrent host and NDA execution.
In summary, we make the following main contributions: • We identify new challenges in concurrent access to memory from the host and NDAs: bank conflicts from host accesses curb NDA performance and read/write-turnaround penalties from NDA writes lower host performance.
• We reduce bank conflicts with a new bank partitioning architecture that, for the first time, is compatible with both huge pages and sophisticated memory interleaving.
• To decrease read/write-turnaround overheads, we throttle NDA writes with two mechanisms: next-rank prediction delays NDA writes to the rank actively read by the CPU; and stochastic issue throttles NDA writes randomly at a configurable rate.
• We develop, also for the first time, a memory data layout that is compatible with both the host and NDAs, enabling them to collaboratively process the same data in parallel while maintaining high host performance with sophisticated memory address interleaving.
• To show the potential of collaboratively processing the same data, we conduct a case study of an important ML algorithm that leverages the fast CPU for its main training loop and the high-BW NDAs for summarization steps that touch the entire dataset. We develop a variant that executes on the NDAs and CPU in parallel, which increases speedup to 2X.
Background
DRAM Basics. A memory system is composed of memory channels that operate independently. In each memory channel, one or more memory modules (DIMMs) share command/address (C/A) and data bus. A DIMM is usually composed of one or two physical ranks where all chips in the same rank operate together. Each chip and thus rank is composed of multiple banks and bank state is independent. Each bank can be in an opened or closed state and, if opened, which row is opened. To access a certain row, the target row must be opened first. If another row is already open, it must be closed before the target row is opened, which is called bank conflict and increases access latency. The DRAM protocol specifies the timing parameters and protocol accessing DRAM. These are managed by a per-channel memory controller. Address Mapping. The memory controller translates OSmanaged physical addresses into DRAM addresses, which are composed of indices to channel, rank, bank, row, and column. Typically, memory controllers follow the following policies in their address mapping to minimize access latency: interleaving address across channels with fine granularity is beneficial since they can be accessed independently from each other. On the other hand, ranks are interleaved at coarse granularity since switching to other ranks in the same channel incurs a penalty. In addition, XOR-based hash mapping functions are used when determining channel, rank, and bank addresses to maximally exploit bank-level parallelism. This also minimizes bank conflicts when multiple rows are accessed with the same access pattern since the hash function shuffles the bank address order [75] . To accomplish this, some row address bits are used along with channel, rank, and bank address bits.
Write-to-Read Turnaround Time. In general, interleaving read and write DRAM transactions incurs higher latency than issuing the same transaction type back to back. Issuing a read transaction immediately following a write suffers from particularly high penalty. The memory controller issues the write command and loads data to the bus after tCWL cycles. Then, data is transferred for tBL cycles to the DRAM device and written to the cells. The next read command can only be issued after tWTR cycles, which guarantees no conflict on the IO circuits in DRAM. The high penalty stems from the fact that the actual write happens at the end of the transaction whereas a read happens right after it is issued. For this reason, the opposite order, read to write, has lower penalty.
NDA Basics. Near-data acceleration adding processing elements near memory to overcome the physical constraints of the host CPU accessing data over limited-bandwidth channels.
Since memory channels are independent, host peak memory bandwidth is determined by the number of channels and peak bandwidth per channel. However, the number of ranks in the system does not affect the peak memory bandwidth of the host since only one rank per channel can transfer data to the host at any given time over the shared bus. On the other hand, near-data accelerators (NDAs) can access data internally without contending for the shared bus. This enables higher peak bandwidth than the host can achieve. However, because NDAs can only access data in their local memory, data layout is crucial for performance. A naive layout may result in frequent data movement among NDAs. In this paper, we assume that inter-NDA communication is only done through the host (alternatives are discussed in [36, 59] ).
Baseline NDA Architecture. Our work targets NDAs that are integrated within high-capacity memory modules such that their role as both main memory and as accelerators is balanced. Specifically, our baseline NDA devices are 3D-integrated within DRAM chips on a module (DIMM), similar to 3DS DDR4 [15] yet a logic die is added. DIMMs offer high capacity and predictable memory access. Alternatively, NDAs can utilize high-bandwidth devices, such as the hybrid memory cube (HMC) [57] or high bandwidth memory (HBM) [67] . These offer high internal bandwidth but have limited capacity and high cost due to numerous point-to-point connections to memory controllers [6] . HMC provides capacity scaling via a network but this results in high access latency and cost. HBM does not provide such solutions. As a result, these devices are better for standalone accelerators than for main memory.
Coherence. When two processors read and write shared memory regions concurrently, coherence needs to be maintained to avoid race conditions. Coherence mechanisms between the host and NDAs have been studied in prior NDA work [2, 10, 11] and can be used as is with CHoNDA. We therefore do not focus on coherence in this paper. In our experiments, we use the existing coherence approach of explicitly and infrequently copying the small amount of data that is not read-only using cache bypassing and memory fences. Address Translation. Before the host and/or NDAs accesses memory, logical-to-physical address translation should be done. One possible approach is to make the host OS do the address translation for all host and NDA accesses. On the other hand, there are prior work [29, 27] that attempts to do address translation with NDAs to enable independent NDA execution without host's assist. In this paper, we choose the first approach where the host has direct control over NDAs. NDA Workloads. We focus on NDA workloads for which the host inherently cannot outperform an NDA. These exhibit low temporal locality and low arithmetic intensity and are bottlenecked by peak memory bandwidth. By offloading such operations to the NDA, we mitigate the bandwidth bottleneck by leveraging internal memory module bandwidth. Moreover, these workloads usually require simple logic for computation and integrating such logic within DRAM chips/modules is practical because of the low area and power overhead. Fundamental linear algebra matrix and vector operations satisfy these criteria. Dense matrix and vector operations are particularly good candidates for NDA execution because of their deterministic and regular memory access patterns. Representative examples include low arithmetic-intensity linear algebra kernels and machine learning primitives. In this paper, we focus on accelerating the dense matrix and vector operations summarized in Table 1 . We demonstrate and evaluate their use in the SVRG application in Section 4.
NDA execution of graph processing has also been strongly considered for NDAs because graph processing can be bottlenecked by peak memory bandwidth due to their low temporal and spatial locality [52, 74, 66, 1, 2] . We do not consider graph processing in this paper, however, because we do not innovate in this context.
CHoNDA
We develop CHoNDA with four main connected goals that push the state of the art: (1) enable fine-grain interleaving of host and NDA memory requests to the same physical memory devices while mitigating the impact of their contention; (2) permit the use of coarse-grain NDA operations that process long vector instructions/kernels; (3) simultaneously support the locality needed for NDAs and the sophisticated memory address interleaving required for high host performance; and (4) integrate with both a packetized interface and a traditional host-controlled DDRx interface. We detail our solutions in this section, first briefly summarizing the need for a new approach.
The need for fine-grain interleaving with opportunistic NDA issue. An ideal NDA opportunistically issues NDA memory requests whenever a rank is idle from the perspective of the host. This is simple to do in a packetized interface where a memory-side controller schedules all accesses, but is a challenge in a traditional memory interface because the host-and NDA-side controllers must be synchronized. Prior work proposed dedicating some ranks to NDAs and some to the host or coarse-grain temporal interleaving. The former approach contradicts one of our goals as devices are not shared. The latter results in large performance overhead because it cannot effectively utilize periods where a rank is naturally idle to to the host access pattern and thus throttles the host. Figure 2 shows that for a range of multi-core application mixes (methodology in Section 6), the majority of idle periods are shorter than 100 cycles with the vast majority under 250 cycles. Fine-grain interleaving is therefore necessary. The need for coarse-grain NDA vector/kernel operations. Fine-grain interleaving is simple if each NDA command only addresses a single cache block region of memory. Such finegrain NDA operations have indeed been discussed in prior work [2, 1, 42, 52] . One overhead of this fine-grain approach is that of issuing numerous NDA commands, with each requiring a full memory transaction that occupies both the command and data channels to memory. Issuing NDA commands too frequently degrades host performance, while infrequent issue underutilizes the NDAs. Coarse-grain NDA vector operations that operate on multiple cache blocks mitigate contention on the channel and improve overall performance. The vector width, N, is specified for each NDA instruction. As long as the operands are contiguous in the DRAM address space, one NDA instruction can process numerous data elements without occupying the channel. Coarse-grain NDA operations are therefore desirable, but introduce the data layout, memory contention, and host-NDA synchronization challenges which CHoNDA solves.
Localizing NDA Operands while Distributing Host Accesses
To execute the N-way NDA vector instructions, all the operands of each NDA instruction must be fully contained in a single rank. If necessary, data is first copied from other ranks prior to launching an NDA instruction. If the reuse rate of the copied data is low, this copying overhead will dominate the NDA execution time and contention on the memory channel increases because of NDA commands. We solve this problem in CHoNDA by laying out data such that all the operands are localized to each NDA at memory allocation time. Thus, copies are not necessary. This is challenging, however, because the host memory controller uses complex address interleaving functions to maximally exploit channel, rank, and bank parallelism for arbitrary host access streams. Hence, arrays that are contiguous in the host physical address space are not contiguous in physical memory and are shuffled across ranks, possibly in a physical-address dependent manner. This problem is illustrated in the left side of Figure 3 , where the two operands of an NDA instruction are shuffled differently across ranks and banks. The layout resulting from our approach if shown at the right of the figure, where arrays (operands) are still shuffled, but both operands follow the same pattern and remain correctly aligned to NDAs without copy operations. Note that alignment is to rank because that corresponds to an NDA partition.
Data layout across ranks.
We rely on software (runtime and OS) to use a combination of coarse-grain memory allocation and coloring for operands to ensure all operands of an NDA instruction are shuffled the same way. We allocate memory for NDA operands such that they are aligned at the granularity of one DRAM row for each bank in the system which we call a system row (e.g., 2MB for a DDR4 1TB system). For all the interleaving mechanisms we are aware of ( [58, 47] ), this ensures that NDA operands are locally aligned, as long as ranks are also kept aligned. We use page coloring to effect rank alignment. We explain this below using the Intel Skylake address mapping [58] (Figure 4a ) as a concrete and representative interleaving mapping.
In this mapping, rank and channel addresses are determined partly by the low-order bits that fall into the frame offset field and partly by the high-order bits that fall into the physical frame number (PFN) field. Frame offsets are kept the same because of the coarse-grain alignment. The OS colors allocations such that the PFN bits that determine rank and channel are aligned for a particular color; which physical address bits select ranks and channels can be reverse engineered if necessary [58] . The CHoNDA runtime indicates a shared color when it requests memory from the OS and specifies the same color for all operands of an instruction. The runtime can use the same color for many operands to minimize copies needed for alignment. In our baseline system, there are 8 colors and each color corresponds to a shared region of memory of 4 GiB. Multiple regions can be allocated for the same process. Though we focus on one address mapping here, our approach works with any linear address mapping described in prior work [58, 47] as well.
Note that coarse-grain allocation is simple with the common buddy allocator if allocation granularity is also a system row, and can use optimizations that already exist for huge pages [72, 40, 23] . The fragmentation overheads of coarse allocation are similar to those with huge pages and we find that they are negligible because coarse-grain NDA execution works best when processing long vectors.
Data layout across DRAM chips. In the baseline system, each 4-byte word is striped across multiple chips, whereas in our approach each word is located in a single chip so that NDAs can access words from their local memory. Both the host and NDAs can access memory without copying or reformatting data (as required by prior work [19] ). Memory blocks still align with cache lines, so this layout change is not visible to software. Note that this data layout does not impact the host memory controller's ECC computation (e.g. Chip-kill [17] ) because ECC protects only bits, not how they are interpreted.
Mitigating Frequent Read/Write Penalties
The basic memory access scheduling policy we use for CHoNDA is to always prioritize host memory requests, yet aggressively leverage unutilized rank bandwidth by issuing NDA requests whenever possible. That is, NDAs wait when incoming host requests are detected but, otherwise, always issue their memory requests to maximize their bandwidth utilization and performance. One potential problem is that an NDA request issued in one cycle may delay a host request that could have issued in one of the following cycles otherwise.
We find that read transactions of NDAs have only a small impact on following host commands and that row commands (ACT and PRE) are issued infrequently by NDAs (in our linear algebra context, at least). We prioritize host memory commands over DRAM row commands to the same bank that originate at the NDAs. This has negligible impact on NDA performance in our experiments.
NDA write transactions, however, can have a large impact on host performance because of the read/write-turnaround penalties that they frequently require. The host mitigates turnaround overhead by buffering operations with caches and write buffers [69] . The host and NDAs may issue different types of transactions, however, which are then interleaved if both host and NDA run in parallel. We find that NDA writes interleaved with host reads degrade performance the most. We introduce two mechanisms to selectively throttle NDA writes.
Our first mechanism simply throttles the rate of NDA writes by issuing them with a predefined probability. We call this mechanism stochastic NDA issue. When NDAs detect rank idleness, they flip a coin to determine whether to issue a write transaction or not. By adjusting the probability, the performance of the host and NDAs can be traded off: higher probability leads to more frequent turnarounds while a lower probability throttles NDA progress. Deciding on how much to throttle NDAs requires analysis or profiling.
Our second approach does not require tuning, and we find that it works well in our experiments. In this approach, the memory controller inhibits NDA write requests when more host read requests are expected; the controller stalls the NDA in lieu of providing an NDA write queue. In a packetized interface, the memory controller schedules both host and NDA requests and is thus aware of potential required turnarounds. The traditional memory interface, however, is more challenging as the host controller must explicitly signal the NDA controller to inhibit its write request. This signal must be sent ahead of the regular host transaction because of bus delays.
We use a very simple predictor that inhibits NDA write requests in a particular rank when the oldest outstanding host memory request to that channel is a read to that same rank. For now, we assume that this information is communicated over a dedicated pin and plan to develop other signaling mechanisms that can piggyback on existing host DRAM commands at a later time. Our experiments with an FRFCFS [61] memory scheduler at the host show that this simple predictor works well and achieves performance that is comparable to a tuned stochastic issue approach.
Partitioning into Host and Shared Banks
In addition to read/write-turnaround overheads, concurrent access also degrades performance by decreasing DRAM row access locality. When the host and NDAs interleave accesses to different rows of the same bank, frequent row conflicts occur. To avoid this bank contention, we propose using bank partitioning to limit bank interference to only those memory regions that must concurrently share data between the NDAs and the host. However, existing bank partitioning mechanisms [50, 30, 46] are incompatible with both huge pages and with sophisticated DRAM address interleaving schemes.
Existing schemes rely on the OS to color pages based on partitions where colors can be assigned to different cores or threads, or in our case, for banks isolated for the host and those that could be shared. The OS then maps pages of different color to frames that map to different banks. Figure 4a shows an example of a modern physical address to DRAM address mapping [58] . One color bit in the baseline mapping belongs to the page offset field so bank partitioning can, at best, be done at two-bank granularity. More importantly, when huge pages are used (e.g., 2MB), this baseline mapping cannot be used to partition banks at all.
To overcome this limitation, we propose a new interface that partitions banks into two groups-host-reserved and shared banks-with flexible DRAM address mapping and any page size. Specifically, our mechanism only requires that the most significant physical address bits are only used to determine DRAM row address, as is common in recent hash mapping functions, as shown in Figure 4b [58] .
Without loss of generality, assume 2 banks out of 16 banks are reserved for the shared data. First, the OS splits the physical address space for host-only and shared memory region with the host-only region occupying the bottom of the address space: 0 − (14 × (bank_capacity) − 1). The rest of the space (with the capacity of 2 banks) is reserved for the shared data and the OS does not use it for other purposes. This guarantees that the most significant bits (MSBs) of the address of host-only region are never b'111. In contrast, addresses in the shared space always have b'111 in their MSBs.
The OS informs the memory controller that it reserved 2 banks (the topmost banks) for shared memory region. Hostonly memory addresses are mapped to DRAM locations using any hardware mapping function, which is not exposed to software and the OS. The idea is then to remap addresses that initially fall into shared banks into the reserved address space that the host is not using. Additional simple logic checks whether the resulting DRAM address bank ID of the initial mapping is a reserved bank for shared region. If they are not, the DRAM address is used as is. If the DRAM address is initially mapped to one of the reserved banks, the MSBs and the bank bits are swapped. Because the MSBs of a host address are never b'1110 or b'1111, the final bank ID will be one of the host-only bank IDs. Also, because the bank ID of the initial mapping result is 14 or 15, the final address is in a row the host cannot access with the initial mapping and there is no aliasing. Note that the partitioning decision can be adjusted, but only if all affected memory is first cleared.
Tracking Global Memory Controller State
Unlike conventional systems, CHoNDA also enables an architecture that has two memory controllers (MCs) managing the bank and timing state of each rank. This is the case when the host continues to directly manage memory even when the memory itself is enhanced with NDAs, which requires coordinating rank state information. Figure 5 shows how both side MCs track global memory controller state. Information about host transactions is easily obtained by the NDA MCs as they can monitor incoming transactions and update the state tables accordingly (left). However, the host MC cannot track all NDA transactions due to command bandwidth limits.
To solve this problem, we replicate the finite-state machines (FSMs) of NDAs and place them in the host-side NDA controller. When an NDA instruction is launched, FSMs on both sides start at the same time. Whenever an NDA memory transaction is issued, the host-side FSM can also update the state table in the host MC without communicating with NDAs (right). Also, if a host transaction blocks NDA transactions in one of the ranks, that transaction will be visible to both FSMs and can stop the FSM operations until that rank becomes available. The area and power overhead of replicating FSMs are negligible (40-byte microcode store and 20-byte state registers per rank (i.e., per NDA)).
Host-NDA Collaboration
In this section, we conduct a case study to show the potential of concurrent host-NDA execution by collaboratively processing the same data. Our case study shows how to partition ML training tasks between the host and NDAs such that both processors leverage their specialties. Also, our case study is a good example since infrequent and low-overhead operations are required to maintain coherence while the host and NDAs can independently access large and shared read-only data of which access time dominates the overall execution time.
We use logistic regression with stochastic variance reduced gradient (SVRG) [31] as a case study of CHoNDA's benefits for host-NDA collaboration. SVRG is a machine learning technique that enables faster convergence by reducing variance introduced by sampling. Figure 6 shows a simplified version of SVRG and the opportunity for collaboration. A large input matrix, A, is evenly partitioned into multiple chunks and stored in memory. The host samples a random element a within A in every inner loop iteration to update the learned model w. Other than the large input, other data (w, s, and g) takes advantage of the CPU caches. The tight inner loop is therefore ideally suited for high-end CPU execution.
The SVRG algorithm periodically calculates a correction term, g, by summarizing the entire input data (example code in Figure 8 ). Because the summarization operation is simple, exhibits little locality, and traverses the entire large input data, it is ideally suited for the NDAs. The term g is used for correcting error in the host workload, f. With CHoNDA, the host can maximally exploit locality captured by the LLC while NDAs can leverage their high bandwidth for accessing the entire input data A. In SVRG, the epoch refers to the number of inner loop iterations.
The main tradeoff in SVRG is as follows: when summarization is done more frequently, the quality of the correction term increases and, consequently, the per-step convergence rate increases. On the other hand, the overhead of summarization also increases which might offset the improved convergence rate. Therefore, the epoch hyper-parameter, which determines the frequency of summarization, should be carefully selected.
Delayed-Update SVRG. As CHoNDA enables concurrent access between the host and NDAs, we explore an algorithm change to leverage parallel collaboration. Instead of alternating between the summarization and inner loop, we run them in parallel on the host and NDAs. Whenever the NDAs finish computing the correction term, the host and NDAs exchange the correction term and the most up-to-date weights before continuing concurrent execution. However, this results in using stale s and g, which are from one epoch behind. The main tradeoff in delayed-update SVRG is that per-iteration time is improved by overlapping execution, whereas convergence rate per iteration degrades due to the staleness. Similar tradeoffs have been made in prior work [7, 41, 60, 16] . To avoid races for s and g in this delayed-update SVRG, we maintain private copies of each of these variables and use a memory fence that guarantees completion of DRAM writes after the data-exchange step. Note that we bypass caches when accessing data produced/consumed by NDAs during the dataexchange step. Since s and g are small and copied infrequently, the overheads are small and amortized over numerous NDA computations. Whether delayed updates are used or not, the host and NDAs share the large data, A, without copies.
Runtime and API
CHoNDA is general and helps whenever host/NDA concurrent access is needed. To make the explanations and evaluation concrete, we use an exemplary design as discussed below and summarized in Figure 7 . Command and address signals pass through the NDA memory controllers so that they can track host rank state. Processing elements (PEs) in the logic die access data by using their local NDA memory controller (Figure 1 ). Figure 8 shows example usage of our API for computing an average gradient used in logistic regression. This is simply an example and other calls, such as fine-grain NDA commands for graph processing can easily be included [2, 42] .
The CHoNDA runtime system manages memory allocations and launches NDA operations. NDA operations are blocking by default, but can also execute asynchronously. If the programmer calls an NDA operation with operands from different shared regions (colors), the runtime system inserts appropriate data copies. We envision a just-in-time compiler that can identify such cases and more intelligently allocate memory and regions to minimize copies. For this paper, we do not implement such a compiler. Instead, programs are written to directly interact with a runtime system that is implemented within the simulator.
NDAs operate directly on DRAM addresses and do not perform address translation. To launch an operation, the runtime (with help from the OS) translates the origin of each operand into a physical address, which is then communicated to the NDAs by the NDA controller. The runtime is responsible for splitting a single API call into multiple primitive NDA operations. The NDA operations themselves proceed through each operand with a regular access pattern implemented as microcode in the hardware.
Optimization for Load-Imbalance. Load imbalance occurs when the host does not access ranks uniformly over short periods of time. The AXPY operation (launched repeatedly within the loop shown in Figure 8 ) is short and non-uniform access by the host leads to load imbalance among NDAs. A blocking operation waits for all NDAs to complete before launching the next AXPY, which reduces performance. Our API provides asynchronous launches similar to CUDA streams. Asynchronous launches can overlap AXPY operations from multiple loop iterations. Any load imbalance is then only apparent when the loop ends. Over such a long time period, load imbalance is much less likely. We implement asynchronous launches using macro NDA operation. An example of a macro operation is shown in the loop of Figure 8 and is indicated by the parallel_for annotation.
Launching NDA Operations. NDA operations are launched similarly to Farmahini et al. [19] . A certain memory region is reserved for accessing control registers of NDAs. NDA packets access the control registers and launch operations. Each packet is composed of the type of operation, the base addresses of operands, the size of data blocks, and scalar values required for scalar-vector operations. On the host side, the NDA controller plays two main roles. First, it accepts acceleration requests, issues commands to the NDAs in the different ranks (in a round-robin manner), and notifies software when a request completes. Second, it extends the host memory controller to coordinate actions between the NDAs and host memory controllers and enables concurrent access. It maintains the replicated FSMs using its knowledge of issued NDA operations and the status of the host memory controller.
Execution Flow of a Processing Element.
Our exemplar PE is composed of two floating-point fused multiply-add (FPFMA) units, 5 scalar registers (up to 3 operand inputs and 2 for temporary values), a 1KB buffer for accessing memory, and the 1KB scratchpad memory. The memory access granularity is 8B per chip (same as the host). PEs may be further optimized to support lower-precision operations or specialized for specific use cases, but we do not explore these in this paper as we focus on the new capabilities of CHoNDA rather than NDA in general. Figure 9 shows the execution flow of a PE when executing the AXPY operation. Each vector is partitioned into 1KB batches, which is the same size as DRAM page size per chip. To maximize bandwidth utilization, the vector X is streamed into the buffer. Then, the PE opens another row, reads two elements (8 bytes) of vector Y , and stores them to FP registers. While the next two elements of Y are read, a fused multiplyadd (FMA) operation is executed. The result is stored back into the buffer and execution continues such that the readexecute-write operations are pipelined. After the result buffer is filled, the PE either writes results back to memory or to the scratchpad. This flow for one 1KB batch is repeated over the rest of the batches. This entire process is hardcoded in PE microcode as the AXPY operation. Table 2 summarizes our system configuration, DRAM timing parameters, energy components, benchmarks, and machine learning configurations. For bank partitioning, we reserve one bank per rank for NDAs and the rest for the host. We use Ramulator [37] as our baseline DRAM simulator and add the NDA memory controllers and PEs to execute the NDA workloads. We modify the memory controller to support the Skylake address mapping [58] and our bank partitioning and data layout schemes. To simulate concurrent host accesses, we use gem5 [8] with Ramulator. We choose host applications that have medium or high memory intensity from the SPEC2006 [26] and SPEC2017 [54] benchmark suites and form 9 different application mixes with different combinations (Table 2 ). For the NDA workloads, we use DOT and COPY operations to show the impact of extremely low and high write intensity. We use the average gradient kernel (Figure 8 ) to evaluate collaborative execution.
Methodology
For the host workloads, we use Simpoint [25] to find representative program phases and run each simulation until the instruction count of the slowest process reaches 200M instructions. If an NDA workload completes while the simulation is still running, it is relaunched so that concurrent access occurs throughout the simulation time. Since the number of instructions simulated is different, we measure instructions per cycle Logistic regression with 2-regularization (10-class classification), λ =1e-3, learning rate=best-tuned, momentum=0.9, dataset=cifar10 (50000 × 3072) (IPC) for the host performance. In addition, to show how well the NDAs utilize bandwidth, we show bandwidth utilization and compare with the idealized case where NDAs can utilize all the idle rank bandwidth.
We estimate power with the parameters in Table 2 . We use CACTI 6.5 [51] for the dynamic and leakage power of the PE buffer. A sensitivity study for PE parameters exhibits that their impact is negligible. We use CACTI-3DD [13] to estimate the power and energy of 3D-stacked DRAM and CACTI-IO [32] to estimate DIMM power and energy.
Evaluation
We present evaluation results for the various CHoNDA mechanisms, analyzing: (1) the benefit of coarse-grain NDA operations; (2) how bank partitioning improves NDA performance; (3) how stochastic issue and next-rank prediction mitigate read/write turnarounds; (4) the impact of NDA workload write intensity and load imbalance; (5) how CHoNDA compares with rank partitioning; (6) the benefits of collaborative and parallel CPU/NDA processing; and (7) energy efficiency.
Coarse-grain NDA Operation. prevent other factors, such as bank conflicts, bank-level parallelism, and load imbalance from affecting performance, we use our BP mechanism, NRM2 workload, and asynchronous launching. We run the most memory-intensive application mix (mix0) on the host. When more CBs are processed by each NDA instruction, contention between host transactions and NDA instruction launches decreases and performance of both improves. In addition, as the number of ranks grows, contention becomes severe because more NDA instructions are necessary to keep all NDAs busy. These results show that our data layout that enables coarse-grain NDA operation is beneficial, especially in concurrent access situation.
Takeaway 1: Coarse-grain NDA operations are crucial for mitigating contention on the host memory channel.
Impact of Bank Partitioning. Figure 11 shows performance when banks are shared and partitioned between host and NDAs which access different data. We emphasize the impact of write intensity of NDA operations by running the extreme DOT (read intensive) and COPY (write intensive) operations. We compare each memory access mode with an idealized case where we assume the host accesses memory without any contention and NDAs can leverage all the idle rank bandwidth without considering transaction types and other overheads. Overall, accelerating the read-intensive DOT with concurrent host access does not affect host performance significantly even with our aggressive approach. However, contention with the shared access mode significantly degrades NDA performance. This is because of the extra bank conflicts caused by interleaving host and NDA transactions. On the other hand, accelerating the write-intensive COPY degrades host performance. This happens because, in the write phase of NDAs, the host reads are blocked while NDAs keep issuing write transactions due to long write-to-read turnaround time. To mitigate this problem, we show the impact of our write throttling mechanisms below.
Takeaway 2: Bank partitioning increases row-buffer locality and substantially improves NDA performance, especially for read-intensive NDA operations.
Mitigating NDA Write Interference. Figure 12 shows the impact of mechanisms for write-intensive NDA operations. In this experiment, the most write-intensive operation, COPY, is executed by NDAs and the mechanisms are applied only during the write phase of NDA execution. The stochastic issue is used with two probabilities, 1/4 and 1/16, which clearly shows the host-NDA performance tradeoff compared to the next-rank prediction.
For the stochastic issue, the tradeoff between host and NDA performance is clear. If NDAs issue with high probability, host performance degrades. On the other hand, the next-rank prediction mechanism shows slightly better performance tradeoff than the stochastic approach. Compared to the stochastic issue with probability 1/16, both host and NDA performance are higher. The stochastic issue extends the tradeoff range and does not require signaling. The main takeaway is that we can improve on static approaches by using dynamic information.
Takeaway 3: Throttling NDA writes mitigates the large impact of read/write turnaround interference on host performance; next-rank prediction is robust and effective while stochastic issue does not require additional signaling.
Impact of Write-Intensity and Input Size. Figure 13 shows host and NDA performance when different types of NDA operations are executed with different input sizes. The host application mix with the highest memory intensity (mix0) and the next-rank prediction mechanism is used. In addition, to identify the impact of input size, three different vector sizes are used: small (8KB/rank), medium (128KB/rank), and large (8MB/rank). We evaluate asynchronous launches with the small vector size. We evaluate GEMV with three matrix sizes, where the number of columns is equal to each of the three vector sizes and the number of rows fixed at 128.
Overall, performance is inversely related to write intensity, and short execution time per launch results in low NDA performance. The NRM2 operation with the small input has the shortest execution time. Because of its short execution time, NRM2 is highly impacted by the launching overhead and load imbalance caused by concurrent host access. On the other hand, GEMV executes longer than other operations and it is impacted less by load imbalance and launching overhead.
With the asynchronous launching optimization, the impact of load imbalance decreases and NDA bandwidth increases.
Takeaway 4: Short-duration NDA operations can lead to load imbalance, which is mitigated by the asynchronous launch mechanism.
Scalability Comparison. Figure 14 compares CHoNDA with the performance of rank partitioning. For rank partitioning, we assume that ranks are evenly partitioned between the host and NDAs. In addition, since read-and write-intensive NDA operations show different trends, we separate those two cases. We use the most memory-intensive mix0 as the host workload. The first cluster shows performance when the baseline DRAM system is used. For both the read-and write-intensive NDA workloads, CHoNDA performs better than rank partitioning. This shows that opportunistically exploiting idle rank bandwidth can be a better option than dedicating ranks for acceleration. The second cluster shows performance when the number of ranks is doubled. Compared to rank partitioning, CHoNDA shows better performance scalability. While NDA bandwidth with rank partitioning exactly doubles, CHoNDA more than doubles due to the increased idle time per rank.
Takeaway 5: CHoNDA scales better than rank partitioning because short issue opportunities grow with rank count.
SVRG Collaboration Benefits. Figure 15a shows the convergence results with and without NDA (8 NDAs). We use a shared memory region to enable concurrent access to the same data and the next-rank prediction mechanism is used. Compared to the host-only case, the optimal epoch size decreases from N to N/4 when NDAs are used. This is because the overhead of summarization decreases relative to the host-only case. Furthermore, SVRG with delayed updates gains additional performance demonstrating the benefits made possible by the concurrent host and NDA access when each processes the por- tion of the workload it is best suited for. Though the delayed update updates the correction term more frequently, the best performing learning rate is lower than ACC with epoch N/4, which shows the impact of staleness on the delayed update. When NDA performance grows by adding NDAs (additional ranks), delayed-update SVRG demonstrates better performance scalability. Figure 15b compares the performance of the best-tuned serialized and delayed-update SVRG with that of host-only with different number of NDAs. We measure performance as the time it takes the training loss to converge (when it reaches 1e − 13 away from optimum). Because more NDAs can calculate the correction term faster, its staleness decreases, consequently, a higher learning rate with faster convergence is possible.
Takeaway 6: Collaborative host-NDA processing on shared data speeds up SVRG logistic regression by 50%.
Memory Power. We estimate the power dissipation in the memory system under concurrent access. The theoretical maximum possible power of the memory system is 8W when only the host accesses memory. When the most memory-intensive application mixes are executed, the average power is 3.6W. The maximum power of NDAs is 3.7W and is dissipated when the scratchpad memory is maximally used in the average gradient computation. In total, up to 7.3W of power is dissipated in the memory system, which is lower than the maximum possible with host-only access. This power efficiency of NDAs comes from the low-energy internal memory accesses and because CHoNDA minimizes overheads. Takeaway 7: Operating multiple ranks for concurrent access does not increase memory power significantly.
Related Work
To the best of our knowledge, this is the first work that proposes solutions for near data acceleration while enabling the concurrent host and NDA access without data reorganization and in a non-packetized DRAM context. To solve this unique problem, many previous studies have influenced our work.
The study of near data acceleration has been conducted in a wide range as the relative cost of data access becomes more and more expensive compared to the computation itself. The nearest place for computation is in DRAM cells [63, 43, 62] or the crossbar cells with emerging technologies [44, 14, 64, 65, 66, 70, 12, 49] . Since the benefit of near-data acceleration comes from high bandwidth and low data transfer energy, the benefit becomes larger as computation move closer to memory. However, area and power constraints are significant, restricting adding complex logic. As a result, workloads with simple ALU operations are the main target of these studies.
3D stacked memory devices enable more complex logic on the logic die and still exploit high internal memory bandwidth. Many recent studies are conducted based on this device to accelerate diverse applications [21, 34, 18, 1, 2, 24, 28, 29, 48, 56, 73, 20, 53, 27, 11, 45, 10] . However, in these proposals, the main memory role of the memory devices has gained less attention compared to the acceleration part. Some prior work [3, 71, 4, 9] attempts to support the host and NDA access to the same data but only with data reorganization and in a packetized DRAM context. Parrnaik et al. [56] show the potential of concurrently running both the host and NDAs on the same memory. However, they assume an idealized memory system in which there is no contention between NDA and host memory requests. We do not assume this ideal case. The main contributions of CHoNDA are precisely to provide mechanisms for mitigating interference.
On the other hand, NDA [19] , Chameleon [6] , and MCN DIMM [5] are based on conventional DIMM devices and changes the DRAM design to practically add PEs. Unlike rank partitioning and coarse-grain mode switching used in the prior work, we let host and PEs share ranks to maximize parallelism and partition banks to decrease contention.
Conclusion
In this paper, we introduced solutions to share ranks and enable concurrent access between the host and NDAs. Instead of partitioning memory in coarse-grain manner, both temporally and spatially, we interleave accesses in fine-grain manner to leverage the unutilized rank bandwidth. To maximize bandwidth utilization, CHoNDA enables coordinating state between the memory controllers of the host and NDAs in low overhead, to reduce extra bank conflicts with bank partitioning, to efficiently block NDA write transactions with stochastic issue and next-rank prediction to mitigate the penalty of read/write turnaround time, and to have one data layout that allows the host and NDAs to access the same data and realize high performance. Our case study also shows that collaborative execution between the host and NDAs can provide better performance than using just one of them at a time. CHoNDA offers insights to practically enable NDA while serving main memory requests in real systems and enables more effective acceleration by eliminating data copies and encouraging tighter host-NDA collaboration.
