Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.
I. INTRODUCTION AND BACKGROUND
As supercomputing technology pushes toward ever more extreme scales, advances in memory systems are a particular area of attention. Memory speed and capacity constraints are often limiting factors for performance and for the maximum problem size that can be accommodated in a grand challenge scientific simulation. Moreover, power usage has emerged as a first class concern. Manycore processors show order-ofmagnitude speedups on some problems and provide more energy-efficient compute cycles in many cases. However, while their massive multithreading can partially hide memory latency, they do not alleviate the power cost of moving data from memory to the compute units.
While system architects contend with incremental advances in traditional memory technology, research and development efforts are exploring more radical possibilities for future memory systems. Resistive memory [1] and phase-change memory [2] are advanced technologies showing promise in the research stage. In the near term, Micron is preparing to release its Hybrid Memory Cube (HMC) [3] , an example of three-dimensional stacked memory.
In this paper, we focus on technology that directly addresses the costs of data movement: Processing in Memory (PIM). PIM is not a new idea, but technology advances, particularly in 3D integration, have opened up more implementation options. The fundamental goal of PIM is to move computation closer to the memory cells in order to take advantage of the higher bandwidth, lower latency and lower energy costs enabled by physical proximity. Early implementations of PIM were created by instantiating memory and logic elements on the same silicon die. This was done both by adding logic to a DRAM die [4] , [5] , as well as by tightly coupling logic to a large amount of SRAM [6] . Both of these approaches have severe limitations. When using SRAM as the memory technology, the result was both fast memory and logic, but with a very limited memory capacity. The DRAM approach allowed for high memory capacity, but the optimizations used for a DRAM process are not conducive to the creation of high performance logic.
In order to counterbalance the limitations of the physical implementation options available, many PIMs incorporated novel processor architectures, such as programmable logic arrays [7] and massive multi-threading [8] , [9] . While these approaches created architectures with high compute potential, their adoption was limited, partially because applications would have had to adopt radically new programming models.
With the advent of 3D stacking, the physical limitations of the past have been removed and high performance, high capacity memory can by heterogeneously integrated with high performance logic. This close integration enables much higher bandwidth as large, expensive IO pins can be replaced with smaller, denser, through Silicon Vias. These vias allow a much greater number of wires to move data from one layer to another. Additionally, the close connection allows much shorter connections (10s of microns instead of 10s of centimeters) between the processor and memory. One possible instantiation of this PIM architecture is to include traditional processing cores on the logic layer of an HMC. Because the logic layer is already required to provide an interface with the DRAM layers above, adding processing does not require a dramatic change to the fabrication or packaging process. This paper explores this capability by studying the upside potential of adding lightweight cores into a memory system using HMC parts. Along with developments in memory architectures, it is essential to consider how system software, programming models, and algorithms make use of memory. By exploring architectural changes in the context of the applications, this co-design approach estimates benefits, exposes limitations, and can influence design decisions both up and down the software stack. Our investigations of PIM focus on mini-applications that have been designed to model characteristics of production applications of U.S. National Laboratories, including finite element methods (FEM), difference stencils, and hydrodynamics. We explore how these applications respond to emulation of PIM-like system properties and simulation of PIM systems. In particular, we examine how differences in the applications' data layout impact their performance and data movement profile.
The contributions of this paper are as follows:
• The introduction of a novel approach to hardwaresoftware co-design combining emulation and simulation (Section II)
• An analysis of performance and data movement of mini-applications in an emulated PIM environment (Section IV)
• An evaluation of benchmark and miniapplication performance characteristics in cycle-level simulation of a PIM architecture (Section V)
Following this introduction, Section III describes the miniapplications we use and the applications they represent. We close with some final conclusions and projected future work in Section VI.
II. CO-DESIGN APPROACH
Before describing the technical details of our PIM studies, it is instructive to reflect on the co-design process that served as the context for this work. The Extreme-Scale Grand Challenge (XGC) project brought together specialists from the application-level down to system software, computer architecture, and device technologies in a large-scale co-design effort. As we considered potential directions for extremescale architectures and the application and system software implications, it became clear that timing our efforts was a key concern. Labor-intensive enhancements to our simulation framework were needed to model future architectures, and once complete required either small problem sizes or long time periods to perform the simulations. On the other hand, the differences between currently available hardware testbed platforms and the projected future hardware were substantial. To further progress in our research, we adopted an approach that allowed us to utilize both of these resources in concert, mitigating their limitations and leveraging their strengths. Figure 1 shows our new approach. We use a common set of mini-applications to represent the applications of interest, modifying them as needed to explore aspects of their algorithms or implementation that may need to adapt to make better use of the new architecture under study. We identify any system software (e.g., run time system) changes needed to expose or hide characteristics of the architecture and to support the programming model used in the mini-applications. To enable experimental evaluation, we determine fundamental characteristics of the architecture to model.
To instantiate and evaluate the architectural model, we use both emulation on existing hardware and also cycle-level simulation. The emulation experiments can quickly gather coarselevel qualitative results about the application-architecture combination that can enable rapid prototyping of changes to the code and hardware design. They also allow evaluation of application behavior and performance on large data sets. The simulation experiments provide fine-grained measurements with high fidelity to the architectural design under study. These results allow fine-tuning of the hardware and software. The remainder of the paper presents our application of this novel dual-method approach to the study of PIM technology.
III. MINI-APPLICATIONS FOR EVALUATION
While benchmarks such as HPLinpack and GUPS are useful for testing raw system performance, mini-applications target application characteristics. Each mini-application provides insight into a particular class of full HPC applications, allowing us to explore how that class of HPC applications would respond to new architectures without porting millions of lines of code. Our study considers three mini-applications representing three different scientific applications important to the production U.S. National Nuclear Security Administration (NNSA) workloads.
The first, miniFE [10] , implements basic kernels representative of implicit finite-element applications, such as the Charon [11] semiconductor device simulation code at Sandia National Laboratories. Initially, miniFE assembles a sparse linear system from the steady-state conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear system using a conjugate-gradient algorithm.
The second, miniGhost [12] , implements a finite difference stencil across a homogeneous three dimensional domain. This mini-application captures characteristics of Sandia's CTH [13] shock solid mechanics application. The computation simulates heat diffusion with Dirichlet boundary conditions. The third, LULESH (version 2.0) [14] , implements unstructured Lagrangian shock hydrodynamics. It represents the explicit Lagrangian method in Lawrence Livermore National Laboratory's ALE3D Lagrangian-Eularian hydrocode. The test problem is a three dimensional blast wave simulation. 
IV. EMULATION STUDY ON EXISTING HARDWARE
To better understand the implications of PIM hardware for real world applications, we have developed an experimental design for emulation of PIM hardware. Using a commodity multi-socket system, we emulate PIM hardware by clocking down a subset of the sockets. This setup allows for investigation of the affect of PIM hardware on the performance of existing mini-applications. While this style of emulation cannot match the capability of simulations to control system parameters, large input sets to applications can be run much faster than corresponding simulations. In this way, these experiments complement well the simulations presented in Section V.
A. Experimental Setup
The specific architecture for this experiment is a foursocket system consisting of four 8-core Xeon X7550 (Nehalem) processors at a default clock rate of 2GHz. Each socket has 128GB of DDR3 running at 1066MHz. To match our model of future PIM system configurations, we chose one socket to run at full speed, which we will call the CPU socket. The other three sockets we will refer to as PIM sockets. The cores on these sockets are all clocked down. Figure 2 gives a visual representation of the setup.
The evaluation considered all three mini-applications described in Section III. To best emulate how a program would run on PIM hardware, we ran each of the mini-applications in OpenMP-only mode. The rationale for this choice is that that individual PIMs will quite possibly lack full MPI stack support. In contrast, it seems likely that a simple DMA mechanism will be in place. We use the QPI bus on our system to emulate such a mechanism. Where necessary, we modified each of the miniapplications in two ways: 1) Bind threads to cores. 2) Allocate memory on-socket.
We achieved the first through sched_setaffinity, and the second through first touch allocation. Because first touch allocation occurs at page level granularity, and because in some cases memory access patterns are not entirely local, there is still a significant amount of inter-socket access for each of the above mini-applications. We will discuss this further in our analysis of the results. The compilers used were GCC 4.4.7 for LULESH and miniGhost, and ICC 13.0.1 for miniFE.
For each of the three mini-applications, we used four distinct configurations: 1) CPU compute, PIM local memory 2) CPU compute, PIM interleaved memory 3) PIM compute, PIM local memory 4) PIM compute, PIM interleaved memory For the interleaved memory configurations, we use libnuma [15] to interleave memory allocation across the PIM sockets during the initialization phases. For the local memory configurations, we use first touch to allocate the memory during initialization. The interleaved configurations function as a baseline: there is no attempt at data affinity to the work (other than the caching of data), and so cross-traffic may be a factor for the scenarios where computation occurs on the PIM sockets.
For the PIM compute, we ran with a broad range of clock speeds. In particular, we ran with clock speeds ranging from 12.5% to 100% of the default clock speed. Down-clocking of the cores was achieved by adjusting the clock modulation model-specific register (MSR) available on the Intel Nehalem processor chips. For each configuration we executed 10 trials, of which the mean and variance are displayed on the graphs in this section.
The memory footprint for each mini-application was approximately as follows:
• miniFE: 24GB (problem size nx = 400)
• miniGhost: 3.6GB (problem size 240 3 , 32 variables)
• LULESH: 23-60GB (problem size 400 3 ) LULESH differs from the other two mini-applications in that significant allocation and deallocation of memory is performed during compute time. This allows us to examine whether dynamic allocation has an affect on performance in a PIM system, as compared to mini-applications with fairly static allocations.
B. Results and Analysis
We present the results of the emulation in terms of two metrics. The first is QPI traffic, i.e., the number of bytes sent between sockets over Intel's QuickPath Interconnect (QPI) bus. This metric is mostly agnostic to the speed of the links, and is rather a function of how the application's data is distributed across the NUMA domains and the data access patterns of the application. The other metric presented in this section is time to solution. These results, while still offering valuable insight, are much more sensitive to the system's speeds and feeds. Thus, the results should be interpreted more as a back-of-the envelope estimation of performance implications of various PIM configurations. For each of the graphs, we plot the interleaved configuration alongside the equivalent configuration with local allocation. This highlights the effects of carefully managed data affinity (local allocation) in contrast to locality-oblivious distribution (interleaved allocation).
The QPI traffic results for the three mini-applications are shown in Figure 3 . As expected, the configurations in which the PIMs perform the computation send far fewer bytes across the bus. This behavior is consistent across all three mini-applications. Among the configurations with PIMs doing computation, the locality-aware configurations send fewer bytes than the interleaved configurations, but the magnitude of the difference varies by application. The gap is widest for LULESH, with four times the bytes transferred for the interleaved versus the local. The difference in traffic is less for miniGhost, but still substantial. Note that we were not able to achieve a completely optimal first-touch implementation for miniGhost. For miniFE, the QPI traffic increases as the clock speed of the PIMs decreases. We suspect that this behavior is related to cache behavior as a function of clock speed. In all cases, the local configuration sends significantly fewer bytes.
The execution time metric, shown in Figure 4 , provides some insight into the performance trade-offs of a PIM system. For example, we can see at what clock speed the PIMs become competitive with a core CPU. For these mini-applications, the results are somewhat intuitive: with 3x the number of PIM cores as CPU cores, they are competitive at roughly 1/3 clock speed. There is some variation among the mini-applications however. MiniFE is close in performance at roughly 25% clock speed, meaning that more computation is done per cycle. This increase in efficiency could translate into power savings for PIM configurations.
While the clock speed of the PIMs shows a direct correspondence to performance for all the mini-applications, the impact of local versus interleaved configuration is mostly muted. When running miniFE on the PIMs at low speed, interleaved actually outperforms the local configuration. This behavior will require further investigation, as will the difference in performance between the CPU configurations for LULESH. The negligible performance differences between local and interleaved in most cases does not imply that the differences in data movement shown in Figure 3 should be ignored. Rather, this lack of correlation shows that good performance does not imply low volume of data movement or similarly, by extension, power draw and energy consumption.
V. SIMULATION STUDY USING SST

A. Experimental Setup
Emulation provides an excellent mechanism to quickly explore a design space and refine applications. To complement the emulation, we also performed cycle-level simulation of a PIM architecture using the Structural Simulation Toolkit (SST) [16] . This allows more fine-grained analysis of architectural interactions and exploration of latency and bandwidth ratios that hardware emulation cannot provide.
The Structural Simulation Toolkit is a parallel discrete event simulation framework for computer architecture simulation. It contains a number of simulation models for processors, 1) PIM Architecture: Figure 5 shows what a future PIMenhanced compute node may look like. This hypothetical system consists of a CPU chip with several dozen heavyweight cores and multiple chains of connected PIM-enabled memory stacks. Each memory stack would contain several processor cores.
Due to the limits of simulation speed and to enable a wider design space search, we restricted our experiments to examine a subset or "sliver" of this future design. Simulation experiments examine a system with 8 "heavy" CPU cores and 16 lighter PIM cores.
In this design, cores are arranged into "quads". Each quad contains a set of four cores each with a private L1 cache. All cores in the quad share an L2 cache. Quads connect to each other and off-chip with a packet-based network. Directory controllers are located on each PIM and provide access to the backing store -in this case, an array of highly parallel DRAM vaults located in the memory stack. There are four PIMs and they are connected with the CPU in a simple ring (i.e. 1D torus) configuration. Figure 6 provides key architectural parameters that are detailed in the sections below.
Due to simulation time limits, the mini-applications were configured to use smaller problem sizes:
In addition to the described mini-apps, we simulated execution of the GUPS (40MB range) and stream (25MB) benchmarks.
2) Ariel Model: To simulate the processor cores, we use the Ariel model within SST. Ariel is a PIN-based [17] dynamic instruction trace generator. Using the Intel PIN framework, Ariel attaches to a process and performs dynamic rewriting to intercept the flow of execution. In this case, we capture memory instructions from each thread of the executing benchmark and stream the type (load/store), address, and size of the Incoming instructions are sorted by thread and dispatched into the memory hierarchy at the appropriate point. Other functions in the executing process are also trapped (using PIN's RTN_Replace()) to control when instructions should be captured. This allows us to only capture the relevant portions of the benchmark's execution and avoid simulating setup or initialization code. Ariel can simulate different processor types by throttling the rate that instructions are issued, the number of memory instructions issued per cycle and the maximum number of outstanding memory instructions at any time.
For these experiments, we use two classes of cores -"Heavy" cores in the CPU and "Light" cores in the PIMs. The "light" cores are roughly half the capabilities in all areas of the "heavy" cores. The parameters utilized in the studies are shown in Table I . 3) Memory Hierarchy: The memory hierarchy was simulated with SST's memHierarchy set of components. Each core has a private data and instruction L1. A set of four cores in a quad all share a common L2. Table II contains the cache configuration parameters used in the simulations. Coherence within the quad is through a snoopy bus-based MESI protocol. Between quads, a directory-based protocol is used. Memory requests are dispatched over the Merlin network (see Section V-A4) to the appropriate directory controller. Directory controllers have 128K entry caches managed with an LRU replacement policy. Memory addresses are assigned to memory stacks with a simple interleaved scheme at 4K granularity. Like the emulation experiments, a local memory configuration was also tested in which the first PIM to touch a memory page is allocated that page for the simulation. For these experiments, each memory stack and the CPU has a merlin router. They are connected in a simple ring topology. Parameters are given in Table II .
5) Stacked Memory Model:
Stacked DRAM memory is expected to become more prevalent in future architectures. Stacked memory products like WideIO, Micron's HMC, and the JEDEC HBM are currently available or are planned to be available. Indeed, it is the potential for heterogeneous integration that makes processing in memory possible. The SST VaultSim component allows simulation of DRAMbased stacked main memory.
For these experiments, we do not focus on emulating a particular memory standard (e.g. HMC), but rather a memory organization that is indicative of industry trends. Memory in each memory stack is organized into 8 independent banks or vaults. Each vault can handle up to 32 outstanding memory requests at a time. Detailed DRAM timings are not simulated in these experiments, so vault latency (without contention) is fixed at 25ns for a row-hit and 50ns for a row miss with 512B row size.
B. Simulation Results
The performance results for simulation can be seen in Figure 7 . Performance results are normalized to the base (CPUonly) case.
The CPU+PIM case roughly doubles the performance of the system by adding 16 1Ghz processors to the CPU's 8 2Ghz. However, the PIM processors have a lower dispatch rate, so the performance improvement should be somewhat less than 2×. However, all of the mini-applications show a performance improvement greater than 2× (Table III) . Some of this is explained by the larger amount of cache capacity available to the PIM system, however, for miniFE and miniGhost the cache performance improvement is fairly modest. From this we can conclude that the performance improvement is due largely to the greater memory bandwidth and lower latency for PIM processors. Simulation results match the emulation experiments with the "local" allocation scheme. Performance improvement compared with the interleaved allocation was quite modest. For most of the applications, the performance improvement was less than 1%, with the stream benchmark showing a 2.3% improvement.
VI. CONCLUSIONS AND FUTURE WORK
As mentioned earlier, limitations of our emulation and simulation studies require qualifications of any claims. However, the results do suggest two important points. First, computing on the PIMs at lower clock speed can rival CPU performance while reducing data movement (with possible implications for power savings and scalability). Second, data locality among PIMs is nonessential for performance but important for reduction of data movement. This finding highlights a trade-off between data movement minimization and programmer productivity, as careful data layout and access to exploit locality can require significant programmer effort to implement. High volumes of data movement are also a concern for scalability: If a large number of PIMs are linked together, then the multiplicative effect of cross-traffic could become a bottleneck not observed our study of only a small number of PIMs.
Performing both simulation and emulation studies provides complementary insights. The faster emulation studies allow analysis of applications with much larger memory footprints and allowed us to examine more designs. Simulation experiments allowed us to examine designs that emulation could not (e.g. different bandwidth and memory ratios, different core issue rates). The two approaches both showed similar performance trends.
In future work, we plan to investigate the power draw and energy usage of the mini-application executions on the different configurations by running on newer machines with readily available power measurement capabilities such as Intel's RAPL interface. Such experiments will allow us to quantify the relationships of power and energy to data movement, which we have used as a proxy for them in the emulation study presented here. A follow-on mini-application to miniGhost using Adaptive Mesh Refinement (miniAMR) is under development and will be considered for PIM studies in future work. For this study, we chose mini-applications that have some structure in their memory access patterns. In future work, we will also investigate mini-applications that have very little structure in their memory access patterns. Applications would include key problems in graph processing and data analytics.
We expect that they may perform poorly in a PIM system, but understanding their behavior could enable the development of solutions for mitigation.
Future simulations will explore changes to the system balance such as modifications to the size of caches, different PIM interconnection schemes, and new coherency protocols. Additionally, we will perform more detailed core simulations with cycle-approximate core simulators such as gem5.
Although there are still significant hurdles in the fabrication and packaging of processors in memory, and though system software, programming models, and algorithms will have to evolve to take optimal advantage of the new architecture, these experiments indicate that PIM does not have to be radically disruptive to how applications are written and that there is a high potential for improvements in performance and energy consumption.
