Abstract
Introduction
Processor in Memory or PIM architecture incorporates arithmetic units and control logic directly on the semiconductor memory die to provide direct access to the data in the wide row buffer of the memory. PIM offers the promise of superior performance for certain classes of data intensive computing through a significant reduction in access latency, a dramatic increase in available memory bandwidth, and expansion of the hardware parallelism for flow control. Advances in PIM architecture under development incorporate innovative concepts to deliver high performance and efficiency in the presence of low data locality. These include the use of PIM to augment and compliment conventional microprocessor architectures, the use of a large number of on-chip PIM nodes to expose a high degree of memory bandwidth, and the use of message-driven computation with a transaction-oriented producerconsumer execution model for system-wide latency tolerance. All of these have benefited from previous work and this study extends those experiences to the domain of PIM. This paper explores the design space of several innovations being considered for PIM through a set of statistical steady-state parametric models that are investigated by means of queuing simulation and analyses.
While the advanced PIM concept is encouraging, it is not proven. In order to both prove the effectiveness of this new class of architecture and to quantitatively characterize the design tradeoff space to enable informed choices of resource allocation, a set of simulation experiments and analytical studies were conducted. These include 1) the modeling of the interrelationship between the PIM components and their host microprocessor, 2) an investigation of the optimal number of nodes that should be implemented on a chip. This paper describes these experiments, presents the results and findings, and discusses their implications for the future design and operation of advanced PIM architecture and the systems that incorporate them. Section 2 describes the basic concepts and identifies important relevant prior work in the field. Section 3 describes the simulation and analysis experiments and presents their results. Finally, section 4 discusses the implications of these findings for future PIM design and briefly suggests work necessary to broaden and confirm these initial conclusions.
Background
Processing-in-memory encompasses a range of techniques for driving computation into a memory system. This involves not only the design of processor architectures and microarchitectures appropriate to the properties of on-chip memory, but also execution models and communication protocols for initiating and sustaining memory-based program execution.
Reclaiming the Hidden Bandwidth
The key architectural feature of on-chip memory is the extremely high bandwidth that it provides. A single DRAM macro is typically organized in rows with 2048 bits each. During a read operation, an entire row is latched in a digital row buffer just after the analog sense amplifiers. Once latched, data can be paged out of the row buffer to the processing logic in wide words of typically 256 bits. Assuming a very conservative row access time of 20 ns and a page access time of 2 ns, a single on-chip DRAM macro could sustain a bandwidth of over 50 Gbit/s. Much of PIM research has focused upon reclaiming this hidden bandwidth, either through new organization for conventional architectures or through custom ISAs.
Several studies have demonstrated that simple caches designed for on-chip DRAM can yield performance comparable to classical memory hierarchies, but with much less silicon area. In [SPN96] , researchers at Sun investigated the performance of very wide, but shallow caches that transfer an entire cache line in a single cycle. Using a Petri-net model, they showed that as a result of the lower miss penalty, a PIM with a simple 5-stage RISC pipeline running at 200 MHz would have comparable performance to a DEC Alpha 21164 running at 300 MHz, with less than one-tenth the silicon area. Work at Notre Dame showed similar performance results for a sector cache implemented by adding tag bits directly to the row buffers in DRAM [BZKJ98] . Early simulation results from the Berkeley IRAM project showed that in addition to improved performance-perarea, PIM could also have much lower energy consumption than conventional organizations [FPC+97] . Even greater performance gains are possible through architectures that perform operations on multiple data words accessed from memory simultaneously. Many such designs have been implemented or proposed [BKF+99, FPC+97, HKK+99, KGM+00, Kir02, LY99]. The Berkeley VIRAM has 13 Mbytes of DRAM, a 64-bit MIPS scalar core, and a vector coprocessor with 2 pipelined arithmetic units with each organized into 4 parallel vector lanes. VIRAM has a peak floating-point performance of 1.6 Gflop/s, and shows significant performance improvements in multimedia applications over contemporary superscalar, VLIW, and DSP processors [KGM+00] . The DIVA PIM employs a wideword coprocessor unit supporting SIMD operations similar to the Intel MMX or PowerPC Altivec extensions. Using a memory system enhanced with DIVA PIMs produced average speedups of 3.3 over hostonly execution for a suite of data-intensive benchmark programs [HKK+99] . Memory manufacturer Micron's Yukon chip is a 16 Mbyte DRAM with a SIMD array of 256 8-bit integer ALUs that can sustain an internal memory bandwidth of 25.6 Gbytes/s [Kir02] .
Computation and Communication in Massively-Parallel PIM Systems
The benefits of PIM technology can be further exploited by building massively-parallel systems with large numbers of independent PIM nodes (an integrated memory/processor/networking device). Many examples of fine-grain MPPs have been proposed, designed and implemented in the past-for example [Hil81] and others. All have faced stiff challenges in sustaining significant percentages of peak performance related to the interaction of computation and communication, and there is no reason to assume that networked PIM devices would be immune to the same problems. PIM does, however, provide a technology for building massive, scalable systems at lower cost, and for implementing highly efficient mechanisms for coordinating computation and communication.
One of the key potential cost advantages of PIM is the ability to reduce the overhead related to memory hierarchies. In [MSS95] , it was first suggested that a petaflops scale computer could be implemented with a far lower chip count using PIM technology than through a network of traditional shared memory machines or through a cluster of conventional workstations. The JMachine was one of the computers envisioned as using DRAM based PIM components for an MPP, although for engineering considerations the system was eventually implemented in SRAM technology [DCC+] . Execube [Kog94] was the first true MIMD PIM, with 8 independent processors connected in a binary hypercube on a single chip together with DRAM. More recently, IBM's original Blue Gene [Den00] and current BG/L designs [Adi+02] both use embedded DRAM technology in components for highly scalable systems.
Although related, the semantics of requests made of a PIM system differ somewhat from messages in classic parallel architectures.
HTMT and related projects introduced the concept of parcels (parallel communication elements) for memory-borne messages, which range from simple memory reads and writes, through atomic arithmetic memory operations, to remote method invocations on objects in memory [SB99, BKF+99].
There are various ways that one could characterize and set performance objectives for PIM networks communicating through parcels. A useful approach is to view the PIM network as a transaction-processing system, where two important, related figures of merit are the latency in servicing a single transaction and the throughput, or number of transactions serviced per unit These include dataflow machines [AN90, PC90] , multithreading [Smi78, Smi91, CGSV93] , and hybrids [Ian88, NA89] . PIM Lite is a recent PIM architecture and prototype implementation that efficiently uses wide words out of memory to integrate multithreading and fast parcel response with SIMD arithmetic operations [BKF+99, BKKK02] .
The Need for Design Space Exploration
As the previous sections show, many architectural and implementation options currently exist for exploiting PIM technology. What is lacking, however, is a framework for evaluating tradeoffs between options in designing balanced, cost-effective systems: what follows are the beginnings of such a framework. Specifically, we have developed a set of analytic and simulation models that help provide insight into some of the key questions affecting the configuration of PIM systems.
The first set of analyses addresses tradeoffs in partitioning a computation into heavyweight/high temporal locality threads running on a conventional host processor and lightweight/low temporal locality threads running in PIM. Parameters of the model include the number of PIM nodes, the percentage of the application with low temporal locality, and the system configuration.
The second set of analyses addresses tradeoffs involving the ability to utilize the high on-chip memory bandwidth and the balance between memory and processor area on a chip. Parameters of this model include the probability that a given memory access in an application hits in a row of memory, which is related to how well concurrent operations such as SIMD or vectors could be used.
The following sections provide the details of these models and their results. 
Experiments and Results

HyPerformix
A Queuing Model of a Basic PIM-based System
The Workbench queuing model comprises a master or heavyweight processor (HWP) and a set of PIM or lightweight processors (LWP) in the main memory as in the block diagram of Figure 1 .
Although similar in form, the two classes of processor are distinguished by their operational parameter values as shown in Table 1 .
Also, the HWP includes a cache but experiences a relatively long access time to main memory on a cache miss. The LWP has no cache but is physically adjacent to the memory row buffer and so exhibits much shorter memory access times Figure 2 presents the simple queue model for the HWP and Figure 3 provides the corresponding queue model for the array of LWP and memories. Note that for simplicity, the model treats the main memory accessed by the HWP and LWP as separate devices but this is simply an artifact of convenience and does not impact the simulation results. Bank conflicts are not modeled but the nature of the workload modeled for these experiments precludes this kind of resource contention so no inaccuracies are introduced in the final results.
The experimental workload divides the operations between the HWP and the array of LWP. For those threads of activity that exhibit high temporal locality such that good cache hit rates should be expected, the HWP is scheduled to perform them. For those threads of activity that exhibit low or no temporal locality that would result in very poor cache performance, the set of LWP/memory components are scheduled to perform them. At any one time, either the HWP or LWP array is executing but not both. We also assume that the LWP workload is partitionable in to a number of concurrent threads that are concurrent and uniform in length, one per LWP. This execution flow is depicted in Figure 4 .
While somewhat constraining, the experimental workload permits simple statistical characterization and is representative of many important classes of real-world algorithmic behavior if by no means all. The parameters used to specify the workload are also given in Table 1 .
Experimental Results from the Queuing Simulation
Two experiments were performed: 1) a control run in which the HWP performed all of the work, and 2) the test runs in which the low locality threads were performed on a set of LWP nodes. For both cases, the amount of low locality work measured as the percentage of operations was varied across a parameter range of between 0% and 100%. For the test runs, the number of LWP nodes was varied as well in a range typical of a modest scale system. The performance gains of the test runs with respect to the control run were calculated as a function of the fraction of LWP workload for different number of LWP nodes as shown in Figure 5 .
It is seen that even for a small amount of LWP work including PIMs in the system may double the performance. If the application is data intensive, a significant portion of the total work is scheduled on the array of LWP nodes and as much as an order of magnitude performance gain may be achieved. In the extreme case where essentially all work resides on the LWP array, at least for some configurations, a factor of 100X gain is observed. These results, if substantiated through further studies, imply important advantages of PIM-based systems with respect to their conventional counterparts.
Analytical Model of PIM-based Operation
To better understand the simulated results, an analytical model was developed incorporating the same operational parameters. The results derived from the simulation were reproduced with this analytical model to an accuracy of between 5% and 18%. This encouraging result motivated a second analytical study to expose the basic time to solution normalized to that of the HWP alone performing only high temporal locality work; i.e. 0% LWP workload. The equations are given below:
Equation 1. Analytical Expression for Relative Execution Time
This formulation exposes a remarkable property. Totally unanticipated, in addition to the two independent parameters of number of nodes (N) and percentage of LWP workload (%W L ), a third orthogonal parameter, here referred to as N B , was derived from the combined properties of the system configuration and application workload. This theoretical model is plotted in Figure 6 .
From this diagram, it is evident that a point of coincidence occurs at a specific value of N, independent of %W L . The derived equation for N B also shows that it is orthogonal to N. For N > N B , time to solution with PIM support will always be as good or as better than the control system without PIM elements. If the form of this relationship is sustained as the underlying model grows in fidelity, the finding will provide a strong condition for superiority of PIM-based system architecture.
PIM Technology and Memory Bandwidth
A principal motivating factor for the exploitation of PIM technology and architecture is the opportunity to greatly increase memory bandwidth with respect to conventional system structures Partitioning the on-chip memory in to multiple memory/processor nodes increases the available on-chip memory bandwidth; the total number of nodes being the product of the number of chips and the number of nodes per chip.
But increasing the number of nodes comes at a cost and may not deliver significant improved performance to cost. As the memory block is subdivided, each new node requires additional logic for registers, data paths, controls, interfaces, and for part of the memory stack itself that must hold data related to the presence, management, and operation of the memory/processor node. Thus the cost of a PIM chip increases with increased number of nodes while the total memory capacity per fixed size chip is decreased requiring more PIM chips to provide the same user memory. The effective memory throughput is also limited by the concurrency of memory accesses as determined by the user application program as well as the distribution of those accesses. If there is little program parallelism, then having too many nodes will waste PIM resources. A critical question for future PIM architecture is: how many nodes should be implemented in a PIM based memory system of a given user memory capacity.
An analysis was conducted to model the dominant parameters and their quantitative interrelationships for this important design trade-off issue. A generalized performance to cost parameter was devised such that performance is equated to sustained bandwidth, b, and cost is the die area, a. Efficiency, , is the ratio of sustained bandwidth per unit area and the maximum bandwidth to area that can be achieved. An abstract measure of memory access concurrency is used to differentiate points in the design space. The total user memory capacity, M, (which is measured in number of rows) remains constant and the total area increases as the number of node partitions, n, is increased.
The area, a, is the sum of the areas for the user memory and the node overhead logic and overhead memory. The area for a row of memory is given by A m , and the amount of area required for all of the overhead logic, registers, and control for a single node is given by A P . M P represents this additional overhead memory per processor, also given in terms of rows of memory.
The overhead area per processor, V, measured in units equivalent to the area of a row of memory is given by:
and after a change of variables, the total area is given by:
To model sustained bandwidth, it is necessary to consider some estimate measure of application user demand in terms of concurrency of access requests. As the user demand or request parallelism goes up and the access pattern is uniform over all nodes (clearly there are exceptions to this), the probability of access to any given node increases. At the finest-grain level, p is the probability that a given memory row is not accessed in a given machine memory cycle. The number of rows of memory in a single node is given by M/n. Employing a Bernoulli process to represent the probability that all rows in a node are not accessed, meaning that in a given cycle that node did not perform a useful memory access, the probability of a failed node cycle is: n M p node) a for access prob(no and the probability for a memory access at a node in a given cycle is: Figure 7 presents the total efficiency with respect to a measure of user memory per node. The unit of memory for the independent variable, s, is the amount of memory that would fit in to the equivalent area of all the overhead space required to implement a node. Thus, the value s = 1.0 means that the total space of a node is equally divided by its user memory and overhead resources. As s increases, the amount of user memory per node increases and the number of nodes decrease. This is the memory intensive regime of PIM design. Conversely, as s decreases, the amount of user memory per node decreases and the number of nodes in the system increases. This is the node intensive regime of PIM design. The variable r represents the probability that a node will not experience a memory access (likelihood of a miss), if that node has the amount of memory equivalent in area to the area of the overhead resources of the node. As s increases, the probability of a miss, r s , decreases (because r < 1.0) and the likelihood of a memory access for the node increases, as would be expected. This reflects program concurrency and data access spatial locality. The diagram demonstrates a broad range of optimality implying a preferred balance point of number of nodes and concurrency of memory access requests.
Discussion and Conclusions
In this paper, we've developed a set of analytic and simulation models for exploring tradeoffs in the PIM design space. We may summarize the findings of these experiments as follows:
Interrelationships between PIM Components and a Host Processor
Augmenting the memory system of a host processor with PIM components can yield performance gains ranging from moderate (a factor of 2 or less) to dramatic (an order of magnitude or more) for applications that can be separated into regions of high or low temporal locality. The models show that adding even small amounts of processing capability to the memory system can have significant impact. For data-intensive applications where there is little or no data reuse, and where caches are of little value, PIM may help enormously. The model that we developed for this study provides a strong foundation that characterizes the region of operation in terms of three independent variables: the number of PIM nodes, the fraction of work that can be assigned to PIM, and a third parameter that is both machine and application dependent. While it may be difficult to calibrate these parameters for specific design points, by sweeping them across a range, we are able to get a broad view of the design space and to recognize emerging trends.
In terms of ongoing research, this first study supports the direction taken by projects exploring PIM-enabled memory for conventional hosts, such as Diva [HKK+99] and Cascade.
Number of Processing Nodes per PIM Chip
This study investigated the relationship between the numbers of PIM processing nodes per unit memory density and processing efficiency, as measured by the ability to effectively use the supplied memory bandwidth. One important finding from the model is that the relationship is non-monotonic, and that there is indeed a region of optimality. For applications with high spatial locality or regular access patterns, the model suggests that it is cost-effective to devote significant area to processing logic. For applications that don't have these characteristics, additional processing logic would provide little value and result in a waste of area. 
