Microprocessors and memory systems su er from a growing gap in performance. We i n troduce Active Pages, a computation model which addresses this gap by shifting data-intensive computations to the memory system. An Active P age consists of a page of data and a set of associated functions which can operate upon that data. We describe an implementation of Active P ages on RADram Recon gurable Architecture DRAM, a memory system based upon the integration of DRAM and recon gurable logic. Results from the SimpleScalar simulator BA97 demonstrate up to 1000X speedups on several applications using the RADram system versus conventional memory systems. We also explore the sensitivity of our results to implementations in other memory technologies.
Introduction
Microprocessor performance continues to follow phenomenal growth curves which drive the computing industry. Unfortunately, memory-system performance is falling behind. Processorcentric optimizations to bridge this processor-memory gap include prefetching, speculation, out-of-order execution, and multithreading WM95 . Several of these approaches can lead to memory-bandwidth problems BGK96 . We i n troduce Active Pages, a model of computation which partitions applications between a processor and an intelligent memory system. Our goal is to keep processors running at peak speeds by o -loading data manipulation to logic placed in the memory system.
Active Pages consist of a page of data and a set of associated functions that operate on that data. For example, an Active P age may contain an array data structure and a set of insert, delete, and nd functions that operate on that array. A memory system that implements Active P ages is responsible for both the storage of the data and the computation of the associated functions.
Rapid advances in fabrication technology promise to make the integration of logic and memory practical. Although Active P ages can be implemented in a variety of architectures and technologies, we focus upon the integration of recon gurable logic and DRAM. We i n troduce the RADram Recon gurable Architecture DRAM system. On many applications, our simulations show substantial performance gains for a uniprocessor workstation using a RADram system versus a conventional memory system. RADram can also function as a conventional memory system with negligible performance degradation. As we shall see in Section 3, RADram is likely to have superior yield, higher parallelism, and better integration with commodity microprocessors when compared to architectures such as IRAM Pat95 . Since memory technologies are a moving target, we measure the sensitivity of our results to the speed of Active P age implementations. This allows us to generalize to currently available technologies such as DRAM macrocells in ASIC Application-Speci c Integrated Circuit technologies.
This paper starts with a description of Active Pages in Section 2, and continues with our RADram implementation in Section 3. We then describe our experimental methodology in Section 4 and our applications in Section 5. We continue with the recon gurable logic designs for each application in Section 6. We present our results in Section 7 and generalize these results to other technologies in Section 8. Finally, we conclude with a discussion of related work in Section 9, future work in Section 10 and conclusions in Section 11.
Active Pages
Active P ages introduce new programming, system, and fabrication issues. In this section, we shall discuss programming issues which arise from the Active P age computational model. These issues are partitioning, coordination, computational scaling, and data manipulation. We will discuss system and fabrication issues in Section 3 where we introduce the RADram Active-Page implementation.
To use Active P ages, computation for an application must be divided, or partitioned, b e t ween processor and memory. F or example, we use Active-Page functions to gather operands for a sparse-matrix multiply and pass those operands on to the processor for multiplication. To perform such a computation, the matrix data and gathering functions must rst be loaded into a memory system that supports Active P ages. The processor then, through a series of memory-mapped writes, starts the gather functions in the memory system. As the operands are gathered, the processor reads them from user-de ned output areas in each page, multiplies them, and writes the results back to the array datastructures in memory.
Partitioning In our sparse-matrix example, the application was partitioned between work done at the memory system and work done at the processor. Such partitioning varies in emphasis between e cient use of processor computation and e cient use of Active-Page computation. We refer to these two extremes as processor-centric and memory-centric partitioning. Processor-centric partitioning is appropriate for algorithms with complex computations, such as oating point. Memory-centric partitioning is appropriate for data manipulation and integer arithmetic.
Sparse-matrix computations require substantial oatingpoint computation and suggest a processor-centric partitioning. Active P ages compute which operands must be multiplied with the goal of providing the processor with enough operands to keep it running at peak speeds. Our image processing application, on the other hand, uses integer arithmetic and can be performed almost entirely in Active P ages. Consequently, the goal is to exploit parallelism and use as many Active P ages as possible.
Activation Time Intuitively, a processor working with a memory system that implements Active P ages is similar to a control processor working with a small data-parallel machine. Typically, an algorithm is partitioned by rst dispatching a request for a computation to occur on the data within an Active Page. A w ell-structured application will have t o m o ve little, if any, additional data into the page in order for that function to complete. Thus, the majority of time in dispatching a work request is spent communicating to the Active P age the function to invoke and additional required parameters. We refer to the time it takes to dispatch this request as activation time. Activation time is generally constant for each page for a given function measurements for each application will be given in Table 4 .
Coordination Partitioning computations implies that Active P ages must coordinate with the processor and with each other. Processor-page coordination is accomplished via prede ned synchronization variables. Inter-page coordination is accomplished with inter-page memory references.
Synchronization variables are used to coordinate activities between the Active Page functions and the processor. The structure and layout of these variables are implementation and application speci c. The variables may serve a s l o c ks to indicate when inputs or outputs for an Active P age operation are valid. This interface is similar to memory-mapped registers used for network interfaces.
The Active Page model of computation does not de ne an explicit means for inter-page communication. Support for communication between pages can be accomplished in a variety of fashions. Abstractly, all forms of communication are viewed as non-local memory references issued by an Active Page. For performance reasons, an Active P age memory system may c hoose to combine several references into a contiguous inter-page memory copy. Our RADram implementation Section 3 simulates such an approach.
Computation Scaling The computational power of Active Pages scales in an unusual way as application problem sizes grow. In this section, we develop some intuition about this scaling and we will verify these intuitions in Section 7.
Traditional multiprocessors generally operate with a xed number of processing engines which m ust be applied to a variable problem size. With Active P ages, the number of processing engines is coupled to physical memory size. Since many systems are designed to scale memory size to contain the data of their intended applications, more Active P ages will be available for the computation. Figure 1 shows how w e expect Active-Page performance to scale as problem size grows. Speedup refers to the performance of a system using a conventional memory system divided the performance of a system using Active Pages. Non-Overlap Time is the time the processor spends waiting for Active P age computation which is not overlapped with processor computation. This is indicative of the quality of partitioning. As illustrated in Figure 1 , we expect three regions of speedup as problem sizes scale:
The sub-page region: For very small problem sizes, applications use a small number of Active P ages and utilization of those pages is poor. Activation time dominates the computation and speedups do not scale until the Active P age function o oads su cient w ork from the processor.
The scalable region: Once the problem is larger, the number of Active P ages involved increases linearly. The corresponding increase in computational power results in linear speed-ups.
The saturated region: Although the number of Active Pages grows with data size, the number of processors in a system does not. Consequently, w e expect speedups to eventually level o as the processor-component of the application saturates constant processor resources. This leveling o can also produce a degradation in performance as an increased number of Active P ages can increase the synchronization and communication overhead.
Ideally, w e w ant speedups which are in the rightmost portion of the scalable region. Fortunately, partitions can be tuned to shift this scalable region towards speci c problem sizes.
Data Manipulation In addition to providing scalable computation, Active P ages allow programmers to optimize for density and indexing rather than data manipulation. Currently, programmers have a wealth of data structures they can choose to use for any given problem. However, these data structures each have advantages and disadvantages. For instance, doubly-linked lists provide fast insertion and deletion of elements, but poor random access. On the other hand, arrays provide fast random access, but poor performance on insertions and deletions.
To some extent, Active P ages remove the burden of compromise when choosing a data structure. For example, our implementation of the STL array class uses dense arrays, but exploits Active Page functions to provide fast insertion and deletion. We adopt a processor-mediated approach t o i n ter-page communication which assumes infrequent communication. When an Active-Page function reaches a memory reference that can not be satis ed by its local page, it blocks and raises a processor interrupt. The processor satis es the request by reading and writing to the appropriate pages. Once an interrupt is raised, the processor generally satis es many requests from di erent pages in the system. Future work will evaluate hardware mechanisms for in-chip communication, increasing the number of outstanding references per page, and processorpolling for requests. The processor-mediated methodology, however, functions well for our applications and will greatly simplify future work in paging and virtual memory. Table 1 lists the parameters of our reference RADram implementation. Several parameters were also individually varied in our experiments with respect to the reference implementation. The range of variation for these parameters is also given in Table 1 . Additionally, a memory bus capable of transferring 32 bits of data between memory and cache every 10 ns is assumed.
Why Recon gurable Logic? The potential of gigabit densities in DRAM has prompted research and development i n a v ariety of implementation options for intelligent memory. IRAM Pat95 , an integration of processor core and DRAM, is a well-known option studied at Berkeley. RADram, however, is likely to have better yield, higher parallelism, and better integration with commodity processors than IRAM.
The primary advantage of RADram memory devices is that they will be inexpensive to fabricate. Processor chips cost ten times as much as memory chips because their complexity makes their yield, or percentage of working chips, much l o wer Prz97 . DRAMs are fabricated with redundant memory cells that can replace defective cells through laser modi cation after chip production. The uniform nature of recon gurable logic allows for similar measures in RADram chips. In contrast, IRAM chip designers will have t o w ork hard to avoid yields similar to processor chips. If IRAM chips are fabricated at processor costs, systems will be limited to a few IRAM chips and to applications with smaller data. RADram is intended to fabricate at DRAM costs, which allows dozens of chips per system and much larger application data.
Our results will show that RADram can exploit extremely high parallelism by supporting simple, application-speci c operations in memory. A m ulti-gigabit RADram can have more than 128 Active Pages, each of which can execute simultaneously. Processor-in-DRAM solutions can not support such high parallelism. The variety of custom operations used in our applications also suggests that xed logic would severely limit the functionality of Active P age applications.
Finally, RADram is speci cally designed to support commodity microprocessors. The RADram interface is compatible with standard memory busses. A primary goal of RADram is to supply microprocessors with enough data to keep them running at peak speeds. IRAM technology, h o wever, is intended to compete with commodity processors. This competition may eventually be favorable for IRAM as the importance of singlechip systems increases, but ever-growing applications may always demand larger memories and multiple chips.
Fabrication Interest in the fabrication of Merged DRAM Logic MDL devices has grown dramatically in the past few years. Major manufacturers currently have the capability t o fabricate DRAM cells macrocells in logic chips. Processors have also been fabricated in DRAM chips. Current DRAM in logic chips has poor density. Logic in DRAM chips has poor speed and density. Merged DRAM-logic processes, which can fabricate both kinds of structures well, are becoming available Prz97 . Our study, however, is conservative and assumes a DRAM process with associated penalties in logic speed and density.
Power Power consumption is a major concern for DRAM chips because increased chip temperatures result in higher charge leakage from storage cells. This leakage increases the need for more frequent DRAM refresh. Fortunately, this higher refresh can be bundled into our logic added to each DRAM subarray.
Although a detailed study of power is beyond the scope of this paper, we h a ve been conservative in our use of power in RADram. Our applications only use 32 bits of bandwidth between data and logic in RADram pages. This could easily be increased to 256 or 512 bits, but would result in higher power consumption. Increasing bandwidth would also require more recon gurable logic, which i s b e y ond our area constraints for some applications. Application performance, however, is high despite conservative bandwidth.
Methodology
To evaluate Active Pages, we conducted a detailed application study. The reference Active-Page platform used for this study was previously described in Section 3. This platform was studied using a three step approach. First, a simulator was implemented which modeled the RADram Active-Page memory system. Second, a set of applications were chosen which represented various algorithmic domains. Finally, these applications were written and optimized for both the RADram and conventional memory system architectures.
As a base for a simulation environment w e started with the SimpleScalar v2.0 tool set BA97 . This tool set provides the mechanisms to compile, debug and simulate applications compiled to a RISC architecture. The SimpleScalar RISC architecture is loosely based upon the MIPS R3000 instruction set architecture. The SimpleScalar environment w as extended by replacing the simulated conventional memory hierarchy with an Active-Page memory system. The new simulated memory hierarchy provides mechanisms which simulate RADram application-speci c circuits executing within the DRAM memory system. Further, the SimpleScalar instruction set was extended with Intel MMX multi-media instruction opcodes. Finally, the toolset was enhanced by updating the GNU C C++ compiler version included to the latest v2.7.2.1 compiler suite. All applications in this study were compiled with the -O3 optimization option.
After implementation of this simulation environment, a set of applications was chosen for architectural evaluation. Each application is brie y described in Section 5. Here we explore the methodology used in choosing, partitioning and evaluating these applications.
Applications were chosen with three motives in mind. First, the algorithms to be implemented in the application were representative of a broad class of algorithms used in a range of applications. Second, the algorithm or application illustrated a certain kind of partitioning as described in Section 2. Finally, an MMX-instruction-set compatible application was chosen to explore Active-Page implementations other than RADram. For instance, future work may investigate the possibility of identifying a small key set of data manipulation primitives which should be implemented in xed logic in the Active-Page model.
The rst step in studying each application or algorithm described in Section 5 is to implement and optimize it on a conventional memory system. The application is then handpartitioned for an Active-Page memory system. Next, ActivePage functions are coded in VHDL and synthesized to FPGA logic. The results of this are discussed in Section 6. State transition characteristics of these synthesized circuits is used to simulate the functions with our SimpleScalar simulator.
Applications
In order to demonstrate e ective partitioning of applications between processor and Active P ages, we c hose a range of applications representing both memory-and processor-centric partitioning. Table 2 summarizes the attributes of these applications. This section describes each application and divides those descriptions into each partitioning class. As discussed in Section 2, Active P ages can exploit the parallelism in applications through memory-centric partitioning. Our array, database, median ltering, and dynamic programming applications are good examples of such partitioning.
STL Array Template The STL array template is a general purpose C++ template which permits the storage, access, and retrieval of objects based upon a linear integer index. The template class supports the usual array access operators, as well as insert, delete and binary-nd count operations. All of the applications implemented hide the layout of data and partitioning of algorithmic operations from the application via a simple C++ interface. However, the STL array best demonstrates this principle. Library calls, derived from a common subclass, allow single source les to work with either the Active-Page or conventional-system implementation of the array template. The implementation uses recon gurable logic to speedup the following operations: array insert, delete, and count operations. The insert and delete operations involve m o ving portions of the array in parallel to accommodate the change in array size. The count operation is implemented by a binary comparison circuit.
These three operations are indicative of a broad range of array operations which the RADram system can e ectively compute. Further examples from the STL library include: accumulate, partial sum, random shu e, rotate, and adjacent di erence.
Database Query Several methods SKS97 exist to speed up database searches, if the searches involve indexed elds. Indexing produces a second table within the database which permits the database engine to quickly locate elds in logarithmic or constant time. However, indexing is often not practical for highly-varied queries or under tight storage constraints. Unindexed queries can take time proportionally linear to the number of records. Our database benchmark uses a synthetically generated address book. Custom Active P age functions were written to search for exact matches on any of the string elds contained in the address records. The RADram system time complexity of the unindexed database query is O1, however the constant bounding it is quite large. The performance gained by the RADram system comes from the parallelism available in the database search. In theory, all records can be searched simultaneously. In practice, the records are grouped into blocks, which are roughly the size of a RADram memory page. These blocks are then distributed among the pages in the RADram memory system. Each page is then custom programmed with the search engine's application speci c circuit. To demonstrate the performance of the RADram system on this application a count of exact matches for the last name of an individual in the address book is performed. The count is run on the same database in both the RADram system and on a conventional implementation.
Image Processing Image processing and signal processing have been traditional strengths of FPGA's and custom processor technologies R + 93 AA95 K + 96 . We implemented an image median ltering RW92 application on RADram. Median ltering is a non-linear method which reduces the noise contained in an image without blurring the high-frequency components of the image signal. The RADram implementation divides the image by r o w blocks among various Active Pages. Each row block contains two additional rows, one above the current r o w block, and one below it, in order to perform the median ltering kernel computation. The Active P ages are then programmed with a custom circuit designed to nd the median of nine short integer values. For comparison, the conventional system uses a hand-coded algorithm which takes a minimal number of comparisons to nd the median of nine values. Because the computational work involved is small in terms of circuit area, the bulk of the median ltering application runs inside the RADram memory system. Not surprisingly, this application allows RADram to exploit high parallelism and memory bandwidth. RADram also uses a custom circuit which is designed for sorting nine short integer values. The conventional implementation requires several conditional instructions, as well as memory I O operations, in order to nd the median value.
Largest Common Subsequence This algorithm is representative of a broad class of string algorithms which form the basis for modern biological research. At the heart of the computer algorithm to reconstruct DNA sequences are string algorithms such as largest common subsequence, global alignment, and local alignment Gus97 . The largest common subsequence LCS computation is typically done using a dynamic programming construction. This construction runs in On 2 time and space for sequences of length n. One can view the construction as a set of computations over a plane. For the LCS algorithm, the computation can proceed in parallel as a wave-front starting at the upper left corner and ending in the lower right corner of this plane. This wave-front computation runs in On logn time on the RADram system. The RADram system implements the LCS computation by dividing the algorithm into two steps. The rst step is the computation of the LCS result matrix itself. The second step is the backtracking CLR96 required to nd the largest common subsequence. The RADram system executes the rst step entirely within the recon gurable logic inside the memory system. Backtracking executes entirely within the processor.
Processor-Centric Partitioning
Active P ages are intended for simple, application-speci c operations, leaving more complex computations to general-purpose microprocessors. Our MMX and matrix applications are good examples of processor-centric partitioning.
MMX Primitives The MMX multimedia instruction primitives were chosen for implementation within the RADram system for two reasons. First, they represent a well known commodity" set of architecture primitives. Second, they are simple primitive operations designed for parallel execution.
The simulator was extended to support SimpleScalar MMX instructions, and RADram MMX instruction equivalents. The MMX instructions themselves are highly parallel, simple, and generally complete in a single processor cycle. To improve upon the base SimpleScalar MMX instructions, the RADram equivalents operate on larger data widths. While an MMX instruction in SimpleScalar is restricted to producing only 32 bits of data per instruction, a RADram MMX instruction can produce up to 256 kbytes of data per instruction.
While implementation of the complete MMX instruction set is still underway, enough is implemented to carry out key portions of the MPEG encoding and decoding processes. While future work will explore more MPEG routines, current w ork has focused upon application of the correction matrices within the P and B frames M + 96 . Future implementation of the MPEG algorithm will partition additional components between the processor and RADram memory system. The processor will be responsible for the Discrete Cosine Transform DCT, while the RADram system will handle motion detection, application of motion correction matrices, run length encoding and decoding RLE, and Hu man encoding and decoding.
Sparse-Matrix Multiply A wide range of real-world problems can be represented as sparse matrices. We examine both a common scienti c benchmark and a more challenging compiler optimization problem. Our scienti c benchmark involves the multiplication of matrices representing nite-element computations taken from the Harwell-Boeing benchmark suite D + 92 . Our compiler optimization problem involves using the Simplex method NM65 to perform optimal register allocation GW96 .
A key computation in both these applications is sparse vector-vector dot-product. Conventional implementations of this operation are severely limited by processor-memory bandwidth. Sparse vector FLOPS on a conventional system are often an order of magnitude lower than those for dense vectors. The processor must fetch the indices of each nonzero in both vectors of the dot product, determine which indices match, fetch the data corresponding to those indices, multiply the data, and write the data back to its appropriate location.
In contrast, the RADram system implements a comparegather-compute approach. Active Page functions fetch and compare vector indices, fetch the data values for the indices that match, and gather the data into cache-line size blocks. Vectors are co-located on pages. The processor then reads the packed data, computes the multiplies, and writes back cacheline size blocks of results. Note that only useful" data travels between the processor and memory, greatly conserving bandwidth. With large matrices, the RADram system has enough Active Pages executing to keep the processor computing at peak oating-point speeds.
Synthesized Logic
In order to estimate performance and area of RADram logic con gurations, each function of an application's Active P ages was hand-coded in a high-level circuit-description language, VHDL Ash90 , and circuits synthesized to completely routed designs in contemporary FPGA technology. This provided a means to verify the timing of the simulated circuit implementation, as well as information on circuit area, which helped guide the RADram design.
The results of our implementations of the application speci c circuits for the simulated applications are summarized in Table 3 . These results were obtained by implementing the circuit design in behavioral VHDL and synthesizing them with the Synopsys FPGA design tools. After synthesis to a technology independent logic description, the designs were placed and routed to an Altera FLEX-10K10-3 part. This allowed us to study the post-routed designs on real FPGA technology. The count o f logic block usage reported in Table 3 includes both completely used and partially used LEs. The speed and code size were directly reported by the Synopsys tools.
The results obtained from implementation of applicationspeci c circuits indicate that the RADram Active-Page system can execute the application kernel's circuits. The RADram implementation can implement designs with approximately 256 LEs per Active Page, and all of our designs are below this amount. Our designs can also be further optimized by implementing common memory interfaces in xed logic. Our system simulation assumes a 100 MHz clock for our circuits. Given modest advances in FPGA technology, this should be achievable for our circuits by 2001. Finally, the code size is an indication of the potential code-bloat" which will happen when transitioning an application to the RADram system. Code size is also indicative o f the page-replacement cost for Active P ages, which w e a n ticipate to be 2-4 times larger than for conventional pages due to recon guration time. However, pages which do not use Active-Page functions do not incur this cost, and future recon gurable technologies may signi cantly reduce this cost see Section 10.
Results
In this section, we compare our RADram simulation results of each application kernel described in Section 5 t o o u r expectations from the Active-Page application characteristics discussed in Section 2. First, we discuss performance of RADram versus a conventional memory system executing optimized versions of the same applications. Then we explore the memory hierarchy of both memory systems by studying the e ects of cache parameters. Finally, w e develop an analytical model to describe partitioned application performance, and then compute the correlation between this model and our experimental results.
Performance
To e v aluate performance of the RADram Active P age memory system, each application described in Section 5 was executed on a range of problem sizes using a xed set of machine characteristics listed in Table 1 . The speedup of our applications running on a RADram memory system compared to a conventional memory system are shown in Figure 3 . Each application was run on a range of problem sizes, given in terms of number of Active P ages 512 Kbyte superpages. We make two primary observations about this graph.
First, application kernels execute signi cantly faster on a RADram memory system than a conventional memory system. The one exception from our application mix is the array-delete primitive in the sub-page region. The SimpleScalar processor instruction set actually favors array-delete over array-insert. To take advantage of this fast delete, the RADram version of array-delete uses an adaptive algorithm that uses the processor more for arrays that are smaller than one Active P age.
Second, our performance results qualitatively scale as we expected in Figure 1 . We observe that most applications show little growth in speedup as data size grows within the subpage region below one page for most applications. In this region, RADram applications have little parallelism to o set activation costs. As we leave this region, we e n ter the scalable region and see that performance on all of our applications grows nicely as data size increases. Four applications database, mmx, matrix-simplex, matrix-boeing, and medianltering also reach the saturated region. Here, RADram performance is limited by the progress of the processor. This limitation may be due to either too much w ork for a given speed processor or too much data traveling between the processor and RADram across the memory bus. Performance can actually decrease as coordination costs dominate performance. Given a large enough problem size, all our applications would eventually reach the saturated region.
Processor-Memory Non-overlap
The saturated region of Active-Page performance emphasizes the importance of partitioning applications to e ciently use the processor in a system. For processor-centric applications, this dependence is obvious. The goal is to keep the processor computing by providing a steady stream of useful data from the memory system. For memory-centric partitions, however, the processor is still a vital resource. Active P ages can not compute without activation and inter-page communication, both provided by the processor. As data size grows in an Active-Page application, so does the load upon the processor. We measure the remaining capacity of a processor to handle this load with a metric we call processor-memory non-overlap time. Non-overlap is the time the processor spends waiting for the memory system and can be used to estimate the boundary between the scalable and saturated regions of application performance.
The relative percentage of time the processor is stalled, waiting for memory system computation is shown in Figure 4 . As described earlier in Section 7.1, the applications which reached the saturated region of speedup were: database, matrixsimplex, matrix-boeing, and median-ltering. As is shown in Figure 4 these applications also reach a point of complete processor-memory overlap. The e ect of this is described in Section 2.
We also observe that for the array primitives and the dynamic programming application the non-overlap percentage remains relatively high. These applications are largely memorycentric, with very little processor activity. In fact, the array primitives operate asynchronously to the end of the application, and are arti cially forced in synchronous operation for this study. This means that an application can use the insert an delete array primitives with only the cost of RADram function invocation. Modulo dependencies on the array, the time spent b y the memory system shifting data can be overlapped with operations outside of the STL array class. This overlap occurs in a natural way with no additional e ort required by the programmer who uses the RADram STL array class. Opportunities for overlapping execution of data structure operations with data-structure usage is intriguing, and is being investigated further. The dynamic programming example maintains a very high processor memory non-overlap, however preliminary results indicate that processor-mediated communication required by the RADram memory system eventually dominates performance. This occurs for extremely large problems that are well beyond the range of problem sizes presented in this study.
Cache E ects
The simulated processor used for this study has a default split instruction-data level-one cache. Each level-one cache is 64 kilobytes, and is 2-way associative. The processor also has a combined level-two cache of 1 megabyte and is 4-way associative. For this study the level-one data cache size was varied from 32 to 256 kilobytes. The level-two cache size was varied from 256 kilobytes to 4 megabytes. Figure 5left plots total conventional application kernel execution time versus the size of the level-one data cache. As illustrated, within the range of cache sizes explored most conventional applications where una ected. However, at the left edge of Figure 5left we note that some conventional applications are a ected by the size of the level one cache when it fell below 64 kilobytes. Figure 5right plots total RADram application kernel time versus level-one data cache size. As illustrated, all but one application was una ected by the size of the level one cache. The median-total application shows various stride e ects. The application consists of two phases. The rst reads data into an array and transforms it into a special data layout required by the Active-Page memory system. The size of the levelone cache plays a role in enhancing the performance of this operation. The second phase simply dispatches the request for median ltering to the Active P age memory system and waits for the result. As evident from the performance of mediankernel, the second phase is una ected by the size of the level one cache.
All applications were also executed with a range of leveltwo cache sizes. Throughout this range no signi cant performance di erences occurred. This, combined with the levelone cache results indicates that our applications are sensitive to extremely small caches sizes, but small to reasonable size caches achieve all of the performance of large caches. ActivePage applications tend to work with large datasets. Although their primary working set may t in a small cache, secondary working sets will not t in realistic cache sizes. Consequently, without migrating to a cache-only architecture, our application performance is bounded by other architectural characteristics such as DRAM memory latency and bandwidth.
Analysis
To a c hieve a deeper understanding of the performance of application partitions, we i n troduce an analytic model. This model is based upon an abstract application. From this abstract application a formula is developed which models performance under various problem sizes. Additionally, total application performance is bounded by Amdahl's Law. We present this model by rst developing an intuitive understanding of a partitioned application. Then we c haracterize processor performance with an Active-Page memory system. Finally, w e compute the correlation of this analytical model with the results obtained from our RADram simulator.
Model
Section 2 described partitioning, and the role it plays in application performance on an Active P age memory system. To investigate partitioning in more detail, an abstract application is depicted in Figure 6 . As illustrated in Figure 6 a partitioned algorithm undergoes two phases from the perspective of the processor: activation and post-processing. The activation phase is characterized by increased Active P age activity. The post-processing phase is characterized by decreasing Active P age activity but potential processor-memory non-overlap stalls mixed with processor computation.
The abstract application depicted in Figure 6 uses K pages of Active Page memory. The processor spends TAi time activating Active P age i. Initially, the processor activates all pages in sequence, thus requiring P K i=1 TAi time to activate all pages. Immediately after activation, an Active P age begins to execute. The time required to complete execution for Active Page i is TCi. After dispatching the activation request to all K pages, the application returns to the rst page to perform any follow-up processor computation. Before the processor can perform this computation, however, the processor may be forced to stall and wait for the Active Page in memory location 1 to nish execution. At this point in Figure 6 , the processor is stalled, w aiting in non-overlap time. We account for this as N O 1, or non-overlap time waiting for Active P age 1. The processor, after waiting for N O 1 time for the Active Using this abstract application we observe that the all processor time for a single partitioned algorithm is accounted for in three distinct sets of variables: TAi, TP i and N O i. Thus total kernel execution time for a partitioned application is the summation Tconv is time per item. We note that within the non-overlap time the processor spends before post-processing of page i is a maximum of zero, or the computation time of the Active P age minus the time spent by the processor between nishing activation of page i and the current time. Table 4 : Activation time TA, computation time TC, postactivated processor time TP , and minimum problem size for complete overlap.
on previous pages.
Correlation
In general, an average activation time TA and average postpage computation time TP can be measured using a small to medium data-set. Furthermore, an average Active-Page computation time TC can be measured from this small data-set. Using these averages, and the model in Figure 7 a rough estimate of the non-overlap time for a particular problem size can be found. Using this estimate, it is possible to predict performance of a partitioned application for a range of problem sizes. This prediction provides insight i n to the particular characteristics of a partitioned application. By modeling performance as activation, post-page computation, per-page Active-Page computation, and processor-memory non-overlap time, it is possible to gauge performance at a variety of problem sizes and adjust the balance of work between the memory system and processor according to the expected workload of the application.
To illustrate, Table 4 lists the activation time, post-page processor time, and per-page Active P age computation time f o r a n umber of application kernels in our workload. Using a simpli ed version of the formulas in Figure 7 which assume constant v alues for these metrics, pages for complete overlap is computed. Furthermore, for each application, and for each data-point used to construct Figure 3 a predicted speedup is computed using these constant activation and computation times, and a measured non-overlap time taken from Figured 4. The correlation between the predicted speedup from using the analytical model and the actual speedup observed is shown in the rightmost column of Table 4 . Most applications are well-correlated to the analytical model. A notable exception is the matrix-boeing application. This application violates the assumption of constant activation and computation times per Active P age. The times are inherently data-speci c for this application and using constant v alues proved to be less useful than for the other applications studied.
Sensitivity to Technology
Our results for the RADram system demonstrate that Active P ages can be implemented with substantial success on a variety of applications. RADram technology, however, is a long-term goal which is several years in the future. Shorterterm and alternative long-term technologies can also be used to implement Active P ages. This section describes such technologies and analyzes the sensitivity of our results to some of the key parameters in the RADram system. Current technologies exist to implement Active Pages at signi cantly higher cost than RADram. Such costs would limit the amount of memory available to support Active P ages, and consequently, the problem sizes of the applications. These technologies include: small merged FPGA-DRAM or SRAM chips, DRAM SRAM macrocells in ASICs, and small processorin-DRAM SRAM chips. In general, logic speeds in these technologies are either equal to or better than RADram assumptions. Chip cost, however, will limit most near-term technologies to substantially smaller problem sizes. SRAM or multichip solutions will also have an e ect on memory latencies.
We v ary two technological parameters in our RADram simulations: memory latency and logic speed. First, Figure 8 in terms of cache-miss penalty. In general, the performance advantage of RADram comes from in-DRAM computation which is una ected by cache-miss penalty. Cache e ects, however, account for slight c hanges in both RADram and conventional system performance. These changes can result in either increases or decreases in speedup as cache-miss penalties increase. The sign of the slope depends upon the relative ratio of instruction cycles to memory stall cycles for the conventional versus the partitioned application. If one splits the total application runtime into two components: processor time, and memory stall time, then computes the ratio of these two v alues for both the conventional and partitioned applications, then the slope of application speedup versus memory latency depicted in Figure 8 will depend upon the relative ratio of these two ratios. Second, Figure 9 plots speedup versus the speed of the application-speci c circuit. The speed of application-speci c circuits in the simulated RADram system is measured in relative clock divisions of the processor clock. In Figure 9 a higher logic divisor corresponds to a slower recon gurable logic clock.
To generalize across applications, those operating on problems in the scalable region of their partitioning domain are sensitive to the speed of the Active P age computation, whereas those applications operating on problems in the saturated regions of their partitioning domain are generally insensitive t o the speed of the Active P age computation.
Related Work
The IRAM philosophy goes to the extreme by shifting all computation to the memory system through integration of a processor onto a DRAM chip. This results in dramatically improved DRAM bandwidth and latency to the processor core, but conventional processors are not designed to exploit these improvements B + 97a . An interesting alternative i s t o i n tegrate specialized logic into DRAM to perform operations such as Read-Modify-Write B + 97b . This alternative is promising, but we h a ve seen that di erent applications can exploit significantly di erent computations in the memory system. Our results have shown that integrating recon gurable logic is highly e ective.
Recon gurable computing has shown considerable success at special-purpose applications A + 96 B + 96 , but has had di culty competing with microprocessors on more generalpurpose tasks such as oating-point arithmetic. Some groups focus upon building recon gurable processors HW97 WH96 RS94 WC96 , but face an even more di cult competition with commodity microprocessors. Our approach a voids these di culties by exploiting the strengths of both microprocessors and recon gurable logic. We focus upon data manipulation to make the memory system perform better for the processor. DeHon described limited integration of recon gurable logic and DRAM in an early memo DeH95 , but did not evaluate it further.
Our philosophy is reminiscent of scatter-gather engines from a long line of supercomputers HT72 SH90 CG86 Bat74 EJ73 HS86 L + 92 . Hockney and Jesshope HJ88 give a good history of such machines. Our approach, however, supports a much wider variety of data manipulations and computations than these machines. Additionally, our emphasis on commodity technologies results in a focus on di erent applications and design tradeo s.
Future Work
Active Pages and our RADram implementation have shown great potential in our study. Unlocking this potential involves many i n teresting issues, including: compiler support for automatic application partitioning, operating system integration, multi-threaded application support, complete application runtimes, application-speci c circuits vs. data-primitives, hierarchical computation structures, inter-page and inter-chip communication. In addition, a detailed power, yield and hardware implementation study of RADram is required.
For Active Pages to become a successful commodity architecture, the application partitioning process must be automated. Current w ork uses hand-coded libraries which can be called from conventional code. Ideally, a compiler would take high-level source code and divide the computation into processor code and Active-Page functions, optimizing for memory bandwidth, synchronization, and parallelism to reduce execution time. This partitioning problem is very similar to that encountered in hardware-software co-design systems GVNG94 which m ust divide code into pieces which run on general purpose processors and pieces which are implemented by ASICs Application-Speci c Integrated Circuits. These systems estimate the performance of each line of code on alternative technologies, account for communication between components, and use integer programming or simulated annealing to minimize execution time and cost. Active P ages could use a similar approach, but would also need to borrow from parallelizing compiler technology H + 96 to produce data layouts and schedule computation within the memory system.
Integration of Active P ages with a real operating system poses new challenges. Active P ages are similar to both memory pages and parallel processors. Several open operating system issues exist such as allocation policies, paging mechanisms, scheduling, and security. Of particular concern is the high cost of swapping Active P ages to and from disk. Current FPGA technologies take 100s of milliseconds to recon gure. New technologies, however, promise to reduce these times by several orders of magnitude DeH96a . Our future work will address these issues both formally and practically by clarifying the policy of interaction between an operating system and the Active P age memory system, and by simulation of a modi ed operating system kernel such a s L i n ux Bee96 . In addition to operating system studies, multi-threaded application support will be investigated.
Future work shall address inter-page and inter-chip communication issues. Before mechanisms are formalized for interpage communication, a detailed evaluation of inter-page communication requirements is required. This evaluation must study whether inter-page communication is required by a broad class of application domains, and if so, if it should it be simulated via processor intervention or implemented with dedicated hardware support. Along with inter-page and inter-chip communication, a study of inter-page synchronization primitives is required. Such primitives, if implemented in hardware, pose additional challenges.
Finally, further evaluation of application kernels is required. Instruction sets such as MMX codify a set of data-manipulation primitives for a certain application domain. Further study of data-manipulation primitives could distill a common base set of primitives for a broad set of application domains. If such primitives exist, hybrids of the RADram implementation should be investigated.
Conclusion
Active P ages provide a general model of computation to exploit the coming wave of technologies for intelligent memory. Active P ages are designed to leverage existing memory interfaces and integrate well with commodity microprocessors. In fact, a primary goal of Active P ages is to provide microprocessors with enough useful data to run at peak speeds.
Our RADram implementation of Active Pages achieves substantial speedups when compared to conventional memory systems. RADram provides a large number of simple, recongurable computational elements which can achieve speedups up to 1000 times faster than conventional systems. This high performance, coupled with low cost through high chip yield, makes RADram a highly promising architecture for future memory systems.
