Abstract
Introduction

This paper discusses a n approach to reducing m e mory latency in future systems. It focuses o n systems where a single chip DRAM/processor will n o t be feasible even in 10 years, e.g. systems requiring a large m e m o r y and/or m a n y CPU's. In such systems a solution needs t o be found t o DRAM latency and bandwidth as well as to inter-chip communication. Utilizing the projected advances in chap I/O bandwidth w e propose t o implement a decoupled access-execute processor where the access processor is placed in m e m o r y . A program is compiled t o run as a computational process and several access processes with the latter executing in the D R A M processors. Instruction set extensions are discussed t o support this paradigm. Using multi-level branch prediction the access processor stays ahead of the execute processor and keeps the latter supplied with data. T h e system reduces latency by moving address computation t o m e m o r y and thus avoiding sending address t o m e mory by the computational processor. This and the fetchahead capabilities of the access processor are combined with multiple DRAM "streaming" t o improve performance. D R A M caching is assumed t o be used to assist in this as well.
*This work was supported in part by the University of Illi- Memory access is a major problem in highperformance architectures. While processor speed has grown steadily with continuing advances in VLSI technology and attention to circuit design, DRAM speed has not kept. Both memory latency and bandwidth remain major problems in todays systems, uni-and multiprocessor alike. Recent developments, such as RAM-BUS DRAM and its bus protocol [I] , point to a solution for a problem of DRAM and bus bandwidth but more remains to be done to support higher ILP. The so-called intelligent or active memory RAM architectures [a, 3, 4, 51 have taken a different approach to the problem by avoiding off-chip access and exploiting onchip access parallelism. Latency, however, remains a problem in both of these cases since they do not change the DRAM core organization.
Thus both uni-and multiprocessor systems continue to rely on a memory hierarchy to solve the memory access problem. However, the cost of cache misses is becoming prohibitive. RAMBUS and synchronous DRAM'S utilize a form of on-chip caching and even more aggressive approaches have been proposed [3] . Latency hiding techniques, primarily prefetching [6, 7, 81, also have been utilized to help solve the problem.
Projected advances in VLSI and packaging technology are expected to make the problem much worse in the near future. The number of gates on a chip is projected to reach 20M in 10 years, while a DRAM can contain 2 G B y t e s of data. This increase comes from growth in die size, which increases interconnect distances on chip, and from feature size reduction. These lead to an increase in both on-chip interconnect length and propagation delays. The frequency of processor or logic ICs is expected t o rise to 2GHz with up to 2 x lo3 I/O's per chip. A chip of this size with a clock of 0.5ns may require 20 to 30 clocks to cross.
Given these semiconductor capabilities the approach combining a CPU with memory on the same chip seems to offer a hope for a system solution although the latency problem still remains. This solution, however, will not be sufficient for two important classes of future systems: A solution for multi-chip systems assuming an integrated processor/memory IC has been proposed [lo] relying on redundant use of hardware and cache coherence mechanism. This paper proposes a different multi-chip solution which can be applied to both uniand multiprocessor systems. It assumes that development of super-CPUs will continue in the future in an attempt t o exploit higher ILP [12, The approach advocated here helps increase the DRAM bandwidth and reduce latency through decoupled access DRAM architecture and compilation. It is accomplished by placing a simple, small integer processor (or processors) on each DRAM to help generate enough memory requests to fully utilize the DRAM bandwidth. It is assumed that more external chip bandwidth to support this will be available within 10 years and that DRAM organization will change t o provide more random-access internal bandwidth. Finally, caches are assumed to migrate to memory as well. The performance improvement is achieved by a fetch-ahead of the memory processor, by eliminating address transmission to memory, by building a cache hierarchy into the DRAM'S, and by using multiple Crams simultaneously to send data t o a single CPU. The resulting system is already a multiprocessor, albeit heterogeneous, and can be further extended to utilize multiple computational processors.
Decoupled Access-]Execute Mechanism
Consider a uniprocessor system that consists of a processor, multiple DRA Ms and interconnect. Traditionally, a processor in ziuch a system performs loads and stores. The latter does not usually slow down the processor as stores can be cached, buffered, and retired later. Loads, on the other hand can stall the processor if data is needed and are a major cause of performance degradation. A problem with loads is that a processor needs to send the address to memory and then wait for memory to respond with data.
A Decoupled Access-Execute approach has been proposed and implemented [15, 13, 9, 141 to reduce the wait. The idea was to allow memory access instructions to be executed by a dedicated processor which communicates with a computational processor through a set of queues. The queues allow asynchrony in execution. Dynamically scheduled processors using Tomasulo's algorithm [17] implement ii version of this approach with tagged queues. The details of organization and whether or not the queues are exposed to the user/compiler vary. A block diagram of such a decoupled processor is shown in Fig. 1 .
In all cases the system is viewed as executing a single program which gets split by hardware into access and execute instructions. Another issue for both decoupled access processors and prefetching is conditional branch resolution. Since either processor can perform computation resolving a branch condition this information needs to be cornmunicated to the other processor. Communication queues and Branch-on-Queue instruction have been proposed to solve this problem (E2A B r Q and A2E B r Q in Fig. 1 ). More recently, branch prediction was utilized as well.
Another approach allowing some decoupling is prefetching, which attempts to predict the future addresses and initiate a memory read before the actual load is executed. Hardware, software, and integrated methods for doing it have been proposed. The hardware approach relies on a separate hardware engine to generate the addresses and a cache to store the arriving data. The CPU still executes all address computations and issues loads. Branch prediction remains a problem. So far prefetching has been typically confined to the processor or L1 cache. Moving it higher in the memory hierarchy is more difficult, although not impossible [19] .
Decoupled Access Mechanism for DRAM's
The approach proposed here decouples uniprocessor computation and data access by allocating them to a computational processor (ComP) and one or more memory processors (MemP), respectively. It uses a compiler to generate code for a program in such a way that address computations and memory access are done by the memory processor. In some cases data computed by ComP may be required for address computations and will have to be sent to MemP. The MemP can be an integer processor. The computational processor is a complete, high-ILP processor. We assume that it does not execute memory loads. Rather it expects the data to be sent by the memory processor and fetches data from its queues. This is not the only possible implementation, but it is pursued here as the closest to the original ideas of decoupled execution. A block diagram of the decoupled DRAM system is shown in Fig. 2 .
The MemP is capable of executing all integer instructions of the ComP and will execute address computations, control flow, etc. of a user program as well as the operating system. In addition, the instruction set is augmented with Send and Receive instructions for communication between the processors, including MemPto-MemP communication. Other instruction set extensions are used as well and are discussed in more detail later in the paper. Examples illustrating the generated code are presented below.
DRAM
DRAM,
Mhl
1
Interconnect
Receive queues
Computational Processor
Figure 2. Decoupled Access DRAM System
The MemP attempts to run ahead of the ComP as is typical of decoupled access architecture. In many cases the MemP code can be generated in such a way that it can execute independently of the ComP and thus provide fetch-ahead capabilities. In other cases the code will require data computed in the ComP for conditional branches which will determine subsequent memory accesses. To solve the problem of conditional branching some code is replicated in both MemP and ComP, e.g. "static" loops. When data-dependent branching is present branch prediction is used. We propose to use multi-level branch prediction [20] to allow the MemP to advance without waiting for data from the ComP to resolve the branch condition. MemP can thus execute its program ahead of the ComP. Both may need to exchange branch information expected by the other. A roll-back mechanism is provided to undo misprediction. Data that is sent using a mispredicted branch is identified through tagging and not used by the receiver.
Multiple DRAM's can send multiple data items to the ComP every cycle and they can arrive out of order. The ComP contains queues to receive and reorder load data and to buffer outgoing stores. The ComP tags the stores and checks tags on incoming load data. The MemP tags the loads (Sends) and receives and checks the store tags. The tag contains branch path information.
Internally, a ComP contains an instruction but no data cache. The use of multiple DRAM's is a major source of increased memory bandwidth and parallelism in accessing it. To make this a more powerful system, MemP'si n each DRAM are allowed to communicate directly in accessing the data. Thus multiple data streams can be automatically sent to the MemP.
High chip 1/0 and interconnect bandwidth is assumed to be available for this. In order to send and receive multiple words to/from DRAM's each cycle, queues are partitioned. The decoupled nature of memory access and its exposure to the compiler are expected to allow a much higher memory request rate than a typical h igh-performance processor can achieve today. Each MemP can run a different program allowing even more flexibility.
To summarize, the decoupled DRAM approach has the following advantages:
1. It can eliminate the need for the main CPU to issue loads to memory.
2. It allows a significant lookahead in executing loads.
3.
It simplifies the CPU address computation and partly eliminates it.
4. It allows address translation to be moved to memory.
.
It can allow the operating system execution in memory instead of ComP which may help the wellknown poor memory hierarchy behavior of the OS.
6. It allows a large part of the memory hierarchy to be moved into the DRAM's.
Multiprocessing
So far only a uni-processor has been considered.
]However, the system using decoupled-access DRAM's is obviously a multiprocessor and supports Sends and ]Receives among processors. Thus one can say that by connecting multiple ComP's to multiple DRAM's and their MemP's one gets a multi-computer. Data layout optimization and programming to exploit it are igoing to be required for both uni-and multiprocessor :systems. The major difference is direct support for ishared memory which is carried from decoupled access DRAM uni-to multiprocessor system model. In fact, remote memory loads and stores also need to be supported making the DRAM and the ComP excellent MP building blocks.
Finally, the programming model for a uniprocessor using decoupled access DRAM's remains unchanged from a standard uniproclessor. But it requires a compiler to convert the program to a parallel decoupled access program. So to compile even for a uniprocessor requires MP concepts to be employed. This makes the unifying system model a multiprocessor.
Multiple decoupled access DRAM's are used in the uniprocessor case to provide sufficient bandwidth to the ComP. Use of multiple ComP's in uni-or multiprocessor organization will require more memory bandwidth. In both cases, to achieve this may require multiple MemP's to be placd in each DRAM. Other ways of increasing the DRAM bandwidth may need to be utilized as well. Finally, the latency may still remain too high, if sufficient lookahead cannot be achieved. These issues are discussed in the next section.
High-bandwidth, low latency DRAM
The use of a memory processor MemP leads to latency reduction due to slhortened address transmission time and fetch-ahead. To support it, a DRAM needs to provide more internal as well as off-chip bandwidth as well as to reduce the latency of accessing the memory. The projected increase in the number of package I/Os and their operating frequency will provide the off-chip bandwidth. This has already been demonstrated by RAMBUS DRAM [l] ad will advance even further in spite of many technical hurdles, such as power dissipation, noise immunity, etc. The McmP will help better utilize a large number of DRAM I/O's.
Internally to the DRA.M plenty of bandwidth potential exists but modifications are required to extract it. For instance, RDRAM design required a wider column pitch to increase the access bandwidth. Some existing DRAM'S already utilize a row of sense amplifiers as a latch for temporary data storage. Much discussion has taken place about the use of these latches to form a cache. These latches arc: neither large enough or sufficiently fast to act as a cache for a 1 to 2GB DRAM.
Current high-performance workstations use 2-4MB of cache for 256MB of mernory.
Mitsubishi CDRAM used a different approach and implemented a 16Kbit cache in a 4Mbit DRAM. This required very little extra real estate, about 7010, and provided a much better cache albeit still too small for a large memory system.
In this paper, a traditional multi-bank DRAM organization is assumed, each bank possibly operating completely independently to increase parallelism. By allocating more real estate, each bank may be provided with a large enough cache to satisfy a large percent-age of accesses. In addition, a large memory cache is assumed at DRAM I/Os. Given a high degree of on-chip DRAM interleaving via multiple independent banks and caching, the MemP needs to be able to utilize it. This may require the use of several MemPs.
We have shown in the past that memory caches are very effective in multiprocessors [all due to the lack of the coherence problem. The decoupled architecture makes it even more advantageous to move caches to memory and rely on them to further reduce DRAM latency. An important question is the organization of the memory cache. We see the role of the memory cache in hiding the increased on-chip interconnect latency. Thus some form of geometrically distributed cache hierarchy may be necessary to avoid long-distance propagation on chip. It will consist of a large L1 cache for MemP, which is also the main memory cache, followed by smaller, sub-array caches distributed over the DRAM chip. Sense amplifier cache can still be used as the last level before memory.
Details of the organization
To support decoupled execution the following architectural issues have to be addressed and are discussed in this section:
1.
2.
3.
4.
5.
6.
Branch prediction
Remote memory access
Inter-processor Send and Receive instructions
Other instruction set extensions A tagging mechanism for load and store data Flow control and synchronization of data interchange It is assumed that the combined multiple DRAM storage capacity forms the total shared physical address space of the system. Some storage is reserved as private in each DRAM. Let us assume for simplicity that each DRAM contains the complete user program as well as the operating system. Tbese are stored in the private space of each DRAM. The number of DRAM's in the system and each DRAMS id are available to its MemP. The address spa,ce is block interleaved across DRAM'S, with block size smaller than a page. This allows a DRAM id number to be extracted from a virtual address without translation. All translation is assumed to be performed in memory and each DRAM is responsible for translation of all its 'addresses.
Loads and stores are processed asymmetrically by the ComP. The load data is expected to come from architecturally visible queues without address computation and memory access. Store addresses are computed explicitly and stores buffered in the send queue. The reason for treating stores this way is to obtain the address and from it the DRAM id used to send data. To assist decoupled processing by separate MemP and ComP both loads and stores are tagged with a value number. For example, the induction variable can be used as the value number or the value of a compilerspecified variable. The value number also includes N bits of branch history to precisely identify data.
Operand queues
Each processor contains a number of queues for storing/sending and loading/receiving data. The two types of queues are organized differently. A load queue is assumed to be organized as a small, associative cache with a tag to assist in operand reordering. It is assumed there is a load queue for each register ri accessible by LDri. A store queue is logically a single queue.
Data Communication MemP and ComP can issue tagged Send and Recv instructions to communicate. The tag is appended on a Send and checked on a Recv. Send is non-blocking
while Recv is a blocking operation. A Send/Recv pair uses the same register and thus the same queue. Send can be cancelled if it used an invalid register. Recv searches the input queue using the expected value number. The instruction formats are as follows:
Recv ri
Remote Memory Access
MemP and ComP can issue tagged remote load (LD)
and store (ST) instructions. These remote LD/ST instructions are routed to the requested DRAM, translated there, and memory is accessed. The requested DRAM id comes from an address in register r 3 . The remote memory request specifies the node and register to which the load data is returned. The destination node "dest" is an extra specifier in the instruction:
LD-r ri, ( r j ) , dest
A load request also carries the requesters tag. As mentioned above, the ComP program does not utilize loads directly. Instead these are used for MemP remote memory access, but by specifying a ComP as a destination, indirect addressing can be sped up. LD-r is equivalent to a "standard" load followed by a Send.
Local Memory Access
In addition t o standard LD/ST instructions a "localonly" version is provided,
LD-I T,, offset(rj)
This instruction computes the address and only issues a load or a store if the address is local to this node.
Otherwise the load is cancelled and the register 7-i is marked invalid. Any instruction using an invalid input register is cancelled as well and its destination register marked invalid. Any of the load or Recv instructions described above can make a register valid.
Conditional Branches
The MemP may require a data value from ComP to red v e a conditional branch. In such a case a compiler inserts a new instruction, BRanch-on-Prediction, in the MemP code. The ComP sends n bits of its branch history as they are accumulated. MemP also keeps track of its branch history and compares it with the history bits received from ComP. In case of a discrepancy the MemP rolls back the execution, updates its predictor and history register, and re-starts execution. Any data sent by the MemP during execution of the mispredicted ]path is detected by the ComP when it checks the Recv (queue tag.
Standard conditional branches are also executed by bot,h processors. It, is assumed that ComP does not (execute Branch-on-Prediction but rather receives the (data and does its own testing.
Synchronization
Every so often it may be necessary to synchronize all processors t o know that they all have a consistent state with respect to either data or conditional statements.
A barrier instruction BAR is provided for this purpose.
Memory consistency model relies on the barrier and can be either strong or relaxed. BAR can be used to trig-,ger synchronization and backtracking in case of branch misprediction.
Data Dependences
With decoupled execution one has to check RAW and WAR dependencies. WAW will be guaranteed by hardware. Nornially it is riot hard to do, but without load addresses it is rnore difficult. Value tags will be used instead and send queue checked on reads. The checking will continue all the way to MemP and its queues.
Caches
MemP uses standard memory hierarchy but for ComP tagged decoupled access makes the use of caches more difficult. ComP will have an instruction cache. The tagged queues act a bit, like a data cache. If a true data cache is desired it, can be tagged with a value number and branch history rather than address. The address must be present in the cache to implement a write-back policy.
Compilation
The instruction set extensions and their semantics for decoupled processing define the compilation approach. In addition, a programming model needs to be defined, especially since multiple processors now cooperate in executing the single program. We discuss it below assuming a uniprocessor^' case, e.g. one ComP and multiple MemP's. Partitioning the program into access and execute programs for this architecture is shown through examples in the next section.
It is assumed that any computation involving variable referencing is pulled into a MemP program. The ComP program loads are replaced with queue access while stores are left as they were. Most simple control structures such as loops with index variables are duplicated in both places and necessary data is sent by a MemP to ComP. We are not concerned with this redundant integer computation since integer processing is relatively inexpensive The ComP must compute the address of stores to determine to which DRAM the data must be sent. A MemP cmputes the address for loads that are usually independent of data computations, using data in memory. But occasionally, a ComP and a MemP may have to exchange data.
For simplicity, the examples in this paper assume that both MemP and ComP programs are extracted from the bame "or igirtal progiarn" arid keep same riames, including register names, identical in both. While this places some restiictions on code generation and optimization for each type of processor, it is a valid initial assumption. Each MemP is assumed to execute a complete memory access program but access only local data. This is similar to what is known in the parallel community as the SF'MD model of computation. In this case it is further assisted by the fact that a complete program exists in each node. Thus each MemP has acccss to symbol tablc information without a need for communication. Each MemP executes the program but only accesses data stored in its memory. This it can figure out from the virtual address Data stored locally is fetched and sent to the ComP with an appropriate tag. If a MemP finds it does not have the data locally it skips the computation. In some cases complicated indirection is used to address data needed by ComP. This can mean that data accessed locally is used to compute an address of data needed by the 
Examples
In this section, we consider two simple loops. They are hand-compiled using the instructions described earlier using the SPMD programming model. Both MemP and ComP codes are shown. A comparison with a "standard" uniprocessor execution is made and qualitative performance differences discussed.
Example 1: Vector Loop
Let us start with a vector code without conditional statements. Consider the SAXPY loop below. The simplified assembly code followed by the decoupled 
Sparse matrix times dense vector
This is a simplified sparse matrix times a dense vector kernel. It assumes consecutive storage for the rows of the matrix. It is used to illustrate the case when MemP to MemP communication can be used to improve performance. The multiply code is shown below with its storage declarations: To compare performance with a standard processor note that here all MemP's quickly get through the iteration and move on to the next one except for the two per iteration that actually access first their local and then remote memory. However, with SPMD model it is hard to expect more parallelism in memory access. This may not be sufficient to keep the processor busy.
The inner loop can be re-written as a FORALL and "tuned" to execute N / P iterations per processor. In this case, however, it is hard to compute the addresses of only those items residing in a given memory. MemP to MemP communication would be required but will result in more parallelism and better memory utilization. The resulting code is shown next and has the potential to exploit all the available memory bandwidth. For simplicity, the required changes in index variable and address computations are not shown. As 
. Conclusions
This paper presented a possible approach to increasing memory bandwidth and reducing latency by using a decoupled access-execute paradigm. The major innovation comes from moving the access processor to memory and compiling programs to execute as separate but communicating processes in memory and the computation processor. The need for such a system comes from two sources: systems requiring multiple DRAM'S and Super-CPU's with extremely high ILP that cannot be integrated with memory on one chip. The system we propose is an evolutionary step from separate processor-memory architectures towards an integrated processor-memory system. It takes the idea of decoupled access-execute architectures and partitions the system to place a simple access processor and cache in memory. Eventually it may be possible to put the ComP in memory as well allowing the approach proposed by Goodman et a1 for multi-chip systems. The broadcast involved in this solution is its main drawback and perhaps hybrid approach is possible. It is argued that given sufficient memory and communication bandwidth the system described here can reduce latency by aggressively streaming data from multiple memories to a computational processor.
Another way to reduce latency is via DRAM caching which was briefly discussed. Overall, the idea of memory cache is very appealing. The qualitative discussion of performance potential showed this a promising approach.
Finally, one can observe similarity between prefetching and decoupled access. Prefetching engines, even the explicitly programmable ones, are simpler than a decoupled processor. They can also be added to a standard system without the major change in the processor architecture required by the decoupled system described here. They can also peacefully coexist with the CPU caches. Thus a intermediate solution may be possible which adds an even simpler MemP to each DRAM but leaves the computational processor and its program largely the same. We are currently exploring this possibility. ...
...
