This paper presents an algorithm to automatically map code on a generic intelligent memory system that consists of a host processor and a simpler memory processor. To achieve high performance with this type of architecture, code needs to be partitioned and scheduled such that each section is assigned to the processor on which it runs most efficiently. In addition, the two processors should overlap their execution as much as possible.
INTRODUCTION
Integrating substantial processing power and a sizable memory on a single chip can potentially deliver high performance by enabling low-latency and high-bandwidth communication between processor and memory. This type of architecture, which is popularly known as intelligent memory or processor in memory, has been recently proposed for many systems [8, 12, 13 In this second class of systems, we have a heterogeneous mix of processors: host and memory processors. A host processor is more powerful, is backed up by a deep cache hierarchy, and suffers a high latency to access memory. A memory processor is typically less powerful, has a lower memory latency and, at least in theory, is significantly cheaper. The question that we address in this paper is: how do we automatically program these systems?
In previous work on these systems [5, 8, 12, 191 , the progranmer is expected to identify and isolate the code sections to run on the memory processors. This process is time consuming and error prone. Furthermore, in our experience, visual inspection of the code may not reveal much about which processor is best at running a given code section. In addition, previous work has largely focused 'This work was supported in part by the cessors. This approach is often not much different from running code on a parallel processor.
Our goal, instead, is to automatically partition the code into homogeneous sections and then schedule each section on its most suitable processor, while maximizing host and memory execution overlap. No knowledge of the code should be assumed from the user.
To this end, this paper presents an algorithm embedded in a real compiler that automatically maps code to a system with both host and memory processing. To simplify the analysis, we only generate code for an architecture with a single host and a single memory processor. Using a set of standard applications and a software simulator for the architecture, we show average speedups of 1.7 for numerical applications and 1.2 for non-numerical applications over a single host with plain memory. The speedups are similar and often higher than ideal speedups on a more expensive multiprocessor system composed of two identical host processors. Overall, our work shows that heterogeneity can be cost-effectively exploited in an automated manner. It represents one step toward effectively mapping code on intelligent memory systems.
The rest of the paper is organized as follows: Section 2 overviews the intelligent memory architecture used; Section 3 presents our algorithm; Section 4 describes the evaluation environment; Section 5 evaluates the algorithm; and Section 6 discusses related work.
INTELLIGENT MEMORY SYSTEM
For this work, we assume a server with a memory system enhanced with processing power. The machine has two types of processors: the off-the-shelf processors that come with ordinary servers (Ehosts) and the processors in the memory system (Pmems). To simplify the analysis, in this work we use a simpler architecture with only one processor of each type ( Figure 1-(a) ). Supporting multiple processors of each type requires extending the techniques that we will present, possibly by augmenting them with conventional parallelization techniques. Typically, P.host is a wide-issue superscalar with a deep cache hierarchy and a high memory access latency, while P.mem is a simple, narrow-issue superscalar with only a small cache and a low memory access latency. Consequently, to run efficiently, computing-bound code sections should be run on P.host, while memory-bound sections should run on P.mem.
To reduce the cost of the system, a restriction that we impose is that the chips for the host and memory processors are connected with an off-the-shelf interconnection. As a result, only P.host can be the master of the interconnection and initiate transactions; P.mem cannot initiate interconnection transactions. In addition, there is no hardware support to ensure coherence between the P.host and P.mem caches.
We assume, however, some simple support in P.mem's cache that makes programming easier (Figure 1-(b) ). Specifically, when P.host writes back a line to memory, P.mem's cache is automatically updated if it contains a copy of the line. In addition, when P.host requests a line from the memory, P.mem's cache overwrites the retuming data if it has a copy of the line.
For P.host and P.mem to execute an application concurrently, we need to support synchronization and data coherence correctly. For synchronization, since P.mem cannot be master of the interconnection, the most inexpensive scheme is to poll a special, uncachable memory location. Whichever processor arrives first sets the location and keeps polling until the other processor arrives. Depending on how sophisticated the memory controller is, the controller can offload the spinning from the P.host. Ensuring data coherence is harder. We consider it next.
Data Coherence
Since some sections of the code execute on P.host while others on P.mem, the data in the caches will become incoherent in the course of execution. To avoid incorrect execution, we must ensure that when a processor accesses a variable, it gets the latest version of it. This can be ensured in different ways, depending on the support available. For this work, we assume very simple support, namely that P.host can issue write-back and invalidation commands to control its caches. We use this support as follows:
Before P.mem starts executing a section of code, Phost writes back to memory all the dirty lines in its caches that P.mem may read or only-partially modify in that section. The partialmodification condition is necessary in case the two processors write to different words of the same line. Recall that, as a line is written back, it updates P.mem's cache if the latter has an old copy of the line. This support ensures that Pmem sees the latest versions of data.
Before P.host starts executing a section of code, P.host invalidates from its caches all the lines that P.mem may have updated in the previous section. This support ensures that Rhost sees the latest versions of data: if P.host re-references lines written by P.mem, it will miss in its cache.
Therefore, in the general case, transferring execution from P.host to P.mem and then back to P.host induces three overheads on P.host: writing back some cache lines to memory, invalidating some cache lines and, later on, potentially missing in the caches to reload the invalidated lines.
In reality, the cost of these operations heavily depends on the quality of the compiler. The compiler can minimize the cost by performing careful dependence analysis to write back and invalidate only the strictly-necessary cache lines. It can schedule the writebacks in advance and in a gradual manner, to avoid bunching them up right before execution is transferred to P.mem. It can schedule the invalidations when P.host is idle waiting for Pmem. Finally, it can also insert prefetches to reload the data in Rhost's caches in advance, but only when it is safe to do so.
AN ALGORITHM TO MAP THE CODE
We have implemented a compiler and run-time algorithm that automatically maps applications to the intelligent memory architecture of Section 2. The algorithm maps both numerical and non-numerical applications. For the numerical applications, the algorithm is embedded in the Polaris parallelizing compiler [3] .
The algorithm has several parts. First, the code is partitioned into modules that have a homogeneous memory and computing behavior (Section 3.1). Then, using a static performance model or code profiling, it estimates which processor should each module run on (Section 3.2). Since static scheduling decisions may be poor, the algorithm also inserts code that identifies at run time where each module should run (Section 3.3). Finally, the algorithm enables the overlap of P.host and P.mem execution (Section 3.5).
Code Partitioning
The first stepin the algorithm is to partition the code into modules or sections of code that have a homogeneous computing and memory behavior. In addition, a module should have good locality and be easy to extract from the code. We use two partitioning algorithms: basic partitioning and the more aggressive advanced partitioning. Basic partitioning finds basic modules, while advanced partitioning combines them into compound modules.
Basic Partitioning
Intuitively, the desirable characteristics listed above are most likely to be found in loop nests. Consequently, we define a basic module to be a loop nest where each nesting level has only one loop and possibly several statements. Such a loop nest may span several subroutine levels.
To identify basic modules, the algorithm starts off by marking as initial modules all innermost loops in the application that contain neither subroutine calls leading to loops nor if statements enclosing loops. Then, it tries to expand these initial modules, possibly crossing subroutine boundaries in the process.
For the expansion process, we first order the subroutines in the application starting from the leaves in the static call graph and working bottom up in the graph. We then consider one subroutine at a time.
For a given subroutine, we order the modules depth first. Then, for each module M in the subroutine, we repeatedly apply the following two steps in sequence until the module stops expanding:
1 Given a statement s in the subroutine, which is neither a module, nor a subroutine call leading to a loop, nor an ifstatement After all the modules in the subroutine have been processed, we move on to the next subroutine. After all the subroutines have been processed, we need to perform one final operation. We examine each of the resulting modules. If a module is not a loop nest, we peel-off statements until it becomes a loop nest. After all these steps, we have all the basic modules.
As an example, the code in Figure 2 -(a) contains two loops that are marked as initial modules, namely L1 and L2. After applying the algorithm described, the resulting basic modules are loops L3 and L2 as shown in Figure 2 ... ... Figure 2: A sample code with two loops marked as initial modules (a) and its structure after basic partitioning (b) and advanced partitioning (c). We assume that the two modules in (b) have the same estimated affinity.
Advanced Partitioning
While basic modules may be fairly homogeneous, they may also be small. If so, it is possible that, relative to their small grain size, they induce unacceptably high overheads to keep the data in the caches coherent (Section 2.1) and to bundle up the code into units ready to execute (Section 4). With advanced partitioning, we try to increase the grain size of the modules and, therefore, reduce their relative overheads, at the possible expense of lowering their homogeneity.
In this algorithm, we generate compound modules out of basic modules. Compound modules do not have to be loop nests. They are generated by merging basic modules with nearby statements and other basic modules whose affinity is expected to be the same. We say that a module has ufiniry for P.host or Pmem if it runs faster on P.host or P.mem, respectively. The affinity of a module is estimated using a static performance model or code profiling (Section 3.7).
The advanced partitioning algorithm starts off by ordering the subroutines in the application starting from the leaves in the static call graph and working bottom up in the graph. Within each subroutine, the basic modules identified in Section 3.1.1 are marked and ordered depth first. The algorithm then works on one subroutine at a time. For a given subroutine, it applies an expansion subalgorithm to every module in sequence. Then it applies a combining sub-algorithm to every module in sequence. The two sub-algorithms are repeatedly applied until the modules in the subroutine do not change further.
The expansion sub-algorithm expands a module. It works like the expansion in basic partitioning. Given a module M in a subroutine, it repeatedly executes the following steps until the module stops expanding:
1.
Step 1 from Section 3.1.1.
2.
Step 2 from Section 3.1.1. After a subroutine has been processed, we move on to the next one. After all the subroutines have been processed, we obtain the compound modules.
3.
As an example, Figure 2 -(c) shows the result of applying advanced partitioning to the code of Figure 2 -(b). If we assume that the two basic modules have the same estimated affinity, the whole code becomes a compound module.
Note that there are modules so small that the overheads that they are expected to induce may overwhelm their execution time. We will see in Section 3.2 that these modules are statically assigned undefined affinity. If, in the advanced partitioning algorithm, one of these small modules is considered for combination with a large one, the algorithm assumes that the small module has the same estimated affinity as the large one. Consequently, it combines them. If two small modules are considered for combination, they are also combined but, if the resulting module is large enough, its estimated affinity may now become P.host or Pmem.
Advanced Partitioning with Retraction
One possible problem with the compound modules generated by advanced partitioning is that they may be very large and, as a result, may be invoked very few times during program execution. If this occurs, run-time adaptation (Section 3.3) is harder to apply because the module may not run enough times for the system to learn.
To address this problem, we also propose a variation of advanced partitioning that includes a retraction step at the end. Specifically, after all the compound modules have been identified, the algorithm selects those that (possibly through profiling) are expected to be invoked only 1-2 times. For each of these modules, the algorithm then starts peeling-off statements until it reaches an all-enclosing loop or a set of disjoint loops. In any case, the bodies of these loops or loop are the new compound modules. This is because these bodies, which are usually much larger than basic modules, are likely to be executed several times. With this approach, we allow run-time adaptation to work, at the expense of reducing the grain size of the module.
Affinity Estimation
To be able to combine modules under advanced partitioning, our algorithm must have the ability to estimate the affinity of individual modules. Such ability is also needed to decide where to schedule the execution of modules in case we use a static scheduling policy.
To estimate the affinity of modules, we use a static performance model. Currently, the model is designed only for numerical appli-cations. For non-numerical ones, we use information gathered from two profiling runs of the code, one on P.host and one on P.mem, using a different input set than for the production run. The profiling runs measure the execution time and number of invocations of each module. In the rest of this section, we describe the static performance model.
Static Performance Model
The model is based on Delphi's performance predictor [4] . It estimates the execution time of a module on both P.host (Tphost) and P.mem (Tpmem), and selects the processor with the lowest time as the estimated affinity of the module. The model works by estimating the two major components of the execution time: computing time (Tcap) and memory stall time (Tmemstarl). In the formula, Tother is the estimated latency of special items like calls to libraries that compute square roots or intrinsic functions.
To estimate Tmemstall, we estimate, for each level of the cache hierarchy, the number of cache misses in the module (miss) and the average processor stall time per miss (stall). Then, Tmematocl for a cache hierarchy is estimated as:
The number of cache misses in the module is estimated by using the data dependence structure of the code to predict the access pattems. The latter then drive a stack-distance model for the cache [4, 171. The stack-distance model assumes fully-associative caches and, therefore, underestimates the number of misses. As for the average stall time per miss, it is hard to estimate since part of the cache miss latency is overlapped with computation. In addition, resource contention increases the stall. For simplicity, in our calculations, we use the full non-overlapped cache miss penalty without contention.
Affinity and Inaccuracy Window
All these approximations induce some error to the values estimated for Tphost and T,,,,. In addition, there are other sources of e m r that arise from the fact that some information may not be available at compile time. Examples of such information are the outcomes of branches and full data dependence stmcture information. Overall, however, the model delivers acceptable results. Recall that its goal is not to estimate the actual execution time as accurately as possible, but to estimate which of Tphost and T,,,, is smaller.
To be on the safe side, we use an inaccuracy window for the model. We assume that the error can be reasonably bounded by err% of the measured value and, therefore, have an inaccuracy window of &err%. Consequently, our model only reports the affinity for a module if the inaccuracy windows [Tphoat(l -err/lOO),Tphost(l + err/100)] and [Tpmem(lerr/lOO),Tpmem(l + err/100)] do not overlap. Otherwise, the affinity is reported as undefined (Table 1) .
Finally, as shown in Table 1 , very small modules are also assigned undefuted affinity. The reason is that the overheads required to keep the data in the caches coherent (Section 7.1) and to bundle up the code into a module ready to execute (Section 4) are likely to overwhelm the execution time of one of these modules. By setting the affinity to undefined, we increase the chances that the very small module combines with nearby modules. In addition, for very small module sizes, the model becomes less accurate.
Adaptive Execution
The algorithm can use the estimated affinity of the modules to individually schedule each module on either P.host or P.mem statically. Altematively, the schedule can be decided dynamically. In this case, the compiler generates both P.host and Pmem versions of the module. In addition, it includes code in the module to measure at run time the execution time of some of its invocations or (if applicable) some of its Imp iterations. These invocations or iterations where execution times are gathered are called decision runs. Based on the measurements in the decision runs, the run-time system determines the affinity of the module and schedules subsequent invocations or loop iterations of the module in the program appropriately.
Depending on the granularity of the code executed in a decision run, we propose coarse-and fine-grain dynamic scheduling strategies. In coarse strategies, the granularity is an entire module invocation, while in fine strategies, the granularity is a single iteration of the outermost loop. Of course, fine strategies are only applicable to basic modules and to compound modules that have an all-enclosing We propose two different coarse strategies: coarse basic and coarse most recent. In coarse basic, the module is executed and timed on one processor when it is first invoked in the program, and on the other processor in its second invocation. The affinity of the module is then determined by comparing the two measurements.
The module is then executed for the remaining invocations in the program on the processor for which it has affinity.
In coarse most recent, the module is also executed first on one processor and then on the other. However, the scheduling for the remaining invocations in the program is not fixed at that point. Instead, every time that the module executes on a processor, we time it and compare the execution time to its most recent execution time on the other processor. If the latter is lower, we change the affinity of the module.
We propose two different fine strategies: fine basic and finefirst invocation. Infine basic, in each invocation of the module, the first iteration of the loop is executed and timed on one processor, and the second one on the other processor. Based on these two measurements, the affinity is determined and fixed for the rest of the loop execution. The whole process is repeated in every invocation. In finefirst invocation, the decision runs are performed only in the first invocation of the module. Once the affinity is determined after the second iteration, it is fixed for all subsequent iterations and invocations of the module. Table 7 -helps compare the different dynamic scheduling strategies. The table classifies modules into 8 different classes, based on the number of times the module is invoked in the program, and whether or not the module's behavior varies across invocations and across iterations of the same invocation. The table then lists the best strategy or strategies for each class of module.
Coarse strategies tend to be more effective for modules that are invoked relatively more times. The reason is that, in general, in the first two invocations of a given module, coarse strategies schedule the module on the wrong processor once. As a result, they are not Of all the strategies, fine basic has the highest overhead, since execution transfer between P.host and P.mem happens at least once for every invocation of the module. However, it adapts to changes a m s s invocations. Compared to all other strategies,Jinefirsr invocarion has the lowest overhead. However, it is only useful when the behavior of the loop does not change across iterations or invocations.
Overall, we see from the table that no single strategy is best in all cases. In practice, it may be better to focus on the modules that are invoked many times, since they are more likely to contribute significantly to the overall execution time of the application. Moreover, it may be safer to assume that the behavior of a module will vary across both iterations and invocations.
Finally, our algorithm uses several heuristics. Modules with undefined affinity are always run on P.host. This approach is likely to reduce the overheads associated with transferring execution between processors. For a module with defined affinity, the first decision run is always scheduled on the processor for which the module has affinity. This approach minimizes the chances of executing the module on the wrong processor. At this p i n t , if our intelligent memory architecture included several P.mem processors, we could parallelize the modules assigned to memory. Similarly, if the architecture had several P.hosts, we could parallelize the modules assigned to the host. To do so, we could use many conventional compiler techniques. While such an approach is certainly interesting, we prefer to focus on an architecture with a single P.mem and a single P.host, and examine the less explored issue of overlapping execution between the two. We recognize, however, that both axes of parallelism affect each other and that both need to be included in a complete compiler algorithm for intelligent memory architectures.
or2

Parallelization
Overlapped Execution
To further speed-up execution of the application, the algorithm attempts to overlap the execution of modules on P.host and on P.mem. To this end, the program is divided into two classes of regions. In a module-wise parallel region, there are multiple modules that can be run in parallel with respect to one another, while in a modulewise serial region, there is only one module that can be run at a time because of dependences between modules. The algorithms used in each type of region are different. In the following, we explain them.
Note that, in these algorithms, we use basic modules. The reason is that basic modules are simpler and, therefore, expose more parallelism in the application. In addition, they are easier to pmllelize because they are always loops.
Module-Wise Parallel Region.
The goal is to run some of the modules in this region on P.host and some on P.mem concurrently. We attempt to perform an initial partition of the modules as balanced as possible. We assign the modules to P.host or to P.mem based on their estimated affinity. However, if the resulting partition is not balanced according to static estimates (Section 3.2). we move some modules from the busier processor to the other one. If necessary, we also take the largest module remaining in the busier processor and partition it according to the algorithm for module-wise serial regions (See below).
If, at run-time, the load is imbalanced, we dynamically change the scheduling of some modules using algorithms similar to those in Section 3.3. If necessary, we also take the module that was partitioned and we dynamically repartition it according to the algorithm for module-wise serial regions.
Module-Wise Serial Region.
The goal is to partition the only module in this region between P.host and P.mem so that the load is as balanced as possible. In the following, we explain the general approach. The partition can be performed statically or dynamically, as explained in Section 3.5.1.
If the loop is fully parallel, we divide the iteration count into two chunks (Row 1 of Table 3 ). The sizes of the chunks are those that balance the load between P.host and P.mem, based on static or dynamic information. Otherwise, if the loop is distributable across processors without synchronization, we distribute the loop (Row 2 of Table 3 ).
Loop distribution splits the loop into the maximal stronglyconnected components (called nblocks) in the data dependence graph of the loop body [Z] . Each component becomes a new loop. We then topologically sort the data dependence graph of the distributed loops and assign the loops to P.host or P.mem according to their estimated or real affinity. As usual, we try to balance the load based on static or dynamic information. When a final schedule is produced, all the loops assigned to a given processor are combined into a single one to reduce overhead. Note that, if the loop is both parallel and distributable, it is better to run it as a parallel loop than to distribute it. The reason is that parallelization tends to provide better cache reuse and makes it easier to balance the load.
Otherwise, if the loop can be distributed using dopipe scheduling [20], we do so (Row 3 ofTable 3). The procedure is similar to the previous case. However, we now need to add synchronization between the processors, and write-back and invalidate commands to control P.host's cache. In the example in Row 3
of Table 3 , P.host executes at least 4 iterations ahead of P.mem and uses signal and waif to synchronize. To keep the data coherent, before every synchronization, P.host writes back the updated cache lines.
Otherwise, we apply the best sequential scheduling strategy described in Section 3.3 to the loop. While we could still partition the loop and apply doacross scheduling in some cases, the synchronization overhead is likely to be too high [XI.
Note that, when the compiler partitions a loop in cases 1 and 2, it must do so in such a way that P.host and P.mem do not falsely share any memory line. Otherwise, since the compiler is not inserting write-back or invalidation commands, there may be data coherence problems. Therefore, the compiler must be aware of the data access pattems and data layouts when it partitions the loops.
Static and Dynamic Partitioning
All the cases in the algorithm for module-wise serial regions can use static or dynamic information to decide how to partition the module. The procedure is straightforward except that, because we may now be assigning small chunks of work, we need to be aware of cache write-back and invalidation overheads.
As an example, consider Case 1. To make the decision statically, we use the predicted execution time of the module on P.
host ( T p h o a t )
and on Pmem (Tpmem), and the predicted number of iterations N . The obvious approach is to assign the iterations so that the load in P.host and in P.mem is balanced. This occurs when we assign Nphost iterations to P.host and Npmem to P.mem such that:
In this case, the estimated execution time of both P.host and P.mem will be Ttotal = ~h o ' L X T p m e m phost+Tpmem + Twbinv. In this formula, Twbinv (Npmcm) is all the overhead involved in performing writeback and invalidation actions on P.host's cache. For simplicity, we assume that this overhead delays execution of both P.host and P.mem equally. Twbinv can be estimated as a function of N,,.,, the number of iterations assigned to Pmem. Overall, in our algorithm, we use this partition unless Ttotoi is larger than Tphost, in which case we execute the whole module on P.host.
To make the decision dynamically, we proceed similarly. In the first invocation of the loop, we use a certain partition of iterations, for example Nphoat and N,,,,.
In this first invocation, the runtime system measures the overhead-free execution time of these iterations, which is 'rpphost and r , , , , , respectively. In addition, it also measures the Twbinv (Npmem) overhead. With these measurements, the run-time system estimates the average execution time of one iteration on P.host (tphost) and on Pmem (tpmem) as: tphost = -and tpmem = : r e : . Based on these values, the run-time system partitions the loop in its next invocation in the program as follows.
If the loop has Nnezt iterations, the assignment is:
Compiler Directives
Our system includes source-code compiler directives that allow the programmer to guide the algorithm. For example, they allow the programmer to identify modules and specify where and how they should be run. These directives are useful when the programmer knows the application well. A sample of our directives is shown in Table 4 .
EVALUATION ENVIRONMENT Compiler
We have implemented the compiler algorithm described in Section 3 so that it can be applied in a fully-automated manner. For the numerical applications, the algorithm is embedded in the Polaris parallelizing compiler [3]. Polaris takes Fortran programs and includes many compilation passes that our algorithm can benefit from. Such passes perform data dependence analysis, interprocedural analysis, symbolic analysis, and other operations. For the non-numerical applications, we cannot use Polaris and, therefore, apply our algorithm by hand. Polaris helps identify with high accuracy the cache lines that have to be written back or invalidated from P.host's caches when execution is transferred between P.host and P.mem (Section 2.1). For the non-numerical applications, since we do not have tools to perform detailed data dependence analysis, we often conservatively write back or invalidate more cache lines than necessary.
Our system attempts to produce efficient code. Any module that is to be run on P.mem is bundled into a subroutine, which simplifies maintaining data coherence for register values. Moreover, the P.host and Pmem versions of a module are optimized for the processor they will run on. Specifically, P.host versions are loop-unrolled so that more L P can be extracted dynamically and loads can be overlapped. For P.mem versions, we use blocking and loop distribution to minimize the pollution of the small P.mem cache.
Finally, a special case occurs if the loop in a module contains gotos that exit the module. In this case, the compiler transforms these gotos when the module is made into a subroutine. Specifically, new targets for these gotos are generated immediately after the loop in the same subroutine, and the subroutine is given one extra argument that is set to different values at these target positions. After executing the subroutine, the argument is checked by the caller of the subroutine and control branches to the original goto target in the caller according to the retumed value of the argument. Multiple entries into the loop in a module are transformed by the compiler in a similar way.
Static Prediction
We set the parameters used in the static performance model to match the processor and system architectures modeled. Some of the most important parameters include the number and type of functional units in the processors, instruction latencies, cache sizes and organizations, and cache miss latencies. As indicated in Section 3.7, the model is currently designed only for numerical applications. For non-numerical ones, we predict based on data from two profiling runs of the code, one on P.host and one on P.mem, using a different input set than for the production run.
As indicated in Section 3.7.2, the static performance model returns undefined affinity for a module when the inaccuracy windows of the estimated P.host and P.mem execution times overlap. Based CSPIM f i n e -f i r s t CSPIM p a r t i t i o n b s f i t e r , memiter CSPIM p a r t i t i o n d y n a m i c on our experiments, we use 335% inaccuracy windows. In addition, when the module so small that the model becomes less accurate, the affinity is also undefined. This occurs for modules whose estimated execution time on P.host or P.mem is lower than 50,000 cycles per invocation.
Finally, for very small modules, not even the profiles performed on non-numerical applications are accurate. The reason is that these profiles do not include the overheads to keep the data in the caches coherent or to bundle up the code into modules. We consider that profiled modules whose estimated execution time on P.host or Pmem is less than 2,000 cycles per invocation also have undefined affinity.
Appl kat ions
We evaluate both numerical and non-numerical applications. 
Simulation Environment and Architecture
The code generated by our algorithm is compiled into MIPS executable and run on a MINT-based [23] execution-driven simulation environment [15] . The simulation environment models dynamic superscalar processors with register renaming, branch prediction, and non-blocking memory operations [ 151. The architecture modeled is that of Section 2, with a bus connecting the processor and memory chips. The architecture is modeled cycle by cycle, including contention effects. Table 6 shows the parameters used for each component of the architecture.
The L2 cache size used is 1 Mbyte for numerical applications and 512 Kbytes for non-numerical ones. We selected a smaller cache for non-numerical applications because they execute small problem sizes, especially the SPECint95 applications. Table 7 shows the L2 local hit rates of both types of applications. With 512-Kbyte L2 caches, the average hit rates of non-numerical applications get closer to those of numerical ones, which are around 80%.
Our choice of Pmem's clock frequency is motivated by recent advances in Merged Logic DRAM process. They appear to enable the integration of logic that cycles as fast as in a logic-only chip, with DRAM memory that is only 10% less dense than in a DRAM-only chip [ Then, the controller writes back the desired lines in the background without stalling P.host. The write backs must be completed before passing execution to P.mem. If, instead, we want to invalidate nummche-lines lines, P.host suffers a total overhead of 5 + 1 x num-cache-lines cycles. These cycles can potentially be overlapped with P.mem execution.
Finally, P.host and P.mem synchronize at module boundaries as described in Section 2. The overheads involved in these synchronizations are considered in our simulation.
EVALUATION
,
To evaluate our algorithm, we first examine the characteristics of the modules (Section 5.1) and then evaluate non-overlapped execution with basic partitioning (Section 5.2) and advanced partitioning (Section 5.3). overlapped execution (Section 5.4), and overall speedups (Section 5.5). Table 8 shows the characteristics of the basic modules. The table has a section for the numerical applications and one for the nonnumerical ones. The first row in each section shows the total number of basic modules in each application and their combined execution time relative to the total execution time of the application. All times are taken running on P.host. The second and third rows in each section break down the modules into whether or not they are parallelizable. The fourth and fifth rows break down the modules into whether they have overall affinity for P.host or P.mem. To compute the affinity of a module. we time the execution of all the invocations of the module on P.host and P.mem, and choose the processor with the lowest average time. The sixth row shows the average number of invocations for each individual module in a program. Finally, il the last row shows the average module size measured in number of P.host cycles that it takes to execute one invocation. If we focus on the numerical applications, we see that there are only a few modules per application (5-21) and that they account for 99% of the application time. Parallel modules dominate in Swim, TFFI'2, and Mgrid, while they have little weight in LU. In general, modules tend to be large and be invoked a low number of times.
Characteristics of the Modules
In non-numerical applications, the number of basic modules is considerably higher (15-75). However, the coverage is low (on average 63% of the application time), especially in Go, where only 16% of the execution time is covered by basic modules. Serial modules dominate in all the non-numerical applications. Most of the modules in the non-numerical applications except Bzip2 are small and, not coincidentally, are invoked many times. Therefore, the overhead involved in executing the basic modules on different processors is likely to be large compared to their execution time. As a result, we may need advanced partitioning for these applications.
The table also shows that different applications have a different distribution of module affinity. While Swim, Tomcatv, Mgrid, and
Mcf have mostly Pmem affinity, LU, Bzip?,, Go, and M88ksim have mostly P.host affinity. TFFT2 has a balanced distribution of affinity.
Non-Overlapped & Basic
In this section, we evaluate the non-overlapped execution of the applications with basic partitioning. Since numerical applications and non-numerical ones have different behavior, we discuss them separately. Figure 3 compares the execution times of the numerical applications under several scenarios. For each application, the three leftmost bars correspond to running the application on P.host alone (P.host(alone)), on P.mem alone (P.mem(alone)), and on an ideal non-overlapped environment (Ideal), respectively. P.host(alone) and P.mem(alone) are obtained by running the uninstrumerued original applications on either processor. Ideal is obtained by executing each module on the processor where, on average across all invocations, it runs the fastest. The code that is not covered by the basic modules is run on P.host. No overlap between P.host and P.mem execution is allowed. In this ideal environment, we do not consider any data coherence or message bundling overheads. However, we also disregard any dynamic variation of affinities because we fix the scheduling of modules to processors. Moreover, we do not consider any possibly good cache interactions between modules. Overall, Ideal is a good approximation to the lower bound for non-overlapped execution.
Numerical Applications
The remaining, thicker bars, correspond to real scenarios where part of the code is executed on P.host and part on Pmem. For now, we only consider the bars that use non-overlapped execution and basic partitioning: static scheduling (Static), coarse basic dynamic scheduling (Coarse), coarse most recent dynamic scheduling (CoarseR), fine basic dynamic scheduling (Fine), and fine first invocation dynamic scheduling (FineF).
In each application, all bars are normalized to P.host(alone). All non-ideal bars are divided into: execution of instructions (Busy), stall time due to memory accesses (Memory), stall time due to pipeline hazards (Other), time waiting for the other processor (Idle), and overhead due to write-hack and invalidation of P.host caches (WB&INV). The thicker bars show the breakdown for P.host on the left side and the breakdown for P.mem on the right side. P.host(alone) and P.mem(alone) show that the relative emphasis on computing and memory activity varies across applications: Swim, Tomcatv, and Mgrid run faster on P.mem, while LU runs faster on P.host and TFFI'2 runs equally fast on both processors.
These results are consistent with the module affinity of the applications shown in Table 8 . Overall, they show that neither P.host nor P.mem is the best place to run all and every application.
As expected, Ideal is faster than both P.host(alone) and P.mem(alone). Ideal is relatively better in applications with a mixed affinity like TFFI'?.
Static schedules modules according to the static performance model; if the latter returns undefined affinity for a module, the module runs on P.host. From the figure, we see that Static is somewhat close to Ideal for most applications except forTFFl'2. Overall, Static is attractive because of its simplicity.
Coarse and CoarseR tend to be the best choices. They are usually Characteristic as fast or faster than Ideal. The reason is that they use real measurement of execution times to run the modules on the processors adaptively. Unfortunately, in the process of doing so, they are likely to run each module sub-optimally at least once. This is the reason for the slowdown in Swim.
If we compare Coarse and CoarseR, we see that, in many applications, they behave similarly. The reason is that the workload of each individual module tends to remain constant across invocations. As a result, CoarseR does not offer any advantage over Coarse. However, an important exception is LU. In LU, several modules vary their workload across invocations gradually, allowing CoarseR to adapt. The result is that CoarseR is about 20% faster than Coarse.
While we recommend to use CoarseR, we note that it can sometimes be slower than Coarse. Specifically, abrupt changes in a module's workload across invocations may confuse CoarseR. Such behavior occurs in TFFl'2.
The fine strategies are not as attractive. Fine is sometimes slow because of the high overhead resulting from its frequent decision runs (TFFE). As for FineF, although it has the lowest decision run overhead of all the dynamic schemes, it often suffers because all decisions are made based exclusively on the first two iterations of the first invocation of the module. Consequently, unless the workload of the module is constant across invocations and iterations, the decision is likely to be sub-optimal (Tomcatv). Overall, we conclude that, for numerical applications, CoarseR is the best strategy under non-overlapped execution and basic partitioning. CoarseR is very close to Ideal and, on average, it is 20% faster than P.host(alone) and 12% faster than F.mem(alone). We also conclude that the write-back and invalidation overheads are negligible. Figure 4 compares the execution times of the non-numerical applications under different scenarios and input sets. For each application, we consider two different input sets, namely inputl and input2. These inputs are different from the one used in the profiling run that provides data for the static predictions. In each application and input set, the bars are normalized and broken down into categories as in Figure 3 .
Non-Numerical Applications
As in the numerical applications, the P.host(alone) and Pmem(alone) bars show that neither P.host nor Pmem is the best place to run all and every application. Go, M88ksim and, to a lesser extent, Bzip2 run faster on P.host, while Mcf runs faster on P.mem. These results are expected from Table 8 . The bars also confirm that Ideal is often much faster than P.hosl(alone) and Pmem(alone).
The main characteristic of non-numerical applications is that the Static, Coarse, CoarseR, Fine, and FineF strategies evaluated in Figure 3 for numerical applications perform very poorly now. Although not shown in the figure, these strategies take 11-100% longer than P.host(alone) to complete execution. This effect is caused by the small size of basic modules in non-numerical applications (Table 8). They are so small that overheads due to data coherence, code bundling, and instrumentation affect execution time significantly.
Non-Overlapped & Advanced
We now evaluate the non-overlapped execution of the applications with advanced partitioning. As usual, we consider numerical and non-numerical applications separately.
Numerical Applications
If we apply advanced partitioning to the numerical applications, the average number of modules per application decreases only slightly. Moreover, practically all the resulting compound modules are invoked more than twice, which makes retraction as discussed in Section 3.1.3 not applicable. With these compound modules, we reexecute the applications following the coarse basic and coarse most recent dynamic scheduling strategies. The result, shown in Figure 3, are bars AdvCoarse and AdvCoarseR, respectively. Recall that fine strategies are not applicable under advanced partitioning. If we compare bars AdvCoarse and AdvCoarseR to Coarse and CoarseR in Figure 3 , we see that advanced partitioning has little impact on the performance of numerical applications. In fact, it only manages to speed up TFlT2 by 5-7%. The reasons for the marginal improvement are that there is little reuse of data across the combined modules, and that overheads are an insignificant component of the execution time.
Non-Numerical Applications
If we apply advanced partitioning to the non-numerical applications, we reduce the number of modules per application significantly. Moreover, many of the resulting compound modules are invoked less than three times, which means that we can optionally apply retraction. Consequently, we take the compound modules after retraction and re-evaluate three strategies: static scheduling, coarse basic dynamic scheduling, and coarse most recent dynamic scheduling. The result, shown in Figure 4 , are bars AdvSlalic, AdvCoarse, and AdvCoarseR, respectively. We have also evaluated the strategies without retraction, which we discuss later.
The figure shows that AdvSlalic can perform close to Ideal in some special cases. One case is when the production and profiling runs use similar input sets, as in inputl of Bzip2. Another case is when modules do not change affinities for different input sets. This occurs in Go and M88ksim. where all modules have P.host affinity because the working set fits in the L2 cache for all the input sets evaluated. In general, however, AdvStatic is not a good choice because it can perform very poorly, as in input2 of Bzip2 and inputl of Mcf.
The adaptability of AdvCoarse and AdvCoarseR to run-time conditions enables them to perform well when the production and profiling runs are very different. This occurs in input2 of Bzip2 and input1 of Mcf. Moreover, AdvCoarseR is more robust than AdvCoarse. This can be seen in inputl of Bzip2, where the workload of modules changes across invocations. Consequently, AdvCoarseR is our best choice. On average, it is about as fast as Ideal, while it runs 12% and 30% faster than P.host(alone) and P.mem(alone), respectively.
Although not shown in Figure 4 , not using retraction delivers worse results. Specifically, if we do not apply retraction, AdvCoarse and AdvCoarseR do not change much in most cases. However, they become as slow as AdvStatic in input2 of Bzip2. The reason is that they are not exploiting all the potential of adaptive strategies. Therefore, we use retraction.
Overlapped Execution
Finally, we overlap the execution of P.host and P.mem according to the algorithm in Section 3.5. We attempt the two approaches discussed in that section, namely static and dynamic scheduling. The result is shown in Figure 3 as bars OverSta and OverDyn, respectively. We do not show data for the non-numerical applications because our algorithm is currently unable to overlap P.host and P.mem execution in those applications.
Recall that overlapping is applied over basic partitioning only. In the numerical applications, the algorithm finds many module-wise serial regions but no significant module-wise parallel region. In the different module-wise serial regions, it attempts to partition all 66 basic modules. Of those, 57 modules are partitioned using iteration parallelism (Case 1 in the algorithm of Section 3.5) and 2 using loop distribution (Case 2). The remaining 7 are not partitioned. Figure 3 shows that, for the majority of applications, overlapped execution is significantly faster than the best non-overlapped strategy, either real (AdvCoarseR) or not (Ideal). Specifically, in Swim, Tomcatv, and Mgrid, the overlapped strategies OverSta and OverDyn are 30.40% faster than AdvCoarseR. With overlapped strategies, we are utilizing processor resources that would otherwise remain idle.
The behavior of the other applications is easy to explain. In LU, the strategies for overlapped execution have no effect because the most significant modules have dependences and cannot be partitioned. TFFT2, on the other hand, runs slower because the strategies induce extra overhead. While some of the overhead comes from the additional code bundling and instrumentation added, most of it is induced by the need to ensure coherence for the cached data. For example, P.host needs to execute instructions to write back and invalidate cached data structures when execution is transferred to P.mem and back (WB&INV in Figure 3 ). P.host also executes additional instructions to generate the addresses of these data structures. These instructions add to P.host's Busy in Figure 3 , which does not appear to have increased because, at the same time, much work has been transferred to P.mem. Finally, extra misses in the caches of P.host occur when, after the cache invalidations, P.host reloads the data (higher Memory Stall in Figure 3) . Fortunately, the same chart for TFFn shows that, while OverSta suffers greatly from these overheads, OverDyn is able to eliminate most of them. As a result, OverDyn only takes 12% longer than AdvCoarseR to execute TFFP-. OverDyn is better because it is aware of the extra overheads induced by overlapped execution and schedules modules more conservatively. Overall, taking the average over all numerical applications, OverDyn is over 15% faster than AdvCoarseR, the best strategy for non-overlapped execution.
Overall Speedups
The results from the previous sections indicate that the best overall strategy for numerical applications is the OverDyn overlapped execution. However, overlapped execution does not improve over non-overlapped execution for our non-numerical applications. The best non-overlapped execution stxategy in both types of applications is AdvCoarseR, which is applied with retraction when applicable.
To summarize our results, Table 9 compares the speedups for different environments. Relative to P.host(alone), AdvCoarseR delivers an average speedup of 1.31 and 1.18 for numerical and nonnumerical applications, respectively. Relative to P.host(alone), OverDyn delivers an average speedup of 1.66 for numerical applications. Recall that P.host(alone) uses the original, uninstrumented code.
In general, we expect individual numerical applications to benefit from overlapped execution, unless they are highly serial (LU) or suffer from the data coherence and related overheads of overlapped execution (TFFT2). Furthermore, we expect individual non-numerical applications to benefit from non-overlapped execution strategies like
AdvCoarseR to a largerdegree than our average results indicate. Our average is pulled down by the fact that Go and M88ksim have working sets that fit in P.host's caches and, therefore, all their modules have P.host affinity. In this case, our strategies cannot help. To put our results in perspective, Table 9 shows the speedups under two related scenarios. The first one is an ideal system with 2 P.hosts. To obtain it, we use the Polaris [3] and the Silicon Graphics parallelizing compilers to identify the parallel sections in the numerical and non-numerical applications, respectively. Then, we compute the IdealAmdahZ execution time of the applications by taking P.host(alone) and dividing by two the time taken by the code marked parallel.
We can see from the table that our algorithm delivers speedups that are comparable to this ideal speedup. Indeed, with our algorithm, the speedups of the numerical applications are nearly as high, while the speedups of the non-numerical ones are higher. We note that the 2-P.host system, in addition to being ideal, can be argued to be more expensive than our heterogeneous system.
The second related scenario is a real ?-processor Silicon Graphics Challenge. We ran the numerical and non-numerical applications on such a machine after the Polaris and the Silicon Graphics parallelizing compilers, respectively, had marked the parallel sections. The speedups on such a machine are shown in Table 9 . Clearly, the results are not directly comparable to our simulation results because of differences in architecture. However, they give an idea of how parallelizable these applications are. In this paper, we focus on an architecture with host and memory processing. Past work for these architectures [5, 8, 12, 191 largely assumes that the programmer isolates the code sections to run on the memory processors. In addition, that work has mostly focused on executing sections of code on only a set of identical memory processors. Our work, instead, tries to automatically partition the code into sections and thcn schedule each section on its most suitable processor. Another characteristic of our approach is that it uses a combination of compiler and run-time algorithms to map the code.
RELATED WORK
Exploiting parallel execution in a heterogeneous environment has been proposed in systems like Globus [6] Consequently, we are in the process of extending our software infrastructure with code pmllelization algorithms and data mapping techniques to support such architectures.
,
