Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that size of cache accessible in a single cycle will decrease in future generation of chips. Thus, a bank-exposed memory system comprising of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to e ectively use such a memory system for sequential programs.
Introduction
Technological trends are forcing computer architects to reexamine their assumptions. In future technology, wire delay is scaling approximately with feature size, but a combination of decreasing transistor size and increasing die size means that the fraction of the chip Points in the graph represent commercial microprocessors from that year. Each microprocessor contributes two points, one for its maximum number of instructions per cycle (ILP), and one for its maximum number of primary cache (L1) accesses per cycle (memory parallelism). Memory parallelism has remained at one or two accesses per cycle, while ILP has improved much more.
A truly scalable microprocessor, however, not only entails a scalable processing core { the memory system must scale as well. A recent study shows that cache access time in terms of gate delay scales poorly with capacity and sub-linearly with technology 1]. Thus, the size of cache accessible in a single cycle will decrease in future generation of chips. For example, in the aggressive 2014 technology projected by the SIA technology roadmap 25], a 512-byte cache will require three cycles to access. Furthermore, as shown in Figure 1 , the number of cache ports has not scaled with the number of functional units in a microprocessor.
Based on these observations, it is no longer reasonable to design the memory system in a future microprocessor as a single monolithic unit. A conventional hierarchical memory system does not address the issues, because data from the centralized L1 cache may still traverse a potentially large distance to reach the growing number of compute elements that can t on a chip. To maintain scalability, the memory system must be decentralized. Each processing element is tightly coupled with a memory bank, but it can access all the memory banks through a communication network. Figure 2 shows an abstract view of a scalable, fully decentralized microprocessor. The microprocessor comprises individual autonomous units, each with its own processing element and L1 cache bank. Each processing element has direct access to an L1 cache bank; a communication network can be used to access values in other cache banks. In this organization, cache banks are kept small, fast, and scalable with technology. Furthermore, because computation is mapped to each processing element statically, data can be statically placed near the processing element that needs it. We call such an architecture a bank-exposed architecture, because it allows the compiler to manage the locality of data along with computation.
This paper considers the challenge of utilizing a bank-exposed architecture for sequential programs. It focuses on bank disambiguation, a central problem in attaining good performance from such an architecture. Bank disambiguation is the problem of determining at compile-time which bank a memory reference is accessing. A particular load or store instruction is said to be bank disambiguated if the instruction accesses the same compile-time predictable bank every time it is executed. Bank disambiguation is important because it is a prerequisite for the compile-time optimization of data locality. When a memory reference is known to refer to memory on a particular bank, the computation that operates on it can be placed close to that bank. This paper presents Maps, a compiler managed memory system that performs bank disambiguation for bank-exposed architecture. It presents two complimentary bank disambiguation techniques. Equivalence class uni cation uses pointer analysis to guide the intelligent placement of data. Modulo unrolling uses intelligent loop unrolling to turn undisambiguated access into disambiguated ones.
A good bank disambiguation scheme should satisfy three criteria. First, it should distribute data evenly across the memory banks. It is easy to bank disambiguate all accesses by mapping all data to a single bank, but that is ine cient use of memory and will likely lead to poor locality. Second, the distribution should lead to good locality, where data is close to the computation that uses it. Finally, when code transformation is involved, the scheme should try to minimize any increase in code size. The methods in this paper aim for balanced distribution and minimizes code growth. It provides the opportunity for a back end to optimize for locality, but it does not address the locality issue directly.
The methods in this paper apply to any bank-exposed architecture for both generalpurpose and embedded systems 7, 11, 20, 22, 32] . In theory, it may also be used to map sequential programs onto distributed shared memory multiprocessors, although the communication latencies on DSMs have historically been too high to be able to pro tably exploit the instruction level parallelism extracted by this compiler approach. This paper uses the Raw machine, a bank-exposed architecture, to illustrate its techniques.
The rest of this paper is organized as follows. Section 2 gives the background for our research. Sections 3 and 4 describe our two methods for bank disambiguation, equivalenceclass uni cation and modulo unrolling, respectively. Section 5 presents experimental results. Section 6 describes related work. Section 7 concludes. Figure 3 : A Raw microprocessor is a mesh of tiles, each with a processing element, some memory and a switch. The processing element contains both registers and an ALU. The processing element interfaces with its local instruction memory, local data memory and local switch. The switch contains its own instruction memory.
RawTile

Raw P
Background
This section gives the background of our research. First, it describes the Raw architecture, the bank-exposed architecture on which we implement Maps. Then, it overviews the Raw compiler and its interface with Maps.
The Raw architecture Figure 3 depicts the Raw microprocessor. It consists of a 2-dimensional mesh of tiles. Each tile is composed of a processing element and a cache memory bank. A switch is provided on each tile to communicate with other tiles. Two communication networks connect the tiles: the static network and the dynamic network. The static network is a fast compiler-routed register-level network. Bank-disambiguated accesses to compile-time-known banks either complete over the static network or are local to a tile. The dynamic network is a slower runtime-routed network that serves the role of a conventional memory system's arbitration logic. Accesses to compile-time-unknown banks complete over the dynamic network. Each Raw tile has its own instruction stream; di erent tiles proceed in a loosely synchronous manner, communicating only for register and control dependences.
The Raw compiler Rawcc is the Raw parallelizing compiler based on the SUIF compiler infrastructure 33]. It takes a sequential C or Fortran program, extracts its instruction level parallelism, and parallelizes it across the tiles of the Raw machine. Figure 4 gives the major components of Rawcc.
Rawcc comprises two major components. Maps is the memory front end that performs bank disambiguation of memory accesses. It partitions all memory references and data objects into equivalent classes. Each equivalent class is labeled as either a single-tile equivalent class or a low-order-interleaved one. Objects in a single-tile equivalent class are mapped to a single tile. Objects in a low-order-interleaved class must be arrays; they are interleaved element-wise across the tiles. The Maps analysis ensures that memory references can be bank disambiguated if objects are mapped in such a manner. Maps itself is made up of three components. It begins by collecting information from traditional pointer and array analysis. This information is then used to perform analysis for bank disambiguation. The third component deals with accesses that Maps decides not to bank disambiguate. This paper focuses on techniques for bank disambiguation; a description of handling for non-disambiguable access can be found in 5].
The space-time scheduler is the back end of Rawcc. It performs tasks related to the mapping of instruction level parallelism to the Raw tiles. In addition, it is responsible for mapping each equivalence class of data objects to the memory bank of a speci c Raw tile. The space-time scheduler performs this mapping with two goals: it tries to map equivalence classes that are rarely concurrently accessed to the same physical bank, and it tries to map accesses close to the computation that needs it. See 17] for details.
Equivalence-class uni cation
This section describes equivalence-class uni cation, our rst method for bank disambiguation. Equivalence class uni cation (ECU) is our baseline disambiguation technique. It is applicable to all memory accesses, including arbitrary array accesses, pointer dereferences, structure references, and heap references. ECU provides disambiguation through careful data placements that are guided by pointer analysis. Section 3.1 gives an introduction to pointer analysis, while Section 3.2 describes ECU itself. Throughout this section, we use Figure 5 as an expository example.
Pointer analysis
Pointer analysis is a compile-time technique that, for every memory-reference instruction, determines the data objects that the instruction can possibly refer. Maps uses SPAN 27] , a state-of-the-art pointer-analysis package that provides an inter-procedural, ow-sensitive, and context-sensitive pointer analysis.
To understand pointer analysis, consider the input program in Figure 5 (a). Figure 5 (b) shows the results of the SPAN pointer analysis package on the program. SPAN assigns a unique number, called a location-set number, to each abstract object in the program. An abstract data object is either a stack-allocated variable declaration in the program or a group of dynamic objects created by the same heap-memory allocation call site in the program. An entire array is considered a single object, but each eld in a structure is considered a separate object. Figure 5(b) shows the abstract data objects marked with assign comments, with the location set numbers for the objects listed alongside the comment. Finally, pointer analysis annotates each memory reference instruction with a location-set list, a list of location-set numbers corresponding to the objects that the memory reference can refer. In Figure 5 (b), each memory reference is annotated with its location-set list, shown as ref comments. For simplicity, location-set numbers are shown only for objects that have program load/stores to them; in the compiler all objects are assigned such numbers. Dotted edges represent potential memory dependences derived from pointer analysis: two memory references are potentially dependent if 1. the intersection of their location-set lists is non-empty; 2. one of the accesses is a store.
3.2 Equivalence-class uni cation method Figure 5 helps explain the ECU method through an example. First, ECU runs pointer analysis described above: Figure 5(b) shows the results of pointer analysis. Next, ECU represents the pointer analysis information as a bipartite graph of data objects and memory references. Figure 5 (c) shows the bipartite graph for the program in Figure 5 (b). A node is constructed for each abstract object and for each memory reference. The upper row shows the abstract objects, with the location-set number for each object in parentheses. The lower row shows the memory references. Edges are constructed from each memory reference to the all the abstract objects whose numbers are in the reference's location-set list.
Subsequently, ECU de nes alias equivalence classes from the bipartite graph. Alias equivalence classes form the nest partition of the location set numbers such that each memory access refers to location-set numbers in only one class. ECU derives the equivalence classes by computing the connected components of the bipartite graph. Figure 5 (d) shows the bipartite graph in Figure 5 (c) partitioned into four equivalence classes. References in the same alias class may potentially refer to the same object, while references in di erent classes never refer to the same object.
Finally, each equivalence class is mapped to a single tile. Figure 5 (e) shows a sample mapping using the equivalence classes in Figure 5 (d). Mapping of equivalence class to memory banks is performed by the space-time scheduler, the backend of the Raw compiler. The space-time scheduler performs this mapping with two goals: it tries to map virtual banks that are rarely concurrently accessed to the same physical bank, and it tries to map accesses close to the computation that needs it. See 17] for details.
Quality of the disambiguation The quality of the disambiguation from ECU depends upon the number and size of the alias classes. A large number of small classes gives the most memory parallelism, since accesses mapped to di erent classes can execute in parallel.
malloc.y (6) malloc.z (7) (c)
malloc.z (7) pf->y=55
f.x (1) f.y=11 *q=33 p->x=22 The number and size of the classes depend on the access patterns of the program, which the compiler cannot control. Nevertheless, our results in Section 5 show that many programs contain several alias classes. Finding any more than one class enables us to remove the bottleneck of a centralized memory system.
Modulo unrolling
The major limitation of equivalence-class uni cation is that an array belongs to one equivalence class and is mapped to only one bank. This section presents modulo unrolling, a technique for attaining bank disambiguation and memory parallelism for arrays. Modulo unrolling is applicable to array references whose index expressions are a ne functions of enclosing loop induction variables. 1 Such accesses are common in dense-matrix scienti c codes as well as some multimedia and streaming applications. This section is organized as follows. Section 4.1 illustrates modulo unrolling through an example. Section 4.2 describes modulo unrolling and its scope. Section 4.3 proves that the unroll factor selected by modulo unrolling is necessary and su cient. Section 4.4 discusses the issue of code growth. Figure 6 gives an example of modulo unrolling. Figure 6 (a) shows the code fragment forming the compiler input, consisting of a simple for loop containing a single array reference A i]. The array A ] ranges from 0 to 99. To enable parallel accesses, the compiler distributes A ] among 4 memory banks using low-order interleaving. 2 This distribution, however, makes A i] non-bank-disambiguable, because it touches data on all four banks.
Example
In special cases, full unrolling can attain bank disambiguation. Figure 6 (b) shows the
A [6] A [10] . . .
A [3] . . .
Bank 1 Bank 2 A[9]
A [5] A [1] A[8] A [8] . . . A [11] A [7] A [5] A [9] A [4] . . .
A[2]
A [6] Bank 0 . . .
Bank 3 A[3]
A [7] A [11] . . .
Bank 0
. . .
A [8] . . .
A[1]
A [1] A [5] A [9] . . .
A[2]
A[10]
A [3] A [7] A [11] . . . sample loop in Figure 6 (a) fully unrolled. This solution, however, is only possible if the loop bounds are known, and it is only reasonable if the iteration count is small.
Modulo unrolling uses a modest amount of intelligent unrolling to make the example access disambiguable. In the example, it unrolls the loop by a factor of four, as shown in Figure 6 (c). After the unrolling, each access refers to elements on the same memory bank:
A i] to tile 0, A i + 1] to tile 1, A i + 2] to tile 2, and A i + 3] to tile 3. Thus, each access in the unrolled loop has become bank disambiguated. Furthermore, the accesses can proceed in parallel, thus providing memory parallelism.
Modulo unrolling method
Modulo unrolling is a technique for bank disambiguation that is applicable to array references whose index expressions are a ne functions of enclosing loop induction variables. This section describes the method.
Modulo unrolling works as follows. First, the compiler looks for a ne array accesses inside loop-nests. For each array access and each loop, it computes the minimum unroll factor required on the loop in order for the access to be bank disambiguated. Once the compiler computes the induced unroll factors for each loop for each a ne access, the nal unroll factor for a loop is the least common multiple (lcm) of all its unroll factors induced by each enclosing a ne accesses. Section 4.3 proves that the overall code-growth from modulo unrolling is bounded by the number of memory banks in most cases, even for nested loops.
Let N be the number of memory banks in the target software-exposed architecture. We de ne the following:
De nition 4.1 Given a k-dimensional (not necessarily perfectly nested) loop nest of the form:
for v 1 Within its framework, it handles imperfectly nested loops, non-unit loop step sizes, handlinearized multidimensional arrays, and unknown loop bounds. Both imperfectly nested loops and non-unit loop step sizes are handled naturally without any special case. Handlinearization of multidimensional arrays does not pose a problem, because the transformation preserves the a ne property: a linear combination of a ne functions is also a ne, and the o set of any array element from its base remains unchanged using hand-linearization. Only unknown loop bounds require additional transformation beyond that required in the basic framework 5].
Deriving the unroll factors
This section proves that unrolling each loop in a loop nest by a certain factor disambiguates all a ne array accesses in that nest. The proof also derives the formula above for the minimum required the unroll factor U j . We inherit the de nitions in Section 4.2. Additional variables needed for the proof are de ned when needed.
A roadmap for the proof follows. First, two supporting theorems involving modular arithmetic are proved, namely, the product modulo theorem (Theorem 4.2) and the sumof-products modulo theorem (Theorem 4.3) . Then, a formula for the address of an array access is de ned for row-major addressing (De nition 4.4). Next, the condition for bank disambiguation of a ne accesses is represented as a requirement of the step-size after unrolling is performed (Theorem 4.5). After that, the unroll factor implied by the step-size required after unrolling is shown to result in the minimum code-size increase among unrolls that provide bank disambiguation (Theorem 4.6). Finally, for each loop in the loop nest containing the a ne access, the formula for the actual unroll factor is derived, such that the required step-size after unrolling is attained (Theorem 4.7) .
In all the proofs that follow, all variables and constants introduced are integers. The proof begins by supplying two supporting theorems involving modular arithmetic, theorems 4.2 and 4.3. The following theorem derives the condition for memory bank disambiguation for an a ne function access of the form in De nition 4.1. 
2
The following theorem derives the nal result, i.e., the value of the unroll factor U j , in terms of step size after unroll D j . Theorem 4.7 (Unroll factor formula) In order to attain the value of D j in Theorem 4.5, we need to unroll the jth loop nest (1 j k) by a factor U j given by U j = lcm(D j ; s j ) / s j .
Proof Unrolling a loop j produces step sizes that are multiples of s j . From Theorem 4.5, an unrolled step size necessary for bank disambiguation is any multiple of D j . Thus, the lowest attainable step size that results in disambiguation is lcm(D j ; s j ). The necessary unroll factor U j to reach this step size is the unrolled step size divided by the initial step size: U j = lcm(D j ; s j ) / s j . 2 
Bounds in code growth and the padding optimization
This section examines the increase in code size implied by the modulo unrolling. Code growth is an undesirable side-e ect of modulo unrolling. Note, however, that the unrolling required by modulo unrolling can often be combined with the unrolling used to expose instruction level parallelism. This combination can help reduce the unrolling overhead.
In this section, we rst derive the worst case code growth. Then, we describe a padding optimization that can reduce the code growth. Finally, we present an example that demonstrates the application of the modulo unrolling formulas, both with and without the padding optimization.
Bounds on unroll factors Unrolling incurs the cost of increased code size. To establish a bound, we show that the unroll factor U j derived in Theorem 4.7 is provably at most N, the number of banks. From Theorem 4.5, D j = N=a positive integer N. Inserting into Theorem 4.7, U j = lcm(D j ; s j ) =s j D j s j =s j = D j N.
In the worst case, since all the k loop in the loop nest may be unrolled N ways, the overall code growth is at most a factor of N k . For k 2, N k can be large. In practice, however, for most a ne accesses, the overall code growth can often be limited to N irrespective of k by applying the padding optimization discussed later in this section.
A nal observation regarding code growth is that the decision of whether to modulo unroll a nested loop is a local decision. If the code growth from modulo unrolling is deemed excessive for one nested loop, the compiler can choose not to unroll the loop without adversely e ecting the modulo unrolling decisions in the rest of the program.
Padding Optimization For many a ne functions that occur in practice, a simple optimization enables us to restrict the overall code growth and to greatly simplify the code generation. This optimization is the padding optimization, which involves padding the last array dimension size to be a multiple of N for all arrays. To see how, rst we derive a simpler expression for D j than the one in Theorem 4.7, in the case when the padding optimization is performed. It can be shown that since the value of X does not matter for this result, the result holds for cases when only last dimension of the array reference is a ne. Most a ne functions that occur in real programs are of the simple-index last dimension class. Some references that have non-a ne expressions in all but the last array dimension are also in this class. The following theorem shows that for this class, at most one of the enclosing loops is unrolled by modulo unrolling. Thus, for array references that have simple-index last dimension, Theorem 4.10 shows that the code-size growth is no more than N, irrespective of the depth of the loop nest.
In some cases, padding optimization may fail to bound the overall code growth to N. Such cases include those where the a ne functions are not simple index functions, as well as cases where the loop nest contains multiple simple index functions that induce unrolls on di erent loops of the loop nest. For cases where the predicted code growth is prohibitive, modulo unrolling can operate on a subset of array accesses to reduce the unroll factor { only the accesses in the subset will become bank disambiguated.
Results
This section presents results for the Maps memory system in the context of the Raw architecture. Application programs are compiled with Rawcc and simulated on a simulator of the Raw architecture as described in Section 2. The processing element on each Raw tile is a MIPS R4000 instruction set augmented with network access instructions. Latencies of the basic instructions are as follows: 2-cycle load, 1-cycle store, 1-cycle integer add or subtract; 2-cycle integer multiply; 36-cycle integer divide; 4-cycle oating add, subtract, or multiply; and 10-cycle oating divide. Except for divides, all basic oating point operations are fully pipelined. The simulator simulates in nite data and instruction memories on chip. Two sets of results are presented. Section 5.1 compares application performance with varying amount of bank disambiguation support. Results show that in the 32-tile case, Maps improves performance by a factor of 3 to 5 for a broad range of programs. Section 5.2 presents more detailed analysis of the performance with our bank disambiguation techniques. Benchmarks include several dense matrix applications, multimedia applications, and applications with irregular memory access patterns. All the benchmarks are ordinary sequential programs written for a uni ed address space. Rawcc compiles them without any user directives or pragmas of any kind. All speedups were attained with our automated compiler without user intervention. Because the Raw machine does not support double-precision oating point, all oating point operations are converted to single precision.
End-to-end performance
We rst present results for end-to-end application performance with a varying degree of bank disambiguation support. In all cases, instruction level parallelism is extracted and exploited across the tiles of the Raw machine. Performance is collected for three types of disambiguation support: trivial, equivalence-class uni cation only (ECU only) , and ECU with modulo unrolling. In trivial support, the compiler has no intelligent disambiguation information. With no information, the compiler generally has two options: leave memory accesses undisambiguated, or perform trivial bank disambiguation by mapping all objects to one tile. On Raw, undisambiguated accesses happen to be very expensive due to software overhead 5], so we select trivial bank disambiguation as our baseline technique. This method models the cost of centralization in Raw's memory system, but it does not model the penalty due to extra capacity misses in a nite sized cache. In ECU only, di erent equivalent classes provided by ECU are mapped to di erent tiles. In ECU with modulo rolling, non-array equivalent classes are distributed, and arrays accessed through a ne references are low-order interleaved across the memory banks.
In these results, we expect end-to-end performance to improve with the quality of bank disambiguation for two reasons. First, bank disambiguation improves memory parallelism, allowing multiple concurrent memory accesses. Second, good disambiguation allows data to be \parallelized" with the computation, so that the data can reside close to the computation that uses it. Figure 7 compares the performance of the three bank disambiguation schemes on a Raw machine with 32 tiles. The baseline for all three strategies is the execution time of the sequential program running on one tile, with one functional unit and one memory bank. The results show that ILP without memory parallelism yields poor performance. While using ILP alone gives a speedup in the range 1-4, memory parallelism can increase performance substantially. ECU increases the speedup on average by a factor of two beyond using ILP alone, boosting it to between 2 and 6. 3 Modulo unrolling further improves speedup to between 7 and 24 in applications where it is applicable. Overall, the methods in this paper deliver an additional factor of 3 to 5 in performance over using ILP alone for our benchmarks. Table 2 shows benchmark speedups for a varying number of tiles on Raw, using our bank disambiguation techniques. The numbers in the last column, for N = 32, are identical to the ILP + ECU + modulo unrolling numbers in Figure 7 . We discuss some overall trends. Benchmark performance can be classi ed into several types. Dense-matrix programs performed very well, attaining multiprocessor-like speedups on a microprocessor. The performance is due largely to modulo unrolling. For multimedia applications, two applications, Adpcm and SHA, attain low speedups while the remaining three attain high speedups. For Adpcm, the code is inherently serial for the large part; for SHA, while some ILP is available, it is too ne-grained for our current techniques to exploit. The other three multi-media applications bene t signi cantly from memory parallelism. The results on two of them, MPEG-kernel and FIR-lter, are especially encouraging as these are key components of the emerging workloads of the future involving audio, image and video data.
Detailed application results
The irregular applications include Fppp-kernel, Moldyn and Unstructured. Fppp-kernel consists of a big time-intensive basic-block with much parallelism and mainly accesses to scalar data. Since only scalar variables are present, the benchmark does not bene t from ECU and modulo unrolling. Moldyn and Unstructured are more typical examples of scienti c programs with irregular access patterns. Both moldyn and unstructured are run using two versions: one using structures, and the other using arrays instead for the structure elds. For these irregular applications, Maps is able to improve performance by a factor of 2 to 3 compared to using ILP alone. Memory distribution We measure the distribution of memory across the tiles. In general, balanced data distribution is desirable because it minimizes the per-tile memory needed to run an application, and it alleviates the need to build large and centralized memory that is also fast. Figure 8 shows the distribution of primary data across tiles for our benchmarks executing on 32 tiles. Most dense matrix codes can fully distribute their data; Swim and Cholesky can only partially distribute their data because of their small problem sizes, but their distributions become balanced with larger problem sizes. Load balance in the other applications is limited by two factors: the limited number of equivalence classes, and the unequal size of the classes.
Memory bandwidth utilization We measure how well an application takes advantage of Raw's independent memory banks. Bandwidth utilization depends on the amount of memory parallelism exposed by Maps and the amount of parallelism in the application. Figure 9 measures the weighted memory bandwidth utilization of a 32-tile machine. It plots the percentage of memory references being issued in a clock cycle when a given number of tiles is simultaneously issuing memory requests. The sum of the percentages for any one application is 100%. For example, for Cholesky, almost 20% of the memory references are issued in a cycle in which a total of 7 memory references are issued in all the tiles. Results show that except for the two highly serial benchmarks (Adpcm and SHA), all the benchmarks are able to exploit at least a small amount of parallel memory bandwidth. The gure shows that most of our benchmarks are indeed able to take advantage of the many ports in the memory system.
Related work
Bank-exposed general-purpose microprocessors date back to as early as 1983 when Josh Fisher proposed the ELI-512 VLIW machine 11]. The machine is a bank-exposed architecture with a point-to-point network connecting its processing elements, each with an exposed memory bank. It provides two ways to access memory: a fast \front door" that directly addresses a particular bank, and a slower \back door" that can address any bank. Bank disambiguation is equivalent to memory bank prediction described by Fisher 11] . He explores memory bank prediction in the context of the ELI-512 VLIW machine. On this machine, successful bank disambiguation allows faster memory accesses through the front door. Fisher does not provide any automatic way to perform bank disambiguation, but he observes that unrolling can sometimes help disambiguate accesses. This observation forms our basic for modulo unrolling, our fully automated technique.
Since then, work on bank disambiguation has mostly been con ned to the DSP community. Saghir, Chow, and Lee 28] have developed a method for exploiting digital-signal processing chips with dual memory bank. Their approach examines each memory reference and constructs an interference graph that represents how frequently data objects can be accessed in parallel. A min-cut algorithm is then used to partition the objects across the memories. Similarly, Sudarsanam and Malik 30] also exploit the use of dual memory bank in DSPs, with the additional constraint that each register is tied to a speci c memory bank. They exploit both a greedy algorithm and a simulated annealing technique based on interference graph. Unlike Maps, these approaches do not deal with pointer aliasing, nor do they attempt to partition arrays. The common partitioning problem they solve are analogous to the problem of mapping virtual object partitions to physical tiles in the Raw compiler. In the Raw compiler, this problem is solved by the space-time scheduler 17], the back end of Maps.
Other researchers have parallelized some of the benchmarks in this paper for multiprocessors. Automatic parallelization has been demonstrated to work well for dense matrix scienti c codes 6, 13] . In addition, some irregular scienti c applications can be parallelized on multiprocessors using inspector-executor method 10]. Typically these techniques involve user-inserted calls to a runtime library such as CHAOS and are not automatic 21]. The programmer is responsible for recognizing cases amenable to such parallelization, namely those where the same communication pattern is repeated for the entire duration of the loop. In contrast, the Maps approach exploits instruction level parallelism and is thus more generally applicable.
Literature includes many kinds of memory disambiguation. Most of them are unrelated to bank disambiguation, which is concerned with the location of a reference. Rather, they are usually concerned with the dependence relation between references. Disambiguation of this type includes relative memory disambiguation 18], run-time disambiguation 23], dynamic memory disambiguation 8, 12] , and a ne-memory disambiguation 2, 4, 19, 34].
Conclusion
This paper presents Maps, a memory system for bank-exposed architectures. Maps provides memory parallelism through a compiler-managed set of decentralized memory banks. This approach contrasts with the centralized view of memory maintained by existing microprocessors, which inhibits scalability due to the need for centralized dependence checking hardware and long wires. The system supports sequential programs and is transparent to the programmer, thus requiring no extra programming e ort.
This paper focuses on compile-time bank disambiguation, the main problem in exploiting memory parallelism on a bank-exposed architecture. Bank disambiguation is the act of ensuring that a memory reference refers to data on only one memory bank. Two methods for banks disambiguation are presented. Equivalence class uni cation uses pointer analysis to partition data into classes that can be mapped to di erent banks without disturbing bank disambiguation. Modulo unrolling uses unrolling to enable the bank disambiguation of a ne accesses to low-order interleaved arrays.
We are encouraged by the results of the Maps approach to providing memory parallelism. Experimental results demonstrate that our disambiguation methods improve performance by a factor of three to ve. Maps is able to exploit memory parallelism in a range of applications, from those containing small amounts of memory parallelism to more regular applications with large amounts of memory parallelism. This versatility opens up a range of possible applications for Maps. From small embedded designs to desktop microprocessorbased systems to supercomputers, machines with exposed memory banks can bene t from our techniques.
