Existing scratchpad memory (SPM) allocation algorithms for arrays, whether they rely on heuristics or resort to integer linear programming (ILP) techniques, typically assume that every array is small enough to fit directly into the SPM. As a result, some arrays have to be spilled entirely to the off-chip memory in order to make room for other arrays to stay in the SPM, resulting in sometimes poor SPM utilization.
INTRODUCTION
Hardware-managed cache has traditionally been used to bridge the ever-widening performance gap between processor and memory. Despite this great success, some deficiencies with cache are well-known. First, their complex hardware logic incurs high overhead in power consumption and area. Second, their simple application-independent management strategy does not benefit from some data access characteristics in many applications. Finally, their uncertain access latencies make it difficult to guarantee real-time performance in real-time applications.
In contrast, software-managed scratchpad memory (SPM) has advantages in power, area, real-time guarantees and performance [1] . Thus, SPM is widely adopted in embedded systems, stream architectures (known as stream register file, local memory or streaming memory), and GPUs (known as shared memory in NVIDIA GPUs under its CUDA programming model). In the case of supercomputers, software-managed on-chip memory is also frequently used, especially in their accelerators. Examples include Merrimac [5] , Cyclops64 [4] , Grape-DR [17] and Roadrunner [2] .
Unlike cache-based machines, machines with SPMs require software to explicitly and carefully manage data allocation in the SPM and make fully use of the scarce on-chip memory. Manual SPM management is impractical and error-prone, leading to non-portable code.
Many compiler approaches, static or dynamic, for SPM allocation have been proposed. Dynamic approaches, which allow arrays to be swapped into and out of SPM during run time, are known to outperform their static counterparts. However, the proposed dynamic SPM allocators typically assume that every array candidate is small enough to fit directly into the SPM. As a result, some arrays have to be spilled entirely to the off-chip memory to make room for other arrays to stay in the SPM, resulting in sub-optimal solutions.
Data tiling [12, 10] , which partitions a large array into smaller subarray tiles, was originally proposed to improve the cache performance of regular loop kernels, i.e., loop kernels whose data dependences are mostly constant or uniform. This data transforma-tion technique was later applied to improve utilization of SPM by copying the subarray tiles of an array, one at a time, rather than the entire array itself between SPM and off-chip memory [11, 16] . While being effective for certain programs, these earlier methods exhibit some deficiencies as discussed in Section 2.
In this paper, we introduce a new comparability graph coloring allocator that integrates for the first time data tiling and SPM allocation for arrays by tiling arrays on-demand to improve utilization of the SPM. Central to graph coloring is the notion of array interference graph. Given a program, its array interference graph G consists of the nodes representing all the arrays in the program and the edges between two nodes representing the fact that the two nodes have overlapping live ranges (i.e., code regions) and thus cannot be placed in overlapping spaces in the SPM.
We propose to solve the SPM allocation problem for G by first completing it into a comparability graph G ′ and then finding an optimal acyclic orientation α for G ′ . As a result, the SPM space required by G is bounded from below by the heaviest (directed) path P in the directed graph of G ′ induced by α. This observation motivates us to perform data tiling on-demand during the iterative scratchpad allocation process (common in a graph coloring allocator). The novelty of our approach lies in identifying the heaviest directed path this way during each iteration step and reducing its weight (if it is still larger than the SPM size) by tiling certain arrays on the path appropriately with respect to the size of the SPM under consideration. The optimal tile sizes required can be found either analytically or numerically with a cost-benefit analysis. In the case when data tiling is not possible, some arrays are spilled entirely to reduce the weight of P.
This paper makes the following contributions:
• We present a new SPM allocator that combines data tiling and scratchpad allocation to improve utilization of SPM.
• We propose to perform demand-driven data tiling during graph coloring scratchpad allocation so that tile sizes can be found with a cost-benefit analysis.
• We demonstrate the effectiveness of our SPM allocator by using a number of benchmark kernels for which existing allocators are ineffective (if tiling is not applied).
The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 introduces some basic results required to understand our approach. Section 4 presents our algorithm. Section 5 evaluates our approach. Section 6 concludes the paper.
RELATED WORK
Existing approaches for SPM allocation are either static or dynamic. The static approaches are known to be less efficient than dynamic counterparts. The proposed dynamic approaches can be roughly divided into two classes, those that resort to ILP [21] and those that rely on some well-crafted heuristics [11, 19, 20, 14, 15, 13] . The ILP-based approaches are theoretically optimal but too expensive to be practical for many applications. Among the heuristics-based approaches, graph coloring [14] seems to achieve the best performance for general-purpose applications [13] with interval coloring delivering better performance for embedded applications [15] . However, these existing SPM allocators always spill an array entirely whenever the SPM space is insufficient, resulting in sub-optimal solutions.
Kandemir et al. [11] , Zhang and Kurdahi [23] and Li et al. [16] apply data tiling [12] to improve utilization of SPM. However, the methods described in [11, 23] are restricted to the matrix multiplication kernel only while [16] relies on ILP to find optimal tile sizes to tile user-specified arrays in an ILP-based allocator.
Fabri [6] discovered the connection between interval coloring and compile-time memory allocation. Li et al. [15] apply interval coloring to assign arrays in embedded programs to SPM. Yang et al. [22] apply comparability graph coloring to optimize utilization of the stream register file in a stream architecture, with the assumption that every stream candidate (i.e., array) is small enough so that it can be placed entirely in a stream register file.
BACKGROUND
This section recalls some basic results about interval coloring and comparability graph coloring from [7] as well as minimal comparability completion from [9] , providing a basis for understanding our proposed approach.
Interval Coloring vs. SPM allocation
The SPM allocation problem can be naturally solved by interval coloring as formulated below. Allocating SPM spaces to array live ranges in an array interference graph, IG, is represented by an assignment of intervals to the nodes in the IG. Minimizing the span of intervals amounts to minimizing the required SPM size. Given an undirected graph G = (V , E ) with the function w mapping nodes to positively integral weights, the total width of an interval coloring α, χα(G; w ), is | ∪ x∈V αx |. The chromatic number χ(G; w ) is the smallest width used to color the nodes in G, which corresponds to the optimal SPM allocation.
Interval Coloring vs. Acyclic Orientation
Let G= (V, E) be an undirected graph. The subgraph of G induced by a subset
An orientation of G is a function α that assigns every edge a direction such that α(x , y) ∈ {(x , y), (y, x )} for all (x, y) ∈ E. Let Gα be the digraph obtained by replacing each edge (x, y) ∈ E with the arc α(x, y). An orientation α is said to be acyclic if Gα contains no directed cycles.
Every interval coloring α of G induces an acyclic orientation α The problem of finding optimal colorings is NP-complete. In an optimal coloring, the chromatic number χ(G; w) is related to the notion of heaviest path in an acyclic orientation of G as follows:
where A(G) is the set of all acyclic orientations of G and P(α) the set of directed paths in an orientation α ∈ A(G). In other words, the orientation whose heaviest path is the smallest induces an optimal coloring. This heaviest-path-based formulation is exploited in the development of our SPM allocator. Figure 1 illustrates the equivalence between finding an interval coloring and finding an acyclic orientation for a weighted graph. In Figure 1 (b), the heaviest path is x → b → c → z with a (total) weight of χα(G; w ) = 13. In Figure 1 (c), the heaviest path is b → z → c with a weight of χ β (G; w ) = 10. The gap between the two is 3 but can be larger in general. So there is a need to look for an optimal solution efficiently in practice.
Comparability Graph Coloring
For the purposes of optimizing utilization of SPM, we examine below a class of graphs that allows interval colorings to be found optimally in polynomial time.
DEFINITION 3. An undirected graph G is a comparability graph if there exists a transitive orientation of G.
A transitive orientation is acyclic but the converse is not necessarily true. In Figure 1 
However, β shown in Figure 1 (c) is transitive. As a result, the graph given in Figure 1 (a) is a comparability graph.
THEOREM 1. For any transitive orientation α of G, the interval coloring induced for G is optimal.
Furthermore, the problems of recognizing a comparability graph G = (V, E) and finding a transitive orientation of G can both be done in O(δ· | E |) time and O(| V | + | E |) space, where δ is the maximum of the degrees of all nodes in G. Based on α, an optimal coloring of G can be obtained in linear time [7] .
Minimal Comparability Completion
Given a graph G, a comparability graph obtained by adding edges to G is called a comparability completion of G. Computing a comparability completion of G with the minimum number of added edges (called a minimum comparability completion) is an NP-hard problem [8] . As an approximation to the minimum comparability completion, a minimal comparability completion H of G is a comparability completion of G such that no proper subgraph of H is a comparability completion of G. Obviously, a minimum comparability completion is minimal but the converse is not necessarily true. A polynomial algorithm is presented in [9] to compute a minimal comparability completion for a graph G.
In our SPM allocator described below, a given IG may be subject to the minimal comparability completion in order to obtain improved SPM utilization.
SPM ALLOCATION WITH ON-DEMAND DATA TILING
Our SPM allocator, given in Figure 2 , performs data tiling ondemand in order to improve utilization of the SPM. Presently, our algorithm is restricted to tiling 1-D arrays. Higher-dimensional arrays can be converted to 1-D arrays if it is desirable for them be tiled. Our SPM allocator proceeds in five main steps, which are described in five separate subsections.
• Live-Range Splitting (Section 4.1). As in [14, 13] , the live ranges of the arrays inside loop nests are split based on a costbenefit analysis. The new live ranges obtained inside loop nests are called hot arrays as they are frequently accessed. Copy operations are inserted at the splitting points to transfer the hot arrays between SPM and off-chip memory.
• Finding the Critical Path (Section 4.2). Next, the IG Go for the program is built. Its subgraph containing all hot arrays is completed into a comparability graph Ht (if necessary). Then a transitive orientation α of Ht is computed, from which the critical path, i.e., heaviest path PH t is deduced.
• Live-Range Coalescing (Section 4.3). If the weight of the critical path PH t is not larger than the SPM size, then all array candidates in Ht can be placed in the SPM. The algorithm tries to coalesce live ranges to eliminate unnecessary copy operations introduced during live range splitting before terminating.
• Spilling (Section 4.4). If the weight of PH t exceeds the SPM size, then either spilling or data tiling is applied. But the latter is preferred since it often tends to make a better use of the scarce SPM space.
• Data Tiling (Section 4.5). The arrays on the critical path PH t are tiled on-demand to reduce its weight so that better SPM utilization may be obtained. Guided by PH t and a cost benefit analysis, an optimal tile size is determined in a symbolic manner.
The main contribution of this paper is a graph-coloring-based scratchpad allocation method for data aggregates that performs both scratchpad allocation and data tiling together. Due to the iterative nature of graph coloring register/scratchpad allocation [3, 14] , various heuristics such as those for live range splitting and spilling are employed. We use mostly existing heuristics with some slight variations, if necessary. Other more effective heuristics can be developed in future. 
Live Range Splitting
An array may be frequently accessed at some parts of its live range, i.e., at some computation-intensive loops. Following [13] , we split its live range around loops and insert the required array copy operations at the splitting points, which become potentially the data transfer statements between the SPM and off-chip memory. For a loop nest where an array is accessed, the array is copied to a new, i.e., hot array at the earlier splitting point (at the beginning of the loop nest) and restored back at the later splitting point (at the end of the loop nest). During our graph coloring stage, all these hot arrays are the candidates to be colored so that they will likely be placed in the SPM.
The live ranges of arrays in a program are required in order to perform live range splitting and construct an IG for the program later. The live ranges of arrays are computed by extending the def/use definitions for scalars to arrays in the normal manner. At any program point, USE(A) returns true iff some elements of A are read. DEF(A) returns true iff A is defined entirely, i.e., if every element of A is defined. In general, it is difficult to identify whether an array is defined or not at compile time. So we assume conservatively that an array that appears originally in a program is defined only at its definition point, i.e., where it is declared. In addition, for every array copy introduced in live range splitting, the array that appears at its left-hand side is defined. The live range of an array starts from its definition and ends at its last use. Two arrays are move-related if one is obtained as a result of splitting the live range of the other. Such move-related arrays can be coalesced if the corresponding splits are unnecessary.
Consider our algorithm in Figure 2 . In line 2, live range splitting is applied to the program under consideration. We adopt the algorithm described in [13] to perform live range splitting. This algorithm processes all the loop nests in every function one by one and examines all the loops of a particular loop nest, starting from its outermost to innermost loop. For every array A accessed in a loop L, the algorithm checks to see if it is beneficial to split the live range of A. The cost model takes into account the access frequencies of arrays (obtained by compiler analysis as well as runtime profiling) and the data transfer cost between the SPM and off-chip memory. The cost of communicating n bytes between the SPM and off-chip memory is approximated by Cs + Ct × n (cycles), where Cs is the startup cost and Ct the transfer cost per byte. In addition, Sspm and Smem are used to represent the number of cycles required per array element access to the SPM and off-chip memory, respectively. If the splitting is beneficial, then a new array is introduced and appropriate copy in/out operations are inserted.
Consider an example program in Figure 3 , which will be used to illustrate our SPM allocation algorithm. Let Cs = 90, Ct = 10, Sspm = 1 and Smem = 100. 
Finding the Critical Path
Let us continue our discussion with our algorithm in Figure 2 . In line 3, the IG Go for all array live ranges is built. Among all the live ranges, the ones introduced in Vt during live range splitting are expected to be placed in SPM. So the interference subgraph induced by them, Gt, is extracted from Go (line 5). In line 6, the algorithm checks to see if Gt is a comparability graph, which holds in many applications, since the array IGs tend to be disjoint cliques (complete graphs), which are trivially comparability graphs. If Gt is not a comparability graph by itself, the minimal comparability completion algorithm is performed to make it so (line 9). Next, a transitive orientation α of the comparability graph is attained (line For the program in Figure 4 , the interference subgraph Gt and its transitive orientation are depicted in Figure 6 (a). With two disjoint cliques, Gt is trivially a comparability graph. The heaviest directed path is highlighted in thick arrows,
with a total weight of 122, which gives a lower bound for the SPM capacity to fully hold all these live ranges. The optimal coloring, i.e., the optimal allocation corresponding to the orientation is shown in Figure 6 (b).
Live Range Coalescing
If the weight of PH t is not larger than the SPM size (line 14 in Figure 2 ), then the current candidates in Ht can all be placed in the SPM. In this case, we apply coalescing to remove unnecessary copy operations (if any) introduced during live range splitting. However, coalescing may increase SPM pressure and is thus performed with the IG being always colorable (line 29).
The move-related nodes that are not originally in Gt are inserted back into Gt and coalesced by applying Coalesce_Live_Ranges in line 29 using the optimistic coalescing algorithm [18] . Every time after some coalescing has been done, the current graph is checked to see if it remains a comparability graph. If it is not, the minimal comparability graph completion is performed to make it so. Next, the heaviest path PH t is recalculated. If w(PH t ) ≤ SP M _Size, the coalescing results are kept, otherwise discarded, until all coalescing possibilities have been tried. Finally, a transitive orientation to the final IG, i.e., a SPM allocation is returned (line 31).
For example, Figure 7 (a) depicts the IG Go corresponding to the program in Figure 4 with two copy-related nodes b and d being inserted. Figure 7 
Spilling
If the weight of PH t , w(PH t ), is larger than the SPM size, we resort to spilling or data tiling to reduce the weight of PH t . We prefer data tiling since doing so enables more live ranges to be placed in the SPM, resulting in often more significantly improved SPM utilization.
Let us return to our algorithm in Figure 2 . In line 17, the set of tileable arrays T ASP H t is extracted from PH t . If no array can be tiled (line 18), spilling has to be performed. We adopt the heuristic introduced in [15] except that we find the heaviest path rather than (maximum) cliques related to Ht so that the resulting spilling (and data tiling) phases are polynomial.
The colorability of Ht is governed by:
where PH t is the heaviest path of an transitive orientation of Ht and PH t .f req is the sum of the access frequencies of all arrays in PH t . Intuitively, if w(PH t ) is no larger than the given SPM size, then its colorability is PH t .f req. Otherwise, its colorability is approximated as a percentage reduction of PH t .f req in terms of the ratio
, which represents the percentage of data in PH t that cannot be placed in the SPM.
As a result, the benefit for spilling v from Ht is:
The spilling cost, i.e., penalty incurred by v is estimated by:
where v.f req is the access frequency of array v.
We choose a node in PH t to spill such that the spilling profit defined below is maximized among all v in PH t :
A spilled node in PH t is excluded from the current IG Ht. No spilling code needs to be generated. Since a subgraph G ′ of a comparability graph G remains a comparability graph, and a transitive orientation of G remains transitive in G ′ and does not need to be recomputed. However, the heaviest path PH t , which may have changed in the current IG, should be recomputed (line 23). For the IG given in Figure 6 (a), with SP M _Size = 100, then w(PH t ) = 122 > SP M _Size. If spilling is performed according to the heuristic (5), then d ′ can be selected to spill. We simply remove d ′ from Figure 6 (a), obtaining the resulting graph as shown in Figure 10 . The heaviest path now is b ′ → a ′ → c ′ with a total weight of 96. For the program in Figure 4 , no spilling code needs to be generated. We only need to undo the live range splitting for d in loop L1 and eliminate the hot array d ′ introduced and the associated copy operations.
Data Tiling with Optimal Tile Sizes
If there are tileable arrays in T ASP H t , then data tiling (instead of spilling) can be applied to a selected loop. Data tiling causes some temporary arrays, called tile arrays, to be introduced in the selected loop. Their sizes can be expressed as multiples of a tile size variable for the selected loop. The critical path PH t gives an upper bound for the tile size to be used. By combining this upper bound constraint with the cost benefit analysis for all tileable arrays expressed as a function of the tile size variable, the best tile size can be solved analytically or numerically.
The focus of this paper is on demonstrating the feasibility of performing on-demand data tiling together with scratchpad allocation. To this end, we describe a simple approach to selecting which loop to tile and which tileable arrays to tile inside the selected loop. More sophisticated tiling heuristics, as discussed at the end of this section, will be developed in future work.
Let us consider our algorithm in Figure 2 again. If some arrays in T ASP H t can be tiled, we select a loop L to tile where some tileable arrays in T ASP H t are accessed. Presently, we select L such that the sum of the access frequencies of all the tileable arrays accessed in L is the largest. We then undo the live range splitting performed earlier for the tileable arrays accessed in L. Then data tiling, together with loop tiling, is applied to L with the tile size being symbolically represented by x, and the procedure Determine_Optimal_Tile_Size() is called to find the optimal tile size (line 27), as detailed in Figure 11 .
The basic idea behind Figure 11 is simple. The weight of a tileable array in T ASP H t is replaced with the weight of its cor-1: procedure Determine_Optimal_Tile_Size 2: Input: Loop L to be tiled and the heaviest path PH t 3: Output:Optimal tile size x 4: Let S old be the set of m tileable arrays in L of T ASP H t 5: Let Snew be the set of m new "tiled arrays" 6: Let f be a bijective function from S old to Snew 7: for every array A ∈ S old do 8:
Replace the weight of A with the size of f (A) 9: end for 10: Recalculate w(PH t ) (which is now a function of x) 11: x = Maximize_Profit() 12: return x 13: end procedure , where num_copies.A denotes the dynamic number of copy operations executed for A prior to tiling. In line 11, we call Maximize_Profit to find x by maximizing the net profit obtained:
subject to the following SPM capacity constraint:
and other constraints, including, for example, those constraints related to the sizes of the tileable arrays in S old and the loop bounds. Let us consider the IG in Figure 6 (a) with the heaviest path be-
. Let the SPM size be 64. Some arrays must be tiled to be placed in the SPM. Suppose that L1 is selected with
′ } be the set of the corresponding tiled arrays introduced. Applying data tiling, together with loop tiling, to L1 yields the program in Figure 12 . As a result, the IG shown in Figure 6 (a) evolves into the one shown in Figure 13 , giving rise to w(PH t ) = 7x.
When performing the cost benefit analysis, the transfer cost is 8640/x + 5760 and the access benefit is 16 * 4 * 5 * (100 − 1) = 31680. So the net profit to be maximized is 31680 − 8640/x − 5760 = 25920 − 8640/x subject to the space constraint 7x ≤ 64. To be practical, the tile size x is required to be at least 2. So the optimal tile size for L1 is x = 8 as shown in Figure 14 .
Let us return to our algorithm in Figure 2 . Once data tiling has been performed, we go back in line 28 to line 2 to rebuild the IG and repeat the same process until the weight of the heaviest path in the current IG is no longer larger than the SPM size.
Our simplistic tiling heuristics can be further improved along a number of directions. The search space for data tiling is defined by a number of factors, including, which loop(s) to tile, which tileable array(s) in a selected loop to tile, what tile size to use to tile a selected loop, and what tile size, i.e., array size to use for a selected tileable array in a selected loop. The most aggressive option is to analytically try all possibilities and pick the best according to some heuristics employed. The most computationally-efficient option could be to randomly pick one loop and one contained tileable array in the loop to tile. The solution introduced above represents a simple compromise. Better tiling heuristics that are useful for on-demand data tiling will be developed in future work. We have modified SimpleScalar to integrate SPM instead of cache. There are four parameters to be considered. The cost of communicating n bytes between the SPM and off-chip memory is estimated as Cs + Ct × n in cycles. Two other parameters are Sspm and Smem, which represent the number of cycles required for one memory access to the SPM and the off-chip memory, respectively. The values of the four parameters are set to be Cs = 90, Ct = 10, Sspm = 1 and Smem = 100.
EXPERIMENTS
We have implemented our algorithm in the SUIF compiler infrastructure. We take a C program as input, perform a source-to-source transformation by applying our algorithm, and finally, produce as output a new C program with SPM operations inserted. Then the program is compiled by GCC and run on SimpleScalar. If the program is originally in FORTRAN, then the f2c tool is used to convert it to C code first.
The benchmarks, which are all taken from SPEC2000, are the computation-intensive procedures of the corresponding benchmarks. In more detail, resid and psinv are from 172.mgrid, buts from 173.applu, calc1 and calc2 from 171.swim, zaxpy is from 168.wupwise.
These benchmarks are modified with all arrays of more than one dimension being replaced by 1-D arrays so that data tiling can be applied by our algorithm. Figure 15 gives the performance speedups over the non-tiling graph coloring SPM allocation algorithm [13] . For resid, when the SPM size is smaller than 256KB without tiling, none or only some of the arrays can be placed in SPM. Therefore, our algorithm achieves relatively large speedups in these cases. When the SPM is 256KB or larger, all the arrays can be placed in SPM. So both algorithms achieve the same performance. For psinv, the same trend is observed except that the demarcation line is 128KB. We have identified buts as an interesting application. The computation is performed in a five-level loop nest. The three innermost loops access consecutive array elements. However, the two outermost loops access array elements with a large non-unity stride. According to the cost-benefit analysis, it is not worth it to copy all the arrays accessed in the loop nest into SPM. However, it is beneficial to copy the array elements accessed in the three innermost loops into SPM (in which case data tiling is needed), requiring only a small SPM (smaller than 16KB). In addition to this loop nest, there are some other small loop nests containing arrays that can all fit into a 16KB SPM. That is why under different SPM sizes, our algorithm achieves the same speedup. For calc1, without tiling, no array can be placed in the SPM even with a size up to 512KB, in which case, the performance is equal to the one when the SPM is not used. When data tiling is used, the situation is different. As the SPM size increases, the optimal tile size becomes larger and larger, and consequently, the number of copy operations becomes smaller and smaller, as validated in Figure 16 . That is why our algorithm achieves increasingly better speedups when the SPM size increases. Finally, calc2 and zaxpy are similar to calc1, except that zaxpy demonstrates a much lower computation intensity. Figure 17 gives the execution times when the SPM sizes range from 16KB to 512KB, normalized to the execution time obtained with an infinite large SPM. By demonstrating performance close to the best possible, we show that our algorithm makes a good utilization of the limited SPM space. Figure 18 demonstrates the impact of live range coalescing on performance for resid through eliminating unnecessary copy operations. As described before, when the SPM size reaches 128KB 
Performance

The Impact of Live Range Coalescing
CONCLUSION
This paper proposes a new comparability graph coloring allocator that integrates data tiling and SPM allocation for arrays by tiling arrays on-demand to improve utilization of the SPM. The novelty lies in repeatedly identifying the heaviest path in an array interference graph and then reducing its weight by tiling certain arrays on the path appropriately with respect to the size of the SPM. The effectiveness of our algorithm is validated by using a number of selected benchmarks for which existing algorithms are ineffective.
