Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory partitioning for such architectures, taking into account potentially parallel data accesses to physically independent banks. Targeted at affine static control parts (SCoPs), the technique relies on the Z-polyhedral model for program analysis and adopts a partitioning scheme based on integer lattices. The approach enables the definition of a solution space including previous works as particular cases. The problem of minimizing the total amount of memory required across the partitioned banks, referred to as storage minimization throughout the article, is tackled by an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The article also presents a prototype toolchain and a detailed step-by-step case study demonstrating the impact of the proposed technique along with extensive comparisons with alternative approaches in the literature. 
INTRODUCTION
During the past few years, a particularly important trend has clearly emerged in parallel computing architectures, indicating that significant improvements in performance can only be achieved by customizing the computing platform to some extent. In particular, the term heterogeneous computing refers to systems made of a variety of different general-and special-purpose computational units, such as graphics processing units (GPUs) [Brodtkorb et al. 2013] , digital signal co-processors [Li et al. 2012a] , and custom accelerators typically implemented on field-programmable gate arrays (FPGAs) [Cilardo et al. 2013] . Many of such advanced computing platforms are provided with Authors' address: A. Cilardo and L. Gallo, Department of Electrical Engineering and Information Technologies, University of Naples Federico II, via Claudio 21, 80125, Napoli, Italy; emails: {acilardo, luca.gallo}@unina.it. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2015 ACM 1544-3566/2015/01-ART45 $15.00 DOI: http://dx.doi.org/10.1145/2675359 several independent memory banks that can be accessed simultaneously by parallel computing elements through complex interconnects. This potentially provides an opportunity for improving the memory bandwidth available to the application. However, to take full advantage of the memory architecture, adopting suitable memory partitioning strategies based on the actual application access patterns is of paramount importance.
This work proposes a methodology for automated memory partitioning in architectures provided with multiple independent memory banks accessed by parallel computing units. The approach encompasses both the problem of bank mapping and the minimization of the total amount of memory required across the partitioned banks, referred to as storage minimization here. Most of the results presented apply to affine static control parts (SCoPs)-that is, code segments in performance-critical loops where loop bounds, conditionals, and subscripts are affine functions of the surrounding loop iterators and of constant parameters possibly unknown at compile time. The methodology is based on the Z-polyhedral mathematical framework [Gupta and Rajopadhye 2007] , allowing a compact and comprehensive representation of the code and the related transformations. Furthermore, it adopts a partitioning scheme based on integer lattices, identifying memory banks with different full-rank lattices that constitute a partition of the whole memory. We rely on a method for enumerating the solution space exhaustively and evaluating each solution based on the time overhead caused by conflicting accesses. We also propose a technique for representing the transformed program, suitable for existing tools that generate code from Z-polyhedra. For the storage minimization problem, we rely on an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The preceding techniques are demonstrated through a prototype toolchain relying on a range of software libraries for polyhedral analysis, whereas the experimental data are collected through high-level synthesis (HLS) of FPGA-based hardware accelerators. Adopting a lattice-based approach to memory partitioning in the context of HLS, the proposed technique improves on a few very recent results in the technical literature that formalize the same problem by means of less powerful mathematical tools, resulting in narrower solution spaces and thus missing potential solutions. In the last part of the article, we present extensive comparisons with state-of-the-art proposals concerned with memory partitioning in the context of HLS, showing that the lattice-based approach can effectively identify a superset of the solutions spanned by existing works.
The text is structured as follows. Section 2 presents a motivational example to introduce the problem. Section 3 recapitulates the relevant results on integer lattices and the Z-polyhedral model. Section 4 describes the proposed lattice-based partitioning technique. Section 5 reviews the overall approach by discussing a step-by-step case study. Section 6 presents some comparisons with existing literature works. Section 7 concludes the article with some final remarks on possible developments of this work.
MOTIVATIONAL EXAMPLE
Our motivational example is a downsizing algorithm used for image resizing relying on bilinear interpolation [Danahy et al. 2007] , where the image is shrunk by a factor of 2 in both dimensions. For the sake of simplicity, we neglect boundary conditions and the source image is supposed to be grayscale. Each pixel in the target image is determined by averaging a 2 × 2 block of the source image. The kernel consists of a perfect loop nest and comprises only one statement. There are neither loop-carried dependencies nor loop-independent dependencies [Kennedy and Allen 2002] . The code is thus fully parallel. According to the bilinear interpolation method, each instance of the statement accesses simultaneously four locations of the source image. Figure 1 (a) shows the kernel code, whereas Figure 1(b) highlights the access patterns of a few iterations, where each dotted box corresponds to the locations accessed by the same iteration (i.e., the same value of i and j). Clearly, if the number of memory ports equalled the number of pixels in the source image, all read operations could virtually be performed in parallel.
1 In contrast, if we have limited memory banks and the array elements are not properly mapped to the available banks, memory conflicts may arise and the inherent parallelism is not fully exploited. A few examples of partitioning approaches include cyclic and a block partitioning strategies [Cong et al. 2011] . Both first linearize the array. Then, cyclic partitioning assigns a memory location mto memory bank m mod NB, where NB is the number of banks, whereas block partitioning maps m to memory bank m/NB . Assume, for example, that we have only four memory banks available. In fact, with neither of the two approaches, four ports are sufficient to completely avoid conflicts. For instance, if we have a 480 × 640 image and adopt a cyclic partitioning strategy, iteration i = 0, j = 0 would simultaneously access both A[0 · 640 + 1] and A[1 · 640 + 1] in the flattened array and a conflict would arise since 1 mod 4 = 641 mod 4. In essence, the technique proposed in this work seeks different approaches for partitioning the memory space such that improved parallelism in memory accesses can be achieved. Figure 1 (b) shows the mapping determined by the lattice-based partitioning technique, which is introduced later in the article. As highlighted by the figure, the memory accesses in the statement can be completely executed in parallel. Figure 1 (c) contains the hardware datapath implementing the example kernel based on our partitioning strategy. Access parallelization can be achieved with the lowest overhead in terms of steering and control circuitry. In fact, the mapping scheme follows a regular pattern, and consequently, a direct connection between the memory banks and the functional units is sufficient to access the required data across all iterations.
MATHEMATICAL BACKGROUND AND PROBLEM FORMULATION
This section briefly reviews a few mathematical concepts and results that are essential for the formulation of our lattice-based partitioning technique.
Given n linearly independent vectors
We refer to b 0 , . . . , b n−1 as a basis of the lattice [Kelner 2009 ]. The rank of the lattice is n, whereas its dimensionality is m. If n = m, the lattice is a full-rank lattice. In particular, if the basis is made of integer points, the lattice is an integer lattice. This article will only deal with integer lattices.
A basis can be also regarded as an m × n matrix B = {b ij } having the vectors b j as columns. We will say that B spans a lattice by using the notation L(B). Two different bases B = {b ij }, C = {c ij } span the same lattice L if and only if there exists a unimodular matrix U (i.e., a square matrix with integer entries and determinant equal to ±1) such that B = C · U.
Given a basis B, a fundamental parallelepiped associated with B is a set of points in R m defined by F P(B) = { n−1 j=0 r j b j : 0 ≤ r j < 1}. The set F P(B) is clearly halfopen and the points in its translates F P(B) + t, with t ∈ L(B), form a partition of R n . The determinant of a lattice, denoted det(L(B)), is the n-dimensional volume of its fundamental parallelepiped and may be regarded as the number of integer points contained in it. For a full-rank lattice, det(L(B)) = | det(B)|.
An integer affine lattice L t is obtained by translating an integer lattice L(B) by a constant offset t = (t 0 , . . . ,
An integer polyhedron P is a subset of vectors in Z m satisfying a finite number of affine (in)equalities with integer coefficients: P = { z ∈ Z m |Q · z + q ≥ 0}, where matrix Q and vector q specify the (in)equalities. An integer polytope is a bounded integer polyhedron.
By combining multiple affine constraints on integer variables, either equalities or inequalities, by means of logical operators (¬, ∧, and ∨) and quantifiers (∀ and ∃), we can specify special sets, called Presburger sets, which are particularly relevant to our work. Consider an integer polyhedron, P = { z ∈ Z m |Q· z+ q ≥ 0}, and an affine function,
}. These structures are linearly bound lattices (LBLs) [Teich and Thiele 1993] and can in fact be expressed as Presburger sets.
A Z-polyhedron Z is the intersection of a polyhedron
A Z-polyhedron can in fact be regarded as an affine image of an integer polyhedron. The hypotheses on the affine function ensure the compliance with the conditions of Le Verge [1995] for an LBL to be a Z-polyhedron. This representation has been showed to be complete by Quinton et al. [1996] . We will rely on a number of previous results concerning Z-polyhedra [Gupta and Rajopadhye 2007] : -The intersection of two Z-polyhedra is still a Z-polyhedron. -The union of Z-polyhedra, called a Z-domain, is not a Z-polyhedron in general.
-The difference between two Z-polyhedra is a Z-domain.
-The preimage of a Z-polyhedron by an affine invertible function is a Z-polyhedron.
-The image of a Z-polyhedron by an arbitrary affine function is an LBL. -An LBL, and hence a Presburger set, can be expressed as a union of Z-polyhedra [Gupta and Rajopadhye 2007] . A solution for deriving such union from a generic LBL is given in Seghir [2012] .
Z-polyhedra can be used to represent execution information of program loop nests, particularly the so-called affine SCoPs, having compile-time predictable control flow as well as loop bounds, subscripts, and conditionals expressed as affine functions of the loop iterators, which is common in a wide range of HPC and scientific program kernels, such as image processing and linear algebra operations. Representing SCoP code by means of parametric polyhedra enables a precise and instance-wise representation of the execution information, unlike abstract syntax trees (ASTs) normally used by compilers, as well as the composition of well-known loop transformations in a single step [ Cohen et al. 2004] . In fact, some compilers currently embody tools for code manipulation based on the polyhedral abstraction, such as GCC Graphite, LLVM Polly, and RStream [Meister et al. 2011] . Following are a few mathematical concepts linking the Z-polyhedral model with the representation of SCoP code.
The iteration vector v of a given loop nest is a vector having as elements the indices of the surrounding loops along with the parameters and the constant term. As an example, the iteration vector for the loop nest in Figure 2 
. The set of integer points (memory locations) accessed by a certain reference in a statement S can be regarded as the image of the Z-polytope D S (i.e., the iteration domain) by the affine access function F of that memory reference. The result is generally an LBL, hence a union of Z-polyhedra, since we cannot guarantee the invertibility of F.
Although not relevant to this work, polyhedra are also essential for representing data dependencies between statements. In particular, given two statements S i and S j and their iteration vectors v and w, respectively, we can build a dependence polyhedron, having a dimensionality equal to | v| + | w|, that includes all pairs of instances v, w such that v ∈ D S i , w ∈ D S j and w has a dependency on v.
Each instance v of a statement S in a loop nest can be associated to a point in time based on a schedule function S ( v), which in fact establishes an ordering of the instances of S, possibly based on a parallelizing transformation. The schedule function has the form S ( v) = · v, where is an n× | v| matrix. The parallelism in the code can be exposed by transforming the iteration domain and making the schedule matrix have a zero column corresponding to each parallel for loop. In Figure 2 (b), the schedule has four columns: two for the loop iterators i and j, one for parameter N, and one for the constant term. In this particular example, the outer loop was kept serial, whereas the inner loop was parallelized. Consequently, the schedule is one dimensional-that is, the matrix has one row. The parallelism can be easily recognized by the fact that the schedule has a zero term in the position corresponding to the parallel loop iterator: all instances with the same value of i can be executed in parallel independent of the value taken by j.
To model data-level parallelism, we rely on the concept of parametric polyhedral slice, defined for each statement S as follows:
The horizontal line in the preceding notation separates the inequality constraints, here D S · v ≥ 0, from the equality constraints, here S · v = k. The slice P S ( k) identifies all iteration instances in D S that correspond to the same schedule value k. Notice that, being an intersection of Z-polyhedra, P S ( k) is still a Z-polyhedron. Each component k h of the parameter vector k varies within the projection of the domain D S on the corresponding serial dimension. In Figure 2 , the set of parallel instances are the points along the planes having i as a constant value. Two of these sets are represented by the blue rectangles in Figure 2 (c). Vector k has only one component because we have only one outer sequential dimension corresponding to the i iterator.
Given an array A and a statement S containing a number of memory references, each with affine access function F h , call M( v) the set of memory cells of array A accessed by the statement instance v.
The definition can be obviously extended to sets of instances. In particular,
is the set of the memory cells referenced by the parallel iterations in the parametric slice P S ( k). We call M(P S ( k)) a dataset. For a certain statement S, the dataset is a function of the parameter k and includes all memory locations that can be potentially accessed simultaneously according to the given schedule. Notice that although P S ( k) is a Zpolyhedron, M(P S ( k)) is not necessarily a Z-polyhedron for two different reasons. First, the union of a finite number of Z-polyhedra is not generally a Z-polyhedron. Second, the image of a Z-polyhedron is not necessarily a Z-polyhedron, since the access functions may be noninvertible. In general, M(P S ( k)) is an LBL, which is formally equivalent to a union of Z-polyhedra, as remarked earlier. The presence of multiple statements affects the calculation of the datasets. In particular, some of the statements may have the same schedule function and, hence, be run in parallel. The formulation shown earlier could be extended by defining multiple datasets, each being the union of the datasets of single statements, having the same schedule. For the sake of simplicity, however, in the following we will refer to the case of a single statement.
The parallel datasets defined previously can be expressed as finite unions of Zpolyhedra, allowing a closed formulation of the memory partitioning problem. Furthermore, they can be easily manipulated by existing polyhedral tools, such as those of Loechner [1999] , Seghir [2012] , and Verdoolaege and Woods [2008] .
Problem Formulation
Although all statement instances in a parametric slice P S ( k) are scheduled in parallel, they normally access the same data structure in memory. Accesses that conflict on the same memory port may cause parallel instances to be serialized, introducing a considerable performance bottleneck. Ideally, if the memory locations accessed by the iterations within the same parametric slice (described by the M(P S ( k)) set) are mapped to independent physical banks, then full parallelization can be achieved.
Consider again Figure 2 . Each point (i.e., each memory location of array A) is labeled with the identifier of the bank where the location is mapped. For instance, A [3] [1] is mapped to memory bank 4. As shown in the figure, six different banks are used, and the mapping is such that there never are two equal labels in each red parallelogram of Figure 2 (d) (i.e., full access parallelization is achieved).
To generalize the earlier reasoning, we introduce the concept of conflict count, denoted MC( k), identifying the maximum number of distinct memory locations in M(P S ( k)) mapped to the same bank, as a function of k. In essence, MC( k) represents the number of serialized distinct memory accesses in slice P S ( k), which is indicative of the time spent for handling memory references. In addition, notice that the definition of MC( k) only refers to distinct memory accesses. In fact, concurrently scheduled accesses might also include multiple references to the same location, possibly due to more than one parallel statement being executed. Notice that since we assume that the code is correctly parallelized, concurrent conflicting accesses cannot include more than one write operation, whereas the semantic of concurrent write and read operations assumes that the read access gets the old value held by the accessed location, which is consistent with the physical meaning of the concurrent accesses (e.g., in an FPGA memory bank). On the other hand, multiple read operations are simply handled by broadcasting the value to the different operators requesting it (e.g., FPGA-implemented processing blocks).
Based on the definition of conflict count, we can introduce our cost function, simply defined as the summation of values MC( k) across all the values of k:
(
The preceding function is representative of the overall time cost incurred for memory operations across the execution of the entire kernel. As we will show in the following section, it can be expressed in a closed mathematical form with lattice-based memory partitioning. In particular, the Z-polyhedral theory provides useful results for counting the points in unions of parametric Z-polyhedra. Based on such results, the conflict count turns out to be a piecewise quasipolynomial function of the parameter k [Seghir 2012]. In case of multiple statements, as explained earlier, the dataset is indeed a list of sets, each parameterized in k. In this case, the cost functions C time can still be calculated for every dataset separately and then summed over all sets of the list to evaluate the overall cost, since any two accesses of two different datasets are executed sequentially. For a given number of physical memory banks NB, each of size SI ZE i , the memory partitioning problem can be formulated as follows. A memory partition of an array A is a pair of scalar integer functions g( m), f ( m). Vector m has as many components as the dimensionality of A. It represents a memory location and varies in the integer parallelepiped defined by A. Function g( m) identifies the bank to which m is mapped, whereas f ( m) is the (linear) address in that bank. Clearly, we must have 0 ≤ f ( m) < SI ZE g( m) , whereas the total amount of memory taken by the arrays across the physical banks is
The memory partitioning problem can be decomposed into ACM Transactions on Architecture and Code Optimization, Vol. 11, No. 4, Article 45, Publication date: January 2015.
-a bank mapping problem, which consists of finding a suitable function g( m) that assigns all used locations to existing banks (i.e., 0 ≤ g( m) < NB) and minimizes C time ; and -a storage minimization problem, which consists of finding a suitable function f ( m) that avoids colliding assignments to the same bank (i.e.,
and minimizes the total amount of memory C size .
In the ideal case, C size coincides with the number of locations in array A, essentially meaning that the partitioned arrays are perfectly exploited with no holes in memory allocation.
The two preceding problems are tackled separately by two sequential steps. Since we aim to maximize parallelism, we determine the bank mapping as a first step by using the proposed lattice-based partitioning technique. Then, we solve the other problem by an optimal storage minimization approach described later.
LATTICE-BASED MEMORY PARTITIONING
The approach presented in this work aims to automatically define efficient partitioning choices minimizing structural conflicts on the memory ports. The proposed methodology assumes that the code is already parallelized by properly rearranging the loops [Griebl and Lengauer 1996; Bondhugula et al. 2007; Feautrier 1992] , which corresponds to having zero columns in the schedule matrix, through a preliminary code transformation step based on existing polyhedral tools [Loechner 1999; Verdoolaege and Woods 2008; Seghir 2012] In essence, the lattice-based memory partitioning technique proposed in this work -regards a d-dimensional memory array as a polyhedron, precisely a hyperrectangle; -partitions the hyperrectangle into separate sets-the sets are Z-polyhedra delimited by the same polyhedron (i.e., the memory array, but having different underlying affine lattices) obtained as translates [Barvinok 2002 ] of a particular lattice chosen by the methodology so as to minimize the conflict count; and -generates the new polyhedral representation of the code featuring optimized memory accesses.
As an example, in Figure 2 , each set of locations assigned to the same bank forms an integer lattice. Each lattice can be thought of as a translate of the lattice marked with 0, which we denote L 0 . Mathematically, L 0 is a full-rank bidimensional lattice spanned by the basis B 0 = ( 3 0 0 2 ). The remaining lattices are affine lattices. For instance, the set of integer points L 1 can be obtained by translating L 0 by one position along the j dimension:
In general, we define the lattice containing the origin as the fundamental lattice. As implied by the results summarized in Section 3, it forms a partition of Z d along with its translates. The number of distinct translates, including the fundamental lattice itself, is equal to the determinant of the lattice [Barvinok 2002 ]. Since lattice-based partitioning maps each memory bank to a different translate, the determinant of the fundamental lattice must be equal to the number of actually used banks to fully cover the d-dimensional memory. This establishes a fundamental link between a physical characteristic, the number of memory banks, and the main mathematical object handled by our technique-the lattices. The problem of bank mapping (i.e., determining the g( m) function defined earlier) directly corresponds to finding the best fundamental lattice.
Generation of the Solution Space
As implied by the previous remark, the total number of available memory banks NB is an upper bound to the determinant of the fundamental lattice to choose. We carry out an exhaustive search of all candidate lattices based on the ideas presented in Darte et al. [2005] , considering all lattices having a determinant less than or equal to NB. Among the solutions reaching the minimum conflict count, we then choose the one requiring the least number of banks-that is, the least determinant. Two solutions ensuring the minimum conflict count and the same number of banks are deemed equivalent, although they may have different second-order implications on the implementation costs. As part of our future work, we will also take into account such implications.
Since d-dimensional lattices are spanned by full-rank d × d matrices, we can equivalently generate all matrices corresponding to distinct lattices. In that respect, given a matrix B of rank d in Z d×d , there exists a unique lower triangular matrix H = {h ij } and a unimodular matrix U such that H = B · U and 0 ≤ h ij < h ii for all j < i. The matrix H is called the Hermite normal form (HNF) of B [Schrijver 1986 ]. Consequently, we can generate all integer lattices of a given determinant δ by enumerating all distinct lower triangular matrices H = {h ij } such that det(H) = δ and 0 ≤ h ij < h ii . For a given rank d and determinant δ, the number of such matrices, denoted H d (δ), is equal to
where p and q are relatively prime numbers [Newman 1972 ]. For a maximum number of banks NB, the search space thus contains 
Evaluation of the Solutions
Lattice-based partitioning enables the conflict count MC( k) to be expressed in a closed mathematical form. Call L t , the tth translate of a given fundamental lattice L, with 0 ≤ t < det(L). Then, we simply have
The intersection picks the points of the dataset M that belong to translate L t . For each value of k, the lattice incurring the maximum conflict count determines the worst-case serialization in the memory accesses. The set L t ∩ M(P S ( k)) is a union of parametric Z-polyhedra. We first use the technique proposed in Seghir [2012] to count the integer points in the set as a function of k, obtaining different expressions for determined intervals of k. Then, by summing such expressions together, we obtain the overall value of the cost function C time = k MC( k). As an example, in Figure 2 , we have |L t ∩ M(P S ( k))| = 1 for each of the six translates t, and hence C time = N − 1.
Enumeration of the Translates
The evaluation of the solutions requires the exhaustive enumeration of the translates of a given fundamental lattice. To this aim, we rely on the following property, which can be proven easily. 
The property implies that if we take a point in L, we cannot find any different point of L in the parallelepiped having m as the lower corner and the diagonal elements h ii as heights. In fact, notice that the parallelepiped has a volume equal to the determinant of L because H is in HNF. As a direct consequence, considering m = 0, if we take all elements in the parallelepiped having the diagonal elements h ii as heights, we pick exactly | det(H)| points belonging to different translates. Since there are exactly | det(H)| translates, those points cover all of the possible translation vectors. As an example, the HNF of the fundamental lattice L 0 in Figure 2 is ( 3 0 0 2 ), so the remaining five translates are given by L 0 + t, with t = (1, 0), (2, 0), (0, 1), (1, 1), (2, 1) .
Generation of the New Polyhedral Representation
After picking the best partitioning solution, a new version of the code must be generated so as to capture the allocation to memory banks. A possibility is to use an additional dimension in the partitioned array and then apply block or cyclic mapping. Notice, however, that we are mainly interested in the synthesis of hardware accelerators from high-level C code. Although existing HLS tools [Xilinx Inc. 2012 ] allow block or cyclic partitioning along a given dimension, they look at statements, not instances of statements. In other words, they parallelize two memory accesses, say M 1 and M 2 , only if it can be proved that all instances of M 1 and M 2 in the iteration domain of the loop nest access different banks. To address this problem, we decided to explicitly partition the array in the code by declaring multiple distinct arrays. Consider the example in Figure  2 . The statement in the loop body contains two memory references, or accesses (i.e.,
A[i][ j] and
. Different values of the iterators ( j, i) lead to a different bank accessed for each of the two references. For example, instance (1, 1) accesses banks 4 and 0, whereas instance (2, 1) accesses 5 and 1. Indeed, since the dataset has a constant shape and the banks in the example are periodic with period 3 along the j dimension and 2 along the i dimension, there are exactly six different cases (i.e., the number of banks NB), each corresponding to a different statement body:
In general, depending on the structure of the dataset, there may be up to NB h different combinations of bank accesses, where h is the number of memory references in the statement. The essential idea in our approach is to generate a different statement for each combination of bank accesses (e.g. six statements in the preceding example). To describe the new statements in terms of polyhedral representation, we start from the original iteration domain D S and generate NB smaller integer sets, each corresponding to a different statement body, covering a specific combination of bank accesses. To express this mathematically, further constraints must be added to the algebraic form describing the original polyhedron. Call r the number of distinct memory references in the statement, and let d be the dimensionality of the array (e.g., d = 2 for bidimensional arrays). For each combination p of bank accesses, the iteration domain of the corresponding statement can be expressed as follows: ). Notice that T p i cannot be made parameters because we need to pass separate algebraic structures to the code generator, one for each version of the statement. The preceding structures can be handled by existing tools (e.g., ISL, CLoog) for code generation from the polyhedral model, which can identify and efficiently prune out empty polyhedra. In addition, notice that due to the potentially noninvertible nature of the affine access functions, the sets D Sp can be unions of Z-polyhedra rather than single Z-polyhedra. However, in case the access functions are invertible, D Sp can be proved to be a single Z-polyhedron, delimited by the same polyhedron as D S , yielding a representation that can be manipulated easier by existing polyhedral tools.
Storage Minimization
The problem of storage minimization-that is, the definition of the f ( m) functionconsists of assigning a new address within the new bank to each original memory location m. This corresponds to determining new access functions for each memory reference such that the proper location is accessed in each iteration. Call 
The scalar function f ( m) is then given by the linearized address of m in bank g( m). It can be easily shown that the preceding solution is consistent and asymptotically optimum. In fact, by Property 1, if m and n are two distinct locations belonging to the same lattice, then there must be at least one i such that |m i − n i | ≥ h ii . Consequently, they are certainly mapped to different locations in the same bank, since | 
earlier) is upper bounded by
Under the assumption that h ii = o(D j ) ∀i, j, the preceding upper bound to C size can be written
is the total amount of memory taken by the original array. In other words, C size is asymptotically equal to the original amount of memory-that is, the banks are densely populated with a number of holes becoming negligible as the size of the array is increased.
The preceding optimal solution involves a zero amount of asymptotic memory waste. Unfortunately, however, it requires an integer division by h ii for each component in m before starting a memory access. In a case where h ii is a power of 2, which indeed is likely to happen in practice, division simply coincides with a right shift. For cases where h ii is not a power of 2, on the other hand, we propose an efficient solution that involves an arbitrarily small amount of asymptotic waste along each dimension of the memory. In essence, the solution is based on the idea of replacing the integer division by h ii with a more efficient multiplication by an integer constant a, followed by a right shift by b bits. In the following, we will refer to a single component m i for the sake of simplicity. In essence, we replace m i = . First of all, based on this choice, we have that Depending on the values of h ii and the actual cost of an integer division, one of the two solutions shown earlier can be adopted to effectively address the problem of storage minimization, As an example, assume that we have h ii = 3 for dimension i, as in the example of Figure 2 , and that we require a percentage waste upperbounded by 10%. We can choose b > log 2 h ii ω = log 2 30 (e.g., b = 5) and a = = 11. Then, in the transformed code, each array subscript corresponding to dimension i, expressed as F i · v (where F i is the ith row of the access matrix F) will be replaced with (11 · F i · v) 5.
A DETAILED CASE STUDY
To demonstrate the lattice-based partitioning technique, we built a prototype toolchain, depicted in Figure 3 . The implementation is based on a variety of open-source tools for polyhedral analysis. First, the input is analyzed by Clan, 3 used to extract a polyhedral representation. Then, it is scheduled to extract as much internal parallelism as possible. This step is optional since our mathematical framework does not deal with parallelism extraction but assumes an already scheduled loop nest. Subsequently, the representation is manipulated by ad hoc tools applying the proposed methodology. The process relies on the OpenScop 4 format to represent the information on the code being transformed, PolyLib [Loechner 1999 ] to manipulate polyhedra, and ZPolyTrans [Seghir 2012 ] to handle images of Z-polyhedra by noninvertible functions and parametric counting of points in unions of Z-polyhedra. There are other tools that could serve the purpose, particularly those of Verdoolaege and Woods [2008] , whereas ISL supports relations and hence Z-polyhedra, but it does not provide specialized algorithms [Iooss and Rajopadhye 2012] . Last, after manipulation, C code is regenerated using CLooG [Bastoul 2004 ].
Based on the preceding toolchain, this section provides a step-by-step illustration of the lattice-based partitioning technique to a real-world application, a bidimensional window filtering kernel, where a waveform/sequence is multiplied by a window function [Weisstein 2003 ]. Applications of window functions include spectral analysis, filter design, and beamforming. More specifically, in this section, we target bidimensional signals expressed as arrays (e.g., images) and 3 × 3 window functions expressed as nine separate coefficients. Figure 4(a) shows the bidimensional filter kernel consisting of a perfect nest made of three loops. The inner loops iterate over space, whereas the outer loop iterates over time since the filter may be applied several times to the same signal to amplify the effect or use different coefficients.
Some parallelism has been extracted by stripmining the innermost loop by a factor SF. Figure 4(b) shows the stripmined and parallelized code. Although we can set SF to any value less than N − 1, for the sake of simplicity, here we use SF = 2. The constant N is assumed to be even. All loops in the code are normalized-that is, their iterators start from 0 and have unit stride. The innermost for construct is marked as parallel. In fact, it is evident that it can be parallelized since there are no true dependencies carried by the three innermost loops.
In our example, the memory architecture is assumed to provide up to six independent memory banks. We apply the lattice-based partitioning technique to the input array A; a similar procedure may be applied to the output array Y .
Polyhedral model of the kernel. The kernel contains four loops, and hence the iteration vector v is v = (t, i, j, p, T , N, 1) , with N and T being two constant parameters. The Modeling data-level parallelism. The first step of the methodology consists of modelling the parallelism of the memory accesses. First, we need to build the parallel parametric slice P S ( k). In this example, k is k = (k 1 , k 2 , k 3 ) because we have three serial loops. Parameters k h directly correspond to the iterators of the three outermost loops (i.e., t, i, and j) and vary within the same intervals. The parametric slice 
For each value of the parameter k = (k 1 , k 2 , k 3 ), the slice turns out to contain two iterations, those corresponding to p = 0 and p = 1. The statement in the kernel of Figure 4 (b) contains nine bidimensional affine accesses to array A, each having its own access function. As an example, ) · v as its access function. The next step thus concerns the identification of the datasets determined by these accesses-that is, the set M(P S ( k)) containing the memory locations accessed in parallel by concurrent iterations. As explained earlier, the datasets are the union of the images of P S ( k) by the affine access functions. Each image represents a set of locations accessed in parallel by a specific memory reference in the statement. In this case, although the memory access functions are noninvertible, the images are still Z-polyhedra (here, indeed, they are simply polyhedra). Furthermore, their unions also turn out to be polyhedra. Figure 5 depicts a few examples of images and datasets for a couple of values of the parameter k = (k 1 , k 2 , k 3 ) . Although the datasets are a function of k and their cardinality may generally vary, here all datasets always contain 12 memory points. In this example, they can be formally t, 0, 0) ), the figure also shows the images of the slice P S by two single access functions. Overall, each dataset is formed by nine such images. expressed as follows:
Generation of the solution space. Since there are six memory banks available, the methodology looks for the best bidimensional lattice-based partitioning solution having a determinant less than or equal to 6. Indeed, since each dataset contains 12 distinct accesses, the best we can do is to have 2 accesses per memory bank in each dataset, which is only possible with six banks. As a consequence, in this example, we will only discuss the case det(L) = 6. As explained in Section 4.1, enumerating all distinct bidimensional lattices is equivalent to generating all possible 2 × 2 matrices in HNF with a determinant equal to 6. Relying on the quantitative formula given in Section 4.1, we know that there are H 2 (6) = H 2 (3)H 2 (2) = 3 × 4 = 12 such matrices. Next, we list all of them: Each enumerated solution is representative of an infinite number of bases all equal up to a unimodular transformation. Figure 6 shows the lattices corresponding to B 8 and B 3 . Once the solution space is generated, we need to evaluate each potential solution to identify the best fundamental lattice.
Enumeration of the translates. To proceed, we first need to enumerate the translates of each lattice identified previously. Take B 8 = ( 2 0 0 3 ) as an example. As pointed out in Section 4.3, we need to enumerate all points in the parallelepiped identified by the diagonal of B 8 , here a 2 × 3 rectangle, with the lower corner in the origin. The points are as follows:
These vectors correspond to the six translates of the fundamental lattice needed to cover Z 2 . By construction, their number is equal to the number of memory banks.
45:16
A. Cilardo and L. Gallo 
Evaluation of the solution space.
To evaluate a solution, we compute the maximum number of conflicting accesses (i.e., accesses to the same memory bank) within all parallel datasets (see Section 4.2). For each parallel dataset, the conflicts for a given translate L t are given by the intersection of the affine lattice L t itself and the polyhedron describing the dataset. The number of integer points, what we called MC( k), contained in such Z-polyhedron represents, parametrically, the number of conflicts corresponding to a particular translate L t for a certain solution L. Figure 6 also shows graphically MC( k) for k = (t, 0, 0) and the translates corresponding to the vectors (0, 0) and (1, 0) for L (B 8 ) and L(B 3 ), respectively. The number of points in the first Z-polyhedron is 2, whereas we count 3 points in the second Z-polyhedron, resulting in worse performance. In both cases, they are independent of k. Following are the overall memory access times, computed by taking into account the worst-case conflicts within the datasets (which are necessarily serialized in time):
In Section 4.2, the expression of C time contained a sum on k. Here, because there is no dependence on k, the sum is reduced to a multiplication by the number of values spanned by k = (k 1 , k 2 , k 3 ) , i.e., (N − 2) · (N − 2) · T . Notice that B 5 , B 8 , B 9 , and B 10 all achieve the minimum number of conflicts. Here, we choose L = B 8 . ). Based on the results in Section 4.5, we can simply replace each original access in the new arrays by scaling the column subscript by 2 and the row subscript by 3, obtaining an asymptotically zero memory waste. As an alternative, to avoid the integer division by 3, we can adopt the approximate approach introduced in Section 4.5, which consists of replacing the division with a multiplication by an integer a followed by a b-bit right shift. As already exemplified in Section 4.5 for the case h ii = 3 and a maximum waste of 10%, we can use b > log 2 3 0.1 (e.g., b = 5) and consequently a = Complying with the mapping choice previously identified, no more than two conflicts per parallel set of iterations are incurred. Consider the first two parallel statements S 1 and S 2 as an example. A0 appears three times, and thus the iteration seemingly contains three simultaneous accesses. However, the three accesses can only involve two different values because of the mapping that we enforced. In fact, A0 [(11 · i) Validation of the cost function. To conclude our case study, we show a few results collected from a practical implementation of the kernel targeted at an FPGA technology by means of an HLS process. In particular, the preceding C code was synthesized to VHDL by the Vivado HLS tool [Xilinx Inc. 2012] and analyzed by cycle-accurate simulation. The actual execution times, denoted ET and expressed in terms of clock count, refer to the case N = 20, T = 8. They were measured for each of the 12 latticebased partitioning solutions explored earlier: The results provide a clear confirmation of the significance of the cost function C time that we used to drive the optimization process and confirm the impact that memory conflicts have on the actual execution time of the synthesized kernel.
RELATED WORK
The application of integer lattices to memory allocation problems is addressed by Darte et al. [2005] , who propose a methodology for reducing the memory needed for the execution of an algorithm by analyzing variable liveness. The authors establish a correspondence between an integer lattice and a modular mapping of the array indices. As an example, the lattice spanned by the basis ( 2 0 0 3 ) may equivalently be expressed by the two-dimensional mapping (b 0 , b 1 ) = ( j mod 2, i mod 3), where the pair (b 0 , b 1 ), with 0 ≤ b 0 < 2 and 0 ≤ b 1 < 3, is used as a bidimensional bank index. Although we base our work on a similar mathematical framework and use some of their results, such as the enumeration of the solutions, they solve a different problem-memory reuse. In particular, in our work, the determinant of the lattice corresponds to the number of memory banks, and the objective function is chosen to minimize the number of conflicts on memory ports. On the other hand, in their work, the number of memory locations required by an optimum modular mapping is equal to the determinant of the underlying lattice. Furthermore, they consider indices as conflicting when they are simultaneously live under a given schedule, whereas we have a conflict when there is an access to the same memory bank by simultaneous executions of loop iterations.
The development of mathematical methodologies for the problem of partitioning was also studied extensively for optimizing locally sequential globally parallel (LSGP) computations. Darte [1991] aims to find valid boxes (i.e., parallelepipeds of points) for systolic arrays. Their solution space consists of all left Hermite forms of the so-called activity basis that identifies the computations simultaneously active. In Darte [1991] , integer points represent computations that may be mapped or not onto the same processor and not memory locations to map onto specific banks. A similar problem is analyzed in Darte et al. [2002] , providing a closed form for enumerating all schedules on a cluster of VLIW processors and constructing a linear function that schedules precisely one iteration per cycle of a loop nest. Memory partitioning for LSGP systems has been also studied by Gupta [1992] , Chatterjee et al. [1995] , and Verdoolaege et al. [2007] to reduce the communication among processors.
Recently, mathematical optimization techniques have targeted reconfigurable devices provided with multiple independent fine-grained memory blocks. In this scenario, minimizing the number of conflicts on the distributed memory banks is of paramount importance for performance. A few recent contributions [Cong et al. 2011; Li et al. 2012b; Wang et al. 2013 Wang et al. , 2014 are concerned with partitioning of on-chip memory. In Cong et al. [2011] , an automated memory partitioning algorithm is proposed to support multiple simultaneous affine memory references to the same array. In Li et al. [2012b] , memory accesses in different loop iterations are partitioned across different memory banks and scheduled in the same cycle, minimizing the number of required banks. In the preceding works, a multidimensional array is first flattened into a singledimensional array before partitioning. Since memory addresses after flattening depend on the array size, different partitioning schemes are generated for different array sizes, many of which are suboptimal. An improved approach is proposed in Wang et al. [2013] . They model memory ports as n-dimensional hyperplanes. However, by limiting their solutions to a single family of hyperplanes, they might ignore some potential solutions. Unlike the work of Wang et al. [2013] , which uses a single family of hyperplanes to express bank mapping, we essentially exploit a number of hyperplanes equal to the dimensionality of the array to be partitioned. Furthermore, the work of Wang et al. [2013] does not express parallel datasets formally. Hence, the methodology is valid only for the specific problem of avoiding memory conflicts when loop pipelining techniques are applied. Our technique, on the other hand, explicitly models parallel iterations by means of Z-polyhedral techniques and generalizes the solution space by adopting lattices instead of hyperplanes-that is, a strict superset of the solution space in Wang et al. [2013] . Wang et al. [2014] extend their previous work by introducing block-cyclic partitioning but still relying only on a single family of hyperplanes. The work in Liu et al. [2009] introduces a geometric programming framework combining data reuse with data-level parallelism. Although the problem is still related to data-level parallelization, they assume that each processing element is connected to a single memory and replicate data when needed. This assumption largely simplifies the problem but neglects the possibility of avoiding the replication of data, possibly incurring memory waste and a potential loss of available memory ports. Similar to our work, Chen and Postula [2000] use integer lattices in the context of memory partitioning to improve memory bandwidth by using different physical storage blocks. The work explores different periodic partitioning strategies by varying both the number of banks and the system clock frequency. Their main contribution is related to address generation given a specific partitioning strategy. However, their solution space exploration does not resort to any mathematical technique to model the code and its memory access patterns. As a consequence, the work is based on evaluating the schedule length and the system cost for each possible solution by adopting scheduling algorithms, such as list scheduling.
There are other approaches concerned with optimizing on-chip memory usage, although focused on orthogonal problems. The work in Lu et al. [2009] develops a compiletime framework for data locality optimization via data layout transformation targeting NUCA chip multiprocessors. Liu et al. [2007] use polyhedral abstractions to reduce the occupied on-chip scratchpad memory. The approach is useful when dealing with algorithms working with a great deal of data. Unlike our approach, it aims to improve data reuse instead of maximizing the available on-chip memory bandwidth. Differently, Pouchet et al. [2013] introduce a framework in the context of HLS for transforming loop nests to maximize on-chip memory reuse and minimize accesses to off-chip memory blocks. A different problem is tackled in the work of Bayliss and Constantinides [2012] , which proposes a technique to embed some physical features related to external SDRAMs in the iteration domains of code statements to minimize row activations and maximize reuse. It addresses external memory modules, and in that respect, it is complementary to our work, which is focused on on-chip memory. The work of Alias et al. [2013] optimizes remote accesses for offloaded kernels on reconfigurable platforms.
Comparisons with the Hyperplane-Based Approach
As summarized previously, the work of Wang et al. [2013 Wang et al. [ , 2014 provides current stateof-the-art solutions for memory partitioning in the context of HLS, limited to solutions based on a single family of hyperplanes. In that respect, lattices represent an extension of previous approaches. In fact, partitioning solutions based on one single family of hyperplanes (e.g., ( j + i) mod 2) can always be represented as lattices (e.g., ( 2 1 0 1 )). In some cases, hyperplane-based partitioning may achieve the same level of access parallelism as lattice-based partitioning, but yielding a more complex steering circuitry. For instance, the memory in the example of Figure 1 (a) can be partitioned as shown in Figure 7 (a) by using hyperplanes, fully exploiting the available four memory banks. However, since the access patterns to the different banks are not always the same in each dataset, each input of each adder needs to receive the operands from two different memory banks depending on the specific value of the iterators, complicating the steering logic of the datapath, as well as causing potential increases in area, clock period, and power consumption. Notice that because of the commutativity of addition, this logic overhead may be optimized out in this specific case. However, this is not true in general. On the other hand, covering a strict superset of hyperplane-based partitioning, lattice-based partitioning allows us to identify the more efficient solution shown in Figure 1(c) , where in addition to the fully parallelizable read operations, the access patterns to the banks are the same for each dataset, inherently simplifying the steering logic. To confirm the preceding considerations, Table I shows the area occupation measured in look-up tables (LUTs) on a Xilinx Virtex-7 FPGA for three different benchmarks. The considered benchmarks are typical HLS applications with loop nests having a high degree of parallelism. In particular, the first benchmark is an ordinary image resize kernel using bilinear interpolation. The remaining two benchmarks perform different stencil computations-a 2D Jacobi kernel and a 2D Gauss-Seidel kernel e.g., see Leopold [2002] for the details of these algorithms). The kernels were coded in plain C and synthesized by means of the Vivado HLS tool [Xilinx Inc. 2012] . For each benchmark, we varied both the stripmining factors of the two innermost loops and the number of memory banks, here coinciding with BlockRAM components (BRAMs), normally available on Xilinx FPGAs. The table reports the number of LUTs used by the two techniques for the respective optimal solutions. Because of the spatial regularity of lattices, our technique achieves more compact datapaths having a simplified steering logic. These improvements are also reflected in the code complexity. In fact, Table I also contains the overall number of statements present in the final code. As shown by the table, for the chosen benchmarks, our method also helps to keep the code complexity reasonably low.
In addition to a more complex steering logic, there may be situations where the work of Wang et al. [2013] requires more memory banks than strictly necessary to achieve full parallelization, particularly when the kernel has a complex control flow. As an example, consider the code in Figure 8 . The kernel contains a normalized perfect loop nest with three statements. There is a statically predictable branch, so the nest is still an affine SCoP. Loop iterations access a variable shape 2 × 2 window. In particular, Figure  8 ). This solution cannot be expressed as a set of translate hyperplanes. Figure 8(b) shows all possible hyperplane-based solutions. The green line represents the hyperplane causing the conflict. It can be easily recognized that four memory banks are not sufficient here to avoid conflicts completely, and the optimal solution identified by the lattice-based partitioning technique is missed.
CONCLUSIONS AND FUTURE DEVELOPMENTS
This work addresses the problem of automated memory partitioning for emerging architectures, such as reconfigurable hardware platforms, providing the opportunity of customizing the memory architecture based on the application access patterns. Targeted at affine SCoPs, the technique exploits the Z-polyhedral model for program analysis, yielding a powerful and elegant formalism capturing both the problem of bank mapping and storage minimization. In particular, the approach was based on integer lattices, enabling us to generate a solution space for the bank mapping problem, which includes previous results as particular cases. The problem of storage minimization, on the other hand, was tackled by an optimal approach ensuring asymptotically zero memory waste, or as an alternative, an efficient approach ensuring arbitrarily small waste. The theoretical results were also demonstrated through a prototype toolchain and a detailed step-by-step case study described in the article, along with some comparisons with different approaches found in the technical literature.
The work also opens up a range of further investigation paths. First of all, as pointed out by our case study, there may be different lattices all achieving the minimum number of conflicts for a given number of banks. However, they might not be equivalent in terms of the generated code, which may cause different delays and possibly area results in those computing platforms where the code is directly translated to hardware, such as FPGAs. Analyzing these effects systematically is part of our future work. Furthermore, although the search space is likely to be limited for practical problems, we will also explore the adoption of ad hoc heuristics to explore it more efficiently. In particular, from a purely mathematical point of view, there are situations where given a certain determinant, the solution space of the lattice-based partitioning technique collapses to one single family of hyperplanes. Although this happens for low dimensionalities of the array and very low numbers of memory banks, a precise formalization would help to reduce the solution space in such cases. A further possibility of improvement concerns the storage minimization scheme. Although it is asymptotically optimal in terms of memory waste, it does not include liveness analysis of memory locations, unlike that in the work of Darte et al. [2005] . We thus plan to extend our methodology with per-bank liveness analysis. In addition, like most similar works, the approach in this article is focused on partitioning a single array in memory. When different arrays are concurrently accessed in the same kernel, they may be processed separately, or alternatively, they may be seen as parts of a single memory space. These choices might variously impact performance or result in additional opportunities for optimization, leaving room for further developments in this direction. Last, instead of making the partitioning explicit in the code, such as by using different array names, which leads to a static assignment of banks, a different possibility would be to insert ad hoc hardware components that route the memory requests to the corresponding banks by computing the mapping dynamically. This possibility was not explored here, essentially because a static code-level solution can be easily automated and does not interfere with the HLS process in itself. However, its implementation would transfer the complexity of the approach from the software to the underlying hardware architecture. We thus consider the automated generation of such hardware memory access managers as a potential future development of this work.
