In this paper, we present a comprehensive HLS methodology for the design of distributed logic-memory architectures. Given an input behavior, our methodology can automatically determine a very good partitioning of operations (computations and memory accesses), along with the requisite partitioning of physical memory. The techmques are designed to be applicable to behaviors with complex m a y access pattems. Therefore, the final output of synthesis can be a homogeneous or heterogeneous Set of partitions, depending on the behavior being analyzed.
as well as the manner in which the array references are serviced. Several HLS enhancements have addressed the memory organization issues by providing an figurations ( m o n u l ,~c , multi.poned, banked, herarc,jcal, e,c.) ~141 In this paper, we present a comprehensive HLS methodology for the design of distributed logic-memory architectures. Given an input behavior, our methodology can automatically determine a very good partitioning of operations (computations and memory accesses), along with the requisite partitioning of physical memory. The techmques are designed to be applicable to behaviors with complex m a y access pattems. Therefore, the final output of synthesis can be a homogeneous or heterogeneous Set of partitions, depending on the behavior being analyzed.
Our methodology incowrates the following novel components: (i) behavioral simulation to determine the array access footprints (the array locations accessed by the reference, and the frequency of access). Such a scheme allows us to handle array references in behavioral descriptions characterized with m a y index functions of arbitrary complexity, (ii) clustering techniques to pmition array access operations into groups, according to their lucditY and data access workloads, (iii) '"in-cut" Style pmitioning to derive a very good pmitioning of operations so as to minimize overall execution time (including synchronization overheads), and (iv) an iterative improvement scheme to determine a good physical memory partitioning. We have evaluated our techniques within the framework of an induslrial HLS tool to synthesize distributed logic-memory implementations for severi benchmark applications. our techniques demonstrate that upto 2.2X and 1.5X performance speed-ups are possible over conventional designs and homogeneously partitioned
The rest Of this paper is organized as follows. ing an efficient mapping of computations in a behavior to Components in an RTL conuoller/datapath implementation. For memory-intensive behaviors, it is critical to examine the memory organization in the ASIC Acknowledgments: This work was supported by NJCST Center for Embedded System-on-a-chip Design and NEC.
In this section, we illustrare the various issues involved in automatically deriving multi-partitioned architectures for a given behavior. These issues are motivated by the general observations outlined below.
. Due to the complex and irregular nature of m a y data access patterns of some applications, the use of homogeneous distributed loglcmemory partitioning may not benefit the overall performance of the Pcrmission to make digital or hard copics of all or pan of this work for personal or classroom use is granted without fec providcd that copies arc not made or distributcd Cor profit or commercial advantagc and that copier bear this noticc and the full citation on the first page. To copy otherwise, 10 republish, lo post on S C W~~S or to redistrihutc to lists, requires prior specific permission andior a fee. ICCAD'OZ, November i i -1 3 , 2 0 0 3 , San Jose, California, USA.
Copyright2003 A C M 1-581 13-762-110310Oi1 ... $5.00. ASIC. However, as shown in Example 1, it is possible to realize performance gains, when data and computations are partitioned in a manner that is correlated with the way in which array data get accessed.
. There are many possible ways ofpartitioning data and computations heterogeneously for a given circuit behavior. Example 2 demonstrates that a large design space necessitates that several considerations be kept in mind while trying to obtain the "best" partitioned implementation. These include (a) servicing the data requests of computations in a given partition to a large extent from its local memory, since poor locality will lead to remote memory accesses and hence, large communication overheads, and (b) balancing the paltitioned system in terms of both memory accesses and computational workloads, since the overall performance is determined by the suh-system which requires the longest execution time. m Computations in a given partition are likely to be data-dependent on the intermediate results of computations in other partitions. In other words, communication between different partitioned sub-systems not only includes remote memory accesses, but also data values produced during the execution of computations in other partitions. Therefore, synchronization of computations among different partitions becomes essential. Example 3 points out that the synchronization for a distributed architecture requires existing HLS techniques to be suitably enhanced to guarantee functional correctness of the entire system. Since synchronization can introduce significant latencies, an efficient arrangement of synchronization becomes a key factor in minimizing the total system execution time. h, x and y stored in a monolithic memory block. We assume that addition and subtraction operations take two cycles to complete, multiplications take four cycles to complete, and all other computations are single-cycle operations. Read (write) operations to the single-ported memory unit take an access time of four cycles. Given a resource library consisting of two multipliers, two adders, two subtracters, four comparators and four shifters, the resulting implementation using conventional HLS takes 6,401 cycles to complete the Haar transformation for N = 64. 2. Using the framework described in [21], we can derive a two-way homogeneous distributed architecture as shown in Fig. 3 . The system has three main components: a top-level logic and partitions Ahom and Bhom. The system takes as its inputs a flag INTL that initiates data processing and an inputloutpiut channel DATA through which input data are streamed. The top-level logic responds to the system inputs and co-ordinates the functioning of partitions Ahom and Bhom. Each partition consists of RTL controllerldatapath implementations (logic) with local memory. The logic in each partition executes its assigned computations concurrently, while readingiwriting appropriately from its local memory (accesses to memory in partition Ahom (&om) by the logic in partition Ahom (Ehom)) or remote memory (accesses to the memory in partition B h o m (Ahom) by the logic in partition Ahom (Bhom)). AAer the computations terminate, the results are finally output through DATA with the output flag DONE set to 1.
behavior given in Fig. 1 .
I * '
ToD-level loaic I Similar to a two-way homogeneous architecture, it has three components: the two partitions A h e t e r . B h e t e r and the top-level logic. The main difference between the homogeneous system of Fig. 3 and the heterogeneous system of Fig. 5 lies in the fact that array data and computations are distributed unequally to partitions A h e t e r and . Partition Aheter consisting of the first 10 rows of data in arrays h and y. The remaining regions of the arrays are assigned to partition B h r t e r (refer again to Fig. 4) . The iteration space is divided into two unequal loop tiles given by {(lo 5 i < 64), (0 5 j < 64)}, which are bound to partition A h e t e r and B h e t e r , respectively. The performance ofthe above system is better than both the conventional HLS and two-way homogeneous distributed HLS cases. The main reason for the performance speed-ups is that "unequal" distributions of data and computations lead to a well-balanced workload. Sub-system A h o t e r takes 2,874 cycles to complete and sub-system B h e t e r 2,971 cycles. The two-way heterogeneous system reduces the total execution time to 2,971 cycles (2.2X and 1.5X speed-ups B h e t o r as follows: by ( i , j ) , which is shown on the right side, indicates the access regions by the two loop tiles of L y ) and L(ghLt"). This heterogeneous partitioning scheme of array h can support all the data accesses by the local memories in both the partitions, which is a communication-free partitioning solution.
2. Fig. 7 shows another memory partitioning choice. Array h is divided evenly into h l and h2. In sub-system Ahetcr, h l is able to provide all the data needed by the computations of L y L e r ' .
However, computations of L(BhLtC" require data residing in h l , which is a remote memory located in sub-system Aheter. The global communications for remote memory accesses degrade the overall circuit performance. With this memory partition, sub-system Ahrter takes 3,315 cycles to complete and Bhrter 3,207 cycles. The total execution time is 3,315 cycles (11.6% longer compared to the first solution). I In the next example, we will first examine the synchronization overheads in a partitioned architecture and motivate why partitioning must he judiciously performed to minimize these overheads.
Example3: The C behavior in Fig. 8 perfoms the ArakawaJacobian formulation, from the OCEAN program in the SPLASH-2 benchmark suite 124). The CDFG representation of the behavior is shown in Fig. 9(a) Each iteration of the two-level nested loou in the behavior fetches resource data from arrays x and y, and stores the result data back to array 2. Since some computations are mainly related to data arrays 3: and 3, separately, we consider partitioning the given hehavior into two heterogeneous sub-behaviors (JacobqartitionAl and Jncob.pnrtition.Bl) as follows.
Sub-behavior JncobpnrtitionAl contains all the data accesses to array y. and the computations mainly dependent on those fetched resource data. They are {+3,. . . , +10,+14,. . . , +16}, {-7,. . . , -10, -15,. . . , -18) and {*5,. . . , * 8 ) .
. Data accesses to arrays z and z ({MRS.. . .,MR17} and MWI), as well as the computations dependent on those references, including { + l l , . . . , +13, +17}, { -3 , . . . , -6,-11,. . , , -14) and { $ l , . . . , *4, *9), arc assigned to J a c o b g a r t i t i o n E l .
. The indexing operations on ( i , j ) are duplicated in each partition.
The partitioned loop body of the Arakawa-Jacobian behavior is shown as a CDFG in Fig. 9(a) . Since some computations are dependent on the intcrmediate results from the other partition in the execution of each iteration, (for example, computation f 1 7 in Jacob.partition.BI depends on data from computation f 1 6 in Jacob.paptition-Al), global communication to transfer these data be- . The operations on the indices are also duplicated in each partition.
In this solution, data-dependent communications exist only along edges {Cl,. . . , C4} and {Dl, . . . , D5]. With lower inter-partition data communications, the entire system takes 36,966 cycles to complete @.OX and 1.2X speed-ups over the corresponding number in the conventional and the aforementioned heterogeneous solutions). Since synchronization can have a significant influence on the total execution time, an efficient arrangement of synchronization becomes a key factor in improving the system performance. Note that we use a handshaking scheme to synchronize the data transfer between the two sub-systems in both the cases. Fig. 10 shows the transformed sub-behaviors for partitions Jacob.partitGnA2 and Jacob.pnrtitionE2 shown in Fig. 9(b) . We illuslrate the synchronization process using data edge C1. When the computation of -11 is finished, the result C1 is sent to Jacob.pnrtitionA2 by function SENDDATA, and a DATAREADY signal is set to 1. When . . 
METHODOLOGY AND ALGORITHMS
In this section, we describe our methodology for synthesizing heterogeneous multi-partitioned architectures. Section Ill-A presents an overview of the proposed framework, while Section Ill-B details the constituent steps.
A. Overvim Fig. 11 outlines the proposed synthesis methodology. The inputs to this framework include the behavioral description of the circuit, design constraints (area and resource constraints) and the parameters to the different optimizing steps. The final output is a multi-partitioned RTL implementation.
The algorithm starts with a behavioral profiling step (step 1) to extract simulation statistics of array data references and computations, which is used by the subsequent partitioning steps. The simulation profile of array data references enables us to determine its "footprint" (which array locations are accessed and the frequency of access). Using these footprints, step 2 employs a data clustering algorithm [25] to cluster array data reference operations with similar access patterns.
Step 3 then partitions the computations in the behavior into the different clusters by minimizing a cost function that considers the balancing of workloads, locality of data accesses and synchronization overheads. A modified Kernighm-Lin heuristic [26] is used in this step.
m -w y heterogeneous distributed system
After behavioral partitioning into multiple partitions, the next goal is to distribute the array data judiciously into physical memories local to the different partitions, while considering the effect of data distribution on the array references for each sub-behavior (since array data references can become local or remote memory accesses based on the array data distribution). Steps 4-6 perform this data distribution for each array as follows.
Step 4 first determines a designer-specified number of "hotspots" (most accessed regions) of data accesses in an array. These hotspots arc the candidate or seed locations around which data partitions are iteratively evolved (step 5)'starting with a window of predefined size (data local to each window resides completely within a partition). At each seed location, the window is successively expanded, while minimizing the total memory access time due to the different partitioned sub-behaviors.
Step 6 then selects the best data distribution using the windows evolved at different seed locations.
Finally, Step 7 inserts synchronization code into the behavior to enforce correct communication between the different partitions.
Step 8 then proceeds to perform conventional HLS for individual partitions. This results in an RTL implementation of a heterogeneous distributed architecture, with two or more partitions.
B. Algorithms
We now describe the various features of our algorithm by detailing step 1 in Section 111-B.1, step 2 in Section Ill-B.2, step 3 in Section III-8.3 and steps 4-6 in Section Ill-B.4,
B.I Behavioral profiling and simulation statistics
Profiling in our methodology consists of (a) enhancing the behavioral description with new variables that can record the array access footprint where i and j are the row and column indices, respectively, within the scale of (0 _< i < N , 0 5 j < M ) . The results are rounded up or down to the nearest integers. For example, the center-of-gravity of A C N T in Example 4 is (centerl = 9, c e n t e r j = 8 ) (the gray cell in Fig. 12) .
The technique used to partition alray data references is the hierarchical agglomerative clustering algorithm, which has been widely used in statistics and data mining [25]. For a given set of K references to, say, a two-dimensional data array in the behavior, the input to the algorithm is a K x 2 center-of-gravity matrix CG, in which each row corresponds to a data reference and the two column entries list the value of the centerof-gravity (centeri and c e n t e r j ) , respectively. The clustering algorithm is briefly outlined below.
1. Compute the Euclidean distance separation metric between every pair of rows in CG (each row of CG conesponds to the center-ofgravity of a data reference).
2. Select the two rows in CG with the lowest distance separation and merge them into a single cluster. Determine the center-of-gravity corresponding to this cluster by adding the shadow arrays corresponding to the two rows. Update CG to delete the two selected rows and add a new row corresponding to the center-of-gravity of this cluster.
3. Repeat steps 1 and 2 until all references are in a single cluster (CG has a single row).
In this way, clusters at any given stage of the algorithm are formed by grouping clusters incrementally from the previous stage. Such a nested grouping of clusters is represented as a graph called dendogram. The dendogram can be broken at different levels to yield different clustenngs of references. We use an example to illustrate the results of clustering.
Example 5: Fig. 13(a) shows seven data accesses {MRl,. . .,MR7}
to data array A . The gray boxes represent the footprints of those data references, while the black circles correspond to their center-ofgravities. The final dendogram corresponding to the data references is given in Fig. 13(b) . This figure also shows that the data references are partitioned (marked by dashed line) into three distinct clusters {Cl,. . .,C.3} ({MRl,MRZ,MR4},{MR6},{MR3,MR5,MR7]). I
Behavioral partitioning and cost function
Behavioral partitioning involves dividing the CDFG representation of the given behavior into sub-graphs, while minimizing a cost function that considers the overall circuit performance. We define the cost of a partitioning solution as the critical path of the entire partitioned behavior. Since each partition is synchronized on a per-iteration basis for loop Example 6: Suppose N data reads (MR.1,. . . , M R B } to array
A[S][8]
are defined in a given C behavior, which are clustered into two groups C1 and C2. Shadow arrays A.CNT1 and A.CNT2 show the data access patterns for array A due to C1 and C2, respectively (see Fig. 14) . Consider the scenario when the behavioral partitioning assigned C1 to partition sub-system1 and C2 to sub-system2 in a twoway heterogenous architecture. The distribution of data array A into those two sub-systems can now be determined as follows. body partitioning, finding the critical path for a partitioned behavior involves enhancing the CDFG representation of the original loop body to account for partitioning, and tracing the critical path through it. We modify the CDFG representation by adding two kinds of virtual delay nodes into the CDFG.
. Ifdata read by an operation in one partition are the output o f a data reference in another partition, a remote memory access is clearly necessary. We insert a virtual delay node, denoting a remote memory access, on the corresponding edge in the CDFG,.which has a timing delay representing the latency difference compared to a local data reference. A similar addition is made for a remote write.
Whenever there are data dependencies on intermediate computation results between different partitions, synchronization is needed for each data transfer. The corresponding edge connects two computation nodes across the boundaries of two different partitions. A virtual delay node SYNC is added to incorporate the synchronization overheads. Since the physical data distribution is not available at this juncture, we assume all thc data references made in each partition are local memory accesses. By using the simulation statistics of computational operations (including the probability with which conditionals are executed) obtained in step 1 of Fig. 1 I, critical path analysis is performed to determine the cost of a partitioning solution
Partitioning starts with the data reference clusters derived by step 2 in Fig. I I , by assigning data reference clusters to individual partitions. The computational nodes are successively assigned to a partition, whenever all their inputs originate from that partition. The remaining nodes are randomly partitioned to form the initial partitioned graph. After this step, we utilize the Kemighan-Lin heuristic [26] with the abovc cost function to find the partitioning.
B.4 Data distribution in physical memories
In a partitioned architecture, each partition is associated with a fraction of each data array. The manner, in which array data get distributed into physical local memories residing in each partition to optimally support the partitioned behavior, becomes crucial in delivering an efficient heterogeneous distributed RTL circuit implementation. The best data distribution choice would, therefore, maximize the data accesses made hy a partitioned sub-behavior to the corresponding local memories so as to minimize inter-partition data communication for remote memory accesses. For a given choice of distribution for arrav data into K oarti- where localncc(i) and remotencc(i) are thc number of local memory accesses and remote memory accesses by sub-system i (0 5 i < K ) .
We will now illustrate our strategy of iteratively searching for the best physical memory partitioning by minimizing the above cost function.
;=I . .
The total cost is a c c w s t = accrost(1) f a c c a s t ( 2 ) = 471
Expanding the memory window: The memoly window in the shadow array is expanded in the direction which results in the largest decrease in the cost function. In this example, four choices of expanding the initial window in the left, up, right and down direction have the costs of cost-left = 472 costright = 464 cast-up = 477 costdown = 463 Therefore, "down" is the best direction to expand the window to reduce the memory access cost. This step stops when no cost reduc-
-
. ~~~ -tions, we can formulate a cost function to evaluate the choice as follows lassume that a Dartition takes n cvcles for a sinele readiwrite to its lotion is possible. The final window is shown'in A.CNT1 in Fig.14 with the dashed line as boundary. The best data distribution in this example is obtained by assigning the final window region of A to sub-systeml and the rest of the data array (region hounded by dashed boundary in A.CNT2 in Fig.14) to szLb.system2. The final memory access cost is act-cost = 431.
IV. EXPERIMENTAL RESULTS
The techniques described in this paper were evaluated within the framework of an existing ASIC design flow. We applied our algorithm to several benchmark behaviors and were able to synthesize optimized circuits with multiple logic-memory partitions in each case. HLS of the input C behavior (with and without our techniques) was performed using an HLS tool called Cyber [28] and LVD [31] ). The resulting gate-level circuits and layouts wcrc compared with respect to the following metrics: area and execulion time. These nietrics were extracted from the technology-mapped circuits and designer-provided testbenches. The results obtained are summarized in Table 11 .
Of our benchmarks, Haor and Jacob were described in Section 11. Gairss is a behavior that performs Gauss-Jordan elimination for solving linear algebraic equations. Infinite impulse response (IIR) filter is a well-known digital signal processing benchmark. Edge is a behavior performing edge detection in images. Morion is a behavior that performs video compression during video stream encoding. The workloads for Gouss and Jacob involve processing of 2K bytes of resource data, and 4K bytes of resource data for the remaining four benchmarks.
In Table 11 , major columns Cirruit, Memory block. Logic cells, Total area and Execution time represent the name of the behavior, size of memory blocks including register files (x103p1nZ), area of logic cells ( x 103pm2), total area ( x 103pm2) and performance expressed as the average execution time (ps), respectively. Minor columns Orig. and Opt., respectively, represent the original and optimized systems. Columns A.O. and PI. report the area overheads and performance improvements, respectively. Area overheads include the effect of replicating circuitry while partitioning physical memory, and performance improvements include the effects of introducing global communications and synchronization across partitions.
The results presented in Table 11 describe the best performing twoway partitioned architectures found by our algorithm. For the first four benchmarks, our algorithm derived heterogeneous architectures as the best performing two-way partitioning, while for the last two benchmarks, our algorithm derived homogeneous architectures. It is worth noting that Gauss is an input content aware benchmark. Thus, the results denote the average for the best-case data trace, worst-case data trace and a random input. The results show that circuits designed as heterogeneous multi-partitioned logic-memory architectures using our framework achieve significant performance improvements (upto 2.2X, average of 2 . 0 X ) over well-optimized conventional HLS designs. The area overheads varied from 10.5% to 17.9% for those examples. It includes the effcct of laying out the complete system with memory blocks and routing effects. We also compared our heterogenous architectures with the best-performing homogeneous architectures, whenever feasible. We obtaincd 1.5X, 1.6X, 1.3X and 1.5X speedups, respectively, for benchmark Hoar, Gauss, IIR and Jacob compared to their two-way homogeneous architectures.
V. CONCLUSIONS
In this paper, we proposed a novel HLS methodology for designing multi-partitioned architectures for memory-intensive applications. Using simulation-based statistics of array data references and operations in a given bchavior, our methodology tries to automatically derive the best partitioning of memory accesses and computations, as well as the requisite data distribution in physical memory. The proposed design techniques can, in general, he applied to behaviors with complicated memory access patterns, which can arise due to the use of non-affine array indices, loop nests with conditionals, err. Experiments with several benchmarks in the context of a commercial design flow demonstrated that the enhanced HLS flow can be used to derive high-performance homogeneous and heterogenous distributed architectures.
