In this paper, we describe a set of compiler analyses and an implementation that automatically map a sequential and un-annotated C program into a pipelined implementation, targeted for an FPGA with multiple external memories. For this purpose, we extend array data-flow analysis techniques from parallelizing compilers to identify pipeline stages, required inter-pipeline stage communication, and opportunities to find a minimal program execution time by trading communication overhead with the amount of computation overlap in different stages. Using the results of this analysis, we automatically generate application-specific pipelined FPGA hardware designs. We use a sample image processing kernel to illustrate these concepts. Our algorithm finds a solution in which transmitting a row of an array between pipeline stages per communication instance leads to a speedup of 1.76 over an implementation that communicates the entire array at once.
INTRODUCTION
The use of pipelined execution is an effective method to improve the throughput of a program. FPGA-based computing machines offer a unique opportunity for the realization of custom pipelining structures, matching the definition of the pipeline to the application requirements in terms of pipeline stage definitions, computation that can be overlapped, the data and rates at which it can be communicated as well as the communication placement within the pipeline stages.
The complexity and sophistication of pipelined execution make automatic tools that can analyze sequential applications and derive pipelined implementations extremely desirable. In this paper, we describe a set of compiler analyses to derive, from a sequential algorithm description, the components of a pipelined design, Le., task parallelism and communication requirements. We use as a foundation parallelizing compiler analyses for array data-flow analysis [3, 61, which we extend to recognize pipelining opportunities and derive communication requirements, for use in mapping to FPGA systems. In the Design Environment for Adaptive Computing Technology (DEFACTO) [8] , we combine these analyses with behavioral synthesis tools to automatically synthesize application-specific pipelines onto a target FPGA-based architecture. The work described in this paper makes the following specific contributions:
It defines new analyses for characterizing the task parallelism and communication requirements for use in mapping sequential programs t o systems of configurable logic.
It describes an implementation of the analyses and code transformations required t o automatically design and synthesize pipelines tailored to sequential program characteristics.
It presents experimental results for a machine vision application excerpt (MVIS) which demonstrate the use and performance potential of these techniques on an FPGA-based architecture.
The paper is organized as follows. In section 2 we describe the problem we are addressing along with an example that illustrates the approach. Section 3 describes the compiler analysis in more detail. We present the execution times for four communication schemes in section 4. In section 5 we survey related work and conclude in section 6. 
Good Design?

No
Configuration dit Stream
MOTIVATION AND BACKGROUND
The problem we address in this paper is automatically mapping an application onto an asynchronous pipeline, executing on a configurable architecture [16] . Mapping a pipelined application involves identifying a set of pipeline stages and the associated communication. The compilation goal is to minimize overall execution time, while meeting the storage and computing capacity constraints of the system. We exploit the parallelism in the application without the use of any programmer inserted pragmas or directives.
The compiler approach described in this paper, although generic in the sense of mapping a set of communicating pipeline stages to a configurable architecture, has focused on applications whose computations are specified by sequences of loop nests with intervening statements. These loop nests, not necessarily perfectly nested, compute over array data structures, affine index access functions, defined as linear functions of loop index variables, and constant loop bounds. For the current implementation, we do not map to hardware computations with pointer accesses. Under these assumptions, we have been able to apply our compiler analysis approach to digital image and signal processing kernels and other regular array computations of interest.
Example
We illustrate the mapping of MVIS, depicted in Figure l The code is structured as two loop nests. In this example, pipeline stage SI corresponds to the computation in the first loop nest, stage s2 to the second loop nest. SI computes the peak array; s2 reads peak and computes the arrays feature-x and feature-y as output. We determine that we must communicate the whole array peak from the producer SI to the consumer s2, one row at a time.
COMPILER ANALYSIS
The compiler analysis described in this paper is built upon an automatic parallelization system that is part of the Stanford SUIF compiler [l] . Figure 2 depicts the set of compiler analyses implemented specifically for DEFACTO.
The DEFACTO system-level compiler takes an un-annotated, sequential program and applies a set of communication and pipeline analyses based on array data-flow analysis. Previous work on array data-flow analysis has largely focused on identifying loops whose iterations can execute in parallel. This is called data parallelism, since the same code is executed in parallel on different data. In this paper, we extend and develop array data-flow analysis that supports pipelining of independent computations, or task parallelism. These extended analyses are incorporated into the communication and pipeline analysis phase and include the following: 0 Determining which data must be communicated.
0 Determining the possible granularities at which data may be communicated.
0 Determining the corresponding communication placeThen code transformations are applied to reflect the results of the analysis and the SUIF intermediate format is converted into behavioral VHDL. Estimates from the behavioral synthesis phase are used to evaluate each of the possible granularity solutions in the design space exploration phase; a good design can then be passed onto the logic synthesis and place and route phase. ment points within the program.
Analysis Background
Analyzing communication requirements involves characterizing the relationship between data producers and consumers. This characterization can be thought of as a dataflow analysis problem. In compiler terminology, data-flow analysis is the compile-time reasoning about the run-time flow of data values through the program. For decades, dataflow analysis has been used to guide a host of compiler optimizations and has even been incorporated into high-level synthesis tools. For the most part, the analysis has been restricted to scalar variables. Data structures, such as arrays, are treated as a single entity.
Unfortunately, scalar data-flow analysis is too imprecise when optimizing designs that access multi-dimensional array variables, such as commonly occur in multimedia algorithms. To be effective, the analysis must track accesses to individual array elements. For this purpose, we draw on solutions from parallelizing compiler technology to derive away data-flow analysis information. Our compiler uses a specific array data-flow analysis, reaching definitions analysis [2] , to characterize the relationship between array accesses in different pipeline stages [ll] . Reaching definitions are variable values that are not killed, i.e., not redefined, at another definition point occurring in the data flow from predecessor program points ( p E pred(s)) to the current program point s. Definitions occurring at the current program point form a gen(s) set. We typically talk about program points as being basic blocks. While in our system we also calculate reaching definitions at the basic block level, we perform a hierarchical analysis that ultimately derives reaching definitions at the level of pipeline stages. For the purposes of deriving communication requirements, the analysis only retains reaching definitions information when a definition in one pipeline stage reaches a use in another pipeline stage. Reaching definitions, y(s), are defined by a set of simultaneous equations represented by Equation 1. 2.
for(y = 0; y < IMAGE-3; y++){ Descriptor (RDAD). RDADs are a fundamental extension of Data Access Descriptors (DADs) [6] , which were originally proposed to detect the presence of data dependences either for data parallelism or task parallelism, but DADs do not capture sufficient information to automatically generate communication when dependences exist. For this purpose, we have extended DADs to capture reaching definitions information. Second, we have developed RDADs in an existing array data-flow analysis implementation, described by Amarasinghe [3] , that derives more precise array sections than that of the DAD definition. For example, it can represent array regions that have holes and some nonlinear array regions. Finally, we have added a tuple containing the dominant induction variable (DIV) for each dimension, ordered according to the array traversal order [6] . A dominant induction variable is defined as the loop index variable that is changing the fastest in a given access expression. In the DAD analysis, the dominant induction variables were used to calculate the traversal order. We relax the DAD restriction on traversal orders in order to increase the opportunity for pipelining and then use the dominant induction variables to aid in selecting communication placement. A full discussion of how we combine standard scalar reaching defintions and array data-flow analysis with task parallelism and pipelining information is beyond the scope of this paper.
We present the RDAD definition in the next section and the definition of another abstraction that captures specific communication requirements between pipeline stages, the Communication Edge Descriptor (CED), in section 3.3. We also show how RDADs are used by the compiler to automatically calculate CEDs in section 3.4.
RDAD Description
Reaching Definition Data Access Descriptors (RDADs) summarize information about the read and write accesses for array variables in the high-level algorithm description. Such RDAD sets are derived hierarchically by analysis at different program points, i.e., on a statement, basic block, loop and procedure level. Since we map each nested f o r loop or intervening statements to a pipeline stage, we also associate RDADs with pipeline stages. Loop variables are normalized before calculating the set of associated RDADs.
DEFINITION 1. A Reaching Definition Data Access Descriptor, R D A D ( A ) , defined as a set of 5-tuples ( a I T I S I w I y ), describes the data accessed in the m-dimensional array A at a program point s, where s is either a basic block, a loop or pipeline stage. a is an array section describing the accessed elements of array A represented by a set of integer linear inequalities. r is the traversal order of a, a vector of length 5 m, with array dimensions from (1,. . . m) as elements, ordered from slowest t o fastest accessed dimension.
A dimension traversed in reverse order is annotated as 5.
A n entry may also be a set of dimensions traversed at the same rate. 6 is a vector of length m and contains the dominant induction variable for each dimension. w is a set of definition or use points for which cy captures the access information. y is the set of reaching definitions. W e refer to R D A D , , ( A ) as the set of tuples corresponding to the reads of array A and RDAD,,,(A) as the set of writes of array A at program point s. Since writes do not have associated reaching definitions, f o r all RDAD,+(A), y = 0.
In the following, we use the notation f (RDAD(A)), when selecting a tuple, f, on which to perform an operation. For example, to select the array section a , we 
write a ( R D A D ( A ) ) .
For this example, we show the calculated RDADs in Figure l(b) . The compiler determines that an access to array peak, in statement 3, writes the entire array, as described by A read access to peak in statement 4 is described by   RDAD,,,, (peak) . Similarly, the arrays feature-x and feature-y are written in statements 5, 6, 7, and 8. For all array accesses in the program, we capture the vector r = (112)1 indicating that dimension 1, the row dimension, varies more slowly than dimension 2. Similarly, we capture the dominant induction variables in S = ( 2 , y ) , where 2 is the DIV corresponding to the row dimension access expression for each RDAD and also the enclosing loop in the nest with loop index variable x. Reaching definitions are retained in RDADs only if they reach outside of a pipeline stage. In the MVIS example, we have one reaching definition, from state- ment 3 to 4, from RDAD,,,,(peak) to RDAD,,,,(peak) . We will show how these RDADs are used to calculate the specific communication between the two pipeline stages in section 3.4 and define the Communication Edge Descriptor next.
CED Description
We define another abstraction, the Communication Edge Descriptor, to describe the communication requirements on each edge connecting two pipeline stages.
CED,,,,,(A), defined as a set of 3-tuples ( a I X I p ), describes the communication that must occur between two pipeline stages si and sj. a is the array section, represented by a set of integer linear inequalities, that is transmitted o n a per communication instance. X and p are the communication placement points in the send and receive pipeline stages respectively.
DEFINITION 2. A Communication Edge Descriptor (CED),
Determining Communication Requirements
After calculating the set of RDADs for a program, we use the reaching definitions information to determine between which pipeline stages communication must occur. To generate communication between pipeline stages si and sj we consider each pair of write and read RDAD tuples, Ri and Rj, where the definition point w ( R i ) is among the reaching definitions in y ( R j ) . The communication requirements, i.e., placement and data, are related to the granularity of communication. We calculate a set of valid granularities, based on the comparison of traversal order information from the communicating pipeline stages, and then evaluate the execution time for each granularity in the set. Once we identify the best granularity, we then calculate the specific communication placement and array data section to be communicated and form a CED. The algorithm is shown in Figure 3 .
To calculate the valid set of granularities, the function The calcValidGran function pairwise compares the input traversal order vectors starting from the slowest moving dimension; if two entries are identical, the corresponding P k is assigned the same value. Once a non-matching pair of entries is detected, all remaining entries in p are set to 0. 
The function findCommPlace(P, R,, R3) returns the communication placement for the send and receive pipeline stages. The first step in this function is to identify the granularity which yields the minimum execution time. The time input t o this calculation includes the MonetTM synthesis estimates for individual pipeline stage execution, the known communication time and the time during which computation overlap occurs for a pair of stages at a particular communication granularity. If Pk is a set, we calculate one execution time for the whole set since communication for each entry would be mapped t o the same placement points as discussed below.
From the communication granularity, the communication placement is determined by mapping the array dimension from minDim to its associated dominant induction variable in each stage. The communication will occur inside the loop corresponding to the selected dominant induction variable.
Based on the array sections accessed in R,J and R3+,, subsets of R, and R3, respectively, accessed within the pipeline stages, we may also need to perform code transformations, such as peeling, to align the communication. For a b k that is a set, each dimension is mapped to the same dominant induction variable and thus the same communication placement points.
We also address multiple definitions reaching a single read access. If the definitions are generated within the same pipeline stage, we place the send primitives at the control flow meet point just after the definition points. If multiple definitions are generated in different stages, we place additional logic at the receive point, where the control flow meet point occurs for these definitions. Finally, we calculate the intersection of a(R2,x) and cy(R3+,). Intersection on array sections is defined as merging the sets of integer linear inequalities and if a solution exists, simplifying the linear inequality set [3].
Code Generation
Once the compiler has inserted the communication primitives, the SUIF code is translated into behavioral VHDL, shown in Figure l(d) . For presentation, some of the synchronization details have been abstracted away.
EXPERIMENTAL RESULTS
Implementation Status and Methodology
In this experiment, we evaluate our communication analyses by comparing four different communication schemes.
The vector P = (1,2) indicates that both a row-sized or an element-sized transmission are valid for MVIS, and we can always communicate the whole array, either on or off-chip, in one transmission. In A m y Off-chip, s1 completes its total calculation and then communicates the array peak to s2 via external memory. There is no pipelining in this scheme, and additional overhead for the memory accesses; reads take 7 ns and writes 0.1 ns but are pipelined so that there is one memory access per clock cycle. The scheme Away On-chip is similar, except the array is communicated on-chip. In Row, once SI has produced a row of peak, the row is communicated immediately to s2. 5'2 consumes the data, such that this computation is overlapped with SI'S computation of the next row. Similarly, in Element, SI communicates each element as it is produced. There is maximum computation overlap in Element. For each instance of on-chip communication, independent of amount of data communicated, commTime is approximately two cycles. The CEDs for these schemes are shown in Figure l(c) .
For each scheme, we compile the behavioral VHDL with MonetTM and simulate in ModelSimTM to obtain the total execution time. All behavioral VHDL with the exception of the Away 08-chip scheme were generated automatically. Figure 4 shows the MVIS execution times for four communication schemes. As expected, when accessing external memory, with no computation overlap, as in Away 08-chip, we see the largest program execution time. When we compare the Away On-chip and Row, we see that there is a 1.76 speedup due t o the computation overlap gained from stages SI and s2 executing in parallel. The overhead for communicating one element at a time, in Element, is not amortized over the computational overlap. The Element scheme therefore does not yield a minimal execution time.
Results
For MVIS, we choose minDim = 1, since a row-sized transmission, described by 
RELATED WORK
Previous work on array data flow analysis [6, 14, 31 focused on data dependence analysis but not at the level of In [5] Arnold created a software environment to program a set of FPGAs connected to a workstation. [9] focused on compiling for a tightly coupled hybrid FPGA and RISC architecture; Callahan and Wawrzynek [7] used a VLIW-like compilation scheme for the GARP project; both works exploit intra-loop pipelined execution techniques. Goldstein et al. [lo] describes a custom device that implements an execution-time reconfigurable fabric. Weihardt and Luk [15] describes a set of program transformations to map the pipelined execution of loops with loop-carried dependences onto custom machines.
The approach taken in this paper differs from previously mentioned efforts. Our approach takes un-annotated sequential programs and maps them into a pipelined execution scheme without programmer intervention. Unlike concurrent languages, our approach neither relies on nor exploits concurrent specification behavior. Instead of focusing on intra-loop pipelining techniques that optimize resource utilization, we focus on increased throughput through task parallelism coupled with pipelining, which we believe is a natural match for image processing data intensive and streaming applications.
CONCLUSION
In this paper, we describe how parallelizing compiler technology can be adapted and integrated with hardware synthesis tools, to automatically derive, from sequential C programs, pipelined implementations for systems with multiple FPGAs and memories. We describe our implementation of these analyses in the DEFACTO system, and demonstrate this approach with a case study, a machine vision application. We presented experimental results, derived automatically by our system. We illustrate how these analyses can improve application performance, as evidenced by the 1.76 speedup we gain over a non-pipelined implementation. Current work focuses on integrating these analyses with automated design space exploration for a single loop [13] and partitioning pipelined implementations over multiple FPGAs.
