As modern image and video processing applications handle increasingly higher image resolutions, the buffering requirements between communicating functional modules increase correspondingly. The performance and cost of these applications can change dramatically depending on the implementation methods for FIFO buffers and the data delivery methods between modules. This paper introduces a new FIFO hardware mapping algorithm based on pointer-based token delivery from dataflow semantics for image and video processing applications. This approach significantly improves the performance of dataflow based implementation of image and video processing systems, and allows effective prediction of changes in performance and buffer memory requirements associated with changes in image resolution. Our pointerbased token delivery method allows indirect token delivery between actors by pointers in conjunction with use of a shared memory. Each pointer references a data block stored in the shared memory. In pointer-based token delivery, a buffer can be configured to be implemented as the combination of a small, fast FIFO and a larger, relatively cheap shared memory while providing an attractive trade-off between performance and hardware cost. We present the complete semantics of our pointer-based modeling method, systematic techniques for mapping representations using these semantics into efficient implementations, and experimental results that demonstrate the performance of the proposed pointer-based techniques. The efforts described above make useful contributions to mapping application representations at various levels of abstraction into hardware implementations. However, the simultaneous analysis of both performance and cost implications when mapping image processing applications, which involve especially large volumes of data token delivery, has not been thoroughly investigated in previous work.
RELATED WORK
Dataflow [7] is widely used for designing DSP applications. Various research efforts on mapping dataflow graphs into hardware implementations have been undertaken. For example, the approach of [2] exploits loop parallelism to map nested loop kernels onto a coarse-grained reconfigurable architecture. The approach of [3, 4] uses direct mapping of each dataflow graph component (actor) onto a corresponding hardware resource. The approach of [5] uses shared resources and looped schedules. The approach of [6] analyzes a given set of applications to extract commonalities across nodes in different applications and uses them to bias the mapping of nodes in the partitioning process. For FPGA implementation, the approach of [10] provides a rapid system prototyping method through a component architecture and an associated set of software tools. The approach of [11] provides a pipelined asynchronous circuit mapping method. For pointer synthesis, the approach of [9] encodes pointer values and generates circuits that can dynamically access different locations with each pointer reference. The approach of [13] points out that pointers can reference indices to RAM, registers or even wires in a hardware mapping. The approach of [1] applies an external memory for mapping FIFO buffers and implements real-time image convolution on an FPGA. The approach of [8] implements image processing applications on FPGAs and points out that such implementations lead to a large on-chip FIFO buffers that prevent flexible usage of FPGAs for image processing applications. The approach of [12] presents an elaborate technique for mapping global, static arrays to distributed communication structures while classifying four types of inter-process communication patterns. The approach of [15] studies memory optimization for embedded software, particularly the performance of cache-based systems. The approach of [14] presents a novel technique for background memory allocation in multi-dimensional signal processing applications based on dataflow analysis.
The efforts described above make useful contributions to mapping application representations at various levels of abstraction into hardware implementations. However, the simultaneous analysis of both performance and cost implications when mapping image processing applications, which involve especially large volumes of data token delivery, has not been thoroughly investigated in previous work.
This paper helps to bridge this gap by studying, in the context of mapping dataflow graphs into hardware, the relationship between token delivery methods (indirect, pointerbased token delivery vs. direct-reference, raw token delivery) and FIFO architecture. This paper exploits pointerbased token delivery to reduce on-chip FIFO sizes, and also provides a range of efficient trade-offs between performance (latency and throughput) and FPGA resource cost through a novel FIFO mapping algorithm. This paper also shows how overall performance and cost vary in relation to the selected sub-frame size at which block processing is carried out. Finally, this paper provides a new mapping algorithm for dataflow representations of image processing applications to reduce overall FPGA resource costs without significant performance loss.
FIFO HARDWARE MAPPING FOR DATAFLOW GRAPHS

Modeling and architecture
In this work, an application is modeled under synchronous dataflow (SDF) [7] semantics and then mapped to an FPGA device. Each vertex (actor) within the given SDF graph is mapped to a module within the target FPGA. Edges are con- Figure 1 shows a comparison of raw data FIFOs and pointer based FIFOs. In Figure 1b ), the raw data FIFO is embedded inside the FPGA chip and holds direct raw data tokens. Here, by token we mean the unit of data transfer along an edge in the dataflow graph. The pointer based FIFO involves both an on-chip FIFO, which holds references to token blocks rather than the tokens themselves, and an external (off-chip) RAM-based memory, which may be shared across multiple pointer based FIFOs as well as other storage constructs. In Figure 1a ), raw data tokens are located in the external memory, while a relatively small on-chip FIFO buffer holds pointers that provide a stream of indices into the external memory. The FIFO architectures (raw data vs. pointer based) and FIFO sizes can be configured strategically based on optimization during the synthesis process. This paper formulates and investigates this optimization problem, and studies various important factors that should be taken into account when configuring dataflow buffers for hardware mapping. This is an important problem because the configurations of the FIFOs in a dataflow graph implementation have significant impact on the overall performance and hardware resource costs. This paper presents an effective heuristic FIFO mapping algorithm for mapping SDF graphs efficiently into hardware.
2.2
Performance and cost impact of token delivery methods
As implied above, we consider two alternative token delivery methods between dataflow actors, pointer based token delivery (indirect token delivery) and raw token delivery (direct token delivery). Raw token delivery is the conventional form of data delivery for dataflow graph implementation. Raw token delivery directly transfers data tokens across the FIFOs that connect adjacent pairs of actors in the dataflow graph. Therefore, for applications, such as those found in the image processing domain, that require large volumes of token transfer, very high resource requirements often result from extensive use of raw token delivery. On the other hand, since there is no indirection overhead or external memory access involved, raw token delivery improves performance through faster dataflow communication.
The limited quantities of gates available on FPGAs makes it challenging to implement image processing applications efficiently on these devices. Although FPGA resource density continues to increase from Moore's law, the complexity and resolution requirements of state-of-the-art image processing applications is also increasing at a significant pace.
Pointer based token delivery allows for more efficient use of limited FPGA resources by dividing inter-actor communication functionality into two parts. These parts consist of a relatively small set of pointers, and blocks of token data that the pointers reference. The pointers are kept in fast but expensive on-chip FIFOs, while the raw token data is located in slow but cost-effective external RAM. Dataflow graph actors send data to other actors by transferring pointers through the on-chip FIFOs. Actors at the receiving end use the transferred pointers to access external memory and retrieve the actual raw tokens. Pointer based token delivery significantly reduces FPGA resource requirements at the expense of some degradation in latency and throughput.
Equation (1) below describes relationships between pointer based token delivery and raw token delivery in terms of performance (execution time) and cost (the required number of gates). Here, denotes the number of gates required for the FIFO ; denotes the execution time for data token delivery through FIFO ; represents a coefficient for converting the number of gates between two delivery methods; and represents a similar conversion coefficient for execution time. The values of and depend on the subframe size . , ,
The following equation describes the effects of raw token delivery and pointer based token delivery on latency and throughput: Interface buffer Figure 2 . Effect of sub-frame division on latency and throughput.
Here, a critical path of the given application must be extracted beforehand for the analysis, and is the number of actors on this critical path. The symbols and are related, respectively, to the input port and output port of in the critical path (i.e., with respect to the edges in the critical path that are incident to ). In (2), ( ) if the associated communication is mapped to a raw FIFO architecture, and conversely, ( ) if it is mapped to a pointer based FIFO. The other symbols in (2) are defined below in Section 2.3.
2.3
Effect of sub-frame size on performance and cost Sub-frame division reduces FIFO size along with pointer based token delivery since the whole data frame can be processed in smaller units. However, depending on the application, there may be strict constraints on the sub-frame size ( ) that can be employed. Many image processing subsystems have minimum window (or block) sizes for their basic units of operation. Some globally-oriented operations, such as contouring, require the whole image frame as their basic units of input. Sub-frame division influences both performance and cost. To understand this better, we can decompose the execution time of an actor into three different parts, , and . Here, is the execution time for activation of ; is the execution time for the main functional logic operation of ; and is the execution time required for token delivery of .
is proportional to the number of sub-frame divisions ( ), whereas the "total summation" of and are the same regardless of the sub-frame division format. Usually, is relatively small compared to and . Equation 3 shows the relationship among the three different components of execution for an actor, taking into account sub-frame division. 
Here, represents the size of the entire image frame; is the sub-frame size; is the number of sub-frame divisions ( ); and is latency of actor . Additionally, and are latencies of actor under the image frame size and under the sub-frame size , respectively. Unlike the latency and throughput of a single actor, as decomposed in (3), the latency and throughput of the entire application are influenced by the interaction of data dependency, sub-frame size and FIFO architecture. Although sub-frame division generally allows for reduction of FIFO size, and also improves throughput, sub-frame division generally leads to some increase in application latency. For example, in the case where a single dataflow graph represents two or more applications operating concurrently, and those applications share actors in the graph, data dependencies and execution time distributions of paths in the graph influence the performance of each application in the dataflow graph differently. Figure 2 compares, for an illustrative example, the performance of sub-frame division by to the case where there is no sub-frame division. Here, throughput is improved for both Applications I and II. However, sub-frame division degrades the latency of Application I, whereas the latency of Application II is improved. This phenomenon generally arises when two or more applications share actors (e.g., for more compact representation and implementation) in a common dataflow graph and (defined in (4) below) is smaller than 0. This effect becomes prominent especially when the ratio of and is large, where represents the pipeline idle time. In (4), can be obtained by simply dividing by .
.
In (5), is the execution time of the actor with the largest execution time, and represents the initial latency for subframe size . Here, the number of gates required for each application ( ) in the common graph is reduced by increasing . Equation (6) shows the effect of sub-frame division on the number of gates required for an application( ): 
Effect of data dependency on performance and cost
In case a dataflow graph has a "branch point", two or more paths following the branch point merge again at some subsequent point, and these paths exhibit a large execution time deviation, the associated data dependency can greatly deteriorate the performance of all the associated applications in the dataflow graph. Here, a "branch point" represents a point where a single actor has two or more output ports or a single output port goes to two or more successor actors. Figure 3 shows how performance under sub-frame division can be improved through insertion of special FIFOs that we call "delay FIFOs( )" (these are the FIFOs labeled and in Figure 3) . Performance improvement by delay FIFO insertion depends on the execution time distribution of the actors on each critical path following the branch point.
Equation (7) represents the relationship between performance and the added delay FIFOs.
, ,
Here, and are the latency and throughput, respectively, without . Furthermore, and are the corresponding values with 1 and and are those for 2 s. , and are latencies for processing the first subframe in the cases of no , 1 and 2 s, respectively. Equation (8) represents the increase in the number of gates required for the application as delay FIFOs are added. The overhead of the delay FIFOs can be minimized by using the pointer based FIFO architecture for their implementation.
Optimization of FIFO hardware mapping
Idle intervals and uneven execution time distributions exist due to data dependencies and differences in operational complexity across dataflow actors. Performance and cost can be improved by integrating cost-effective, pointer based FIFOs and fast, raw token FIFOs in strategic ways. Figure 4 provides a simple illustration of how the resource cost for a dataflow graph can be reduced significantly while maintaining overall performance through hybrid FIFO architecture selection. Here, the throughputs of both configurations are identical. Furthermore, by using subframe division, the difference between latency of Figures 4a and 4b can be made negligible, since the throughput ( ) is ultimately the primary factor for determining latency under sub-frame division, as implied by (4) and (5). Figures 5 and 6 show our FIFO mapping algorithm, which is motivated by the observations and analysis above. It is assumed that the dataflow graph can involve multiple applications, and moreover, that subsets of applications can share common actors for more compact representation and implementation. The function () sets up information about estimated execution times and execution time distributions of the actors. The function also finds and . Here, represents the estimated number of gates for the main functional logic portions the actors, and is the number of gates used for FIFOs under the assumption that only raw token FIFOs are used. The actual that results from a mapped implementation lies between and as shown in (9). , , .
For each application( ), a critical path ( ) is selected and an appropriate FIFO type is determined based on the execution time distribution of actors within the path.
For each hierarchical subsystem within the critical path, () is applied recursively. Finally, delay FIFO ( ) insertion is performed to improve performance. For , pointer based FIFOs ( ) are used, and therefore, the overhead of redundant FIFOs can be minimized while achieving the desired performance improvement. Figure 7 shows a complex, composite morphological image processing application used in this paper for experimentation. Here, the performance and cost of each application under the dataflow representation are influenced by the interaction of to shared actors with the applications that contain them. Figure 7 is implemented by Verilog and is simulated under the modelSim 6.0 environment. Synthesis is performed under Xilinx XST with the Spartan3 (xc3s1500) used as the target device. Input images of size ( ) are consumed and processed by the graph. Experimentation is performed under two different values of , corresponding to 8x8 and 16x16 subframes. In Table 1 , and of are lower bounds in performance optimization, and and of are lower bounds in cost reduction. Equation (10) shows how, in the following discussion, we compare the performance and costs of two different configurations and .
EXPERIMENTAL RESULTS
, .
In comparison of and , and provide 
T h a c to r
approximately 23% performance improvement compared with and , while requiring about 81% more gates. In comparison of , provides 54% performance improvement compared with along with a slight (2%) cost increase. In comparison of sub-frame division effects for , and , the latency of is slightly improved, whereas the latency of is decreased as is decreased. Here, the latency impact is negligible since is relatively small compared to the execution time of each actor for processing the entire image frame . On the other hand, the throughput and cost improvements are distinguishable as is increased.
Next, we see that , which involves both performance and cost optimization, provides 54% performance improvement and 16% cost reduction compared with the conventional approach of . Similarly, , which leans more toward cost optimization, provides 39% performance improvement and 76% cost reduction compared with the conventional approach of . Here, delay FIFO insertion in Path 1 (see Figure 7 ) leads to significant improvement of performance along with 2% increase of . Combined use of and with significantly improves overall performance along with providing for cost reduction. For cases where cost is the primary issue, it is important to note the significant cost reduction of .
CONCLUSIONS AND FUTURE WORK
This paper studies important issues in the mapping of dataflow representations of image processing applications into hardware implementations. Specifically, we focus on efficient mapping of FIFO buffers, and explore the effects of FIFO architecture, sub-frame division and data dependency on performance and cost. Based on this exploration, we provides heuristic optimization methods that simultaneously improve performance and cost with manageable complexity. A strategic FIFO mapping approach that comprehensively exploits dataflow graph characteristics results in significantly lower FPGA resource requirements with nearly equal performance. Useful directions for future work include initializeGraph(G) { -Analyze the critical path of each application in the dataflow graph.
-Obtain the estimated execution time -Obtain the execution time distribution on the path -Obtain and return , ; } 
