Abstract. Due to increasing complexity of modern real-time image processing applications, classical hardware development at register transfer level becomes more and more the bottleneck of technological progress. Modeling those applications by help of multi-dimensional data flow and providing efficient means for their synthesis in hardware is one possibility to alleviate the situation. The key element of such descriptions is a multi-dimensional FIFO whose hardware synthesis shall be investigated in this paper. In particular, it considers the occurring out-of-order communication and proposes an architecture which is able to handle both address generation and flow control in an efficient manner. The resulting implementation allows reading and writing one pixel per clock cycle with an operation frequency of up to 300 MHz. This is even sufficient to process very huge images occurring in the domain of digital cinema in real-time.
Introduction
With increasing capacities of Field Programmable Gate Arrays (FPGAs), more and more complex applications with growing demands can be realized. This can also be observed in the domain of digital image processing where new standards and algorithms offer powerful functionality at the price of huge complexity. JPEG2000 [1] for instance is a new compression technique which is currently introduced in the domain of digital cinema. However, its huge computational requirements and algorithmic complexity together with the large image sizes attaining up to 4096 × 2140 pixels constitute severe challenges for a real-time implementation.
As this complexity is more and more difficult to cope with by a low level description like classical Register Transfer Level (RTL) source code in VHDL, high level synthesis is considered to play an important role in future system design. Whereas transformation from sequential C-code into a parallel hardware implementation requires complex extraction of the contained parallelism, data flow descriptions like Synchronous Data Flow (SDF) [2] or Kahn Process Networks (KPN) [3] offer a natural representation of the inherent coarse grained parallelism. For this purpose, the system is composed into a set of actors representing processes which are interconnected by edges modeling communication.
In classical one-dimensional data flow, this communication is realized by help of FIFOs which transport data elements, also called tokens, from a source actor to the corresponding sink. Image processing algorithms however work on multi-dimensional arrays of pixels where the order by which the data is produced and consumed might differ (out-of-order communication). FIFOs however only support strict in-order communication. Thus, the necessary pixel reordering either has to be hidden in the actor or leads to complex system descriptions. In both cases, analysis and optimization being important in order to achieve high-performance implementations are difficult.
In order to alleviate this situation, multi-dimensional data flow models of computation like Windowed Synchronous Data Flow (WSDF) [4] have been proposed. They take explicitly into account that a written or read token is part of a whole array. This geometric information allows describing out-of-order communication without hiding information in the actors. Consequently, the latter one is explicitly available for analysis [5] and optimization. This however also means that communication edges cannot be realized anymore by simple onedimensional FIFOs. Instead, multi-dimensional FIFOs [6] have to be deployed which directly support out-of-order communication.
This paper focuses on the hardware implementation of such a FIFO in order to contribute to a synthesis path from multi-dimensional data flow descriptions to FPGA solutions. In particular, it shows how the occurring out-of-order communication can be solved efficiently. Thanks to its explicit modeling, different optimizations like relative address generation and fast flow control can be performed. Special care is taken, that also applications processing very huge image sizes in real-time can be taken into account. As a key result, an architecture is proposed which can read and write one token per clock cycle at frequencies up to 300 MHz, thus being sufficient to process images with 4096 × 2140 at 30 frames per second.
The remainder of this paper is as follows. After a comparison of our approach with related work in Section 2, Section 3 introduces the out-of-order communication which we want to handle. Section 4 then presents a hardware architecture for the corresponding multi-dimensional FIFO. The two major challenges, namely address generation and flow control, are discussed in Sections 5 and 6. Section 7 finally presents the results obtained by our implementation.
Related Work
As the handling of today's system complexity is a major challenge for future technological progress, communication generation for hardware implementations is an important topic in many research approaches. Ref. [7] for instance investigates for SDF graphs, how to efficiently transport huge amounts of data by splitting the FIFO functionality into an FPGA internal pointer transport and an external background memory for data storage. SA-C [8] allows generating hardware accelerators for sliding window algorithms described by special loops. IMEM [9] describes chains of local image processing algorithms and permits for efficient synthesis of the underlying FPGA-internal memory structures. ADOPT [10] investigates incremental address calculation when accessing multi-dimensional arrays. All these approaches pay no special attention to out-of-order communication and the resulting scheduling challenges.
Ref. [11] describes the communication techniques deployed for translating sequences of loop nests into parallel hardware implementations using the DE-FACTO compiler. Different degrees of communication granularity are supported, ranging from individual pixels to complete images. A granularity however is only allowed, if production and consumption is performed in the same order. Furthermore, for chip-internal communication, huge amounts of data are transmitted in parallel requiring large register banks.
ESPAM is another loop-oriented design flow which takes out-of-order communication into account [12, 13] . Ref. [12] proposes to use a Content Addressable Memory (CAM) storing the data values and an associated valid-bit. Whereas this leads to the smallest achievable memory size, CAMs are very expensive in terms of required hardware resources. Furthermore, the dynamic memory allocation requires two clock cycles per write and four clock cycles per read due to polling of the valid bit. One clock cycle is specified as 40 ns. Ref. [13] deploys a normal RAM. The address generation bases on a hyper-rectangle comprising all simultaneously life data elements. As in [12] , a valid-bit is used to decide whether the required data element or free space is available. A write operation takes two clock cycles, a read three clock cycles, each having 10 ns.
In comparison with these approaches, the multi-dimensional FIFO we propose offers the typical status information like full and empty signals, as well as the amount of available data elements on the read side and free spaces on the write side. Thus, no valid-bit is necessary and the source resp. sink can determine very quickly whether a write or read is possible. As extremely huge image sizes shall be processed, a static memory allocation is preferred against CAM which is considered as being too expensive. The addressing scheme does not base on a hyper-rectangle, but instead linearization in production order is applied as this has come out to be more efficient for our applications [5] . In order to achieve sufficient throughput, we require that each clock cycle a pixel can be read and written simultaneously. Pipelining helps to achieve high synthesis frequencies. Finally, communication on pixel granularity is allowed independently on the production and consumption order.
Out-of-Order Communication
Out-of-order communication is a phenomenon which can be observed in various image processing applications. Many camera sources for instance deliver the images in a raster-scan order line by line, while the JPEG2000 compression standard [1] allows cutting these images into sub-images, so called tiles, which are then processed sequentially. Similar situations occur, when composing the code-blocks for entropy encoding or re-ordering the 8 × 8 blocks obtained from a JPEG decoder into a raster-scan order suitable for display devices. In all cases, the order by which the data elements are produced and read differs. Figure 1 exemplarily shows the JPEG2000 tiling operation. The input image is generated line by line in raster scan order whereas the sink reads the tiles sequentially and applies the compression algorithm. In order to support this operation, the source and sink shown in Fig. 1(a) cannot be connected any more by a classical one-dimensional FIFO. Instead, the FIFO has to perform a data reordering and is called multi-dimensional, because it is aware of the token position in the multi-dimensional array. It is parameterized with the image size and the execution order of both the sink and the source. The latter one is described by dividing the array into a hierarchy of (hyper-)rectangles which are written and read in raster-scan order [5] . In Fig. 1 for instance, the source generates the complete image in raster scan order, whereas from the sink's point of view it is divided into six blocks forming the tiles. The multi-dimensional FIFO hence has two different tasks, namely (i) correct flow control and (ii) association of a memory cell to each data element. In other words, the FIFO has to figure out, whether there is enough space to accept the next token produced by the source and whether the data required next by the sink is already available. Especially, if the tiles are huge sized, this decision must be performed on the granularity of pixels instead of tiles, as the latter one would lead to a significant increase in delay and buffer size. For each written and read data element, the multi-dimensional FIFO has to derive the memory address of the FIFO buffer where the corresponding token is stored.
In the next section, a hardware architecture of the multi-dimensional FIFO is presented which fulfills these tasks.
As FIFOs have turned out to be an efficient medium for data transport and synchronization, the multi-dimensional FIFO shall provide a similar interface: Full and empty signals indicate whether the next token can be written or read. Fill-level indicators show the amount of tokens which can be read by the sink (rd count) and written (wr count) by the source before the FIFO gets empty or full. Moreover, it shall be possible to read and write one token per clock cycle. Figure 2 shows the corresponding hardware architecture. It consists of two major parts, the memory where the tokens are stored and which can be both FPGA internal or external, and the controller. The latter one is responsible for the fill-level control and the address generation. For this purpose, it needs to know the current position of the source and sink actors in the processed image. This information is kept by so called iteration vectors i src ∈ I src ⊂ IN nsrc and i snk ∈ I snk ⊂ IN n snk where I src and I snk are sets of indices with dimensions n src and n snk respectively. For the tiling shown in Fig. 1 , i src = (y src , x src ) is a twodimensional vector indicating the row y src and the column x src of the next pixel to produce. i snk = (ty snk , tx snk , ry snk , rx snk ) has four dimensions and specifies the tile coordinates (ty snk , tx snk ) and the position (ry snk , rx snk ) relative to the tile borders of the next pixel to read (see Fig. 1 ).
The possible vector values are given by the image and tile extensions as well as the number of tiles. For Fig. 1 , this leads to 0 ≤ i src ≤ i src,max = (3, 8) and 0 ≤ i snk ≤ i snk,max = (1, 2, 1, 2). Each time the wr en signal or the rd nxt signal is set to one, the corresponding iteration vector is updated by a simple lexicographic increment:
(1) Example 1. Suppose, that the sink in Fig. 1 currently is processing pixel number 14 corresponding to an iteration vector of i snk = (0, 1, 1, 2). Applying equation (1), the iteration vector of the next sink invocation is given by succ (i snk ) = (0, 2, 0, 0). In other words, the sink starts the third tile by processing pixel 6.
Address Generation
For each token which is written into the FIFO or read from it, the corresponding memory address has to be derived. For this task, we use linearization in production order, as investigations in [5] have shown, that this leads to good memory efficiency. This means that the generation of the source addresses is very simple, because for each produced source pixel, the write address has simply to be increased by one. If the latter one exceeds the available buffer size B which can be selected by the user, then a simple wrap around to the address zero has to be performed. The determination of the read address is unfortunately more complex because of the occurring out-of-order communication. Due to the linearization in production order, we first need to calculate the producing source invocation from which we can then derive the memory address. Although it is easily possible to establish the corresponding mathematical relations (see also Section 6.1) we observed, that their solution in general requires several multiplications or even integer divisions 3 . As especially the latter ones are very expensive in hardware and both require in general several clock cycles, we invented another approach which for practical applications came out to work very well.
We observed in fact, that relative address generation is efficiently possible despite out-of-order communication. Take for instance the example given in Fig. 1 . Due to the linearization in production order, the address of a data element simply corresponds to the number of the producing source actor invocation which is represented by Arabic numerals in Fig. 1 . If we now follow the sink invocations in the order indicated by the dashed flashes, we can observe, that the address of the accessed data element can be easily derived from the address of the previous invocation. In the concrete example, the address increment simply amounts one, as long as we stay in the same line of the same tile. If we move to the next line in the same tile, the address is increased by seven. Moving to the next tile in horizontal direction means an address decrement of eight and so on. In other words, relative address generation is easily possible by simply taking the value of the sink iteration vector into account which tells us the current position of the sink actor.
As however we want to process very huge images, we cannot just synthesize a look-up table which associates to each sink iteration vector the corresponding address offset, as this would be extremely expensive in terms of hardware resources and synthesis time. Instead, we have to group as many identical address offsets as possible. Figure 3(a) shows the pseudo-code for the corresponding algorithm generating nested if-then-else statements expressing the correct address increment. It is started with j = 1 and obtains a table T : I snk → Z Z which assigns to each sink invocation i snk the address increment T (i snk ) required in order to calculate the data element address of the next invocation. Based on this table, the algorithm groups identical address increment values in order to obtain a compact representation in form of nested conditionals. Therefore, line (05) checks whether there exist two table entries which only differ in coordinate j and which do not have the same value. If this is the case, a corresponding distinction in form of a conditional has to be introduced. The latter one is generated in lines (06) and (08), whereas line (02) outputs the assignment of the result variable addr inc.
In part (b) of Figure 3 , the code generated by the above algorithm for the tiling example in Fig. 1 is print off. As it can be seen, the number of required if-statements is much smaller than the number of pixels forming the image. They can hence be efficiently synthesized in hardware. (04) for c, ej = 0 : isnk,max, ej − 1
end if (10) end for (11) c, ej = isnk,max, ej ; (12) create_cond(j+1,c); (13) Fig. 3. (a) Coding of the address offsets by nested conditionals. The resulting code output is indicated by the "→"-sign. I snk (j, c) = {isnk ∈ I snk | ∀1 ≤ k < j : isnk − c, ek = 0}. (b) shows the (reformatted) code generated by the algorithm for the example given in Fig. 1 . Else-statements immediately followed by an if-statement are replaced by elsif -constructs.
The next section shows, how based on the introduced memory model, efficient flow control can be realized. In other words, we want to solve the question when the source or sink can be executed.
Whereas for one-dimensional FIFOs, the fill-level control is rather easy to implement, the out-of-order communication makes this task more challenging. Consider for instance once again the tiling operation illustrated in Fig. 1 . Then it can be easily recognized, that each of the first three source invocations immediately allows the sink to execute once. The source invocations 3-8 however do not allow the sink to continue, because due to out-of-order communication, the latter one requires the data element produced by source invocation 9. On the other hand, once the sink has processed pixel 11, it can immediately continue with pixels 3-5 without waiting for the source, because they are already available.
A similar reasoning is valid for freeing buffer elements and hence for the question, whether the source can still execute or has to wait due to a full buffer. Because of the linearization in production order, pixels stored in the buffer can only be freed in the same order in which they have been produced. In other words, in Fig. 1 it is not possible to discard data elements 9-11 before 3-5, because otherwise we would get holes in the address space which would be too complex to handle in hardware. This however also means, that no buffer elements can be freed, when the sink processes pixels 9-11. On the other hand, when discarding pixel 8, also pixels 9-14 can be freed because they have already been processed.
This example clearly shows, that in contrast to one-dimensional FIFOs it is not sufficient anymore to count the tokens stored in the buffer in order to derive whether the source or the sink can execute. Consequently, in the architecture shown in Fig. 2 , both the source and the sink fill level control modules contain their own counters indicating the number of possible source resp. sink invocations. Both are initialized with the correct values during startup. If neither initial tokens nor virtual border extension [4] occurs, the sink counter is set to zero and the source counter equals the buffer size. Each time the source or sink fires, the corresponding counter is decreased by one. Additionally, whenever the source executes, it communicates the number of additional sink invocations to the sink-level fill control and vice-versa.
The next subsection will show how the corresponding values can be derived.
Invocation Number Calculation
In order to determine the possible number of source and sink invocations, we need to know the additional possible sink invocations ∆snk (i src ) resulting from the source invocation i src as well as the number of additional source invocations ∆src (i snk ) due to execution of i snk . Both questions can be answered by help of a Parametric Integer Program (PIP) [14] which minimizes a system of linear inequalities in the sense of the lexicographic order ≺. The latter one establishes an order on Z Z n and is defined as following:
For example, i 1 = (1, 1, 1, 2) ≺ i 2 = (1, 2, 0, 0). Due to our definition of the iteration vectors (see Section 4), i 1 ≺ i 2 also means, that i 1 is executed before i 2 .
By help of a particular parametric integer program, we can now calculate ∆src (i snk ). Given for instance the current sink iteration i snk,0 ∈ I snk ⊂ IN n snk , then the following PIP searches for the lexicographically smallest source iteration i src ∈ I src ⊂ IN nsrc whose data elements are still required 4 :
i snk,0 is considered as the PIP parameter whose possible range is specified by the context [14] in equation (7). Equation (6) describes the data element mapping. Part (a) calculates the coordinates of the pixel produced by the source invocation i src , part (b) the pixel coordinates accessed by the sink iteration i snk . Together, they establish a relation between the sink iteration i snk and the corresponding source iteration i src producing the required data element. Working on m-dimensional images, M src is an m × n src matrix, M snk an m × n snk matrix. For the example given in Fig. 1 , we have
Equations (3) and (4) specify the possible iteration range. Equation (5) finally takes care, that only sink iterations which do not occur before i snk,0 are taken into account. This relation can be transformed into a simple inequality as required for PIPs by establishing the following order:
The overall system of inequalities thus searches the earliest source iteration (Eq. (2)) whose produced data element is required by a sink invocation i snk (Eq. (6)) which occurs not before i snk,0 . In other words, let i src = f (i snk,0 ) be the solution of the above PIP. Then we know, that all data elements produced before f (i snk,0 ) are not required anymore and can be discarded. If we now solve the same PIP for the successor of i snk,0 , we can derive the number of additional data elements which can be discarded after execution of i snk,0 and hence the number of additional possible source invocations:
O src : I src → IN 0 is a function which enumerates all source invocations as shown in Fig. 1 . A similar reasoning can be performed for ∆snk (i src ).
Solution of the PIP
Solutions of parametric integer programs can be expressed symbolically by help of nested conditionals. The latter ones can be obtained by help of the PIP-library [15] which can solve parametric integer programs as those presented in the previous section. However, in practical implementations, we observed severe difficulties. First, the expressions returned by the PIP-library are extremely complex. Although we succeeded to perform various simplifications, for some examples they stayed unsuitable for a hardware implementation. Secondly, even for very small image sizes, we observed sometimes a tremendous calculation effort and extremely huge memory requirements. Even worse, sometimes the PIP-library failed completely.
Consequently, we have elaborated an alternative approach to solve the above parametric integer program. It allows to derive f (i snk,0 ) by help of simulation and bases on the buffer analysis presented in [5] . Whereas this does not allow for symbolic solutions, it can process very huge image sizes in reasonable time. As a result, we obtain a table T : I snk → IN 0 which assigns to each sink iteration the resulting number of additional source invocations. By help of the algorithm shown in Fig. 3 , this table can be coded as nested conditionals which can be efficiently synthesized in hardware.
Elimination of Modular Dependencies
Whereas for the example given in Fig. 1 determination of ∆src (i snk ) by the above approach does not cause any difficulties, ∆snk (i src ) shows inherent modular dependencies. This is because the number of additional possible sink invocations depends on the tile structure. Consider for instance the first row of tiles in Fig. 1 . Then for all tiles except the last one, we observe that whenever the source has generated the last pixel of a tile, the sink can not only read this pixel, but also all lines of the next tile except for the last line. Hence, in order to determine the value of ∆snk (i src ), the multi-dimensional FIFO has to know whether the source has produced the last pixel of a tile. Mathematically, this is nothing else than checking whether i src , e 1 mod 2 = 1 (9) i src , e 2 mod 3 = 2 (10) Unfortunately, this modular dependency increases the complexity of the resulting hardware implementation. As the nested conditionals generated by the algorithm shown in Fig. 3 do not contain any modulo function, they are translated into a possible huge amount of conditions. Equation (10) for instance can be represented by i src , e 2 = 2 ∨ i src , e 2 = 5 ∨ . . .. This however increases the required resources for a hardware implementation. Furthermore, even if the conditions included a modulo function, this would not help much, as its hardware realization is expensive too.
Fortunately, the situation can be easily improved by replacing i src with a four dimensional vector: i * src = y * src,1 , y * src,2 , x * src,1 , x * src,2 with 0 ≤ i * src ≤ (1, 1, 2, 2) . This removes the modular dependencies, because they are now already occurring in the source iterator. In other words, equation (10) for instance can be replaced by x * src,2 = 2 which can be efficiently represented in hardware.
Results
In order to verify our concept of the multi-dimensional FIFO, we have implemented it in VHDL and verified its functional correctness by help of different Modelsim simulations. In order to get an idea about the achievable speed, we have furthermore synthesized several configurations of the out-of-order communication shown in Fig. 1 . As we want to process very huge images in real-time, we have written the VHDL code in such a way, that critical operations can be pipelined over several clock cycles. For instance, we allow the calculation of ∆snk and ∆src to take several clock cycles as this does not violate our requirement to process one pixel per clock cycle. Additionally we have realized a pipelined memory access as it is found in many high-speed memories: As for high frequencies it is impossible to retrieve the desired data word within one clock cycle, the interface is designed in such a way, that the memory controller can issue one read request per clock cycle while it might take more than one clock cycle until the requested data word effectively arrives. We deploy the same principle for the sink address generation. Whereas both the sink address generation and the data access might take more than one clock cycle, we have designed the FIFO interface in such a way, that the sink can issue one read-request per clock cycle.
Tab. 1 shows the hardware results obtained after place-and-route with the Xilinx ISE 8.2 tools. Two different configurations are tested, using a big and a small tile size. The latter allows using internal block ram (BRAM) as FIFO buffer, whereas this is infeasible for the big tile size. In the latter case, we assume an external memory by assigning the address bits to FPGA pins. Note that an external memory controller is not taken into account, because the generation of FPGA output signals and sampling of FPGA input signals significantly complicates the achievement of high frequencies due to tight phase requirements. As however the proposed multi-dimensional FIFO is independent of the applied memory controller, the latter one is omitted in order to avoid influence on the synthesis timing. As it can be seen, the proposed architecture for the multi-dimensional FIFO achieves both for recent and older FPGA technology very good operation frequencies. In the case of a Virtex4 device, the frequencies are even sufficient to process an image with 4096 × 2140 pixels at more than 30 frames per second which is a considerable throughput. Moreover, even with a Virtex2, 20 frames per second are possible. Nevertheless, the resource consumption is acceptable, needing not more than 1% of a Virtex2 6000 and not more than 4% of a Virtex4 LX25 which is a rather small chip. Table 2 shows the overhead caused by the support of out-of-order communication. It compares the developed multi-dimensional FIFO with an ordinary FIFO generated by the Xilinx CORE Generator [16] . In order to allow for a fair comparison, the multi-dimensional FIFO is configured in such a way that data production and consumption occur in the same order and can thus be realized by an ordinary FIFO. Both FIFOs dispose of the same amount of memory equaling 16384 items. The multi-dimensional FIFO is synthesized in two variants. One uses the same pipeline settings required to achieve the high synthesis frequencies of Table 1 . The second takes into account, that identical consumption and production order is less complex than out-of-order communication. Hence, less pipeline steps are required. Nevertheless, as expected, the multi-dimensional FIFO in both cases requires more resources and the achievable frequency is smaller than for the CORE Generator FIFO. We identified as underlying reason the address calculation which is more complex for out-of-order communication compared to an ordinary FIFO. Consequently, the VHDL implementation needs to deploy more complex logic structures. Unfortunately, the synthesis tool is not able to remove this complexity even when the multi-dimensional FIFO is configured for identical consumption and production order. Consequently, a possible optimization strategy for complete systems is to replace multi-dimensional FIFOs with ordinary ones if no out-of-order communication is required.
However, whenever this is not possible like for the example shown in Fig. 1 , our implementation proves to achieve high throughput with acceptable resources. Thanks to our static compile time analysis, we achieve higher synthesis frequencies compared to [12, 13] (see Section 2). Furthermore, our implementation does not need any valid-bit and allows one read and write access per clock cycle which is not possible in [12, 13] . On the other hand, as [12] uses dynamic memory allocation, it has an increased flexibility and possibly better memory utilization. 
Conclusions
Due to the increasing complexity of modern image processing applications, classical hardware description at register transfer level gets more and more inadequate.
Modeling by help of multi-dimensional data flow and providing efficient means for the required synthesis is one possibility to alleviate the situation, because the application is described at a higher level of abstraction. The paper in hand contributes to this new methodology by providing an efficient and fast implementation of a multi-dimensional FIFO which allows for out-of-order communication. The latter one is required by different applications like JPEG2000 or JPEG. As a major contribution, this paper presented an architecture which is able to read and write one pixel per clock cycle. Usage of linearization in production order leads to memory efficient solutions and trivial source address generation. Determination of the corresponding sink address is more challenging due to the occurring out-of-order communication, but can be efficiently solved by relative address calculation. In order to solve the question, how often the source or sink can still be fired before the FIFO gets full or empty, a parametric integer program can be established. For its solution, two different approaches are available, namely by help of the piplib library and by simulation. Especially the latter allows to process huge images. The obtained synthesis results prove, that a very huge throughput can be achieved allowing to process images with 4096 × 2140 pixels in real-time.
