h TRADITIONALLY, PARALLEL IMPLEMENTATIONS of multimedia algorithms are carried out manually, since the automation of this task is very difficult due to the complex dependencies that generally exist between different elements of the data set. Moreover, there is a wide family of iterative multimedia algorithms that cannot be executed with satisfactory performance on multiprocessor systems-on-chip (MPSoCs) or graphics processing units (GPUs). For this reason, new methods to design custom hardware circuits that exploit the intrinsic parallelism of multimedia algorithms are needed. As a consequence, in this paper, we propose a novel design method for the definition of hardware systems optimized for a particular class of multimedia iterative algorithms. We have successfully applied the proposed approach to several real-world case studies, such as iterative convolution filters and the Chambolle algorithm, and the proposed design method has been able to automatically implement, for each one of them, a parallel architecture able to meet real-time performance (up to 72 frames per second for the Chambolle algorithm), with on-chip memory requirements from two to three orders of magnitude smaller than the state-of-the art approaches.
In the last years, embedded systems have become a valuable platform not only for simple and low-power applications, but also for the implementation of very complex algorithms, especially in the field of image and video processing [1] . Within this context, iterative algorithms that work on large matrices [11] can be implemented today on GPUs, MPSoCs, or field-programmable gate arrays (FPGAs). However, GPUs and MPSoCs often fail in achieving real-time performance, as shown in The Case Study: The Chambolle Algorithm for Optical Flow Estimation, on certain families of iterative algorithms [8] , [9] , such as the iterative convolution filters presented in Algorithm 1, where: h N is the number of iterations to be performed; h X is the horizontal width of the input image; h Y is the vertical height of the input image (so X Â Y is the size of the input image).
Editor's notes:
The methodology from this paper exploits fine and coarse-grained parallelism for the automated design of digital architectures for multimedia applications. Specific focus is placed on iterative algorithms, as demonstrated through a case study of the Chambolle algorithm for optical flow estimation. VNicola Nicolici, McMaster University
The reason why these architectures are not suited for the considered family of algorithms is that, on the one hand, it can be very difficult to exploit the single-instruction multiple-data (SIMD) paradigm with the static structure of GPUs (which have a fixed number of computational units) to implement algorithms with very complex data dependencies among iterations. On the other hand, MPSoCs are best suited for application or task level parallelism, where the operations on the input data are very sophisticated and fit the high complexity of each processing core. On the contrary, FPGAs provide a fully customizable platform where any kind of custom operation, either complex or very simple, can be implemented in hardware and applied on multiple blocks of data in parallel. Unfortunately, the design of custom systems could be a very challenging task, since methods to drive the designer in the definition of architectures optimized for multimedia algorithms are still missing.
Imageðx; yÞ ¼ Tempðx; yÞ end for end for end for
In addition to this, the parallel implementation of an algorithm that works on large amounts of data and, in particular, on matrices, is also inherently difficult because of the nature of the algorithm itself, as it typically contains complex data dependencies. Historically, algorithms for multidimensional data processing are expressed in a serial form, either because they are meant to illustrate a way to solve the problem rather than to provide an actual implementation, or because they are originally targeted for a software execution on an instruction set processor. The serial version of such algorithms is generally inefficient and inadequate to process large amounts of data, especially at a very high rate (e.g., in video processing applications [9] , [11] ).
Despite the need for parallelism, the hardware design for algorithms that deal with large sets of data is still carried out manually, and its automation is discouraged by the complex dependencies that generally exist between different elements of the data set. In this context, we present in this paper an innovative design method that goes beyond the code manipulation performed by the existing loop transformation techniques, as it is specifically tailored to a well-defined class of algorithms and to a fine-grained programmable hardware. Hence, it is able to take into account the typical dependency schema of multimedia applications, and to exploit them to extract an efficient custom-shaped implementation. With respect to other loop transformation techniques, our approach is able to generate a circuit that directly computes the result of an arbitrary nth iteration, while automatically taking into account the required data dependencies. The ability of combining the unrolling of multiple iterations with the possibility of partitioning the input (i.e., performing the computation starting from a subset of the input matrix), and to possibly process a redundant amount of data in order to guarantee the data dependencies, make our approach innovative with respect to the existing techniques.
Target family of algorithms
The design method proposed in this paper can be successfully applied to a wide set of iterative algorithms that work on matrices, which are typical in image and video processing. In fact, many algorithms in this field [9] , [11] aim at finding an output matrix of the same size as the input (e.g., a filtered image) by using an iterative process: each iteration produces an intermediate matrix, which is computed by processing one or more elements of the result produced during the previous iteration.
Algorithm 1 provides an illustrative example of this class of algorithms: the pseudocode emphasizes the iterative behavior (i.e., the N iterations of the outermost loop), as well as the necessity to scan the entire intermediate matrix (consisting of X Â Y elements) to produce the updated result.
Although the pseudocode in Algorithm 1 might seem very specific, its structure models a large number of existing algorithms, especially in the multimedia field. For instance, all the algorithms presented in [1] , [2] , [4] , [8] , and [11] can be expressed in that form. However, if the whole algorithm has to be repeated several times starting from different input data, as it happens in [4] , each execution can be considered independent from the others, since they do not share intermediate results.
Thus, the approach proposed in this work is still valid under the condition that the generated system is used sequentially for each execution.
In order to formalize the considered family of algorithms, we define the following notations: h ðx; y; nÞ represents the element ðx; yÞ of the intermediate matrix at iteration n;
h Gðx; y; nÞ represents the set of elements that are necessary to correctly compute the value of the element ðx; y; nÞ; in other words, it corresponds to the dependency schema of the algorithm. Since the elements belonging to Gðx; y; nÞ are used to compute an element at iteration n, they have to be generated at iteration n À 1.
An algorithm can be successfully handled by the proposed methodology if the two following conditions hold, which is typically true for iterative algorithms.
1) No read-after-write (RAW) conflicts should exist
within an iteration, because the first innermost loop performs the evaluation of the intermediate matrix coming from the previous iteration and stores the results in a temporary variable, while the second innermost loop updates the intermediate matrix. This means that the computation of an element at iteration n cannot depend on the value of another element at iteration n, but only on previously generated elements (the ones at iteration n À 1). 2) For each generic element ðx; y; nÞ, the set Gðx; y; nÞ should only depend on the position ðx; yÞ of the element. In other words, it should be possible to compute an element of the intermediate matrix starting from a well-defined and limited neighborhood of the element itself, taken from the intermediate matrix coming from the previous iteration.
Implementation approaches
Several approaches can be adopted to design an optimized hardware system for the considered family of algorithms. In Figure 1 , we are considering a quite simple example where an operation k, with two adjacent inputs ðInput k ¼ 2Þ and a single output ðOutput k ¼ 1Þ, is executed on all the elements of the input matrix (which are X Â Y ) for four iterations ðN ¼ 4Þ. Thus, each iteration of the algorithms requires a number Z ¼ ðX Á Y Þ=Output k of k operators. In this context, the implementation of all the k operators needed to execute the N iterations of the algorithm is not a viable solution in typical realistic scenarios, because of its extremely large area requirements (the area usage is directly proportional to both Z and N ), which makes it almost impossible to implement this solution on actual devices for reasonable values of X and Y (e.g., almost eight millions of k operators would be necessary to compute ten iterations of the considered family of algorithms on a 1024 Â 768 image).
The first approach that can be found in the literature [5] , indicated with A in Figure 1 , corresponds to the implementation of a single instance of the operator k, which is used several times in order to compute all the intermediate and final results of the algorithm, which are the ðx; y; nÞ elements introduced in the previous section. The area usage of this first approach is very low, since only a single instance of the operator k has to be implemented in hardware. However, also the performance of this approach is low, since all the operations have to be performed in a completely sequential fashion (only a single operator is available), and its memory requirements are huge, since all the intermediate results have to be stored for the subsequent iteration.
The second approach, indicated as B in Figure 1 , involves a window of k operators running in parallel [6] . This approach is widely used in the literature [7] , especially on algorithms characterized by simple dependencies. However, this approach cannot be considered a viable solution when dealing with algorithms characterized by complex dependencies, since it does not take into consideration their dependency schema, thus leading to unacceptable overheads in terms of both computation time and memory requirements. In fact, if the dependencies among the iterations are not taken into account, it is not possible to ensure that all the output values of an intermediate computational step are directly used in the subsequent step, thus it could happen that some of them have to be stored for later use (memory overhead), or discarded, and then computed again when necessary (timing overhead). In particular, storing all the data requires a big on-chip or off-chip memory (able to contain the whole input image). The first case is obviously very demanding in terms of on-chip memories (which are usually not sufficient to hold the whole input image), while the second one, in addition to the memory requirement, introduces a nonnegligible timing overhead since each time an old value is needed, it has to be retrieved from the external memory. A third option is to abolish the memory requirements by discarding all the values that cannot be stored anymore in the small local cache, thus computing again a value every time it is needed (and this could happens hundreds or thousands of times, depending on the complexity of the dependency schema of the target algorithm).
Finally, the approach indicated as C in Figure 1 is the one proposed within this research work. It deals, at each iteration, only on the subset of data, namely a collection of overlapping Gðx; y; nÞ values, necessary to compute the information needed by the following iteration. Since in most of the cases the set of elements that can be successfully computed at the iteration n is only a subset of the elements computed at the iteration n À 1, the resulting structure is usually very similar to a 3-D cone, such as in the example shown in Figure 1 .
The shape of the architectures generated with the proposed approach is characterized by the size of the input window ðw i Þ, the size of the output window ðw o Þ, and the number of iterations computed by each cone (which is t ¼ N =H, where H is the number of levels in which the computation of the N iterations is split). The shape of the cones (e.g., the relationship between w i and w o ) strictly relies on the data dependencies schema of the selected algorithm (in other words, on the size and shape of the Gðx; y; nÞ sets).
The proposed design method
The methodology proposed in this paper can be successfully applied to the design of hardware architectures based on a hardware template that can be tuned to meet the constraints imposed by the designer and that is able to exploit two different levels of parallelism.
h Coarse-grained parallelism: This is the parallelism that can be extracted among different iterations of the same loop on the same elements of the data set.
h Fine-grained parallelism: This is the parallelism that can be extracted when computing different elements in the same iteration of the loop.
Assuming that the two conditions presented in the target family of algorithms hold for the iterative algorithm taken under consideration, it is sufficient to analyze the instructions (in C or VHDL code) contained in the innermost loop (e.g., sum ¼ sum þ Kernelðj; iÞ Ã Imageðx À j; y À iÞ in Algorithm 1) to identify the sets Gðx; y; N À tÞ, one for each ðx; y; N Þ element of w o , which represents the inputs of the cone. In other words, it is possible to trace back the dependencies among the different elements (all the ones needed to compute the desired output window specified by the designer as input) at different iterations (from the first iteration to the iteration specified by the designer as input). In fact, as shown in Algorithm 2, in order to build the complete architecture, the designer can specify the size and shape of the desired output window ðw o Þ, as a set of ðx; y; nÞ elements, as well as the number of iterations to be implemented within a single cone ðtÞ.
Algorithm 2: Proposed Design Method
Input from the designer: Once the analysis of the dependencies is completed, it is possible to generate and synthesize the basic hardware cone able to produce the desired output window starting from the elements at the iteration N À t. At this point, if the final hardware cone includes all the iterations of the algorithm (thus if N ¼ t), it can be simply executed sequentially (shifting the location of the output window in order to cover the whole input data set) to generate all the output windows necessary to build the final result. Otherwise, if only a portion of the iterations are processed by the hardware cone (see Figure 1) , it is necessary to build the structure shown in Figure 2 in order to cover (following a multilevel approach) all the iterations required by the target algorithm, as described in the Reverse Scheduling procedure of Algorithm 2. In particular, Figure 2 shows the structure of an architecture consisting of cones spanning four iterations ðt ¼ 4Þ, for an algorithm requiring a total of 12 iterations ðN ¼ 12Þ.
In other words, in the final solution, the computation of the previously introduced innermost loop is performed on a subset of elements, instead of considering the whole data set. This makes it possible to split the computation in several different tasks (implemented in hardware as cones that span on more iterations) that can be executed sequentially or completely in parallel, depending on the amount of resources available on the target device.
The method proposed within this paper (intended as a set of subsequent steps to be performed in order to implement the Approach C of Figure 1 ) can be followed by any designer in order to build a system that follows the previously described architectural template. However, we also developed a design framework that makes it possible for the designer to automatically generate the desired architecture by simply specifying the size of the output window and the number of iterations to be implemented within a single cone. Thus, the designer can exploit this tool in order to explore the design space, selecting the architecture with the highest performance among the ones that fulfill the area constraints imposed by the designer. However, the description of this design framework is out of the scope of this paper, since the method is more generic and it is valid also when applied manually by the designer.
Memory requirements analysis
The main advantage of employing the proposed approach with respect to approaches A and B of Figure 1 is that it makes it possible to exploit the knowledge of the dependency schema of the target algorithm in order to obtain high-performance architectures (as shown by the experimental results on several real-world case studies, such as the one presented in the case study: the Chambolle algorithm for optical flow estimation with very low requirements in terms of on-chip memory.
In fact, from the internal memory usage perspective, the on-chip memory requirement of the approaches A and B in their basic formulation is 2 Ã X Ã Y Ã e, where e is the size of a single element of the input. This is because it is necessary to keep in the internal memory at least two arrays (each containing one X Ã Y element): the first one stores all the intermediate values computed at the previous iteration, while the second one stores the results of the current iteration.
However, by optimizing the memory management, it is possible to drastically reduce the memory required to store the intermediate results, but the total memory requirement is bound to a minimum of X Ã Y Ã e. Thus, the actual memory requirement M1 of the different formulations of the approaches A and B is
On the other hand, the memory requirement of the proposed approach is given by
which is based on (see Figure 2 for a graphical representation of the architecture) the following:
h the number of levels of the whole architecture ðHÞ; h the number of iterations performed by a single cone ðN =HÞ;
h the number of inputs ðSiÞ and the number of outputs ðSoÞ of a single cone;
h the number of inputs ðWiÞ and the number of outputs ðWoÞ of the whole architecture.
In general, the memory requirement of the proposed approach is from two to three orders of magnitude smaller than the one of the architectures following the approaches A and B, as proved in The Case Study: The Chambolle Algorithm for Optical Flow Estimation on a real-world case study.
The case study: The Chambolle algorithm for optical flow estimation
We successfully applied the proposed approach to several real-world case studies based on multimedia applications, such as iterative filters and complex image processing algorithms. For the sake of illustration, one of them is presented and analyzed in this section: the Chambolle algorithm [4] , which is very popular in image processing (especially for the estimation of the optical flow [8] ).
The optical flow is a vector field representing the movement of an object in a sequence of frames, and it can be determined by analyzing the variation of the brightness inside a sequence of successive images [3] . The estimation of this vector field is one of the most important problems in image and video processing, as it can be employed for motion estimation and compensation, as well as in other fields such as robotics and even medical analysis.
From the practical point of view, the optical flow between two images I 0 and I 1 , which are given in the form of two input matrices, is a vector u ¼ ðu 1 ; u 2 Þ. The vector u is initialized at 0, and its final value is computed by means of an iterative way in order to increase the precision of the solution [8] . The value of u can be determined using an iterative technique known as the Chambolle algorithm [4] , which is summarized in Algorithm 3, where the vector v is basically a support variable defined by using a thresholding function based on I 1 and on the value of u computed at the previous iteration (for the sake of simplicity, the pseudocode only shows the computation of u 1 , but u 2 is computed in a similar way).
Algorithm 3: The Chambolle Algorithm
In the execution of Chambolle, the vector u is updated by two intermediate values, namely px ¼ ðpx u1 ; px u2 Þ and py ¼ ðpy u1 ; py u2 Þ [8] , as shown in lines 7 and 8 of Algorithm 3. Variables and are predefined values that determine the precision, while N is the number of iterations to be performed (usually set to 200). The Backward X ðzÞ function returns a matrix where each element of z is reduced by its left neighbor, while in Backward Y , it is subtracted from its upper neighbor, in Forward X , it is subtracted from its right neighbor, and in Forward Y , it is subtracted from its lower neighbor.
Before applying the design method described in this paper, it is necessary to check if the conditions presented in the target family of algorithms hold. Forward operations, it can be derived that the dependencies schema is the one depicted in Figure 3 . In particular, an element depends on seven surrounding elements computed at the previous iteration, as outlined in Figure 3a . As a consequence, the dependencies of each element can be delimited in a 3 Â 3 square.
Since all the conditions are met, the proposed approach can be successfully applied to the Chambolle algorithm. Among the possible solutions that can be generated thanks to the method proposed in this paper, we present here the one considering an 8 Â 8 output window ðw o ¼ 64Þ and able to parallelize two iterations in a single cone (H ¼ 5, N =H ¼ 2), since it is the one characterized by the best performance among all the solution of the design space we have explored. With respect to the schema presented in Figure 2 , the complete architecture, which occupies around 95% of the Slices of a Xilinx Virtex 6 VLX760 FGPA, consists of five levels (since H ¼ 5) for a total of 21 executions of the hardware cone (eight for level 1, six for level 2, four for level 3, two for level 4, and, finally, one for level 5) in order to generate the target 8 Â 8 window of valid final results.
In order to realize a proof-of-concept design, we exploited a Xilinx Virtex 6 (XC6VLX760) FPGA device as underlying prototyping platform, thus synthesizing and running on this FPGA device the system obtained by following the methods presented in this paper. Table 1 shows the comparison between the performance achieved by the proposed approach and the ones obtained by other state-ofthe-art implementations on FPGAs or on other powerful parallel architectures, such as GPGPUs. In particular, the results presented in Table 1 underline the frame rate that can be achieved by the different approaches (thus considering, for each frame, both the time necessary to load the input and store the output and the time needed to perform the whole computation) when executing the Chambolle algorithm (100 or 200 iterations) on a flow of images with a resolution of 512 Â 512 or 1024 Â 768. In [8] , the robust TV À L 1 technique to calculate the optical flow between two frames is proposed and implemented using modern GPUs. The authors proved that a real-time frame rate can be achieved by the most powerful devices for lowresolution sequences, but only few frames that are larger than 512 Â 512 can be processed in one second. Additional hardware results of the execution of TV À L 1 on GPUs can be also found in [9] , but even the fastest implementation cannot process more than 6 frames per second, even with just 512 Â 512 images. Finally, the approach presented by Akin et al. [10] is an ad hoc parallel solution that is able to perform the Chambolle algorithm with a very high frame rate on a Virtex-V FPGA device. For what concerns the memory requirement analysis, when considering 1024 Â 768 images (X ¼ 1024, Y ¼ 768, e ¼8 b), according to (1), the memory requirement M1 of approaches A and B is
However, since it could be difficult to completely satisfy the internal memory requirements of approaches A and B in realistic scenarios, the number of accesses to the external memory of this kind of approaches considerably rises (in order to write and then read back the intermediate results produced once the internal memory is completely full), thus leading to a huge d e g r a d a t i o n o f t h e p e rformance (as shown in the results presented in Table 1 (2) , the memory requirement M2 of the Table 1 Comparison with respect to state-of-the-art implementations.
IEEE Design & Test
architecture generated with the proposed design method is M2 ¼ 2.26 kB:
Thus, in conclusion, the memory requirement of the proposed solution is at least two orders of magnitude smaller than a generic architecture following the design approaches A and B.
IN THIS PAPER, we proposed a design method that can be successfully applied to all the iterative algorithms that fulfill the two conditions presented in the target family of algorithms. Thus, if the target algorithm fits the class model presented in this paper, the proposed design method makes it possible for the designer to define a custom architecture able to execute the specified algorithm with very high performance. We successfully applied the proposed design method to several case studies, such as iterative filters and complex image processing algorithms, outperforming most of the alternative implementations that can be found in the literature. In particular, the experimental results presented in the The Case Study: The Chambolle Algorithm for Optical Flow Estimation show that the proposed approach, when applied to a real-world case study such as the one of the Chambolle algorithm, is the first one that guarantees a real-time execution without a manual optimization of the algorithm (such as the one in [10] ), and at the same time, it presents an on-chip memory requirement from two to three orders of magnitude smaller than other approaches. h h References
