We present a compiler that takes high level signal and image processing algurithms described in MATUB und generates an optimized hardware,for an FPGA with external memury. We propose a precision unulysis ulgorithm tu determine the minimum number of bits required by an integer variable and a combined precision and errur analysis algorithm to infer the minimum number of bits required by afloating point variable. Our results show that on an average, our algorithms generate hardware requiring u Jucmr of 5 less FPGA resources in terms of the Configurable Logic Blocks (CLBs) consumed as compared to the hardware generated without these uprimizations. We show that our analysis results in the reduction in the size oflookup tables furfunctions like sin, cos, sqrt, exp etc. Our precision unalysis ulso enables us to pack various array elements into a single memory location tu reduce the number cfexternal memory accesses. We show that such a technique improves the performance uf the generated hardware by un overage of 35%.
INTRODUCTION
Field Programmable Gate Arrays (FPGAs) have been recently used as an effective platform for implementing many imagdsignal processing applications. Though the concept of using FF'GAs for custom computing evolved in the late 1980s, certain recent advancements in FPGA technology has made reconfigurable computing more feasible. Current trends indicate that FPGAs have a faster growth of transistor density than even general processors. The implication of this is that there will be sufficient transistor budget for larger and more complex applications to be implemented on FPGAs. Most hardware designers today use hardware description languages like VHDUVerilog or low level CAD tools to implement designs on FPGAs. This involves directly dealing with the complexities of the hardware and understanding the cycle-bycycle behavior of millions of gates, which can be very tedious and time consuming. Clearly, there is a need for system level design tools that would provide designers a higher level of abstraction enabling the next generation of complex applications of FPGAs with reduced time-to-market.
Many researchers have focused on the use of general purpose languages as a target for hardware synthesis. C/C++ is the most popular target language [19-241. Some other researchers have attempted to use Java as the target language too [25-271. Our choice of the MATLAB language is guided by the following facts -(1) MATLAB is extremely popular with the signal/image processing community and is easier and more intuitive to use than CIC++ (2) MATLAB has a rich set of libraries for signallimage processing functions which can be directly mapped to efficient libraries, thus making MATLAB very conducive to design reuse. (3) Large amounts of parallelism can be extracted from MATLAB programs . with little or no dependency analysis, as opposed to complex dependency analysis required by languages like C/C++. Some of the major issues in compilation from a high level language for FPGAs is in generating hardware that will not only fit within the FPGAs, but which will also provide high performance. Since controlling the bitwidth of the variables will result in instantiation of lesser precision operators in hardware leading to FPGA resource savings, there is a need to assign bits in an optimal manner. The need to conserve bits has been investigated for architectures like Intel's MMX [2] , HP MAX-2 [3] and SUN VIS [4], which allow data paths to operate on subwords. Also, with the current interest in generating low power systems, researchers have proposed turning off bit slices [I] . Stephenson et. al. [5] have proposed a precision analysis scheme to determine the required bit level precision for various target architectures. All of the above work in subword control have focussed entirely in trying to optimize the bits for integer programs. Since most real applications have floating point operations, our work presents a unified scheme to determine the minimum number of bits for both integer and floating point applications. We also present a scheme to use the precision information to pack several array elements to a memory location, so that the total number of memory accesses is reduced, thereby improving performance.
The contribution of the paper can be summarized as follows:
We present a value range propagation algorithm to determine the minimum number of bits required for the integral part of floating point representation and for integers
We present an error analysis algorithm to determine the minimum number of bits required to represent the fractional part of our floating point representation We present a memory packing algorithm to pack more than one array elements into a single memory location to improve the performance of the hardware generated This work presents an automated way of improving the hardware generated by the MATCH compiler 11.51. The rest of the paper is organized as follows. Section 2 presents an overview of the MATCH compiler. Section 3 motivates the need for a bitwidth analysis phase and presents our representation of integer and real variables in the VHDL description of the hardware. Section 4 and Section 5 describe our algorithm for precision and error analysis to optimize the FPGA resources consumed. Section 6 describes our memory packing algorithm to improve the performance of the generated hardware. We present some experimental results in Section 7 and conclude in Section 8.
OVERVIEW OF THE MATCH PROJECT
The work presented in this paper is part of the MATCH compiler
[13]. The MATCH compiler takes in the description of an appli-1530-159l/01$10.00 0 2001 IEEE cation in MATLAB and partitions it into software to be executed on general purpose and embedded processors and hardware to be mapped to FPGAs. The hardware generated are targeted for the Xilinx FPGAs on the WildchildTM board from Annapolis Micro Systems. In this paper, we address the issues involved in generating an efficient hardware once the frontend of the compiler has partitioned the system into hardware and software [27] . In particular, we focus on optimizing the FPGA resources in terms of number of Configurable Logic Blocks (CLBs) used up by the instantiation of various operators and registers and in improving the performance of the system by reducing the total number of clock cycles required for external memory accesses. Figure 1 shows an overview of the MATCH compiler. The input MATLAB code is parsed in to develop a MATLAB AST based on a grammer developed by us [14] . Since MATLAB is a dynamically typed language, the type and shape of the variables are unknown at compile time. Hence, a compiler phase infers the type of the variables and dimensions of the matrices and uses this information to scalarize the MATLAB AST. The AST is then levelized wherein complex expressions are broken down into simple expressions with at most three operands. A dependancy analysis phase infers the control and data dependancies present in the AST. A precision and error analysis phase infers the optimum number of bits required for representing the variables in the MATLAB AST and generates a resource optimized VHDL AST. Finally, a memory packing phase packs more than one array element into a single memory location depending on the array precision and optimizes on the number of memory accesses. The output VHDL code is then passed through commercial synthesis and place and route tools to generate a netlist and bit-stream for the FPGAs. All variables in the output RTL VHDL code are mapped to bit vectors of type std-logic-vector of width as decided by the bitwidth analysis phase. The most significant bit of the bit vector is reserved for the sign bit for both integer and real variables. For integer variables, the value of the variable is converted to binary and directly mapped to the bit vector. For real variables, the most widely used representation is the IEEE floating point representation. format where all variables are represented by 32 bits. Such a representation is often avoided for reconfigurable computing platforms because the floating point operators typically require too much area to be practical. One of the accepted methods of performing fractional operations is to compute the three components of floating point result, sign, exponent and mantissa independently [9. IO] . But such systems do not have high clock rates and are also limited by area [9] . One alternative representation is to use fixed point representations. The main advantage of such a representation is that all operations can be performed using integer operators resulting in much less usage of FPGA resources. We use a fixed point scheme in which real numbers are represented by both a integer part and a fractional part. The advantage of such an approach can be seen from Figure 3 (b) and (c) where we require 32 bits to represent both 5.5 and 1399.75 with a floating point representation while it takes only 4 bits to represent 5.5 and 13 bits to represent 1399.75 with a fixed point representation. This will result in the instantiation of a 17 bit multiplier for the multiplication in Figure 3 (c) as compared to a 64 bit multiplier if floating point representation is used. Hence, the P G A resources are opti-
Representation of Integer and Real Variables
mally used. The main disadvantage of such a representation is that the number of bits required for the integral part would be high if the value range of the variable is high. This is acceptable for our study since the dynamic range of all variables in almost all image and signal processing applications is small. Another disadvantage of such a representation is that since both the number of bits required for the integer part and the number of bits required for the fractional part would vary for different variables, integer operators would not give correct results as they require the decimal bits to be perfectly alligned. One solution to this is to remember the number of bits for the fractional part for each real variable and generate conditional code so that integer operators could be used. We have made an assumption that the number of bits required to represent the fractional part is constant for all real variables while the number of bits required to represent the integer part can vary. Figure   4 shows that the multiplication of the two numbers using this assumption requires integer operators. The middle column shows that the multiplication involves a Xor operation on the sign bits and a normal integer operation on the other bits. The last column of Figure 4 shows the actual integer multiplication. Our algorithms in the next section can accurately determine the number cif bits required for the integer part of the real variable. Hence, the output of the multiplication in the last column of Figure 4 is sampled so that the first 5 bits is for the integer part of the result c (since decimal 13 requires 5 bits) and the next 4 bits is sampled for the fractional part (since both the variables a and b have 4 bits for the fractional part). The main advantage of this representation is that different variables will have different number of bits as required for their representation unlike the floating point representation so that integer operators of the optimal precision would be instantiated leading to resource savings.
We next present a precision analysis algorithm to determine the minimum number of bits required to represent integer variables and the integer part of real variables. 
PRECISION ANALYSIS
In our representation of variables, the minimum number of bits required to represent integer variables and the integer part of real variables is directly related to the maximum value that the variable attains throughout the program run. Hence, precision analysis or the minimum number of bits required to represent the integer part of the variables can be inferred by value range propagation
[28]. We next discuss the value range propagation algorithm to accurately determine the minimum number of bits required for the integer part of the variables. *Up: represents the data range during backward propagation .Down: represents the data range during forward propagation *Actual: represents the actual data range = Up n Down.
5.lnitialize each of these structures for each variable to < 6. Read in target architecture features from a file so that memory width and address width can be used to optimize on the precision of the array elements.
I. do { 8.
-INTmaz, I N T m a z >
Traverse the SSA data flow graph in the forward direction and infer the value range of variables in the Ihs of an assignment expression
9.
Calculate the data ranges for the variable being calcu- 
10.
Perform error Analysis by finding out the error of the Ihs of an assignment expression according to the transformations given in Figure 5 Traverse the SSA data flow graph in the backward direction and infer the value range of variables in the rhs of an assignment expression from similar transformations M in Step 9.
11.

12.
}while (none of the data ranges change or for a fixed number of iterations); 13. Change the symbol Step 1 of the algorithm levelizes the MAT-LAB AST so that all assignment operations are converted to a three operand format. This helps in formulating a series of transformations as shown in [5] which can now be applied on these statements to infer the value range. To avoid converting induction variables used inside loops to be type promoted to real numbers, it is necessary to use temporaries as shown in Step 2. Value range propagation is simplified by the assumption that every use of a variable has only one reaching definition. Hence, a dataflow graph with a static single assignment (SSA) property is generated.
Step 3 uses a Array based SSA representation [8] wherein each array element is renamed so that precision inferencing becomes more accurate.We have implemented a forward and backward propagation algorithm to determine the maximum value of each variable. The precision analysis phase ends once the value range of all variables stabilizes. Certain precision information can be derived from the target architecture for which VHDL is generated. For example, the memory of the slave WGA's on the Wildchild board is 16 bits wide and the external memory has 2'' locations.
Step 6 reads in this information from an architecture file and uses it for inferring the precision of address variables and array elements. An added benefit of Value Range Propagation is in optimizations like Consrunt Propagufion [ 161 and Deud Code Elimination.
ERROR ANALYSIS
Though the value range propagation algorithm in the previous section can determine the minimum number of bits required for the integer part of the real variable, this is not true for the fractional part of the real variable. This is because a floating point variable can attain innumerable values between two integers. If we use less number of bits to represent the fractional part, then we will be decreasing the resolution of the variable, thereby introducing an error in computations. Hence, we require an error analysis phase to determine the tolerable error.
Algorithm for Error Analysis
Step 10 of Algorithm 1 finds the error in the fixed point representation of each variable based on transformations outlined in Figure 5 . Most image processing applications take as input an image and output another modified image. The actual algorithm performs some floating point operations on these input images to give us the final output image. Hence, the error tolerance in such applications is very high. We can infer the number of resolution bits for real numbers when : Figure 5 to find out the error in the calculation of the intermediate real variables, both ,because of its representation using lesser number of bits and also because of its computation from other real variables which have errors in their representation. This error is in terms of the number of bits r used in representing the fractional part of the variable. Both the information, namely the tolerable error and the error due to computation using less number of bits is used to determine the minimum number of bits required to represent the variable. Hence, an error analysis will give us the minimum number of bits required to represent the fractional pan of the real numbers while the precision analysis algorithm in the previous section will give us the minimum number of bits required to represent the integer part of the real number. 6. MEMORY PACKING It is well known that most of the computations in image processing applications involve memory accesses. When such applications are compiled for a system with an external memory as is true for most commercially available FPGA boards, memory access becomes a performance bottleneck. Hence, reducing the number of external memory accesses could lead to performance gains. An example image processing code for Region Splirring in MATLAB is given in Figure 6 . An important observation in this code is that each iteration of the loop makes a memory access which is independent of other loop iterations. Also, the memory access patterns are uniform. Most image processing applications that we considered have characteristics which are similar to Figure 6 , namely no loop carried dependence and uniform memory access. If these applications are targeted for execution on commercial FPGA boards with an external memory as in the Wildchild and the WildSrur board from Annapolis Micro Systems, then each memory access could take as long as 3-4 clock cycles on any of these boards. One way of improving the performance is pipelining the memory accesses [ 121. Yet another method which can be implemented over pipelining is by packing more than one array element into the same memory location. For example, the Wildchild architecture has an external memory which is 32 bits wide for PEO. In Figure 6 , if we assume that the image a and b are in a gray scale format and have a value range of < 0,255 >. then the precision of the images is 8 bits and we can pack upto 4 array elements in one memory location.
In the Region Splitting code shown, since the loop iterations are independent, we can unroll the loop by a factor of 4 so that in each loop iteration, there are 4 different array element accesses which have the same physical memory locations. Hence, the total number of memory access is decreased by a factor of 4 reducing the total number of clock cycles. 
Hence, e , = Eb + cc
Hence, E, , = Cb + cc
This E arises due to rounding of/tmncation of the 21 bits generated on multiplication to r bits.
Hence , 
a[i+31= bli+ll +cli+:!I: Since most of the images read in from MATLAB are stored in a 2-dimensional array, the precision of the input images is inferred by parsing the input matrices to get the maximum value of the array elements. Figure 7 shows a typical loop described in MATLAB and its unrolled version. As memory packing requires consecutive array access across loops, step 8 of algorithm 2, finds out the array access patterns accross loop iterations. Since the maximum unroll factor of the loop can be equal to the array PO, we need to find the array access pattem of the first PO iterations of the loop. The unroll factor of each memory access in a loop is defined by the number of array element accesses across loops which lie in the same physical memory location. To minimize the number of memory accesses, step 12 unrolls the loop by the maximum unroll factor. For the unrolled loop in Figure 7 , both the arrays a and c require two memory accesses while array b requires one memory access in a single iteration of the loop. Thus, the total number of memory accesses is reduced by 55 % due to memory packing. X i is the array element access in the irh iteration of the loop
9.
Calculate the set X,%PO which is the array element access pattern in the loop
IO.
Calculate maximum unroll factor of the loops so that in each loop iteration, this particular array access when unrolled leads to only I packed memory access A detailed description of these algorithms can be found in [26] . For each benchmark, first a description of the algorithm in MAT-LAB was passed through our compiler without any optimizations to get the unoptimized hardware. Secondly, the algorithm in MAT-LAB was passed through our compiler with the precision, error and memory packing phases to get the optimized hardware. The output of our compiler was the description of a hardware in VHDL. We used the Synplify tools from Synpliciry to get the netlist and the Alliance tools from Xilinx to get the FPGA bit stream for the Xilinx XC4028 FPGA with an external memory on the IYildChildT" board from Annapolis Micro Systems. Figure 8 shows that on an average, the designs consume about a factor of 5 less FPGA resources after our precision analysis phase as compared to the unoptimized hardware. It can be seen that for some benchmarks like IIR Filter, the optimized hardware uses a factor of 9.5 less resources than the unoptimized hardware. Further. Figure 2 shows that our manually designed hardware for IIR Filter consumes resources which are a factor of 4.7 less than the unoptimized hardware, which implies that our automated tool generates a more resource efficient hardware, by almost a factor of 2, as compared to even a manually designed hardware. The reason for this is that though it is easy to determine the minimum number of bits manually for the input and the output variables even for complex designs. computing the precision for intermediate variables for hardware spanning over a IO00 lines of VHDL code is very tedious and error prone. This is because, the user has to mimick the precision analysis phase in propagating value ranges throughout the code. Hence, it can be inferred from Figure 8 that for large designs, our compiler would generate efficient hardware which would be as good as or better than a manually designed,one. Table 2 shows the actual CLBs required and the execution time for some designs after the optimization phases. It can be seen that the execution time of the designs decreases by about 20 % after the precision analysis phase. This is because the number of CLBs required for the logic decreases after this phase so that the commercial high level synthesis tools can route designs in a more efficient manner leading to increased frequency of execution. Figure 9 shows the average reduction in FPGA resources after our combined precision and error analysis algorithm to be a factor of 3.5 as compared to the unoptimized hardware for applications with floating point operations. The reason for these savings is because we were able to use a unified approach of precision and error analysis to determine the minimum number of bits required for real variables. The final savings in FPGA resources would be far more than the number shown. This is because our error analysis phase would also determine the minimum size of the sin, cos, sqrr. log lookup tables so that the error is minimized. For example, without any error analysis, the user would have instantiated a cos lookup Figure I O shows that on an average, our optimized hardware after memory packing is faster by 35 %. This is because our optimization tries to reduce the total number of accesses to the external memory in the program. For most applications which are easily parallelizable like Vector Sum, we can get almost 60 % reduction in the execution times. Column 4 of Table 2 shows that the resources consumed after the memory packing phase goes up by almost 50 %. This is because our memory packing algorithm unrolls the MATLABfor loops to extract more parallelism. Hence, there is clearly a resource versus performance tradeoff. For applications like Sobel Transfonn, the major pan of the algorithm is computed inside a loop with a huge list of statements, which would have been quadrupled if it were unrolled for memory packing. Most high level syntheis tools like Synplif?. are not able to perform resource sharing optimally in such conditions. Hence, though unrolling would improve the hardware performance, the packing algorithm is selectively applied in the Sobel application resulting in an improvement of only 8 96. Table 2 shows the details of our experimental results including the CLB count, clock frequency of the synthesized design and the execution time of the design on the Wildchild board for various benchmark applications for the hardware without the optimizations, the optimized hardware afyer precision and error analysis, for the optimized hardware generated after precision analysis and memory packing and for the manually generated optimized hardware. It can be seen from Table 2 that the manually generated hardware is better than the hardware generated by the optimizing compiler by almost a factor of 2.7. This is because the manually generated hardware makes use of the fact that the external memory accesses on the WildCliifd board can be pipelined. Hence, pipelined memory reads and writes take one clock cycle as compared to three clock cycles for our compiler generated designs. Since a pipelining algorithm [ 111 can be implemented after memory packing, we expect the designs generated by the compiler after the pipelining phase has been integrated to the current MATCH framework to be as good as the manually generated hardware. 8. CONCLUSION We have presented a framework for generating an efficient hardware for imagdsignal processing applications described in MAT-LAB. We have proposed a representation of floating point variables which would lead to optimal usage of FPGA resources. Also, we have proposed a precision and error analysis algorithm to generate hardware with an average resource requirement reduced by a factor of 5 as compared to an unoptimized hardware before our analysis. We have also proposed a memory packing algorithm to generate faster hardware requiring an average of 3. 5 ' 70 less execution time. We have proven the strength of the optimizing compiler by synthesizing hardware for certaing image processing algorithms that are as good as the manually designed hardware in performance and resource needs.
