Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Washington Headquarters Services, Directorate This report provides a brief summary of the research and development of a compiler for a mix of general purpose processors and adaptive computing processors from MATLAB. It incorporates a list of publications resulting from this research. The objective of the MATCH project was to make it easier for DOD users to develop efficient codes for adaptive computing systems. We have developed a compiler that takes in DOD applications written in a high-level language (MATLAB) and generates efficient low level code that runs on a distributed environment of commercial-off-the-shelf (COTS) FPGAs, embedded processors, and digital signal processors. The main features of the compiler are: 1. It enable the users to reduce the code development times for adaptive applications from weeks using manual approaches to hours using compiler tools. 2. It produce efficient codes that are within a factor of 2-4 of the best manual approach with respect to optimizing resources under performance constraints, or optimizing performance under resource constraints. The project URL is at: http://www.ece.nwu.edu/cpdc/Match/Match.html The results of the MATCH compiler have been transferred to a startup company called AccelChip, Inc. (formerly called MACH DESIGN SYSTEMS). The company was founded by two of the PIs of the proposal, Prith Banerjee and Alok Choudhary, and two of the Ph.D. students, Malay Haldar and Anshuman Nayak.
SUMMARY
This final report summarizes the research results obtained during the MATCH compiler project on "A MATLAB Compilation Environment for Adaptive Computing Systems," supported at Northwestern University between March 1998 to August 2001. The objective of the MATCH project was to make it easier for DOD users to develop efficient codes for adaptive computing systems. We have developed a compiler that takes in DOD applications written in a high-level language (MATLAB) and generates efficient low level code that runs on a distributed environment of commercial-off-the-shelf (COTS) FPGAs, embedded processors, and digital signal processors. The main features of the compiler are: 1. It enable the users to reduce the code development times for adaptive applications from weeks using manual approaches to hours using compiler tools. 2. It produce efficient codes that are within a factor of 2-4 of the best manual approach with respect to optimizing resources under performance constraints, or optimizing performance under resource constraints. 
INTRODUCTION
Efficient high-level design tools that can map behavioral descriptions of signal and image processing applications to FPGA architectures are one of the key requirements to fully leverage FPGAs for high-throughput computations and meet time to market pressures. Currently, most FPGA designs are entered at the level of Register Transfer Level (RTL) VHDL or Verilog. It is widely recognized that there is a need for design tools at the high level using languages such as C/C++ or MATLAB. MATLAB is an extremely popular language in the signal and image processing community with over 500,000 users. A direct synthesis path from MATLAB into hardware would be very useful. The MATCH compiler at Northwestern University takes as input algorithms described in MATLAB, and generates Register Transfer Level (RTL) VHDL. The RTL VHDL then can be mapped to FPGAs using commercial tools. The input application is mapped to multiple FPGAs by parallelizing the application and embedding computation and synchronization primitives automatically. Our compiler infers the minimum number of bits required to represent the variables through a precision inferencing analysis framework. The compiler can leverage optimized Intellectual Property (IP) cores to enhance the hardware generated. The compiler also exploits parallelism in the input algorithm by pipelining in the presence of resource constraints. We have demonstrated the utility of the compiler by synthesizing hardware for a couple of signal/image processing algorithms and comparing them to manually designed hardware.
MODELS, ASSUMPTIONS AND PROCEDURES
The MATCH project consisted of six research tasks. A brief description of each of the tasks is given below. 
RESULTS AND DISCUSSION
We will now report on our results of our research under the six tasks in the MATCH project.
Testbed (Task 1):
We have developed a hardware testbed of an adaptive computing system. The testbed consists of: (1) 
Basic Compiler (Task 2):
We have developed a MATCH compiler that takes MATLAB programs as input, and produces C programs to be mapped onto the embedded processors, and DSP processors, and RTL VHDL that will be mapped onto the FPGAs. In addition the compiler has the capability of making calls to library functions that are available on various targets. An overview of the MATCH compiler is shown in the figure below. As part of the compiler effort we have developed a MATLAB to VHDL compiler which consists of the following steps. The front-end parses the input MATLAB program and builds a MATLAB AST (Abstract Syntax Tree). The input code may contain directives regarding the types, shapes and precision of arrays that cannot be inferred, which are attached to the AST nodes as annotations. This is followed by a type-shape inference phase. MATLAB variables have no notion of type or shape. The type-shape phase analyzes the input program to infer the type and shape of the variables present for which type/shape is not provided by directives. This is followed by a scalarization phase where the operations on matrices are expanded out into loops. In case optimized library functions are available for a particular operation, it is not scalarized and the IP core corresponding to the library function is used instead. The scalarized code is then passed through the parallelization phase. The parallelization phase attempts to exploit coarse grain parallelism by either splitting a loop onto multiple FPGAs on the board (dataparallel approach) or by putting different tasks onto different FPGAs and pipelining the output of one to the input of another (systolic approach). The parallelization phase relies on communication libraries implemented for the target architecture board to communicate between the different FPGAs. A state machine description in VHDL is then synthesized from the parallelized scalarized MATLAB code for each of the FPGAs. Most of the hardware related optimizations are performed on the VHDL AST. A precision inference scheme finds the minimum number of bits required to represent each variable in the AST. The precision information is used in instantiating customized IP blocks corresponding to the functions and operators. Transformations are then performed on the AST to optimize it according to the memory accesses present in the program and characteristics of the external memory. This is followed by a phase to perform optimizations like pipelining under resource constraints that alter parts of the state machine that was constructed earlier. Finally a traversal of the optimized VHDL AST produces the output code. 
Automatic Mapping (Task 3):
We have developed automatic algorithms for partitioning and mapping the MATLAB programs on the heterogeneous target. We have developed algorithms for pipelining, partitioning, allocation of resources, and scheduling of the operations on the various platforms to perform time-constrained resource optimizations. We have developed a tool called SYMPHANY for performing the task of automated program partitioning and pipelining. Given a high-level sequential specification of the real-time computation with associated timing constraints (latency and throughput), the tool automatically arrives at a cost-effective solution to the system design problem using embedded processors, DSP processors, FPGAs. Our algorithm is based on a mixed integer linear programming formulation and uses an off-the-shelf LP solver called "lp_solve". We have applied our tool to the data flow graphs of three synthetic benchmarks and to the graphs for the STAP application and an MPEG decoder. In each benchmark, we have studied the solution to the problem for various combinations of throughputs and latency constraints. In each case the SYMPHANY tool gave the right solutions in terms of the number of pipeline stages used. It gave better solutions than a hand-optimized solution in most cases by about 10-20% in terms of the cost of the solution in dollars. An example set of results on the STAP application is shown below. Figure 6 . Use of the SYMPHANY automated tool on the various benchmarks.
Compiler Directives (Task 4):
We have developed a complete set of directives to specify type, shape, size, precision, data distribution and alignment, task mapping, resource and timing constraints. The compiler recognizes many of these directives. Examples of such directives are: %!MATCH SHAPE a(100,00) %!MATCH TARGET WILDCHILD %!MATCH STREAM Figure 7 shows an example of a FIR filter code in MATLAB with directives. 
Libraries (Task 6):
We have developed of various MATLAB libraries on the different platforms. The approach used was to develop each function as a parameterized function with the size of the data, the number of processors or FPGAs used, and the precision of the data (8 bit, 16 bit, 32 bit) for fixed point and floating point representations on three platforms. The platforms are the Annapolis Wildchild FPGA board, the Transtech DSP board and the Motorola embedded processor board. We have completed the development of the following library functions on the Wildchild FPGA board (using RTL VHDL and the commercial synthesis tools, namely Synplicity and Xilinx XACT place and route tools).
(1) Real matrix addition (2) Real matrix multiplication (3) IIR and FIR Filtering (4) One and two-dimensional FFT. We have also developed the following library functions on the Transtech DSP board and the Motorola embedded board (using C plus MPI and the native C compilers for the Transtech and the Motorola boards).
(1) Real and complex matrix addition (2) Real and complex matrix multiplication (3) One and two dimensional FFT. Each of these libraries has been developed with a variety of data distributions such as blocked, cyclic and block-cyclic distributions.
We have characterized the performance of each of these library functions on various platforms for various data sizes and precision. In each case we have developed C program interfaces to our MATCH compiler so that the programs can be controlled from the host controller (Force board). 
CONCLUSIONS
In conclusion we have developed the MATCH compiler which is capable of generating highly optimized hardware from applications described in MATLAB. A set of effective optimizations implemented in the compiler ensures that the quality of the output hardware is comparable to manually optimized hardware. The optimizations include parallelization, precision inferencing, IP core integration and pipelining. The effectiveness of the compiler was demonstrated by synthesizing hardware for a couple of signal/image processing applications. The outputs of the synthesized hardware were functionally verified against the outputs of the MATLAB interpreter. The execution times were almost equivalent to manual designed hardware, in fact superior in some cases were large amount of parallelism was available across loops. The resource utilizations were within a factor of four of the manual designs. All this was achieved while reducing the design time from months to minutes.
In terms of publications and patents, the MATCH project has:
• Supported 12 graduate students 
