Commercial off-the-shelf computer systems with reconfigurable architectures have recently become available. Some of these machines have architectures, features, and software development environments that seem to make them useful for digital signal processing, especially in electronic warfare and radar signal processing applications. This paper describes experiments to evaluate the architecture, features, software development environment, and performance of one such computer, the SRC Computers model SRC-6e, using an electronic warfare signal processing application.
I. Introduction
T HERE are a wide variety of technologies available today for implementing digital signal processing hardware for electronic warfare and radar systems. 1, 2 Options range from off-the-shelf microprocessors at the low end of the performance (and cost) spectrum, through multiple microprocessor systems, dedicated digital signal processing (DSP) chips, and field programmable gate arrays (FPGAs), all the way up to full-custom applicationspecific integrated circuits (ASICs) at the high end of the performance (and cost) spectrum. Recently, an additional implementation alternative has become available, the commercial-off-the-shelf (COTS) computer system with a reconfigurable architecture. 3, 4 Reconfigurable computing is an attempt to take full advantage of the advances made in FPGA technology, which is a popular method of implementing DSP systems for military applications. 5, 6 Proponents of reconfigurable computing claim such machines have the performance of semi-custom and custom DSP systems implemented with FPGAs and/or a mixture of microprocessors and FPGAs, while at the same time eliminating hardware development costs (and time) and minimizing software development costs. 7, 8 Opponents claim reconfigurable computers suffer from problems which make them marginal for DSP applications and unsuitable for real time or near real time DSP applications, including low I/O bandwidth, unpredictable response times, and immature software development environments.
This paper describes an experiment to evaluate the architecture, features, software development environment, and performance of the SRC Computers model SRC-6e reconfigurable computer. The application selected for this evaluation is an electronic warfare algorithm that synthesizes false target radar images for countering high-resolution imaging inverse synthetic aperture radars (ISARs), such as the U.S. Navy AN/APS-137.
FOUTS, MACKLIN, ZULAICA AND DUREN

II. The SRC-6e Reconfigurable Computer
The SRC-6e computer 9, 10 is considered to be an entry-level system by its manufacturer, SRC Computers, Inc. of Colorado Springs, CO. Within the SRC product line, it has the least number of processors, the least number of reconfigurable processors, and the least amount of memory. However, the overall system architecture of the SRC-6e is similar to the larger and more expensive models from SRC and the software development environment is the same.
The overall system architecture of the SRC-6e can be seen in Fig. 1 . Where feasible, SRC has leveraged commodity computing components in order to reduce cost and development time. Referring to Fig. 1 , the two PCs are COTS dual-processor machines with each computer having two, 1000 MHz, Intel Xeon processors, 1.5 Gbytes of memory, and a 100 Mbits/sec network interface. Each computer operates as an independent but cooperating computational node with inter-processor communications occurring via the network.
As indicated in Fig. 1 , each PC is also connected to a MAP via a Snap port. The MAP is the reconfigurable part of the architecture, the acronym being short for multi-adaptive processor. Each MAP has 3 Xilinx Virtex-II FPGAs, 11 as indicated in Fig. 2 . 
Fig. 2 Architecture of multi adaptive processor (MAP).
Two of the FPGAs are for user-defined logic while the third is dedicated for use as the MAP controller. Each MAP is also equipped with 24 Mbytes of RAM that is divided into 6 banks of 4 Mbytes each. The bandwidth between the user FPGAs and each memory bank is 800 Mbytes/sec, yielding a peak bandwidth to the on-board memory of 4800 Mbytes/sec. The Snap port is 64-bits wide and has a bandwidth of 315 Mbytes/sec. It connects the MAP to the PC via the memory bus in the PC. In fact, the Snap port plugs into an available memory slot in the PC that could otherwise be used for more memory. This architecture allows data transfers into and out of the MAP at the maximum speed the memory bus can support, while at the same time eliminating any requirements to modify the COTS PCs.
MAPS are interconnected with each other via the Chain ports, which have a maximum bandwidth of 800 Mbytes/sec. Chain port usage is defined and controlled by user applications running in the FPGAs. For highbandwidth I/O operations, such as real-time electronic warfare applications, the chain ports can be broken at any point and connected to other devices, such as an analog-to-digital converter (ADC), digital-to-analog converter (DAC), or digital radio frequency memory (DRFM). In fact, one of the main goals of the research project described here was to test and evaluate the performance of the MAP to determine if it would be worth the effort to design a hardware interface to the chain ports to allow digitized signals to be directly input to the MAPs at a high transfer rate.
Both the PCs in the SRC-6e operate under Red Hat Linux. In fact, if one did not want to utilize the MAPs, the SRC-6e would look like just another PC/Linux cluster, although the version of Linux distributed with the SRC-6e has been augmented to provide the device drivers, compilers, debuggers, and other software necessary to utilize the MAPs. Compilers are available for C and Fortran. It is the opinion of the authors that these compilers, together with the more or less standard Unix/Linux software development environment, create the most sophisticated software development environment available today for computers with reconfigurable architectures. Most commercially available reconfigurable computers are programmed by calling canned procedures out of a library provided by the manufacturer. If a needed procedure is not available the developer must either wait for the vendor to provide it in a future release or develop the required procedure on their own. For all commercially available machines except the SRC, this means programming in VHDL, Verilog, or some similar hardware description language. Furthermore, the software development task is heavily dependent on the architecture of the reconfigurable hardware. This makes software development more of a hardware design task than a programming task. With the SRC approach, developing software truly is programming and it is decoupled from the architecture of the hardware. The available C and Fortran compilers perform a large number of code optimizations, including loop unrolling and the conversion of multiple independent statements into parallel hardware. The compiler also generates all the required hardware and software interfaces. Between the C and Fortran compilers and Linux, the SRC software development environment is as close as possible to programming a COTS PC running Linux.
Developing software for the SRC-6e requires the programmer to explicitly declare what part of the algorithm being coded should execute on a MAP and what part should execute on a Xeon processor. 12 In fact, the code that is to run on the MAP is even placed in a different file than the code that is to run on a Xeon processor. Presumably, the code that runs on the MAP is the part of the algorithm that requires the most execution time when running on a platform with a more conventional architecture. If one is not sure which part of an application requires the most execution time, the available Linux execution profiler can be used. Other execution profilers are available from other sources, such Gnu gprof. The execution profiler was not utilized during this research because of the familiarity of the algorithms being implemented. In addition to the standard execution profiler that comes with Linux, the SRC software development environment provides a graphical tool for displaying and tracing out the internal data flow graph created by the MAP C compiler. Most C compilers with optimization capability generate an internal data flow graph but do not provide a means for the software developer to access it. With the SRC environment, the graph is stored in a disk file at the end of compilation. Analysis of the graph with the tool provided by SRC not only helps to identify the frequently executed inner loops but was found to be a useful debugging tool during the course of this research.
Both the Xeon processors and the MAPs can be programmed in either C or Fortran. The MAPs can also be programmed with the hardware description languages Verilog and VHDL. MAP programming can also be accomplished with any schematic diagram editor capable of editing logic-level schematic diagrams and generating EDIF output files. It is also possible to run intellectual property (IP) cores for the Xilinx Virtex-II FPGA on the MAPs, so long as an appropriate interface to the IP core can be written for execution on the Xeon processors using either C or Fortran. Essentially, the compilation script sends the code for the MAPs to Synplify Pro, 13 which is an FPGA place-and-route tool created by Synplicity, Inc. Therefore, any format that Synplify Pro can read, or that can be translated to a format Synplify Pro can read, can be used to program the MAPs. This provides a large amount of flexibility in developing software for the MAPs. Program execution on the SRC-6e initiates on one or more of the Xeon processors. When a procedure is called that is to be accelerated using one or more MAPs, the required number of MAPS are allocated, the MAPS are programmed, input data to the procedure is copied from the common memory in the PCs to the on-board memory in the MAPs, and finally, execution flow is transferred to the MAPs. At this point, execution of the application in the PCs can be suspended, waiting for an interrupt from the selected MAPs indicating that execution in the MAPs has completed. After the MAPs have completed their task, output data from the MAP procedure is copied from the onboard memory to the common memory, the MAPs are deallocated, and execution flow returns to the PCs. It should be noted that after execution flow has transferred from the PC to the MAP, the PC does not have to sit idle. If another execution thread, or even another task is available, the PC can continue to be productive. Furthermore, multiple tasks can be executed at the same time if they are allocated to different PCs and different MAPs. Also, different threads can even be executed on the same MAP if they are allocated to different FPGAs.
The job of copying input data from the common memory to the on-board memory and copying result data from the on-board memory to the common memory is important to the proper functioning of the SRC-6e, as well as to performance. Therefore, the allocation of specific data structures in the software to specific memory banks in the on-board memory is left to the applications developer. However, the software environment does provide a variety of functions to make this task easier and more efficient. In addition to the block copy functions, functions are also available that automatically stripe arrays across multiple banks in the on-board memory. If elements of an array are striped across multiple memory banks, then multiple read accesses to the same array can be performed on the same clock cycle. The available functions allow user control over the stride of a striped array. Other available functions that help optimize data transfer operations include streaming functions. Streaming allows data to be transferred straight from the common memory into one of the user-programmable FPGAs without first being stored in the on-board memory. Streaming also allows two or more FPGAs within a MAP, and even FPGAs in different MAPs, to communicate with each other without having to go through on-board memory.
III. Phase Extraction Using the CORDIC Algorithm
The electronic warfare DSP application that was selected to be the benchmark program for this research is the digital synthesis of large false-target radar images for countering high-resolution imaging inverse synthetic aperture radars (ISARs), 14 such as the U.S. Navy AN/APS-137. To synthesize an appropriate false-target image, the signal from an interrogating radar must first be intercepted, digitized, and stored in a high-speed digital radio frequency memory (DRFM). 15 Most commercially-available DRFMs use an in-phase/quadrature (I/Q) format to encode and store a signal, rather than storing separate phase and amplitude information for each sample of the intercepted signal. However, the most practical image synthesis algorithms available require pure phase information. Therefore, a method is needed to extract the phase information from a signal in the I/Q format.
The obvious method to extract the phase information from a signal in the I/Q format is to use Eq. (1).
However, a direct implementation of this equation is not practical because of the amount of time required to perform both the division and the arctan function. It needs to be kept in mind that with a typical EW system, a new phase value will need to be extracted from a new I/Q pair about every 2 nS. Fortunately, the CORDIC algorithm can be used to perform both the division and the arctan operations at the same time. 16 The CORDIC algorithm is a well-known and extensively studied successive approximation algorithm that can be implemented either iteratively or recursively. The entire algorithm relies on addition, subtraction, shifting, and comparisons and there is no need for multiplication, division, direct trigonometric operations, or the evaluation of series or polynomials. Furthermore, the accuracy of the result can be increased or decreased, as needed, by controlling the number of iterations through the algorithm.
The first version of the CORDIC algorithm developed to test the SRC-6e for extracting phase information from a signal in the I/Q format was written in the C programming language. The main body of the program looped through an array of 256 K I/Q pairs and generated 256 K phase values. This part of the program was compiled twice, once to run on the Intel Xeon processors and once to run on the reconfigurable processors. Both versions of the CORDIC procedure were supported by identical code running on a Xeon processor. The support code generated the I/Q pairs, kept the CORDIC procedure fed with data, stored the result data, and tested the phase calculations for correctness.
The results of the first set of tests can be seen in Table 1 .
The version of the CORDIC algorithm running on the Xeon processor took 10.5 seconds while the version running on the MAP took only 9 seconds. Although this represents a performance improvement, the speedup factor is only 1.17 and somewhat disappointing. However, while reviewing these results it was learned that the interface between the MAP and the on-board memory (OBM) was being used inefficiently. Specifically, only one word of data can be read out of a single memory bank on each clock cycle. Thus, two clock cycles were required by the MAP to read each I/Q pair out of the OBM because the I/Q pairs were being stored at two different locations in the same array and the entire array was getting allocated to a single OBM. Unfortunately, the SRC software development environment was not completely decoupling the programmer from the hardware architecture. Another version of the CORDIC algorithm was written in C that stored the I and Q pairs in separate arrays which then got allocated to separate memory banks in the MAP, allowing both I and Q to be read from memory at the same time. The result of this experiment can also be seen in Table 1 . Execution time dropped to 8.5 seconds indicating a further improvement, although still not as much as was expected.
Further examination of the results indicated the code was not using the Snap port efficiently. Specifically, data transfers over the Snap port are always 64-bits wide. Thus, transferring 32-bit values for I and Q and 32-bit values for the phase results was wasting approximately half the Snap port bandwidth. In response to this problem, data packing and unpacking routines were written for both the Xeon processor and the MAP. When execution flow transitions from the Xeon processor to the MAP, data is packed before being sent over the Snap port. When it's received by the MAP, the data is unpacked before the CORDIC algorithm executes. After the CORDIC algorithm executes, the results are packed by the MAP before being sent over the Snap port to the Xeon processor. The results are then unpacked by the Xeon processor before checking and storage. The execution of this version of the code indicated the time required to accomplish all the packing and unpacking is significantly greater than the time required to transfer unpacked data across the Snap port. Referring to Table 1 , execution time increased to 16.4 seconds, another disappointing result. Clearly, a more detailed analysis was needed of where the MAP procedure was spending its execution time.
The process of compiling C code for execution on the MAP requires the C code to be translated to a data flow graph intermediate format. Then, off-the-shelf Synplicity FPGA place-and-route software is used to translate the intermediate format into FPGA circuitry. The end result is that a dedicated hardware pipeline is created in the FPGA when a program is loaded for execution. This pipeline reads input data from the OBM, performs the required operations, and stores the results back into the OBM. The clock speed for the dedicated hardware pipeline that is created inside the user-programmable FPGAs is fixed in the hardware at 100 MHz. Thus, if a pipeline can be created that is capable of generating a result on every clock cycle, which is not an unreasonable expectation for an FPGA with the capabilities and logic density of the Virtex-II, then the execution time of a procedure running on the MAP can be described by Eq. (2).
In Eq. (2), T EX is the procedure execution time, T MOH is the map overhead time, T CL is the clock period, NPS is the number of pipeline stages, and NOS is the number of samples being processed. Fortunately, the suite of software tools from SRC allows the applications developer to actually measure some of these parameters, while other parameters can be obtained in other ways.
The MAP procedure that distributes the I and Q data between two different data arrays but that does not attempt to use data packing yielded the best execution time in Table 1 . Therefore, this procedure was selected for optimization by hand. Furthermore, in an attempt to gain additional improvements in performance, a CORDIC core generation tool was downloaded from Xilinx. 17 This tool is specifically designed to generate efficient core logic on Xilinx FPGAs for implementing CORDIC algorithms. Three different versions of the CORDIC algorithm were created with the Xilinx core generation tool. Table 2 lists the characteristics and performance metrics of the 3 different versions, along with the optimized version of the CORDIC procedure that was written for the MAP in C.
Referring to Table 2 , the internal precision for each version of the algorithm is shown in column 2. The C version of the CORDIC algorithm is most like the "Core 2" version generated with the Xilinx core generation tool. In column 3, the number of pipeline stages required to implement just the CORDIC algorithm is given. For the versions of the algorithm that were created with the core generation tool, this parameter was obtained from the Xilinx software. This information is not available from the SRC applications development tools and thus is not available for the version of the algorithm written in C. Column 4 shows the total number of pipeline stages required to implement the entire MAP procedure, including communications, control, and other overhead processing. This information comes from the available applications development tools and is available for all versions of the procedure. For the 3 versions of the algorithm created using the core generation tool, there is a significant increase in the number of pipeline stages between columns 3 and 4. Clearly, there is a significant amount of processing, which requires a lot of logic and pipeline stages, associated with MAP communications, control, and overhead. This implies that the best algorithms to program on the MAP are algorithms that do a large amount of processing in the MAP each time a MAP procedure is called. It is also interesting to compare the number of pipeline stages indicated in column 4 for the C version of the CORDIC algorithm against the version labeled "Core 2". As mentioned previously, these versions have the same internal precision, yet the C version requires 1.6 times more pipeline stages. Clearly, there is some inefficiency in the SRC process that compiles C code for execution on the MAP. The larger number of pipeline stages results in a longer pipeline load delay.
Another important performance parameter for reconfigurable computer systems is the amount of logic utilized in the FPGAs to implement the desired algorithm. The more efficient the hardware implementation the larger an algorithm can be. Even if a given algorithm is not particularly complex, having an efficient and compact MAP implementation is desirable because it allows iterative loops to be unrolled and executed in parallel, thus further accelerating MAP procedure execution. An efficient MAP implementation of an algorithm will also reduce the pipeline load delay, although it will be seen shortly that this is not a major concern. Column 5 of Table 2 indicates the number of logic slices utilized inside the Xilinx Virtex-II FPGA to implement just the CORDIC algorithm for each of the 3 versions of the procedure that were generated with the Xilinx core generation tool. This data was obtained from the core generation tool and is not available from the SRC applications development tools. Thus, this data is not available for the version of the procedure written in C. Column 6 indicates the total number of logic slices required to implement the entire MAP procedure. It is interesting to compare the data in column 5 with the data in column 6. It is apparent that a large number of logic slices are required to implement the communications, control, and overhead functions. It is also interesting to compare the data in column 6 for the C version of the algorithm with the data in column 6 for the "Core 2" version of the algorithm created with the core generation tool. The previously mentioned inefficiency of the compilation process for the C language is apparent.
Ultimately, the most important parameter shown in Table 2 is the total execution time of the MAP procedure, which is shown in column 7. The execution times are identical for all 4 MAP procedures, within measurement accuracy limitations. This is caused by two reasons. First, the communications, control, and other overhead processing is approximately the same for all procedures. Second, the amount of time required to process all 256 K samples is the same because all versions of the CORDIC algorithm produce one result on every clock. The only difference between the MAP procedures is the pipeline load delay, which ranges from a low of 68 clocks for "Core 1" and "Core 2" to a high of 112 clocks for the procedure written in C. However, with 256 K clocks required to process the 256 K samples, the pipeline load delay is such a small fraction of the overall execution time that the difference in load delays between the 4 different versions of the procedure are barely noticeable. The SRC run time environment includes a debug option that creates a text-based log of activities within the MAP. Using this tool, it is possible to measure the amount of time required for various overhead activities associated with using the MAP. Allocation of the map can take as little as 0.3 seconds or as long as 3 seconds for a heavily loaded machine, with 0.5 seconds being typically required. The allocation time is not dependent on the amount of logic utilized in the FPGA. Initialization of the FPGA, including programming the FPGA, takes approximately 0.1 seconds. The time required to transfer the input data from the common memory to the on-board memory is dependent on the amount of data, the data rate being the previously stated 315 MB/s for 64-bit words. The time required to actually execute the algorithm coded into the FPGA is dependent on the size of the data set and the pipeline latency, as quantified in Equation 2. The time required to transfer the result data from the on-board memory to the common memory is dependent on the amount of data, the data rate being 195 MB/s for 64-bit words, noticeably slower than the data rate for transferring data into the MAP. To deallocate the MAP, approximately one second is required. There is also a small amount of miscellaneous overhead. Ultimately, the cost of accessing the MAP makes the SRC-6e most attractive for applications where, once processing has transferred to the MAP, processing remains in the MAP for a long enough time to amortize the time cost of accessing the MAP over a larger number of computations. However, it should be noted that MAP allocation and deallocation only needs to occur when a process first needs to use a MAP and when it is done using a MAP. Thus, the MAP allocate and deallocate time can often be amortized across many calls to the MAP from the same process. Also, if an application calls the same MAP function over and over again, the FPGA does not need to be initialized on every call because it was already programmed to perform the desired task on the first call. Finally, it should be noted that the SRC-6e is an entry-level, first-generation machine. SRC is now producing what it considers to be third generation machines that have greatly improved bandwidth between the common memory and the on-board memory in both directions, as well as decreased MAP allocation and deallocation times.
IV. False Target Radar Image Synthesis
Once the phase information has been extracted from a sample of the intercepted interrogating radar signal, it can be used to synthesize part of a false-target image, which when integrated by the radar receiver with image components synthesized from other samples, will cause the radar to see a complete image of the desired false target. The complete false target synthesis algorithm and its analysis is described in other publications, 18 thus the description provided here is brief. The first step is to divide up the false target into sections, as illustrated in Fig. 3 .
Fig. 3 Synthesized false target image.
Each section of the false target is assumed to be at a different distance from the interrogating radar. Therefore, each section of the false target is allocated to a different range bin.
Within each range bin, two tasks must be accomplished for every sample of the intercepted radar signal in order to synthesize an output signal. First, the phase of the sample must be rotated to account for the fact that each range bin is at a different distance from the interrogating radar and thus the phase of the synthesized false target signal will be different for each range bin. Second, different parts of the false target will have a different radar cross section. Therefore, each range bin needs to synthesize a signal with an amplitude that corresponds to the radar cross section of the part of the false target allocated to that range bin.
The false-target image synthesis algorithm can be quantified as shown in Eq. (3) and (4).
Referring to Eqs. (3) and (4), I (n) represents the nth in-phase (I) component of the synthesized output signal that results from the nth sample of the intercepted signal, while Q(n) represents the corresponding quadrature (Q) component. E is the extent of the target, or the number of range bins that contain a part of the false target. Thus, the summation operations combine the I and Q outputs from each range bin for each input sample. A i represents the magnitude of the synthesized output signal from each range bin, which is dependent on the radar cross section of the part of the false target allocated to each range bin. The cos and sin operations generate the I and Q components once the phase of the output signal from each range bin has been calculated. φ(n − i) represents the phase value extracted from the original sample of the intercepted signal, while ∆φ i represents the phase rotation that must be added to the original phase value in each range bin to account for the different ranges between the different range bins and the interrogating radar.
To synthesize the required output signal in real time, a separate processor is required to implement the calculations performed within each range bin. 19 These processors, known as range bin processors or RBPs, have a custom architecture that is dedicated to the described algorithm. A block diagram of the architecture is shown in Fig. 4 .
Referring to Fig. 4 , the electronic warfare system controller programs each RBP with appropriate phase rotation and gain coefficients before an interrogating radar signal is intercepted. When a signal is intercepted and sampled, the extracted phase information is fed into the phase rotation adder in the different RBPs, as indicated at the top of Fig. 4 . The phase rotation is accomplished using a modulo 360 (degrees) adder because phase rotation is a cyclic function. For example, if the incoming phase is 350 degrees and the phase increment value is 30 degrees, adder overflow is ignored and the sum is expressed as 20 degrees. The output of the phase rotation adder is sent to a Sine/Cosine lookup table ROM to generate the corresponding I and Q components. The I and Q components are then scaled by the desired amount of gain using multipliers. However, multiplication is restricted to powers of 2 n where n ranges from 0 to 10. This allows the multiplication to be accomplished at high speed using arithmetic shifting and does not require any addition operations or adder hardware. After the scaling operations, the I and Q components are summed with the I and Q components from the other RBPs using the summation adders shown at the bottom of Fig. 4 . To maintain a high clock speed and to maximize throughput, the entire RBP is pipelined with 4 stages of pipeline registers, as illustrated in Fig. 4 . A more detailed analysis of the range bin processor architecture is available.
19
V. SRC-6e Performance on Image Synthesis Algorithm
To synthesize a false-target image of a typical U.S. Navy ship with enough resolution to fool a modern, highresolution, inverse synthetic aperture imaging radar, at least 512 range bin processors are required. Traditionally, this would require ASIC technology. 20 However, with the availability of two MAPS in the SRC-6e and with each MAP having two Xilinx Virtex-II FPGAs, a single SRC-6e should be capable of synthesizing a complete false-target image. However, this requires using the logic in the FPGAs very efficiently. Therefore, for this experiment, a macro was created for the MAP using the hardware description language VHDL, which allows direct control over how the logic cells in the FPGAs are programmed. The entire image synthesis algorithm was coded into the macro for execution on the MAP, with supporting functions written in C for execution on the Linux PC part of the machine. Support functions include programming the different range bin processors with the desired phase rotation and gain values, generation of the phase samples, and checking of the synthesized signal to confirm correct execution of the algorithm.
The initial experiment to program the false-target radar image synthesis algorithm into the MAP utilized 4 RBPs programmed into a single FPGA. 21 The results are shown in Fig. 5 , which plots execution time as a function of the number of phase samples processed.
As expected, the MAP executed the false-target image synthesis algorithm extremely quickly, as can be seen by the diamond-marked plot in Fig. 5 . After having completed the experiments described in Section III of this paper, it was no surprise that the total execution time of the MAP macro was significantly greater than the actual time required to execute the image synthesis algorithm on the MAP. The total macro execution time is shown in Fig. 5 by the square-marked plot. The difference between these two curves is the time it takes to allocate the MAP, program the FPGAs, transfer the input phase values from the common memory in the PC to the on-board memory in the MAP, transfer the result data from the MAP back to the PC, and deallocate the MAP.
As a basis for comparison, another version of the image synthesis algorithm was created but written entirely in the C programming language. This version could be configured to emulate 1 to 512 RBPs in software. This version of the algorithm was compiled to run on two different platforms, a 3 GHz Windows PC with a Pentium-4 processor and a 1 GHz Linux PC with a Xeon processor, which was essentially one of the computers in the SRC-6e system utilizing only one of the Xeon processors and with the MAP disabled. These benchmarks were then configured to emulate 4 RBPs and performance measurements were taken. The results are also plotted in Fig. 5 . The triangle-marked plot is for the Windows PC and the plot marked with crosses is for the Linux PC. Obviously, for this small number of RBPs, it would be faster to just use an off-the-shelf Windows or Linux PC.
Before attempting to increase the number of RBPs implemented on a single FPGA, the VHDL code for the RBP was rewritten. 22 Special attention was paid to design and implementation efficiency, especially with respect to the I/O interface with the Linux PC and the utilization of the 6 memory banks. Specifically, it was determined that a large number of logic gates were being devoted to the distribution of configuration data to each of the RBPs, such as the phase increment values and gain values that each RBP is programmed with before signal processing starts. This configuration information was being transmitted from one RBP in the cascade to the next on each clock signal, along with each sample of the intercepted incident radar pulse. However, configuration information such as phase increment values and gain values do not change very often, relative to how fast the samples of the intercepted incident radar pulse are processed. Therefore, a new method was created that uses significantly fewer signal lines and much less logic for distributing RBP configuration information. The new implementation uses a time multiplexed approach instead of a large amount of parallel wires but does not allow configuration parameters to be changed on every clock. However, this feature is not necessary for the intended application. These significant improvements in the code efficiency allowed up to 128 range bin processors to be implemented in a single FPGA. Fig. 6 shows FPGA space utilization for the new version of the code as a function of the number of range bin processors. With 128 processors on each FPGA, the SRC-6e could be used to implement a total of 512 range bin processors.
The exponential and polynomial functions shown in Fig. 6 can be used to estimate FPGA usage for implementations with more RBPs than 128. This will be useful as FPGA technology improves and the number of RBPs that can be included on a single chip further increases. With 16 processors implemented on a single FPGA, the execution time of the image synthesis algorithm on the MAP and the total execution time of the macro call have not changed much, although the total execution time for the macro has come down a small amount from about 4.7 seconds to about 4.6 seconds at the right side of the graph. However, the performance plots for the two C versions of the algorithm running on the Windows PC and the Linux PC no longer cross. Enough calculations are now being performed that the large overhead of Windows has been fully amortized and the 3 GHz P-4 processor finally shows its raw speed over the 1 GHz Xeon processor.
When 64 range bin processors are implemented on a single FPGA, the significant overhead required to access the MAP is finally amortized over enough computations so that the raw computing power of the MAP can be seen in a performance comparison. Fig. 8 shows performance plots for 64 RBPs on a single FPGA.
As expected, the plot marked with diamonds still shows an extremely low execution time for the image synthesis algorithm running on the MAP. However, the plot marked with squares that indicates the total execution time of the macro, including the time it takes to allocate the MAP, program the FPGAs, transfer the phase samples to the memory in the MAP, transfer the results back to the PC, and deallocate the MAP, has dropped below the execution time of the C version of the benchmark running on the 1 GHz Xeon processor under Linux. Furthermore, the total execution time for the macro has decreased to the point where it is almost lower than the execution time of the C version running on the 3 GHz P-4 processor under Windows for a large number of input samples.
With 128 range bin processors implemented on a single FPGA, the point is finally reached where the SRC-6e becomes the fastest method of implementing the false-target radar image synthesis algorithm. Fig. 9 shows the performance plots for 128 processors. The plot marked with squares is now below the plots for both of the C versions of the algorithm for a large number of input samples. This indicates that for applications requiring a large number of calculations on a large data set, the reconfigurable architecture of the SRC-6e can provide a significant performance improvement. The key to achieving this performance is the ability to amortize the high cost of accessing the MAP over a large number of calculations.
With the overall performance of the SRC-6e being so heavily dependent on the amount of time required to allocate the MAP, program the FPGAs, transfer input data from the common memory in the PC to the on-board memory in the MAP, transfer output data from the on-board memory to the common memory, and deallocate the MAP, additional research was done to further quantify how much time was being spent on different tasks. The results are shown in Fig. 10 .
The plots shown in Fig. 10 were generated with 128 RBPs allocated to an FPGA. However, on a percentage basis, the results were nearly identical for 4, 8, 16, 64, and 128 processors allocated to an FPGA. As expected, the total amount of time required to execute the false target radar image synthesis algorithm stays fairly low, even for large input data sets, as indicated by the triangle-marked plot. The overhead required to allocate and deallocate the MAP is shown in the plot marked with squares. The percentage of time spent doing overhead drops as the input data set size increases because the overhead time required to allocate and deallocate the MAP stays constant. Therefore, as the size of the input data set increases and the number of calculations performed increases, this time becomes a smaller percentage of the overall execution time. However, as indicated by the plot marked with diamonds, as the size of the input data set increases, more and more time is spent copying input data from the PC to the MAP and output data from the MAP to the PC. There is a potential warning in this plot. If the size of the input data set is too large relative to the number of computations performed in the MAP, the performance of the MAP will be limited by the I/O overhead. The applications that can attain the best performance on the MAP are those that have a high ratio of computation to I/O. Additional information about I/O behavior can be learned from Fig. 11 .
Referring to Fig. 11 , it can be seen that the percentage of time devoted to I/O is not dependent on the number of range bins implemented in the FPGA, which is to say the percentage of time devoted to I/O is not heavily dependent on the number of computations performed in the FPGA. It is interesting to note that the percentage of time devoted to I/O can vary noticeably when there is a small amount of input and output data, which is the case on the left side of the graph. However, as the quantity of input and output data increases, the percentage of time devoted to I/O becomes more stable, which is the case on the right side of the graph. It should be noted the Y axis in Fig. 11 uses a log scale. The large increase in the percentage of time devoted to I/O for a large number of samples in Fig. 11 is not unexpected and is characteristic of the specific algorithm implemented in this research. As the number of samples processed increases to 128 K, 256 K, and 512 K, the algorithm has a lower and lower computation to I/O ratio.
VI. Conclusions and Future Work
The raw computing performance provided by the MAPs working in conjunction with the Linux PCs gives the SRC6e a tremendous amount of computing power. However, there are two important questions that need to be answered before porting a given application to the SRC-6e. First, will the number of computations performed in the MAP be enough to amortize the high cost of allocating the MAP, transferring input data into the MAP, transferring result data out of the MAP, and deallocating the MAP? Second, will the extra time required to program the SRC-6e be worth the amount of performance improvement attained, relative to a computer with a more traditional architecture and programming model? It is important to keep in mind that the real issue is how long it takes to get the answer once the question is known. It is better to have a computer that takes a day to program and a week of execution time to get the answer, than to have a computer that takes two weeks to program and a day of execution time. When taking these two questions into consideration, it is the experienced opinion of the authors that for digital signal processing applications in the field of electronic warfare, the performance of the SRC-6e is well worth the extra software development time. However, this is assuming a software life-cycle model where the development is done once and the code is then utilized many times.
All of the performance measurements taken in this research were done with benchmarks that utilized the Snap ports for transferring input data to the MAP from the common memory in the PC and result data from the MAP to the PC. However, as pointed out in section II, the Snap ports, with a bandwidth of 315 Mbytes/sec, are not the only ports in and out of the MAPs. This research has not yet made use of the Chain ports, which have a bandwidth of 800 Mbytes/sec. One of the reasons for conducting this research was to determine if the performance of the MAPs was substantial enough to warrant designing a hardware interface that would allow the MAPs to read data directly from an electronic warfare system or a radar receiver. We feel the answer to this question is definitely yes and the next step of this research will be to design such an interface. Having an 800 Mbytes/sec I/O interface with each MAP and between MAPs should make it much easier to reach the "break even point" where the performance of the SRC-6e starts to exceed the performance of a more traditional computer. This will allow applications with less complexity and a lower ratio of computation to communications to take advantage of the power of the MAP. However, it should also be noted that the machine used in this research is a first-generation, entry-level machine. SRC Computers is now producing third generation machines with a much higher Snap-port bandwidth and decreased MAP allocation, programming, and dealloation times. These improvements should help applications reach the break even point even if they do not use the Chain ports. It should also be noted that the MAP does not need to be allocated and programmed every time the code that runs on the MAP is called and the MAP does not need to be deallocated every time the code running on the MAP completes. If a MAP is not deallocated after a call then it can be used again by the same process without having to allocate it. The advantage of this is that no time is lost allocating and deallocating the MAP. However, no other process can use the MAP until it is finally deallocated. Also, if the code being executed on the MAP does not change from one call to the next, the MAP does not have to be reprogrammed which can also save some time.
