The Systems Level Applications of Adaptive Computing (SLAAC) project is defining an open, distributed, scalable, adaptive computing systems architecture based on a highspeed network cluster of heterogeneous, FPGA-accelerated nodes. Two implementations of this architecture are being created. The Research Reference Platform (RRP) is a Myrinet™ cluster of PCs with SLAAC-1 PCI-based FPGA accelerators. The Deployable Reference Platform (DRP) is a Myrinet cluster of PowerPC nodes with SLAAC-2 VMEbased FPGA accelerators. A commercial 6U-VME quad-PowerPC board, the CSPI M2641™, has been adapted to act as a carrier for SLAAC-2. A key strategy proposed for successful ACS technology insertions is source-code compatibility between the RRP and DRP platforms. This paper focuses on the development of the SLAAC-1 and SLAAC-2 accelerators and how the network-centric SLAAC system-level architecture has shaped their designs. A preliminary mapping of a Synthetic Aperture Radar / Automatic Target Recognition (SAR/ATR) algorithm to SLAAC-2 is also discussed.
INTRODUCTION
The mission of the Systems Level Applications of Adaptive Computing (SLAAC) project is to: 1) define an open, distributed, scalable, adaptive computing systems architecture; 2) design, develop, and evolve scalable reference platform implementations of this architecture; and 3) validate the approach by deploying technology in multiple defense application domains. In the context of this research, adaptive computing systems (ACS) refer to systems that reconfigure their logic and/or data paths in response to dynamic application requirements. Creation and validation of a scalable, distributed, ACS architectures requires a closely coordinated hardware and software development effort in the areas of next-generation FPGA accelerators, module generators and other tools, runtime control libraries and APIs, and algorithm mapping. Although the SLAAC team is presently engaged in all of these activities, this paper focuses on the SLAAC-1 and SLAAC-2 FPGA accelerator hardware design efforts currently underway for the first generation reference platforms.
The system-level focus of the SLAAC project came about because of the realization that scalability and portability are the two primary obstructions preventing innovative ACS research from being directly useful in deployed systems. Scalability is an issue in that many real-world applications are larger than the modern PCI-based ACS accelerator. Transitioning from a small proof of concept demonstration to large real-world application is often overlooked in ACS research. The portability issue has both a hardware and software aspect. Physical form-factor and operating system issues can limit the utility of algorithm research on PCI-based FPGA accelerators. For example, a deployed system may require VME-based hardware and a real-time OS such as VxWorks™. Even among strictly PCI-based accelerators there is little commonality in the hardware architectures and software environments; rehosting to a field-friendly platform may be as difficult as the original research.
The SLAAC approach to scalability leverages modern cluster-computing techniques. The cluster computing community uses workstations and COTS high-speed network "backplane" to build high-performance parallel systems [1] . Therefore, a logical way to build scalable parallel ACS systems is to cluster FPGA-accelerated workstations. This workstation-based architecture is called a Research Reference Platform (RRP). The SLAAC team is developing a scalable API and runtime software to support application control of these network-distributed multiple-host multiple-board ACS systems. The RRP has the advantage of being an inexpensive readily available platform for ACS development that tracks advances in workstations, adaptive computing, and cluster computing. The Tower of Power (ToP) at Virginia Tech is a good example of an RRP. The ToP has sixteen Pentium II™ PCs each equipped with a WildForce™ board tightly coupled to a Myricom™ LAN/SAN card; the PCs are connected through a sixteen port Myrinet™ switch. A total of 80 XC4062XL FPGAs and memory banks are distributed throughout the platform, and are available as computing resources [2] .
Portability to deployable systems is partially addressed by implementing this same ACSaccelerated cluster architecture in a field-friendly platform. The scalable API and runtime control software being developed for the RRP are also under development in a VxWorks environment for Single Board Computers (SBCs). This field-friendly version of the RRP is called a Deployable Reference Platform (DRP). Scalable source-code compatibility can be achieved with respect to the host processor application. However, VHDL-level or bit-file level compatibility of the application code on the FPGA-based accelerator requires an identical FPGA architecture on both the RRP and DRP platforms. For this purpose, the SLAAC team is developing two FPGA accelerators. The SLAAC-1 board is a standard full-sized 64-bit PCI board intended for RRP workstations. SLAAC-2 is a 6U VME mezzanine board designed to be plugged into a modified CSPI 2641 Quad PowerPC baseboard. From the application perspective, the SLAAC-1 and SLAAC-2 boards represent the same ACS architecture. Section 2 of this paper discusses this common hardware architecture in detail. Details relating to the specific SLAAC-1 and SLAAC-2 implementations are described in Section 3. Hardware support software such as device drivers and a control library is briefly covered in Section 4. A preliminary application mapping to the SLAAC architecture is covered in Section 5, and future work is discussed in section 6.
SLAAC ARCHITECTURE
The SLAAC architecture is an attached processor system comprised of FPGAs and fast local memories. The basic concept isn't significantly changed from predecessor reconfigurable computer architectures such as Splash 2 [3] and Wildforce™ [4] . As shown in Figure 1 the SLAAC-1 architecture is partitioned into a single interface FPGA (labeled 'IF') and three user-programmable FPGAs (labeled 'X0', 'X1', and 'X2'). The IF chip is configured at power-up to act as a stable bridge to the host system bus. It provides configuration, clock, and control logic for the user FPGAs. The attached host is responsible for actually programming the user FPGAs and controlling the system. SLAAC-1 is designed to act either synchronously with the host, or asynchronously with DMA channels transporting data to and from host memory. A clock generator and FIFOs implemented within IF allow the user FPGAs to operate from a single data-synchronous clock in either mode of operation.
Data Paths
One of our goals with the SLAAC-1 architecture was to design an FPGA accelerator assuming a 64-bit data word. Since fast 64-bit system busses have become more commonplace in commodity PCs, we felt that a 64-bit data word was necessary to keep up with modern I/O rates. A 64-bit word is also a more natural atomic data element for these wider processors and even/odd word alignment issues of a 32-bit FPGA system would cause additional complexity in user FPGA designs. Consequently, the two bidirectional 72-bit "FIFO" connections between IF and X0 permit the user FPGAs to produce and consume a 64-bit data word in a single clock cycle. The three userprogrammable FPGAs are organized in a ring structure. X0 acts as the control element for managing user data flow, thus enabling X1 and X2 to focus on computation. The ring path (X0→X1→X2→X0) is also 72-bits wide so that an 8-bit tag can be associated with each 64-bit data word. The individual pin directions on the ring connections are usercontrolled; this architecture could just as easily support one 36-bit clockwise ring, and one 36-bit counterclockwise ring. The "crossbar" connecting X0, X1, and X2 together is a common 72-bit bus. The user also controls the direction of individual pins of this crossbar. Six additional handshake lines not shown (two each from X0 to X1, from X1 to X2, and from X0 to X2) permit crossbar arbitration without requiring unique configurations in X1 and X2.
Processing Elements
The SLAAC processing elements X1 and X2 each consist of one Xilinx XC40150XV-09 FPGA and four 256Kx18bit synchronous SRAMs. The Xilinx 40150 contains a 72x72 array of CLBs for 100K to 300K equivalent logic gates supporting clock speeds up to 100MHz. The SRAMs feature zero-bus turnaround permitting a read or write every cycle; no idle cycles are required for write after read with the only tradeoff being that writes are pipelined [5] . Each PE has two 72-bit connections to left and right neighbors for systolic data and a 72-bit connection to the shared crossbar. Other connections not shown include four LED lines, two handshake lines connected to X0, and miscellaneous reset, clock, configuration, and readback pins.
The location of the memories and the major connections are designed to permit the PE to be divided into four "Splash-2-like" single-memory systolic processors to improve pipelining and floor planning. The memories are arranged along the top of the PE, the crossbar connection is centered on the bottom, and the left and right ring connections are on the left and right sides respectively. Both X1 and X2 processing elements are identical so that the same SIMD or systolic configuration can be easily replicated without redundant synthesis.
Control Element
The SLAAC control element, X0, consists of one Xilinx XC4085XLA-09 and two 256Kx18bit synchronous SRAMs. The Xilinx 4085 contains a 56x56 array of CLBs for a 55K to 180K equivalent gates at clock rates up to 100MHz. X0 has two 72-bit ring connections, a 72-bit shared crossbar connection, and two 72-bit FIFO connections to the interface FPGA. Unlike the Splash 2 and WildForce™, X0 in the SLAAC architecture is designed to sit at both ends of the systolic array. X0 acts as the data stream manager for the architecture. Its primary mission is to read/write data from the FIFO module blocks implemented in the IF chip and pass this data on to the processing elements. The location of the memories and major connections in X0 are designed to allow the device to be split into a pre-processing section on the left, and a post processing section on the right half of the FPGA.
Interface
The SLAAC interface includes a Xilinx XC4062XLA-09 and several supporting components for clock generation and distribution, configuration, power management, external memory access, and system bus interfacing.
Clock. The SLAAC interface includes a clock generator tunable from 391 KHz to 100 MHz increments of 1 MHz. Clock distribution is separated into two domains. A processor clock (PCLK) drives the logic in X0, X1, and X2. PCLK is looped through the interface FPGA to support flexible countdown timers and single-step clocking. A memory clock (MCLK) drives the user memories and allows the host to access the memories while the PCLK is halted.
External Memory Bus. All of the user programmable memories in the SLAAC architecture are accessible from the host through an external memory bus. This feature guarantees a stable path to the memories for initialization, debugging, and retrieving results without depending upon the state of the user FPGAs. For each memory, a pair of transceivers isolates the address/control and data lines from the shared external memory bus. The transceivers are controlled from the IF chip.
For the SLAAC-1 architecture, we chose to implement a preemptive memory access strategy similar to that of Splash 2. In a preemptive memory access, the host interrupts the user FPGAs to read or write the memory. The user FPGAs are unaware that the access has occurred. This greatly simplifies the user's design because exclusive access to the memory is assumed. No special states are required in user state machines for initialization or debugging.
Although we chose a preemptive memory access strategy for SLAAC-1 for our interface implementation, the fact that the interface is within an FPGA allows us to explore other approaches. In addition to this, the fact that the SLAAC-1 memories and transceivers that implement the external memory bus are located on replaceable memory modules (see Figure 3 ) presents ample opportunity to experiment with alternate memory designs. Each of the memory modules has 160 undedicated pins connected to one of the processing elements and a 40-pin connection to the external memory bus.
Configuration. The IF device is programmed on power-up by an EEPROM to provide a stable interface to the host. The EEPROM program pins are accessible to the host through a control/status register in IF. This enables in-system updates of the interface through software. The user programmable FPGAs in the system are configured from IF. X0, X1, and X2 can be programmed individually or in parallel. A simple slave bus configuration through a set of control/status registers is supported for the SLAAC-1 prototype. However, there are two additional memories on the external memory bus dedicated to the IF to act as a configuration cache. The host can quickly load the configuration cache and the configuration can occur autonomously in IF, thus freeing up the host more quickly. An added benefit of placing the configuration memories on the external memory bus is that any or all of the ten user memories can be conscripted as configuration caches. Up to six complete SLAAC-1 configurations (including X0, X1, and X2) can be stored simultaneously on SLAAC-1 and selected with minimal effort from the host.
Readback. An integral part of rapid prototyping on reconfigurable architectures is the ability to debug a design on the hardware. The Xilinx readback facility is essential. The IF chip provides readback access to X0, X1, and X2 through a set of control/status registers. For the SLAAC-1 prototype, a simple slave bus readback is supported. However, the configuration cache memories can also be used as a readback cache.
FIFOs. Instead of dedicated hardware, a number of input and output FIFOs are implemented within the logic of the interface chip. The SLAAC-1 architecture is designed to allow the user FPGA logic to simultaneously process a number of input and output streams. This feature is essential for the network-centric SLAAC system architecture. The ability to address multiple input and output FIFOs allows the user FPGAs to dynamically route data across multiple network channels on a cycle-by-cycle basis [2] .
Power Management. Since the user logic in X0, X1, and X2 has the potential of drawing too much current for the PCI slot, the SLAAC interface includes a power monitoring circuit. Power monitoring is accomplished using a current to voltage monitoring circuit on the +5V, +3.3V and +2.5V supply lines. Each circuit uses a LMC6482 operational amplifier and a low value current sensing resistor. Feedback resistors set the appropriate gain. These analog voltage levels are then monitored by a PIC16715E microcontroller that has four A/D input channels available [6] . Once a threshold level has been triggered the PIC interrupts the IF device. The IF design is able to halt the processor clock to stop the user FPGAs and interrupt the host.
IMPLEMENTATIONS AND STATUS

SLAAC-1 (PCI)
SLAAC-1 is a full-sized PCI board designed for use in the RRP workstations. Although the initial release of the interface FPGA contains a Xilinx 32-bit PCI core, the hardware is capable of supporting 64-bit PCI. Figure 2 contains a photograph of SLAAC-1 assembled in March 1999. In order from left to right, the large BGA devices are IF, X0, X2, and X1. The double-row of 100-pin connectors above X0, X1, and X2 support memory daughter card modules.
A memory module is show in Figure 3 . Each memory module has four 256Kx18 synchronous SRAMs and the transceivers for the external memory bus. The memory module for the IF and X0 devices share one memory card since there are two memories on X0 and two configuration cache memories on IF.
On the back of the SLAAC-1 board (not shown) are four systolic connectors for the highspeed data path through the X1 and X2 chips. The 64 data bits of the X0 to X1 and the X2 to X0 ring paths are shared with the systolic connectors. Additional pins from the X1 and X2 chips provide control for the respective external data sources.
SLAAC-2 (VME)
SLAAC-2 is a 6U VME mezzanine board designed to plug into a modified CSPI M2641 baseboard carrier. As shown in the SLAAC-2 architecture diagram in Figure 4 , there are actually two SLAAC-1 compatible accelerators on the SLAAC-2 board, each controlled by one of the two PowerPCs on the M2641. A few modifications were necessary to the basic SLAAC-1 design to accommodate having two accelerators in an area not much larger than a single full-sized PCI board. However, most of these changes are not directly visible to the SLAAC-2 application designer.
One change to the SLAAC-2 design was that the Xilinx 4062 IF device on SLAAC-1 was replaced with Xilinx 4085s. The extra I/O pins available on the 4085 were needed to accommodate the unmultiplexed 64-bit PowerPC bus. Other modifications were made to save space on the board, including combining the power management, IF boot EEPROMS, and the reference oscillator. The external memory bus was the only casualty to compute density visible to the user. There was insufficient area available for the transceivers necessary to isolate the external memory bus during normal compute FPGA operation. It was decided that since the SLAAC-1 and SLAAC-2 user FPGAs are bitfile compatible, debugging the memories could happen on a PCI board and was not essential in the VME platform. The only consideration for the application designer is that the memories will have to be loaded from within X0, X1, and X2.
Also shown in the SLAAC-2 architecture diagram are two 40-pin busses between X1A and X2B, and X1B and X2A. The spare pins used for controlling the external systolic connectors in SLAAC-1 were used in SLAAC-2 to bridge the two "independent" designs. Although the A and B designs have separate tunable clock synthesizers, a side-effect of having a single reference oscillator on SLAAC-2 we believe will allow the two designs to operate synchronously with each other. In any event, we cross-clocked to spare pins in the compute FPGAs so that X1A and X2A have access to the B design's clock and vice versa with X2A and X2B. This permits cooperation between the two adjacent nodes. 
M2641S
The commercial CSPI M2641™ has four 300MHz PowerPC 603r processors connected by a Myrinet 1.2Gb/sec SAN network. The M2641 has an integrated 8-port Myrinet switch and supports network connections from both the front-panel and the P0 VME backplane row for cable-free networking [7] . Figure 6 shows the M2641S carrier that has been modified for SLAAC-2. The black circular heat-syncs conceal the Power PC processors in the upper-left and upper-right corners. The BGA packages near the center of the board from left to right are 1) an ASIC interfacing the 2) Myricom LANai network processor to the PowerPC bus, and the 3) LANai processor and 4) ASIC FPGA interface for the second PowerPC node. The M2641S board has been modified for the SLAAC project to include two 120-pin PowerPC bus connectors visible in the SLAAC-2 photo. Two additional 60-pin SAN connectors are also available. However, they are unpopulated in this photograph because SLAAC-2 does not use them.
APPLICATIONS
As a validation of the SLAAC architecture approach, we are implementing portions of the Joint STARS Synthetic Aperture Radar / Automatic Target Recognition (SAR/ATR) application from Sandia National Labs on the SLAAC-2 system. The three primary components of the Sandia SAR/ATR application are Focus-Of-Attention (FOA), SecondLevel Detection (SLD), and Final Identification (FI). SLD is the most computationally intensive of the three algorithms used in Sandia SAR/ATR application [8] . In SLD, regions of interest, called chips, produced by the FOA algorithm are correlated with target templates. Figure 7 shows the SAR/ATR algorithm data flow.
The SLD algorithm can be described mathematically as follows. Two hundred sixteen template pairs represent each target type from 72 orientations (rotations) and 3 angles of elevation. A template pair comprises a bright template, which represents pixels with a strong radar return, and a surround template, which represents pixels with strong radar absorption. The chip image is contained in the 64x64-pixel matrix M. The 32x32-pixel bright and surround templates are contained in the matrices B and S. Let Bias be a template-specific value used to set the adaptive threshold. For each position (i, j) in the search space, SLD can be computed in five phases (P1 -P5):
Where a hit at position (i, j) is valid if:
And the variables are defined as shown in Table 1 . SLD returns the two highest quality hits for each chip.
The search space is defined by the set of pixels in the chip that are used as a correlation operation origin. Graphically, the origin is the lower left corner of the correlation images. The origins that are used in SLD are those pixels in the chip where the lower left of the template can be overlaid without any of the template going outside the chip. The current version of FOA guarantees that the no target pixels are within nine pixels of the edge of the chip, reducing the effective chip size to 46x46. Since the template size is 32x32, the size of the search space is 15x15 (46-32+1=15).
Our SLAAC-2 implementation of the SLD algorithm is based on an FPGA mapping created by Myricom, Inc. [9] . In our implementation, we store the target templates in the SRAMs local to the compute FPGAs, X1 and X2. The image chips are broadcast from the host, through IF and X0, to the compute elements in X1 and X2. Each compute element in X1 and X2 computes a correlation for a relative placement for a single chip and template pair. The thresholds, surround sum, and bright sums are passed through X0, which does the comparisons to determine whether a hit was found, and passes the hits back to the general-purpose processor. The general purpose processor determines the k best hits, where k can be determined at runtime. Determining the hits on X0 allows the number of FPGAs per general purpose processor to be scaled, while allowing the k best hits on the general-purpose processor.
Our estimates are that each compute element will process pixels at 40 MHz, with approximately 75 compute elements per compute FPGA. With two compute FPGAs per SLAAC board, this brute force method will achieve approximately 15,000 template With optimizations, we believe we can achieve 30,000 template matches per second by skipping zero elements in the templates. A computation rate of 30,000 template matches per second per VME slot is three times the performance achieved by Myricom's implementation, and is approximately four times faster than the estimated performance of a quad-PowerPC board. As FPGAs get larger, we expect performance of SLD at least linearly. The parallelism available in the algorithm allows increased density of devices to translate directly into increased performance in terms of the number of computational elements per chip and clock speed. For example, we would expect a SLAAC board populated with Xilinx Virtex XV1000 parts, with 1,000,000 gates per chip, to achieve at least 200,000 templates/second, twenty times the performance of a quad-PowerPC board. Furthermore, architectural improvements which will allow better arithmetic implementations will further increases in functional density and clock speed.
FUTURE WORK
The SLAAC team has an aggressive schedule of demonstrations in the coming year on both the SLAAC-1 and SLAAC-2 systems. Some applications include IR/ATR, SAR/ATR, Sonar Beamforming, and Multi-dimensional image processing. Incremental releases are planned to improve the performance of the interface and device drivers. Other future work includes supporting JHDL design environment and extending the SLAAC VHDL simulator to include multiple-board systems. 
FIGURES
