The rapid prototype provides full functionality, allowing any design errors or beneficial modifications to the design to be identified. the need for costly design modifications, and delay of delivery of the final system. This paper presents the rapid prototyping of a custom chip set implementing a Single Instruction Multiple Data (SIMD) architecture [2] . The rapid prototype was designed using automated CADKAE development tools, and was implemented using off the shelf parts and FPGAs.
SPAR Architecture 1. Introduction
Rapid prototyping of hardware is becoming a standard step in structured top down design methodologies that seek to improve the process by which complex digital systems are designed, manufactured, upgraded, and supported[ 11. Rapid prototyping has attempted to decrease the development cycle time of non-standard custom hardware, interfaces, and software, particularly in real time systems by first implementing a scaled version of the system using off the shelf parts. These long development cycles delay the delivery of prototypes into the hands of users for operational evaluation. If a system can be rapidly prototyped based on a subset, or relaxed set of specifications and off the shelf parts, the resulting prototype can be delivered to the user for test and evaluation much more quickly. The design can then be evaluated, and upgraded, correcting any functional errors and inserting newer technology. When the prototype is built and evaluated prior to final layout and fabrication of the ultimate system, the functional operation and electrical characteristics of the system can be tested and validated. The results of the testing and validation of the circuits can then be incorporated into the final design. This approach can drastically reduce The SPAR architecture was designed to bring the benefits of massively parallel processing to the embedded systems domain. The system philosophy focused on building a hierarchy of scaleable subarray modules allowing systems to be configured by "plugging together" any number of these modules to meet specific system requirements [3] . The SIMD architecture is programmed in Ada, designed for use in real time embedded systems, providing the benefits of abstract data types, generic functions, and high level programming. Figure 1 shows the block diagram of the system for a typical embedded application.
The SPAR architecture is implemented in two custom chips; a controller chip and a SPAR (Systolic Processor Array) chip as shown in Figure 1 . Each SPAR chip consists of a 16 x 16 array of bit serial processing elements configured as shown in Figure 2 . The system architecture allows any number of SPAR processor chips to be combined to fit a particular application. The organization of SPAR chips in Figure 2 provides a 64 x 64 array of processing elements, for a total of 4,096 processors. The configuration shown in Figure 2 is capable of performing 6.4 GOPS. The block diagram of a single processing element is shown in Figure 3 . Each processor has comer turning registers, a dual port register file, an ALU, and an interface into the nearest neighbor interconnection network. Each processor also has an associated 64kbyte local RAM accessible via the comer turning bus. The comer turning registers allow the single bit processor to access memory across a standard parallel bus. The effective address is computed in the ALU and issued to memory from the comer turning address register. Data is loaded into the processor or stored into memory through the comer turning data register. The register file shown in Figure 4 contains 32, 16 bit registers, organized as 512 single bit register cells. Each cycle, the bit serial processor can access two bits from the register file in parallel, or store two bits back into the register file in parallel. Figure 4 shows the design cycle used for developing the rapid prototype. As shown in Figure 4 , the first step in our structured top down design approach was a requirements analysis for the prototype. After the requirements were determined, the design was captured using a CADKAE development environment. Simulations were performed to verify that the prototype was functionally equivalent to the final system design, and that the circuit design was correct. After the prototype was validated in simulation, the simulation was combined with a simulation of the remaining portions of the system . A complete system simulation was the performed, and hardware was fabricated.
Design Methodology
A critical preliminary step in the design methodology was selecting technologies for supporting the design, testing, and fabrication of our low cost hardware emulator in a short time. These technologies included both hardware devices and efficient Computer Aided Design (CAD) tools. Two of the key considerations in the selection of technologies was
overall time to hardware, and the geographical separation of two design teams. The hardware selection had to support the rapid development of the circuits, and the CAD tool had to support open standards[41 for interfacing the designs and simulations of the two separate design teams. Field-Programmable Gate Arrays (FPGAs) were selected for the rapid hardware implementation of the SPAR prototype [5] , and a CAD tool was identified that would support open standards for interfacing the designs developed at the two geographically separate locations.
Field Programmable Gate Arrays (FPGAs)
Field-Programmable Gate Arrays F E A S ) are general purpose high density user configurable ASIC devices. FPGAs have evolved from simple chips that could replace small discrete subsystems, such as address decoders, to complex arrays capable of implementing large circuits with up to 20,000 gates. This gate count was necessary in order to implement the SPAR SIMD array. The operating speed of the P G A also had to match the operating speed of the final SPAR system in order to verify rise and fall times, propagation delays, and other circuit timings. The Xilinx [6] 4OOO family of Field-Programmable Gate Arrays was chosen based on efficiency of implementation, speed, functionality, and an open systems intehce with most existing CAD tools. 
CAD Tool Environment
A CAD tool was required that supported the rapid development of the prototype. The CAD tool was required to support a high level of abstraction for the rapid specification of the complex SPAR circuits. The availability of both detailed circuit design via graphical input, and specification of circuits using a high level description language was required. The availability of these tools allowed the design of the complex system in a short time, and simplified the programming of the FPGAs and PLDs. The tool was also required to support detailed timing and simulation of the circuits in order to verify the functionality of the design. These tools must be integrated in a common design environment that provides a supported our rapid prototype methodology.
Requirements Analysis
The first step in the requirements analysis was to identify the type of results and information that the rapid prototype was to provide. The main requirement of the prototype was to validate the electrical and timing characteristics of the SPAR (see section 3 for description of the SPAR) chip. A discrete event simulation had been performed of the system to verify the functionality of the SPAR instruction set. However, the discrete event simulation could not provide detailed electrical and timing information due to the abstracting of the system required to keep simulation times reasonable. Further, discrete event simulations cannot generally provide an exhaustive verification of circuits operating asynchronously, and cannot verify complex circuit setup and hold, and critical timing analysis.
The rapid prototype would provide an exact duplicate of the SPARS interfaces with the controller, but in a scaled version. This scaled version was specified to allow development and debugging of the debugging software, and provide early feedback for detailed verification of the interface signals and instruction operations. The rapid prototype would provide full functionality, allowing any design errors or beneficial modifications to the design to be identified. These modifcations could then be included in the SPAR chip before the chip was sent for fabrication. The complete functional operation of the processing elements, data paths, interconnections, microcode sequencing, and timing could be validated with the rapid prototype.
The SPAR would be programmed in Ada. However, off the shelf tools for parallel debugging in Ada on a massively parallel embedded SIMD array were not available. This debugging support is fundamental for programmers to develop and debug their code on the SPAR array. Without these tools, programmers could not use the SPAR chip set when fabrication and testing was complete. The scaled, rapid prototype of the SPAR would provide a platform for developing the parallel debugging software while the actual SPAR was being fabricated. Furtber, the rapid prototype would alleviate the common problem of not having sufficient amounts of custom hardware available to all software development teams. The scaled rapid prototype system would be significantly cheaper to produce than the VLSI SPAR chip. Scaled prototype systems could be made available to software engineers as cheap development stations thereby allowing parallel development of software to OCCUT. 
Prototype Design
Based on the requirements analysis, the prototype array was specified to contain a 2 x 2 PE array. It was determined that a 2 x 2 organization was sufficient to test all interfaces and instructions. The prototype board was specified to operate as a stand alone 2 x 2, or in combination with other 2 x 2 boards in the same fashion as the 16 x 16 SPAR chip to create a bigger array.
Partitioning and Implementation
The next step in the development was to partition the bit-serial processors shown in Figure 3 into discrete components. An initial sizing was performed to determine the implementation complexity and number of I/Os required per processor. The implementation complexity was based on the number of combinatorial gates and flip flops, and interconnect requirements of all internal data paths. A comparison of implementation complexity and I/O requirements to the resources available on each FPGA showed the number of CLBs instead of the number of YOs or interconnects was the critical factor in the design. This is in part due to the bit serial nature of the pcessor, requiring only a single bit per data path, and fairly modest YO chip requirements.
Next the 2 x 2 processor array was partitioned to maximize the resources on the FPGA, under the consuaint of minimizing the overall cost of the design. It was determined that mapping a single processor into a single Xilinx 4000 series FPGA was not the most efficient partitioning. Implementing the chip enable circuitry for each single bit register in a dual port register file required decoding five variables. This decoding is implemented in a Configurable Logic Block (CLB) within the FPGA. Each CLB can decode two independent four input functions, or a single five input function. Therefore, to implement the dual port register file alone would require 512 CLBs, not including the logic required to implement the comer turning registers and ALU. This ruled out seven of the ten 4000 series FPGA chips.
This partitioning was compared to a second partitioning where the ALU and comer tuming memories of all four processors were placed into a single FPGA. The dual port register files were placed external to the FPGA. This partitioning required sufficient U 0 capabilities to handle all interconnections between each processor's register file, and corresponding comer turning memory and ALU. This proved acceptable as all internal busses were single bit busses. A block diagram of the partitioning is shown in Figure 5 . This partitioning simplified the decoding of registers for all four processors. As each processor executes the identical instruction according to the SIMD philosophy, each processor is accessing the same single bit register. A single decoder circuit was implemented instead of four identical decoder circuits, one per processor. The four single bit dual port registers were implemented in a single dual port SRAM.
The utilization output report for the FPGA showed the prototype used 356 out of 400 CLBs. This provided a 89% utilization of CLBs, allowing enough margin for modifications to be made for bug fixes and enhancements during integration. The number of UOs required was 147 out of a possible 160.
The prototype board is shown in Figure 6 . As seen in Figure 6 , the prototype is built using all standard, off the shelf parts. The address decode and control is implemented in an Altera Programmable Logic Array (PLA). Using a single programmable part consolidated all combinatorial logic associated with chip decoding, and provided the flexibility to modify the logic in case of design errors or modifications.
As shown in Figure 6 , the dual port register files for all four processors are implemented in an extemal dual port register file. Each processor accesses the register bit serially analogous to the operation of the VLSI SPAR.
In the prototype implementation, the four bit serial register files are efficiently implemented in the single lk x 8 dual port register file shown in Figure 6 .
The circuit board shown in Figure 6 was fabricated using six layers; four signals, power, and ground. The form of the board was developed to plug into the actual system. Automatic routing was first performed, followed by hand routing of critical signals. The hand routing was required due to the fine pitch dimensions associated with the surface mount components. Surface mount components were used to decrease the form of the board, and also availability.
Conclusion
This paper presented an overview of a methodology and design of a rapid prototype SIMD embedded processor developed at the University of Arkansas. The prototype is a scaled version of the final processor, implementing the same instructions, and containing the same interface. The prototype was built to allow development of the parallel debugging software for programmers, and to validate the design. Validation of the design included both operational verification of assembler instructions and microcode, and verification of all data and control paths. This low level verification was not possible by discrete event simulation, and required a hardware emulator. The prototype will also serve as an economical development platform for software developers. The prototype is implemented in FPGAs, PLDs, and other off the shelf nonprogrammable parts. The FPGA's have provided a fast development cycle, and the flexibility to alter the design for bug fixes and functional enhancements. The CAD development environment provided the framework and support for performing a quick turnaround of the prototype.
The prototype has been successfully used to verify the assembler instruction set, and circuit timing of the overall system. The emulator has been used to verify the board to board interfaces, as well as board level timing.
Several modifications were made to the final VLSI SPAR design based on the prototype evaluation. These modifications were necessary for the correct operation of the system, and were not identified during discrete level simulations. 
Bibliography

