A hardware architecture is proposed which allows direct mapping of design simulation topology onto an acceleration platform. In order to clarify architectural principles, the simulation is confined to functional verification of unit delay, binary valued gate level logic designs. Under this approach, a rank ordered design description is executed on a massively parallel processor grid which implements an efficient and direct model of the design, similar to prototyping. Architectural innovation reduces logic complexity and execution time of boolean evaluation and fanout switching circuits, while large scale parallelism is integrated at die level to reduce cost and communication delays. The results of this research form the basis for a multiple order of magnitude improvement in reported state-of-the-art cost-performance merit for hardware gate level simulation accelerators.
INTRODUCTION
The increasing densities provided by VLSI technology have created a demand for CAD systems capable of quickly simulating and verifying large, complex designs. Previously, accelerator development efforts have evolved as either software algorithms implemented on general purpose computers or as special purpose hardware accelerators designed to optimize certain simulation algorithms [1] [2] [3] [4] [5] [6] [7] [8] [9] . This research establishes proof of concept for speed-up in a simulator architecture that matches digital designs rather than 108 E. SCOTT FEHR et al. In order to simplify design issues for purposes of architectural clarity, the simulation is confined to functional verification of unit delay, binary valued, synchronous, gate level logic designs. In order to match the hardware architecture, the simulation netlist is converted into equivalent four-input gates during preprocessing.
Related work has been done by Kravitz, et al [12] , which identifies factors contributing to the performance of massively parallel simulators for large VLSI circuits. Although the subject of this work was switch level simulation, their analysis of the rank ordering and execution of the Boolean behavioral model created by COSMOS provides important resuits which are applicable to the execution model for our research. Kravitz's work was targeted to the Connection Machine [13] , a general purpose parallel architecture with a dynamically scheduled message routing network not optimal for logic simulation. Their recognition that the communication pattern for this type of simulation "is static and could be supported with a much faster and simpler communication network" underscores the value of one of the principal architectural contributions of our work.
ARCHITECTURE OF THE ACCELERATOR
The fundamental building block chip, designed for minimum cost and maximum speed, contains 256 PE cells arranged as two push-pull rows of 128 PEs each. This two row configuration is adequate to execute single stimulus patterns at full speed. Sequences of stimulus patterns can be executed in pipeline fashion by replicating the fundamental chip so that the number of rows is as great as the depth of the rank ordered netlist block to be executed. The active circuitry in the PE is designed for implementation in custom CMOS logic, which allows all fanin configuration switching and Boolean function evaluation for a logic rank to be done in one clock cycle. Fanout propagation between chips on a board is done at one clock per circuit rank, while wire delay between boards incurs an additional time step penalty.
The grid interconnect topology, shown in Figure 1 , is designed to efficiently route arbitrary fanout patterns between successive ranks of a rank ordered design, including feedback paths. Fanin An example of a rank ordered combinational net mapped onto a platform array is given in Figure 8 (a generic gate symbol is used to represent a PE in the array). Redundant gates, designated by (R), are required for primary inputs when entering beyond rank 0, and fanouts which pass through one or more ranks. Redundant rows/ranks are required to split gates with more than four inputs, or for a net width greater than the platform width. Primary inputs X l, X12, X13, and X14 all enter beyond rank 0 in the original net, which require redundant gates at each rank back to rank 0 to represent the signal state at each rank during simulation. The five inputs to gate I require a split to one 4 input gate and one 2 input gate, forcing the addition of rank 3. After the design ranks are loaded in order onto sequential platform rows, each rank is evaluated by concurrent execution of all processors in the corresponding row (primary inputs must be available prior to evaluation.) Rank 
SIMULATION SLOWDOWN FACTORS
The time required to execute a simulation on the platform is determined by the total number of simulation clock cycles required and the clock period. The number of clock cycles required to execute a simulation is determined by the number of clock cycles required to execute each rank in the compiled netlist, summed over all ranks. Since the accelerator executes at one clock cycle per rank, the number of clock cycles is equal to the number of ranks in the rank ordered net. The period of the simulation clock cycle is constrained by logic delays to execute and propagate a simulation rank. The time required to execute a 114 E. SCOTT FEHR et al. netlist rank is determined by two types of delays: (1) gate delays through the PE active signal path-delays required to switch inputs, evaluate functions, latch outputs-a constant factor for each PE/simulation rank, and (2) fanout route delays which include wire delays for long fanout paths.
Interconnect Loading
The number of interconnect links consumed by all fanout connections for a net can be determined as follows. Between any pair of fanout connected nodes, the fanout distance is the path length, or number of single bit links that are consumed by the connection. That is, fanout distance on the grid can be determined as the sum of the number of horizontal links given by the column distance II-jl and the number of forward links given by the row distance Ik-il. For a net, the total number of links required for fanout connection is the sum of links for each fanout-connected pair in the net. The communications loading for each node is the number of fanouts for the node plus the routing paths of other fanouts which include the node.
Since there are four signal lines in each direction between node pairs, any lateral loading >4 produces collisions. In order to measure lateral crossover density, a figure of merit called the collision degree is defined relative to the lateral interconnect width of 4.
A collision degree of zero means that no interconnect load is >4, so that all routing can be resolved by one circuit switching configuration between net ranks. A collision degree of one means that interconnect load is >4 but < =8, so that all collisions can be resolved by inserting one redundant rank between the original net ranks.
A figure of merit for the execution efficiency for a net can be derived from the sum of collision degrees for all rank pairs in the net. This follows directly from the circuit switching design of the grid, which allows propagation of fanouts between adjacent ranks in a single execution clock cycle. For a net having collision degree one, an additional clock cycle is required to propagate the redundant rank of interconnect load, and execution is slowed by a factor of two between the rank pair. Higher collision degrees are assessed similarly.
Netlist Redundancy Bounds
In order to predict bounds for time-space penalties due to netlist redundancy injection during preprocessing, a typical dense commercial circuit is characterized a 32-bit 32-bit multiplier built up from 4-bit 4-bit multiplier building blocks. For this circuit, partial results are summed in a Wallace tree to produce a 63-bit product, and Booth recoders and carry look-ahead generators are used to achieve high performance. This circuit is a typical logic design application for a 4-input sea-of-gates gate array requiting ---7500 gates, a non-trivial amount of logic.
The multiplier circuit requires 40-50 logic ranks (without pipelining). Three redundant logic ranks are required to distribute fan-out signals as follows: 1) one rank for the top of the logic tree for distributing 5 carries, 2) one rank for the 4-bit look-ahead, and 3) one rank for the 4-bit group look-ahead. The 3 redundant ranks represent 3/40, or < 10% redundancy. The maximum width of the multiplier circuit can be calculated as 16 (4- board, which is wider than the multiplier circuit. Therefore, no additional redundancy penalty is assessed based on width. As an example of upper bound worst case fanout crossing, logic design for a 32-bit barrel shifter could require up to 2x rows, or 100% redundancy. Based on the above analysis, a liberal 20% figure will be adopted for typical redundancy to be used in calculating effective gate capacity for the accelerator platform.
Off-Chip Signal Slowdown
Simulation slow-down due to off-chip fanout propagation is calculated based on ns/ft signal propagation speed as: 1) chip-to-chip fanout distance of <2 ft LOGIC SIMULATION 115 is presumed, so that 5 ns, or 1 clock cycle (developed in section 4.4) slowdown is adopted; and 2) board-toboard fanout, including pad delays, is liberally penalized at 2 clock cycles slowdown.
Typical netlist fanout propagation slowdown, based on typical commercial circuits, is predicted based on 40-rank subnet depths, with 35 ranks executing in one clock cycle (all on-chip with complete fanout distribution), four ranks requiting two clocks (on-chip but delayed for dense fanout distribution), and one rank requiring two clocks (off-chip for wide ranks). This results in a predicted typical simulation slowdown of (4 clock) + (1 1 clock)= 5 clocks for a 40-rank net. Since the fastest possible simulation time for this net is 40 clocks, the simulation slowdown due to fanout propagation delay is 5/40, which is less than 20%.
System Timing
A cornerstone issue for cost-performance analysis, once accelerator functionality and architecture is defined, is "how fast can it go?" The answer can be derived in terms of the simulation clock period, which is a function of the system interconnect and logic topology delays derived from fabrication technology parameters. The required clock period can be bounded by calculating signal path delays through the PE cell active logic, and fanout propagation delays through the mesh interconnect.
Technology parameters for this research are based on currently reported fabrication process values applied to fundamental MOS relationships. In order to avoid current density problems and long signal delays along metal runs, metal width is currently being held at about 2p with 0.05 fl/sq, which will hold wire delays at twire ns/cm. A system' s clock period is proportional to the "r of its smallest devices, and so becomes proportionally smaller as devices scale smaller.
At a feature size of 0.61a, a device value of -r 0.02 ns is predicted [14] [15] [16] , and a system clock period of 2-4 ns. We will adopt the value a" 0.03 ns. The BEU and latch design, Figures 4-6 Figure 3 , requires no gate delays for signal propagation. All configuration switches are set within a single gate delay during q. For a chain of PE cells forming a platform row, 'Tstag is defined as the fundamental unit of on-chip fanout propagation, calculated at ---1.5 ns in [11] . For clock period measurement purposes, the switch-BEU-latch path delay can now be stated in terms of 'l'stage, Tswitch_BEU 2 "rstag 3 ns.
Fanout signal propagation delay between successive net ranks is composed of zero or more rank forward cell-to-cell wire delays and a single lateral cellto-cell wire delay. Both delay directions may cross chip and/or board boundaries. All fanout propagation wire delays take place during q2. If wire delays are set at ns/cm, then a q0 2 period of 3 ns would allow fanout propagation length of up to (3 ns cm/ns) 3 cm. Based on the above calculations, the netlist pre-processor should observe the following rules for assigning redundant ranks to distribute delay periods based on fanout route lengths. A simulation slowdown of one clock cycle, and therefore redundant rank, will be assigned to the following fanout route configurations: every 16-cell segment of on-chip route length, every chip-to-chip crossing, and every board-to-board crossing.
In summary, allowing 2 ns for active path delay within the PE cell, and 3 ns for fanout propagation, the clock period could be set conservatively at 5 ns for the 0.6 p technology. The Advancements of this architecture under study at The University of Texas include mapping of event driven simulation to the hardware for efficient execution, and implementation of timing delays in the data paths without adding excessive hardware. Although not currently being pursued, the architecture could easily be extended to simulate four valued logic by widening all data paths to two bits per signal, increasing the area for data paths by a factor of two and requiring a modest increase in die size. Simulation of asynchronous logic designs is an interesting extension which will require further analysis of the net preprocessing algorithm.
