We describe the design of Ravel-XL, a 
Overview
Ravel-XL is a hardware implementation of the Ravel logic simulation algorithm [6] . It operates as a dedicated co-processor to a general-purpose host via an interrupt-driven asynchronous interface. The prototype version of Ravel-XL will interface to the DEC TURBOchannel using the TURBOchannel Interface ASIC (TCiA) chip [3] . The host is expected to maintain the user interface to the simulation process including downloading the simulation program and test vectors to Ravel-XL and reading back the resulting output waveforms and setup and hold violation reports.
In its current implementation, Ravel-XL can simulate circuits with up to four clock phases sharing a common cycle time. Gates are limited to a fanin of 16. Ravel-XL has been implemented in a 0.8-micron custom CMOS VLSI chip.
With an on-board 2K word data cache, the implementation uses approximately 900,000 transistors, re uires a 256-pin chip is currently being readied for fabrication, so final clock package and occupies approximately 1.9 cm 9 ' of die area. The speeds are not known, though simulation results predict operation at 33MHz.
Ravel Overview
Ravel is a static ("oblivious" according to Lewis[S] ) compiled logic simulator that can be distinguished from other static simulators by its ability to take gate delay into account. In addition, Ravel correctly models multiphase clocks and level-sensitive latches, and can detect setup and hold violations [6] . Ravel "compiles" a synchronous gate-level circuit into a C program whose execution simulates the logical and timing behavior of the circuit. This is accomplished by modeling each data signal xi in the circuit with a 4-tuple 
Hardware Solution
A hardware implementation of the Ravel algorithm should aim to reduce both the CPG and the cycle time needed to execute the simulation code. Reducing CPG is, primarily, an architectural exercise, whereas reducing cycle time is mainly achieved through careful logic, circuit, and layout design. ' We describe in this section three architectural choices that take advantage of the particular features of the Ravel-generated simulation code. Their impact on CPG is assessed in terms of a typical 3-input gate. Additional features of the implementation include a hold/setup violation handler, which is described in Section 3.4.
Dedicated Data Path
A dedicated data path offers three immediate advantages over a general-purpose instruction processor: 1) increased concurrency allowing the values and times in a 4-tuple to be computed simultaneously; 2) special functional units for min and max computations; and 3) a more efficient encoding of the data that reduces memory traffic.
The instruction set of the data path is summarized in Fig. 1 . The primary operation of the data path is encoded in the CISC-style variable-length GEV instruction. This instruction consists of ( n + 4) 16-bit fields: 1) an OPCodc speciqing gate type and number of inputs; 2) n input address offsets; 3) the output address offset; and 4) gate delays 6 and A. To minimize instruction size, addresses are specified as 16-bit offsets to 14-bit bases proving a data space of one billion signals. The bases are kept in data segment registers which must be loaded (using the LDS instruction) whenever segment boundaries are crossed. Typically, these registers need to be loaded only once at the beginning of the simulation.
A signal's 4-tuple in the s o h a r e implementation is stored in 4 32-bit words. In Ravel-= each 4-tuple is encoded in a single %bit word: 2-bit fields are used for the signal's values, and 14-bit fields are used for their times (for a clock period of 50ns, 14 bits provide a 3ps resolution.) Thc CPG for memory traffic can now be estimated. Assuming 32-bit-wide buses between memory and the function units, the evaluation of a 3-input gate requires 8 memory accesses. At a nominal 3-cycle access time this implies a CPG of 24.
The calculation of signal times can be completed in n -1 cycles if (2) presence of one or more controlling inputs, and all logical operations are performed bit-wise. Adding the two cycles needed to perform the calculation in (3) for our 3-input gate to the cycles required for memory access to the relevant code and data yields a total CPG of 26, a six-fold improvement over the software implementation.
Memory System Design

Y Y
I
The forgoing analysis suggests that memory bandwidth must be increased in order to obtain further reductions in CPG. An obvious approach to increasing bandwidth is to use separate memories for code and data and to optimize each memory system for the particular access patterns it is most likely to experience. Examining the algorithm, we notice the following: 1. Operands are accessed randomly, with a large distance between writes and reads to the same location making a register file difticult to utilize. Our solution was to use a large cache for the data. Since gate evaluation is done in level order, the inputs needed to evaluate a gate have a high probability of being in the cache, and consequently the number of cache misses is expected to be extremely small. .. .-. .................................... __ ........................ ............ ~. .................................................. . > ............................................................................................ 
Figure 2: Pipeline Operation for a 3-input GEV Instruction
A careful compilation will generate signal addresses in such a way that cache misses are reduced to a minimum.
As long as the cache size is reasonably larger than the number ofgates per logic level, cache misses will not occur often. Using a cache, the average access time will be close to 1 clock cycle if a high hit rate is maintained. For simplicity, we use a write-through direct-mapped cache. 2. The code memory is accessed sequentially. This lack of locality would make a cache completely useless for large circuits. We decided to use 4-bank interleaving [4] to access the code memory to obtain the single-cycle word read. While code is being fetched consecutively, interleaving allows code access at a rate of one word per clock cycle. However, every time code fetching is stalled, resumption requires 3 clock cycles before code becomes available. Again, since code access is completely sequential, a simple code pre-fetch algorithm can be used to buffer the interleaved memory from any delays.
Pipelining
The above memory system is supported by a pipeline structure that overlaps memory access to code and data with the execution of the GEV instruction. While variable in length, the GEV instruction has a regular structure with the operand fields arranged in the order in which they will be needed in the execution unit. Pipelined operation can be optimized for any gate size (value of n); we chose to optimize it for 2-and 3-input gates, as they are typically the most common. This naturally leads to a 3-stage pipeline corresponding to the code fetch and decode (CODE), operand fetch (DATA), and execute and operand write (E&W) phases. As shown in Fig. 2 , each instruction requires 4 cycles in each stage of the pipeline. For gates with more than 3 inputs, a certain latency is introduced in accessing the code memory.
Since the code generated by Ravel is sequential, the pipeline does not need to be flushed. The only data dependency that may exist in the pipeline occurs when the operand to be read is still being computed at the execute/write unit. Our implementation detects this data dependency, and stalls the pipeline in such situations. In most circuits, however, a careful compilation will avoid almost entirely any data dependencies, and pipeline stalls are not to be expected.
Assuming that we can design a memory system to support the above pipeline structure, the evaluation of a 3-input gate has been reduced to 4 cycles (CPG = 4). This represents a 40X improvement in performance over the software implementation. The pipeline, however, introduces a latency of 9 clock cycles for each gate evaluation.
Violation Processing
The E&W stage has a function unit that checks for setup and hold violations and stores this information in data memory for later retrieval by the host. Since timing violations do not affect the simulation process, violation reports are not stored in the data cache. Rather, they are stored in a separate FIFO and written directly back to data memory. The data memory interface arbitrates write requests from the cache and the violation handler giving the cache priority except when the FIFO is full.
Implementation
System Decomposition
At the top-level, the Ravel-XL chip is divided into four main communicating "boxes." Each box implements a portion of the system's functionality: 1. The Data Management Box (DMB) processes requests to the data memory. It contains the data cache, the data bus interface, and the violation handler. 2. The Fetch/Decode Box (FDB) fetches the instructions from code memory, decodes them, and fetches the required operands. FDB implements the first two stages of the pipeline, allows the host to writdread code, and contains the program counter for the simulation process.
The Gate Processing Box (GPB) is used only during the simulation process, and realizes the execute and write phase of the pipeline. It accepts decoded GEV and SEV instructions from the FDB and performs the requested computations. This box contains the dedicated data path where the simulation algorithm is implemented. The Host Interface Box (HIB) implements the interface to the DEC TURBOchannel. It can perform readdwrites to the memories and to the registers and can start the simulation process. When a simulation is running this box stalls and waits for the completion of the simulation.
Design Methodology
Architectural simulations of Ravel-XL were performed using a behavioral model written in the Verilog Hardware Ilescription Language (HDL) [l] . This model was purposely partitioned into distinct data path and control sections to aid the synthesis process. Physical design was performed using a heta-release of the EPOCH logic and layout synthesis package [2] . EPOCH received its input in a synthesizable subset of Verilog-HDL: behavioral data path elements were manually converted into netlists of MSI macro-cells defined in the E.POCH library, while behavioral control modules were input directly from the architectural models. EPOCH performed logic synthesis on the behavioral control logic, and provided technology mapping for the library cells, as well as t Iming-driven placement, routing, and buffer and power rail sizing. The EPOCH static timing analyzer, TACTIC, was used in the determination of the critical path. The longest sensitizable path in the design was found to lie in the datapath, and results in a maximum clock frequency of 33MHz.
Conclusions and Future Work
We now present implementation optimizations that will allow significant cyde time improvements in a future version of Ravel-XL. With these changes, Ravel-XL is expected to present a speedup of about 100 over Ravel. We also describe possible architectural improvements that may help reduce the current CPG.
Currently, the critical path resides in the gate evaluation data path (MDP), and is due to the comparators available in the EPOCH library, which have a delay of about 1711s. Our initial expectation was that the critical path would be in the tache, and MDP does not exploit all possible parallelism. A simple optimization would be to evaluate more inputs at each cycle (by using more hardware), and use two cycles to evaluate the comparisons of the simulation algorithm. With these modifications, and with some more die area, we expect to reduce the cycle time to almost half, since all the other blocks in Ravel-XL present delays below 1511s.
Figure 3: Chip Floor Plan
In order to reduce CPG even further we envision using a three-port cache in a future version of Ravel-XL. This will speed up operand reads, and writes could be done concurrently. This solution will also require a faster read cycle to the code memory. A solution to achieve this is to access code through a RAMbus interface. These two improvements can reduce CPG to 3 or even 2 (for a 3-input gate), but control will be much more complicated.
Other improvements that are planned for the Ravel simulation algorithm include the ability to correctly simulate gated-clocks and tri-state busses. Most two-phase systems take advantage of gated-clocks to precisely control interface signal timing. This will require a notion of conditional execution (i.e. branching) in the algorithm, and will introduce significant complication to the hardware. We also intend to explore techniques of circuit partitioning that will allow multiple Ravel-XL accelerators to operate on a single circuit in parallel.
