Monte Carlo based SSTA serves as the golden standard against alternative SSTA algorithms, but it is seldom used in practice due to its high computation time. In this paper, we accelerate Monte Carlo based SSTA using the FPGA platform. A simple dataflow pipeline technique will not work well due to the excessive usage of FPGA logic slices. We leverage the recently proposed pattern matching method to identify common circuit structures, and further use a mathematical programming based formulation to explore the trade-off between performance and logic slices consumption. The proposed design provides two orders of magnitude speedup compared to the CPU-based implementation.
INTRODUCTION
As the IC technology scales down to nanometer range, the manufacturing variations make Statistical Static Timing Analysis (SSTA) become an essential step in timing sign-off for nanoscale circuit designs. Monte Carlo based SSTA generates a large number of samples, and then uses traditional STA to evaluate each sample. This method avoids the path selection problem in pathbased method and the inaccuracy of statistical max operator in block-based method. It provides the most accurate predictions of circuit delay distributions, and often serves as the golden standard to validate the accuracy of all the other methods. But due to the high computation time, it is seldom used in practice. The work in [5] tries to accelerate conventional Monte Carlo based SSTA using graphics processing units.
In this paper, we implement the Monte Carlo based SSTA using FPGA hardware. In the SSTA formulation, the output arrival time of each gate is computed as the maximum value among the sums of each input pin's arrival time and the corresponding pin-tooutput delay. This allows us to model the entire circuit as a data flow graph for static timing analysis, which consists of a large number of "max" and "sum" operations. The data flow graph can be configured in a pipelined fashion-we can ideally accept one STA sample evaluation in each clock cycle. However, this solution consumes a considerable amount of FPGA logic slices, and may not fit into a given FPGA. The challenge of our design is to fit the implementation into the limited FPGA area through extensive use of resource sharing.
In this paper, we leverage the recently proposed pattern matching technique [3] to identify the patterns in the mapped circuit netlist, and then use this information to share hardware resources among the same pattern instances. We further formulate the problem of allocating hardware resources for each pattern within the limited FPGA area bound as an optimization problem. By solving this optimization problem, we can obtain the maximum speedup and execution throughput (with the minimum pipeline initiation interval) on the limited FPGA area.
Our Monte Carlo based SSTA is currently implemented on one FPGA of the BEE3 hardware platform and it shows two orders of magnitude speedup compared to the CPU-based solutions. Note we can easily instantiate four SSTA instances on four FPGAs of the BEE3 hardware platform to obtain an additional 4X speedup.
MONTE CARLO BASED SSTA 2.1 Delay Model
Our delay model is simply the separate rise-fall delay model [7] . Each pin has its own pin phase, which is INV, NONINV, or UNKNOWN depending on whether the logic function is negative unate, positive unate, or binate in each input variable respectively [7] . If the phase of an input pin is INV, the output rising (falling) arrival time of the signal from the input pin is equal to the sum of the input falling (rising) arrival time and the corresponding rising (falling) pin-to-output delay. Similarly, if the phase of the input pin is NONINV, the output rising (falling) arrival time of the signal from the input pin is equal to the sum of the input rising (falling) arrival time and the corresponding rising (falling) pin-tooutput delay. If the phase of the input pin is UNKNOWN, the output rising (falling) arrival time of the signal from the input pin is equal to the maximum value of the output rising (falling) arrival times between the INV case and the NONINV case. Delay is the falling delay from the input pin i to the output pin. Let the two input pins of the NOR2 gate be Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
Modeling of Delay Variations
In this paper, we assume the pin-to-output delay satisfies the Gaussian distribution. Each pin-to-output delay has its own variation. We implement the Gaussian random number generator based on the piecewise linear approximation method [8] .
Currently we assume the Gaussian distributions of these pin-tooutput delays are independent. In practice, these delays may have spatial correlations. We can modify the random number generator accordingly to accommodate the correlations.
OVERVIEW OF DESIGN FLOW
The input of the SSTA is a mapped circuit netlist, which is represented using a directed acyclic graph (DAG), where each node represents a logic cell. For example, a nine-node circuit is shown in Figure 1 (a). If the mapped circuit netlist is directly transformed into a dataflow graph, it will have excessive area consumption. To reduce the area, we use the technique of pattern matching to identify the patterns in the mapped circuit netlist. During the pattern matching process, we find out the patterns which can cover the entire circuit without overlapping. These patterns form a pattern-based DAG. It is important that the patterns do not create cycles since that would cause problems for delay propagation. An example of the pattern conversion is shown in Figure 1 . Through pattern matching, we find a three-node pattern P1 as shown in Figure 1 (b), which has three pattern instances: P1_1, P1_2, P1_3. These pattern instances form a pattern-based DAG as shown in Figure 1 (c). Details about pattern matching will be discussed in Section 4.
For the above pattern we find, it is a three-node circuit shown in Figure 2 . From the delay model, we can find out that the operations used in STA are "max" and "sum". This allows us to model the three-node circuit as a data flow graph. In the threenode circuit, each node has two inputs. Because the output arrival time of the signal from each input is the sum of the input arrival time and the corresponding pin-to-output delay, for the 2-input node, each pin corresponds to one "sum" operation in the data flow graph. The "max" operator is used to choose the larger output arrival time of the signals from the two input pins. In our example, we assume that the pin phase is INV or NONINV. If the pin phase is UNKNOWN, then each pin corresponds to two "sum" operations and one "max" operation. Based on the method above, we can apply our delay model to the mapped circuit netlist and transform the mapped circuit netlist to data flow graph. In the pattern-based DAG, there is a set of patterns, and each pattern has a number of instances. Pattern provides the coarsegrain sharing opportunity to facilitate circuit reuse. Let us denote each pattern as a unique type of functional unit. We may allocate multiple instances of one type of functional unit. The optimization step tries to determine the exact number of instances of different functional units that we should allocate for all patterns. For the simple example shown in Figure 1 , we can allocate one, two or three instances of the functional unit for pattern P1, and each of which will have different area/performance trade-offs. A mathematical programming formulation is used in Section 5 to address the tradeoff. Figure 3 describes the key steps in the design flow. We leverage state-of-the-art C-to-FPGA tools to generate the RTL design. The input C code to C-to-FPGA tools is generated based on the pattern-based DAG and the data flow graph of each pattern. Note the resource allocation part generates some resource constraints for the C code, so that C-to-FPGA tools can take the input C code as well as the resource constraints to generate an efficient RTL implementation.
PATTERN MATCHING
As we said, a flat data flow graph representation of the netlist cannot fit easily. Due to resource limits of FPGA, the final performance of our approach is determined by the area optimization algorithm. We apply the recent pattern matching technique in [3] to find patterns in the circuit netlist. These patterns provide coarse-grain sharing opportunities for designreuse and area reduction. The pattern matching algorithm in [3] has two phases: pattern enumeration and pattern selection. Pattern enumeration process discovers a set of patterns based on the structure information of the circuit netlist. Pattern selection algorithm attempts to find a set of pattern instances, which can cover the entire pattern-based DAG and maximize the area reduction. A greedy algorithm is used for pattern selection. At each step, the best pattern is chosen based on an area-reduction metric, and all its pattern instances are removed from the pattern-based DAG. The whole process is repeated until the available patterns are empty. Note any available pattern that shares common nodes with patterns that have already been selected will be discarded, as we do not allow pattern overlap.
The area-reduction metric is shown below, which represents the reduced area by allocating one functional unit instance of pattern P:
where ( ) gain P is the reduced area,
( )
#inst P is the number of pattern instances, and overhead area is the area overhead brought by newly introduced multiplexers when functional units are shared among pattern instances. The area overhead has a linear relation with the number of inputs of pattern P.
where ( ) port P is the number of inputs of pattern P,
the estimated area of 1-input multiplexer.
The greedy pattern selection process is efficient enough to find good pattern candidates.
RESOURCE SHARING & ALLOCATION
The allocation of the number of instances of all functional units plays a critical role in the tradeoff between performance and resource consumption. In this section, we discuss the problem of functional unit allocation in the context of pipelining with patterns. The goal is to obtain the highest possible throughput under a limited resource budget. The problem we discuss in this section can be described as follows.
Given: a target circuit which is covered by patterns {P 1 , P 2 , ..., P N }. For each pattern P i , the amount of resource consumed by a corresponding functional unit is denoted as r i . The number of instances for each pattern P i is denoted as Inst i . The total resource budget is denoted as Rbudget.
Output: the number of allocated functional unit instances for each pattern s i , so that the total resource consumed by these patterns does not exceed Rbudget.
Performance Model
The performance of Monte Carlo based SSTA is mainly decided by initiation interval II (or equivalently, the throughput). In general, II is limited by two factors: recurrence and resource. In our implementation, one STA sample is computed in one iteration, and different iterations are completely independent, and thus only resource needs to be considered. In our implementation, each functional unit is pipelined, and the initiation interval of each functional unit is one. Thus, a lower bound of II considering the i th type of functional unit is given as follows:
Resource Model
The resources required by every allocated functional unit instance and the multiplexers are used to estimate the total resource usage.
where , i mux R is the resource required by the multiplexers for pattern P i . On a typical FPGA, the resource required to implement a multiplexer grows at least linearly with total bitwidth of the inputs of the multiplexer. For pattern P i , Inst i pattern instances share s i corresponding functional unit instances. Since large multiplexers are undesirable, an optimized sharing solution is often balanced, i.e., the sizes of the multiplexers for different functional unit instances of pattern P i should be almost equal. Thus, we use
Inst s as an estimate of the number of pattern instances of pattern P i sharing one functional unit instance. Then, at each port of the functional unit instance, an n-to-1 multiplexer is needed, where
Inst s , as some operands from different operations may come from the same data source. Suppose the total bitwidth of the input ports of the functional unit for pattern P i is b i . We assume that the size of a b i -bit n-to-1 multiplexer is proportional to n and b i . Thus, the resource required by the multiplexers for pattern P i is:
where c is a constant. To estimate n, we introduce α , so that 
Optimization
With performance and resource models discussed above, we can formulate the problem as a mathematical programming:
( ) 
In the above formulation, II and s i are integer variables and all others are constants. The allocation problem formulated above has been extensively studied. It can be solved by using the incremental algorithm in [4] . For the incremental algorithm, each time we select a pattern and increase the number of allocated functional unit instances for the selected pattern by one so that we have the maximum efficiency (performance gain/area cost). Clearly, this is a polynomial algorithm. 
AutoPilot

EXPERIMENTAL RESULTS
Experiment Setup
Our experimental flow is shown in Figure 4 . Before SSTA can be computed on a benchmark, the circuit is first converted into a mapped netlist by using ABC [6] and the standard cell library, mcnc.genlib, from the SIS distribution [7] . We write a standalone program which parses the netlist, extracts the patterns and dumps out the C file description of the pattern-based DAG automatically. Then we use the state-of-the-art C-to-FPGA compilation tool AutoPilot [1] , which can take C/C++/SystemC and some design hints/constraints as input, and generate synthesizable and optimized RTL.
We use the BEE3 hardware platform developed by BEEcube Inc.
[2] in our actual implementation. BEE3 hardware platform has 4 Xilinx Virtex5 LX155T FPGAs. Currently, we only use one of the four. Note it is straightforward to run four SSTA instances on four FPGAs to obtain an additional 4X speedup. We use Xilinx ISE tool to generate the NGC file from the RTL. The NGC file is used to generate custom IP for the BEEcube Platform Studio.
Speedup Measurement
In order to evaluate the performance of our FPGA implementation, we implemented the same Monte Carlo based SSTA on a 2GHz Xeon CPU with 4 GB of memory running Linux. This implementation provides a reference point, in terms of performance and correctness, for our FPGA implementation.
The test circuits were chosen from ISCAS85, ISCAS89 and MCNC benchmark suits. Compared to our software implementation, our FPGA implementation provides a speedup ranging from 30X to 52X. The data is shown in Table 1 . 
Comparison with GPGPU implementation
Paper [5] presents an implementation of Monte Carlo based SSTA using Nvidia GPUs. Their implementation is up to 2X faster than our implementation (For circuit C499, FPGA implementation is faster). This is partly due to the fact that the sample-level parallelism is indeed a good fit for SIMD GPU architectures, while we mainly uses the node-level parallelism of the circuit and also use loop pipelining techniques in our FPGA implementation. Note we used only one Virtex-5 LX155T FPGA. We expect we can have a larger speedup if we use multiple FPGAs or a larger FPGA e.g., LX330T. Moreover, the FPGA implementation is much more power efficient compared to GPGPU implementation.
CONCLUSIONS AND ONGOING WORK
In this paper, we accelerate the Monte Carlo based SSTA using FPGAs. We consider the entire circuit as a data flow graph. We explore the advantage of FPGA to pipeline the data flow graph. We apply the techniques of pattern matching and pattern recognition to fit the custom design into a given FPGA. We obtained two orders of magnitude speedup using a single FPGA of the BEE3 hardware platform.
Currently, our implementation is circuit-specific. That is, for each input netlist, we need to generate a different FPGA implementation for acceleration. In order to make our implementation independent on the input circuits, we are building a special-purpose processor with some elementary SSTA computing patterns implemented in FPGAs. Using this approach, we expect that we can handle resource sharing in a more unified and flexible way.
