I. INTRODUCTION
The trend of scaling in the CMOS VLSI processes has led to designs with multi million transistors. Due to this trend, functional verification of designs in the pre silicon stage is no longer sufficient and there is a possibility of some bugs escaping into the post silicon stage. Limited observability of internal signals is a major obstacle in the post silicon validation stage. A number of solutions have been proposed in this area to tackle the problem. The type of error, which one is trying to locate, dictates type of information to be acquired from the design. The following are the different type of errors encountered during the post silicon stage [1] ,
• Circuit bugs: These are bugs that arise due to circuit mismatch among different levels of abstraction and process variation and signal integrity effects in deep sub-micron technologies.
• Logic bugs: The data acquired by the DFD (Design for Debug) hardware can be used to identify logic bugs, which are functional errors that have escaped the pre silicon verification stage. One method for acquiring functional data in silicon is the scan chain based technique. The problem here is that the normal circuit operation is halted and the circuit has to be operated in the scan mode, hence preventing data to be acquired in real time. As functional bugs can span thousands of clock cycles [2] it is essential to keep the circuit working in the normal operation during scan dumps. Hence, to effectively acquire debug data the trace based technique is used which employs on chip memories for at speed data sampling.
• System bugs: These type of bugs exist among multiple cores in an SOC. The bugs that occur when a software is running on the system and the different cores are interacting with each other require acquisition of data from the interrelated cores. Hence the Design for Debug hardware has to be different when compared to the hardware required for logic bugs in one core.
As discussed, scan-based techniques are not suited to acquire data for debug in real time as the normal operation of the circuit has to be halted and the circuit has to be operated in the scan mode. To acquire data in real time the trace buffer based technique [3] is used. The trace buffer width limits the number of signals that can be traced and the trace buffer depth limits the number of samples that can be stored. Hence, there is a need for an automated way to determine the signals being traced, such that maximum data (combinational and sequential nodes) can be reconstructed based of the data acquired by the trace buffer. This has to be done in a way that a postprocessing algorithm can use the resulting data to identify design bugs.
Currently the trace signal selection process in the industry is primarily manual. The decision to select signals is guided by the designer's experience and intuition (For example trace signals are selected from hardware blocks that have encountered more bugs during the pre-silicon stage). Due to the lack of techniques for qualifying observability value, the inadequacy of the selected trace signals shows up during silicon debug, usually in the form of observability holes that make it difficult to identify, diagnose and root-cause an observed failure. However, at this stage new trace signals cannot be selected. Inability to observe, validate and debug at the post-silicon stage results in costly escapes or silicon respins. Research in post silicon validation has attempted to address this problem of automating the process of selecting trace signals. The key idea here is to select a set of signals S that maximizes state restorability (the set of states that can be reconstructed by observing S). Most of the work done in this domain [4, 5, 6, and 7] involves defining a metric based of the circuit structure that estimates the state restoration capability of a set of signals and then this estimate is used to converge to a candidate set of trace signals. Chatterjee et al. [8] have designed a simulation based approach that performs better than the structural based approach, but incur much computational overhead. Li et al. [9] has designed a hybrid approach combining the metric based and the simulation based approaches, however they only make use of simulation for a small set of signals and consequently sacrificing restoration quality. Rahmani et al. [10] designed an approach based of two components: (1) an iterative approach to signal selection based on mock simulations and (2) a filtering scheme based on Integer-Linear Programming (ILP) to refine the selected set. However, their signal selection algorithm is a greedy algorithm and consequently this affects the restoration quality.
In this work we have developed a novel simulation based approach for selecting trace signals. The popular metric for measuring the quality of a set of trace signals is restoration ratio. Restoration Ratio = (No. of traced and restored values)/(No. of traced values). The objective of our work is to maximize the metric restoration ratio. To do this we used the simulated annealing heuristic to select trace signals. We developed this idea from the fact that trace signal selection can basically be viewed as a bi-partitioning problem, the set of flipflops being tapped onto the trace buffer is one partition and the other partition is the set of all other flip-flops in the design. Another key contribution of this paper is an improvement to the state restoration algorithm developed in [4] . The amount of restoration depends on the order of selection of flip-flops in the trace signal list. This will lead to lower estimation of state restoration ratio in some cases. We have developed an improved state restoration algorithm that solves this problem and using this along with our novel simulated annealing based trace signal selection method has led to higher quality of restoration.
The reminder of this paper is organized as follows. In Section 2 we explain the process of state restoration and the algorithm designed for state restoration [4] . We also show how the algorithm for state restoration has a drawback and we propose an improved version of the state restoration algorithm to solve the issue. In Section 3 we propose our novel simulated annealing based technique for selecting trace signals. In Section 4 we provide the description of the experiments performed and the corresponding results and analysis. We conclude the paper in Section 5.
II. ALGORITHMIC SOLUTION FOR STATE RESTORATION
The following figure is used to explain the restoration process In the above Figure we see an example circuit utilizing the trace buffer idea. The trace buffer width is set to 1, implying that only one flip-flop will be connected to the trace buffer. In this example flip-flop C is being traced, we observe that just by recording the values of flip-flop C at different clock cycles we are able to restore the values of other flip-flops in the circuit. Another point to be noted here is that the number of signals being restored depends on which flip-flop is being traced, notice that if flip-flop E were to be traced no other signal would be restored. Hence in order to maximize observability at the post silicon stage a clever and automated methodology has to be used to select the trace signals. The basic idea of state restoration is to forward propagate and backward justify the traced values. This is basically done using Boolean inferences.
A. Principal Operations for State Restoration
Any combinational logic can be decomposed into the primitive two input gates (and, or, exor, nand, nor, exnor). The algorithm proposed by [4] involves applying two basic operations to logic gates in the translated circuit (i.e. after decomposing the combinational logic into two input gates):-forward propagation and backward propagation. Forward propagation is applied to a gate when the input values are known and the output is computed with the help of Boolean algebra. Backward propagation is applied to a gate when the output value is known and the input is computed with the help of Boolean algebra. This is comparable to what is done in functional simulators. The following figure shows examples of forward propagation and backward propagation. 
B. Algorithm for State Restoration and Significance of the Order of Signal Selection
The state restoration algorithm proposed by the authors of [4] is given below.
Input: Circuit Graph, Trace_Signal_List
Output: Circuit Graph with restored data 1 search_list = Trace_Signal_List; 2 while search_list is not empty do 3 cur_node = first node in search_list; 4 for each (parent_node of cur_node) 5 BackwardOperation (cur_node, parent_node); 6 if(new data are restored for parent_node) then 7
Put parent_node at end of search_list 8 for each (child_node of cur_node) do 9
ForwardOperation (cur_node, child_node); 10 if (new data are restored for child_node) then 11
Put child_node at end of search_list 12 Delete cur_node from search_list We found that the original algorithm for state restoration is incomplete. The amount of restoration depends on the order of selection of flip-flops in the trace signal list. This can be better explained with the help of the example shown in the next page. In the example, restoration is carried out for a specific window of 64 clock cycles. In Figure 3 , both the flip-flops are selected to be part of the trace signal list. The order of selection of these flip-flops in the trace signal list determines the amount of restoration. In Case 1, the flip-flop with output port K is chosen first and the flip-flop with output port Z is chosen next. In Fig 3 (a) , the restoration process begins and node P is restored by forward propagation. After which the next node in the search_list is Z and it restores node Q by backward propagation as shown in Fig 3 (b) . The next node in the searchlist is P, it applies the forward propagation operation and restores node A as shown in Fig 3 (c) . After which node Q is in the front of the searchlist and it restores node B by backward propagation as shown in Fig 3 (d) . In Case 2, the flip-flop with output port Z is chosen first and the flip-flop with output port K is chosen next. In Fig 4 (a) , the restoration process begins and node Q is restored by backward propagation. After which the next node in the search_list is K and it restores node P by forward propagation as shown in Fig  4 (b) . The next node in the searchlist is Q, it applies the backward propagation operation and fails to restore anything. After which node P is in the front of the searchlist and it restores node A by forward propagation as shown in Fig 4 (c) . Finally, node A applies forward propagation and is unable to restore anything and the final state of the circuit is as shown in Fig 4 (d) . Here Node B is not restored, it should have been restored with the knowledge of nodes A and Q (As seen in case 1). Hence we conclude that the original restoration algorithm is incomplete as the amount of restoration depends on the order of selection of flip-flops.
C. Improved Algorithm for State Restoration
As discussed in the previous sub section the original state restoration algorithm is incomplete. There is a difference in the amount of restoration obtained depending on the order in which flip-flops are pushed into the trace signal list. The algorithm below solves this problem In the improved version of the algorithm during forward propagation if the output node is not unknown, then a backward operation is carried out, where there is an attempt to restore sister node of the current node. Note that this only has to be done for two input gates, as there is no need for this in a NOT gate. We present the example discussed in the previous section and show how we have removed the dependence on the order of selection of flip-flops in Fig 5. As discussed previously if the flip-flop with output port Z is chosen first, then node B would never be restored in the original restoration algorithm. Using the improved restoration algorithm when node A is in the front of the search_list it would detect that the output node of the gate (Node Q) is not unknown, after which a backward propagation operation is applied to restore the sister node of the current node (Node B). This operation would successfully restore the value of Node B for 63 clock cycles. In this way the dependence on order of selection of flip-flops is eliminated.
Input:
Circuit Graph, Trace_Signal_List Output: Circuit Graph with restored data 1 search_list = Trace_Signal_List; 2 while search_list is not empty do 3 cur_node = first node in search_list; 4 for each (parent_node of cur_node) 5 BackwardOperation (cur_node, parent_node); 6 if (new data are restored for parent_node) then 7 Put parent_node at end of search_list 8 for each (child_node of cur_node) do 9 ForwardOperation
III.SIMULATION BASED APPROACH TO SELECT TRACE SIGNALS
A common way to reduce effort in simulation-based estimations is to perform several short simulations and average their outcomes. Chatterjee et al [8] proposed the use of a shorter trace buffer depth. They showed that the state restoration ratio variation is negligible beyond a trace buffer size of 64. Basically the trace signal selection problem can be viewed as a bi-partitioning problem. The first partition here is the set of flip-flops which will be recorded by the trace buffer and the other partition is the set of all other flip-flops in the design. This insight gave us the idea to use the simulated annealing heuristic for this problem.
Start
Step 1: Initialize: Start with a random initial partition
Step 2: Move -Perturb the partition through a defined move (Swap one flip flop in the trace buffer set with another in the non trace buffer set)
Step 3: Calculate cost -calculate the change in the score (Restoration ratio) due to the move made.
Depending on the change in score, accept or reject the move 
A. Initialization
Step: Random Initial Partition There are basically two partitions, the first partition has all the flip-flops which will be tapped onto the trace buffer. The second partition has all other flip-flops in the design. The size of the first partition, depends on the trace buffer width (Usually width = 8 or 16 or 32). The size of the second partition depends on the number of flip-flops in the design. In the initialization step a certain number of flip-flops (Number = trace buffer width) will be randomly selected to be a part of the first partition, all other flip-flops will be in the second partition. Evaluate the state restoration ratio metric for the set of flipflops in the first partition. This metric serves as the cost function for the simulated annealing heuristic.
B. Move Function
In the move function the partition is perturbed through a defined move. One flip-flop in the trace buffer set is moved to the non-trace buffer set. Some other flip-flop in the non-trace buffer set is moved to the trace buffer set. The selection of these flip-flops is done randomly.
C. Cost Function
The new trace buffer set may have a different score (State Restoration Ratio). The difference in the score between the new trace buffer set and the old trace buffer set, will dictate if the move is accepted or not. If there is an improvement in the score the move will be accepted. If there is a degradation in the score, the move may or may not be accepted. Initially a lot of inferior moves are accepted, but as the number of iterations keeps increasing the probability of an inferior move being accepted decreases. In the end, no degrading moves are accepted.
D. Stop Criteria
For the work done in this paper, Will Naylor's simulated annealing package [11] has been used. There are 3 user inputs that have to be given to the package which control the stop criteria. They are • Problem size • Stop run length • Epochs to run Epochs are "problem size acceptances". At each acceptance, the temperature is decreased by a fixed amount, the amount is chosen to make the temperature 0 after "Epochs to run" epochs. Temperature is not decreased at rejections. Problem size is a parameter which specifies the number of variables in the problem to be optimized.
Stop run length specifies the unaccepted moves to terminate the anneal.
All moves which give improvement are immediately accepted. To avoid the algorithm getting stuck in local minima too soon, degradations are sometimes accepted with probability equal to (1) where delta is the change in objective function produced by the move. The temperature decreases by a fixed amount each time a move is accepted. Temperature starts at some medium to large value and falls throughout the run toward 0. At the end the temperature is equal to 0. At temp=0, no degrading moves are accepted.
IV. EXPERIMENTATION AND RESULTS
The benchmarks that we have used to evaluate our proposed technique are the ISCAS 89 benchmarks [12] . These benchmarks are publically available and most papers related to this research have used these benchmarks.
A. Comparision Between Original and Improved Restoration
Algorithm As discussed in Section 2, the original restoration algorithm is incomplete. When the output and one input of a gate are known, the other input of the gate can only be restored by backward propagation (this is because of the way the forward and backward equations are derived, the forward equations can only restore the output and not the other side input). To evaluate the improvement obtained, we selected 10 random sets of flip-flops (Trace buffer width = 8) for all the benchmarks and calculated the restoration ratio with the original restoration algorithm and the improved restoration algorithm. The idea behind selecting just 10 random sets was to check if the improvement is obtained on a regular basis (that is the order of selection of flip-flops matters on a regular basis). Results for 3 benchmarks are presented below: Table. 1. Evaluating difference in restoration algorithms for S5378 
B. Tuning for Simulated Annealing and Convergence Plots
The trace buffer widths used in our experiments are 8,16 and 32, we have selected these widths as these are the widths selected by many papers on this topic. As explained in Section 3, there are three user given inputs to the simulated annealing package • Problem size • Stop run length • Epochs to run Since we are optimizing one particular variable, "Restoration Ratio", we are setting the problem size to be equal to 1. We set the stop run length to be equal to 500, this is a large number as it means that the anneal will only be terminated if there are 500 consecutive rejected moves and it was chosen as our goal was maximizing the restoration ratio regardless of the run time. We wanted the epochs to run to be a function of the number of flip-flops in a design and the trace buffer width chosen for that design. Also since our goal was maximizing the restoration ratio, we set this variable to be equal to (number_of_flip_flops*trace_buffer_width)*100. We used the simulated annealing heuristic to select trace signals for 6 different ISCAS 89 benchmarks and 3 different trace buffer widths. For a specific ISCAS 89 benchmark and trace buffer width, we launched six different runs. These six runs correspond to 3 different random seeds for obtaining simulation data from Modelsim® and 2 different windows of 64 cycles. We present the convergence plots for four benchmarks and trace buffer width equal to 32 for one particular window of 64 clock cycles. These plots show how the cost function (Restoration ratio) moves towards the global optimum value. The x axis in the following plots is the number of moves. For a specific ISCAS 89 benchmark and trace buffer width, we launched 6 different runs. These 6 runs correspond to three different random seeds for obtaining simulation data from Modelsim® and two different windows of 64 cycles. From these 6 runs we obtain 6 different sets of flip-flops and we choose the set which has the highest score which corresponds to the average restoration ratio for 6 sets of input vectors (Each corresponding to 64 clock cycles). The future work of this paper would be to feed these 6 sets of flip-flops into an ILP optimizer [10] which would then select the best signal set such that the total number of lost states in all runs is minimized. For now, we use the best average to select the trace signals
C. Comparision with Previous Methods
Comparison of Simulation based approach with our method: Barring one benchmark (S35932) where there seems to be a mismatch in the simulation data used by other conventional methods and our method, our approach does quite well in comparison to [8] . We obtain an improvement in restoration ratio up to 61.93%. We do note however that in a few cases our approach yields inferior results when compared to [8] .
Comparison of Hybrid based approach with our method: Our method performs up to 49.27% better than the hybrid approach. Barring the benchmark S35932, for which as stated earlier there seems to be a mismatch in the simulation data, our approach again yields better results when compared to the hybrid approach. It is to be noted that there are a few cases in which the hybrid approach performs better than our approach.
Comparison of ILP based approach with our method: In any simulation based approach, trace signals may be different in different runs depending on the generated random input vector seed and also the window of tracing. The goal of the ILP refinement is to eliminate the influence of randomness and also to cover more states of a given circuit through selected signals. To do so, the authors of [10] used multiple runs of the signal selection algorithm which is then processed by ILP to select the best signal set among all outcomes. The same methodology can be applied to our approach, as stated before we launch six different runs for a given benchmark and trace buffer width. Corresponding to these six runs we get six sets of trace signals. We take each of these six sets of trace signals and calculate its restoration ratio with respect to each of the 6 input vectors. After which we select the set which has the best average restoration ratio. We could replace this step with the ILP refinement approach, we feed the six sets of signals into the ILP optimizer which would return a set of signals (equal to trace buffer width) such that minimum number of states are lost over all the runs. This would greatly enhance the restoration quality, as the base signal selection algorithm used by the authors of [10] is a greedy approach which limits the quality of restoration obtained. We note that even without the ILP optimization step, our approach performs better than the ILP approach for a few cases.
Conclusion from these comparisons: Among the simulation based approach, the hybrid approach and our approach there is no method which gives better results for all the benchmarks. The ILP method has merit and if its initial greedy signal selection approach is replaced by the simulated annealing method, it would yield better results. Even without the ILP optimization step in our methodology we have better results in a few cases. There is no clear winner among all the approaches, and it is best for designers to launch all methods and pick the one which gives the best restoration ratio for that circuit.
V. CONCLUSION
In this work we have developed a novel simulation based approach to select trace signals which takes a gate level netlist as input and selects a list of flip-flops which should be tapped onto the trace buffer. The selection of flip-flops is done in a manner so as to maximize the amount of signals that can be restored by these flip-flops. We viewed this as a partitioning problem, which led us to using the simulated annealing heuristic for this problem. We also improved the original state restoration algorithm. Our methodology works well for most ISCAS 89 benchmarks, it yields up to 61.93% improvement in restoration ratio over the simulation based approach [8] , up to 49.27% improvement over the hybrid approach and up to 10.62% over the ILP [10] approach on some benchmarks. It has been explained in Section 4 how the ILP method can be integrated into our methodology, this would further improve the restoration ratio. The runtime for our approach is fairly high, as our primary goal was to maximize restoration ratio regardless of the run time. The advantage of using our approach is that runtime can be controlled as per the user's requirement, by changing the termination criteria.
