Abstract-Silicon debug poses a unique challenge to the engineer because of the limited access to internal signals of the chip. Embedded hardware such as trace buffers helps overcome this challenge by acquiring data in real time. However, trace buffers only provide access to a limited subset of pre-selected signals. In order to effectively debug, it is essential to configure the trace-buffer to trace the relevant signals selected from the pre-defined set. This can be a labor-intensive and time-consuming process. This paper introduces a set of techniques to automate the configuring process for trace buffer-based hardware. First, the proposed approach utilizes UNSAT cores to identify signals that can provide valuable information for localizing the error. of all scanned states for one cycle, while trace buffers allow the engineer to trace a small number of states in consecutive cycles. Although these two techniques greatly enhance the observability of the chip, the proportion of nets that can be observed is still small compared to pre-silicon verification.
5Vennsa Technologies Inc., Toronto, Canada. E-mail: 1{yangy.briank.veneris}@eecg.utoronto.ca.2nicola@ece.mcmaster.ca.5sean@vennsa.com for debugging. To overcome this issue, two types of Designfor-Debug (DID) techniques, scan-based and trace bufferbased techniques, have been introduced. Scan chains provide the value of all scanned states for one cycle, while trace buffers allow the engineer to trace a small number of states in consecutive cycles. Although these two techniques greatly enhance the observability of the chip, the proportion of nets that can be observed is still small compared to pre-silicon verification.
Once a silicon chip fails during test, an iterative postsilicon validation process is launched to locate the root-cause as shown in Figure 1 . Engineers start with setting up the environment to capture appropriate data from the chip under test while it is run in real-time. Then, this sparse amount of acquired data is analyzed by the engineer to prune the error candidates and also setup the next debug session. This timeIndex Terms-Silicon debug, post-silicon diagnosis, data acquisition setup consuming and labor-intensive cycle continues until the root cause of the failure is determined.
Several techniques have been proposed to automate the data analysis process [5] [6] [7] . They analyze the acquired data to Modem integrated circuit development cycles require sev-determine the root cause of the error. Clearly, the quality of eral different synthesis and verification stages before a silicon those analysis is affected by the acquired data and it can be prototype is manufactured. Each verification stage ensures that more effective if the data contains useful information regarding its corresponding synthesis step did not introduce any errors the error location. Hence, one question software solutions (e.g. timing, functional, power). At the Register Transfer Level to silicon debug need to address is which set of signals is (RTL), formal methods [1] , [2] and simulation [3] approaches important for tracing. This includes which signals and cycles are used to verify that the RTL model complies with its func-during the execution run to trace. Note that acquiring the data tional specification. However with the growing size of modem at run time can be time-consuming to setup and the data is designs along with the prevalent use of intellectual property read out at slower speed. Therefore, it is desired to minimize (IP), it is infeasible to achieve 100% functional verification the iterations of the data acquisition and have a concise set of coverage. In addition, time-to-market constraints only allow a signals for tracing. Furthermore, since only a small proportion finite amount of engineering resources to be dedicated towards of the entire design can be accessed by DID hardware, it is functional verification. This limits the verification engineer's important for any software solutions to be able to handle this ability to ensure functional correctness. As a result, functional constraint. bugs are introduced into silicon prototypes. In fact, more than This work proposes two novel approaches and a ranking 60% of design tape-outs require a re-spin with more than half system to enhance the data analysis while considering the of these containing a bug due to logic or functional errors [4] . data acquisition hardware constraints. It first utilizes UNSAT Each re-spin dramatically increases the cost of a project and cores to identify registers that may contain useful information eats away the time-to-market.
to help the diagnosis process. Since the UNSAT core relates The main challenge of silicon debug is the limited ob-directly to the error, it can provide greater precision. Secondly, servability of the internal signals in the chip. Unlike pre-a searching procedure is proposed to find alternatives for silicon verification, only nets that are connected directly to untraceable states. The algorithm takes the hardware conoutput pins can be observed which is generally insufficient straints into account and finds alternative states among the logic values of the selected signals into buffers, which typically range from 8K to 256K . Subsequently, the recorded data can be read via a low-bandwidth interface , such as boundary scan. Due to the limited size of this on-chip memory, only a set of pre-selected signals can be traced . Those pre-selected signals are divided into groups and connected to the on-chip memory through a multiplexer. During execution, only one group can be selected and traced at a time. The traceable signals are typically manually selected by the designer. Recently, several algorithms have been developed to automate the selection process [11]- [13] . In those works, authors try to select a small set of signals such that their values can restore a significant amount of untraceable states. For example, State signal selection proposed in [11] conducts structural analysis on the circuit. Then , it calculates the restorability of signals to determine the signals to be traced .
r --------------.
Refin e :
Ca pt ure Events :
"=====:::!I :
...
--------

C. Automated Data Analysis
Although DID hardware enhancement increases the observability of internal signals, there is a lack of techniques to automate the data analysis process on the acquired data, e.g. the data analysis step in Figure 1 . Recently, there has been an effort to develop methodologies to aid the engineer in other parts of the silicon debug process besides data acquisition.
Caty et. al. [5] and Yen et. al. [14] both perform silicon debug by analyzing the data collected from scan chains . The former uses the data to determine the failing registers at each timeframe and then conducts back-tracing and forward-tracing from those registers to narrow down the potential root cause candidates. The latter proposes a similar approach. They use the scan chain data to identify the portion of the trace where the error is sensitized. Then, the suspect list is pruned using both the path-tracing method [15] and simulating the faulty value of the suspect in the golden model.
Yang et. al. [7] propose a software solution to silicon debug that utilizes trace buffers . It analyzes the acquired data to discovering the root-cause of chip failure in both spatial and temporal domains . At the end of each debug session , if the root-cause is inconclusive, the methodology provides suggestions on the data acquisition setup of the sub-sequent debug session . However, it assumes that all registers can be traced by pre-selected registers such that values of those registers can imply the untraceable ones . The proposed searching algorithm is memory efficient because only a small window of the compl ete silicon trace is analyzed. Finally, since in practice only one register group can be traced at a time, a simple ranking system is suggested to prioriti ze the traceable register groups according to the result from the proposed analysis .
Experimental results show that when the available hardware is complemented with the new techniques, the data analysis methodology can reduce , on average , 30% of the number of suspects that the engineer needs to investigate . This reduction is achieved with only 8% to 20% of registers traced.
The remaining paper is organized as follows . Section II summarizes prior work on hardware and software solutions for silicon debug . Section III and Section IV illustrate the new approaches for selecting signals to be traced in runtime . Then , a simple ranking system to determine the pre-sel ected group for tracing is presented in Section V. Experimental results are presented in Section VI, followed by the conclusion in Section VII.
II. BACKGRO UND
Silicon debug involves hardware and software components. The hardware components consist of DID hardware that improves signal observability. The software components include the automated debugging tools and the test environment setup to collect and analyze the data from the tester. In the following subsections, we briefly introduce notation and review some of this background material.
A. Notation
In this work , we consider a sequential circuit with primary xy refers to the primary input XI in the second cycle . We also let symbol T; denote the i-th simulated cycle .
B. Design for Debug Hardware Solutions
To increase observability of internal signals in the silicon prototype, there are two main DID techniques adopted in practice: scan chains and trace buffers.
Scan chains allow engineers to capture and off-load the value of scanned registers at a specific cycle. However, scanout operation interrupts the execution of the chip because the values stored in the registers are destroyed. In order to resume the execution from the same point, the environment must be reset and restarted from the beginn ing of the test vector [8] .
Trace buffers [9] , [10] are another DID technique that uses an on-chip memory to record internal signals . As shown in Figure 2 , a trace buffer contains control logic, called trigger logic (e.g., embedded hardware assertions), employed for online monitoring of circuit behavior. Once the trigger condition is asserted the on-chip memory can start/stop recording the the trace buffer which is impractical. In practice, the number of registers that are traceable are limited. Consequently, this constraint can greatly affect the performance of [7] . Formal methods are used in [6] to restore the state infonnation when a chip fails the test. lt starts from the crash state and computes the states backward in time . Signatures are captured during the chip execution and used to determine a unique or a small set of possible predecessor states that leads to the crash state . lt requires additional hardware structures to compute signatures before they are stored in the trace buffer.
D. UNSAT Cores
This work utilizes the use of UNSAT cores to find suggestions for the data acquisition setup . A brief overview of UNSAT cores is give in this section.
Given a set of Boolean variables, a literal is an instance of the variable or its negation. An SAT instance in Conjunctive Normal Form (CNF) is a conjunction of clauses where each clause is a disjunction of literals. An SAT instance is satisfied if there exists an assignment over variables such that all clauses are evaluated to be true . That is, at least one literal in each clause is evaluated to be I .
If such an assignment cannot be found, the SAT instance is unsatisfied. In this case , any subset of clauses in the instance that is also unsatisfiable is referred to as an UNSAT core.
Modem SAT solvers [16] [17] [18] can produce UNSAT cores as a result of finding an instance to be unsatisfiable. The following is an example of an unsatisfied CNF formula <I> and its UNSAT core .
An unsatisfied SAT instance can have multiple UNSAT cores. Each represents a situation where the CNF formula is unsatisfied. Additional UNSAT cores can be obtained by performing relaxation on the current UNSAT core clauses [19] . In summary, each relaxible clause in the UNSAT core is augmented with a distinct relaxation variable. Additional clauses are added to the CNF to ensure assignments to relaxation variables complied with one-hot property, i.e. one and only one relaxation variable is assigned to I .
III. DATA A CQUISI TIO N S ET UP
Due to the insufficient observability of internal signals, selecting which set of signals to observe is a key step in the silicon debug process. Trace buffer-based DID hardware provides the engineer great flexibility in the choice of traced signals. However, they can only trace a limited subset of signals. To make the most efficient use of this hardware, the engineer uses two major criteria for selecting signals to acquire using trace buffers: I) Signals that are related to the error source or provide valuable information to aid in pruning suspects. 2) Signal selection needs to comply with the hardware constraints. As discussed in Section II-B, in most realword designs, only a small set of hard-wired signals can be traced during the execution. In the next subsection, an algorithm that utilizes the proof trace generated by SAT solvers to identify registers that may contain useful information to aid debugging is presented.
A. Registers Identification with UNSAT Cores
As discussed in Section 11-0, an UNSAT core of an unsatisfiable SAT problem is a subs et of clauses that is also unsatisfiable. Assume that there is a golden model, such as a high-level behavioral model, available during the debug to provide correct responses of the design. Note, although in this behavioral model we do not have access to the data on every single net in the implementation, the important information on the data/address buses, as well as the essential control signals that steer the data through the data-path, can be monitored.
Then, given an erroneous circuit C, the input trace X and the correct output response 0, the CNF formula C ' X . 0 is unsatisfiable due to the contradiction between the erroneous output response and the correct output response. Intuitively, the contradiction can occur at any signals along the paths from the actual fault location to the output where discrepancies are observed. Therefore, signals associated with clauses in the UNSAT cores have to be one of the followings :
• Nodes that excite the error • Nodes along the error propagation paths • Side inputs to the error propagation paths Clearly, those signals can be potential locations for tracing and provide information about the erroneous behavior of the design. Figure 3 'Uinit := Solve cD and extract the UNSAT core 7:
Example 1 Consider the circuit shown in
'U f -'Uinit 8: while cD is unsatisfiable do 9: relax on clauses {cJe E 'Uinit and c is an input vector unit clause}
10:
'U new f -solve cD and extract the UNSAT core
'U f -'UU 'U new 12: end while 13: while cD is unsatisfiable do + d 3) ) and e at frame 4 (from the clause (e '3 + e 4 ) ) .
Algorithm I shows the overall algorithm. It starts with obtaining the initial UNSAT core ('Uinit in line 6). Then, the algorithm tries to obtain more UNSAT cores by relaxation as summarized in Section II-D . First, it relaxes unit clauses in 'Uinit that represent input vectors (line 9) until the problem is SAT. Next, it repeats for unit clauses in 'Uinit that represent output responses (line 14). Since each UNSAT core can represent different error propagation paths, different signals can be included. To ensure all paths are considered, the union of all UNSAT cores is taken as shown in line 11 and line 15 in the algorithm. Finally, registers associated with variables in the UNSAT cores are the potential locations for tracing. 
IV. ALTERNATIVE SIGNAL S EARCHING
The algorithm IDENTlFYTRACEDSIGNALS from the previous section selects a list of registers that may contain useful information about the behavior of the faulty chip . However, as mentioned in Section I1-B not all registers can be traced with the trace buffer. In this case, one can try to obtain the value of non-traceable registers indirectly by implication using other traceable registers.
Consider a circuit modelled in the ILA representation shown in Figure 4 . The goal is to find registers that can imply the non-traceable register, or target register, denoted s~(Sg at timeframe Tk). Since s~cannot be traced directly, we want to restore its value with traceable registers within a certain window of timeframes {n-w · · · Tk+w}, where w is the window_size. Those traceable registers are referred to as candidate registers. As shown in Figure 4 , the value of sc an be restored in three ways: (1) forward implication, (2) backward justification or (3) combination of (I) and (2) .
To solve this problem, we formulate a SAT instance that will search for assigned values to candidate registers that, together with the input/output trace, imply to the target register. The alternative for the target register consists of those selected candidate registers. The detail of the formulation is given in the following subsections.
A. Problem Formulation
The basic problem formulation is presented in this section. In order to indicate whether a candidate register is selected for generating an implication, new variables are added for every candidate register. We use the notation L = {II ,lz ,... } to label those variables. The formula contains two components as expressed as :
j=k-w j=k-w
The first component models the design from timeframe
Tk-w to Tk+w. Intuitively, each cI>!: represents a copy of the erroneous design at timeframe j with the vector x! and «;
enforced. Previous traced register values (Sfnawn) are also used to constrain the problem, since they may be helpful in generating implications. As will be explained in the next subsection, special CNF models are required for the target register and candidate registers. The value w is user-defined, but also depends on the size of the trace buffer. We can set w such that 2w + 1 = buffer depth to fully utilize the memory space of the trace buffer. However, larger w's can increase the computation complexity and _.D..1_ Fig. 7 . The second component EN(UY~k_wLj) constrains the number of selected candidate registers. The detail of the construction can be found in [20] . To find the minimum number of candidate registers required for implications, it starts the constraint from one active select variable and increments the value up to the total number of the select variables, until a solution is found.
-------"----J
At the end, each solution of the problem is one implication for the target register. Candidate registers of which the select variable (I) is active are the necessary registers to generate the implication. Note that the algorithm identifies not only the registers, but also the timeframe where those registers are at to generate the implication .
In the next subsection, the models for target registers and candidate registers are described.
B. Register Modelling
Target register and candidate registers need to be encoded specially in the CNF formula in order to solve the problem. In this section, models applied to these two types of registers are discussed.
Target Register: The goal of the target register s~is to have a non-unknown value. The implication can come from two directions : forward propagation from assignments in the earlier timeframe, or the backward justification from assignments in a later timeframe . Hence, the target register is modelled as shown in Figure 5 . A extra signal is introduced to disconnect s~from its fanin. The syntax of the model is shown in Figure 5(b) . cond, (line 1) ensures that a non-unknown implication is generated by either forward implication or backward justification. cond; gives the flexibility that the implication only needs to be satisfied from one direction. Furthermore, if there are implications from both directions, the implied value have to be the same.
Candidate Register: Candidate registers are traceable registers that the SAT solver can assign 1/0 when they are selected. For each candidate register, two variables, are introduced as shown in Figure 6 (a). The variable I, referred to as select variable, determines whether the register connects to its fanout. When I equals 0, the network remains the same (line 1-2 in Figure 6(b) ). When I equals 1, the register is disconnected from its fanout and allow the SAT-solver to assign 0/1 to the either end of the break. This enables the possibility to identify forward/backward implications. Similar to the model for target registers, at least one of the two variables at the disconnected ends must have a non-unknown value. If both ends have nonunknown values, the values must be the same.
Example 3 Figure 7 shows a portion of the ILA of the example circuit in Figure 3( 5 is known.
C. Formulation Improvements
As shown in Figure 2 , typically, traceable registers are divided into groups . When configuring the trace buffer, one group of the traceable registers is selected and traced for several timeframes . With this observation, we can reduce the number of select variables for the candidate registers. Instead of introducing one distinct select variable for each candidate register, all registers in the same group can share the same select variable. Furthermore , the same register in different timeframes can share one select variable as well. In Example 3, assuming d and f are in different groups, the number of I's can reduce to two, e.g. d 5,d6,d7 share one I, while F,r,f7 share another one.
The second optimization is to find implications for a group of target registers. As mentioned in Section III-A, target registers identified by the proposed method are correlated to each other. Hence, if there exists an implication for one of the target registers, the same implication may as well imply the value of other target registers. By grouping several target register together, the number of executions of the searching algorithm can be reduced. As a result, the overall runtime is reduced . However, it is a trade-off between the runtime and the precision of solutions, because more traceable registers may need to be selected when multiple registers are targeted .
V. GROUP RANKING
The algorithms described in previous sections identify registers that should be traced to provide more information on the spi  160  8  8  40%  hpdme  453  16  8  28%  usb  2054  32  16  25%  s1423  74  6  6  49%  s5378  179  7  8  31%  s9234  211  8  8  30% error. Since registers are selected by groups at the end when configuring the trace buffer, we describe a simple ranking system to prioritize the traceable register groups according to the result from the proposed algorithms.
• Rule 1: The group that contains the most registers returned by the algorithm IDENTIFYTRACEDSIGNALS has higher priority. This is because those registers are directly related to the error source. Their values may contain most useful and direct information. • Rule 2: When searching alternatives for non-traceable registers, different target registers may require different traceable groups. If a group is being selected at higher frequency than others, it gets a higher rank. Intuitively, this group contains registers that have a higher chance to provide implications to non-traceable registers. • Rule 3: A higher rank is assigned to the group that needs to be traced for more timeframes. This is simply to efficiently utilize the memory space of the trace buffer.
VI. EXPERIMENTS
In this section, experiments on OpenCores.org designs and ISCAS' 89 benchmarks are presented. Minisat [18] is used as the underlying SAT-solver. Experiments are conducted on a Core 2 Duo 2.4GHz process with 4GB Memory.
To emulate the real trace buffer hardware structure, a subset of registers of each design is selected randomly, or by State signal selection [11] , as traceable by the trace buffer and they are divided into groups. The grouping configuration is summarized in Table I . The first column lists the benchmark used in the experiments. The second column of the table shows the total number of registers in each design. Columns three and four have the number of the register groups and the number of registers in each group, respectively. Column five shows the percentage of total registers that can be traced.
In addition, as mentioned in Section IV-C, several target registers can be combined into one search execution to reduce the runtime. In our experimental setup, the target registers in every four timeframes are targeted together. For designs from OpenCores.org, test vectors are extracted from the testbench provided by OpenCores.org. Test vectors for ISCAS'89 are generated randomly. In both cases, the trace length is between 100 to 300 timeframes. In the experiment, we set window_size (w) to be six timeframes. Table II summaries the performance of debug analysis under two situations. Columns 2 -4 show the performance of the analysis with no state information available, while columns 5 -11 show the performance of the analysis with the proposed techniques. Each row is one individual case that contains a different bug in the design. A single random functional error (wrong assignment, incorrect case statement, etc) is inserted into the RTL code. The final row is the geometric mean of the data in columns. In the experiment, the analysis performs the model-free diagnosis for one hierarchical level in each debug session. The total number of modules returned at the end of the debug sessions is shown in columns two and five. This is the sum of the number of modules that the engineer needs to investigate after each debug session. As shown in the table, with state information the debugging tool can effectively eliminate more false candidates in all cases. The percentage reduction on the number of suspects, e.g. comparing column five and column two, is listed in column six. The reduction can be as high as 78% and an average of 31% reduction is achieved.
Columns three and seven show the number of debug sessions performed. About one third of cases require less debug sessions to find the root cause of the error, for example, the second case of spi I hpdmc and both cases of ubs. The number of registers traced by the trace buffer is shown in column eight. Those numbers are small compared to the total number of registers shown in Table I . The benefit of the proposed technique is shown when one considers the reductions in both the number of suspects and the number of debug sessions.
Finally, the runtime of the debugging is summarized in column four and columns 9-11. Because of the reduction of suspects and debug sessions, the runtime for diagnosis is also reduced in the case of the proposed methodology. The runtime is 52% less on average (from 1426s down to 684s). The runtime on searching the registers for tracing is recorded in column 10. This is the additional computation required by the proposed methodology. As shown in the table, it can be significant in cases, such as hpdmc. This is because the algorithm has a higher failing rate on finding the recommendation for the non-traceable registers in those cases. The detail on the performance of the searching algorithm will be discussed later. Overall, the total runtime of the propose method is about 1.43 times longer than the runtime when no state information is used. However, since the number of the final suspects is reduced significantly, this additional runtime may be acceptable if the time saved from the manual inspection of less suspects is greater.
Next, we discuss the performance of the alternative searching algorithm. Clearly, the performance of the algorithm depends on the available traceable signals. Some signals may not be able to restore at all if the necessary registers are not traced. Hence, in addition to selecting the traceable register randomly, another approach, State signal selection, is also used. State signal selection selects registers that their values have a higher potential to restore other unknown registers. However, due to the technical implementation, State signal selection only handles ISCAS benchmarks. The results are summarized in Table III . The second and fourth columns of Table III show the percentage of targets that the search algorithm successfully finds alternative recommendation. The number of traceable register groups selected in order to generate application is shown in columns three and five. In the case of the random selection, the algorithm is able to find an alternative for almost half of the targets on average. The performance of using State signal selection is similar to the random selection. This can be because that the main goal of State signal selection is to restore as many registers as possible over the whole design. It does not target a specific region of the design.
In the next set of experiments, we investigate the performance of debugging when various state information is available. The experimental results are summarized in Table IV . All numbers are the average of 11 buggy benchmarks discussed in Table II . The reference case for comparison is the case where no state information is used (columns 2 -4 of Table II ). The first column lists the four considered cases. The next two columns summarize the number of suspects reduced and the ratio of the number of sessions comparing to the reference case. The fourth column is the ratio of traced registers to the total number of registers, followed by the reduction on the diagnosis runtime.
To demonstrate the advantage of the proposed UNSAT core approach, we compare it with the X-simulation described in [7] as shown in the first two cases of the table. In these two cases, no hardware constraints are considered, i.e. assuming all registers can be traced. From the table, we can see that the UNSAT core approach outperforms the Xsimulation approach in all columns, particularly with respect to the number of traced registers. This demonstrates that the UNSAT core approach can achieve better performance with less state information. In the second two cases, only debugging with the UNSAT core approach is considered, as well the trace buffer hardware constraints. However, in the third case, the searching algorithm is not carried out to find alternatives for non-traceable registers. Those registers are simply ignored.
Comparing the result of the case 3 and the case 4, we can see that with the help of the searching algorithm, the debugging has better performance. For example, the reduction of suspects increases from 27% to 31%. This implies the effectiveness of the searching algorithm.
In the last set of experiments, two variations of the experiment setup are implemented to further investigate the performance of the searching algorithm, namely, the search window size (w) and the hardware group structure.
First, the algorithm is executed with four different window_size (w). The performance is summarized in Table V . The first column lists the four window_size considered. The second column shows the average number of targets of all testcases. One can observe that as the window size becomes larger, the number of targets is decreased. This is because that more input/output values are applied when a larger window is used, which provides extra information to imply values to some targets that are unknown under the smaller window of the trace. The third and fourth columns show the average number of targets that are successfully found an alternative and the success rate, respectively. In general, a higher success rate is achieved as the window size increases. This is expected since there are more candidates for selection and a longer trace is used, which can restore values of more signals.
Next, three trace buffer group structures are tried to see the performance of the searching algorithm. Conf ig 1 is the configuration in Table I . Conf ig 2 and Conf ig 3 have the same number of traceable groups as Conf ig 1 does, but the number of registers in each group is only half and quarter of the size in Conf ig 1, respectively. For instance, Conf ig 1 of hpdmc has 32 groups of the size of eight registers; Conf ig 2 has 32 groups of the size of four registers, while Conf ig 3 has 32 groups of the size of two registers.
The success rate on finding an alternative is plotted in Figure 8 (a). As expected, since there are less traceable registers, more non-traceable registers cannot be replaced. Hence, the success rate drops as the number of candidate becoming less. Figure 8 Interna l state inform ation is impo rtant for silicon debug because they can aid in pruning the suspect modules. However, due to the hardware constraints, only a small prop ortion of the registers can be traced during the execution . This paper presents novel techniques to identify a more precise set of registers as the suggested candi date for tracing. It considers the hardware constraints and presents an approach to find alternative reco mmendation for non-traceable registers such that their value can be obtained through implications of other traceable registers . The experi mental results show that the proposed techn iques can help silicon debug diagnosis methodologies to achieve good performance under the data acq uisit ion hardware limitation .
