I. INTRODUCTION
Post-Silicon validation involves operating manufactured chips in actual application environments to validate correct behaviors across specified operating conditions. According to recent industry reports, post-silicon validation is becoming significantly expensive. Intel reported a headcount ratio of 3:1 for design vs. post-silicon validation [Patra 07 ]. According to [Abramovici 06 ], post-silicon validation may consume 35% of average chip development time. [ Yeramilli 06] observes that post-silicon validation is becoming prohibitively expensive because of increasing use of design resources and equipment costs.
Loosely speaking, there are two types of bugs that design and validation engineers worry about:
1. Bugs caused by the interactions between the design and the physical effects discussed above, also called electrical bugs [Josephson 01 ]. Such bugs generally manifest themselves only under certain operating conditions (temperature, voltage, frequency). Examples include setup and hold time problems.
2. Functional bugs, also called logic bugs, caused by design errors. While most functional bugs get caught during pre-silicon verification, a small percentage of them get exposed during post-silicon validation due to increasing design complexity and design schedule constraints.
Post-silicon validation involves four major steps: 1. Detecting a problem by running a test program, such as OS, games or functional tests, until a system failure occurs (e.g., system crash, segmentation fault or exceptions); 2. Localizing the problem to a small region from the system failure, e.g., a bug in an adder of the second ALU of a complex microprocessor. The stimulus that exposes the bug, e.g., particular 10 lines of code from some application, is also important.
3. Identifying the root cause of the problem. For example, an electrical bug may be caused by power-supply noise slowing down a circuit path resulting in an error at the adder output for a certain input sequence.
4. Fixing or bypassing the problem by patching [Chang 07, Wagner 06, Sarangi 07], circuit editing [Livengood 99 ], or, as a last resort, re-spinning using a new mask.
As pointed out in [Josephson 06 ], the second step dominates post-silicon validation effort and costs. Two major factors that contribute to the high cost of traditional post-silicon bug localization approaches are:
1. Failure reproduction which involves returning the hardware to an error-free state, and re-executing the failurecausing stimulus (including instruction sequences, interrupts, and operating conditions) to reproduce the same failure. Unfortunately, many electrical bugs are very hard to reproduce (often referred to as Heisenbugs [Gray 85] ). The difficulty of bug reproduction is exacerbated by the presence of asynchronous I/Os, and multiple clock domains. Techniques to make failures reproducible [Heath 04 , Sarangi 06, Silas 03] are often intrusive to system operation, and may not expose bugs.
2. System-level simulation for obtaining golden responses, i.e., correct signal values for every clock cycle for the entire system (i.e., processor + all peripheral devices on the board). Running system-level simulation is typically 7-8 orders of magnitude slower than actual silicon. In addition, expensive external logic analyzers are required to record signals values that enter/exit the processor through external pins [Silas 03].
Due to the above factors, a functional bug typically takes hours to days to be localized vs. electrical bugs that require days to weeks and expensive equipments [Josephson 01 ]. This paper targets localization of electrical bugs in using a technique called IFRA which is an acronym for Instruction Footprint Recording and Analysis. Figure 1 .1 shows an IFRA-based post-silicon bug localization flow for processors. During chip design, a processor is augmented with low-cost hardware recorders (Sec. II) for recording instruction footprints, which are compact pieces of information describing the flows of instructions (i.e., where each instruction was at various points of time), and what the instructions did as they passed through various design blocks of the processor. During post-silicon bug detection, instruction footprints are recorded in each recorder, concurrently with system operation, in a circular fashion to capture the last thousand cycles of history before a failure.
Upon detection of a system failure, the recorded footprints are scanned out through a Boundary-scan interface, which is a standard interface present in most chips for testing purposes. Since a single test run is sufficient in capturing all the necessary information, there is no need to reproduce the failure for localization. The details of how IFRA ensures such property are described in Sec. II. The scanned out footprints, together with the test program binary executed during post-silicon bug detection, are postprocessed offline using special analysis techniques (Sec. III) to identify the bug location (e.g., instruction queue control, scheduler, forwarding path, decoders, etc), and the instruction sequence that exposes the bug (i.e., the bug exposing stimulus). Such analysis techniques do not require any system-level simulation because they rely on checking for self-consistencies in the footprints with respect to the test-program binary. Here is a simple example of a selfconsistency check: If value X was written into some memory location A, a subsequent read from A should return value X (if the location was untouched). If the returned value is other than X, then we might suspect that there are bugs in the address generation circuitry, the read/write circuitry or the storage location itself.
Once a bug is localized using IFRA, existing circuit-level debug techniques [Caty 05, Josephson 06] can then quickly identify the root cause of bugs, resulting in significant gains in productivity, cost, and time-to-market. One method is to derive thousands of test patterns from the bug exposing stimulus and apply them to the microarchitectural blocks in close vicinity to the pinpointed block, while sweeping over voltage, frequency and temperature ranges. Another method is to run the exposing stimulus while having all the observation and control mechanisms concentrated on the buggy microarchitectural block(s). For example, if a trace buffer (e.g., [ Abramovici 06]) is available, one can probe only the signals associated with the pinpointed block(s).
In this paper, we demonstrate the effectiveness of IFRA for a DEC Alpha 21264-like superscalar processor model because its architectural simulator [Austin 02 ] and RTL model [Wang 04 ] are both available as open source. This processor contains aggressive performance enhancement microarchitectural features (e.g., e.g., speculative, multiway, and out-of-order execution) present in many commercial high-performance processors [Shen 05 ]. Such features significantly complicate post-silicon validation, yet the structured architecture enables opportunities for efficient bug localization using IFRA.
Extensive IFRA simulations demonstrate: 1. For 75% of injected representative electrical bugs, IFRA pinpointed their exact location (1 out of 200 microarchitectural blocks) and the time they were injected (referred to as location-time pair). For 21% of injected bugs, IFRA correctly identified their location-time pairs together with 6 other candidates (out of over 200,000 possible pairs) on average. IFRA completely missed correct location-time pairs for only 4% of injected bugs.
2. The above results were obtained without relying on system-level simulation and failure reproduction.
3. IFRA hardware introduces a very small area impact of 1% (including 60KBytes of distributed on-chip storage).
II. IFRA HARDWARE SUPPORT
We use an Alpha 21264-like superscalar processor model [Digital 99 ] to explain the IFRA hardware support. The shaded parts in Fig. 2 .1 indicate the additional hardware:
1. A set of distributed recorders with dedicated storage. Each recorder is essentially a circular buffer, associated with a particular pipeline stage of the processor, and records specific information corresponding to that stage (table 2.1) as instructions pass through that stage.
2. An ID assignment unit for assigning and appending an ID to each instruction that enters the processor. Our ID assignment scheme [Park 08a ] works for processors supporting out-of-order, speculative execution, multiple clock domains and dynamic voltage and frequency scaling.
3. A post-trigger generator, which is a mechanism for shortening the error-to-failure latency by pausing or stopping recording.
While an instruction, with an ID appended, flows through Each entry contains an additional 8-bit instruction ID (explained later).
Kbytes
Our synthesis results (using Synopsys Design Compiler with TSMC 0.13 microns library) show that the area impact of the IFRA hardware infrastructure is 1% on the Illinois Verilog Model [Wang 04 ] assuming a 2MBytes on-chip cache, which is typical of current desktop/server processors. The overhead is largely dominated by the circular buffers present in the recorders. Wires connecting the recorders operate at slow speed, and a large portion of this routing reuses existing on-chip scan chains that are present for manufacturing testing purposes.
The need for a post-trigger is illustrated by the following situation. Suppose that a test program has been executing for billions of cycles and an electrical bug is exercised after 5 billion cycles from the start. Moreover, suppose that the electrical bug causes a system crash after another 1 billion cycles (i.e., 6 billion cycles from start). With limited storage, we are only interested in capturing the information around the time when the electrical bug is exercised. Hence, billions of cycles worth of information before then is unnecessary. On the other hand, if we stop recording only after the system crash, all the useful information will be overwritten. What is necessary is a failure/suspect detection mechanism that tells us well in advance that something suspicious may lead to system failure at some point in the future; i.e., we must reduce error detection latency, the length of time between bug manifestation and visible system failure. These detection mechanisms are referred to as post-triggers, and are listed in Table 2 .2.
Classical hardware error detection techniques such as parity bits for arrays and residue codes for arithmetic units [Ando 03, Leon 06, Sanda 08] as well as in-built exceptions, such as unimplemented instruction exceptions and arithmetic exceptions do help. However, these mechanisms are not sufficient, e.g., two tricky situations described in the last 2 rows of Table 2 .2. These two failure scenarios may be detected several millions of cycles after an error occurs, causing useful information to be overwritten even with the existing error detection mechanisms. Hence, we introduce the notion of soft and hard post-triggers.
A hard post-trigger fires when there is an evident sign of failure, and terminates the processor. A soft post-trigger fires when there is an early symptom of a possible failure. It pauses the recording in all recorders, but allows the processor to keep running. If a hard post-trigger for the failure corresponding to the symptom occurs within a prespecified amount of time, the processor stops. If a hard posttrigger does not fire within the specified time, the recording resumes assuming that the symptom was false.
Segmentation fault (or segfault) requires OS handling and, hence, may take several millions of cycles to resolve. Nullpointer dereference is detected by adding simple hardware in the Load/Store unit. For other illegal memory accesses, TLB-miss is used as the soft post-trigger. If a segfault is not declared by the OS while servicing the TLB-miss, the recording is resumed on TLB-refill. On the other hand, if a segfault is returned, then a hard post-trigger is activated. 
III. POST-ANALYSIS TECHNIQUES
Once the recorder contents are scanned out, footprints belonging to same instruction (but in multiple recorders) are identified and linked together using a technique called footprint linking (Sec. III.A). The linked footprints are also mapped to the instruction in the test program binary using the program counter value stored in the fetch-stage recorder (Table 2 .1).
After linking the footprints, we run four high-level postanalysis techniques (Sec. III.B) followed by a low-level analysis (Sec. III.C). The low-level analysis asks a series of questions until we obtain the bug location and the bug exposing stimulus. These questions make up a complex decision diagram similar to the structure shown in Fig. 3.1 .
The high-level and the low-level post-analysis techniques rely on the concept of self-consistency, which checks for the existence of contradictory events in the collected footprints with respect to the test program binary. Such selfconsistency checks are extensively used in fault-tolerant computing for error detection [Austin 99 , Lu 82, Oh 02, Siewiorek 98]. The key difference here is that we use selfconsistency checks for failure localization rather than error detection. Such application is possible because, unlike faulttolerant computing, post-analysis is done offline allowing significantly complex analysis for localization purposes. Figure 3 .2 shows a part of a test program and the contents of three (out of many) recorders right after they are scanned out. As explained in Sec II, since we use short instruction IDs (8-bits for Alpha 21264-like processor), we end up having multiple footprints having the same ID in the same recorder and /or multiple recorders. For example, in Fig. 3.2 , ID 0 appears in three entries of the fetch-stage recorder, in two entries of the issue-stage recorder, and in three entries of the execution-stage recorder.
A. Footprint Linking
Which of these ID 0s correspond to the same instruction? This question is answered by the following special properties enforced by the ID assignment scheme presented in Sec. II:
1. All flushed instructions are uniquely identified 2. If instruction A was fetched before instruction B, and they both have the same ID, then A will always exit any pipeline stage (and leave its footprint in the corresponding recorder) before B does for that same pipeline stage.
In Fig 3. 2, using the first property, all flushed instructions with ID 0 are identified and discarded. Then, using the second property, the latest instances of ID 0 across all recorders are linked together, followed by linking of the second latest instances of ID 0, and so on. Since the PC is stored in the fetch-stage recorder, we can link the instruction ID back to the test program binary to find the corresponding instruction. AUX10  AUX9  AUX8  AUX7  AUX6  AUX5  AUX4  AUX3  AUX2  AUX1  AUX0  …  …   AUX29  AUX28  AUX27  AUX26  AUX25  AUX24  AUX23  AUX22  AUX21 Each analysis technique is applied separately. If only one of them identifies an inconsistency, then the corresponding entry point into the decision diagram of the low-level analysis is taken. If none of them discovers an inconsistency, then there is a default entry point into the decision diagram. If multiple of them identify inconsistencies, then since we are interested in the inconsistency that is closest to the electrical bug manifestation in terms of time, the reported inconsistencies are compared to see which one occurred the earliest. The high-level analysis technique with the earliest occurring inconsistency then dictates the entry point into the decision diagram for low-level analysis. The four analysis techniques are described in detail in [Park 08a ]. Here we explain the control-flow analysis to illustrate the idea.
In program control flow analysis, four illegal cases of incorrect program control flow are checked by looking at the PC sequence of the serial execution trace (obtained from fetch-stage recorder and test program binary during footprint linking).
I. The PC increments by +4 except in the presence of a control flow transition instruction (e.g., branch, jump).
II. A PC jump always occurs in the presence of unconditional transition instruction.
III. The PC jumps to the correct target in presence of direct transition (with target address that does not depend on a register value).
IV. The PC jumps to a legal target in the presence of register-indirect transition (with target address that depends on a register value). A legal target is an address that is part of the executable address space (determined from the program binary), and whose residue matches that of the recorded register residue.
The violation in control flow is scrutinized in the lowlevel analysis by starting from the PC register at the time when the instruction made an illegal transition.
C.
Low-level Analysis The low-level analysis involves back-propagating discovered inconsistencies through hardware locations according to the low-level decision diagram, while updating the bug time from the initial inconsistency. The low-level analysis mainly involves checking for consistency in residue bits collected by recorders (Table 2 .1). A detailed explanation of the low-level analysis is beyond the scope of this overview, and can be obtained from [Park 08b ].
IV. RESULTS
We evaluated IFRA by injecting errors into a microarchitectural simulator augmented with IFRA. We used Simplescalar 3.0 architectural simulator [Austin 02 ] with Alpha 21264 configuration.
For this particular configuration, there are 200 different microarchitectural blocks (excluding array structures and arithmetic units since errors inside those structures are immediately detected and localized using parity and/or residue codes, respectively). Each block has an average size equivalent of 10K 2-input NAND gates. SPECint2000 was chosen as validation test programs and each recorder was sized to have 1,024 entries.
All bugs were modeled as single bit-flips at flip-flops to target hard-to-repeat electrical bugs. This is an effective model because most electrical bugs (e.g., hold time, noise, speedpath, signal integrity problems) eventually manifest themselves as incorrect values arriving at flip-fops for certain input combinations and operating conditions. Upon error injection, the following scenarios are possible:
1. The error vanished without effect at the system level. 2. The error does not cause any post-trigger mechanism to trigger, but produces incorrect program outputs 3. Failure manifestation with short error latency, where recorders successfully capture the history from error injection to failure manifestation.
4. Failure manifestation with long error latency, where 1024-entry recorders fail to capture the history from error injection to failure (including soft triggers).
Cases 1 and 2 are related to coverage of validation test programs and post-triggers, and are not the focus of this paper. Any error injection run which does not result in the activation of any post-trigger within 100K cycles from the point of error injection are included in this categories. Table  4 .1 presents results from 800 error injections that resulted in cases 3 and 4. All error injections were performed after a million cycles from the beginning of the program in order to demonstrate that there is no need to keep track of all the history before the appearance of an error. In addition, we pessimistically report all the errors in case 4 to be within the "completely missed" category. 
