Abstract-Existing techniques to ensure functional correctness and hardware trust during pre-silicon verification face severe limitations. In this work, we systematically leverage two key ideas: 1) Symbolic QED, a recent bug detection and localization technique using Bounded Model Checking (BMC); and 2) Symbolic starting states, to present a method that: i) Effectively detects both "difficult" logic bugs and Hardware Trojans, even with long activation sequences where traditional BMC techniques fail; and ii) Does not need skilled manual guidance for writing testbenches, writing design-specific assertions, or debugging spurious counter-examples. Using open-source RISC-V cores, we demonstrate the following: 1. Quick (≤5 minutes for an in-order scalar core and ≤2.5 hours for an out-of-order superscalar core) detection of 100% of hundreds of logic bug and hardware Trojan scenarios from commercial chips and research literature, and 97.9% of "extremal" bugs (randomly-generated bugs requiring ~100,000 activation instructions taken from random test programs). 2. Quick (~1 minute) detection of several previously unknown bugs in open-source RISC-V designs.
I. INTRODUCTION
RE-SILICON verification requires major effort in a typical hardware design flow [1] . In this paper, we consider presilicon verification of single processor cores, which are critical components of any System-on-Chip (SoC). Generally, pre-silicon verification mainly targets logic design errors (logic bugs). However, it is also crucial to detect Hardware Trojans (HTs) [2] , which are unauthorized modification of a system that result in incorrect functionality and/or the exposure of sensitive data [3] . While previous research on HTs focused on attacks implemented during fabrication [4] , there is growing concern about HTs being inserted in third-party Intellectual Property (IP) cores by malicious entities [5] . This makes HT detection during pre-silicon verification essential.
Similar to logic bugs, HTs can affect functionality of a system. For example, an HT can cause an error that creates a change in the software-visible state of a system, defined by the state of software-visible registers and memory. The objective of HT detection is to detect these changes, which encompass many catastrophic attacks on processor cores [2] .
Symbolic quick error detection (Symbolic QED or SQED) [6] is a new pre-silicon verification technique based on QED tests [7] . It uses bounded model checking (BMC) [8] for formal analysis of the design. QED tests generate short sequences of instructions that trigger logic bugs in a design. As such, SQED is an automatic bug detection and localization technique that is extremely effective in practice.
Manuscript received Aug. 15, 2019 . K. Ganesan and S. S. Nuthakki are with Dept. of Electrical Engineering, Stanford University, Stanford, CA 94305 USA (e-mail: karthik3@stanford.edu). An early draft is cited in [9] as [ Ganesan 18 ].
For example, SQED was recently applied to several industrial microcontroller cores used in commercial automotive products [9] . It was able to detect all recorded logic bugs in the designs, while enabling an 8-60X (depending on the design) reduction in verification effort compared to the standard industrial verification flow.
Importantly, SQED does not target single-instruction bugs (i.e., bugs such that a single instruction on a specific set of inputs always produces an incorrect result). There are many other verification techniques that are highly effective at detecting such bugs, from both research literature [10] and in industry [9] .
SQED analyzes a design symbolically, but it requires a concrete starting state (e.g., a state of the digital system that is given explicitly as a bit-vector of 0s and 1s). That means, to find bugs or HTs that require long activation sequences (i.e., many instructions are required to activate such bugs), Symbolic QED must rely on very deep BMC runs (i.e., runs that unroll the system far enough to include all the activation instructions). This can be very difficult for practical designs. In a related study [6] , it was shown that BMC could unroll a large, multicore SoC up to around 30 clock cycles, within 24 hours of verification time. The following example [11] shows that SQED, while highly effective for logic bugs, can be insufficient for detecting HTs.
Motivating Example 1. Consider the following HT that is difficult to find using existing HT detection techniques: The HT changes opcodes of the next several decoded instructions if the processor has fetched a specific sequence of 256 instructions. This HT could inject an instruction sequence to bypass physical memory protection and run a privileged instruction. Such privilege escalation attacks [2] can be catastrophic.
Because the HT requires a long sequence of instructions (and hence many clock cycles) for activation, SQED (like other BMC-based methods [12, 13] ) fails to detect the HT unless the selected starting state for BMC quickly transitions to a state where the HT activates. Stumbling upon such a "close" state by starting at a concrete state (e.g., obtained from simulation or a power-on reset state) is highly unlikely to succeed since the HT can be designed with an arbitrary activation sequence that is not known a priori.
To overcome this major challenge, we extend SQED so that it is now capable of starting from a symbolic (instead of concrete) starting state (i.e., we give the BMC tool the ability to choose an arbitrary starting state for each run).
However, it is well-known that starting BMC from an unrestricted symbolic starting state risks generating spurious counterexamples (false positives). This occurs when the BMC tool incorrectly indicates that a bug or HT is present in a design, when there is actually no bug or HT. If the BMC tool selects a starting state that is not reachable from the set of all reset states of the system via a sequence of instructions, then a false positive might occur. For example, assume that each word in a memory system is protected with a single even parity bit and assume that a BMC tool is asked to check the following property: for any sequence of reads and writes to the memory, the parity bits remain consistent with the data. If the starting state of the design is not constrained, then the BMC tool can initialize the memory to contain an all-zero word with a '1' for the parity bit, issue an instruction that reads from this location, check the property, and report this false positive.
Traditional methods rely on verification engineers to (manually) create constraints to rule out such false positives, which can be time-consuming for practical designs having many complex properties. This paper overcomes that challenge by: i) defining QED constraints: sufficient constraints (see Section III for details and Appendix B for proofs) to ensure false positives do not occur when using Symbolic QED with symbolic starting states; and ii) introducing QED recorders, which observe a small subset of internal signals within the processor to ensure the QED constraints are satisfied. QED recorders are used for pre-silicon verification only. They do not incur area overhead for the final design.
Our work improves on a previous technique for SQED with symbolic initial states, called S 2 QED [14] . That approach differs from ours in the types of processors that can be verified, the types of logic bugs and HTs that can be detected, and the way symbolic initial states are implemented. We explain our advantages in Section II.C.
Experimental results using our new method demonstrate: 1. We automatically, correctly, and quickly (~1 minute) detect several previously unknown (real) logic bugs in open-source out-of-order (OoO) superscalar [15] , and in-order scalar [16] RISC-V cores. The bugs found in [15] cannot be detected by [6] or [14] . 2. We automatically, correctly, and quickly (within 25 seconds for in-order, 18 minutes for OoO) detect 100% of (117 inorder, 120 OoO) simulated logic bugs, representing a wide variety of "difficult" logic bugs (Appendix A) from commercial designs. SQED with a concrete starting state detects only 33% (in-order) and 5% (OoO). 3. We automatically, correctly, and quickly (within 5 minutes for an in-order core; 2 hours for an OoO superscalar core) detected 100% of (156 in-order, 195 OoO) simulated HTs, encompassing a wide variety of scenarios (Appendix A) from over 100 papers in the HT research literature. SQED with a concrete starting state detects 15% (in-order) and 9% (OoO). 4. We automatically, correctly, and quickly (within 2.5 hours) detected 97.9% of an "extremal" bug family (randomlygenerated pre-condition-based bugs which require ~100,000 activation instructions taken from random test programs) in an OoO superscalar core [15] . In contrast, SQED with a concrete starting state detected 0%, and [14] is not applicable. 4. It does not require a golden model or simulation data of the design-under-test for detection of logic bugs and/or HTs. 5. Its effectiveness does not depend on the way HTs are designed, i.e., our method is HT-design agnostic. The rest of this paper is organized as follows. Section II provides background on earlier QED works. Section III describes Symbolic QED with symbolic starting states. Results are presented in Section IV, followed by a survey of related work in Section V and conclusions in Section VI.
Appendix A provides a list of logic bug and HT types used in the experiments of Section IV. Appendix B provides formal proofs of the sufficiency of QED constraints (introduced in Section III.B). Appendix C provides details on how QED constraints are specified to the BMC tool.
II. BACKGROUND
In the following, we present the basics and terminology of QED [7] , SQED [6] , and S 2 QED [14] .
A. QED and the EDDI-V Transformation
Quick error detection (QED) is a testing technique that takes existing system validation tests (i.e., sequences of instructions) and automatically transforms them into a set of new tests using various QED transformations [7] . Among the various transformations that can be applied, Error Detection using Duplicated Instructions for Validation (EDDI-V) is the focus of our work (illustrated in Fig. 1 ). It targets bugs inside processor cores by checking the results of original instructions against the results of duplicate instructions.
First, the software-visible register and memory space are divided into two halves, one for the original instructions and one for the duplicated instructions. Next, corresponding registers and memory locations for the original and duplicated instructions are initialized to hold the same values. This is called a QED-consistent system state. Then, for every load, store, arithmetic, logical, shift, or move instruction in the original test, EDDI-V creates a corresponding duplicate instruction that performs the same operation, but uses only registers and memory reserved for the duplicate ones. The duplicated instructions execute after the original instructions (in the same relative order), but may be interleaved. The EDDI-V transformation then inserts periodic check instructions that compare the results of the original instructions against those of the duplicated ones. A failing QED test occurs if after an equal number of original and duplicate instructions have committed, the system reaches a state that is not QED-consistent. The respective starting state and instruction sequence constitute a counterexample or QED-compatible bug trace. 
B. Symbolic QED
Symbolic QED [6] combines QED transformations with bounded model checking [8, 17] for pre-silicon verification of a design. SQED creates a BMC problem to check all possible EDDI-V tests within a bounded number of clock cycles for a failing one. It searches for counterexamples to properties of the form Ra == Ra′. Here, Ra is an original register, and Ra′ is the corresponding duplicate register in an EDDI-V test. To ensure that all possible counterexamples are QED-compatible: 1.
Original instructions must be valid instructions from the instruction set architecture (ISA) of the design; 2. The instruction sequence must be an EDDI-V test.
A QED module (a small hardware module that is only used during pre-silicon verification and does not incur area overhead for the final design) automatically transforms a sequence of original instructions into a QED-compatible sequence (e.g., as in Fig. 1 ). The QED module only requires that the input sequence is made up of valid instructions that read or write to only the original registers and memory (conditions that can be specified directly to the BMC tool). After execution, a signal is asserted (denoted below as QED ()*+, ). All original and corresponding duplicate registers should contain the same values in a bug-free situation, i.e., the BMC tool checks that
where N is the number of registers defined by the ISA. Here (for a ∈ {1, … , N 2 ⁄ }), Ra and Ra′ correspond to original and duplicate registers. ↑ .QED ()*+, / is true on any clock edge where QED ()*+, transitions from low to high.
The starting state for the BMC run must also be a QEDconsistent state, in which the value stored in each original register or memory location matches the corresponding duplicate register or memory location. This is to prevent spurious counterexamples from being generated. One way to obtain such a state is to run an EDDI-V test in simulation and stop immediately after QED checks have compared all register and memory values.
Symbolic QED can also detect HTs, if it finds an EDDI-V test for which the HT affects original registers and duplicate registers differently. For example, assume an HT is inserted (unknown to the designer) that activates when a 128-bit counter reaches its maximum value. Assume the HT changes an inflight instruction to a NOP when it activates (cf., activation criteria A. These bugs and HTs escape S 2 QED, because they affect the original and duplicate CPU equivalently. In contrast, our new approach detects the bugs and HTs by creating scenarios where mismatches between original and duplicate instructions occur.
To count how many distinct logic bugs and HTs could exist from the above class, the benchmark core [18] [1] [2] [3] [4] , there are at least 12 billion distinct logic bugs and at least 17 billion distinct HTs that would not be caught by S 2 QED, but will be caught by our new technique.
Our approach also differs from S 2 QED in that it is especially suited for a broader class of processor designs, including OoO superscalar processors. This capability is enabled by the QED constraints we define (Section III.B), together with QED recorders (Section III.C) and a new QED module (Section III.C). In contrast to our approach, S 2 QED is applicable to processors with Out-of-Order writeback. This is possible due to additional constraints that restrict the state of the instruction pipeline. Such constraints can also be integrated in our framework. Further, S 2 QED does not require a QED module or QED recorders. However, it requires duplication of the CPU in the model of the design-under-test (only during pre-silicon verification), whereas our approach requires only a single CPU.
III. EXTENDING SQED WITH SYMBOLIC STARTING STATES
We now present our new extension of SQED [6] with symbolic starting states. In Section III.A, we describe the design of a new, improved QED module that is integrated in the model of the design-under-test during pre-silicon verification. These improvements enable the detection of real logic bugs that [6] fails to detect (see Section IV.A). Section III.B introduces a set of QED-constraints on the symbolic starting state that allow us to avoid false positives. The sufficiency of these constraints is proven in Appendix B. To implement the QED-constraints, we introduce QED-recorders in Section III.C. These are additional hardware modules (used only during pre-silicon verification) that record a small subset of internal logic values of the processor core to ensure that the QED-constraints are satisfied when a QED test begins. Fig. 2 contrasts Symbolic QED without/with symbolic starting states. 
A. New QED Module for Single Processor Cores
Pseudocode for the new QED module is given in Fig. 3a . Inputs are: 1) enable: disables the QED module if false; 2) next_instruction: next instruction to be executed; 3) fetch_next: true when the core is ready to receive an instruction, i.e., the fetch stage is not stalled; 4) original: tells the core to execute an original (if true) or duplicate (if false) instruction. Outputs are: 1) instruction_valid: indicates whether the output instruction is valid; and 2) instruction_out: instruction to be executed. The QED module has internal variables: 1) queue: a queue data structure that stores previous original instructions that have not yet been executed in the duplicate subsequence; 2) head_instruction: the previous head of the queue; 3) insert_valid: true when an instruction is loaded into the queue; 4) delete_valid: true when the QED module can execute a duplicate instruction; 5) duplicate_instruction: next instruction in duplicate subsequence to execute (when original is false).
QED checks occur when the qed_ready signal of the QED module is true. Pseudocode for determining this signal is given in Fig. 3b . To avoid trivial false positives, QED checks occur when an equal number of commits (writes) have been made to original registers and duplicate registers. This is accomplished by keeping track of the number of original and duplicate commits to the register set, as shown in Fig. 3b . For simplicity, in Fig. 3b , we assume that at most one instruction commits per cycle. For superscalar processors that can commit multiple instructions in the same cycle, we track all corresponding pairs of write_valid (tells whether the input data is valid) and write_address (the address for the data to be written) signals, keep a separate is_original signal (identifies if a write address corresponds to an original or duplicate location) for each instruction, and allow the original and duplicate counters to be incremented multiple times if needed.
The old QED module of [6] requires that all original instructions complete, a waiting period occurs for the pipeline to be flushed, and duplicate instructions execute, before the qed_ready signal is asserted. In constrast, this new QED module allows arbitrary interleaving of the original and duplicate instruction subsequences, without requiring a waiting period. This additional timing diversity is made possible by giving the BMC tool control over the original input of Fig. 3a . The QEDready logic (Fig. 3b) can be further enhanced as follows: 1.The current QED-ready logic is only applicable to single processor cores, since a multi-core system would require considering the original and duplicate commits across all cores. This can be challenging in situations where multiple cores operate with a shared address space. For simplicity, we do not consider this situation in this paper. 2.For some processors, e.g., superscalar processors with explicit register renaming (MIPS 10000 [19] and ARM's Cortex-A15 [20] ), the designation of original or duplicate instruction cannot be made solely on physical address (unlike in Fig. 3b ). This issue can be corrected by including the current state of the register mapping table as an input to the function is_write_to_original_space. Each time a QED check happens, the same mapping table must be used to map logical to physical addresses before comparing original and duplicate values. The RISC-V cores used in our experimental evaluation (see Section IV), however, do not have this issue. 
B. QED Constraints
We first define some terminology used in the constraint definitions: i) Symbolic In-Flight (SIF) "instructions": symbols (i.e., state bits), part of the symbolic starting state (which will be assigned 0s and 1s by the BMC tool), corresponding to (microarchitectural) flip-flops within the pipeline that hold instructions during normal operation of the core 1 ; ii) TC: the point in time when all SIF instructions have committed (i.e., written to the architectural state). This is determined by the BMC tool. iii) Symbolic QED instructions: symbols which represent the instructions that form the bug trace (which is part of the counterexample, along with the starting state that BMC assigns) generated by the BMC tool after TC; and iv) Symbolic QED operand data: symbols representing the operand 2 data of dispatched Symbolic QED instructions (dispatched before TC). Fig. 4 illustrates these definitions for a 3-stage in-order pipeline. When the formal analysis begins, there are up to 3 SIF instructions in the pipeline, and all commit by time TC. The first Symbolic QED instruction (R1=R1+5 in Cycle 1 of Fig. 4) is fetched into the pipeline, and its Symbolic QED operand data is available after the Dispatch stage. Now, the QED constraints are stated as follows (Appendix C further details how each constraint is enforced):
Constraint C-1. At TC, all SIF instructions have committed (i.e., no SIF instruction can write to the architectural state after T Q ), while all Symbolic QED instructions commit after TC.
Constraint C-2. At TC, the architectural state (programvisible registers and memory) is QED-consistent (Section II.B), and nothing but Symbolic QED instructions can write to architectural state after TC (e.g., test modes such as scan that bypass instructions to write to architectural state are disabled).
Constraint C-3. All the operand data for each Symbolic QED instruction , must satisfy one of the following properties: i) if operand data is available (i.e., has already read data for this operand) at TC then it matches the corresponding register or memory location (i.e., source operand location) data at TC.
ii) 3 if operand data is not available at TC, then is waiting for the result of an earlier Symbolic QED instruction for this operand data. The QED constraints form a sufficient condition to ensure no false positives, given that bug-free designs satisfy two assumptions after TC: Assumption-1. If a Symbolic QED instruction is executed twice on the same data, it results in the same value being stored to architectural state, e.g., Rx=1+2; and Ry=1+2 always result in the same value stored to both registers Rx, and Ry. Note: there is no assumption that the stored value is '3'. Assumption-2. If a Symbolic QED instruction has a read-afterwrite dependency with earlier instructions, it uses the most recent value of the data in its computation. For example, in the program {R1=5; R2=R1+2; R3=R2-2}, if the first instruction stored '5' to architectural-state, the second instruction will use value '5' for R1. We can now state the main theorem of the paper. Formal definitions and full proofs are deferred to Appendix B. Theorem 1. Let Constraints C-1, C-2, and C-3 be satisfied by a starting state of a processor core. Let Assumptions-1 and -2 hold after TC for any bug-free design of the core. If any EDDI-V test fails, the failure must be caused by a bug in the design.
Proof: See Appendix B. ∎ PROOF OUTLINE: We first define notation for a sequence of Symbolic QED instructions in a QED-compatible bug trace. Next, we isolate the first pair of Symbolic QED instructions which cause an EDDI-V test failure. We decompose the execution of these two instructions into a union of six mutually disjoint cases. For each case, we give a proof by contradiction (of one or more Assumptions) that there must be a bug in the design, thus concluding the proof. ∎
We also observed empirically (see Section IV), that at least one assumption was violated in each BMC bug trace.
C. Symbolic QED Recorders
QED recorders copy a small number of internal signals in a design (to track TC and Symbolic QED operands) so that we can specify the QED constraints to the BMC tool. For ease of understanding, we take an in-order core with single instruction fetch and 5-stage pipeline as a running example in Section III.C, but we explain how the technique is generalized to other cores. In Section IV, we present results for both in-order (scalar) and OoO (superscalar) cores.
Recorder for TC. As TC depends on the starting state chosen by the BMC tool, it cannot be statically determined before the formal analysis begins. A recorder is used to give this information to the BMC tool dynamically. For an in-order core, TC can be determined by simply tracking the progress of the first Symbolic QED instruction (the first symbolic instruction the BMC tool creates as part of the bug trace) until it reaches the commit stage (write-back stage) of the pipeline. At this time, all SIF instructions must have committed, as the pipeline is occupied by Symbolic QED instructions.
Specifics of the TC recorder for a 5-stage, single-fetch, inorder pipeline is given in Fig. 5 . Inputs are ready signals for all stages that precede the commit stage (e.g., fetch_ready is true when the fetch stage is ready to receive an instruction). The output SIF_complete is true when the first Symbolic QED instruction goes through all pipeline stages and reaches the commit stage. The output mode keeps track of progress made so far by the Symbolic QED instruction (we later make use of this output in the Symbolic QED operand recorder). This TC recorder for a 5-stage pipeline can be easily modified to support in-order pipelines with a different number of stages.
For an OoO core, the TC recorder is even simpler, and utilizes the reorder buffer (ROB). The idea is to mark the entry allocated in the ROB for the first Symbolic QED instruction. After this, SIF_complete is assigned true when the ROB head pointer reaches the marked instruction. For cores with no ROB, but OoO commit (e.g., [18] ), an additional constraint is required (see Section IV). Symbolic QED operand recorder. Like TC, the Symbolic QED operands also depend on the starting state. The Symbolic QED operand recorder stores information for both register and memory operands. Specifics of the Symbolic QED operand recorder for a 5-stage, single-fetch, in-order pipeline is given in Fig. 6 . Inputs are: 1) *_addr, which gives register/memory address of the corresponding operand; 2) *_data, which gives operand data; 3) *_valid, which is true when *_addr is valid and *_data is valid; 4) mode, which gives the state of the TC recorder (Fig. 5) . Output *_buffer stores all Symbolic QED operands and their values (buffer depth is determined by the maximum number of instructions in-flight at a given time). We only store the information for Symbolic QED instruction operands in buffers, i.e., we do not store operand information of any SIF instruction. This is enforced by checking the TC recorder state, i.e., mode (we do not add entries to *_buffer until all SIF instructions pass through the dispatch stage). In Fig. 6 , we assume that each instruction requires at most two register values and one memory value, but the idea is easily extended to more source operands.
For an OoO core, Fig. 6 is extended to include Symbolic QED operands that are waiting on results of earlier Symbolic QED instructions. For each waiting operand, we also store the instruction tag (ROB entry number) of the instruction it is waiting for. This information is used to specify Constraint C-3 for an OoO core (see Appendix C).
IV. RESULTS
In this section, we demonstrate the effectiveness of our new technique on two open-source RISC-V processor cores: i) Vscale [16] , an in-order core targeting embedded applications; and ii) RIDECORE [15] , an OoO superscalar core (2-way pipeline, 64 maximum instructions in-flight, 2 ALUs, 1 multiplier, 1 load/store unit) for high performance applications. For BMC, we used the Questa Formal tool (version 10.5c) from Mentor Graphics on an AMD Opteron 6438 with 128 GB of RAM. For each core, we instrumented the new QED module (Section III.A), QED constraints (Section III.B), and QED recorders (Section III.C).
A. Previously Unknown Bugs
We first found three previously unknown logic bugs in the multiplier reservation station (RS-m) of the RIDECORE design (all three confirmed by RIDECORE designers [21], see Table  1 ). These bugs only activate when back-to-back multiply instructions execute on successive clock cycles. They were detected due to the new QED module of this paper (see Section III.A). This design improves upon [6] by allowing arbitrary interleaving of original and duplicate instruction subsequences in EDDI-V tests (see Section II.A), without requiring a waiting period between them. The QED module of [6] cannot detect these bugs, and S 2 QED [14] is not applicable to RIDECORE. We also found two bugs in Vscale (Table 2) , by running Symbolic QED with the new QED module, starting at a concrete, power-on reset state in less than 40 seconds (also confirmed by designers). These bugs are due to errors in the Vscale implementation of the RISC-V privileged ISA [22], within specific Control Status Registers (CSRs). Importantly, Vscale does not implement shadows for CSRs. To circumvent this, the EDDI-V transformation (Section II.A) duplicates instructions using a scratchpad memory for each CSR. 
B. "Difficult" Logic Bugs and HT Scenarios
We simulated 120 (117) logic bug types using RIDECORE (Vscale). These are "longer" (up to 256 consecutive activation instructions) versions of "difficult" logic bugs (see Appendix A; Table A.1.a-b) that occurred in various commercial designs [6] . We also simulated 195 (156) difficult HT scenarios (see Appendix A; Table A. 2.a-c) which encompasses over 100 papers in research literature (see Section V.C) using RIDECORE (V-scale). Results are in Table 3 . Observation 1: SQED with symbolic starting states correctly and automatically found all "long" logic bugs, in less than 30 mins, with no false positives. It found bugs that traditional BMC methods fail to detect (including SQED). SQED with concrete starting state detected only 5% of these bugs in RIDECORE and 33% of these bugs in Vscale.
Observation 2: SQED with symbolic starting states correctly and automatically found all injected HTs (including those designed to evade state-of-the-art HT detection techniques; see Section V and Appendix A), in less than 2.5 hours, without requiring design-specific assertions or debug of false positives. SQED with a concrete starting state detected only 9% of these HTs in RIDECORE and 15% of these HTs in Vscale.
C. "Extremal" Bugs
To further demonstrate the robustness of our presented technique, we inject "extremal" bugs (only triggered when the design reaches a very specific set of states) into RIDECORE. We focused on RIDECORE for this experiment since it is OoO, superscalar, and more complex than Vscale. Also, S 2 QED [14] is not applicable to RIDECORE. Our extremal bug injection methodology is as follows: i) Run Matrix Multiply 6 (1M cycles) on the design in simulation, and stop the simulation at a random clock cycle; ii) Run a uniform random sequence of 100 ALU or Load/Store instructions; iii) Select a uniformly random subset of flip-flips from the set of all flip-flops in the design and record their logic values; and iv) Generate a bug (effect A.1.b.3 of Appendix A), injected into the design. This bug activates when the design reaches a state where all the selected flip-flops (step iii) have a specific set of values recorded (step iii).
We present our results in Table 4 . For generating such extremal bugs, we randomly chose 180 time points (step i), ranging from 26,026 to 988,159 clock cycles elapsed from program start. For each time point, we ran a random 100-instruction sequence (step ii), and then randomly selected 10 different subsets of 128 flip-flops (step iii), resulting in 1,800 total extremal bug count. Using the Questa Formal tool (version 10.5c) from Mentor Graphics on 6 (in parallel) AMD Opteron 6438 machines with 128 GB of RAM, we were able to run roughly 60 experiments to completion each day. We stopped at 1,800 experiments, after roughly 1 month of runtime.
Whereas Symbolic QED with concrete starting state detected 0% of these 1,800 "extremal" bugs, Symbolic QED with symbolic starting state was able to detect 1,763 of these 1,800 bugs. For the remaining cases, the BMC tool timed out after 24 hours. A closer inspection reveals that the BMC tool was not able to unroll the design beyond 7 clock cycles (8 clock cycles are needed to observe these bugs). In future work, we plan to investigate ways to improve BMC tools to address such issues (following approaches such as [23, 24] ). Table 4 . "Extremal" logic bugs in RIDECORE.
We report [min, avg., max].
Observation 4:
Our new Symbolic QED with symbolic starting states correctly and automatically found 97.9% of the "extremal" logic bugs and generated a bug trace in less than 2.5 hours. In contrast, Symbolic QED with concrete starting state detected 0% of the "extremal" bugs.
V. RELATED WORK
In this section, we compare and contrast existing pre-silicon verification techniques for logic bug detection (Section V.A) and HT detection (Section V.B) with the method of this paper. In Section V.C, we provide a survey of HT attacks implemented in research literature. We show that each attack fits in one of the categories (Appendix A; Tables A.2 .a-b) used in our HT experiments (see Section IV.B)
A. Existing Pre-silicon Verification Methods for Logic Bugs
Existing formal verification techniques employing BMC [6, 10] have issues in detecting logic bugs that require a long activation sequence. Other works for processor cores use theorem proving [25] , or try to learn invariants of the design [26] to be used as constraints, but these techniques tend to be ad-hoc and require a high level of manual effort. In seminal work [27] and extensions [28] , models of processors were verified based on abstractions by uninterpreted functions with equality. That approach in general requires to provide invariants to avoid false positives. E-QED [29] is a BMC-based technique for electrical bug localization in post-silicon validation. Apart from that, it is substantially different from our technique, e.g., as it does not rely on the duplication of instructions.
False positives are a major challenge for traditional BMC. However, the same QED constraints (Section III.B) used by our approach may not prevent false positives for general property checking using BMC. The following example illustrates this point. Let a processor core start at a state where the Exception Program Counter (EPC) (i.e., the register storing the return address for an exception) is misaligned (i.e., not aligned with any word in the instruction cache), the current PC is within an exception handling routine, and there are only NOP instructions in the pipeline. This is an unreachable state for processors with strict alignment rules (e.g., MIPS [19] ). It is reasonable to check the property that the EPC is aligned, since returning to a misaligned address can cause programs to crash. Even at time TC, when the NOP sequence is finished, this EPC will still be misaligned, causing a false positive. With QED constraints (Section III.B), we do not get such a false positive because the exception handling routine will be filled by valid QED tests. Hence, any time we assert a QED check, it will not fail unless there is a bug in the design.
B. Existing Pre-silicon Verification methods for HTs
Existing HT detection techniques that can be applied in presilicon verification broadly belong to two categories: i) design analysis methods; and ii) formal methods [30] . One class of design analysis techniques use the observation that signals associated with HTs may be mostly unused or rare. [5, [31] [32] [33] use simulation data along with rareness metrics (e.g., code coverage, signal correlation). [34, 35] do not need simulation data, but still trade off false-positives (i.e., spurious detection of HTs) for false-negatives (i.e., failure to detect HTs) and viceversa, depending on the thresholds set for their rareness metrics. [40] (No explicit designs) 8051 Microcontroller, RC5 [69] a.1-a.4 (all TrustHub HTs) All TrustHub benchmarks [70] a.1-a.4 (characterizes TrustHub) All TrustHub benchmarks [71] a.1-a.4 (TrustHub, DeTrust), a.2 (XOR-LFSR)
All TrustHub, DeTrust, XOR-LFSR (special case of 1.2) [32] (No explicit designs) Leon3 [72] a.3 (M2=1) Sense-Amplifier [44] a.1-a.3 (11 TrustHub Trojans) AES, RSA [73] a.3 (M2=32) 16-bit Multiplier, 32-bit RSA (both within ARM SoC) [74] a.3 (12 HTs of various size) ISCAS-85 benchmarks [75] a.3 (8 HTs of various size) Alpha encryption module on Spartan-3 FPGA [45] a.1-a.4 (Details not sufficient -TrustHUB) DES encryption core Additionally, stealthy HTs have been designed [11] to bypass such analyses. In contrast to that, our technique does not require simulation data, detects stealthy HTs given in [11, [36] [37] , and does not produce any false positives. However, our technique is for processor cores, while the aforementioned analysis techniques are applicable for general designs.
Formal methods for finding HTs generally either use BMC [12, 16] , SAT-based equivalence checking [13, [38] [39] or theorem proving [40] [41] [42] . These techniques face similar challenges to BMC-based techniques for logic bug detection, in addition to manual creation of properties. [41] a.1-a.4 (Details not sufficient -TrustHub) 8051 Microcontroller [76] a.1-a.4 (Details not sufficient -TrustHub) DES encryption core [77] a. Attack on processor data memory [93] a.3 (No details given) MPEG and 5 DSP accelerators (C programs) [94] a.3 (M2=1) 11 Different Arithmetic, DSP cores (C-code accelerators) [95] a.3 (M2=1) IIR Filter Accelerator (C program) [96] a.3 (M2=64); (4 HTs) Ethernet MAC 10G circuit [97] a.1-a.4 (details not given; 7 HTs) TEA encryption on FPGA [39] a.1-a.4 (details not given; 37 HTs) ITC-99 benchmarks (slightly modified) [98] (No explicit designs given) ISCAS-85 and ISCAS-89 benchmarks [99] a.3 (M2 arbitrary) Leon3 [100] a.1-a.4 (9 HTs) RS232 serial interface [101] a Complementary approaches to ours include techniques to detect HTs that leak sensitive data, but do not produce incorrect logic values [43] [44] [45] [46] , and HT prevention methods [47] [48] [49] .
C. Literature Survey of Implemented HT Attacks
To confirm that our list of HTs (Appendix A; Table A. 2.a-b) is representative, we surveyed over 100 different papers which categorize and/or construct example HTs for experimental evaluation. All explicit HT constructions are shown in Table 5 (with parameters given, if noted explicitly in the paper). Many of the papers use a large subset of the TrustHub [37] or DeTrust [11] benchmarks. Both of these benchmarks are covered by HT activation criteria from Table A. 2.a. Additional papers [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] not included in Table 5 are surveys or papers that describe attacks in words. We confirmed that there are no HT attack types in [113] [114] [115] [116] [117] [118] [119] [120] [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] [131] [132] that we do not encompass.
VI. CONCLUSION
In this paper, we extended Symbolic QED to include symbolic starting states. As a result, we overcome limitations of existing pre-silicon verification techniques for detecting logic bugs and HTs that require long activation sequences. The unique combination of Symbolic QED and QED constraints enable us to achieve this objective. Our results on multiple open-source RISC-V processor cores demonstrate the effectiveness and practicality of our approach: i) detection of previously unknown logic bugs within minutes; ii) detection of 100% of hundreds of long logic bugs and HTs (SQED with a concrete starting state detects, at best, 33%); iii) detection of 97.9% of "extremal" logic bugs (SQED with a concrete starting state detects 0%). Future research directions include: i) extending our approach to detect bugs and HTs in other SoC components beyond processor cores, such as uncore components and accelerators; ii) handling other QED transformations beyond EDDI-V, e.g., CFTSS-V and CFCSS-V [6] ; iii) automated methods for inserting QED recorders and generating QED constraints on the symbolic starting state; iv) theoretical comparison of the bug detection capabilities of [6] , [14] , and the method of this paper.
APPENDIX A: LOGIC BUG AND HARDWARE TROJAN TYPES In the following tables, we give the different logic bug (harder versions of "difficult" bugs that occurred in various commercial designs [6] ) and HT scenarios (from research literature) used in Table 3 of Section IV. Each "long" logic bug is modeled with two parts: i) activation criteria of the bug (Table A. 1.a), i.e., the conditions which need to be satisfied for the bug to activate; and ii) effect of the bug once it is activated ( [12] . We create stealthy HTs that are known to evade common detection techniques (e.g., HT designs from [36] evade detection techniques based on UCI [32] and coverage metrics [5, 32] ). A HT scenario is formed by using one activation criteria (Table A. 2.a) with one bug effect (Table A. 2.b), along with an appropriate design strategy (Table A. 2.c). We used a wide range of HT scenario parameters, given in Table A [37] .
Logic bugs and HTs were injected by introducing a small state machine into the design that checks for the activation criteria, and flips bits at desired wires to achieve the effect. [36] UCI [32] ; coverage metrics [5, 32] .
[11] [5, 32, 33, 34] .
APPENDIX B: MATHEMATICAL PROOFS In this Appendix, we formalize our assumptions on how any bug-free design should operate. We relate these assumptions to Section III.B, and prove Theorem 1.
We first provide preliminary definitions [133] . We will use the term alphabet to refer to a nonempty set of symbols. We will often be dealing with operations on words (sequences of symbols from countable alphabets). We will denote the empty word as , and if and are two words over the same alphabet, we denote their concatenation as . For finite vectors of equal length, = (x 4 , … , x 7 ) and = (y 4 , … , y 7 ), we write Δ( , ) for the subset of indices they differ on.
The next definition provides the most general model of the computation used in our statements.
Definition 1 (Transition System). A tuple = 〈 ,`, , , ℱ〉 is called a transition system if
• is a countable alphabet of states.
• `⊆ is the nonempty subset of initial states.
• is a nonempty, finite set of actions.
• ⊆ × × is the transition relation.
• ℱ ⊆ is the nonempty subset of accept states. For each transition system, we also associate a specification function : × → . This is the function such that 4 = (`, ) ⟺ 〈`, , 4 〉 ∈ . The state-space contains all states that can be represented by the transition system. ` is the set of states that the transition system can begin in for any execution. is a set of actions that can be applied to steer the system from one state to the next. is the transition relation, a countable set which represents the mapping that each action implements. ℱ is a set of accept states, in which executions can end.
The next definition describes finite sequences of actions, and the states they allow a system to traverse. For any processor core, we assume there exists a transition system model (with finite state-space) that defines the behavior of the system. We call this transition system the ISA (e.g., [134, 135] ) of the core. The actions in an ISA are called instructions. This concept is formalized in the next definition. However, we also know that because an EDDI-V test failed. This implies that the same operation (op) performed on the same data results in two different values when executed twice, contradicting Assumption-1 of any bug-free design. Therefore, for case (A), Theorem 1 holds.
Definition 2 (Path

Definition 3 (ISA
Next, assume (B) holds instead of (A). Depending on when the operands' data are available for the instructions to compute on, we have five mutually disjoint subcases for (B) (for each subcase, we show that Theorem 1 holds): (B.1) At TC, data for both operands x OE and x OE -is available. As Constraint C-3 (i) holds, we know that
