Security and reliability have become important concerns in the design of computer systems. On one hand, microarchitectural enhancements for security (such as for dynamic integrity checking of code at runtime) have been proposed. On the other hand, independently, microarchitectural enhancements for reliability to detect and tolerate natural faults have also been proposed. A fault in these security enhancements due to alpha particles or aging might potentially pass off maliciously modified instructions as safe, rendering the security enhancements useless. Deliberate fault attacks by attackers can be launched to disable the security enhancements and then launch the well-known security attacks that would otherwise have been detected by these enhancements. We report an integrated microarchitecture support for security and reliability in multicore processors. Specifically, we add integrity checkers to protect the code running on the multiple cores in a multicore processor. We then adapt these checkers to check one another periodically to ensure reliable operation. These checkers naturally can check the other parts of the core. The average performance, power, and area costs for these security-reliability enhancements are 6.42%, 0.73%, and 0.53%, respectively.
INTRODUCTION
Computer systems are being continuously compromised. Recent attacks, like the Evil Maid attack [Rutkowska 2009 ], record secure data from a computer without leaving a trace. Attackers also exploit the vulnerabilities in software, such as lack of bounds checking, to overrun unchecked buffers and maliciously transfer program control to an attack code (this attack is also called stack smashing attack) [Cowan et al. 2000] or modify the program binary while it is still in the main memory [Burrows et al. 2003 ], thereby compromising the system. Software-based protection approaches like This work was performed while Arun Kanuparthi was a Ph.D. student at New York University. Arun Kanuparthi is now a Security Researcher at Intel Corporation. The views expressed in this article solely belong to the authors and do not in anyway reflect the views of Intel Corporation. The countermeasures described in this article were implemented in experimental hardware and software environments. The authors of this article have not explored the potential applicability of these countermeasures to commercially available hardware and software. Authors' addresses: A. K. Kanuparthi, 2111 NE 25th Ave, JF4, Hillsboro, OR 97124; email: arun.kanuparthi@ intel.com; R. Karri, Department of Electrical and Computer Engineering, New York University, 5 MetroTech Center, Brooklyn, NY 11201; email: rkarri@nyu.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. anti-virus and antispyware tools, and OS patches are not totally effective in preventing such attacks [Aaraj et al. 2007] . This is because these software-based solutions might have as yet unknown vulnerabilities that can be discovered and exploited by an attacker to launch new attacks. Thus, hardware-based approaches that involve adding an onchip checker to verify the integrity of the code have been proposed [Kanuparthi et al. 2012a; Fiskiran and Lee 2004; Gelbart et al. 2005; Kirovski et al. 2002; Suh et al. 2005 ] to secure programs running on the processors. This is called dynamic integrity checking. The on-chip checker calculates hash values of basic blocks in the program at runtime and compares them against precomputed hash values obtained at compile time.
Moore's law has continually provided exponential growth in the number of transistors on a single chip for several decades. General-purpose processor designs have evolved by attempting to find parallelism in sequential programs through approaches such as pipelined single-issue processors, in-order superscalars and out-of-order superscalars. Due to limited performance improvements and sky-rocketing energy consumption, the computer industry started migrating to multicore processors. Multicore processors are being used in desktops, mobile phones, gaming consoles, and even in unmanned aerial vehicles. Multicore processors contain several simple cores rather than a single large core. This way they are more energy-and area-efficient than the large monolithic cores. Complex multicore processors are increasingly susceptible to a variety of faults, thus motivating the need for microarchitecture support for reliability. Faults can be transient or permanent in nature. A transient fault occurs once and then does not persist. A permanent fault occurs as devices wear out and persists from that time onward. A fault can manifest itself as an error.
Whereas both security and reliability deal with errors-malicious or unintendeduntil now security and reliability have been addressed independently. We propose microarchitecture enhancements to a multicore processor that simultaneously consider security and reliability. The key idea of the proposed enhancements is to use the security components (e.g., those that verify the integrity of the code running on the processors) to perform dynamic integrity checking in a reliable manner. Similarly, reliability enhancements can improve security as well by making sure that the security modules do not become single points of failure. The security components check each other periodically for reliable operation. On encountering a permanent fault, the faulty component is turned off and the execution continues with other components. Furthermore, the security components can detect faults in the instruction fetch path, thus adding an extra line of defense. In this article, we use dynamic integrity checkers as the security component exemplar.
Motivation
Hardware-based dynamic integrity checking adds a component on the chip that is integrated with the pipeline to monitor the integrity of the instructions. This component compares the hash value of the instructions against a corresponding precomputed hash value. If the values match, it means that the instructions have not been maliciously modified or the attacker has not transferred control to a malicious code. The instructions are allowed to proceed and make changes to the architectural state of the processor. Else, the program is aborted. However, security does not ensure reliability. Faults can subvert the security provided by these integrity checkers. A stuck-at-fault near the output of the integrity checkers such as Fiskiran and and Kanuparthi et al. [2012a] can result in a security breach. For instance, a stuck-at-1 fault at the output always allows the instructions to commit, even though the checker correctly detects a breach. Similarly, reliability does not ensure security. For instance, all the components on the chip might be functioning reliably, but an attacker can still gain ownership of the system by performing attacks such as stack smashing, program binary modification, or malicious code injection.
To some extent, integrity checkers can detect faults in the cores that manifest as errors. For instance, faults in the instruction caches, or in the interconnect between the caches can be detected. Thus, they add an additional line of defense. However, a fault in the integrity checker and other security components can render them ineffective. An attacker can inject faults [Kim and Quisquater 2007; Pellegrini et al. 2010; Li et al. 2012 ] to disable the integrity checkers, and then launch security attacks that would have otherwise been detected by the integrity checkers. Thus, there is a need to jointly consider reliability and security at a microarchitectural level. The challenge is to do this without high performance, hardware, or area overheads.
To the best of our knowledge, there is no related work in literature that considers both security and reliability together. In this article, we present a novel way to secure the programs running on the processors by adding dynamic integrity checkers, and ensure their reliability. Further, we propose to use integrity checkers to check each other periodically and also detect faults in the system. In addition, the proposed architecture also ensures liveness 1 of the system in the event of a catastrophic failure of any component without too much impact on performance. Execution is aborted if there is a security breach, but the program continues to execute in the presence of faults.
Key Contributions
With this work, we propose a microarchitecture for reliable integrity checking in multicore processors. We achieve this through the following contributions: -We provide a framework for dynamic integrity checking in multicore processors using security components that are tightly integrated with the processor pipeline. -We ensure reliable operation of these checkers by making the checkers check each other periodically using pseudo-randomly generated inputs. If a fault is detected, the checkers check themselves using hardcoded input patterns that achieve 100% fault coverage. The faulty component is isolated and deactivated. -We propose a solution that is scalable to manycore processors. The integrity checkers are connected using a ring which is activated when one checker fails. The load of integrity checking is then shared by all the other checkers connected by the ring. -We extensively evaluate the performance, area, and power overheads of our technique on a cycle-accurate simulator that models an x86 processor containing 4, 8, and 16 cores, using a wide variety of benchmarks from the SPEC CPU2006, BioBench, and STREAM suites. The average performance, power, and area costs for these securityreliability enhancements are 6.42%, 0.73%, and 0.53%, respectively.
The rest of the article is organized as follows. Related work on hardware-based security and reliability in microprocessors is discussed in Section 2. In Section 3, the threat and fault models are described. The proposed enhancements to a multicore processor are explained in Section 4. Performance, security, and reliability analyses, along with the hardware cost are presented in Section 5. We conclude in Section 6 with a discussion on the limitations of the proposed approach and our ideas for future work.
RELATED WORK

Related Work in Microarchitecture Support for Security
Write xor eXecute (W⊕X) (also known as data execution prevention) prevents attackers from executing their own malicious code by ensuring that protected program segments are not writable and executable at the same time [Andersen and Abella 2004] . W⊕X is implemented using a No eXecute (NX) bit that the hardware platform enforces: if execution moves to a page with the NX bit enabled, the hardware raises a fault. Address Space Layout Randomization (ASLR) prevents an attacker from directly referring to objects in memory by randomizing their locations [PaX Team 2003] . However, attackers have exploited vulnerabilities in these approaches and have bypassed the protections offered by W⊕X and ASLR [Skape and Skywing 2005; Shacham et al. 2004] . The eXecute Only Memory (XOM) architecture allows the secure execution of instructions stored in the memory by protecting these instructions from malicious manipulation [Thekkath et al. 2000] . In XOM, the program is distributed in encrypted form by the vendor. During execution, the program is decrypted by the instruction loading path on the target processor using a secret key. Unlike XOM, the proposed approach does not require software to be encrypted by the vendor.
Dynamic information flow tracking is a simple hardware mechanism to track spurious information flows at runtime [Suh et al. 2004] . On every operation, a processor determines whether the result is spurious or not based on the inputs and the type of the operation. With the tracked information flows, the processor can easily check whether an instruction or a branch target is spurious. This prevents control flow changes by potentially malicious inputs and dynamic data generated from them. This approach assumes that any use of untrusted data should lead to an untrusted output. Gatelevel information flow tracking (GLIFT) [Tiwari et al. 2009a [Tiwari et al. , 2009b [Tiwari et al. , 2010 shows that a gate-level description of a processor can be automatically augmented with shadow logic-gates that dynamically track the flow of information through the processor. This reduces the number of false positives. Unfortunately, these approaches incur hardware overhead in the range of 1× to 3×.
Secure Program Execution Framework (SPEF) [Kirovski et al. 2002] secures the program running on the processor by using a hash function along with a cryptographic transformation. The calculated hash values are compared against precomputed hash values. CODESSEAL [Gelbart et al. 2005 ] uses a similar technique but the precomputed hashes are stored in the memory of an FPGA, which is placed between the main memory and the last level cache. Runtime Execution Monitoring (REM) [Fiskiran and Lee 2004 ] modifies the processor microarchitecture, and the Instruction Set Architecture (ISA). REM controls the processor pipeline and does not commit the instructions in the basic block until their integrity has been checked. The precomputed hashes are stored in the L1 I-cache. A high-performance low-overhead microarchitecture proposed in Kanuparthi et al. [2012a] checks for the integrity of instruction traces (containing up to four basic blocks) by calculating hashes of traces at runtime and comparing them against precomputed hashes. Instead of stalling the pipeline until the integrity check is complete, the instructions are allowed to make temporary changes to the architecture state. These changes are made permanent only when the integrity check is successful. Else, the system rolls back to the last known correct state.
Orthrus [Huang et al. 2010] employs spatial redundancy by replicating the program on multiple cores to enhance security. At runtime, these replicas execute on different cores with the same input, and their outputs are checked for consistency. Orthrus does not provide protection when a different application is running on each core. A related approach [Chen et al. 2008 ] uses spare cores on multicore platforms to perform runtime checks. The performance overhead for this approach is between 1.02× to 3×. SHIELD [Patel and Parameswaran 2008] provides security against code injection attacks by employing a dedicated security processor that monitors the applications running on the other cores in the processor. SHIELD increases the number of instructions significantly and has a 27% area overhead. Address-Independent Seed Encryption (AISE) and Bonsai Merkle Trees (BMTs) were proposed in Rogers et al. [2007] . AISE decouples security and memory management by eliminating the use of virtual or physical addresses as a seed component for counter-based memory integrity verification by using logical identifiers. BMTs minimize the storage overhead when compared to Merkle tree implementations like the work done in Gassend et al. [2003] by leveraging the address independent counter-based memory encryption.
Related Work in Microarchitecture Support for Reliability
Fault-tolerant computer architecture is not a new area of research. There are several well-studied fault tolerance solutions such as triple modular redundancy or the more general N-modular redundancy, originally proposed by Jon von Neumann [1956] . However, for most computing applications, the price of such macro-scale redundancy-in terms of hardware, power, or performance-outweighs its benefits, particularly because faults are relatively uncommon.
Several redundancy approaches (temporal or spatial) have been proposed to detect errors in microprocessor cores. Techniques that detect errors in functional units like ALUs [Patel and Fung 1982] and register files [Blome et al. 2006 ] are well known. Caches are protected either with Error Detection Codes (EDC) (as in the Pentium 4) or Error Correcting Codes (ECC) (as in Alpha 21264). The memory system is usually protected by Single Error Correction Double Error Detection (SEC-DED). Also, numerous schemes for detecting errors in on-chip interconnects have been proposed. Intel processors implement a machine-check architecture that provides a mechanism for detecting and reporting hardware errors, such as: system bus errors, ECC errors, parity errors, cache errors, and TLB errors [Intel 2011] .
Cores that operate in tight lockstep and compare their results after every instruction (or less frequently) have been proposed to detect errors [Yeh 1996 ]. However, they have not been widely adopted because of steep hardware costs. Approaches that employ redundant multithreading without lockstepping, like AR-SMT [Rotenberg 1999] , combine program-level time redundancy and instruction re-execution to provide fault tolerance against transient faults. Few techniques such as DIVA [Austin 1999 ] and SIS [Schuette and Shen 1987] detect errors in architectures using dynamic verification without replicating any piece of hardware. A recent approach detects hardware faults by periodically executing new instructions (that are added to the instruction set of the processor) to run directed tests on the hardware [Constantinides et al. 2007 ]. These tests can be used to detect a hardware defect, locate it, and activate system repair. mSWAT uses low-cost monitors that watch for anomalous software behavior as a symptom of hardware faults [Sastry Hari et al. 2009 ]. It uses a trace-based fault diagnosis algorithm that locates and isolates the faulty microarchitectural component.
The aim of this article is to secure the programs running on the processor against attacks, in a reliable manner. Thus, the security components such as those described in Section 2.1 must be fault tolerant. To achieve fault-tolerant operation of the security components, popular reliability approaches such as spatial or temporal redundancy may be used. However, they have several shortcomings such as very high area overhead and performance degradation, respectively. mSWAT [Sastry Hari et al. 2009 ] incurs addition of hardware monitors, in addition to the security components. Moreover, mSWAT activates these monitors only when some software anomaly has been observed, that is, after the attack has been successful. The downside of the approach proposed in Constantinides et al. [2007] requires modification of the processor instruction set, thus, making it difficult to use off-the-shelf processors in the design of SoCs. The proposed framework for reliable integrity checking does not require addition of any additional monitors (other than the security components) to ensure reliable operation. First, the security components check each other periodically using random input patterns that are generated using an LFSR. If the check fails, it means that one of the components is faulty. The faulty component is diagnosed by applying a hardcoded input pattern that achieves 100% fault coverage to each checker. If a fault is detected in one of the security components, it is disabled and the remaining components are used to provide integrity checking.
THREAT AND FAULT MODELS
Threat Model
We assume that all components outside the chip boundary, including the memory, peripherals, and the bus are prone to tampering. Because off-chip memory is prone to attacks, the processor must check the integrity of the instructions coming from the memory. Infrequent integrity checks lead to Time Of Check (TOC)-Time Of Use (TOU) threats [Bratus et al. 2008] .
Fault Model
We assume that faults can be either transient or permanent. An important parameter in any fault model is the number of simultaneous errors under consideration. We consider, without loss of generality, only a single fault at a time. This consideration is justified because the possibility of multiple simultaneous errors is highly unlikely [Sorin 2009 ].
We assume that the cores, on-chip interconnects, caches, and memory system are protected by state-of-the-art fault tolerance techniques described in the previous section [Patel and Fung 1982; Blome et al. 2006; Intel 2011] . However, the integrity checking modules within the chip boundary are susceptible to transient and permanent faults. We model a permanent fault by making a checker produce wrong output to the validation by the other checker. Similarly, we model a transient fault by a flipping a bit in one of the components.
ARCHITECTURE SUPPORT
In this section, we describe the proposed microarchitectural changes to provide security and reliability in multicore processors. The proposed framework for security is explained with dynamic integrity checker proposed in Kanuparthi et al. [2012a] as an example. Noted that any of the integrity checking techniques described in Section 2.1 can be used.
Support for Security
A dynamic integrity checker is an on-chip component that is integrated with the processor pipeline. It calculates the hash of a group of instructions (in this case, a trace, consisting of four basic blocks) and compares the obtained hash value against a precomputed hash value. Before the program is loaded, the precomputed hashes are stored in a secure location in the main memory (where the OS kernel is located). These precomputed hashes are prefetched into the hash cache at program load time [Kanuparthi et al. 2012b] . The instructions being verified are executed and allowed to make temporary changes to the architecture state of the processor [Kanuparthi et al. 2012a ]. However, they are not allowed to make permanent changes to the state (register file, data caches) until the integrity check is complete. Once the integrity check is successful, instructions are allowed to make permanent changes to the register file and the data caches.
Implementation of this scheme in a multicore processor incurs the addition of three components: the scheduler, the checkers, and the hash cache. Figure 1 shows the proposed framework for a quadcore system. The blocks that are specific to the proposed framework are shaded.
Scheduler forwards the requests coming from the cores to the checkers. Figure 2 (a) shows the internal structure of the scheduler. It consists of a buffer and a load directory. The scheduler collects the traces sent from all the cores, tags them with a 2-bit core ID, 2 and places them in the buffer. To prevent starvation or overload of any of the checkers, the scheduler performs load balancing by forwarding the requests to the checker with lesser load. The load directory allows the scheduler to keep track of the load on each checker. It contains information about the number of requests in every Input Queue (IQ) corresponding to the cores in both the checkers. In Figure 2 (a), checker 0 (D0) has a load of 1 for core 1 (C1), while checker 1 (D1) has a load of 2. Since D1 has a lesser load, the request corresponding to C1 will be sent to D1. The valid bit indicates whether the corresponding checker is alive. The contents of the load directory are updated by the integrity checkers. When a checker is diagnosed with a permanent fault, it is deactivated and the load will be shared by another checker (DX), as discussed further in the following subsections.
Integrity Checker checks for integrity of traces at runtime. Its internal structure is shown in Figure 2 (b). It contains an RSA-2048 engine to encrypt or decrypt messages, a SHA-1 hash engine to calculate cryptographic hashes, a comparator for comparing the calculated and precomputed hashes, and a Linear Feedback Shift Register (LFSR) to generate pseudorandom numbers. To ensure fairness to all the cores, the requests coming to the checker are placed in separate IQs. Each core has an assigned IQ. CX IQ is used to store requests coming from cores other than C0-C3 in case there is a permanent fault in another checker. The arbiter picks the first trace to be checked for integrity by choosing the Head of Line (HOL) entry of the IQ. In this work, we implemented Simple Round Robin (SRR) and Weighted Round Robin (WRR) algorithms for the arbiter. While SRR gives priority to C0 IQ, WRR gives priority to the IQ with the maximum number of pending requests. A 3-bit counter assigned to each IQ counts the number of entries in the IQ. As soon as the HOL entry is chosen to be checked for integrity, the counter is decremented by one. The information from all the counters is used to update the load directory in the scheduler. The integrity checker calculates the hash of the trace using the SHA-1 hash engine. It then compares the calculated hash with the precomputed hash stored either in the hash cache or the main memory. Upon eviction, the contents of the hash cache are encrypted using an RSA key, unique to every processor and hardcoded in it, before storing them in the main memory. On a hash cache miss, the encrypted precomputed hashes are fetched from main memory, decrypted using the RSA key, and then compared against the calculated hash. If the values match, the trace is marked as safe. The core commits the instructions in the trace. On a mismatch, the execution is aborted immediately.
Hash Cache stores recently computed hashes. Calculating a cryptographic hash such as SHA-1 takes 80 cycles. The hash computation is an expensive operation and recalculating hashes for traces that were seen recently is an overkill. To avoid recalculation, the hashes of the most recently seen traces are stored in the hash cache. Hash Cache Controller accepts requests from the integrity checkers and sends them to the hash cache on a first-come-first-served basis. It also implements miss handling to process the pending requests while the current hash is being verified, to improve overall performance. This is accomplished using Miss Status Handling Registers (MSHR). When a miss occurs, the request that missed is sent to the MSHR, and the following request is allowed to access the cache while the missing block is fetched from memory.
A stepwise operation of the entire scheme is described in the following text and is also illustrated in Figure 3. (i) The HOL entry and the following three entries of the buffer in the scheduler accesses the load directory using the core ID to determine which checkers to send the requests to. (ii) Requests from C0, C1, C2, and C3 are sent to the IQs corresponding to the core (i.e., C0 IQ, C1 IQ, C2 IQ, and C3 IQ) in checkers D1, D0, D0, and D1, respectively. (iii) The arbiter determines which IQ to service next. In Figure 3 , the arbiter in D0 selects C0 IQ, while the one in D1 selects C2 IQ. (iv) Both the checkers send the request to the hash cache controller, where D0 gets priority over D1 and the entry coming from D1 is pushed onto the FIFO. (v) The HOL entry and the following entry in the FIFO access the hash cache.
How Many Checkers Do We Need?
The optimum number of integrity checkers is chosen based on the following considerations: (1) the idle time of each checker should be minimum, (2) the system must not have a single point of failure, and (3) the architecture must have minimum performance, area, and power overheads.
There are two extremes to implementing integrity checking in multicore processors. One extreme is a straightforward extension of single-core integrity checking scheme by replicating the checker enhanced cores. Thus, for a processor with N cores, there will be N checkers, one per core. We call this scheme N-for-N. The checkers will remain idle for workloads that do not generate requests frequently, thus wasting valuable real estate on the processor die. Instead of being idle, they can service requests coming from other cores. It results in increased area and power overheads. Also, in case a checker fails, there is no way to check the integrity of the code running on that corresponding core. If the incurred area and power overheads are tolerable, more than N checkers, say 2N checkers, can be used to support integrity checking in N cores.
The other extreme is to use a single checker for requests coming from all the cores. We call this scheme 1-for-N. This approach involves high-performance overhead, as the pipeline will be stalled until the request from the core is verified and marked as safe by the only checker. The single checker becomes the bottleneck and performance degrades as the number of cores increases. Moreover, this checker is a single point of failure. Thus, this approach is not scalable. Figure 4 shows the idle time of checkers for various workloads for 4, 8, and 16 cores. We observed that as the number of checkers increases, the amount of time a checker remains idle increases. For instance, the checker idle time in a 16-core processor running memory-intensive benchmarks with just one checker is 0.12%, whereas the idle time with 16 checkers is 2.7% for the same setting. We also observed that while there is very little difference in idle time with N/2 checkers and N checkers, there is a steep increase in the checker idle time in all other cases. For instance, for a 16-core processor running memory-intensive benchmarks, the idle time increases by 46% when the number of checkers increases from 2 to 4, whereas it increases by 4.4% when the number of checkers increases from 8 to 16. The idle checkers are powered on even when they are not performing an integrity check, thereby increasing idle power consumption. Instead, by using fewer checkers, the idle time as well as power consumption can be reduced. Moreover, since more than one checker is used, there is no single point of failure.
We use the metric Normalized Perf/Watt/Area to help determine the optimum number of checkers, which falls in between the two extremes. Perf is the performance, which is given by (geomean of IPC) × (core frequency). Watt is the total power consumption of the processor in Watts, and Area is the total area of the processor in mm 2 . As high performance and lower hardware and power costs are desired, this metric is a "higher the better" metric. A cycle-accurate simulator [Loh et al. 2009 ] was modified to implement the proposed framework. Area and power details of the cores were obtained using McPAT ]. Verilog implementations of the scheduler and integrity checkers were synthesized at 45nm technology to obtain their area and power details. We compare the normalized Perf/Watt/Area statistic for Processor intensive (P), Memory Intensive (M), and a mix of Processor and Memory intensive workloads (P+M) from several benchmark suites (discussed further in Section 5 for a processor with 4, 8, and 16 cores. The number of checkers in these experiments varies from 1 to 8 for a quadcore processor, from 1 to 16 for an 8-core processor, and from 1 to 32 for a 16-core processor. From Figure 5 , we see that for various workloads, highest Perf/Watt/Area is obtained when the number of checkers is half the number of cores. Though the average performance overhead decreases (to 4.3%) with 2N checkers for N cores, the checker idle time increases (to 2.5%) and the Perf/Watt/Area decreases (by 16%). Hence, we choose N/2-for-N. Therefore, we use N/2 checkers in a N-core processor. We call this scheme N/2-for-N, and we use this architecture in the rest of the article, unless specifically mentioned.
Support for Reliability
As mentioned in the fault model, the cores, caches, memory, and on-chip interconnects are fault tolerant. The hash cache is protected in a manner similar to the other on-chip caches. We propose that, periodically, 3 the checkers check one another first and then check themselves while the scheduler checks itself. A fault in the instruction fetch path of any core can modify the instructions. This can be detected by the integrity checker during normal operation, that is, outside the reliability checking period. 
Reliability of the integrity checkers:
During the reliability checking, instead of using a full blown hash or RSA encryption algorithm, we use partial hash or RSA algorithm.
4 This is because, during the reliability checking period, focus is on reliability and not on security. The stepwise operation of the checkers checking one another is illustrated in Figure 6 and described in the following text:
(i) LFSR counts through a sequence and a randomly chosen output (TV) is sent to the hash engine. 
E(H(TV)). (iii) E(H(TV)) is compared against a known correct value. (iv) E(H(TV)
) is compared against an incorrect value. The double comparison is done to detect stuck-at-faults in the comparison circuitry.
(ii) The output of the hash engine, H(TV), is applied to the crypto engine. The output of the crypto engine is E(H(TV)). (iii) TV is now applied to the hash engine in Checker 1 (iv) H(TV) is calculated and applied to the crypto engine in Checker 1. Once it is detected that one of the checkers is faulty, each checker attempts to validate itself by applying a test input that achieves 100% fault coverage and compares it to the result that is hardcoded in the checker. The steps involved are described as follows:
(i) A known test vector (TV) is applied to the hash engine.
(ii) The output of the hash engine, H(TV), is applied to the crypto engine. The output of the crypto engine is E(H(TV)).
(iii) E(H(TV)) is compared against a known correct value. (iv) E(H(TV)) is compared against an incorrect value.
To test for stuck-at-0 or stuck-at-1 faults at the output of the comparison circuitry, the comparison is done twice-first with a known correct value and then with an incorrect value. The checker in which the calculated and hardcoded values do not match is the faulty checker. Figure 7 shows the steps involved in the self-test operation of a checker. The reliability checking period is now decreased and execution continues. If the fault was a transient fault, the faulty checker will now pass the reliability check. If the reliability check fails, it means that the checker has encountered a permanent fault. The faulty checker is deactivated. The existing requests in the checker are migrated to the fault-free checker one at a time.
Reliability of the scheduler: The scheduler keeps track of the number of requests forwarded to each checker, and the number of requests from each core. If the scheduler has not forwarded any request (or the difference between the number of requests is huge) to one of the checkers during the last checking period, it indicates that something is wrong.
Reliability of the system: When we scale up from multicore processors to manycore processors, the complexity of the scheduler increases manifold. It also increases the number of stalls in the scheduler buffer, resulting in huge performance impact. To tackle this solution, we propose to divide the architecture into tiles, consisting of several simple schedulers instead of one complex scheduler, as shown in Figure 8 . A 16-core processor is organized into four tiles, with each tile containing a simple scheduler, and two checkers. To ensure reliability, we propose to join the checkers and the schedulers from two tiles 5 in the form of a ring. This is done to prevent overloading the fault-free checker in a tile with requests coming from all the cores. The ring is activated when one checker is diagnosed as faulty. They are instead shared among all the fault-free checkers (D0, D2, and D3 in Figure 8 ) in the following manner. When the fault-free checker D0 is handling the HOL request in C1 IQ, the request in C0 IQ of D1 is sent to D0. Similarly, requests are sent to D3. Requests coming from cores in another tile are sent to the Cx IQ in the checker. A fault in the scheduler can lead to failures such as all requests being sent to a single checker, and the other checker being ignored. This situation is tackled by connecting the schedulers from both tiles. When a scheduler is diagnosed as faulty, it is disabled and all the requests from the cores to that scheduler are forwarded to the connected scheduler, which then forwards requests to the checkers. Table I describes the system configuration used in our experiments. We used Zesto [Loh et al. 2009 ], a cycle-accurate x86 simulator that simulates an out-of-order superscalar processor. The cores are multiprogrammed, that is, each core can run only one application. We configured each core to have 128 Reorder Buffer (ROB) entries and a commit width of 4. We use a DDR3-1600 SDRAM, with a latency of 200 cycles for main memory. The scheduler buffer contains 128 entries. The checkers contain five IQs-C0, C1, C2, C3, and CX-with each IQ containing 32 entries. The SHA-1 hash engine takes 80 cycles to produce a 20 byte hash output of a 64-byte input. The RSA-2048 crypto engine takes 150 cycles. The partial hash and crypto operations during the reliability checking period take 20 cycles each. The lightweight cryptographic primitives, DM-PRESENT-80 (hash) [Poschmann 2009 ] and PRESENT-80 (encryption/decryption) [Bogdanov et al. 2007] , take 33 and 32 cycles, respectively. The comparator in the checker takes 1 cycle to perform comparison. The hash cache is 64 KB and 8-way set associative with a latency of 8 cycles (obtained using CACTI [Hewlett Packard 2011] ). The hash cache controller consists of 16 miss buffer entries and 32 entries in the FIFO. We used McPAT ] to calculate the processor area and power consumption at 45nm technology. We implemented the scheduler, the integrity checkers and the hash cache controller in RTL Compiler using 45nm standard cells [Davis 2009 ].
EVALUATION
Experimental Setup
The checkers check each other periodically every 10,000 cycles (similar to an operating system time slice). The duration of each reliability checking period is 100 cycles. In case a checker fails the reliability test, it is checked after another 5,000 cycles. We use geomean of IPCs as a measure of performance. It is a "higher the better" metric. We use benchmarks from the SPEC CPU2006 [SPEC 2008 ], BioBench [Albayraktaroglu et al. 2005] , and STREAM [McCalpin 1995] benchmark suites. SPEC benchmarks were profiled using exp-bb, a basic block vector generation tool from Valgrind [2011] using the train input sets. The non-SPEC benchmarks were profiled using smaller test inputs. Three mixes of benchmarks were considered: processor bound (P), memory bound (M), and a mix of processor and memory bound (P+M). The simulation results were obtained for a representative slice of 500 million instructions by fast forwarding the startup part. Table II shows the workload summary for the various simulations.
Performance Analysis
The performance impact of the proposed architecture with SRR scheduling over a baseline architecture is shown in Figure 9 (a). When the proposed architecture only performs security operations, that is, when it only verifies the integrity of the code running on the cores, the average performance overhead is 7.13%. When it performs both security and reliability functions, using the full operation of SHA-1 and RSA during the reliability checking period, the performance overhead is 9.18%. When the partial SHA-1 and RSA are used during the reliability checking, the performance impact comes down to 8.19%. The performance impact due to the use of partial SHA-1/RSA algorithms during reliability checking is indicated in Figure 9 (a) as Security+Reliability Partial. Thus, the secure system is made reliable at a very low additional performance impact of 1.06%. These overheads occur when no permanent faults are encountered. The performance impact will be much higher if a permanent fault is encountered. The impact of encountering a permanent fault will be discussed in a later subsection. Figure 9(b) shows the impact of SRR and WRR scheduling mechanisms on IPC Geomean. The overall performance impact with WRR comes down to 6.42% with Security+Reliability, and 5.18% with security alone.
The performance impact is caused because the instructions are stalled in the ROB and are not allowed to commit until the checkers complete the integrity checking. In addition to being stalled in the cores, the requests are also stalled when the buffer in the scheduler or the IQs in the checkers are full. The total number of cycles an IQ in any of the checker or the buffer in the scheduler is full is shown in Figure 11 . When the SRR scheduling mechanism is used, the total number of cycles the buffers are full is very high. However, with the WRR scheduling mechanism, the total number of cycles the buffers are full is reduced. The P workloads spend more time in the cores and generate more requests for integrity checking. Since they spend more time waiting for the integrity check result, the number of stalls in the cores, checkers, and scheduler are high, and hence the performance impact is high. The number of stalls outside the cores (i.e., in the buffer in the scheduler and the IQs in the checkers) is shown in Figure 11 . The performance impact for the M workloads (B, E, and H) is lesser than the impact for processor-bound workloads, as the number of stalls is relatively lesser. The performance impact for P+M workloads is higher than that of M workloads and lower than that of the P workloads. Since the workloads with more outstanding requests are given priority, processor-bound workloads benefit the maximum from WRR scheduling.
The performance overhead can be further reduced by using lightweight cryptographic primitives at the cost of the security guarantees offered by standard encryption or hash algorithms such as RSA or SHA, respectively. While RSA takes 150 cycles to encrypt/decrypt, PRESENT takes just 32 cycles. Similarly, while SHA-1 takes 80 cycles to produce the hash digest, DM-PRESENT hash algorithms take just 33 cycles. Using these lightweight cryptographic hash functions brings the average performance overhead down to 3.42%. Figure 10 compares the performance overhead of using heavyweight cryptographic algorithms (RSA and SHA-1) in the proposed microarchitecture against the performance overhead incurred due to the use of lightweight cryptographic primitives (PRESENT and DM-PRESENT).
Security Analysis
The proposed microarchitecture can protect the system from several attacks that are launched at runtime, after the code has been checked at load time. Threats such as stack smashing, malicious code injection, and program binary modification can be thwarted by the proposed microarchitecture. The security guarantees of the proposed scheme stem from the dynamic integrity checkers [Fiskiran and Lee 2004; Gelbart et al. 2005; Kanuparthi et al. 2012a] .
Stack smashing: An attempt to overrun unchecked buffers and perform a stack smashing attack will transfer program control to the attacker's malicious code, which is outside the program's address space. These malicious instructions will be fetched into the core just like the normal instructions. While the malicious instructions proceed through the pipeline, the checkers verify the integrity of these instructions. Until the integrity check is complete, the instructions will be held in the core's reorder buffer and will not be allowed to commit and make any changes to the architectural state. These instructions will not be marked as safe by the checkers because a precomputed hash for these malicious instructions does not exist either in the hash cache or in the main memory. As soon as this breach is detected, the checker sends an interrupt to abort execution.
Malicious code injection: Consider the case where an attacker tries to transfer program control to an address by entering the payload information. The checkers detect that a precomputed hash for that particular address does not exist in the hash cache or the main memory. Thus, the malicious instructions will not be allowed to commit and the execution is aborted.
Program binary modification: Now consider the attack scenario in which an attacker maliciously modifies the binary, either in the main memory or on the bus between the processor and memory. These instructions produce a hash value that is different from the precomputed hash value. The comparator in the checker detects that the hash values do not match and immediately sends an interrupt to the core to abort execution.
The proposed microarchitecture has some limitations too. For statically compiled code, if all possible traces are not identified during the profiling phase, an alert would be raised for innocuous traces as well. For dynamically compiled code, there is no golden hash to compare against for dynamically generated traces. This results in false positives.
Reliability Analysis
We assume that any faults that occur within the cores, caches, memory, and interconnect are taken care of by previously proposed work. The security offered by the checkers can fail and the entire system can be compromised if either the checkers or the scheduler fail. The proposed microarchitecture can handle transient and permanent faults at the checkers and schedulers. In addition to detecting any errors in the checkers and the scheduler, the proposed architecture can detect the presence of faults in the instruction caches and memory if existing protections in the cores fail to detect a fault, thus adding an extra line of defense. A fault in the instruction cache manifests 10:18 A. K. Kanuparthi and R. Karri itself as a bit-flip and modifies an instruction. When this instruction, as a part of a trace, is checked for integrity, produces a hash value that does not match with the corresponding precomputed hash.
We performed fault analysis on the hardware implementation (Verilog code) of the scheduler and checker using Synopsys Tetramax. We considered the stuck-at fault model. A total of 1 million faults were injected. Of these, 36.98% affect the checker. Only 17.11% of all the faults are detectable, and 6.32% result in errors in the checker. Assuming that a fault always coincides with an attack, 2.9% of all the injected faults lead to an attack. Similarly, of all the faults injected on the chip, 26.61% affect the scheduler. Only 11.98% of all the faults are detectable, and 3.2% result in an error. Assuming that a fault always coincides with an attack, only 1.8% of all the injected faults lead to an attack (Figure 12 ). These are the faults that lead to faulty hash calculation, or faulty encryption or decryption, thereby possibly passing of malicious instructions as safe. Since the scheduler and checkers are vital to secure program execution, they need to be fault tolerant.
Transient faults: The checkers validate each other periodically to ensure reliable operation. Thus, our approach can detect transient faults that occur during the checking period, or those whose effects last for the duration of the checking period. This period can be varied to detect transient faults of much shorter duration. We varied the period between 1,000 cycles, 10,000 cycles, and 100,000 cycles. As the frequency of validation increases, the overall geomean of IPC decreases because the checkers halt dynamic integrity checking during this period. The performance impact is 27.3% when the checkers check each other every 1,000 cycles, and it goes to down to 5.03% when they check each other every 100,000 cycles. When the checking period is 10,000 cycles, the performance impact is 6.42% using the WRR scheduling scheme. Figure 13(a) illustrates the impact on performance when the reliability checking period is varied for different workloads.
Permanent faults: We model the permanent fault as the failure of a checker or a scheduler. When a permanent fault occurs, the fail over mechanism takes over, and the load of dynamic integrity checking is shared by the three checkers in two neighboring tiles that are connected in the form of a ring, instead of burdening the nonfaulty checker. Impact on performance depends on what time the permanent fault occurs. Figure 13(b) shows the impact on IPC geomean depending on the time of failure and the component that fails. If the checker fails very early, the performance impact is high. This is because the load that is to be handled by fewer checkers until the end of execution. For instance, if the permanent fault is detected within the first 100 million cycles, the geomean of IPC for a 16-core workload falls from 0.9792 to 0.2851. However, when the failure occurs toward the end of execution, the impact on IPC geomean is much lesser. The same 16-core system (under the same workload) would see a decrease in IPC geomean from 0.9792 to 0.9420. The performance impact is more pronounced in case a scheduler fails. This is because the number of schedulers is fewer than the number of checkers.
When the scheduler fails very early in the simulation, the performance impact is very high. The IPC geomean falls from 0.9792 to 0.1132 if the scheduler fails after 100 million instructions. When the scheduler fails toward the end of simulation, the IPC geomean falls from 0.9792 to 0.7424. This performance impact is shown in Figure 13 (b).
Scalability
We proposed a methodology that is scalable when we scale up from multicore to manycore processors. Figure 14 shows the performance of systems containing 4, 8, and 16 cores when different workloads are applied. Baseline is the case when there is no provision for security or reliability checking. N-for-N gives the best performance because each core has a dedicated checker. 1-for-N has the highest performance overhead, as the load for N cores is shared by just one checker. The performance impact increases as the number of cores increases. Also, the P workloads (A, D, and G) experience very high performance impact because they generate requests more frequently, and this single checker cannot handle this load. N/2-for-N nearly matches the N-for-N approach for all the workloads. From Figure 14 , we see that as the number of cores increase, there is no significant performance degradation for the N/2-for-N when compared to N-for-N. While N-for-N is the optimum approach from a performance point of view, 1-for-N is the optimum approach from a hardware cost point of view. However, N/2-for-N is the optimum approach from a Perf/Watt/Area, security, and reliability perspective. The security and reliability implications of the three schemes are discussed in Section 5.6. 
Tradeoff between Security, Reliability, and Performance
The proposed architecture incurs the addition of N/2 checkers in an N core system. The N cores forward the integrity checking requests to a scheduler, which in turn forwards requests to the N/2 checkers. Apart from the already-mentioned Perf/Watt/Area benefits, the N/2-for-N approach has several security and reliability benefits. The load of integrity checking of requests coming from N cores is shared by N/2 checkers. In case any checker becomes faulty, the N/2 -1 cores can share the load of N cores and ensure secure and reliable operation, even in the presence of faults. The 1-for-N scheme has the disadvantage that there is just one checker, serving the requests from N cores. This checker becomes the single point of failure and causes the whole system to fail in case it becomes faulty. Similarly, the N-for-N scheme is also disadvantageous because each core has its own checker. If the checker corresponding to a certain core becomes faulty, the core cannot forward requests to the other cores, unless there is some interconnect joining all the checkers.
Hardware and Power Cost
The area of the core, in 45nm technology, is 26.4mm 2 . For a baseline 4-core processor, the total area is 150.18mm 2 , while that of an 8-core processor is 257.93mm 2 . The area of a baseline 16-core processor is 470.8mm 2 . The proposed framework incurs the addition of the scheduler, the integrity checkers, the hash cache controller, and hash cache. For a processor with N cores, we add N/2 checkers. The area overhead for 4-, 8-, and 16-core processors is 0.66%, 0.59%, and 0.53%, respectively.
The peak dynamic power consumption for a baseline 4-, 8-, and 16-core processor is 79.332W, 158.67W, and 317.34W. The power overhead of the proposed framework over a baseline quadcore, 8-core, and 16-core processor is 1.01%, 0.89%, and 0.73%, respectively. The area and power consumption of the baseline processor were obtained using McPAT ], while those of the scheduler, integrity checkers, and hash cache controller were obtained from RTL Compiler. The area and power statistics of the hash cache was obtained using CACTI [Hewlett Packard 2011] .
CONCLUSIONS
In this article, we presented a microarchitecture that jointly considers security and reliability and can protect against deliberate as well as accidental attacks. Some of the limitations of the proposed microarchitecture are that it cannot handle dynamically compiled code or security issues that arise from corruption of dynamic data. On the reliability front, it can detect only those transient faults that occur during the checking period, or those whose effects last for a duration of the checking period. If any of the reliability mechanisms protecting the data caches fail, it leads to corruption of data. The proposed microarchitecture cannot detect such faults.
In addition to improving the proposed framework to overcome the described limitations, we are currently extending the technique to multithreaded processors. The performance impact can be further reduced by increasing the fetch bandwidth to the checkers and thereby, reducing the wait time in the scheduler buffers. Implementing arbitration algorithms other than SRR and WRR can further improve the system performance. We currently use a bus-based interconnect between the cores and the checkers. We are investigating the use of other interconnects to support a wide variety of architectures. In this article, we considered dynamic integrity checkers as the security components and used them to provide reliability as well. Other secure architectures can be accordingly modified to use the security components to provide reliability too.
