INTRODUCTION
Ensuring reliable operation during the entire service period is essential for computing systems, especially for those deployed in safety-critical applications and in remote or otherwise hazardous locations. However, today's highly complex nano-CMOS processor systems are becoming increasingly vulnerable to premature system failures due to unreliable hardware substrates and environmental stress. Also, increased manifestation of random, systematic, and parametric yield loss mechanisms is making it difficult for the chip manufacturers to maintain an acceptable manufacturing yield rate [ITRS 2012] . Hence, designing adaptive systems that can continue to function reliably even with faulty components can go a long way in sustaining Moore's law [Borkar 2005] .
A typical processor system consists of three classes of structures: (1) arrays such as instruction and data cache, translation lookaside buffer, branch history table, and reorder buffer; (2) data path; and (3) control logic ( Figure 1 ). Arrays make up the largest share of transistors in a processor system. They are usually protected by redundant circuit structures such as spare rows, spare columns, spare blocks, spare banks, and errorcorrecting codes (ECCs) [Rusu et al. 2006; Rodrigues et al. 2010] . Data paths represent the second-largest class of structures. Data-path circuits are in turn dominated by the floating-point (FP) computation and integer division units. In this article, we present novel ways of protecting these large structures with very little added hardware cost.
We propose and investigate a low-overhead hardware framework for a chip multiprocessor (CMP) that serves to improve chip yield and reliability. The framework enables the CMP to remain in operation, albeit in a degraded performance mode, even when some of the large data-path units fail by exploiting the inherent redundancy that exists in the system. The proposed scheme involves hardware-assisted communication between the cores to share functional units across them. This sharing is exposed at an instruction-level granularity and is completely transparent to user or system software. The hardware and power overhead for the proposed scheme are minimal. Besides improving long-term reliability through graceful degradation, the scheme can also achieve reasonable enhancement in yield at an acceptable level of degradation in performance.
This scheme is targeted toward sparsely used functional units with large on-chip areas such as integer division units or floating-point units (FPUs).
Other major system components can be protected efficiently through alternate techniques. Storage arrays are usually protected by redundant circuit structures such as spare rows, spare columns, spare blocks, spare banks, and ECCs, and small and frequently used data-path units can be protected through replication.
RELATED WORK
Processor systems today have multiple units with the same functionality in order to exploit instruction-and thread-level parallelism and actually require only a critical subset of their hardware to be in working condition to remain functional. In addition, higher device density and lower cost per transistor allow us to include power-gated redundant structures in processors, which can be swapped in for units that fail in operation. Because of the redundancies, manufacturers can avoid throwing away chips unless they have critical functionality-disrupting defects, thereby improving yield. The spare structures are also used to extend the lifetime of a processor beyond that of the baseline architecture by keeping it functional, possibly in a degraded mode, even in the presence of failed subunits. The existing solutions consider redundancies at granularities from system to intraprocessor levels. However, sharing of hardware across multiple cores on a chip has remained a relatively less explored area.
The idea of incorporating redundancy in microprocessors for yield and defect tolerance is well entrenched. The 16-bit HYETI microprocessor is an early example of such a system, which had circuits for most of its functional units replicated 16 times in a bit-sliced design for optimal redundancy [Leveugle et al. 1994] . Commercial highavailability multiprocessor systems like the IBM p690 have been designed to exploit redundancy at chip and module levels [Bossen et al. 2002] . More recently, researchers have proposed ideas to use redundancy at finer granularities in order to achieve more efficient use of redundant hardware [Shivakumar et al. 2003; Srinivasan et al. 2005] .
Within a processor, storage components and logic arrays are provided with redundant rows, columns, and subarrays for effective yield and reliability improvement [Stapper 1993; Bower et al. 2004] . However, structural irregularity and testability issues in logic and control units render them unsuitable for such partial redundancy [Koren and Koren 1998 ]. Hence, entire units need to be replicated to achieve better yield or fault tolerance. In this regard, Shivakumar et al. [2003] explored the possibility of using multiple execution units already present within a processor to improve manufacturing yield at the cost of performance degradation. In Srinivasan et al. [2005] , somewhat similar ideas of exploiting existing redundancies or building idle spare units within a processor to improve lifetime reliability were introduced. Joseph [2006] proposes software-controlled thread swapping across partially damaged cores for yield enhancement. Along with the inherent performance degradation due to the presence of faulty units, this scheme suffers from additional performance penalties due to repeated core hopping. The CASH (CMP And SMT Hybrid) architecture also advocates sharing of sparsely used functional units across several cores as a way to save area and reduce hardware complexity of individual SMT superscalar cores [Dolbeau and Seznec 2002] . Powell et al. [2009] use thread migration as a tool to make the entire CMP ISA compliant even though some cores in the CMP have failed components. Again, such a scheme suffers from thread migration overheads. Gupta et al. [2008] propose the potential sharing of every stage in the pipeline between neighboring cores in the CMP. When one core experiences a pipeline stage failure, it takes over or shares the healthy stage from the nearby core. Such a scheme can add significant hardware overhead and complicate design and verification of the CMP. Furthermore, for certain instruction types (such as integer ALU), execution on a nonnative execution unit can impose significant performance penalty [Rodrigues and Kundu 2011a] . Hence, there is a need to identify candidates for remote execution such that performance penalty is within bounds. The idea of using redundant resources in a CMP is not just limited to fault tolerance. Recently, Rodrigues and Kundu [2011b] used a hardware framework similar to what is discussed here for online testing of instruction execution in a CMP. A similar scheme was proposed by Borodin et al. [2011] , where functional units are shared across cores in a 3D stacked die for online testing and performance boosting. Weaver et al. [2009] propose Emμcode, a scheme where complex FP instructions that are unable to be executed locally are replaced by sequences of integer instructions. This enables Graceful Performance Degradation (GPD). However, this results in performance slowdowns of 1.3 to 4 times, which is a lot larger than the slowdown experienced by our scheme by outsourcing instructions (1.1 times). Further, this scheme requires complex trace optimization. Hence, even though such schemes reduce the complexity of the hardware for GPD, performance overhead may be unacceptable. A performance comparison between this and the proposed scheme follows later in the article.
In this article, we consider hardware-controlled resource sharing across cores. A core containing a faulty functional unit is unable to serve instructions that require the use of that unit. Since a core in a homogeneous multicore system is expected to handle all types of instructions, the fault renders the entire core unusable. A possible solution would be to incorporate duplicate units inside each core that can keep the damaged core working. However, this would entail considerable hardware overhead, especially for large units, which can be unacceptable for area-constrained embedded processor systems. On the other hand, all other healthy cores already possess copies of the same functional units. Hence, the healthy units in other cores can be used to execute the instructions that the damaged core is unable to serve. This can be achieved through a hardware-software approach, as in Joseph [2006] , with some context-switching overhead (saving process state, cache/TLB misses, branch mispredictions). Here we propose the use of a centralized Intercore Queue (ICQ) as an interface between cores in order to enable resource sharing among them (Figure 2 ).
PROPOSED APPROACH
We now present the proposed solution. To illustrate the general working, we describe a motivational example based on a dual-core CMP in Figure 2 . This is then followed by a more detailed description of the solution and its various components. In the results section, we present results from dual-core and quad-core systems.
General Working
In the proposed scheme, resource sharing is achieved via an ICQ. In a dual-core processor, there is a dedicated queue between the two cores, and for a quad-core processor, a global queue is shared by all four cores. This can be visualized as a simple medium of communication between the two cores, as shown in Figure 2 for the dual-core processor. In the fault-free scenario, the ICQ is not used. However, if certain large execution units (such as the FP ALU, multiplier, or divider) are found to be faulty in a core (faulty core), it outsources any instructions that it is unable to execute to a nearby core (helper core) via the ICQ. Each ICQ entry holds information necessary for execution of instructions. The helper core then tries to schedule and execute the instructions from the ICQ natively.
A pictorial depiction of the general steps followed during ICQ-based execution outsourcing is shown in Figure 3 . Note that in Figure 3 , by faulty stage, we mean that one or more large data-path units are faulty. Smaller data-path units (such as the integer ALU) may be protected via redundancy, since they are often small in size. Further, these units are exercised so often that performance penalty may be excessive if such instructions were to be outsourced [Rodrigues and Kundu 2011a] . In the figure, in stage A, we see that the faulty core has one or more faulty execution units. In stage B, the faulty core tries to schedule an instruction that it is unable to execute natively. It outsources the instruction to the helper core by writing the execution information into the ICQ. In stage C, the helper core is ready to schedule the instructions in the ICQ into its own execution engine. In stage D, the execution of the instruction is complete and the result is written back to the ICQ. In stage E, the faulty core reads the result from the ICQ and then goes ahead and retires the instruction at a later point in time in stage F. A detailed description of the steps and the various components are now presented.
Detailed Description
A simplified version of the homogeneous dual-core CMP described in Tendler et al. [2002] is presented to illustrate the proposed scheme. The idea can be easily extended for a quad-core or many-core system since we are looking at the interaction of two cores (one faulty and one helper) at a time. Each core is assumed to be a superscalar out-of-order execution machine with private L1 data and instruction caches and shared L2 cache. We assume a Symmetric Multiprocessor (SMP) paradigm for this proposed scheme. Basic modification involves incorporation of an ICQ. The architecture of the ICQ and the structural and behavioral modifications required in the cores are described later.
3.2.1. ICQ. ICQ forms the interface for dataflow between a faulty and helper core in our scheme. It acts as a temporary storage for instructions that are to be transferred across cores. The ICQ maintains ordering of instructions using FIFO order. We propose a push-pull scheme for scheduling the head of the ICQ to a helper core. When an instruction is ready in the ICQ to be serviced by the helper core, the helper core is immediately signaled to convey the information. Subsequently, at each cycle, after its native instructions are scheduled, the issue unit of the helper core looks for an empty issue slot to schedule this instruction. If it finds a slot, it pulls the instruction from the ICQ to its pipeline. On the other hand, if the helper core does not pull the ICQ, the priority of the entry is elevated to critical after a specified maximum wait period, called the idling threshold. A hardware counter is used to count the number of cycles the instruction spends in the ICQ, and the helper core is notified after the idling threshold is exceeded. This forces the helper core to give the instruction higher priority than the native instructions, and it is pushed into the helper core pipeline before any additional native instructions are processed.
Structure. Figure 4 shows the ICQ's components and the core-interface signals. The ICQ has two main components: a multiported hardware FIFO to store the in-flight instructions and a control logic block (CLB) to synchronize communication with the cores.
The hardware FIFO can be implemented as an addressable memory using an SRAM or flip-flops along with two registers to store the address of the next available entry and the next entry to be served, called the Free Address Register (FAR) and the Occupied Address Register (OAR), respectively. Implementation of FIFO using addressable memory in this way is well known. Each entry in the ICQ has data fields for the instruction including the opcode, input operands, and execution output, as well as identifiers for source and helper cores. The execution output contains the result and the exception information specific to the instruction. The helper core is responsible for exception detection, but the faulty core handles the exception during the retirement stage. The complete process of executing an instruction in the helper core needs two reads and two writes of the ICQ. Hence, port contention might affect system performance when the ICQ load is high. We find through our experiments (Section 6.5) that a single read and write port for two cores and two each of read and write ports for four cores are enough to mitigate any port contention for most cases. In our scheme, only one entry is scheduled to be executed in a helper core per cycle. The write port data (Wr_Data) has two logical parts: (1) Input_Wr, which is the opcode and input operands from the faulty core, and Control, the source and destination identified bits from the CLB, which the CLB selects using the Inp_Sel signal, and (2) Result_Wr, the execution output from the healthy core, which the CLB selects using the Res_Sel signal. The read port data has Input_Rd and Result_Rd components, which are the same as Input_Wr and Result_Wr, respectively. The CLB selects the entry for data transfer using Address_Wr and Address_Rd signals.
The CLB is responsible for interfacing with the cores and maintaining the state of each entry in the FIFO. It also performs the helper-core selection if there are multiple choices for helper cores. It interfaces with each core through the following signals: -Wr_In_Req: Faulty core to CLB-to request push of input -Wr_In_Ack: CLB to faulty core-to indicate successful push of input -In_Avail: CLB to helper core-to indicate that an entry is waiting for service -In_Critical: CLB to helper core-to indicate that entry has been waiting beyond the idling threshold and needs to be served as soon as possible -Rd_In_Req: Helper core to CLB-to request reading of input -Wr_Res_Req: Helper core to CLB-to request push of result -Res_Avail: CLB to faulty core-to indicate result is available (1) Invalid: Entry is unused. The address of the next available unused entry is stored in the FAR. If a faulty core pushes an input entry to the ICQ, the state is changed to Queued, and the FAR is advanced to reflect that the entry is no longer free. (2) Queued: Entry is in the queue waiting for the older entries to be served first. When all the older entries are served, the entry in the OAR matches the address of this entry, and the entry goes into Ready state-it is ready to be served. The counter for tracking the idle cycles for this entry is started. The helper core is conveyed this information by asserting the In_Avail signal. (3) Ready: The entry is waiting to be served. When the counter reaches the idling threshold, the state is upgraded to Critical, and the count is stopped. The helper core is notified of this priority increase by asserting the In_Critical signal. The helper core might read the entry before it becomes critical, in which case, the state is changed to Execute directly. (4) Critical: The entry has been in the queue long enough and needs to be served at the earliest opportunity. (5) Execute: The helper core is executing the entry. When the helper core writes the result back to the entry, the state is changed to Served. The faulty core is notified that execution is complete by asserting the Res_Avail signal. (6) Served: The entry has completed execution at the helper core and the result is available in the ICQ. When the faulty core reads the result from the queue, the entry is freed and returns to the Invalid state. The OAR is incremented to point to the next entry to be served.
Figure 5(b) shows the timing diagram for an example scenario where the faulty core (core 1) uses the helper core (core 2) for execution of an instruction while the queue is empty. For this example, let us fix the idling threshold at two cycles and helper core execution time at three cycles. In cycle 1, the faulty core asserts the Wr_In_Req signal to inform the CLB that it needs to push an entry into ICQ. In response, the CLB determines if there is a free entry and selects the helper core. In cycle 2, the faulty core sends the input data (Input_Wr) to the input port. Since there is a free entry, the helper core sends the acknowledgement signal (Wr_In_Ack) to core 1 and sends the write address (Addr_Wr), the Control bits, and the Inp_Sel signal to the appropriate input port. If core 1 does not receive the acknowledgment in cycle 2, it resends the write request again on the next cycle. Since the entry goes into the head of the queue, the CLB sends the In_Avail signal to the helper core in cycle 3. The entry waits in the ICQ for two cycles-the idling threshold-and then the CLB sends the In_Critical signal to the helper core in cycle 6. The helper core sends the Rd_In_Req in cycle 7. In cycle 8, the CLB asserts the Address_Rd signal to the read port so that the helper core can read the entry to be served. The helper core takes three cycles to finish execution and sends the Wr_Res_Req to the CLB in cycle 12 to inform the CLB that it is ready to write the result back to the queue. In response, the CLB sends the Address selector signal (Addr_Wr) and the Res_Sel signal to the appropriate write port in cycle 13, which allows the helper core to write the result back. In cycle 14, the CLB informs the core 1 that the result is ready through the Res_Avail signal, and the faulty core reads the result back in cycle 15. In the figure, we divide the operation into distinct phases; the phases that involve the ICQ access, namely, phases A, C, E, and F, take two cycles each. We consider this as the ICQ access latency in our simulations. A faster implementation is possible, such that the ICQ access latency can be only one cycle, but we find it more appropriate to consider a more pessimistic latency while evaluating the feasibility of the scheme. This operation for the ICQ is valid for both dual cores and quad cores, where two or four cores are connected to the ICQ, respectively. For quad cores, the CLB makes sure that there are enough free entries to satisfy all requests per cycle. There might be rare cases where the ICQ receives input requests from more than one core and has the space to satisfy only some of the requests. In such cases, cores are selected to fill the empty spaces based on a static priority and the rest are denied acknowledgment so that they can retry in later cycles.
Two important design parameters involving the ICQ are the depth and the idling threshold. The depth refers to the number of instructions from all faulty cores that can simultaneously reside in the queue. Increasing the depth is expected to improve performance of the faulty core for workloads that have clusters of instructions requiring the use of the faulty resource. However, the area overhead will also increase. The idling threshold, on the other hand, is expected to have a bearing on the performance of the helper core. Higher values for the interval would mean less frequent force-through from the ICQ, possibly leading to lower performance degradation for the helper core. The advantage is expected to be more pronounced if the helper core also faces a strong demand for the shared functional units from its native instructions.
The ICQ forms a critical component of the scheme. Hence, it has to be protected by redundancies. Fortunately, the regular structure of buffers means they can be provided effective protection with low hardware overhead [Bower et al. 2004] .
Faulty Core.
The path that an instruction requiring a faulty unit takes through the pipeline structures is illustrated in Figure 6 . We consider the Tomasulo dataflow model-with reservations stations and Reorder Buffer (ROB)-for a speculative outof-order pipeline. Functional unit 1 is considered to be faulty. During the decode stage of the pipeline, a parallel lookup of the hardware fault table identifies whether an instruction requires the use of a faulty unit or not. When an instruction requires the use of a faulty unit, a flag, called the migration bit, is set in the control-store entry for that instruction to ensure proper control flow in the subsequent pipeline stages. Then the issue unit finds an empty slot in the ROB for this instruction. The instruction is not sent for execution. The capability of the faulty functional unit to write back to the Common Data Bus (CDB) is disabled in order to prevent data corruption on the CDB. When the instruction reaches the head of the ROB (i.e., it reaches the retirement stage), all dependencies are resolved and operands are available. Also, we become sure that the instruction is not a speculative one. Usually at this point the instruction has already been executed and is ready to be retired. However, if the migration bit is set, the instruction is scheduled to the ICQ if an empty slot is available. Otherwise, the instruction waits in the ROB for an ICQ slot. After the instruction is sent to the ROB, the core waits for a signal from the ICQ notifying it that the results and exception information are available. On receiving the signal, the results are updated to the entry at the head of the ROB and are broadcast to the CDB from the ROB head, following the normal mode of operation. Any instruction waiting for the result of this instruction would not get the value from the functional unit output but from the ROB head. In the next cycle, the retirement unit handles the instruction commit to the architectural state. Exceptions are handled during this stage, depending on the exception information received from the ICQ. The use of the migration bit in the control-store information and the exception bits in the ICQ preserves the speculative and precise interrupt behavior of the processor. Modern superscalars are equipped to force serialized execution for selected instructions. For example, the POWER4 microarchitecture [Tendler et al. 2002] implements the "Completion Serialization" mechanism whereby certain instructions are not allowed to execute speculatively and are issued to the issue queues only after all previous instructions are retired. An alternative mode of execution is to send the instruction to the ICQ as soon as all the source operands are available. Such a scheme is more effective in hiding the execution latency but causes speculative and potentially futile instructions to be sent to the ICQ and executed in the helper core. Sending instructions speculatively to the ICQ also increases the implementation complexity since we need to be able to flush multiple entries sitting on the ICQ if they are found to be on a mispredicted path. In our experiments, we use the early dispatch of instructions to the ICQ.
Helper
Core. An instruction is either pushed or pulled by the helper core as described before. In either case, a control-store entry is created for proper control flow for this instruction in the subsequent stages of the pipeline. After empty reservation stations and ROB entry are found, the instruction is dispatched by the issue unit to the reservation stations, and the operands are pulled from the ICQ. We note that the operands can also be pulled by the execution units wherever the critical path is mitigated. Once the instruction completes execution and reaches the head of the ROB, the results and any exception detected during the instruction execution in the helper core are written back to the ICQ. The reservation station and ROB entries are freed. The flow is illustrated in Figure 7 . An important consideration here is that the result, once computed by the functional unit, will be broadcast to the CDB. So any other instruction waiting in any reservation station should not interpret this result as a native result. Assuming that results are tagged with ROB entry number, we need to add an additional bit to the tag in order to identify the result as native or foreign. This composite tag would avoid any data corruption in the helper core.
The timing of the dataflow through the faulty core and the helper core considered together is shown in Figure 8 . The top row shows the execution stages in the faultfree scenario: fetch (FE), decode (DE), issue (IS), execution (EX), writeback (WB), and retirement (RE). The fetch, decode, and issue stages form the front-end in-order part of the execution. The issue stage pushes the instruction into the reservation stations after all structural dependencies are resolved. Execution and writeback happen out of order. Instructions wait in the reservation stations until all data dependencies are resolved and are then executed out of order. Immediately after execution, the results are broadcast to the ROB and other reservation stations. Retirement again happens in order when the entry for the instruction reaches the ROB head.
In the faulty-core pipeline, the instruction can optionally wait in the issue queue until it is serialized, marked as serialization wait (SL_WT), that is, until all prior instructions are retired. Then it is written to the ICQ (QWr) and again waits until the results are read back from the ICQ (QRd). This period is marked as "helper core execution wait" (HELPER_EX_WT) in the figure. Writeback and retirement stages follow the QRd stage. The QWt stage in the ICQ denotes the wait period of the entry in the ICQ that includes the period to reach the queue head and the wait thereafter to be read by the helper core. In the helper core, the issue stage is responsible for reading the entry waiting to be served (QRd). This entry is then dispatched for execution after all the structural dependencies are resolved. After execution, when the entry reaches the ROB head, the entry is written back to the queue (QWr) instead of being retired. The writeback stage for the helper core is essentially a NO-OP as discussed before since there are no local reservation stations waiting for the result.
A single ICQ can be used for transferring instructions from each of the two cores to the other one as required. The change in the dataflow is constrained within the pipeline and introduces no data consistency problems in the architectural state of the system.
Overheads
This scheme certainly entails some overhead in terms of area and complexity. The additional hardware for incorporating this scheme involves the following:
(1) The hardware FIFO and the buses connecting the FIFO to each core, (2) An FSM CLB to control dataflow between the hardware FIFO and the cores, (3) Hardware counters for calculating the idling threshold, (4) Extra complexity in the control and synchronization logic of the cores to control the migration of instructions, (5) A couple of bits in the control-store entry for an instruction, and (6) Hardware fault map and associated wires to read and write the map.
We provide a rough estimate of the area of the hardware FIFO, which is the main overhead, in terms of transistor count. We consider two 80-bit extended-precision operands and one 80-bit result, 3 byte opcode, six exception status bits (used in x87 instructions), and 4 bits for source and destination. Rounding up to integral number of bytes, we find that each row consists of 36 bytes. From our experiments, we find that 20 entries per core is enough to sustain operation in the worst case. Considering 1W/1R SRAM cells for the dual-core system with isolated write and read access (eight transistors per cell), we can estimate the total transistor count per entry of the queue to be about 2.3K transistors. So for the dual-core system with 40 entries, we need approximately 92K transistors for the SRAM cells. For four cores, considering two read and two write ports, the overhead is roughly 360K transistors. Considering 10% extra overhead for peripheral circuitry (sense amps, bitline drivers, and the decoder), we similarly estimate the total ICQ overhead for the dual-core system to be 101K transistors. The transistor count for FPUs in modern microprocessors is hard to find. Gerwig and Kroener [1999] noted that a pipelined FPU for IBM hexadecimal FP format featured about 7.2 million transistors. In Naini et al. [2001] , an IEEE-compliant FPU with RAS features and partial support for denormalization had 1.9 million transistors. Assuming 2 million transistors for an FPU, the area overhead of our structure is roughly 5% of an additional FPU. Considering the area of a dual-core CMP with around a billion transistors, the area overhead is insignificant.
Additionally, our scheme depends on the ability of cores to reliability detect, diagnose, and deconfigure a faulty functional unit (discussed in detail in Section 4) and implicitly incurs the overheads required for such capabilities.
ARCHITECTURAL PREREQUISITES
The proposed framework requires mechanisms for detection, diagnosis, and deconfiguration of faulty functional units, either during manufacturing or at runtime.
Hard faults in cores are commonly detected during manufacturing through scan tests [Jha and Gupta 2003 ]. Schuchman and Vijaykumar [2005] outlined the microarchitectural modifications required to isolate faults to microarchitectural blocks through scan tests.
Incorporating such fault awareness at runtime requires nontrivial system modification. Considerable work has been done in the area of online error detection and recovery [Gizopoulos et al. 2011] . We summarize the possible detection, diagnosis, and deconfiguration methods that can be leveraged by our framework. Our work has no contributions toward this end.
Online Error Detection
Redundant execution is a feasible approach for online error detection in CMPs [Mukherjee et al. 2002] . In chip-level redundant multithreading, identical threads are executed on separate processor cores, with the trailing thread receiving load values and line prediction outcomes from the leading thread. Store outcomes from the leading and trailing threads are checked for correctness before writing them to the data cache. This approach is suitable for hard-fault detection since the redundant threads do not share resources. Built-in self-test approaches, which perform opportunistic online checking of processor components, also provide high fault coverage at low overheads [Apostolakis et al. 2009; Shyam et al. 2006] . Note that the proposed ICQ framework may also be used to test execution units as has been explored by us previously [Rodrigues and Kundu 2011b] . A third approach involves the use of robust low-cost dynamic checkers to verify invariants required for execution correctness [Austin 1999; Meixner et al. 2008] . Lowcost approximate checkers have been developed for online checking of FP computation [Seetharam et al. 2013; Lipetz and Schwarz 2011; Maniatakos et al. 2011] .
Diagnosis and Deconfiguration
Once an error is detected, we need to determine the location and nature (hard/soft) of the fault that caused the error. The hard-fault detection and diagnosis framework described in Bower et al. [2005] can be used for this purpose. In order to isolate faulty units, substructures in the cores that we wish to isolate and deconfigure are classified as field deconfigurable units (FDUs). Additional bits in the instructions are used to track FDU usage by an instruction from the decode to commit stage. If an instruction result is found to be erroneous, a saturating counter is incremented for each FDU used by the instruction. If the fault count for an FDU rises beyond a threshold within a prespecified time interval, the fault in that unit is considered to be permanent.
A hardware fault table of the FDUs can then be used to deconfigure a faulty FDU. The fault table maintains one entry per FDU to track the operational health of each FDU. For many-core systems, the table can be extended to a fault map, mapping the helper cores to be accessed for each FDU. This table is updated online depending on the entries in the saturating counters. For yield enhancement purposes, the fault table can be initialized offline during preshipment testing to deconfigure any faulty FDU.
SIMULATION FRAMEWORK
For simulation studies, we used the SESC architectural microprocessor simulator. It is an event-driven cycle-level simulator built on MINT, a MIPS processor emulator [Renau et al. 2005] . The simulator was suitably modified to model the proposed modifications in dual-core and quad-core CMPs running multiprogramming workloads.
Processor Configuration
In this work, we model 90nm 32-bit symmetric dual-core and quad-core processors. Each core is a four-way speculative out-of-order superscalar running at 3GHz frequency. Relevant system parameters for each core are summarized in Table I .
Dual-Core Modeling.
For experiments on a dual-core system, we model one or both cores as being damaged in their large data-path units permanently. Since we are concentrating on high-area, high-latency, and low-utilization units in this study, we model one of the FP ALU, multiplier, and divider units as the faulty unit in each damaged core. Identical units in both cores are not treated as faulty simultaneously. The recovery scheme fails for such a pathological case.
Quad-Core Modeling. In quad-core simulation, we model one, two, or three cores as being damaged simultaneously. Target damaged units are FP ALU and/or divider units. For simulation, we model a centralized queue serving all the cores.
Workloads
Dual-Core Workload Mix. For any simulation run, we combine two benchmarks to form a multiprogrammed workload and then spawn the threads separately on two cores. The benchmarks used are classified according to the proportion of dynamic FP instructions contained in them. SPEC2000 benchmarks equake and gcc are picked with low FP instruction count, and flops and fbench with high FP instruction count. We combine these to form an appropriate mix that is interesting for the analysis, as shown in Table II . These combinations form a representative set of the workloads that the cores can face with respect to FP intensity. For each workload, we set each of the FP ALU, multiplier, and divider units as faulty and measure the performance loss in the degraded system compared to a fault-free system. Quad-Core Workload Mix. Here we combine seven different benchmarks programs to form three four-threaded multiprogrammed workloads and then spawn the threads separately on four cores. SPEC2000 benchmarks equake, gcc, mcf, and ammp are picked with low FP instruction count, art is picked with moderate FP intensity, and flops and fbench are picked with high FP instruction count. Table III shows the representative set of workloads formed by the combination of these benchmarks. The workload names-EAFF, MGFF, and MGAA-are combinations of the first letters of the constituting benchmarks. For each workload, we set each of the FP ALU and divider units as faulty and measure the performance loss in the degraded system compared to a fault-free system.
The choice of helper cores can play a significant role in the performance of the multicore. We consider two possible schemes. In one, helper cores are chosen in a round-robin (RR) fashion, and in the other, they are chosen based on the number of FP instructions executed (nFPEx) by each core. The RR scheme is simple but lacks flexibility, while the nFPEx scheme is slightly more complex but, if done in the right way, can reduce performance overhead. Details follow in the the next section.
We vary the two design parameters, the ICQ depth and the idling threshold, and analyze their impact on the performance loss. We also record the percentage of instructions that go through the ICQ and exceed the idling threshold. We run 1 billion instructions across both cores after fast-forwarding the initial 2 billion instructions in each core. The performance of each core is measured based on number of instructions issued per cycle (issued IPC). Hence, an instruction issued in a faulty core and served by a helper core will be counted in the IPC of the faulty core. The IPC for the helper core reflects its performance in executing its native thread only.
RESULTS AND ANALYSIS
When a module in a core becomes faulty, a functional neighboring core helps with the execution. This may lead to performance degradation for both the cores involved in such interaction. We report this performance degradation in the faulty and helper cores with respect to the fault-free IPC of the individual cores. In the figures that follow, simulation results for selected workloads are shown, illustrating the performance degradation. When considering the quad-core workloads, results are presented for two possible choices of helper cores: (1) RR based and (2) nFPEx based. Detailed results are presented on the use of the RR scheme. We present a case study considering the nFPEx but draw important conclusions.
Dual-Core Results
Figures 9 and 10 are used to illustrate the performance degradation when any one of the cores is faulty. The y-axis shows the relative performance of each core compared to the fault-free situation when no neighborly help is sought. The performance varies with depth of the ICQ, type of the faulty unit considered, and nature of the workload. For example, if the FP unit is defective, it is more likely to impact performance of an FP-intensive program. The x-axis in Figure 9 represents the faulty unit type and the depth of the ICQ. The depth of the ICQ was varied from two to 20 entries, keeping the idling threshold constant at five cycles. The x-axis in Figure 10 represents the idling threshold after which an instruction forces its way through. The idling threshold was varied from two to 10 cycles for a constant ICQ depth of 10. In both cases, the results were more or less consistent for the static parameter (idling threshold or ICQ depth), so we only show results for a single constant variable. The performance of an infinite depth ICQ was also studied. However, it was found that the performance improvement obtained from increasing the depth tends to saturate at a value around 20. Hence, we report results up to depth 20 only. Figure 11 shows results when both cores have different faulty units, so that both the cores have to utilize the other core simultaneously and the flow of instructions through the ICQ occurs both ways. Here, the performance improvement with ICQ depth saturated at a depth of 40 entries is the worst case. The x-axis represents the depth of the ICQ for the various combinations of faulty units in both cores, and the y-axis denotes the relative performance of the cores. We use the following combinations of faulty units in cores 1 and 2: FP ALU-FP Multiplier, FP Multiplier-FP ALU, FP ALU-FP Divider, and FP Divider-FP ALU.
Across the experiments, for the cases in which the helper core has no significant native FP activity, a 2% to 5% performance improvement is actually observed. This apparent oddity is due to the nature of the simulator. The simulator stops execution when the sum of fetched instructions in both cores equals the specified number. Hence, while the faulty core incurred more dead cycles due to extra latency of executing faulty instructions and executed fewer instructions, the helper core fetched and executed more instructions, thus changing its native workload profile slightly.
6.1.1. Workload equake-gcc. For this workload, for a faulty FP ALU, less than 1% of the fetched instructions are switched from the faulty core to the helper core, while the helper core has no FP instructions of its own. The idling threshold has consistently been shown to have no impact on performance. This is of particular interest when the helper core is running a critical thread region and would incur a wait period to service remote instructions. As expected for this workload, system performance is similar in the presence of a single faulty core or two simultaneous faulty cores.
Workload flops-gcc.
Here the ICQ depth is found to be quite dominant in terms of performance impact. For the faulty FP ALU unit, the faulty core used a helper for about 14% of the issued instructions. Varying the ICQ depth from two to 20, the faultycore performance loss improved from 75% to 12%. Similar results are seen for a faulty FP Multiplier unit that has approximately 12% of the issued instructions sent to the helper core for execution (67% to 11%). In the case when the FP Divider is not working, about 2% of the instructions are sent to the helper core. The worst-and best-case degradations are 30% and 10%, respectively. There is no impact of idling threshold on the faulty-core performance. The monotonic improvement in performance of the faulty core with increase in ICQ depth can be seen in Figure 9 .
In this workload, the helper core had insignificant native FP instructions. Hence, there was no contention for the FP execution units. The base-case IPC for the four-way helper core is only around 0.95, which means the schedule and issue units are also utilized only partially, primarily due to the lack of instruction-level parallelism in its native thread. Hence, these units have enough free resources available to serve any foreign instruction that is injected. Almost all switched instructions were served within the idling threshold and very few had to be forced through the helper core.
When both the cores have faulty units (Figure 11 ), core 1 running the floating-point intensive benchmark flops is observed to show marked performance improvement with an increase of the ICQ depth, and the improvement saturates to 5% at a depth of 30. The recovery is better for a faulty FP Divider than for an FP ALU because of lower demand on the divider unit. Core 2, which executes the low-intensity benchmark gcc, recovers the performance loss almost entirely at a depth of around 10.
6.1.3. Workload flops-fbench. This mix of FP-intensive applications represents the worst-case combination that the system can face since both faulty and helper cores have significant FP load. Although the percentage of instructions switched remains the same as the previous case, the best-case degradation achieved goes down from 12% to 16% for the faulty FP ALU unit and from 11% to 15% for the FP Multiplier unit. Results for the FP Divider were similar to the previous case (see Figure 9) . When both the cores have faulty units, there can be a permanent performance degradation of around 10% in both the cores in the worst case, and the improvement saturates at a higher ICQ depth of 40. Since there was no significant performance improvement with variations in the maximum idling threshold, the results for varying the idling threshold for two simultaneous faulty cores are not shown here.
Quad-Core Results
Since the quad-core version of the system utilizes a global ICQ, it enables flexibility with respect to the choice of helper core. We present results on two such schemes.
6.2.1. Round-Robin (RR)-Based Helper Core Selection. In this subsection, we present detailed results for the scheme where the helper core choice is made in an RR fashion.
Single Faulty Core. For this configuration, exactly one core is considered faulty for each run. Figures 12 and 13 show the performance variations as the ICQ depth is varied, when core 1 and core 3 are faulty, respectively. The x-axis again shows the ICQ depths for each of the four cores. For each ICQ depth, we consider two faulty units: the ALU and the divider. Across the workloads, core 1 faces the least FP intensity and core 3 sees the most. The y-axis shows the relative performance with respect to the fault-free scenario.
Core 1 Faulty. Since core 1 has the minimum FP activity among all cores, a fault in core 1 does not lead to significant performance degradation. In Figure 12 , we see that for MGAA and MGFF workloads, there is no performance degradation in the system at all, since the ICQ is not pressed into service.
For EAFF, there is moderate FP activity in core 1; hence, for a faulty FP ALU, there is a loss of 3% in the faulty core for a queue depth of 4. However, a depth of 8 is sufficient to recover the loss. About 2% of the fetched instructions are switched from core 1 to the other cores, while the helper cores show no performance loss in executing their native threads. Very few instructions survive in the ICQ up to the maximum idling threshold, indicating that the helper cores had enough space to accommodate the foreign instructions without sacrificing their native IPC. For a faulty FP Divider unit, less than 1% of instructions are switched and there is no appreciable loss in the system even for a queue depth of 4.
Core 3 Faulty. In Figure 13 , for MGAA and a faulty FP ALU unit, there is about a 5% drop in performance in core 3 for a depth of 4, which improves to 1% for a depth of 8. All other cores are unaffected. For MGFF and EAFF, the faulty core suffers about 60% degradation at a depth of 4 but recovers sufficiently, with an increase in the depth reaching almost full performance at a depth of 32. Also, a small drop of 1% to 2% can be seen in the helper cores that have native FP instructions to run, representing mild contention for resources. Varying the idling threshold has no effect on reducing this drop.
Hence, for a single faulty core in a quad-core system, the scheme enables us to salvage the system without significant performance degradation using a queue depth of 32. The idling threshold is again consistently shown to have no impact on performance.
Two Faulty Cores. Here we consider simultaneous faults in two cores and investigate three different configurations as shown in Table IV .
Case 1 represents the worst case possible, when both the FP-intensive cores (3 and 4) are faulty (Figure 14) . For MGAA, the performance is similar to the case with only the third core faulty (shown in Figure 13 ), because the fourth core does not have any appreciable FP intensity.
For MGFF and EAFF, the ICQ depth is found to be quite dominant in terms of performance impact on the faulty cores. On average, for a faulty FP ALU unit, core three used helper cores for about 18% of the instructions, and core 4 did the same for about 7%. In MGFF, by varying the ICQ depth from 8 to 48, the performance loss for core 3 improved from 40% to less than 1%. Performance loss of core 4 improved from 90% to 4%. For EAFF, the corresponding numbers were 50% to 1% and 60% to 3%, respectively. Only a mild to negligible drop was observed in the helper cores' performance (1 and 2) because of lack of native FP intensity.
For cases 2 ( Figure 15 ) and 3 (not shown), the results show a similar trend. The more the floating-point intensity in a faulty core, the larger is the impact of increasing the queue depth. The performance loss generally saturates within 5% for deep enough queues. The helper cores do not show any significant degradation in these configurations.
Three Faulty Cores. We report the results of two configurations in Figure 16 : faulty FP ALU units in cores 1, 2, and 3 and cores 2, 3, and 4, respectively.
In the first configuration, for MGFF, core 3 suffers the most significant degradation, but the loss saturates to 1% at a depth of 32. There is a 3% to 5% loss in the only helper core (core 4) because it has the only functional FP unit in the system. This loss cannot be avoided by varying any design parameter. For EAFF, the trend is similar, with core 3 saturating at a 5% loss at depth 32 and core 4 (helper core) suffering a steady 3% loss.
In the second configuration (core 1 serving all other cores), for the MGFF workload, core 4 saturates at a loss of 5% and core 3 recovers completely. There is no appreciable loss in the helper core (core 1) since it has no native FP instructions. For EAFF, faulty cores 2 and 4 suffer a loss of 2% to 3%, whereas for faulty core 3 the loss is negligible. The helper core (core 1) does not see any degradation.
Hence, we see that the loss in the faulty cores is generally bounded within 5% for a queue size not exceeding 50. The helper cores also do not show degradation exceeding 5%. The variations in idling threshold have minimal impact, as the cores are wide enough to accept instruction from a faulty core without much interference to their native instructions.
6.2.2. Number of FP Instructions Executed (nFPEx) Based Helper Core Selection. The RR scheme is simple and is sufficient if the workload combinations are homogeneous (i.e., they have the same characteristics with respect to instruction distribution). However, if the workload characteristics differ, then there may be potential gains to be made. For example, when running the workload EAFF or MGFF, when core 3 is faulty, there are times when the core running fbench is used as the helper core. However, just like flops (the workload running on the faulty core), fbench is also an FP-intensive workload and hence its performance suffers a little. If instead the cores running equake or art were chosen as helper cores, helper core performance penalty is expected to be at a minimum. Hence, in such circumstances, a scheme that is slightly more sophisticated than the RR scheme may help mitigate some of the lost performance. We explored the use of the number of FP instructions executed (nFPEx) by a core as the metric for helper core choice. In this scheme, at the time of instruction scheduling from the ICQ, the core that has executed the least number of FP instructions is chosen as the helper core. If this core does not have the capacity for more instructions in its issue queue, the core with the next highest number of FP instructions executed is chosen as helper core and so on. Hence, in this scheme, helper core priority is given to the cores that at the given point in time have executed the least number of FP instructions. This is the sum of a core's native instructions and those scheduled from the ICQ. A brief overview on a potential scheme to measure the number of FP instructions executed on each core is now presented.
Measuring the number of FP instructions executed on each core. Modern microprocessors feature multiple performance counters to count events such as cache misses, branch mispredictions, instructions retired, and so forth [Contreras and Martonosi 2005] . In this study, we assume the availability of performance counters that count the number of executed FP instructions in a core. Accordingly, at the time of instruction issue from the ICQ, the system always tries to schedule the instructions into the pipeline of the cores with the least number of FP instructions executed.
We now present a case study on the workload EAFF and compare results obtained by the RR scheme and the nFPEx scheme. In this study, we assume core 3 to have faulty FP ALU, Multiplier, and Divider units. Hence, it must outsource all of its FP instructions. Conclusions drawn from this study help give an indication of the results expected in the other considered scenarios. In order to compare the RR and nFPEx schemes, we present two metrics: (1) the percentage of instructions from the ICQ executed by each potential helper core and (2) the IPC of the helper and faulty cores. In this experiment, the ICQ depth was maintained at 20 and the idling threshold set at 10 cycles.
A case study on EAFF. We ran the workload EAFF for 1 billion instructions after skipping the initial 2 billion instructions. The results obtained are plotted in Figures 17(a) and 17(b), respectively.
• Percentage of faulty core instructions executed by each potential helper core. In Figure 17 (a), we have plotted the percentage of instructions of the faulty core, executed by each core as a helper core. It can be seen that when using the RR scheme, around 50% of the instructions from the faulty core are executed by core 1 (running workload equake) and approximately 25% of the instructions are executed by core 2 (running workload art), while around 24% of the instructions are executed by core 4 (running workload fbench). Since the cores running equake and art exhibit few or no FP instructions, more often than not, they possess idle resources to execute the FP instructions belonging to the faulty core in the ICQ. Since fbench is an FPintensive application, idle FP execution resources are rare to come by and hence the low percentage of instructions executed as helper core. When using the nFPEx as a metric for helper core choice, a very different distribution of faulty core instructions executed per core is seen. In our experiments, we found equake to exhibit 3.5% FP instructions and art to exhibit 5% FP instructions. Hence, as per the metric, core 1 that runs equake is expected to execute the most instructions from the ICQ, and as can be seen from Figure 17 (a), this is the case. Core 1 executes almost 60% of the instructions from the ICQ, while core 2 executes around 37% of the instructions, and core 4 around 1.6% of the instructions. This result shows that the nFPEx scheme is better than the RR scheme at distributing the FP instructions from the ICQ across the available healthy cores.
• Performance per core. The IPC per core obtained when using the RR scheme and nFPEx scheme is shown in Figure 17 (b). In addition to that, we have also plotted the IPC per core in the fault-free multicore as an upper bound. It can be seen that when using any scheme to choose the helper core, cores 1 and 2 show no difference in IPC when compared to the fault-free multicore. This is because they exhibit almost no FP instructions. Further, when looking at the IPC of cores 3 and 4, it can be seen that there is a performance penalty. Of these, core 3 (the faulty core) shows more penalty than core 4 since it must wait to receive results of instructions executed on the helper core. Performance loss is more when using the RR scheme since a reasonable proportion (24%) of the instructions belonging to the faulty core must compete with the helper core instructions for resources (core 4). When using the nFPEx scheme, core 4 shows almost no performance penalty since its native instructions have far less competition for resources. Hence, using the nFPEx scheme may result in better IPC for all cores, especially when the workloads being executed are heterogeneous with respect to the FP instruction mix.
Comparison Against Schemes Based on Emulation
We now compare the proposed scheme against a scheme based on emulation [Weaver et al. 2009 ]. In such schemes, the basic idea is to replace any instructions in the executable that make use of faulty units by a stream of instructions that make use of healthy units. For example, multiply instructions may be replaced by a string of ALU operations. However, such a scheme has several disadvantages when compared to the proposed ICQ scheme:
(1) The program binary needs to be recompiled in order to bring such emulation into effect. (2) In order to reduce the performance penalty, several additional changes to the microarchitecture and further changes to the executable need to be made. Weaver et al. [2009] argue that these optimizations result in up to six times performance improvement over a naive scheme that only uses recompilation. Fig. 18 . Performance of the optimized emulation scheme [Weaver et al. 2009 ] and ICQ schemes with respect to the fault-free multicore for various combinations of faulty units and cores. Workload run is EAFF.
(3) Such schemes need code compilation in advance and as such cannot be used to mask hard errors if an execution unit fails during the lifetime of a processor. For the ICQ scheme, there is no such limitation.
For comparison, we implemented the scheme proposed in Weaver et al. [2009] . Here, each instruction that is expected to be run on a faulty unit takes many more cycles than when run on dedicated hardware. We used the latencies for emulated instructions from Weaver et al. [2009] to replace the execution latency of the faulty instructions in our experiments. Specifically, the latencies for each emulated instruction used are FP ALU: 64 cycles and FP Mult and Div: 35 cycles. Experiments were run for various combinations of faulty cores and faulty units. Specifically, we consider the following scenarios: (1) FP ALU faulty, (2) FP ALU and FP Mult faulty, and (3) FP ALU, FP Mult, and FP Div faulty. In the experiments, cases (1), (2), and (3) were considered when only core 3 was faulty, core 4 was faulty, and the case where both cores 3 and 4 had faulty units. The results obtained when running the workload EAFF are shown in Figure 18 for the emulation-based and the ICQ schemes. In general, it can be seen that the emulation scheme results in a significantly higher performance penalty. Notably, when even a single unit is faulty, the emulation scheme results in a 15% additional performance penalty when compared to the ICQ scheme. This penalty increases to almost 30% when a single unit each is faulty in cores 3 and 4. Clearly, the ICQ scheme is much less complicated than the emulation scheme and results in far lower performance penalty, which makes it the superior option.
Energy Overheads
We also evaluated the energy penalty of the proposed ICQ scheme for fault tolerance.
There are a few sources of energy penalty. First, every time a core that has faulty units tries to issue instructions into the ICQ, a check is made to see if an empty slot exists. If this slot exists, the operands and opcode of the instruction need to be read from the host core and written to the ICQ. Energy is also expended when healthy cores read instructions from the ICQ, issue them, execute them, and then write the result back to the ICQ. From here, the result must be read by the host core from the ICQ before it is retired. In order to model energy, we used Wattch [Brooks et al. 2000] and CACTI [Shivakumar et al. 2001] . The overheads arising due to the mentioned operations for various combinations of faulty cores and faulty units when running the workload EAFF are shown in Figure 19 . The combinations of faulty cores and faulty units are the same as those used in the Section 6.3. It can be seen that even in the worst case of three faulty units in cores 3 and 4, only a 4% energy penalty is observed. In the average case, where one unit is faulty in one core, less than a 1% increase is seen in energy. We thus conclude that the proposed ICQ framework results in a very small increase in energy. This overhead is acceptable for the fault tolerance capability it provides.
Analysis on the Number of ICQ Ports
We now present analysis on our design choice of two read and write ports for the ICQ serving four cores. In order to understand the effects of contention due to a limited number of ports, we performed a series of simulation runs with an increasing number of ports. Several combinations of faulty core and faulty units were considered. The experiments were run using the equake_art_flops_fbench (EAFF) workload, in which flops and fbench are floating-point intensive. The results obtained are plotted in Figure 21 . We considered one to four read/write ports. The results show that two read and two write ports are a good choice for four cores as seen in the figure where relative performance seems to saturate at this design point. We have found that a single read/write pair works well for two cores, and also for four cores, when there is only a single faulty core, or even when there is a single helper core serving three faulty cores. However, for cases where there are different units failing in more than one core, the port contention leads to appreciable performance degradation. Two read and two write ports are sufficient to mitigate the contention and recover most of the performance as is evident in the figure. . Relative performance of the proposed scheme for various combinations of faulty core(s) and execution unit(s). The x-axis is read as {faulty core i_faulty core i+1}-{faulty unit a_faulty unit b}. The bars indicate performance when considering n read/write ports.
Scalability
So far in this article, we have seen the ICQ scheme in operation for dual-and quad-core systems. In general, it is expected that the higher the number of cores that have access to the ICQ, the richer is the choice of helper cores and hence the lower is the expected performance penalty. Such an architecture is also expected to relatively increase the level of fault tolerance possible as well. However, an increase in the number of cores that have access to the ICQ will increase the ICQ access latency. Further, a design that shares an ICQ between eight or 16 cores will certainly pose floor-planning problems. Such an increase in complexity will be justified if there is a commensurate decrease in system performance penalty. We have used the average time spent by instructions in the ICQ as a metric to determine system performance. To quantify the benefits of sharing between two and four cores, we conducted an experiment in which core 3 is considered faulty. We then measured the average time spent in the ICQ for each instruction when one (FP ALU), two (FP ALU and FP Mult), and three (FP ALU, FP Mult, and FP Div) units are faulty for core 3 in dual and quad cores. The workloads run were flops-fbench on the dual core and EAFF on the quad core. The results obtained are plotted in Figure 20 . It can be seen that even though the average time spent in the ICQ reduces a little when going from two to four cores, the decrease is not significant. Hence, we extrapolate that going to eight cores will not add much value with respect to average time spent in the ICQ. Hence, even though it is possible, we do not explore the sharing of the ICQ between more than four cores. We thus conclude that the ICQ scheme may be best used if up to four cores share the same ICQ. Hence, for a many-core CMP, using one centralized ICQ for each set of four cores would enable us to strike a good balance between performance and complexity with respect to our scheme.
CONCLUSION
CMPs contain inherent redundancy. In this work, a microarchitectural technique was proposed to exploit such redundancy for salvaging yield and improving reliability. The central idea was to implement an ICQ to seek execution help from functioning neighboring cores. The resulting design changes are minimal and impose insignificant cost in terms of area and power. Simulation studies show that significant yield recovery is possible with only a 16% performance degradation in the worst case. The proposed scheme is useful for high-area, high-latency instructions that are executed sparingly. We have also shown that by careful scheduling of the ICQ instructions in the helper cores, performance penalty may be reduced even further. In the future, we plan to evaluate the performance impact of runtime software thread swapping against hardware-assisted instruction migration presented earlier as part of a comprehensive solution.
