Dennard scaling has ended. Lowering the voltage supply (V dd ) to sub-volt levels causes intermittent losses in signal integrity, rendering further scaling (down) no longer acceptable as a means to lower the power required by a processor core. However, it is possible to correct the occasional errors caused due to lower V dd in an efficient manner and effectively lower power. By deploying the right amount and kind of redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower V dd . One promising approach is the Redundant Residue Number System (RRNS) representation. Unlike other error correcting codes, RRNS has the important property of being closed under addition, subtraction and multiplication, thus enabling computational error correction at a fraction of an overhead compared to conventional approaches. We use the RRNS scheme to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms. From This paper is an extension of "Computationally-Redundant Energy-Efficient Processing for Y'all (CREEPY)" [11] . This submission adds the following:
INTRODUCTION
Dennard scaling [12] has been one of the main phenomena driving efficiency improvements of computers through several decades. The main idea of this law is that transistors consume the same amount of power per unit area as they scale down in size. However, leakage current and threshold voltage limits caused Dennard scaling to end [46] about a decade ago. This essentially negates any performance benefits that Moore's law may provide in the future; power considerations dictate that a higher transistor density results in either a lower clock rate or a reduction in active chip area.
Theis and Solomon [82] suggest that new device concepts within the purview of twodimensional lithography technology, such as tunneling FETs, enable reduction of the 1 2 CV 2 energy to small multiples of kT , without resulting in low switching speed [81] . Similarly, research on ferroelectric transistors, a.k.a. negative capacitance FETs (NCFETs), demonstrates a sub-60mV/dec slope as well as a higher drive current [34] [35] [36] 65] , both of which are necessary in rendering V dd reduction beneficial to energy reduction without sacrificing performance.
These next-generation devices are fast switching even at few tens of millivolts but, as a result, are vulnerable to thermal noise perturbations. This translates into intermittent, stochastic bit errors in logic. With signal energies approaching the kT noise floor, future architectures will need to treat reliability as a first-class citizen by employing efficient computational error correction.
In this article, we propose a scalable architectural technique to effectively extend the benefits of Moore's law. We enable reducing the supply voltage beyond conservative thresholds by efficiently correcting intermittent computational errors that may arise as a result of thermal noise. Energy benefits are observed as long as the overhead incurred in error correction is less than that saved by lowering V dd . The Computationally-Redundant Energy-Efficient Processing for Y'all (CREEPY) approach of introducing error correcting hardware to lower energy is demonstrated as beneficial in this article.
Contributions
(1) Development of microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms towards a Computationally Redundant, Energy-Efficient core design. (2) Design and analysis of an efficient RRNS multiplier unit using the index-sum technique. (3) Novel RRNS-check-insertion heuristics to optimize performance/energy/reliability tradeoffs. (4) Derivation of an estimated lower limit on signal energies via stochastic fault injecting simulation.
We first introduce some mathematical background and notation in Section 2 and then provide a high-level overview of a CREEPY core in Section 3 before describing the RRNS algorithms and several other aspects towards designing a CREEPY core in Section 4. We then describe our evaluation methodology, results before discussing related work and concluding in Sections 5, 6, 7, and 8, respectively.
BACKGROUND 2.1 Triple Modular Redundancy
Error-correcting codes (ECC) are widely used in modern processors to improve reliability. However, these are limited to memory/communication systems and are unable to achieve computational fault tolerance. The conventional approach to computational fault tolerance is triple modular redundancy (TMR) [86] . As shown in Figure 1 , the idea is to replicate the computation twice (for a sum total of three computations per computation) and then take a majority vote. With a model that assumes that at most one of these three computations can be in error at any given point in time, it follows that at least two of the computations are error free; this can thus be used to detect and correct a single error, assuming an error-free voter.
While simple to understand and implement, this introduces more than 200% overhead in area and power, leaving plenty of room for improvement. Any energy savings from lowering V dd would be eclipsed due to this overhead in correcting resultant errors.
Residue Number System
The Residue Number System (RNS) has been used as an alternative to the binary number system chiefly to speed up computation [1, 51] . This increased efficiency comes from the fact that a large integer can be represented using a set of smaller integers, with arithmetic operations permissible on the set in parallel. We present some of the properties of RNS below.
Let B = {m i ∈ N f or i = 1, 2, 3, . . . , n} be a set of n co-prime natural numbers, which we shall refer to as bases or moduli. M = n i=1 m i defines the range of natural numbers that can be injectively represented by RNS that is defined by the set of bases B. Specifically, for x such that x ∈ N and x < M, then x ≡ (|x | m 1 , |x | m 2 , |x | m 3 , . . . , |x | m n ), where |x | m = x mod m. Each term in this n-tuple is referred to as a residue.
We also note that addition, subtraction, and multiplication are closed under RNS. This is because of the following observation: Given x, y ∈ N and x, y < M, we have |x op y| m = ||x | m op |y| m | m , where op is any add/subtract/multiply operation.
Redundant RNS
To augment RNS with fault tolerance, r redundant bases are introduced. The set of moduli now contains n non-redundant and r redundant moduli: B = {m i ∈ N f or i = 1, 2, 3, . . . , n, n + 1 . . . , n + r }. The reason these extra bases are redundant is because any natural number smaller than M (= n i=1 m i ) can still be represented uniquely by its n non-redundant residues. Intuitively, the r redundant residues form a sort of error code, because all residues are transformed 8:4 B. Deng et al. Table 1 . A (4, 2)-RRNS Example with the Simplified Base Set (3, 5, 2, 7, 11, 13) Decimal mod 3  mod 5 mod 2 mod 7 mod 11 mod 13  13  1  3  1  6  2  0  14 2 4 0 0 3 1 13 + 14 = 27 (1 + 2)mod 3 = 0 2 1 6 5 1 All columns function independently of one another. An error in any one of these columns (residues) can be corrected by the remaining columns.
Range is 210, with 11 and 13 being the redundant bases.
in an identical manner under arithmetic operations. For
contains n non-redundant residues as well as r redundant residues. For convenience, we further define M R = n+r i=n+1 m i . On applying arithmetic transformations to a redundant RNS (RRNS) number, any error that occurs in one of the residues is contained within that residue and does not propagate to other residues. When required, such an error can be corrected with the help of the remaining residues. Specifically, in an RRNS system with (n, r ) = (4, 2), a single errant residue can be corrected or two errant residues can be detected. Table 1 provides a simple example, and Section 4.8 outlines necessary algorithms to do so. Research by Watson and Hastings [25, 89, 90] lays the foundation for the underlying theoretical framework that is used and extended in our work. Their work also details algorithms to handle RRNS scaling and fractional multiplication. They used (199, 233, 194, 239, 251, 509) as the (4, 2)-RRNS system, providing a range M = 199 × 233 × 194 × 239 ∈ (2 31 , 2 32 ). In Section 4.5, we discuss the methodology and implications of choosing a different set of RRNS bases for the purposes of trading range with overhead.
Not only does a residue number system achieve a higher efficiency due to enhanced bit-level parallelism (also, no carries required for addition), but also introducing 50% of overhead is sufficient to provide resiliency. As the granularity of an error is that of an entire residue, RRNS is capable of potentially correcting multi-bit errors as well, for free.
We design a computer based on these properties.
CREEPY OVERVIEW
Given new device concepts that enable device operation at signal energies close to the kT noise floor [34-36, 65, 81, 82] , CREEPY aims to achieve lower energy consumption by lowering V dd in such a manner that the intermittent errors that thereby arise are corrected efficiently. We take note of the compute preserving properties of RRNS (cf. Section 2.3) and propose building a Turing-complete computer around this idea.
A CREEPY core consists of six subcores, an Instruction Register (IR) and a Residue Interaction Unit (RIU), as depicted in Figure 2 .
Each subcore consists of an adder, a multiplier, a portion of the distributed register file, and a portion of the distributed data cache. The bit width of these components is same as that of their corresponding residue (8-bit or 9-bit in this example). Each subcore is fault isolated from the other, because it is designed to operate on a single residue of data (analogously to a bit-slice processor, with bolsters). Post a successful instruction fetch (the instruction cache stores instructions in binary and is ECC-protected), the ECC-checked instruction is dispatched onto the six subcores, which then proceed to operate on their corresponding slice of data. For example, adding two registers is done on a per residue basis; the register file is itself distributed across the six subcores. Similarly, the data cache is also distributed across the six subcores and stores RRNS protected data. The RIU is then responsible to perform any operations that involve more than a single residue.
Section 4 evaluates several aspects of designing such a core and provides solutions. For example, conventional multipliers incur high cost in both energy and area, and, therefore, we leverage RRNS properties to provide an efficient solution in Section 4.4. The RRNS base selection is also very important in CREEPY core design, because it directly affects the computational range and energy efficiency, which we discuss in Section 4.5. The RIU logic includes three parts: RRNS consistency check logic, RRNS comparison logic, and RRNS to binary conversion logic. For the RRNS consistency check logic and RRNS comparison logic designs, we have detailed discussions in Section 4.8.1 and Section 4.8.4, respectively. The RRNS to binary conversion logic is relatively less important, as it is used only to support operations that are not native to RRNS, such as bit-shifting and division, which are relatively few in number. Furthermore, the circuitry to convert from RRNS to binary is identical to that found in the literature [7, 25, 76] to convert from RNS to binary, and therefore we omit it for space constraints.
Because conversions to and from binary are expensive and rather unnecessary for RRNS data, a CREEPY core operates entirely on RRNS data and literals. An upshot of this is that control-path errors manifest themselves as data errors, meaning that they can be handled simply by handling the data error. For example, if there is an error in bypass logic in a subcore, or, if a faulty decoder in one of the subcores causes it to perform a multiplication instead of an addition, then the resultant residue for that subcore would have an erroneous value but can be recovered from the remaining five residues that were a result of the correct addition operation.
Although operating entirely on RRNS data and literals avoids the significant overheads of converting to/from binary, representing a memory address (for the purposes of Program Counter (PC), Load (LD), and Store (ST)) in an RRNS format naively may cause significant degradation in locality and changes memory access patterns, which is fundamental to memory systems performance. This issue has already been handled by Srikanth et al. [70] , where they propose bit-manipulation techniques as well as a compiler based approach, with little to no overhead. Of the techniques proposed, their rns_sub scheme, which essentially subtracts the least significant residue from the others, renders the most energy-efficient architecture along with the added advantage of requiring no support from the software stack.
Interfacing a CREEPY core with heterogeneous accelerators (such as those on an SoC) that may or may not be RRNS based is a more involved issue, and we leave that to future work. CREEPY employs standard ECC-protected main memory because of ECC's compactness and efficiency when it comes to protecting stored data. However, standard ECC is not amenable to computational fault tolerance, and, therefore, the representation of data is in RNS form (as opposed to binary). The memory controller checks ECC on a processor load and generates the two redundant residues before loading the resultant RRNS data into the last level cache. Similarly, it generates ECC on a processor store (and the redundant residues are not stored into the main memory). The exact choice of ECC is not relevant to this article; any of the existing schemes [43] may be used.
CREEPY CORE
In this section, we present several considerations for the design of the CREEPY core.
Instruction Set Architecture (ISA)
The description of CREEPY ISA is laid out in a manner similar to that of the Microprocessor without interlocked piped stages (MIPS) ISA for explanatory purposes. To simplify instruction fetch and decode, all instructions are of fixed length, 32 bits. The ISA expects 32 registers (R0-R31), with R0 hard-wired to zero, R30 being the link register and R31 storing the default next PC (= PC + 4). In our micro-architecture, each register is 49 bits long (i.e., it contains the RRNS redundant residues as well) and is sliced on a per-modulus (sub-core) basis. The data cache is also implemented in a similar manner, as it stores data in an RRNS format.
(1) R-Format (ADD/SUB/MUL) These instructions assume that the destination operand as well as both source operands are registers. Reg3 is the destination for a load and is the source register for a store. The source/destination address for a load/store is given by Reg1 (base) + Reg2 (offset). Note that the memory address is hereby stored in an RRNS format. Recall from Section 3 that efficiently handling RRNS addresses without conversion to binary is critical to application performance. Tradeoffs and methodologies in this space have been handled by Srikanth et al. [70] .
Reg1 is the register that needs to be checked. Once an error is detected, the system would try to correct it, for example, by performing the RRNS Single Error Detection and Correction algorithm (Section 4.8). Candidate usage scenarios are discussed in Section 4.6 and evaluated in Section 6. Helper instructions such as mov, ret, and so on, also exist but are omitted from this description for brevity.
Error Model
First, we distinguish fault, error and failure as follows: Fault. A single bit flips, but is not stuck-at, i.e., only intermittent/transient faults are considered. Causes may range from unreliable devices to low supply voltage to particle strikes to random noise and any combination therein.
Error. One or more faults in a single residue that show up during a consistency check. Failure. Error uncorrectable and no recovery mechanism or error undetectable. Faults may lead to errors that may lead to failures. We can guarantee the system is reliable if at most one error per core occurs between two RIU checks. Multiple bit flips are rare but this phenomenon occurs if a circuit in the carry chain fails [40] . In our design, carry chains are limited to a residue as there are no carries between residues. Therefore, any resulting multi-bit errors would be localized to a single residue, which we can correct. If this RRNS system needs to detect and correct multiple error residues, then an extra checkpoint and rollback mechanism is necessary. However, based on the discussion above, the case of multiple residues in error is extremely rare. So we ignore the checkpoint mechanism design in current system and leave it to future work.
Redundancy in time, i.e., check at cycle x, check again at cycle y, check again at cycle z, and vote, does not apply to this model as it is possible that the three checks suffer 3 independent 1-bit faults, rendering voting useless. The transient clause in the model rules out stuck-at faults. An implication of this is that we cannot achieve reliability by merely trading performance alone. Additional resources in terms of spatial redundancy are necessary, which is exactly what has been designed.
Different components of the core are protected via specialized means that target each component. The guiding principle is to design a system that uses the more efficient of RRNS/ECC based redundancy based on the range and nature of data being protected. Where both techniques are deemed insufficient to prevent the fault from metastasizing into an error, and eventually into a failure, the more conventional (and expensive) method: TMR is employed. An alternative is to prevent the fault from occurring in the first place by using high V dd (and/or circuit hardening). Choosing optimally between the latter expensive techniques is beyond the scope of this document but we assume that the RIU uses a high V dd /hardened circuitry. We assume that error in control signals manifest themselves as errors in data (for example, a control error causing one of the subcores to operate on the wrong opcode will be caught as a data error); however, one can potentially further improve the control signals' integrity by using either TMR or intelligent state assignment and that the RIU uses a high V dd /hardened circuitry.
Signed Number Representation
There are three competing ways of representing signed numbers, given an RRNS framework as presented in Section 2.3. Each presents its set of tradeoffs, which we now detail.
M is the product of all the non-redundant moduli (M = m1*m2*m3*m4), and MR is the product of all the redundant moduli (MR = m5*m6). correction factors (scaling and offset) are applied to the result of each arithmetic operation. However, further analysis indicates that these correction factors require knowledge of the signs of the operands, which are not trivial to determine like in binary. The RRNS sign determination is a time-consuming algorithm. Moreover, arithmetic operation overflow detection is unknown for this representation. Similarly to the M Complement representation, the results of arithmetic operations must be offset by a correction factor before they can be corrected. However, these correction factors turn out to be independent of the sign of the operands. We also find that this representation enables simple algorithms for comparison (and thereby sign detection) and arithmetic operation overflow detection. In fact, these algorithms make use of a technique used in the error correction algorithm itself. These algorithms are discussed in detail in Section 4.8. We choose Excess-M 2 to be the de facto signed representation scheme for CREEPY.
Optimized Multiplier Unit Design
Many workloads in the domains of multimedia, image processing and digital signal processing are highly multiplication intensive [87] . Index-sum multiplication has been proposed in the past [57, 58] to achieve multiplication via simple addition and table lookup operations, thereby rendering it more efficient than traditional binary multiplication provided the size of the LUT is not too large. The principle is analogous to using a logarithm operation, i.e., a multiplication can be achieved via a table lookup, addition and a reverse table lookup, as summarized as follows for the product of two numbers X and Y :
(1) Use a pre-defined mapping table to generate index (X ) and index (Y ). While realizing an LUT that is addressable by a 32-bit input is rather expensive, leveraging RNS properties allows us to slice this table into a few tables with address sizes closer to 8 bits, as outlined by Preethy et al. [57, 58] . We extend this idea into RRNS by adjusting the RRNS bases (cf. 4.5) to be amenable to index-sum LUTs, the requirements for which are summarized below.
Index-sum multiplication is based on the theory of Galois fields, which can be classified into 3 types: GF (p), GF (p m ), and GF (2 m ), where p is an odd prime number and m ∈ Z + . The range of integers that can be represented bijectively in Galois fields, and the encoding methodology depends on the GF type [58] : We skip the methodology of deriving GF (p m ), as we do not utilize this for CREEPY.
GF (p):
Any integer x ∈ [1, p − 1] can be uniquely coded as a single integral index code α by the relationship X = |д α | p , where α ∈ [0, p − 2], and д is a primitive root such that |д p−1 | p = 1. See Table 2 for an example. 
, whose sum is 11, which reverse maps to 18 and is indeed the desired product. Ex: X = 3, Y = 6 map to, respectively, < 0, 3, 1 >, < 1, 3, 1 >, whose sum results in < 1, 6, 2 >. Since γ ∈ [0, 1], the modulo sum results in < 1, 6, 0 >, which reverse maps to 18 and is the desired product.
See Table 3 for an example.
Therefore, the relative preference of GF types are GF (p) > GF (p m ) > GF (2 m ) as they require 1, 2, and 3 index codes, respectively. Furthermore, a smaller value of p and m leads to a smaller LUT. These considerations impact the choice of RRNS bases, as discussed in Section 4.5/ Table 4 .
By using the index-sum technique in conjunction with RRNS, we greatly simplify the complexity of multiplication. Index-sum multiplication can be efficiently performed via a simple addition and two modest table lookup operations. We achieve a reduction in ALU gate count using this approach by about 87% when compared to using a traditional multiplier in RRNS, which itself reduces the gate count by 52% when compared to a traditional non-error-correcting binary ALU, thereby realizing area, energy, and reliability improvements, as we demonstrate in Section 6.
Selecting RRNS Bases
Watson [89] used the base set (199, 233, 194, 239 , 251, 509) in his article. However, the range rendered by this set is larger than 2 31 but smaller than that of a 32-bit unsigned integer: 2 32 . Furthermore, these bases are not amenable to designing index-sum based multipliers, as discussed in Section 4.4. These limiting necessary and sufficient conditions can be summarized as follows:
(1) Each pair of bases m i , m j must be relatively prime. (For RRNS representation [89] .) (2) max n+1≤i ≤n+r The proofs of Conditions (1)- (6) are available in Waston s thesis [89] , and Condition (7) is based on the theory of Galois fields, which was discussed in Section 4.4. We limit our analysis to (n, r ) = (4, 2) for simplicity and find bases that satisfy the conditions summarized above, while keeping the overhead to a minimum. Table 4 lists several such possibilities. (61, 149, 128, 71, 179, 181 ) is the set of bases that offers least overhead, whereas (421, 211, 256, 347, 503, 521), on the other hand, offers a range superior to 2 32 at additional overhead.
RRNS Check Insertion Strategies
Given that the CREEPY microarchitecture supports the error model outlined in Section 4.2, it is necessary to carefully insert RRNS_check instructions, as they have a direct impact on the performance-energy-reliability metrics of the core. In this section, we outline the following check insertion schemes.
Periodic Check.
Insert a single check instruction after every n instructions. When n = ∞, this is an unchecked core, and when n = 1, every instruction is checked. Note that lowering the value of n increases the check insertion frequency, raising performance overhead. While increased check insertion frequency typically provides increased reliability, one must be vary of the fact that the check instruction itself is of non-zero latency (cf. Section 4.8), meaning that the longer the core spends in consistency checking, the longer it leaves its state vulnerable for errors to creep in. On the other hand, not checking every instruction also increases the probability of errors manifesting into multiple residues, leading to core failure.
Pipelined
Check. Insert a pipelined check that checks n instructions after every n instructions. This approach has the performance advantage of amortizing the latency of RRNS_check via pipelining as well as the reliability advantage of being able to increase state coverage of consistency check.
StateTable Guided Adaptive
Check. We define a bookkeeping entity known as StateTable in Section 5 to maintain temporal information of the vulnerability of processor state. Whenever the probability of a register exceeds a certain threshold, an RRNS_check is inserted for that register. Naturally, this can be extended to insert pipelined checks if more than one register is in need of a check. This StateT able itself is assumed to be an error-free entity that can either be implemented in software or hardware. Like the other two schemes, this insertion scheme can be implemented by the compiler or by the runtime (hardware or software); however, it is likely that utilizing a runtime component for this purpose would yield greater accuracy, which translates to improved efficiency and reliability, although subject to the overhead the StateTable itself introduces.
Irrespective of the check strategy, we acknowledge that the following need to be error free for correct execution; however, for the purposes of this simulation, we ignore their overheads/implications on control flow by assuming periodic checkpointing for potential rollbacks: (1) Effective address of each memory access (RRNS check), (2) Instruction contents (standard ECC check), and (3) Main memory contents (standard ECC check). For a low-overhead checkpoint mechanism candidate, one can use an incremental checkpoint scheme to save energy and reduce storage overhead when compared to using full checkpoints alone. The incremental checkpoints only record the modified entries from the last checkpoint (the last checkpoint could either be a full checkpoint or an incremental checkpoint). Once the rollback operation is necessary, the system can then use the last full checkpoint and the subsequent incremental checkpoints to recovery the machine state. A detailed tradeoff analysis of the size, frequency, reliability, and energy of such a scheme is beyond the scope of this article.
Multi-Domain Voltage Supply
The error distribution for each domain of a CREEPY core, viz., computational logic, SRAM cells, and RIU logic, are different. In an SRAM device, any fault occurring in one of its transistors gets latched, thereby resulting in an error. To contrast, glitches in logic transistors get masked if the glitch does not occur close to the clock edge. Also, to avoid having to "check a check instruction," we assume the RIU logic is error-free protected via TMR, hardened logic, and/or higher signal energies, with the latter sufficient to model the energy effects of the former. Given that the vulnerability of these domains increases from computational logic to SRAM cells to RIU logic, it is inefficient to assume a uniformly high signal energy across these domains. We model this phenomena by independent voltage rails for each of these domains. Shimazaki [68] and Rusu [64] proposed some multi-voltage domain designs. The voltage domains referred to in CREEPY are coarse grained (module based), rendering the implementation feasible.
RIU Algorithms
The algorithm for single error correction was originally given by Watson [89] . However, RNS renders comparison and arithmetic overflow detection to be a non-trivial exercise. We present algorithms to perform these RIU functions by augmenting the consistency checking algorithm. This way, no extra hardware is warranted beyond that required by the error check. (2,4,1,6,0,1) (1,1,1,1,2,3) (1, 1, 1, 1 (c) A non-zero difference indicates the presence of an error. This pair of differences indexes into an entry of a pre-computed (fixed) error correction table, which contains the index of the residue that is in error and a correction offset that needs to be added to that residue to correct said error.
The RRNS_check instruction performs this RRNS Single Error Detection and Correction algorithm. For the error detection step, the system would perform (a) and (b) to the get values of Δm 5 and Δm 6 . For the error correction step (if necessary), it performs (c). Analysis of the algorithm reveals that the error detection step would take 8 cycles while the correction step takes 2 cycles. Therefore, once the system inserts an RRNS_check instruction, the first step is to execute the 8-cycle error detection procedure. If no error is found, then this RRNS_check instruction is complete and it takes 8 cycles in total. But if an error is detected, then we need 2 more cycles for the RRNS correction operation to complete (resulting in 10 cycles in total).
For ease of presentation, we present such an error correction table for a smaller (toy) set of RRNS base moduli in Table 5 . The total entries in such a table is at most 2 4 i=1 (m i − 1). For the remainder of this section, these set of bases are used for explanatory purposes.
Unsigned Number Overflow Detection.
In the absence of any error or overflow, adding two unsigned RRNS numbers results in both Δm 5 and Δm 6 being zero. As has been just explained, presence of an error is handled by the error correction table. In the absence of error, we observe that any overflow manifests itself as a fixed index into the error correction table, with the entry not corresponding to any error. Table 6 provides some examples of this observation. While computation of the deltas is most efficient using a base-extension algorithm, we use Chinese Remainder Theorem (CRT) or the Mixed-Radix Conversion (MRC) method to first convert the RRNS number 104 (1,1,0,1,7,2) (2,4,1,6,0,1) (0,0,1,0,7,3) (0,0,0,0,1,2) (0, 0, 0, 0) ⇔ 0 |0 | 11 =0, |0 | 13 =0 10 11 2 + 104 (2,2,1,2,8,3) (2,4,1,6,0,1) (1,1,0,1,8,4) ( 0,2,0,4,3,11) (1,1,1,1,1,1) (1,3,1 to binary before computing deltas. This is solely for explanatory purposes; binary conversion is not actually necessary to detect overflow. Iterating through all possible combinations of numbers and operations, we observe that the value pair of (Δm 5 , Δm 6 ) is fixed. Moreover, (Δm 5 , Δm 6 ) = (10,11) is not a legitimate address of the error correction table (Table 5) , thus enabling a distinction between an error and an overflow. This approach, however, does not apply to multiplication. Table 8 . In this case, we observe that the pair (Δm 5 , Δm 6 ) is fixed to (1, 2) .
Signed Number Overflow Detection. Recall from Section 4.3 that CREEPY uses the Excess-
Note that neither (10, 11) nor (1, 2) are legitimate addresses in Table 5 , thereby enabling a distinction between an error and an overflow. However, while this method works for both addition and subtraction, it does not hold for detection of multiplication overflow as the delta-pair is not constant and sometimes indexes into a legal error correction table entry. Figure 6 shows the overview of the whole algorithm. We observe that the described algorithm works in a similar manner even with the base sets in Table 4 . For example, in Waston's bases (199, 233, 194, 239, 251, 509) , an overflow results in a delta-pair of (77, 289) , whereas an underflow results in (174, 220). Both these pairs do not index into legitimate entries of the error correction table for these set of bases (cf. Appendix E, Watson [89] ).
Comparison.
Comparison is an important operation because of its use in determining control flow. In a manner similar to overflow detection, we explore potential algorithms to perform RRNS comparison without incurring unnecessary hardware overhead.
Jen-shiun et al. [9] and Omondi [52] proposed number comparison methods for residue numbers based on parity bits. However, a prerequisite of these parity comparison methods is that all moduli are supposed to be odd (in addition to being pairwise relatively prime). In CREEPY, one of the nonredundant moduli is even (to enable fast fractional multiplication [89] ); therefore, this approach is not suitable.
Instead, we propose leveraging the error check algorithm itself to check for an overflow post a subtraction: To compare X and Y , perform X − Y and derive the delta-pair (Δm 5 , Δm 6 ). Then, X ≥ Y iff the delta-pair is (0, 0) (i.e., no overflow) and X < Y iff the delta-pair is (174, 220) (i.e., X − Y results in an underflow).
This new residue number comparison method can be used for both unsigned and Excess-M 2 signed numbers. It is easy to understand that this idea is suitable for unsigned residue numbers:
, thereby resulting in an underflow. For an Excess-M 2 signed number X, an injective mapped residue number can be defined as follows:
, which reduces to an unsigned comparison. A caveat to note is that correction factors should not be added for a comparison operation. These are summarized in Figure 7 (a) and (b).
Correction Factors.
In this section, we are concerned with the addition, subtraction, and multiplication operations on two numbers that do not generate any overflow. Recall from Section 4.3 that CREEPY uses the Excess- Consider x = a and y = b. The sum x + y can be represented for each subcore 1 ≤ i ≤ n + r as follows:
However, the expected addition result is
It follows that (1) 1 ≤ i ≤ n and m i is odd: Examining Equations (1b) and (2) imply that no correction factor is necessary. (2) 1 ≤ i ≤ n and m i is even: Examining Equations (1b) and (2) 
However, the expected subtraction result is
From examining Equations (3) and (4), it follows that: Multiplication. Again, for brevity, we only present the case where two positive integers are multiplied; without loss of generality: x = a and y = b; the product xy becomes
However, the expected multiplication result is
As residues are typically 8-bit wide, consider a 511 entry LUT per subcore that stores the following:
From examining Equations (5), (6), and (7), it follows that: The correction factors for the addition and subtraction operations require a single, constant addition/subtraction operation, whereas for multiplication, 2 additions/subtractions and a modest table lookup are required. Another advantage of the schemes presented here is that sign determination is not necessary and that they can be performed at the subcore level, without the involvement of the RIU.
EVALUATION METHODOLOGY
To measure the performance-energy-reliability tradeoff of a CREEPY core, we augment a stochastic fault injection mechanism into a cycle-accurate in-order trace-based simulator. We abstract the notion of using next-generation devices operating at low signal energies (E s ) and the resulting interaction with the kT noise floor into P e , the probability of an error occurring in a transistor state in any given cycle. E s , provided as an input to the simulation, is a measure of the signal energy at the input of a transistor; P e is the probability of a fault occurring at the output of a transistor in any given cycle. The relationship of E s and P e can be defined by the following relation:
. From Section 4.7, these inputs are vectors as they denote the signal energies and error probabilities for each voltage domain; however, for explanatory purposes, we present them as scalars for the remainder of this section. Also input to the simulator is the check insertion strategy, as discussed in Section 4.6. Because we are evaluating a very different number system, we simulated an unpipelined microarchitecture with no branch prediction and a two-level memory hierarchy (LLC-DRAM, with latencies of 12 cycles and 100 cycles for LLC hit and miss, respectively) to maintain our primary focus in this article. Adding more features to our design has been left as future work.
We first introduce a series of error events and their probabilities.
P e : Probability of an error occurring in a transistor state in any given cycle. This is provided as an input to the simulation, as just discussed. P add : Probability of at least a single error in an adder (each sub-core has an adder). If there are N add transistors in an adder, then the probability of each of these transistors being free of error is (1 − P e ) N add . Therefore, P add =1-(1 − P e ) N add . Similarly, P sub and P mul are calculated. For multi-cycle operations, this definition holds as long as the state of each transistor is used exactly once for the operation. This is true for the said operators. Note that this is a conservative (pessimistic) estimate in our evaluation, because we ignore any error masking that may potentially occur.
P R i : Probability of at least 1 error being present in a slice (sub-core/residue) of register R i since its last write. To compute this, we devise a StateTable, the ith entry of which holds the tuple (P, cycle), where, P is the probability of R i having atleast 1 error being present in the corresponding residue on its most recent update at cycle cycle. This StateTable is updated for each register write. For example, consider the register R 0 .
(1) At cycle 0, the default value of R 0 tuple is (P=0, cycle=0). (2) At cycle 10, assume that we have an ADD instruction: ADD R0, R1, R2, and that it is the first instruction writing to R 0 . We then update the tuple value to (Error_Probability_ADD, 10). It is necessary to update the P value here, because the error probability of this ADD instruction should be taken into account. P value would then be set back to 0 once an RRNS check is inserted for that register and no error is detected, and then set the current system cycle value to the cycle field. This way, the P field in the StateTable always reflects the probability of that register of having at least 1 error being present in one of its residues, given its most recent update at the cycle field.
Assuming an SRAM implementation of 8-bit wide R i , the number of transistors is 8 × 6 = 48. The probability of R i being error free is subject to two probabilities: (1) probability of an error-free write (P 1 = 1 − StateTable[R i ].P) and (2) probability of no error creeping into it since its last write (P 2 = (1 − P e ) 48(c−StateTable[R i ].cycle ) ), where, c is the current cycle and P e is the probability of an error occurring in the state of an SRAM transistor. Due to the nature of an SRAM device, any fault occurring in one of its transistors gets latched, resulting in a higher probability of an error (when compared with glitches in logic transistors getting masked if the glitch does not occur close to the clock edge). As such, we assume P e = 100P e . Putting it all together, we have P R i = 1 − P 1 * P 2 . P LOAD X : Probability of at least 1 error being present in the loaded data of address X . This is analogous to P R i , with the extended StateTable storing an entry for each cache line. As we assume a perfect off-chip (ECC protected) main memory, cache miss repairs are initialized with a zero probability in error, and cache replacement victims' entries are evicted from the StateTable. Finally, P LOAD X encapsulates the probability of an error in the implicit computation of the address X itself (from its base and offset) during the execution of the load, in addition to the probability of an error in the loaded data from the cache line. P SC : Probability of at least 1 error occurring in a sub-core from the last time it was checked.
To illustrate, consider the following add instruction: ADD R 3 , R 2 , R 1 . Then, at the end of instruction, P SC = 1 − (1 − P add )(1 − P R 2 )(1 − P R1 ). P C : Probability of exactly 1 error occurring in a CREEPY core from the last time it was checked.
This translates to exactly 1 sub-core being in error (where the sub-core error itself may be of multi-bit form; RRNS can tolerate multi-bit flips within a single residue). Therefore,
, where the combinatorial choose operator nC r enumerates the number of ways in which r items can be chosen from n distinct items. P 0 C : Probability of no error in a CREEPY core from the last time it was checked. P 0 C = 6C 0 × (1 − P SC ) 6 = (1 − P SC ) 6 . P fail C : Probability of a CREEPY core failing at any given cycle since the last time it was checked.
The current version of the CREEPY micro-architecture is unable to correct more than 1 error occurring in the core and assumes a recovery mechanism such as checkpointing is in place. As such, we deem ≥2 errors in the core as amounting to a failure. Therefore, and to estimate the probability of a failure P fail C,i at each time step t i . We use a typically used reliability metric, Mean Time Between Failure (MTBF) [77] , which can be defined as follows:
. The subscript i in P fail C,i represents the ith instruction of the instruction stream. MTBF also corresponds to mean time to checkpoint recovery.
SIMULATION RESULTS

Signal Energy Limits
From an independent set of simulations of a non-error-correcting core operating on binary data, we find that the minimal signal energy required for ensuring its reliable operation is 48kT. In our previous design of an error correcting RRNS core [11] , we assumed a single voltage domain across computational logic, SRAM cells, and RIU logic. Together with a traditional multiplier (i.e., without index-sum), a pipelined check insertion strategy (with a frequency of five instructions) and a 16MB LLC, the result is that we can tolerate gate signal energies of 42-43kT, as shown in Figure 8 .
However, given the dissimilarity in error distribution across computation, SRAM and RIU (Section 4.7), we consider independent voltage domains for these. For simplicity, we conservatively set the RIU gate signal energy to be 48kT (i.e., same as that required for a non-error correcting binary core), although it can be potentially lowered as its functionality is a subset of that of a binary core. We find that the relative impact of energy savings in the RIU is rather limited (Section 6.5), and therefore restrict the RIU gate signal energy to 48kT in our evaluations.
We abstract these voltage domains as a triplet; for example, 30-43-48 denotes the gate signal energy for computational logic to be 30kT, SRAM cells to be 43kT, and RIU logic to be 48kT. For the purposes of this evaluation, we assume a target MTBF of 1E + 10s (over 300 years) and find that the gate signal energy for computational logic can be lowered all the way to 28-31kT, depending on the benchmark, as shown in Table 9 .
Given these minimum signal energies, we evaluate the performance, efficiency, and reliability of various core configurations in Sections 6.2, 6.3, and 6.4, respectively. The core configurations presented are as follows:
Binary. A non-error-correcting core operating on binary data. This is the baseline and requires signal energies of at least 48kT to achieve reasonable reliability. RNS. A non-error-correcting core operating on RNS data. In other words, an RRNS core without redundant subcores and error correction capabilities. RRNS_pipe5. An error correcting RRNS core with a pipelined check insertion strategy (with a frequency of five instructions), as was determined as the most optimal strategy in our previous RRNS core design [11] . For example, 30-43-48 denotes the gate signal energy for computational logic to be 30kT, SRAM cells to be 43kT, and RIU logic to be 48kT. *We use these signal energies for the remainder of this article, as they render reasonable reliability. Fig. 9 . Performance of various core configurations, normalized to an non-error-correcting binary core.
Index-sum_pipe5. Similar to RRNS_pipe5, except that the traditional multiplier is replaced with an index-sum multiplier. RRNS_Adapt_1e-9. An error correcting RRNS core with an adaptive check insertion strategy (with an error probability threshold of 1e − 9. We found that 1e − 9 was the optimal threshold obtained via simulation for target MTBF/signal energy). Index-sum_Adapt_1e-9. Similar to RRNS_Adapt_1e-9 that uses an index-sum multiplier. Figure 9 presents the performance of various core configurations listed in Section 6.1, normalized to that of a non-error-correcting binary core.
Performance
There is an inherent performance degradation in running binary-optimized code on an (R)RNSbased core, because position-based bit manipulation techniques are expensive in (R)RNS; however, this is limited to about 20% on average. Introducing error correction may further degrade performance if naive or static check insertion strategies are used. The overhead due to error correction is amortized when the check insertion strategy is adaptive instead.
Energy
The primary concern of CREEPY core design is reducing the core energy overhead. Figure 10 shows the normalized energy consumption of the aforementioned configurations. The non-error-correcting binary core requires high gate signal energies to be reliable. Given the low-bit-width and carry-free nature of RNS arithmetic, RNS based cores are inherently more energy efficient than their binary counterparts. When efficient error correction is introduced, further energy savings can be achieved as the supply voltage can be turned down while still maintaining reliable functionality. We ensure that the overhead of error correction is minimal by using an adaptive check insertion strategy. Finally, using index-sum multipliers enables further energy savings, as they are more efficient than traditional multipliers (savings of over 3× for multiplication intensive benchmarks and over 2.3× on average). Figure 11 shows the Energy Delay Product (EDP) of these core configurations, normalized to that of a non-error-correcting binary core. With the exception of arithmetic intensive workloads, RNS cores typically have a higher EDP than binary cores. However, via efficient error correction, our RRNS cores show significantly improved EDP. Specifically, by utilizing our best optimization scheme (index-sum multiplier and adaptive check insertion), we see EDP benefits of about 2× on average or about 3× for multiplication intensive workloads.
Energy Delay Product
Energy Potential of RIU Optimizations
As described in Sections 4.7 and 6.1, we conservatively choose the gate signal energy for RIU logic to be that necessary for reliable operation of a Turing complete non-error-correcting binary core, i.e., 48kT. One of the reasons for this is to side-step the issue of "checking the checker." However, if we were to deploy self-checking logic or some other optimizations in the RIU, then it may no longer be necessary to use a high voltage supply for the RIU domain. In this limits study, we evaluate three possibilities of the gate signal energy to RIU logic: Binary, 48kT; Computation, same as that of RRNS subcore computational logic; Zero, 0kT. From an Amdahl's law perspective, we find that optimizing RIU logic has limited impact on core energy, as shown in Figure 12 , thanks to our judicious RIU usage via adaptive check insertion. 
RELATED WORK
RNS and RRNS. The energy efficient properties of RNS due to its low-bit-width operations and absence of carries across residues has found applications in the digital signal processing (DSP) [10, 14, 60] domain. Furthermore, the representability of high bit-width integers as a tuple-of-resides has been leveraged by the cryptography (RSA) [4, 28, 94] community. Anderson [1] proposed an architecture and ISA for an RNS co-processor designed to run datapath operations in tandem with a general-purpose processor running binary instructions, where the primary role of the general purpose processor is to handle control flow. The RNS co-processor uses an accumulator-based ALU and does not support caching or computational error correction (RRNS). Furthermore, it requires a conversion to binary (and vice versa) for comparison operations, which is expensive. Clearly, our CREEPY architecture is significantly more efficient. A unique feature of their ISA is their ability to encode instructions targeting two ALUs simultaneously. But this can easily be extended to our architecture and enable such Superscalar-like capabilities if need be.
Chiang et al. [9] provide RNS algorithms for comparison and overflow detection but assume all bases to be odd and do not consider error correction. Similarly, Preethy et al. [57, 58] integrate index-sum multiplication into RNS but do not consider its impact on the properties of RRNS bases critical to CREEPY.
Ever since Watson and Hastings [25, 89, 90] introduced RRNS as an efficient means for computational error correction, there has been a significant body of research [3, 5, 8, 13, 17, 21, 22, 24, 32, 38, 39, 42, 53, 59, 61, 67, 71-73, 75, 76, 78-80, 91-93, 95 ] that strives to improve on it. These are orthogonal to CREEPY, and further such algorithmic research can be used to optimize aspects of the core itself, such as the RIU.
Computational Error Correction. Standard ECC [43] have already been adopted into modern memory systems. These codes accommodate errors occurring in storage and communication/network traffic but are not able to protect computational logic. The naive approach to computational error correction is TMR [86] , requiring over a 200% overhead in area and energy for single error correcting capability. Several techniques in the form of arithmetic codes such as AN codes [6, 18, 19, 41, 66, 88] , self-checking [30, 33, 44, [48] [49] [50] 84] , and self-correcting [15, 20, 26, 37, 45, 55, 62, 63, 74, 83] adders and multipliers have since been devised. Orthogonally, proposals employ redundancy at a higher granularity, such as timing speculation (wherein error correction capability is limited to circuit timing violations) [16, 23] , partial pipeline replication [2] , or checkpoint-rollback-recovery such as those in IBM Power8 processors [29] . While these are more efficient than naive TMR, they come with limitations on their error model, or their area overheads are still over 100% and/or incur a significant performance penalty, due to the fact that they leverage temporal redundancy in an effort to minimize area overhead [69] . Figure 13 summarizes some of these techniques in comparison with RRNS. We refer the interested reader to Srikanth et al. [69] for a more detailed survey on some of these non-residue techniques, but the takeaway is that RRNS is generally considered superior in terms of capability and efficiency for computational error resilience. Fig. 13 . First-order comparison of area overhead and EDP of various mechanisms for computational error correction, depicting the superiority of RRNS. Computational error correction techniques use a combination of spatial and temporal redundancy techniques. While temporal redundancy allows for a low area overhead, they suffer from a significant performance penalty. Timing speculation techniques seem more efficient than RRNS; however, their error model assumes all bit errors manifest as circuit timing errors, which is not sufficient to work with ultra low energy logic devices.
Approaches that employ timing speculation [16, 23] may seem superior to RRNS at first glance. However, the error model that can be supported by an RRNS error correcting microarchitecture is orthogonal to theirs, if not broader. For example, razor [16] uses conventional transistors; therefore, lowering V dd lowers MOSFET switching speed, resulting in a frequency drop, which could cause setup time violations that they handle via a delayed latch mechanism. They assume that any error manifests itself as a timing error. Similarly, decor [23] uses a delayed commit approach (with rollback support) to handle violations in timing margins. However, with emerging devices (Section 1), V dd can be lowered to few tens of millivolts without frequency loss, meaning that operating at the resultant thermal noise floor leads to stochastic, intermittent bit flips, which cannot be captured as circuit timing errors. Unlike such approaches, a CREEPY core can tolerate such errors not only in the data path but also in the control path between memory accesses.
In terms of being able to tolerate control path errors, approaches such as DIVA [2] that replicate parts of the pipeline are capable. Their design provides recovery by having a simple core recalculate results of an out-of-order core. In this approach, the simple core is assumed to be error free. This is similar to a "double-modular-redundancy" approach with a rad-hard node, implying a relatively high overhead. Furthermore, if the rad-hard simple core is instead prone to error, then checkpoint and re-execute methods would need to be employed, similarly to the IBM POWER7/8 processors [29] . On the other hand, a CREEPY core is able to tolerate errors in its redundant as well as nonredundant computations.
CONCLUSION
The advent of next-generation device concepts such as tunneling FETs and ferroelectric/negativecapacitance FETs enables reduction of supply voltage to few tens of millivolts without degradation in switching speed. However, as a result of operating close the the kT noise floor, computational logic is subject to intermittent, stochastic errors. The RRNS representation is a promising approach towards using such ultra low power devices by employing efficient computational error correction.
In this article, we design a Compuationally-Redundant, Energy-Efficient core, including the microarchitecture, ISA and RRNS centered algorithms. We elucidate several novel optimizations and RRNS-based design considerations to demonstrate significant improvements over a non-errorcorrecting binary core.
