Abstract-Crossbar memories are promlsmg memory tech nologies for future data storage. Although the memories offer trillion-capacity of data storage at low cost, they are expected to suffer from high defect densities and fault rates impacting their reliability. Error correction codes (ECCs), e.g., Redundant
I. INTRODUCTION
The quest for new memory technology that can provide further scalability, yet able to tolerate reliability failures has made fault tolerance as one of the key requirements [1]- [6] . Crossbar memory is one of the emerging new memory tech nologies able to offers trillion-capacity of data storage at low power consumption and reduced fabrication cost. However, these advantages do not come for free as several challenges need to be resolved [3] . One of the challenges is that the memories are likely to suffer from high defect densities and fault rates impacting their reliability.
In order to improve the reliability of crossbar mem ories, error correction codes (ECCs) such as Hamming, Low-Density Parity-Check (LDPC) and Bose-Hocquengham Chaudhuri (BCH) codes [7] - [11] have been proposed. Accord ing to [5] , [6] , defects and faults in crossbar memories tend to induce cluster errors; hence, ECCs able to correct such errors, such as RS [12] and RRNS [13] - [15] , are required. Tradi tionally, these ECCs have been implemented using software resulting in low performance; this make such implementation unsuitable for scalable yet unreliable crossbar memories. This paper studies ECC design for fault-tolerant crossbar memories. The encoder and decoder of both RS and RRNS are designed and implemented. An evaluation in terms of their area overhead and decoding speed as well as error correction capability is carried out. The evaluation shows that the encoder and decoder of RS requires smaller area and operates faster as compared to that of RRNS. Moreover, both ECCs can correct almost equivalent numbers of errors.
The rest of the paper is organized as follows. Section II gives the background of crossbar memories and error correction codes. Section III presents the theory of RS and RRNS that are used in our work. Section IV explains the design of the encoder and decoder for both ECCs. Section V analyzes and compares the area overhead, speed and error correction capability of the considered ECCs. Section VI concludes this paper.
II. BACKGROUND
This section gives the background required to further under stand the paper. It starts with explaining crossbar memories, thereafter error correction codes.
A. Crossbar Memories
Figure l(a) shows one of the crossbar memories referred to as CMOS/Molecular (CMOL) memory [7] , [8] . CMOL memory provides the utmost data storage capacity as huge as lTbitlcm 2 , which is about three magnitude denser than the existing semiconductor memories. In addition to the data storage, the novelties of this hybrid memory are: (i) the memory array is stacked above the peripheral circuits (3D stacking IC instead of planar Ie), and (ii) the memory array are formed by non-CMOS devices instead of CMOS and/or capacitor.
The memory array consists of nanowire crossbars with reconfigurable two-terminal nanodevices embedded at each crosspoint. Because non-CMOS-based devices are incapable to perform the periphery tasks (e.g., sensing, amplification, etc.), nanoscale CMOS is required to structure the peripheral circuits [7] , [8] . Two sets of CMOS-to-nano (CtN) interface pins connect the memory array to the peripheral circuits; see Fig. 1 (b) . These CtN interface pins are different in height such that the short pins connect the lower nanowires, while the tall pins connect the upper nanowires.
In order to write to and read from the memory, a sufficient voltage is biased across the targeted two-terminal nanodevices (memory cells) from the CMOS-based peripheral circuits through the CtN interface pins to the corresponding nanowires [7] , [8] . For writing, the voltage must be larger than the threshold voltage of the two-terminal devices to turn them on (represent 1) and smaller to turn them off (represents 0). For reading, a smaller voltage is used. Note that the value of the voltages depends on the two-terminal nanodevices used as the memory cells [7] . 
B. Error Correction Codes
Error correction codes are formed by a group of codewords.
As shown in Fig. l(c) , a codeword C={Xl,oo.,Xk,Xk+l,oo.,X n } comprises of a k-element of dataword and (n-k)-element of checkword where n and k are integer [17] . Here, the element Xi; 1 :Si:Sn, can be either a number of bits (for bit-oriented ECCs) or a number of symbols (for symbol-oriented ECCs); a symbol is a set of bits. The dataword represents the input data, whereas the checkword denotes the required extra elements for error detection orland correction. Generally, the number of elements required for correction is twice as many as that for detection.
Depending on their types, whether bit-oriented or symbol oriented, ECCs can be classified into two groups. Bit-oriented means that the ECCs operate in bit by bit basis during the encoding and decoding. Because of the bit-oriented character istic, these ECCs are suitable to tolerate random faults. The ECCs that belong to this class include Hamming, BCH and LDPC [17] . Contrarily, symbol-oriented ECCs operate in a group of bits by a group of bits basis during the encoding and decoding. Due to the symbol-oriented characteristic, these ECCs are suitable to correct cluster faults, which are the case for crossbar memories. The ECCs that fall into this group are RS and RRNS. While RS composes of fixed-length symbols, RRNS consists of varied-length symbols. More descriptions on these two ECCs will be given in the next section.
III. ECCs FOR CROSSBAR MEMORIES
This section explains the encoding and decoding theory of RS and RRNS ECCs.
A. RS Code Theory
An n-symbol RS codeword consists of k-symbol dataword and (n-k)-symbol checkword where n and k are integer [12] , [17] . Each symbol is generated based on Galois Field GF (2 m ) where m is the number of bits in each symbol. The correction capability of this code is defined as t= ( n ; k ) . For example, two symbols are appended as the checkword to correct a single erroneous symbol.
To encode RS code, input data X is multiplied with a primitive polynomial. This primitive polynomial is selected in such a way that it cannot be factorized into smaller polynomial to ensure the encoding and decoding consistency (the unique relationship between input/output data and RS codeword). The 62 resulting product is then appended to X to produce an RS codeword C.
To decode RS code, the read codeword is validated by checking syndrome Si; it can be expressed as [12] :
where Cj is the codeword symbol, aV is the primitive poly nomial roots and 1:Sv:S2t. If Si=O, then the RS codeword is error-free and is read out. In contrast if Si=J0, then the read codeword has errors and requires correction. Several algorithms can be used to correct errors in RS code word, e.g., Peterson-Gorenstein-Zierler (PGZ), Berlekamp Massey, etc [13] , [17] . PGZ provides a low computational complexity for small t values as compared to other algorithms, e.g., Berlekamp-Massey, which are preferable for large t values [12] . In this work PGZ will be used as it suits our experiment.
B. RRNS Code Theory
RRNS code has a similar structure and the same error correction capability as RS code; yet in theory, the symbol is usually referred to as residue [13] , [18] . The residues in RRNS code might have different bit length b, depending on the moduli used, i.e., b=ll092(moduli -1) + 1J bits.
To encode RRNS code, input data X is divided by a set of moduli mi where 1 :Si:Sn; n is the number of symbols of the codeword. The remainder of the division of X by the moduli results in the dataword and checkword. In contrast to RS that relies on Galois Field for encoding and decoding consistency, RRNS depends on three different rules. Briefly, these three rules are: (i) the moduli set must be mutually co-prime, (ii) the succeeding modulus must be larger than its preceding, and (iii) their product must be larger than the operating legitimate range of 2d-l where d is input data length [14] , [18] .
To decode RRNS code, a similar steps as for RS is per formed. The read codeword is first validated by checking its syndromes. Two algorithms can be utilized for decoding: Mixed-Radix Conversion (MRC) or Chinese Remainder The orem (CRT) [13] , [18] . MRC is used in this work because it require simple design and easier to be optimized. The calculated codeword referred to as syndrome can be expressed as follows [13] , [18] :
where gC i-u)i is the multiplicative inverse of m ( i-u) with respect to mi defined as Im ( i-u)gCi-u) I m i =1; 2�i�n and l�u�n-l.
After reading, if 8i m ax=O where imax refers to the largest syndrome in each iteration (more detail in Section IV), then the read codeword is error-free and is converted into binary prior to read out; otherwise a correction takes place. As in RS, the correction procedure for RRNS is quite complex, [13] , [18] can be referred for further theory and explanation.
IV. ENCODER AND DECODER DESIGN
This section explains the design of the encoder and decoder of RS and RRNS ECCs.
RS encoder and decoder design
The RS encoder is designed in such a way that, e.g., 16 bit input data D, will be encoded into two-symbol datawords Dl and D2; each consist of 8 bits. For this purpose, GF (28) is chosen, meaning that each symbol comprises of 8 bits. Moreover, the decoder is set to correct one erroneous symbol t= 1, or a maximum of 8 bits cluster error. Therefore, the RS codeword needs (n -k)=2t=2 symbols as the checkword.
However, because GF(28) may results in a complex conver sion from binary to GF element and the way around, GF(24) is used instead; this will result in a simple design without impacting the error correction capability [16] . This means that each 8-bit data is further divided into two sub-group; each composes of 4 bits. Dl becomes two sub-datawords Dll and D12, while D2 turns into another two sub-datawords D21 and D22. This is also applied to the checkword; Rl becomes two sub-checkwords Rll and R12, while R2 turns into another two sub-checkwords R21 and R22.
The used pnmltIVe polynomial for encoding is GF(2 4 )=x 4 +s3+1. This GF(2 4 ) consists of a successive f th
e po ynomta roo so:, I.e., ,0: ,0: ,0: , ... ,0: [17] . Each o:V has its binary representation that can be pre calculated using polynomial generator.
The following algorithms are used to design the encoder of RS [19] : 1) Generate GF(2 4 ) elements.
2) Split input data D, into 4-bit sub-group. 1) Split the read C into two sub-codewords Cl and C2. [16] . Each sub decoder comprises of two syndrome units Syndromel and Syndrome2, an error locator unit Locator, and a corrector unit Corrector. The syndrome units, which validates the read codeword, are fonned by an array of XOR gates. Their outputs
Fig. 3. Block diagram of RRNS (a) encoder (b) decoder
become the inputs to Locator, which is structured by a look up table (LUT) storing a pre-calculated GF roots aVo The outputs of Locator and the outputs of Syndrome1 unit then become the inputs to the corrector, which is structured by XOR gates and multiplexer. Finally, the outputs of the sub-decoders are concatenated creating the output data.
RRNS encoder and decoder design
The RRNS encoder and decoder are designed based on four moduli mi={2�, 2� + 1, 2�+ 1 _1, 2�+ 1 + 1} where d is input data length. The moduli set comprises of low cost moduli, which realizes small and fast RNS-based arithmetic circuits [13] . Such moduli are selected because the resulting residues have a fairly similar codeword length to that of RS symbols. E.g., for d= 16 bits, the dataword length is b= llo g 2(m1-1)+ 1 J + llo g 2(m2-1)+ 1 J =llo g 2(256-1)+ 1 J + llo g 2(257 -1)+ 1 J = 17 bits. Two residues are set as the checkword to provide a single residue correction.
The following algorithms are used to design the encoder of RRNS (see Fig. 3(a) where two sub-encoders operate in parallel [16] . The first sub-encoder consists of two modulo units; each is based on 2 � and 2 � + I. The second sub-encoder also comprises of two modulo units; each is based on 2�+ 1 _1 and 2�+ 1 +1. The modulo units may consist of either simple buffer, or more complex circuits (e.g., adders and multiplexers) depending on the moduli set they operate. E.g., the first modulo circuit is formed by a �-bit buffer. The third modulo is structured of adders and multiplexers. However, the second and fourth modulo units require additional subtracters besides adders 64 and multiplexers. The modulo units runs in parallel producing the corresponding residues, which in tum are concatenated producing a b-bit (n-symbol) RRNS codeword C. 3) The other two iterations is calculated based on the simi lar calculation as the above, but with their corresponding residues, moduli and multiplicative inverses.
Because RRNS decoder is based on MRC, which operates sequentially, a modification has been performed to improve the speed [16] . Instead of checking the residues one by one, some of them are checked in parallel. For example 81, 82 and 82 x m1 are calculated twice, i.e., both in the first and second steps. Thus, by calculating these common syndromes once and sharing it to all required calculations, a faster decoding can be obtained. However, it is worth noting that this might incurs extra circuitries (e.g., multiplexer) and additional routing; hence, larger overhead area. Figure 3(b) illustrates the RRNS decoder that comprises of four syndrome units, three converter units and a multi plexer. All syndrome units operate concurrently to validate the read codeword. They generally comprise of subtracter, adder, multiplier and multiplexer. As mentioned before common syndromes are shared by the units. These are realized by the 
ECCs for Different Data Length (b) Fig. 4 . Ca) Encoder and decoder area overhead Cb) encoder and decoder time delay feedback connections from, e.g., Syndrome] to Syndrome2, etc. The output of the syndrome units become the input to the converter units, which produce binary data D1, D 2 and D3. Consequently, the binary data is compared to the operating legitimate range, which can be hardwired. Finally, the multiplexer selects the valid output data.
V. EXPERIMENTAL RESULTS AND DISCUSSION
This section presents the experimental results and dis cussion. First, it presents the results of the hardware im plementation of encoder and decoder for both ECCs, the analytical evaluation of the memory cell array overhead and the analytical evaluation of error correction capability. Finally, it discusses the experimental results.
A. Encoder/Decoder Area and Speed
To analyze the implementation cost, the encoder and de coder of both ECCs were designed using Xilinx Design Suite and synthesized using Synopsys Design Compiler based on 90nm CMOS. Figure 4 (a) illustrates the area overhead of the encoder and decoder for both ECCs. It shows that the area overhead for RS is smaller than that of RRNS irrespective of the data length. For example, the encoder and decoder of RS occupies 5 x smaller area overhead than that of RRNS for 16-bit data. As the data length increases, the area overhead for RS slightly enlarges, while that for RRNS escalates. Figure 4 (b) depicts the speed of the encoder and decoder for both ECCs represented by the critical path time delay. It shows that RS operates faster than that of RRNS irrespective of the data length. For example, the encoder and decoder of RS is 3 x faster than that of RRNS for 16-bit data and is 804 x for 64-bit data. As the data length increases, RS time delay is quite the same; however, that of RRNS increases fairly linear (becomes slower).
B. Memory Cell Array Area Overhead
The area overhead of the memory cell array depends on the bit length of the codeword. Thus, the overhead of memory cell array when using both ECCs can be estimated analytically. Note that no real hardware synthesis can be carried out for non-CMOS devices because there is no available design tool for such devices yet. Figure 5 (a) depicts the required codeword length to correct clustered faults in RS and RRNS codewords stored in crossbar memories at different lengths of input data. Although both RS and RRNS codeword length increases as the size of input data enlarges, the difference in the required number of bits for the codeword becomes severe. For example at 64-bit input data, RRNS requires about 1.7x more bits than RS. Translating these numbers into the memory cell array area means that RS requires smaller area than RRNS for a fixed input data capacity. The difference becomes greater as the input data length increases. For example, it is about lAx greater for 64-bit input data encoded into both ECCs as compared to that of 16-bit.
C. Error Correction Capability
In terms of correcting cluster errors, both ECCs has quite similar capability. Theoretically, RRNS scores slightly better than RS in case of the cluster errors exceeding the size of RS symbols. For example, consider a fault that induces 34-bit cluster errors at the third symbol of both ECCs as shown in Fig. 5(b) . In this scenario, RS cannot correct them because the errors impact two symbols, which is beyond its single residue correction capability. However, for RRNS the errors only corrupt the third symbol, which is still can be corrected.
D. Discussion
With respect to hardware implementation and the associated cost, RS performs better than RRNS because of the followings:
• RS symbols are based on Galois Field elements for which all symbols have equal bit length. However, RRNS sym bols are based on the residues generated from mutually co-prime moduli, each might has different bit length. Moreover, the redundant residues (checkword) must be bigger than non-redundant residues (dataword). Hence, the total bit length of RRNS codeword is larger than that of RS. Clearly, larger bit length implies bigger area and longer execution time of the encoder and decoder as well as greater memory cell array area. • The RS encoder and decoder comprises of simple XOR gates and LUT-based error corrector, whereas that of RRNS consists of adders, subtracters and multiplier be sides ROM-based moduli and moduli inverses. Obviously, XOR gates is smaller and faster than adders, subtracters and mUltiplier.
• RS decoder requires only two syndromes in validating the read codeword. On the other hand, RRNS decoder needs to compute three syndromes for the same purpose. Even though the MRC-based RRNS decoding has been parallelized, the time latency is still worst than RS. On the other hand, the parallel execution incurs bigger area overhead.
• Above all, ECCs in essence depend on consistency rules to have a unique relationship between data and codeword. For RS, the ECC requires a single consistency rule, i.e., Galois Field, whereas RRNS needs three consistency rules (see Section III-B). Intuitively, lesser rules realize simpler algorithm and implementation.
VI. CONCLUSION
This paper has presented a case study of two symbol oriented ECC designs for fault-tolerant crossbar memories. The encoder and decoder of two ECCs, Reed Solomon and Redundant Residue Number System, have been implemented and experimentally compared. The results show that RS re quires smaller area overhead and operates faster than the RRNS. In terms of correcting cluster errors, both ECCs posses quite similar capability. It can be concluded that RS offers better performance at lower cost than RRNS because the former can be implemented mainly using simple logic gates, whereas the latter needs more complex logic circuitries such as adder, multiplier and multiplexer. Moreover, RS relies on one encoding and decoding consistency rule; on the other hand, RRNS depends on three consistency rules.
