ABSTRACT Recently, there are increasing demands for fully flexible Bose-Chaudhuri-Hocquenghem (BCH) decoders, which can support different dimensions of Galois fields (GF) operations. As the previous BCH decoders are mainly targeting the fixed GF operations, the conventional techniques are no longer suitable for multiple GF dimensions. For the area-optimized flexible BCH decoders, in this paper, we present several optimization schemes for reducing hardware costs of multi-dimensional GF operations. In the proposed optimizations, we first reformulate the matrix operations in syndrome calculation and Chien search for sharing more common sub-expressions between GF operations having different dimensions. The cellbased multi-m GF multiplier is newly introduced for the area-efficient flexible key-equation solver. As case studies, we design several prototype flexible BCH decoders for digital video broadcasting systems and NAND flash memory controllers managing different page sizes. The implementation results show that the proposed fully-flexible BCH decoder architecture remarkably enhances the area-efficiency compared with the conventional solutions.
I. INTRODUCTION
In digital communication and storage systems, error correction codes (ECCs) are essential to improve data reliability. Due to the strong and guaranteed error-correcting performance with reasonable hardware costs, among the various ECCs, Bose-Chaudhuri-Hocquenghem (BCH) codes have been continuously adopted in practical applications [1] - [10] . Targeting the message frame of k bits, an (n, k, t) BCH code is defined to recover up to t random bit errors in the transferred codeword of n bits. To find the t error locations, three decoding steps are generally used as illustrated in Fig. 1 , i.e., syndrome calculation (SC) block, key-equation solver (KES) stage, and Chien search (CS) step [11] . In the last decade, targeting the recent high-speed applications, the high-throughput BCH decoders associated with the massive parallelism have been developed and optimized in numerous literatures [12] - [17] . Although the previous optimization techniques are quite effective to relax the hardware complexity, they are only applied to the single BCH code, where all the internal arithmetic operations are based on the fixed dimension of Galois field (GF). Recently, specifications of error-correction capabilities and the data lengths are increasingly diversified, and it is necessary for making BCH decoders be flexible to cope with the increased requirements of reconfigurability. Targeting the multi-standard solutions, only few studies have been reported for sharing the processing elements in multi-t BCH decoders, which can change their correcting conditions within only the same GF dimension m [18] - [20] . For the sake of simplicity, therefore, we denote this multi-t fixed-m BCH decoder as the partially-flexible BCH decoder.
In order to provide more flexibilities on error-correction capabilities, it is more common to change the dimension of GF, leading to the fully-flexible BCH decoder. For example, the NAND flash memory uses a variety of page sizes such as 512, 1024, 2048, and 4096 bytes, BCH decoders in the controller system are required to operate in GF(2 13 ), GF(2 14 ), GF (2 15 ) and even GF(2 16 ) for each page size [6] . For the case FIGURE 1. The conventional BCH decoder having 3 processing steps [11] .
of digital video broadcasting (DVB) systems, in additions, recent specifications require two kinds of BCH decoders to support different sizes of frame structures [7] - [9] . For the conventional fully-flexible BCH decoder architecture, which can change both t and m, multiple BCH decoders targeting different GF dimensions are designed separately and integrated into the same system, drastically increasing the hardware complexity. Resource-sharing schemes may relax the hardware costs by reutilizing the registers and multiplexors at the higher-m BCH decoder [21] , [22] . However, the area saving caused by this technique is naturally limited as there is no consideration in GF operators, which dominate the overall hardware complexity.
For the area-efficient fully-flexible BCH decoder architecture, this paper presents several optimization techniques for supporting multiple GF dimensions. More precisely, we reformulate the parallel SC and CS architectures having different field dimensions into a single matrix operator to find the maximum number of common sub-expressions (CSEs). Based on the iterative search [25] , the hardware resources for different GF dimensions can be shared so that the complexities of parallel SC and CS blocks are minimized. For general GF multiplications, in addition, we newly define the cell-based multiplier architecture to accept arbitrary field dimensions. Based on the proposed universal GF multiplier, the area-time complexity of folded-KES block is remarkably reduced compared to the previous registersharing approach. For validating the proposed optimization schemes, we also designed prototype decoders in 65nm CMOS process, targeting DVB systems and NAND flash memory controllers. Implementation results show that the proposed work successfully realizes the area-optimized fullyflexible BCH decoder, which enhances the area-efficiency by more than two times compared to the previous approaches.
The rest of this paper is organized as follows. Section II describes the methods and drawbacks of the previous architectures supporting multiple GF dimensions. Section III describes the proposed methods to maximize the effects of CSE-sharing for parallel SC and CS blocks. Section VI shows the architecture of the proposed multi-m GF multiplier for the folded-KES block. Implementation results are presented and analyzed in Section V, and the conclusions are finally drawn in Section VI.
II. THE PREVIOUS FULLY-FLEXIBLE BCH DECODERS
In order to support different BCH decoding operations for multiple GF dimensions, it is generally to design the dedicated BCH decoders for each GF dimension, requiring a huge amount of hardware resources. Only few literatures have been reported the area-efficient solutions for flexible decoder architectures [21] , [22] . Targeting j different GF dimensions, in general, the previous works are based on the sharing methods of pipelined registers as depicted in Fig. 2 . More precisely, the previous fully-flexible BCH decoder is firstly designed for supporting the highest GF dimension, i.e., m 1 in the figure, and then the decoders for lower dimensions are added by sharing the internal pipelined registers, eliminating additional storing elements [21] , [22] . This register-sharing method is effective to reduce the amount of sequential logics as the decoders for lower dimension in general necessitate the fewer number of registers.
FIGURE 2.
The previous fully-flexible BCH decoder architecture with register-sharing method [21] , [22] .
For combinational logics, as shown in Fig. 2 , the GF operators, i.e., constant and general GF multiplications, cannot be shared and should be utilized individually. To overcome the limitation of register sharing in previous study, that tried to work on sharing GF computing logics between different GF fields [21] . As shown in Fig. 3(a) , the shift-and-add parts of general multiplier are shared for various GF dimensions, and the following modulo parts are individually designed for each dimension. Note that the final result is selected from the outputs of different field dimensions by using the j-to-1 multiplexor. As the complexity of the modulo process dominates the overall multiplier, unfortunately, the effects of sharing are quite limited, still necessitating a huge amount of hardware resources. Moreover, this method cannot be used for reducing the hardware complexity of constant GF multipliers, which require numerous XOR operations [13] . As shown in Fig. 3(b) , therefore, the previous constant multiplier covering multiple GF dimensions utilizes a dedicated GF multiplier for each dimension [21] .
Considering the complexity breakdown of each building block, as depicted in Fig. 4 , the impact of the previous only register sharing method is naturally limited as the portion of sequential parts is negligible compared to that of GF operators. Note that the complexity shown in Fig. 4 is based on an 8-parallel (65535, 65343, 12) BCH decoder, which is designed in a 65nm CMOS process at the speed of 500 MHz. In order to provide more realistic analysis, in addition, we apply the recent optimization techniques, which can reduce the complexity of the fixed-m decoder architecture [13] , [16] , [23] . Considering the various demands from [21] , [22] . (a) The general multiplier having limited shared units and (b) the constant GF multiplier with no sharing parts. practical applications [6] - [10] , [24] , therefore, it is urgently required to develop new area-efficient design methods for the fully-flexible BCH decoder architectures. For the sake of simplicity, in Section III and IV, we first describe new optimization schemes for area-efficient dual-m BCH decoders. Note that the proposed methods are easily extended to support more GF dimensions, even providing the fully-flexible solution.
III. THE PROPOSED SC AND CS ARCHITECTURES FOR MULTIPLE GF DIMENSIONS
It is widely known that constant GF multiplications are actively used in the processing of SC and CS stages [11] . Similar to the reformulated matrix operations for the fixed-m SC and CS blocks in [13] and [16] , in this section, we propose construction steps for a single matrix operation supporting different GF dimensions at the same time. As the common sub-expressions (CSEs) of the given matrix can be shared during the realization step [25] , the proposed single-matrix form naturally relaxes the hardware overheads by maximizing the search area of CSEs among the different GF dimensions.
In order to construct the single matrix form for dual-m parallel SC and CS blocks, firstly, it is necessary to find an efficient way to construct matrix forms including different GF dimensions. For the given GF dimension of m, y = xα i is expressed as
where y and x represent elements in GF(2 m ), and an m × m matrix A i (m) stands for the corresponding constant multiplication over GF(2 m ). Considering the previous registersharing method for the dual-m BCH decoder [21] , [22] , the input operand x in (1) can belong to either GF(2 m 1 ) or GF(2 m 2 ). By assuming m 1 > m 2 , we set the unused bits for the smaller GF dimension to be zero, i.e., x i = 0 for m 2 ≤ i < m 1 . Hence, two constant multiplications can be applied to the same input x as follows. . This assumption is reasonable as we will also apply the register-sharing architecture in Fig. 2 to our fully-flexible decoder, which prepares all the registers based on the largest GF dimension, m 1 . In other words, all the data in the lower dimensions are zero-extended in our architecture. As shown in (2), two multiplicands can be merged into a single matrix form whose size is m 1 ×(m 1 +m 1 ) binary matrix.
As the merged matrix allows to find CSEs between A i (m 1 ) and A i (m 2 ), it is natural that the hardware complexity for realizing (2) is relaxed significantly, compared to the individual realizations of two different constant GF multiplications. Fig. 5 exemplifies how the proposed optimization method reduces the number of XOR operations in the flexible constant GF multiplier. In order to design the previous flexible multiplier performing y = xα 15 over both GF (2 8 ) and GF(2 10 ), as depicted in Fig. 5(a) , two multiplications are (2 8 ), (b) the matrix form of each multiplier [16] , (c) the proposed single matrix form.
individually realized by using matrix operations denoted as A 15 (8) 
Note that the input vector x in Fig. 5 (b) consists of 10 bits for targeting the larger dimension GF(2 10 ), but only 8 bits are involved to process the multiplications over GF (2 8 ). Based on (3) 23 and 14 XOR gates are used for the straightforward realization of A 15 (10) and A 15 (8) , respectively. The iterative searching algorithm in [16] shares CSEs of each matrix operator as many as possible, leading to the following results.
A 15 (10) CSE : 
Note that the optimized matrices in (4), denoted as A 15 (10) 
As described in (2), the merged matrix is zero-extended so that the portion corresponding to A 15 (8) becomes equal to the size of A 15 (10) . By applying the iterative searching algorithm, similar to (4), we can share all the CSEs to generate the optimized matrix in (6) , as shown at the bottom of the previous page.
Note that only 24 XOR gates are enough to implement the single matrix operation in (6) , relaxing the complexity by 17% compared to the previous structure. As the matrix becomes dense for supporting the larger multiplicand, moreover, the proposed scheme is more effective to reduce the overall complexity with increased number of common terms. Note that this concept can be further extended to optimize the parallel dual-m SC and CS architectures, which will be described in the following subsections. 
A. THE PROPOSED PARALLEL DUAL-m SC ARCHITECTURE
When the n-bit BCH codeword (r 0 , r 1 , r 2 , . . . , r n−1 ) is received, the i-th syndrome s i (0 < i ≤ 2t) is calculated as follows [11] .
In the conventional p-parallel SC architecture, in general, 2t different syndrome units are utilized and, in each processing cycle, each of syndrome unit accepts p bits from the received codeword, which is denoted as z 0 , z 1 , . . . , z p−1 in Fig. 6 . Hence, the p-parallel SC architecture necessitates n/p processing cycles to compute 2t syndromes. To reduce the hardware complexity of p-parallel fixed-m SC, only t odd-indexed SC units are introduced and other even-indexed syndromes are calculated by using power-operations without using the original syndrome units [15] . In additions, all the power operations and constant GF multiplications can be reorganized into a single matrix to share the maximum number of CSEs [25] . Based on the previous optimizations, the proposed p-parallel i-th dual-m SC unit, which supports two GF dimensions m 1 and m 2 (m 1 > m 2 ), is exemplified in Fig. 7 , where i is an odd number less than 2t. Note that GF operations in the shadowed region in Fig. 7 are organized into a single matrix operator to share their internal logics. The proposed syndrome unit contains one temporal register having m 1 bits, where the currently-stored value and the next value for the following cycle are denoted as q i and d i , respectively. Based on the p-bit input (z 0 , z 1 , . . . , z p−1 ) and m 1 -bit stored-data q i , the proposed dual-m SC unit computes two candidates of next value, denoted as d i (m 1 ) and d i (m 2 ). According to the current GF dimension, d i is determined by selecting one of these two candidates over GF(2 m 1 ) and GF(2 m 2 ), and the stored-value q i becomes the i-th syndrome s i after p/n processing cycles. For the case of smaller dimension, m 2 , some parts of s i are filled with zeros to make the regular data size be used at the fully-flexible BCH decoder. Similar to the works in [15] , even-indexed syndromes can be calculated after obtaining odd-indexed s i based on the power operations. As depicted in Fig. 7 , two matrices, denoted as P i (m 1 ) and P i (m 2 ), are reserved to perform the power operations for each GF dimension and the final syndromes are generated by selecting the proper dimension between s x (m 1 ) and s x (m 2 ).
To construct a huge but single matrix operation for the proposed p-parallel dual-m SC unit, we first formulate the matrix operation for calculating the next value over GF(2 m 1 ), d i (m 1 ), as follows.
. . . 
By grouping t odd-indexed syndrome units, it is possible to generate a single matrix operator as follows.
where D odd (m 1 ), Q odd , and C odd (m 1 ) are constructed by serially appending the corresponding matrices in each oddindexed syndrome unit as follows.
On the other hand, the matrix X D (m 1 ) is obtained by diagonally appending each constant GF multiplication as follows.
Note that the grouped matrices for the second GF dimension, i.e., C odd (m 2 ), D odd (m 2 ), and X D (m 2 ), are defined by using the same way. After taking n/p cycles, D odd (m 1 ) and D odd (m 2 ) becomes odd-indexed syndromes for two different field dimensions, i.e., GF(2 m 1 ) and GF(2 m 2 ), respectively. Based on the odd-indexed syndromes, it is necessary to generate even-indexed syndromes by applying power operations. For the given i-th odd-indexed syndrome unit, the related even-indexed syndromes over GF(2 m 1 ) are obtained as follows.
where B w (m 1 ) is an m 1 × m 1 binary matrix stands and stands for the power operations y = x w over GF(2 m 1 ) [15] . Note that I(m 1 ) is an m 1 × m 1 identity matrix so that q i becomes s i (m 1 ) directly as we discussed previously. By grouping t P i (m 1 ) matrices, therefore, it is possible to generate all the 2t syndromes over GF(2 m 1 ) as follows.
S(m
Considering two equations (10) and (16), finally, we can develop a huge single matrix equation that can calculate all the syndromes over two different GF dimensions as follows.
Fig . 8 represents the proposed dual-m SC hardware architecture supporting (17) . Note that we added multiplexors to select the target GF dimension after performing a huge matrix operation for different GF dimensions. If we perform the iterative CSE searching algorithm on this matrix [25] , the search area of CSEs is theoretically maximized, leading to the minimum number of XOR gates for realizing the parallel dual-m SC block. 
B. THE PROPOSED PARALLEL DUAL-m CS ARCHITECTURE
Since the parallel CS block also uses numerous constant GF multipliers [11] , the proposed optimization steps for the dual-m architecture can be summarized in the similar way to that of the proposed parallel SC block. In the parallel CS block, which is the last step of BCH decoding in Fig. 1 , up to t error positions are determined by evaluating the error locator polynomial, which is generated by the KES block. More precisely, the t-order error locator polynomial over GF(2 m 1 ) is characterized by t +1 m 1 -bit coefficients denoted as 1×m 1 binary matrix forms, λ 0 , λ 1 , . . . , λ t , and α i becomes the root of this polynomial if the error location is represented by i (1 ≤ i ≤ n). Fig. 9 illustrates the conventional p-parallel CS architecture, taking n/p cycles by evaluating p consecutive positions at each processing cycle [16] . At the first processing cycle, in this architecture, the received t coefficients are selected to f j (1 ≤ j ≤ t) for checking the first p consecutive positions, and the temporal values d j are calculated to prepare the next evaluation at the same time. At the following cycles, these d j values are updated to q j , which are selected to f j for the rest of processing cycles. To reduce the complexity of p-parallel fixed-m CS block, all the constant GF multipliers and GF adders are grouped into a single matrix operation for sharing CSEs as many as possible [16] .
Based on the previous optimizations, the proposed pparallel dual-m CS architecture, which supports two GF dimensions m 1 and m 2 (m 1 > m 2 ), is exemplified in Fig. 9 . Based on the selected f j inputs, in the proposed architecture, we calculate two different temporal values, d j (m 1 ) and d j (m 2 ), over different fields, GF(2 m 1 ) and GF(2 m 2 ), respectively. As shown in Fig. 10 , the proposed CS architecture contains t m 1 -bit temporal register, which means that the temporal values for the lower dimension, i.e., d j (m 2 ), are zero extended. Similar to the dual-m SC case, the proposed work generates a single matrix operator by grouping all the GF operators inside of the shaded region in Fig. 10 , leading to the maximum number of CSEs.
To derive the proposed optimization steps for p-parallel dual-m CS architecture, we first group t constant GF multipliers for calculating temporal values d j (m 1 ) over GF (2 m 1 ) as follows.
By considering the lower GF dimension, all the temporal values for dual-m CS block can be derived as follows.
Note that G D (m 2 ) contains zero-extended multiplications over GF(2 m 2 ) as depicted in (2 
. . .
By grouping p v k values, denoted as V(m 1 ), we can construct the a single matrix operator, denoted as G V (m 1 ), which includes all the GF operations for p-parallel evaluating processes over GF(2 m 1 ) as follows.
It is obvious that the similar way can be applied for the case of lower dimension. The only difference is that we have to consider the zero-extended multiplications as depicted in (2) . As a result, all the evaluation parts supporting two different field dimensions can be formulated into a single matrix as follows.
By combing two parts, i.e., calculating the temporal values in (19) and evaluating the error positions in (22), we can finally obtain the single and huge matrix operator for pparallel dual-m CS architecture as follows. Fig. 11 illustrates the proposed hardware architecture based on (23) . Similar to the proposed dual-m SC case, the proposed dual-m CS architecture enlarges the search area of CSEs as much as possible, leading to the area-efficient solution as described at the following subsection. 
C. IMPLEMENTATION RESULTS OF DUAL-m SC AND CS BLOCKS
To verify the proposed SC and CS blocks in dual-m BCH decoder, this subsection shows the implementation results of various parameters. We compared the results of the proposed multi-m BCH decoders by using Synopsys' Design Compiler in 65nm CMOS technology. Considering the broadcasting standards, we select two BCH codes over GF (2 16 ) and GF (2 14 ) , where the number of correctable errors is equally set to 12. For fair comparisons, the area efficiency is used for this work, which is defined as follows [27] :
Fig . 12 shows the comparison between the proposed dualm SC architecture and the previous register-sharing dual-m SC block [21] . It can be seen that the proposed structure is more effective in both area and area efficiency for most parallel factors as the proposed work shares both sequential and combinational logics. By increasing the parallel factor, the impact of the proposed optimization method is remarkably enhanced. In the case of 8-parallel architectures, for example, the proposed dual-m SC block improves the areaefficiency by 1.74 times.
The area-efficiencies of parallel dual-m CS architectures are illustrated in Fig. 13 . In contrast to the impact of our dualm SC architecture, as depicted in the figure, the proposed dual-m CS block slightly degrades the area efficiency in the case of serial, i.e., when the parallel factor is set to 1. This is because of the increased critical delay from aggressivesharing of CSEs [30] . If the parallel is increased, however, the effects of area saving become dominant, reducing the area efficiency remarkably. According to the number of parallel factor, moreover, the amount of effectiveness is gradually increased as depicted in Fig. 13 . For example, when the parallel factor is 8, the proposed dual-m CS architecture saves the area-efficiency by 1.61 times by reducing the hardware complexity by 33% compared to the previous register-sharing architecture. Based on the proposed optimization method, it is noticeable that the area efficiencies of parallel dual-m SC and CS architectures are superior to the previous works [21] . As the contemporary flexible BCH decoders, which will be discussed in Section V, aim more than 8-parallel architectures for increasing the decoding throughput, the proposed SC and CS architectures can be happily acceptable at the practical area-efficient designs.
IV. THE PROPOSED FOLDED-KES STRUCTURE FOR MULTIPLE GF DIMENSIONS
This section discusses the proposed optimization methods, which enable the area-efficient dual-m KES module. As the KES operation is normally based on the general GF multiplication rather than the constant ones, it is hard to directly apply the previous matrix-based sharing techniques. In order to share the combinational parts of general multiplier, therefore, we propose a new folding method for dual-m KES architecture.
A. THE PREVIOUS FOLDED-KES ARCHITECTURES
Fig. 14 illustrates the conventional KES architecture based on the simplified inverse-free Berlekamp-Massey (SiBM) algorithm, which takes t cycles for calculating the error location polynomial [23] . Due to the numerous general GF multipliers, in general, the unfolded-KES operation is considered as the most complex and area-consuming operation among three BCH decoding stages [11] . On the other hand, the processing latency of KES block is relatively shorter than the other stages. Hence, the folding technique is widely applied to reduce the hardware complexity, making the balanced pipelining process. For example, the original SiBM architecture in Fig. 14 can be folded by sharing the greycolored processing elements (PEs), allowing more processing cycles [23] . Compared to this PE-level global folding, the recent architecture in [27] reveals that the multiplier-level local folding provides better area-efficiency by even reducing the internal critical delay as well as the hardware complexity. However, these techniques are only targeting the fixed-m KES architecture, no longer suitable for developing the areaefficient fully-flexible BCH decoder. Similar to the previous dual-m solution, the register sharing method from [21] and [22] has limitations by nature and it is necessary to share internal combinational logics for reducing the overall hardware complexity of KES block. To support dual-m operation in a single general multiplier without using separate multipliers, as reported in [21] , the modulation parts for two dimensions can be designed individually and attached at the end of shared processing parts. However, this method is inefficient in terms of critical delay by utilizing the serially-connected logics. As the operating frequency of a whole BCH decoder is normally determined by the KES module [29] , the decoding throughput can be degraded due to the increased clock period, even reducing the area efficiency of decoder. Based on the locally-folded multiplier architecture in [27] , in this work, we present efficient architectures of processing cells (PCs) for supporting different field dimensions while enhancing the area efficiency remarkably. 
B. THE PROPOSED DUAL-m FOLDED-KES ARCHITECTURE
As reported in [27] , the PC-based locally-folded general GF multiplier is effective in reducing both the critical delay and the hardware complexity of fixed-m SiBM-based KES architecture. In this subsection, we add the flexibility to the previous multiplier by presenting new PC types and simple control schemes.
Targeting two GF dimensions, GF(2 m 1 ) and GF(2 m 2 ), Fig. 15(a) conceptually shows the proposed dual-m locallyfolded general GF multiplier. Applying the maximum number of folding factor, as shown in the figure, the proposed multiplier utilizes m 1 PCs (m 1 > m 2 ). According to the coefficients of primitive polynomials for constructing GF fields, the proposed multiplier uses three types of PCs, i.e., non-biased PC (PC N ), 0-biased PC (PC 0 ), and 1-biased PC (PC 1 ), as depicted in Fig. 15 respectively. Note that three non-biased PCs are utilized at the first, second, and fourth bit positions as two coefficients at these positions are different from each other. For the same coefficient values, PC 1 and PC 0 are used for the zeroth and third positions, respectively, by taking account of each value. Note that the last fifth bit position, where PC 1 is introduced, is dedicated only for GF (2 5 ), the larger dimension.
In the proposed folded process, for the case of larger field dimension, m 1 cycles are used for multiplying two m 1 -bit input operands over GF(2 m 1 ). More precisely, each bit of (b 0 , b 1 , . . . , b m−1 ) is serially inserted to the multiplier in order, whereas all the bits of (a 0 , a 1 , . . . , a m−1 ) are issued in parallel as depicted in Fig. 15(a) . When the lower field dimension is selected, two operands are zero-extended, and only m 2 cycles are consumed for the general multiplication process. Note that (m 1 -m 2 ) right-most PCs are disabled by deactivating the control signal SEL so that the upper bits are automatically disconnected to the current processing steps for GF(2 m 2 ).
By replacing the general GF multipliers of the previous SiBM architecture in Fig. 14 to the proposed folded multim multiplier in Fig. 15(a) , the KES block can successfully support two different field dimensions without using more multiplier units. In addition, the locally-folded architecture may be combined to the globally-folding concept, i.e., the previous PE-level folding, as discussed in [27] . This hybridfolding technique further enhances the area-efficiency of dual-m folded-KES block, as discusses at the following subsection.
C. IMPLEMENTATION RESULTS OF DUAL-m KES ARCHITECTURES
Similar to the dual-m parallel SC and CS modules, as described in the previous Section, we design dual-m KES architectures based on the straightforward architecture, the previous register-sharing and partial GF logic sharing structure [21] , and the proposed solution. Supporting two types of BCH codes at the same time, (65535, 65343, 12) and (16383, 16216, 12) codes, all the architectures are equally designed in 65nm CMOS process for providing fair comparison results. In addition, we assume 8-parallel SC/CS stages for finding the best folding parameters, which can optimize each KES structure in terms of area efficiency. For example, as depicted in Fig. 17 , the proposed dual-m KES module achieves the maximum area-efficiency when it adopts the hybrid folding method with the local folding factor of 8 and the global folding factor of 12.
Resulting from the commercialized EDA tool, Table 1 compares three different dual-m KES architectures. Note that the proposed architecture occupies the minimum silicon area, while even having the shortest critical delay. This result is reasonable, as the proposed work internally cuts the delay paths inside of dual-m GF multipliers where the previous architectures separately utilize general multipliers for different dimensions. As a result, the proposed dual-m KES module improves the area efficiency by more than 72% compared to the register-sharing architecture, providing the most attractive solution.
V. CASE STUDIES
This section shows case studies of fully-flexible BCH decoder prototypes by targeting the broadcasting specification and storage controllers. To verify the effectiveness of the proposed studies, we implement three types of flexible decoder architectures; the straight-forward architecture having multiple dedicated BCH decoders, the previous registersharing and partial GF logic sharing architecture [21] , and the flexible decoder based on the presented optimizations. Note that all the decoder architectures consume the same number of processing cycles as no additional cycles are required for each optimization scheme. For fair comparisons, all the prototype designs are equally synthesized in a 65nm CMOS process using Synopsys' Design Compiler. In addition, the constant GF multipliers over different GF dimensions in [21] are reformulated into single binary matrix operators, which can guarantee the maximum number of common terms detected by the commercialized EDA tools. 
A. CASE STUDY ON THE DVB-S2 SYSTEM
In the first case, the proposed fully-flexible BCH decoder is designed to support the DVB-S2 system [7] . In this application, two BCH codes over GF (2 16 ) and GF(2 14 ) dimensions are defined for managing the normal and short frames, respectively. The system defines 58320b-length message frame for a normal frame, whereas a short one consists of 14400 bits. It also requests various t parameters including 8, 10, and 12. Based on the standard implementation guide [31] , the parallel factor of the proposed fully-flexible decoder is set to 8 to meet the required throughput sufficiently. Under the specified conditions, therefore, the folding factors of KES module is properly set to maintain the overall decoding throughput. Table 2 compares different BCH decoders that can fully support this DVB-S2 specification. Note that the proposed dual-m decoder takes the lowest hardware complexity while achieving the minimum level of area efficiency. In terms of the area efficiency, more precisely, the proposed solution enhances provides 2.17 and 1.96 times better results compared to the straight-forward realization and the recent register-sharing and partial GF logic sharing architecture, respectively.
B. CASE STUDY ON THE NAND FLASH MEMORY CONTROLLER
Based on the proposed optimizations for supporting multiple GF dimensions, we also design three different types of flexible BCH decoders targeting three field dimensions, which are used for the recent NAND flash memory controllers [6] , [32] . In this prototype decoders, we support 512B, 1KB and 2KB user data, and up to 120bit random errors can be corrected. As shown in Table 3 , the proposed optimization still provides the most attractive results compared to the other solutions. Note that the area-efficiency of the proposed fully-flexible BCH decoder is 2.47 times better than that of the previous sharing architecture. Compared to the dual-m cases depicted in Table 2 , it is important that the improvement due to the proposed methods is increased by 58% by supporting one more dimension. This means that the proposed fully-flexible architecture is superior to the previous options when the practical system considers multiple types of BCH codes.
VI. CONCLUSION
In this paper, we have presented several optimization skills for realizing area-efficient fully-flexible BCH decoders, which can support different field dimensions. By sharing the internal computing parts as much as possible, the proposed methods actually reduce the hardware complexity to the minimum level. The new folded-multiplier architecture also allows the multi-dimensional operations without using redundant hardware resources. Various case studies show that the proposed optimizations are quite effective to enhance the area efficiency while providing the flexible ECC usages, compared to the state-of-the-art designs.
