ABSTRACT This paper proposes an optimized cyclic weight (OCW) decoding algorithm based on novelty ideas. The OCW algorithm utilizes the properties of quadratic residue (QR) codes and the advantages of the syndrome-weight algorithm to facilitate fast decoding of the binary systematic (47, 24, 11) QR code. The memory size of the OCW algorithm is reduced to only 0.45% of that in the original cyclic weight algorithm. In addition, it can be carried out in simple and repeatable steps, which is of great importance for hardware implementation. We further propose a hardware architecture in ALTERA STRATIX V FPGA by adopting parallel processing and pipeline techniques. To the best of our knowledge, it is the first hardware architecture to decode (47, 24, 11) QR code. The synthesized results in ALTERA QUARTUS V17.0 show that the proposed hardware architecture operates at high throughput and low latency. The proposed hardware module may be a good candidate in 5G wireless communications.
codes can not be guaranteed to has good performance for applications where short packet transmission is required, such as machine type communication and tactile internet. Moreover, the Internet of Things (IoT) emerging technologies also use a mass of short packets in various applications to achieve low latency, where we divide the messages into smaller partss and encode them separately. Therefore, high efficient algebraic codes of short length have potential in such scenarios with short packet communications.
The well-known quadratic residue (QR) codes are algebraic and cyclic BCH codes with code rates greater than or equal to one-half [7] . Most of the known QR codes are among the best-known codes and used in various communication systems due to their large minimum distances. For instance, the (24, 12, 8) extended QR code has been used in the Voyager imaging system link of NASA [8] and high-frequency radio systems [9] . In the past decades, several algebraic decoding algorithms have been developed to improve decoding performance and reduce decoding complexity [10] [11] [12] [13] [14] [15] [16] [17] [18] . These methods aim to solve the Newton identities that are nonlinear and multivariate equations with high degree. However, these algebraic decoding algorithms require a large number of complicated computations in a finite field, leading to a significant time delay in the decoding. Thus, these decoding methods are not suitable to be implemented practically, especially when the code length is long. Recently, look-up table based decoding algorithms [19] [20] [21] [22] [23] [24] [25] [26] have been proposed to decode the (47, 24, 11) QR code up to five errors. The full look-up table in the conventional decoding algorithm needs to store 5 i=1 C i 47 = 1729647 syndromes and 1729647 error patterns [27] , where C k n denotes the number of combinations of k objects chosen from n elements. It requires 14.85 Mbytes memory size to store the table. Such a large memory required makes the computation very complicated. To address this issue, an efficient table look-up decoding algorithm (TLDA) in [21] is developed to decode the (47, 24, 11) QR code. However, the memory size of the TLDA, 36.6 Kbytes, is still so large that one needs to further reduce the memory size. Thus, Lin et al. [22] further proposed a cyclic weight (CW) decoding algorithm to yield a much reduced memory size. Furthermore, Zhang et al. [23] proposed a new efficient syndrome-weight decoding algorithm (NESWDA) without using any look-up table, which saves memory overhead. However, NESWDA still needs a memory size of 135.125 bytes to store parity-check matrix H of QR code and its decoding procedure needs to divide codeword into seven types to achieve decoding, which burdens the hardware to some extent.
Hence, to reduce the decoding complexity, we design an optimized algorithm to decode (47, 24, 11) QR code up to five errors. We first derive some theories to catch a vital property of (47, 24, 11) QR code where the length of message section is one bit longer than that of the original parity check section. Based on the derivations, we then propose an optimized cyclic weight (OCW) algorithm by combining the CW algorithm described in [22] . Compared to the original algorithm [22] , the OCW algorithm has two advantages: first, it only requires 69 bytes to store C 1 24 = 24 syndromes, while the compact look-up table (CLT) requires 20.43 Kbytes to store 2324 corresponding error patterns associated with C 1 24 + C 2 24 + C 3 24 = 2324 syndromes [22] ; second, the OCW algorithm generates C 1 24 + C 2 24 = 300 syndromes, which are only 300 ÷ 2324 ≈ 12.9% of that used in the CLT of the original algorithm. Moreover, the OCW algorithm provides a simple method to avoid categorizing codeword as the method described in [23] .
Exploiting properties of the OCW algorithm, we further propose a hardware architecture for QR code implementation in ALTERA STRATIX V FPGA based on parallel processing and pipeline techniques. The whole architecture is divided into three identical pipelines and each of them has two levels and fourteen steps, which make execution time of all the steps in the pipeline with a uniform distribution. As a result, the proposed architecture benefits from high throughput and low latency, along with a reduced area consumption. To the best of our knowledge, it is the first hardware architecture to decode (47, 24, 11) QR code.
The rest of this paper is organized as follows. Section II briefly describes the background on the (47, 24, 11) QR code. Section III shows the idea of OCW algorithm for decoding of QR code. Section IV presents the decoding steps and program flowchart of the OCW algorithm in detail. Section V shows the proposed hardware architecture for implementing efficient decoder. Simulation results are showed in Section VI. Finally, Section VII concludes the paper.
II. BACKGROUND
The binary (47, 24, 11) QR code is performed over GF (2 23 ), which defines the long codeword of length n = 47 bits, the message of length k = 24 bits and the minimum hamming distance of d = 11 [28] . Let α be a root of the primitive irreducible polynomial p(x) = x 23 + x 5 + 1 such that α is a generator of multiplicative group of all nonzero elements in GF (2 23 ). Let the element β = α (2 23 −1)/47 = α 178481 . Then the quadratic residue set is given by 
and the generator polynomial is
According to the coding and decoding theory, the generated (47, 24, 11) QR code algebraically defined as a multiple of its generator polynomial g(x) can correct up to t = It follows from [27] that the systematic generator matrix G can be represented by a k × n = 24
where I k is a k × k = 24 × 24 identity matrix, and P is a k × (n − k) = 24 × 23 matrix. The parity-check matrix 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 1 1 
III. IMPORTANT PROPERTIES ABOUT OCW ALGORITHM
To illustrate the idea of the OCW algorithm, we first shows the following definitions, lemmas, theorems and corollaries.
Definition 1: The hamming weight of a binary vector a is denoted by w(a).
Definition 2: Let e m,i , i = 0, 1, . . . , 23, be single-error patterns having an error in the message section and s m,i is the syndrome corresponding to e m,i . The s m,i are shown in Table 1 .
Lemma 1: Let s p and s m be the syndromes corresponding to e p and e m , respectively. Assume that w(e) ≤ t, then we have For a detailed proof, see [22] . 
Note that the following definition and corollaries hold for the binary systematic (47, 24, 11) QR codes.
Definition 3: Let cr be a codeword by cyclically shifting r left by 24 bits and cs, cr m , cr p , ce m , ce p be respectively its syndrome, message section, parity check section, error pattern of message section and error pattern of parity check section. 
IV. THE OCW ALGORITHM
Based on the above theoretical derivations, the procedure of the proposed OCW algorithm works as follows:
Step 1 Given a received word r and initially set flag=0.
Step 2 Compute the syndrome s corresponding to r and the weight w(s). otherwise, if flag = 2, then go to Step 8.
Step 7 Cyclically shift the r left by k = 24 bits to obtain cr and syndrome cs corresponding to cr. Let s = cs, r = cr and then go to Step 4.
Step 8 If we cannot finish decoding above, by Corollary 2, the highest bit of r must be wrong, then invert the highest bit of r to obtain the new received word nr and the syndrome ns corresponding to nr. Let s = ns, r = nr and then go to Step 4.
Step 9 If flag = 1, cyclically shift the output left by 23 bits to obtain the correct codeword and then go to stop; otherwise, if flag = 2, then go to stop directly.
The flowchart of the OCW algorithm is given in Figure. 1.
V. THE PROPOSED HARDWARE ARCHITECTURE
Next, we try to make full use of two great characteristics of the OCW algorithm to design an efficient hardware architecture. One is that the OCW algorithm consists of three similar processes to decode r, cr and nr as shown in Section IV, which is a key feature to implement in parallel architecture. The other one is that each process can be implemented by repeatable and simple steps, which is conductive to the realization of high performance pipeline circuit. These two properties are vital to divide decoding process into basic processing units. Next, we propose an architecture by dividing the decoding process into three identical pipelines. Each pipeline has capacity to correct errors in r m less than or equal to two.
If the number of errors in r m surpasses the error-correcting capacity of the pipeline circuit, namely, r m has more than two errors, its switch register will be zero indicating decoding failure; otherwise, it outputs the correct codeword and its switch register is set to be one. At first, demultiplexer changes the received word into three forms, r, cr and nr, and then transmits them into three identical pipelines at the same time. For Pipeline One and Pipeline Three, we can obtain the message by intercepting the output result of high 24-bit. However, for Pipeline Two, we need an extra process to obtain the message. If the switch register in Pipeline Two is one, namely, cr has been corrected, then the multiplexer cyclically shifts the output codeword left by 23 bits and intercepts high 24-bit to obtain the message. The whole hardware architecture of the OCW algorithm is shown in Fig. 2 . More detailed description is shown as follows.
A. THE PIPELINE CIRCUIT
The pipeline architecture consists of fourteen steps and each step includes two units, i.e., Syndrome Check Unit and Correction Unit. Syndrome Check Unit is utilized to generate a series of different syndromes and then detects whether any one of these difference syndromes meets the decoding condition. Correction Unit receives the sequence generated by Syndrome Check Unit and then decides to trigger decoding or not. For connecting two units together, the pipeline circuit can detect whether a series of locations in the message section have wrong bits.
The first step of pipeline generates s and corrects codeword with no error in the message section of the received word. The second step generates the array of s ∧ s m,i , C 1 24 = 24 difference syndromes all together, and corrects codeword with one error in the message section. The third step to the fourteenth step generate the array of s∧s m,i ∧s m,j , C 2 24 = 276 difference syndromes all together, and correct codeword with two errors in the message section. It means each step form the third from the fourteen disposes 276 ÷ 12 = 23 difference syndromes. In addition, Correction Unit delays a clock period after the processing of the Syndrome Check Unit. Thus, each pipeline generates C 1 24 + C 2 24 = 24 + 276 = 300 different syndromes and only needs 15 clock delays to finish decoding.
B. SYNDROME CHECK UNIT
As mentioned before, Syndrome Check Unit generates a series of different syndromes and then detects whether any one of these difference syndromes meet decoding condition. Specifically, it consists of three parts. The first part generates a series of different syndromes, which can be easily obtained by XOR operation. One register is required to store syndrome s. The second part calculates the weight of difference syndromes and judges whether the weight of difference syndromes meet the decoding condition. The third part is a group of registers to save a sequence of detection and then transmits sequence to Correction Unit. Each bit in the sequence denotes whether the weight of corresponding syndrome satisfies the decoding condition, where one stands for the syndrome meeting decoding conditions and zero stands for not. That is, the sequence denotes the error locations in r m . By generating difference syndromes in a fixed order, we can determine the error locations quickly. The circuit of Syndrome Check Unit is shown in Figure. 3.
C. CORRECTION UNIT
Correction Unit receives the sequence generated from Syndrome Check Unit and then decides to trigger decoding or not. The switch register stores the state of decoding completion. It is assigned the value one when the sequence from Syndrome Check Unit contains one; otherwise, it is the same value of previous step. If the value of the switch register is one, it means that the decoding was finished and thus the rest of logic operation stops; otherwise, it proceeds to decoding. The codeword register is used to transmit codeword. When sequence from Syndrome Check Unit contains one, the codeword in the codeword register would be corrected. The syndrome register transmits s forward, which is key to generate difference syndromes. The circuit of Correction Unit is shown in Fig. 4 . 
VI. RESULTS OF HARDWARE IMPLEMENTATION AND SIMULATION

A. HARDWARE IMPLEMENTATION RESULTS
To verify our scheme, we implement the proposed hardware architecture in 5SGSMD4E1H29C1 FPGA using ALTERA QUARTUS 17.0 tool. Table 2 depicts the latency, throughput and clock frequency of the proposed architecture. It has a pipeline latency of 15 clock cycles. After placing and routing, the synthesized frequency for the whole architecture can be up to 308.36 MHz. Another crucial factor of implementation is throughput, which reflects the number of outputs per clock cycle. The proposed hardware architecture can yield a message (24 bits) per clock, which means high throughput. By exploiting parallel processing technique together with pipeline technique, the proposed architecture offers remarkable performance in area consumption, path delay, throughput, and operating clock frequency. Table 3 clearly shows that the proposed architecture has great resource utilization.
The results obtained from simulation shows that it utilizes 1352 logic LABs, 11314 ALMs together with 3405 dedicated logic registers including 2987 primary logic registers and 418 secondary logic registers. As we will see, the utilization of logic LABs and ALMs is no more than 10% of total in chip. It is noteworthy that the usage of primary logic register is merely 1.1% of the total amount, which means the realistic decoding circuit will be low area consumption. 
B. SIMULATION RESULTS
The simulation results of the bit error rate (BER) via a signal-to-noise (SNR) ration for the (47, 24, 11) QR code, in an additive white Gaussian noise (AWGN) channel with binary phase-shift keying (BPSK) modulation, are illustrated in Fig. 5 . We can see that the proposed OCW algorithm and the original CW algorithm have almost the same BER performance for all the SNR regime. At received SNR as 8 dB, the BER probability of the OCW algorithm is 4.7 × 10 −7 , which satisfies the requirement of most of wireless communication applications.
VII. CONCLUSION
First, in this paper, we have derived some vital theories mathematically about QR codes to efficiently facilitate the OCW algorithm to decode the (47, 24, 11) QR code up to five errors. The OCW algorithm simplifies the decoding process, by only computing the weight of difference syndromes corresponding to the received word with one or two bits error in the message section; meanwhile, it consumes merely 69 bytes memory to store 24 syndromes, which is 51% and 0.45% of memory size used in the NESWDA algorithm [23] and the CW algorithm [22] , respectively. In the near future, the authors of this paper will devote their efforts to decode QR codes with different length. In addition, the concatenated coding system composed of QR codes and other powerful channel codes such as low-density parity-check codes and Reed-Solomon codes is also worth investigating.
Second, to verify the advantages of the OCW algorithm, we have proposed a hardware architecture in FPGA for decoding (47, 24, 11) QR code. To the best of our knowledge, it is the first hardware architecture to decode (47, 24, 11) QR code. The proposed architecture enjoys a low pipeline latency of 15 clock cycles and high throughput that yields a message (24 bits) per clock. The results obtained from synthesized simulation indicate that the proposed hardware architecture has great performance in terms of area consumption, path delay, throughput and operating clock frequency. Thus, the proposed hardware architecture can be seen as a good alternative in forward error correction schemes for 5G wireless communications.
