To meet the higher data rate requirement of encoder, an 
Introduction
As channel coding, more and more researchers have been attracted by QC-LDPC [1] because of its high error correction performance. Currently, QC-LDPC has been adopted by a lot of standards, such as WIMAX, DTMB, DVB and so on. How to achieve high efficiency coding in large circuit board is one of hotspots of modern science need to solve. The shift register accumulator (SRAA) in literature [2] is the commonly used encoding m-ethod for general matrices. Literature [3] applies the left loop accumulator for high density matrix multiplication of DTMB standard. In our design making use of WIMAX standard with dual-diagonal structure check matrix, we design an efficient encoder structure through simplifying and optimizing RU algorithm.
Encoder Architecture Analysis of Design Space Exploration (DSE)
How to efficiently find methods to meet the hardware structure design of large-space constraints is an important challenge. The micro-encoder is designed based on Design Space Exploration [4] method. The main process is that we must consider resources consumption of the system on a chip (the chip area and storage resources) before loading programmer into the target board. The overall design of the frame is shown in Figure 1. 
Memory Exploration (ME)
The system memory includes A, C matrix memory, information bits memory etc. The constrained storage resource determines the maximum number of reused data saved on the chip. In order to achieve resource sharing, the best way is to store reused data among inter-layers on chip memory system. Take 1/2 rate in WIMAX standard for example, we store the sub-matrix
line by line which is obtained after blocking model matrix, reducing the number of storage into half. Specific design detail is shown in Part IV. 
Micro-architecture Design (MD)
The Design Space Exploration method includes two levels, Micro-architecture Design and Macro-architecture Design. The Macro-architecture design idea is to adopt multi-pro-cessor parallel process codes. Through the analysis of digital instantiation, we only consider serial input codewords. Micro-architecture design is relatively simple, feasible and resource-saving, and in the case the performance is not affected. The architecture can be realized by inputting data concurrently or sharing ROM modules. Spe-cifically designed micro-architecture will also be given in detail in Part IV.
Communication Network Exploration (CNE)
WIMAX is the Worldwide Interoperability for Microwave Access based on IEEE802. 16e WMAN (Wireless metropolitan area network) technology, in which quasi cyclic LDPC codes is adopted. The standards were defined by 1/2, 2/3A, 2/3B, 3/4A, 3/4B, 5/6, 6 different LDPC code rates in total. LDPC codes for each bit rate have a total of 19 different kinds of code length. With the satisfying constraint between the lines, we can design encoders only if we change the ROM blocks number and r parameter.
Encoder Algorithm and Optimization

Encoder Algorithm Principle
In WIMAX the fundamental matrix determines the parity check matrix. In the matrix -1 represents the whole zero matrix, 0 is the identity matrix, and positive z stands for cyclic shift permutation matrix. In this design expansion factor z is 96. The double diagonal structure of matrix T is also applicable in other WIMAX standard rates. 
Every code rate meets
, which is the identity matrix in the WIMAX standard, except 3/4B.
Combining WIMAX Standard Algorithm to Simplify
Simplification of Processes
1 p ： Encoder's resource consumption often converges on large inverse matrix calculations, so simplifying inverse matrix can make the encoder more simple and the operation more efficient. WIMAX matrices have dual-diagonal structure, so simplification also applies for other rate. Take rate of 1/2 as an example. Set
, making the equation left multiplies T then transforms into
(Note: i represents the matrix rows). According to the particularity of T we can get 8, No. 4 (2015) 28
, according to the identity matrix substitution principle, the result of 
Finally we get that 1 p equals sum of Cs and As (i). 
These equations will be written in an iterative formula:
Because of modulo-2 addition, finally get the equation:
Finally, 2 p is simplified to accumulation of the current time value of W (
and previous cycle value of 2 p .
Design and Analysis of the Encoder Structure
The proposed encoder is mainly composed of three parts: Intermediate Vector Calculation Module (IVCM), Checksum1 Calculation Module (C1CM) and Checksum2 Calculation Module (C2CM), as shown in the dashed box. The Intermediate vectors As (Product of matrix A and information bits s) and Cs (Product of matrix C and information bits s) are outputted grouped by 96 bits, while C1CM outputs 1 p and C2CM outputs 2 p . Eventually they will be written to Chk ROM in the form of 96 bits per clock cycle. In successive sections we will give the descriptions in detail. 
Resource Sharing ROM for Matrix A, C
From (1) and (2), we know that D is not needed during the calculation, and T's dual-diagonal structure is simple. Thus most of the values stored in ROMs is from matrix A and C of the parity check matrix. According to the grouping principle that non-negative values of two rows are not in the same column, we put the 1 st and 3 rd row, 2 nd and 11 th row, 4 th and 6 th row, 5 th and 12 th row, 7 th and 9 th row, 8 th and 10 th row stored in six ROMs respectively, in order to meet the need of resource-saving encoder. ROM is designed in depth of 12, corresponding to the column number in the matrix. Figure 3 shows one ROM for storing the first row and the third row. The maximum shift value is not more than 96, so we need 7 bits to store non-negative value, and another bit as a flag bit. When the flag bit is 1, it represents the value is from the first row; if not, it represents the third row. For example, we fill FF as the value in the first column of the first row and the third row is -1. Since the second column value in the first row is 94, with the appending of 1 in the MSB we store it as a hexadecimal value 'DE'. The storage content of six ROMs is shown in Table 1 . 
Barrel Shifter
Cyclic shift permutation matrix is multiplied with the column vectors, which is equivalent to the shift operation of the column vector. In this design, the multiplication is accomplished by the barrel shifter and XOR gates. We use 96 XOR gates corresponding to the spreading factor. One conventional shift register accumulator SRAA circuit needs 192 flip-flops, 96 XOR gates, and 96 AND gates. 1/2 rate requires 12 SRAA, in other words 2304 flip-flops, 1152 XOR gates, and 1152 AND gates. The barrel shifter requires z z ) 1 (log 2  flip-flops [6] , so the intermediate vector calculation module needs 672 flip-flops and 96 XOR gates in all. Barrel shifter is more conducive to resource-saving encoder design.
Checksum Calculation Modules
Ultimately from (4), 1 p can be obtained by the modulo-2 operation of As(i) and a set of 96-bit data of Cs. It is calculated by an accumulator, which mainly consists of 96 XOR gates. After calculation 1 p will be stored in Chk ROM. The calculation of W is detailed as follows.
We need the nonzero value in matrix B for XOR operation with 1 p , only in the first cycle and sixth clock cycle, so in the rest of the clock cycles we restore As (i) in the W RAM. The shifting value 'r' is determined by the non-negative element in matrix B. In the first clock cycle, r has the value of 7. So 1 p is shifted right 7 bits, whose XOR operation with the first set of 96-bit data As (1) will be written in W RAM. In the sixth cycle, the modulo-2 operation of 
Applicability of the Encoder
The proposed encoder is not only suitable for 1/2 rate, but also applicable to 2 / 3B rate. Because the dual-diagonal structure exists in all rates under WIMAX standard, the design of C2CM is universal. But the rate containing some non-negative elements in the same column can not use this encoder structure. In 2 / 3B rate, we need 4 ROM while in 1/2 rate, the amount is 6. What's more, r has the value of '7' and '95', in 1/2 rate and 2/3B rate respectively. The design is expected to make significance for the prospect of the universal encoder.
.Simulation Results
Based on the above micro-architecture design and improvement, the encoder of LDPC codes is compiled and simulated in two rates using Verilog HDL language and Cyclone II P2C70F896C6 chips, and the simulation results are given in Table 2 . It shows the resource consumption of 1/2 rate and 2/3B rate with code length 2304. From the usage elements and the resource utilization, we see the proposed encoder is efficient and resource-saving. The availability of resources facilitates the enhancement and expansion of the entire system. Meanwhile, based on the Modelsim simulation platform, coding waveforms are output successively, which verifies the correctness of the results compared with coding on MATLAB platform. Input them when s2p_en0=1, and output 'msg-data' and 'chk-mdata' in final. The information bits 'msg-data' includes 12 groups of s2p-q0 ~ s2p-q11 and 'chk-mdata' presents all the parity bits, 12*96 bits in length.
Conclusions
Based on the methodology of encoder design exploration and aiming to reduce hardware consumption, an efficient encoder architecture suitable for both 1/2 rate and 2/3B rate under WIMAX standard was proposed.
In the exploration of the memory design, we adopt the idea of storing two lines of data in one ROM to achieve the sharing of resources, and the amount of memory is reduced to half of the original. Combined with specific analysis using the WIMAX standard, we further simplify the RU algorithm, thus greatly simplifying the structure of the encoder. The final design can be applied to two rates by summarizing the similarities and differences in the structural characteristics of each base matrix. We expect that the proposed encoder will make certain significance for the future design of multi-rate encoder.
