ABSTRACT In this paper, we extend polar decoding function to our previous design, and propose a flexible quad-mode forward error correction application specific instruction-set processor (QFEC ASIP) that supports polar, low-density parity-check (LDPC), turbo, and convolutional code (CC) decoding with multiple code lengths and code rates. A unified polar/LDPC/turbo/CC quad-mode algorithm framework is presented. The top level architecture of QFEC ASIP and the polar data path are designed on the basis of the algorithm framework. A quad-mode confliction-free global memory system is proposed. 65.2% of global memory banks, 48.9% of global memory bits, and 29.7% of global memory area are saved via hardware sharing. Specially accelerated FEC decoding instructions make the decoding procedure fully programmable and ensure the high throughput. Synthesis using 65-nm technology shows that the total area of QFEC ASIP is 4.26 mm 2 . QFEC ASIP provides the maximum throughput of 1345 Mb/s for polar, 917 Mb/s for LDPC (WiMAX), 320 Mb/s for turbo, and 387 Mb/s for CC (64 states) at the clock frequency of 344 MHz. QFEC ASIP occupies much smaller silicon area than the sum of the silicon area of 4 single-mode FEC decoders that together provide a similar function range as QFEC ASIP.
I. INTRODUCTION
Wireless communication standards are dynamically in progress. Baseband chips have to support multiple standards to provide access to ubiquitous wireless communication networks. Forward error correction (FEC) improves communication reliability, and thus is important to a baseband system. FEC decoding is one of the modules requiring heavy computing in a baseband system [1] . Furthermore, FEC decoders must support multiple decoding algorithms for different communication standards within limited hardware cost. These bring great challenges for designing FEC decoders.
The application specific integrated circuit (ASIC) is a traditional solution for applications requiring high performance, low power and small area. However, designing ASICs that support all possible configurations consumes huge no return engineering cost and years of research and develop time because of the high design complexity [2] . In addition, ASICs have limited flexibility. The product lifetime is threatened since a redesign on ASIC is unavoidable when standard changes. We propose the multi-mode application specific instruction-set processor (ASIP) that supports softwaredefined-radio (SDR) for FEC decoding in complex application scenarios. The system integration using a multi-mode ASIP is easier than that of multiple single-mode decoders. The hardware sharing of multiple modes saves silicon area effectively. Specially accelerated instructions ensure high throughput and sufficient flexibility in the specific application domain, and in turn reserve some design margin for extending the product lifetime.
Low-Density Parity-Check (LDPC), Turbo and Convolutional Code (CC) are commonly used error correction codes. They have been adopted in various wireless communication standards, such as 2G-4G [3] - [6] , WLAN [7] and WiMAX [8] . And many multimode FEC processors have been proposed for supporting multiple standards. Alles et al. [9] proposed the first LDPC/Turbo/CC triplemode FEC ASIP, namely FlexiChaP. FlexiChap achieves small silicon area through partly parallel decoding, and provides a throughput of 257 Mbps for LDPC by an efficient 12-stage pipeline. However, the throughput for Turbo mode is only 18.6 Mbps. Condo et al. [10] proposed a networkon-chip-based LDPC/Turbo dual-mode decoder. A 35-node network working at a high frequency of 780 MHz is introduced to avoid the memory confliction between two decoding modes. This decoder provides a high throughput of 455 Mbps for LDPC and 150 Mbps for Turbo. But this decoder does not support CC. In [11] , we proposed a high throughput Trellisbased ASIP (TASIP) for LDPC/Turbo/CC triple-mode FEC decoding. TASIP achieves high decoding efficiency by a triple-mode unified forward-backward recursion decoding kernel with an eight-state parallel trellis structure. TASIP can support multiple standards, including the 3rd Generation Partnership Project-Long Term Evolution (3GPP-LTE), WLAN, and WiMAX.
In 2016, Polar code was adopted as the control channel coding scheme for Enhanced Mobile Broadband (eMBB) scenario in future 5G standard [12] . However, as far as we know, the multi-mode FEC decoder that covers all four important FEC techniques (Polar, LDPC, Turbo, and CC) has not been reported until June 2018. Designing a high speed Polar/LDPC/Turbo/CC quad-mode FEC decoder is thus an essential research orientation.
In this paper, we propose a flexible Polar/LDPC/Turbo/CC quad-mode FEC ASIP (QFEC ASIP) by extending Polar decoding function to our LDPC/Turbo/CC triple-mode former work, TASIP [11] . QFEC ASIP is an intellectual property (IP) core based design. It is an upgrade version of TASIP, and the time cost of the inherited design is obviously reduced compared with the time cost of design from scratch. Compared with TASIP, QFEC ASIP mainly achieves the following innovations. Firstly, a unified Polar/LDPC/Turbo/CC quad-mode algorithm framework is proposed, and the hardware design of QFEC ASIP is based on this framework. Secondly, Polar decoding acceleration instructions are added to the instruction set, and a high paralleled Polar data path is designed. The newly introduced Polar decoding mode is fully programmable, and it supports code lengths from 64 bits to 4096 bits with any code rates. Thirdly, the memory subsystem is redesigned, and a Polar/LDPC/Turbo/CC quad-mode confliction-free global memory system (GMS) is proposed. 65.2% of global memory banks, 48.9% of global memory bits and 29.7% of global memory area are saved via memory sharing of four different modes. QFEC ASIP can be adopted in SDR environments for both terminals and base stations, and the corresponding channel coding functions can be implemented just by software programming.
The rest of this paper is arranged as follows: Section II presents the design methodology of QFEC ASIP. Section III introduces Polar decoding algorithm selection and the unified Polar/LDPC/Turbo/CC quad-mode algorithm framework. Section IV presents the hardware implementation of QFEC ASIP. Section V introduces the unified quad-mode confliction-free GMS. Section VI analyzes the performance of the Polar decoding mode. Section VII gives comparison results. Finally, Section VIII concludes the paper.
II. QFEC ASIP DESIGN METHODOLOGY
The software-hardware cooperative design methodology shown in Figure 1 is adopted in QFEC ASIP design. QFEC ASIP is an inheriting design based on TASIP. The primary goal of this paper is to add Polar decoding function to the previous design. The algorithms and the hardware architecture of TASIP are important inputs of the QFEC ASIP design. Polar decoding algorithm that is adaptive to the previous design should be selected firstly, considering the hardware reuse maximization. Then, a quad-mode algorithm framework is designed by the combination of the newly selected Polar decoding algorithm and the algorithms adopted in TASIP. Next, the Polar decoding instructions are determined, and the top level architecture of QFEC ASIP is designed on the basis of the quad-mode algorithm framework. After hardware submodule and firmware design, the function of QFEC ASIP is validated. The hardware submodule and firmware should be checked if the function is not correct. Otherwise, we can go on to finish the silicon design, and validate the area, performance, and power of QFEC ASIP. If the results are unsatisfied, the instruction set and the top level architecture should be optimized. Finally, the hardware and software are integrated.
III. ALGORITHM PRINCIPLE
Before ASIP hardware design, a unified algorithm framework including all decoding modes should be designed. Our previous design, TASIP, is based on a unified LDPC/Turbo/CC triple-mode trellis forward-backward recursion (FBR) algorithm kernel [11] . In this paper, we add Polar decoding algorithm to the triple-mode algorithm kernel, and propose a unified Polar/LDPC/Turbo/CC quad-mode algorithm framework for QFEC ASIP.
A. POLAR DECODING ALGORITHM SELECTION
In this subsection, we select a Polar decoding algorithm for quad-mode algorithm framework design, and provide some necessary background on the selected algorithm.
As we mentioned in Section II, Polar decoding algorithm that is adaptive to the previous design is more suitable for QFEC ASIP. There are mainly three kinds of Polar decoding algorithms: belief propagation (BP) based algorithms, successive cancellation (SC) based algorithms, and successive cancellation list (SCL) based algorithms [13] . Wherein BP decoding algorithms are based on a bi-directional recursion process that is similar to FBR. Moreover, BP decoding has inherent high parallelism. The operands can be segmented using parallel windows (PWs) and sliding windows (SWs) without reducing the decoding accuracy. Therefore, BP based algorithms are the best candidates for the design of unified quad-mode algorithm framework.
In this paper, we adopt the scaled min-sum (SMS) approximated BP decoding algorithm [14] for easier hardware implementation. The BP decoding procedure can be represented on a factor graph [15] , [16] . Figure 2 gives a simple example of Polar factor graph where code length (N ) is 8 bits. The factor graph has n + 1 columns (n = log 2 N ). Each column contains N nodes. The decoding calculation contains n stages, and each stage includes N 2 basic computational units (BCUs). The structure of BCU is shown at the top of Figure 2 . Each BCU involves two nodes on each side, and (i, j) represents the node coordinates in the factor graph. There are two kinds of intermediate decoding results, R message and L message. Each node carries only one message at a time.
The Polar decoding procedure is shown at the bottom of Figure 2 . The BCUs are calculated as (1) and (2) during right-to-left propagation (LP), and calculated as (3) and (4) during left-to-right propagation (RP). The final results are determined by the sum of L 0,j and R 0,j (0 ≤ j < N ). The criterion is shown in (5) .
In this subsection, we propose a unified Polar/LDPC/Turbo/ CC quad-mode algorithm framework. The ASIP hardware will be designed on the basis of this framework. TASIP adopts turbo-decoding message passing (TDMP) algorithm with Bahl-Cocke-Jelinek-Raviv (BCJR) routine for LDPC mode, maximum a posteriori (MAP) probability decoding based on BCJR for Turbo mode, and the Viterbi algorithm for CC mode [11] . These three algorithms are kept in QFEC ASIP. TDMP algorithm, MAP algorithm, Viterbi algorithm and the newly selected BP Polar decoding algorithm are all base on a bi-directional recursion decoding procedure. For BP algorithm, the bi-directional recursion refers to LP and RP. For TDMP and MAP algorithm, the bi-directional recursion refers to the BCJR routine. For Viterbi algorithm the bi-directional recursion refers to the add-compare-select (ACS) operation and the trace back operation. Thus, these four algorithms can be combined into a unified quad-mode algorithm framework.
The unified Polar/LDPC/Turbo/CC quad-mode algorithm framework is proposed as Algorithm 1. The bi-directional recursion calculation of four single-mode decoding algorithms is implemented by two innermost loops. In Algorithm 1, the ''for'' loops marked with ''(Pol.)'' and pseudo codes in the loop bodies marked with ''(Pol.)'' are used in Polar decoding mode. it represents the loop times of the outermost iteration. p is the serial number of parallel windows. w is the serial number of sliding windows. t is the sliding window address. And P s is the number of BCUs that can be executed within one clock cycle. Other loops are not The hardware implementation of Polar decoding mode is introduced in Section IV. Detailed information of algorithm and hardware implementation of LDPC, Turbo and CC mode can be found in [11] and [17] .
IV. QFEC ASIP HARDWARE DESIGN
The ASIP architecture is the implementation of the Algorithm 1. In this section, we will discuss the general considerations of Polar mode implementation, including the data precision, the degree of data parallel, the storage requirements, and the programmability design. We'll also introduce the top level architecture of QFEC ASIP and the Polar softin-soft-out module (SISO) implementation.
A. GENERAL CONSIDERATIONS OF POLAR MODE IMPLEMENTATION
QFEC ASIP adopts fix-point data, and the data quantization scheme of Polar mode is determined firstly. Figure 3 shows the bit error rates (BER) of Polar decoding with N = 1024 bits using different data quantization schemes. When fraction length = 3 bits and integer length = 5 bits, the BER performance of the hardware approaches the extreme value. The performance improvement provided by longer fraction length and integer length is negligible. So that the data width is set to 8 bits, wherein 3 bits are for fraction and 5 bits are for integer. From Figure 2 , we can see that BCUs in the same stage do not have data dependency. Therefore, single-instructionmultiple-data (SIMD) architecture that achieves data-level partial parallel can be adopted, and multiple BCUs in the same stage can be calculated in parallel to enhance the Polar decoding speed. BCUs in the factor graph are segmented using PWs and SWs according to the segmentation scheme in Figure 4 , so that the fixed hardware can support multiple code lengths. There are P PWs where n stages are separated as n SWs. P PWs are implemented with P homogeneous Polar SISOs (PSISOs) and the corresponding global memory banks in hardware. We adopt 16 PSISOs that each computes 32 BCUs in parallel, considering the decoding throughput and the hardware reuse of four decoding modes.
To reach the ultimate performance of the SIMD architecture, the data path should be fully loaded every clock cycle. According to (1)- (4), GMS must be capable of providing 4×32 = 128 operands (128×8 = 1024 bits) for each PSISO in each clock cycle. And all data nodes in the factor graph should be saved in memory for the next iteration of computation. The memory compiler in the 65nm process library only provides memory banks with bandwidth ≤ 144 bits and depth ≥ 32, and we finally have to adopt eight 128-bit×32 memory banks in each parallel window for Polar decoding mode though the adoption ends an imperfect solution. This storage volume can support Polar decoding with N ≤ 4096 bits.
Three Polar decoding hardware accelerated instructions are extract from the unified quad-mode algorithm framework. ''PLEFT'' is for L message calculation and write back (one step of LP). ''PRIGHT'' is for R message calculation and write back (one step of RP). And ''PFIN'' is for final result decision. A slice of assembly program for Polar decoding is shown in Figure 5 . Wherein ''c0'' and ''c1'' are loop flags, ''as'' and ''aw'' control sliding window address and sliding window number separately, ''SWL'' means the sliding window length, and ''rep = xx'' repeats the corresponding instruction for ''xx'' times. The SW movement in this program slice can be observed more intuitively in Figure 4 .
B. TOP LEVEL ARCHITECTURE OF QFEC ASIP
The top level architecture of QFEC ASIP is shown in Figure 6 . Compared with TASIP [11] , the modules filled with gray remain completely unchanged (where TSISO represents the unified triple-mode SISO for LDPC, Turbo and CC in TASIP). Apart from these modules, Polar address computation logic is added to the address generation unit (AGU) and the direct memory access module (DMA). Polar permutation networks and PSISOs are newly introduced. GMS is redesigned to meet the high bandwidth requirement of Polar decoding, and eight 128-bit×32 memory banks are adopted in each parallel window. From Figure 6 , we can see that PSISOs are independent of TSISOs. PSISOs do not need any pipeline registers and buffers. If we integrate PSISOs and TSISOs, only negligible computation logic can be shared. The synthesis results show that the integration can only save 1.56% of the total area. However, the reuse of computation logic will introduce more multiplexing routes to the SISOs, and the critical path of QFEC ASIP will extend from 2.9ns to 3ns (3.45% increment). The integration of PSISOs and TSISOs is considered to be not beneficial. That is why we finally decided to adopt independent PSISOs.
The red lined modules in Figure 6 are shared by four decoding modes. Program control modules, including program memory, program counter, instruction decoder, finite state machine, and parameter registers are fully shared. VOLUME 6, 2018 Linear address computation logic in AGU and DMA are also fully shared. But these modules only occupies less than 1% of the total area. GMS takes up more than 45% of the total area. And thus global memory sharing contributes most to the hardware efficiency improvement.
It is worth mentioning that TASIP adopts only five 8-bit×512 global memory banks in each parallel window (the design of TASIP GMS can be find in [11] ) for LDPC, Turbo, and CC mode. The global memory bandwidth in one parallel window of TASIP is 1 25 of that of QFEC ASIP. But the global memory depth of TASIP is 16 times of that of QFEC ASIP. The completely opposite storage characteristics of Polar decoding mode and other three modes bring great challenges to memory sharing. Data of LDPC, Turbo, and CC mode have to be reallocated to recover the function of these three modes after GMS redesign. The solution to this problem will be discussed in Section V-B.
C. POLAR SISO IMPLEMENTATION
The structure of PSISO is shown in Figure 7 , where ''⊕'' stands for addition. Polar decoding adopts 2's complement data format. The calculation of opposite number is implemented as bitwise NOT operation, and the subtraction is implemented as add after inverse. The implementation of f (x, y) is shown at the top left corner of Figure 7 . The multiplying with 0.9375 is implemented as x − (x >> 4). Guard bits are introduced before add and inverse operations to avoid data overflow. Therefore, data must be saturated before written back as shown at the bottom left corner of Figure 7 . The final result hard decision is to output the MSB of the addition result directly.
V. QUAD-MODE CONFLICTION-FREE MEMORY SUBSYSTEM
The SIMD architecture proposed in last section can improve the decoding speed by parallel computing. And the ultimate performance of the architecture can be reached when SISOs are fully loaded every clock cycle. However, there are several potential memory confliction problems caused by the parallel data access requirements in Polar, LDPC, and Turbo mode.
During Polar decoding, one column of data can be used as left nodes or right nodes of the BCUs. The access patterns under these two conditions are different, and thus memory confliction will occur if the data are arranged simply in order. For LDPC mode, there are read-after-write conflicts for a posteriori messages access when forward recursion and backward recursion are processed in parallel. For Turbo mode, memory confliction may happen when the extrinsic messages and channel messages are fetched in parallel with the interleaved sequence.
In this section, we propose a Polar/LDPC/Turbo/CC quadmode confliction-free global memory management scheme to avoid all of memory confliction problems mentioned above. The memory management of Polar decoding mode is explained in detail in Section V-A. The memory sharing techniques of LDPC, Turbo, and CC decoding mode in TASIP (with five 8-bit×512 global memory banks in each parallel window) have been explained in detail in [18] . And these techniques are kept in QFEC ASIP by data reallocation that is presented in Section V-B. Four decoding modes can fully share the eight 128-bit×32 memory banks in each parallel window, and the memory sharing results are listed in Section VII.
We adopt single-port memories in QFEC ASIP to minimize the silicon cost. The memory usage of all FEC decoding instructions are summarized in Figure 8 , wherein Polar and LDPC instructions may read and write a memory bank at the same time. And thus the global memories have to work at twice the frequency of SISO modules. Figure 2 implies three requirements for confliction-free memory access of Polar decoding. First, two neighbouring columns of data can be accessed at the same time. Second, node j and node j + N 2 in the left column of BCUs can be accessed at the same time. Third, node 2j and node 2j + 1 in the right column of BCUs can be accessed at the same time.
A. MEMORY MANAGEMENT FOR POLAR DECODING MODE
To ensure the flexibility, we propose a memory management scheme for Polar mode that is applicable to any P, n and SWL.
As shown in Figure 4 , columns in factor graph are separated into P parallel windows. The memory location of each column in parallel window p (0 ≤ p ≤ 15) is shown in Figure 9 . Two neighbouring columns are accessed in the Three examples of data arrangement inside one column are shown in Figure 10 . As every SISO module calculates 32 BCUs at a time, 32 neighbouring nodes are regarded as the basic unit of data arrangement. When SWL = 1, data are simply arranged in order. When SWL > 1, the first half of data are still arranged in order, but the second half of data are arranged in reverse order. The blocks filled with the same color are accessed together when this column is used as the left column of the BCUs. The blocks with the same address in all banks are accessed together when this column is used as the right column of the BCUs. The access addresses of each bank under different conditions are shown in Figure 11 . ad1 and ad2 are the start addresses of the columns. ad3, ad4 and t are the corresponding offset addresses, where t is the sliding window address.
B. DATA REALLOCATION OF LDPC, TURBO AND CC DECODING MODE
As we mentioned at the beginning of this section, the memory sharing techniques of LDPC, Turbo and CC decoding mode in TASIP are kept in QFEC ASIP. And data of LDPC, Turbo and CC mode are reallocated to recover the function of these three modes after the global memory design modifications. The data reallocation of LDPC mode is shown in Figure 12 (a) . 40-bit data are saved in one memory address, and the rest 88 bits are filled with 0. Both Turbo mode and CC mode adopts 8-bit operands, and thus these two modes share the same data reallocation scheme as shown in Figure 12 After data reallocation, the new access addresses and chip selection signals of these modes are calculated on the basis of those in TASIP, and the access procedures are also changed. In LDPC mode, the redundant zeros are removed when memory read, and are added when memory write as shown in Figure 13 (a) . In Turbo and CC mode, the required 8-bit data VOLUME 6, 2018 are selected from the 128-bit data when memory read, and the 8-bit result replace 8 bits in the 128-bit data when memory write as shown in Figure 13 (b).
VI. POLAR DECODING PERFORMANCE ANALYSIS
In this section, we analyze the performance of the Polar decoding mode, including throughputs, BERs, and the computation complexity. The performance analysis of other three modes can be found in [11] , [17] , and [19] .
A. THROUGHPUT ANALYSIS
A complete iteration of Polar decoding contains n − 1 steps of LP and n − 1 steps of RP. The last iteration only contains n − 1 steps of LP and one step of final result judgement. Each step costs SWL clock cycles. And thus the total time cost of Polar decoding is [(n − 1) × 2 × (itr. − 1) + n] × SWL clock cycles, where itr. is the total number of iterations. Code length N = 2 n , and the throughput (TP) of the Polar decoding function can be calculated as (6) ,
where f stands for working frequency of the SISO module. SWL = N 2 32p = N 64p, where p represents the number of SISO modules that are actually used in decoding computation. (6) can be rewritten into (7) .
The maximum working frequency of QFEC ASIP is 344 MHz. Figure 14 shows the relationship between code lengths and the corresponding maximum TPs when P = 16 and itr. = 15. The curve has two phases. When N < 1024, p is always less than 16, and the maximum TPs are achieved when SWL = 1. The characteristic of this phase is revealed by (6) , where maximum TP increases as N gets longer. When N = 1024, the hardware utilization rate just reaches 100%, and the maximum TP reaches the maximal value. When N > 1024, the maximum TPs are achieved when p = 16. The calculation of one propagation step takes up more than one clock cycle (SWL > 1). The characteristic of this phase is revealed by (7), where maximum TP decreases as N gets longer.
The throughput results and the supported code length range shown in Figure 14 are valid only when P = 16, P s = 32, and eight 128-bit×32 global memory banks are used in each parallel window. This Polar decoding architecture is scalable. The supported code length range can be extended simply by increasing the depth of the memories. The architecture can be capable of satisfying higher throughput requirements in future high speed communication standards by increasing P or P s . 
B. BIT ERROR RATES
The BER performance of the decoding algorithm using floating point data and the 8-bit fixed-point hardware implementation are shown in Figure 15 . Compared with the algorithm using floating point data, the fixed-point hardware has worse BER performance. The errors mainly come from the data quantization, the simplified f (x, y) function implementation and the saturation operation. From Figure 15 , we can see that the BER performance difference between the algorithm using floating point data and the 8-bit fixed-point hardware gets larger when code length is larger. This is mainly because more approximate calculation and saturation operations are executed by the hardware during the decoding process when code length is larger. When N = 4096 bits, the 8-bit fixed-point hardware suffers 0.34dB of performance loss compared with the decoding algorithm using floating point data at the BER of 10 −5 .
C. COMPUTATION COMPLEXITY
The computation complexity is measured by the number of basic operations needed for decoding one bit (OPs/bit).
There are two kinds of basic operations in Polar decoding process. First, arithmetic computation operations, including addition, multiplication, division, absolute value computation, comparison and sign function. Second, memory operations, including loading a datum, storing a datum, and permuting a datum.
As we mentioned in Section III-A, the Polar decoding procedure includes three steps: LP, RP and final judgement. The number of basic operations of each step is listed in Table 1 . From the table, we can see that the number of basic operations of one step of LP (or RP) is 11SWL + 13N , and the number of basic operations of the final judgement is 2N . One normal iteration of decoding contains n − 1 steps of LP and n − 1 steps of RP. The last iteration of decoding contains n steps of LP and the final judgement. And thus, the total number of operations can be calculated as (8) . When N = 1024 bits and itr. (the total number of iterations) is 15, the total number of operations is 3492674, and the computation complexity is 3411 OPs/bit.
VII. RESULTS AND COMPARISON
QFEC ASIP is synthesized using Synopsys Design Compiler, and placed and routed using Cadence Encounter. The 65nm CMOS technology (1.1V, 120 • C) is used.
QFEC ASIP adopts 16 PSISOs and 12 TSISOs. The synthesis results show that the total area is 4.26 mm 2 , wherein GMS takes up 45.8%, TSISOs take up 26.4%, PSISOs take up 12.5%, permutation networks take up 7.3%, ATU and LUTs takes up 3.7%, pipeline registers and parameter registers take up 3.7%, and the rest of control path takes up only 0.6%. All memory blocks, including PM, LUTs, global memories in GMS, and buffers in TSISOs, occupy 44.8% of the total area.
The core layout of QFEC ASIP is shown in Figure 16 , and the floor plan density is 0.68. Most of the silicon area is due to the newly introduced Polar mode modules and the redesigned GMS. The decoding procedures of all decoding modes in QFEC ASIP are fully programmable, and all supported code configurations are listed in Table 2 . 
A. GLOBAL MEMORY SHARING RESULTS
Global memory sharing results in one parallel window are listed in Table 3 . The unified quad-mode confliction-free memory subsystem saves 65.2% of global memory banks, 48.9% of global memory bits and 29.7% of global memory area. The logic area of memory interface increases by 4.9% because extra address transformation logic are introduced for data reallocation of LDPC, Turbo and CC mode. And thus, only 20.1% of total GMS area is saved. The bandwidth requirement of Polar mode is much larger than those of other modes, and only 9.2% of total bandwidth is saved. In this parer, QFEC ASIP adopts TSISOs of TASIP directly. And those TSISOs can only deal with 8 sub layers in LDPC mode and 8 states in CC mode in parrallel. It is possible to increase the computation parallelization of TSISO to improve the global memory bandwidth sharing ratio in our future work. 
B. RESULTS COMPARISION
As we mentioned in Section I, there are no other Polar/LDPC/ Turbo/CC quad-mode decoders in literature until June 2018. To estimate the performance of QFEC ASIP, we provide two alternative comparisons in this section. The QFEC ASIP is compared with the sum of TASIP and a single-mode Polar decoder to reveal the competitiveness of the newly designed Polar mode. QFEC ASIP is also compared with the sum of four single-mode decoders to prove its high area efficiency.
1) COMPARISON WITH THE SUM OF TASIP AND A POLAR DECODER
QFEC ASIP keeps all decoding modes of TASIP, and implements the additional Polar decoding mode without reducing the performance of other decoding modes. The newly added Polar decoding mode supports 64 bits ≤ N ≤ 4096 bits with 2.14 mm 2 of area increment (the total area of TASIP is 2.12 mm 2 [11] ), and the maximum throughput is 1345 Mbps.
There are mainly three kinds of Polar decoders: SC decoders, SCL decoders, and BP decoders.
SC decoders and SCL decoders have low computation complexity. However, they suffer from high decoding latency and low throughputs because of the serial processing nature of the decoding algorithm. Coppolino et al. [20] proposed a multi-code SC Polar decoder that supports code lengths from 2 bits to 4096 bits with an area of 2.01 mm 2 . The supported code length range is wider than that of QFEC ASIP and the silicon cost is slightly smaller than 2.14 mm 2 .
However, the throughputs for code lengths that are the power of 2 are only around 350 Mbps, and the decoding latency (measured by clock cycles) is 10 times larger than QFEC ASIP. Yuan and Parhi [21] proposed a single-code SCL Polar decoder. It enhances the decoding speed by determine 2 K bits a time. The total area is only 0.62 mm 2 when the code length is 1024 bits. However, the maximum throughput is only 675 Mbps when K = 3, and the decoding latency is twice as large as that of QFEC ASIP.
Compared with SC decoders and SCL decoders, BP decoders have relatively high computation complexity and high storage requirements. But they can achieve high throughputs because of the inherent high parallelism. Sha et al. [22] proposed a single-code stage-combined BP Polar decoder that reduces half of memory area than the traditional BP decoders. However, the silicon area is 2.57 mm 2 (scaled to 65 nm technology) when N = 4096 bits. We can see that QFEC ASIP achieves better flexibility and smaller area than the sum of those of TASIP and a single-mode BP Polar decoder.
2) COMPARISON WITH THE SUM OF FOUR SINGLE-MODE DECODERS
The parameters of QFEC ASIP and four single-mode decoders that together provide the similar function range as QFEC ASIP are listed in Table 4 , and QFEC ASIP is compared with the sum of these single-mode decoders. From Table 4 , we can see that QFEC ASIP achieves similar energy efficiency as the single-mode decoders. The normalized throughputs of LDPC mode, Turbo mode and CC mode (64 and 256 states) of QFEC ASIP are smaller than those of single-mode decoders. But QFEC ASIP achieves much smaller area than the sum of area of single-mode decoders, and thus achieves better area efficiency.
VIII. CONCLUSION
In this paper, we propose QFEC ASIP, a flexible Polar/LDPC/Turbo/CC quad-mode FEC ASIP, by extending Polar decoding function to our triple-mode previous design, TASP [11] . The newly added Polar decoding function supports any code rates and multiple code lengths (from 64 bits to 4096 bits). The proposed quad-mode confliction-free global memory system avoids all access conflict problems caused by the parallel data access requirements. Memory sharing of four modes saves 65.2% of global memory banks, 48.9% of global memory bits and 29.7% of global memory area. Synthesis result shows that the area of QFEC ASIP is 4.26 mm 2 in 65 nm technology. With improved data and SISO parallelization, QFEC ASIP can reach the maximum throughput of 1345 Mbps for Polar, 917 Mbps for LDPC (WiMAX), 320 Mbps for Turbo, and 387 Mbps for CC (64 states) at the clock frequency of 344 MHz. QFEC ASIP achieves better flexibility and smaller area compared with the sum of those of TASIP and a Polar decoder. QFEC ASIP is also compared with the sum of four single-mode decoders that together provide the similar function range as QFEC ASIP, and the results reveal that QFEC ASIP achieves much smaller silicon area and better area efficiency.
Combined with the Reed-solomon decoding function in our another previous design, namely BIT ASIP [26] , our research has provided programmable solutions to all important FEC decoding techniques for SDR systems. And thus, our research is very important and practical for future terminals and for future base stations. 
