In this paper, we present an FPGA implementation of parallelnode low-density-parity-check convolutional-code encoder and decoder. A 2.4 Gb/s rate-1/2 (3, 6) LDPC convolutionalcode encoder and decoder were implemented on an Altera development and education board (DE4). Detailed power measurements of the FPGA board for various configurations of the design have been conducted to characterize the power consumption of the decoder module. For a E b /N o of 5 dB, the decoder with 9 processor cores (pipelined decoder iteration stages) has a bit-error-rate performance of 10 −10 , and achieves an energy-per-coded-bit of 1.683 nJ. The increase in E b /N o can effectively reduce the decoder power and energy-per-coded-bit for configurations with 5 or more processor cores for E b /N o < 5 dB. The incremental decoder power cost and incremental energy-per-coded-bit also hold a linearly decreasing trend for each additional processor core.
INTRODUCTION
One of the challenges in designing a low-density parity-check (LDPC) decoder is the difficulty in exploiting the trade-off between throughput, power and area. Past implementations of LDPC decoders have mostly used Field Programmable Gate Arrays (FPGA) as an intermediate tool to prototype their designs instead of targeting it as the final platform [1, 2, 3] , because of its relative slower clock speed, larger area and higher power consumption compared to full-custom application-specific integrated-circuit (ASIC) implementation. Recent literature on LDPC decoders on FPGAs have shown improvements in decoding throughput and some of them have even been shown to achieve throughput in the range of several Gigabits per second [3, 4] . However, the lack of reported power measurement figures from existing FPGA implementations of LDPC decoder designs has made it difficult to analyze the gap in power consumption between FPGA implementations and ASIC implementations of LDPC decoders, and to explore FPGA-specific architectural tradeoffs. This paper presents a set of detailed power measurement data conducted using a simple power measurement set-up on an FPGA implementation of a parallel-node low-densityparity-check convolutional-code (PN-LDPC-CC) decoder [5] . The rest of the paper is organized as follows. Section 2 includes a brief background on the gap between FPGA and ASIC implementation as well as a summary of performance results from the existing FPGA implementations of LDPC decoders. Section 3 presents a brief background on LDPC-CC and the PN-LDPC-CC encoder and decoder architectures used in our experiment. Section 4 provides the details of our FPGA implementation of the PN-LDPC-CC decoder system and a simple power measurement method for further power characterization of our system. Section 5 presents the measurement results of our FPGA implementation and the resulting discussion. Conclusions are presented in Section 6.
BACKGROUND

FPGA versus ASIC Implementation
In comparison with designs implemented in ASICs, FPGA implementations tend to result in higher power and area consumption, and slower clock speed. Kuon and Rose presented experimental measurements of area, speed, and power consumption to analyze the gap between ASICs and FPGAs [6] . They suggest that by making use of the available hard heterogeneous blocks (such as memory, DSP blocks) on FPGAs, the resulting gap in area consumption can be narrowed, as can the gap in power consumption. However, they also suggest that the possibility of narrowing the gap in clock speed performance would depend largely on how well the designs are tailored to the functionality of the DSP block.
LDPC decoder based on stochastic decoding was presented by Tehrani et al. [2] . The decoder achieves a throughput of 706 Mbps at 3 dB SNR. Chandrasetty et al. propose a modified 2-bit Min-Sum Algorithm (MMS2), which is implemented on an FPGA that achieves an average throughput of 10.2 Gb/s at 4 dB [3] .
FPGA-specific architectural techniques named vectorization and folding are proposed by Chen et al. in [4] for quasi-cyclic LDPC decoders. Vectorization takes advantage of the configurable data width of embedded memory on FPGAs by packing multiple messages into the same physical word, which is loaded and stored simultaneously. However, to concurrently process the messages delivered in each memory access and to take care of the data alignment and addressing, additional logic resources are needed. The extra required hardware resources may result in degraded performance, or it may not even fit on the FPGA due to the alignment logic and interconnect complexity. An additional tool, named QCSyn, which synthesizes a vector architecture for a given quasi-cyclic code, is also developed to ensure high resource utilization. Folding, on the other hand, is a memory virtualization technique that allows large LDPC codes to be implemented on commercially-available FPGAs with a small number of available block RAMs by mapping messages corresponding to multiple sub-matrices in the same physical block RAMs. Results of the implementations based on the vectorization technique are presented in Table 1 .
PARALLEL-NODE LOW-DENSITY PARITY-CHECK CONVOLUTIONAL CODES
LDPC convolutional codes (LDPC-CC) were first proposed in [7] . An LDPC-CC bears the same characteristics as conventional convolutional codes, where a code bit depends on present and previous information bit only. An LDPC-CC can be characterized by a parity-check matrix, H. However, unlike LDPC block codes (LDPC-BC), which have finite-length parity-check matrices, the parity-check matrix for LDPC-CC in infinite in length. Like LDPC-BC, all of the valid codes v of an LDPC-CC must satisfy the condition, where vH T = 0. The H matrix for an LDPC-CC with the rate R = b / c (b < c) is shown in Figure 1 , where the H i (t) (i = 0, 1, ..., m s ) represents the sub-matrices of size c × (c − b) and the parameter m s is called the code memory. The iterations of the beliefpropagation decoding algorithm are implemented as pipelined "processor cores". Additional parallelism is achieved through design of the LDPC-CC code to allow the encoding and decoding of ρ information bits per cycle, as demonstrated in [8] .
3.1. Parallel-Node Low-density Parity-Check Convolutional Code Architecture-aware Parallel-Node Low-Density Parity-Check Convolutional Codes (PN-LDPC-CCs) were initially developed by Chen et al. [8] . Both implementation-oriented constraints and performance-oriented constraints are applied in the construction of PN-LDPC-CCs to allow parallelism in the encoder and decoder architecture and to also ensure bit-errorrate (BER) performance of the code. The code length, T s , and the node-parallelization factor, ρ, are two important factors of the PN-LDPC-CCs. In [5] , it is shown that ρ can be increased significantly with little impact on the BER performance, and the main factors affecting BER performance are T s of the code and the decoding algorithm.
A series of circuit optimizations to the PN-LDPC-CC encoder/decoder architecture presented in [8] and the corresponding synthesis results in terms of energy-per-encoded-bit versus throughput and area versus throughput for each improvement are reported in [5] . In both of the encoder and decoder, only one phase, which refers to the specific row of parity-check matrix is processed per cycle; therefore the throughput of the design can be defined as the product of clock frequency and node parallelization factor, ρ, as shown in Equation (1) .
PN-LDPC-CC Encoder
The PN-LDPC-CC-based encoder architecture presented in [8] is a partial syndrome encoder [9] . paths between those inputs and outputs. A series of circuit optimizations to this initial architecture are presented in [5] . The first proposed optimization step in [5] effectively simplifies the O(ρ 2 ) implementation complexity in the initial encoder design by replacing the SW0 switches with fixed re-wiring and phased-gated inputs [5] . This modification changes the shift chain in the design in [8] into a circular buffer. Next, a technique called "gate-swapping" is applied on the simplified design by conditionally swapping the 3-input XOR with a 2-input XOR and a 2-input OR gate when the phase-gated "info" and "parity" updates inside an encoder node do not occur in the same phase. This technique produces a small reduction in power and area. Furthermore, latch-based clock-gating [10] is applied to the encoder node registers in the resulting design from the "gate-swapping" technique to further reduce dynamic power. The resulted architecture has a slight reduction in throughput and significant reduction in power consumption for all codes.
PN-LDPC-CC Decoder
The decoder architecture presented in [8] holds the same relationship between ρ, throughput and the energy-per-coded-bit as in 3.2. The decoder is formed by cascading identical decoder processors in series. A single decoder processor consists of variable-node (VN) units, check-node (CN) units and multiple memories. For a PN-LDPC-CC decoder with ρ > 1, each decoder processor contains a rotation switch-matrices (SW1) and multiple copies of the VN units and CN units. Based on the phase, φ, SW1 performs the rotation and reverse rotation operations on log-likelihood ratio (LLR) data routed between the CN units and the memories.
In [5] , Brandon proposes a series of improvement steps on the design presented in [8] . Removing saturation bit in the sign-magnitude representation of LLRs in the first step has been shown to have an advantage of reducing power consumption and area. However, the removal of saturation bits also results in a small loss in BER performance. The rotation switch matrix in the initial design is eliminated by re-arranging the memory. The removal of rotation switch matrix not only reduces the hardware cost relationship to ρ down to a near-linear dependence, and also further reduces the power consumption and area. The next improvement technique involves the application of clock-gating, which does not only further reduce area, but it also reduces the energy-per-decoded-bit. By replacing the function of the saturation bit with the maximum magnitude and additional control circuitry, the reset circuitry/memory initialization can also be eliminated from the design. Although the effect of the removal of the reset circuitry is minimal on the energy-per-decoded-bit, it does further reduce area. The final optimization technique is named the truncated min-sum (TMS) check sum operation. Unlike the offset min-sum operation, the TMS technique conditionally subtracts a constant value from the check-node LLR magnitudes when the least-significant-bit (LSB) of the LLR is set. This subtracted constant is equal to the maximum value that the LSB can represent. TMS decreases the energy-per-decoded-bit and area. A final architecture that contains all the aforementioned improvement changes is provided in [5] , and is the one used in our experiment.
FPGA IMPLEMENTATION
System Design
The design is written in Verilog HDL, compiled using Altera Quartus II Version 11.0 for the FPGA implementation on DE4, which is an Altera development and education board that features a Stratix IV 4SGX230. An encoder and decoder based on a rate-1/2 (3,6) PN-LDPC-CC with T s = 192, ρ = 16 [8] are implemented using the final encoder and decoder architectures described in Section 3.2 and 3.3 along with an additive white Gaussian noise (AWGN) channel model, a pseudorandom pattern generator (PRPG), a firstin first-out (FIFO) buffer, and a simple error counter on the DE4.
In this design, ρ test bits are generated by the PRPG. A copy of the ρ test bits is fed into the encoder module and the FIFO buffer. For every ρ info bits entering the encoder, it generates ρ code bits. The noise output from the AWGN generator of the channel module is scaled according to the desired SNR value and added to the info and code bits generated by the encoder. The sum of the signal and noise is then scaled by a linear function and quantized to 4-bit LLRs, where each LLR consists of one sign bit and three fractional magnitude bits. The quantized LLRs are then fed into the decoder module. The decoded data is then compared with the original data from the FIFO that was previously fed to the encoder. The error counter keeps track of the number of information bits and detected errors until a pre-defined target number of errors is reached.
A total of 11 configurations of the design containing different numbers of decoder processors have been implemented at a fixed clock frequency of 75 MHz using the default compilation options on Quartus II to maintain consistency of the experiment. To avoid the need to re-compile each configuration for every combination of the parameters such as the predefined SNR and target number of detected errors, a JTAG interface [11] system allows for the flexibility of modifying registers after a design is compiled and downloaded onto the FPGA. The BER data gathering process is also simplified with this JTAG interface since the values in the registers that store the total error counts and the elapsed clock cycle counts can be read for every different combinations of SNR and target number of detected errors.
Power Measurement
In order to determine the power consumption of the decoder module, we have chosen to measure the total power consumed by the entire FPGA board. The power measurement of DE4 is made possible by inserting a custom unit that consists of a pair of 8-pin Molex power connectors and necessary wire connections between the Molex receptacle of the AC/DC power supply and the matching Molex header on the FPGA board. The custom unit allows two 0.01-Ω (1% tolerance) current sense resistors to be put in series with the 12-V power rail, where the voltage drops measured across these resistors can then be used to calculate the total current drawn by DE4, and an additional digital multimeter is used to measure the input voltage, V V CC12 CON for the FPGA board. Detailed circuit connections are shown in Figure 2 . The total power consumed by the FPGA board is calculated using Equation (2) . Three sets of current and voltage readings are captured to calculate the total power, and the average of that is considered as the board power consumption for that case. The 12-V supply powers several components and DC-DC converters, which in turn power several more components. All power consumption on the FPGA board is kept constant, except for varying the number of decoder cores. Therefore, the measured power of the "0core" configuration that contains all the modules except the decoder for every different E b /N o value can be used as the base case for that particular E b /N o , which is then subtracted from the corresponding total measured board power to calculate the decoder power. Since the clock frequency for all our experiment is fixed at 75 MHz, the resulting coded throughput of all configurations is 2.4 Gbit/s, using Equation (1). The energy-per-coded-bit is calculated using Equation (3).
MEASUREMENT RESULTS AND DISCUSSION
The BER performance of the FPGA implementation is plotted in Figure 3 . The configurations with 9, and 10 decoder cores achieve a BER of 10 −10 or better at an E b /N o of 5 dB. The decoder power consumption and energy-per-codedbit for each configuration is plotted against E b /N o in Figure 4 and against the number of decoder cores in Figure 5 . In Figure 4 , the reduction in decoder power consumption and energy-per-coded-bit resulting from the increase of E b /N o is more obvious for the configurations with 5 or more decoder processors at E b /N o < 6dB. In Figure 5 , the decoder power and energy-per-coded-bit hold a close-to-linear relationship with the number of processor cores.
To further analyze the relationship between decoder power consumption, energy-per-coded-bit, E b /N o , number of processor cores, the incremental decoder power and energy-per-coded-bit values are obtained by subtracting the decoder power and energy-per-coded-bit values at a lower E b /N o or (N − 1)-core from the total decoder power and energy-per-coded-bit at the current E b /N o or N-core configuration, and are then plotted in Figure 6 and 7. In Figure 6 , the incremental decoder power and energy-per-coded-bit gradually decrease for every dB of increase in E b /N o with the exception of the values for the "1core" configuration. In Figure 7 , we observe a generally decreasing trend in the incremental decoder power and energy-per-coded-bit for each additional processor core.
In both Figure 6 and 7, the behaviours of the incremental values for "9core" are significantly different from that of other curves in the same plots. A possible explanation for this is the inherent randomness within the CAD methodologies used in Quartus II for placement and routing. In Figure 6 , although the overall incremental values for the "9core" are lower than that of the other configurations, the curve itself still maintains In Figure 7 , at the 9 th core, the reduction in decoder power and energy-per-coded-bit from 8 th core is significantly greater than that of the other transitions. Furthermore, it can be seen from Figure 8 that the increase in incremental Logic Utilization at 9 th core from the 8 th core is significantly greater than that of the others. The aforementioned behaviours of the "9core" configuration seem to suggest that a higher Logic Utilization can result in lower decoder power and energy-per-coded-bit, but they could also be just the result of the inherent randomness in the CAD algorithms used in Quartus II. We verified this by performing additional compilations of the "8core", "9core", and "10core" configurations with different random number seeds specified at the Fitter stage on Quartus II. The resulting incremental Logic Utilization and the incremental decoder power values varied by 11% and 0.362 W respectively with different seeds, and they have not shown any consistent trend in their relationship, which confirms that the earlier observed relationship is most likely caused by the randomness in the synthesis tools.
We also attempted to measure the encoder power consumption. However, using our power measurement method, the configuration without an encoder turns out to consume slightly more power than the configuration with an encoder. We conclude that the power consumption of the encoder module is minimal in comparison to the FPGA power fluctuations caused by the randomness in the synthesis tools. We will further quantify and explore this behaviour in future work.
To summarize, for 10 −10 BER performance, having an additional core on the" 9core" configuration would cost an additional of 0.457 W and the improvement in BER performance is less than 5 −11 . An increase of 0.25 dB or 0.50 dB in E b /N o would be required for the "8core", "7core" configurations to achieve the same BER performance as well as saving 0.224 W, and 0.350 W respectively, compared to the "9core" configuration.
In Table 1 , the result from this work with 9 decoder cores at a E b /N o of 4.25 dB is summarized along with the results of other existing LDPC-BC decoders described in Section 2.2. Our work achieves a higher throughput than both implementations from [4] on a lower clock frequency. The average incremental energy-per-coded-bit for our design for 7 to 10 decoder cores at E b /N o dB is approximately 0.15 nJ per additional core. Using this approximated value, our design with 15 cores at an E b /N o between 4 and 5 dB consumes around 2.61 nJ per-coded-bit, which is around half of what is consumed by the first implementation from [4] .
CONCLUSION
In this paper, we presented an FPGA implementation of a rate-1/2 (3,6) PN-LDPC-CC with T s = 192 and ρ = 16 and a power measurement technique and model to determine the decoder power consumption. We conclude that the increase of E b /N o is more effective at reducing the decoder power and energy-per-coded bit for configurations with 5 or more cores before E b /N o exceeds 5 dB. A decreasing incremental decoder power cost has also been observed for each additional processor core. With the use of our method for power measurement, we have been able to observe the general tradeoffs between decoder power, energy-per-coded-bit, E b /N o , and the number of processor cores for our design on the chosen PN-LDPC-CC code. This power measurement method is general enough that it can be applied to other FPGA implementations of LDPC decoders. Directions for future works would include improving the existing power measurement method to minimize the fluctuation in measured FPGA power to better capture the small change in power for smaller configurations as well as using the improved method to better characterize the tradeoffs between power consumption and the effects of FPGA-specific techniques applied on the design of LDPC decoders.
