Abstract-Recent power reduction techniques aggressively modulate the supply voltage of embedded buffering memories allowing acceptable hardware errors to flow through the processing chain. In this paper, we introduce a class of modified Turbo and LDPC decoders that provide significant improvements over standard decoders in the presence of hardware noise. Simulation results show a consistent improvement in the BER performance of the modified decoders across all SNRs with very small area and power overheads as compared to the conventional decoders.
I. INTRODUCTION
Recently, dynamic voltage and frequency scaling (DVFS) [1] has been widely used to reduce both the dynamic and leakage components of power consumption, which is a key factor for mobile applications. To improve the effectiveness of DVFS, aggressive voltage scaling (AVS) [3] has been proposed where the range of voltage reduction is increased to a point that allows some degree of controlled hardware errors to propagate through the system (as long as an overall quality of service (QoS) metric is achieved). This approach allows hardware error tolerance based on the measured Signal to Noise Ratio (SNR); i.e for high SNR, we can have higher acceptable hardware errors with the benefit of additional power saving.
In modern Systems on Chip (SoC) applications, a large portion of the design consists of embedded memory cores as documented by the International Technology Roadmap for Semiconductors (ITRS) report [2] . Thus, a significant portion of the SoC performance and area statistics are dominated by embedded memory statistics. Moreover, memories are highly structured both in their architecture and access patterns, and therefore, provide structured means of identifying the impact of hardware induced errors on the overall behavior of the system. With these facts in mind, several publications addressed the failure mechanism of embedded memories under aggressive voltage scaling [4] [5] [6] for the purposes of low power operation.
In previous publications [3, 4] , the authors have shown that by incorporating application knowledge in determining the required operational failure rates of memories, an overall system power savings of 20% can be achieved by relaxing the circuit constraints on the data buffering memories in a WCDMA modem. While these facts illustrate the benefits of AVS, they did not consider the impact of hardware errors on the Forward Error Correction (FEC) schemes that are traditionally optimized for Gaussian noise and do not account for the noise introduced by memory failures. In this paper, we address this issue by presenting a unified statistical model for the combined channel noise and memory failure noise. Additionally, we introduce the necessary modification in a class of iterative decoders; namely Turbo and LDPC decoders (which are widely used decoders in communication systems) to accommodate the new distribution. We show that with negligible overhead in the decoder structure, a significant gain in the bit error rate (BER) performance in the presence of faulty memories can be achieved.
The paper is organized as follows: Section II presents the system overview, a review of the memory error modeling and the general formulation of the combined distribution of the memory error and channel noise. This includes the general formulation of the resulting distribution. In sections III and IV, we present the modification in two widely-used classes of iterative decoders; namely Turbo and LDPC decoders respectively. Sections V and VI provides the simulations results and implementation overheads of the proposed decoders, and section VII provides the concluding remarks.
II. SYSTEM OVERVIEW Figure 1: System Model
To illustrate the idea, we introduce a simple communication system consisting of FEC encoding and BPSK modulation as shown in Figure 1 . At the front end of the receiver side, a buffering memory exists before the decoder. The communication channel can be either an AWGN or a fading channel. The buffering memory, at the receiver front end, introduces random uniform errors depending on the supply voltage scaling. The distribution of memory errors due to voltage over-scaling has been independently verified by numerous publications and confirmed by silicon trials [3] - [5] . For this work, we set up an HSPICE circuit simulation in 65nm technology using predictive models to quantify the effect of voltage over-scaling on embedded memories error rates. Figure 2 illustrates the memory error rate as well as the power saving for different values of the supply voltage. It is shown that reducing the supply voltage by 25% from the nominal voltage (1.0 V) results in a power saving of 40% and a memory error probability of 10 -6 .
Figure 2 Memory error rate vs supply voltage
The operating supply voltage can be adjusted based on the signal to noise ratio and the required QoS. At high SNR, a controllable amount of error can be introduced in the hardware (by voltage over-scaling) with the goal of reducing power, while maintaining the required BER. This additional hardware error can be considered as an extra contribution to the channel noise.
Depending on the noise distribution introduced from the channel , and the noise introduced by memory, the data at the memory output has a new distribution . The noise introduced by memory is dependent on the memory probability of error as shown in Figure 2 . Then, based on the resultant distribution , the FEC decoder is modified to accommodate for the hardware error introduced in the memory.
Statistical Noise Model
The memory model assumes one quantized signed word per memory location with (N, r) quantization. The integer part consists of d bits while the fractional part has r bits such that . We assume that the values stored in memory ( ) have a general distribution of . These numbers are the channel outputs after the A/D conversion. As a result, the distribution function is indeed a Probability Mass Function (PMF) that demonstrates the probability of taking any of the 2 possible values. In [4] , it is shown that the error in the memory (as a result of voltage over-scaling) is spatially uniformly distributed. Therefore, all bits stored in memory have the same probability of to be flipped over to the wrong value. Thus, considering one location of N bits, the entire bits located from 0 to 1 at any location inside the memory, have the same probability of to be flipped to an incorrect bit value.
Considering one memory location with N bits, we can have one or more bit flipped and the distribution of the read data from that faulty memory is expressed as: 1 where is the probability of having bit flips simultaneously and is the distribution of data when bit flips occur at one word, with 0≤ ≤ . The probability is calculated easily such that the probability of having k errors or k bit flips, out of N bits is given by
Equation (2) considers only one combination of k bit flips.
The overall effect of all k-bit flips is considered in (3). The detailed derivation of the memory output distribution is found in [7] . The generic k-bit flips distribution is
where in general , ,…, is the resultant distribution when bit flips at specific bit locations n 1 , n 2 ,…..n k occur.
It is important to note that the effect of memory error becomes more significant at higher SNR values as illustrated in Figure 3 , which depicts the effect on the data distribution with one bit flip. Furthermore, errors in the sign bit and most significant bits are more severe than errors in other locations. As indicated by the irregularities in the tail of the distributions shown in Figure 3 , these bit-flip/errors are projected in the data distribution as scaled, decoupled and shifted (at the location of the bit flip 2 ) versions of the original data distribution [7] . More details about the efficiency and accuracy of this model are mentioned in [7] . 
III. MODIFIED TURBO DECODER
In prior work, the authors considered the impact of memory errors on convolutional decoders [8] . In this paper, we consider iterative decoders. Specifically, this section focuses on Turbo decoders.
Generally speaking, a generic Turbo decoder architecture consists of two Soft Input Soft Output (SISO) decoders with interleavers. In the iterative decoding process, each SISO decoder generates a soft estimation about the information bit in a form of Log Likelihood Ratio (LLR). Extrinsic information is generated by each SISO decoder and then passed to the other decoder after (de)interleaving. There are two widely used Turbo decoding schemes; namely Maximum Aposteriori (MAP) [9] and Soft Output Viterbi Algorithm (SOVA) [10] . The SOVA based decoder has less computation compared to the original MAP decoder at the cost of slightly degraded performance. A commonly used MAP approximation approach is the MAX-Log MAP algorithm [11] which can be considered as two Viterbi decoders; one in the forward direction and one in the backward direction. The basic idea of the MAX Log MAP algorithm is to calculate the branch metrics (BMs) based on the channel values and the extrinsic likelihoods .
, 4 Following that, the forward and backward state metrics (FW and BK) are updated based on the previous (next) forward (backward) metrics as well as the current branch metrics , ,
Where M={0,1,…N-1} for N states. The calculation of the forward (backward) metrics is a simple Add-Compare and Select (ACS) operation which is radix-2 or radix-4 for binary and double binary codes respectively. The log likelihood ratios (LLR) are computed for different symbols based on equations (4)- (6) 
A. Modified MAP algorithm
To simplify the discussion, we consider only the effect of the dominant error, i.e. a single error in the sign bit. The same analysis can be easily generalized to a higher number of bit flips, although as discussed in section II, the contribution of lower order bits (LSBs) is progressively insignificant. The channel probability contributes to the calculation of the branch metric term as shown in (8).
Where is the branch metric at k th time for a specific transition from state s' to state s in the trellis diagram, is the forward metric of state at k th time, is the backward metric of state at k th time, is the transmitted symbol corresponding to this specific transition, and is the received symbol at k th time. In fact, at each time slot, we have multiple transmitted and received symbols represented in systematic and parity bits. For simplicity, our notation refers to any received symbol at k th time. Now, consider is the received symbol at the memory output instead. For a fixed precision of , and , the effect of a single error at the sign bit is the dominant error over all other error combinations. This error introduces two artifacts in the distribution at 2 ⁄ and 2 ⁄ 2 ⁄ with height of P e .
Thus, the distribution is dominated by the mainlobe around 0 and the sidelobes around 2 ⁄ neglecting the term 2 ⁄ . So, from (1) the approximated new channel probability is represented by three terms as follows:
where 2 ⁄ , 0 1 , and 1 1 The remainder of the computations remains unchanged including the forward and backward metrics.
B. Architectural Modifications
When the MAX Log MAP algorithm is applied, the change in the distribution affects only the computation of the branch metrics and the extrinsic likelihood metrics. In the modified decoder, an additional MAX Log approximation step is applied prior to the conventional one in MAX Log algorithm. The log likelihood function is represented as Instead of the direct cross-correlation in the conventional branch metrics calculations, a new metric based on the new distribution is calculated for each symbol for both cases of transmitted symbols 1. Consider the j th transmitted symbol (systematic or parity) at any time, the new metric is:
For the approximated case, ln ln ln , ,
, 14
The summation in (14) The extrinsic likelihood computation is modified as follows:
where , is the output extrinsic likelihood for symbol i at time k, , is the input extrinsic likelihood for symbol i at time k, and , is the metric for the j th systematic bit at time k corresponding to an input value of i. Thus, the modified decoder requires a change in the branch metric computation unit as well as a change in the extrinsic likelihood computation unit. The rest of the decoder remains unchanged.
IV. MODIFIED LDPC DECODER
In this section, we propose the necessary changes in the LDPC decoder to accommodate the combined channel and memory error effect. A binary (n,k) LDPC is a linear block code represented by a sparse parity check matrix H of size . The H matrix is represented by a tanner graph with variable nodes and check nodes and a value of 1 is presented by a connection between the corresponding variable node and check node. The original decoding scheme is based on the belief propagation approach [12] , where soft messages are exchanged among the variable and check nodes. Some approximations were adopted to simplify the hardware implementation such as the min-sum decoding [13] and the offset min-sum decoding [14] . Recently, a modified architecture for LDPC codes was proposed in [15] and [16] to enable Turbo-like decoding of LDPC codes. It is known as the Turbo Decoding Message Passing (TDMP) algorithm. It assumes a certain structure of the H matrix. A summary of belief propagation, min-sum and TDMP decoding as well as the modified versions for the combined error distribution is shown in the next subsections.
A. Belief propagation algorithm
Let M(n) denotes the set of check nodes connected to the variable node n and N(m) denotes the set of variable nodes connected to the check node m. Let 0 and 1 denote the message from the variable node n to the check node m indicating the likelihood of the variable node n being zero or one respectively. Similarly, let 0 and 1 denote the message from the m th check node to the n th variable node indicating the likelihood of the n th variable node being zero or one respectively. Assume that the transmitted sequence is [x 1 , x 2 … x n ], and the received sequence is [y 1 , y 2 … y n ]. In the log-domain, the exchanged LLRs between variable and check nodes and check to variable nodes are defined as follows: The decision is based on the overall LLR for the nth variable node as follows:
The value of is quantized such that 1 if 0 and 0 if 0
In the min-sum approximation, the check node update is approximated as follows:
B. Modified Belief propagation algorithm
In order to modify the decoder to accommodate the combined error distribution, the modification takes place only in the initialization of the LLRs i.e in the . For the same fixed precision analysis presented in section III, and considering only the effect of a single error in the MSB, The initialization of is:
23.
4/ . 24
This modification is considered as adding a preprocessing block prior to the decoder as shown in Figure 4 . The additional block adjusts the initial LLRs based on (24) while the rest of the decoder remains unchanged. As will be discussed later in the paper, the overhead of the computation is insignificant. This architecture illustrates that it is not necessary to build a new decoder from scratch; but rather add a small block at the decoder front-end to adjust the value of LLRs based on the new distribution. 
C. Modified TDMP LDPC decoding
The TDMP architecture assumes a specific arrangement for the H matrix such that it consists of several sub-matrices. For each sub-matrix, each column and each row contains at most one nonzero element. This architecture enables the representation of the LDPC code as a parallel set of constituent codes (sub-codes) connected to each other through a set of interleavers. Hence, Turbo-like decoding is applicable. The details of the original TDMP algorithm are found in [15] . The same modification as in (24) is exactly applied to the front-end of the TDMP decoder to obtain the combined noise resilient TDMP decoder. The LLR vector is initialized as follows:
4/ 25
where and are computed according to (23.a) and (23.b) respectively.
V. SIMULATION RESULTS
To illustrate the benefits of the proposed system, we set up two simulation scenarios for two state of the art standards; the first one is a WiMAX double binary turbo decoder with block size N=240 and coding rate R=1/2. The second one is a DVB-S2 LDPC decoder with block size of 64800 and coding rate of 1/2. The idea can be applied to any other standard with any coding rate and block size. The simulation illustrates the decoders' performance for both AWGN as well as single tap Rayleigh fading channel environment. The simulation assumes a memory error probability of 5x10 -3 . As shown in Figure 2 this value of achieves a considerable savings in the memory power (more than 40%).
A. Performance in AWGN environment
As shown in Figure 5 , the dashed and solid lines represent the performance of the conventional and modified decoders respectively. The system is simulated for (8,5) precision. As shown, the modified decoder enhances the performance for all values of SNR. Furthermore, it does not suffer from the noise flooring effects observed in the conventional decoders. It is shown that the modified decoder is approximately 0.5 dB off from the ideal one (with P e =0) at BER of 10 -6 , while consuming much less power (a reduction of ~ 40% in power). This deviation from the ideal decoder can be smaller for lower values of P e . The value of P e , which depends on the voltage over-scaling can be controlled based on the available slack in the received SNR. For example, if the received SNR is higher than the minimum required by the decoder to achieve the target BER, P e can be increased as long as the target BER is maintained. Additionally, further improvement in performance can be achieved by considering higher order bits, albeit at a cost of decoder complexity. This adaptive power management is the subject of current research by the authors. The same trends can be observed for the DVB-S2 LDPC decoder as shown in Figure 6 . 
B. Performance in Rayleigh Fading Channel
The decoder performance was tested under a Rayleigh fading channel, assuming perfect channel estimation and MMSE channel equalization. Figure 7 and 8 depict the BER performance for both Turbo and LDPC decoders respectively for multiple iterations. The same trends as in the AWGN case are observed. 
VI. IMPLEMENTATION RESULTS
To estimate the overhead in hardware area and power, the additional modules were designed and synthesized using Synopsys Design Compiler for TSMC CMOS 65nm technology, with a 1V supply and operating at 200 MHz. Table 1 illustrates the overhead as compared to previously published decoders. The area overhead is reported in KGates, where the reference gate is a 2-input NAND gate with a drive X1. To maintain a fair comparison, power is scaled appropriately to a reference technology of 65 nm at 1V, 200 MHz.
Based on the results reported in [17] and [18] , the overheads in area are 0.82% and 0.013% for Turbo and LDPC respectively. The overheads in power consumption are 1.1% and 0.016%. 
VII. CONCLUSION
This paper presented an analytical framework for managing the combined channel and hardware noise that is induced by applying aggressive memory power management techniques. We presented the design and performance results of modified Turbo and LDPC decoders that are compatible with the new distribution. The modified decoders achieved significant performance improvement with a negligible overhead in area and power consumption.
VIII. REFERENCES

