I. INTRODUCTION AND MOTIVATION

S
INCE their rediscovery by MacKay, Low Density Parity Check (LDPC) codes have been extensively adopted in both next-generation wired and wireless standards due to their near-Shannon limit performance. Moreover, LDPC codes over non binary alphabets of size show better performance over their binary counterparts with proper encoding design and code length [2] . However, the significant improvement comes along with the penalty of high decoding complexity. The locally optimal, yet the most complex, iterative decoding algorithm of non binary LDPC codes is the belief propagation (BP) algorithm.
Since the size of messages varies with the size of the alphabet , a straightforward implementation of BP results in memory and complexity requirements of the order of and respectively. In order to reduce the complexity of non binary decoding, several suboptimal decoding schemes have been proposed in recent years.
The first straightforward simplification is obtained at check-nodes by replacing the discrete convolution of messages, having complexity , with the product of the message Fourier transforms. The use of FFT brings down the complexity to . In [3] , the authors introduce a log-domain version of this approach that has advantages in terms of numerical stability. Some further simplifications have been proposed in [4] with the Extended Min Sum (EMS) algorithm, where message vectors are reduced in size by keeping only those elements in the alphabet with higher reliability. In [5] the same authors propose a hardware implementation of the EMS decoding algorithm for non-binary LDPC codes. In [6] the Min-Max algorithm is introduced with a reduced complexity architecture called selective implementation, which can reduce by a factor 4 the operations required at the check-nodes; however, complexity is still in the order of . Several studies on VLSI implementation of non binary decoders based on the previous algorithms have been presented in literature [7] - [15] . The results of such studies confirm that all non binary decoders require complexity growing with the size of the alphabet .
The analog digital belief propagation (ADBP) algorithm proposed in [1] represents a breakthrough in the reduction of the complexity and memory requirements with respect to previously proposed algorithms, as for ADBP both complexity and memory requirements are independent of the size of the alphabet. The main simplification of ADBP is due to the fact that messages are not stored as a vector of size containing the likelihood of the discrete variables (or equivalently their loglikelihood ratios-LLR) but rather as the two moments, or related quantities, of some suitable predefined class of Gaussian-like distributions. ADBP can be casted into the general class of expectation-propagation algorithms described by Minka [16] . The main contribution in [1] is the definition of a suitable class of distributions for the messages and the derivation of the updating equations for the message parameters at the sum and repetition operations of the Tanner graph.
It should be noticed that ADBP cannot be applied to all types of linear codes over as multiplications by elements different from 1 are not allowed in the graph. This ensemble of codes has been analyzed in [17] (named "binary" LDPC ensemble over ) and [18] (named modulo-LDPC). In both papers it is shown that the ensemble is capacity achieving as its distance spectrum approaches that of random codes as the underlying graph connectivity grows.
The exact ADBP updating equations however are not suitable for a straightforward implementation due to the presence of complex non linear operations. Some simplifications to the updating equations have been presented in [19] . In this paper, we prove the practical feasibility of ADBP decoding and provide post synthesis results of the hardware implementation of required processing functions. We do not provide results for the whole decoder, which could be organized reusing architecture already proposed for binary LDPC code decoders, such as [20] . In Section II we start by reporting the exact updating equations of ADBP and consider its special application to the decoding of non binary codes. In Section III we introduce some simplifications to the updating equations and evaluate their impact on the performance of the decoder. In Section IV we present the results of the fixed point implementation of ADBP obtained by optimizing the bit width of input, output and intermediate quantities of the decoder processing elements. In Section V we report the architecture of the designed core processors and the synthesis results for up to . The provided results confirm that implementation of ADBP is feasible with small complexity and more importantly that complexity is independent of .
II. THE ADBP ALGORITHM
ADBP is a particular version of the BP algorithm that allows to perform in a very efficient way the BP for linear systems where variables can be either discrete or wrapped or both.
From the complexity point of view ADBP is equivalent to Gaussian BP over linear system as described in [21] . Let us define the class of Gaussian messages as where and denote respectively the mean and concentration of a Gaussian message. With continuous variables and messages belonging to , the following simple updating rules for linear real systems can be derived corresponding to the sum, repetition and axis scaling operations:
In [21] , it is shown that several powerful estimation techniques can be derived as particular instances of this algorithm. ADBP introduced in [1] adds the possibility of wrapping and/or discretizing the random variables involved in the system. Indeed, in the considered linear system the quantities are bounded in and sum is substituted with sum.
The sum induces a wrapping of the axis of variables that induces a wrapping of the corresponding messages and requires to use the class of wrapped Gaussian messages:
where is the wrapping period and following wrapping operator is introduced (4) In (4), the symbol indicates the convolution operation and is the train of pulses. The discretization of system variables on the other side, induces a sampling of their corresponding messages, leading to the introduction of the class of sampled Gaussian messages:
where we have defined the sampling operator where is the sampling period. Notice that Fourier transforms of sampled Gaussian messages are wrapped Gaussian messages with imaginary mean.
When the wrapping period is a multiple of the sampling interval , wrapping and sampling operators commute , so that it is possible to introduce the class of Digital messages or -messages, which consists of messages that are both wrapped and sampled:
Examples of linear systems that use discrete and wrapped variables, i.e. integers in the range , are linear non-binary encoders whose non binary codewords satisfy where the sum is assumed in the equation and the coefficients of the parity check matrix are bounded to be in the set . We showed in the [1] that the operation of circular convolution and multiplication of messages, correspondent to the BP updating at the check and repetition nodes, respectively, is "almost" closed w.r.t. D-messages, in the sense that output messages can be well represented within the same class of input messages. If we avoid the multiplication with a coefficient in the finite field different from 1 in the Tanner graph, that induces a permutation of the order of symbols, the ADBP, which forces the messages to stay into the class of D-messages performs almost as well as the exact BP. ADBP using -Messages can then be applied for the iterative decoding of members of this code ensemble, yielding decoders with complexity independent of the cardinality of the alphabet . In [1] it is shown that, in contrast to what happens for Gaussian messages ( (1)- (3)), wrapped messages are not closed under multiplication and sampled Gaussian messages are not closed under convolution. As a consequence, -messages are not closed with respect to repetition and sum operations. In particular, while the updating (2) is reliable also for wrapped messages associated to large concentrations, it fails to provide accurate results when the concentration is small. This is due to the non negligible effect of the aliasing on the replicas introduced by wrapping [4] . For the same reason, (1) fails to provide accurate results for sampled Gaussian messages with high concentration.
However, accurate approximations of the output messages belonging to the same class of the inputs can be found in both cases. This is obtained by exploiting the correspondence between members of the class of wrapped Gaussian messages of period with those of the class of Von Mises or Tychonov messages [22] with the same period:
where . The mapping between the two distributions preserves the mean and transforms the concentration according to (5) (6) where and denote the concentration of Wrapped Gaussian and Von Misses messages respectively, denotes the modified Bessel functions of order .
In summary, by skipping the details (reported in [1] ), at the repetition operation, the output distribution can be approximated for low message concentrations as (7) Similarly, a good approximation of the message for the sum operation, which is valid for large message concentrations can be obtained by exploiting the same correspondence, but in the transform domain, yielding: (8) where the real quantity is expressed by separating its integer and fractional part . The ADBP algorithm that uses exactly the updating rules (7), (8) is named as -ADBP algorithm and denoted in the following with the acronym .
III. SIMPLIFICATIONS OF THE UPDATING EQUATIONS
Although ADBP algorithm introduces the fundamental complexity breakthrough of making the iterative decoding of non binary codes independent of the alphabet size , (7) and (8) for updating the D-message parameters are still too complex for a hardware implementation. In this section we introduce some simplifications of the ADBP updating equations for the specific purpose of using it in the decoding of non binary codes in a digital transmission system.
The considered full transmission system for high spectral efficiencies, shown in Fig. 1 , consists of a mod-LDPC encoder, an -PAM modulator, a wrapped AWGN channel and the ADBP decoder. 1 The encoder is a regular LDPC encoder with input -ary symbols, and output symbols. The constant variable-node degree is and the check-node degree is . The -ary output symbols of the encoder are transmitted using an -PAM constellation, with a natural order mapping , such that the average constellation energy is . The outputs of the wrapped Gaussian channel are obtained by offset and wrapping the output of a regular AWGN channel in the interval i.e.
where is a random noise with zero-mean and -variance. Wrapping reduces the channel capacity but at the same time makes it input-symmetric, so that transmission of the all zero sequence can be assumed with the employed linear encoder. 2 Furthermore, the likelihoods at the output of the wrapped Gaussian channel with -PAM inputs take naturally the form of digital messages with parameters so that it can be easily interfaced with ADBP. The ADBP decoder takes as input messages the pairs and performs the decoding using the flooding schedule. After a fixed number of iterations (10 in this case), the hard decision on symbols is performed simply by observing the integer part of updated rounded to the nearest integer. The Tanner graph of the code is reported in Fig. 2 where sum nodes correspond to check-nodes and repetition nodes correspond to variable-nodes in the binary 1 We consider here one-dimensional PAM constellation for notation simplicity. The extension to -QAM complex bidimensional constellations, achieving twice the spectral efficiency, is simply obtained by repeating the -PAM mapping independently on the I and Q components of the signal. 2 The use of ADBP in conjunction with the regular AWGN channel, as well as with other types of modulation set, requires some modifications of the computation of the input messages parameters. Since in this paper we focus primarily on the implementation issues, we decide to use the wrapped AWGN to simplify the analysis. Notice that the loss of the capacity induced by wrapping becomes negligible for large values of . codes. It has been observed that there is no significant BER improvement increasing the number of iteration as the structure of the code considered in this paper is not optimized.
A. Simplifications of the Repetition Update Equations
As pointed out in Fig. 2 , at repetition nodes one of the involved messages is always the channel message. The standard deviation of this message, related to the signal to noise ratio over the channel, is typically much smaller than the wrapping period . In this case (corresponding to large concentrations) the exact expression of ADBP (7) can be approximated by the simpler expression (2) that neglects the aliasing effect of replicas of the wrapped Gaussian. Notice however that for the proper computation of the output mean in (2) one should consider, among all possible replicas associated to the wrapped Gaussian distribution, those that have closest . As a consequence, the following approximation can be used to obtain the mean of :
where , , and the integers and should be chosen so as to minimize . Eq. (9) requires 2 multiplications, two sums and one division. In order to simplify it, we first derive the indexes associated to the maximum and minimum values of the concentrations:
we then write (9) as (10) where . Expression (10) only requires two sums and the multiplication by the number , which is always bounded in [0,0.5].
B. Simplifications of the Sum Update Equations
In sum update the use of the correct expression (8) instead of (1) is required as the concentration of messages actually increases during iterations. In this case a simplification of (8) is obtained by considering the following approximation for the function :
which is valid for .
Using this approximation we can write the concentration of the characteristic functions as and approximate the output message concentration as follows: (11) where we introduced the familiar operator and the derived operators and . Similarly we can derive the following expression for the output mean (3rd equation in (8)) (12) A further simplification to the that neglects the correction term and obtains the true min is considered. The ADBP algorithm that uses the updating rules of (10), (11) and (12) with true min approximation is named as the simplified-ADBP and denoted by the acronym sADBP. Since the structure of the code considered in this paper is not optimized, improved approximations, based on the use of scaled or offset functions [23] , do not offer significant advantages.
IV. FIXED POINT MODEL
The fixed point (FP) model of proposed algorithm is implemented in C language. All variables in the decoding algorithm are represented in 2's complement notation as with bits for integer part (which includes the sign bit, unless stated otherwise) and bits for fractional part. The performance of the decoder is expressed with the symbol error rate (SER) as a function of signal to noise ratio . At first, the approximations used in (10)- (12) are validated by simulating in double precision (DP), both and algorithms. Fig. 3(a) shows the results obtained from this step for , 16, 32 and 64. For all the four cases of reported in the figure, the simulation results show that the algorithm performs close to the algorithm, therefore it can be safely investigated for FP implementation.
The initial step of FP implementation of decoder is the determination of appropriate number of and bits for input quantities (i.e. and ). This acts as a starting point to implement the internal datapath of the decoder keeping in view the performance and complexity constraints. To achieve this objective, we analyze the DP performance offered by the algorithm when quantized inputs are applied. This analysis is carried out in two steps. Quantization of Only: In this step, only the is quantized as and applied to the input of sum and repetition functional units whereas, the and internal data path is in DP. As discussed in Section II, the of D-messages with wrapping period is a real number that lies in the range . Therefore, the 2's complement representation of integer part of requires bits for the magnitude and 1 bit for the sign. Since the input is always positive, most significant sign bit is always '0' and neglected. On the other hand, the number of fractional bits of is a design parameter that could be changed to trade off performance with hardware complexity. The simulation results achieved at this optimization step are plotted in Fig. 3(b) , which shows the performance with quantized with constant and variable ( between 2 and 5). For the three cases of shown in Fig. 3(b) , the simulation results demonstrate that at least 4 fractional bits are required for in order to obtain performance within 0.3 dB to the DP implementation.
Quantization of and : In this step, both inputs i.e. and are quantized as and applied to functional units. is quantized as obtained from the above step whereas, for we simulate various combinations of integer and fractional bits. The simulation results of this step are shown in Fig. 3(c) . Since the magnitude of increases with , the schemes with large and are required to achieve low at high . For example, achieves nearly the same performance as the DP model. On the other hand, the simulated cases have a performance loss of about 1 dB. From the simulation results, we noticed that the number of bits of increases with . Thus, considerable improvement in performance could be achieved at high by first scaling down the magnitude of before applying to the input of functional units. Similarly, the updated at the output of the units is scaled up by the same factor. This scaling helps to represent a large value of with lower number of bits and avoids overflow during intermediate calculations thus achieving better numerical stability. This scaling plays an important role in the look up table implementation of quantity . Without scaling, the LUT would require more bits per memory word. In particular, it has been observed that a scaling factor of 16, when used with [4, 3] quantization scheme, provides a good performance-complexity trade off. The results of this step are given in Fig. 3(d) for and 32. The obtained curves show that significant improvement in performance of algorithm is achieved with the help of scaling. The same [4, 3] quantization for and scaling factor of 16 is also applicable for and 64. In the final step, the internal datapath of decoder is optimized for FP implementation using the quantized inputs ( and ). The signals involved at each intermediate step of computation are optimized by simulating various combinations of and bits. Fig. 4 shows the complete FP simulation results of the proposed decoder. The curves in the figure show that for each case of , the FP implementation of performs very close to the DP implementations of and algorithm. The hardware architecture and datapath complexity details specific to the FP simulation curves of Fig. 4 are discussed in the next section. 
V. HARDWARE ARCHITECTURE AND SYNTHESIS RESULTS
The ADBP decoding architecture is similar to standard twophase binary LDPC decoders. As discussed in Section III and shown in Fig. 2 kinds of computation nodes, i.e. sum nodes and repetition nodes that operate on and messages respectively. The decoding process involves a flooding schedule in which a single decoding iteration consists of two steps: a) Horizontal scan: in which all sum nodes update their output messages and b) Vertical scan: in which all repetition nodes update their output messages. As discussed before, a single message in ADBP decoding is a vector containing two values i.e. and of the corresponding message class.
It is worth noting that, the architecture of the complete decoder depends on the structure and characteristics of the parity check matrix . However, the optimization of is out of the scope of this paper. Thus, in this section, we focus only on the processing functions involved in ADBP. Namely, we will first discuss the digital architecture for binary (i.e. two input) sum and repetition nodes and then we will propose a generalized processing element that implements the extension of these operations over messages.
Binary Repetition Node Functional Unit (RN-FU):
The binary repetition node functional unit (RN-FU) implements the function using simplified version (10) . This equation is implemented as shown in Algorithm 1. Fig. 5(a) shows the datapath architecture of RN-FU. The figure also shows the input, output and intermediate variables of Algorithm 1, where and , , 2 are the input mean and scaled concentrations respectively. Comparators (CMP1-CMP3), adders (A1-A4), multipliers (M1-M2) and multiplexers (MUX1-MUX4) act as described in the following. A1 implements line 5 whereas CMP1, MUX1-MUX3 and A2 implement lines 7 to 11 in Algorithm 1. On the other hand, CMP2-3, MUX4 and A3 implement lines 12 to 18. In order to decrease the hardware complexity and propagation delay, we adopted division via multiplication by reciprocal technique where the quantity in line 19 is implemented using a look up table (LUT). Finally, M1-M2 and A4 implement lines 19 to 21. The input concentration is represented as [8, 0] which is scaled down by 16 by simply moving the decimal point to left by 4 places i.e. . Reverse is true for scaling up by 16 at the output of the node. The repetition node is implemented as a fully pipelined structure with pipelining depth (pipelined stages are shown in dotted lines in Fig. 5(a) ).
Binary Sum Node Functional Unit (SN-FU):
The binary sum node functional unit (SN-FU) implements the function using (11) and (12) and the main computational steps are shown in Algorithm 2.
Step 6 of the algorithm computes the integer and fractional parts ( and ) of input means ( , 2). Similarly, steps 8-13 and 15-18 perform the computations necessary to implement (11) and (12) respectively. The hardware architecture of SN-FU is reported in Fig. 5(b) , where adders (A1-A16), comparators (CMP1-CMP5) and multipliers (M1-M3) act as described in the following. A1-A4 and M1-M2 compute , and for , 2. A6-A8, A14 with CMP2-CMP3 and CMP5 implement lines 8 to 13 of Algorithm 2. A9-A13 with CMP1 and CMP4 implement lines 15 to 18 where a LUT is used to implement . Finally, M3 together with A15-A16 implement lines 19 to 21 of Algorithm 2. The sum node is also a fully pipelined structure with pipeline stages. It may be noted that although the generalized architecture for both sum and repetition node remains the same for all values of , the complexity as well as decoding performance is dependent upon the representation of all variables. The FP simulation results shown in Fig. 4 have been obtained using the finite precision model characterized by choices listed in Table I .a and I.b. for each external and internal quantity. The left most column of both tables shows the computation steps involved in the Algorithm 1 and 2 while the following columns show for to 64 the quantization adopted for each step. The quantities , , , , : , 2, 3 are always positive therefore, MSB of their magnitude part is always '0' and hence excluded from representation. For all other quantities the includes the sign bit. In some cases, where the operands have unequal number of and bits, arithmetic shift left (right) operation is performed on one or both of them, in order to align their decimal points. In addition, the tables show the hardware modules (of Fig. 5(a) and 5(b) ) involved in each step along with the total bitwidth of operands.
The details of Table I .a and I.b reveal that moving from to 64 requires a 1 bit increase in the bitwidth of those hardware modules which involve in computation. This results in a slight but affordable increase in overall complexity of sum and repetition functional nodes.
Processing Element: The sum and repetition functions discussed above are associative i.e. for three inputs where operator denotes the binary sum or repetition function. In order to extend these functions to process more than two input messages, the following update rule must be satisfied (13) denote the input (output) message vectors and denotes the node degree ( or ). The above equation states that the output message corresponding to edge of a node (sum or repetition) is the result of sum or repetition operator over all input messages except the input message . The update rule of (13) is implemented in this work resorting to the forward-backward (FB) strategy with serial read and write adopted in some classical implementations of LDPC decoders, see e.g. [10] . Fig. 5(c) . shows the generalized architecture of proposed FB processing element (PE) which consists of three functional units (FU1-FU3) implementing the binary operator that can be either RN-FU or SN-FU and two last in first out (LIFO) memory units that stores the intermediate results. The functionality of a PE explained here for sum node processing can be straightforwardly extended to repetition node processing.
Thanks to the pipelined implementation of FUs, the PE is able to process one edge of Tanner graph per clock cycle. The number of parity check equations in pipeline is . In the first clock cycles PE receives the first message of parity equations, in the next cycles it receives the second message of equations and so on. This process continues until all messages of parity check equations have been received. At the output, the messages are produced in reverse order i.e. starting from the -th message of the -th equation, up to the first message of the first equation.
The width (i.e. number of bits per row) of LIFO1 and LIFO2 is equal to sum of total number of bits for FP representation of and , whereas the depth (i.e. number of rows) is given as (14) Both sum and repetition PEs have a throughput of one edge per clock cycle. Therefore, if the clock frequency is , the PE for both sum and repetition operator will perform iterations in clock cycles, where is an additional constant that takes into account the latency of the two processors. More generally, with parallel processors the throughput is given as (15) The synthesizable IP core of the ADBP decoder has been written in VHDL and synthesis has been performed on 45 nm standard cell ASIC technology using Synopsys Design Vision tool at a target clock frequency . Table II shows the synthesis results of the main processing elements of ADBP algorithm for various values of . Following notations are adopted in table to represent the area figures.
• and denote respectively the area in of binary SN-FU and RN-FU respectively.
• denotes the area in of a single LIFO for a given node degree ( or . • and denote respectively the area in of a sum and repetition PE of degree and . The table also reports the gate count i.e. total number of 2-input nand gates for both sum and repetition PEs for various values of and node degrees and respectively. rates and variable/check-node degree distributions. However, the complexity comparison at the PE level is possible and presented here. One recent work in the domain of non binary LDPC decoders is [13] in which the authors propose an decoder with Trellis based implementation of forward backward check-node. The main processing core of [13] consists of an iterative decoder processor (IDP) that implements the combined functionality of a single check-node and variable-node with degree and respectively. The IDP is synthesized on 90 nm technology and has a gate count of 0.16 Million eq. gates. In case of ADBP, the combined area of sum and repetition PEs of degrees and respectively is 0.038 at 45 nm. This area is multiplied by 4 to obtain equivalent area at 90 nm technology. Finally, the result is divided by the area of a single 2-input nand gate to obtain the at 90 nm which is 0.06 Million gates. This clearly demonstrates the logic area advantage of ADBP processing cores over the non binary LDPC decoders. In addition, the non binary LDPC decoder of [13] has a parallelism IDPs and is able to achieve a throughput of 234 Mbps with parameters , , and . For the same parameters, the proposed ADBP decoder is able to achieve a throughput of 833 Mbps which is almost 3.5 times higher than the decoder in [13] .
The synthesis results of Table II clearly demonstrate that the ADPB algorithm achieves a remarkable complexity reduction and a very high throughput. In addition the complexity increases very slightly moving to higher cardinalities which shows feasibility of this algorithm for high values of .
It should be noticed that the simple regular mod-LDPC encoder used in this paper provides poor performance with respect to the state of the art LDPC encoders constructed on fields. The exceptional complexity reduction achieved from using the ADBP, together with asymptotic results of [17] and [18] motivates for further research effort in the design of good non binary LDPC encoders within the class. The introduction of ADBP then converts the decoding complexity drawback associated to non binary codes into an advantage and the problem of the applicability of non-binary coding techniques back to an encoding design problem. It is worth noting that the scaling of complexity with graph density is clear from hardware implementation results. Thus, there is a trade-off between graph density and performance and code construction issues have to be further addressed.
VI. CONCLUSIONS
After almost two decades of research, binary LDPC codes have gained a wide diffusion in several fields and excellent implementations exist for their decoding. However, the efficient decoder implementation for non binary LDPC codes is still a challenging task and requires considerable research effort. In particular, the implementation complexity of non binary LDPC decoders tends to grow with the cardinality of the symbol alphabet, which severely limits the achievable spectral efficiency. In the first part of this paper, we introduce several simplifications in the previously proposed Analog Digital Belief Propagation algorithm. Provided simulation results show that these approximations do not affect significantly decoding performance, but they enable the practical implementation of ADBP. In the second part of the paper, the fixed point model of ADBP is developed, showing that between 5 and 10 bits must be allocated to represent external and internal quantities with limited or null effect on performance. Finally, in the last part of the work, we proposed and detailed the implementation architecture of key processing nodes. Synthesis results obtained for multiple sizes of the symbol alphabet prove that: (i) the required area to implement processing nodes is affordable, and (ii) the complexity grows very weakly with the size of the alphabet. Since 1986 to 1988, he was a Researcher with the Centro Studi e Laboratori in Telecomunicazioni (CSELT), Torino, involved in the standardization activities for the GSM system. Since 1992, he has been an Assistant Professor and then Associate Professor with the Electronic Department, where he is member of the VLSI-Lab group. His research interests include several aspects in the design of digital integrated circuits and systems, with special emphasis on high-performance architecture development (especially for wireless communications and multimedia applications) and on-chip interconnect modeling and optimization. He has coauthored 230 journal and conference papers in the areas of ASIC-SoC development, architectural synthesis, VLSI circuit modeling and optimization. In the frame of competitive National and European research projects, he has been co-designer of several ASIC and FPGA implementations in the fields of Artificial Intelligence, Computer Networks, Digital Signal Processing, Transmission and Coding.
Dr. Masera is an associate editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II.
