Abstract-Data representations for LDPC decoders using the sum-product algorithm in the log-likelihood domain are considered. It is suggested that the look-up table implementation of the domain transform function is separated into two parts, allowing a compact representation of the internal state data. Memories and bus widths can be reduced by typically 16%, while the imposed hardware overhead is insignificant.
I. INTRODUCTION
An LDPC code is defined by a very sparse matrix H of dimensions M × N , where N is the block length and M is the number of redundant symbols in each block [1] , [2] . A code word x is defined to belong to the code if and only if Hx = 0 (mod 2). An LDPC code is visualized by a Tanner graph, consisting of M check nodes and N variable nodes. Each column and row in H correspond to a variable node and check node, respectively. The graph is bi-partite, and a check node m and a variable node n are connected if and only if H m,n = 1.
During decoding with the sum-product algorithm [2] , messages are passed between the connected nodes of the graph, β mn denoting the message from check node m to variable node n and α mn denoting the message from variable node n to check node m. Messages are iteratively computed according to the relations
where N (m) and M(n) denote the neighbors to check node m and variable node n, respectively, γ n denotes the input a priori data, and Φ(x) is a domain transform function given by Based on (1) and (2), a decoder data-path can be depicted as in Fig. 1 , where the α mn and β mn messages are computed by the CNU and VNU blocks, respectively. The function Φ(x) is usually implemented as a look-up table, and is represented by the LUT block in Fig. 1 . As the computations by the VNU and CNU blocks are either temporally separated (in a partly parallel or serial architecture) or spatially separated (in a fully parallel architecture), the messages need to be stored and/or communicated over possibly long signal wires, which can be done at several locations along the path. Two possible cuts proposed in earlier works are 1 and 2, which have been used in [3] and [4] , respectively. In an LDPC decoder, the dominating part of energy dissipation is associated with communication and/or storage of messages. Hence it is of interest to find an energyefficient representation of these.
II. LOOK-UP TABLE COMPRESSION
The fixed-point data format used by the decoder is denoted (w i , w f ), where w i and w f are the numbers of integer and fractional bits, respectively. However, because of the non-linear characteristic of Φ(x), shown in Fig. 2 , several values will not occur at the output of the look-up tables. Thus, the redundant number of bits in the representation can be computed by log 2 (2 wi+wf − N o ), where N o is the number of unique values of the lookup table. As seen in Fig. 3 , for many data formats the redundancy is considerable, and we therefore propose a separation of the look-up table into an encoder and a decoder part as shown in Fig. 4 . Combined, the encoder and decoder perform the same function as the LUT in Fig. 1 , but the separation offers a freedom to choose a suitable representation of the intermediate data at the locations indicated by the cut in Fig. 4 . By exploiting the redundancy in the representation a reduction of the required wordlength for busses and/or memories is achieved. As the compression does not change the algorithm, the performance of the decoder is not affected. However, significant savings of memory and/or routing area and energy should be obtained. It was shown in [5] that the (2, 2)-representation provides a good trade-off between hardware complexity and decoder performance. Hence that format will be considered in the following discussions. For the (2, 2)-representation, messages consist of 6 bits (including sign bit and hard decision/parity-check bit), and compression thus results in a wordlength reduction of 16.7%. The area overhead associated by the separation of the look-up table has been estimated by synthesis of VHDL code to standard cells in a 0.35µm CMOS process. The synthesis was performed using Synopsys design compiler [XXX: kolla upp]. The synthesis of the original look-up table required 12 cells utilizing a total area of 726µm 2 . Realized as separate blocks using a straightforward mapping of data to the intermediate compressed format, the encoder and decoder parts utilized 8 cells occupying 528µm 2 and 6 cells occupying 309µm 2 , respectively. As comparison, a 6-input CNU and a 3-input VNU occupy roughly 39000µm 2 and 31000µm 2 respectively, when synthesized using the same methods. Thus the area overhead of 111µm 2 per look-up table amounts to a 1-1.5% area increase for the process elements.
As reported in [6] , changing the logarithm base of Φ(x) (effectively scaling the inputs and outputs of the CNU) may increase the performance of the decoder, and as shown in Fig. 5 the approach can be combined with the proposed idea for logarithm bases close to the natural base e. Outside of the range, compression may be applied in one direction without approximation of the table entries.
III. IMPLEMENTATION ASPECTS
In [7] , a fully parallel implementation of an (N, K) = (1024, 512) code in a 0.16µm CMOS process is presented. The average length of the message nets was found to be 3 mm. Scaling linearly with technology, the average length becomes 6 mm in the 0.35µm process. Further, using 6 wires/message, 6 messages/CNU and a wire width of 0.5µm, the total routing of messages for one CNU amounts to 108000µm 2 , which is considerably larger than the area of the CNU. Even at low switching activities, the energy required for data communication is thus significant. In a partly parallel architecture, the average net length is expected to be shorter, but on the other hand switching activity would be higher because uncorrelated messages are communicated along the same wires. Additionally to reducing the wordlength of messages, the separation of the look-up table also allows a possibility to choose a representation suitable for energyefficient communication, and, as shown in Sec. II, the separation is essentially free. Consider the distribution of look-up table values for CNU input (Fig. 6 ) and VNU input (Fig. 7 ) obtained using an (N, K) = (1152, 576) code. It is obvious that the energy dissipation for communication of messages will depend on the encoding chosen. For example, in a parallel architecture the most common values can be assigned representations with a small mutual distance, while in a partly parallel or serial architecture an asymmetric memory [8] might be efficiently used if low-weight representations are chosen for the most common values.
IV. CONCLUSIONS
In this letter, we have shown that the data representation for messages traditionally used in the sumproduct decoding algorithm for LDPC codes contains redundancy. We show how to exploit this redundancy by a separation of the domain-transfer function, in order to obtain a compact and energy-efficient data representation.
