We describe an efficient, fully-parallel Network of Programmable Logic Array (NPLA)-based realization of iterative decoders for structured LDPC codes. The LDPC codes are developed in tandem with the underlying VLSI implementation technique, without compromising chip design constraints. Two classes of codes are considered: one, based on combinatorial objects derived from difference sets and generalizations of non-averaging sequences, and another, based on progressive edge-growth techniques. The proposed implementation reduces routing congestion, a major issue not addressed in prior work. The operating power, delay and chip-size of the circuits are estimated, indicating that the proposed method significantly outperforms presently used standard-cell based architectures. The described LDPC designs can be modified to accommodate widely different requirements, such as those arising in recording and optical communication systems and also wireless channel applications.
Introduction
One of the most prominent capacity-approaching error-control techniques in communication theory is coding with lowdensity parity-check (LDPC) matrices, coupled with decoding of the form of belief propagation on a graphical representation of the code. Currently, long random-like LDPC codes offer the best quality error-control performance for a wide range of standard channels [5, 6] , channels with memory [10, 15] and channels with inter-symbol interference (ISI) [19] .
In addition to their excellent performance, LDPC codes have decoders of complexity linear in their code length and of an in inherently parallel nature. This makes them amenable for implementation using parallel VLSI architectures. The primary performance-limiting factor of most known parallel implementations is the complexity of the graph connectivity associated with random-like LDPC codes. Additional problems arise from the fact that LDPC codes of random structure also require large block sizes for good error correction performance, leading to prohibitively large chip sizes. Despite these bottlenecks, there were several attempts to come up with high throughput implementations [3] and implementation-oriented code constructions [51, 52] . The drawbacks of most of these proposed techniques are that the code-design and VLSI implementation the proposed layout while section 6 explains the structure of the LDPC codes supporting the proposed layout. The chip power, area and throughput estimates are presented in section 7. Section 8 introduces generalized LDPC (GLDPC) codes and related VLSI design issues, while section 9 describes some reconfigurability problems. Section 10 discusses possible applications of the designed codecs while the concluding remarks are given in section 11.
LDPC Codes: Implementation bottlenecks
In 1963, Gallager [14] introduced a class of linear block codes known as low-density parity-check codes, endowed with a very simple, yet efficient, decoding procedure 1 . These codes, popularly referred to as LDPC codes, are described in terms of bipartite graphs. In the bipartite graph of a designed-rate 1 − m/n code, the m rows of the parity-check matrix H represent check nodes ("right nodes"), while its n columns represent variable nodes ("left nodes"). The edges of the graph are placed according to the non-zero entries in the parity-check matrix. If the degree of all variable nodes is the same, then the code is called left-regular. Similarly, if the degree of all check nodes is the same, the code is termed right-regular. The decoding complexity is directly proportional to the number of edges and hence to the number of ones in the parity-check matrix, justifying the use of sparse matrices.
A consequence of the graphical representation of LDPC codes is that these codes can be efficiently decoded in an iterative manner. More specifically, decoding is performed in terms of belief propagation (BP) [22, 37] , with log-likelihood ratios of bits and checks iteratively passed between the two classes of nodes until either all parity-checks are satisfied or a maximum number of iterations is reached. The iterations are initiated at the variable nodes, which usually receive soft input information from the channel. At the end of message passing decoding, the bits are estimated based on the final reliability information of the variable nodes. We will mostly focus our attention on the sum-product version of the belief propagation (BP) algorithm. The same type of design philosophy can be used for other classes of iterative algorithms, such as min-sum decoding. Furthermore, the design methods proposed in this work can be applied to both regular and irregular codes.
The operations performed at each variable and check node can be summarized as follows:
Variable nodes (VN):
Denote 2 the set of all neighboring check nodes incident to variable node v as C v , the set of all variable nodes connected to check node c as V c , a message on an edge going from variable node v to check node c in the l th iteration as m (l) vc , and a message on the edge going from check node c to variable node v in the l th iteration as m 
where y denotes the channel output and p(y|x = i), i = 0, 1 represents the channel transition statistics, while m 0 = log p(y|x=1) p(y|x=0) denotes the channel output log-likelihood ratio of the variable v.
Check nodes (CN):
From the duality principle [13] it follows that the the message m 
The computations in equation (2) will be referred to as the log/tanh operations.
The implementation bottlenecks of the decoding process can be easily identified from the previous discussion, as summarized below:
• Large wiring overhead and routing congestion of the code graph implementation. These problems become particularly apparent for high-rate, long and random-like codes.
• Approximate computations performed at check nodes, involving tanh and arctanh functions. These approximations have to be implemented for every incoming edge of a check node and they have a two-fold effect: first, they may compromise the decoder performance and second, they can lead to a large increase in the chip size.
• Finite precision arithmetic and finite computational time imposed on the hardware implementation. For many codes these constraints have a significant impact on the error-correcting performance. Capacity-approaching random-like, irregular codes [38] are usually very long and take a large number of iterations (typically around 1000) ( [37] ,p. 624) to converge to a stable solution. This has a significant bearing on the throughput of the implementation. On the other hand, restricting the maximum number of iterations performed can in certain cases lead to significant degradations of the error performance.
Current implementations fail to provide solutions to one or more of these problems. Ideally, one would like to use codes with near-capacity performance that also bound the worst-case (longest) wire length desired, and that have chip-area and chip-delay characteristics as good as possible. Most known approaches for handling these obstacles deal with code design and implementation problems as separate issues thereby leading to non-optimal solutions [3] , with the notable exception of the proprietary wireless system chip design by Flarion Technologies [12] . Also, most known implementation schemes use standard-cell circuitry. It was shown in [21, 20] that an implementation of a circuit using a network of medium-sized PLAs has better area and delay characteristics compared to a standard cell design. Hence, we propose to investigate PLA-based decoders and compare their performance with those of known standard-cell implementations.
The Proposed Approach: Structure and Full Parallelism
Our proposed implementation of a fully-parallel LDPC decoding system utilizes extremely fast and area-efficient NPLAs [21, 20] . The major features of the proposed system are :
• Full parallelism with the code structure "embedded" in the wiring;
• Area and delay efficient implementation with PLAs;
• A unified approach of tackling the LDPC code design and VLSI implementation problem.
This approach can yield a throughput of the order of several hundred Gbps. As a consequence, it can be used in most modern recording and wireless systems. Given the placement and routing constraints arising out of the NPLA architecture, LDPC codes are tailor-made to meet these and performance-related constraints. Such an approach yields to an overall solution of the problem that demonstrates a significant improvement over prior attempts to implement LDPC codecs in VLSI.
LDPC Codec Architecture

Encoder Implementation
The central problem of the paper -a fully parallel decoder design -has to be viewed in the context of a scheme that deals jointly with the encoding and decoding process. LDPC encoding can be realized in terms of operations involving matrix multiplications that can be implemented in terms of tree-based XOR operations in hardware. This ensures that encoding delays for the codes investigated are logarithmic in the code length. Additionally, for certain LDPC codes of the form presented in the forthcoming sections, encoders based on shift registers and addition units can be used. In this setting, the parity check matrix itself is used for governing the encoding process. This significantly simplifies the overall implementation of the codec, and as a consequence, the LDPC encoding process is not expected to present a stumbling block of the architecture.
Decoder Implementation
In the proposed approach, the parallel nature of the iterative decoding process is directly exploited in the hardware implementation. Since each of the variable and check nodes makes use of information available from their counterparts only from the previous cycle, it is possible to let these units operate in parallel and complete their operations in one clock cycle.
The main challenge in this implementation is to reduce the complexity of the inter-connects. This problem is solved at the code design level itself. The LDPC codes are hardwired into the chip and have a structure that results in small wiring overhead. The fully parallel design helps avoid storing the code parity-check matrix in a look-up table or some other way.
The hardware architectures used for the variable and check nodes of the decoder are described next.
Variable Node Architecture
The variable node operations are specified by equation (1) . The outgoing information through any edge is the sum of the log-likelihood values of the channel information and the information coming into the variable node from all other edges. Hence, at a variable node a series of additions of log-likelihood values is performed. The channel information and check messages are quantized to values that can be represented by 5 bits. Extensive computer simulations show that 5-bit quantization results in very small degradation of the decoder performance in the waterfall region [5, 31] , for most types of sufficiently long LDPC codes. Nevertheless, quantization can have a significant impact on the codes' performance in the error-floor region -see for example [33] , [36] , [46] , but this issue will not be dealt with in this paper. Assuming 5-bit quantized messages both from the channel and the checks, a total of ⌈log(d v + 1)⌉ + 1 stages (levels) of two-input adders is needed to perform the variable operations. For this purpose, Manchester adders described in [33] are used. At the beginning of the evaluate period of a clock cycle, the messages from the previous iterations are used to perform a series of additions. The results of these additions are latched and sent as inputs to the check nodes during the next clock cycle. The sign of the sum represents the current estimate of the decoded bit. Figure 1 illustrates the described variable node architecture. Though it is possible to increase the throughput by stopping the iterative process for a given block by checking for its parity, the proposed architecture does not incorporate this feature. This feature is dictated by the constant throughput requirement imposed by most applications. Hence, the number of iterations performed is fixed, and chosen depending on the convergence speed of the decoding process. To increase the throughput, this number is typically set to 16; in general, a number of 16 or 32 iterations was found to be most appropriate for the code structures proposed in this paper.
For codes with very small gap to capacity, the number of iterations would have to be significantly larger, of the order of several thousands. This follows based on the fundamental trade-off between complexity and performance of error-control codes [27] . Due to these facts, such codes are not suitable for practical implementation. A gap to capacity of approximately 1dB is usually considered a good choice regarding the trade-off between performance and complexity and the stability of operation of the decoder [35] .
Check Node Architecture
At the check nodes, two types of operations are performed, namely parity updates and reliability updates. Since the parity update operation implementation has been dealt with in [3] , and since it has a very small influence on the chip area and power overhead, it will not be discussed in this paper.
The reliability operations described in equation (2) are -as are the variable node operations -performed in the loglikelihood domain in order to avoid multiplication and division operations. The system blocks are required to:
• Perform log/tanh operation on each incoming edge;
• Add all values obtained from these operations on a check node;
• Subtract the incoming value on each edge from the result obtained in the previous step;
• Perform an inverse log/tanh operation on the messages on each of the edges, in order to obtain the "outgoing"
information from the variable nodes at the end of an iteration. Figure 2 shows the reliability update architecture of a check node for the case d c =3. Finite precision arithmetic is used to develop a PLA-based look-up for the log/tanh and log/arctanh operations, as described below. 
PLA layout
The design of a good PLA layout 3 plays a crucial role in the implementation of the check nodes in a efficient way. The problem of designing good PLA layouts was addressed by one of the authors in [21] . For the sake of completeness, the most important features of the PLAs are described in this section.
A PLA can be considered as a means to directly implement a conjunctive (product of sum) or disjunctive (sum of product) expression of a set of switching functions. A PLA has an "AND" plane followed by an "OR" plane. In practice, either NAND or NOR arrays are used, with the resulting PLA said to be a NAND/NAND or a NOR/NOR device.
Let us describe the functionality of a PLA of the NOR-NOR form with w rows, n input variables x i , i ∈ {1, 2, . . ., n}, and m output variables y i , i ∈ {1, 2, . . ., m}. Define a literal L i as an input variable or its complement. A function g is described by a sum of cubes g = ∑ w i=1 C i , where each cube is the product of literals
i , according to:
In words, the PLA outputs g, which is obtained as the logical NOR of a series of expressions, each corresponding to the NOR of the complement of the literals present in the cubes of g. As can be seen from the schematic view of the PLA core in figure 3 , the outputs of the PLA are implemented by vertically running output lines, which are connected to the vertical word lines implementing the cubes of g. Each cube combines the vertically-running bit-lines implementing the two literals for each input variable, the variable itself and its complement.
For the message passing algorithm, literals represent the 5-bit quantized message input log-likelihoods, so a NOR-NOR layout of the function g involving 2 5 = 32 terms is designed accordingly. For the check node PLAs, a logical function consisting of at most 32 terms is used to implement the log-tanh operations. Based on the underlying logic sharing operations, this number can be modified. The corresponding outputs are retrieved from the output plane through their designated buffers.
For our proposed decoder design, pre-charged NOR-NOR PLAs [21, 20] are used. This is motivated by several obervations (see figure 3 ): when a word line of a PLA switches to "high", it may happen that some neighboring lines switch to low. The worst case switching delay occurs when all neighboring lines of one line, set to "high", are in a "low" state.
For a pre-charged NOR-NOR PLA, and for every word-line, its neighbors are restricted to either switch with it or remain static. This obviously results in reduced delay deterioration due to cross-talk, when compared to the former case. As a consequence, in a pre-charged NOR-NOR PLA, a word-line of the PLA must switch from "high" to "low" at the end of any computation, or remain pre-charged. In order to ensure that the output of the PLA is sampled only after the slowest In order to perform a valid comparison between a single PLA implemented in our layout style against the standard-cell layout style, we implemented both styles for four examples. The delay results were obtained utilizing SPICE [32] , while the area comparison was obtained from actual layouts of both styles using two routing layers. The standard-cell style layout was done by technology-independent optimizations in SIS [44] , afterwards mapping the circuit using a library of 11 standard-cells, which were optimized for low power consumption. Placement and routing was done using the wolfe tool within OCT [4] , which in turn calls TimberWolfSC-4.2 [43] for placement and global routing, and YACR [34] for completion of the detailed routing.
The examples for the PLA layout style were flattened, then the magic [16] layout for the resulting PLA was generated using a perl script. In order to perform the delay computation, a maximally loaded output line pulled down by a single output pull-down device was simulated.
PLA • n denotes the number of input lines or variables;
• m denotes the number of output lines or variables;
• w denotes the number of rows in the PLA;
• D refers to the delay in picoseconds;
• A refers to the layout area of the resulting implementation in square grids.
The values of D for the standard cell layout style were obtained as the maximum values after simulating about 20 input test vectors. It has to be taken into consideration that wire capacitances, which would increase the delay in the standard-cell implementation, were not accounted for. The delay numbers and area sizes for the PLA layout style are taken as worst-case values. Although this leads to a bias in comparison, impressive improvements of the PLA layout style over the standard-cell layout style can be observed. The PLA layout requires only an area between 33 and 81 per cent of the the standard-cell layout area, while the average area requirement of the PLAs is 46 per cent and the average delay is 48 per cent of the standard-cell layout style. This favorable area and delay characteristics of the PLA is due to the following reasons:
• In the standard-cell implementation, traversing different levels (i.e. gates) of the design leads to considerable delays, while the PLA logic functions have a compact 2-level form with superior delay characteristics, as long as w is bounded.
• Local wiring delays and wire delay variations due to crosstalk are reduced in the PLA, since it is collapsed into a compact 2-level core.
• Extremely compact layout is achieved in the PLA by using minimum-sized devices.
• In a standard-cell layout, both PMOS and NMOS devices are used in each cell, leading to a loss of layout density by the PMOS-to-NMOS diffusion spacing requirements. In contrast, exclusively NMOS devices are used for the PLA core, which can be placed extremely close to each other.
• Finally, PLAs are dynamic, and hence faster than static standard-cell implementations.
In summary, the advantages of the proposed realization are cross-talk immunity and favorable delay and area characteristics compared to traditional standard-cell based ASICs. By utilizing these novel PLAs, interconnected in the manner of [21] , all these characteristics can be exploited to implement fast, fully parallel LDPC codecs. For each check node, 2d c PLAs and (⌈log(d c )⌉ + 1) 2-input adders have to be used to accommodate its underlying operations. The checks and the variables are hard-wired with separate wiring in either direction. As already pointed out, uniform 5-bit quantization is performed on the messages, although it is also possible to implement non-uniform quantization schemes suited to the particular channel noise density function. Accuracy of operation can be improved by using non-uniform quantization that can be adaptively changed based on the evolution of the check and variable message densities. The PLA design needs minimal modification to allow for such flexibility.
If one is willing to somewhat compromise the decoding performance of a code, an alternative belief propagation algorithm can be implemented: the sum-product algorithm can be approximated by the min-sum algorithm, for which the outgoing check-node messages are computed as
This min-sum approximation leads to an underestimate of the true message values [50] , but the simpler implementation of the min and sign functions largely reduces the check node complexity requiring less complicated circuitry and chip area of the PLAs. A set of clusters aligned in the form of a square rim will be called a ring. The size of the ring is the number of banks of clusters on one side of the square. Denoting the size of a bank of C/V node clusters in ring i by a + 2i, and the total number of check nodes by m, one obtains the following formula for the number of rings r in the above described concentric construction:
Alternative C/V cluster packings with different variable to check node ratios can be used for the min-sum version of the iterative decoding algorithm, making the number of blocks packed highly depending on the algorithm; it also makes the C/V cluster structure more amenable for lower-rate codes. Furthermore, different variable to check-node packing ratios can be used for generalized LDPC codes, described in more detail in section 8.
As described before, the PLAs for the reliability operations of check nodes require a large chip area, which allows arrangements of C/V node clusters with a large number of variable nodes neighboring a check node as shown in figure 5 .
This structure can be especially used for high-rate codes with a large ratio in the number of variable nodes compared to The regularity inherent in the IC architecture of figure 5 represents an input constraint for the code construction problem.
In particular, the locality of a check node and several variable nodes in a cluster is exploited during the code construction process. In order to minimize the length of long wires between check and variable nodes, the codes are additionally constrained in such a way that nodes in the S 1 bank do not communicate with nodes in the S 4 bank, and likewise, and that the nodes in S 2 do not communicate with nodes in the bank S 3 . Prototype codes of this kind have been constructed, and custom IC implementations of these codes have been developed with very good results presented in section 7. The resulting design has the property that wiring is sparse and that long wire lengths are minimized due to the fact that the codes are constructed so as to exploit the regularity of the above architecture. At the same time, code performance does not have to be significantly compromised by introducing this constraint, as will be seen in the subsequent sections.
For the purpose of achieving more flexibility in the code design process, and hence in the achievable error-correcting performance, alternative layouts can be considered as well. The layouts introduce some losses in desirable VLSI implemen- The idea is to introduce a bridge connecting the basic units across the clocking control region in the center of the chip.
This can increase the percentage of variable nodes communicating across the central region of the chip and lead to improved code performance.
Another approach is to make use of a chip with a 2 : 1 aspect ratio as against a square aspect ratio, and to additionally eliminate the central clocking control unit. The proposed architecture is shown in figure 6 . This architecture also allows for larger flexibility in the code design process by ensuring the communication of a larger fraction of units across the chip without the constraints imposed by routing and delay issues.
LDPC Codes for the Concentric Construction
Constraints on LDPC Codes from VLSI Implementation Structure
For the concentric VLSI implementation described in the previous section, an LDPC code can be constructed based on the following set of constraints:
• Variable and check nodes on opposite sides of the chip should not be mutually connected, or less restrictively, very few connections should exist between them; this ensures that no wires cross the central region of the block or very few do so.
• Only nodes on the border of two neighboring sides of the chip are allowed to exchange messages during the decoding process; this ensures highly localized wiring.
Posed as constraints on the code design process, these requirements take the following form. Assume that U denotes the set of variable nodes of the code, and that W denotes the set of parity-check nodes. We seek a code with good error-correcting characteristics that allows for a partition of the set U into four subsets U 1 , U 2 , U 3 , U 4 , approximately of the same size. If S i denotes the subset of W that checks variable nodes in U i , i = 1,2,3,4, then one should limit the intersection between those subsets to:
for some integers s and c such that c ≪ s, and c sufficiently small. In this setting, the vertices in S 1 , S 2 , S 3 , and S 4 will be assigned to the four different sides of the chip, and there will be very limited or absolutely no interaction between these sides. Furthermore, the variables in the intersection of sets S 1 and S 2 , say, will be placed on the edge between the two corresponding sides. For a code of interest, a structure satisfying these constraints can be obtained by selectively deleting some non-zero entries in the parity-check matrix. This has to be done in such a way as neither to make the code graph disconnected nor to have a large number of variables of degree less than or equal to two. Furthermore, one can devise a code construction methods that would directly address the constraints posed in equation (6) . 
To clarify the code-design ideas, we consider a "toy-example" of a rate 1/2 code with parity-check matrix given in equation (7) . It is to be observed that the code described by H is of no practical use, since it is of length 24 only and its graphical representation contains a very large number of four-cycles. It can also be seen that the matrix in equation (7) contains linearly dependent and repeated rows. Nevertheless, it is straightforward to explain all the underlying constraints and design issues on such a simple structure.
The vertical labels in the matrix of equation (7) represent the banks of the chip-layout and the horizontal labels represent the variable nodes. All check-nodes with the same label are in the same bank of the layout. Thus, for this case one has: 
Based on equation (8), one can see that the code matrix in equation (7) can be used without any modifications for the proposed design approach. As a result, no wires will be crossing the central region of the chip. Furthermore, although this scenario is not directly applicable in this case, one can make the desired codes parity-check matrix slightly irregular, by deleting certain ones in H, in order to meet the implementation constraints of equation (6) . This process is to be performed in such a in such a way as to eliminate edges that result in wirings between opposite banks. In addition, such "sparsifying"
could also be performed to reduce, rather than completely eliminate, the number of wires crossing the central section of the chip. Consequently, only few entries in the parity-check matrix would be modified, ensuring that with overwhelming probability the overall code characteristics and parameters are not compromised.
The variables in the intersections of adjacent banks can be placed at the "diagonals" of the concentric chip. Placement within the S i , i = 1, .., 4 banks themselves can be governed by known proximity-preserving space-filling curves, such as the Hilbert-Peano (HP) or Moor's version of the HP curve (HP-M) [42] . The square-traversing structure for these two curves (dimension four) are depicted below.
HP : An example of a practically important code parity-check matrix, with the partition property described in equation (6) and with c = 0 is shown below,
The question of interest is how to choose the blocks H 1,1 , ..., H 4,2 so that the resulting code has good performance under iterative message passing, and at the same time has a simple structure amenable for practical implementation also allowing for easy encoding. This problem is addressed in detail in the next section.
Construction Approach Based on Difference Sets
Several design strategies for H S are described below. The sub-matrices H i, j , i = 1, ..., 4; j = 1, 2 are chosen to be row/column subsets of "basic" parity-check matrices H based on permutation blocks, as described in more detail by the authors in [48] .
For the first technique the "basic" parity-check matrix H is of the form 
where P is of dimension N, i k,l ∈ N ∪ {−∞} and P −∞ stands for the zero matrix of dimension N. The integers i k,l are taken from so-called Cycle-Invariant Difference Sets (CIDS) of order h and cyclic shifts thereof [30] . CIDSs are a subclass of Sidon sets [30] which can be easily constructed according to the formula
where GF(q) denotes a finite field with a prime number of elements q. For (N = 5, h = 2) and (N = 7, h = 4) two such sets (mod 2400). The resulting codes have girth six. The last claim is a consequence of the result described by the authors in [11] .
Next, we choose the first two rows of the CIDS based LDPC codes for H 1,1 , and form the other sub-blocks from row and column subsets of the parity-check matrices of these codes. Two examples for CIDS-set based parity-check matrices are shown below. The first corresponds to a rate R = 1/3 code with d v =4, d c =6, while the second corresponds to a rate
In both cases, the dimension of P, the basic circulant permutation matrix, is 7 4 − 1 = 2400.
Both codes have length 2 × 6 × (7 4 − 1) = 28800, and are free of cycles of length four and six (i.e. the girth of the codes g is at least eight). Lower bounds on the minimum distances d of the codes of rate 1/2 and 1/3 can be obtained from the well-known formula due to Tanner [45] ,
and are equal to eight and six, respectively. Figure 8 shows the BER curves for these codes for different number of decoding iterations. For the simulations, 5-bit quantized messages were used. Observe that the LDPC code of rate 1/2 with VLSIimplementation imposed constraints exhibits an error-floor type behavior at BERs of the order of 10 −5 . The rate 1/3 code represents an interesting exemplary case for the seldom occurrence of multiple error floors. One possible explanation for the onset of such a phenomena is the decrease in the diameter of the code graphs represented by matrices in (13) and (14), as compared to the original code graph. The diameter of the graph is the maximum of the lengths of the shortest distance between any pair of variable nodes, and it measures the quality of "information mixing" in the code graph. The error floors might also be due to the emergence of different small trapping sets in the code. Despite their good code properties such as girth of ten, these codes show a surprisingly weak performance and will not be considered for implementation purposes. 
The small improvement in the error-correcting ability of the resulting code in this case is not large enough to justify the introduction of longer length wires, as was observed in the simulation process.
If one is willing to compromise the throughput in order to achieve better quality of error-protection, the number of iterations can be increased to several hundreds. For the example of the rate 1/3 codes shown in figure 8 , Table 2 : BER and throughput for 2.27 dB as a function of the number of iterations for the rate-1/3 code (50% duty cycle)
Construction Approach Based on Lattice Codes
A different technique for designing H S is based on array codes [48] , described in terms of a parity-check matrix of the form:
for some odd prime q, and a circulant P of order q. To construct a code with non-interacting banks, all that is needed is to select an appropriate set of block-row labels A = {a 0 , a 1 , . . .} ∈ {0, 1, . . ., i} and block-column labels B = {b 0 , b 1 , . . .} ∈ {0, 1, . . . , (q − 1)} that are retained and delete all other permutation matrices from the matrix. To ensure good code performance, we suggest the use of improper array codes (IAC), a type of shortened array codes described by the authors in [29] of column weight four (d v = 4) guaranteeing girth at least ten for chosen sets of exponents which avoid cycle-governing Choosing these sets of exponents, the parity-check matrix for the lattice-based codes of rate 1/3 with the special structure described by equation (10) is defined by exponents which are products of the form a i · b j , i = 0, 1, 2, 3, j = 0, 1, 2, 3, 4, 5:
Codes of different rate (e.g. 1/2) can be obtained by deletion of block-rows similar to the transition from equation (13) to equation (14) .
The performance of shortened (IAC) array codes of rate 1/3 for the above described sequences used for equation (18) is shown in figure 9 . Since q = 911, the resulting length of the code is 12 × 911 = 10932. Simulations showed no error floor up to a BER of 10 −7 . For comparison, a random-like (irregular) code constructed by the progressive edge-growth (PEG)
algorithm [17] of length 10800, which has optimized degree distributions obtained from [47] serves as a benchmark of the code performance. Defining the ratio of variable nodes of degree d v = i as λ i , then the chosen variable degree distribution can be described as {λ 2 , λ 3 , λ 5 , λ 7 , λ 15 } = {0.5509, 0.2386, 0.1320, 0.000052, 0.0784}. As can be seen, at a bit error rate close around 10 −5 , the IAC code with the special VLSI structure has a performance gap of close to 1dB compared to random-like codes. 
Construction Approach based on PEG Codes
Since the VLSI-adjusted code construction based on cycle-invariant difference sets shows very high or even multiple error floors and the VLSI-adjusted construction based on lattice codes despite clear improvements still has a gap of almost 1dB compared to random-like irregular LDPC codes, we will now demonstrate that by relaxing some constraints related to the regularity of the code, the wiring structure within the banks S 1 , S 2 , S 3 and S 4 and easy encodability of the code, LDPC codes having a VLSI-adjusted structure for the concentric construction can also closely follow random-like code performance.
This comes at the cost of somewhat more complex localized wiring and a slightly suboptimal chip area usage within the different sections of the concentric design in figure 5.
Besides using a regular "basis" matrix consisting of permutation matrices, one can construct VLSI-implementation adjusted codes of the structure shown in equation (10) from random-like irregular LDPC codes constructed by progressive edge-growth (PEG) techniques [17] . This code construction algorithm is known to generate random-like codes with excellent performance when used with optimized degree distributions [47] .
A length-n VLSI-implementation adjusted welding code is constructed by arranging a PEG-optimized code of length n/4 along the diagonal of an empty matrix (see figure 10a) . Decoding with this matrix is equivalent to processing four codewords of length n/4 at once. Interaction between different sections on the chip is generated by randomly shifting half of the ones in each row by n/4 positions to the right. Performing the same set of shifts for each of the four sub-matrices, not only the existing row degrees, but also the optimum column degrees of the overall matrix are preserved. The resulting parity-check matrix structure in figure 10b can be easily rearranged to fit the structure of equation (10) by rearranging the order of the block-rows so that the second and third block-row are interchanged. Generating interaction between submatrices of a parity-check matrix in random-like or structured manner has been previously described as "welding" [8] of sub-matrices; welded codes can sometimes outperform PEG-constructed codes. Figure 11 compares the performance of VLSI-implementation adjusted codes of different lengths and rates to PEGconstructed codes of the same length and degree distribution without the special VLSI structure. As can be seen, for rate 1/4 codes of length 48000, comparable to standardized codes used for mobile communications [9] , as well as for length 10800 codes of rates 1/3 and 3/4, there is only a small performance degradation for welded codes compared to PEG codes, despite their VLSI adjusted structure and possible loss in diameter. 
Estimation results
We applied the proposed method of decoder implementation using a 0.1µ process [1] . The delay and size estimates of the PLA were based on [20, 21] , while the size estimate of adders were taken from [33] . An accurate delay/power description of both these hardware units based on SPICE simulations was performed. It should be noted that in computing the size/delay/power estimates of adders and PLAs, wiring overhead, routing delay and the parity update operations at the checks are not accounted for. A minimal overhead is incurred upon incorporating these schemes. Table 3 : Estimates for n=28800, Rate 1/2
As an example, rate 1/3, 1/2 and 3/4 codes, suited for a variety of applications are considered. In the first case, the column weight d v is equal to four, while the number of decoding iterations is 16. Tables 3, 4 and 5, show throughput chip size and power estimates for these given rates and lengths 28800, 7200 and 8992, respectively. The tables show that the maximum achievable throughput is between one and two orders of magnitude higher than that demanded by most applications. By lowering the clock speed, the power consumption can be brought down as shown in In order to compare the proposed approach with the standard-cell based implementation in [5] , the estimates for a regular rate 1/2 code on 0.16µ technology are provided as well. The parameters considered are n=1024, d v =3, number of iterations equal to 64, and a power supply voltage of 1.5 V. For a throughput of 1 Gbps, the side of the square chip based on the proposed implementation is 2.956 mm with a power dissipation of 0.723 Watts. This is a tremendous improvement on the area figures provided in [3] , where a similar code dissipated 0.690 Watts with a chip size of 7.5mm x 7mm.
As a concluding remark, we observe that in the proposed implementation, the delay introduced by the variable nodes is almost three times smaller than that of a check node. It is therefore possible to further reduce the size of the chip by using multiplexers that would allow a single variable node unit to perform calculations for two variable nodes in a single clock cycle. This strategy would involve additional multiplexers, de-multiplexers and latches, but lead to a reduction of the number of variable node units to one half of its value. In most modern applications, it might also be necessary to incorporate the channel detection block into the bi-partite graph structure as shown in Figure 12 . In such a case, the channel nodes perform the same set of operations as the variable node and present a minor overhead in terms of area and power dissipation. As an example, we considered a length n = 7200 code with a channel detection scheme added to the decoder, and a total number of 32 iterations. Such a code would have a chip size of 6.1806 mm and power dissipation of 3.6386 Watts. An inclusion of all overheads arising from timing recovery circuits, serial-parallel and parallel-serial conversion blocks, is not expected to increase the side of the chip beyond 15% of its current value, based on a very conservative estimate.
Generalized LDPC Codes
The implementations proposed in the previous sections can be easily adapted to accommodate generalized LDPC (GLDPC) codes [24] . GLDPC codes show excellent performance under a combination of iterative message passing and belief propagation algorithms, and for a wide variety of channels [7, 28] . There are two variants of GLDPC codes that one can consider.
The first is the case when each check in the global parity check matrix is a short LDPC code itself (alternatively, each one in a row is replaced by a different column of a smaller length LDPC code). In this case, a natural generalization of the proposed architectures is a fractal concentric architecture. In this realization, each "local" code is implemented as a concentric sub-unit. These units can now be looked at as the basic building blocks of the "global" code. It is to be noted that the "check" blocks in this case will each have a bigger area compared to the blocks of a standards LDPC code implementation. In addition, GLDPC codes usually have a much larger overall parity check matrix. These characteristics impose a constraint on the smallest achievable size of a fractal-like chip. Consequently, a partly parallel implementation seems to be a more attractive solution for this problem. For example, by considering a GLDPC with say 80000 variable nodes, it is possible to apply the concept of semi-parallelism. It would be reasonable to scale down the level of parallelism by 16 to have only 5000 variable node units and a corresponding decreased number of check units as well. Of course, in this case the throughput will decrease by the same factor, but would be still be comparable to the same value of its LDPC counterparts of the same rate. Hence, with this approach, it is possible to improve the error performance for the same throughput and almost the same chip-size and power consumption. Another variant of GLDPC codes has the property that each check node represents a short algebraic code, for which an appropriate MAP decoder is used during global iterative decoding [24] . In this case, each at each check node one would have to replace the standard tanh and arctanh operations by MAP decoding circuits (this also justifies using PLA circuits, rather than standard-cell ones, since MAP decoding operations tend to be complex). Thus, the area of each check logic will increase based on the size of the MAP-decoder unit. For example, if a simple [7, 4, 3] Hamming code is used as a local code, a 128 × 7 table look-up may be required. Similarly as in the previous scenario, a partly parallel implementation would provide for a solution with practical chip sizes, while allowing for good code performance.
Reconfigurability
In the context of LDPC decoding, circuit reconfigurability can be achieved by implementing the codes using reconfigurable wiring, and multiplexed tanh and arctanh nodes. Given a fixed number and arrangement of check and variable nodes, one can develop several codes that differ in their connectivity of check and variable nodes, but have a "nested" structure. The latter allows for the wiring differences between the codes to be minimized, resulting in a maximally area-efficient design.
Using these ideas, the predictions are that such an architecture can operate with a throughput of 25Gbps and a power consumption of about 0.7W, for code lengths approximately 20,000 and rates 1/6, 1/3, 1/2, 2/3 and 5/6. The overall chip size is estimated to be 14mm on one side.
Applications of the proposed LDPC Code Implementations
The extremely powerful and yet fairly simple error control coding schemes of the form of codes on graphs are currently considered for applications in storage systems, optical communications, as well as wireless systems. We will briefly discuss some potential applications of the practical design scheme proposed in this work.
Since the emergence of magnetic, optical and solid-state recording technologies, the main force supporting their progress was the improvement of areal storage density. The most promising storage systems that have emerged in the recent past are multi-layer and multi-level recorders and nanoscale-probe storage techniques [49] . Especially the class of systems based on atomic force microscopy (AFM), e.g. the "Millipede", a thermo-mechanical data-storage system based on AFM and micro electro-mechanical systems (MEMS), having data in the system recorded in blocks of 1024 × 1024 arrays, require powerful error control techniques. First results for utilizing codes on graphs for modern storage systems, namely LDPC codes with iterative decoding for both transversal and perpendicular magnetic recording have been presented in [39, 40, 41] , while joint message-passing decoding of LDPC codes over partial response channels was addressed in [23] .
The results of these investigations suggest that very large performance gains can be achieved from utilizing such coding schemes instead of Reed-Solomon (RS) codes, the well-known standard coding schemes in tape and disk systems;
A debate is still going on as of how to do a fair comparison of complexity and performance for soft-decision LDPC, which have inherently more complex decoders, and hard-decision RS codecs, whose circuitry is complex due to their operation over finite fields of large order. Since quantized soft information can be used for iterative decoding (3-5 bits suffice for this purpose), the fact that all operations are performed over a binary field makes codes on graphs an attractive scheme compared to RS codes.
For the proposed code design technique, the decoder chip size can be made very small and power-efficient, and the decoder can also be easily incorporated into a larger system involving channel state estimating/equalization and timing recovery, as described in [2] . Finally, storage system code constraints, such as high rate (usually exceeding 0.8) lead to an even smaller implementation complexity, due to the fact that such codes have a small number of check nodes. For possible applications in nano-storage systems, fractal-like generalized LDPC codes developed by the authors [7] can be used, since they represent extensions of product codes well suited for two-dimensional recording systems.
For wireless communication systems, there already exists a prototype vector-LDPC architecture developed by Flarion Technologies [12] . The central block of the architecture is a programmable parallel processor that reads a description of the particular LDPC code from memory. Several codes can reside in the device at once, and switching between them incurs no overhead. The Flarion LDPC technology was integrated into a mobile wireless communications system for end-to-end Internet Protocol (IP)-based mobile broadband networking. The modulation schemes supported by Flash-OFDM include QPSK and 16QAM. The coding rates currently used are 1/6, 1/3, 1/2, 2/3, and 5/6, and the system uses adaptive modulation to rapidly switch between codes. The current maximum data throughput in the Flash-OFDM system is 3 Mbits/sec, but the decoder actually supports speeds of up to 45 Mbits/sec. Several technical aspects of their design, such as code construction, power consumption information and chip-size are not disclosed. Also the FPGA and ASIC based implementations of the Flarion solution suggest that the throughput of their design is substantially lower compared to a custom IC implementation such as the one described in this work.
The ideas described in this paper propose an LDPC coding scheme construction with a significantly broader perspective.
The described ideas can be extended or modified in order to cover a very wide range of other system architectures, for example, in concatenation with Multiple-Input Multiple-Output (MIMO) wireless systems. As opposed to the Flarion technique, the idea in this paper is based on a fully parallel implementation and the use of PLAs with a low wiring overhead.
In this way, one avoids the problem of storing the code description -the code structure is implicit in the wiring of the chip itself. Also, in contrast to the Flarion implementation, the custom IC based solution proposed here can have the property of on-the-fly reconfigurability between codes, with significantly improved throughput, as described in the previous sections.
Some additional initial experimental results show a decoding throughput of 25 Gbps and a power consumption of about 0.7W, for a code of length 20000 and rates 1/6, 1/3, 1/2, 2/3 and 5/6 with a die size 14mm on a side. Nevertheless, one has to point out that the Flarion implementation includes other functionalities, such as channel estimation and automatic repeat request (ARQ) controls, which can account for their observed performance.
LDPC codes are also becoming increasingly important in modern high-speed long-haul wavelength-division multiplexing (WDM) systems; here, they can be used to provide a necessary system margin or they can effectively increase the amplifier spacing, transmission distance and system capacity. Optical networking interface device employing a rate 1/2 block length n = 1024 low-density parity-check (LDPC) code were recently developed by Agere Systems. As for the case of storage systems, high code rates and relatively short code lengths are important design parameters for these applications, which can be easily accomplished by the code architecture proposed in this paper. Full details regarding code implementations for these applications will be described elsewhere.
Conclusions and Future Research
A general high throughput VLSI architecture was proposed that can be used to design LDPC decoder chips for specific applications as wireless communications, magnetic recording, or optical communications. By using an efficient code design criterion and a regular chip floor plan, which is exploited during code construction, a high speed, low area design was developed. Furthermore, based on some preliminary estimates, it was concluded that practical size and power constraints can be met based on the proposed setting. The current problem of interest is to develop techniques for reducing the power consumption of the chip even further.
