Abstract-This paper presents a general procedure for designing low density parity check (LDPC) codes for multiprocessor software defined radio platforms. Our approach is to design the LDPC code to match the constraints imposed by the hardware architecture, without compromising on the communication performance. The proposed architecture-aware code design procedure involves feature identification, code construction and verification. We demonstrate the effectiveness of our procedure for three cases. If the local memory of the processor is small and it can only process one horizontally partitioned submatrix at a time, we show how the code can be constructed so that the traffic to the global memories is reduced by 2X. If the row weight of the matrix is large and each processor processes a vertically partitioned submatrix, we show how the matrix can be constructed so that the computational load is evenly distributed among the processors. If the processors have no storage capability and all data is stored in global memories, then for the case when all traffic is through a multistage interconnection network, we show how code construction can be used to significantly reduce the number of routing conflicts. In all three cases, the resulting LDPC codes can not only be mapped efficiently onto the multiprocessor platform but also have very good frame error performance.
I. INTRODUCTION
L OW-DENSITY Parity-Check (LDPC) [1] codes are linear block codes with sparse parity check matrices. Their asymptotic performance can be as close to one-tenth dB away from the Shannon limit [2] . Another advantage of LDPC codes is that the decoding algorithm is inherently parallel and so a wide variety of hardware implementations can be derived to exploit this feature. Because of their extraordinary performance, LDPC codes have been adopted in the physical layer of many recent communication standards such as DVB-S2, 10GBase-T, 802.16e, and 802.11n.
Recently, there has been a lot of work done on LDPC decoder implementation [3] - [10] . The fully parallel ASIC implementation of LDPC decoder presented in [3] achieves very high throughput with large area overhead. Starting with the pioneering work in [5] , many of the work consider decoder complexity in the design of the LDPC codes. For instance, [7] presented an architecture-aware LDPC code that utilizes pseudopermutation submatrices and views the LDPC code as a concatenation of supercodes. This facilitates Turbo decoding like operations in the LDPC decoder and results in an implementation that is significantly simpler in terms of both interconnection network complexity and memory requirement. This has been demonstrated in a real chip implementation in [11] . Codesign approaches have also been proposed in [9] , [10] . The block LDPC design method presented in [9] jointly considers code design, decoder design and encoder design. It is shown that the encoder and decoder complexity can be reduced without significant performance loss by using circulant submatrices in the parity check matrix. In [10] , we showed that by including weight-2 circulant submatrices in the parity check matrix, we can achieve higher throughput without any performance loss. While all these architecture-aware LDPC code design studies are targeted for ASIC implementations, there are LDPC decoding studies for multiprocessors as well [4] . For instance, randomly constructed LDPC codes are considered in [4] . The focus there is on balancing the computation among the processors and reducing communication cost by clustering computation.
In this paper, we consider the problem of systematically designing LDPC codes that exploit the architectural features of multiprocessor architectures, such as those used in existing SDR platforms. This is an extension of the work presented in [12] and [13] . Fig. 1 shows the design flow that we first proposed in [12] . First, the code features based on: i) system specifications, such as frame error rate (FER) performance, codeword size, code rate; and ii) architectural features of the target platform, such as the number of processing units (PU), the width of the single-instruction multiple-data (SIMD) unit and arithmetic complexity, are obtained. The LDPC code is then constructed based on code features, which include number of block columns, row weight and level of parallel processing. The last step is verifying that the LDPC code meets both the system constraints and the architectural constraints. In some cases, several iterations may be necessary to ensure that all the constraints are met. Note that the code design process is done off-line and does not affect the run-time performance.
We demonstrate the effectiveness of the procedure for three cases. In all cases, we ensure that the FER performance is not compromised. First, we present the case when the local memory in the PU is not large enough to support the computations on the whole parity check matrix. We present a scheme where the parity check matrix is partitioned horizontally into supercodes [7] , [14] (Section IV). We find that supercode decoding introduces degree-one nodes which hinder its performance. So we derive code design constraints that minimize the number of such nodes. Our simulation results show that the LDPC codes constructed with these constraints converge twice as fast compared to the nonoptimized codes. This reduces the memory traffic by a factor of 2 which result in the decoding throughput increasing by a factor of 2.
Next we consider the case when the parity check matrix has a large row weight and implementation on a single PU results in low throughput. We present a scheme where the parity check matrix is partitioned vertically and each of the sections is mapped to a PU in a multiprocessor architecture (Section V). We derive code design constraints that help in distributing the computational load evenly among the PUs.
Finally, we present an extreme case where the local memory in each PU is relatively very small and the bulk of the information is stored in the global (shared) memories. Thus there is significant traffic between the PUs and the global memory, and the throughput is closely related to the number of routing conflicts. We demonstrate the LDPC code design procedure for a SDR platform where the PUs communicate to the global memory units via a multistage interconnection network (MIN). The combinations of routing paths that cause conflicts in the MIN are identified and mechanisms to avoid them are translated into constraints for the code construction step. The resulting LDPC code can be mapped very efficiently onto the SDR platform and has very high decoding throughput.
The rest of this paper is organized as follows. Section II provides a brief review of LDPC codes. Section III outlines the SDR platform and mapping of LDPC decoding onto the platform. Sections IV and Section V describe the schemes based on horizontal partitioning and vertical partitioning of the parity check matrix. Section VI describes the scheme that has been optimized for MIN-based interconnection. Section VII concludes the paper.
II. BACKGROUND

A. Basics of LDPC Codes
LDPC codes can be represented by bipartite graphs in which a set of nodes (variable nodes) corresponds to the elements of a codeword and another set of nodes (check nodes) correspond to the parity-check constraints of the code. Fig. 2 shows the parity check matrix and the corresponding bipartite graph of a very short (2,3)-regular LDPC code. It is regular since each variable node is connected to 2 check nodes and each check node is connected to 3 variable nodes. LDPC codes that do not have the regularity property are called irregular. Irregular LDPC codes usually achieve better performance than the regular LDPC codes [2] , [15] , and thus are considered here.
The performance of the LDPC codes is often predicted in terms of the ensemble average of codes, a consequence of the random choice of edges between check nodes and variable nodes for a given length and a given degree distribution. The optimal degree distribution for asymptotic cases (infinite codeword size, infinite iteration number) can be computed based on density evolution (DE) [16] . An online computing program based on DE is available in [17] . Let be the minimum (maximum) variable node degree and let be the minimum (maximum) check node degree. A study of the asymptotic performance shows the following features for degree distribution:
• .
• The check node degrees for optimal distribution are limited to only a few (one or two) values close to each other. Thus, a simple notation is used to represent the maximum (average) check node degree.
B. The Iterative Decoding Algorithms
LDPC decoding is done by either the bit flipping hard-decision algorithm [1] , or by the soft-decision iterative decoding algorithms such as the belief propagation (BP) algorithm and variations of Min-Sum algorithm. All of them involve two kinds of operations: variable node processing (VP) and check node processing (CP). The operations for soft-decision iterative decoding are summarized in Algorithm 1. In the description in Algorithm 1, is the intrinsic information of variable node from the received signal, and are the extrinsic information from check node to variable node , is the information from variable node to check node , is the total information about the variable node . All the information are in form of log-likelihood ratio (LLR).
is the set of variable nodes which is connected with check node in the bipartite graph. Similarly, is the set of check nodes which is connected with variable node . The different decoding algorithms differ in how the function is evaluated. For BP and Min-Sum algorithm, the set . The function in (3) for several common algorithms are listed below:
For Min-Sum algorithm:
For BP algorithm:
where the function .
The LLR information is the sum of the intrinsic and extrinsic information of the variable node (7) After the final iteration, a hard decision based on can be made. The number of iterations required for decoding varies with the SNR. In practice, the maximum number of iterations can be set to 10 -15. Early termination algorithms are often adopted to stop the decoding process after all the parity check equations have been met.
C. Finite Length LDPC Code Design
Unlike the asymptotic cases, the performance of finite length LDPC code is greatly affected by the code structure as well as the degree distribution. This is because the finite length LDPC codes are no longer cycle-free [18] , [19] . Many different criteria have been proposed to optimize the performance of finite length LDPC codes, for example, large girth (length of the smallest cycles in the bipartite graph) [20] , [21] , selective avoidance of small cycles [22] and optimization based on stopping set [23] or trapping set [24] .
LDPC codes with structured submatrices [25] are widely used in practice to reduce implementation overhead and is the focus of this paper. The parity check matrix of size can be represented as a block matrix of size ; each element in matrix is a circulant shifted identity matrix or zero matrix, where is an integer. Each row of the block parity matrix (which is equivalent to rows with elements per row of the original parity check matrix) is referred to as block row. Similarly, each column of the block parity matrix (which is equivalent to columns with elements per column) is referred to as block column. The nonzero item in is written as , where is the circulant identity matrix with shift value of . For simplicity, we use '1' or '0' to represent the or zero matrix. To design finite length LDPC codes that can be efficiently implemented on the target multiprocessor architecture, we utilize a design process that combines the optimal degree distribution for asymptotic performance, characteristics of the structured submatrices, finite length code optimization criteria and architectural constraints such as number of processors and the width of the SIMD unit. The design process is performed at two levels: the block matrix level design which constructs the matrix, and the submatrix level design which assigns shift values to the submatrices.
The steps for designing LDPC codes based on the code features can be summarized as follows.
1) First, obtain the code features based on system requirement and architectural specifications. The code features include: , , , , , , level of parallel processing, etc. While system requirements determine the code rate and the range of codeword size , the SIMD width and number of PUs are used to determine the extent of parallelism and the row weight . The specific decoding algorithm is selected based on the arithmetic operation supported by the processor.
2) Compute the degree distribution based on the targeted channel condition, code rate (obtained from and ), , and . For the asymptotic case, the optimal degree distribution can be computed based on density evolution [16] . To use these results for finite sized blocks, adjustments have to be made such as rounding off the actual node numbers to integers. 3) Construct the matrix based on the degree distribution result in previous step. The construction procedure can be briefly written as:
i) Initialize the of size with "0." ii) Randomly place "1"s in each row. iii) Move the "1"s along rows to satisfy the variable node degree distribution. iv) Switch columns to facilitate further modification.
For example, some lower degree columns are moved to the right side of the matrix before step v) in order to make the encoder efficient [26] . v) Move the "1"s in the matrix in pairs to meet additional constraints that have been imposed due to architectural or performance considerations. The pair-wise movement is such that if a "1" is moved from location to location , then another "1"' needs to move from location to location , provided that location had a "1" and location had a "0." Note that this pair-wise movement of '1's does not affect the check node or variable node distributions. vi) Output the matrix. Note that steps iv) and v) may have to be processed multiple times until all the constraints have been met or the maximum iteration number has been reached. 4) Assign shift values to all the nonzero elements in matrix. This is done by assigning shift values based on GF( ) [25] or assigning shift values in a more flexibly way such that the trapping set is minimized and small cycles are avoided selectively [22] , [24] . A summary of the key notations that are used in this paper is included in Table I for easy reference.
III. LDPC DECODING ON AN SDR PLATFORM
A. SDR Platform
There are many different definitions of SDR systems. In the context of this paper, it is a hardware platform that processes the physical layer of multiple protocols. We only consider the digital processing engine, thus ADC/DAC and RF parts are not studied here.
A typical SDR platform consists of multiple processing units (PU) and multiple global memory units as shown in Fig. 3 . The PUs and the global memory units communicate with each other via an interconnection network. The interconnection network could be one or more shared buses or even a multistage interconnection network. Each PU consists of a local memory, a processing element (PE), which consists of a scalar unit and a SIMD unit, and an application specific element (ASE). The combination of a scalar unit and a SIMD unit enables a large class of algorithms to be mapped very efficiently as shown in [27] . The ASEs are typically included to enhance the performance of some target protocols. For example, the ASE could be an optimized shuffle network as in [27] or a sorter. The local memory of a PU is typically quite small-of the order of a few kilobytes. The software control unit coordinates all the PUs in the system.
B. LDPC Decoding
We first consider the case where all the information related to LDPC decoding can be stored in the local memory of a PU. This puts an obvious limitation on the size of the codeword or equivalently, the size of the parity check matrix that can be processed.
Assume that the PE is a SIMD engine with slices. For example, the PE in [27] has slices. Fig. 4 describes such a PU. Each SIMD slice can efficiently compute addition/subtraction and comparison. For LUT-based implementations that are required for BP-based decoding, approximate computations can be used [7] . The local memory in the PU stores the intrinsic and extrinsic information. The local memory and the register file feed data to the slices. In addition, there is a permutation unit that is used to permute values to/from the slices of the SIMD engine.
We assume the scheduling is in check node order. Each SIMD slice is responsible for the computations in a check node. In each step, only the check-to-variable or variable-to-check information corresponding to the scheduled check nodes are updated. The general process for processing check nodes in a single PU is shown in Fig. 4 . The decoding includes the following steps: 1) load from memory, permute to the correct slice and store to register file; 2) load from memory, load from register file, calculate (VP), store the to register file; 3) load from register file, calculate (CP), store to memory and register file; 4) load and from register file, calculate , permute to its original order and store to memory. These steps have been marked along the edges in Fig. 4 . The round parenthesis denotes the data that can directly be fed to the SIMD slices and square parenthesis denotes the data that is organized for local memory and require permutation before/ after being processed by the SIMD slices.
1) Number of Cycles:
We estimate the number of cycles that are required for the decoding steps outlined above. Each of the steps (except Step 3) requires cycles. This is because the steps are repeated for each of the nonzero elements in a row. The CP operation in Step 3 [described in (5)- (6)] requires the allbut-one operation which can be implemented very efficiently in ASICs using adder-trees and XOR-trees. In an SDR platform, cycles are required to compute the all-but-one operation. Thus the total number of cycles to process check nodes using the SDR platform is cycles, ignoring the overhead due to the pipelines. The precise cycle count is platform dependent. For example, the permutation step that is used to convert the or values between variable node order and check node order, can be done in one cycle in [27] but may not be true for other platforms.
2) Memory Requirement: Because the precision of operands in processors are limited to a few choices (for example, 8 bits, 16 bits, 32 bits), we will evaluate the memory requirement in terms of number of values in the SDR platform. For BP and Min-Sum (and its variation) algorithms, 8 bits are more than sufficient to achieve decent performance in hardware [6] , [7] .
The information that needs to be stored for BP decoding include the LLR value and the check-to-variable information . The total memory requirement is (8) where the first term is equal to the number of nodes and the second term is equal to the total number of edges in the bipartite graph.
If the Min-Sum algorithm is used then the memory requirement can be reduced. For each check node, it is sufficient to only store the minimum (together with its position), the second minimum, and the signs for all check-to-variable information. Because the sign information can be packed together to save memory, the total memory required for iterative Min-Sum decoding is (9) where is the bit width of each word in memory. We can introduce a new notation to represent the memory required by check-to-variable information of each row.
BP algorithm Min-Sum algorithm.
For a SIMD architecture with slices, the local memory is laid out such that each memory word contains values. The local memory required for decoding is (11) where . The LDPC codes require large codeword sizes to achieve good performance. For example, DVB-S2 standard specifies a maximum size of 16200 bits, the new Chinese DTV standard DMB-T specifies a maximum size of 7493 bits. On the other hand, the local memory size of PU is limited. For example, the SDR architecture proposed in [27] has only 8KB+4KB local memory in each PU (which is significantly less than the memory in general purpose processors). Algorithmic innovations are required to handle such cases.
IV. LDPC CODE DESIGN FOR HORIZONTALLY PARTITIONED MATRIX
In this section, we study the case when only a single PU can be assigned for LDPC decoding. The PU has SIMD slices (as described in Section III-A) and limited local memory. To satisfy the memory constraints, the parity check matrix can be partitioned horizontally into supercodes [7] , [14] , and the PU operates on the supercodes one after the other. Supercode-based decoding could results in degradation of FER performance if not careful. We show how to avoid this from happening by proper code construction.
A. Supercode-Based LDPC Decoding
Fig . 5 shows the decomposition of a LDPC code into two supercodes. The parity check matrices of supercodes ( and ) are submatrices of the original matrix. LDPC decoding based on supercodes is summarized in Algorithm 2. There are two levels of iteration in the decoding-inner and outer iteration. Each supercode is decoded with fixed number of iterations (inner iteration). The output LLR values from one supercode serve as the input to the next supercode. The supercodes are decoded one after another and in an iterative fashion (outer iteration). Let denote the number of supercodes, each of which has a block parity check matrix . The outer iteration number is and the inner iteration number of the th supercode is . The AA-LDPC codes in [7] is equivalent to the case , ,
. The scheme shown in [14] is equivalent to the case and , . Our method is more general and provides more choices for the LDPC code designer. 
B. Memory Analysis
In this section, we analyze the memory requirements for supercode-based BP decoding. The procedure for the Min-Sumbased decoding is identical and has not been included here. In supercode-based BP decoding, the local memory requirement is smaller than that given in (11) . Here the local memory keeps a complete copy of all values and only the values for the supercode to be decoded; the global memory stores all the values. There is another possible storage scheme described in [13] which stores only a subset of the values. It results in smaller local memory but larger global memory and increased global memory traffic. Here the memory traffic is defined as the clock cycles used to transfer data among memory and PU.
The memory requirement is summarized here (12) where , is the number of block rows in . The values are usually set equal for all supercodes, i.e, , to minimize the local memory. The local memory requirement reduces with the increase in the number of supercodes, as expected. However, the reduction in local memory comes with the price of increase in memory traffic between the global memory and the PU.
The number of access cycles for global memory is (14) where is the number of cycles for transmitting values between the global memory and the local memory.
The number of decoding cycles is (15) where is the number of cycles for one decoding iteration (including VP and CP operation) for a block row in . The can be chosen independently for each supercode. For example, it can be chosen proportional to . However, for simplicity of control and performance comparison, we choose , . In this case, . A comparison between and shows that while is proportional to the outer iteration number , is proportional to the product of the outer and inner iteration numbers, . Thus we need to reduce to minimize the memory traffic and keep certain number of to maintain the FER performance. In Section IV-CE, we describe how this can be achieved by appropriate code construction.
C. Study of FER Performance
In [14] , the authors proposed a method where all supercodes have the same degree distribution and the degree distribution of the supercodes is derived directly from the overall degree distribution. The problem with this method is that it results in a large number of degree one variable nodes in the supercodes. It is easy to verify that the variable-to-check information from such variable nodes do not get updated during the inner iterations. Thus it does not help in the convergence process within one supercode.
For example, Fig. 5 shows matrix generated by the block level random placing step in the LDPC code construction procedure (Section II-C). A blind construction of supercodes by directly partitioning an existing parity check matrix is likely to create many degree-one variable nodes in supercodes, as shown in the shaded boxes in Fig. 5 . As a result, the performance of supercode-based decoding is dominated by the number of outer iterations [13] . However, increasing increases the memory traffic and is detrimental to the overall throughput. In Section IV-D, we show how to unleash the potential of supercode-based decoding with small . We achieve this by constructing the codes such that the number of degree-one nodes in supercodes is reduced.
D. Code Constraints
In order to reduce the number of degree one nodes in supercode-based decoding, we impose an additional constraint in the LDPC code construction procedure (Section II-C), which is given as follows:
• Divide into parts, each of which corresponds to a supercode. The number is chosen so that the local memory is able to accommodate all the and values of one supercode.
• If and there exists degree-one nodes in the th supercode, perform exchange operation to make it at least degree 2 or degree 0. The only exception is where the degree-one node is critical in maintaining a certain structure of the parity check matrix.
• Reduce the outer iteration number till the FER performance is no longer acceptable. In step (3v) of the algorithm in Section II-C, the exchange procedure removes degree-one variable nodes in supercodes except in some special cases where they are kept to preserve the structure of the parity check matrix. If all the degree-one variable nodes in supercodes can not be removed, the variable nodes with lower degree in final codeword (especially those with degree 2 or 3) should be handled with higher priority. Fig. 6 shows the matrix of Fig. 5 after undergoing the exchange procedure. Almost all the degree-one nodes in supercodes have been removed. There are only two degree-one nodes left in , which are purposely kept to maintain the lower triangle shape of the matrix. The FER performance of this new code is no longer dependent on .
E. Simulation Results
We simulated the LDPC codes for the original matrix (Fig. 5 ) and the optimized matrix (Fig. 6 ) in an additive Gaussian white noise (AWGN) channel. Each matrix (corresponding matrix of size 1544 6176) consists of submatrices of size 193 193 and is divided into two supercodes. . The FER results are calculated based on the average of 25 error codewords. The optimized LDPC codes have significantly better performance. In fact, they can achieve the same performance with approximately one half of the number of iterations. This translates to the memory traffic being reduced by a factor of 2 and the throughput increasing approximately by a factor of 2.
Table II summarizes the four comparisons of the original and optimized codes in terms of FER performance and normalized memory traffic. The optimized LDPC codes have significantly (2 times) faster convergence speed. It demonstrates that under the same supercode-based decoding, well constructed LDPC codes can achieve much higher throughput comparing with blindly constructed LDPC codes. Actually, it was shown in [7] that supercode-based decoding converges 2 times faster than regular decoding for . Our simulation shows that by increasing , our codes can achieve the same 2X convergence speed as in [7] and also have significantly lower memory traffic.
The supercode-based decoding for the LDPC code with codeword size 6176 was mapped into a single PU in the SODA architecture shown in [27] . Because of the limited size of local memory and large codeword size, Min-Sum algorithm is used. For a 400 MHz clock, the throughput is 10.7 Mbps for iterations. If each SODA PU was assigned one codeword and all 4 PUs were activated, the throughput could be as high as 42.8 Mbps. In this section, we study the case when the parity check matrix has a large row weight. To support this, the PU has to have larger local memory, and more importantly, will require a large number of decoding cycles. If throughput is an important constraint, and it is indeed possible to utilize multiple PUs, one possibility is to partition the matrix vertically and assign different partitions to the different PUs. This scheme can reduce the number of cycles roughly by a factor of , where is the number of vertical partitions.
A. Cooperative Decoding for Vertically Partitioned Matrix
Algorithm 3 describes a cooperative decoding procedure for LDPC decoding. Let be the number of partitions, so each submatrix , , is a block matrix, where is the number of block rows and is the number of block columns in . Let the number of nonzero elements in the th block row of be denoted as . The cooperative decoding algorithm is based on check nodebased scheduling. In each iteration, the VP, and variable update step remain the same as the original BP or Min-Sum algorithm. The difference lies in the processing of the CP step.
Assume that the th partition is assigned to the th PU. The CP processing in each check node is divided into two rounds of operations. In the first round, the th PU updates the partial CP result based on its own data (same as the original algorithm), and calculates the partial sum , where is a variable node belonging to the th partition of the parity check matrix; in the second round, the partial sum is sent to all PUs and the th PU then calculates the based on and , . The , and in Algorithm 3 are defined below. Here is the set of variable nodes that is connected to node in the bipartite graph corresponding to . For Min-Sum algorithm: It is easy to verify that the algorithm is equivalent to the original Min-Sum and BP algorithm. Furthermore, the memory requirement for a PU is given by (27) Note that there are several other LDPC decoding schemes that vertically partition the matrix for different purpose [28] - [30] . In [28] , the authors also proposed to partition the matrix vertically to reduce the decoder complexity. The decoding algorithm in [28] , which is based on a variation of the BP algorithm, uses heuristic multiplicative factors and results in performance loss. While the decoding is suboptimal, in [28] , the inter-processor communication is significantly reduced. In [29] , the matrix is vertically partitioned into groups and the groups updated sequentially from left to right. Our work differs from the scheme in [29] in that the update of vertically partitioned groups are performed simultaneously in a cooperative way. Vertical partitioning of matrix has been used in a recently published paper [30] to achieve multiple code rates.
B. Code Constraint
A closer look at Algorithm 3 indicates that although the different PUs work independently, they need to exchange the partial sum value for each check node. Thus the throughput performance is determined by the partition with the largest row weight. This leads us to establishing the following LDPC code requirement: all partitions, , should have approximately the same row weight. Besides,
should have approximately the same number of block columns, which will minimize the memory requirement of each PU. To characterize this requirement quantitatively, we introduce a parameter called the unbalance factor (UF). (28) Essentially UF is the number of extra cycles required for decoding a specific partition compared to the optimal partition (which corresponds to ). So during code design, the distribution of 1s should be such that UF is as small as possible. This is achieved by applying the constraint to minimize UF defined in (28) in Step 3v) in the code construction procedure in Section II-C. Specifically, the exchange procedure in Step 3v) would switch block columns of the original matrix such that is as small as possible.
C. Performance Analysis
FER:
There is no change in the FER performance since the proposed algorithm is equivalent to the original BP or Min-Sum algorithms.
Throughput: In order to evaluate the increase in the throughput performance, we calculate the number of cycles that are required in this scheme. Partial check node processing is equivalent to the original CP computation (outlined in Section III-B) with nonzero elements. Another cycles is required for the second round of operation during CP processing. This overhead is due to the partial sum values that have to be sent to all the PUs and the updated values that we assume can be broadcast to the other PUs in one cycle. For a simple bus interconnection network, it takes cycles to broadcast the results. Thus the number of cycles is given by (29) Fig. 8 shows the throughput of vertical partition schemes for different values. In each case, the throughput values have been normalized to the case. We see that for small values, the throughput increases sublinearly with , while for large values, the throughput increases linearly with . Thus this method can be used to increase the throughput for matrices with large values by effectively distributing the computational load among the PUs.
The code with codeword size 6176, code rate 3/4, and was mapped onto the 4 SODA PUs. The resulting throughput was 24.8 Mbps which is lower than the 42.8 Mbps that is achievable if horizontal partitioning is used and all 4 PUs are activated. However, for codes with larger values, the relative difference in throughput is very small and is easily out weighted by the advantage of the smaller memory requirement in PUs.
VI. LDPC CODE DESIGN FOR MIN BASED SYSTEMS
When the LDPC decoder is required to support very large codeword sizes and the PU local memory that is required for processing vertically partitioned or horizontally partitioned parity check matrix is not sufficient, we consider a different approach. Assume that the whole SDR platform can be utilized for LDPC decoding. We also assume that the PUs and the global memory units are interconnected with an MIN. The MIN network can be of different type such as butterfly, perfect k-shuffle. We consider the butterfly MIN here though the proposed framework can be applied to other classes of MIN in the similar way. Because many of the known MINs, such as butterfly, perfect -shuffle, belong to the class of Delta networks [31] and can be proved to be topologically and functionally equivalent [32] .
Assume that there are PUs. Each PU consists of SIMD slices, and thus can process data simultaneously. Since the local memories are small and store only temporary data, the values and the values are stored in global memories. In each iteration, the and values have to be read from the global memories and after processing, the updated values have to be stored back into the global memories. Thus there is significant traffic between the PUs and the global memories, and routing conflicts through the MIN network can seriously affect the throughput. In this section we describe how the LDPC code can be designed to reduce the routing conflicts and thereby increase the decoding throughput.
A. Feature Identification
To minimize the routing conflicts through the MIN network during LDPC decoding, we need to first characterize the constraints imposed by the MIN and translate the constraints to the corresponding code features.
1) MIN-Based Constraints:
Let the number of global memory units be . The MIN interconnection network of size connects the PUs to the global memory units. Clearly, ,
. When multiple data have to be routed through the network at the same time, certain routing paths (between the input and output nodes) cannot be activated simultaneously. The relative positions of the input/output nodes play an important role in determining the conflicting paths. To make the problem tangible, we focus on reduction of "pair-wise" conflicts. These conflicts are used to derive explicit constraints that are then used in the code construction process. For any given input node , the remaining input nodes are divided into different neighborhood groups :
It is obvious that . If the input node is represented by , , the necessary and sufficient condition that an input node belongs to the th neighborhood group for butterfly MIN is given in (31) . Fig. 9 shows the neighborhood groups of in an 8-point butterfly MIN: , and .
otherwise (31)
To derive a method to reduce the probability of conflict, we introduce the concept of exclusive node set (ENS). Let be a set of input nodes corresponding to th-level ENS which includes and all the elements in , , i.e.,:
For the butterfly MIN in Fig. 9 , . has several properties: (1) . (2) , . (3) . From these properties, we can see that the input nodes can be divided into nonintersecting groups in th level ENS. In Fig. 9 , the groups in 0th level ENS are , , , . To reduce the possibility of routing conflict, no more than one node from each ENS should be selected as input node. Thus the maximum number of input nodes that can be routed simultaneously is the same as the cardinality of th level ENS, i.e., . Our goal is to have the largest number of input nodes participate in routing at any given time so we start with the 0th level ENS.
2) Code Features: input nodes from the nodes in the 0th level ENS can be routed simultaneously. Since the number of input nodes equals the maximum row weight , this translates to the constraint . In addition, the number of block columns . Because of the sparseness of the parity check matrix for LDPC codes, we can easily find configurations that satisfy these constraints.
Since the two nodes in the same group in 0th level ENS should not be routed simultaneously, the block matrix should have the following structure. In any row, there is no more than one '1' element belonging to the same ENS group.
The constraints on and can be further refined by considering the encoding complexity. As shown in [12] , the constraints for encoder efficient codes are as follows. If , the maximum value of ; if , the result remains the same since we can amortize the unused output nodes for the left part of . Also, it is clear from the pigeonhole principle, that the maximum variable node degree . The positions of the input nodes can be chosen arbitrarily as long as the two nodes do not belong to the same ENS group. The positions of the output nodes are constrained by the positions of the input nodes. These constraints can be derived from Theorem 6.1. We will first present a lemma. • No more than one '1' within an ENS group in any row.
• For two input nodes in , avoid assigning their output nodes with difference of (absolute value), . By applying the constraints presented above in step 3v) in Section II-C, the routing conflicts greatly reduce. Our experimental results show that the algorithm converges very well. Next we will show a code design example following these rules.
B. LDPC Code Design Example
In this section, we illustrate our procedure in the design of a rate LDPC code which fully exploits the characteristics of an SDR platform equipped with a -point butterfly MIN. From the discussion above, we can derive the design parameters as follow:
• , , code rate • , , check node degree With these constraints, we can design a finite length LDPC code as described in Section II-C (for details see [12] ). In a single run, we obtained two matrixes with the same degree distribution-one is randomly generated as shown in Fig. 10(a) and the other is optimized for MIN interconnection as shown in Fig. 10(b) .
In the block matrices in Fig. 10 , columns and belong to the same ENS group. In this figure every other ENS group has been highlighted. Notice that in the random matrix of Fig. 10(a) , there are multiple cases where there are two '1's that belong to the same 0th level ENS group in the same row. In contrast, in the matrix of Fig. 10(b) , all such cases are eliminated.
After the LDPC codes are constructed, we map the original and optimized LDPC codes to the target architecture. Here the number of input nodes is and . The mapping will result in a look-up table that stores the switch states for each stage of the MIN. For each block row there are that need to be stored. There are block rows in the matrix thus result in a total of 640 bits switch states information.
We see that when the original matrix is mapped to the architecture, there are on an average 4 conflicts out of the 80 switch states per block row, causing a slow down of 47.5%. The slowing down factor is calculated by dividing the additional number of cycles, which are introduced by separately routing the data that caused conflicts, by the overall number of cycles. The interconnection-aware LDPC code had no conflicts in any switch and thus no routing conflicts. This translates to a very high decoding throughput.
FER Performance: We verify the performance of the code before and after interconnection-aware optimization. Fig. 11 shows the FER performance for AWGN channel when the Fig. 11 . FER of the rate (3/4) LDPC codes shown in this section and the WiMax 3/4A LDPC code [33] .
LDPC encoded data is modulated with binary phase shift keying (BPSK). Each codeword is decoded with maximum of 15 iterations and each FER point represents the average of 25 error codewords. The proposed code generation procedure did not introduce any degradation in the FER performance. For better comparison, we also included simulation results for WiMax LDPC codes.
VII. CONCLUSION
In this paper, we presented a procedure to design LDPC codes that can be mapped efficiently onto multiprocessor SDR platforms. The general design flow for architecture-aware code design was presented along with three case studies. In all cases, the FER performance of the modified LDPC code was not compromised.
First, we considered the case when only one PU is available and the local memory of the PU is not sufficient to support decoding of the entire parity check matrix. Here the matrix is partitioned horizontally into supercodes and the PU performs supercode-based decoding. We show how addition of certain constraints during the code design phase results in enhanced FER performance compared to that of typical supercode-based decoding, as well as a 2X reduction in the memory traffic. Next, we consider the case where the parity check matrix has a large weight and multiple PUs are available for computation. We show how certain constraints can be used to create a matrix which supports even distribution of computation load among the PUs. Finally, we consider the case where the PU has very limited local memory and so there is a large volume of traffic between the PU and the global memory. We show how addition of code constraints can be used to reduce the number of routing conflicts in a MIN-based interconnection network, resulting in substantial improvement in the throughput.
Code construction principles presented in this paper can also be applied to some of the LDPC codes used in standards. Since the LDPC codes are linear block codes, we can exchange the whole block row and block column of the parity check matrix for efficient decoding. This does not change the code as long as the codeword bits are restored to their original order after decoding. 
