Abstract-Parallel hardware architecture proves to be an excellent compromise between area, cost, flexibility and high throughput in the hardware design of LDPC decoder.
INTRODUCTION
Near Shannon limit error correcting capabilities of Low Density Parity Check (LDPC) codes [1] has gained a lot of attention in information theory community. Due to very high decoding throughput and communication performance, LDPC codes are increasingly included in the standards such as DVB-S2 and DVB T2 [3] , WiFi (IEEE 802.11 n) [4] or WiMAX (IEEE 802.16e) [5] . LDPC codes are linear block codes and are represented either by parity check matrix H or by Tanner graph [2] , which is a bipartite graph. In its tanner graph representation two types of vertices, variable nodes (VNs) and check nodes (CNs), construct the two vertex sets of bipartite graph (cf. Figure 1 ). VNs represent the codewords (i.e. data to be processed) and CNs corresponds to the parity-check sums (i.e. operations to be done on the data). A VN is connected to a CN by an edge if and only if it is checked by that check node. The decoding process is carried out by an iterative message passing algorithm called "Belief Propagation Algorithm". In this algorithm, VN and CN iteratively exchange their soft-information to qualify the likelihood of the variable in accordance with the associated parity-check equation [1] . Currently, three main families of decoder architecture for LDPC codes have been proposed in the literature:
• Serial decoder • Partially-Parallel decoder • Fully-Parallel decoders Serial decoders suffer from low throughput and fully-parallel decoders from prohibitive area. Thus only partially-parallel architectures are considered in practical hardware design of LDPC decoders. In partially-parallel architecture several processing elements PEs are used and set of variable nodes and set of check nodes are allotted to each PE. High throughput requirement can be achieved using a proper number of PEs, while the interconnection network cost tends to be less critical as compared to fully-parallel implementation. Typical architecture for partially-parallel decoder is shown in Figure 1 in which P PEs are connected with B memory banks where P = B. The computation at variable node and check node is quite simple. When designing parallel hardware architecture, the implementation issues mainly arise due to the communication structure between VNs and CNs. The communication structure becomes more and more challenging with the increase in the number of nodes, the number of node degrees, the number of iterations and the parallelism. Hence, parallel implementation suffers from memory accesses 978-1-4244-8157 -6/1 0/$26.00 ©20 10 IEEE collision problem in which more than one PE concurrently accesses the same memory bank to read or write data. In this paper, we present a memory mapping methodology based on bipartite graph which is able to provide all the PEs conflict free parallel access to the memory banks. This algorithm provides conflict free memory mapping for all types of decoding methods, code types, codeword lengths and code rates. The remainder of the paper is organized as follows. Section 2 presents a state of the art related to parallel LDPC decoder design. Section 3 introduces the mapping problem. Section 4 describes some definitions related to bipartite graph needed to understand the proposed approach. Section 5 details the mapping algorithm we propose. Finally, section 6 explains the algorithm through a pedagogical example.
2.
RELATED WORKS
Currently three classes of approaches to design partially-parallel LDPC decoder architecture exist to tackle the collision problem:
• Design LDPC codes to avoid collision problem [6] , [7] , • Use extra memory elements and control logic in the interconnection network in order to remove conflicts [8] , [9] , [10] ,
• Find a memory mapping to provide conflict free access to all the memory banks at any time instance [11] , [15] , [16] . In the first category of decoder implementation, structured or architecture oriented LDPC codes are designed in order to avoid conflicts in accessing data from memory banks. These codes remove the memory access conflicts and simplify the interconnection network through the use of a barrel shifter [6] or a network [7] . However, constraints in the development of structured LDPC codes may cause degradation in code performance. In the second class of decoder implementation, memory access conflicts are removed either through the addition of extra memory elements or complex interconnection network or both. In [8] , configuration memories are used along with 2D-mesh network for LDPC codes of different block size and code rates. In [9] , concurrent accesses to the same memory banks are avoided through the use of heterogeneous network. However, this network becomes complex with increasing degree of parallelization and suffers from reduction in the achievable throughput. In [10] , Binary de Bruijn network is employed for providing flexible on-chip network for LDPC decoder. Concurrent accesses to the same memory bank are avoided through dedicated routing algorithm which deflects one of the conflicted packets at the router. The flexibility in these complex interconnection networks is paid through additional hardware, increased decoding latency and power consumption . In the last class, methodologies for solving collision problem are proposed to map the data in different memory banks for conflict free concurrent read/write accesses. In [11] , the authors propose to use a mapping algorithm to remove memory conflicts in flexible LDPC decoders. However, the proposed approach is based on a simulated annealing algorithm, so the user cannot predict when the algorithm will end. Moreover, it fails to optimize either the storage elements or the interconnection network. Finally, different heuristics [15] , [16] have been proposed to solve the mapping problem in turbo and LDPC decoding. However, they consider in-place memory access in which data have to be read from and write to the same memory location. Finally, conflict graph can be used. In this model, a node represents a data and two nodes are connected if and only if the associated data are accessed at the same time. Node coloring approach can then be used to solve the mapping problem: each color corresponds to one memory bank. Unfortunately only one color can be assigned to one node i.e. a data can be stored in only one memory bank. This constraint may require more memory banks than needed (see [17] for more details). Similarly, number of algorithms have been proposed for coloring the edges of a bipartite graph by constructing partitions ( [13] and [14] for example). Unfortunately, like node coloring approaches they can not be used to solve the mapping problem because each data is supposed to be stored in one memory bank only i.e. only one color can be assigned to one edge.
PROBLEM FORMULATION
To explain the problem, we consider a set of K data elements {d(j, db ... , dK.J} and a set of P processing elements {PE(j, PEb ... , PEp.J} which iteratively process these K data elements in N time instances {to, t b··· , t N-d· In order to store these K data elements and to achieve parallel iterative processing of data for high throughput a set of B memory banks {bo, bb ... , bE-J}, where B = P, is used. All the memory banks have the same size M which is equal to M = KIP.
Mapping problem
Store K data elements in B memory banks in such a manner that P processing elements can access B memory banks in parallel at each time instance for first reading and then writing B data elements without any conflict.
To highlight this problem, we introduce a data access matrix in which we have P rows, related to the processing elements, and N columns, related to the time instances. Data elements in each row are processed by the processing element connected with this row. Similarly data elements in each column need to be accessed in parallel by P processing elements for partially parallel decoding architecture. Figure 2 represent the data access matrix in which we have K = 6, P = B = 3, M = 2 and N = 6. Each data is processed by 3 times which shows the iterative nature of the data access. However, data accesses are interleaved in time and there is no regularity in processing the data elements; e.g., data 3 is successively processed in time instances tJ and t2 whereas the first access to the data element 4 occurs at time instance t3. '.
Time
To successfully map the data (i.e. to allow conflict free parallel memory access) in (1) a given number of memory banks and (2) to tackle the iterative nature of data access in error correction coding, the mapping matrix must fulfill the two following constraints: 1-At each time instance, all the memory banks have to be used one and only one time. 2-The bank of the last write access to a data must be the same as the bank of its first read access.
Formal modeling of mapping problem
To tackle the mapping problem, we introduce the concept of multiple read and multiple write access in the formal modeling of mapping problem in which we can not only access the data with in place strategy (if it is possible) but we can also read a data element from one memory bank and then write it in a different one in order to map the data in minimum required memory banks. This approach is based on the edge coloring of the bipartite graph and presented in section 5.
DEFINITIONS
A graph G = (V,E) is a collection of node, set V, and edge, set E. If v, w E V then an edge (v, w) E E is incident to v and to w, and vertices v and ware said adjacent. A subgraph of G is a graph whose vertices and edges are in G.
To delete edge (v, w) from G means to form the subgraph G -(v, w), consisting of all vertices of G and all edges of G except (v, w).
A graph G = (SJ U S2, E) is bipartite, if SJ and S2 divide the vertices set so that each edge is incident to a vertex in SJ and a vertex in S2 i.e. SJ n S2 = 0.
The We finally define a proper partition in semi regular bipartite graph as a partition that respects either Lemma 1 or Lemma 2.
An edge coloring of G is an assignment of a color to each edge in G. An edge chromatic number, X'(G), is the fewest number of colors necessary to color each edge of a graph so that no two edges incident to the same vertex have the same color.
In [12] , Konig proved that if the maximum vertex degree of a bipartite graph is d then, ;((G) = d.
PROPOSED APPROACH
The proposed algorithm is divided into three steps. In the first step we construct a bipartite graph based on data access matrix. In the second step, we divide our graph into different proper partitions. In the third step, the edges of each partition are colored.
Step 1: A bipartite graph G = (Tu L, E) is constructed based on data access matrix (e.g. figure 3 ) in which vertex set T represents all the time instances and vertex set L represents all the data elements used in the computation. An edge (t, I) is incident to the data element vertex I and to the time instance vertex t if I needs to be processed at t (i.e. data I will be read and next written at time t). Moreover, different data accesses are represented based on the relative position i of edges at the data vertex i.e. first edge at I represents the first read and write accesses and so on. However, the read access that follows the it " write access is the (i+l)modulo(degree(I)) , " edge at the data node I. An edge that represents the /' read access will be next referred in this paper as a direct edge and the edge corresponding to the associated write access as the induced edge. This placement property will be used during steps II and III. One interesting property of LDPC decoding is that the number of accesses to data or processing elements at any time instance is always equal which implies that corresponding bipartite graph is always semi regular. This implies that all the time nodes in the bipartite graph have the same degree d,. Step II: In this step, bipartite graph G is divided into proper partitions. In order to simplify the coloring algorithm next used in step III, one constraint named partitioning constraint is introduced: no more than 2 read or write accesses have to be done at each time instance in a proper partition. Following this constraint always allows to construct proper partitions. Each proper partition is constructed using the partitioning algorithm which is shown in Figure 4 .a. In this algorithm, two processes working side by side apply at each time and data vertex: Process of traversal and Process of elimination. Process of traversal randomly selects one edge available at the current data or time node and records its induced edge. Process of elimination removes all the edges from the current partition which contradict the partitioning constraint. Hence if d/ number of selected direct edges (i.e. read accesses) appear in a time node then the remaining (i.e. non-selected) available edges at that time instance are eliminated. Also, if d,' number of recorded induced edges (i.e. write accesses) appear in a time node then the direct edges associated to the remaining (i.e. non-recorded) induced edges of that time node are eliminated. Hence, the algorithm starts constructing a path Peur by choosing any data vertex leur and then by applying process of traversal which selects randomly an edge (leu,., teur) to reach at the time vertex tcur. Process of elimination is then applied to remove all the edges which contradict the partitioning constraint. At tcun the process of traversal is again applied to choose another edge (teu" lnex,) to reach at the data vertex lnex, . Again the process of elimination is applied to remove all the edges which contradict the partitioning constraint. At that time Pcur = {(leu,., tcur)' (teu,., lnex,)}. The algorithm continues until Pcur is completed, i.e. the process of traversal does not find any valid edge to be included in Pcur. The path is added in the current subgraph sgeur. The algorithm tests if the sgcur is a partition (i.e. all the time node has been traversed).
Once a partition has been extracted the algorithm stops. Otherwise, the algorithm starts constructing another path Pnex, by using the remaining edges of G (that have not been removed by the process of elimination). Once sgeur becomes a partition, the algorithm starts constructing another partition on the remaining graph G = G-sgcu r
Step II is explained through a pedagogical example in the next section.
St ep III: Thanks to the construction of proper partitions respecting the partitioning constraint, our coloring algorithm, which flow chart is shown in Figure 4 .b, colors each partition with at most two colors. For this it uses a strategy to color each edge in each partition so that there is no conflict in the read and write access at each time node. For each uncolored partition sgcun the algorithm starts by removing the read conflict accesses by assigning different color to each edge (h teur) of tcur. After that, following the p lacement pro per t y (see step I description) the algorithm searches in G for each edge (h tcur) of tcur for the induced edge (tpred, ID. Since only two write accesses are possible at each time node (by partitioning constraint), the algorithm searches in G for the direct edge (lrrv tk) of the induced edge (tpmb 1m) that belongs to sgcu C " The algorithm then colors (1m, tk) differently from (Ii, tcur) and continues until it reaches the starting node whose both direct edges are already colored. While the partition is not completely colored the algorithm selects another time node teur and repeats. It should be noticed that simply giving different colors to both the direct edges at each time node in each partition without taking into account the write access memory conflicts makes the algorithm recursive.
Algorithm is completed. 
PRACTICAL IMPLEMENTATION
Let us present an example based on the data access matrix in Figure 2 . The first step is the construction of bipartite graph which is already depicted in Figure 3 . This semi regular bipartite graph has each time vertex with degree d, is 3. Following Lemma 2, we will have after applying step II, 1 partition in which each time vertex's degree d,' is 2 and one subgraph in which d,' is 1. To better understand the modeling approach we propose, we use in his paper a mapping matrix. In this matrix, two columns are added in each time instance column of the data access matrix introduced in section 3. The first column shows the memory banks which are used for read access and second column shows the memory banks which are used for write access at this time instance. The mapping matrix of Figure 2 is shown in Figure 5 . RW  RW  RW  RW  RW  RW   1  3  6  5  4  2   2  5  1  6  3  1   3  6  4  2  5  4 t, t2 tJ � ts t6 Figure 5 : Mapping matrix for data access matrix of Figure 2 The algorithm starts constructing the path P J by using the first available edge of data J which is (1, tJ), leading to PJ = {(1, tJ)}. The selected edge (1, tJ) and its corresponding recorded induced edge (1, t6) appears respectively as bold and dotted line in Figure 6 .a. Using the p lacement pro per t y the write access of the edge (1, tj) indeed appears on the edge (1, t6)· The process of elimination is applied and no edge is removed.
The process of traversal continues and adds the edge (tJ, 3) into the path pj = {(1, tj), (tJ, 3)}. According to the partitioning constraint only two read accesses are possible at each time node. Since two read accesses are completed at tJ therefore the process of elimination deletes all the remaining edges at tJ: (tJ, 2) in that case. Deleted edges are simply removed from the graph in Figure 6 .b. Edge (3, t5) is then selected and added in the path. Since this edge is both a recorded induced edge and a direct selected edge, it thus appears in bold and dotted line in Figure  6 .c.
The process continues until we traverse the path pj = {(1, tj), (tJ, 3), (3, t5), (t5, 5), (5, t4)} and reach at the time node t4. At this point, recorded induced edges at t2 increase to two and the process of elimination deletes all the direct edges associated to the remaining (i.e. non recorded) induced edges at t2. All this process is shown in Figure 6 .d. The traversal continues until the path extends to PJ = {(l, tj), (tJ, 3), (3, t5), (t5, 5), (5, t4), (t4, 6), (6, t2), (t2, 3)} as shown in Figure 7 .a. No more edge can be added in the current path. We thus obtain a subgraph sgj = Pj. However, the current subgraph sgJ is not a partition because the time nodes t3 and t6 are not included inpj. Using the process of traversal, the path P2 is obtained: P2 = {(1, t3), (t3, 4), (4, t6), (t6, J)} (see Figure 7 .b).
The partition sgj is the union of all the traversed paths, sgj = pj + P2 (see Figure 7 .c) Unfortunately, the graph is not completely traversed so the algorithm removes sgj to obtain the graph G' = G -sgj and applies again the processes on the remaining graph to obtain the following paths, p 'j = {( 2 , tj)}, P'2 = {( 2 , t4)}, P'3 = {( 2 , t6)}, P'4 = {(4 , t5)},P'5 = {(5 , t2)}, P'6 = {(6 , t3)}. Similarly partition sg2 is the sum of all the traversed paths as given below, sg2 = p 'j + P'2 + P'3 + P'4 + P'5 + P'6 (see Figure 7 .d).
After the construction of sg2, the algorithm finds that the graph is completely traversed and is terminated. After that we search in G for the induced edges of these previously colored edges. Induced edge of (tj, 1) is (1, t6) so we search for the other direct edges that belong to sgl and which have an induced edge at t6 in G. Edge (t3, 4) must be colored with different color of (tJ, 1) in order to remove the write access conflict at t6. So we color (t3, 4) = bi (see Figure S .c). The write access of (tJ, 2) occurs also at t6' However (tj, 2) does not belong to sg j, it is not colored at that time. The corresponding mapping matrix is shown in Figure S .d. Figure 8 : Conflict free edge coloring of sgl This process continues until the partition is completely colored. The complete coloring of sgl is shown in Figure 9 .a. The corresponding mapping matrix is presented in Figure 9 .b.
t, n---------, .. The coloring of sg2 is easier: all the edges are colored with one single color b2. The complete coloring of G is shown in Figure 1O .a. The corresponding mapping matrix is presented in Figure 1 D.b. t, ��-----�. In this paper, we have presented a conflict free mapping approach for designing any parallel iterative decoding and for any type of LOPC code. The approach introduces the concept of multiple read/write access and uses a modified bipartite edge coloring algorithm. In future works, additional constraints will be added in the algorithm to support the conflict free mapping for specific interconnection networks such as barrel shifter, butterfly or binary De Bruijn. This effort will enhance the design of flexible network of reduced size, higher throughput and lower hardware cost.
