Abstract-Flexible and reconfigurable architectures have gained wide popularity in the communications field. In particular, reconfigurable architectures for the physical layer are an attractive solution not only to switch among different coding modes but also to achieve interoperability. This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of our knowledge, was not previously addressed and ii) proposing a reconfigurable NoC-based turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. Obtained results show that dynamic switching between most of considered communication standards is possible without pausing the decoding activity. Moreover, post-layout results show that tailoring the proposed architecture to the WiMAX standard leads to an area occupation of 2.75 and a power consumption of 101.5 mW in the worst case.
I. INTRODUCTION

I
N the last years, several efforts were spent to develop systems able to give ubiquitous access to telecommunication networks. These efforts were spent mainly in three directions: i) improving the transmission rate and reliability; ii) developing bandwidth efficient technologies; iii) designing low cost receivers. The most relevant results produced by such a vivid research were included in the last standards for both wireless and wired communications [1] - [7] . Besides, several standards provide multiple modes and functionalities. However, sharing common features is a challenging task to achieve flexibility and interoperability.
Several recent works, including [8] , have shown that flexibility is an important property in the implementation of communication systems. Some works investigated this direction facing the challenge of implementing flexible architectures for the decoding of channel codes. In particular, flexible turbo/low-density-parity-check (LDPC) decoder architectures have been proposed not only to support different coding modes within a specific standard but also to enable interoperability among different standards. In [9] - [11] , flexibility is achieved through the design of processing elements (PEs) based on application-specific-instruction-set-processor (ASIP) architectures, whereas in [12] - [14] PEs rely on application-specific-integrated-circuit (ASIC) solutions. In both approaches, flexible and efficient interconnection structures are required to connect PEs to each other. Unfortunately, the communication patterns of turbo and LDPC codes suffer from collisions, namely two or more PEs require concurrent access to the same memory resource. To break the collision a network-on-chip (NoC) like approach was proposed in [15] for turbo codes. This idea has been further developed in other works. In particular, in [16] the NoC approach is used as a viable solution to implement flexible and high throughput interconnection structures for turbo/LDPC decoders.
An intra-IP NoC [17] is an application specific NoC [18] where the interconnection structure is tailored to the characteristics of the intellectual property (IP). The use of an intra-IP NoC as the interconnection framework for both turbo and LDPC code decoders has been demonstrated in several works [16] , [19] - [21] . This choice enables larger flexibility with respect to other interconnection schemes [16] , [22] , [23] , but introduces penalties in terms of additional occupied area and latency in the communication among PEs.
Stemming from the work presented in [14] , [19] , [20] , where an ASIC implementation of an NoC-based turbo/LDPC decoder architecture is proposed, this paper aims to further investigate and optimize it. In particular, this work features the following novel contributions: i) management of dynamic reconfiguration to switch between a code to another one without pausing the decoding, ii) description of a new PE architecture with an improved shared memory solution which provides relevant saving of occupied area for min-sum decoding algorithm, iii) evaluation of a wide set of standards for both wireless and wired applications: IEEE 802.16e (WiMAX) [5] , IEEE 802.11n (WiFi) [6] , China Mulitimedia Mobile Broadcasting (CMMB) [3] , Digital Terrestrial Multimedia Broadcast (DTMB) [4] , HomePlug AV (HPAV) [2] , 3GPP Long Term Evolution (LTE) [7] , Digital Video Broadcasting-Return Channel via Satellite (DVB-RCS) [1] , and iv) complete VLSI implementation of the decoder up to layout level and accurate evaluation of dissipated power.
It is worth noting that, to the best of our knowledge, this is the first work addressing dynamic reconfiguration of flexible channel decoders with an analytical approach, and showing the actual impact of reconfiguration on both performance and complexity. The paper is structured as follows. In Section II, decoding algorithms are briefly discussed, whereas Section III deals with the basics of NoC-based turbo/LDPC decoder architectures and summarizes the main results this work starts from. The decoder reconfiguration techniques are detailed in Sections IV and V, while Section VI deals with the description of LDPC and turbo decoding cores, along with their respective memory organization. In Section VII, evaluations of the architecture performance on various existing standards are provided. Implementation results are portrayed and discussed in Section VIII, and conclusions are drawn in Section IX.
II. DECODING ALGORITHMS
Turbo and LDPC decoding algorithms are characterized by strong resemblances: they are iterative, work on graph-based representations, are routinely implemented in logarithmic form, process data expressed as logarithmic-likelihood-ratios (LLRs) and require high level of both processing and storage parallelism. Both algorithms receive intrinsic information from the channel and produce extrinsic information that is exchanged across iterations to obtain the a priori information of uncoded bits, in the case of binary codes, or symbols, in the case of nonbinary codes. Moreover, their arithmetical functions are so similar that joint or derived algorithms for both LDPC and turbo decoding exist [24] . In the following, for both codes we will refer to , and as the number of uncoded bits, the number of coded bits and the code rate respectively.
A. LDPC Codes Decoding Algorithm
Every LDPC code is completely described by its parity check matrix which is very sparse [25] . Each valid LDPC codeword satisfies , where is the transposition operator. The decoding of LDPC codes stems from the Tanner graph representation of where two sets of nodes are identified: Variable Nodes (VNs) and Check Nodes (CNs). VNs are associated to the bits of the codeword, whereas CNs correspond to the parity-check constraints. The most common algorithm to decode LDPC codes is the Belief Propagation (BP) algorithm. There are two main scheduling schemes for the BP: two-phase scheduling and layered scheduling [26] . The latter nearly doubles the converge speed as compared to two-phase scheduling. In a layered decoder, paritycheck constraints are grouped in layers each of which is associated to a component code. Then, layers are decoded in sequence by propagating extrinsic information from one layer to the following one [26] . This process is iterated up to the desired level of reliability.
Let represent the LLR of symbol and, for column in , bit LLR is initialized to the corresponding received soft value. Then, for all parity constraints in a given layer, the following operations are executed:
is the extrinsic information received from the previous layer and updated in (5) to be propagated to the succeeding layer. Term , pertaining to element of and initialized to 0, is used to compute (1); the same amount is then updated in (4), , and stored to be used again in the following iteration. In (2) and (3) is the set of all bit indexes that are connected to parity constraint .
According to [27] , the function in (2) and (4) can be simplified with a limited BER performance loss as (6) usually referred to as normalized-min-sum approximation, where and .
B. Turbo Codes Decoding Algorithm
Turbo codes are obtained as the parallel concatenation of two constituent Convolutional Code (CC) encoders connected by the means of an interleaver . Thus, the decoder is made of two constituent decoders, referred to as soft-in-soft-out (SISO) or maximum-a-posteriori (MAP) decoders [28] connected in an iterative loop by the means of the interleaver and the de-interleaver . Each constituent decoder performs the so called BCJR algorithm [29] that starting from the intrinsic and a priori information produces the extrinsic information. Let be a step in the trellis representation of the constituent CC, and an uncoded symbol. Each constituent decoder computes where [30] , is the a-posteriori information, is the a priori information and is the systematic component of the intrinsic information. According to [29] a-posteriori information is computed as (7) where is an uncoded symbol taken as a reference (usually ) and with the set of uncoded symbols; is a trellis transition and is the corresponding uncoded symbol. Several exact and approximated expressions are available for the function [31] : for example, it can be implemented as followed by a correction term, often stored in a small look-up-table (LUT). The correction term, usually adopted when decoding binary codes (Log-MAP), can be omitted with minor bit-error-rate (BER) performance degradation (Max-Log-MAP). The term in (7) is defined as
where and are the starting and the ending states of , and are the forward and backward metrics associated to and respectively. The term represents the intrinsic information received from the channel. For further details on the decoding algorithm the reader can refer to [32] . In a parallel decoder, the decoding operations summarized in previous paragraphs are partitioned among PEs. When configured in turbo code mode, these PEs operate as concurrent SISOs. On the other hand, they execute (1)-(5) in parallel for slices of parity check constraints when configured in LDPC code mode. In both cases, messages are exchanged among PEs to propagate and amounts in accordance with the code structure. In the following, we indicate the -th message received and generated by PE as and respectively.
III. NOC-BASED DECODER
The goal of this work is to design a highly flexible LDPC and turbo decoder, able to support a very wide set of different communication standards. The proposed multi-mode/multi-standard decoder architecture relies on an NoC-based structure, where each node contains a PE and a routing element (RE). Each PE implements the BCJR and layered normalized min-sum algorithms. On the other hand, REs are devoted to deliver values to the correct destination.
The node architecture employed in this work for node is represented in Fig. 1 . Each RE is constituted by a 4 4 crossbar switch with 4 input FIFOs and 4 output registers. The routing algorithm is the one proposed in [19] as Single-Shortest-Path-FIFO-Length (SSP-FL). SSP-FL relies on a distributed tablebased routing algorithm where each table contains the information for shortest path routing. The routing information is precalculated by running off-line the Floyd-Warshall algorithm. Moreover, in SSP-FL shortest path routing is coupled with an input serving policy based on the current status to the FIFOs, namely in case two messages must be routed to the same output port, priority is given to the message coming from the longer FIFO. It is worth noting that the destination of each is imposed by the interleaver and the matrix, respectively. As a consequence, the routing is deterministic.
The PE includes both LDPC and turbo decoding cores: their architectures are structured to be as independent as possible of the supported codes. The LDPC decoding core is completely serial and able to decode any LDPC code, provided that enough memory is available. The SISO core for turbo decoding is tailored around 8-state turbo codes, and no other constraints are present: the two cores share the memories where the incoming data are stored and the location memory containing the pre-computed values, i.e., the memory addresses to store . Also the interconnection structure depends only on the location memory size, that sets an upper bound to the number of messages each PE can handle.
The decoding task is divided uniformly among the different nodes. The process is straightforward in turbo mode, with each node being assigned a portion of the trellis that is processed in a sliding-window fashion [33] , [34] . Extrinsic and windowinitialization information are carried through the network according to the code interleaving and deinterleaving rules [19] . On the contrary, in LDPC mode the partitioning of the decoding task on the PEs is obtained as follows. Using a proprietary tool based on the METIS graph coloring library [35] , the matrix is partitioned on the chosen network topology. At this point, the destination of every message coming out of each decoding core is known. Thus, in both turbo and LDPC modes each outgoing message is made of a payload and a header containing the destination node.
Performance of meshes, toroidal meshes, spidergon, honeycomb, De Bruijn and Kautz graphs were compared, along with a number of other design choices, as routing algorithm and collision management policies. This analysis shows that the Kautz topology yields the best results in terms of area occupation and obtainable throughput. In particular, in [14] a 22-nodes Kautz NoC was used to fully support IEEE 802.16e standard, each node being connected to a decoding PE and to three other nodes via a 4-way router.
IV. DECODER RECONFIGURATION
Flexible decoders available in the literature [9] - [13] , [16] , [17] , [19] , [20] , though supporting a wide range of codes, do not address the reconfiguration issue. Change of decoding mode, standard or code parameters requires not only hardware support, but also memory initialization and specific controls: since in many standards a code switch can be issued as early as one data frame ahead [5], a time efficient reconfiguration technique must be developed.
For the proposed decoder the reconfiguration task consists of i) rewriting the location memory containing values; ii) reloading the CN degree parameters and the window size in the control unit of LDPC decoding cores and SISOs respectively. In the following, the whole set of storage locations to be updated at reconfiguration time will be indicated as "reconfiguration memory". When possible, the decoder must be reconfigured while the decoding process is still running on the previous data frame. This means that the reconfiguration data can be distributed by means of the NoC interconnections only at the cost of severe performance penalties. Consequently, we suppose that the reconfiguration data are moved directly to the PEs via a set of dedicated buses, each one linked to PEs. In the following, we estimate reconfiguration occurrence assuming mobile receivers moving at different speeds and the carrier frequency . This frequency is included in most standards' operation range, and used in a variety of applications. In this scenario, the communication channel is affected by fading phenomena, namely slow fading, whose effects have very long time constants, and fast fading. Fast fading can be modeled assuming a change of channel conditions every time the receiver is moved by a distance similar to the wavelength of the carrier. Being , at a speed , the channel changes with a frequency (WiMAX, WiFi, 3GPP-LTE), whereas, at (DVB-RCS, HPAV, CMMB, DTMB) changes occurs at . These scenarios result in different reconfiguration probabilities, whose impact on BER performance is addressed in Section V.
The reconfiguration memory is organized as a circular buffer: two sets of pointers are used to manage reading and writing operations. The start of current configuration (SCC) pointer and the end of current configuration (ECC) pointer delimit the memory blocks that are currently being used. A read pointer (RP) is used to retrieve the data during the decoding process, as shown in Fig. 2(a) . The start of future configuration (SFC) and end of future configuration (EFC) pointers are instead used concurrently with the write pointer (WP) to delimit the locations that are going to be used to store the new configuration data.
The reconfiguration of the considered decoder to switch from the code currently processed to a new one can be overlapped with the decoding of both current and new code, provided that enough locations are free in the configuration memories. In particular, part of the configuration process can be concurrent with the decoding of one or more frames of ; if necessary, another portion of the configuration can be scheduled during the first iteration of the new code . Finally, in case the overlap with decoding activity is not sufficient to complete the whole configuration, a further option is pausing the decoder by skipping one or more iterations on the last received frame for and using the available time, before starting the decoding of the new frame encoded with .
Let us define as the size of the location buffer available at each PE to store configuration data, and as the duration in clock cycles of a single decoding iteration for codes and . Moreover, and express the number of locations required to store configurations of codes and at each PE, and and their iteration numbers. In the considered architecture, the duration of one decoding iteration expressed in clock cycles is directly proportional to the number of memory locations a PE has to read throughout the decoding process, and consequently to the number of used locations in the reconfiguration memory . Though the actual relationship between and is affected by memory scheduling and ratio between PE and NoC clock frequencies, this analysis is carried out with the worst-case assumption that the reconfiguration memory is read at every clock cycle of each iteration, setting for both and codes. We define five phases , , 2, 3, 4, 5 in the configuration process and for each phase we identify i) as the number of clock cycles available during phase , and ii) as the number of locations in each reconfiguration memory that can be written in clock cycles. In the reconfiguration from code to code , words must be replaced with new words. The first part of the configuration can be scheduled during the initial decoding iterations on and therefore the available time is ; in this range of time a maximum of words can be loaded into each buffer. However, assuming that the buffer size is larger than , we define as the number of unused memory blocks in current configuration for code . Therefore, the actual number of locations written in is the minimum between and . The SFC pointer is thus initialized as ECC [ Fig. 2(b) ].
During the last iteration on , every memory location between SCC and the current position of RP is available for reconfiguration. This means that up to locations are available for receiving configuration words for . However, this has to be done during a single iteration, and therefore cycles are available. During these cycles, up to words can be loaded. As mentioned before, part of the configuration can be overlapped with the first decoding iteration on code. SCC is initialized as SFC, and RP will take the duration of a full iteration to arrive to ECC [ Fig. 2(c) ]. The available time is and the maximum number of words that can be loaded in this phase is . In the event that previously listed phases are not sufficient to complete the configuration, an early stopping in the decoding of code can be scheduled to make available additional cycles to be used for loading the remaining part of the configuration words. We indicate the number of cycles available in this phase as . The number of words that can be loaded in is . As one or more complete iterations are dropped in , is a multiple of , which can be formalized as (12) Differently from the other four phases, affects the decoder performance, as if the number of decoding iterations is reduced for code . Evaluating the actual effect on BER and FER curves is necessary to understand the feasibility of this approach.
If necessary, the reconfiguration process can be overlapped with the decoding of a number of data frames encoded with , in addition to the last frame, which was 
The five described phases are reported in Table I , together with the corresponding and . Thus, , , and are design parameters, and their values must be decided based on decoder parallelism (P) and supported codes, which determine and . Two alternative cases can arise during : either this phase is limited by the available time, or it is limited by the number of free locations in the reconfiguration memory (14) Then, assuming , we define the threshold (15) and distinguish between two cases: 1) (small codes), 2) ( large codes). Let us study the two cases separately.
A. : Small Codes
When , phase is not useful at all, as dropping decoding iterations has the effect of reducing the time of by the same amount that is gained in . Therefore, the following constraint can be set: (16) This constraint simply means that the overall available time through , , and must be long enough to update locations in the reconfiguration memories. From the values in the second column of 
A number of preceding frames can be exploited only if enough locations are unused in the buffers during and . This condition can be expressed as (19) namely (20) Thus, given that , the maximum useful value of depends on as (21) Thus, (18) can be better written as (22) This means that the size of code has an upper bound and this bound is proportional to the size of code . Therefore, the most critical reconfiguration cases are those involving "small" codes: in such cases, there could be many codes that violate condition (22) . The bound is also proportional to , and can be consequently increased by rising the number of reconfiguration buses.
B. : Large Codes
In this case, . Now the use of phase makes sense as the duration of does not depend on the number of iterations, because it is limited by the number of free locations in the reconfiguration memory. As a consequence, additional reconfiguration time can be gained if is reduced. On the contrary, is not useful for large , because is limited by the available memory, whereas the number of available cycles is sufficient. Thus, in this case is completed in cycles (when all the available locations are written) and the constraints on is now written as
If and , we have (24) Also for this case, there is a limit to the size of code that can replace during phases from to . However, this limit can be increased by increasing or .
V. RECONFIGURATION: CASES AND EXAMPLES
The reconfiguration method detailed in Section IV has been applied to a set of target standards, in order to identify suitable design parameters (i.e., , , , ) that enable reconfiguration without pausing the decoder for most of code sizes. The following analysis has been performed with . Figs. 3-7 plot the maximum , as defined by (22) and (24), for a continuous set of values. The markers represent a subset of the considered intra-and inter-standard code changes: markers below the curve identify reconfigurations that can be performed without pausing the decoder. Fig. 3 shows the maximum for different values of B: in this plot, corresponds to , which is the size of the largest considered , while 160% means . It can be seen that in the cases of small codes, increasing the buffer size does not affect the positive slope portion of the curve.
On the contrary, in Fig. 4 , the maximum is shown for different values of : in this case, an increase of is reflected in all areas of the plot. A higher number of buses means a shorter reconfiguration time, and a larger maximum .
Variation in the maximum allowed (Fig. 5) only affects the maximum in case of large codes (negative-slope portion of the curve), as shown in (24) . It can be noticed that with all the large codes are below the right side of the curve: later in this section it will be demonstrated how these skipped iterations are negligible in terms of BER performance. In Fig. 6 , the effect of different choices of is shown: from the plot it can be seen that actually increases the maximum only for small codes. Finally, Fig. 7 plots some combinations of the analyzed parameters in order to allow dynamic reconfiguration among most of considered codes. The represented combinations of , , and all yield very similar performance: the cost underlying every parameter choice consequently becomes the decision metric. A 20% increase in memory, even if backed up by a smaller number of buses, heavily affects the decoder area occupation, ruling out the solution represented by the thick dashed line. Among the remaining three combinations, the one that makes use of six buses yields a higher area occupation than the others. Since with the thin dashed curve crosses one of the lower markers, the final choice falls on , ,
, and . Given that , and consequently is not an integer number, every bus will be exclusively connected to four nodes, while the reconfiguration of the remaining two nodes will be shared among all five buses. The impact of the reconfiguration process on the decoder area is addressed in Section VIII, whereas a set of BER simulations has been performed to evaluate the impact of different , on WiFi, DVB-RCS, WiMAX, CMMB, DTMB, 3GPP-LTE, and HPAV codes. Considering the worst case for each tested standard (i.e., the largest block length, the most unfavorable throughput/code rate ratios), the reconfiguration probability can be expressed as the probability for each incoming frame to request a code change, computed as the channel changing frequency over the number of coded frames received in a second: (25) where is the maximum throughput required by the standard for the and code choices. The reconfiguration probability ranges between 0.25% and 0.3% in presence of the fast moving receiver, while it remains under 0.15% in the other case. Simulation results show how the BER penalty is negligible as long as , with the average number of iterations performed before a correct codeword is obtained and is the next highest integer value. Fig. 8 shows the BER curves obtained with and , in the pessimistic assumption that a reconfiguration requiring always occurs with . As it can be observed, the difference between the case when reconfiguration occurs (solid lines) and the no-reconfiguration case (dashed lines) is completely negligible.
VI. DECODING CORES
The design of the decoding cores must yield the same degree of flexibility of the NoC, being as independent as possible of the set of supported codes. In [14] a completely serial LDPC decoding core has been designed, mostly independent of block length and code rate: an arbitrary number of CN operations can be scheduled on it. The same holds true for the serial SISO, where different windows can be scheduled, regardless of the size of the interleaver.
This work stems from the results presented in [14] , improving the architectures through novel memory scheduling and addressing methods, reduced latency and simpler control. As shown in [36] , sharing the datapath of a min-sum based decoder architecture with a log-MAP SISO does not provide significant advantages. As a consequence, in this work logic sharing is not addressed. Experimental results show that the area of the architecture is dominated by memories indeed.
A. Quantization and Memory Organization
Memory organization evolves from the idea presented in [14] , in which in every decoding core two memories are instantiated: a 7-bit memory and a 5-bit memory. Their usage is shown in the left part of Fig. 9 : LDPC VN-to-CN values are stored in the 7-bit memory, together with turbo extrinsic information and state metrics. The 5-bit memory is instead used for CN-to-VN values in LDPC decoding, while storing the intrinsic channel information in turbo decoding. The memories are sized to the largest WiMAX codes ( , for LDPC and for turbo). However, according to post-layout synthesis results, memory access multiplexers suffer from excessive area overhead for these particular cuts. To reduce this problem and to reduce at the same time the overall memory area occupation, a novel memory organization technique is proposed, as shown in the rightmost part of Fig. 9 . Different colors highlight different metrics, while black-striped parts are unused.
Extensive simulations of WiFi, WiMAX, CMMB, and DTMB have shown how, in LDPC decoding, and channel LLR quantization can be reduced from 7 to 6 bits without consistent performance degradation. Fig. 10 shows the BER curves for some WiMAX, DTMB and WiFi LDPC codes with the two quantization choices: the difference is smaller than 0.05 dB for all rates of medium and large code sizes. On the same graph, yielding similar results, a few turbo codes examples (WiMAX and HPAV) are plotted, in which and the channel LLR representation changes from 7 to 6 bits, and from 5 to 4 bits (the meaning of will be detailed in Section VI-C1). Also for turbo codes, the performance loss introduced by the proposed quantization change is almost negligible. Very small codes, as the ones that can be found in 3GPP-LTE and WiMAX, suffer more from the quantization reduction (Fig. 10 ). Curves obtained with floating point precision show improvements between 0.1 and 0.2 dB w.r.t. the selected precisions. Thanks to these changes, a single 6-bit wide memory is instantiated, in which both and values are saved. Storing all the values requires locations in each decoding core. However, with the normalized min-sum algorithm the number of necessary bits can be reduced by 21.2% by changing the addressing mode as follows. For every CN in the matrix metrics are updated. Since can take only two possible values, for each CN we can memorize magnitudes, and 2-bit indexes that identify the correct magnitude and its sign. The sizing of the 6-bit memory is determined by the Double Binary Turbo Code (DBTC) decoding mode, since it must store and values. To limit the area overhead and speed up the loading process, four values are stored in three memory locations, as portrayed in the left part of Fig. 11 . Three 4-bit are stored in three 6-bit locations: the remaining metric can be divided in two pairs of bits, and stored in the leftover locations. Three clock cycles are used to read the four values for a trellis step with minimal logic overhead. In case of single binary turbo codes (SBTC), like those used in 3GPP-LTE, only two and one are necessary for a trellis step, and they can be read in two clock cycles without impairing the throughput. With a similar method the 2-bit memory is used in turbo decoding mode to store and between iterations, as suggested in [34] . Six locations are used to store 2 or (Fig. 11 ): since at most three 8-state windows initialization metrics, i.e., 24 and 24 , are stored at the same time, only 144 out of 400 locations are used. Multiple memory accesses are necessary to read a single value: the issue is handled with appropriate scheduling (see Section VI-C.2) and does not affect the throughput.
B. LDPC Decoding Core
The LDPC decoding core used in the decoder described in [14] relies on a serial architecture suited for exclusive memory usage. The main drawback of this solution is the variable number of cycles to produce the output. The average number of cycles-per-data varies between one and two. To overcome this limitation and to share the memory with the SISO a novel architecture with limited area overhead is proposed.
1) Architecture:
The LDPC decoding core is detailed in Fig. 12(a) .: this architecture supports all kind of LDPC codes, as long as the memory requirements are met.
A value (1) is produced at every clock cycle and fed to the minimum extraction unit (MEU) depicted in Fig. 13 . Then, is compared to the current first and second minimum (min1 and min2), that are initialized as the maximum allowed value at the beginning of each CN phase. The minimum of both comparisons ( and ) is passed on and sampled on the rising edge of the clock signal, together with the previous first minimum and a flag signaling if . If , is substituted with : min1 and min2 are finally updated on the falling edge of the clock, ready and stable for the next . Differently from the MEU used in [14] , that could halt the pipeline in case both min1 and min2 had to be updated, the negative-edge triggered registers allow both updates in a single clock cycle, leading to a constant cycles-per-data rate close to one, after the initial cycles latency. Concurrently, signs are XORed as in (3). Once min1 and min2 have been successfully extracted, they are compared to all the of the CN, that are delayed by a number of clock cycles equal to the degree of the CN , to compute as in (4) . The CMP unit handles the comparison and produces the two flags (sign and identification) to be stored in the index memory. The correction factor in (6) is applied before the final addition in (5) and are sent to the output buffer.
The length of the delay lines used for , magnitudes and indexes is initialized by the control unit to . 2) Memory Scheduling: Both 6-bit and 2-bit memories are implemented as dual port RAMs, allowing two concurrent operations. At iteration , for the -th CN, two clock cycles are devoted to write on port 1 of the 6-bit memory. This allows the storage of the two magnitudes of CN and computed during current iteration. On the contrary, port 2 is set to read mode, loading the two magnitudes of CN stored during previous iteration. In the 2-bit memory port 1 is always in write mode, storing indexes as soon as they are computed, while port 2 is constantly in read mode. During this first phase, though, no data is loaded.
The second phase of the scheduling lasts for cycles. The ports on the 6-bit memory switch functionality: port 1 is used to store incoming from the network, while port 2 is used to read values of CN . The 2-bit memory is enabled, loading the indexes of CN and storing indexes of CN .
C. Turbo Decoding Core
As for the LDPC decoding core, also the SISO core yields a very high degree of flexibility, limited only by the size of the memories: any double-binary turbo code can be decoded as long as the memory capacity is sufficient. 1) Architecture: Fig. 12(b) . portraits the designed architecture. The SISO interfaces with the NoC via two dedicated input and output blocks, respectively called bit-to-symbol conversion unit (BTS CU) and symbol-to-bit conversion unit (STB CU). According to [37] , symbol-level (SL) information in double-binary codes can be approximated from bit-level (BL) extrinsics, with a limited BER loss and an average NoC complexity reduction of 1/3. The BTS-CU changes the received a priori BL to SL , as required by the algorithm, while the STB-CU reduces the number of messages to be sent on the NoC by converting extrinsic values into BL . For every trellis step in a window, the branch metric unit (BMU) and the Extrinsic Computation Unit combine the two converted by the BTS-CU with four values read from the 6-bit memory to calculate (11) and to update , respectively. The output of the BMU is used by the main computation unit, that tackles the calculation of , and (8), (9) and (10) . These metrics are computed in this exact order, thus storing values in a dedicated set of registers while are being processed: the metric, that needs both and , is calculated last.
2) Memory Scheduling:
In turbo mode, each trellis step requires three clock cycles to be completed. However, up to five cycles are needed to read all the necessary and . Early simulation results presented in [14] show that the SISO working frequency can be lower than the NoC's one . By timing the memories with the faster clock signal, six values (five memory locations) can be read from port 1 of the 6-bit memory in three SISO-cycles. Port 2 is kept in write mode for the duration of the decoding, and used to store values coming from the network.
The 2-bit memory is used in the same way, with port 1 in read mode and port 2 in write mode. At the beginning of every new window 16 values are needed (8 and 8 ) from this memory to initialize the trellis. Due to the memory organization used for state metrics, see Fig. 11 , one and one are spread over 7 memory locations. Consequently, clock cycles are necessary to load the 16 values. The values must be loaded from the memory before the window is processed. Thus, they are loaded during the processing of the previous window. Since every window is composed of at least 20 trellis steps, requiring clock cycles to be executed, there is enough time to load and values to initialize the next window.
VII. SUPPORTED STANDARDS
The 22-node architecture presented in this work has been tested on a large set of communication standards. In particular, the whole set of turbo and LDPC codes included in [1] - [6] have been tested.
As explained in the previous section, if is smaller enough than , communication time between PEs is negligible. Taking in account the presented 22-node architecture, the maximum ratio for which this assumption stands is 2/3 for LDPC codes and SBTC, while 3/5 is necessary for DBTC. The maximum number of iterations has been set to 10 for LDPC codes, and to 8 for turbo codes.
Every standard has different throughput requirements: both and can be adjusted consequently. • CMMB and DTMB: the CMMB [3] and DTMB [4] Chinese broadcast standards, though serving the same purposes as DVB, work with smaller LDPC codes. Like in DVB, also in CMMB codes feature double diagonal submatrices, slightly limiting the concurrent number of row nodes that can be instantiated on the proposed decoder. Both CMMB and DTMB codes demand an increased memory capacity with respect to the aforementioned standards, requiring PE memories to be enlarged by 55% to support CMMB, and by 68% for DTMB. A working frequency is sufficient to guarantee the 20.22 Mb/s throughput required by CMMB standard, while to comply with DTMB 40.6 Mb/s, frequency must be risen to , as shown in Table V. • 3GPP-LTE: the LTE version of 3GPP [7] uses a set of 188 SBTC with coding rate 1/3, thus being characterized by a range of widely spaced block lengths. The required 150 Mb/s throughput can be obtained on the 22-node architecture with ; however, if we consider the extended 35-node architecture mentioned for the WiFi standard, compliance with the throughput requirement is met at . This standard requires additional 41% memory capacity w. r. t. WiMAX, WiFi, DVB-RCS and HPAV standards, but can be fully supported by the CMMB and DTMB memory sizing. Table IV summarizes possible switching among the selected standards, taking in account all possible code combinations. The dark gray cells represent the percentages of , combinations between two standards whose reconfiguration requires pausing of the decoder. A few cases arise between DVB-RCS and WiMAX turbo codes and within 3GPP-LTE (due to its wide variety of codes), while when belongs to the CMMB, DTMB and LTE standards, it is more likely to encounter a critical combination. On the contrary, if belongs to CMMB or DTMB standards, any reconfiguration can be completed with : this is also the most common situation among the other standards. The choice of maximum allows to handle all 
VIII. IMPLEMENTATION RESULTS
The results presented in Section VII show a broad range of possibilities for implementation, and the designed decoder can be scaled with very low effort. Three different complete decoders have been synthesized with TSMC 90 nm CMOS technology: post-layout results have been obtained for all of them, with accurate functional verification, area and power estimation. Synthesis has been carried on with Synopsys Design Compiler, functional simulation with Mentor Graphics ModelSIM, and place and route with CADence SoC Encounter [38] .
A. Implementation A
The first decoder implementation has been devised to fully support WiMAX, HPAV and DVB-RCS standards. The memory sizing and organization described in Section VI-A is able to handle the addressed standards with 22 PEs. To comply with each standard throughput requirements, a single is sufficient in both LDPC and turbo mode, consequently identifying and , both under the constraint. Obtained throughput is presented in Table VI .
Each reconfiguration bus is 18 bits wide: 3 bits are the node identifier, used to address one of the connected decoding cores, 5 bits are assigned to the node degree or window size information, and the remaining 10 bits carry the . These design choices have led to an overall area of 2.75 after place and route, taking in account the reconfiguration additional hardware as well. The logic of the SISO cores occupies 15% of the overall area, while the LDPC cores 11%. Core memories account for another 53%, while the NoC, together with the reconfiguration buses and additional logic, constitute the remaining 21%. This area overhead is due to two specific functionalities that have been introduced in the proposed decoder: (i) full flexibility in terms of supported turbo and LDPC codes and (ii) dynamic reconfiguration between different standards.
Estimated power consumption, based on the switching activity in case of WiMAX LDPC code , (for ease of comparison with the state of the art) is 87.8 mW; for WiMAX turbo code with estimated power is 101.5 mW.
A screenshot of the final layout is portrayed in Fig. 14 : the irregularity of the placement is due to the large number of memories and their complex interconnections. However, two different areas can be easily identified: a central zone in which most of the logic is found (black contour), and a border area where the 
B. Implementation B
The second implementation presented extends the set of standards supported by implementation A to WiFi LDPC codes and 3GPP-LTE turbo codes. To limit the complexity of off-chip clock generators, also in this case a single NoC working frequency has been chosen, , while is necessary to provide high enough throughput, can remain set to 200 MHz. The parallelism of the NoC is increased from 22 nodes to 35 nodes, the reconfiguration buses rise from 5 to 8, and the support of LTE requires an increase in the size of 6-bit memories. Throughput results are reported in Table VI 
C. Implementation C
This third implementation extends implementation A's support to CMMB and DTMB. Neither frequency nor NoC parallelism modification are necessary, but the core and reconfiguration memories must be enlarged. Consequently, an extra bit is added to the reconfiguration bus data width. The new post place & route estimated area is 3.42 , while power reaches 120 mW for both tested turbo and LDPC codes. This is because the LDPC consumption is calculated on a DTMB code, that makes full use of the extended memories, while the memory usage percentages for DBTC remains low. The enlarged memories allow also LTE codes to be decoded, but the SBTC would need to rise up to 333 MHz to meet the throughput requirements. Throughput results for CMMB and DTMB are shown in the Implementation C column of Table VI. Table VIII shows the detailed implementation results in comparison with the state of the art flexible turbo/LDPC decoders. Even though A, B, and C are the only decoders capable of dynamic switching, area, power and efficiency figures prove the effectiveness of this approach.
D. Comparisons
In order to make a fair comparison, normalized area occupation has been included in the Table, , where is the total area and Tp is the technology process, together with throughput and power consumption. Moreover, , where Pow is the peak power consumption, expressing the energy spent for decoded bit, and the area efficiency , reported in Table VII , an efficiency figure that considers both throughput and area occupation.
Baghdadhi et al. [11] propose an ASIP decoder architecture supporting WiMAX and WiFi LDPC codes, and WiMAX, 3GPP-LTE and DVB-RCS turbo codes. The A, B and C implementations are designed such that the minimum throughput is sufficient to comply with the supported standards. On the contrary, worst case throughput in [11] is not high enough for WiMAX. Comparison reveals similar area occupations, but very different frequencies. This leads to a better area efficiency in all three proposed implementations for most of the codes: particularly evident is the difference for DBTC (second last row of Table VII) .
The work presented in [9] supports convolutional, LDPC and turbo codes, giving results for WiMAX LDPC, WiFi and general binary and double-binary turbo codes. It yields a very small area occupation with low power consumption and good maximum throughput for LDPC decoding. On the contrary, it features less interesting figures in turbo mode. This situation is reflected both on and , with Implementation A, B, and C having, when comparing the same codes, better efficiencies in turbo mode (last row of Table VII), and worse in LDPC mode. However, under the worst case conditions ( , , 20 iterations), A and B outperform [9] also in LDPC mode.
The multi-standard decoder designed in [12] supports 3GPP-HSDPA, WiFi, WiMAX, and DVB-SH. No specific information on the codes used is given, only minimum guaranteed throughput: for this reason, results in Table VII refer to the minimum throughput of each standard. Implementation A and B have comparable minimum when working with WiMAX LDPC codes, and A, B, and C yield much better results in turbo mode. When comparing WiFi results [12] guarantees a higher than A, B and C, even though aiming for a lower throughput than B. All three proposed implementations yield better , and both A and C have a smaller area occupation. Sun and Cavallaro describe in [13] a decoder working with 3GPP-LTE turbo codes and WiMAX and WiFi LDPC codes. They obtain very high maximum throughput efficiency in both LDPC and turbo mode: the range of supported codes is however quite limited w.r.t. all considered implementations, and the area occupation is larger than A. Since no power analysis is given, comparison based on is impossible, although the difference in working frequencies would suggest a smaller power consumption for at least A and C.
IX. CONCLUSIONS
This work describes a flexible turbo/LDPC decoder architecture able to fully support a wide range of modern communication standards. A complete analysis of the never previously addressed inter-and intra-standard reconfiguration issue is presented, together with a dedicated reconfiguration technique that limits the complexity overhead and performance loss. Three different implementations are proposed to cover different sets of standards. Full layout design has been completed to provide accurate area and power figures. Comparison of the proposed architectures with the state of the art show very good efficiency, competitive area occupation and an unmatched degree of flexibility.
