Abstract Parallel Low-Density Parity-Check and turbo code decoding consists of iterative processes that rely on the exchange of messages among multiple processing elements (PEs). They are characterized by complex communication patterns that require area expensive interconnect and memory management. Channel decoders based on Networks-on-Chip (NoCs) have been proposed in the literature, showing unmatched degrees of flexibility, but yielding high area occupation and power consumption. While general and applicationspecific power reduction techniques are available to save energy, the gap with respect to dedicated decoders is still large. This paper proposes techniques that reduce and optimize the traffic on the network for NoC-based channel decoders, and can be applied to any NoC architecture. The proposed techniques exploit the probabilistic nature and the processing order of the exchanged messages in
Reducing the dissipated energy in multi-standard turbo and LDPC decoders 3 solutions. Lately, the original idea of general purpose NoCs has evolved and new kinds of NoCs are today proposed with features fully optimized for a single or reduced number of applications (Application Specific NoCs, or ASNoCs, [11] ). LDPC and turbo decoders based on very special ASNoCs have been proposed, providing a high degree of flexibility [14, 21, 22, 27, 29, 30, 32] . These implementations support multiple standards and code types, with dynamic switching capabilities between codes.
In general, ASNoCs proposed for channel decoding are intra-IP NoCs that connect homogeneous PEs; in order to reduce occupied area and dissipated power, they are also characterized by lower complexity and limited capabilities with respect to NoCs usually reported for other applications. For example, it is shown in [27] that the best choices in terms of NoC topology and routing algorithm for a NoC-based turbo decoder are given by Kautz topology, which is less regular than usual 2D-mesh but guarantees shorter delivery time, and shortest path algorithm, which can be implemented with a very low complexity.
In spite of these choices the NoC is responsible for a relevant efficiency gap between NoC-based highly flexible decoders and dedicated or partially flexible solutions, such as for example implementations in [33, 35] : the NoC guarantees virtual connectivity among all nodes and great flexibility, but packet latencies and intermediate storage, along with routing logic and memories, increase power consumption and decrease throughput with respect to dedicated and partially flexible decoders.
General power reduction techniques, like clock gating and dynamic voltage scaling, can be applied to channel decoders. Additionally, specific features of LDPC and turbo decoding can be exploited to reduce power dissipation.
As an example, since LDPC and turbo decoding are iterative processes, early stopping of iterations criteria have been proposed over the years [23, 25] : these techniques rely on the observation of a metric to decide if it is worth or not to perform additional iterations, avoiding unnecessary energy consumption.
The usage of power reduction techniques is in many cases very effective, however, since they can be introduced in any decoding architecture, the power consumption gap between NoC-based decoders and dedicated architectures is still large. This paper proposes new power reduction techniques for flexible channel decoders. These techniques reduce and optimize the traffic due to messages exchanged among PEs, which account for a significant percentage of the consumed energy. Therefore the proposed methods are particularly effective with NoC-based decoders. Preliminary contributions in the direction of traffic reduction have been made in [31] for turbo codes, and in [15] for LDPC codes, both dealing with the evaluation of the usefulness of exchanged information: this work refines and extends them, while proposing new traffic optimization techniques. The performance of the proposed techniques has been extensively evaluated and compared to alternative methods, while the most promising one has been implemented as an application example on a fully characterized decoder [14] . Area overhead and power gain have been obtained and compared to the state of the art to evaluate the effectiveness of the solution. Both multi-standard decoders [6, 19] and single-standard, optimized implementations [16, 34] are taken in account.
The comparisons show how the proposed solutions greatly improve the energy efficiency of NoC-decoders and help to reduce the gap between flexible and dedicated implementations. For example, the proposed multi-standard LDPC/turbo decoder consumes 99.2 mW, against the 51.6 mW of the decoder in [16] , regardless of its support being limited to WiMAX LDPC codes only and being optimized for ultra low power performance.
The rest of the paper is organized as follows: Section 2 introduces LDPC and turbo decoding and analyzes the problems arising in parallel implemenReducing the dissipated energy in multi-standard turbo and LDPC decoders 2 NoC-based decoding: practical issues LDPC codes [18] are described by a sparse matrix with M rows and N columns.
Each word x that satisfies H · x = 0 is considered as valid codeword. LDPC decoding is based on the Tanner graph representation of H, composed of Variable Nodes (VNs, the columns of H) and Check Nodes (CNs, the rows of H).
The Belief Propagation (BP) algorithm is the most common algorithm for LDPC decoding, especially with the efficient layered scheduling [20] . In a layered decoder, parity-check constraints are grouped in layers of unconnected H rows: extrinsic information is passed from one layer to a subsequent one [20] Convolutional turbo codes are typically specified as the parallel concatenation of two Convolutional Code (CC) encoders. The decoder is consequently made of two different Soft-In-Soft-Out (SISO) decoders, linked by an interleaver Π and a de-interleaver Π −1 . Each SISO decoder relies on the BCJR algorithm [8] . With each CC represented with a trellis, let k be a trellis step and u an uncoded symbol. Each decoder computes
where λ In parallel decoders, the decoding process of the received frame is typically partitioned among P PEs. Messages are exchanged among PEs by means of an interconnection structure, that is usually deterministic, and guarantees fixed and uniform latency. To increase the degree of flexibility of the decoder, recent works have proposed NoCs as interconnection structures [14, 21, 27, 30] .
Different techniques are available in the state of the art to partition the received frame and the decoding process among PEs: they all rely on the assignment of a set of variable nodes (in the LDPC case) and trellis steps (in the turbo case) to each processing element. Even though random partitioning has been proven to grant acceptable results, a graph coloring approach that minimizes the inter-PE communication has been employed in this case [15] . Fig. 1 Reducing the dissipated energy in multi-standard turbo and LDPC decoders 7
shows the basic structure of a NoC-based decoder. In order to guarantee NDT< 1 for the totality of messages, either the throughput must be reduced (which means reducing the PEs clock frequency)
or the NoC clock frequency should be increased (without altering the PEs clock frequency) [14] .
An ideal situation for a decoder is shown in Fig. 2 , that plots the number of messages with respect to the NDT for a WiMAX code of block size 2304 and rate 1/2, decoded with a 16-PE decoder mapped on a Kautz network:
in this decoder, to have NDT< 1 for all messages, the NoC frequency is 420
MHz, while the PEs only run at 280 MHz. 
Traffic reduction and optimization
Reducing the number of messages traveling on the NoC is bound to speedup the delivery of those remaining, with shorter queues and fewer collisions.
Two techniques have been devised and tested towards these goals: they are based on the general concept that not all information messages (which are of a probabilistic nature) traveling on the network are essential for the success of the decoding. These two techniques (sub-Section 3.1 and 3.2) can be used either alone or combined. Two additional methods to optimize the NoC traffic are described in sub-Section 3.3.
Hard importance
The Hard Importance (HI) method allows to refrain from sending messages that are estimated to be of low impact on the decoding process. A similar approach was considered in [31] and [15] . In the LDPC decoding case, the messages traveling on the NoC are different updates of λ k [c]. The HI checks are performed once per iteration to each of them. Consider the following comparisons:
where n expresses the n th iteration and max(λ k For turbo decoding, the choice of stopping or not a message can be made by modifying the Symbol Reliability Difference (SRD) criterion proposed in [31] .
as the difference between the logarithmic extrinsic probabilities of the first and second most probable symbols, the original SRD criterion proposes, for each
where m ′ refers to the metrics at the input of the SISO, coming from the previous half-iteration, and m at the output. If condition
is satisfied, the message can be stopped. Applying this method as is, however, led to unsettling results. This is due to the fact that it is not taken in account that (7) could give very low φ
expresses just uncertainty about symbols, instead of agreement between SISOs, and the message should not be stopped.
For this reason, an additional control has been added for the message to be stopped:
where Thr HI Dif f assures a degree of reliability on the symbol.
The performance of HI is of large interest, since this method can be applied also to other types of decoders beside NoC-based ones. HI acts on messages by deciding if it is worth updating a value or not, and can be effective also in absence of a NoC. It can consequently be exploited in decoders that rely on shared memory banks: in this case, energy is saved by reducing the number of memory write operations.
Soft importance
The Soft Importance (SI) technique evaluates the state and the evolution of the information exchanged through the NoC, flagging non-essential messages as expendable. In case of collisions, i.e. messages that need to be routed through the same output port, the router arbiter will forward a message and discard all the expendable messages that were not granted priority. In LDPC decoding, (3)- (5) Priority-based routing is a well-explored path to guarantee QoS: multiple virtual channels are often assigned different priorities to differentiate traffic flows [17, 26] . The concept of priority is applied here in an original way, by using a single channel with a reordering buffer.
The Urgency technique (U) implements a priority-based collision management policy. In case of collision, priority is given to the most urgent message, i.e. the message which is needed by its destination PE sooner. To allow this kind of decision, an urgency field must be added to the message during sending, initialized with an estimate of the number of clock cycles available before the message is needed by another PE. The field must be updated by the routers, taking in account the wait cycles spent in input buffers, and a message is discarded if its urgency reaches zero, avoiding unnecessary switching activity for late messages. With the LDPC case, each PE can perform an estimation based on local knowledge of the instant in which the outgoing message is going to be needed. The precision of the estimate strongly depends on the regularity of the partitioning of the H matrix among the PEs. On the contrary, in the turbo case, since the interleaving rule is known to all PEs, the measure can be exact.
In Buffer Reordering (BR) a fast lane can be created by arranging the messages in the input buffers not in arrival order, but according to the urgency field. The most urgent message in a buffer will consequently always be the first one to be pulled out, increasing its chances of arriving on time.
Performance results
The impact of each of the proposed techniques, alone and in combination with one another, has been evaluated with the JANoCS tool [12] . Extensive simulations have considered a wide range of codes, NoC topologies, number of PEs, routing algorithms, PE and RE architectures. Fig. 7 , where the impact of HI on BER for an LDPC code and a turbo code are presented. The percentages of stopped messages assumed in these simulations are consistent (17% and 18% respectively), but both the LDPC and the turbo code show negligible performance losses. In the plots of 
SI
Dif f = 0.05. These thresholds are also used to derive the "Ideal THR" curves in Fig. 8 and 9 , that show how variations in the threshold values affect the BER performance.
The results given by the urgency U method alone are not satisfying: since its effects are mostly appreciated in case of collisions, which can involve also non-critical messages, its effectiveness alone is limited. However, as soon as Table 1 it is possible to make some important observations. As expected from previous analysis [13, 27] Table 1 : similar effects are encountered with larger codes when mapped on the 32-PE NoCs. On the contrary, queues and collisions are the main sources of delay in case of large codes mapped on small NoCs. These limit cases (e.g. LDPC 576, rate 5/6 in Table 1 ) cannot be completely solved with the implementation of the proposed traffic handling techniques, and need to work in conjunction with alternative techniques: for example, the code can be mapped on a smaller portion of the NoC, and the unused part of the decoder can be deactivated to save energy. 
Hardware architecture
The similarity of the calculations involved in HI and SI, and the necessity for controls at each RE for SI, U and BR allow for efficient resource sharing. The multi-mode decoder described in [14] has been taken as reference architecture: among the different implementations, the one denoted A in the paper has been The HI method can be easily implemented in the SISO. At the beginning of each trellis step one or more read operations from the 
. At the end of the trellis steps, also the data needed to calculate δ (7), (8) 
and (9).
A memory bit is required for each trellis step to signal if the outgoing messages are unimportant and must not be sent. For LDPC codes, since the considered metrics are λ message is not sent anymore. The STOPPING message is mapped to the lowest negative value that can be represented with the allocated number of bits: for example, with 9 bits, the data dynamic range is mapped to the interval (-255, +255), and the value -256 is recognized as a STOPPING message.
The HI method can also be applied to save energy by simply reducing the number of memory write operations at the destination PEs. When a given message is received by its destination PE, the writing into the internal memory can be avoided if the message is recognized as unimportant. The implementation of this functionality still exploits the STOPPING message, which is used to control the write enable signal of the memory and prevent write operation.
However, in this case, the STOPPING message must be sent to the memory at every iteration.
The implementation of SI follows that of HI in the turbo case, only requiring two additional comparators for the different thresholds in (8) The U method requires, for the initialization of the URGENCY field of outgoing messages, the estimation of the available delivery time. This measure can be obtained, in the turbo case, thanks to the current trellis step together with the globally known interleaving rule. Since each SISO processes a sequential set of trellis steps, the destination memory address is a precise identifier of the time instant a message will be needed. The URGENCY field of each outgoing 
where T half is the duration of a half iteration, t send is the time stamp of the sending instant and t need equals to the destination memory address multiplied by the number of cycles needed to complete each trellis step. The destination address is also used in the initialization of U in the LDPC case. By multiplying it by the minimum row degree of H, a lower bound of t need is obtained, thus leading to the following equation:
The urgency field requires additional bits in all the NoC FIFOs and channels, together with the simple initialization logic at each PE. Moreover, each FIFO of length F needs F adders to update U at each clock cycle, while WPTR and RPTR must be updated in case the urgency field reaches zero, and the corresponding message discarded. All the FIFO memory elements must be available for writing at each clock cycle: the FIFO consequently must be implemented with registers, and not with a RAM. Finally, the priority of the RE arbiter is changed from being FIFO-length-based [14] to urgency-based.
The implementation of BR requires all the modifications described for U, plus a novel method for the update of WPTR and RPTR. The RE input buffers in fact lose their FIFO nature, since the input order is not guaranteed to be the output order. Fig. 11 shows the simplified structure of the proposed buffer reordering mechanism; white blocks represent registers, while gray blocks in- The area overhead introduced by the additional operations in the reordering buffer has been evaluated with respect to a typical FIFO buffer. For example, a reordering buffer with five buffer elements accounts for a little more than Graphics ModelSIM, and place and route with CADence SoC Encounter. Table   2 dissects the post place and route area occupation of various components of the decoder before and after the implementation of the proposed methods.
The small increase in memory occupation is due to the extra memory bit for the unimportant flag related to HI, shared between SISOs and LDPC PEs. Anew + P ow
where P ow P E ∆ is the power consumption increment in PEs due to the contribution of HI, SI and U initialization. The implementation of HI+SI+U+BR, The actual impact of HI on the energy consumption of centralized decoders can also be estimated. In [7] the energy breakdown of a decoder based on memory banks sharing is given. The energy consumption of the γ-memory accounts for 70% of total dynamic energy for WiMAX LDPC code of size 576 and rate 5/6. For this particular code if HI is applied the percentage of stopped messages, and consequently of avoided write operations, is around 17% (Table 1) . Since write operations contribute for approximately 50% of the energy expenditure, the implementation of HI leads to a 6% reduction in the total decoder energy consumption.
A simple direct way to reduce the traffic on the NoC is to reduce the number of iterations of the channel decoder. , E f rame expresses the energy spent per decoded frame and ∆SNR shows the perfor- . The proposed implementation outperforms the iteration reduction in terms of energy savings, while at the same time affecting the BER performance only marginally.
Finally Table 5 [19] and A new achieve comparable throughputs when decoding LDPC codes, with [19] having a larger An tot . This leads to A new having slightly better A ef f and E ef f : the gap is much larger in turbo mode.
The high parallelism, single-standard WiMAX LDPC decoder presented in [16] guarantees very high throughput with a 40 MHz frequency, that allows for reduced power consumption and great efficiencies. Though [16] has been designed for ultra-low power consumption, and A new targets multiple code types and standards, the normalized power gap between them is minimal. This work fares even better when compared to the dedicated Application-Specific Integrated Circuit (ASIC) targeting 3GPP-LTE turbo codes in [34] , that yields good A ef f and throughput. The estimated normalized Pow of 261 mW with the 90 nm node is still higher than A new , with lower efficiency.
Conclusion
The paper proposes novel power reduction techniques for NoC-based channel decoders: extensive traffic and performance analysis are performed, while the hardware implementation allows to assess the impact on a relevant design example and to obtain accurate power consumption. By dealing with the late message delivery problem through traffic reduction and optimization, the 
