Abstract-During the last two decades, wireless communication has been revolutionized by near-capacity error-correcting codes (ECCs), such as turbo codes (TCs), which offer a lower bit error ratio (BER) than their predecessors, without requiring an increased transmission energy consumption (EC). Hence, TCs have found widespread employment in spectrum-constrained wireless communication applications, such as cellular telephony, wireless local area network, and broadcast systems. Recently, however, TCs have also been considered for energy-constrained wireless communication applications, such as wireless sensor networks and the 'Internet of Things.' In these applications, TCs may also be employed for reducing the required transmission EC, instead of improving the BER. However, TCs have relatively high computational complexities, and hence, the associated signal-processing-related ECs are not insignificant. Therefore, when parameterizing TCs for employment in energy-constrained applications, both the processing EC and the transmission EC must be jointly considered. In this tutorial, we investigate holistic design methodologies conceived for this purpose. We commence by introducing turbo coding in detail, highlighting the various parameters of TCs and characterizing their impact on the encoded bit rate, on the radio frequency bandwidth requirement, on the transmission EC and on the BER. Following this, energy-efficient TC decoder application-specific integrated circuit (ASIC) architecture designs are exemplified, and the processing EC is characterized as a function of the TC parameters. Finally, the TC parameters are selected in order to minimize the sum of the processing EC and the transmission EC.
upon the design of the wireless communication schemes and on the electronic devices. For example, cellular telephony, Wireless Local Area Networks (WLANs) and broadcast systems [1] - [3] may be considered to be spectrum-constrained, since the ever-increasing demand for faster data rates creates a correspondingly increased demand for the limited Radio Frequency (RF) resources. Therefore, successive generations of cellular telephony, as well as WLAN and broadcast systems have been designed to make increasingly efficient use of the RF spectrum. In parallel to this trend, there has been a significant amount of recent interest in energy-constrained wireless communication applications [4] [5] [6] [7] , such as in Wireless Sensor Networks (WSNs) and in the 'Internet of Things' (IoT) [8] [9] [10] [11] [12] . These applications are characterized by the requirement of maintaining sporadic, but reliable data transmissions for extended periods of time. Typically, the communication devices employed in this scenario are required to be mobile, preventing them from relying on access to fixed energy supplies, such as mains electricity. The devices are often required to be shirt-pocketsized, light-weight and low-cost, preventing the employment of high-capacity batteries. Furthermore, the communication devices may be expected to operate without human interaction, preventing the regular replacement or recharging of batteries. For these reasons, the communication devices are required to make efficient use of all available energy resources, which may include low-capacity batteries and energy harvesters, such as solar cells. In this paper, we focus our attention on the employment of Turbo Codes (TCs) [13] , [14] in energy-constrained wireless communication applications, considering the joint design of both the communications and the hardware architecture. In this paper, TCs are invoked for energy-constrained wireless communication applications due to their widespread employment in operational communication standards, such as LTE [1] and WiMAX [15] .
Wireless communication has been revolutionized by the invention of TCs [13] , [14] and other sophisticated ErrorCorrecting Codes (ECCs). These codes provide resilience to the transmission errors that are caused by noise, interference and fading during wireless transmission. This is achieved by using a turbo encoder to process all information before transmitting it, then employing a corresponding turbo decoder in the receiver to detect and correct any transmission errors. Compared to previous ECCs, TCs facilitate significantly higher information bit rates and/or significantly lower RF bandwidth requirements, without requiring an increased transmission Energy Consumption (EC) or imposing an increased transmission error probability. In other words, TCs facilitate significantly improved spectral efficiencies η, without requiring an increased Signalto-Noise Ratio (SNR) per bit E tx b /N 0 or imposing an increased Bit Error Ratio (BER) . Here, the spectral efficiency η has units of bit/s/Hz and is given by the ratio of the information bit rate to the required bandwidth. Meanwhile, the transmission EC E tx b has units of J/bit and is expressed by the SNR per bit E tx b /N 0 , where it is normalized by the noise power spectral density N 0 . Finally, the BER quantifies the transmission error probability by expressing the number of information bits that are erroneously decoded as a ratio to the total number of information bits. Fig. 1 plots the capacity of a particular wireless channel, which quantifies the maximum spectral efficiency η for which it is theoretically possible to achieve a vanishingly low BER [16] , as a function of the SNR per bit E tx b /N 0 . The crosses in Fig. 1 show that at an E tx b /N 0 of about 11 dB, a low BER can be achieved by a particular repetition code having a spectral efficiency of η = 1/3 bits/s/Hz, assuming a Nyquist roll-off-factor of α = 0. By contrast, a particular punctured TC is capable of achieving this BER, while using a significantly higher spectral efficiency of η = 0.81 bits/s/Hz, which is much nearer to the channel capacity. Owing to this benefit, TCs are often referred to as near-capacity ECCs and have found widespread employment in spectrum-constrained wireless communication applications, such as cellular telephony, WLAN and broadcast systems. However, Fig. 1 also illustrates an alternative application for TCs in energy-constrained wireless communication systems, where the attainable energy efficiency of 1/E tx b is of more grave concern than the spectral efficiency η. The crosses in Fig. 1 show that when no puncturing is used, the TC considered achieves the same low BER and the same spectral efficiency of η = 1/3 bits/s/Hz as the repetition code, albeit at a significantly lower E tx b /N 0 value of 1.6 dB. This corresponds to a 9.4 dB reduction in transmission EC E tx b , which is nearly an order of magnitude. This demonstrates why TCs have found application not only in spectrum-constrained wireless communication scenarios, but also recently in energy-constrained scenarios, such as WSNs and the IoT.
TCs most commonly take advantage of the Bahl-CockeJelinek-Raviv (BCJR) decoding algorithm and its variants with the objective of mitigating the transmission errors corrupting the received information. When used in TCs, the BCJR decoder, also known as the Maximum A Posteriori (MAP) decoder, is activated in an iterative manner. In a similar fashion to the classic Low-Density Parity-Check (LDPC) decoders [17] operating on the basis of the min-sum and sum product algorithm, the iterative operation of the BCJR algorithm approximates the capacity-approaching performance of a Maximum Likelihood Detector (MLD), with the appealing benefit of imposing a fraction of the complexity [18] . The BCJR algorithm operates on the basis of a trellis in a similar manner to the Viterbi Algorithm (VA) [19] , which has a lower complexity but does not facilitate iterative decoding and hence has a reduced error correction capability. The complexity of a decoding algorithm is often quantified in terms of the number of operations required for decoding, which can be expressed in terms of the number of states /N 0 for which BERs of 10 −3 can be achieved by three ECCs, namely, (a) a 1/3-rate repetition code employing soft decoding, (b) the 1/3-rate LTE TC [1] employing a message length of 2048 bits and 6 decoding iterations, as well as (c) the same LTE TC but punctured to give a coding rate of 0. 81. or trellis-transitions. However, this paper will demonstrate that the complexity of the algorithm does not necessarily determine the complexity and the EC of its Application-Specific Integrated Circuit (ASIC) implementation, hence motivating the holistic design methodologies investigated in this paper.
Despite having a complexity significantly less than the optimal MLD, when employing TCs for the sake of reducing the transmission EC E tx b , consideration should also be given to the TC's processing EC E pr b dissipated by its iterative decoder. While turbo encoders have relatively low complexity and EC [20] , the EC E pr b of turbo decoders is not insignificant [21] , even when implemented using an ASIC. This may be attributed to the relatively high complexity of turbo decoding algorithms, such as that of the BCJR algorithm [22] . Indeed, the authors of [23] considered the power consumption of the various components of a transceiver, finding that for the range of LTE base-stations which were considered, the turbo code consumes approximately the same power as the baseband radio components. Additionally, it was found for the smallest 'femto' base-stations that the turbo code also consumes approximately the same power as the Power Amplifier (PA) components. Conventionally, it has been a challenge to jointly optimize both the transmission EC E tx b and the processing EC E pr b during the design of TCs for energy-constrained wireless communication applications. While the transmission EC E tx b can be characterized at an early design stage using BER simulations, it has not previously been possible to characterize the processing EC E pr b until after the turbo decoder ASIC has been designed, which is a much later design stage. If at this time, it is discovered that the processing EC E pr b is unacceptably high, then it becomes necessary to revert to an earlier design stage and try again. This motivates the holistic TC design methodologies that we demonstrate in this tutorial. These methodologies model the processing EC E pr b of an energy-efficient TC decoder ASIC architecture as a function of the TC design parameters, allowing joint optimization at an early design stage.
Typically, the open literature on wireless communication algorithms [24] [25] [26] considers them independently of the hardware implementation, despite the dependence on each other. Instead, often a simplistic approach is pursued, when considering the implementation aspects. For example, it is typical for a paper in wireless communications to quantify the computational complexity of an algorithm using the number of computational operations which have to be undertaken [24] . This gives a reasonable metric for comparing similar schemes, however this method typically does not offer a fair comparison between dissimilar schemes [27] . Typically the parameters which are important are the energy consumption and hardware resources of a scheme, as this is what ultimately determines the cost and battery life of the system. Furthermore, without considering the hardware implementation, it is not possible to consider metrics such as processing latency and throughput, which can impose bottlenecks upon the overall latency and throughput, particularly in applications such as Machine-to-Machine (M2M) communications for next generation devices [28] . As explored in this paper, considering the algorithm and its implementation jointly allows for holistic optimization of the overall energy consumption, cost, latency and throughput as functions of all algorithmic and implementation design decisions.
Against this background, Fig. 2 summarizes the trade-offs the designer of a TC decoder has to consider. These have been split into the categories of algorithmic trade-offs and architectural trade-offs, since these have previously been considered separately. Building on this, Fig. 3 illustrates the structure of this paper and the holistic design and optimization approach of this tutorial. This facilitates a system-wide EC optimization, while considering how the trade-offs on different sides of Fig. 2 influence each other. We commence in Section II by introducing in detail the TC, and its BER performance.
Section III considers the implementation of the TC, with particular consideration of the computationally intensive Logarithmic BCJR (Log-BCJR) algorithm. The requirements of the Log-BCJR algorithm affect the design decisions made for the architecture, while conversely, architectural trade-offs have to be made which may modify the operation and performance of the Log-BCJR algorithm. This reciprocal relationship is shown in Fig. 3 , where the algorithmic design and architectural design are closely linked. We focus our attention on the three main areas of the architectural design, namely on the datapath, on the controller and on the memory, exploring different methods which have been developed for reducing the corresponding EC. The remainder of this tutorial then focuses on the joint optimization of the algorithm and architecture parameterization, with consideration of the possible options developed during the design stage. To achieve this, Section IV discusses a range of different approaches conceived for estimating the processing EC E pr b for the different algorithm parameterizations. Although typically extensive simulations are required for estimating the EC of a circuit, this section discusses methods of significantly reducing the required simulation complexity, which is achieved by characterizing the processing EC E pr b of a turbo decoder as a function of its parameters. Finally, we holistically consider the performance and energy consumption of the candidate algorithm and architecture trade-offs in Section V. The techniques gleaned from the literature and explored in this section facilitate all of the factors seen in Fig. 2 to be jointly considered, allowing the selection of carefully optimized TC parameters that minimize the sum of the processing EC E pr b and of the transmission EC E tx b . The tutorial concludes with our recommended design guidelines in Section VI.
II. TURBO CODING
In this section, we introduce the TC scheme of Fig. 4 . We begin in Section II-A by describing the convolutional encoders, which are concatenated in parallel in order to form the turbo encoder of Fig. 4 . The integration of the turbo encoder into a BPSK transmitter is discussed in Section II-B. Following this, Section II-C describes the modeling of transmission over an Additive White Gaussian Noise (AWGN) channel, subject to a certain path loss. Section II-D discusses the operation of the turbo-coded BPSK receiver of Fig. 4 . This operates on the basis of the most frequently used variant of the BCJR decoder, namely the Log-BCJR decoder, which is detailed in Section II-E. Modifications of the Log-BCJR algorithm are conceived for the practical implementations, which are discussed in Section II-G, before the TC's error correction performance is characterized in Section II-F.
A. Convolutional Encoder
The convolutional encoder [29] is a widely adopted component in sophisticated error correcting schemes, forming the basis of the turbo encoder, as shown in Fig. 4 . In this application, the input of the convolutional encoder is a message frame b 1 comprising N bits, while the output is an N-bit encoded frame b 2 . The parameterization of a convolutional encoder 
T)/a 2 (T).
A particular transition T from the current state will be selected if the corresponding bit in the message frame b 1 has the value a 1 (T), while a 2 (T) is the value that will be output for the corresponding bit in the encoded frame b 2 .
may be specified by a trellis, which graphically illustrates the relationship between the frames b 1 and b 2 . The example trellis of Fig. 5 corresponds to a simple convolutional encoder, which may be used for encoding a message frame b 1 comprising N = 5 bits. This encoder adopts one of two possible states following the encoding of each bit in the frame b 1 , as represented by the dots in Fig. 5 . Depending on the value of this bit, the encoder state is selected by following one of two possible transitions from the previous state, as represented by the lines in Note that the convolutional code's trellis of Fig. 5 has 2 m = 2 states, which corresponds to a shift register having m = 1 memory element. Furthermore, each transition between states is selected based on the value of k = 1 message bit, resulting in the generation of n = 1 encoded bit. This results in a coding rate for this convolutional encoder of R = k/n = 1, and an overall coding rate for the turbo code of Fig. 4 of R = 1/3. However, the convolutional codes of generalized TCs may employ a shift register having any number m of memory elements. Furthermore, transitions may be selected based on any number k of message bits, resulting in the generation of any number n of encoded bits. While the TC of the LTE standard in cellular telephony [1] also employs k = 1 and n = 1, its shift register has m = 3 memory elements, resulting in a trellis having 2 m = 8 states. The mapping of message and encoded bit values to each transition in the LTE TC trellis is specified by its generator polynomials. Furthermore, the LTE TC appends three additional termination bits to each message frame b 1 , in order to guarantee that the convolutional encoder always reaches the same particular state at the end of the encoding process.
B. Turbo Coded Transmitter
As shown in Fig. 4 , the turbo encoder comprises a parallel concatenation of two convolutional encoders, which we refer to as the upper and lower encoders. The upper encoder processes the frame of message bits b u 1 in their original order, while the lower encoder processes the same bits, but in a different order. This reordering is performed by the interleaver π of Fig. 4 , which outputs the interleaved message frame b l 1 . The upper and lower convolutional encoders produce the N-bit encoded frames b u 2 and b l 2 , respectively. These encoded frames provide 2N parity bits, which are multiplexed in the crossed block of Fig. 4 with N systematic bits, which are provided by the N-bit message frame b u 1 . The resultant transmission frame b 3 comprises 3N bits, corresponding to a coding rate of
Following turbo encoding, the transmitter of Fig. 4 employs BPSK modulation, upsampling, pulse shaping, RF mixing and power amplification. These are employed in order to transmit the frame b 3 using the desired carrier frequency f c at a desired transmission energy per bit E tx b . Note that the power amplifier may have an efficiency of only around 33%, which corresponds to a power amplifier efficiency loss A of 4.8 dB [4] . Here, E tx b is related to the energy E tx s dissipated per modulated symbol according to E tx b [dBJ] = E tx s − 10 log 10 (η), where η = R log 2 (M), R is the coding rate and M is the modulation order of the modulation scheme, with M = 2 in the case of BPSK. Note that the employment of E tx b is typically preferable to E tx s , since this allows a fair comparison among schemes having different coding rates R and modulation orders M in terms of their transmission energy consumption.
C. Channel
The wireless channel of Fig. 4 conveys the BPSK-modulated signal between the transmitter and receiver antennas, but imposes degradation. These antennas can be characterized by their gain (G tx and G rx ) for the intended direction of propagation. In the scenario where there is a dominant line-of-sight (LOS) path between these antennas, the degradation may be modeled by the inverse-second-power free space path loss and AWGN. Here, the path loss is imposed by the attenuation of the BPSKmodulated signal as it propagates through free space. This depends on the distance between the transmit and receive antennas d (in m) and the carrier frequency f c (in Hz) [30] , according to
where c = 2.998 × 10 8 m/s is the speed of light, resulting in the last term of (1) having a constant value of −147.55 dB. However, the free space path loss model may be optimistic, since often there are multiple paths between the transmitter and receiver but the LOS path might be absent. In order to account for this, the path loss equation can be generalized by parameterizing the path loss exponent p [4] , [5] , according to
Path loss exponents between p = 2 and p = 4 can be expected in the diverse environments encountered. The AWGN is imposed by the Brownian motion of electrons, resulting in thermal noise at the receiver, which has the power spectral density of N 0 [dBJ] = 10 × log(k · T), where k = 1.3806503 × 10 −23 JK −1 is the Boltzmann constant. For the case of the room temperature T = 300 K, we obtain N 0 = −203.8 dBJ. Note that depending on the operating conditions, co-user interference is often more significant than the thermal noise. To model this, N 0 can instead be replaced with the noise power spectral density that is expected in the operating conditions of the wireless link [31] . Considering the above channel effects, we can therefore relate the energy per bit at the receiver E rx b in terms of the energy dissipated at the transmitter E tx b and the channel conditions, according to
where all quantities are expressed in dB, except E tx b which is expressed in Joules. Note that if shadowing or fading is prevalent in the particular wireless environment considered, then (3) can be modified to model this by additionally subtracting corresponding fading margins [32] .
D. Turbo Coded Receiver
In the receiver of Fig. 4 , the BPSK-modulated signal provided by the receive antenna is passed to a Low Noise Amplifier (LNA). This is employed to boost the weak received signals, while introducing only a minimal amount of additional noise, which is quantified by its Receiver Noise Figure (RNF). The amplified signal is mixed down from the RF range to the baseband, where it is filtered to remove the out-of-band noise, down-sampled and provided to the BPSK demodulator.
The role of the BPSK demodulator is to extract information pertaining to the turbo-encoded bits from the received signal. However, the BPSK demodulator can never be certain of the correct value for each bit, owing to the unpredictable nature of the degradation imposed by the channel. Rather than making a binary hard decision of '1' or '0' for each bit, superior error correction performance can be obtained if the demodulator makes a soft decision. Here, a soft decision expresses not only what the most likely value of the bit is, but also how likely this value is. More specifically, the demodulator, which is also often referred to as a demapper, can express the soft information pertaining to a particular bit using a Logarithmic Likelihood Ratio (LLR), which represents the probabilities associated with the value of the bit b according tob = ln[Pr(b = 1)/ Pr(b = 0)]. Here, the sign of an LLR expresses whether a value of '1' or '0' is more likely for the corresponding bit, while the magnitude of the LLR is commensurate with how likely this value is. When employing BPSK modulation, it can be shown that each LLR is directly proportional to the corresponding sample provided by the down-sampler [33] . As shown in Fig. 4 , the BPSK demodulator generates the LLR sequencesb 1 , which pertains to the bit sequence b l 1 . These LLR sequences are then provided to the turbo decoder, which is invoked for mitigating the corresponding uncertainty and for eliminating transmission errors. As shown in Fig. 4 , the turbo decoder comprises two Log-BCJR decoders, which correspond to the two convolutional encoders of the turbo encoder.
The turbo decoder is operated in an iterative manner, with the switch labeled 'S1' in Fig. 4 being left open during the first decoding iteration. This enters the LLR sequenceb u,s 1 provided by the BPSK demodulator directly into the upper Log-BCJR decoderb u,a 1 . As shown in Fig. 4 , the upper Log-BCJR decoder's other inputb u,a 2 is supplied by the BPSK demodulator. The upper Log-BCJR decoder combines the old (or "a priori") information provided by its two input LLR sequences, in order to extract new (or "extrinsic") information for the output LLR sequenceb u,e 1 . Since this LLR sequence pertains to the uncoded bit sequence b u 1 , the interleaver π may be used for converting it into information pertaining to the bit sequence b l 1 . Following this, the resultant interleaved LLR sequence may be added on a bit-by-bit basis to the values in the LLR sequenceb l 1,s provided by the BPSK demodulator, which also pertains to b l 1 . The resultant LLR sequence is then forwarded to the lower Log-BCJR decoder's inputb l,a 1 , as shown in Fig. 4 . Meanwhile, the lower Log-BCJR decoder's other inputb l,a 2 is supplied by the BPSK demodulator. In turn, the lower Log-BCJR decoder combines these a priori LLR sequences, in order to obtain the extrinsic LLR sequenceb l,e 1 , completing the first decoding iteration.
In the second and in all subsequent decoding iterations, the switch labelled 'S1' in 1 . This process may be repeated during the third iteration and during all further iterations, in order to gradually improve the quality of the iteratively exchanged LLR sequences. However, as we will show in Section II-G, each additional iteration yields a diminishing return, until convergence is eventually achieved, whereupon additional iterations provide no further improvement. Once a sufficient number I of iterations has been performed, we may obtain a final output by adding the LLR sequencesb Finally, these soft-valued LLRs may be converted into hard-valued bits by considering the sign of each LLR, where a positive value corresponds to a '1' and a negative value corresponds to a '0'.
E. Log-BCJR Decoder
In this section, we provide an overview of the Log-BCJR algorithm [34] , which is employed both by the upper and lower Log-BCJR decoders of Fig. 4 . Note that the Log-BCJR algorithm is a reduced-complexity version of the BCJR algorithm, as will be discussed in greater detail in Section II-F, together with a discussion of other variants of the BCJR algorithm. Here, we use an example, where the trellis of Fig. 5 is , in order to obtain the extrinsic LLR sequenceb e 1 . Note that these example LLRs have been rounded to the nearest integer, for the sake of simplicity. The Log-BCJR algorithm comprises four intermediate steps, in which four sets of metrics are calculated, namely the γ (T), α(S), β(S) and δ(T) values, where T refers to a particular transition in the trellis and S refers to a particular state, as detailed in the following discussion. We will show that the calculations of each step can be decomposed into simple AddCompare-Select (ACS) operations. Further detailed discussions are available in [18] , [35] .
In the first step of the Log-BCJR algorithm, a γ (T) value is calculated for each transition in the trellis of Fig. 5 . This γ (T) value represents the a priori probability that the transition T was selected during the convolutional encoding process. The γ (T) value for a particular transition T in the trellis of Fig. 5 is calculated according to
where a 1 (T) and a 2 (T) are described in Fig 
. Therefore, the entire set of γ (T) values can be calculated using only addition and selection operations.
The second step of the Log-BCJR algorithm is to calculate an α(S) value for each state S in the trellis. These α(S) values represent the probability that a particular state was entered into during the encoding process. This is obtained by considering the probabilities of the previous states having been entered into during encoding, as well as the probabilities that the transitions between these pairs of states have been taken. Owing to these dependencies between the probabilities associated with consecutive states, a forward recursion is required in order to calculate the α(S) values for the states of the trellis in a specific order, evolving from left to right. The calculation for an α(S) for a particular state S is given by where to [S] returns the set of all transitions merging into the state S, while fr[T] returns the particular state that the transition T emerges from. The operation max * for two inputs A and B is defined as max * (A, B) = max(A, B) + ln(1 + e −|A−B| ). Since this operation is associative, it can be readily extended to more inputs. In the example of Fig. 7 , each state in the trellis is labeled with its α(S) value, where the max * operator has been approximated using the max operation for simplicity. As shown in Fig. 7 , the forward recursion is initialized by setting the α(S) value of the state at the far left of the trellis to zero. Note that the α(S) values are calculated using only addition and max * operations, which can be further decomposed into only ACS operations, as we shall show in Section II-F.
In the third step of the Log-BCJR algorithm, a β(S) value is calculated for each state in the trellis, using a similar process to that of the α(S) values. While the α(S) values depend on the previous α(S) values in the trellis, the β(S) value of a particular state depends on those of the next states in the trellis. Therefore the β(S) values must be calculated in order, using a backward recursion order, evolving from the right end of the trellis to the left end. This is achieved according to
where fr [S] returns the set of all transitions that emerge from the state S, while to[T] returns the particular state that the transition T merges into. Once again, the β(S) values for our example are shown on the states of Fig. 7 , where the max * operator has been approximated using the max operation for simplicity. As shown in Fig. 7 , the backward recursion is initialized by setting the β(S) values of the states at the far right of the trellis to zero. Like the α(S) values, the β(S) values can be calculated using only ACS calculations. The fourth set of metrics required for the Log-BCJR algorithm are the δ(T) values, which combine the results from previous metrics in order to represent the a posteriori probabilities that the transitions were followed in the encoder. The δ(T) value of a particular transition T is calculated by adding its γ (T) value to the α(S) value of the state it emerges from and the β(S) value of the state it merges into, according to
The δ(T) calculations detailed for our example can be seen in Fig. 8 . Since the δ(T) values are calculated using only additions, they can be decomposed into ACS operations. Finally, the Log-BCJR algorithm can combine the δ(T) values in order to calculate the output extrinsic LLRs. This is achieved according tõ
where T|
is the set of all transitions for which the represented uncoded bit value a 1 (T) is zero and the index i(T) of that uncoded bit is i. As shown in the example of Fig. 8 , this corresponds to the grouping of the δ(T) values into two sets, which are then combined using max * operations. Following this, the a priori LLRb a 1,i is subtracted from the difference between these two max * calculations. Note that the extrinsic LLRs are calculated using only subtraction and max * operations, which can be further decomposed into only ACS operations, as we shall show in Section II-F. This completes the Log-BCJR decoding process.
F. Algorithmic Modifications to the Log-BCJR Decoder
The Log-BCJR algorithm is universally preferred for implementation over the BCJR algorithm owing to its reduced computational complexity. More specially, the BCJR algorithm operates in the normal domain, requiring addition and multiplication operations for calculating the bit probabilities. Since these probabilities have a high dynamic range, a large number of bits are required for their digital representation. By converting the equations of the BCJR algorithm into the logarithmic domain, the Log-BCJR algorithm replaces multiplications with additions, and replaces additions with the max * operation. These operations have a lower computational complexity, and representing the probabilities in the logarithmic domain requires fewer bits.
As shown in Section II-E, the max * operation of the Log-BCJR algorithm is defined by max * (A, B) = max(A, B) + f (|A − B|), where the correction term is given by f (|A − B|) = ln(1 + e −|A−B| ). Since the logarithmic and exponential functions of f (|A − B|) are costly to implement in hardware, they are often approximated in practical applications of TCs. In the Maximum Log-BCJR (Max-Log-BCJR) approximation [36] of the Log-BCJR algorithm, max * (A, B) is approximated using max(A, B). As shown in Fig. 9 , the value of f (|A − B|) is always in the range [0, 0.69], which is typically small compared to max(A, B), justifying this approximation. The Max-Log-BCJR approximation imposes a low computational complexity, but its error correction capability is lower than that of the original Log-BCJR algorithm [34] . This motivates the conception of a Look-Up-Table based Log-BCJR (LUT-Log-BCJR) algorithm [37] , which uses a Look-Up Table ( LUT) for approximating f (|A − B|). As shown in Fig. 9 , the range of |A − B| values for which f (|A − B|) has a significant value is limited, meaning the LUT size can be small. Fig. 9 shows how as few as four values given by {0,0.25,0.5,0.75} can be used for approximating f (|A − B|), hence offering an error correction capability for the LUT-Log-BCJR which approaches that of the Log-BCJR algorithm [37] , as shown in Fig. 15 .
Both of the Max-Log-BCJR and the LUT-Log-BCJR algorithms can be implemented using only ACS operations. Firstly, the max(A, B) operation is performed by comparing A and B, and selecting the largest value. Based on the knowledge of max(A, B), the subtraction |A − B| of the LUT-Log-BCJR can be carried out so that a positive number is returned. By comparing this result to the boundary points of the LUT, the approximate value for f (|A − B|) can be selected, and then added to the value of max(A, B).
As shown in Section II-E, the α(S) and β(S) calculations require forward and backward recursions respectively due to their data dependencies between consecutive states. Owing to this, the Log-BCJR and its variants are not naturally suited to parallel processing. Furthermore, a large amount of memory is required, since the α(S) and β(S) values are calculated in different directions along the trellis. More specifically, in order to generate the first output extrinsic LLRb e 1,1 , it is necessary to have first calculated the β(S) values for every state in the trellis and then to store them for the calculation of the subsequent output extrinsic LLRs.
An appealing technique for overcoming the data dependency issue is to decompose the trellis into N/w s number of smaller windows [38] , each having the above-mentioned length w s . The Log-BCJR algorithm (or one of its approximations) can be applied to each window independently, significantly reducing the memory required for storing metrics. However, with this approach, it is necessary to initialize the α(S) values of the states at the left end of each window, as well as the β(S) values of the states at the right end. If the windows are processed sequentially in a left to right ordering, the boundary α(S) values can be passed from the right end of each window to the left end of the subsequent window. However, this approach cannot supply boundary β(S) values for the right end of each window, requiring a pre-backward recursion to generate these boundary conditions [39] . This technique generates boundary conditions by starting to calculate the β(S) values ahead of the window, then carrying out a backwards recursion towards the edge of the window. The first β(S) values used by the prebackward stage are initialized to zero, then the pre-backwards length w p is chosen for ensuring that the beta values generated at the boundary of the window converge to those values in the non-windowed Log-BCJR algorithm. Further detailed reading on the pre-backward technique is available in [40] . Other windowing techniques include the Previous Iteration Value Initialization (PIVI) technique of [39] , [41] , which is also known as State-Metric Propagation (SMP) [42] . This avoids the extra computation associated with the pre-backwards step by initializing the windows during the current turbo decoding iteration using the boundary conditions 'inherited' from the previous iteration.
G. Turbo Code Performance
When analyzing the performance of error correcting codes, typically the BER of the code is plotted against the SNR per bit E rx b /N 0 , where E rx b is the energy received per message bit. A TC's BER plot can be used for determining the minimum E rx b /N 0 required for reliable communication. Fig. 10 provides a BER plot for a R = 1/3 LTE turbo code, which uses the schematic shown in Fig. 4 . Fig. 10 shows that the error correction performance improves with successive iterations of the decoder, until about 8 iterations have been completed. Beyond this convergence point however, there are diminishing returns, resulting in very little further improvement.
A specific feature of turbo codes is that they perform better with the aid of longer interleavers. Fig. 11 shows the attainable BER performance for the message lengths of N = 40, 440 and 6144 bits, as well as for the uncoded BPSK case. While all of the turbo coded schemes offer an improved BER for E rx b /N 0 values above 0 dB, the longer frame lengths have a much steeper cliff than shorter ones. Owing to this, shorter frame lengths N correspond to higher E rx b /N 0 requirements for achieving reliable communication. Fig. 11 shows that the LTE TC provides a coding gain G c of around 8 dB over the uncoded scheme, which equates to a corresponding transmission energy saving at the transmitter.
Using (3) we can express the transmission energy per message bit E tx b required to achieve a particular target BER as
where all quantities are expressed in dB, and S t is the minimum SNR per bit E rx b /N 0 that is required to achieve the target BER.
III. TURBO DECODER ARCHITECTURES
In this section, we will commence by reviewing the existing approaches to low power design for turbo decoders. We shall then narrow our focus to three major areas for formulating design considerations. Firstly, Section III-A considers the most significant challenges in energy-efficient datapath design, as well as in architectural solutions to these. Secondly, Section III-B considers the issues of algorithm control, where the scheduling of the decoder by the controller will be investigated, under the consideration of beneficial modifications to the algorithm that achieve a lower energy consumption at a minimal loss to error correction performance. This performance loss can then be considered during the holistic design stage, when minimizing the overall energy consumption of the system, as shown in Fig. 3 . Finally the various aspects of energy-efficient memory usage is discussed in Section III-C. Table I shows a range of ASIC turbo decoder architectures disseminated in the literature, which have been designed for meeting a variety of design goals. In particular, the authors of [50] designed their low-dissipation architecture for lowthroughput applications, where the energy consumption of the receiver is of primary concern. This architecture employs the LUT-Log-BCJR of [37] , which provides a superior error correction performance and a reduced transmission energy compared to the faster, less complex Max-Log-BCJR [36] approximation. Low throughput turbo decoders also tend to have a reduced chip area, which results in a reduced static energy consumption and a reduced cost, which is often a concern in these applications. This is in contrast to conventional turbo decoder architectures [48] , [49] , [54] , which are typically designed for bandwidthconstrained applications, such as cellular telephony, WLAN and broadcast systems. More specifically, these architectures are designed to have a high processing throughput, in order to match the high transmission throughputs that are sought in these applications. As a trade-off, these applications use the Max-Log-BCJR, which allows for a simpler approximation of the max * calculation to support higher throughputs, but comes at the expense of both a degraded BER performance and an increased transmission energy requirement. Section III-B below discusses this tradeoff, as well as methods aimed at mitigating their performance loss. This section will concentrate on conventional decoders, however alternate approaches have also been proposed for implementing the BCJR algorithm, which will be briefly discussed here. Firstly, stochastic decoders [55] represent each LLR as a series of bits, where the value of the sequence is represented by how many '1's or '0's there are in the sequence. In contrast to the conventional fixed-point binary representation, each bit in a stochastic sequence has the same significance. During decoding, each bit in these LLR sequences is processed sequentially by the stochastic decoder. The decoder only processes one bit of each LLR in each clock cycle, which results in a significant reduction of the number of gates required in the decoder. However, since long LLR sequences are required for a high error correction performance, stochastic decoders typically require many more clock cycles compared to a conventional decoder, hence resulting in lower throughputs.
Another alternative architecture is constituted by the family of analog turbo decoders [56] . In these architectures, soft information is represented with the aid of analog currents, while the various operations of the decoder are performed For example, the difficulties in matching analog circuits on a large scale leads to a potential performance degradation [57] . Furthermore, accurately simulating the BER performance of the circuit before its fabrication is not feasible or accurate. In [58] an analog architecture, which supports long frames is described, although this is associated with other challenges. In particular, a sampling circuit is required at each input of the decoder, which holds the analog value constant during decoding. However, these analog values cannot be readily maintained for extended periods of time, hence affecting the achievable error correction performance.
In the following subsections, we consider three salient aspects of conventional digital decoders, namely the design issues of the data path, of the controller and of the memory.
A. Data Path Considerations
Some of the designs listed in Table I rely on architectures that were designed for meeting the requirements of the latest telephony standards, resulting in optimizations for very high throughputs. These conventional architectures typically employ dedicated modules for each of the different steps in the LUTLog-BCJR decoding algorithm. More specifically, they use separate hardware for calculating each of the α, β, δ values and the extrinsic LLRs. However, this can result in a long critical path in the hardware implementation, which precludes having a high processing energy efficiency for the following three reasons: 1) Firstly, a lengthening of the critical path implies a greater variety of data path lengths. The differences amongst the data path lengths in the circuit may impose significant energy wastage owing to spurious transitions (glitches) [59] . Indeed, spurious transitions may account for a significant part of the dynamic energy consumption of ASIC implementations [60] . Reducing spurious transitions requires the lengths of the paths that converge at each register in the circuit to be roughly equal.
2) Secondly, a long critical path prevents the decoder from employing a high clock frequency. In order to implement the conventional LUT-Log-BCJR architecture at a high clock frequency, it is necessary to employ additional hardware during the synthesis for the sake of shortening the critical path. This is achieved by employing more complex circuits, such as the 'look-ahead adder' for minimizing their long datapaths. Unfortunately, this increases the chip area of the datapaths, hence resulting in a higher EC. On the other hand, operating at a lower clock frequency in order to avoid introducing this additional hardware would result in some of the hardware resources associated with shorter datapaths remaining idle for longer, hence increasing the static EC. The energy wasted by the static EC becomes more and more significant, when the process technology is scaled down [61] . 3) Thirdly, the high complexity of the conventional architecture imposed by its circuits dedicated to the different tasks increases the requirements imposed on the clock tree and on the buffers for multiple input signal loads [62] . Hence, this may impose a significant additional energy dissipation on the decoder.
On this basis, we shall now discuss a pair of techniques, which can be employed for mitigating the energy inefficiencies inherent in designs having a long critical path.
The first method we will discuss is pipelining, which is employed extensively within the architectures of [48] , [51] , [52] . Pipelining reduces the critical path between two registers by adding additional registers to the middle of this path. This has the result of shortening the paths so that a higher clock frequency can be employed, but also adds latency to the circuit, since the number of clock cycles required before a result is available is increased for every pipeline stage that is added. This can therefore result in a slow down of a circuit's operation, if one part has to wait for a pipelined calculation to become available. Fig. 12 shows an example of pipelining in the turbo decoder of [51] , which uses a similar decoder core to that proposed by the authors of [46] , [49] . High-throughput turbo decoders, such as those proposed by [49] , [52] , [53] , typically employ a multitude of these cores in parallel. The architecture of Fig. 12 employs separate hardware units for calculating the α (forward state-metrics) and β (reverse state-metrics), each having dedicated hardware for generating the γ values. Since this architecture utilizes windowing, a separate dummy state-metric-recursion unit is used for generating the boundary conditions of the windows, as described in Section II-F. This parallelization within each decoder core facilitates higher throughputs than the alternative approaches. To perform the pipelining, registers are placed between the branch metric computation units that are used for calculating the γ values, as well as between the ACS Units that are used for calculating the α or β values. Note that due to their recursive nature, no pipelining can take place within the ACS Units. This is because the values for one bit depend on that of an adjacent bit, which is calculated in the preceding clock cycle. Adding pipelining to the ACS Unit then increases the number of cycles it takes for a new value to be calculated, hence slowing down the operation of the decoder, rather than speeding it up.
With careful pipelining, the critical paths in a design can be kept low and the path length can be kept more similar, therefore mitigating the previously mentioned impediments. However, as mentioned above, pipelining cannot be used in the recursive parts of the BCJR algorithm and the additional chip area as well as the EC associated with the pipeline registers must also be considered.
Building upon the pipelining philosophy, we shall now focus our attention on the turbo decoder architecture of [50] , which is shown in Fig. 13 . This architecture has been specifically designed for a low processing EC for energy-constrained wireless communication applications, such as WSNs and the IoT. The philosophy of this architecture is to redesign the timing of the conventional architecture into a series of small steps, each with the same length, in a similar manner to that which is achieved when adding pipeline stages. In contrast to the high-throughput architectures discussed previously, where each of the pipelined stages are performed at the same time, the architecture of [50] sequentially carries out the operations using small functional units. This produces an architecture comprising only a low number of inherently low-complexity functional units, which are collectively capable of implementing the entire LUT-Log-BCJR algorithm at a high hardware efficiency. Further wastage is avoided, since the critical paths of the functional units are naturally short and have a similar length, hence eliminating the requirement for additional hardware to manage them. [50] , this ACS Unit can perform a LUT-max * operation over four clock cycles, when external control logic is used for correctly sequencing the ACS Unit. When calculating the LUT-max * , the status flags C[2:0] hold the result of the LUT operation, which is then used for selecting which of the four quantized values gleaned from Fig. 9 are added on to the result in the final cycle of the LUT-max calculation.
Due to the short critical path and owing to the serial nature of this approach, it naturally results in a low chip-area and a high clock frequency, which implies having a low static EC. The architecture is based on the fact that the LUT-Log-BCJR Fig. 15 . Error correction performance of 6144-bit turbo decoders employing extrinsic scaling (Max-SE-Log-BCJR), the Max-Log-BCJR and the LUT-Log-BCJR, in relation to that offered by the exact Log-BCJR. A LUT comprising 8 entries was used for the LUT-Log-BCJR, and a scaling factor of 0.7 was employed for the Max-SE-Log-BCJR.
comprises only addition, subtraction and max * operations, which can be further decomposed into three fundamental operations, namely the ACS operations, as shown in Section II-E.
B. Algorithm Control
In this section, we consider the control of the architecture, where the controller instructs both the datapath and the memory to carry out a particular sequence of operations, in order to implement the algorithm.
In the LUT-Log-BCJR algorithm, the basic operation that imposes the highest computational overhead is the LUT-max * operation [37] . This is of particular concern in high-throughput decoders, where the max * calculation is used within the forward-and backward-recursive loops, preventing its pipelining for speeding up the decoder, as described in Section III-A. By contrast, the low-power, low-throughput architecture of Fig. 13 does not suffer from this problem, since it performs all algorithmic steps using the same set of functional units, which are all capable of performing the same tasks, rather than having dedicated hardware for each part of the Log-BCJR algorithm. Owing to this, there are no parts of the decoder that are required to wait, while another part completes the operation of a slower task.
It is therefore desirable to favour the Max-Log BCJR over the LUT-Log-BCJR in applications, requiring a higher throughput. However, the naive employment of the Max-Log-BCJR results in a performance loss, when compared to the LUT-Log-BCJR. This motivates the employment of a technique known as extrinsic LLR scaling, which is capable of mitigating some of this performance loss [46] , [63] . Fig. 15 compares the error correction performance of the Log-BCJR, LUT-Log-BCJR, Max-Log-BCJR and Maximum with Scaled Extrinsic Log-BCJR (Max-SE-Log-BCJR) decoders. It can be seen that the extrinsic scaling technique improves the performance, which will be within a small margin of 0.1 dB of that offered by the Log-BCJR algorithm. This is a typical margin that may be observed for other turbo code parameterizations designed for communicating over AWGN and Rayleigh fading [64] channels.
The Max-SE-Log-BCJR decoder relies on multiplying the extrinsic LLR output of the decoder blocks in the receiver by a constant value of less than 1. This represents a reduction of confidence in the extrinsic LLRs, which is due to the nonoptimal implementation of the max * calculation. The author of [65] discuss the optimal selection of this constant, which is found to be between 0.6 and 0.8, depending on the SNR at the receiver. However, practical implementations tend to use a fixed scaling value [64] . A typical choice for the extrinsic scaling factor is one that leads to a simple hardware implementation using just adders. For example, a scaling factor of 0.75 can be achieved using fixed point arithmetic by simply adding the extrinsic output right-shifted once, to the extrinsic output rightshifted twice.
Extrinsic LLR scaling is also used in the Max-Log-BCJR architecture of [46] , resulting in a 45% reduction in area and a 50% improvement in throughput, when compared to a similar architecture, which uses the LUT-Log-BCJR algorithm instead. The reduction in the number of logic gates required for the max * calculation also results in a reduced EC.
As described above, the use of extrinsic LLR scaling in conjunction with the Max-Log-BCJR results in an error correction performance loss relative to the LUT-Log-BCJR decoder. This equates to more transmit energy being required, but offers the advantage of requiring lower decoding energy. Note that the holistic design method discussed in Section V will address these conflicting design choices. This conflict demonstrates the importance of considering both the architecture and the algorithm jointly, since a holistic design approach facilitates striking the right balance between the algorithm and the architecture, resulting in the lowest overall EC and the best overall performance for the system.
Another beneficial technique for the implementation of turbo decoders is the Radix-4 transformation of [44] , [52] , which combines two trellis stages into a single one. Owing to this, the decoder considers twice the number of a priori LLRs at once and the number of transitions emerging from each state of the Radix-4 trellis is squared. However, this technique halves the number of state metrics that have to be calculated and stored, since it halves the number of stages in the trellis. In the most common case, where only two transitions emerge from each state, the total number of transitions per frame will remain constant. This leads to a moderate area increase for radix-4 decoders over radix-2 decoders [49] , partly because more ACS operations per transition are required, when considering several transitions at once. The main advantage of radix-4 decoders is that by transversing two states at once, the degree of parallelism can be doubled, hence facilitating higher throughputs.
There are a number of other techniques that may be employed in turbo decoder implementations, as follows.
• Early stopping [46] , [53] , which terminates the turbo decoding process early, if the correct bit-stream is unlikely to be found, thus saving energy. This technique considers the values of the LLRs, and detects if their quality no longer improves in successive decoding iterations, indicating that the remaining errors in the message will not be corrected. Furthermore, early stopping can also stop the iterative decoding process once the correct message is found, as verified using a Cyclic Redundancy Check (CRC).
• Modulo normalization [46] , [51] , which allows the state metrics to overflow, relying on the nature of the two's complement arithmetic to correct this overflow, instead of requiring a larger number of bits to represent these metrics. An additional logic gate is required for the max logic, in order to allow it to correctly process numbers, which have experienced an overflow.
• Voltage scaling [46] , [49] , which reduces the supply voltage when the throughput requirements are lower, or when less iterations are required, because the SNR is higher, resulting in a reduced energy consumption.
C. Memory Considerations
Turbo decoder architectures require a large amount of memory for their operation. This memory is required for storing the a priori LLRs, the extrinsic LLRs generated by each of the Log-BCJR decoders and the intermediate α or β values of the Log-BCJR decoder, as discussed in Section II-E. While Section II-F discussed beneficial techniques, such as windowing for reducing the required memory, frequent access will still be required of this memory. Since accessing this memory dissipates energy [66] , having an EC comparable to that of the datapath [50] , it is desirable to minimize the number of memory accesses, in order to reduce the overall EC of an architecture.
To address this issue, the architecture shown in Fig. 13 additionally employs two register banks, namely Regbank1 and Regbank2, which act as a cache memory between the main memory and the processing units. The combined usage of both the dedicated registers and of the register banks allows an entire Log-BCJR stage of the trellis to be processed without requiring access to the main memory. A similar approach is pursued in [67] , where a cache memory is employed between the LLR memories and the decoder. This reduces the required number of memory accesses, since each of the hardware blocks for the α, β and output LLR units access the cache rather than directly accessing the main memory.
As described in Section II-F, the Log-BCJR algorithm's data dependencies require an entire forward-recursion or backwardrecursion to be carried out, before any extrinsic LLRs can be generated. This gives rise to the memory requirement for storing the α or β values calculated during this recursion. The authors of [52] have proposed an additional method for reducing the storage requirement of state metrics during this initial forwards-or backwards-recursion. This 're-computation' method reduces the number of values stored in the memory during on the initial recursion, which is achieved by storing only every nth set of state metrics. However, this requires the missing state metrics to be recalculated as and when needed, during the subsequent pass through the trellis. The implementation advocated in [52] opted for storing every 6th set of state metrics, since it was found that the extra hardware required for the re-computation circuit occupied a smaller area than the memory, which would otherwise have been required.
For any design, the required amount of memory storage and the number of memory accesses can be traded-off against the requirement of repeating the computation of unstored values in the decoder. However, as a minimum, the a priori LLRs have to be fetched from memory into the Log-BCJR decoder, while the extrinsic LLRs have to be stored from the Log-BCJR decoders into memory. The values, which require minimal computation may be readily recomputed as and when required, such as the γ values, which typically necessitate no more than a single addition per transition. Conversely, due to the data dependencies, memory will be required for at least some of the forwards-or backwards-recursion values, so that they can be stored until they are needed for the duration of a window.
In high-throughput decoders, that employ parallelization by concurrently operating multiple decoder cores, accessing the shared LLR memories may cause contention. As described in Section II-B, the interleavers within the turbo decoder dictate the memory accesses of the decoder cores. In particular, the interleavers enforce the requirement for the a priori and extrinsic LLR memories to be shared between each of the decoder cores, rather than having independent LLR memories for each of the decoding cores. In the case where there are M decoder cores, it is desirable for the LLR memories to be split into M separate memory blocks, with the interleaver designed for ensuring that only one decoder core requires access to each memory block at a time. An interleaver that meets this criterion for some values of M is said to be contention-free [68] . However, an interleaver which is not contention-free will cause inefficiencies in the decoder, since some of the decoding blocks will have to stall their operation, while they wait to individually access the memory.
While contention-free interleavers allow the LLR memories to be broken into separate memory blocks, the address decoding logic has to be duplicated for each of these memory blocks, hence increasing both the chip area and the associated EC. It is therefore also desirable for each decoder core to fetch or store the LLRs using the same addresses for their corresponding one from the set of these M blocks of memory. This design of the interleaver will allow contention-free memory accesses to be implemented using a single address decoding circuit, since each decoder core uses the same address. As shown in Fig. 16 , each decoder core carries out its fetching or storing action using a different memory window, but the index used within each window is the same. A pair of specific interleaver designs which meet both of these criteria are constituted by the so-called ARP and QPP interleavers [68] . The QPP design was chosen for the LTE standard [1] . The specific interleaver design has a significant affect on the BER performance of a turbo code, hence requiring a careful design of the interleaver for meeting the contention-free implementation requirement, as well as the BER performance requirement [69] . The authors of [49] , [70] demonstrated how to facilitate contention-free memory accesses, where a permutation network is employed for routing the LLRs between the memory and the decoder cores. Fig. 16 . Contention-free interleaver using the same indices within each window, where address i is interleaved to yield the address π(i). The interleaving pattern is shown for two sets of addresses.
IV. PROCESSING ENERGY CONSUMPTION ESTIMATION
In this section, we will discuss various techniques invoked for characterizing the expected EC of a turbo decoder architecture. Referring to Fig. 3 , accurately processing EC estimation is important for the holistic design process, since it typically makes a similar contribution to the overall EC as the transmission energy in energy-constrained scenarios. Indeed, in some applications, the processing energy can actually exceed the transmission energy. We commence by briefly considering the most common design characterization methods, before focusing our attention on the method of [27] . This work parameterizes the estimated EC per bit per iteration, therefore aiding the joint design of the architecture and the algorithm, as it will be further discussed in Section V. The energy per bit per iteration is employed as the metric for comparing the EC of different architectures, since it is independent from the algorithm which is being operated. Furthermore, the EC per bit metric is preferred over the power consumption per bit, since the EC is independent of the decoder's throughput.
For the majority of the ASIC architectures proposed by the authors listed in Table I , the EC of the architecture is obtained from post-layout simulations. The EC per bit per iteration can then be readily derived by taking into consideration both the throughput and the number of iterations employed. However, when characterizing the EC as a function of the TC parameters, the above-mentioned approach has the disadvantage of having to modify the design and to rerun the post-layout simulations for each of the different parameters that are considered during the holistic optimization. Table I lists a range of architectures designed for a variety of applications, resulting in a diverse range of throughputs and EC figures. These EC results are obtained from simulations using only a single particular parameterization of the design.
By contrast, a different framework was proposed in [27] for estimating the EC of a Log-BCJR decoder as a function of its parameters, which can be generalized to any turbo decoder. The objective of this framework is to quantify the EC during the TC design stage, in order to assist the designer in selecting appropriate parameters for the code.
In order to provide accurate EC predictions, the authors of [27] stipulate some assumptions, which are based on the later implementation stages. In particular, the Integrated Circuit (IC) fabrication process technology [49] , the supply voltage and the clock frequency of the implemented circuit can all have a significant impact on the EC. When the designer wishes to consider a range of technology nodes or supply voltages (V dd ), the chip area, throughput and energy consumption can be scaled according to the scaling rules as follows [53] .
where s is the scaling factor between the two technology nodes, t pd is the propagation delay, and P dyn is the dynamic power consumption. Reducing t pd increases the clock frequency the IC can operate at, which results in an increased throughput. The power consumption reduces with the technology node, which results in a corresponding reduction of E dec b . This allows the energy analysis to be performed only once, and then scaled to allow holistic design decisions to be taken. The specific parameters which affect the overall EC are summarized in Table II . When using the technique of [27] for estimating the Log-BCJR decoder's EC, the designer has the ability to change these parameters, in order to investigate their impact on the EC.
In order to derive an overall EC estimate for a turbo decoder, the EC is divided into three main components which will be discussed here. Each of these steps focuses on the three areas discussed in Section III, namely on the datapath, on the scheduling of the decoder by the controller and on the memories.
1) Datapath Functional Unit Characterization:
The first step of the technique conceived in [27] is to analyze the EC of each of the sub-blocks that comprise the datapath of the architecture. More specifically, the energy used by the different subblocks as they perform the tasks of addition, subtraction and max * is characterized as functions of the related parameters. It was found in [27] that the complexity of some sub-blocks varies according to some of the parameters of Table II . In particular, the number of states in the decoder and the number of bits used for number representations have a significant effect upon the EC. It is therefore suggested that the Register-Transfer Level (RTL) design [71] of the functional units should be written in a way that allows the parameters to be readily changed, in order to characterize a whole range of EC results.
2) Timing Analysis: Next, the base operations undertaken by the decoder as instructed by the controller are analyzed for each time-step. More specifically, for a given set of turbo code parameters and a given set of implementation parameters of Table II , the total number of addition, subtraction, max * and idle operations undertaken by each of the functional units of the decoder can be characterized. This therefore characterizes how often each of the operational modes of the datapath is used during decoding. This allows the designer to promptly characterize the effect of the different parameters of Table II , which can be used in the ensuing steps to examine, how the EC is affected by changing the parameters.
The results from the previous two steps can be combined to estimate for the EC of the datapath, when considering a set of given parameters and a particular scheduling of operations. By multiplying the energy used per operation of step 1) and the number of operations per bit from step 2) an accurate estimate if the overall EC can be made. It was shown in [27] that this method of estimating the EC has at most 7% error, when compared against the EC simulation of the entire decoder.
3) Memory Power Usage: The databook provided for the memories by the standard library developer [72] provides specifications, which allow the EC to be calculated. For a technology scale of l = 90nm, the Taiwan Semiconductor Manufacturing Company (TSMC) 90 nm databook [72] states that the power consumption of a particular memory module size can be estimated by considering both the accessing rate a in units of accesses per clock cycle, as well as the clock frequency f and the supply voltage v. In the standard cell library, the power consumption of the SRAM used in the architecture can be estimated using the reference table of [72] . Here, the typical memory access power consumption p a and leakage current I l are given for memory blocks having various sizes and operandwidths. The power consumption P a can be used for calculating the dynamic EC, when the memory is being accessed. Similarly, the leakage current I l can be used for calculating the static EC of the memory, when it is idle.
Similarly to the EC of the datapath and of the memories discussed above, the EC of the interleaver and of the controller may also have to be considered. The authors of [27] provided the analysis of the EC of these components. However, it was found that their contribution is minor compared to that of the datapath and memories. Furthermore, their EC per bit per Log-BCJR decoder activation is unlikely to change between different parameterizations. Owing to this, when making comparisons between two candidate scheme parameterizations, any error in the interleaver or controller EC estimation will be common to both schemes, hence having little effect on the comparison.
V. HOLISTIC DESIGN CHARACTERIZATION
In this section, we explore a range of methods capable of characterizing and holistically parameterizing an overall wireless communications system, while investigating the energy efficiency of different TCs and the effect their parameters. We shall explore the techniques outlined by the authors of [4] and [27] , showing how these techniques can be applied to a specific scenario and architecture, in order to demonstrate the holistic design approach and to show the effect of the various system parameters on the overall EC. By considering the energy consumption in both the transmitter and the receiver, the candidate TCs may be evaluated holistically for employment in energyconstrained applications, such as WSNs and the IoT. More specifically, the transmitter's energy consumption is comprised of the turbo encoder's processing energy consumption E enc b , the modulator's energy consumption E mod b and the PA energy consumption E tx b . Likewise, the receiver energy consumption is comprised of the demodulator's energy consumption E dem b and the turbo decoder's processing energy consumption E dec b . The techniques discussed in this section are similar to various other examples of holistic characterization that are available in the literature [73] [74] [75] . For example, the authors of [75] considered the holistic optimization of cellular networks, while the authors of [73] , [74] investigated whether Multiple-Input Multiple-Output (MIMO)-based sensor networks can provide energy savings over conventional networks.
The conventional design method optimizes the algorithm and architecture separately, without considering the processing EC at the receiver alongside the transmission EC. By contrast, the methods explored in this section allow the TC to be used for reducing the overall EC of a wireless communication system. As highlighted in Fig. 3 , the holistic design characterization bridges the algorithm design and the implementation design, allowing the parameters of each to be combined, when considering the performance of the eligible schemes for a particular design scenario.
The objective of the design methods described is to determine the particular parameterization of the TC design that optimizes the overall EC of the system over the range of operating conditions expected in a particular scenario. The component encoder of the design is specified by the parameters k, m and n, as well as by the generator polynomial. Furthermore, different turbo coding schemes may be employed, which may use different arrangements of the component encoders. For example, Multiple-Component Turbo Codes (MCTCs) [76] employ multiple parallel component encoders, where the number of encoders employed also becomes a parameter of the scheme. Further parameters to be considered are those, which relate to the hardware implementation, such as which max * approximation to utilize, as well as the number of bits used for representing the LLRs and other internal variables. Additionally, the number of decoding iterations performed also affects both the decoding EC E dec b and the minimum required transmission EC E tx b quite significantly. The holistic design approaches of [4] and [27] go beyond the approaches proposed by the authors of [76] , [77] . In these contributions, the decoder complexity is quantified by the number of operations undertaken in the decoder, which is related both to the number of Log-BCJR decoder activations and to the number of states in the trellis. This measure of complexity is used for representing the relative energy consumption of different codes. However, as shown in Section III, the absolute energy consumption heavily depends on the architecture, as well as on factors such as the amount of memory in the design. As an example, [27] shows that two different schemes having the same operations-based complexity have a 45% difference in their processing EC E dec b . Furthermore, schemes having a and E dem b . This illustrates that while the complexity-based comparison of [76] is useful for comparing the relative processing EC of schemes where the Log-BCJR decoders are similar, it does not allow the overall EC to be optimized, since it does not facilitate a fair comparison between different architectural parameterizations. This is because it has no knowledge of how the architecture performs the decoding, wherein different parameterizations of the architecture will cause different activation of blocks in the decoder. During the holistic optimization, the designer may also wish to compare the performance of different architectural parameterizations, which is not provided by the approaches disseminated in [76] , [77] .
In order to demonstrate the holistic design techniques, this tutorial considers a scenario, which is representative of a lowpower, relatively low-throughput receiver, as is typical in WSNs and in the IoT. By using the energy estimation techniques discussed in Section IV, we may obtain a reliable estimation of the processing EC for different turbo code parameters. We shall consider a Twin-Component Turbo Code (TCTC) as discussed in Section II-D, as well as two MCTCs [76] having three and four constituent codes, which we refer to as 3MCTC and 4MCTC, respectively. The TCTC and 3MCTC schemes are both R = 1/3-rate codes, while the 4MCTC is a R = 1/4-rate code. These TCs employ the generator polynomials (17, 15) and (2, 3) o , respectively. A fourth scheme considered is provided by the uncoded case, which is associated with no TC processing energy E dec b , allowing us to explore the specific situations, where using TCs is the most energy efficient. A number of different parameters will also be considered for each of these codes. Furthermore, we will investigate the effect of employing various approximations of the Log-BCJR algorithm, namely the LUT-Log-BCJR, the Max-Log-BCJR and the Max-SE-Log-BCJR [46] . In particular, we will explore which approximations are most appropriate, when attempting to reduce the overall EC. The number of iterations in the receiver will also be considered in the holistic characterization. Table III shows two different operating scenarios, which will be considered in this tutorial, representing a range of environmental factors faced by energy-constrained systems. The power consumption figures P mod and P dem are representative of those achieved by a particular low-power state-of-the-art transceiver [78] . While naturally, only a limited number of parameterizations are considered in this tutorial, the designer of a real communications system may wish to consider a wider range of candidate schemes. For example, error correction codes such as LDPC [17] , Repeat Accumulate (RA) [79] , or ReedSolomon (RS) [80] codes may provide a lower overall energy consumption, depending on the scenario. For example, TCs out perform LDPC codes at lower coding rates [81] , while LDPC and RA codes lend themselves to be conveniently implemented in parallel, albeit at the expense of a large chip area. Likewise, the designer may wish to consider Hybrid Automatic Repeat Request (HARQ) [82] as an alternative method of reducing the EC. While these rate-compatible schemes have been shown to reduce the transmission EC of a scheme, usually the EC of the decoder is not considered. The techniques detailed in this section can be extended to both HARQ and to other similar techniques, however for reasons of space-economy, they are not discussed here.
A. Methodology
To evaluate the transmission energy required for each candidate scheme, first the BER requirement must be specified based on the target application. Here, a BER of 10 −5 is assumed as the maximum tolerable BER. Table IV shows the E rx b /N 0 required for achieving a BER of 10 −5 for each of the candidates considered in this tutorial. The corresponding BER simulation results of these candidate schemes are provided in [26] , while the relative performance of different approximations of the Log-BCJR algorithm are taken from [64] . Next, the path loss model given in Section II-C is used for calculating the transmission EC per information bit E tx b , by invoking Equation (9) . This path loss model may be substituted by alternative channel models, such as a Rayleigh fading [83] channel, if this is more appropriate for the design scenario. The assumptions and specifications for the target scenario of Table III are applied to the specific path loss model, having the parameters defined in Section II. Furthermore, the decoding EC E dec b of the candidate schemes can be estimated using the techniques discussed in Section IV. The decoding EC E dec b for the example architecture of Fig. 13 is shown alongside the required E rx b /N 0 in Table IV . Using the coding rates of the candidate schemes, the modulation and demodulation energy consumption can be calculated according to
The encoding energy E enc b is typically considerably lower than the decoding EC [84] , and therefore in this design example it is assumed to be negligible. Finally, the overall EC of the candidates can be calculated by summing these figures according
, in order to obtain the combined energy consumption per bit. 
B. Results
Figs. 17 and 18 show the combined transmission and processing EC for the four candidate schemes, when operating in Scenarios 1 and 2, respectively. These graphs show how the combined EC increases with the transmission distance, allowing the designer to make decisions based on the range of required distances. It can be seen that for very short transmission distances, the uncoded candidate scheme has the lowest EC due to the processing overhead of the turbo coded schemes, as well as the additional modulator E mod b and demodulator E dem b energy required for transmitting the additional parity bits. As the distance increases, the turbo coded schemes overtake the uncoded one, since they facilitate a lower transmit energy.
The low-complexity TC schemes have an advantage for shorter transmission distances, while the transmit power dominates the EC over longer distances, where the best performing scheme becomes the one having the best BER performance. Table V summarizes the combined EC for a selection of the candidate schemes over a range of distances. It can be seen that the Max-SE-BCJR decoder offers an attractive tradeoff. At short distances, it offers an energy saving due to its lower processing energy E dec b compared to the LUT-Max-BCJR decoder, while at higher distances its slight BER performance degradation results in only a small increase of the overall EC. Compared to the Max-BCJR decoder, the Max-SE-BCJR decoder offers an improvement for the majority of distances considered. Indeed, the only distances for which the Max-BCJR decoder offers a lower combined EC than the Max-SE-BCJR decoder is at distances, where the uncoded scheme provides the lowest EC.
The schemes offering the best BER performance are 3MCTC and 4MCTC of Table IV, however they also have the highest decoding EC E dec b . As a result, these schemes provide the lowest overall EC at longer distances, especially, when compared to the TCTC schemes, which have a slightly worse BER performance.
The authors of [4] refer to the point at which using an error correcting code becomes beneficial over uncoded transmission as the critical distance d cr . This can be expressed as follows are the energy consumptions for the respective encoded and uncoded modulator and demodulator components. The critical distance depends on the particular error correction code used, as well as on all of the other factors shown in Table III . Fig. 19 shows how the critical distance varies both with the carrier frequency and with the path loss coefficient p for a variety of schemes.
The case study of [76] offers a simple example for demonstrating the philosophy of the proposed holistic design method. Naturally, numerous idealized simplifying assumptions of the environment and of the WSN specifications had to be stipulated here for avoiding distraction from the holistic design methodology. As a benefit, the design methodologies discussed here are capable of assisting the designer in holistically optimizing a TC design by considering numerous different design aspects. For example, apart from the basic parameters of TC schemes that were considered in our example, the longest block length N of a TC determines both the memory requirement of the hardware implementation. The number of decoding iterations performed has a significant effect on both the BER performance and on the decoding EC. Additionally, the number of hops employed in a multi-hop network determines the average transmission range and the sensor densities. All of these aspects directly affect both the transmission EC and the decoding EC. As a result, these design methods can be used for optimizing a wide variety of related specifications for improving the system's energy efficiency.
VI. CONCLUSION AND DESIGN GUIDELINES
In conclusion, energy-constrained scenarios such as WSNs and the IoT constitute emerging applications for TCs, where they can be employed for reducing the overall EC of communication systems. To achieve this goal, new design methodologies are required for TCs, which consider the EC throughout the entire design process. The key issue is to ensure that the potentially high-complexity turbo decoder has only a moderate EC E dec b , while also reducing the transmission energy E tx b . By achieving this goal, a holistically-considered overall EC (E tx b + E dec b + E enc b + E mod b + E dem b ) can be realized. In this tutorial, the parameters of TCs were detailed and turbo decoder architecture design techniques were presented, particularly for the case of TCs specifically designed for reducing the EC of the decoder, without impacting the error correcting performance. Furthermore, energy estimation methods were conceived for estimating E dec b during an early design stage. Based on these three topics, holistic TC design methods were proposed for reducing the overall EC. The selected design guidelines may be summarized as follows:
• Determine the environmental parameters of the target scenario, as exemplified in Table III. • Establish the path loss model, as exemplified by the path loss model of Section II-C.
• Select the code design candidates, for example the candidates shown in Table IV including the specific design parameters of Table II and the architectural approximations discussed in Section III.
• Invoke the energy estimation framework of Section IV for estimating the processing EC E dec b of the candidates.
• Using BER simulations and the path loss model, estimate the transmission EC E tx b of the candidates, as demonstrated in Section V.
• Compare the overall ECs (E tx b + E dec b + E enc b + E mod b + E dem b ) to find the most energy-efficient design, as demonstrated in Section V.
• Implement this most energy efficient design.
Using the discussed holistic design method, a specific design example was presented. The results demonstrated that the design methods presented are capable of finding the most desirable TC design and architectural choices from an energy efficiency point of view. The conventional approach, which used the BER and computational complexity derived from the number of states and decoder activations, could not achieve this optimal design decision due to its separate consideration of the architecture and algorithm.
In a communications system there are a wide variety of conflicting design trade-offs. The holistic design techniques of this paper allow all of the relevant trade-offs to be considered together, for the sake of minimizing the overall energy consumption. In particular, a joint optimization of these trade-offs can be used for holistically improving the entire system. 
