Deep-space communications are characterized by extremely critical conditions; current standards foresee the usage of both turbo and low-density-parity-check (LDPC) codes to ensure recovery from received errors, but each of them displays consistent drawbacks. Code concatenation is widely used in all kinds of communication to boost the error correction capabilities of single codes; serial concatenation of turbo and LDPC codes has been recently proven effective enough for deep space communications, being able to overcome the shortcomings of both code types. This work extends the performance analysis of this scheme and proposes a novel hardware decoder architecture for concatenated turbo and LDPC codes based on the same decoding algorithm. This choice leads to a high degree of datapath and memory sharing; postlayout implementation results obtained with complementary metal-oxide semiconductor (CMOS) 90 nm technology show small area occupation (0.98 mm 2 ) and very low power consumption (2.1 mW).
I. INTRODUCTION
The Consultative Committee for Space Data Systems (CCSDS) has produced over the years a de facto standard for all space-related communication systems. In the latest versions of the standard [1] there has been an increment in the foreseen downlink throughput for deep-space communications, reaching up to tens of megabits per second. Four channel coding schemes have been described in [2] and consequently assembled into application-wise forward-error-correction (FEC) schemes in [3] . Both turbo [4] and low-density-parity-check (LDPC) [5] codes are currently contemplated for deep-space communications [2] ; while the suggested turbo codes target stricter bit error rate (BER) constraints, LDPC codes have been recently included in the standard and have higher rate, and they are currently subject to CCSDS experimentation [6] . Both turbo and LDPC codes are common in on-Earth wireless communication systems; however, throughput requirements are much higher than those for deep-space communications, while frame error rate (FER) constraints are more relaxed. In fact, spacecraft-to-Earth communications are characterized by limited amounts of available power and long transmission times, and a failed reception and consequent retransmission are often unacceptable. Thus, ad hoc powerful FEC schemes must be devised.
A FEC relying on the serial concatenation of turbo and LDPC codes has been proposed in [7] ; thanks to its very good error correction capabilities, it has been deemed suitable for the extremely critical deep-space communications. To the best of our knowledge, no implementation solution for the concatenated scheme has been proposed so far, but decoders for both turbo and LDPC codes are present in the state of the art, mainly targeting wireless communications. Multicode and multistandard decoders that make flexibility their primary concern have also been introduced recently [8] [9] [10] [11] [12] [13] ; they are characterized by different degrees of datapath and memory sharing.
This work proposes a decoder for concatenated turbo and LDPC codes targeting deep-space communications. The usage of the same decoding algorithm for both codes greatly reduces the area overhead of the concatenated scheme decoder with respect to a single LDPC or turbo code decoder. In fact, it allows one to exploit a high degree of datapath sharing and obtain very low power consumption and area occupation. In addition to deep-space communications, the proposed solution could be also useful in further applications where retransmission of lost packets is not allowed, such as, for example, broadcasting.
The rest of the paper is organized as follows: Section II introduces turbo and LDPC code decoding, while Section III describes the concatenated FEC schemes and their performance. The hardware structure of the proposed decoder is explained in Section IV, and Section V gives the results of the implementation. Finally, conclusions are drawn in Section VI.
II. TURBO AND LDPC DECODING
Turbo codes can be obtained by concatenating in parallel two convolutional code encoders. The dual encoding structure is reflected on the decoder that is consequently made of two parts as well, known as soft-in-soft-out (SISO) decoders. These are connected by an interleaver and a de-interleaver −1 . Each of them implements the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm [14] , which produces extrinsic metrics from a priori information; each iteration of the algorithm can be divided into an in-order half-iteration and an interleaved-order half-iteration, due to the presence of two SISOs. The BCJR algorithm relies on the trellis representation of the constituent convolutional codes; let us define k as a trellis step and u as an uncoded symbol. 
whereũ ∈ U is an uncoded reference symbol (usuallỹ u = 0) and u ∈ U \ {ũ} with U the set of uncoded symbols; e is a trellis transition and u(e) is the corresponding uncoded symbol. According to the Max-Log-MAP approximation [15] , the * max{x i } function can be approximated to max{x i } at the cost of a small BER degradation. The term b(e) in (1) can consequently be defined as: LDPC codes are identified by an M × N sparse parity check matrix H that represents all the parity checks a codeword must satisfy, i.e. H ·x = 0, where x is the codeword of length N. Various decoding approaches are possible, depending on the graph representation of H, but the most performing one is the layered decoding approach [16] . It sees H as a multipartite graph composed of different layers of parity check constraints; multiple updates of the bit error probabilities within a single iteration allow for a fast convergence of the decoding algorithm. It is particularly advantageous in case of quasi-cyclic LDPC codes (QC-LDPC), where the parity check layers are inherent to the structure of H. In fact, the parity check matrix is constituted of multiple instantiations of an m × m identity matrix circulated of a variable shift factor; each layer will consequently be constituted of m rows.
Let us define as λ[c] the logarithmic likelihood ratio (LLR) of symbol c. The bit LLR λ k [c] related to column k of H is initialized to the corresponding received soft value. For every parity constraint l in a given layer, the following operations are performed and reiterated up to the desired level of reliability:
where
is the updated version of R old lk that is initialized to zero and stored for the next iteration; a different R lk is identified for each H matrix nonzero entry at column k and row l. Several exact and approximated algorithms have been proposed to calculate R new lk : the most common algorithm used in LDPC decoding is the belief propagation (BP) algorithm, together with the min-sum approximation and its variations [17] .
It can be noticed how the LDPC and turbo decoding processes share many characteristics.
Both of them are iterative, rely on soft information, are usually implemented in their logarithmic form, while commonly being represented through special kinds of graphs. A particularly interesting exploitation of these characteristics has been proposed in [18] . Every row of H is seen as a turbo code with trellis length equal to the row weight: a direct link between turbo and LDPC codes is consequently drawn, and turbo decoding algorithms like BCJR can be applied to LDPC codes with minor adjustments. The BCJR-based LDPC decoding relies on the fact that binary LDPC codes can be represented with a 2-state trellis; state metrics can consequently be expressed as differences α [c] and β [c] , reducing the quantization noise. Considering the Max-Log-MAP approximation [19] , the calculation of R new lk becomes
where the operator (·) is defined as
and α[c] and β[c] can be computed as The concatenation of different codes targets the improvement of performance via careful code selection. The concatenation of code A and B is in fact meaningful only when A + B performs better than both A and B. This means that the coupled codes must be in some way complementary, each one overcoming the shortcomings of the other. Outer codes (OCs) are often chosen among those with guaranteed performance, like Reed-Solomon (RS) [20] or Bose-Chaudhuri-Hocquenghem (BCH) [21] codes, thanks to their theoretically predictable error correction capabilities. They are associated with powerful inner codes (ICs) such as convolutional or LDPC codes, as used in worldwide interoperability for microwave access (WiMAX) and digital video broadcasting-second generation satellite (DVB-S2) that greatly reduce the number of errors the OC has to correct. The RS + convolutional FEC scheme used by CCSDS allows these codes to rival with the more powerful LDPC and turbo codes. In [22] , the performance of the common turbo + RS FEC scheme is analyzed in relation to their interleaver. Good results are observed with complex interleavers and at very high signal-to-noise ratio (SNR). However, this is not the only possible criterion of choice. In [23] , LDPC and recursive systematic convolutional codes are concatenated in parallel, with good performance and little additional complexity with respect to a standard LDPC code. Block turbo codes and LDPC codes have been concatenated in [24] , where a FEC scheme for three-dimensional high-definition television (3D HDTV) is devised. The scheme outperforms the digital video broadcasting-second generation terrestrial (DVB-T2) standard serial concatenation of BCH and LDPC codes. Satellite communications are handled through the concatenation of Luby transform (LT) codes and nonbinary LDPC (NB-LDPC) codes in [25] . Thanks to the high error correction capabilities of NB-LDPC and the intrinsic flexibility of LT codes, the resulting system is very versatile.
Serial concatenation of codes is based on the concept that the output bits of an encoder are used as input bits for another encoder. Turbo and LDPC codes in particular have been considered for concatenation in [7] , where deep-space communications were targeted; Fig. 1 shows the proposed idea. The performance of these two types of codes are somewhat complementary; while turbo codes guarantee much better performance than LDPC codes at low SNR, they suffer from higher error floors [26] . Consequently, the LDPC encoder is placed before the turbo encoder, while the decoders are in inverted order. The turbo code, working as an IC and being the first one to be decoded, can exploit its early waterfall region, while the outer LDPC code receives already refined error probabilities and can thus work at higher equivalent SNR. To prove the soundness of this choice, simulations were run also inverting the order of the encoders. With an LDPC IC and a turbo OC, BER results in the waterfall region improve with respect to the use of the LDPC code alone, but they are much worse than the proposed concatenated scheme. Moreover, error floors can be noticed at BER levels only slightly lower than that of turbo codes alone. The encoders are connected by an optional padding block that adapts the respective block sizes in case they are different by adding zeroes. This means that not all the IC input bits carry useful information, but experimentation with a wide variety of codes are possible. At the same time, a de-padding block is inserted between the decoders. The IC decoder receives an initial measure of the bit error probabilities from the channel estimator and performs a fixed number of iterations Iter in . Afterwards, the potential padding bits are removed, and the bit-level output error probabilities λ k [u] are passed to the OC decoder that interprets them as input λ k [c] . Having gone through the turbo decoding, the λ k [c] at the input of the LDPC decoder cannot be considered channel-estimated LLRs. Those pertaining to correct bits in particular have diverged from zero and are suitable for a hard decision on bits. This could lead to poor performance on the part of the LDPC, but the nature of the concatenated scheme prevents it. To exploit their low error floor, LDPC codes need to work at high SNR; the refined LLRs used as inputs guarantee the required general high level of reliability and avoid undesired bit flipping. The correction of the errors that could not be corrected by the turbo code is helped by the inherent interleaving effect brought by the LDPC H matrix structure [27] .
Comparing the performance of codes differing in both block size and rate is a complex task. To help a fair evaluation of the effectiveness of the concatenation, the distance of each concatenated scheme from the sphere packing bound (SPB) [28] has been considered. The SPB is an evolution of the channel capacity that binds the achievable performance of a code not only to its code rate, but also to the block size; different methods of calculation and refinements of this measure exist, but the authors refer to that proposed by Dolinar et al. in [28] . In particular, the asymptotic approximation devised for long blocks (≥100 symbols) is used. In Fig. 2 , the performance of various concatenations, along with that of codes currently used in different standards, is evaluated in terms of SPB and E b /N 0 . The x axis represents the E b /N 0 at which FER = 10 −7 , while the distance of the code from its SPB at that particular FER is shown on the y axis ( SPB). These results, together with those shown in Fig. 3 have been obtained taking into account the performance degradation brought by the Max-Log-MAP approximation and the bit-level metric conversion addressed in Section IV-B. Together, they sum up to around 0.3 dB loss. Simulations have been run with 10 maximum iterations for both turbo and LDPC codes; R new lk and λ apo k [u] values have been quantized with 10 bits, three of which are assigned to the fractional part, while 9 bits are used for channel LLRs and state metrics. The selected quantization leads to negligible degradation with respect to floating point and guarantees a much higher level of precision than typical LDPC and turbo decoder quantizations that can be as small as 4 bits [10] . The white symbols are codes taken from the current CCSDS standard, while full black symbols in the Fig. 2 represent different choices of concatenations. Where padding bits have been used, the SPB has been computed by considering K = K IC × rate OC . At FER = 10 −7 , it is possible to observe the effect of the error floor on turbo codes in both the high SPB and the large E b /N 0 (white circle and white cross in Fig. 2 ). CCSDS LDPC codes show good SPB, especially for K = 4096; however, they are characterized by quite large E b /N 0 . The results obtained with the largest CCSDS LDPC code are similar to those obtained by concatenated WiMAX LDPC and turbo codes. The CCSDS turbo codes, when used as IC in concatenation, give the best results, with the CCSDS turbo + WiMAX LDPC outperforming all the other solutions. The comparison with SCCCs [29] , which are able to obtain lower error floors than parallel turbo code, shows very similar performance, with the concatenated scheme yielding slightly better results in terms of both SPB and
The advantage of the concatenated scheme is particularly evident at low FER, as can be observed in Fig. 3 , where FER curves are plotted alongside SPB. While concatenated schemes exploit the very low error floors of LDPC codes and follow the behavior of SPB closely, the FER of turbo codes alone suffers from an early divergence from the theoretical achievable performance. For example, while the FER curve of the CCSDS turbo code plotted in Fig. 3 displays SPB = 0.72 dB at FER = 10 −5 , it rises to 1.26 dB at FER = 10 −7 due to error floor.
IV. UNIFIED LDPC/TURBO DECODING ARCHITECTURE
Following the effectiveness of the concatenated FEC scheme presented in [7] , the decoder architecture for turbo and LDPC codes concatenation shown in Fig. 4 has been designed. The gray blocks represent the duplicated datapath described in Section IV-A, while the structure of the memory banks, with their alternative usage according to half-iterations, is detailed in Section IV-B. In [30] , a study on memory and datapath sharing in the context of flexible decoders reveals unsatisfying results for many code combinations. While memory sharing is deemed often disadvantageous, some exceptions arise, notably the turbo/LDPC combination in case the block lengths are comparable. Moreover, the common turbo/LDPC decoding technique depicted in Section II paves the way for highly shared datapaths, in the wake of works like [8] and [9] , as opposed to separate datapath turbo/LDPC decoders like [10] [11] [12] .
The proposed decoder relies on an innovative smart memory structure that allows one to increase the percentage of module reuse within the datapath and avoid complex interleaving mechanisms between the decoding modes.
A. Datapath
The structure of the designed LDPC/turbo datapath positions itself in between a completely shared approach and datapath separation. The turbo and LDPC datapaths have great disparities in terms of complexity, with the turbo datapath requiring more resources. As shown in this section and in Section V, the LDPC datapath is included within the turbo datapath, while constituting a limited percentage of its overall logic. The concatenated scheme can consequently be decoded at little more than the logic cost of a turbo decoder. Fig. 5 shows a block diagram of the designed datapath. It is characterized by a pipelined architecture, with registers represented by striped blocks. The turbo decoding process makes use of a butterfly structure; the datapath is duplicated in an α and β, respectively entrusted with the concurrent forward and backward scanning of the trellis steps. They implement the modified sliding window technique described in Section II. Each half of the duplicated datapath receives as an input from the memories the λ apr k [u(e)] and λ k [c(e)] relative to a trellis step; these are used by the branch metric units (BMUs) to perform (5) and obtain γ k [e] . These are passed to the α and β units that perform the computations of (3) and (4), respectively. The structure of the α and β units is similar. Along with the output of BMU, the α unit receives
either from the memory (when computing the first trellis step of a window) or from its own outputs (all other trellis steps), as shown by the feedback loop in Fig. 5 . Together with the updated α k [s] that are stored in the state metric memory α (Fig. 4) , the α unit also produces the α k-1 [s S (e)] + γ k [e] partial sums needed by (2) . These are passed to one of the extrinsic computation units (EXT-α in Fig. 5 ). EXT-α completes (2) by taking the β k [s E (e)] stored in the state metric memory β by the β unit and finally performs (1) . The same computation is concurrently carried out on another trellis step by EXT-β, to which are given α k-1 [s S (e)] stored by the α unit in the state metric memory α and γ k [e] + β k [s E (e)] partial sums calculated in the β unit.
The LDPC decoding process makes mostly use of the turbo mode datapath. LDPC codes are characterized by 2-state binary trellises; since the turbo codes considered are either 16-state single-binary (SBTC) or 8-state duo-binary (DBTC), LDPC codes can easily exploit an additional parallelism factor, concurrently performing the computations associated to multiple parity checks. The BMU is not used and consequently deactivated in LDPC mode, while both α and β units are shared with the turbo mode. In Fig. 6 , the core components of the unified α unit are depicted. Adders and comparators are shared among the two operating modes. Both structures are equivalent when in turbo mode, while their operations differ in LDPC mode. Architecture 1 implements the max(x, y) operator of (x, y), while architecture 2 implements max(x + y, 0). The EXT-α and EXT-β units perform (8) and rely on the same architectures used for α and β unit (Fig. 6) . Also, in this case, they are shared with the turbo datapath. 
B. Memory
As occupied area in both turbo and LDPC decoders is dominated by storage components, efficient memory sharing is very important; for example, a scheme suitable for serial PEs with disjoint turbo and LDPC datapaths has been used in [10] , resulting in large memory saving. However, a different approach is needed with this work. Since the sizing of memories strongly depends on the supported codes, the following analysis is carried out supposing the concatenation of a rate 5/6, N = 1920 WiMAX LDPC code with a rate 1/3, K = 960 DBTC taken from the same standard, decoded considering a window size w = 80. As already shown in [7] and in Section III, this FEC scheme guarantees performance comparable to that of more powerful codes. No padding bits are necessary, since the size of the input frame for the DBTC (960 symbols, i.e. 1920 bits) is equal to the size of the LDPC codeword. However, the following discussion on memory requirements stands also in case of padding, as long as the padding bits are added at the end of the LDPC codeword. The memories necessary to support the designed decoder can be observed in Fig. 4 ; two sets of four memory banks serve the in-order and interleaved half-iterations, respectively, storing extrinsic and intrinsic information, while two memories are dedicated to the storage of state metrics.
In turbo mode, the duplication of the datapath required by the butterfly structure raises the need for concurrent data reading and writing. For the correct computation of a trellis step, the following metrics are necessary: (7) are to be updated once per trellis.
Since the chosen turbo code is duo-binary and has an eight-state trellis, its decoding process needs a much larger number of metrics than the LDPC code, which decoding is similar to that of a single-binary, two-state turbo code. From the LDPC point of view, this translates in an internal level of parallelism in the datapath that is not, however, directly available. In fact, the structure of the H matrix and the layered scheduling require the same LLR to be read and updated multiple times during a single LDPC decoding iteration, resulting in complex load and store patterns not found in turbo decoding. Careful planning of the memory structure is consequently necessary to maximize the level of memory sharing and to concurrently allow the LDPC datapath to exploit the internal parallelism. Figure 7 shows in-depth detail of one of the two sets of memory banks depicted in Fig. 4 . Memories are sized to accommodate the considered codes in case two concurrent parity check computations are performed in LDPC decoding (parallelism factor × 2). They are dual-port, and the usage percentage of each memory is portrayed for both turbo and LDPC codes, along with its depth and width.
In turbo mode, two λ k [c(e)] metrics are stored at each address in the two 1920 × 16 bit intrinsic memories; both ports are always kept in read mode, except during initialization. In this way, four λ k [c(e)] are concurrently available to the α datapath, and four to the β datapath. These same intrinsic memories are used to store the R The memories portrayed so far compose the in-order half-iteration memory banks (Fig. 4) and need to be kept, during turbo decoding, with both ports in read mode. The in-order half-iteration makes use of two additional extrinsic memories to store the newly computed λ apo k [u(e)] that are used in the interleaved-order half-iteration (Fig. 4) . Also the λ k [c(e)] needed by the interleaved half-iteration are stored in two additional intrinsic memories. The same data could be retrieved by adding complexity to the address generation logic and reading in interleaved order the intrinsic memories used in the first half-iteration. However, the extra storage is useful for LDPC decoding. In fact, by having a total of four 1920 × Figure 7, Table I synthesizes the advantages of the devised memory structure. The first row gives the memory bits necessary for the decoder architecture described in Section IV-A to work in both turbo and LDPC mode, while considering separate memories. If smart metric allocation techniques are used (R old lk coupling and reusage, bit-level λ apo k [u(e)]), the required bits are reduced of 22.5%. Moreover, by sharing the memories between the two modes, only 41.8% of total bits is necessary, with LDPC mode being completely supported by the memories required by turbo mode.
C. Interleaving and Addressing
Address generation for the described memory structure is in most cases straightforward. In turbo mode, all read operations are sequential, either in forward or backward order, and are consequently handled by simple counters. Write operations to the following half-iteration memories are based on the permutation law associated with the turbo code encoding, and the memory addresses can be obtained via simple operation on the current half-iteration read address. In the considered case study, the interleaving rules are those associated with the WiMAX standard turbo codes; the interleaved addresses are obtained on-the-fly by dedicated logic implementing the WiMAX permutation function. Address generation for the intrinsic memories is sequential in both read and write operations when in LDPC mode, and the counters used in the turbo mode can be reused, but problems arise when dealing with the extrinsic memories. While sequential addressing in intrinsic memories is possible thanks to the local nature of R old lk values, λ k [c] are read and updated multiple times and in variable order during an iteration. Address generation, however, can still exploit the regular structure of the H of QC-LDPC codes. By storing in a small memory the shift factors of the constituent m × m circulant identity matrices, together with the position of the nonzero entry of their first row, read and write addresses can be obtained with modulo-m counters and adders. A single 160 × 36 bit memory is sufficient to support also the × 2 internal parallelism.
The devised memory structure is particularly advantageous when switching between turbo and LDPC decoding; after the last turbo iteration, the extrinsic memories relative to the in-order half-iteration contain the data needed by the LDPC decoding process in the correct order. The memories relative to the interleaved-order half-iteration in turbo mode, to be used in LDPC mode, will only need to have the read and write addresses pass through the permutation law circuit.
V. IMPLEMENTATION
The decoder architecture described in Section IV has been implemented in 90-nm CMOS technology; synthesis and power estimation have been carried out with Synopsys Design Compiler, while the switch activity has been analyzed with Mentor Graphics Modelsim.
Several design choices are related to the set of codes that is going to be implemented, in particular the sizing of the memories. The largest codes considered for the implementation are taken from the WiMAX standard: an LDPC code with block size 1920 and rate 5/6, and a DBTC with information block size 1920 and rate 1/3. Soft metrics have been quantized with nine and eight bits, with two bits of fractional part; the maximum number of iterations has been set as It OC = 10 for LDPC and It IC = 6 for turbo. The CCSDS standard foresees in [1] a wide range of possible throughputs for spacecraft-to-Earth communications, depending on the modulation scheme, frequency band, and type of mission. Downlink data rates supported by spacecraft employed in current missions vary consistently between one another. For example, the Curiosity rover deployed on the surface of Mars can communicate directly with Earth at 32 Kb/s; however, it can exploit two different orbiters (the Mars Odyssey Orbiter and the Mars Reconnaissance Orbiter) to reach data rates of up to 110 Kb/s and 6 Mb/s respectively. The Cassini orbiter around Saturn has a maximum downlink data rate of 248.85 Kb/s, and the Kepler planet search spacecraft communicates at 4.33 Mb/s. Higher rates (up to 28 Mb/s) will be considered by the James Webb Space Telescope.
The throughput of the Cassini orbiter can be achieved by the proposed decoder architecture at 12 MHz (252 Kb/s); with this target frequency, the total area occupation is 0.98 mm 2 . Thanks to the shared datapath approach, more than 90% of the LDPC datapath is included in the larger turbo datapath, with very few LDPC-exclusive components. This is also reflected on the power consumption estimate, resulting in 2.1 mW at 12 MHz. Memories occupy 82.6% of the decoder area and account for 70.1% of the total power consumption. Pipeline stages contribute for 10.3% of the area and 13.4% of power consumption, with the remaining 7.1% area occupation and 16.5% power consumption being taken by processing, addressing, and control logic. The implementation results show that this decoder has a smaller area and lower power consumption than most LDPC and turbo decoders [10, [32] [33] [34] . Obviously, due to the limited throughput target, the obtained throughput-to-area ratio is low with respect to values typical of most recent decoders developed for wireless communications. It yields, however, an energy efficiency of 120 Mb/s per Watt, outperforming the majority of the state of the art.
The 6 Mb/s required by the Mars Reconnaissance Orbiter are obtained by targeting a frequency of 286 MHz, for which the occupied area results 1.03 mm 2 , and the power consumption 56.6 mW. Whereas the energy efficiency is reduced to 106 Mb/s per Watt, this implementation of the decoder allows to comply with most current deep-space downlink throughput requirements. The proposed architecture, however, can sustain even higher throughputs: 10.5 Mb/s have been obtained by synthesizing the presented decoder without any modifications targeting a frequency of 500 MHz. The implementation yields an area occupation of 1.06 mm 2 and 111.9 mW power consumption. To achieve even higher throughputs, it is possible to reduce the system critical path by adding a pipeline stage in the EXT-α, EXT-β modules and another in the α, β modules; with these straightforward modifications, up to 14.5 Mb/s can be obtained. Another possible approach can be incrementing the degree of parallelism of the decoder; by subdividing the current memory structure in a number of smaller banks, multiple instances of the datapath can work concurrently, virtually multiplying the achievable throughput.
The state of the art is currently lacking extensive information about decoders aimed at deep-space communications, making the comparison between the concatenated FEC scheme implementation and alternative solutions unfeasible. The work in [35] presents a field-programmable gate array (FPGA)-based LDPC decoder for space communications; however, the considered near-Earth transmissions involve codes and specifications very different from deep-space links. Turbo codes are a more mature technology in the deep-space field, and various CCSDS-compliant turbo decoders are available on the market [36, 37] . However, very few scientific papers have been written on the subject. The work in [38] discusses the implementation of a CCSDS-compliant turbo decoder, but it is based on multiple off-the-shelf digital signal processors, lacking area occupation and power consumption details. Also, evaluating the area, power and energy efficiency gain of the proposed solution with respect to similar architectures is problematic. Shared datapath LDPC and turbo decoders are present in the literature, for which complete implementation results are provided [8, 9] . Their target applications are wireless communication standards like Third Generation Partnership Project long-term evolution (3GPP-LTE), WiMAX, WiFi, and DVB, for which BER and throughput requirements are extremely different from deep-space communications. These decoder designs are often based on high levels of parallelism, favoring speed over performance, especially in video broadcasting. For example, the decoder presented in [8] relies on a completely shared datapath. Since the target throughput ranges between 450 and 600 Mb/s, the internal parallelism of each decoding core can be close to 100, while the frequency is set to 500 MHz. To give full support to high-throughput communication standards, multiple instances of parallel cores are used. Consequently, while the concept of datapath sharing and turbo/LDPC code decoding behind the presented work and [8] is similar, the difference in throughput requirements results in diverging design choices that lead to a more than threefold area occupation and an estimated × 180 factor in power consumption. Moreover, existing turbo/LDPC decoders target a different concept of flexibility with respect to the proposed architecture. In the proposed decoder, each frame must pass through both decoding modes; switching between them is consequently a fast operation that involves no overhead. On the contrary, frames in state-of-the-art turbo/LDPC decoders are handled by one decoding mode only, and a large number of frames is foreseen between mode switching. This implies that, in these decoders, reconfiguration time is a much less critical constraint than for the decoder here proposed, which is requested to run both turbo and LDPC decoding on every single frame. While it is clear that a fair comparison with the state of the art cannot be performed, it is possible to get a sense of where the proposed decoder stands. The CCSDS-compliant RS decoder [39] and Viterbi decoder [40] yield a total area normalized to 90 nm CMOS technology of 0.63 mm 2 . The RS + CC FEC schemes are consequently cheaper to implement than the proposed turbo/LDPC concatenation, but their performance is much worse. An additional evaluation can be made thanks to the resource utilization data given in lattice semiconductor FPGA-based CCSDS turbo decoder [36] . Approximately 8000 look-up tables (LUTs) and 4000 registers are necessary for different lattice devices. This work, implemented on a Xilinx Virtex 6 FPGA, requires 6000 LUTs and 1000 registers, having better performance while at the same time occupying a smaller area than [36] .
VI. CONCLUSION
This work presents a unified turbo/LDPC decoder architecture for concatenated LDPC and turbo codes aimed at deep-space communications. The performance of such an FEC scheme is compared to that of FEC schemes currently used by the CCSDS standard, extending the evaluation of previous works; this solution greatly outperforms both CCSDS LDPC and turbo codes. The architecture of the joint turbo/LDPC decoder is described; it yields a high percentage of datapath sharing (>90% in the LDPC case) and completely shared memories. The novelty of the solution and the lack of similar implementations in the state of the art make a fair comparison impossible. The proposed decoder has been implemented obtaining postlayout results that show very small area occupation (0.98 mm 2 ) and low power consumption (2.1 mW).
