ABSTRACT Applying rotated and cyclic Q delayed (RCQD) modulation on the transmitter side improves the performance of the receiver in case of fading channel conditions along with erasures. However, the complexity of demapping in the receiver increases significantly to achieve this elevated performance. Many low complexity demapping solutions are available but few are actually implemented in hardware. Recently, new constellation rotation angles and associated simplified demapping technique with better performance for fading channel conditions along with erasure scenarios has been proposed. In this paper, we propose a novel demapper architecture model that exploits these latest simplifications. Moreover, a joint demapper implementation is proposed, which can demap the symbols from different constellations using full or simplified demapping algorithm to achieve best throughput in terms of LLR/sec. Compared to state-of-theart implementations of RCQD demappers, significant hardware reductions are demonstrated encouraging the use of RCQD modulation in future wireless communication applications.
I. INTRODUCTION
The proposal of new techniques to improve transmission quality in digital communication applications has significantly increased over the past years. This trend will be even more strengthen in the coming years due to the emergence of new scenarios and requirements as foreseen in the next generation standards for wireless communications. Unfortunately, many of the proposed techniques are either not adopted in emerging standards, or selected for a very limited range of applications due to their high implementation complexity. Therefore, the investigation of low complexity algorithms and area-efficient implementations is more and more crucial in order to encourage and to enlarge the adoption of these advanced digital communication techniques.
In this context, it has been shown that applying signal space diversity (SSD) through rotated and cyclic Q delayed (RCQD) quadrature amplitude modulations (QAM) significantly improves error rate performance in deep fading channel conditions [1] . The significant performance gain in erasure channel conditions has convinced the Digital Video Broadcasting (DVB) project to adopt this technique in the DVB-T2 standard [2] despite the implied digital hardware complexity. In fact, at the receiver side, the demapper should compute the two dimensional Euclidean distance of all constellation points. The number of these constellation points is M = 2 m , where m denotes the number of bits per symbol of rotated constellation, i.e. complexity order of the demapper is O(M ). Hence, this order becomes 4, 16, 64, and 256 for QPSK, 16-QAM, 64-QAM, and 256-QAM constellations respectively. To support this wide range of complexity, from 4 to 256, the scaling of the demapper hardware architecture constitutes another issue. Initial hardware implementation efforts for RCQD demapper include [3] - [5] . In [3] , [4] , by exploiting sub-region simplification, a single core demapper is proposed for lowest complexity QPSK demapping to highest complexity 256-QAM. On the other hand, the solution provided in [5] is a lower complexity and flexible applicationspecific instruction-set processor (ASIP) based demapper for multi-standard applications. This demASIP can implement both full complexity demapping as well as demapping based on sub-region simplification. The proposed demASIP is suitable in a scalable architecture where for high complexity multiple instances can work together as shown in [6] . Later on, several additional low complexity solutions for RCQD QAM demodulation were proposed [7] - [11] . In these solutions the complexity reductions are theoretically explained along with effects on the performance. The simplifications proposed in [8] and [9] provide reduction figure of 50% and 78% respectively while comparing with full complexity solution. On the other hand, comparison result of Complexity-Reduced Max-Log Demapper Design (CRML) approach used in [11] with Per Dimension Demapper (PD-DEM) approach adopted in [7] are given in [11] . It is shown that CRML approach is less complex than PD-DEM. Moreover, the actual FPGA implementation results are also provided. In [10] ,
√
M best candidates based soft demapping is proposed providing performance equals to full complexity solution. However, if we compare the arithmetic operations such as multiplication, add/subtract and comparisons from formula provided in [10] and [11] separately, it appears that the solution proposed in [10] is more complex than that proposed in [11] .
Currently the constellation rotation angle used for 256-QAM in DVB-T2 standard is α = 3.57 o which can be seen as
where M = 256. On the other hand, for QPSK, 16-QAM and 64-QAM the rotation angles do not comply with α = tan −1 1
. Recently in [12] , it is proposed to use a rotation angle
for all constellations. It is shown that with these new proposed angles up to 0.75dB gain can be achieved in error rate performance in case of a channel having 15% erasure rate whereas in case of Rayleigh fading channel the performance is almost similar. Moreover, the demapping complexity reduces to O(2 √ M ). In this context, in order to explore the potential of this new RCQD demapping solution from a hardware implementation perspective, we propose and develop in this paper a high-throughput area-efficient hardware architecture, namely RCQDdemASIP.
The choice of ASIP-based design approach was adopted as it allows to accelerate design time and enables the designer to consider the exact application requirements and to achieve any target trade-off between performance and flexibility. This trade-off can be freely tuned in a language-based ASIP design approach [13] . It is practically shown how a small-sized program memory and a simple fetch mechanism can provide full control of pipelined data path to achieve target throughput.
The rest of the paper is organized as follows. The next section presents the system model along with algorithmic requirements of the demapper. Section III details the proposed ASIP architecture. Section IV illustrates the designed instruction set. Section V is dedicated for synthesis and performance results. Finally, Section VI concludes the paper.
II. SYSTEM MODEL A. SSD USING RCQD CONSTELLATION
The SSD principle is implemented firstly through rotation of QAM constellation by an angle α and secondly by interleaving the I and Q components [1] . The former is achieved by sending the Q component with one symbol time delay, i.e. d = 1 with respect to I component of rotated QAM constellation symbol, hence, called as RCQD. Here, the rotation breaks the independence between the in-phase I and quadrature phase Q components of the constellation signal. Whereas, due to delay d, both transmitted components face different fading, hence degree of diversity is increased. The rotated symbols x are computed from non-rotated constellation symbolx as per the following expression.
After rotation, the x Q t component of the rotated symbol x t is delayed by one symbol period with respect to x I t component. This delay of Q-component becomes much larger than one symbol duration due to the presence of interleavers in later stages of baseband processing in DVB-T2.
At the time of proposal of RCQD, the selected angles were based on criteria provided in [1] . However, recently in [12] , new angles are proposed which simplify the demapping process without degrading the performance for DVB-T2 scenarios. The rotation angle α for a constellation is given by:
The current constellation rotation angles for DVB-T2 and the new proposed angles are given in Table 1 . The rotated 16-QAM constellation with new proposed α is shown in Fig. 1 .
B. CONSTELLATION SOFT DEMAPPING 1) FULL COMPLEXITY SSD SOLUTION
The maximum likelihood (ML) demapper computes Log Likelihood Ratios (LLRs)v i t related to the i th bit of a symbol x transmitted from any type of constellation (i.e. no matter what is the constellation rotation angle) at time t is given by:
where i = 0, 1 . . . m − 1, X i c with c ∈ {0, 1} are the symbol sets of the constellation for which symbols have their i th bit equals to c,ỹ is received constellation symbol, ρ is fading coefficient and σ 2 is variance of additive white Gaussian noise. In case of an OFDM based system where the received symbolỹ is equalized using ρ to form y before demapping, the expression in (3) becomes:
Where
is called as Channel State Information (CSI). To use (4) in case of 16-QAM, 16 Euclidean distances between y and all individual x will be first computed and then will be used to generate 4 LLRs. For example, in order to compute LLR of most significant bit from received symbol y, as shown in Fig. 1 , Euclidean distances from symbol set X 3 1 (shown in black) and symbol set X 3 0 (shown in gray) will be used to find minimum Euclidean distance.
2) LOW COMPLEXITY 2D SSD SOLUTION
In [12] , it is shown that if constellation rotation angle is computed as per (2), then the number of Euclidean distances to compute LLRs are reduced to 2 √ M . To achieve this reduction, certain transformations are required. First of all, the symbols in constellation are linearly transformed using the following expression:
where
and
for QPSK. 16-QAM, 64-QAM and 256-QAM respectively. The resultant linearly transformed rotated constellation diagram is shown in Fig. 2 .
The equalized symbol y will be transformed to Y on new constellation using following expression:
Once having Y , two search spaces T (Y k ) where k can be I and Q are computed from Y I and Y Q through following expression: (8) and thus the search region will contain symbols having Q component value from 0 to 3, i.e. X 3 , X 7 , X 11 and X 15 . Using Y and these 8 constellation symbols X in place of y and x in (4), LLRs can be computed. Hence, the complexity order is reduced from M = 16 to 2
In case of QPSK, complexity order remains same whereas in 64-QAM and 256-QAM it reduces from 64 to 16 and from 256 to 32 respectively. VOLUME 5, 2017
III. RCQDdemASIP ARCHITECTURE
We started the design of an ASIP model that was capable to execute the recent low complexity demapping algorithm. As explained in previous section that QPSK demapping complexity remains same weather we use full complexity or low complexity 2D SSD solution for demapping, we opted to demapp QPSK with full complexity demapping solution. It will be shown later that this option gives better throughput in case of QPSK. Hence, RCQDdemASIP architecture is designed to support both full and low complexity 2D SSD demapping solutions. In fact very few extra resources are required for the provision of this algorithmic flexibility. This is also evident from the differences between the two demapping algorithms. In low complexity algorithm, we perform transformations and then compute reduced number of Euclidean distances. Whereas, in high complexity demapping algorithm, we do not perform transformations however compute Euclidean distances from all constellation point.
While making our architectural choice, we selected hardware resources for the computation of one Euclidean distance per clock cycle in RCQDdemASIP. Hence, in the presence of demappings of different levels of complexity, one can activate as many demappers from the cluster of multiple RCQDdemAIPS as required. RCQDdemASIP architecture is composed of memory interfaces, registers, and 7-stage pipelined data path unit as shown in Fig. 3 . 
A. MEMORY INTERFACE
The ASIP is interfaced with 4 different memories as shown in Fig. 3 .
1) PROGRAM MEMORY
The instruction width is 10 bits whereas the ASIP can support 8 bit address width to accommodate application program of 256-QAM demapping using selected low complexity algorithm.
2) EQUALIZED SYMBOL (Y ) MEMORY
This memory is 20 bit wide and contains set of equalized symbols y received from OFDM interface for demapping.
Each of y I and y Q symbol is 10 bit wide. Total 15 address lines are provided in order to support 32,400 QPSK data cells (modulated symbols) in long FECFRAME of DVB-T2 [2] .
3) CHANNEL STATE INFORMATION (CSI) MEMORY
This is an 8 bit memory containing CSI in the form of
. Again there are 15 address lines to support 32,400 QPSK symbols in long FECFRAME. 
4) CONSTELLATION MEMORY
This memory contains the precomputed constellation symbols information, i.e. µ, x I and x Q for full complexity and X I , X Q for low complexity demapping solution required from RCQDdemASIP. Eight most significant bits are used for µ to support 256-QAM whereas each of x I /X I and y I /Y I are also represented in 8 bits. Provision of 10 address lines is provided to store multiple constellations. The access method to this memory during demapping can be explained with the help of Fig. 4 . As mentioned earlier, QPSK demapping complexity remains same using either demapping schemes, it is efficient to demapp a QPSK symbol with full complexity method. Hence, first four locations of Constellation Memory are reserved for QPSK symbols. For 16-QAM low complexity demapping, location 4 to 19 will contain constellation symbols of Fig. 2 in X 0 , X 1 , X 2, . . . , X 15 order, to access symbols in T (Y I ) search space. On the other hand, for T (Y Q ) search space, the symbols are saved in X 1 , X 7 , X 11, X 15 . . . , X 12 order at memory locations 20 to 35. Same strategy will be used for 64 and 256-QAM constellations. If we consider first location address of a constellation in memory as ''Offset1_modulation_k'', compute second offset i.e. ''Offset2_modulation_T (Y k )'' from (8) and running a counter from 0, 1, . . . , √ M for both search spaces, then their sum will generate addresses for required constellation symbols to be used in (4) . To demapp Y of Fig.2 , the Constellation Memory address generation for both search spaces is shown in Fig.4 . For full complexity demapping solution, only ''Offset1_modulation_I '' along with Offset2_modulation_T (Y I ) = 0 will be used whereas counter will run from 0 to M . This will generate addresses of all of the symbols in the constellation.
B. REGISTERS
A total of 20 registers of different sizes are used. Certain registers serve to hold configuration, loop repetition and channel data whereas others are used as pipeline registers to hold intermediate values. 
C. DATA PATH
The data path is made up of FEC, DEC, OPF, MUL, SQR, DIST and MIN pipeline stages as shown in Fig. 5 . The fetch mechanism has two possibilities of Program Counter (PC) update. PC increments by one in normal situation and returns to loop start address in case of repeat instruction. Decode stage is used to decode an instruction. In OPF stage, operands are fetched from Equalized Symbol and CSI Memory. In subsequent pipeline stages arithmetic operators are placed to compute Euclidean distances. In the last pipeline stage, Min Finders of [5] are placed.
D. CONTROL PATH
The ASIP control unit is based on 7-stage pipeline as mentioned above. It controls the flow of the program over the designed datapath during different stages of the pipeline.
IV. INSTRUCTION SET
On top level there are 9 instructions which enable the flexibility required for demapping of different constellations. Main instructions of RCQDdemASIP are described below.
A. SET CONFIGURATION
Parameters related to intended modulation type and demapping algorithm to be used is passed through this instruction. The configuration information is stored in respective registers.
B. INPUT DATA
The purpose of this instruction is to input data from Equalized Symbol Memory and CSI Memory at first. Hence, in DEC stage address is sent and in OPF data is read and stored in respective registers. In case, the configuration is set to implement low complexity demapping algorithm, transformation of y into Y is performed by implementing (7) in next two pipeline stages of MUL and SQR. The results are stored in Y I and Y Q registers. The Offset2, as defined in Section III-A.4, is also computed during the execution of this instruction. Due to these two extra clock cycles for the preparation of Y and Offset2, two no operation (NOP) instructions will be required before starting LLR computation instruction. In case configuration is set for full complexity algorithm execution, the value of y is directly stored in Y I and Y Q registers in OPF stage whereas offset2 is set to 0.
C. PROCESS LLR
During the course of this instruction address of Constellation Memory, as described in Section III-A.4, is generated in DEC stage. In OPF, constellation symbol (x or X ) is read and subtracted from (y or Y ) depending upon configured algorithm to be implemented. The results are stored in Diff_I and Diff_Q registers. The value of µ read from CSI memory is sent to the pipeline register as it is used in last pipeline stage.
In MUL stage, the computed differences are multiplied with CSI and stored in CSI_Diff_I and CSI_Diff_Q registers. Squaring of product of CSI and differences is performed in SQR stage whereas sum of squaring to achieve Euclidean distance is performed in DIST stage. Finally, this distance and value of µ coming through pipeline register are used to find the minimum values using Min. Finders proposed in [5] placed in last pipeline stage.
D. OUTPUT LLR
On execution of this instruction two LLRs per cycle are sent on output lines depending upon the configured modulation type.
E. REPEAT
In order to repeat the LLR generation process, a Loop instruction is part of instruction set. We have made separate instructions for loop count, loop start and loop end address in order to reduce instruction size. The number of symbols to demapp is an input to the ASIP whereas 6 bits are used to define start and end address of loop which makes the size of instruction to 10 bits. In order to execute two level nested loop Push Loop and Pop Loop instructions are also provided.
F. SAMPLE PROGRAM
A sample program to deampp a 16-QAM symbol using reduce complexity algorithm is shown in Fig. 5 .
VOLUME 5, 2017
Initially 4 instructions are used to set configuration and loop related parameters. Input instruction is then used to read y and CSI. Two NOPs are added to wait till Offset2 and Y are ready. Next four instructions of LLR processing are used to executed (4) to use X from T (Y Q ). In fourths instruction ''Last'' argument is passed to reset the counter used in generating the address of Constellation Memory. Next four instructions of LLR processing use X from T (Y Q ) search space. After last LLR processing instruction, six clock cycles will be required till final LLRs will be ready in the internal registers of Min. Finders. Here, rather putting NOPs, processing of next symbol will be started. Hence, after fetching five instructions for second symbol demapping, OUTPUT instruction to output LLR will be launched. After this, 6 remaining instructions to process second symbol will be executed. The sequence will be repeated as per value in Loop Count Register. With this optimal use of data path through ASIP implementation, the per symbol demapping clock cycles are reduced to 12 from 17. Hence, a 30% gain in overall throughput for 16-QAM is achieved at the expense of simplified fetch mechanism and small program memory. Moreover, for 7 other possible configurations (total of 8 due to 4 constellations and 2 algorithms), we can perform similar optimizations to achieve best throughput/area ratios. For last symbol, the program will come out of the loop and five NOPs will be added before generating LLRs for last equalized constellation symbol.
V. SYNTHESIS AND PERFORMANCE RESULTS

A. Synthesis Results
Processor Designer framework from Synopsys was used to describe the proposed ASIP architecture in LISA Architecture Description Language. Corresponding VHDL code was automatically generated along with required software development tools. Prototyping flow presented in [5] was followed for verification on FPGA. Vivado Design Suite from Xilinx for FPGA implementation was used. Targeting a Virtex-7 xc7vx690t device, Table 2 summarizes the synthesis results. 
B. PERFORMANCE RESULTS
Due to our architectural choice of scaling the RCQDdemASIP for one Euclidean distance computation per clock cycle, the difference in complexity between the full and the low complexity 2D SSD demapping algorithms is reflected in time (i.e. execution speed or throughput) rather than in space (i.e. area). The throughput results in terms of MLLR/sec for RCQDdemASIP are summarized in Table 3 . All four constellation types ranging from QPSK to 256-QAM are considered. The formula to compute throughput in MLLR/sec is given below where m represents the number of bits per modulated symbol, f is clock frequency in MHz and c is number of clock cycles to demap a symbol:
As the proposed RCQDdemASIP architecture executes both demapping algorithms (low and full complexity ones), we evaluate the gain in terms of throughput rather than area. Significant gain in throughput is achieved when executing the low complexity demapping algorithm associated with the new proposed rotation angles compared to the execution of the full complexity demapping algorithm associated with any constellation rotation angles. Based on the results of Table 3 , using the new rotation angles with the low complexity demapping algorithm, the gain in throughput is 50% for 16-QAM whereas for 64-QAM it is more than 3 times. It can be further noticed that demapping QPSK with the full complexity algorithm gives better performance than demapping with the low complexity solution. This is due to the fact that although the complexity remains same while adopting either demapping algorithms (as explained earlier in Section II-B.2), two extra clock cycles are required to perform the transformations needed in the low complexity algorithm. Furthermore, if we analyse the scaling of throughput from 256-QAM to QPSK modulations, we notice an additional clear benefit of the low complexity solution where high throughput can be achieved in all modulations using a single ASIP core.
The comparison with state of the art is summarized in Table 4 . The significant performance improvement of RCQDdemASIP implementation with respect to the ASIP implementation proposed in [5] , [6] can be justified as follows. First of all, (4) is implemented in RCQDdemASIP in place of (3) due to the presence of OFDM interface providing equalized symbols. Secondly, RCQDdemASIP does not support the iterative demapping (ID) due to the fact that most of the gain in performance over deep fading comes from SSD, and not from ID. As mentioned in [4] , about 6.4 dB gain when compared with typical bit-interleaved coded modulation (BICM) comes from SSD and only an additional 0.75 dB gain is due to ID for a transmission of 16-QAM rotated symbols at r = 4/5 in deep fading channel with 15% of erasures. Due to these factors, less number of memories are required to be interfaced with RCQDdemASIP and there are less number of pipeline stages and hardware resources as compared to the ASIP proposed in [6] . Consequently, RCQDdemASIP uses almost 36% slice LUTs and 54% slice FFs as that of the ASIP proposed in [6] and achieves more than 10% gain in throughput for 256-QAM modulation.
On the other hand, the work presented in [3] , [4] is based on sub-region simplification. The solution uses almost 5 times more slice LUTs and more than 3 times DSP48 slices as compared to RCQDdemASIP while implementing subregion simplified demapping. However the LLR output rate is almost same. The drawback of this architecture is that it considers fixed input symbol rate of 6.2 MSymbols/sec for all types of constellation. Hence throughput/area ratio decreases for low order constellations such as for QPSK and 16-QAM, e.g. with same resources the throughput for QPSK will be 12.4 MLLRs/sec.
Finally, the work in [11] uses 7637 LUTs, 32764 FFs and 16 DSP48 slices. Although for this implementation the theoretical complexity is in the order of O(2 √ M ) which is same as our selected demapping algorithm, it achieves twice the LLR/sec throughput as compared to our solution by using 8 times more slice LUTs, more than 37 times slice FFs and 2.7 times more DSP48 slices.
Based on the above conducted comparisons, the proposed demapping implementation of RCQDdemASIP provides the best throughput and area efficiency. Moreover, due to the devised architectural choices, RCQDdemASIP supports its integration in a scalable architecture. Hence, with hardware solution presented here, this RCQD solution with new rotation angles and low complexity demapping exhibits better performances in terms of error rate, throughput and used area simultaneously.
VI. CONCLUSION
Several simplified schemes are proposed for RCQD demapping, however only few hardware implementations are available. Recently, new constellation rotation angles have been proposed for RCQD modulation to simplify the demapping task at the receiver end while preserving, or even improving, the error rate performance. Although the associated demapping reduction factor was theoretically computed, practical solution was not available. In this paper we have evaluated the hardware resources required to implement this simplified demapping through the design of RCQDdemASIP. The presented ASIP design has the capability to execute both the full and the low complexity demapping algorithms to achieve best throughput for different constellations proposed for future digital video broadcast applications. 
