Abstract-This paper proposes a novel hardware implementation strategy to achieve low-cost design for digital predistortion of radio frequency power amplifiers (PAs) using a modified decomposed vector rotation-based behavioral model. To make the model hardware friendly, we first modify the model into a subdecomposed format, which significantly reduces the computational complexity in model extraction. We then reassemble the coefficients and propose a simple digital implementation structure for real-time signal processing in the transmit path. A new dual-direction coordinate rotation digital computer design is also proposed to simultaneously calculate both magnitude and e j θ n values to facilitate the model implementation. To validate hardware implementation, a wideband signal is employed to evaluate the performance with a Doherty PA. Experimental results show that the proposed approach can achieve comparable performance with much lower system complexity compared with that using the conventional approaches.
I. INTRODUCTION

B
ENEFITING from the consistent scaling of CMOS technology, digital circuits are achieving excellent performance with low-cost hardware and low power consumption. In the radio frequency (RF) area, digital predistortion (DPD) has been widely applied to linearization of RF power amplifiers (PAs) to achieve high power efficiency and simultaneously maintain linear signal amplification.
In the past decades, many advanced DPD models have been developed and the majority of the models used today are more or less modified from the Volterra series, such as memory polynomial (MP) [1] - [3] , envelope-MP [4] , generalized MP (GMP) [5] , dynamic deviation reduction Volterra model [6] , [7] , and many others [8] , [9] . Each of these models can achieve excellent performance in their specific application domains. However, because their basis functions are polynomial-based, the Volterra models have some inherent limitations. For instance, they are only best suitable for modeling continuously smooth and relatively weak nonlinear systems. With the continuous push toward wider bandwidth and higher efficiency, more and more advanced PA architectures, e.g., multiway/multistage Doherty, coherent multiband, and various switch-mode PAs, will emerge. In these systems, the PA can exhibit much stronger nonlinearities, and the PA nonlinear behavior becomes far more complex. The existing Volterra models are facing significant challenges in modeling and linearizing these PAs.
Recently, a completely new behavioral model was proposed in [10] . This model was derived from a modified form of the canonical piecewise linear (CPWL) function [11] , [12] using a decomposed vector rotation (DVR) technique. In this model, the nonlinear operation is achieved by using the "absolute" value operation, which is completely different from the polynomials that are used in the Volterra models. Theoretical analysis has shown that this model is much more flexible in modeling RF PAs with non-Volterra-like behavior and experimental results confirmed that the new model can produce excellent performance with a relatively small number of coefficients compared with conventional models.
Zhu [10] mainly focused on deriving the model structures and presenting system characterization methodologies. The model equation was only verified by software implementation, i.e., in MATLAB. Digital implementation of DPD is straightforward and the implementation cost of these digital units is usually relatively low compared with that of the highpower RF amplifiers. In future systems, however, the cost and power consumption of digital circuits must be watched closely. It is because, as the operating bandwidths of wireless systems continue to increase, the nonlinear behavior of the PAs becomes more complicated, which leads that more complex DPD models are required to maintain the linearity and thus more complex numerical operations are needed that can increase the cost and power consumption. More importantly, in future networks, more and more small-cell base stations will be deployed [13] , [14] . In these small cells, the output power of the PA becomes much lower and thus the relative power 0018-9480 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
budget assigned for DPD must be much lower too. Therefore, in future DPD development, not only the performance is the concern, the hardware implementation of DPD, including the complexity of model coefficients extraction, must also be carefully considered.
In this paper, we introduce a power efficient and low-cost hardware implementation structure for implementing the DVR DPD model in digital circuits. First, we modify the DVR model into a subdecomposed format by alternatively selecting line segments in the model construction. This approach results in a significant reduction in computational complexity in model extraction. In the transmit (Tx) path, we reassemble the coefficients and propose a very simple implementation structure that can dramatically reduce the complexity of signal generation in the predistorter. A new dual-direction coordinate rotation digital computer (DD-CORDIC) design is also proposed, which simultaneously calculates both magnitude and e j θ n values to facilitate the model implementation.
This paper is organized as follows. Section II reviews the DVR model and current implementation complexity, followed by the derivation of the subdecomposed DVR (SD-DVR) model and details of implementation methodologies in Section III. Section IV introduces the hardware structure of DD-CORDIC. Section V shows experimental results with the conclusion in Section VI.
II. REVIEW OF THE DVR MODEL
The DVR model [10] can be expressed as
wherex(n) andũ(n) is the baseband input and output, respectively. The inner |·| returns the magnitude ofx(n), while the outer | · | is the normal real-valued "absolute" operation. θ n represents the phase ofx(n). K denotes the number of partitions and β k is the threshold that defines the boundary of the partitions. M represents the memory length.ã i andc ki, j are the model coefficients. Because the nonlinear functions are composed in a piecewise manner, the DVR model does not have any restrictions on the shapes of the nonlinear curves. This model is therefore much more flexible than the Volterra models [10] . The generic block diagram of a DPD system is shown in Fig. 1 : the input baseband signalx(n) is predistorted by the predistorter block before being upconverted to RF frequency and sent to the PA. In order to extract the coefficients of the predistorter, a small fraction of the Tx signal is transferred back to baseband via a feedback loop. The model extraction unit compares the input and the output data and finds the coefficients for the predistorter. 
A. Complexity of Model Extraction
Since the model is linear-in-parameters, the general linear system identification algorithms, such as least squares (LSs), can be applied in the model extraction. Let us rewrite the DVR in the matrix format as
where matrix X includes all the linear terms and various DVR interaction products constructed by using the input signalx(n). The subscript N indicates the total number of input samples and Q represents the number of coefficients. Vector C contains the predistorter coefficients. The result vector U is the predistorted signalũ(n), going through the PA. Two architectures are generally employed for model extraction: direct learning architecture (DLA) and indirect learning architecture (IDLA) [15] , [16] . The DLA is usually used in closed-loop systems and compares the PA output with the original input directly. The IDLA estimates the post-inverse of PA first and then copies the coefficients to the pre-inverse block. By applying IDLA, the PA baseband outputỹ(n) is fed into the model, appearing as the model input instead of x(n), to build the matrix Y . The predistorted output,ũ(n), is specified as the model expected output. The coefficients can then be extracted by using the LS algorithm as the following: Table I , where only the complex multiplications are concerned, since the hardware cost of additions is negligible. The detailed analysis for the computational complexity of each matrix operation can be referred to [17] . When the size of the matrix is small, the computational complexity is out of problem. However, if the nonlinearity is quite strong, normally, around 8192 samples and 100 coefficients are required and thus approximately 165.6 million complex multiplications are needed, which leads to large computational complexity and occupies a large digital chip area.
B. Complexity of Predistorter Implementation
In the Tx path, the main complexity is to implement the DPD model. For the DVR model, the main operation is to conduct the magnitude decomposition and phase restoration process. Let us take the first-order basis
as an example. The first step is to generate the magnitude and phase from the complex data. This can normally be conducted by using a CORDIC.
The second step is to sum all the decomposed components after multiplying with the coefficients. Since different delayed terms can reuse the same hardware structure, and the threshold index k is irrelevant to e j θ n , the core unit to be implemented in the DVR model is
Implementing (5) is straightforward, as shown in Fig. 2 , but it appears costly. |x(n)| separately subtracts K threshold values β k , and the outcomes then go through absolute operations. The output is the summation of K magnitude decompositions ||x(n)| − β k | multiplied with K complex coefficientsc k . If the summation is implemented in Xilinx FPGA, 2K DSP48 units (each DSP48 unit contains one multiplier and one adder) are required for both real and imaginary parts. With strong memory effects and nonlinearities involved in the model, the total number of DSP48s will increase by a factor of M × N term , where M represents memory delay and N term designates the number of interaction products in the DVR model. The DSP48 unit is an important component in the FPGA design, which requires dedicated hardware resource to implement, occupies a large die area, and consumes most of the chip power. Using a large number of DSP48 units is not an economical choice.
III. SUBDECOMPOSED DVR
In order to reduce the implementation complexity and reduce power consumption, in this paper, we modify the DVR model to make it more hardware friendly without degrading the performance.
A. Model Modification
The magnitude decomposition in the DVR can be represented as
where the threshold values are defined as
Owning to the property of absolute value operation, it is easy to find in the geometrical construction that these pairs of line segments are symmetrical about the threshold value β k . In other words, multiple pairs of line segments are symmetrically split at each threshold point. In the literature [18] , a simplicial CPWL (SCPWL) was proposed for nonlinear modeling. The SCPWL functions can be represented as
Comparing SCPWL with DVR, we can find that the SCPWL is simpler to implement, because half of the terms are zero, but the disadvantage is that it only contains the half geometrical segments of the DVR. For example, in (7), although G 1 (|x(n)|, k) depicts a two-segment piecewise linear function, the right segment overlaps with the x-axis over |x(n)| ∈ [β k , +∞), making no contribution to the model fitting. In the same manner, the function response of (8) expounds the other half segments of the DVR model, i.e., right-direction components. Only using the single-direction SCPWL cannot accurately model the nonlinearities in the entire input range.
To guarantee the performance, in this paper, we propose to employ the both-direction components, to cover the whole input range. We alternatively choose the left-direction segments on the odd-numbered positions from (7) and the rightdirection segments on the even-numbered positions from (8) to create a new segment construction
where ρ is an integer index starting from 1. In (9), the first part depicts the line segments indexed by k = 2ρ − 1 and the second part describes the line segments on the even-numbered positions, defined by k = 2ρ. Because the line segments in (7) and (8) can be considered as the subdecompositions of magnitude decomposition in the DVR model, the new model is called as SD-DVR
B. Complexity Reduction of Model Extraction
Interestingly, the computational complexity of model extraction for the SD-DVR is dramatically reduced. As discussed in Section II, to extract model coefficients, we first need to gather a set of data samples and then build data vectors and matrices for the LS estimation. The key matrix to be built is the matrix Y in (3) where each row includes the linear and nonlinear model terms. For instance, the first row can include [ỹ(n),
. .], and the second row has the form of
. .], and so on. As explained earlier, the nonlinearity composition via DVR is the superposition of multiple pairs of line segments symmetrically split at each threshold point. Each pair of segments splits the input range into two parts, and which part of the segments is chosen depends on the magnitude of the input signal sample comparing with the threshold value, namely, the right-direction segment is chosen if the input signal is greater than the threshold value otherwise the leftdirection segment is chosen. Nevertheless, all the nonlinear elements in the matrix Y have values, e.g., (|ỹ(n)| − β 1 )e j θ n , and (β 3 −|ỹ(n)|)e j θ n . If we use × to indicate a nonzero value, the matrix Y in the DVR model can be written as
In the SD-DVR, we alternatively select one segment from each pair in original DVR, "flattening" the other segment. Therefore, for the same input samples, SD-DVR only has approximately half valid segments in comparison with that of DVR. This leads that, approximately, half of the nonlinear elements in the matrix Y are zeros
These zeros will directly affect the matrix multiplication operations in S 1 and S 3 . Because there is an approximately Reducing their complexities will directly affect the overall system complexity, which can be seen in Section V in more detail.
C. Complexity Reduction of Predistorter Implementation
When it comes to the hardware implementation of the predistorter, the core component of SD-DVR is
where F(|x(n)|, k) is the nonlinear function defined by (9) . To simplify the following derivation, the memory and highorder terms are omitted here. Similar to that discussed in Section II, direct implementation of (13) will require a large number of dedicated DSP units. In this section, we propose to implement (13) in a much simpler way. In the Tx chain, the model coefficientsc k of DPD are known after model extraction, and the partition thresholds β k are also predetermined, the summation (13) turns out only depending on the magnitude value of the input signal, i.e., |x(n)|. This leads that a low-cost implementation strategy can be applied, similar to that discussed in [19] - [21] . The idea is to merge all the valid coefficients together before multiplying with |x(n)|, as explained in the following.
Assuming that |x(n)| is greater than a particular threshold value β k−1 and less than the upper adjacent threshold β k , here we take the left-direction segments, the function with k = 1, 3, . . . in (9) , as an example. Only the coefficients with index greater than k multiply with nonzero values, while the rest of coefficients can be ignored, because the magnitude decompositions are zeros. Thus, the summation of left-direction segments is −|x(n)| ·
Applying the same rule on the right-direction segments, the summation is |x(n)|·
2i β 2i . By merging (14) for SD-DVR. them together, (13) can be rewritten as
where · rounds the elements to the nearest integers toward minus infinity and · rounds the elements to the nearest integers toward infinity. A k and B k represent gain and offset, respectively. The index k is an integer between 1 and (K + 1), categorizing K + 1 coefficient groups for different magnitude zones.
In this arrangement, the implementation complexity and power consumption can be significantly reduced compared with that using the direct implementation. In the direct implementation, every single input sample is compared with the thresholds first to obtain the magnitude difference and then multiplied with each coefficient and finally summed together. In the proposed approach, after compared with the thresholds, the input samples are directed into separate zones, defined by the threshold values, depending on their magnitude values, as shown in Fig. 3 . For instance, for input samples, 0.6 + 0.1 j and 0.1 − 0.04 j , 0.6 + 0.1 j can be directed into zone 4 while The hardware implementation is shown in Fig. 4 , where the sets of A k and B k are built into the lookup table (LUT) and the signal magnitude |x(n)| is used as the index to select which set of coefficients is used for the output calculation. In this implementation, the nonlinear signal process only requires one complex multiplication and one complex addition (two DSP48 units). Compared with the hardware cost of 2K DSP48s in the traditional implementation in Fig. 2 , the new design substantially cuts down the computational complexity and hardware cost. The detailed improvement in the hardware efficiency is shown in Section V.
IV. DUAL-DIRECTION CORDIC
The overall model structure has been addressed. The next task is to obtain the magnitude |x(n)| and the phase restoration information e j θ n to feed into the model. The general procedure is: 1) produce |x(n)| and angle θ n fromx(n) by using the CORDIC algorithm [22] , [23] ; 2) obtain e j θ n according to the value of θ n : e j θ = cos θ + j sin θ ; and 3) feed |x(n)| and e j θ n into the DVR model to finish the following computation. This procedure is straightforward, but it involves multiple calculations.
If we carefully relook the equation, we can find that we actually do not need to calculate the phase θ n . It is because what we need to know is e j θ n , which is a unit complex number. Rewritingx(n) as |x(n)|e j θ n , we will find that the only difference betweenx(n) and e j θ n is their magnitude value: |x(n)| forx(n) and one for e j θ n , but the two complex numbers share the same phase.
The basic idea of CORDIC is to iteratively rotate the vector to a new vector ending up with zero phase and then the real part of the new vector is the magnitude, as the imaginary part is zero. For instance, as shown in Fig. 5(a) , to generate the magnitude, the vectorx is rotated clockwise by phase θ to the Re-axis, so that the magnitude ofx can be obtained from reading the real part of the vector A. This way can be reversely applied to e j θ . The trick here is to add a unit vector, whose coordinate is E(1, 0), shown in Fig. 5(b) . If we rotate it counterclockwise by the same phase with constant radius 1, we can obtain the vector B, which is perfectly coinciding with the vectorx. The real and imaginary parts of e j θ can then be obtained by reading from the vector B, which is cosθ and j sin θ , respectively. The clockwise and counterclockwise rotations can be coordinately operated by using one set of shared digital logic as described in the following.
A. Magnitude |x| Generation
First, a detailed demonstration of calculating the magnitude by using CORDIC is introduced. For example, for a complex numberx = I + j Q, to obtain its magnitude, the first step is to make sure its phase is located in the range of [−90°, 90°]. This can be conducted by checking the sign of Q. If Q is positive, rotate the vector clockwise by 90°, otherwise counterclockwise by 90°.
The following operation is to rotate the vector step by step, approaching the Re-axis. Let us define a complex number as "1 + j R," whose phase can be represented as tan −1 (R). Therefore, adding a phase of tan −1 (R) to the current vector (counterclockwise rotation), it is equivalent to multiplying with "1 + j R," otherwise multiplying with "1 − j R." The sign of ± indicates the rotation direction. Importantly, the value of R decreases with the powers of two after each rotation [22] , starting with 2 0 = 1 and then 2 −1 , 2 −2 , 2 −3 , 2 −4 , and so on. The corresponding rotation phase tan −1 (R) approaches 0 after a number of iterations.
The iterative rotation can be described as
where I i and Q i represents the real and imaginary part of the complex number, respectively, and i designates the i th rotation. σ i is defined as the value of −1 or 1, which decides the direction of rotation. The gain of each rotation is
, which can be compensated together. In digital logic, multiplying with powers of two can be easily conducted by using bit shifts instead of actual multipliers. The CORDIC therefore significantly reduces the implementation cost.
To further demonstrate how the CORDIC works, a stepby-step illustration is shown in Fig. 6 The following rotation details are listed in Table III . We takex 1 as an example to illustrate the iterative process in Fig. 6(c) . Due to the fact that Q 1 < 0, the next rotation is to add the phase of tan −1 (2 −1 ) = 26.57°with the current vectorx 1 . We need to multiply "1 + j ×2 −1 " with "I 1 + j Q 1 " to obtain new vectorx 2 . According to (15) , the real value of x 2 is I 2 = I 1 − 2 −1 ×Q 1 , whose operation is shifting Q 1 to right by 1 b, then subtracted from I 1 , without considering error compensation δ 1 . The same binary arithmetic can be performed on the imaginary part: Q 2 = 2 −1 × I 1 + Q 1 . In the following second rotation, because of Q 2 > 0, the next rotation is to subtract the phase of tan −1 (2 −2 ) = 14.04°from the current vectorx 2 . The rest procedure can be done in the same manner. After a certain amount of rotations, the value on the Re-axis can be approximated as the magnitude of the vector, since the phase approaches zero.
B. Phase Restoration Information e j θ Generation
As discussed earlier, the value of e j θ can be obtained by concurrently rotating a vector E by the exact same phases as the vectorx has done, but in an "opposite" direction. To illustrate this process, we add the other trace into the CORDIC trace of rotatingx and label the iteration steps with the "prime" sign, as shown in Fig. 7 . When "S" moves to "0," "S " moves to "0 ." With the aid of the iteration trace, E (1, 0) can be rotated all the way back to get e j θ . It shows that when the vectorx reaches the Re-axis, vector E ends at the position indicating the same phase θ as the original vector x, straightforward offering the value of e j θ = cos θ + j sin θ .
Based on the above-mentioned analysis, the iterative rotation for e j θ can be described as where the difference between (15) and (16) is the sign of ±σ , which serves as the rotation direction. Therefore, the dualdirection iterative matrix can be built as ⎡
where the first two rows (process A) define the calculation of magnitude |x|, starting from vectorx, and the last two rows (process B) state the process of e j θ , rotating from vector E. After several iterations, e.g., N, the Q
N will be close to 0 and then the rotation will stop. The magnitude |x| and e j θ can then be obtained ⎡
Two rotation traces can be implemented by using the same digital logic, as the rotation mechanisms are exactly the same, except that they add opposite phases in each rotation. Therefore, the proposed CORDIC is named Dual-direction CORDIC in this paper.
C. Time Multiplexing Dual-Direction CORDIC Design
Time-division multiplexing has been widely used in digital design to reduce hardware cost with increased clock rates. Facilitated by the time-division multiplexer, processes A and B in (17) can be implemented by using only one dual-direction rotation structure shown in Fig. 8 . At the sampling frequency, the data streamx(n) and E are fed into the digital block, and the multiplexer cooperating with CS alternately captures vectorsx(n) or E at twice speed of the sampling rate.
Since the DD-CORDIC includes many iterative rotation units, which are pipelined in a sequence, the structure of one dual-direction rotation unit is highlighted by the dashed rectangle. The rotation phase of each rotation unit is preset and the imaginary value of vectorx(n) decides the rotation direction. The rotation process is accomplished with shifts and adders. In the next clock cycle, CS signal flips the direction of rotation, operating an opposite rotation upon vector E. The alternate data flow ofx(n) and E within DD-CORDIC is double the speed of the input sampling rate. To explain the variation of data flow, a time sequence for the DD-CORDIC is shown in Fig. 9 . In the end, the outputs |x(n)| and e j θ n are downsampled to the original sampling speed, in order to maintain a consistent communication rate between different blocks.
Note that only shifters and adders are involved in the dual-direction rotation unit, while compensation gain δ total is omitted here. This is because, normally, the gain δ total may not need to be compensated for or be compensated for during other procedure. In our test, we multiply δ total with coefficients c ki , compensating the gain error in the offline characterization process.
V. EXPERIMENTAL RESULTS
To validate the proposed approach, various experimental tests were conducted. Regarding model extraction, the computational complexity for the proposed SD-DVR model was discussed in comparison with that of the DVR model. After extracting the coefficients, the DPD block was implemented on the FPGA, and linearization performance was measured by normalized mean-square errors (NMSEs) and adjacent channel power ratio (ACPR). Comparisons were also given in terms of hardware resource utilization of the two models.
The DPD test platform employed was the same as that used in [10] , which includes a PC, a baseband FPGA board, an RF board, and a PA, as shown in Fig. 10 . The baseband inphase and quadrature (I/Q) digital signal source was generated by using either software in MATLAB or FPGA hardware board. The baseband signal was then sent to the RF board for modulation and upconversion to RF frequency by the Tx chain and finally sent to the PA. To reduce peak-to-average power ratio (PAPR), a crest factor reduction block was applied on the input signal. For model extraction, a fraction of the output signal was downconverted to baseband and captured by the receiver (Rx) chain through the feedback path and sent back to the PC. The time alignment and model extraction were conducted off-line in MATLAB.
The PA under test was an in-house designed broadband Doherty PA operating at 2.14 GHz and excited by a 60-MHz 12-carrier UMTS signal with 6.5-dB PAPR and 32.36-dBm average output power. Approximately, 16 000 I/Q samples were saved at a sampling rate of 368.64 MSPS. Recorded complex input and output samples were time aligned and normalized before training the model. The FPGA board employed for DPD hardware implementation was Virtex-7 XC7VX485T, whose operating clock frequency was designed as 260 MHz (and 520 MHz for multiplexer's clock of DD-CORDIC). The same nonlinear terms, the second-order types-1, 2, and 3 in [10] , of the DVR and the SD-DVR were selected from (1) and (10), respectively. K was set as 7 and M = 3, resulting 74 coefficients in total.
A. Model Extraction
The LS algorithm was chosen for model extraction and 8192 training samples were used. As analyzed earlier, the LS process can be split into S 1 , S 2 , S 3 , and S 4 operations. Multiplications S 2 and S 4 were considered as the same matrix operations for both the DVR and the SD-DVR. Because the elements in Y matrices in the two models have different values, the computational complexity of S 1 and S 3 are different.
The matrix Y of SD-DVR contained more or less half zeros and half valid values. On the contrary, the elements in matrix Y of DVR appeared to be full valid values. To compare the complexity, we only took the nonlinear terms into account first, which involved 70 coefficients. The number of required complex multiplications and the complexity reductions of each operation between two models are summarized in Table IV . For S 1 , each element in both Y and Y H had an approximately 50% chance of being zero, leading to 74.81% complexity reduction.
Step S 3 was only affected by matrix Y H , reducing computation by 49.96%. In total, approximately 62% saving can be made in the complex multiplications. Including all the terms, i.e., the complete 74 coefficients, the complexity comparison is shown in Table V . Even though the linear terms use the same resources in both the models, the total resource saving of the complete model extraction is still around 60%, which is significant.
B. Comparisons of DPD Performances
The obtained coefficients were applied in the DPD hardware implementation. For comparison, we implemented the DVR and the SD-DVR differently. For the DVR, we used standard CORDIC, where we calculated both magnitude and phase θ n first and then obtained e j θ n according to the phase θ n . In the SD-DVR, we applied DD-CORDIC to obtain magnitude and e j θ n simultaneously. Given the overall structure of DPD model, the straightforward strategy in Fig. 2 was deployed for the DVR, while the SD-DVR used the simplified structure (14) , shown in Fig. 4 , to save hardware resources.
For the purpose of fair comparison, some standards of hardware design were kept the same for both approaches. For example, each I/Q signal of both input and output was kept as 32-b data, in which the least significant 16 b indicated the real part and the most significant 16 b designated the imaginary part. The bit precision in the intermediate process, such as adders, multiplexers, and configuration of DSP48 units, remained the same for both approaches. To satisfy the desired accuracy, the rotation number of CORDIC and DD-CORDIC was 9.
To verify the linearization performance, we not only compared the performances between the two models, but also the performances between the software/MATLAB DPD implementations and the corresponding hardware/FPGA ones. As expected, after cascading DPD and PA chain, the PA nonlinearities were almost completely removed by both hardware and software predistorters. It can be seen from Table VI that the ACPRs (±5 MHz) of the 60-MHz signal in all the cases were suppressed to −51 from −28 dBc, and NMSEs were reduced to around −40 dB. The same DPD performance was achieved in hardware as that in software, which demonstrated high accuracy of the hardware DPD implementation.
Moreover, the spectrum comparison between DVR and SD-DVR (hardware only) is shown in Fig. 11 . To illustrate performance, AM/AM and AM/PM characteristics after linearization via hardware DPDs are shown in Fig. 12 . Those results validated that both hardware implementations achieved the same linearization level.
C. Comparisons of DPD Resource Utilizations
The hardware implementation process was conducted in five steps: the first step was to calculate the magnitude and e j θ n ; linear term, i.e., M i=0ã ix (n − i ), was regarded as the second step, which was identical for both the models; the third step was to implement the summation of nonlinear terms, e.g., − i )|, k) ; the accomplishment of phase restoration or various interaction of present and past samples, memory delays, and final nonlinear summation were carried out in step 4; linear and nonlinear terms were summed up in the last step. The other resources were used to synchronize the signal. The resource utilizations of the DVR and the SD-DVR are listed in Tables VII and VIII, respectively. Overall, 60% flip-flops and 34% slice LUTs were reduced during step 1 after the DD-CORDIC was applied. Importantly, the low-cost structure in step 3 significantly saved 85.7% DSP48 units, 98.11% flip-flops, and 95.31% slice LUTs, compared with the traditional implementation. In total, the new approach 
VI. CONCLUSION
This paper has proposed the SD-DVR model for PA linearization, whose computational complexity of the model extraction has been significantly simplified. Moreover, we have introduced an efficient hardware structure for the predistorter to further reduce implementation complexity and power dissipation, facilitating future small cell applications. A DD-CORDIC has also been proposed to simultaneously calculate both the magnitude and e j θ n values. The experimental results and hardware resource utilization have been provided to validate the functionality of the new model and its implementation methodologies.
