Abstract-This paper deals with and details the design and implementation of a low-power; hardware-efficient adaptive self-calibrating image rejection receiver based on blind-sourceseparation that alleviates the RF analog front-end impairments. Hybrid strength-reduced and re-scheduled data-flow, low-power implementation of the adaptive self-calibration algorithm is developed and its efficiency is demonstrated through simulation case studies. A behavioral and structural model is developed in Matlab as well as a low-level architectural design in VHDL providing valuable test benches for the performance measures undertaken on the detailed algorithms and structures.
I. INTRODUCTION
Image rejection receivers utilize In-phase and Quadrature (I/Q) signal processing in dealing with bandpass signals. However, analog implementations of I/Q signal processing is vulnerable to RF-impairments [1] - [11] , resulting in imperfect image rejection, which is not sufficient for communications applications. With large signal constellations of M-QAM/PSK even modest RF-impairments result in detrimental performance degradation. Therefore, digital techniques which will enhance this image rejection and alleviate the I and Q channel mismatches play an important role in simplifying the analog front-ends for future high performance highlyintegrated single-chip wireless transceivers.
Conventional image rejection architectures are implemented by analog circuit techniques [9] - [11] . However, hybrid and digital solutions have also been reported in the literature which attempts to improve IRR [1] - [8] . An unsupervised adaptive self-calibrating image rejection receiver was proposed and its performance evaluated in [8] utilising the Digital Image Rejection Processor (DIRP). This paper deals with efficient low-complexity, low-power implementation of this adaptive self-calibrating image rejection receiver. A key contribution of this paper is the application of the strength reduction transformation at the algorithmic level to obtain low-power implementation of the adaptive self-calibrating image rejection receiver.
Furthermore, clever scheduling and pipelining of the algorithm for low-power implementation has been undertaken.
The paper is organized as follows: Section II gives a brief description of the adaptive image rejection receiver. Section III details the application of the strength reduction at the algorithmic level along with Time-Division-Multiplexed (TDM) architectural design, while concluding remarks are given in Section IV.
II. ADAPTIVE IMAGE REJECTION RECEIVER
The adaptive self-calibrating Image rejection receiver is composed of a modified Weaver image rejection mixer and a DIRP. With this architecture the I/Q errors are eliminated without using any off-chip discrete components, in the DSP domain at the baseband. Fig. 1 depicts the image rejection receiver incorporating the DIRP. The incoming signal, s(t), consists of the wanted signal u(t) at f RF and unwanted image signal i(t) at f IMG where f IMG = f RF2f IF . Hence, the incoming signal s(t) can be expressed as:
where u(t) and i(t) are the complex envelopes of the wanted and image signals respectively. The incoming signal is downconverted to an IF frequency via the image-rejectionmixer with RF-impairments. Signals are then digitised and digitally downconverted to the baseband to yield two baseband signals r 1 (k) and r 2 (k) which can be expressed as: 
where g 1 =(1+0.5α ε ), g 2 =(1-0.5α ε ) and ϕ ε is the phase and α ε is the gain mismatch between the I and Q channels. The desired signal corrupted by the image signal scaled by h 1 is contained in r 1 (k), and r 2 (k) contains the image signal corrupted by the desired signal scaled by h 2 due to the phase and gain errors. This is demonstrated in the frequency domain in Fig. 1 . The mixing coefficients h 1 and h 2 can be expressed as:
Signals r 1 (k) and r 2 (k) form the two inputs of the DIRP with c 1 (k) and c 2 (k) representing the corrected desired channel and the adjacent channel respectively. These can be expressed as [8] :
The idea behind the DIRP is that in the absence of RFimpairments the desired and image signals are not correlated with each other. However, this is not the case when RFimpairments exist. The DIRP acts as a decorrelator separating the desired channel and the image channel. Detailed design and performance analysis of this is covered in [8] .
III. ARCHITECTURAL DESIGN This section details the implement of the low-power, reduced complexity DIRP via the application of strength reduction as well as clever rescheduling of the algorithm along with efficient pipelining techniques. We start with the parallel brute force implementation of the DIRP, followed by the description and the application of the strength reduction transformation. This is followed by clever rescheduling of the DIRP algorithm for low-power and 100% resource utilisation and its pipelined implementation.
A. Algorithmic Level Power Reduction Techniques
Parallel brute force implementation of the DIRP is depicted in Fig. 2 . Fig. 2(a) shows the filter section, whereas Fig. 2 (b) details the adaptive weight-update section which makes up the DIRP [8] . Algorithmic transformations are an important class of architectural level transformations, which have been proposed for high speed and low-power [12] . These transformations rely on the fact that most linear DSP algorithms can be expressed in terms of multiply-add operations. In particular, the strength reduction transformation trades off high-complexity multiply operations with lowcomplexity add operations thus achieving low-power [12] . The algorithmic transformation of multiplying two complex numbers, (a+jb) and (c+jd) is given as:
As can be observed from (5) a total of four real multiplies and two real additions are needed for computing the complex multiplication. Equation (5) can be strength reduced and reformulated as:
As can be observed from (6) the number of real multiplications is three and the number of real additions is five i.e. one multiplier is replaced with three adders. We will now apply the strength reduction technique to the DIRP algorithm. The outputs of the filter block of the DIRP are given as:
where w (1,2) (k)=w I(1,2) (k)+jw Q(1,2) (k), and r (1,2) (k)=r I(1,2) (k)+ jr Q(1,2) (k). Putting these into (7) we have: The adaptive coefficient updates can be expressed as: At this stage we can apply the strength-reduction transformation to the filter output. For the filter output equation given in (7) the transformation that follows is (only y 1 (k) is shown to prevent repetition): For the DIRP case the strength reduced form of (7), following the derivations of equations (10) (11), is given by:
Following a similar approach and applying the strength reduction technique to (9) we end up: From (12)- (14), we can now construct the structure of the strength reduced DIRP. Fig. 3 (a) depicts the "filter section" of the strength reduced DIRP, whereas Fig. 3(b) depicts the "weight-update section". Table I . If we assume that effective capacitance of a two-operand multiplier is K c times that of a two-operand adder [12] , it can be seen that application of the strength reduction to the implementation of the DIRP results in a power saving factor, PS, given by: 
where P D,o and P D,sr are the dynamic power dissipation of the original and strength-reduced DIRP algorithms. Fig. 4 depicts the PS as a function of K c . As can be observed, the power saving can be made for K c >2 for the DIRP application. Asymptotically, the power savings approach 25% as K c increases. For a typical K c value of 8 [12] , the power saving is 16.67%. 
B. Algorithmic Rescheduling and Pipelining for Low Power
The architectural design aims to have 100% utilisation of each element. This is achieved by clever use of process rescheduling and pipelining stages to incorporate the different sections of the design. In this section a detailed analysis of the architecture, process schedule and process cycles will be carried out and the most favourable architecture and data flow will be established. The TDM based architecture is preferred for implementation as it utilises the least hardware. The first step in designing the TDM based architecture is to decide on the data flow over the structure diagram. The overall structure consists of the repetitive use of three distinctive sub-structures. These are: Complex Multiplication Block (CMB), Filter Output Block (FOB), and the Weight Update Block (WUB). With the parallel implementation the data flow is straight forward. In the first clock cycle the filter outputs, c 1 and c 2 , are calculated in a parallel manner. In the following clock cycle the filter outputs are fed back to the adaptive coefficient update section for the calculation of the weight factors (w 1 (k+1) and w 2 (k+1)) of the next iteration. The TDM based hybrid model on the other hand has to follow a different data flow structure. The most suitable dataflow structure where each sub-block is utilized 100% at all time is shown in Fig. 5 .
The data flow starts with the use of CMB, where inputs r 1I and r 1Q are multiplied with w 2I and w 2Q (Step 1). The next step is to use the calculated intermediate result in obtaining the outputs c 2I and c 2Q in the FOB (Step 2a). However, at this stage the CMB stands idle. So we utilize it by running this block at the same time with the FOB, (Step 2b). While c 2I and c 2Q outputs are calculated at the FOB, the multiplications of r 2I and r 2Q with w 1I and w 1Q are carried out in the CMB. When both CMB and FOB finish their operations the next parallel usage of these blocks starts. This is done as follows: calculated c 2I and c 2Q values from the FOB can be used in the CMB to be multiplied with r 2I and r 2Q inputs (Step 3a). At the same time the intermediate results from the previous use of the CMB can be used in the FOB to calculate c 1I and c 1Q (Step 3b). Once the multiplication operations in Step 3a are finished, the resulting values can be used in the WUB (Step 4a). While the µ-scaling and weight updates are being processed (Step 4a), the CMB can process the results from FOB from Step 3b (Step 4b). Once the calculations in WUB are finished in Step 4a, we have the weight factor of w 2 (k+1). We can use this new weight factor to begin the same cycle of operations stated so far by utilizing the CMB (Step 5a). This is possible since the multiplication block will complete its processing of Step 4b. At the same time the results of Step 4b can further be used to calculate w 1 (k+1) in the WUB (Step 5b). So in a total of five steps all the outputs will be calculated as well as the weight updates for the next iteration. Clever usage of the sub-blocks and the parallel processing scheme helped us to drop the number of steps required to calculate the outputs and weight factors from eight steps to five steps. 
C. Architectural Design
According to the data flow proposed in the previous section the architecture design of the hybrid models is undertaken. Fig. 6 depicts the proposed architecture of the TDM based DIRP architecture.
C1 Latch

Gated phi1
A1 Latch
Gated phi1
B1 Latch
Gated phi1 The architecture consists of four major parts: CMB, merged FOB and WUB section, Controller and the storage block where the calculated output and weight values are stored. The controller is implemented as a 4-bit ring counter and generates control signals sel1, … sel7, add1, mu_sel etc.
Performance of the proposed architecture was evaluated using 32-PSK modulated signals. Simulation results are shown in Fig.7 for varying phase and gain errors demonstrating the effective operation of the architecture. IV. CONCLUDING REMARKS Design and implementation of a low-power imagerejection receiver incorporating DIRP to alleviate RFimpairments and improve IRR has been undertaken. Strength reduction, data re-scheduling and pipelining approaches were used to reduce the power consumption. It has been shown that the application of strength reduction at the algorithmic level results in a power saving of 16.67%. Complexity of the algorithm is reduced by four real multipliers at the expense of eight real adders. The algorithm is also amenable for software DSP implementation requiring small processing overhead.
