Abstract-In this paper, a novel built-in tuning technique to compensate for process variability-induced imperfections in RF circuits is proposed. The yield improvement methodology proposed is a generic and self-contained tuning method that does not require a digital signal processor as in prior softwarebased methods or the use of a tester. The technique uses digital logic that can be synthesized on-chip along with the analog/RF tuning circuitry to performing self-tuning. An optimized digital bitstream (stimulus) is used to stimulate the RF device, and the response of the device is downconverted to the low-frequency domain using a sensor. The resulting signal is mapped to a digital signature, in such a way that the Hamming distance between the observed and the reference signatures represents the degree by which the device specifications differ from the nominal specifications. A logic-driven algorithm is used to minimize this Hamming distance to optimize multiple RF specifications concurrently. The presented methodology incurs minimal area overhead, and the tuning time is in the order of milliseconds. Results obtained by tuning the power amplifier of a 2.4-GHz transmitter show up to 16% yield improvement. To validate the proposed yield improvement concept on hardware, results obtained from experimentation on an industrial transmitter are presented.
have made the goal of obtaining high yields at these nanometer nodes a greater challenge. At nodes, such as 45-nm and below, process variations in the manufacturing of system-on-chip by integrating analog/RF with digital pose a significant challenge to designers [5] . At these nodes, multidimensional process variations affect the fidelity of these systems resulting in highly unstable yields, in-field wear-out, and signal-integrity problems. Besides random defects, high parametric variations have a significant impact on the yield of the front-end systems. In [6] , it has been reported that the intradie variation in threshold voltage has doubled as CMOS technology has scaled from 130 to 45-nm. The short channel effect of drain-induced barrier lowering on the threshold voltage of CMOS transistors at nodes below 90-nm node is discussed in [7] . Ghai et al. [8] showed that due to process variations, the worst case specification variation in a voltage-controlled oscillator module is close to 43%.
To compensate for the increased process variations, new design approaches have been formulated. Design-centering approaches that try to compensate for the effect of process variations on a given design by using yield-driven heuristic algorithms to optimize the circuit design have been investigated in the past [9] . Another approach involves designing circuits based on the fab process corners. In this approach, the circuit is designed such that the circuit's specifications are within specified limits for extreme variations in process parameters [10] . However, such a technique leads to over-design of the circuit, usually at the cost of higher power consumption and area. There has also been research in the areas of designfor-yield and design-for-manufacturability [11] of digital and mixed-signal systems. These techniques generally focus on the yield statistics of a single circuit specification and can be computationally expensive. These methods are not generic enough for RF ICs manufactured in today's deeply scaled technologies, where variances in performance specifications (in relation to the mean) are significantly larger compared with older CMOS manufacturing processes.
To address the reduced yield problem in advanced nanometer CMOS processes, a digitally assisted built-in postmanufacture tuning technique for RF circuits is proposed in this paper. The overview of the proposed technique is shown in Fig. 1 . A digital signature obtained from the device in response to an optimized digital stimulus is used to determine the extent by which the performance of the device has been affected by the manufacturing process variations. Using a tuning/calibration algorithm implemented on the same die, the device is then iteratively tuned until the digital signature is reduced, and consequently, the device specifications are within the acceptable limits of the nominal specifications resulting in yield recovery. The advantage of this technique, as opposed to prior techniques, is that it does not require a digital signal processor (DSP) chip to perform self-tuning.
It is a self-contained solution with minimal additional circuitry added to the RF front-end chip for built-in tuning (BIT) purposes that tunes multiple specifications. The presented technique facilitates die-level self-calibration of the RF circuits, thereby increasing yield of known good die before packaging and system-level integration leading to cost savings and quicker time-to-market of the product. While the concept of digital signal generation and tuning is demonstrated on an RF transmitter, in theory, the concept can be extended to other analog/mixed-signal/wireline modules.
II. PRIOR WORK
Standard industry practices for performing postmanufacturing tuning of analog/RF devices involve performing trimming, such as laser and electrical trimming [12] , [13] . These techniques are expensive, and they control a specific voltage/current/resistance in the device for correcting a specific measurement. In general, the three essential components of an analog/RF postmanufacturing tuning or calibration scheme are as follows:
1) A sensing unit that captures the front-end nonidealities.
2) A control/monitoring unit that coordinates the generation of the input stimulus and runs algorithms to estimate and correct the imperfections in the device due to the process variations. This can be the system DSP or the on-chip circuitry for BIT or the tester in production environment. 3) Analog or digital knobs used for correcting the imperfections at various points of the system. Depending on the type of control unit and knobs, the BIT techniques can be broadly classified as follows:
1) digital baseband (also known as the DSP)-based monitoring and tuning; 2) DSP-based monitoring with analog tuning (or analog and digital cotuning); 3) on-chip monitoring and tuning. Digital baseband-based monitoring and tuning technique involves estimating the imperfections of the front-ends in the baseband using measurements obtained from the sensors attached to the front end or in loopback mode and correcting the imperfections in the baseband through amplitude and phase corrections. For nonlinearity correction in power amplifiers (PAs), digital baseband-based predistortion is a widely used linearization technique. In this technique, the amplifier characteristics are corrected using adaptive or nonadaptive methodologies in the baseband of the wireless system [14] . The PA output is fed back through the internal receiver chain. The correction technique, however, is affected by the receiver's LNA, down conversion mixer, and analog-todigital converter (ADC) nonlinearities. A similar technique for compensating the I /Q mismatch effects in RF transceivers is proposed in [15] . The drawback of DSP-based techniques is that it cannot correct large impairments due to the limitations in the dynamic range of the data converters in the system, the amplification of dc-offsets, and the saturation of front-end analog modules. Furthermore, these compensation techniques, while being accurate, take a long time to converge to their optimum values [4] .
In DSP-based monitoring with the analog tuning technique, the DSP of the system runs the calibration for test/monitor and tune purposes and directs the analog knobs. Lee et al. [16] used current digital to analog converters (DACs) in the front-end to compensate for the local oscillator (LO) feed through distortion. In this technique, the output of the transmitter is envelope detected, and the output frequency spectrum calculated in the DSP is used for impairment correction. In today's RF modules, incorporating analog tuning knobs has become critical to achieving high yields. Elmala and Embabi et al. [17] developed an online or offline technique to compensate for I /Q mismatches observed in typical RF receivers, by making use of a variable delay gain circuit to feedback correction vectors to the system LO. In [18] , a dual mode 802.11 b/Bluetooth radio with tunable bias current for LNA, mixer, LO, and filters is presented. These can be controlled by system-level metrics computed in DSP chip. In [19] , a DSP-driven, one-time calibration technique that predicts the optimum knobs settings for an RF LNA is presented. In [20] , a technique for tuning the RF front-end modules by running a gradient algorithm on the tester/DSP is proposed. Regression-based fast, iterative test, and tune technique has been proposed in [21] . A DSP technique to tune an LNA based on oscillation principles is discussed in [22] . The techniques in [19] [20] [21] are generic methodologies that can tune concurrently multiple specifications of the circuit/system. However, the disadvantage of the regression-based techniques is that they require an expensive training phase for the calibration of the regression models. Furthermore, periodic recalibration of the regression models is required. Finally, the above tuning techniques rely on signal processing operations that can be performed only in the system DSP for BIT solutions. Besides the fact that these techniques increase the system complexity, most existing wireless systems incorporate the RF front-end modules and the baseband processor on different chips/dies for signal-integrity reasons. In such a scenario, using the DSP for testing and tuning of the RF front-end is difficult and can be achieved only after systemlevel integration. As a result, one incurs the cost of packaging the chip before tuning. The above factors significantly affect the time-to-market of the product. If the DSP and the front end are manufactured by different vendors, pin compatibility issues for calibration purposes should also be considered. One approach that alleviates the DSP dependence and timeframe issues is the on-chip monitoring and tuning implementation. In this methodology, there is no requirement for interaction between the digital baseband and the front-end module.
In on-chip monitoring and tuning methodology, the impairments are sensed by the circuitry present on the front-end chip and from the estimated performance criteria, and the front-end parameters are tuned. These techniques can be analog or digital in nature. The advantage of using on-chip monitoring and tuning technique is that the short calibration time is achieved as opposed to DSP-based calibration schemes. In analog tuning or compensation, the calibration/tuning is performed in the analog domain by changing circuit parameters, such as bias, supply voltage, and passive components, such as capacitors, inductors, and resistors. One technique that falls in this category is circuit-level feedback that is specific to a given design. In [23] , analog predistortion is performed for PA using a unique matching network design that improves the nonlinearity specification. A completely analog scheme for tuning I /Q mismatches and the resulting image rejection ratio performance in a two-stage image-reject receiver is presented in [24] . The bandwidth of the analog feedback system generally decides the convergence/stability in this calibration methodology. An on-chip, self-calibration scheme for tuning the impedance mismatches of an LNA by using a variable inductor with taps is presented in [25] . The technique involves real-time current sensing whose magnitude varies with the input match of the LNA, which is then amplified and peak detected to obtain a dc signal corresponding to the input match and directs the tuning of the inductor.
For the purpose of diagnosis/testing, numerous digitally assisted built-in self-test (BIST) techniques have been proposed in the past [26] , [27] . In [28] , regression-based testing of analog modules is performed using digital signatures.
In the digital monitoring and tuning paradigm, on-chip digital logic is used to monitor and aid the compensation of mixed-signal/RF performance due to process variations. By implementing such a technique, the focus is shifted from the analog circuitry to the digital circuitry. In the past, numerous digital techniques have been presented for phase-locked loop calibration, which has predominantly digital blocks [29] [30] [31] . A least mean square (LMS)-based technique for testing and calibration of pipelined ADC is presented in [32] for go/no-go classification. An on-chip technique for tuning the PA is provided in [33] . This technique talks about gain control with no tuning for nonlinearity metrics.
The above research examples focus primarily on tuning a particular analog/RF specification and/or are specific to a circuit design/topology. This paper presents a generic methodology that is capable of concurrently sensing and tuning multiple design specifications of analog/RF front-end circuits using on-chip digital circuitry in the early stages of the production cycle (prepackaging) without extensive computational resources. In this paper, a generic on-chip technique to tune multiple RF specifications without the use of regression as developed in prior state-of-art techniques [19] [20] [21] [22] is presented. A preliminary version of this paper is presented in [34] . In this paper, the work has been extended with in-depth analysis of the methodology and detailed discussion of the system implementation. Furthermore, the hardware experimentation presented in this paper validates the proposed theory in a comprehensive manner for the first time.
In the proposed technique, the tasks of monitoring and tuning the RF front-ends are performed by dedicated circuitry enabled by digital signatures called Hamming distance proportional (HDP) signatures. The proposed framework for sensing and tuning of an RF transmitter is shown in Fig. 2 . The shaded blocks are the additional hardware components used to perform the BIT/calibration. A low-frequency measurement of the device's high-frequency output to an optimized stimulus and a reference signal are used to obtain these HDP signatures. The input stimulus used to excite the front-end in this scheme can be generated using on-chip digital logic. In this paper, the stimulus is specially optimized to increase its sensitivity to the static specifications (gain, IIP3, and IP1 dB) of the RF transmitter. The transmitter response to the applied multitone stimulus is downconverted by an envelope detector that is attached to the output of the PA. The low-frequency analog measurement is converted into a digital signature using circuitry consisting of a comparator and a reference signal generator, which in this paper is a ramp-signal generator. The output of the comparator is a digital signature corresponding to the device output (see Fig. 2 ). The unique property of this digital signature is that the Hamming distance between the observed signature and the reference digital signature is proportional to how bad/skewed the device, under process variability, is relative to the nominal device. Due to this unique property, an on-chip tuning block can be used to tune the device performance with the objective of minimizing the Hamming distance without having to compute the specifications of the device explicitly. While the technique of obtaining HDP signatures is implemented in this paper to tune the static specifications (gain and nonlinearity metrics, such as IIP3) of the PA of a transmitter, the concept can be extended in theory to sense and tune other analog/RF circuits. The rest of this paper is organized as follows. First, the theoretical premise that discusses HDP signature generation is presented. The cost function formulation, the tuning methodology, and the input stimulus generation are discussed in Sections V-VII, respectively. Finally, the simulation results and the hardware results are presented.
III. THEORETICAL PREMISE A. Device Calibration Strategy
For theoretical purposes, let us consider that there exist j process parameters represented by 
The parameters K l are assumed to be uniformly distributed across a predetermined range of values across which tuning is to be performed. In this paper, an optimum input stimulus is developed, such that a strong statistical correlation between the observed measurement M n and a specified set of device specifications S t is exhibited under process variations across the parameters P j for a range of tuning knob values K l . In the above, strong statistical correlation across simultaneous multiparameter perturbations in the vector [P j , K l ] is implied. If there exists such a strong correlation between the measurement/response and the specifications, then the following can be stated [20] .
1) If the observed response of the process-skewed device is different from the response of the nominal device (also called as golden/reference device), then one or more specifications of the process-skewed device need to be tuned. 2) If the response of the process-skewed device after tuning is identical to the response of the nominal device, then theoretically, the specifications of the process-skewed device, after tuning, are within the acceptable range of the nominal specifications. As a result, the response can be directly used for tuning the knobs of the process-skewed device, without explicitly computing the specifications at each iteration. The measurement obtained from the process-skewed device can be compared with the reference measurement, and the difference between them can be reduced to tune the specifications. In this paper, to tune the specifications, the measurement comparison is performed on-chip using digital signatures called the HDP signatures (details in Section III-B). This tuning technique works under the assumption that the device suffers only from parametric variations. Catastrophic defects can be detected by a defect filter, such as the one described in [35] , and these devices are discarded prior to tuning.
B. HDP Signature Generation
Consider that the device response Y (t) (low-frequency envelope detector output of the transmitter) is periodic with period N and is captured using a 1-bit comparator (see Fig. 2 ). The other input to the 1-bit comparator is a ramp-signal generator that produces a linear ramp signal R(t) with period M. The system can be designed such that across all reasonable process variations and tuning knob settings of the device, the dynamic range of the ramp signal is greater than the device output response (extreme process variations will be filtered out by defect filter). One period of the periodic device response signal is defined to consist of the N time samples
, and one period of the ramp signal consists of the M time samples
A total of M * N comparisons at the comparator output are acquired as a digital signature. At any time t, R(tmodM) is compared against Y (tmodN). The signal from the envelope detector output Y (t) is connected to the positive input of the comparator. If Y (t) > R(t), then the output of the comparator is high (V CC or 1), else the output is low (GND or 0). This sampling approach with offset frequencies is similar to the Vernier sampling technique used for measuring fractional distances.
For the purposes of interpretation of the comparator output signal, let X be defined as a M × N matrix with row indices going from M − 1 to 0 (top to bottom) and column indices going from 0 to N − 1 (left to right). The following observations can be obtained using the comparator output D(t).
Observation 1: If each comparison of Y (t) against every point of R(t) obtained over M * N comparisons is regrouped and arranged as columns of the matrix X. Then, the total number of 1s in X is directly proportional to the area under the curve of Y (t).
Explanation: Consider a sample point Y (0), the sample point is compared against R(0) at t = 0, and the next 
t).
Integrating over all the values of t (columns of X) gives the stated observation result. This is analogous to comparing every amplitude point of Y (t) with M different levels (equivalent to a log 2 M bit ADC). However, in this case, the comparisons happen over the time cycles of the device output response and the ramp signal.
Example: A theoretical example that explains Observation 1 is shown in Fig. 3 . At each of the time steps t = 0 through t = 11 (12 steps corresponding to N = 3 and M = 4), the result of the comparison between Y (t) and R(t) is shown in D(t). In this case, for ease of illustration, it is assumed that both R(t) and Y (t) are ramp signals. The matrix X exhibits a ramp in its lower right corner with a level of quantization (0-3) provided by the signal Y (t). As shown in Fig. 3 , the total number of 1s in the first column is one, the second column is two, the third column is three, and the signal Y (t) can be reconstructed. While the example is provided using a ramp signal as the transmitter output response, the theory explained here holds good for any periodic signal with any number of multitones.
Observation 2: If Y (t) corresponds to the process-skewed response and Y nom (t) is the response of the nominal or reference device, then, using Observation 1, the absolute difference in the time-domain response between the two signals is proportional to the sum of the differences in their corresponding digital signatures (also known as the Hamming distance), as shown in
Explanation: The LHS of (1) Fig. 2 ). The total number of 1s in this signature is the Hamming distance between the observed and the reference digital bitstream and is called the error count metric. By (1), the larger the value of absolute difference between the two time-domain responses, the larger the error count metric and vice versa. Based on the discussion in Section III-A, this error count metric, which determines the difference between process-skewed response Y (t) and the reference response Y nom (t), must be minimized by a tuning algorithm to the maximum extent possible in order to ensure that the device specifications are within the acceptable range of the nominal specifications.
On a per clock cycle basis, D(t) and the corresponding reference signature D nom (t) are compared at the XOR gate input. At the end of comparisons, the total number of 1s at XOR gate output [counted by a counter (see Fig. 2 
)] is directly proportional to the area between Y (t) and Y nom (t).
A key outcome of this is that the matrix X (shown in Fig. 3 ) need not be constructed by the hardware. Just counting the number of 1s at the XOR gate output in the order that the comparisons are performed gives the stated result and simplifies the hardware implementation. Therefore, for each tuning iteration, the input stimulus is applied, and the calculated error count metric directs the tuning algorithm.
Optimal Choice of M and N: As discussed in Observation 1, if f s is the sampling frequency of comparator clock, the frequency of
The N samples of the response signal need to be compared against all the M different levels of the ramp signal. In order for such a condition to hold, the frequencies of Y (t) and R(t) should have a specific relationship given by (2) . This condition ensures that every point of the device response is compared with all the M different levels
If (2) does not hold, each sample of N would repeatedly be compared against a small subset of M rather than all its values over the M * N clock comparisons cycles. Hence, according to (2) , M and N are coprime. Under this condition, the number of 1s in a column of X would be proportional to the amplitude and the integration to the area under the curve. Note that in the example presented, N = 3 and M = 4 satisfy (2) as any two consecutive numbers are coprime. Along with the condition provided in (2), the choice of M and N should also be based on the implementation flexibility. IV. SYSTEM DESCRIPTION In this section, a brief description of the various components of the BIT technique is presented.
A. RF Front-End and Envelope Detector
In this paper, the PA consists of tunable elements (bias knobs) to tune for the effects of process variations. In this work, the gain and IIP3 specifications of the PA are tuned. In theory, the technique can be extended if the tuning knobs are implemented in the mixer as well as shown in Fig. 2 . The tuning knob present in the bandpass filter (BPF) is only to facilitate the input stimulus generation and is not for calibration purpose. The time-domain envelope output of the high-frequency PA output signal contains the information of both the nonlinearity and gain of the RF front ends [36] .
B. Ramp-Signal Generator
There has been a lot of literature in the past that discusses the generation of ramp signals on-chip for performing BIST for components, such as ADCs, where the frequencies of the ramp signal range from a few kilohertz to megahertz [37] , [38] . One such reference is [37] , where calibration schemes for the correction of the slope of the ramp signal have been presented as well. The basic operating principle of the ramp-signal generator is shown in Fig. 4 . The current source drives the capacitor according to a switch signal, which in this paper can be derived from the system clock using a divider circuit (Fig. 2) . The ramp-signal amplitude can be represented as shown in (3), where S is the amplitude scaling factor and M is the ramp period. While a ramp signal is used in this paper, other signals, such as the switch capacitor circuit output of a divided clock signal, can also be used as a reference signal [28] .
V. COST FUNCTION FORMULATION
The error count metric is used to direct the tuning of the RF device knobs to tune the specifications of the device within the acceptable range of the nominal specifications. The difference in the gain and nonlinearity specifications between the process-skewed device and the nominal device contribute to the difference in their output response and, hence, to the corresponding computed error count metric. However, since multiple metrics (gain and nonlinearity in this case) need to be tuned, it is important to ensure that the final cost function metric used for tuning is sensitive to each specification. When computing the error count metric, it is observed that the contribution from the variation in gain specification takes dominance over the variation in nonlinearity. This can be intuitively explained as the gain variation is present at all power levels, while the variation due to the nonlinearity becomes significant only at higher power levels. In order to increase the sensitivity of the cost function to the nonlinearity of the device, multiple error count measurements are made at different power levels. For the purpose of explanation, let us consider that the transmitter is modeled as a third-order polynomial as
where α 1 is the gain term, and α 2 and α 3 are the nonlinearity terms. When the amplitude of the input signal x(t) is low, the amplitude of the higher order terms (namely, α 2 and α 3 that correspond to the nonlinearity output) is small and can be ignored. Hence, error count E 1 (discussed in Observation 2) can be approximated as follows:
Now, if the amplitude of x(t) is relatively higher (a times) corresponding to a higher input power level, then the error count E 2 is given as
If the higher order terms are insignificant, then the above error E 2 can be stated as a E 1 . However, at the higher input power level, the effect of higher order nonlinearity terms α 2 and α 3 is significant leading to distortion characteristics. Hence, the difference corresponds to the nonlinearity in the response and is given by the NLM
Using (5) and (7), the cost function used for tuning is formulated as follows:
where W 1 and W 2 are the weights of the two error terms that are determined during the characterization phase of the tuning methodology. The final values of the counter at the end of the HDP signature generation for low and high input power levels provide the error counts E 1 and E 2 , respectively. While it is not possible to completely decouple the various impairments, the cost formulation, such as the one explained above, provides a better metric to obtain the effect of both the gain and the nonlinearity terms. To obtain the cost function, the device is excited twice using the stimulus at two different amplitudes. This can be achieved by programming the gain of the analog BPF that precedes the mixer (see Fig. 2 ). Furthermore, the reference digital signature (corresponding to the nominal device) at each input power level needs to be stored in the memory. The cost function is formulated in the form provided by (8) considering its ease in on-chip implementation. The NLM and cost function calculations involve addition and multiplication operations that can be efficiently implemented on-chip. Prior techniques [19] [20] [21] [22] have utilized cost functions that are based on regressions that would mandate the use of a DSP for computational purposes.
VI. TUNING METHODOLOGY
The tuning algorithm used in this paper is based on the principle of sign-sign least mean squares (SS-LMS) algorithm. The SS-LMS is a variant of the LMS class of algorithms that are commonly used for adaptive filtering. In this paper, the tuning algorithm is used to tune the knob settings of the tunable device based on the cost function [given by (8) ]. Let cf (iter) be defined as the cost function metric for a knob setting k 1 (iter) for iteration iter. The knob k 1 is then updated for iteration iter +1 to its new value as shown in the following equation:
where δ is the knob step size, and the sign functions sgn(cf ) and sgn(k 1 ) are defined as follows:
In (10), cf (iter − 1) is the cost function value for iteration iter − 1. Similarly, sgn(k 1 ) can be defined with respect to k 1 (iter) and k 1 (iter − 1) in the same manner as sgn(cf ). The idea behind the approach is to change the direction of the search when a sign change is observed between the successive values of the cost function. When the knob approaches its optimum value, the product of sgn(cf ) · sgn(k 1 ) toggles between 1 and −1. This product can be monitored to indicate the end of iteration with respect to the knob and the next knob can then be tuned. This cycle of the knobs can be repeated for a predetermined number of repetitions learned during the characterization phase. The selection of the order of the knobs can be obtained in the characterization phase by examining the variation of the specifications with respect to the knobs. Due to the simplicity in the implementation of the algorithm, there is no adaptive step size selection. As a result, the chances of the optimum knob solution converging to a local optimum value exist. To take care of this problem, multiple starting locations for the tuning knobs can be used, and the final knob setting that has the lowest value of cost function can be selected. The flowchart explaining the tuning algorithm is shown in Fig. 5 . The above presented algorithm requires low number of computations, thereby making it more suitable for on-chip implementation. The tuning techniques used in prior methodologies involve the computation of steepest-descentbased multidimensional gradient search algorithm [20] or Lagrange multipliers-based constrained optimization [21] or regression mapping [19] , [22] for tuning all of which are computationally intensive and require a DSP/external tester.
VII. INPUT STIMULUS GENERATION
The basic principle of the proposed technique is that when the device's digital signature matches with the reference digital signature, the specifications of the device after tuning are within the acceptable range of the nominal specifications. Hence, it has to be ensured that the device response variation shows strong statistical correlation with the specification variation across process and tuning knob variation.
In order to satisfy this requirement, a one-time, heuristic input-stimulus optimization that aims to increase the statistical correlation of the device response variation to the variations in specifications is proposed. In this paper, an optimized digital bitstream (that can be generated on-chip) is obtained using a binary genetic algorithm (GA). The binary GA is a heuristic optimization algorithm that starts with a pool of solutions (called chromosomes) and evolves an optimized solution through iterations based on an optimization function [39] . In this paper, each chromosome in the population of the GA represents a digital bitstream. Each generation of GA consists of a set of chromosomes that are evaluated based on an optimization function. Based on the optimization function values for the chromosomes in one generation, using crossover and mutation operations, new chromosomes (new digital bitstreams) are created in the next generation. The GA converges after several generations and finds the digital bitstream that gives the optimum value. This optimum bitstream is then generated on-chip and used as an input digital bitstream during the tuning process.
In our methodology, each digital bitstream/chromosome is first bandpass filtered to get a multitone stimulus within the passband of the device. For each multitone stimulus, a response for the nominal device is obtained. Similarly, responses are collected for a set of representative devices from different process corners and different tuning knob settings that correspond to different specifications. Let the instances across different process corners and knob settings be collectively termed samples. The optimization function is formulated, such that the difference between the responses from the samples (corresponding to different specifications) and the nominal response is maximized. The greater the difference, the greater is the sensitivity of the response variations to the variations in device specifications and better the input stimulus candidate. The input stimulus that has the maximum value of the difference over generations of GA for all the samples is the final optimum stimulus. The difference between the nominal response and a process-skewed instance response for a given tuning knob setting and input stimulus is represented by costfunction (8) . Hence, the optimization function defined by (11) is used to direct the GA. In the formulation, a limiting function is used to prevent the contribution of any one sample from dominating the overall optimization function Optimization function = maximize cost function for samples across generations
VIII. SIMULATION RESULTS
In this section, the concepts described in Sections V-VII are presented in the simulation environment. The front-end circuits are simulated in advanced design system (ADS), and the tuning scheme simulations are implemented in MATLAB using the circuit data. The proof-ofconcept PA and mixer circuits of the transmitter are designed in ADS environment in 0.18-μm CMOS technology. A differential Gilbert-cell is used as the up-converting mixer. The PA is a two-stage design with tunable bias knobs. The selection of the tuning knobs is similar to the technique proposed in [40] , and the tuning knobs are selected to enable the tuning of the gain and IIP3 specifications relative to each other. Monte Carlo simulations are performed to generate the process-skewed instances. Process variations of 15%-20% (±3σ ) are injected by changing the threshold voltages, the length reduction factors, the oxide thickness, the gate-source capacitance, and the channel mobility of the transistors. For each process instance of the mixer and each process instance and tuning knob combination of the PA, the voltage transfer characteristics (V in versus V out ) are extracted from ADS. Using V in versus V out characteristics obtained from the PA, the PA is modeled in the MATLAB as a fifth-order polynomial function:
, where α 0 is the amplifier offset, α 1 is the small signal gain, and α 2 -α 5 are the nonlinearity coefficients. Similarly, the mixer is modeled in the MATLAB as a third-order polynomial function:
where β 0 is the mixer offset, β 1 is the small signal gain, β 2 and β 3 are its nonlinearity coefficients that are computed from the V in versus V out characteristics obtained from the mixer. Using the above polynomial functions, the timedomain simulation for sensing and tuning is implemented in MATLAB. The process variations in the ramp signal are modeled in MATLAB as a variation in the scaling parameter (S) given by (3) . For the ramp signal, an offset error of 1% (mean) is modeled in different instances along with additive white Gaussian noise. A 15% Gaussian variation in the resistance and capacitance of the envelope detector is modeled in MATLAB to account for the process variations in the sensor. The nominal specifications of our transmitter setup are shown in Table I .
The stimulus generation technique discussed in Section VII is performed using a set of 50 process-skewed instances whose gain and IIP3 specifications are varied from the nominal by Monte Carlo simulations. As our implementation of the GA tries to obtain the minimum value, a term called fitness value, which is defined as the inverse of the optimization function (11) , is used [shown in (12) ].
Fitness value
As the fitness value becomes smaller, the value of the optimization function increases. The fitness value progression over generations of the GA is shown in Fig. 6 . As can be seen from the graph, the optimized chromosome/digital bitstream is obtained in 15 generations. After the 15th generation, the mean value of the fitness value (blue curve) changes but the best value (black curve) remains constant indicating that there is no better chromosome/digital bitstream obtained in subsequent generations. The optimization algorithm stops (at 55th generation) as there is no change in the best fitness value for 40 generations (preset limit). The optimized digital pattern obtained from the GA is shown in Fig. 7(a) . This optimized digital pattern is bandpass filtered to obtain a multitone stimulus, as shown in Fig. 7(b) . In the simulation framework, a system/comparator clock of F s = 100 MHz is used. The ramp signal has a period of 311 T s (where T s is the system clock period). For the simulation framework, M = 311 and N = 128. This value of M is approximately equivalent to an 8-bit ADC. The envelope response obtained at the output of the PA along with the ramp signal is shown in Fig. 8 . A total of M * N bits are captured at the comparator output. The actual envelope outputs as well as the reconstructed signals (from their respective digital signatures using Observation 1) for three process-skewed instances are shown in Fig. 9 . The envelopes shown here correspond to the higher input power level, and the reconstructed envelopes have been scaled for comparison. The error count metrics, the specifications of the different process instances, and the rms errors between the reconstructed envelopes (scaled) and the original envelopes are shown in Table II . This error is due to the quantization effect arising due to the resolution of the ramp signal and can be reduced by increasing the comparator clock frequency or reducing the speed of input signals with respect to the clock frequency. However, the tradeoff is that a larger number of digital bits need to be used for an iteration leading to increased memory and tuning time. The cost function (8) variation for a few process-skewed instances across all tuning knobs is shown in Fig. 10 . As can be seen from Fig. 10 , the cost function value reduces when both the gain and IIP3 values are within their acceptable limits given in Table I . In computing the cost function, two input power levels of −40 dBm (low power level) and −20 dBm (high power level) are used to excite the RF front end to obtain its two error terms. For validating the tuning methodology, 305 instances are used for the yield study. The yield histograms before and after the transmitter tuning are shown in Fig. 11 . To avoid local optima in the cost function, two different starting points for the tuning knobs are used, and the knobs corresponding to the lower value of the final cost function are selected as the optimum knobs. The initial yield is 70.6%, and the final yield is 86.5%. A yield improvement of 15.9% is obtained using the proposed tuning methodology.
IX. HARDWARE VALIDATION
For the purpose of experimental validation of the proposed concept, a PA/LNA RF module in production with Texas Instruments is used. The system (see Fig. 13 ) consists of a configured MAX2039 mixer to up-convert the baseband signal generated using an Agilent 33220A function generator. The mixer is powered by a Keithley 2400 power source. The signal is up-converted at an LO frequency of 2.2 GHz generated using an HP E8648D RF signal generator. The up-converted signal is fed to the PA that has tunable control knobs. The PA chip is fed through an RF socket on the Texas Instruments (TI) tester board, as shown in Fig. 13 . The output of the PA is downconverted using a custom-made envelope detector that has a cutoff frequency of 10 MHz. To prevent the comparator from loading the envelope detector, a custom designed analog buffer (designed using AD711) is used. The clocked comparator used is Hittite HMC874LC3C with an internal sample and hold circuit. The system/comparator clock signal of F s = 10 MHz is provided by an Agilent 81133A pulse pattern generator. The ramp signal is generated using an AFG320 function generator. The digital bitstream at the output of the clocked comparator is resampled by feeding the signal to the digitizer in the NI PXI 1073E DAQ chassis. The digitizer is interfaced through NI LabVIEW to the MATLAB simulation environment in a PC where the tuning algorithm is implemented. NI 488.2 general purpose interface bus (GPIB) controller is interfaced with MATLAB for controlling the PA tuning knobs using a HP E3631A programmable power supply. The TI PA module has pins to select its mode of operation (PA or LNA) and control its linearity. The tuning knobs used for process variation compensation are the PA supply (V cc : 2.4-3.9 V) and its linearity control (V control : 2.4-3.9 V). These knobs control the gain and IIP3 specifications of the device. The nominal specifications of the transmitter system setup are shown in Table III . The variation of device specifications with the tuning knobs for one instance is shown in Fig. 12 . As shown in Fig. 13 , the trigger signal (browncolored signal path) generated from the DAC (present in the NI DAQ module) initiates the input stimulus to the mixer (blue-colored signal path) using an Agilent 33220A function generator as well as the ramp signal using an AFG320 function generator. A 100-kHz tone is used as the input stimulus to the mixer. The amplitude of the ramp signal is selected to be 1.38 V pp and symmetry of 70%.
The ramp signal is generated with the value of M = 311 T s , and the envelope detector output has a period of N = 100 T s (where T s is the system clock period). The main synchronization signal is generated in the NI DAQ chassis (black-colored signal path in Fig. 13 ) and is fed to both the 81133A pulse pattern generator (that generates the clock signal) and the E8648D RF signal generator (LO signal). This synchronization clock is also used by the DAC that generates the trigger signal. Once the trigger signal is initiated, 100 cycles of the ramp signal and 311 cycles of input stimulus are generated. The digital bitstream obtained at the output of the comparator is collected for 31100 clock cycles and is transferred to the PC. In the software (MATLAB), the calculated cost function directs the tuning knob selection, which is applied to the RF module using the GPIB control. The entire tuning setup is automated to mimic the on-chip operation.
The digital signature obtained from a transmitter instance is regrouped to reconstruct the envelope-detected signal [see Fig. 14(a) ]. As shown in Fig. 14(b) , the reconstructed (scaled) signal, and the original envelope (captured using the digitizer for comparison) track each other closely (error less than 1%), thereby validating Observation 1. The reconstructed signals (from digital signatures) obtained across different tuning knobs of a process-skewed instance are shown in Fig. 14(c) . From the graph, it can be inferred that the variations in specifications that cause the variations in the envelope signals are captured in their corresponding digital signatures. The error count metric variation across tuning knobs of an instance computed by using the HDP Fig. 14(d) (left) . The actual envelope outputs of the same instance across knobs used to generate HDP signatures are captured (using an ADC), and the difference between them and the envelope response of nominal device are computed [according to (1) ]. The resulting surface is shown in Fig. 14(d) as well (right). The similarity in the variation of the two surfaces validates Observation 2.
For validating the tuning methodology, eight fabricated instances are used for process emulation. The initial and the final measured values of gain and IIP3, before and after tuning with the proposed technique, are shown in Table IV . The average increase in the device power consumption due to tuning is 4%, and the average number of iterations performed to obtain the final tuned specifications is 18 in this setup. Table IV has been color-coded to indicate whether the measurements pass (green) or fail (red). Two different starting points for the knobs are used to avoid the local optimum points. For instances 7 and 8, after tuning, both the specifications could not be tuned within their acceptable bounds. While the cost function (8) reduces due to the tuning implemented, it cannot guarantee that the specifications will always be within the acceptable bounds. This is an artifact of the simplicity of the proposed tuning methodology. Note that the use of the proposed tuning approach can only improve yield, as it is always possible to revert to the original tuning knob settings or perform exhaustive search if the final (measured) test specification values are not acceptable.
X. DISCUSSION

A. Tuning Time
The tuning time is dominated by the HDP signature generation time. In current technology nodes, on-chip clocked comparators operating at 100 MHz are easily feasible. Using such a clock, the total time taken to obtain an HDP signature (in simulation environment, around 40 kbits are collected) would be ∼0.4 ms. In our simulation environment, the average number of iterations required for tuning a device is 20 (18 in hardware), and it is required that for each iteration a low and high amplitude input signal be used. Hence, the total time required for performing the tuning is in the order of 16 ms (0.4 × 20 × 2). If a greater resolution of the envelope response is required, for a given number of samples of the envelope response N, then the number of samples of the ramp signal can be increased by reducing its frequency and capturing a higher number of samples at the comparator output. Table V shows the variation of HDP signature generation time with ramp signal samples M for a clock frequency of 100 MHz. The resolution increases at the cost of increase in HDP signature generation time. In reality, the current state-of-the-art RF chips can have clocks in gigahertz, which would reduce the stated HDP signature generation time. Similarly, the number of sample points of the envelope response (N) can also be increased to improve the envelope response accuracy at the cost of increased tuning time. This is equivalent to increasing the sampling speed of the ADC. Note that the factors, such as the offset voltage of the comparator and linearity of the ramp, can eventually limit the accuracy with which the envelope response is sampled.
B. Area Overhead
The memory required depends on the comparator, the device response, and the ramp-signal frequencies and can be determined for a given setup. Instead of the actual bitstream, the count of the number of consecutive 1s and 0s in the reference digital stream can be stored as a word. In simulation environment (∼40 kbits), a register file of 500 words of each eight bits (seven bits for storing the count and one bit for storing if the consecutive bit is a 1 or 0) was sufficient to represent the reference digital signature resulting in a memory of 8 kbits (two register files are required for two input amplitudes). The area implementation of such as register file is ∼0.005 mm 2 in 22-nm CMOS [41] . Assuming an equal area for control and logic block, the total area overhead is 0.01 mm 2 . An on-chip envelope detector is implemented in [19] , and its area is dominated by the size of the resistor and capacitor area. The envelope detector implemented in this paper has a cutoff frequency of 50 MHz that can be implemented with a resistance of 1 k and a capacitance of 3 pF. In today's state-of-the-art 22-nm process node, the resistor and capacitor area can be estimated to be 195 and 1 μm 2 , respectively. Therefore, the overall area of envelope detector can be estimated to be ∼200 μm 2 . The comparator circuit with the offset compensation circuitry has been implemented in [42] . Such a comparator in the current state-of-the-art 22-nm process technology node is estimated to be ∼30 μm 2 . The major area dominating components of a ramp-signal generator are the current source and the capacitor [37] . Considering the frequency of the ramp-signal generator used in this paper, it can be implemented with a combination of a current source of 5 μA and a capacitance of 1.5 pF. In a 22-nm process node, the area of such a capacitor is estimated to be ∼96 μm 2 , and a current source is ∼2 μm 2 , thereby the total area can be estimated to be ∼100 μm 2 . The combined area of the envelope detector, comparator, and ramp-signal generator is small compared with the memory and tuning logic (∼3%). Considering the approximate area of 2.4-GHz transceiver in a 32-nm technology [43] and a 2× decrease in area per technology node, the RF transceiver area in 22 nm is ∼12 mm 2 , and the overall area overhead is <0.1%. If the entire reference digital signature is stored instead of the proposed format, it would result in a memory requirement of 80 kbits (an area of 0.05 mm 2 ) and an area overhead of less than 0.5%. While approximate values have been provided, conceptually it is evident that the area overhead is minimal. Table VI shows a qualitative comparison between this paper and prior techniques. While the framework presented in this paper is capable of tuning key static performance metrics, such as gain and IIP3, to sense and tune a diverse set of specifications in our setup (such as I /Q gain and phase mismatch), additional on-chip sensors will be needed and is a topic of future research. In the case of high number of tuning knobs (scalability), the main issue with the proposed technique would be the order of tuning of the knobs and the convergence to local optima. To address the issue of the order of tuning knobs, the sensitivity of a tuning knob with respect to the specification should be studied at the circuit-level, and the knobs should be tuned in the decreasing order of their sensitivity. Multiple starting points for the tuning knobs can be used to prevent the convergence to a local optima point.
XI. CONCLUSION
In this paper, an on-chip, digitally assisted tuning methodology that can be used to tune multiple design specifications concurrently for RF circuits is presented. Results obtained from tuning a PA of the transmitter show significant yield improvement. Hardware results on an industrial RF module validate the proposed technique. While other postmanufacture techniques that provide greater yield improvement exist, these techniques use complex testing/tuning algorithms that mandate the use of a DSP or an external tester. Hence, these techniques are not self-contained solutions and are feasible only after the system-level integration or with the use of external computational resources. The proposed scheme is capable of tuning key static specifications, such as gain, IIP3, and IP1 dB specifications, that prior tuning schemes are capable of with significantly reduced hardware. The digital stimulus generation, digital signature analysis, and digital logic-driven tuning technique presented in this paper provide for a completely on-chip, built-in calibration framework with minimal area and time overhead.
