A hardware implementation of the Nth order complex-lag time-frequency distribution is proposed. The considered distribution provides an arbitrary high concentration for multicomponent signals with fast varying instantaneous frequency. Although the distribution form is quite complex, the proposed realization is very efficient and provides high-speed real-time processing. Further, it allows avoiding miscalculation errors that may appear in the numerical calculation of signal with complex-lag argument. The results of FPGAs (Field Programmable Gate Arrays) implementation are presented as well.
solved by a suitable hardware realization. Consequently, the favorable properties of the complex-lag distribution could be available in various real-time applications.
A flexible architecture of an arbitrary (Nth) order complex-lag distribution for multicomponent signals is proposed in this paper. Although it seems that the Nth order complex-lag distribution is difficult for realization, an efficient parallel configuration is provided, suitable for VLSI implementation that allows high-speed real-time processing. The serial realization that reduces number of components is considered as well. The proposed hardware overcomes the errors in the numerical calculations caused by the limited MATLAB software precision, which is an additional advantage. Namely, this problem appears in the calculation of analytic extension of the signal with complex-lag argument [9, 13] .
The proposed system consists of two main parts: realization of the S-method (proposed in [14] ) and realization of the concentration function (system of order N-2). The realization of the concentration function is performed throughout several subsystems providing calculation of the signal with complex-lag argument, concentration function calculation, and its time-frequency domain version. The parallel architecture is implemented on the FPGAs chips in order to verify the results. A simple solution for some intermediate functions (based on logarithmic and sine and 2 EURASIP Journal on Advances in Signal Processing cosine functions) that do not exist as standard components, is given as well.
The paper is organized as follows. The review of generalized time-frequency distribution with complex-lag argument is given in Section 2. Section 3 presents the parallel hardware realization of generalized time-frequency distribution with complex-lag argument, while in Section 4 the serial realization is proposed. Section 5 presents the analysis and comparison of the proposed system. The FPGA implementation and simulation results are given in Section 6. Concluding remarks are given in Section 7. The realizations of intermediate calculations are given in the appendix.
Theoretical Background
A general discrete form of the Nth order time-frequency distribution with complex-lag argument can be written as [13] GCD N (n, k) = 
where ( 
where w N,p = e j2π p/N = w rp + jw ip . Observe that the GCD N (n, k) can be expressed as
The convolution in frequency domain is denoted by * k , while FT m denotes the Fourier transform with respect to variable m. The concentration function (of order N-2) acts as a correction term that can arbitrarily improve the concentration of the Wigner distribution (WD) by increasing the order N. In order to provide a suitable representation for multicomponent signals, the Nth order complex-lag distribution has been modified [13] . The modifications are introduced in the calculation of the concentration function 
More details could be found in [13] . Calculation of signal with complex-lag argument x ap (n ± m(w N,p /N)) q is considered later in this section.
The time-frequency representations of the resulting functions c r (n, m) and c i (n, m) (for all signal components) are obtained as
where Q is the number of signal components.
The final form of the concentration function in the timefrequency domain is the convolution of C r (n, k) and C i (n, k):
Note that P(i) is a frequency domain window of the size 2L d + 1. The cross-terms free representation is obtained if the size of window is less than the minimal distance between the autoterms.
Finally, a general form of modified complex-lag timefrequency distribution is defined as
where, instead of the WD, the S-method (cross-terms free
, is used. The acronym STFT is used for the short time Fourier transform.
Calculation of signal with complex-lag argument. In order to calculate the signal with complex-lag argument, the signal components should be first separated by using the STFT [9] . Then the qth component of the signal with complexlag argument is calculated by using the analytic extension as follows [13] :
It is assumed that the qth signal component is within
} is the position of its maximum. The parameter W q is used to define the width (2W q + 1) of the qth signal component in the time-frequency plane. The crossterms will be avoided if 2W q + 1 is smaller than the distance between autoterms. The width 2W q + 1 could be adjusted for each signal component [14] .
Note that the real part of exponential function exp( jmkw N,p /N), for large values of mk, can exceed the computer (software) precision range, and it may cause errors in the numerical realization.
Parallel Architecture for Implementation of the Nth Order Complex-Lag Time-Frequency Distribution
A system for implementation of the Nth order complexlag time-frequency distribution for multicomponent signals is proposed in this section. The block scheme is given in Figure 1 . The starting block is the calculation of the STFT that is used to obtain the S-method (SM(n,k)). According to (8) , the STFT is also used to obtain the signal with complexlag argument for the computation of C(n,k) (by using (4), (5) , and (6)). The S-method is obtained at the output of the SM block, while the C(n,k) is obtained at the output of the BLOCK 4 in Figure 1 . The complex-lag distribution (MGCD(n,k)) is produced at the output of the BLOCK 5, as a convolution of the SM(n,k) and C(n,k).
Hardware Solutions for the STFT and the SM.
The architectures for the STFT and the SM implementation have been proposed in [14] . They are shown in Figures 2(a) and 2(b), respectively. Namely, by using the rectangular window, the STFT is realized as [15, 16] STFT Re (n, k
where, for a given k, the quantities c(k) = cos(2πk/N s ) and s(k) = sin(2πk/N s ) are constants. The STFT(n,k) represents the input for the SM realization. Thus, the SM is obtained according to the form presented in [14] : In the sequel, each block will be analyzed and presented separately.
Hardware Solution for the Concentration Function in

BLOCK 1: Separation of Signal Components.
The regions that contain signal components are separated within BLOCK 1, based on the outputs of the STFT block. In this sense, it is necessary first to allocate components of |STFT c (n, k)| that are higher than the reference value R. For instance, if the signal amplitudes are normalized, the STFT values are in the range [−1, 1] , and the reference value should be set to R = 1/λ, where λ is a scaling constant and λ ≥ 1 holds. Otherwise, if the signal range is unknown, the reference value is defined as a portion of the STFT's maximum at a given instant n,
The regions of STFT c (n, k) that are higher than R contain the signal components. Each of them is further processed to find the position of its maximal component denoted by k q , q = 1, . . . , Q, where Q is the number of allocated regions, that is, the number of signal components. The outputs STFT c (n, k) for k ∈ [k q − W q , k q + W q ] are passed to the inputs of BLOCK 2. 
BLOCK 2: Realization of the
A/D Clock R1 R2 RN Mult Mult Mult Mult V CC V CC V CC C in C in . . . . . . . . . STFT Re (n − 1, k) STFT Im (n − 1, k) k = 0 k = 1 clk clk + + + + S(k) C(k) (a) SM(n, k) STFT Re (n, k − L d ) STFT Re (n, k + L d ) STFT Re (n, k − L d + 1) STFT Re (n, k + L d − 1) STFT Re (n, k − 1) STFT Re (n, k + 1) STFT Re (n, k) STFT Im (n, k) STFT Im (n, k + 1) STFT Im (n, k − 1) STFT Im (n, k + L d − 1) STFT Im (n, k − L d + 1) STFT Im (n, k + L d ) STFT Im (n, k − L d ) Mult
A(m, k) B(m, k)
A(m, k) B(m, k) A(m, k) B(m, k) A(m, k) B(m, k) A(m, k) B(m, k) VCC VCC VCC VCC VCC (a) STFTcRe(n, kq − Wq ) STFTcIm(n, kq − Wq ) STFTcRe(n, kq − 1) STFTcIm(n, kq − 1) STFTcRe(n, kq ) STFTcIm(n, kq ) STFTcRe(n, kq + 1) STFTcIm(n, kq + 1) STFTcRe(n, kq + Wq ) STFTcIm(n, kq + Wq ) Re(xa p (n − wN ,p m/N)q ) Im(xa p (n − wN ,p m/N)q ) Mult
A1(m, k)
The constants A(m, k) and B(m, k) are defined as
Similarly, real and imaginary parts of x ap (n − w N,p m/N) q ( Figure 3 (b)) are obtained by using the constants: 
where, to simplify the expressions, the following notations are used:
The outputs of architecture in Figure 4 (defined by (14)), are further combined for all p : By taking all signal components, the resulting concentration functions c r (n, m) and c i (n, m) ( Figure 6 ) are obtained:
Im c r (n, m) q , EURASIP Journal on Advances in Signal Processing 
EURASIP Journal on Advances in Signal Processing
Re c i (n, m) q ,
BLOCK 3: Realization of Functions C r (n, k) and
The architecture in Figure 7 is used to obtain timefrequency domain functions: C r (n, k) = FFT{c r (n, m)}, and C i (n, k) = FFT{c i (n, m)} (FFT denotes the fast Fourier transform). Since the real and imaginary parts are treated separately, the outputs of FFT circuits in Figure 7 are obtained in the form:
Therefore, the real and imaginary parts of functions C r (n, k) and C i (n, k) (the outputs of adders in Figure 7 ) follow as 
BLOCK 4: The Final Form of the Concentration Function in the Time-Frequency Domain.
The concentration function C(n, k) is obtained as a convolution of functions C r (n, k) and
The architecture is shown in Figure 8 (a) Note that, the realization of the product term C r (n, k + l)C i (n, k − l), shown in Figure 8(b) , is done according to
In the final step, a convolution of SM(n, k) and C(n, k) is performed within BLOCK 5 ( Figure 1 ) and the channels of the Nth order complex-lag time-frequency distribution are obtained as
The corresponding architecture is given in Figure 9 . Note that the convolution of SM(n,k) and C(n,k) is realized in the same way as the convolution of STFT(n,k) and STFT(n, k) within the SM. 
Serial Architecture for Implementation of the Nth Order Complex-Lag Time-Frequency Distribution
As an alternative to the proposed system with parallel realization, a serial architecture for implementation of the Nth order complex-lag distribution is considered. The block scheme for serial architecture is given in Figure 10 . The STFT block is the same as in the parallel realization, while the serial architecture for the S-method (SM block) is given in [14] . The concentration function is obtained throughout the following blocks ( Figure 10 ): BLOCK1, BLOCK2, BLOCK3, and BLOCK4. BLOCK1 (the separation of signal components) and BLOCK3 (the FFT calculation) are the same as in the parallel realization. Therefore, the main modifications with respect to the parallel realization are made within BLOCK2. It consists of Block21, Block22, Block23, and Block24, that will be briefly discussed in the sequel.
Block21. This block is used for the calculation of signal with complex-lag argument, given by (11) . The realization is shown in Figure 11 . The LUT (Look-Up Table) contains the constants : A(m, k), B(m, k), A 1 (m, k), and B 1 (m, k) , defined by (12) and (13), respectively. Address 1 and Address 3 are used to provide synchronization between Block22. The outputs of Block21 are fed to the input of Block22 that is used to obtain concentration functions c r p (n, m) q and c i p (n, m) q . The serial realization of these functions is the same as in the parallel architecture (Figure 4) .
Block23. It calculates the resulting concentration functions c r (n, m) q and c i (n, m) q defined by (16) . The realization is given in Figure 12 . The Address 4 and Address 5 determine which pair of inputs should be multiplied within Mult circuit. One cycle of clk7 is set to the time interval required for one multiplication and one addition operation. After two cycles of clk7, RESET2 signal resets the clock ADD. The terms Re{c r (n, m) q }, Im{c r (n, m)q}, Re{c i (n, m) q }, and 
Re(MGCD(n, k))
Im(MGCD(n, k)) Figure 9 : Architecture for realization of convolution between SM(n,k) and C(n,k). BLOCK4 is used to perform the calculation of concentration function C(n, k) defined by (20) . The realization of this block is given in Figure 13 . One term in summation is calculated within one cycle of the clk9, while complete sum is obtained after 2L d + 1 cycle. The clock cycles and reset signals of this circuit are similar as in Block23. The resulting complex-lag distribution is obtained at the output of BLOCK5 in Figure 10 . The realization of this block is similar to the realization of BLOCK4. The only difference is that only one cumulative adder is required. Figure 11 : Serial architecture for producing of signal with complex-lag argument (Block21). 
Analysis of the Proposed System
In the sequel, some practical issues related to the realization of the proposed hardware realizations are addressed. The proposed system (either with parallel or serial architecture) can be implemented by using the hardware components with floating point format to provide satisfying precision for the calculation of signal with complex-lag argument. However, the floating point adders and multipliers introduce the output latency of few clock cycles. Thus, in order to decrease the output latency, the number of hardware components with floating point format should be reduced. in serial realization). The remaining part of the system can be realized by using the fixed point format. Note that starting from the output of the system shown in Figure 4 (the outputs of cos and sin circuits), all values will be in the range [−1, 1], and the fixed point notation could be used. Also, by normalization of input signal, the STFT and the SM can be calculated by using the fixed point format.
The floating point multipliers increment the exponent by 1, which should be corrected at the output. The fixed point multipliers result in a two sign-bit. Thus, to correct the result, the product has to be shifted left by one bit. This shifter can be included as a part of multiplier.
The total number of circuits needed for one channel of the proposed system for parallel realization is given in Table 1 . The longest path is given in Table 2 . It connects the register storing STFT(n − 1, k ± W q ) with the output MGCD(n, k). The length of this path determines the fastest sampling rate.
The number of required circuits and longest path for serial realization are given in Tables 3 and 4 , respectively.
In order to compare the parallel and serial realizations the following values are considered: N s = 128, N = 6, W q = 2, L d = 2, and Q = 2. The number of hardware components and the latency for parallel and serial realizations are given in Table 5 .
Note that the number of components required for serial realization is reduced with respect to parallel realization. However, the latency in serial realization is significantly increased. Therefore, in order to provide higher processing speed, we will consider parallel realization for the FPGA implementation.
FPGA Implementation of the Proposed Parallel Architecture
The proposed architecture can be implemented by using various hardware devices. As one of the possible choices, the digital signal processor could be used. However, it is not suitable for real-time processing, especially at very high speeds. Therefore, for high-speed processing, one might consider the ASIC implementation (application specific integrated circuit) or the FPGA implementation. Both of them allow a high degree of parallelism, as well. ASIC implementation can provide lower power consumption, design size optimization, and design flexibility that enables speed optimization. However, long production time and significant costs do not recommend ASIC device for prototype development. The main advantages of the FPGA implementation are reconfigurability, lower producing time and costs requirements, inbuilt special hardware such as RAM, and so forth. Thus, we use the FPGA to develop the prototype that, after testing and verifying the results, can be implemented on an ASIC. The FPGA chips from Altera [17] and the Quartus II v8.0 software are used in the proposed implementation.
The fourth-order (N = 4) complex-lag distribution is considered for FPGA implementation. A simplified block scheme of the system is shown in Figure 14 . The FPGA implementation of the S-method (SM Chip), including the chip parameters and performance, has been provided in [14] . Thus, we focus on designing the chip for the concentration function c(n, m) (C function Chip). The output of this chip is passed to the input of the FFT Chip (that produce C(n, k) ). Note that a solution for the FFT chip by using the FFT MegaCore function has recently been proposed by Altera [18] . For device EP3SE50F780C2 (Stratix III family) and the 16-bit fixed point format, this FFT chip requires 4290 ALUTs, 5116 memory bits, 20 DSP 18 × 18 multipliers, and Re(c i (n, m)) Figure 15 (the Block 1). The remaining components for the realization are given in Block 2, Figure 15 . The schematic diagram for FPGA realization of signal Re{x ap(n+ jm) } = a is given in Figure 16 . In order to avoid miscalculations of the analytic extension, the extended single precision floating point arithmetic is used (the number of samples is N s = 128). The floating point multiplier altfp mult0 has the output latency of five clock cycles, while the output latency of floating point adder altfp add sub0 is seven clock cycles. The output latency is determined by the number of summation terms in (11) . Here, one multiplier and W q +1 adder are used. Note that the multiplier altfp mult0 produces the result that is multiplied by two, which will be corrected in the circuit for cosine function calculation.This architecture is realized in the device EP3C40F780C6 from Cyclone III family fabricated in DDR2-SDRAM technology. The available chip characteristic and utilized resources are given in Table 6 .
The schematic diagram for FPGA realization of the second part (Block 2 from Figure 15 ) is given in Figure 17 Figure 17 is realized in the device EP3C16F484C6 from Cyclone III family fabricated in DDR2-SDRAM technology. The available chip characteristic and utilized resources are given in Table 7 .
The circuits that calculate natural logarithm and cosine and sine functions are not implemented in the Quartus II v8.0 software. Thus, they could be calculated by using polynomial approximations, CORDIC (COordiante Rotation DIgital Computer) algorithm or LUT. The approach that uses polynomial approximations requires floating point arithmetic and iterative calculations. This method is computationally very extensive, slow, and it is not suitable for high-speed processing. CORDIC is a recursive algorithm that uses small number of circuits to provide calculation of trigonometric, logarithmic, and exponential functions [18] [19] [20] [21] . If very high precision is required, the CORDIC approach will be more suitable and then the LUT. It has been shown that for the precision up to 16-bits, the LUT approach provides higher processing speeds [21] . The test we performed shows that 16 bit precision is sufficient for the calculation of the logarithmic and trigonometric functions and higher precision does not provide further improvement of results. Thus, for the calculation of logarithmic and trigonometric functions we use the LUT approach. Simple solutions for these functions (log 2 circuit and cos and sin function circuit in Figure 17 ) are given in the appendix.
Finally, we have compared the throughput rate and latency for software simulation and proposed hardware realization. The software simulation is performed by using MATLAB 7 running on Pentium IV with 3.2 GHz and 1 GB RAM. The throughput rate obtained in the software simulation is 4.6 Mbs, while the latency is 90 microseconds. The proposed hardware realization provides throughput rate 6.9 Gbs and latency of 180 nanoseconds. Therefore, as expected, the proposed hardware realization provides significant improvement for throughput rate and latency compared to the software simulation.
Conclusion
The proposed hardware provides an efficient implementation of the Nth order complex-lag time-frequency distribution for multicomponent signals. This flexible system is realized in parallel and serial configurations and combines fixed and floating point arithmetic. It provides satisfactory calculation precision and solves the problem of errors that could appear in numerical realizations of distributions with complex-lag argument. Furthermore, the FPGA implementation of the proposed parallel hardware solution is provided, including the parameters of FPGAs chips.
Appendix
Natural Logarithm. Since we deal with the binary arithmetic, the natural logarithm of number X can be written in terms of the logarithm with base 2: ln X = log 2 X/log 2 e. Let us consider the floating-point number in the form X = x 1 2 x2 , where x 1 and x 2 are mantissa and exponent, respectively. Thus, log 2 x 1 2 x2 = x 2 + log 2 x 1 holds. The schematic diagram for the calculation of logarithm with base 2 is shown in Figure 18 . The exponent x 2 has been incremented for the value 1026 within the preceding circuits (1023 comes from the extended single precision floating point format, while 3 is introduced by floating point multipliers). Thus, in order to correct the result, the exponent is reduced for the same value within the lpm add sub2 circuit in Figure 18 . The calculation of log 2 x 1 is performed by using the look-up table (LUT) that contains 255 values (logarithm circuit in Figure 18 ). The first 8 bits of mantissa are used as an input of LUT. In order to obtain satisfying precision, it is created as LUT(x 1 ) = round(2 15 Note that 2 15 x 2 is obtained at the output of circuit lpm mult2 in Figure 17 . Thus, the final output log2 [27 . . . 0] has the value (2 15 · x 2 + LUT(x 1 )).
Cosine and sine functions. The input value in the cos and sin function circuit, given in Figure 19 , is first corrected by scaling with constant value 1/(2 · 2 15 log 2 e) (const [20. .0] in Figures 17 and 19 ) in the circuit lpm divide2. Note that 1/2 results from (14), while 1/(2 15 log 2 e) remains from the calculation of natural logarithm. The input of this circuit is additionally divided by the period of cosine function 2π. Then the remainder (remain [20. .0] in Figure 19 ) is used as an input of LUT that provides the values of cosine and sine functions. Namely, in order to provide satisfying precision, the remainder is quantized to 512 values. The ten least significant bits, obtained by the quantization, are used as an input in the LUT that provides 16-bit values for cosine and sine functions.
