Introduction
The polynomial phase signal (PPS) is the standard model of the nonstationary signals motivated by the Weierstrass theorem that states that any continuous time function can be modeled as a polynomial. This model has been studied for more than three decades with numerous techniques designed for parameter estimation. Three main groups of the parameter estimators are [1] , [2] :
 Maximum likelihood (ML) estimators (accurate but with unacceptable complexity for PPS of order higher than 3);  Phase unwrapping (PU) estimators (accurate only for narrowband signals and high signal-to-noise ratio (SNR) but inaccurate for wideband signals and/or more emphatic noise) [3] ;  Phase differentiation (PD) estimators (efficient techniques but with limited accuracy especially for higher-order PPSs) [4] - [21] .
For a long time due to efficiency reasons, the PD estimators were a primary research topic. They are reviewed in [1] , [2] . However, their performance for higher-order PPS is not satisfactory. Recently an alternative approach has demonstrated excellent accuracy significantly improving results with respect to all three groups of estimators [22] - [29] . It is essentially a combination of the PU and the ML estimators with pre-processing. In the pre-processing stage, the rough estimate of the signal parameters is obtained by polynomial regression of the instantaneous frequency (IF) estimates. The IF estimates are obtained from the position of the short-time Fourier transform (STFT) maxima [22] , [30] , [31] . Note that there are special purpose hardware systems with different configurations for the STFT realization [32] - [39] . The STFT realizations are commonly used, as an initial stage, in more sophisticated systems dedicated for the realization of the advanced time-frequency (TF) representations [40] - [45] , as well as for realization of the advanced time-varying and space-varying filtering systems, [46] - [49] . However, multiple STFTs are evaluated in the QML what is one of the main difficulties in both software and hardware realizations of this algorithm. After initial/rough PPS coefficients' estimates are obtained by polynomial regression of the IF estimates, the noisy signal is demodulated using signal reconstructed by roughly estimated PPS coefficients. The resulting signal is the PPS of the same order as initial one but it is narrowband and low-pass suitable for the PU estimation. This signal is filtered with low-pass moving average filter reducing noise influence without impact to the signal content. In the fine stage, the phase coefficients are estimated from the unwrapped phase of demodulated and filtered signal. This process is referred as the O'Shea refinement strategy [18] . The fine estimate is an aggregation of results from the rough and fine stages. The final algorithm output is selected based on the ML criterion from estimates calculated with multiple STFTs for various window widths. In this case, the search is performed over the single parameter, i.e., window width in the STFT, and not as in traditional ML algorithm realization where the search is performed over all phase parameters that is infeasible for higher-order PPSs. Excellent results of the described technique qualify it for many practical applications, such as the radar/sonar systems, communications, sensor networks, or biomedical signals and systems. However, this technique also requires high calculation complexity that seriously reduces its applicability in real-time applications. Nevertheless, hardware implementation, if possible, can help in over-coming of this nuisance. Therefore, it is an urgent need to develop the hardware for the QML algorithm realization.
The QML algorithm realization is challenging with evaluation of the STFT, IF estimation (the maxima STFT detection), polynomial regression of the IF and phase, demodulation, and filtering. A mitigating circumstance is the fact that all the above listed procedures are repeated in the same manner but for the different STFT window width. Hence, a hardware implementation of these procedures for a particular STFT window width is the key task that will be solved and presented in this paper.
The paper is organized as follows. The QML algorithm is presented in Section 2. Hardware implementation of the algorithm, for a particular STFT window width, is developed and described in Section 3. In Section 4, the developed design is tested on a noisy PPS and verified through the implementation on a field programmable gate array (FPGA) device. Finally, corresponding conclusions are derived in Section 5.
PPS Estimation and QML Algorithm
The Mth order PPS can be described as 
where the first derivative is with respect to n. The problem of interest is to estimate signal parameters { ; , 0,.
The QML algorithm can be summarized with the following steps.
1. Calculate the STFT of x(n) given by (1) with different window widths ( , )
where window function
H is the set of window widths and S is the number of signal samples.
Estimate IFs for each window from the set
: 
Noise influence is attenuated with the moving average filter:
where filter length is P (assumed odd number). Then the phase of dechirped (low-pass) and filtered signal ˆ( ) x n is extracted and unwrapped
5. Now, residual phase parameters ˆ [1, , ] ,
can be estimated from ( ) v n using polynomial regression:
where elements of ( 1) 
with elements dependent on the window width ,
6. The matched correlator (ML criterion function) is used to determine the optimal window width ˆa rg max ( ),
7. The optimal window width determine final estimate from the ensemble of estimates
From the perspective of hardware implementation, the algorithm steps 1 to 5 are the essential ones, and they represent its core. Namely, those steps are repeating for each STFT window width and, therefore, require the same hardware implementation. Based on the simple criterion (11), it is decided which results (for what STFT window width realization) are the most accurate ones and they are chosen as the final estimates, (12) . Having in mind the above elaboration, a hardware implementation of the QML algorithm core presented in the next section is the main contribution of this paper.
Hardware Implementation
Hardware implementation capable to provide the QML algorithm-based estimation of the PPSs is given in Figs. 1-5 and Table 1 . It has three main functional modules. In accordance with their associated names, the first two modules provide the calculation of the rough and fine coefficients of the QML algorithm, respectively. The third module creates the final outputs of the implementation based on the input signal and the outputs of the first two modules (rough and fine coefficients). The final outputs are estimated signal coefficients, filtered signal, as well as estimated signal amplitude.
The STFT block from Fig. 1 represents one of the well-known and already available modules that are used to 
Parameters
Parameters' values
provide the TF signal representation based on the linear STFT. These modules are implemented either by using the available FFT chips, [32] , [33] , or by using approaches based on the recursive algorithms, [34] - [39] . Based on the calculated STFT samples and the peak detection along the frequency direction, [50] - [52] , the IF Estimation Block (see eq. (4) Table 1 . To this end, the control logic groups modules that consist of the variable length up/down binary counters and the binary magnitude comparators whose references are parameters from Configuration registers. In this way, each of these modules creates the control signal corresponding to the parameter from the Configuration registers.
Within the output signal calculation, the proposed implementation takes multiple and fixed number of clock cycles (CLKs). Calculations of the rough coefficients Rcoeffs and the fine coefficients Fcoeffs take N×S×M CLKs and S+(M+1)round(S/P)+1 CLKs, respectively, where operator round(x) denotes rounding of the real variable x to the nearest integer, whereas the Filtering Module requires S+1 CLKs to complete calculation of the output. Note that number of taken CLKs depends on the known algorithm parameters N, M, S, and P, which means that execution time of the proposed hardware implementation can be calculated in advance that can be of great importance in many practical applications.
Real and imaginary parts of input STFT data, numerically calculated in STFT Block, are imported to the proposed implementation on every CLK and are stored in Register file 1 and Register file 2 of the IF Estimation Block, Fig. 2 . After N CLKs, these register files contain all frequency-only-dependent STFT samples from the observed time instant. Over the absolute ( , ), /2,..., /2 1
, the COMP Block performs, in the already described manner, the peak detection. According to (4), IF is detected in the TF point correspondent in frequency to the detected peak STFT sample. Hence, the IF Estimation Block takes N×S CLKs to complete the IFs estimation for the windowed PPS.
For each N×CLK cycle, the estimated IFs are imported to the Register file 1 of the Matrix-Array Multiplication Block, Fig. 3 . Simultaneously, on every N×CLK cycle, and controlled by the Read_Matrix_1 signal, the Matrix 1 elements are imported to the Register file 2 of the same block. After N×S CLKs, all the estimated IFs are imported to the Register file 1, but only one Matrix 1 column is imported into the Register file 2. At that instant, by the multiplication of the corresponding elements from the Register file 1 and Register file 2, and afterward by the summation of the calculated products, one rough coefficient is produced at the output of the Matrix-Array Multiplication Block. This procedure is repeated M times (one time per a Matrix 1 column). As a result, after N×S×M CLKs all rough coefficients Rcoeffs are calculated and imported to the Shift_Memory_Buffer_1 block. Note that, controlled by the IF_dis signal, when all estimated IFs are imported to the Register file 1, registers from this file do not change their values. In the same manner, the SMB1_dis signal disables the Shift_Memory_Buffer_1 block to change its content after N×S×M CLKs, which allows the design to have the Rcoeffs (in the Shift_Memory_Buffer_1 block) until completion of the execution. Implementing (6) based on the calculated Rcoeffs, the Dechirp Coefficients Block (the simple combinational logic containing a set of the two-input dividers) performs the calculation of the dechirp coefficients Dcoeffs simultaneously imported to the Signal Modification Block. At the same time, on every CLK cycle and controlled by the Read_1 signal, the estimated signal samples (both their real and imaginary parts) are imported to the Signal Modification Block of the Fine Coefficients Calculation Module. After S CLKs, all signal samples are imported to this block, providing a base for the execution of the necessary operations (exponentiation, multiplication, summation, sine, and cosine). As a result, real and imaginary parts of the dechirped signal (Dechirped Signal_re and Dechirped Signal_im) are produced at the output.
Controlled by the MA_Filt_en signal, real and imaginary parts of the calculated dechirped signal are separately imported to the MA Filtering Block. To create a base for the moving average filtering (7) with no overlapping windows, the imported samples are grouped in the P length register files. Filtering (7) is implemented by the simple combinational logic containing a set of the Pinput adders and two-input dividers. The MA_Signal_re and MA_Signal_im samples, produced at the output of MA Filtering Block, are imported to the combinational Phase Extraction Block, which performs the arcus tangent and the phase unwrapping function (8) . This block requires only one CLK to finish the calculation and to produce the Phase signal at its output.
On every CLK, the calculated Phases are imported to the Register file 1 of the Matrix-Array Multiplication Block, shown in Fig. 3 . Simultaneously, on every CLK, and controlled by the Read_Matrix_2 signal, the Matrix 2 elements are imported to the Register file 2 of the same block. After round(S/P) CLKs, all the calculated Phases are imported to the Register file 1, but only one Matrix 2 column is imported into the Register file 2. At that instant, by the multiplication of the corresponding elements from the Register file 1 and Register file 2, and afterward by the summation of the calculated products, one fine coefficient is produced at the output of the Matrix-Array Multiplication Block. This procedure is repeated (M+1) times (one time per a Matrix 2 column). As a result, after round(S/P)(M+1) CLKs all fine coefficients Fcoeffs are calculated and imported to the Shift_Memory_Buffer_2 block. Note that, controlled by the PE_dis signal, when all the estimated Phases are imported to the Register file 1, registers from this file do not change their values. In the same manner, the SMB2_dis signal disables the Shift_Memory_Buffer_2 block to change its content after (N×S×M+S+(M+1)× round(S/P)+1) CLKs, which allows the design to have the Fcoeffs (in the Shift_Memory_Buffer_2 block) until completion of the execution.
The Estimated Signal Coefficients are calculated inside the Final Coefficients Block (the simple combinational logic containing a set of the two-input multipliers and a set of the two-input dividers), based on the already calculated rough and fine coefficients Rcoeffs and Fcoeffs. Simultaneously with the calculation, the Estimated Signal Coefficients are imported to the Signal Modification Block. At the same time, on every CLK cycle and controlled by the Read_2 signal, the estimated signal samples (both their real and imaginary parts) are imported to the Signal Modification Block. After S CLKs, all signal samples are imported to this block, providing a base for the execution of Table 2 Summarized resource utilization of the proposed design implemented on the Cyclone III EP3C16E144C7 FPGA device and determined by N=128, M=6, S=256, P=5, and input data length l=32. *Note that a total number of pins used by the complete design correspond to the sum of input pins of the first module and output pins of the third module. **Note that a total number of memory bits used by the complete design correspond to the sum of a total number of memory bits used by each module separately increased for a total number of memory bits necessary to locate input signal samples, Fig. 1 Total PLLs 0/4 (0%) 0/4 (0%) 0/4 (0%) 0/4 (0%) the necessary operations (exponentiation, multiplication, summation, sine, and cosine). As a result, real and imaginary parts of the filtered signal (Filtered Signal_re and Filtered Signal_im) are produced at the output.
Controlled by the Amplitude_en signal, real and imaginary parts of the filtered signal are separately imported to the Register file 1 and the Register file 2 of the Amplitude Block, as shown in Fig. 5 . Calculation of the Signal Amplitude signal (including summation, multiplication, dividing, and square root operation performed inside this block) requires only one CLK (the last CLK taken within the execution).
Testing and verification
To verify the developed hardware approach for the QML algorithm-based estimation of PPS, it is first implemented on an FPGA device and is tested, after that, on the PPS (1) of the sixth-order, M=6, within the time interval [-1,1), and with the sampling interval of Δ=1/128. Signal (1) is corrupted by the additive white noise. The implementation is done by considering different input SNR values (from 0 dB to 20 dB in a stepwise of 2 dB). Within the implementation, the Cyclone III EP3C16E144C7 device has been selected in accordance with the optimal resource occupation, Table 2 . Besides, as a key implementation detail, the maximum CLK rate of 50 MHz is achieved within the execution. Samples of the observed noisy signal are written in the signed 32-bit fixed-point notation including 8-bit fraction, as well as the numerically calculated and imported STFTs. Within the STFT numerical calculation, the rectangular lag-window width of N=128 is selected. For the moving average filter length of P=5 is chosen.
Results of the real-time implementation for different input SNRs are given in Table 3 (rows named Hardware). In addition, to check the accuracy of the achieved results, numerical results obtained from MATLAB simulation are also presented in Table 3 (rows named Software 1). To provide a fair comparison, numerical results are obtained for the same input noisy PPS and for the same numerically calculated STFTs, as in the hardware implementation case. The proposed hardware implementation provides high quality estimation and only small differences in the parameter estimation can be noticed. Typically, differences are from 0.03% (in the case of parameter a 0 estimation and for SNR in =20dB) up to 2.66% (in the case of parameter a 3 estimation and for SNR in =12dB). The inaccuracy is mainly caused by the approximate realization of some basic functions within the hardware implementation. For example, the trigonometric functions realization is based on Taylor series representation and within hardware implementation, only first two terms of the corresponding Taylor series are taken into account. On the other hand, calculation of these functions in MATLAB uses, by default, 250 terms of the Taylor series. To confirm this statement, default trigonometric functions realizations have been replaced in MATLAB by the corresponding Taylor series representations, but with the same accuracy as in the hardware implementation case (only first two terms of the corresponding Taylor series have been taken into account). Results achieved in this way are also presented in Table 3 (rows named Software 2), for each particular input SNR value. Now, comparing the corresponding Hardware and Software 2 rows from Table 2 , the accuracy of the proposed hardware implementation can easily be checked. Note that, in this case, differences in the parameters estimation (from 0.012% in the case of parameter a 5 estimation and for SNR in =20dB up to 0.51% in the case of parameter a 1 estimation and for SNR in =8dB) are caused by the finite register length influence [53] and are significantly smaller than differences noted within the comparison of the proposed hardware implementation and the MATLAB implementation which uses default realizations of the trigonometric functions.
Note that, if it would be required, the accuracy of the proposed hardware implementation can be improved by using more precise mathematical formulas for the mentioned basic functions realization. However, in this way, hardware complexity would be significantly increased, together with the increase in execution time. Therefore, the hardware implementation should be developed by making a trade-off between required precision from one side, and hardware complexity and execution time from the other side, as performed here.
To complete estimation of the observed 6-th order PPS, the developed hardware design requires a period of time depending on the number of taken CLKs, discussed for each used module in Section 3, and on the single CLK duration, where the single CLK duration of 20ns corresponds to the maximum CLK rate of 50 MHz. This period of time (about 3.95 ms) is about 10 times smaller than the period required by the MATLAB simulation, performed on a high performance computer (24GB RAM, i7 Intel processor).
Conclusion
Efficient multiple-clock-cycle hardware implementation of the QML estimator core, the recently proposed powerful alternative to the state-of-the-art PPS estimators, is developed, tested, and verified. Based on the fact that PPSs have great importance in many practical applications (radars, sonars, sensor networks, biomedicine, and communications), as well as that the QML significantly outperforms other corresponding algorithms (in terms of both characteristics: the SNR and the mean square error), the efficient hardware implementation of such estimation obviously is of great benefit and importance. In that way, this accurate and useful estimation algorithm will get its full practical valorization through possible real-time applications. The developed design is verified by implementation on an FPGA device and is tested on the real, noisy PPS, whereas the achieved very high accuracy is proven by comparison with the results obtained by the MATLAB simulation. It is shown that deviations in the estimation of signal coefficients are mainly caused by the limited precision in the realization of some basic mathematical functions, but also that their precision can be increased if required. A future research could be focused on the parallelization of the necessary number of QML algorithm cores, along with the implementation of conditional steps of that algorithm. However, due to the expected extensive increase in calculation complexity (that is already high in the realization of the single algorithm core), advanced methods for optimization should be required.
