Abstract. Possibility of the pipelining technique application in a highly nonstationary two-dimensional (2D) frequency-modulated (FM) signals estimation is considered in this paper. The considered design technique allows the implemented filter to overlape in execution unconditional steps performing in neighboring space/spatial-frequency (S/SF) instants and, consequently, to significantly improve execution time of the 2D nonstationary Wiener filter based on the 2D cross-terms-free Wigner distribution (2D CTFWD) real-time implementation and on the 2D CTFWD-related local frequency (LF) estimation in a noisy environment. In this way, the improvement in execution time corresponding to the one clock cycle (CLK) by a S/SF point is achieved, which means that the improvement by a S/SF point can reach 50% in most S/SF points. The design and its improvement in execution time are tested on the monocomponent and multicomponent highly nonstationary noisy signals.
INTRODUCTION
Conventional time-invariant or frequency-invariant analysis and processing of onedimensional (1D) and two-dimensional (2D) nonstationary signals cannot produce satisfactory results, because these signals are usually spread across a wide range of frequencies. More efficient processing of these signals requires a time-varying or space-varying approaches that can be defined by using the common time-frequency (TF) domain tools based on TF distributions (TFDs) in the 1D signal case or by using the space/spatial-frequency (S/SF) domain tools based on the S/SF distributions (S/SFDs) in the 2D signal case.
Several time-varying and space-varying filtering solutions have been proposed so far and have been usually referred to as classical ones. However, classical filtering solutions exhibit drawbacks that seriously limit their applicability [1] . The Richaczek distributionbased (Zadeh) filter cannot be used for nonstationary signals, the STFT and the Gabor filters have limited resolution, whereas the Weyl filter is essentially restricted to halfband signals. In order to suppress the noted flaws, the extended versions of classical nonstationary filters have been recently defined [2] - [5] .
All TFD-related and S/SFD-related nonstationary filters are numerically quite complex and require significant time for calculation, which makes them unsuitable for realtime analysis in most practical applications. Hardware implementations, when possible, can overcome these problems, enabling applications of these 1D and 2D filters in numerous additional problems in practice.
In [6] , [7] , hardware design of an optimal 2D cross-terms-free Wigner distribution (2D CTFWD)-based filter, suitable for implementation on an integrated chip and capable of performing real-time estimation of highly nonstationary 2D FM noisy signals, was developed. In this paper we improve hardware design of the optimal 2D CTFWD-based filter from [6] , [7] by the pipelining technique application. In this way, we allow the considered filter to significantly reduce execution time required by the existing design, as well as to substantially increase its processing speed.
THEORY BACKGROUND
Optimal nonstationary 2D filtering definition, obtained by expanding the corresponding 1D filtering definition, [1] - [3] , [8] - [11] , related to the 2D Wigner distribution (2D WD), [12] , and used to overcome distortion of the filtered 2D FM signal, [9] , [12] , is:
where ( , ) n n n  . Following the procedure for the stationary Wiener filter design, [13] , the optimal (Wiener) nonstationary 2D filter, in the case of signal not correlated with the additive noise, is defined by:
where [10] , [12] . Then, the optimal filtering problem of nonstationary 2D FM signals can be reduced to the LF estimation in a noisy environment.
In practice, the LF estimation has to be performed based on a single realization of noisy signals. In the S/SF analysis framework, this is done by determining frequency-frequency (FF) points , 1,...,
where S/SFD of the noisy signal has local maximum, [14] , [15] [16] , preserves optimal auto-terms representation of the 2D Wigner distribution, but without cross-terms presence. It additionally minimizes calculation complexity, [17] , and optimizes LF estimation characteristics in comparison to the commonly used 2D S/SFDs, [14] , [15] , [17] . Besides, 2D CTFWD is defined based on the same 2D STFT elements used in the filtering definition (1) and has already been implemented in real-time, [17] . Therefore, it can be used as an optimal base for an optimal nonstationary 2D filter development.
PIPELINED IMPLEMENTATION
An optimal nonstationary 2D filter (1), based on the 2D CTFWD hardware implementation from [17] and on the real-time 2D CTFWD-based LF estimation, implemented through the sliding matrix operation, has been considered in [6] , [7] . Here, the implementation from [6] , [7] will be improved by the pipelining technique application that allows the implemented filter to reduce the time required for execution, as well as the processing speed. Pipelined hardware implementation of the 2D CTFWD-based optimal filter, principally following definition (1), is presented in Fig.1 . It basically consists of several main functional units. Functions of these units will be described through the complete system operation, as follows.
Input 2D STFT (STFT_IN) elements, produced outside, in 2D STFT or 2D FFT, [13] , [18] , modules, are imported to the input memory owing to each double clock cycle (CLK). This period represents minimal execution time by an S/SF point ( , ) nk , determined by the signal adaptive 2D CTFWD calculation, i.e. by the STFTLoad/CTFWDStore signal generation in the STFT-to-CTFWD gateway, [19] . By each STFTLoad/CTFWDStore cycle, 2D STFT elements are moved to the convolution window operation file that is used to provide movement through the 2D STFT elements and to create address order of the STFT-to-CTFWD gateway's inputs, as principally shown in Fig.1(b) .
In the STFT-to-CTFWD gateway, [19] , 2D CTFWD elements are produced based on 2D STFT input elements and the 2D STFT-to-2D CTFWD relationship, [16] , [17] . By each STFTLoad/CTFWDStore cycle, the calculated 2D CTFWD elements are loaded to the FRS detection module, implemented to estimate FRS ( , ) H L n k , to provide movement through the CTFWD samples, and to participate in the implementation of the real-time LF estimation procedure, which follows definition (3), as described in detail in [6] , [7] . This procedure tests an LF existence in the FF point 12 ( , ),
for the observed signal point 12 ( , ), nn based on the frequency-only-dependent 2D CTFWD elements, symmetrically distributed around the observed point 12 ( , ) kk . These elements, (3)), implemented through the sliding matrix register block area, sized (2L+1)×(2L+1) locations. In line with the LF estimation procedure from [6] , [7] , the LF is detected in the FF point 12 ( , ), kk corresponding to the maximum element of the sliding matrix register block area, but only if it is (i) central sliding matrix register block element, (ii) greater than the spectral floor R, and if the sliding register block area size satisfies:
where A i , i=1,…,q are different widths of the non-overlapping 2D CTFWD auto-terms.
Fig. 1
Hardware implementation of (a) The 2D nonstationary Wiener filter, (b) The FRS detection module.
Sliding matrix register block area creates address order of the COMP block's inputs. The COMP block tests conditions of the LF estimation procedure and, if they are satis-fied, detects an FRS in S/SF point ( , ), nk determined by ( , ) 1. locations, is used to hold the imported 2D STFT data, so that its output corresponds, in frequencies, to the central element of the sliding matrix register block area.
After the execution in the FF point (k 1 ,k 2 ) this procedure is repeated for the next FF point (k 1 ,k 2 + 1), from the same signal point (n 1 ,n 2 ). The
 . FIFO delays from the FRS detection module provide real-time sliding of the matrix area over frequency-only-dependent 2D CTFWD elements and, therefore, the real-time LF/FRS detection. For each signal point (n 1 ,n 2 ) the real-time LF/FRS detection is performed by the matrix area sliding over all frequency-only-dependent input 2D CTFWD elements and by the LF/FRS computing according to the input elements and the described estimation procedure. (a) Non-pipelined implementation [6] , [7] , (b) Pipelined implementation.
Multiple-clock-cycle implementation (MCI) of the optimal S/SF filter, presented in Fig.1 , has been designed. Per S/SF point, the proposed design performs noisy signal estimation in signal dependent number of ( , ) 2 L n k  CLKs, where each CLK corresponds to the related step in the filtering execution. In the first ( , ) 1 L n k  CLKs (0-th, 1-sr, 2-nd, …, ( , ) L n k -th one), the 2D CTFWD element is calculated in the STFT-to-CTFWD gateway. The ( ( , ) 1 L n k  )-st step/CLK is used for the nonstationary signal estimation. However, only two of these steps (0-th-SPEC execution-one and ( ( , ) 1 L n k  )-st-estimation-one) are unconditional. They provide 2D SPEC-based estimation. Residual steps are conditional and depend on the estimated signal shape. They are used to improve the LF estimation quality up to the 2D CTFWD-based one and are taken only in S/SF points from the 2D STFT auto-terms' domains, that is determined by the signal adaptive period of the STFTLoad/CTFWDStore cycle.
The estimation in a signal point (n 1 ,n 2 ) is completed by storing the ( )( ) Hx n final value in the OutRegister in the completion CLK. However, this can be done only after performing the described execution in each FF point from the observed point 12 ( , ) nn . It means that the ( )( ) Hx n final value will be obtained after performing the execution in the maximum FF point, i.e. when ( , 2, 2) CTFWD n N N becomes the central element of the sliding matrix register block area. Therefore, the completion step from the observed signal point (n 1 ,n 2 ) is overlapped in execution with the 2D SPEC execution step from the next S/SF point 12
 . In this way, the execution time of the implemented design is improved by one CLK per a signal point. The design proposed here additionally allows overlapping in execution of the unconditional steps between the substantial FF points (k 1 ,k 2 ) and (k 1 ,k 2 +1) k 1 ,k 2 = N/2 + 1,…,N/2, as shown in Fig. 2 . Residual steps cannot be included in overlapping in execution, because they are conditional ones and do not have to exist. The described pipelined procedure is controlled by the signals STFTLoad/CTFWDStore, Gateway_CLK, Gateway_RESET, CumADD_CLK, and CumADD_RESET. In an FF point (k 1 ,k 2 ) they control calculation in the STFT-to-CTFWD gateway and the summation in the output CumADD, as well as the filtering completion in a signal point (n 1 ,n 2 ) as shown in the timing diagram given in Fig. 3 .
By applying the described overlapping in execution, the proposed pipelined design improves throughput of the implementation (the total amount of work done in a given time interval) corresponding to the one CLK cycle per a FF point (k 1 ,k 2 ), k 1 ,k 2 = N/2 + 1,…,N/2. This means that the improvement of about 5% (depending on the convolution window register block area size L m ), in FF points existing around LFs, up to the 50%, in FF points existing outside 2D STFT auto/terms' domains, can be achieved by the design. Depending on the estimated signal shape, this can be a significant development in comparison to the partly pipelined design from [6] , [7] , because, in real situations, each signal point (n 1 ,n 2 ) contains N×N FF points.
IMPLEMENTATION OF THE SLIDING MATRIX AREA FUNCTION
Basic part of the architecture shown in Fig.1 is the hardware implementation of the sliding matrix area. Essentially, it is used twice in our design -the first time to provide movement through the 2D STFT samples and to create address order of the STFT-to-CTFWD gateway's inputs, and the second time to provide movement through the 2D CTFWD elements, to create address order of the COMP block's inputs, and to generate base for the FRS estimation in the observed S/SF point. Other parts of the architecture from Fig.1 are already implemented (STFT-to-CTFWD gateway in [19] ), or are well known from the literature (such as STFT module, comparators, cumulative adders etc.). This is the main reason why here we test only the implementation of the FRS detection module, as well as its function through the simulation in the ModelSim Altera 6.3g_p1 software. FIFO delays. Registers from the sliding matrix register block area contain the 2D CTFWD elements, calculated in the STFT-to-CTFWD gateway. A 2D CTFWD input element Fig. 1(b) ) is loaded in the first register of the sliding matrix register block area. At the same time, each element of the sliding matrix register block area row k 1 + L (as well as rows k 1 + L  1, …, k 1  L) will be shifted right and generate 2D CTFWD elements in time index (k 2 + L  1, …, k 2  L). The 2L FIFO delays are used to generate 2D CTFWD elements of the sliding matrix register block area column k 2 
It means that for each described step, a new 2D CTFWD element will be shifted to the sliding matrix register block area to provide sliding of the actual matrix position over frequency-only dependent 2D CTFWD elements for one step right. o , The results obtained by the proposed design for different input SNRs and different sliding matrix register block area sizes (2L+1)×(2L+1) are given in Tables 1,2 , for the cases of considered monocomponent and multicomponent signals (5) and (6), respectively. They represent robustness of the proposed design regarding the sliding matrix register block area size (2L+1)×(2L+1). This conclusion follows from the wide FF range (4) obtained in the case of non-overlapping 2D FM signal components, highly concentrated in the S/SF space. Distribution of CLKs taken by the proposed design per FF point in the case of the considered signal (6) and in the signal point (0.25,0.25) is shown in Fig. 8(a) . It is compared to the distribution of CLKs taken by the non-pipelined design from [6] , [7] , Fig. 8(b) . The improvement in execution time achieved by the pipelined design can easily be noted. In the observed 2D test signal (6) case, the improvement (in comparison to the corresponding nonpipelined design) can be easily computed and numerically described with 33.08%.
COMPARATIVE ANALYSIS
The proposed pipelined design follows the same 2D CTWD-based S/SF filter implementation principles as the 2D CTFWD signal adaptive design from [17] and the nonpipelined filter designs from [6] , [7] , [20] . Therefore, the proposed design retains desirable characteristics of the signal adaptive MCI designs from [6] , [7] , [17] , [20] , regarding calculation and implementation complexity, as well as the execution time. Besides, by the pipelining technique application, the proposed design additionally improves S/SF filtering execution time (approximately for 20-35%, depending on the estimated signal shape), overcoming the corresponding S/SFD-based filters regarding almost all critical design performances. Further, the design enables high quality S/SF filtering in real-time, based on the highest quality, signal adaptive 2D CTFWD-related LF estimation. The designs based on non-adaptive algorithms, as well as the stationary 2D filters, cannot produce so high quality results, [1] , [2] , [9] , [10] , [21] .
CONCLUSION
Multiple-clock-cycle signal adaptive implementation of the optimal S/SF filter has been represented. By the pipelining technique application, the proposed implementation enables overlapping in execution unconditional cycles between the substantial S/SF points. In this way, the proposed design significantly enhances critical design performances corresponding to the time required for execution.
