Abstract. For non-stationary processes, the conventional power spectrum does not reflect the time variation of the process characteristics. Higher-order cumulants or bispectrum has been applied in the analysis of nonminimum phase linear, time-invariant (LTI) systems under stationarity assumptions and restricting the signal to have non-symmetric distribution. It is clearly of interest to examine what type of analysis is available for those cases where the assumption of stationarity becomes untrue or unrealistic. In many practical applications such as seismic surveying, detection, system identification, sonar, communications, speech, and medical instrumentation, the signal analysts are faced with the task of processing signals whose spectral characteristics are time-dependent [1] and [2] . In this paper, a simple hardware architecture for computing the time-varying cumulants of non-stationary signals is proposed using systolic arrays.
Introduction
Due to a large number of applications in diverse fields, the problem of designing hardware implementations to deal with non-stationary signals is of utmost important. A common and well-known procedure for dealing with stationary stochastic processes is based on the application of the power spectrum. Although this procedure has a wide range of applications, it has certain limitations. The fact that the power spectrum carries only the magnitude information and no phase information implies that minimum-phase (or maximum-phase) characteristics had to be assumed a-priori. Since non-minimum phase systems are systems that have zeros inside and outside the unit circle, the analysis of non-minimum phase LTI systems cannot be done using the power spectral analysis. Furthermore, for non-stationary processes (processes which have at least one timevarying parameter), the conventional power spectrum does not reflect the time variation of the process characteristics. Higher-order cumulants or bispectrum has been applied in the analysis of non-minimum phase LTI systems under stationarity assumptions and restricting the signal to have non-symmetric distribution [3] , [4] , [5] .
System analysis when the input/output are non-stationary requires new techniques and a great deal of work is needed in this area. It is clearly of interest to examine what type of analysis is available for those cases where the assumption of stationarity becomes untrue or unrealistic. In many practical applications such as seismic surveying, detection, system identification, sonar, communications, speech, and medical instrumentation, the signal analysts [1] , [2] are faced with the task of processing signals whose spectral characteristics are time-dependent. Therefore, non-stationary signals need special mathematical treatment when estimating their spectrum or when identifying systems with non-stationary input. An accurate spectral analysis of these signals cannot be accomplished by the simple use of classical time-domain or frequency-domain representation. To deal with time-dependent spectrum, the concept of time-frequency distributions has been introduced. These methods represent an attempt to provide a general solution to the problem of representing non-stationary signals. In order to develop a useful theory we need to replace stationarity by a more general notion which still allows us to carry out meaningful statistical analysis and to develop a form of timedependent spectral analysis which shares many of the features of the spectral analysis of stationary processes [6] , [7] .
Spectrum and Polyspectrum
Let x(n) be a stationary random process up to order k. Then, its k th order cumulant is defined as [3] 
where k=r 1 +r 2 +...+r n and
which is called the cumulant generating function, and E{.} is the expected value operator. The first, second, and third cumulants of x(n) as functions of its moments are given by E{x(n)} (n) c
is a zero-mean stationary process, then its first, second, and third cumulants reduce to the first, second, and third order moments of x(n), respectively. Particular cases of the (k+1)st order spectrums are the power spectrum and the bispectrum defined as follows:
Let H(z) be the transfer function of an LTI exponentially stable system, (i.e., its impulse response decays exponentially with time), whose input is an independent, identically distributed (i.i.d.) non-Gaussian process with non-zero (k+1) order cumulant β k+1 δ(m 1 ,...,m k ). The (k+1)st order cumulant of the output process is related to the system impulse response h(n) by
In the frequency domain, the (k+1)st order spectrum of the output process is related to the system transfer function H(w) by
The above two equations are of fundamental importance because they serve as the bases for most non-minimum phase system identification methods.
Most of the standard parameter estimation and linear system identification algorithms available in the literature estimate only a spectrally equivalent minimum phase system because these techniques exploit only the second-order statistics which suppress all phase information of the underlying process; thus they are incapable of identifying the non-minimum phase structure of the system. Furthermore, for nonstationary processes, the conventional power spectrum does not reflect the time variation of the process characteristics. With the recent introduction of the bispectrum in digital signal processing, a new approach to the solution of non-minimum phase system identification problem has been devised. This approach exploits the fact that the bispectrum contains information regarding both the phase and the magnitude of the system. Although the bispectrum has been applied in the identification of non-minimum phase LTI systems, it requires the assumption of stationarity and restricts the process to have non-symmetric probability density. Therefore, when the input/output of the system are non-stationary or when they have symmetric probability densities, system analysis requires a new approach. In [8] a new method based on the evolutionary spectrum was proposed to solve some of these problems. In this paper, we propose a hardware implementation for evaluating the time-varying cumulants of a non-stationary signal.
Normal Equations for Time-varying Autoregressive Moving Average Systems Using the Time-varying Cumulants
Consider the time-varying autoregressive moving average (TVARMA) equation
where the residuals e(n) are a set of mutually independent stationary random variables, identically distributed with zero-mean and unit-variance. Then multiplying both sides of equation (3) by x(n+m)x(n) and taking the expected value of both sides and assuming that E{e(n)e(n+m)e(n+k)}=γ n e δ(m,k), we have
where c x n (m,k)=E{x(n)x(n+m)x(n+k)} is the time-varying third-order cumulant of x(n). Equation (4) can be written in a matrix form as x n C a n x n c − = (5) where for k=1,2,...,p and m=q+k=k when a time-varying AR model is considered, then we have
, and
T ,where T stands for matrix transpose. The above equations constitute a set of linear equations at each instant of time n.
Estimation of the time-varying cumulants
The time-varying cumulants (TVC) can be estimated from the evolutionary bispectrum [7] , [8] as follow:
where N is the data sequence length and M is an arbitrary constant depends on signal non-stationarity. Using the evolutionary bispectrum defined as
and w n (m) is a time-varying window. From equations (8) and (9), equation (7) becomes
which is an estimate of the third-order time-varying cumulant of x(n). In the case that x(n) is stationary, equation (10) reduces to
which is an estimate of the third-order cumulant of x(n). Modifying the algorithm proposed by [9] to the time-varying signals, let C n be a symmetric matrix defined as
then, by defining the upper triangular matrix U n to be
and D n as a diagonal matrix whose elements are d n (I,I)=w n (I)x(I), I=0,1,…,N-1 then C n can be written in a matrix form as
In the following section, a hardware architecture for implementing the time-varying cumulants defined in equation (11) is proposed using systolic arrays.
Parallel Architecture for Computing the Time-frequency Spectrum
In this paper we present a linear (one dimensional) systolic system for computing TVC. We exploit the formulation of the problem, as a series of matrix multiplication operations rather than the direct application of the original definition of the problem [10, 11] . The main advantage of doing so is to be able to ''re-use'' old work on systolic and processor-array design for matrix multiplication. Systolic array design for matrix multiplication has received a significant attention from researchers during the last two decades [12 -17] . Using such well known and established work as a building block in our proposed system makes it modular, easy to understand, easy to test and easy to enhance or modify. The other approach (which we will not address in this paper) is to design a systolic array starting from the original definition of the problem using standard systematic mapping techniques e.g., [18] . However, the systematic design techniques, often, yield complex Processing Elements (PEs) designs which are difficult to understand, test or modify. Our main goal in this paper is to map the computations onto a linear systolic array. The basic PEs used are extremely simple and are very suitable for VLSI implementations. We focus on linear array implementations since they have several advantages over two dimensional arrays. Specifically, they are more suitable for VLSI implementation and require lower I/O bandwidth; usually O(1) compared to two dimensional arrays which require O(N) bandwidth where N is the number of data points. Therefore, as the problem size becomes larger, linear arrays become more attractive [19, 20] . Furthermore, linear arrays make it easier to incorporate fault tolerance capabilities into the system. In this paper we propose a linear system for computing TVC in O(N to O(N) processing elements depending on the size of PEs local storage and choice of configuration. The proposed system uses, in part, a linear array for matrix multiplication presented in [19 -22] . The rest of this section covers the proposed architecture in detail.
Linear array design for TVC
As mentioned earlier, we are dealing with a non-stationary process so the matrix C n in equation (11) has to be computed at each instances of time (i.e., for n=0, 1, ... N-1). Nevertheless we will focus first on designing a system capable of computing a single C n and later we will show how to compute all instances of C n on the same system. Equation (11) is the core of our proposed architecture. By ignoring the scaling constant, we notice that the problem is redefined as a series of matrix multiplication operations of NxN matrices. All the elements of these matrices, except for D n , could be pre-computed (for a fixed value of w). Matrix D n is a diagonal matrix and its diagonal elements can be obtained "on the fly". Therefore, what is left is to carry out the multiplication operations. By re-examining (11) one can see that there are two types of matrix multiplication operations. Basically, the first type multiplies two regular matrices together. We call it non-trivial matrix multiplication. The second type multiplies a regular matrix with a diagonal matrix, which we call trivial matrix multiplication. Trivial matrix multiplication, which is multiplying a matrix A with a diagonal matrix B, is equivalent to multiplying each row of A with the corresponding diagonal element in B. Since no addition is involved, only multipliers are needed. If the data is fed serially, then a single multiplier (denoted as S-MUL module) will suffice as shown in the complete system design later. For non-trivial matrix multiplication, a full-fledged systolic array for matrix multiplication (denoted as M-MUL module) adopted from [19] is used and incorporated into the system. The over-all design consists of two building blocks of type S-MUL and M-MUL connected together in a serial fashion to form a linear system which implements the computations of equation (11) .
The proposed architecture consists of two parts S-MUL and M-MUL that can perform the matrix multiplication of (11) as depicted in Fig. 1 . S-MUL performs the first multiplication and feeds its results to M-MUL which is basically a matrix multiplication array based on the VMF design in [19] , [22] . M-MUL multiplies 
C=Y.U n (u 11 u 12 … u 1n ) (u 21 u 22 …u 2n ) (u 31 u 32 … u 3n ) … (u n1 u n2 … u nn ) Care was taken to make sure that the output data of S-MUL is in the format required by M-MUL (i.e., Column major Row major as specified in [19] ). The matrix D n is a U , the second diagonal element x(.) by the nonzero entries of the second column, and so on. As mentioned earlier, since data is fed serially, it was surprisingly sufficient to put S-MUL as a single multiplier to multiply the two input sequences that correspond to the elements of t n U and D n matrices.
In order for the reader to understand how the whole system works, we will explain how the module M-MUL works since it is the most complex component. Let us consider the VMF systolic array presented in [19] , [22] (assuming that the parameter S equals 1, where S is the internal storage capacity of Pes [19] ). M-MUL is a linear array that multiplies two mxm matrices A and B to obtain C. The linear array is obtained by mapping a 2-dimensional array onto a 1-dimensional array. As a simple example, assume that A, B and C are 4x4 arrays. The flow of input data in the original 2-dimensional array is shown in Fig. 2(a) . Each processor performs one multiply-andaccumulate operation and passes the A and B elements to its right and bottom neighbor respectively. All the operations for computing C ij are performed by PE (I-1)m+j . This particular 2-dimensional array is mapped onto 1-dimensional (linear) array by stretching each row in row major order and eliminating vertical links of the array. Therefore, the resultant linear array will have m 2 (in this case = 4
2 ) Pes as shown in Figure 2 (b). In the resultant one-dimensional array, the data flows into the array through the left most PE. The elements of the A matrix are fed in column major while those of the B matrix are fed in row major order, which correspond to the A and B bands in Figure 2 . Each element b 1j (1≤ j≤ 4) in the B band has to meet with the element a 1,1 in the A band. Since all data is fed serially at the leftmost PE, in order for a 1,1 to meet the elements of the B band, the B band data must travel faster than the A band data. This alignment of operands is achieved using two speed data channels. The general methodology used to design a linear array for matrix multiplication given in [19] assuming that we will partition the 2-D array into m-rows is as follows:
• The 2-dimensional array is partitioned into m rows: CROW 1 , CROW 2 ,..., CROW m .
• The linear array consists of m blocks each block having m Pes. The computation performed by the PE's in block i is the computations of Pes in CROW i , (1≤ I ≤ m).
• The Pes are selectively activated to perform a step of the matrix multiplication algorithm.
• Within each block, the elements of B matrix are saved in a slow channel that will be used by the PE's in the next block. At the end of each block, the B matrix data is switched from slow to fast channels so that they can commute with the elements of A matrix within the next block. The array is built using the basic PE shown in Fig. 3 that contains only one Multiply and Acummulate (MAC) unit. Data travels across the array through chains of registers (or channels). There are three major channels for moving data: one fast channel (BF) which carries the elements of B within one block, and two slow channels AS and BS which are used to carry the elements of A and B, respectively, as shown in Figs. 3 and 4. Control is accomplished using, one bit wide control lines I, J and ACT which connect neighboring Pes. Rows of the B matrix are moved from CROW i to CROW i+1 (i.e., from block i to the block i+1 ) using the slow channel BS. Moreover, the fast channel is used to move the data (of the B matrix) within a block of Pes. At the end of each block, data in the fast channel are discarded. Fig. 3 shows hardware details inside Pes to perform the channel switching at the end of each block by using multiplexes and the control signal φ. The control signal ACT flag is used to trigger a partial product computation and the flag φ is used to control data multiplexing. If (φ=1) then multiplexer M A selects data from AS.LR and M B selects data from BS.RR; otherwise M A selects data from AS.LR and M B selects data from BS.RR. A PE is activated to perform a multiply-andaccumulate operation C ij = C ij +a ik *b kj (a ik in the AS.LR of the slow channel is multiplied with b kj in the register BF.R of the fast channel (see Fig. 3 )) when the signal ACT=1. At the end of a block, the current element of matrix A is deactivated and the next element of A is activated. Concurrently with feeding data, φ is set in PE m *i (1≤ i ≤ m) using the I and J control signals. The data are input in every clock period continuously. Also, the control input ACT is set to 1 every time a 1k (1 ≤ k ≤ m) is inserted, into the array. For more details on the operation of the array see [22] . This array performs the multiplication of two mxm matrices using m 2 PEs in 3m 2 -m-1 clock cycles [22] (assuming that S=1). Recall that in our system, m equals N (the size of the matrices is NxN). Now that we have a linear array that can compute C n for a single value on n, what is left is to replicate the process to compute all C n for all possible values of n. In general one can replicate the architecture N times and let each linear system compute a C n matrix. This will guarantee that all computations will be performed in parallel so the whole system will have an O(N 2 ) computation time which is the time needed to compute an instance of C n . On the other hand, it is also possible to compute all C n matrices (i.e. for n=0,1,..., N-1) using the single liner system showing in Figure 1 sequentially and thus no replication of hardware is needed. The tradeoff between the two options will be discussed in the next section.
Performance analysis
As mentioned earlier, M-MUL has a computation time of N 2 +2N(N/S)-(N/S) +1 clock cycles, where S is the size of PEs local storage. Therefore, the over-all computation time involved in computing one matrix C n is given by: No previous work on the same subject is available for comparison. Table 1 , however, summarizes the main features of the proposed architecture for computing one C n matrix.
As for the over-all design for computing all instances of C n , the first option discussed in the previous section (that is replicating the system for each instance of C n ) is N times faster than the second option (that is using a single linear system to compute all instances of C n sequentially) but this performance boost comes at a cost. The second option will require N times the hardware and N times the I/O requirement of the first option which could make it undesirable if the system is to be realized in VLSI. It is up to the designer to judge according to the needs and constraints of his/her particular application and constraining environment. 
Conclusion
In this paper, a new approach for linear systolic array realization of time-varying cumulants (TVCs) is proposed. The underlying matrix multiplication formulation obviously simplifies the design process considerably. A very simple architecture that uses an existing matrix multiplication systolic array as a building block has been developed. The new design has the advantage of simplicity, flexibility and the suitability for VLSI implementation. An obvious advantage of the development presented in this paper is putting the problem in a formulation that can benefit from past (and possibly future) research in systolic architectures for matrix multiplication. A specific example from the literature [19] has been adopted in this paper because of its simplicity and efficiency, but other techniques (see [12] ) can be considered as well. The decision of which approach for matrix multiplication to choose would depend on the overall design objectives and constraints.
