Real-time cyclic spectral analysis is useful in many applications, but is difficult to achieve because of its computational complexity. This paper studies the distribution of complex multipliers in multiprocessor cyclic spectrum analyzers, with the objective of obtaining computational balance. Computationally balanced implementations efficiently use hardware so that computational bottlenecks are reduced and a smooth flow of data between computational sections of the analyzer is maintained. Tables are presented that give the number of complex multipliers required in each section of the analyzer to obtain computational balance.
INTRODUCTION
Cyclic spectral analysis is a useful signal analysis tool in a variety of applications [I]. In applications such as signal detection, signal recognition and parameter estimation, estimation of the cyclic spectrum is often a fundamental component of the alge rithm. Unfortunately, the computational complexity involved in cyclic spectral analysis often prohibita its use in many real-time applications. Generally speaking, cyclic spectral analysis is far more computationally complex than conventional spectral analysis, and real-time Cyclic spectral analysis is far more difficult than real-time spectral analysis [2] .
The key to real-time cyclic spectral analysis is parallel processing. The parallel structure of several computationally efficient algorithms has been determined [2] . Still, many implementation issues such as p r e cessor memory requirements, interprocessor communications, processor allocation and computational balance are unresolved. Furthermore, while signal flow graph descriptions of the algorithms are useful in designing multiprocessor architectures, direct implementation of the graphs leads to redundancy. Efficient implementations can be obtained by computationally balancing the multiprocessor architectures suggested by the graphs. By computational balance, we mean distributing the computational resources in such a way as to maintain a smooth flow of data in the implementation. Analysis of the computational balance leads to efficient distribution of arithmetic units as well as providing a means of analyzing implementation trade-
OffS.
Computational balance is achieved when each section of an implementation takes approximately the same amount of time to complete its computations. Data flows smoothly in a balanced implementation, and computational bottlenecks are minimized. In order to determine the number of arithmetic units needed to obtain computational balance, a performance metric is required. A useful metric for quantifying the notion of real-time performance is the factor-of-realtime, FT [3] where N,, is the number of arithmetic operations to be performed, N,, is the number of arithmetic units available to perform the operations, T, is the time interval between operand input to the pipelined arithmetic units, and Ti is the latency or overall delay of the arithmetic unit. The latency term, T j , of (2) is normally very small compared to the first term and is ignored in the rest of the paper. Address computations are ignored in this analysis, and it is assumed that memory and processors are connected a0 that operands are available to the processors every T, seconds.
In the cyclic spectral analysis algorithms discussed in this paper, the time required to collect data for a computation is At = NT,, where T, is the sampling interval of the input sequences. Since we are interested in determining N,, for a given FT instead of finding FT for a given N,,, we can drop the ceiling function in (2). Combining ( 1 ) and (2), and substituting Tcor = NT,, we arrive at
In cyclic spectrum analyzers, it is useful to balance the number of complex multipliers required for a particu- In this paper, we analyze the computational balance of implementations of the FFT Accumulation Method (FAM) and the Strip Spectral Correlation Algorithm (SSCA). Theae computationally efficient algorithms have been described in a number of papers [2] , [3], [4] . In particular, we consider application of these algorithms to two problems of interest: estimating the cyclic crosa spectrum of t.wo complex-valued signals and estimating the cyclic auto spectrum of a single real-valued signal. For brevity, we refer to the first problem as a Type C computation, and the second problem a8 a Type A computation. The resulting analysis describes the distribution of complex multipliers in parallel implementations and indicates how
COMPUTATIONAL BALANCE IN THE FAM
Before describing computational balance in the FFT Accumulation Method (FAM), the computational aspects of the algorithm are reviewed. The first step in the FAM is to compute the complex demodulates of the input sequences. This computation is performed by NI-channel channelizer and consists of multiplying a block of input data by a data-tapering window, Fourier transforming the block of data, and then frequency shifting the output streams of the Fourier transformer to baseband. In the FAM, the input samples are grouped into blocks of N' samples, multiplied by a data-tapering window and Fourier transformed by an NI-point FFT. The output sequences from each channel of the FFT are then frequency shifted to base band by multiplying the output streams by the appropriate complex exponential. To improve the efficiency of the computation, input samples are shifted into the channelizer L samples at a time. Typically, the decimation factor is set to L = N1/4. In all, a total of N' complex demodulate sequences are computed, and each sequence contains P = N/L samples.
After channelization, the complex demodulate sequences are cross multiplied to form product sequences, and each product sequence is Fourier transformed by a P-point FFT. For a Type C computation, N" P-point FFTs are performed, and for a Type A computation, Nt2/4 P-point FFTs are performed. The computational complexity of the FAM, in terms of complex multiplications, is given in Table 1 for Type C and Type A computations. For reference, the frequency resolution of the FAM is A f % f , / N ' , where fs = 1/T. and the cycle frequency resolution is A a = f,/N. The time-frequency resolution product of the FAM, a measure of the statistical reliability of the estimates, is AtA f = PL/N'.
Using Table 1 
1V-162
Data Tapering NI-point FFT PN' log, N' P$ log, N'
Frequency Shift 
4N'
N' 2N'log, P ("12) log, P Ne,
Multiplications for the initial data tapering are not included in this expression since it calls for N'P real multiplications. Simplifying (7) by substituting P = NIL and L = "14, substituting the resulting equation into (3), and interpreting N. , as NCM yields
The interpretation of equations (6) and (8) is summarized in Table 2 . The table specifies the number of complex multipliers required to estimate the cyclic spectrum in real-time (FT = 1) for T, = T, with the FAM. Although Table 2 indicates that no complex multipliers are required to taper the input sequence for a Type A computation, it is easily shown that four real multipliers are required for computational balance.
It may seem odd that a constant number of multipliers is used in the data-tapering and frequencytranslation sections. In those calculations, the number of operations and the amount of time allotted to perform the computations scale proportionately so that the number of multipliers is constant. Table 2 can be used to illustrate the redundancy found in implementations based on the signal flow graphs presented in [2] . For example, implementations derived from the signal flow graph for a Type C computation require ( complex multipliers for the product sequence computation, whereas the balanced implementation requires only 4". The reason for this discrepancy is that the signal flow graph of the algorithm indicates the connectivity of the data while the computational balance accounts for the timing of data movement. Table 2 also indicates how implementation tradeoffs can be made without affecting the overall performance of the implementation. For example, consider the product sequence calculation of a Type C computation. By replacing multipliers in that part of the implementation with multipliers that operate at T,* = 4Tc, the number of multipliers used in that section can be reduced by a factor of four.
COMPUTATIONAL BALANCE IN THE SSCA
The Strip Spectral Correlation Algorithm (SSCA) is computationally similar to the FAM, except that only one channelizer is required and the product sequences (and therefore the smoothing FFTs) are larger. Using a simple interpolation scheme, the same channelizer that is used in the FAM can be used in the SSCA [4] . The computational complexity of the SSCA is summarized in Table 3 . For reference, the frequency resolution of the SSCA is A f = f,/N', the cycle frequency resolution Aa = f,/N, and the timefrequency resolution product is AtAf = NIN'. 1 N'% log, N $6 log, N From Table 3 , the number of complex multiplications for a Type C computation is The distribution of c,omplex multipliers for this architecture is given in Table 4 for T, = T, and FT = l.
N-163
Although Table 4 indicates that no complex multipliers are required to taper the input sequence, it is easily shown that four real multipliers are necessary for computational balance. The comments at the end of Section 2 regarding computational balance in the FAM also apply to computational balanre in the SSCA. 
SUMMARY
Parallel architectures are a key element of real-time cyclic spectral analysis. Using the tables presented here, together with the signal flow graphs in [2] , efficient cyclic spectrum analyzers can be designed. The computational balancing technique illustrated here can be useful in analyzing design alternatives a8 well as determining performance trade-offs.
