the consistency of the estimate when the noise has an unknown color. In Fig. 3 , the observation lasts 600 symbol periods. The SNR varies from 0 to 18 dB.
The results thus obtained are satisfactory; however, the "best" choice of a sequence f(n) is currently under investigation.
IV. CONCLUSION
When cyclostationarity is induced at the transmitter, we have shown that some second-order cyclo-spectra provide a way of identifying the unknown channel. In contrast to the conventional approches, the consistency of the proposed method is achieved when the observation is corrupted by interference, the second-order structures of which are unknown. We propose in Table IV 
I. INTRODUCTION
Modern digital communications systems require efficient data coding in order to reduce transmission and/or storage costs [1] , [2] . These coding systems use algorithms of one of the following types: waveform coders, transform coders, and model coders [2] . Although transform coding systems generally use the cosine transform [3] , [4] , the use of the Haar transform offers certain advantages over the former due to the wavelet characteristics of Haar functions [5] .
Three generic VLSI architectures have been proposed in [6] for the hardware implementation of the Haar transform. Each of these has different characteristics in terms of computation time, complexity, input/output format, and pipelinability. Some other important characteristics not considered in [6] are the size of the required memory, the input ordering, and the complexity of control. The selection of one architecture or another depends on the application and the design framework available so that in some cases, the resulting architecture has a structure that is a hybrid of other more general structures. This is the case of the direct Haar transform processor described in [7] , whose architecture consists of three stages in pipeline, the last of these having sequential queue architecture.
The on-line computing of the bidimensional Haar transform can be carried out directly by implementing the two-dimensional (2-D) fast transform [8] , [9] or indirectly by applying the property of separability [2] using the unidimensional (1-D) transform. The three processor parallel-pipeline architecture described in [8] has been designed for a raster ordering of input, minimizing the internal memory. The simplicity of the architecture proposed by Albanesi and Ferreti [9] is due to the fact that the input data ordering is not a raster ordering and that they use a less general bidimensional Haar-like transform.
This correspondence presents an inverse Haar transform processor
(1-D-IFHT) programmable for transform lengths of between N = 8 and N = 1024. Its architecture, which is a variant of the sequential queue architecture proposed in [6] , reduces the internal memory requirement from N to log 2 N. Moreover, the control logic has a modular structure whose complexity increases linearly with log2 N.
Manuscript received July 25, 1996; revised August 4, 1997. This work was supported by the Spanish Government Research Commission (CICYT). The associate editor coordinating the review of this paper and approving it for publication was Dr. Konstantin Konstantides.
The authors are with Departamento de Electrónica y Computadores, Facultad de Ciencias, Santander, Spain (e-mail: grr@dyvci.unican.es; jmm@dyvci.unican.es).
Publisher Item Identifier S 1053-587X(98)00527-3.
1053-587X/98$10.00 © 1998 IEEE A prototype chip of the 1-D-IFHT covering an area of 11.7 mm 2 has been implemented using standard cells and a 1-m CMOS technology. Its maximum data rate is close to 60 MHz, which means that it can be applied in the computing of the 2-D-IFHT at video rates.
II. ON-LINE COMPUTING OF THE 1-D-IFHT
Haar functions make up an ortonormal basis used in a wide variety of signal processing applications. The normalization of Haar functions [10] limiting their values to f0; 61g makes them easier to handle from a computing point of view. Thus, the normalized Haar functions of length N, power of 2, are defined aŝ 
Using these functions, the direct and inverse Haar transforms of a sequence xi of length N are defined as 
As with the other transforms, increased computing efficiency is achieved by means of fast algorithms. Fig. 1 shows the computing dataflow diagram corresponding to the algorithm of the inverse fast Haar transform for N = 16. As can be observed in this figure, the number of addition and subtraction operations required is 30. In general, the number of additions and subtractions needed in order to compute the 1-D-IFHT of length N is 2(N 0 1) so that its on-line implementation can be carried out using a single adder/subtracter, as is the case of the direct transform [6] , [7] .
The input sequence ordering conditions the size of internal memory required. The on-line implementation of the algorithm of Fig. 1 using normal input ordering requires the intermediate data to be stored in each computing level. The minimum of internal memory needed is limited by the number of intermediate data generated in the penultimate computing level, that is, N=2. This internal memory can be reduced by using the nonnormal input ordering produced by preorder listing [11] of the binary tree of X i coefficients shown in Fig. 2(a) . It can be seen from this tree that 1) any minimum path between the root node (X0) and any terminal node (X N=2 ; 111; X N01 ) includes the spectral coefficients needed to generate two points of the output sequence x i , and 2) the number of nodes common to both of these paths coincides with the number of common intermediate results in the calculation of the respective output points. Taking N = 16 as an example, the path X0-X1-X2-X4-X8 contains the coefficients needed to calculate x 0 and x 1 , whereas the path X 0 -X 1 -X 2 -X 4 -X 9 contains the coefficients needed to calculate x 2 and x 3 . X 0 -X 1 -X 2 -X 4 are common coefficients so that all intermediate results obtained in the calculation of x 0 and x 1 , except for the last one, are also used in the calculation of x 2 and x 3 . By producing the preorder listing, the input sequence shown in Fig. 2(b) is obtained. Fig. 3 shows the 1-D-IFHT dataflow diagram resulting from this input ordering. Generally, the on-line implementation of the 1-D-IFHT with the proposed ordering requires (log 2 N))01 subtractions to be stored, that is, the number of computing levels minus one, and requires the last intermediate addition to be maintained as well. The internal memory size is thus reduced to log 2 N. Moreover, the output sequence is generated with normal ordering, and its latency (log 2 N) is minimal. Hence, we can say that the sequence in Fig. 2(b) has a minimum latency ordering.
III. PROGRAMABLE LENGTH 1-D-IFHT PROCESSOR
The 1-D-IFHT processor has been designed for input sequences with minimum latency ordering, with a transform length programmable between N = 8 and N = 1024. The output sequence is generated with normal ordering. Its sequential queue architecture (Fig. 4) is made up of
• one adder/subtracter (A/S);
• one STACK memory of nine registers (R0 to R8) to store the subtraction output; • one (RSUM) register to store the addition output; • two multiplexers; • two output registers (ROUT1 and ROUT2);
• three scaling circuits. The external addressing signals (ADD 0 , 111, ADD 9 ) enable the input data with minimum latency ordering to be read from an external memory. The number of addressing lines that remain active is a function of the length of the transform; for example, in the case of N = 8, only the first three are used, the rest remaining inactive. The control signals have the property of being generated recursively for any N from the outputs of a counter driven by the system clock. As an example, the generation of the WRITE signal is now presented. The writing signals WR N are generated from the outputs of the counter ar (r = 0; 1; 1 11;9), using the circuit shown in Fig. 5(a) . This circuit is made up of an initial stage [ Fig. 5(b) ] and seven elemental modules [ Fig. 5(c)] . A multiplexer selects the WRITE signal corresponding to the programmed transform length. In general, the control signals corresponding to a transform length N can be obtained from those of length N /2 in a similar way to that described above. Because of this property, the control logic has a regular structure made up of elemental modules connected in cascade, which reduces its complexity and facilitates its extension to lengths greater than 1024, which is considered to be the maximum in this case. Table I lists the number of standard cells of the control circuit and the STACK as a function of the transform length. As can be observed, the complexity of both circuits, measured in number of cells, increases linearly with log 2 N .
In order to illustrate the operation of the processor, Fig. 6 shows the time diagram corresponding to the calculation of the 1-D-IFHT of the sequence X i = f14; 20; 05; 015; 2; 5; 5; 016g, reordered in the form f14;20; 05; 2; 5; 015; 5; 016g. This figure also illustrates the movement of data in the registers, as well as the operation of the arithmetic element. The transform is carried out in N = 8 clock cycles, and the output is generated with a latency of log2 N 03 clock cycles. In the first cycle, the input data X 0 = 14 are stored in RSUM.
In the second cycle, the processor operates with fX 0 ; X 1 g, storing the addition (34) in RSUM and the subtraction (06) in R0. In the third cycle, the contents of RSUM are processed with the input data, X 2 = 05, scaled by 2 (1 bit shifted to the left); the new additions (24) are stored in RSUM and the subtraction (44) in R0, shifting the previous content of R0 to R1. In the fourth cycle, the contents of RSUM are processed with the input data, X 4 = 2, scaled by 4 (2 bits shifted to the left); in this case, the results of the addition and subtraction, scaled by one eighth (3 bits shifted to the right), make up the first output data, x0 = 4 and x1 = 2 (xe0 and xo0). In the fifth cycle, the data of R0 are processed with the input data, X 5 = 5, scaled by 4 (2 bits shifted to the left), and the contents of R1 shifts to R0; the results of the addition and subtraction, scaled by one eighth (3 bits shifted to the right), are now the output data x2 = 8 and x 3 = 3 (x o1 and x e1 ), respectively. In the sixth cycle, the data of R0 are processed with the input data X3 = 015, scaled by 2, storing the result of the addition (036) in RSUM and the subtraction (24) in R0. In the seventh cycle, the data of RSUM are processed with the input data X 6 = 5, scaled by 4; in this case, the results of the addition and subtraction, scaled by one eighth, are, respectively, the output data x 4 = 02 and x 5 = 07 (x o2 and x e2 ). In the last cycle, the data of R0 are processed with the input data X7 = 016, scaled by 4; the results of the addition and subtraction, scaled by one eighth, are, respectively, the last output data, x6 = 05 and x7 = 11 (xo3 and x e3 ). 
IV. 1-D-IFHT CHIP
A prototype of the 1-D-IFHT processor has been implemented in a chip, using a standard cell design methodology in a 1-m twolevel metal CMOS technology. In order to limit both the size of the chip and the number of I/O pads, a single data output bus has been used, multiplexing the outputs of the adder/subtracter at a frequency of double that of the processor clock, as shown in Fig. 7 . Moreover, the processor has two additional outputs: MS, which indicates the completion of the computing of a transform; and EO, which indicates valid data output. The chip occupies an area of 11.7 mm 2 and has been encapsulated in a 64-pin package. Sixty-three pads and 2621 standard cells have been used in the design. The functionality of the prototype, whose microphotograph is shown in Fig. 8 , has been fully tested up to frequencies of 25 MHz (the limit frequency of the test equipment). The circuit has also passed more simple tests that enable its maximum data rate to be placed close to 60 MHz. The adder/subtracter, whose structure is of the carry select type, has a propagation time of 13 ns, which limits the operating speed of the processor.
As an application of the 1-D-IFHT chip, the 2-D-IFHT processor shown in Fig. 9 has been designed and is capable of handling black and white images of 256 2 256 pixels with a coding of 8 bits. This processor is made up of two 1-D-IFHT chips, two RAM memories of 64k 2 16, which are used to store the intermediate data, and an addressing circuit for these memories based on counters and multiplexers. The first 1-D-IFHT processes each of the 256 rows of the matrix of coefficients and writes the output data in one of the memories. Simultaneously, the second chip reads and processes the data stored in the other memory generated by the first processor, obtaining the output image with raster ordering and a latency of one image. The simulation results for 2-D-IFHT processor, using the standard cell 1.0-m CMOS library, enable its maximum data rate to be estimated at around 50 MHz. 
V. CONCLUSION
This correspondence describes an inverse fast Haar transform processor (1-D-IFHT) of programmable length between 8 and 1024 points. The only computing element of this processor is an adder/subtracter and its data flow has been conceived with the purpose of minimizing the internal memory. One important characteristic is that the control logic and the internal memory have a modular structure that enables it to be enlarged to greater lengths with a linear growth in the hardware of log 2 N. A prototype of the processor has been implemented in a 1-m CMOS process using a standard cell design methodology. The maximum data rate is close to 60 MHz.
