Abstract-We present an area-efficient parallel architecture that implements the constant-geometry, in-place Fast Fourier Transform. It consists of a specific-purpose processor array interconnected by means of a perfect unshuffle network. For a radix r transform of N = r n data of size D and a column of P = r p processors, each processor has only one local memory of N=rP words of size rD, with only one read port and one write port that, nevertheless, make it possible to read the r inputs of a butterfly and write r intermediate results in each memory cycle. The address generating circuit that permits the in-place implementation is simple and the same for all the local memories. The data flow has been designed to efficiently exploit the pipelining of the processing section with no cycle loss. This architecture reduces the area by almost 50% of other designs with a similar performance.
of incremented arithmetic complexity [1] , [2] , or partitioning the memory into r banks simultaneously accessed at the cost of complex addressing, and a larger area [3] . Another approach is restructuring the memory in order to try to reduce the number of accesses without increasing complexity [4] .
An optimal design would follow from an in-place algorithm (to minimize memory size and provide regular butterflies); the design would also require a simple interconnection network and a memory structured to compute the FFT by performing a reduced number of accesses. In this work we propose the design of a parallel architecture based on the constant-geometry FFT algorithm (CGFFT), which implements the in-place transform minimizing memory requirements and optimizing the efficiency and simplicity of communications. The design is obtained by using the methodology proposed in [5] , expressing each stage of the CGFFT algorithm as a string of operators that are easy to translate into hardware. This string of operators determines the internal structure of the processors and the interconnection network. The parallel architecture consists of an array of specific-purpose processors interconnected with local memories by means of a perfect unshuffle network. Each processor has only one local memory of N=rP words of size rD (for a radix r transform of N data of size D and an array of P processors), with only one read port and one write port, and very little additional storage. The control unit permits the in-place implementation in a simple way, and it is the same for all the local memories. The data flow has been designed to efficiently exploit the pipelining of the processing section with no cycle loss. Furthermore, it permits partitioning the computations among an arbitrary number of processors in such a way that data are recirculated, thus optimizing both communications and use of processors.
In recent literature, several parallel designs that implement the CGFFT algorithm have been proposed, but they do not use all the natural characteristics of this algorithm. In [5] the proposed parallel architecture implements the intermediate shuffles by means of two sizes of N=P FIFO queues in each processor. In [6] a processor array is proposed that uses independent shuffle connections along the dimensional axes of the array, without requiring all the processors to have shuffle connectivity, although this increases the complexity of the design. In [7] an efficient design is presented, but the memory is partitioned and multiport elements are used which limits to 4 the radix that can be used. The methodology we use is simpler and clearer than those based on Kronecker products [6] , [7] . Furthermore, it can be applied to regularize the data flow of groups of algorithms and thus treat them in a unified way. In [8] it has been applied to the design of specific architectures to Fast Fourier and Hartley transforms, and in [9] to the design of a unified architecture for tridiagonal algorithms.
We have organized the rest of this work as follows. In Section II, we define several operators that we will use to describe each stage of the constant geometry FFT algorithm. In Section III, we obtain the design of a specific-purpose processor for the sequential computation of the FFT. In Section IV, we present some relationships and decomposition into operators and obtain the appropriate parallel architecture for the computation of the FFT, thus evaluating the efficiency of our implementation. Finally, in Section V, we establish our conclusions and suggest further work.
II. FAST FOURIER TRANSFORM WITH CONSTANT GEOMETRY
In this section we define the basic operators to be used in the rest of the sections, we review basic ideas about FFT with constant 1057-7130/99$10.00 © 1999 IEEE geometry, and describe each stage of the algorithm as an operator string.
A. Basic Operators
We will consider sequences of N data y(i); 0 i < N = r n ; 2 r = 2 m and assume that i n 11 1i 1 are the digits of the base r representation of i. In the following, we will denote as S S S the ordered sequence of data.
Definition 1: The butterfly operator B transforms a sequence S into another S 0 of the same length. It performs the FFT butterfly specific operations over each r data subsequence of S whose indices differ in the least significant digit. Each butterfly accepts r input data and produces r outputs.
Definition 2:
The perfect unshuffle (shuffle) operator 0() transforms a sequence S into another sequence S 0 of the same length. It performs a cyclic rotation of order 1 to the right (left) in the r-ary representation of each element's index of the sequence.
0(i) = [i 1 ; i n ; 11 1;i 2 ]
(1) (i) = [i n01 ; 11 1;i 1 ; i n ]: (2) In some cases, it will be convenient to divide the base r representation of data into several ordered fields of digits i = (; ; ); = (5) Hereafter, the operator (1);(1) will be denoted by ; for simplicity.
We consider that operator strings are applied from left to right. The notation t indicates that operator is applied t times. (6) where n > 1: The proof is obtained by applying the definitions of both operators.
B. Constant Geometry FFT Algorithms
The central role in constant geometry algorithms is carried out by perfect unshuffle and perfect shuffle permutations. The basic idea, due to Pease [10] , is that a subsequence of r elements of S whose indices differ only in their tth digit (i.e., they are at a distance r t01 ; t > 0)
is located by the perfect unshuffle operator at a distance of r t02 :
Consequently, if we carry out a 0 permutation of the output sequence in each stage of the ascendant FFT, a subsequence of r elements initially at a distance of r t01 will occupy consecutive positions at the beginning of the tth stage. In this way, the inputs to the butterflies are at distances of 1 at each stage. So, each stage consists in applying operator string B0 to its input sequence. We can conclude that this algorithm (CGFFT hereafter) consists in applying operator string
n to the initial input sequence.
Let us consider the sequence fy t (i)=0 i N 0 1g of initial data at the tth stage. Because of the unshuffling carried out in each step, y t (i) will be placed at position 0 t01 (i), so, given that 0 and are inverse operators, the input sequence at tth stage will be
In order to better view data evolution, we will assume that the input sequence at stage t is distributed as a matrix Ft with two rows and 2 n01 columns (see of a data in matrix F t can be interpreted as (cycle, bus); cycle being the execution cycle, and bus the path through which this data item accesses the processor.
III. ARCHITECTURES FOR THE IN-PLACE CONSTANT GEOMETRY FFT
The operator string B0 that defines each stage of the CGFFT algorithm may be decomposed as a new string of operators that are easy to translate into hardware. There are several possible decompositions we can choose. The one chosen is the key point in any design since it determines the performance of a processor, a column, or an array of processors architecture. In this section, we will address the design of a uniprocessor system to compute the in-place CGFFT from a new decomposition of 0 which will also permit to store the data in an efficient way. In order to keep the presentation and figures simple, we will develop designs for a radix 2 transform, which are easily extended to any radix.
A. Design of an FFT Specific Uniprocessor
The computation of the tth stage of the CGFFT algorithm occurs in the following basic steps: to read data from memory (read operator Rt), execute some operations over data (operator B), and to write the results in memory (write operator Wt). We will consider that operator R t gives the read cycle from the read address and, conversely, operator Wt gives the write address from the write cycle. Then, since operator B does not modify the position of data, the tth stage can be formulated as the operators string R t BW t and the CGFTT algorithm as the product
if we suppose that initial input data are read from a external device (i.e., R1 is not considered) and Rn+1 = I. Since each stage of the CGFFT algorithm is also defined by the operator string B0, we obtain that 0 = W t R t+1 . The simplest definition for read and write CGFFT operators is R t = I; W t = 0. These are the read and write functions proposed in [5] [6] [7] [8] . However, these functions do not permit the algorithm to be computed in place since the results of a stage are written in different locations than those where the data were read. That happens because the read and write operators chosen are not inverse of each other. This problem is solved by using new read and write functions to carry out the unshuffling permutation. Notice that the read and write operators Rt = 0 t01 ; Wt = t01 also verifies 0 = W t R t+1 , and, furthermore, they produce an in-place algorithm because RtWt = I. Then, the CGFFT is computed in place by performing the string
A time-efficient processor should provide simultaneous access to read/write r data, so r read ports and r write ports of length D should be necessary. This means a memory with higher access time, amount of area, and consumption, as well as a limitation of the radix [7] . To avoid these overheads, we propose a memory with only one read port and only one write port of sizes rD, as follows from next theorem. In every stage t of the transform, we apply the string B cycle; bus ( cycle ) t01 (0 cycle ) t to the input matrix Ft. The interpretation of these operators is given by the following.
The permutation cycle; bus exchanges the least significant bits of the cycle; bus data fields. We will denote as H t+1 the matrix cycle; bus (Gt+1). Observe in Fig. 1 the effect of operator cycle; bus on G 3 : it rearranges the results of two even-odd butterflies [we mean two butterflies computed at the 2hth and (2h + 1)th cycles of the stage (0 h N=2)] and provides inputs for two butterflies to be computed at the next stage. Thus, storing each column of H t+1 in a memory address with words of size two data will provide that the inputs for each butterfly at the next stage can be accessed in one read cycle. Also, it determines the order in which the writing of butterflies results must be carried out. Consequently, the (cycle; bus) fields of the indices in matrix H t+1 may be interpreted as the data writing cycles and the word segment where they are written, respectively.
Operator ( cycle ) t01 is the write function. It acts on the cycle fields of the indices in matrix H t+1 (writing cycles) and provides the writing address in memory. Operator (0 cycle ) t is the read function Rt+1. Its input is a memory address and provides the cycle in which the processor will read in this address at the next stage.
The action of the two last operators, equivalent to operator 0 cycle action, provides the input matrix Ft+1 of the (t + 1)th stage, unshuffling the columns of matrix H t+1 , as can be observed in Fig. 1 .
The data flow generated by Theorem 1 is presented in Fig. 2(a) , and the memory evolution at the different stages of the algorithm in Fig. 2(b) for the case of a 16 data transform.
B. Hardware Implementation
In this subsection, the hardware implementation of the CGFFT processor defined by Theorem 1 is obtained as follows.
Operator B determines the internal structure of the processing section (PS), hardware implementation of the FFT butterflies. In the radix 2 case, this section has two input buses and two output buses whose width is determined by the input data type (real or complex).
The hardware implementation of operator cycle; bus is shown in Fig. 3 . It consists of a set of three registers and a mux. Register r0 is located at the output of bus 0 in the PS. Registers r1 and r2 are located at the output of bus 1 and act as a serial input parallel output (SIPO) queue of size 2 (see Fig. 3 ). This circuit works as follows.
Registers r0 and r1 store the results of the 2hth butterflies during an execution cycle. In the next cycle, output 0 of the PS section and r0 are stored in memory, and in the following cycle are stored r1 and r2: The mux, controlled by the least significant bit of counter CC (see Fig. 3 ), selects the two data to be stored in memory every alternative cycle. We will call the hardware that implements cycle; bus Routing Section A (RSA).
The read and write functions control the memory operation. Since they produce an in-place algorithm, the address accessed at read cycle j will be also accessed at write cycle j; with a delay determined by PS (two cycles if the PS section is not pipelined). So, only the read address generating mechanism implementation is needed. On the other hand, the control should provide the read address from the read cycle (i.e., the inverse of the read function, ( cycle ) t01 ):
The hardware implementation of this operator is made up of two counters and a cyclic bit rotation circuit, as shown in Fig. 3 . A module N=2 counter (CC) defines the read cycle of the butterflies. A module n counter (SC) defines the stage of the transform and controls the number of positions the CC bits must rotate in the cyclic rotation circuit (RC). The RC circuit output is the memory address which the current cycle must read (indicated by CC). To summarize, the processor designed in this section permits computing an N data radix 2 transform in (N=2)n + s + 2 cycles, s being the depth of the PS section's pipelining without any additional storage or idle cycles. The two basic ideas of the design consist: 1) in rearranging the results of two even-odd butterflies so that inputs for two butterflies to be computed at the next stage are written in memory, and 2) in the appropriate election of read and write functions to provide both in-place and unshuffling aimed characteristics. In the general case of a radix r transform, the design is obtained in a similar manner. The PS section has r input buses and r output buses (b 0 ; 1 1 1 ; b r01 ) for the computation of radix r butterflies. The implementation of operator cycle; bus consists of a set of 3r(r01)=2 registers organized into r SIPO queues connected to the output buses of PS, as shown in Fig. 4 for r = 4. An SIPO queue of size r + i is connected to bus b i : The outputs of the registers are directed to a multiplexer that in each cycle selects the suitable r outputs under the control of the last m bits of CC (as each base r digit has m bits). The memory is organized into N=r words of size rD: The implementation of the read and write functions is similar, but the RC circuit will carry out a cyclic rotation of m(t01) bits in the tth stage.
IV. PARALLEL ARCHITECTURES
From the analysis of CGFFT data flow, we deduce that the most appropriate parallel architecture for exploiting their inherent temporal and spatial parallelism is an array of processors (PE's). For simplicity, in Section IV-B we will design a column of PE's and schematic the design for an array in Section IV-C. The methodology used to design both parallel architectures for the radix r CGFFT is based on the decomposition of the permutation operator 0 as a string of elementary operators, which we introduce in Section IV-A.
A. Decomposition into Elementary Operators
We will borrow the notation from Section II-A: i = (; ; ); with = [ a ; 1 11 
In the following lemma, we will consider that field consists of a single digit 1 : Lemma 2: The perfect unshuffle operator can be decomposed into two partial unshuffles 0 = 0 ; 0 ; : 
B. Design of an In-Place CGFFT Column of Processors
For a column of P processors (P = r p ; p < n); the matrix Ft of input data in stage t; 1 t n should be distributed among the PEs' local memories so that each PE evaluates r n0p01 butterflies (columns in F t ). We assume that a processor's local memory is the one where it writes the results of the butterflies. From a computational point of view, we are decomposing the indices of the data into three fields: processor, cycle, and bus, with sizes n 0 p 0 1; p, and 1 digits, respectively. Depending on how the columns of matrix F t are distributed among the PE's, these fields can be interpreted in two different ways: (cycle, PE, bus) and (PE, cycle, bus). The first one, called cyclic distribution, assigns consecutive butterflies to different PE's; whereas the second, called consecutive distribution, assigns a block of consecutive butterflies to each PE. In the following we will use cyclic distribution. With this distribution, in the tth stage of radix r CGFFT, the processor with index ; PE(); operates with matrix F t; ; of r rows and r n0p01 columns. 
Proof: If we denote fields = cycle; = PE; = bus, from Lemma 2, it follows that B0 = B0 cycle; bus 0 PE;bus , and using Theorem 1, 0 cycle; bus = cycle; bus ( cycle ) t01 (0 cycle ) t :
The architecture of the PE column is defined by string (13) given by Theorem 2. The string cycle; bus ( cycle ) t01 (0 cycle ) t carries out a partial perfect unshuffle internally in each PE's local memory (it does not modify the PE field) and determines the processor's routing section. This string is the same as that given by Theorem 1; therefore, each processor is analogous to the one designed in Section III, but now the memories consist of N=2P words of size 2D (radix 2) and the read and write functions are the same for all the PE's, i.e., at each cycle all the processors will read and write at the same address of their local memory. Consequently, we only need one address generator for all of them, which is an interesting characteristic to VLSI implementation regarding area and time design. [ 1 ] in the next step. The operation it carries out over the indices is equivalent to moving the data from segment [ 1 ] 
C. Array Communications Regularization
From (14), operator 0 row; column; bus determines the communications among processors in the array. Unfortunately, this operator generates an irregular network, with no defined global interconnection pattern. However, local patterns exist that can be extracted by factorization of this operator as simpler operators. In fact, from Lemma 2 we get 0 row; column; bus = 0 row; bus 0 column; bus : The partial unshuffling operator 0 row; bus is restricted to fields row; bus, thus it determines identical communications pattern in all the columns. On the other hand, operator 0 column; bus is restricted to fields column; bus and produces the same pattern in each row. of view, we introduce modularity in communications among PE's, which is very interesting regarding reliability and design time. When some of the array dimensions is high, it is possible to decrease the interconnection complexity. In fact, using Lemma 1 we can decompose these operators as a product of exchanges. For example, in the case of operator 0 row; bus ; the unshuffling is carried out in n exchanges steps in such a way that at jth step the following exchange occurs:
[u; 1 1 1 ; j+1; j; ; j01; 1 As an example, output lines in the last column of PE's in Fig. 6 show the exchange steps obtained from factoring operator 0 row; bus for the case of a radix 2, four rows array. Observe the regularity and modularity of this solution.
To summarize, in this section we have designed a parallel architecture for the radix r CGFFT algorithm. This consists of a specific purpose processor column whose interconnection network is a perfect unshuffle of the rP outputs of the local memory segments with the rP input buses of the PE's. Each processor is similar to the one obtained for the uniprocessor implementation in Section III, has one local memory of size ND=P (for an N data transform and a column of P processors), and very little additional storage (SIPO registers).
The circuit for the generation of addresses is only one for each and every local memory. This processor column can be extended to an array of processors in a straightforward manner. Further factorizations of unshuffling operator lead to more regular and modular designs.
D. Evaluation of Design Efficiency
The required routing network for the architecture in Section IV-B belongs to the category known as area-efficient [11] because it is fixed-interconnectioned with bus cross number O(P 2 r 2 ); and evenly routes P r buses to the local memories of the P processors at each time unit. However, the area-efficient concept would be more precise if the processor area, which is mainly determined by the memory size, would be considered. In this sense, an architecture is area-efficient if each processor uses a memory of size ND=P; the minimum needed to store the data to be computed. We can conclude that our design is area-efficient in both senses.
However, we must point out that the area occupied by the memory is not only proportional to the number of data stored, but it is also proportional to the number of ports. Therefore, the most efficient memory (for a minimum number of computation cycles) would be the one used in our case, with a single read port and a single write port. This feature also allows us to improve the memory access time, which may be critical in order to establish the operation cycle in a pipelined processor. In addition, a memory organized with words of size 2D helps to reduce the memory access time. Table I shows the differences in area and access time between two memories with two read/write ports of sizes 16 2 256 and 16 2 1024, and two memories of sizes 32 2 128, and 32 2 512 with only one read port and only one write port.
In the following, we compare the parameters area/time of our design with two others [7] , [8] . Let APS be the processing and internal routing sections area, and N the memory area-being a parameter depending on technology and memory organization. In our design, the total amount of area would be APS+N and APS+2N
in the two others (we do not consider the differences that may occur in the values of A P S and ): If we consider that the area occupied by the processors is, basically, the same as the area occupied by the local memories [12] , we conclude that our design reduces the area almost 50% these two others designs. On the other hand, the number of execution cycles needed to calculate a radix r FFT is [(N=rP ) log r N + s]=s in all the cases we consider, s being the depth of the pipelining of the P S section (notice that r is constraint to 4 in [7] ). As pointed out before, our data storage organization permits a higher operation frequency, which means a more significant reduction in the area 2 time parameter.
V. CONCLUSIONS
We have designed a specific parallel architecture for the computation of the radix r Fast Fourier Transform, which reduces almost 50% the area of other designs with a similar performance. This reduction is mainly due to three factors. First, the algorithm implemented is a constant geometry algorithm that uses a fixed interconnection network. The second factor is that memory requirements are minimal due to the in-place implementation, since we only use the memory needed to store the data sequence to be transformed. The design uses a single local memory with a single read port and a single write port in each processor, the mechanism for the generation of addresses is very simple, and there is only one for each and every processor. Finally, the memory organization in words of length rD reduces the access time and then the processor cycle. On the other hand, the data flow is regular, efficiently exploits the pipelining of the processing section with no cycle loss, and provides an optimal load balance. The design characteristics make it especially suitable for its VLSI integration. In fact, the uniprocessor system presented in Section III-A has been implemented with 0.7 micron CMOS technology in a DSP for real-time audio application.
Novel Vector Quantization Based Algorithms for Low-Power Image Coding and Decoding
K. Masselos, P. Merakos, T. Stouraitis, and C. E. Goutis
Abstract-In this paper, a novel scheme for low-power image coding and decoding based on vector quantization is presented. The proposed scheme uses small codebooks, and block transformations are applied to the codewords during coding. Using small codebooks, the proposed scheme has reduced memory requirements in comparison to classical vector quantization. The transformations applied to the codewords extend computationally the small codebooks compensating for the quality degradation introduced by the small codebook size. Thus the coding task becomes computation-based rather than memory-based, leading to significant power savings since memory-related power consumption forms the major part of the total power consumption of a system. Since the parameters of the transformations depend on the image block under coding, the small codebooks are dynamically adapted to the specific block under coding leading to acceptable image qualities. The proposed scheme leads to power savings of a factor of 10 in coding and of a factor of 3 in decoding, at least in comparison to classical full-search vector quantization. The main factor affecting both image quality and power consumption is the size of the codebook that is used.
I. INTRODUCTION
Image and video coding form an integral part of information exchange. The number of computer systems incorporating multimedia capabilities for displaying and manipulating video data is continuously increasing. As the essential design consideration for portability is the reduction of power consumption [1] , this interest in multimedia, combined with the great popularity of portable computers and phones, makes the development of low-power image and video coding/decoding schemes very important. A hardware implementation of a very low-power decoder based on vector quantization, for real time video decompression on a portable terminal, is presented in [2] . Another low-power video compression-decompression system based on pyramid vector quantization of subband coefficients is described in [3] .
Vector quantization [4] is an efficient image coding technique, achieving low bit rates, i.e., lower than 1 bit per pixel. Vector quantization is described as 
where x is a k-dimensional input vector belonging to the kdimensional space R k ; C is the codebook of N k-dimensional words yi, and d is the distortion criterion used. In vector quantization, a vector, which is a block of pixels, is approximated by a representative vector (codeword) of the codebook, which minimizes the distortion among all the codevectors in the codebook. Compression is achieved by transmitting or storing the codeword address (index) instead of the codeword itself.
