Signal processing applications of massively parallel charge domain computing devices by Barhen, Jacob et al.
I11111 111ll Il1 Il11 III III III III III III 1ll111111 
US005952685A 
United States Patent [19] [ i l l  Patent Number: 5,952,685 
Fijany et al. [45] Date of Patent: *Sep. 14,1999 
SIGNAL PROCESSING APPLICATIONS OF 
MASSIVELY PARALLEL CHARGE DOMAIN 
COMPUTING DEVICES 
Inventors: Amir Fijany, Granada Hills; Jacob 
[58] Field of Search ..................................... 2571214, 215, 
2571231, 232, 236; 377157, 58, 59, 60; 
3641606, 807, 862, 844 
Barhen, LaCresenta; Nikzad 
Toomarian, Encino, all of Calif. 
Primary ExaminerSteven H. Loke 
Attorney, Agent, or Fi rm4ichae l son  & Wallace 
Assignee: California Institute of Technology, 
Pasadena, Calif. 
Notice: This patent issued on a continued pros- 
ecution application filed under 37 CFR 
1.53(d), and is subject to the twenty year 
patent term provisions of 35 U.S.C. 
154(a)(2). 
Appl. No.: 08/598,900 
Filed: Feb. 9, 1996 
Related U.S. Application Data 
Division of application No. 081161,908, Nov. 30, 1993, Pat. 
No. 5,508,538, which is a continuation-in-part of application 
No. 081049,829, Apr. 19, 1993, Pat. No. 5,491,650. 
Int. C1.6 ......................... HOlL 29/76; HOlL 271148; 
G l l C  19118; G06G 7100 
U.S. C1. .......................... 257/214; 2571215; 2571231; 
2571232; 2571236; 377157; 377158; 377159; 
377160; 3641606; 3641807; 3641862; 3641844 
[571 ABSTRACT 
The present invention is embodied in a charge coupled 
device (CCD)/charge injection device (CID) architecture 
capable of performing a Fourier transform by simultaneous 
matrix vector multiplication (MVM) operations in respec- 
tive plural CCDICID arrays in parallel in 0(1) steps. For 
example, in one embodiment, a first CCDICID array stores 
charge packets representing a first matrix operator based 
upon permutations of a Hartley transform and computes the 
Fourier transform of an incoming vector. A second CCDi 
CID array stores charge packets representing a second 
matrix operator based upon different permutations of a 
Hartley transform and computes the Fourier transform of an 
incoming vector. The incoming vector is applied to the 
inputs of the two CCDICID arrays simultaneously, and the 
real and imaginary parts of the Fourier transform are pro- 
duced simultaneously in the time required to perform a 
single MVM operation in a CCDICID array. 
12 Claims, 8 Drawing Sheets 
205 
Ab-'- 
N1 
-210 
2 
b BITS 
- A  h i '  
https://ntrs.nasa.gov/search.jsp?R=20080006936 2019-08-30T03:09:22+00:00Z
U S .  Patent Sep. 14,1999 Sheet 1 of 8 5,952,685 
L OUTPUT VECTOR 
N 
W 
I 
f 
I 
0 co '- 
v -  B 
3 @ 
f 
0 - 
i 
0 
0 .- 
U S .  Patent 
F *  
- 0  
0 
Sep. 14,1999 
. .  
0 
m e  
:. . 
e .  . 
Sheet 2 of 8 5,952,685 
n t- 
2 
E 
0 
E 
Q 
v 
0 
0 
(D 
7 
> 
> 
0 
. .  . . .  
0 . .  . . .  . 
a .  
I 
0 
0 
I 
- 
. .  . . .  . 
e .  . 
e .  . 
0 .  . - 
U.S. Patent Sep. 14,1999 Sheet 3 of 8 
PHASE 1 PHASE 3 
I PHASE 2 I PHASE 
5,952,685 
4 
\ l  I . . . . . . . . . .  . . . . . . . . .  . . . . . . . . . .  \ 
FIG. 3 A  
DC DC 
I COLUMN I ROW 
\  - r 
I I I . . . . . .  . . . . .  . . . . . .  
. . . . .  . . . .  
m m m . .  
I . . . . . .  . . . . .  * * * e . I  
I . . . . .  
. e . .  . . . . .  
(PRIOR ART) 
FIG. 3B 
U S .  Patent Sep. 14,1999 Sheet 4 of 8 5,952,685 
23 
A 
FIG. 4 
U S .  Patent Sep. 14,1999 
w 
W 
4 
a a 
fY w 
0 
U 
n 
E 
E 
tY 
> 
I 
Sheet 5 of 8 
...I 
i 
[r 
t 
5,952,685 
0 
M 
7 
0 
0 U )c1 t 
0 ,  
23 
lis.. D 
t 
M t 
- 4  
2 3  
N 
I 
A d  
2 3  
t 
t 
II 
r 
I 
U S .  Patent Sep. 14,1999 Sheet 6 of 8 5,952,685 
1 400-(b-1) / 
U S .  Patent Sep. 14,1999 
n - 
I 
Z 
m U 
n 
0 
5 
w 
n 
T- 
I z 
1
Sheet 7 of 8 
03 
5,952,685 
n 
E 
k 4  
* e  e e .  
0 .  e 
b 8  b e .  
0 .  e .  
U S .  Patent 
0 
x 
=d 
1 
Sep. 14,1999 Sheet 8 of 8 
Gv 
- 
U 
E 
bl 
0 
E t  
n 
n 
I 
0 
u - 
n 
U 
. .  
s f  b 
5,952,685 
n e 
E 
Y 
w 
b 
n 
n 
-
I 
0 
u -
5,952,685 
2 
on each output row line 130, yielding an analog output 
vector which is the product of the binary input vector with 
the analog charge matrix. By virtue of the CCDiCID device 
physics, the charge sensing at the row output lines 130 is of 
5 a non-destructive nature, and each matrix charge packet 160 
is restored to its original state simply by pushing the charge 
back under its column gate 140. 
FIG. 2 is an illustration of the binary-analog MVM 
computation cycle for a single row of the CCDiCID array. 
10 In FIG. 2A, the matrix charge packet 160 sits under the 
column gate 140. In FIG. 2B, the row line 130 is reset to a 
reference voltage. In FIG. 2C, if the column line 120 
receives a logic one input bit, the charge packet 160 is 
transferred underneath the row gate 150. In FIG. 2D, the 
1s transferred charge packet 160 is sensed capacitively by a 
charge in voltage on the output row line 130. In FIG. 2E the 
charge packet 160 is returned under the column gate 140 in 
preparation for the next cycle. A bit-serial digital-analog 
MVM can be obtained from a sequence of binary-analog 
20 MVM operations, by feeding in successive vector input bits 
sequentially, and adding the corresponding output contribu- 
tions after scaling them with the appropriate powers of two. 
A simple parallel array of divide-by-two circuits at the 
output accomplishes this task. Further extensions of the 
2s basic MVM scheme of FIG. 1 support full digital outputs by 
parallel A/D conversion at the outputs, and four-quadrant 
operation by differential circuit techniques. 
FIG. 3 illustrates how the matrix charge packets are 
loaded into the array. In FIG. 3A, appropriate voltages are 
30 applied to each gate 170 in each cell 110 of the CCDiCID 
array 100 so as to configure each cell 110 as a standard 
4-phase CCD analog shift register to load all of the cells 110 
sequentially. In FIG. 3B, the same gates 170 are used for row 
and column charge transfer operations as described above 
3s with reference to FIG. 2. 
Signal Processing 
The foundation of conventional signal processing algo- 
rithms is based on the use of fast techniques for performing 
4o various discrete transformations such as the discrete Fourier 
transform (DFT), discrete sin transform (DST), discrete 
co-sine transform (DCT), discrete Hartley transform (DHT), 
and others. Consider the discrete Fourier Transform (DFT). 
The DFT can be represented by a Matrix-vector Multipli- 
4s cation (MVM) with a computational complexity of O(N2). 
However, for both serial and parallel computation on con- 
ventional hardware, the Fast Fourier Transform (FFT) is 
always preferred. 
For serial computations, the FFT achieves a computa- 
so tional complexity of O(NLogN). Also, for implementation 
on parallel and vector computer architectures, the FFT has 
been considered as the base line algorithm. In particular, 
with O(N) processors, a time lower bond of O(LogN) can be 
achieved in computing the FFT. Note, however, that this 
5s result is more of a theoretical importance than a practical one 
since, particularly for large N, implementation of the algo- 
rithm to achieve the above time lower bound would require 
an architecture with an excessive number of processors, and, 
more importantly, a very complex processors interconnec- 
With conventional hardware technology, the time lower 
bound in computing a MVM is O(LogN) by using O(N2) 
processors. This result is more relevant to theory than to 
practice, since such an implementation of MVM requires a 
In contrast, a practical implementation of MVM on the 
CCDiCID chip can be performed in 0(1) steps. This indi- 
60 tion structure. 
65 very complex parallel architecture. 
1 
SIGNAL PROCESSING APPLICATIONS OF 
MASSIVELY PARALLEL CHARGE DOMAIN 
COMPUTING DEVICES 
CROSS-REFERENCE TO RELATED 
APPLICATIONS 
This is a division, of application Ser. No. 081161,908, filed 
Nov. 30, 1993, now U.S. Pat. No. 5,508,538, which is a 
continuation-in-part of application Ser. No. 081049,829, 
filed Apr. 19, 1993, now U.S. Pat. No. 5,491,650. 
ORIGIN OF THE INVENTION 
The invention described herein was made in the perfor- 
mance of work under a NASA contract, and is subject to the 
provisions of Public Law 96-517 (35 USC 202) in which the 
contractor has elected to retain title. 
BACKGROUND OF THE INVENTION 
1. Technical Field 
The invention relates to charge coupled device (CCD) and 
charge injection device (CID) hardware applied to higher 
precision parallel arithmetic processing devices particularly 
adapted to perform large numbers of multiply-accumulate 
operations with massive parallelism and to methods for 
performing signal processing including Fourier transforms 
and convolution. 
2. Background Art 
Many algorithms required for scientific modeling make 
frequent use of a few well defined, often functionally simple, 
but computationally very intensive data processing opera- 
tions. Those operations generally impose a heavy burden on 
the computational power of a conventional general-purpose 
computer, and run much more efficiently on special-purpose 
processors that are specifically tuned to address a single 
intensive computation task only. A typical example among 
the important classes of demanding computations are vector 
and matrix operations such as multiplication of vectors and 
matrices, solving linear equations, matrix inversion, eigen- 
value and eigenvector search, etc. Most of the computation- 
ally more complex vector and matrix operations can be 
reformulated in terms of basic matrix-vector and matrix- 
matrix multiplications. From a neural network perspective, 
the product of the synaptic matrix by the vector of neuron 
potentials is another good example. 
An innovative hybrid, analog-digital charge-domain 
technology, for the massively parallel VLSI implementation 
of certain large scale matrix-vector operations, has recently 
been developed, as disclosed in U.S. Pat. No. 5,054,040. It 
employs arrays of Charge CouplediCharge Injectioned 
Device (CCDICID) cells holding an analog matrix of charge, 
which process digital vectors in parallel by means of binary, 
non-destructive charge transfer operations. FIG. 1 shows a 
simplified schematic of the CCDiCID array 100. Each cell 
110 in the array 100 connects to an input column line 120 
and an output row line 130 by means of a column gate 140 
and a row gate 150. The gates 140,150 hold a charge packet 
160 in the silicon substrate underneath them that represents 
an analog matrix element. The matrix charge packets 160 are 
initially stored under the column gates 140. In the basic 
matrix-vector multiplication (hereinafter referred to as 
“MVM’) mode of operation, for binary input vectors, the 
matrix charge packets 160 are transferred from under the 
column gate 140 toward the row gates 150 only if the input 
bit of the column indicates a binary ‘one’. The charge 
transferred under the row gates 150 is summed capacitively 
5,952,685 
3 
cates that, for efficient implementation of signal processing 
applications on CCDiCID chips, a new algorithmic frame- 
work is required, which significantly differs from the con- 
ventional fast techniques framework. In particular, the DHT 
can be more efficiently implemented than the FHT. In fact, 
while the DHT can be performed in 0(1) with one CCDiCID 
chip, the implementation of FHT requires O(LogN) chips 
and takes O(LogN) steps. 
Accordingly, there is a need for a massively parallel 
charge domain computing device and process which 
employs the DHT to achieve massive parallelism in signal 
processing in order to fully exploit the advantages of the 
CCDiCID architecture. In particular, there is a need for a 
process to perform a Fourier transform in a single MVM 
operation or plural simultaneous MVM operations in paral- 
lel in a CCDiCID architecture, each operation being per- 
formed in 0(1) steps. There is also a need for a process to 
perform a convolution in a single MVM operation in a 
CCDiCID architecture in 0(1) steps, convolution. 
SUMMARY OF THE DISCLOSURE 
The present invention is embodied in a CCDiCID archi- 
tecture capable of performing a Fourier transform by simul- 
taneous MVM operations in respective plural CCDiCID 
arrays in parallel in 0(1) steps. 
In one embodiment, a first CCDiCID array stores charge 
packets representing a first matrix operator based upon 
permutations mutations of a Hartley transform and computes 
the real part of the Fourier transform of an incoming vector. 
Asecond CCDiCID array stores charge packets representing 
a second matrix operator based upon different permutations 
of a Hartley transform and computes the real part of the 
Fourier transform of an incoming vector. The incoming 
vector is applied to the inputs of the two CCDiCID arrays 
simultaneously, and the real and imaginary parts of the 
Fourier transform are produced simultaneously in the time 
required to perform a single MVM operation in a CCDiCID 
array. 
In another embodiment, parallel MVM operations com- 
pute a Fourier transform using CCDiCID arrays of a fraction 
of the size of the transform itself. In this latter embodiment, 
the input signal vector is bifurcated into smaller vectors 
whose sums and differences are transformed by different 
permutations of a fractional-sized Hartley transform to pro- 
duce respective real and imaginary parts of the desired 
Fourier transform. Each permutation of a Hartley transform 
is embedded in a different CCDiCID array or chip and all 
CCDiCID arrays receive their respective fractional portions 
of the input signal vector simultaneously and perform 
respective MVM operations simultaneously. 
In one implementation of this latter embodiment, the 
CCDiCID array size is half the size of the incoming signal 
vector, the incoming signal vector is bifurcated into two 
equal portions. Two of four different permutations of a 
Hartley transform embedded in four respective CCDiCID 
arrays receive the sum of the bifurcated incoming signal 
portions to produce, respectively, the real and imaginary 
parts of the even terms of the desired Fourier transform. The 
remaining two permutations of the Hartley transform receive 
the difference between the two incoming signal portions to 
produce, respectively, the real and imaginary parts of the odd 
terms of the desired Fourier transform. 
The present invention also embodied in a CCDiCID 
architecture capable of performing a convolution of two 
signal vectors in a single MVM operation in a CCDiCID 
array in 0(1) steps. In one implementation, one of the two 
4 
signal vectors is known beforehand and even and odd 
permutations of its Hartley transformed are precomputed. A 
convolution matrix operator is constructed by the matrix 
multiplication of a Hartley transform matrix by the sum of 
5 the even and odd permutations of the Hartley transform of 
the known signal vector, the result being matrix-multiplied 
in turn by a Hartley transform matrix. The resulting matrix 
is embedded in a single CCDiCID array of charge packets. 
The unknown signal vector is input to the CCDiCID array to 
produce a vector equal to the convolution of the two signal 
vectors in a single MVM operation. 
The present invention is preferably implemented in a 
CCDiCID MVM processor of the type described in the 
above-referenced parent application which stores each bit of 
1~ each matrix element as a separate CCD charge packet. The 
bits of each input vector are separately multiplied by each bit 
of each matrix element in massive parallelism and the 
resulting products are combined appropriately to synthesize 
the correct product. In one embodiment, the CCDiCID 
2o MVM array is a single planar chip in which each matrix 
element occupies a single column of b bits, b being the bit 
resolution, there being N rows and N columns of such single 
columns in the array. In another embodiment, the array 
constitutes a stack of b chips, each chip being a bit-plane and 
25 storing a particular significant bit of all elements of the 
matrix. In this second embodiment, an output chip is con- 
nected edge-wise to the bit-plane chips and performs the 
appropriate arithmetic combination steps. 
In a preferred embodiment, the MVM processor of the 
30 invention includes an array of N rows and M columns of 
CCD matrix cell groups corresponding to a matrix of N rows 
and M columns of matrix elements, each of the matrix 
elements representable with b binary bits of precision, each 
of the matrix cell groups including a column of b CCD cells 
35 storing b CCD charge packets representing the b binary bits 
of the corresponding matrix element, the amount of charge 
in each packet corresponding to one of two predetermined 
amounts of charge. Each of the CCD cells includes a holding 
site and a charge sensing site, each charge packet initially 
40 residing at the respective holding site. The MVM processor 
further includes a device for sensing, for each row, an analog 
signal corresponding to a total amount of charge residing 
under all charge sensing sites of the CCD cells in the row, 
an array of C rows and M columns CCD vector cells 
45 corresponding to a vector of M elements representable with 
c binary bits of precision, each one of the M columns of 
CCD vector cells storing a plurality of c charge packets 
representing the c binary bits of the corresponding vector 
element, the amount of charge in each packet corresponding 
50 to one of two predetermined amount of charge. A multiply- 
ing device operative for each one of the c rows of the CCD 
vector cells temporarily transfers to the charge sensing site 
the charge packet in each one of the M columns of matrix 
cells for which the charge packet in the corresponding one 
5s of the M columns and the one row of the CCD vector cells 
has an amount of charge corresponding to a predetermined 
binary value. 
The preferred embodiment further includes an arithmetic 
processor operative in synchronism with the multiplying a 
60 device including a device for receiving, for each row, the 
sensed signal, whereby to receive Nxb signals in each one 
of c operations of the multiplying a device, a device for 
converting each of the signals to a corresponding byte of 
output binary bits of all the signals in accordance with 
65 appropriate powers of two to generate bits representing an 
N-element vector corresponding to the product of the vector 
and the matrix. 
5,952,685 
5 6 
In one embodiment, the array of matrix CCD cells is 1)+1j], for 1=1, . . . b, labelled in FIG. 4 as Aoij, . . . Ab-'<,.. 
distributed among a plurality of b integrated circuits con- Thus, the CCDiCID MVM processor array 200 of FIG. 4 is 
taining sub-arrays of the M columns and N rows of the an array of N columns and N rows of matrix elements, each 
matrix CCD cells, each of the sub-arrays corresponding to a matrix element itself being a column 205 of b CCDiCID 
bit-plane of matrix cells representing bits of the same power s cells 210. There are, therefore, at total of NxNxb cells 210 
of two for all of the matrix elements. A backplane integrated in the array 200. A vector of N elements representable with 
circuit connected edgewise to all of the b integrated circuits a precision of b binary bits is stored in an array 230 of 
includes a device for associating respective rows of the CCDiCID cells 235, the array 230 being organized in N 
vector CCD elements with respective rows of the matrix columns 240 of b CCDiCID cells, each cell 235 in a given 
CCD elements, whereby the multiplying device operates on i o  column 240 storing the corresponding one of the b binary 
all the rows of the vector CCD elements in parallel. bits of the corresponding vector element. Each CCDiCID 
cell 210 in FIG. 4 is of the type described above with 
BRIEF DESCRIPTION OF THE DRAWINGS reference to FIGS. 1-3 and is operated in the same manner 
with N column input lines (coupled to a successive row of 
diagram Of a CCDiCID MVM is N vector CCDiCID cells 235 in the vector CCDiCID array 
230) and row output lines of the type illustrated in FIG. 1 .  processor array of the prior art. 
matrix-vector multiply operations in a unit cell of the array reference to the example of a conventional matrix-vector 
product obtained by adding the products of the elements in of FIG. 1 .  
FIGS. 3A and 3B illustrate, respectively, electronic had-  20 the vector and each row of the matrix. Of course, the present 
ing of the matrix elements and arithmetic operations in a unit invention is not confined to a particular type of matrix vector 
cell of the array of FIG. 1 .  product, and can be used to compute other types of matrix- 
FIG. 4 is a plan view of a preferred CCDiCID MVM vector products. One example of another well-known type 
processor employed in carrying out the present invention. of matrix-vector product is that obtained by adding the 
a higher precision arithmetic processor employed in com- elements in a respective Column of the matrix. 
bination with the CCDiCID processor of FIG. 4. Computation proceeds as follows. At clock cycle one, the 
FIG, 6 is a diagram of a three-dimensional embodiment of matrix A, in its binary representation, is multiplied by the 
the CCDiCID MVM processor of FIG. 4. binary vector labelled uol, . . . uoN, which contains the least 
(i.e., by the top row of charge 
ector CCDiCID cells 235). By 
FIG. 7 is a block diagram illustrating a discrete Hartley 30 significant bits Of u12 
packets in the array 2 transform process employed in the invention. 
virtue of the charge transfer mechanism, analog voltages 
FIG. 8 is a block diagram illustrating an inverse discrete are sensed at the output of each Hartley transform employed in carrying out the invention. one of the bxN rows of the matrix array 200. To keep track 
FIG. 9 is a diagram of a Permutation matrix employed in 3s of the origin of this contribution to the result, a left super- 
carrying out the invention. script (O)v is utilized in the notation employed herein. 
FIG. 10  is a block diagram of a CCDiCID architecture for In the present example, all of the products computed in 
carrying out the discrete Fourier transform Process of the the array 200 are synthesized together in accordance with 
present invention. corresponding powers of two to form the N elements of the 
FIG. 11 is a block diagram of a CCDiCID architecture for transformed (output) vector. This is accomplished in an 
carrying out the convolution process of the present inven- arithmetic processor illustrated in FIG. 5 .  The arithmetic 
tion. processor is dedicated to the computation of the particular 
FIG. 12 is a block diagram of a decimation-in-frequency type of matrix-vector Product computed in the Present 
fast Hartley transform process employed in a process of the 4s example. Processors other than that illustrated in FIG. 5 
invention. could be employed in making these same computations from 
carrying out an area-eficient discrete Fourier transform invention is not confined to the type of arithmetic processor 
illustrated in FIG. 5 .  Of course, in order to compute matrix- process of the present invention. 
vector products different from the type computed in the 
present example, an arithmetic processor different from that 
illustrated in FIG. 5 would be employed in combination with 
the array 200 of FIG. 4. 
At clock cycle two, the voltages sensed at each of the N 
order to achieve high precision in a CCD~CID MVM ss rows are fed into respective pipelined A/D converters 300, 
processor, the present invention can each individual converter 300 having b, bits of precision 
architecture described in the above-referenced parent appli- (where d=logzN, and denotes the number Of Of 
the following, the architecture is described in its basic form. by the second 'Ow Of charge packets in the array 230 Of 
However, different architectures can be derived from this 60 vector CCDiCID cells), yielding (l)v. 
basic form which are not discussed here. A key element of At clock cycle three, the digital representations of (o);ol, 
the CCDiCID MVM array 200 of FIGS. 4 and 5 is the . . . v are bit mapped into a V register 305 with an 
encoding in each CCDiCID processor cell 210 of one bit of appropriate b i t 4 f f s e t  toward the most significant bit posi- 
the binary representation of each matrix element. As shown tion. Specifically, the result or element ;li obtained during 
in FIG. 4, if a matrix A is to be specified with b bits of 65 clock cycle k is offset in the appropriate V register by lk bits 
precision, each element A, of the matrix occupies a single toward the most significant bit position (toward the leftmost 
column 205 of b cells, namely cells corresponding to [b(i- bit) at clock cycle k. This offset is controlled by an offset 
is a 
2A, 2B, 2c, 2D and 2E a sequence Of Matrix-vector multiplication will now be described with 
FIG. 5 is a schematic diagram of a typical architecture for 2s products of a respective vector element multiplied by all 
40 
FIG, 13 is a block diagram of a CCD/CID architecture for the products produced by the array 2oo, and the present 
so 
DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 
Preferred CCD MVM Processor Structure 
an advanced 
cation which is illustrated schematically in FIGS, 4 and 5 ,  In A), simultaneouslyAis by ull, . .. u l ~ ,  (i.e., 
( O ) - b - l  
5,952,685 
7 
counter 310 (which is incremented by one at each new clock 
cycle). Next, the voltages (')v are fed into the AID converters 
300, and the vector uZ1, . . . u', multiplies A to yield (');. 
Elements (%li with same row index i are then fed into 
cascaded sum circuits 315 shown in FIG. 5,  in parallel for all 
i, and pipelined over k. The cascaded sum circuits 315 are 
connected in a binary tree architecture of the type well- 
known in the art. Hence, the components vi of the product 
v=Au are obtained after log,b cycles, and the overall latency 
is b+log,b+3. If one needs to multiply a set of vectors u by 
the same matrix A, this pipelined architecture will output a 
new result every b clock cycles. Clearly, a far higher 
precision has been achieved than was previously available. 
An added benefit is that the refresh time overhead is sig- 
nificantly reduced, since in the matrix representation of FIG. 
4 each electron charge packet only refers to a binary 
quantity. 
While FIG. 4 indicates that the NxbxN cells 210 of the 
array 200 are formed on a single planar silicon chip, the 
array 200 can instead be implemented using many chips 
connected together, using Z-plane technology, as illustrated 
in FIG. 6. In such an embodiment, it is preferable to have 
each chip 400 assigned to a particular bit plane, in which 
each chip is an N-by-N CCDiCID array of CCDiCID cells 
210 of the type illustrated in FIG. 1, which stores NxN bits 
or charge packets, which in the present invention, however, 
represent binary values only. The first chip 400-0 stores the 
least significant bits of all matrix elements of the NxN 
matrix, the second chip 400-1 storing the next least signifi- 
cant bits of all matrix elements, and so forth, and the last 
chip 400-(b-1) storing the most significant bits of all matrix 
elements. Abackplane chip 405 implements the array 230 of 
N columns and b rows of vector CCD cells 235 of FIG. 4. 
The backplane chip is connected edge-wise to the column 
input lines of all of the bit-plane chips 400. This permits 
every one of the b rows of vector CCDiCID cells 235 to be 
input to column input lines of respective ones of the b 
bit-plane chips 400, greatly enhancing performance and 
reducing the latency of a given matrix-vector multiplication 
operation. The arithmetic processor of FIG. 5 could also be 
implemented on the backplane chip 405. The architecture of 
the arithmetic processor of FIG. 5 could also be imple- 
mented on the backplane chip 405. The architecture of the 
arithmetic processor would depend upon the type of matrix- 
vector product to be computed. The Z-plane embodiment of 
FIG. 6 permits all b bits of every matrix element to be 
multiplied by a given vector element, and therefore is 
potentially much faster than the embodiment of FIGS. 4 and 
5 .  
SIGNAL PROCESSING CCDiCID DEVICES 
Overview 
The remainder of the specification first describes how a 
complex DFT can be obtained from the DHT by using 
CCDiCID chips. Then, a more interesting result is presented, 
regarding the convolution of two signals. It is shown that, by 
using CCDiCID chips, the convolution can be presented as 
a single MVM, and hence can be performed with the highest 
area and time efficiency. Aprocess is presented for area and 
time efficient DFT where the size of transform is much larger 
than the size of the CCDiCID chip. Finally, the specification 
outlines possible architectures that should enable high pre- 
cision computing with charge-domain devices. 
Computing DFT from DHT 
In its preferred embodiment, the signal processing CCDi 
CID architecture of the present invention employs the DHT 
8 
as the main kernel for implementation on the CCDiCID 
chip. The DHT and its inverse are given as 
1 xnm 
H,, = -cm2- 
N N  5 
and 
10 
The kernel for DHT and inverse DHT are shown in FIGS. 7 
and 8, respectively. In the above expressions 
cas(e)=cos(e)+sin(e) (3) 
The first issue is the computation of the DFT from the 
1s 
DHT. Let the DFT and DHT of a signal g be given as 
20 
f=F[g] and h=H[g] 
where F and H are the DFT and DHT operators. In terms of 
the real and imaginary part of f i t  follows that 
2s 
f=ReH+i ImH=E[h]-iO[h] (4) 
where E and 0 are even and odd operators defined as 
30 
hi h- h c h - ,  h,+hN-n ( 5 )  
E[h] = ~ + D[h,] = = ~ 
2 2 2 
hi - h- h, - h-, h, - hN-n (6) 
O[h] = ~ + O[h,] = ~ = ~ 
3s 2 2 2 
Let h' denote the vector h with normal ordering of the 
elements, i.e., 
40 
hf=[ho, h,, . . . hN-J 
Also, let h- denote the permutation vector h as 
4s 
h-=[ho, h,,, . . . h J  (8 )  
We note that h' and h- can be related through a permutation 
so matrix P as 
(7) 
h-=Ph' (9) 
ss where P is illustrated in FIG. 9. From Eqs. (5), (6) and (9), 
the expressions for E[h] and O[h] are obtained as 
60 
From Eqs. (4) and (lo), f is obtained as 
P C l  P - l  
2 2 
f = -h+i-h 
or 
65 
5,952,685 
10 9 
-continued 
P C l  p - 1  
f = -Hg+i-Hg 
2 2 
The matrices 
P C  I 
-H 
2 
'f=F[ 'g]  and 'h=H1g (14) 
'f=F['g] and 'h=H['g] (15) 
f=F[g] and h=H[g] (16) 
where F and H denote the DFT and DHT operators, respec- 
tively. 
In accordance with Parseval's Theorem, 
5 
P -  I 
-H 
2 where F-l is the inverse DFT operator and 0 indicates 
15 component-by-component multiplication of two vectors. 
The above expression forms the basis of conventional con- 
volution algorithms. Assuming 'f to be known a priori, a first 
FFT is used to compute If in O(NLogN) steps, then the 
product of lfa'f is obtained in O(N) steps, and finally an 
are constant and can be precomputed. Therefore, Eq. (12) 
leads to a direct realization of a DFT, as illustrated in FIG. 
10. In FIG. 10, the incoming signal g is multiplied simul- 
taneously in two CCDiCID arrays 500, 510 of the type 20 inverse FFT is used to compute g in O(NLogN) steps. 
illustrated in FIGS. 4 and 5,  corresponding, respectively, to 
the matrices 
However, Eq. (17) is not suitable for implementation on 
CCDiCID chips. This is because it involves more than just 
a single MVM operation. Specifically, it requires a first 
MVM operation to compute the forward Fourier transforms, 
25 followed by a component-by-component multiplication of 
two vectors followed by a second MVM operation to 
compute the inverse Fourier transform. Hence, a new for- 
mulation is needed to take into account the fact that the only 
operation that can be efficiently performed on a CCDiCID is 
P C  I 
2 
-H 
and 
30 a MVM. In terms of DHT, h can be expressed as P -  I -H. 
2 
35 One can also express h in terms of even and odd operators packets corresponding to binary values of 
as 
P C  I 
-H 
2 
h=E[2hf]01hf+0[2hf]01h 
in respective CCDiCID cells 235 arranged as illustrated in 40 or 
FIG. 4. 
The CCDiCID array 510 stores charge packets correspond- 
ing to binary values of h=E[ 2h+]01h++O[2h+]O[P1h+' 
P -  I 
-H 
2 
The 0 operation of two vectors can be described by a 
45 diagonal matrix-vector multiplication as 
in respective CCDiCID cells 235 arranged as illustrated in v = 1 v ~ ~ v ~ v = 1 ~ ~ v  (20) 
where is a diagonal matrix whose diagonal elements are 
those of the vector 'v. Using such a representation, Eq. (18) 
can now be expressed as 
FIG. 4. Then, binary values of the Vector g are input to each 
Of the CCDiCID 5003 510 in the manner described 
above with reference to FIGS. 4 and 5. The resulting output 
of the CCDiCID array 500 is a vector Re[f] which is the real 
part of the Fourier transform of g while the resulting output 
imaginary part of the Fourier transform of g. Thus, a 
complete ~~~~i~~ transform is computed in the time required 
to perform a single MVM operation. 
of the CCDiCID array 510 is a vector Im[fl which is the 55 h={L?['h+]+B['h']P}'h+ (21) 
Amatrix Q is now defined in terms of the matrices E and 8: 
Fast Area-Efficient Convolution Using DHT 
60 
The convolution problem, that is, the computation of a Note that, since it is assumed that 'g is known a priori, the 
matrix Q is also known a priori. From Eqs. (19) and (20), it 
follows that 
signal g is defined as: 
g=lg*zg (13) 65 
h=QH1g 
where it is assumed that 'g is the input and 'g is known 
beforehand, and hence its DHT can be precomputed. Let and finally 
5,952,685 
11 12 
Also, 
g=H-'h=NHQH'g (24) 
Eq. (24) is a convolution that can be performed in terms of 
a simple matrix-vector multiplication, by defining an appro- 
priate convolution operator. Specifically, we set 
C=NHQH (25) 
as the convolution operator. As illustrated in FIG. 11, a 
CCDiCID array 600 of the type illustrated in FIGS. 4 and 5 
stores charge packets corresponding to the binary values of 
the matrix C and arranged in individual CCDiCID cells 235 
as illustrated in FIG. 4. Binary values of a vector 'g are input 
to the CCDiCID array 600 in the manner described with 
reference to FIGS. 4 and 5. The CCDiCID array 600 
produces a vector g which is the convolution 'g*'g of 
Equation 13. 
Computing a Large DFT with Small Size CCDiCID Chips 
We now consider the case wherein the size of the desired 
transform, N, is larger than the size, M, of the CCDiCID 
chip. Adirect ("brute force") approach would be to build an 
NxN DHT matrix by using MxM CCDiCID chips. From the 
foregoing discussion, it follows that 2(NiM)' CCDiCID 
chips of size MxM would be required for computation of a 
DFT of size N. 
The main issues in devising a better approach for per- 
forming large DFT with small size CCDiCID chips can be 
summarized as follows: 
1. Preserve the computational efficiency; 
2. Reduce the number of chips; 
3. Reduce the complexity of additional hardware. 
The invention is based on reducing the area-time product by 
using hybrid algorithms based on a combination of Fast 
Hartley Transform (FHT) and DHT, i.e., by using higher 
radix FHT. To see this, consider again the DHT as 
h=Hg 
where 
Also consider the case where M=N/2. The Decimation-in- 
Frequency (DIF) of a FHT of size N in terms of DHT of size 
Ni2 is given as 
where 
and 
he=[h(0),h(2),h(4), . . . , h(N-2)]' 
h0=[h(l),h(3),h(S), . . . , h(N-l)T 
S 
10 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
5s 
60 
65 
S(N/2)=H(N/2)K(N/2) (28) 
where K(Ni2) is a matrix defined as 
K(N/Z)=Diag{ cos(bk/w}+Diag{ sin(Znk/ff)}P (29) 
in which k is an index that assumes an integer value from 1 
to N-1 beginning at the lowest order matrix element to the 
highest order matrix element and the symbol Diag connotes 
a diagonal matrix. In the above expressions H(Ni2) repre- 
sents a DHT of size Ni2, and P is the permutation matrix as 
defined before. Then, Eq. (27) can be written as 
or, equivalently 
The process of Equation 31 is illustrated in the block flow 
diagram of FIG. 12. In FIG. 12, an adder 610 and subtractor 
620 combine the lower and higher order portions of the 
in-coming signal. gp and g', respectively, to provide a sum 
and a difference. The sum from the adder 610 is transformed 
by a processor 640 in accordance with the transform H(Ni2) 
(corresponding to the top vector element of Equation 31) 
while the difference from the subtractor 620 is transformed 
by processors 630 and 650 in accordance with the trans- 
forms K(Ni2) and H(N/2), respectively (corresponding to 
the bottom vector element of Equation 31). 
Now, substituting Equation 31 into Equation 12 yields a 
DFT process which is a time-efficient and area-efficient 
realization of a DFT of size N. This process is illustrated in 
FIG. 13 as follows and may be expressed as a pair of 
equations (Equations 31 and 33 below) defining, 
respectively, the real and imaginary parts (Re and Im) of the 
even and odd components (f, and fo) of the desired Fourier 
transform: 
and 
(33) 
FIG. 13 illustrates a CCDiCID architecture for carrying 
out the processes of Equations 32 and 33. In FIG. 13, the 
incoming signal vector g of size N is bifurcated into two 
half-sized (Ni2) portions, namely gp containing the lower 
order terms of g and a portion g, containing the higher order 
terms of g. A bit-serial adder 710 computes the sum of the 
two portions of g while a bit serial adder 720 computes the 
difference between the two portions of g. The sum computed 
by the first bit-serial adder 710 is applied to the inputs of two 
Ni2-size CCDiCID arrays 730, 740 of the type described 
above with reference to FIG. 4. The difference computed by 
the second bit-serial adder 720 is applied to the inputs of two 
5,952,685 
13 
Ni2-size CCDiCID arrays 750, 760 of the type described 
above with reference to FIG. 4. 
The CCDiCID array 730 stores charge packets corre- 
sponding to the binary values of the matrix 
P C  I 
2 
[H(N/2)] in respective CCDiCID cells 235 as illustrated in 
FIG. 4. The CCDiCID array 740 stores charge packets 
corresponding to the binary values of the matrix 
P -  I 
2 
~ 
[H(N/2)] in respective CCDiCID cells 235 as illustrated in 
FIG. 4. The CCDiCID array 750 stores charge packets 
corresponding to the binary values of the matrix 
P C  I 
2 
~ 
[H(N/2)K(N/2)] in respective CCDiCID cells 235 as illus- 
trated in FIG. 4. The CCDiCID array 760 stores charge 
packets corresponding to the binary values of the matrix 
P -  I 
2 
~ 
[H(N/2)K(N/2)] in respective CCDiCID cells 235 as illus- 
trated in FIG. 4. 
The CCDiCID array 730 produces the real part of the even 
terms of the Fourier transform of g. The CCDiCID array 740 
produces the imaginary part of the even terms of the Fourier 
transform of g. The CCDiCID array 750 produces the real 
part of the odd terms of the Fourier transform of g. The 
CCDiCID array 760 produces the Fourier transform of the 
odd terms of the Fourier transform of g. These four products 
together constitute the complete Fourier transform of g. 
In summary, note that, compared to the direct approach, 
the number of required CCDiCID chips has been reduced 
from 8 to 4 at the cost of only two additional simple bit-serial 
adders. Also, since the CCDiCID chip has a bit-serial data 
input format, performing bit-serial addition on the data will 
increase the computation time by only a cycle. Generaliza- 
tion to different M-to-N ratios is straightforward. 
While the massively parallel processes of the invention 
have been described with reference to a preferred imple- 
mentation employing the high-precision CCDiCID arrays of 
FIGS. 4 and 5, these processes may also be implemented 
using the prior types of CCDiCID arrays of the type illus- 
trated in FIGS. 1-3. 
While the invention has been described in detail by 
specific reference to preferred embodiments, it is understood 
that variations and modifications thereof may be made 
without departing from the true spirit and scope of the 
invention. 
What is claimed is: 
1. A charge domain computing device for performing a 
Fourier transform by simultaneous matrix-vector multipli- 
cation operations in respective plural charge coupled device/ 
charge injection device arrays comprising: 
a plurality of charge coupled deviceicharge injection 
device arrays storing charge coupled device charge 
packets in respective charge coupled device cells 
14 
arrayed in rows and column, said charge packets having 
amounts of charge corresponding to the values of 
corresponding elements of respective matrix operators; 
a Hartley transform processor for each of said arrays 
having a signal processing operation in a matrix vector 
multiplication operation for producing real and imagi- 
nary parts of said Fourier transform using a Hartley 
transform; and 
a device for simultaneously applying an incoming vector 
to said plurality of charge coupled deviceicharge injec- 
tion device arrays; 
wherein each matrix-vector multiplication operation in 
respective plural charge coupled deviceicharge injec- 
tion device arrays simultaneously produces said real 
and imaginary parts of said Fourier transform. 
2. The device of claim 1 wherein each of said charge 
coupled deviceicharge injection device arrays comprises: 
5 
lo 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
an array of n rows and M columns of charge coupled 
device matrix cell groups corresponding to a matrix of 
N rows and M columns of matrix elements, each of said 
matrix elements representable with b binary bits of 
precision, each of said matrix cell groups comprising a 
column of b charge coupled device cells storing b 
charge coupled device charge packets representing the 
b binary bits of the corresponding matrix element, the 
amount of charge in each packet corresponding to one 
of two predetermined amounts of charge; 
each of said charge coupled device cells comprising a 
holding site and a charge sensing site, each charge 
packet initially residing at the respective holding site; 
a device for sensing, for each row, an analog signal 
corresponding to a total amount of charge residing 
under all charge sensing sites of the charge coupled 
device cells in the row; 
an array of c rows and M columns of charge coupled 
device vector cells corresponding to a vector of M 
elements representable with c binary bits of precision, 
each one of said M columns of charge coupled device 
vector cells storing a plurality of c charge packets 
representing the c binary bits of the corresponding 
vector element, the amount of charge in each packet 
corresponding to one of two predetermined amounts of 
charge; and 
a multiplying device operative for each one of said c rows 
of said charge coupled device vector cells for tempo- 
rarily transferring to said charge sensing site the charge 
packet in each one of said m columns of matrix cells for 
which the charge packet in the corresponding one of 
said m columns and said one row of said charge 
coupled device vector cells has an amount of charge 
corresponding to a predetermined binary value. 
3. The device of claim 2 further comprising arithmetic - -  
ss means operative in synchronism with said multiplying 
means for receiving, for each row, the signal sensed by 
said sensing device, whereby to receive Nxb signals in 
each one of c operations of said multiplying means; 
means for converting each of said signals to a correspond- 
ing byte of output binary bits; and 
means for combining the output binary bits of all of said 
signals in accordance with appropriate powers of two to 
generate bits representing an N-element vector corre- 
sponding to the product of said vector and said matrix. 
4. The device of claim 1 wherein said array of matrix 
charge coupled device cells is distributed among a plurality 
device, comprising: 
60 
65 
5,952,685 
15 
of b integrated circuits containing sub-arrays of said M 
columns and N rows of said matrix charge coupled device 
cells, each of said sub-arrays corresponding to a bit-plane of 
matrix cells representing bits of the same power of two for 
all of said matrix elements. 
5. The device as set forth in claim 1, wherein the Hartley 
transform is a combination of fast Hartley transform and a 
discrete Hartley transform. 
6. A charge domain computing device for performing a 
Fourier transform by simultaneous matrix vector multipli- 
cation operations in respective plural charge coupled device/ 
charge injection device arrays comprising: 
a first charge coupled deviceicharge injection device array 
storing charge packets; 
a first Hartley transform processor for computing a first 
matrix operator represented by said charge packets 
derived from a first permutation of a Hartley transform 
and wherein said first array having a first matrix vector 
multiplication operation which produces the real part of 
the Fourier transform of an incoming vector; 
a second charge coupled deviceicharge injection device 
array storing charge packets; 
a second Hartley transform processor for computing a 
second matrix operator represented by charge packets 
derived from a second permutation of said Hartley 
transform and having a second matrix vector multipli- 
cation operation which produces the imaginary part of 
the Fourier transform of said incoming vector; and 
a device for simultaneously applying said incoming vec- 
tor to said first and second charge coupled device/ 
charge injection device arrays; 
wherein said first and second matrix-vector multiplication 
operations in respective plural charge coupled device/ 
charge injection device arrays simultaneously produce 
said respective real and said imaginary parts of said 
Fourier transform. 
7. The device of claim 6 wherein said first permutation 
corresponds to a sum of permutated and unpermutated 
version of said Hartley transform and said second permu- 
tation corresponds to a difference between said permutated 
and unpermutated versions of said Hartley transform. 
8. The device of claim 6 wherein said Hartley transforms 
and said charge coupled deviceicharge injection device 
arrays are each of a size m which is a fraction of the size N 
of the Fourier transform and wherein said incoming vector 
is divisible into smaller vectors combinable into respective 
sums and differences and wherein said device for applying 
applies respective ones of said sums and differences to 
respective ones of said charge coupled deviceicharge injec- 
tion device arrays. 
9. The device of claim 8 wherein said incoming vector is 
divisible into two equal portions, each one of said first and 
second charge coupled deviceicharge injection device arrays 
comprising a pair of individual charge coupled device/ 
charge injection device arrays of size M, four different 
S 
10 
1s 
20 
2s 
30 
3s 
40 
4s 
so 
5s 
16 
permutations of said Hartley transform being embedded in 
the respective M-size charge coupled deviceicharge injec- 
tion device arrays. 
10. The device of claim 6 wherein each of said charge 
coupled deviceicharge injection device arrays comprises: 
an array of N rows and M columns of charge coupled 
device matrix cell groups corresponding to a matrix of 
N rows and M columns of matrix elements, each of said 
matrix elements representable with b binary bits of 
precision, each of said matrix cell groups comprising a 
column of b charge coupled device cells storing b 
charge coupled device charge packets representing the 
b binary bits of the corresponding matrix element, the 
amount of charge in each packet corresponding to one 
of two predetermined amounts of charge; 
each of said charge coupled device cells comprising a 
holding site and a charge sensing site, each charge 
packet initially residing at the respective holding site; 
a device for sensing, for each row, an analog signal 
corresponding to a total amount of charge residing 
under all charge sensing sites of the charge coupled 
device cells in the row; 
an array of c rows and M columns charge coupled device 
vector cells corresponding to a vector of M elements 
representable with c binary bits of precision, each one 
of M columns of charge coupled device vector cells 
storing a plurality of c charge packets representing c 
binary bits of the corresponding vector element, the 
amount of charge in each packet corresponding to one 
of two predetermined amounts of charge; and 
a multiplying device operative on each one of said c rows 
of said charge coupled device vector cells for tempo- 
rarily transferring to said charge sensing site the charge 
packet in each one of said M columns of matrix cells 
for which the charge packet in the corresponding one of 
said M columns and said one row of said charge 
coupled device vector cells has an amount of charge 
corresponding to a predetermined binary value. 
11. The device of claim 10 further comprising arithmetic 
means operative in synchronism with said multiplying 
device, comprising: 
means for receiving, for each row, the signal based by said 
sensing device, whereby to receive Nxb signals in each 
one of c operations of said multiplying means; 
means for converting each of said signals to a correspond- 
ing byte of output binary bits; and 
means for combining the output binary bits of all of said 
signals in accordance with appropriate powers of two to 
generate bits representing an N-element vector corre- 
sponding to the product of said vector and said matrix. 
12. The device as set forth in claim 6, wherein the Hartley 
transform is a combination of a fast Hartley transform and 
a discrete Hartley transform. 
* * * * *  
