Demonstration and architectural analysis of complementary metal-oxide semiconductor/multiple-quantum-well smart-pixel array cellular logic processors for single-instruction multiple-data parallel-pipeline processing by Wu JM
Demonstration and architectural analysis of
complementary metal-oxide semiconductory
multiple-quantum-well smart-pixel array cellular
logic processors for single-instruction
multiple-data parallel-pipeline processing
Jen-Ming Wu, Charles B. Kuznia, Bogdan Hoanca, Chih-Hao Chen, and
Alexander A. Sawchuk
We present an optoelectronic-VLSI system that integrates complementary metal-oxide semiconductory
multiple-quantum-well smart pixels for high-throughput computation and signal processing. The sys-
tem uses 5 3 10 cellular smart-pixel arrays with intrachip electrical mesh interconnections and interchip
optical point-to-point interconnections. Each smart pixel is a fine grain microprocessor that executes
binary image algebra instructions. There is one dual-rail optical modulator output and one dual-rail
optical detector input in each pixel. These optical input–output arrays provide chip-to-chip optical
interconnects. Cascading these smart-pixel array chips permits direct transfer of two-dimensional data
or images in parallel. We present laboratory demonstrations of the system for digital image edge
detection and digital video motion estimation. We also analyze the performance of the system compared
with that of conventional single-instruction–multiple-data processors. © 1999 Optical Society of America
OCIS codes: 200.2610, 200.4650, 200.4690, 100.2000.m
c
r
o1. Introduction
As the digital age evolves, various media such as
images, videos, audio, and data are digitized for stor-
age, processing, and transmission. The digitized in-
formation creates the need for processing huge
amount of data in real time. User applications are
moving to sophisticated features such as multimedia,
video conferencing, three-dimensional ~3D! graphics
rendering, and high-resolution images, resulting in
systems that require high data bandwidths.1 These
applications require the system to transfer rapidly a
large amount of data to perform signal processing at
high speed. Significant advancements in comple-
When this research was performed, the authors were with the
Signal and Image Processing Institute, Department of Electrical
Engineering, University of Southern California, Los Angeles, Cal-
ifornia 90089-2564. J.-M. Wu is now with Sun Microsystems,
Inc., Palo Alto, California 94303. A. A. Sawchuk’s e-mail address
is sawchuk@sipi.usc.edu.
Received 1 April 1998; revised manuscript received 21 Septem-
ber 1998.
0003-6935y99y112270-12$15.00y0
© 1999 Optical Society of America2270 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999mentary metal-oxide semiconductor ~CMOS! technol-
ogy have made extremely fast microprocessors
possible. By the year 2001 the integration density of
CMOS logic is expected to be more than 40 3 106
transistors per chip, and the projected frequency is
expected to be 1.4 GHz.2 Performance limits in
many systems today are due not to processor clock
speed but rather to inputyoutput ~IyO! bottlenecks
and system architectures. The signal IyO’s or inter-
connections exist between processors and input de-
vices, between processors for multiprocessor systems,
and between processors and storage devices.
With recent progress in smart-pixel technologies
and development in bump-bonding techniques, it has
become possible to attach large numbers of optical
IyO devices to foundry-grade CMOS VLSI’s.3 With
this method small multiple-quantum-well ~MQW!
odulators and detectors are attached to CMOS
hips by flip-chip bonding with subsequent substrate
emoval. This technology has made possible many
ptical IyO’s normal to the surface of a VLSI chip.4,5
It thus creates large two-dimensional ~2D! informa-
tion transfer capabilities between VLSI chips and
ma
a
f
S
~
P
s
e
c
r
d
a
a
c
W
r
s
R
c
t
t
p
t
V
t
o
c
a
t
~
p
e
a
S
a
c
p
a
o
p
b
f
i
p
o
i
r
b
i
v
S
t
t
s
c
n
a
p
f
a
i
m
d
u
apotentially alleviates the integrated circuit IyO com-
unication bottleneck.
In this paper we present an n-stage smart-pixel
rray cellular logic ~nSPARCL! processor system that
ttempts to overcome the IyO bottleneck by using
ree-space digital optical interconnects. The
PARCL chip is a single-instruction–multiple-data
SIMD! processor element ~PE! array in which all
E’s are identical and execute the same instruction
et on multiple data elements in lock step. They
fficiently execute so-called data-level parallel appli-
ations, which are programs in which the same algo-
ithm or instruction sequence is applied to a large
ata set. Matrix-vector multiplication as well as im-
ge convolution and filtering operations are some ex-
mples of data parallel operations. In the SPARCL
hip each PE is implemented with one smart pixel.
e designed this optoelectronic chip and had it fab-
icated through the CO-OP program at George Ma-
on University sponsored by the Defense Advanced
esearch Projects Agency.6 The 0.8-mm CMOS cir-
uitry was fabricated by a Hewlett-Packard process
hrough the Metal-Oxide Semiconductor Implemen-
ation Service ~MOSIS!. Then AlGaAsyGaAs MQW
–i–n structures were bonded onto it by Bell Labora-
oriesyLucent Technologies with its optoelectronic
LSI process.6 The 1.95 mm 3 1.95 mm area con-
ains 200 MQW diodes that can operate as either
ptical detectors or modulators. The SPARCL chip
ontains a 5 3 10 array of smart pixels, each with
rea of 125 mm 3 250 mm. Each smart pixel con-
ains 182 transistors to execute binary image algebra
BIA! operations with a 3-bit local memory. Each
ixel also detects or transmits one optical data bit on
ach clock cycle. The smart-pixel array operates as
mesh-connected SIMD processor. Operation of the
PARCL chip was simulated at more than 100 MHz
nd has been tested at 90 MHz.
We constructed a demonstration system that inter-
onnects as many as three SPARCL chips in a 2D
ipeline processing array. Data flow unidirection-
lly through the SPARCL pipeline on a 5 3 10 array
f digital optical free-space channels. The system is
ackaged on a 100 3 140 ~25.4 cm 3 35.56 cm! slotted
ase plate that houses polarization-sensitive and dif-
ractive optical components. A host computer sends
nstructions to the SPARCL chips to perform data
rocessing routines. We have successfully verified
peration of the SPARCL prototype system. Specif-
cally, we tested several image and data processing
outines, such as parallel numerical processing ~10-
it-wide addition, subtraction, and multiplication!,
mage edge detection, noise filtering, and digital
ideo motion estimation.
We describe the nSPARCL system architecture in
ection 2 and the SPARCL chip architecture in Sec-
ion 3. In Section 4 we present the experimental
est results of the SPARCL system and demonstrate
ome applications of the system for digital image pro-
essing and digital video motion estimation. Fi-
ally, in Section 5 we analyze the system architecturend show the advantage of the SPARCL system com-
ared with conventional SIMD systems.
2. System Architecture
The system integrates several SPARCL chips in par-
allel pipelines, using free-space digital optics technol-
ogy, as shown in Fig. 1. Each chip has a 5 3 10
array of PE’s that are electrically mesh connected.
The processing elements are optically interconnected
point to point between the chip planes, thus creating
a 3D massively parallel processing system for data
processing and communication. The prototype sys-
tem uses a host computer as a controller to send
instructions as well as data blocks to SPARCL chips.
The input datum, e.g., an image, is usually much
larger than the 5 3 10 SPARCL array size and there-
ore is partitioned into 5 3 10 blocks. These blocks
re pipelined into the system from the electrical data
nput pads of the first stage chip. Each chip in this
ultistage system can be programmed to carry out a
ifferent set of instructions. A processing routine
sually contains a sequence of instructions written as
BIA sequence.7 The host computer analyzes the
instructions and shares the computation load among
the SPARCL stages to optimize the computation ef-
ficiency. The processed blocks leave the system
from the last stage of the SPARCL system. The host
computer then collects the processed blocks and as-
sembles the result. Thus the SPARCL system loads,
processes, and unloads the data blocks in a pipelined
fashion.
In general, free-space digital optics technologies
offer a promising solution for IyO bottlenecks in
SIMD systems. Figure 2 shows a comparison of the
conventional SIMD architecture and two types of
SPARCL system, a one-dimensional ~1D! parallel
Fig. 1. SPARCL system prototype.10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2271
n
v
s
l
t
t
t
3
m
v
d
t
a
e
t
s
T
f
u
e
~
t
s
o
i
M
C
t
M
f
t
a
s
m
2data access nSPARCL and a 2D parallel data access
nSPARCL, where the prefix n represents the number
of cascaded SPARCL stages. The 1D nSPARCL sys-
tem reads and writes data with the same bus band-
width as a conventional SIMD machine. The 2D
nSPARCL system permits reading from input devices
and writing to output devices optically in 2D parallel
and hence with much larger IyO bandwidth. All
three systems assume the same total number of pro-
cessing elements. In later sections of this paper we
analyze the system performance and make compari-
sons among these three systems.
3. Binary Image Algebra and SPARCL Chip
Architecture
The SPARCL chip is designed to execute binary im-
age algebra. Each SPARCL pixel is a 1-bit processor
for binary image processing. In this section we
briefly describe the BIA and show how to implement
the BIA into the chip architecture.
A. Binary Image Algebra
BIA, derived from mathematical morphology, is a
systematic mathematical tool for general morpholog-
ical image processing and data manipulation.7,8 It
defines three fundamental operations:
• Complement, X:
X 5 $~x, y!u~x, y! [ W, ~x, y! [y X% 5 W 2 X; (1)
• Union, ø:
X ø Y 5 $~x, y!u~x, y! [ X or ~x, y! [ Y%, (2)
• Dilation, Q:
X % R 5 $~p, q!uRp,q ù X Þ A%, (3)
in which X and Y denote the raw data sets, W de-
otes the universal set in which all pixels have the
alue 1, and Rp,q denotes the translation or structur-
ing element R such that its origin is located at ~p, q!.
Fig. 2. Architectural comparison of ~a! a conventional SIMD ma-
chine, ~b! a 1D nSPARCL with electrically loaded input and output,
and ~c! a 2D nSPARCL with optically loaded input and output.
All systems are assumed have the same number of total processing
elements.272 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999It has been proved that any binary morphological
image processing routine can be decomposed into
these three fundamental BIA operations.7 By the
combination and repetition of these three operations,
any arithmetic or symbolic functions of binary data
array can be synthesized. For more information
about BIA, please refer to Refs. 7 and 8.
B. SPARCL Chip Architecture
To implement the electronics of the cellular image
processor, a VLSI architecture has been developed
that maps the three fundamental BIA operations into
each smart pixel. Figure 3 is a block diagram of the
BIA smart pixel. Each pixel contains a 3-bit local
memory ~M1–M3!, a union section, and a dilation
ection. At the input port, a multiplexer ~MUX! se-
ects the input either from the optical receiver or from
he electrical feedback, permitting recursive opera-
ions. The input data bit is then routed into one of
he three available local memories under control of a
-bit memory-select command. Each memory is
ade of a flip-flop register that outputs both the
alue of the data and the complement value of the
ata. A 6-bit union command chooses outputs from
he memory modules and performs a union operation
mong selected values. The result of the union op-
ration is sent to the dilation section and then dis-
ributed to north, west, south, and east neighbor
mart pixels as a local interconnection for dilation.
he dilation section takes a reference image pattern
rom the control unit and performs dilation with val-
es from the local neighbor pixels.
Optical signals transmitted from or received by
ach pixel are encoded as two separate channels
dual-rail encoding!. The power ratio of the two spa-
ial channels determines the 0 and 1 logic levels. A
chematic of a single smart pixel with one dual-rail
ptical receiver and one dual-rail optical transmitter
s shown in Fig. 4. The receiver contains GaAs
QW self-electro-optical effect device detectors and a
MOS transimpedance receiver. Similarly, the
ransmitter contains a CMOS modulator driver and
QW modulators. The GaAs and CMOS chips are
abricated separately and flip-chip bonded, and then
he MQW GaAs substrate is removed. The receiver
nd modulator driver circuitry are standard cells de-
igned by Bell LabsyLucent Technologies.6
The silicon chip is fabricated by 0.8-mm HP-
Fig. 3. Block diagram of a BIA smart-pixel design: Q and Q, a
emory output and its complement, respectively.
p
u
l
f
G
~
f
o
C
M
c
s
m
p
m
t
c
t
r
t
ype CCMOS26G technology at the MOSIS foundry. Fig-
ure 5 shows the physical layout of the SPARCL chip
design. Each chip contains a total of 12,863 transis-
tors. The chip contains an array of 5 3 10 pixels
within a 1.95 mm 3 1.95 mm die. Each SPARCL
ixel is a relatively simple processing element that
ses only 182 transistors, and it can implement a
arge number of operations. This chip is adequate
Fig. 4. Optical IyO of a smart pixel.
Fig. 5. Physical layout of the 5 3 10 SPARCL chip.
Table 1. Protot
Parameter
Application
Algorithm
Architecture
CMOS process
GaAs process ~optical IyO devices!
Flip-chip Bonding
Total number of transistors
Die size
Array size
Number of pads
Throughput rate
Number of optical IyO’s
Operation wavelengthor prototyping and testing purposes. A companion
aAs chip containing a 10 3 20 array of MQW diodes
which operate as either detectors or modulators! was
abricated at the Bell LaboratoriesyLucent Technol-
gies MQW foundry and flip-chip bonded to the
MOS chip. The operating wavelength of the
QW’s is 850 nm. Table 1 summarizes the specifi-
ations of the prototype SPARCL chip. With
maller CMOS feature sizes and larger chips, many
ore pixels per array can be fabricated.
4. Experimental Results and Demonstration
We constructed a testbed for optoelectronic testing
and demonstration purposes. The system is pack-
aged upon a 100 3 140 slotted base plate housing
olarization-sensitive and diffractive optical ele-
ents. The demonstrator that we designed is able
o house three SPARCL chips and, at present, two
hips were built. A host computer sends instruc-
ions to the SPARCL chip to perform data processing
outines. A 4-kByte first-in–first-out buffer is used
o interface a slower ~100-kbyteys! data acquisition
board on the host computer to the SPARCL chip’s
input data rate of 20 Mbytesys. Therefore, with five
parallel electrical data input pads and a 5 3 10 array
of optical IyO’s, each SPARCL chip achieved a 100-
Mbyteys electrical IyO data rate or a 1-Gbyteys opti-
cal IyO data rate.
The MQW modulator contrast ratio had a mea-
sured average of 1.96 and 2.13 for logic levels 0 and 1,
respectively. The optical switching power is the
minimum difference in optical power between the
dual-rail detectors that switches the logic states.
The optical switching power of the detector MQW is
;1.5 mW per diode at 20 MHz. The chip consumed
approximately 400 mW of static power dissipation at
5-V operation voltage because of the transimpedance
receivers.5 Dynamic power dissipation was mea-
sured at ;100 mW at 20-MHz operation. The total
chip power dissipation was measured to be ;500
mW. References 9 and 10 contain many additional
details about the chip design and its optoelectronic
characterization.
SPARCL is a programmable cellular logic proces-
hip Parameters
Description
gital cellular logic parallel processing
A
trachip mesh connected SIMD processor array
8-mm HP CMOS26G process through MOSIS foundry
ll LabsyLucent Technologies MQW foundry
ll LabsyLucent Technologies Service
,863
95 mm 3 1.95 mm
3 10 cells
Mbytesys per cell 3 50 cells 5 1 Gbyteys
dual-rail inputs, 50 dual-rail outputs
0 nmDi
BI
In
0.
Be
Be
12
1.
5
40
20
50
8510 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2273
e
i
m
t
t
b
F
2sor for general morphological operations. It has a
wide range of applications, including
• Mathematical morphological processing: ba-
sic operations ~e.g., dilation, erosion, closing, opening,
thinning, skeleton!, image feature extraction ~e.g.,
dge detection, shape, size and location verification!,
mage enhancement ~e.g., salt and pepper noise re-
oval!, parallel pattern recognition ~e.g., hit–miss
ransform, template matching!,
• Parallel numerical computation ~addition, sub-
raction, multiplication!,
• Combinatorial logic functions, and
• Serial-to-parallel or parallel-to-serial data for-
mat conversion and buffering.
Classic linear operators have been powerful in var-
ious numerical analysis and signal processing applica-
tions. However, when they are applied to image
analysis they do not directly address the fundamental
issues of how to quantify image shape or geometrical
structures. In contrast, mathematical morphology,
which is a set-theoretical methodology for image anal-
ysis, can rigorously quantify many aspects of the geo-
metrical structure in a way that agrees with human
intuition and perception. Morphological image anal-
ysis is done by operating on images with some struc-
turing elements.11 Different structural information
is extracted by interaction with selected structuring
elements and different combinations of operators.
Here we demonstrate two examples of image analysis
that use SPARCL instructions for image edge detec-
tion and digital video motion estimation.
A. Image Edge Detection
Figure 6 shows an example of the application of the
SPARCL system for image edge detection. Al-
though the current version of SPARCL chip utilizes
binary image algebra, it is also possible to process a
gray-level image as a set of binary images by use of
top-surface and umbra encoding.10 Here we simply
inarize the gray image by 64 3 64 block quantiza-
tion at the mean value of the block. The host com-
puter then partitions the binarized 256 3 256 image
into 5 3 10 blocks and pipelines these blocks into the274 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999SPARCL system. Let X be the image input to
SPARCL chip and R be the reference image, which is
F0 1 01 1 1
0 1 0
G
in the example. Then the resultant edge detected
image is
Z 5 X ø X % R, (4)
where X represents the compliment of X, ø repre-
sents the union operation, and Q represents the di-
lation operation. The edge detection routine takes
only three clock cycles for operation. The same rou-
tine repeats for every block in the pipeline.
B. Digital Video Motion Estimation
The transmission of digital video sequences contains
highly redundant information, and there is consider-
able correlation between adjacent frames. Most of
the change from frame to frame occurs to the moving
objects in the picture, and most of the background
information remains unchanged or little changed.
Instead of transmitting the whole current frame and
wasting precious channel bandwidth, the MPEG en-
coding algorithm ~shown in block diagram form in
ig. 7! transmits the difference between the frames
along with the motion vectors by the following proce-
dure: The frames are partitioned into small blocks,
and a search is made in the previous frame for the
block that best matches a block in the current frame.
When the best-matched block is found, the index off-
set is coded as a motion vector. Collecting these
Fig. 7. SPARCL for motion estimation. ~a! Encoding and trans-
mission, ~b! receiving and frame recovery. At transmitter site,
two consecutive video frames ~at the left! and the difference image
and motion vectors ~at the right! are computed by the SPARCL
system. Player number 18 moves upward in the image frames,
leaving his imprint in the difference image. The current frame
can be recovered easily at the receiver site.Fig. 6. Demonstration of the SPARCL system for image edge
detection.
t
T
s
f
a
t
d
b
e
e
2
m
m
s
e
b
d
o
t
t
m
w
i
s
s
e
m
a
c
s
u
l
s
a
~
m
e
a
s
d
fi
N
sbest-matched blocks, we obtain a motion-compensated
image frame. Subtracting the motion-compensated
image frame from the current frame, we can obtain the
difference image. The resultant difference image is
compressed with a JPEG encoder and transmitted
along with the motion vectors.12 At the receiver site,
he recovery of the current frame is straightforward.
he motion-compensated image frame is recon-
tructed from the motion vectors and the previous
rame. The current frame is then recovered from the
ddition of the motion-compensated image frame and
he JPEG decoded difference image. In the overall
igital video transmission system, searching for the
est matched block is highly computation intensive,
specially when real-time operation is desired. For
xample, for the MPEG system with a frame size of
88 3 322 and a 16 3 16 block to run a full search in
a 48 3 48 neighborhood area in 30 framesys requires
ore than 3 3 109 operations per second.
A parallel-pipeline smart-pixel array system such
as SPARCL offers a method for running digital video
motion estimation efficiently. Figure 7 shows a sys-
tem level simulation in which the SPARCL chip per-
forms the digital video motion estimation functions.
A necessary step in motion estimation is computation
of the difference between two video frames and image
block matching. First the current frame is parti-
tioned into data blocks of 5 3 10 pixels that match the
array size of the SPARCL chip. The SPARCL sys-
tem searches the neighborhood area of 15 3 30 pixels
in the previous frame to find a block that is best
matched to the data block in the current frame. To
perform this search we load the current frame into
the SPARCL chip in the second stage of the SPARCL
system. Then we scroll the search area data
through the first chip, one column at a time. For
every new column the search data are updated in the
first chip and transmitted optically to the second chip
in 2D parallel. The second chip receives the search
data optically and compares them with the data block
that already resides in its memory. The second chip
then performs the difference operation
D 5 B0 ø B1 (5)
to match the data block and the search data, where
B0 is the search block in the previous frame, B1 is the
data block in current frame, and D is the difference
between B0 and B1. The search block that is least
different from the data block is chosen as the best-
matched block. This block is used as an estimate of
current block, and the index offset is coded as a mo-
tion vector. The system also subtracts the best-
matched block from the current block and obtains the
difference block. By collecting these difference
blocks and motion offsets we can then create, encode,
and transmit the difference image and motion vector
for digital video applications.5. Architecture Analysis and Performance Scaling
A. SIMD IyO Problem
SIMD systems contain two types of IyO traffic be-
tween the PE’s and external devices: instructions
and data elements. The system delivers identical
copies of the instruction to every PE, and each PE
exercises the same instruction at the same time.
There are several methods for delivering the instruc-
tions ~e.g., sequential loading and broadcast!. The
ost efficient method is simply to broadcast the in-
truction to every PE simultaneously.
On the other hand, the data elements delivered to
very PE are different. Thus we cannot simply
roadcast the data to every PE as we do in instruction
elivery. Because of the limited number of IyO pads
n a chip, the system has to load the data block from
he border of the chip, and the data elements flow
hrough the PE array interconnection network ~e.g.,
esh! step by step until the data block is registered
ith the PE array. Moreover, the size of the data set
s usually much larger than that of the instruction
et, e.g., in image processing. The same instruction
et applies to a large number of data blocks repeat-
dly. Here, the data element’s IyO becomes the
ost critical bottleneck for the SIMD system. Also,
fter processing, the system has to unload the pro-
essed data elements from the PE array by the slow
tep-by-step method again. Therefore there are
sually separate IyO channels for loading and un-
oading data elements. However, the IyO bottleneck
till exists.
To perform an image or data processing routine on
computing system requires three distinct steps:
1! loading the data from the input device ~such as
emory or digital camera! to the processor~s!, ~2!
xecuting the instructions for the application routine,
nd ~3! unloading the data from the processor~s! and
toring them to an output device ~memory or display
evice!.12 To evaluate the performance of the sys-
tem we define the processing speed as the number of
data elements or pixels that are processed over the
total processing time, described by
Spr 5
N2
Tload 1 Texe 1 Tunload
, (6)
where Tload, Texe, and Tunload are the times required
for loading, executing, and unloading the N 3 N data
eld or image.13 The value of Texe is the number of
instructions required by an SIMD algorithm multi-
plied by the SIMD clock period. Here we assume
that each instruction requires one clock cycle.
A SIMD machine with P 3 Q PE’s processes an N 3
image by sequentially processing image blocks of
ize P 3 Q, where N is usually much larger than P
and Q. The computing system addresses the exter-
nal device to load input image blocks through a 1D
parallel bus. In all current SIMD architectures the
time required for loading and unloading each of the
P 3 Q blocks depends on this IyO bus’s bandwidth
and can easily dominate the total processing time.10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2275
is
c
t
c
c
t
c
f
n
s
t
2This IyO bottleneck occurs for two reasons: The first
s that data enter the P 3 Q processing array through
one of its borders on a 1D column parallel data bus.
If the data bus is P bits wide, Q clock cycles are
required for loading or unloading the processor array.
The processing speed in such a system grows with the
bus’s width rather than with the number of process-
ing elements. The second reason is that when the
SIMD array is fully loaded and operating on a data
block, the data IyO lines are idle; this results in an
underutilized data bus.
Faced with this IyO bottleneck problem, architec-
ture designers have developed a prefetch
technique.14–16 The elapsed time associated with
loading data from memory is called memory latency.
The system deals with the memory latency by adding
an extra register within each PE and adding on-chip
circuitry that performs data IyO and registration in
the background. These registers are interconnected
with a network similar to the PE array, e.g., mesh,
and match the size and location of the PE array.
Instead of loading data through the PE array, the
system prefetches data elements through the register
array, while the PE’s are dedicated for instruction
execution. The technique hides the memory latency
through data caching. The designers optimize the
SIMD chip by balancing the amount of VLSI real
estate used for PE circuitry versus registers and
background IyO circuitry to maximize the processing
peed.13 The trend is revealed when we consider
urrent devices for SIMD image processing, such as
he video signal processor,17 the integrated memory
array processor,18 and the GLiTCH.19 These systems
use PE array sizes of only 16 3 16 or fewer per chip on
hips of size greater than 1 cm2. Because the PE’s are
simple 1-bit processors, the processing array itself uses
only a small portion of the chip area. The majority of
the chip area is used for memory and data IyO.
The two architectures based on an nSPARCL pro-
essor system,20,21 1D nSPARCL and 2D nSPARCL,
that we are examining in this paper are shown in
Figs. 2~b! and 2~c! and compared with a conventional
SIMD machine. With the smart-pixel optical detec-
tors and transmitters, a SPARCL chip can optically
transfer its entire data block to another SPARCL chip
in a single clock cycle. The 1D nSPARCL uses this
feature in the system shown in Fig. 2~b!, which has
the first SPARCL stage as a dedicated input device,
n 2 2 intermediate SPARCL’s for processing, and the
last SPARCL stage as an output device. The 2D
nSPARCL assumes that the IyO devices are imple-
mented with smart-pixel technology that permits 2D
parallel optical IyO. The IyO devices can be a pho-
tonically accessed page-oriented memory,22,23 a video
camera, a display device, or a network connection.
In this case, data enter the SPARCL chip in a 2D
parallel format and the IyO bottleneck is eliminated.
When a SIMD system has no IyO bottleneck, the
processing speed scales linearly with the number of
processing elements.
The SPARCL chip itself is a SIMD system. Cas-
cading these SPARCL chips to a multistage276 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999nSPARCL system makes a multiple-instruction
multiple-data stream system in the sense that differ-
ent SPARCL stages can execute different instruction
sets simultaneously. By scheduling the instruction
phases among the nSPARCL stages we can improve
he efficiency of the system. Figure 8 shows a timing
hart of data block processing for a SIMD system, a
our-stage 1D nSPARCL system, and a four-stage 2D
SPARCL system. All these systems contain the
ame total number of PE’s. The SIMD system con-
ains 8 3 8 PE’s on a single chip. Both SPARCL
systems have four stages of SPARCL chips that have
4 3 4 PE’s. In this example, the conventional SIMD
system takes 16 clock cycles to load one data block
and executes the processing in 8 clock cycles. It
loads a data block and executes the processing sepa-
rately. Also, we assume that the system has sepa-
rate buses for data load and unload so that data
unloading and loading occur simultaneously in a
pipeline manner.
The 1D nSPARCL system partitions the data into
smaller blocks, four times smaller than the conven-
tional SIMD block in this example. The system
loads the block into the first stage in only four clock
cycles because the block size is smaller. After the
data block is loaded in the first SPARCL stage, it
takes another clock cycle to transfer the block from
chip 1 to chip 2. The system also shares the execu-
tion commands evenly between chip 2 and chip 3, so
each chip takes four cycles for execution. The pro-
cessed block is then transferred to the last chip for
unloading, and the unloading again takes four clock
cycles. The system overlaps the loading time, the
execution time, and the unloading time between
blocks. For example, when the fourth chip is load-
ing data block 1, chip 3 is executing commands on
block 2, chip 2 is executing commands on block 3, and
chip 1 is busy loading block 4 from the input device.
The resultant total processing time is reduced be-
cause of the pipeline processing.
The 2D nSPARCL loads data blocks in 2D parallel
in a single clock cycle. For the four stage 2D
nSPARCL, for example, the system shares the eight
execution commands evenly over the four stages.
Each stage uses two clock cycles to finish the opera-
tion. Again these operations are done in pipeline
Fig. 8. Timing chart of data block processing for a SIMD system,
a four-stage 1D nSPARCL, and a four stage 2D nSPARCL.
d
t
T
w
a
i
fi
c
i
2
r
t
n
t
n
s
l
n
s
5
a
I
a
p
i
2
H
n
s
t
I
t
w
P
ofashion. While chip 4 is executing commands on
block 1, chip 3 is executing block 2, chip 2 is executing
block 3, and chip 1 is executing block 4. The system
uses all the PE resources for operation and therefore
the fewest clock cycles of the three systems.
B. Performance Comparison of SIMD and 1D nSPARCL
Systems
We compare the performance of nSPARCL and con-
ventional SIMD systems given that both have the
same number of total processors in the system.
Here we compare and discuss the 1D nSPARCL and
conventional SIMD architectures in terms of process-
ing time, scalability, bus utilization, flexible multiple
speeds, and unbalanced bandwidth applications.
Because loading and unloading occur simulta-
neously, we can hide the unloading latency and set
Tunload to 0 for our simulations.
1. Comparison of Processing Times
Assume that each of the n chips in the 1D nSPARCL
system has an array size of p 3 q. The equivalent
SIMD system has a total size of npq ~we assume that
this is equivalent in size to the P 3 Q blocks discussed
above!. Both SIMD and 1D nSPARCL systems have
the same bus bandwidth of p bits per second. The
total processing time needed for the SIMD system is
TSIMD 5 S N2npqD~nq 1 Texe!, (7)
where ~N2ynpq! represents the number of blocks to
be processed and ~nq 1 Texe! represents the loading
time and the execution time required for each block.
Note that the SIMD requires separate time slots for
loading and execution.
For the 1D nSPARCL system we normally use the
first and last chips as input and output devices for
data IyO. The intermediate n 2 2 chips execute the
ata processing instructions. The total processing
ime needed for the 1D nSPARCL system is then
T12D nSPARCL 5 5
N2
pq
~q 1 1!
Texe
n 2 2
# q,
N2
pq STexe 1 2q 1 nn D Texen 2 2 . q
.
(8)
here are two cases of 1D nSPARCL system operation
hen tasks with different lengths of instructions are
pplied. For the first case, or Texey~n 2 2! # q, the
ntermediate SPARCL’s finish processing before the
rst chip finishes loading data. Thus the total pro-
essing time is dominated by the loading time and is
ndependent of Texe. For the other case, or Texey~n 2
! . q, all n stages are used to run the processing
outine after the first stage is finished loading. Thus
he workloads are spread properly over n stages.
To compare the performance between 1D
SPARCL and conventional SIMD systems we inves-
igate the ratio of total processing time for 1D
SPARCL systems to that of the equivalent SIMDystems for the same data IyO bandwidth at different
engths of instruction sets, as shown in Fig. 9. 1D
SPARCL’s with stage numbers n 5 3, 4, 9, 25 are
imulated. For the simulation, the image size is
12 3 512 pixels and each SPARCL system is a 5 3
10 processing array ~p 5 5, q 5 10!. When Texe is
small and Texe ,, q~n 2 2! the loading time domi-
nates the total processing time, and both systems
perform roughly the same. In fact, when the in-
struction set is very small ~Texe , number of
nSPARCL stages!, the extra steps of optically trans-
ferring data between SPARCL stages reduces the
SPARCL system efficiency. As Texe increases to be
pproximately equal to loading time, the SIMD data
yO bus must stop frequently when the array is
loaded with the data block and is busy executing
instructions. On the other hand, the nSPARCL sys-
tem moves loaded data blocks down the multistage
system pipeline for processing instead of halting data
IyO. The nSPARCL performance is optimized when
the distributed execution time equals the system
depth, because the loading time and the execution
time are balanced. For Texe . q~n 2 2! the execution
time plays an increasingly more critical role in the
system performance than do the data. When Texe
becomes much greater than q~n 2 2!, both systems
re dominated by the time required for executing
rocessing instructions, and the loading and unload-
ng times become insignificant.
. Scalability of the 1D nSPARCL System
ere we compare the processing time for 1D
SPARCL systems with 3, 4, 8, and 25 stages, as
hown in Fig. 9. The optimum ratio of processing
ime decreases as the number of stages increases.
n general, the performance of the 1D nSPARCL sys-
em scales up as the size of the system increases,
here the system size is defined as the number of
E’s of the system. As the problem size ~5number
f instructions per SIMD algorithm! increases, we
can improve the system performance by increasing
the size of the nSPARCL tailored to the size of the
Fig. 9. Comparison of ratio of total processing times of n 5 3, 4,
9, 25 nSPARCL’s and equivalent SIMD systems plotted against
the time needed for processing operations, Texe.10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2277
a
m
n
a
v
p
t
i
u
s
p
n
S
p
a
n
b
n
2problem. However, a larger nSPARCL system is not
always better than a smaller nSPARCL system for
ny size of problem. For a fixed-size problem the
atched size of the 1D nSPARCL system would op-
timize the efficiency. From the example shown in
Fig. 9, for problems that need fewer than 10 instruc-
tion cycles, n 5 3 nSPARCL is better than any n $ 4
SPARCL. For problems that need more than 10
nd fewer than 20 instruction cycles, n 5 4 nSPARCL
is better than n 5 3 and n $ 5 nSPARCL systems.
In summary, for a problem that needs an instruction
set of Texe cycles, the best number of 1D nSPARCL
stages nopt that optimizes the efficiency is
nopt 5 TexeTload 1 2 , (9)
where  ●  represents the next-larger integer.
On the other hand, in a parallel-processing system
with multiple users it is also desirable for users to be
able to share the processors.15 Because of the single-
instruction nature of SIMD systems, complicated
mechanisms are needed to handle the scheduling. In
contrast, the multistage 1D nSPARCL is indeed a
multiple-instruction multiple-data system in that dif-
ferent SPARCL stages are able to execute different
sets of instructions. It is easy to partition the multi-
stage 1D nSPARCL system into two or more sub-
systems in terms of SPARCL stages. Each
subsystem is an independent SIMD system, running
application programs from different end users. With
predictions of problem size and instruction length, we
can also assign the optimum number of SPARCL
stages to a subsystem dynamically according to Eq. ~8!
and optimize the processing efficiency individually.
3. Comparison of Bus Utilization
We can also approach the comparison of 1D
nSPARCL and conventional SIMD systems from bus
utilization of the two systems for different cases of
Texe and Tload. The bus utilization is defined as the
olume of data flowing through the bus interface of
rocessor array and external devices over a period of
ime. Because of equilibrium, the utilization of the
nput bus and the output bus should be equal. Bus
tilization represents the data throughput rate of the
ystem and is therefore a good measure of the system
erformance. The bus utilization ratio of 1D
SPARCL is compared in Fig. 10 with that of the
IMD system at different relative values of Texey
Tload. The nSPARCL has the greatest advantage
over the SIMD architecture when Texe ' Tload. This
illustrates the ability of the 1D nSPARCL system to
utilize the bus bandwidth better by moving loaded
blocks to open SPARCL processor arrays in the pipe-
line. It also shows the scalability advantage of
nSPARCL system over its equivalent SIMD machine.
Given a task with certain length of instructions, we
can scale up the stages of the SPARCL system prop-
erly such that Texe ' q~n 2 2! and the data through-
put rate is optimized.278 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 19994. Hybrid Speeds with the 1D nSPARCL System
At the system level, a multistage nSPARCL also of-
fers opportunities for high-speed data IyO. In a
VLSI chip, electrical signals enter and leave through
electrical IyO pads at the side of the chip. In prac-
tice, because of the off-chip parasitics from the pack-
age and the printed circuit board, the off-chip clock
suffers from limited signal bandwidth. To overcome
this problem it is common practice is to have VLSI
chips designed with slow ~tens of megahertz! off-chip
clocks synchronized to the high-speed on-chip clocks
with on-chip phase lock loop circuitry. However, for
the SIMD array IyO bottleneck that we have dis-
cussed, doing this helps only to shorten the execution
time on-chip but not the loading–unloading time.
The fundamental problem of data element traffic still
exists. Although some special VLSI components
fabricated in GaAs can have higher-speed IyO, the
design of such VLSI’s may be more difficult and less
dense than that of CMOS. On the other hand,
SPARCL with optical interconnects offers opportuni-
ties at the system level to avoid these problems. It is
obviously possible for the multistage nSPARCL sys-
tem to have multiple off-chip speeds. In the system
we can dedicate high-speed chips ~e.g., GaAs! for the
first and the last stages of the system for data IyO.
The loaded data elements are then transferred to the
following stages optically down the pipe for process-
ing at a high-speed on-chip clock.
5. 1D nSPARCL for Bandwidth Unbalanced
Applications
The 1D nSPARCL has a basic internal chip-to-chip
bandwidth of O~N2! and an external IyO bandwidth
of O~N!. Because of this bandwidth mismatch, ap-
lying the system to general-purpose problems takes
certain amount of effort. However, the system fits
icely the problems that require only modest external
andwidths @O~N!# and internal bandwidths of
O~N2!. For example, matrix-vector multiplication of
an on-chip N 3 N matrix and an off-chip N-element
vector is an application that requires only O~N! band-
width externally and O~N2! bandwidth internally.
Fig. 10. Comparison of bus utilization of n 5 3, 4, 9, 25
SPARCL’s and equivalent SIMD systems.
2S
a
a
s
T
n
v
sTo do this multiplication we have the matrix residing
in the second chip and load the N-element vector from
the first chip in column parallel. Every data ele-
ment of the vectors is then broadcast to a row of the
matrix in 1-to-N fanout. Another special example
meeting these conditions is video motion estimation
described above. In the search for the best-matched
block, the desired data block resides on the second
chip and the search area scrolls over the first chip in
column parallel. Every time that one column of the
search area is loaded to the first chip, every column in
the chip shifts laterally one column to the side and
creates a new array of O~N2! internally for the match-
ing operation. There are other systems ~e.g., neural
networks! that have this type of unbalanced
external–internal traffic and are suited to the 1D
nSPARCL architecture.
C. Performance Comparison of SIMD, 1D nSPARCL, and
D n-SPARCL Systems
Integrated with input and output devices that sup-
port 2D parallel IyO’s, the nSPARCL system can load
and unload an entire p 3 q data block in a single clock
cycle. The same technology used to create SPARCL
can be used to make dense memory chips, data buff-
ers, video relay systems, and network interface
devices.22–24 For this system the total processing
time becomes
T22D nSPARCL 5 SN2pqDS1 1 Texen D . (10)
For each block the loading time is always a constant
of 1 because it requires only one single clock cycle for
loading and unloading, and the execution time is
Texeyn because the execution instructions are shared
evenly over n stages. Figure 11 compares the pro-
cessing speed Spr of 1D nSPARCL, 2D nSPARCL, and
IMD systems when a 256 3 256 image is processed
over various numbers of PE’s up to 256 ~516 3 16
Fig. 11. Comparison of processing speed ~in terms of pixelsyclock
cycle! with the number of processing elements for SIMD, 1D
nSPARCL, and 2D nSPARCL systems. The 2D nSPARCL elim-
inates the data IyO bottleneck by performing 2D parallel data IyO
with input and output devices.rray!. In this simulation the same number of PE’s
re used for all three systems, and they exercise the
ame task with an instruction length of 20 clock cycles.
he four-stage case is assumed for both 1D and 2D
SPARCL’s. In the simulation result, both the con-
entional SIMD and the 1D nSPARCL processing
peeds Spr tend to saturate as the number of PE’s
increases. This is so because the loading time domi-
nates the system as the array size grows too large. In
contrast, the processing speed of a 2D nSPARCL in-
creases linearly with the number of PE’s because the
IyO bottleneck is eliminated and all the processors are
dedicated to performing the application routine.
A commonly cited advantage of SIMD systems is
their scaling properties. The larger the SIMD array,
the more data elements can be processed simulta-
neously. Decreasing VLSI feature sizes allows for
higher-density PE implementation and thus for larger
processing arrays per chip. Ideally the processing
speed per chip, defined as the number of data elements
processed divided by the processing time, increases
linearly with the processing array size. However, be-
cause the processing time includes the time required
for loading the data into the processing array, process-
ing the data, and then unloading the data, the process-
ing speed is also sensitive to the data IyO bandwidth of
the chip. The fundamental problem of data IyO in
conventional SIMD systems is the 2D nature of the
processing array and the 1D nature of the data IyO
ports of electronic buses. Ideally the computation
bandwidth increases proportionally to the processor
array size. However, 2D data fields enter the process-
ing array in a row-parallel format along the edge of the
array and flow into the array on the mesh network.
As a result, as the PE array size grows in O~N2!, the
IyO bandwidth grows only in O~N!. This causes an
IyO bottleneck as the PE array size grows. Conse-
quently, it greatly reduces the overall system through-
put and limits the SIMD system array size. The 1D
nSPARCL deals with the problem by hiding the mem-
ory latency by prefetching. However, this helps only
when the lengths of loading–unloading cycles and the
execution cycles are comparable. As the PE array
size grows, the length of the loading–unloading cycle
becomes much larger than that of the execution cycle.
The data IyO overwhelms the system, and the memory
latency dominates the system performance as well.
This occurs because of the fundamental limits of lim-
ited IyO bandwidth. On the other hand, the 2D
nSPARCL IyO bandwidth grows in O~N2!, well scal-
able with the size of the PE array.
So far we have compared the systems under the
assumption of the same number of PE’s. On the
other hand, considering the fact that the yield of a
VLSI chip decreases as the die size increases, it would
be difficult to build a large SIMD chip. As the
SPARCL system decomposes a large SIMD array into
a multiple SIMD stages, it presents an opportunity
for building a multiprocessor system with a large
number of PE’s distributed over several stages.10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2279
ance receiver–transmitter circuit,” IEEE Photon. Technol.
26. Conclusion
We have described an optoelectronic VLSI architec-
ture for a SIMD computing system, the SPARCL.
The device uses novel hybrid CMOS–MQW smart-
pixel technology. We constructed an experimental
system for testing the devices as well as for demon-
strating the system. This prototype system utilizes
BIA for general-purpose morphological image pro-
cessing. We have demonstrated applications of the
system to image edge detection and estimation of
digital video motion. We compared the performance
of the conventional SIMD machine and 1D and 2D
nSPARCL systems under the assumption that the
total number of PE’s in the systems was the same.
The results illustrate that, given the same task, the
nSPARCL system outperforms the SIMD system in
terms of processing time, bus utilization, and process-
ing speed. The nSPARCL system also has many
system aspect advantages in scalability to optimize
the computation efficiency, flexibility in hybrid
speeds and multiple-instruction systems, and utility
for construction of large-number PE systems. The
optoelectronic VLSI technology has the potential to
improve the performance of multiprocessor comput-
ing systems significantly. However, major efforts
are still needed for the integration of efficient, reli-
able, and cost-effective systems.
The authors thank Lily Cheng for help with chip
design; Allan G. Weber for help with circuit board
packaging; and Matt Derstine and Sue Wakelin of
Optical Networks, Inc., for help with optomechanical
packaging and tunable external-cavity laser diode
sources. This study was supported by the Joint Ser-
vices Electronics Program through the U.S. Air Force
Office of Scientific Research under contract F49620-
97-10238; by the National Center for Integrated Pho-
tonic Technology program funded by the Defense
Advanced Research Projects Agency under contract
MDA972-94-1-0001; and by the Integrated Media
Systems Center, a National Science Foundation En-
gineering Research Center, with additional support
from the Annenberg Center for Communication of the
University of Southern California and the California
Trade and Commerce Agency.
References
1. A. A. Sawchuk, “Smart pixel devices and free-space digital
optics applications,” in Proceedings of 1995 IEEEyLEOS An-
nual Meeting ~Institute of Electrical and Electronics Engi-
neers, Piscataway, N.J., 1995!, pp. 268–269.
2. Semiconductor Industry Association, The National Technology
Roadmap for Semiconductors ~Sematech, Inc., San Jose, Calif.,
1997!.
3. K. W. Goossen, J. A. Walker, L. A. D’Asaro, S. P. Hui, B. Tseng,
R. Leibenguth, D. Kossives, D. D. Bacon, D. Dahringer,
L. M. F. Chirovsky, A. L. Lentine, and D. A. B. Miller, “GaAs
MQW modulators integrated with silicon CMOS,” IEEE Pho-
ton. Technol. Lett. 7, 360–362 ~1995!.
4. A. V. Krishnamoorthy, A. L. Lentine, K. W. Goossen, J. A.
Walker, T. K. Woodward, J. E. Ford, G. F. Aplin, L. A. D’Asaro,
S. P. Hui, and B. Tseng, “3-D integration of MQW modulators
over active sub-micron CMOS circuits: 375Mbys transimped-280 APPLIED OPTICS y Vol. 38, No. 11 y 10 April 1999Lett. 7, 1288–1290 ~1995!.
5. T. K. Woodward, A. V. Krishnamoorthy, A. L. Lentine, K. W.
Goossen, J. A. Walker, J. E. Cunningham, W. Y. Jan, L. A.
D’Asaro, and L. M. F. Chirovsky, “1-Gbys two-beam transim-
pedance smart pixel optical receivers made from hybrid GaAs
MQW modulators bonded to 0.8 mm silicon CMOS,” IEEE
Photon. Technol. Lett. 8, 422–424 ~1996!.
6. A. V. Krishnamoorthy and K. W. Goossen, “Progress in
optoelectronic-VLSI smart pixel technology based on GaAsy
AlGaAs MQW modulators,” Int. J. Optoelectron. 11, 181–198
~1997!.
7. K.-S. Huang, B. K. Jenkins, and A. A. Sawchuk, “Image alge-
bra representation of parallel optical binary arithmetic,” Appl.
Opt. 28, 1263–1278 ~1989!.
8. K.-S. Huang, A. A. Sawchuk, B. K. Jenkins, P. Chavel, J.-M.
Wang, A. G. Weber, C.-H. Wang, and I. Glaser, “Digital optical
cellular image processor ~DOCIP!: experimental implemen-
tation,” Appl. Opt. 32, 166–173 ~1993!.
9. C. B. Kuznia, J.-M. Wu, C.-H. Chen, A. A. Sawchuk, and
L. Cheng, “Hybrid CMOSySEED smart pixel array for 2D
parallel pipeline operations,” in Digest IEEEyLEOS 1996
Summer Topical Meetings: Smart Pixels ~Institute of Electri-
cal and Electronics Engineers, Piscataway, N.J., 1996!, pp.
80–81.
10. C. B. Kuznia, J.-M. Wu, C.-H. Chen, B. Hoanca, L. Cheng, A. G.
Weber, and A. A. Sawchuk, “Two-dimensional parallel pipeline
processing with smart pixel array cellular logic ~SPARCL! pro-
cessors: system implementation,” submitted to J. Lightwave
Technol.
11. P. Maragos and R. Shafer, “Morphological systems for multi-
dimensional signal processing,” Proc. IEEE 78, 690–709
~1990!.
12. D. Le Gall, “MPEG: a video compression standard for multi-
media applications,” Commun. ACM 34, 46–58 ~1991!.
13. A. Broggi and F. Gregoretti, “Performance evaluation and op-
timization in low-cost cellular SIMD systems,” Microprocess.
Microprogramm. 41, 659–678 ~1996!.
14. K. Hwang, Advanced Computer Architecture: Parallelism,
Scalability, Programmability ~McGraw-Hill, New York, 1994!.
15. J. M. del Rosario and A. K. Choudhary, “High-performance IyO
for massively parallel computers—problems and prospects,”
IEEE Comput. 27~3!, 59–68 ~1994!.
16. J. D. Allen and D. E. Schimmel, “Issues in the design of high
performance SIMD architectures,” IEEE Trans. Parallel Distr.
Syst. 7, 818–829 ~1996!.
17. J. Goodenough, R. J. Meacham, J. D. Morris, N. L. Seed, and
P. A. Ivey, “A single chip video signal processing architecture
for image processing, coding, and computer vision,” IEEE
Trans. Circ. Syst. Video Technol. 5, 436–445 ~1995!.
18. S. Okazaki, Y. Fujita, and N. Yamashita, “A compact real-time
vision system using integrated memory array processor archi-
tecture,” IEEE Trans. Circ. Syst. Video Technol. 5, 446–452
~1995!.
19. H. D. Santos, J. C. Ramalho, J. M. Fernandes, and A. J.
Proenca, “A heterogeneous computer vision architecture: im-
plementation issues,” Comput. Syst. Eng. 6, 401–408 ~1995!.
20. J.-M. Wu, C. B. Kuznia, B. Hoanca, C.-H. Chen, L. Cheng, A. G.
Weber, and A. A. Sawchuk, “Smart pixel array cellular logic
~SPARCL! processor for eliminating SIMD IyO bottlenecks:
system demonstration and performance scaling,” in Optics in
Computing, Vol. 8 of 1997 OSA Technical Digest Series ~Op-
tical Society of America, Washington, D.C., 1997!, pp. 152–154.
21. J.-M. Wu, C. B. Kuznia, B. Hoanca, C.-H. Chen, and A. A.
Sawchuk, “Integration of CMOSyMQW smart pixel array cel-
lular logic ~SPARCL! processors for SIMD parallel pipeline
processing,” presented at the 1997 North American Chinese
Photonics Technology Conference, Los Angeles, Calif., 17–19
2
2
high-speed optical read and write,” in Spatial Light Modula-
October 1997.
2. A. A. Sawchuk, “Optoelectronic memory applications for
VCSEL-based smart pixels,” in Proceedings, IEEE Lasers and
Electro-Optics Society 1997 Annual Meeting ~Institute of Elec-
trical and Electronics Engineers, Piscataway, N.J., 1997!, pp.
149–150.
3. A. V. Krishnamoorthy, R. G. Rozier, J. E. Ford, and F. E.
Kiamilev, “Demonstration of a CMOS static RAM chip withtors, G. Burdge and S. Esener, eds., Vol. 14 of OSA Trends in
Optics and Photonics Series ~Optical Society of America,
Washington, D.C., 1997!, pp. 23–26.
24. F. E. Kiamilev and R. G. Rozier, “Design of optoelectronic-
VLSI ICs for optically accessed SRAMs,” in Spatial Light Mod-
ulators, G. Burdge and S. Esener, eds., Vol. 14 of OSA Trends
in Optics and Photonics Series ~Optical Society of America,
Washington, D.C., 1997!, pp. 11–13.10 April 1999 y Vol. 38, No. 11 y APPLIED OPTICS 2281
