1000 frame/sec Stereo Matching VLSI Processor with Adaptive Window-Size Control by Hariyama Masanori et al.
3-8
1000 frame/sec Stereo Matching VLSI Processor
with Adaptive Window-Size Control
Masanori Hariyama, Naoto Yokoyama, and Michitaka Kameyama
Graduate School of Information Sciences, Tohoku University
Aoba 6-6-05, Aramaki, Aoba, Sendai, Miyagi,980-8579, Japan
Email: {hariyama@, yokoyama@kameyama., kameyama@}ecei.tohoku.acjp
Abstract- This paper presents a 1000-frame/sec stereo-
matching VLSI for adaptive window-size control. To reduce
the computational amount, the algorithm uses images divided
into non-overlapping regions. The matching result is iteratively
refined by reducing a window size. Window-parallel and pixel-
parallel architecture is proposed to achieve to exploit the potential
parallelism.
I. INTRODUCTION
Acquisition of reliable three-dimensional (3-D) images of a
real scene plays an essential role in intelligent systems such as
intelligent robots and intelligent vehicles, and next-generation
multimedia applications such as 3-D television and virtual
reality. Stereo vision is a well-known method to acquire 3-D
information. The most important problem on stereo vision is to
establish reliable correspondence between images. The major
problem on the correspondence matching is that a window size
must be large enough to avoid ambiguity but small enough to
avoid the effects of projective[l]. To solve this problem, the
work [3] proposed the VLSI processor for stereo matching
with variable window size. However, the processing time does
not meet the requirement of high-speed intelligent systems
such as highly-safe vehicles.
This paper presents the new VLSI-oriented stereo matching
algorithm that achieves both of high performance and high
reliability. To reduce the computational complexity, the stereo
matching is executed using a reference image that is divided
into non-overlapping regions. By using non-overlapping re-
gions, the redundancy in computation is completely removed.
Moreover, the window-parallel-and-pixel-parallel architecture
is presented to fully exploit the parallelism for any window
size. The test chip is designed in a 0.1 8,um CMOS technology.
Its performance is estimated to be more than 1000 frame/sec
@1OOMHz. This performance is more than 1000 times higher
than that of the previous version[3].
II. STEREO MATCHING ALGORITHM
To find the correspondence between images is the most
essential processing in stereo vision. Once the correspondence
is established, a 3-D point in the real scene can be found
by triangulation as shown Fig. 1. Let us consider a reference
window of a size W x W centered at R = (XR, YR) in the
reference (left) image, and a candidate window centered at
C = (Xc, Yc) in the candidate (right) image. The possible
candidate windows exist on the same horizontal line (called
"epipolar line") as the reference window. Then, an SAD
XR
YR
Xc
(a) Referece YR (b) Candidate
(Left) image (Right) image
,~
~~~
g NemarLtS'
Z.. ..........
II
y
5
Far
(c) 3-D plot
Fig. 1. Search for a corresponding point.
Limited
search area
Matching -
A pixel B
XR XR
(a) Global search (b) Local search
Fig. 2. SAD graphs for adaptive window control.
between the reference and candidate windows is given by
w-1 w-1
2 2
E E 1IR(XR + i, YR + j) -IC(XC + i, YC + j)
W-1 W-1
2 2
(1)
where IR and IC are intensity values in the reference and
candidate images, respectively. If a candidate window exactly
matches the reference window, then the SAD becomes 0. A
0-7803-9735-5/061$20.00 ©2006 IEEE 123
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 23:37 from IEEE Xplore.  Restrictions apply.
pixel where the SAD curve becomes minimum is called a
"matching" pixel(Fig. 2(a)). The accuracy of the correspon-
dence matching is strongly affected by the window size. If
the window size is too small, there exist several possibilities
for the choice of the corresponding pixel(Fig. 3(b)). On the
other hand, if the window size is too large and the window
the matching pixel may not be the correct corresponding pixel
due to different projective distortions in the reference and the
candidate images.
To solve this problem, we propose stereo matching algo-
rithm that adaptively select a window size for each pixel. The
algorithm is given as follows.
Step l(Global search): Set the window size W to Wmax,
initially. The window size Wmax is the maximum window size
that is empirically determined such that the minimum peak
of the SAD graph is unique. Divide the reference image into
non-overlapping regions of the size W x W as shown in Fig.3.
For each reference window RW, the corresponding window
is searched for on the candidate image by computing SADs
between RW and all the possible candidate windows. Note
that the candidate windows are limited to the windows with
the same vertical coordinate as the reference window from the
epi-polar geometry. As a result, the disparity is determined
for each reference window, where the disparity is defined
as difference between horizontal coordinates of the reference
and the candidate windows(XRw -Xcw). The disparity is
corresponding to the distance from the camera-coordinate-
system origin.
Step 2(Local search): Shrink the window size to half, that
is, set W <- W/2 as shown in Fig. 4. Divide the reference
image into non-overlapping regions of the size W x W. The
corresponding window of each reference window RW is found
as follows.
Step 2-1 Let Di (1 < i < 4) be disparities of the
regions, where the disparities are estimated at the
larger window size around the reference window
RW as shown in Fig. 4. For example, D1, D2,
D3 and D4 are respectively D1,2, D1,3, D2,2 and
D2,3. It is likely that the true disparity of RW is
one of Di (1 < i < 4) according to smoothness
constraints [2].
Step 2-2 Compute SADs for only candidate
windows that are within d from the location corre-
sponding to Di (1 < i < 4) (Figure 4). The search
area is limited by using the result at the larger
window size to avoid the ambiguity at smaller
window size. Determine the center pixel Xc of the
candidate window with the minimum SAD value
as the matching pixel of the center XR of the
reference window. By using the coordinates of the
matching pixel, the disparity of P is computed by
XR -XC.
Step 3: If W :t 1 then W <- W/2, and go back to Step 2.
Otherwise, the latest matching pixel is determined to be the
corresponding pixel.
Reference image Candidate image
~~~I 1
A
Disparity of reference windows
Di,i D1,2 D1,3 D1,4
D2,1 D2,2 D2,3 D2,4
D3,1 D3,2 D3,3 D3,4
D4,1 D4,2 D4,3 D4,4
Fig. 3. Matching with a larger window size.
W/2 I
RW
) Di= D2= D3= D4=
D1,2 D1,3 D2,2 D2,3
cw cw cw ow
di/-d .d/-d ad/-d
I&\
Fig. 4. Limiting the search area in the local search and matching with a
smaller window size.
The proposed algorithm reduces the computational amount
to less than 0.5% compared to the variable-window-size algo-
rithm without hierarchical approach[3] without degrading the
quality.
III. PROCESSOR ARCHITECTURE
There are two types of parallelism in stereo matching:
window-level parallelism and pixel-level parallelism.
Pixel-level parallelism: ADs (absolute differences) in an
SAD (Eq. (1)) is computed in parallel with pixels.
Window-level parallelism: SADs can be computed in
parallel with reference and candidate windows.
From the view of the hardware implementation, the de-
gree of the parallelism should be constant to achieve high-
utilized ratio of hardware resources. However, both types of
parallelism are changed depending on a window size in the
124
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 23:37 from IEEE Xplore.  Restrictions apply.
3 4 eeo. IW
SAD for 4x4 window
(a)
(b)
Fig. 5. Design concept of the SAD unit based on WPPP scheduling.
proposed algorithm. To solve this problem, the pixel-parallel
parallelism is combined with window-parallel parallelism.
Figure 5 shows design concept of the SAD unit based on
the window-parallel and pixel-parallel (WPPP) scheduling for
Wmax = 4. A single SAD with 16 ADs is computed for
the window size W = 4, whereas 4 SADs, each of which
has 4 ADs, are computed for the window size W = 2. The
gray adders indicate un-used adders. This scheduling allows
the total number of ADs constant although the window size is
changed since the window size is iteratively shrunk to half.
Figure 6 shows the overall architecture of the SAD unit. The
SAD unit consists of memory modules, line buffers, processing
elements. The candidate image is distributed among memory
modules C-MEMs, whereas the reference image R-MEMs.
The number of C-MEMs and R-MEMs is Wmax. The number
of registers for a line buffer is same as the image size 1W.
The PEs are classified into 3 types: PEI, PE2, and PE3. The
detailed structure of each PE is shown in Fig. 7. The PEts are
used to compute ADs. The PE2s and PE3s are used to add ADs
and sum the results. Each PEs have a pipeline register so that
AD computation and addition are overlapped in execution by
PE-level pipelining. Each PEs have a search area controller
and a minimum value detector to generate control signals
locally. As a result, the delays for distributing control signals
are significantly reduced. The major drawback of the proposed
architecture is its large hardware amount. The number of PEls
are given by Eq.(2).
1W x Wmax (2)
For example, the number of PEls is 4096 for Wmax = 16 and
1W = 256. Moreover, the number of I/O pins is large for par-
allel data transfer. To overcome these problems the SAD units
is designed based on a bit-serial pipeline architecture. Bit-
serial architecture is also useful for multi-chip implementation
4 PEl0
CMEM -
Wmax
_tLfr j 0_r _j
Fig. 6. Overall architecture of the SAD unit(Wmax = 4).
since it reduces the number of I/O pins between chips. The
reason why the SAD unit has such a large number of PEs is
balancing the performance and the area. Actually, the number
of PEs can be reduced by reducing the degree of parallelism,
namely, computing SADs sequentially. However, the total area
of the SAD unit cannot be reduced since serial computation of
SADs requires significant additional hardware such as register
files to store intermediate results and image data although the
execution time greatly increases.
Figure 8 shows micrograph and the features of the stereo
matching VLSI that is designed in a 0.18,um CMOS technol-
ogy. Its performance is estimated to be 81 ,usec/3D image
at 100MHz. That is, the stereo matching VLSI produces
more than 1000 3D-images per sec. The image size and
the maximum window size can be extended by using multi-
chip without decreasing performance due to the scalability of
the proposed architecture. The performance is increased in
proportion to the number of chips.
IV. CONCLUSION
The parallel stereo vision VLSI processor with a simple
interconnection network is proposed. The key to success is
the hierarchical approach using non-overlapping regions on the
reference image, and the scheduling to exploit the parallelism
in a reference-and-candidate-window level.
REFERENCES
[1] S. T. Barnard, "Stereo vision", in Encyclopedia of Artificial Intelligence.
New York: John Wiley, pp.1083-1090(1987).
125
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 23:37 from IEEE Xplore.  Restrictions apply.
I I I
(a) PE1 (b) PE2
MVD: Minimum Value Detector
SAC: Search Area Controller
(b) PE3
Fig. 7. Block diagram of PEs.
[2] M. Hariyama, T. Takeuchi, M. Kameyama,"VLSI Processor for Reli-
able Stereo Matching Based on Adaptive Window-Size Selection" in
Proc. International Conference on Robotics and Automation, pp.1168-
1173(2001).
[3] M. Hariyama and M. Kameyama, "VLSI Processor for Reliable Stereo
Matching Based on Window-Parallel Logic-in-Memory Architecture",
Digest of Technical Paper 2004 Symposium on VLSI Circuits VLSI
Symposium,pp. 166-169(2004).
Technology 0.18um CMOS
Die size 2.8mm x 2.8mm
Input Image 32x32 pixels with
1 6 gray levels
Wmax 4x4
Performance 81 us/frame@ 1 00M Hz
# of AD units 32x4
Fig. 8. Die micrograph and features.
126
I I
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on February 5, 2009 at 23:37 from IEEE Xplore.  Restrictions apply.
