Optimal Periodic Memory Allocation for Image Processing With Multiple Windows by 亀山  充隆
Optimal Periodic Memory Allocation for Image
Processing With Multiple Windows
著者 亀山  充隆
journal or
publication title
IEEE Transactions on Very Large Scale
In egration (VLSI) Systems
volume 17
number 3
page range 403-416
year 2009
URL http://hdl.handle.net/10097/46854
doi: 10.1109/TVLSI.2008.2004547
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009 403
Optimal Periodic Memory Allocation for Image
Processing With Multiple Windows
Yasuhiro Kobayashi, Member, IEEE, Masanori Hariyama, Member, IEEE, and Michitaka Kameyama, Fellow, IEEE
Abstract—One major issue in designing image processors is
to design a memory system that supports parallel access with a
simple interconnection network. This paper presents an efficient
memory allocation to minimize the number of memory modules
and processing elements with a parallel access capability when
multiple windows with arbitrary shapes are specified. This paper
also presents an efficient search method based on regularity of
window-type image processing. We give some practical exam-
ples including a stereo-matching processor for acquiring 3-D
information, and an optical-flow processor for motion estimation.
These examples show that the numbers of memory modules are
reduced to 2.7% and 10%, respectively, in comparison with a
basic approach. It is also shown that the search time is less than 1
ms for practical image sizes and window sizes.
Index Terms—Image processing, memory design, optimization,
parallel processors.
I. INTRODUCTION
H IGHLY-PARALLEL image processors require a com-plex interconnection network between memory modules
and processing elements (PEs) for parallel memory access. One
typical image processor consists of memory modules, PEs, in-
terconnection network between memory modules and PEs, and
an inter-PE network. The complexity of the interconnection net-
work between the memory modules and PEs increases with
the number of memory modules, and it causes significant over-
head in delay and power in deep-submicrometer and more ad-
vanced technologies since the delay and the power of inter-
connection units are more dominant than those of logic units.
The interconnection problem is also serious in image proces-
sors using field-programmable gate arrays (FPGAs) that have
large interconnection delays because of complex programmable
switch blocks. To solve the problem, we introduce an architec-
ture where a memory module is connected to a single processing
element. This architecture model can be though of as a simpli-
fied form of recent image processors based on single instruction
multiple data (SIMD) architecture [1], [2]. In the architecture,
the total hardware amount linearly increases with the number
of memory modules. Memory allocation has a great impact on
the number of memory modules. Therefore, this paper presents
memory allocation to minimize the number of memory modules
with a parallel access capability.
Manuscript received September 20, 2007; revised February 06, 2008. First
published January 20, 2009; current version published February 19, 2009.
Y. Kobayashi is with Oyama National College of Technology, Oyama 323-
0806, Japan (e-mail: y-kobayashi@oyama-ct.ac.jp).
M. Hariyama and M. Kameyama are with Graduate School of Information
Sciences, Tohoku University, Sendai 980-8579, Japan,.
Digital Object Identifier 10.1109/TVLSI.2008.2004547
This paper targets window-type image processing. The
window-type image processing is widely used in practical
applications. Its examples include filtering, template matching
and morphology. An application usually requires several types
of windows. The memory allocation must support parallel
access for all types of windows that is given as a specification.
There are a number of research works which have ad-
dressed memory allocation [3]–[11]. They are classified into
two groups: [3]–[8] and [9]–[11]. The first group handles
array-variable clustering whereby one or more array variables
are stored in the same memory module based on cost and
performance considerations. Reference [3] minimizes the
area. Reference [4] considers memory hierarchy to handle
the trade-off between performance and cost. Reference [5]
minimizes page misses in a bank while respecting data depen-
dences. Reference [6] minimizes energy and/or area based on
instruction-level parallelism (ILP)-based memory allocation.
Reference [8] also minimizes energy based on scheduling as
well as memory allocation. In the array-variable clustering,
data in an array are basically allocated to the same memory
module/bank. Therefore, it is difficult to exploit parallelism
between data in an array.
The second group divides an array into several sub-arrays,
and allocates sub-arrays to memory modules. This type of
memory allocation allows to exploit parallelism between data
in an array by distributing the concurrently-accessed sub-ar-
rays between different memory modules. This paper targets
window-type image processing using multiple windows. In
practical cases, an entire image processing is made up of
several sub-tasks such as filtering, template matching, segmen-
tation and so on. Window-type image processing is frequently
encountered in filtering and template matching. Filtering is
used as preprocessing for noise removal, smoothing and edge
detection and so on which require different windows. In each
processing such as edge detection, several windows may be
used. For example, the Sobel edge detection, the Marr–Hildreth
edge detection [12], and the K-forms edge detection [13]
require respectively, two, two, and eight different windows.
Moreover, mathematical morphology [14]–[18] is one of the
window-type image processing. Mathematical morphology
[14]–[18] is frequently used for filtering and geometric analysis
by structuring elements. In many applications, it is advanta-
geous to use different structuring elements with various sizes
and shapes. Structuring elements correspond to windows. In
image processing, exploiting pixel-level parallelism plays an
essential role to speed up. Hence, this paper focuses on the
second group. This type of memory allocation takes into ac-
count regularity of target processing to reduce the larger search
1063-8210/$25.00 © 2009 IEEE
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
404 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
TABLE I
COMPARISON BETWEEN THIS WORK AND PREVIOUS WORKS
IN TERMS OF PROBLEM DEFINITION
space than the array-variable clustering. [10] is originally for
motion stereo, and can be applied to a window-type image pro-
cessing with a single square window at a single resolution. Its
objective is to minimize the number of memory modules. The
whole image is equally divided into square regions of the same
size as the window. The pixels of a square region are allocated
to different memory modules. The memory allocation in a
square region is repeated horizontally and vertically throughout
the image. As a result, the pixels in a window are distributed
among different memory modules. Reference [9] addresses the
problem of system power reduction through transition count
minimization on the memory address bus when arrays in be-
havioral specifications are accessed from memory modules. It
targets for memory-intensive applications such as digital signal
processing, and image processing that exhibit regular access
patterns. The authors exploit regularity and spatial locality in
the memory. When considering an image as a 2-D array, it
is applicable for image processing. The 2-D array is equally
divided into square regions of the same size called “tiles”.
The memory allocation for a tile is repeated horizontally and
vertically. The objective function is to minimize the power
consumed by address bus. This method is partially applicable
to the minimization of the number of memory modules. For the
minimization of the number of memory modules, it provides
the same result as [10], because it exploits horizontal and
vertical regularity. [11] is originally for stereo matching with
a hierarchical matching approach to reduce the computational
amount. However, it handles a single square window unlike our
method. Table I summarizes the features of the previous works
and this works. In terms of image resolution, only this work
and [11] handle multi-resolution. In terms of window shape and
the number, only this work handles multiple and arbitrary win-
dows. In terms of periodicity, only this work handles arbitrary
directions with arbitrary periods. This allows us to reduce the
number of memory modules compared to the previous works.
Although search space is expanded by considering arbitrary
directions with arbitrary periods, this work guarantees to find
optimal solution.
Given multiple windows, the most simple way of finding the
memory allocation for parallel access is to apply [10] (or [9]) to
the minimum rectangle window that includes all the given win-
dows. The concept behind this method is to approximate the
multiple windows by a single rectangle window. However, this
method can require the large number of memory modules when
the approximation is not good. In this paper, for further reduc-
tion of the required number of memory modules, a whole image
is equally divided into parallelograms, and the memory alloca-
tion for a parallelogram is repeated along the sides of the par-
allelogram. A parallelogram is formed by a pair of vectors with
different lengths and different directions. Since a parallelogram
is a generalized type of a square, its use results in further re-
duction of the required memory modules. Its disadvantage over
the rectangle-window-based approach is the larger search space.
To solve this problem, this paper proposes an efficient search
method based on the regularity of window-type processing. To
reduce search space, search for several vector pairs can be re-
placed with search for a vector pair called “equivalent vector
pair” which provides the same memory allocation as them.
This paper is organized as follows. In Section II, we formu-
late the memory allocation problem. In Section III, we show
an efficient search method based on the regularity of a periodic
memory allocation. In Section IV, we give some practical exam-
ples including a stereo-matching processor for acquiring 3-D in-
formation, and an optical-flow processor for motion estimation.
These examples show that the numbers of memory modules are
reduced to 2.7% and 10%, respectively, in comparison with the
conventional method [10]. They also show that the search time
is less than 1 ms on a PC (Pentium4 at 2 GHz) for practical
image sizes and window sizes. In Section V, we discuss the ex-
tension of the proposed method to more practical problems. In
Section VI, we state our conclusions.
II. PROBLEM FORMULATION
A. Target Processing
We consider a window-type image processing. Let us begin
with image processing using a single window. In this type of
processing, the output/intermediate output depends on a small
neighborhood of an input image, where the neighborhood size
is fixed and given as a window. Algorithms of this type fre-
quently appear in practical situations: spatial filter, morphology,
and image matching, and so on. Moreover, they usually have the
high degree of parallelism, and are suited for VLSI implemen-
tations.
We use window-serial-and-pixel-parallel scheduling as
shown in Fig. 1. In this scheduling, operations are performed
in parallel with pixels in a window, whereas operations are
performed in a serial manner for windows. Fig. 1(a) shows the
location of the window at each step. The thick line denotes a
window. Fig. 1(b) shows the scheduled data-flow graph (SDFG)
corresponding to Fig. 1(a). A node in the SDFG denotes an op-
eration. The labels and on operations denote the operation
types of the nodes. There is an edge between nodes when the
output of one node is used as an input of the other node. At Step
1 in Fig. 1, pixels: (0,0), (0,1), (0,2) and (1,1) are used as inputs
of operations of type . These pixels must be accessed in
parallel since the -type operations are performed in parallel.
Their results are used as inputs of the -type operation. As the
location of the window changes, the input pixels change.
An image-processing system usually requires various win-
dows for filtering, edge detection, morphology, and so on.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 405
Fig. 1. DFG of a window-serial-and-pixel-parallel schedule for a window-type
processing with a single window denoted by a set of gray pixels. (a) Image plane.
(b) SDFG.
Fig. 2. Target architecture.
Hence, the image-processing system requires a memory alloca-
tion that enables parallel access for such various windows.
B. Target Architecture
Fig. 2 shows our target architecture. A single PE is connected
to each memory module. Single-port memories are used as
memory modules. Multi-port memories are also one efficient
method for parallel access. Although an N-port memory pro-
vides more flexibility than N single-port memories, the use
of a multi-port memory imposes a larger hardware amount
because of additional bit-lines and decoders. Multi-port mem-
ories are efficient especially in the case when access patterns
are not predetermined. Window-type processing has a regular
access pattern. Hence, it is possible that pixels in a window
are distributed among multiple single-port memory modules by
appropriate memory allocation. As a result, use of single-port
memories is suitable for window-type processing because of the
smaller area. All pixels of an input image are distributed among
the memory modules. The PEs perform operations of type
according to the window-serial-and-pixel-parallel scheduling
shown in Fig. 1(b). Their outputs are used as inputs of the
unit for type- operations. In order to extend the architecture
model, you can add inputs to PEs as required. In the target
architecture, the hardware amount is determined by a memory
allocation task since the number of PEs is determined by the
number of memory modules. Therefore, a memory allocation
plays an essential role in minimizing hardware the amount of
the target architecture.
Fig. 3. Memory allocation.
Fig. 4. Memory allocation for two windows. (a) Windows. (b) Allocation1. (c)
Allocation2.
C. Memory Allocation for Image Processing
Memory allocation is a task that assigns pixels to memory
modules. Fig. 3 shows an example of memory allocation for the
window shown in Fig. 1(a). The label on each pixel denotes the
memory module to which the pixel is assigned. For example,
pixels: (0,0), (0,1), (1,0) and (1,1) are assigned to memory mod-
ules: M1, M2, M3 and M4, respectively.
To meet the timing constraint of the window-serial-and-
pixel-parallel scheduling, all the pixels in a window must be
accessed in parallel for all possible locations of the window.
In other words, all the pixels in a window must be distributed
among different memory modules for all possible locations of
the window. Readers can examine that the memory allocation
shown in Fig. 3 enables the parallel memory access for any
location of the window.
For a memory allocation for image processing with multiple
windows, all the pixels in each window must be accessed in par-
allel for all possible locations of the window. Fig. 4 shows an
example of memory allocation for two windows. Fig. 4(a) shows
windows: and . Fig. 4(b) shows the memory allocation
result that allows parallel access for the windows. All pixels
in each window are distributed among different memory mod-
ules for all possible locations of the window. Fig. 4(c) shows
the memory allocation that is not capable of parallel access. All
pixels in are distributed among different memory modules
for all possible locations of the windows. This memory alloca-
tion enables the parallel memory access for . On the other
hand, all pixels in are allocated to a same memory module.
This memory allocation does not enable the parallel memory
access for .
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
406 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
Fig. 5. Example of a periodic memory allocation.
D. Periodic Memory Allocation
From a practical point, a memory allocation should have a
simple addressing function to determine which memory module
stores each pixel. If the addressing function is complex, the area
and delay of an addressing circuit become large. In the worst
case, a lookup table for all the pixels is required for addressing.
The periodic memory allocation has a simple addressing func-
tion because of its regularity. Let be the number of memory
modules. Let be the number of pixels that are allocated to
the memory module for . Let be the co-
ordinates of a pixel that is allocated to the memory module
for . Then, we define a periodic memory allocation
as a memory allocation where is expressed as
(1)
where and are vectors to repre-
sent periods (called period vectors); the variables and are
integers; the coordinates are coordinates of the refer-
ence pixel allocated to . Note that you can select an arbitrary
pixel as the reference pixels from the pixels allocated to the same
memory module.
Fig. 5(a) shows an example of a periodic memory alloca-
tion for the SDFG shown in Fig. 1. The label on a pixel de-
notes which memory module stores each pixel. For example,
the pixels: (0,0), (0,1), (1,0) and (1,1) are allocated to , ,
, and , respectively. From Fig. 5(b), the coordinates of
the pixels allocated to are given by
(2)
where the coordinates of the reference pixel and the period vec-
tors are , , and , respectively. The
pixels with label 1 in Fig. 5(b) are given by ,
(1,0), (2,0), (0,1), (1,1) and (2,1), respectively. Figs. 5(c)–(e)
show the pixels allocated to , , and , respectively.
These figures show that the same period vectors as are used
for , , and . The memory allocation shown in Fig. 5
satisfies (1), that is, the definition of a periodic memory alloca-
tion.
As mentioned at the beginning of this section, the advantage
of the periodic memory allocation is its simple addressing func-
tion. For the periodic memory allocation shown in Fig. 5(a),
the addressing function , which allocates pixel to
memory module , is given by
(3)
E. Optimal Memory Allocation
Let us minimize the number of memory modules when win-
dows: , and are specified. The optimal memory
allocation is defined as the memory allocation that satisfies the
following conditions.
C1) For any location of each window, all pixels in the
window can be retrieved in parallel. In other words, all
pixels in each window are allocated to different memory
modules.
C2) Each pixel is allocated to a single memory module. This
condition ensures that the total memory capacity is min-
imized.
C3) The number of memory modules is minimized. This
condition ensures that the hardware amount is min-
imized in the target architecture as mentioned in
Section II-B.
C4) The memory allocation is a periodic one. This condition
ensures that the hardware for the addressing function is
small.
Fig. 4(b) gives an example of the optimal memory allocation
for multiple windows shown in Fig. 4(a). The condition C1 is
satisfied for and . The condition C2 is satisfied since
each pixel has a single label. The condition C3 is satisfied since
the number of memory modules is 3 and is equal to the min-
imum number of memory modules required for parallel access,
i.e., the number of pixels in each window. The condition C4 is
satisfied since the memory allocation is obtained from (1) with
and . Fig. 5(a) also gives an example of
the optimal memory allocation for a single window shown in
Fig. 1(a). Readers can examine it in the similar manner with
Fig. 4(b).
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 407
III. SEARCH METHOD
A. Estimation of the Number of Memory Modules
Given the period vectors and , the number of memory
modules is estimated by the area of the parallelogram made by
the period vectors and . The area of the parallelogram is
given by
(4)
For the example shown in Fig. 5
(5)
and is exactly the same as the number of memory modules.
This is because the memory allocation for a whole image is
given by repeating the memory allocation for the parallelogram,
and the parallelogram must be filled with pixels allocated to dif-
ferent memory modules. From these observations, finding the
optimal memory allocation is reduced to finding period vectors
that make the minimum parallelogram still satisfying the par-
allel access condition.
B. Basic Search Algorithm
We suppose that windows: are specified.
Period vectors and are treated as a pair (called a vector
pair ).
We explain variables used in this algorithm. For
, let and be the width and height of the min-
imum bounding rectangle of , respectively. Let and be
the width and height of the rectangular window that includes
all the windows, respectively. The values and are given by
(6)
For example, let us consider two windows: and
shown in Fig. 6(a) and (b), respectively. Fig. 6(c) and (d)
show the minimum bounding rectangles of and , re-
spectively. From (6), ,
. As a result, we obtain
the minimum bounding rectangle of and as shown in
Fig. 6(e). Let be the area of the parallelogram made by a
vector pair , that is, the number of memory modules of
the memory allocation obtained by the vector pair. Let be
the current minimum number of memory modules. Let be
the set of vector pairs to be checked.
Fig. 7 shows the outline of the search algorithm. Lines 1–3
initialize variables. The initial value of is determined
by the rectangular memory allocation [10]. As mentioned in
the previous paragraph, all the windows are approximated by
the rectangular window of size . It is mathematically
guaranteed that the rectangular memory allocation provides
the memory allocation with the minimum number of memory
modules and capability of completely-parallel access. To ob-
tain an allocation for a whole image, the rectangular memory
allocation regularly repeats the allocation for the rectangular
region. For example, Fig. 8 shows the result of the rectan-
gular memory allocation for the rectangular window shown
Fig. 6. Rectangular window. (a)   ; (b)   ; (c) bounding rectangle for  
(   ,    ); (d) bounding rectangle for   (   ,    ); (e)
bounding rectangle for   and   (   ,    ).
Fig. 7. Basic algorithm.
Fig. 8. Result of rectangular memory allocation.
in Fig. 6(e). Hence, the initial value of is defined by the
number of memory modules for the rectangular window of size
, that is
(7)
For the example of Fig. 6
(8)
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
408 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
The initial values of and are also determined by the
rectangular memory allocation. As shown in Fig. 8, the rect-
angular memory allocation is a periodic one. Hence, the initial
values of and are given by period vectors of the rect-
angular memory allocation, that is
(9)
For the example shown in Fig. 6
(10)
The initial value of is expressed as
(11)
where and are the maximum values of and
coordinates, respectively. Let be the current op-
timal vector pair.
The “for” loop in Fig. 7 finds a vector pair which
makes the optimal memory allocation from all vector pairs.
The function returns the
number of memory modules obtained by (4). The function
returns true if the
memory allocation obtained by satisfies the condition
C1, that is, the parallel access condition. In each iteration, the
variables and maintain the optimal vector
pair and the area of parallelogram made by ,
respectively. During each iteration, and are
updated if the number of memory modules, which is obtained
by , is less than .
We explain a straightforward way for
. First,
the memory allocation for a whole image is made by a
vector pair . When is given, pixels stored in
each memory module are determined by (1) as described in
Section II-D. Next, for each of the windows: ,
we check whether all the pixels in the window are allocated to
different memory modules for all the locations of the window.
This checking is time-consuming in a straightforward way. An
efficient algorithm based on the periodicity of the periodic
memory allocation is presented in Section III-C4.
C. Improving Search Efficiency
1) Equivalent Vector Pair to Reduce the Search Space:
Search for several vector pairs can be replaced with search
for a vector pair called “equivalent vector pair”. Let us con-
sider a vector pair for vectors and
, where the elements of and are integer.
We call the vector pair the equivalent vector pair of
if satisfies the following conditions.
D1) The linear combination of and with integer coeffi-
cients can be given by the linear combination of and
with integer coefficients, or vice versa. This can be
expressed as
(12)
Fig. 9. Deriving an equivalent vector pair.
where , , , and are integer scalars.
D2) The area of the parallelogram made by is equal
to the area of the parallelogram made by .
D3) The vector is parallel to the -axis. In other words,
.
The conditions D1 and D2 denote that a memory allocation ob-
tained by a vector pair is equal to a memory allocation
obtained by a vector pair . There are some vector pairs
which satisfy the condition D1 and D2. The condition D3 is used
to uniquely select the vector pair from them.
Fig. 9 shows a method of deriving an equivalent vector pair
and . Lines 1 and 2 initialize vari-
ables. Lines 3–12 basically calculate
using the Euclidean algorithm, where is the
greatest common measure of and . Line 12 calculates .
Since there are several choices of , lines 14 to 21 uniquely
select with the minimum positive -coordinate. For ex-
ample, Fig. 10(a) shows an equivalent vector pair of
vector pairs shown in Figs. 10(b)–(d). Gray pixels de-
note linear combinations of and with integer coefficients,
and they are stored in the same memory modules. Let us prove
that the vector pair shown in Fig. 10(a) is the equivalent vector
pair of the vector pair shown in Fig. 10(b). The two vector
pairs and satisfy the condition D1 since gray
pixels in Fig. 10(b) conform to those in Fig. 10(a). The area of
parallelogram obtained by the vector pair is given by
(13)
The area of parallelogram obtained by the vector pair is
given by
(14)
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 409
Fig. 10. Equivalent vector pair.
Therefore, the two vector pairs and satisfy the
condition D2. The vector pair satisfies the condition D3
since the -coordinate of is 0. Similarly, readers can prove that
the vector pair shown in Fig. 10(a) is the equivalent vector pair
of shown in Fig. 10(c) and (d). By using the equivalent
vector pair, search for three vector pairs can be replaced with
the search for a single equivalent vector pair. As a result, we
can reduce a search space for vector pairs.
We omit proving that an equivalent vector pair for
an arbitrary vector pair exists because the length of this
paper is limited.
2) Reduction of Computational Complexity Based on Con-
straint of the Area of Parallelogram: We can reduce a search
space by using a constraint of the area of a parallelogram ob-
tained by a vector pair. For an equivalent vector pair ,
the area of the parallelogram made by is given by
(15)
from (4). Note that also means the number of memory
modules for the memory allocation obtained by . For a
window with pixels, at least memory modules must be
used to enable parallel access. This condition is expressed as
(16)
We search for a vector pair with the smaller area than to
find the optimal memory allocation as shown in Lines 8–10,
Fig. 7. From (4), this condition is expressed as
(17)
Fig. 11. Improved algorithm.
Hence, when and are given, the search space of
is limited to
(18)
3) Improved Search Algorithm: Based on the methods men-
tioned in Sections III-C1 and III-C2, we obtain an improved
algorithm shown in Fig. 11. The differences between the im-
proved algorithm and the basic one are as follows. First, we use
an equivalent vector pair in place of a vector pair .
Second, the number of elements of is reduced. When initial-
izing at the third Line, Fig. 11, is set to
(19)
from the condition D3. Hence, the number of elements of is
in the improved algorithm from (19). On the other
hand, the number of elements of is from (11) in
the basic algorithm. Third, the number of vector pairs to perform
is reduced as shown in
Line 7 of Fig. 11.
4) Reduction of Computational Complexity of Par-
allel Access Check: Let us improve the function
to reduce the computa-
tional amount. Let us consider an arbitrary pixel . We define
“parallel access pattern” of as the set of pixels that are
possibly accessed in parallel together with . By using the
parallel access pattern, the condition C1 is rewritten as “The
pixels in the parallel access pattern of are allocated to
different memory modules from the memory module of ”.
Let us make the parallel access pattern of when a window
is specified. The pixel can take an arbitrary location in the
window. When the window shown in Fig. 6(a) is specified, the
possible locations of are shown in Fig. 12(a)–(d), respec-
tively. Gray pixels in each figure are accessed in parallel to-
gether with . When the coordinates of are , the gray
pixels shown in Fig. 12(a)–(d) are expressed as (20)–(23), re-
spectively
(20)
(21)
(22)
(23)
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
410 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
Fig. 12. Method of making a parallel access pattern.
Fig. 13. Parallel access check for a single window. (a) Allocation1. (b) Allo-
cation2.
By finding the union of these sets, the parallel access pattern of
is expressed as
(24)
This parallel access pattern is shown in Fig. 12(e).
By using the parallel access pattern, we improve
. Let us consider the
function when the
window shown in Fig. 6(a) is given. First, let us make a parallel
access pattern for an arbitrary pixel as shown in Fig. 12.
Next, let us make a memory allocation using a vector pair
in the minimum rectangle which includes the parallel
access pattern. Fig. 13(a) shows a result of a memory allocation
using and . Fig. 13(b) shows a result of a
memory allocation using and . The pixels
labeled as “ ” are allocated to the same memory module as
. A memory allocation enables parallel access if these pixels
are not included in the parallel access pattern. The memory
allocation shown in Fig. 13(a) is capable of parallel access since
all pixels labeled as “ ” are not included in the parallel access
pattern. On the other hand, the memory allocation shown in
Fig. 13(b) is not capable of parallel access since two pixels
labeled as “ ” are included in the parallel access pattern, that
is, gray pixels.
For multiple windows, we make a parallel access pattern for
each window. Like the case of a single window, the parallel
access check for each window is done using its parallel access
pattern. The memory allocation enables parallel access if
the parallel access is possible for each window. We consider
the example of two windows shown in Fig. 14(a) and (b).
Fig. 14(c) and (d) show the parallel access patterns for the
windows: and , respectively. Fig. 15(a) shows a result
Fig. 14. Parallel access pattern for multiple windows. (a)  . (b)  . (c) Par-
allel access pattern for   . (d) Parallel access pattern for   .
Fig. 15. Parallel access check for multiple windows. (a) Parallel access check
for     ,    . (b) Parallel access check for     ,   
.
of a memory allocation of and for these
parallel access patterns. Fig. 15(b) shows a result of a memory
allocation of and . The memory alloca-
tion shown in Fig. 15(a) is not capable of parallel access since
two pixels labeled as “ ” are included in the parallel access
pattern for . The memory allocation shown in Fig. 15(b)
is capable of parallel access since no pixel labeled as “ ” is
included in both parallel access patterns.
By using the previous method, the range to be checked is
limited to around the parallel access pattern for a single arbitrary
pixel . Therefore, we can reduce the time complexity of the
.
The proposed algorithm can always find a vector for mul-
tiple windows as follows. Given multiple windows, the most
simple way of finding the memory allocation for parallel ac-
cess is to apply the rectangular memory allocation [10] (or [9])
to the minimum rectangle window that includes all the given
windows. The concept behind this method is to approximate the
multiple windows by a single circumscribing rectangle window.
Although this method can require the large number of memory
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 411
Fig. 16. Search for a corresponding point.
modules when the approximation is not good, this guarantees
the parallel access. The larger rectangle window intuitively cor-
responds to the larger equivalent vector pair in our algorithm.
Our algorithm begins with the possible largest equivalent vector
pair, that is, the possible largest rectangular window. Then, our
algorithm iteratively improves the current optimal solution by
shortening the length of the equivalent vector pair. Therefore,
our algorithm can always find the memory allocation for par-
allel access with multiple windows.
IV. DESIGN EXAMPLES
We show two examples where multiple windows are effi-
ciently used and frequently appear in real cases. In these ex-
amples, pixels used at the same time are sparsely arranged, and
the conventional memory allocation methods require a large
number of memory modules.
Section IV-A describes stereo matching using multi reso-
lution images. The image processing is not limited to stereo
matching. It frequently appears in image processing such as
recognition and so on.
Section IV-B describes optical flow extraction using a single
square window. Use of a single square window is very popular
in image processing such as filtering and so on. The image pro-
cessing using a square window is attributed to image processing
using multiple windows when pixel data in an overlapping area
are reused.
A. Stereo Matching VLSI Processor Using
Multi-Resolution Images
Stereo matching is one efficient method to obtain 3-D infor-
mation of a real scene. It uses two images taken from two dif-
ferent cameras at the same time. After correspondence between
the two images is established, the 3-D information of the scene
is computed based on the triangular method. Given a reference
window, SADs are computed for all the possible candidate win-
dows as shown in Fig. 16. Note that reference and candidate
windows have a square shape. The candidate window with the
minimum SAD is determined to be the corresponding pixel.
Multi-resolution images are used to reduce the
computational amount. Given sampling periods
, reduced images are
made by sampling the original images every pixels.
Beginning with the lowest-resolution image ,
the resolution is iteratively increased until . The
possible locations of candidate windows at a higher resolution
Fig. 17. Optimal memory allocation for multi-resolution images (     and
   ). (a) Original image    . (b) Reduced image    . (c)
Reduced image    .
are limited by using the matching result at a lower resolution.
Hence, the computational amount is reduced [11].
Fig. 17 shows the resulting optimal memory allocation for
and . The period vectors are and
. The number of memory modules is 11. The label
on a pixel denotes which memory module the
pixel is allocated to. Figs. 17(a)–(c) correspond to 1, 2,
and 3, respectively. Gray pixels are accessed in parallel at each
sampling period. You can see that all the pixels in the window
are distributed among different memory modules for arbitrary
locations of the window. In other words, parallel memory access
for a window is enabled at each sampling period.
Let us compare this result with that of three conventional
methods.
The first method is the rectangular memory allocation [10].
This is originally used for image processing with a single
resolution. The rectangular memory allocation maps pixels in a
rectangular window onto different memory modules, where the
rectangular window is defined as the minimum rectangular one
that can include all the parallel-accessed pixels. In the example
shown in Fig. 17, the rectangular window is determined such
that it includes all the gray pixels shown in Fig. 17(c). Hence,
the rectangular memory allocation requires a 7 7 rectangular
window with memory modules as shown in
Fig. 18(a). The gray pixels in Fig. 18(a) correspond to the gray
ones in Fig. 17(c).
The second one is the tile-based memory allocation [9]. This
is originally used for image processing with a single resolution.
Note that the tile-based allocation [9] provides the same result
as the rectangular one [10].
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
412 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
Fig. 18. Conventional memory allocation (     and    ). (a) Rect-
angular memory allocation (original image). (b) Memory allocation with limited
period vectors (original image).
The third one is used for image processing with multi-reso-
lution as well as the proposed method [11]. This method differs
from the proposed method in that periods vectors are limited in
terms of length and direction as follows.
• Periods vectors and have the same length.
• and are a horizontal vector and a vertical one, respec-
tively.
We call this method memory allocation with limited period
(MALP) vectors. Fig. 18(b) shows the result of MALP for
and . This memory allocation requires
memory modules. The period vectors and
for this allocation are given by
(25)
As a result, the number of memory modules of the proposed
method is reduced to of the rectangular memory
allocation, of the tile-based one, and
of MALP.
For the practical case, the window size , and the
maximum sampling period [11]. The proposed
memory allocation, the rectangular one [10], the tile-based one
[9] and MALP [11] require 17, , , and
memory modules, respectively. The search time
of the proposed method is less than 1 ms on a PC (Pentium4
at 2 GHz, 1.2 GB main memory, OS: Windows XP). The total
memory capacity of memory modules is constant independently
of the number of memory modules from the condition C3. The
number of AD units and adders depends on the number of
memory modules. The total number of AD units and adders of
the proposed memory allocation is reduced to 2.7%, 2.7%, and
14% in comparison with that of three conventional allocation
methods, respectively. Table II summarizes the comparison re-
sults. Image size is 500 500 pixels with 256-level grayscale.
The upper and lower parts denote the architectural results and
the results of FPGA-based implementations, respectively. The
StratixIII (EP3SE260F1152C3) is used for the implementa-
tions. Altera’s Quartus II is used for simulation, mapping,
and power analysis. The processing time is evaluated at the
maximum frequency. The comparison result does not include
the operation steps to load pixels into the memory modules.
The operation steps to load pixels are same in the proposed
TABLE II
COMPARISON RESULTS FOR THE STEREO-MATCHING PROCESSOR
StratixIII (EP3SE260F1152C3) FPGA is used.
and the conventional methods since the number of pixels is
same. The power dissipation is evaluated at 60 MHz since the
processor based on the rectangular allocation runs at less than
64 MHz. The CAD “Quartus II” simulates considering the
real configuration data using real image inputs. Therefore, the
resulting processing time and power dissipation will be to very
close to the real values.
B. Optical Flow
Given an image sequence, matching between continuous two
images is frequently used in many applications such as motion
estimation for data compression, optical flow, and so on. Fig. 19
illustrates the typical situation of the image matching between
continuous two images. Figs. 19(a) and (b) are images at times
and , respectively. The image at time is called a ref-
erence image, and the image at time a candidate image.
We consider a square window called a reference window of a
predetermined size in the reference image. In order to
find the corresponding window in the candidate image, the sim-
ilarity measure such as an SAD is computed between the refer-
ence window and a candidate one for all possible locations of
the candidate window within a search area. The sequence of the
locations provides a great impact on how many pixels should
be accessed in parallel. If the consecutive candidate windows
do not overlap to each other, pixels must be newly re-
trieved. To minimize the number of newly-retrieved pixels, the
consecutive candidate windows must have a maximum overlap.
Fig. 19(b) shows the sequence of the locations of the candidate
windows to meet the requirement. In this square-wave-like se-
quence, the candidate window moves horizontally or vertically
by one pixel. For example, Fig. 20 shows the newly-retrieved
pixels for a candidate window of size 3 3. When the can-
didate window moves horizontally from L1 to L2 as shown in
Figs. 20(b) and (c), pixels , , , , , and
can be reused by storing them into registers. Pixels , ,
and must be newly retrieved from memory modules. Sim-
ilarly, when the candidate window moves vertically from L1 to
L3 as shown in Figs. 20(b) and (d), pixels , , , ,
, and can be reused. Pixels , , and must
be newly retrieved from memory modules. In general, for the
candidate window of size , the square-wave-like sequence
requires parallel access for only pixels either in a column or
a row of the candidate window although pixels are used
for matching.
The windows for memory allocation, W1 and W2 are of size
or as shown in Fig. 21(a) and (b), respectively.
Note that the window size for memory allocation is smaller
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 413
Fig. 19. Optical flow. (a) Image at time   (reference image). (b) Image at time
      (candidate image).
Fig. 20. Newly-retrieved pixel for optical flow computation. (a) Reference
window. (b) Candidate window at location L1. (c) Candidate window at
location L2. (d) Candidate window at location L3.
than that for optical flow since the pixels are reused as men-
tioned before. Fig. 22 shows the resulting optimal memory al-
location for W1 and W2 for . The optimal memory al-
location requires 10 memory modules. Pixels in a diagonal line
are allocated to the same memory module, and this memory al-
location is called “diagonal memory allocation”. The resulting
number of memory modules is minimum since each of W1 and
W2 have 10 pixels for . The search time of the pro-
posed method is less than 1 ms on a PC (Pentium4 at 2 GHz,
1.2 GB main memory, OS: Windows XP). Let us compare the
diagonal memory allocation with the conventional rectangular
memory allocation [10] and the tile-based one [9]. The rectan-
gular memory allocation requires memory modules since
the minimum rectangle including both of W1 and W2 is of size
. For , the number of memory modules for the
diagonal memory allocation is reduced to of
that for the rectangular one, and of that for
the tile-based one. Table III summarizes the comparison results.
Image size is 500 500 pixels with 256-level grayscale. Search
area is 10 10 pixels. Note that MALP [11] is not compared
with the proposed method since MALP can be used only for
Fig. 21. Windows for memory allcation. (a) Window W1. (b) Window W2.
Fig. 22. Optimal memory allocation for optical flow.
TABLE III
COMPARISON RESULT FOR THE OPTICAL-FLOW-EXTRACTION PROCESSOR
StratixIII (EP3SL200F1152C2) is used.
multi-resolution images. The upper and lower parts denote the
architectural results and the results of FPGA-based implementa-
tions, respectively. The StratixIII (EP3SL200F1152C2) is used
for the implementations. This FPGA is the smaller than that used
in the stereo-matching processor. The processing time is eval-
uated at the maximum frequency. The comparison result does
not include the operation steps to load pixels into the memory
modules. The operation steps to load pixels are same in the pro-
posed and the conventional methods since the number of pixels
is same. The power dissipation is evaluated at 75 MHz since the
processor based on the rectangular allocation runs at less than
76 MHz.
V. DISCUSSION
From the result of Section X, there is the large variance
among the reduction ratios of the numbers of memory modules
for the two examples: 2.7% for the stereo-matching processor
and 10% for optical-flow extraction processor, respectively.
These reduction ratios are defined as (the number of memory
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
414 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
Fig. 23. Window and SDFG for single-step operation. (a) Window W1.
(b) SDFG for W1.
Fig. 24. Memory allocation result for W1. (a) Proposed memory allocation.
(b) Rectangular memory allocation [10]/tile-based memory allocation [9].
modules of the our method)/(the number of memory modules
of the rectangular memory allocation). The reduction ratio
intuitively denotes how well the rectangular window approx-
imates the given window/windows. The better the rectangular
window approximates the given window/windows, the larger
the reduction ratio is. In the examples of the stereo-matching
processor and the optical-flow extraction processor, the circum-
scribing rectangular window does not approximate the given
windows well, and is sparsely occupied by the pixels of the
given window. By exploiting the sparseness, our algorithm
shrinks the equivalent vector pair to reduce the number of
memory modules.
We can relax the constraint of the window-serial-and-pixel-
parallel scheduling where a window-operation is executed in a
single step, and all the pixels in a window must be retrieved in
parallel. For more practical cases, we should consider multi-step
operation where a window-operation is executed in multiple
steps, and pixels in a window are also retrieved in multiple steps.
Fig. 23 shows a window and the SDFG for single-step operation
based on the window-serial-and-pixel-parallel scheduling. All
the pixels in W1 must be accessed in parallel to meet the time
constraint imposed by the SDFG. Fig. 24(a) and (b), respec-
tively, show the proposed memory allocation and the rectangular
memory allocation for the window W1, which respectively re-
quire 6 and 8 memory modules. Note that the tile-based allo-
cation [9] provides the same result as the rectangular memory
allocation [10] as mentioned before.
Fig. 25 shows windows and the SDFG for multi-step oper-
ation (Type 1). To relax the window-serial-and-pixel-parallel
scheduling, W1 is divided into W2 and W3, and pixels in either
Fig. 25. Window and SDFG for multi-step operation (Type 1). (a) Windows.
(b) SDFG for W2 and W3.
Fig. 26. Memory allocation result for multi-step operation (Type 1).
W2 or W3 are accessed in parallel at a single step. The label
“Sub-A” denotes a partial operation of operation of the type
“A”. At steps S1 and S2, Sub-A for W2 and W3 are performed,
respectively. At step S3, an additional operation is required to
merge the results of the Sub-A operations. For example, if oper-
ation of the type A means 6-input addition, Sub-A and Merge-A
mean 3-input addition and 2-input addition, respectively. The
memory allocation problem for the multi-step operation (Type
1) shown in Fig. 25 can be considered to be the memory allo-
cation problem for multiple windows W2 and W3. This is be-
cause pixels in W2 or W3 must be accessed in parallel for any
locations of them. Hence, the memory allocation for multi-step
operation (Type 1) can be solved by the proposed method for
multiple windows. Fig. 26 shows the optimal memory alloca-
tion for the windows W2 and W3, which requires four memory
modules.
The reduction ratio for a multi-step operation depends on not
only a memory allocation method but also a manner of division
of a window. For the example of multi-step operation (Type 1)
shown in Fig. 25, the same memory allocation shown in Fig. 26
is obtained by either the proposed memory allocation or rect-
angular one/tile-based one. On the other hand, for the example
of multi-step operation (Type 2) shown in Fig. 27, the proposed
memory allocation is more efficient, that is, provides a less re-
duction ratio. In this case, W1 is divided into W4 and W5, and
pixels in either W4 or W5 are accessed in parallel at a single
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
KOBAYASHI et al.: OPTIMAL PERIODIC MEMORY ALLOCATION FOR IMAGE PROCESSING WITH MULTIPLE WINDOWS 415
Fig. 27. Window and SDFG for multi-step operation(Type 2). (a) Windows.
(b) SDFG for W4 and W5.
Fig. 28. Memory allocation results for multi-step operation (Type 2). (a) Pro-
posed memory allocation. (b) Rectangular memory allocation [10]/tile-based
memory allocation [9].
step. Fig. 28(a) and (b) respectively show the proposed memory
allocation and the rectangular one/tile-based one for the win-
dows W4 and W5, which respectively require 3 and 6 memory
modules. Note that the number of memory modules required by
the proposed memory allocation does not exceed that required
by the rectangular one/tile-based one even in the worst case.
VI. CONCLUSION
This paper presents an optimal memory allocation method to
minimize the number of memory modules under a time con-
straint. The method is also useful for FPGA implementations
where interconnect overhead in delay is significant large.
The architecture model is one simplified form of recent
image processors [1], [2]. These processors are based on
highly-parallel SIMD architecture where each PE has its own
local memory module like our architecture model. [1] and
[2] have 128 and 256 pairs of a PE and a memory module,
respectively. The major difference between our architecture and
them is that they have more complex inter-PE network such as
linear array. They are designed for power-aware embedded ap-
plications such as digital cameras, mobile phone and advanced
safe vehicles. Their power consumption is less than 3 W.
Nowadays, the use of graphics processing units (GPUs) is also
one efficient solution to accelerate image processing because
of their high degree of parallelism [19], [20].1 For example,
the NVIDIA Geforce6800 [21] has 6 vertex processors with
the MIMD manner, 12 fragment processors with the SIMD
manner, 4 memory modules, and a crossbar network between
the vertex and fragment processors. More advanced GPUs such
as RADEON X1900XT, GeForce7800, and RADEON X800
have more vertex and fragment processors. The state-of-art
GPUs can be considered as a multi-processor where each pro-
cessor has a local memory. Such architecture is similar to our
target architecture. Therefore, the proposed memory allocation
is applied to image processing using the state-of-art GPUs.
GPUs are mainly used not for implementing final products
but for evaluating algorithms, because of their large power
consumptions. Their power consumptions range from 30 to 100
W. To exploit the parallelism of the recent image processors
and GPUs, memory allocation for parallel access is essential.
Our memory allocation can be applied to them. In such cases,
the advantage of minimizing the number of memory modules
is to save the dynamic power consumed by unused PEs and
memory modules, not to minimize the chip area. If power
gating is available, it is also possible to save the static power
consumption of the unused PEs and memory modules.
REFERENCES
[1] S. Kyo, T. Koga, S. Okazaki, and I. Kuroda, “A 51.2-GOPS scalable
video recognition processor for intelligent cruise control based on a
linear array of 128 four-way VLIW processing elements,” IEEE J.
Solid-State Circuits, vol. 38, no. 11, pp. 1992–2000, Nov. 2003.
[2] M. Nakajima, H. Noda, K. Dosaka, K. Nakata, M. Higashida, O. Ya-
mamoto, K. Mizumoto, H. Kondo, Y. Shimazu, K. Arimoto, K. Saitoh,
and T. Shimizu, “A 40GOPS 250 mW massively parallel processor
based on matrix architecture,” in Proc. ISSCC Dig. Tech. Papers, Feb.
2006, pp. 410–411.
[3] L. Ramachandran, D. Gajski, and V. Chaiyakul, “An algorithm for
array variable clustering,” in Proc. Euro. Des. Autom. Conf., 1994, pp.
262–266.
[4] N. Holmes and D. Gajski, “Architectural exploration for datapaths
with memory hierarchy,” in Proc. Euro. Des. Autom. Conf., 1994, pp.
340–344.
[5] P. R. Panda, “Memory bank customization and assignment in behav-
ioral synthesis,” in Proc. Int. Conf. Comput.-Aided Des., 1999, pp.
477–481.
[6] W. T. Shiue, S. Tadas, and C. Chakrabarti, “Low power multi-mod-
ules, multi-port memory design for embedded systems,” in Proc. Signal
Process. Syst., 2000, pp. 529–538.
[7] S. Wuytack, F. Catthoor, D. De Jong, and H. De Man, “Minimizing the
required memory bandwidth in VLSI system relizations,” IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 4, pp. 433–441, Dec.
1999.
[8] J. Seo, T. Kim, and P. R. Panda, “Memory allocation and mapping
in high-level synthesis—An integrated approach,” IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., vol. 11, no. 5, pp. 928–938, May 2003.
[9] P. Ranjan, Panda, and N. D. Dutt, “Low-power memory mapping
through reducing address bus activity,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 3, no. 3, pp. 309–320, Sep. 1999.
[10] M. Hariyama, S. Lee, and M. Kameyama, “Highly-parallel stereo
vision VLSI processor based on an optimal parallel memory access
scheme,” IEICE Trans. Electron., vol. E84-C, no. 3, pp. 382–389,
2001.
[11] M. Hariyama, H. Sasaki, and M. Kameyama, “Architecture of a stereo
matching VLSI processor based on hierarchically parallel memory ac-
cess,” IEICE Trans. Inf. Syst., vol. E88-D, no. 7, pp. 1486–1491, 2005.
[12] D. Marr and E. Hildreth, “Theory of edge detection,” Proc. Royal So-
ciety London B, vol. 207, pp. 187–217, 1980.
[13] A. Kaced, “The K-forms: A new technique and its applications in dig-
ital image processing,” in Proc. IEEE 5th Int. Conf. Pattern Recogni-
tion, 1980, pp. 933–936.
1[Online]. Available: http://www.gpgpu.org
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
416 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009
[14] H. Minkowski, “Volumen und oberflache,” Math. Annalen, vol. 57, pp.
447–495, 1903.
[15] J. Klein and J. Serra, “The texture analyzer,” J. Microscopy, vol. 95,
pp. 349–356, 1972.
[16] R. Haralick, S. Sternberg, and X. Zhuang, “Image analysis using math-
ematical morphology: Part I,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 9, no. 4, pp. 532–550, Jul. 1987.
[17] K. Sivakumar and J. Goutsias, “Morphologically constrained grfs: Ap-
plications to texture synthesis and analysis,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 21, no. 2, pp. 99–131, Feb. 1999.
[18] H. Heijmans, “Theoretical aspects of gray-level morphology,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 13, no. 6, pp. 568–582, Jun.
1991.
[19] N. Cornelis, “Real-time connectivity constrained depth map computa-
tion using programmable graphics hardware,” in Proc. CVPR, 2005,
vol. 1, pp. 1099–1104.
[20] S. Shinha, J.-M. Frahm, and M. Pellefeys, “GPU-based video feature
tracking and matching,” Univ. North Calolina, Chapel Hill, Tech. Rep.
TR06-012, May 2006.
[21] J. Montrym and H. Moreton, “The GeForce 6800,” IEEE Micro, vol.
25, no. 2, pp. 41–51, Feb. 2005.
Yasuhiro Kobayashi (M’06) received the B.E. de-
gree in electronic engineering from Tohoku Univer-
sity, Sendai, Japan, in 1997.
He is currently a technical staff with Oyama
National College of Technology, Oyama, Japan. His
research interests include reconfigurable VLSIs for
computer vision.
Masanori Hariyama (M’02)received the B.E degree
in electronic engineering, the M.S. degree in infor-
mation sciences, and the Ph.D. degree in information
sciences from Tohoku University, Sendai, Japan, in
1992, 1994, and 1997, respectively.
He is currently an Associate Professor with the
Graduate School of Information Sciences, Tohoku
University, Sendai, Japan. His research interests
include VLSI computing for real-world application
such as robots, high-level design methodology for
VLSIs and reconfigurable computing.
Michitaka Kameyama (M’79–F’97) received the
B.E., M.E., and D.E. degrees in electronic engi-
neering from Tohoku University, Sendai, Japan, in
1973, 1975, and 1978, respectively.
He is currently a Professor with the Graduate
School of Information Sciences, Tohoku University.
His general research interests include intelligent
integrated systems for real-world applications
and robotics, advanced VLSI architecture, and
new-concept VLSI including multiple-valued VLSI
computing.
Prof. Kameyama was a recipient of the Outstanding Paper Awards at the 1984,
1985, 1987, and 1989 IEEE International Symposiums on Multiple-Valued
Logic, the Technically Excellent Award from the Society of Instrument and
Control Engineers of Japan in 1986, the Outstanding Transactions Paper Award
from the IEICE in 1989, the Technically Excellent Award from the Robotics
Society of Japan in 1990, and the Special Award at the 9th LSI Design of the
Year in 2002.
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on March 10,2010 at 02:29:01 EST from IEEE Xplore.  Restrictions apply. 
