Abstract-The main contribution of this paper is to present an image retrieval system using FPGAs. Given a template image T and a database of a number of images ~, I z , .
I. INTRODUCTION
Suppose that an image database 1 containing a number of gray-scale images {I1,12, . . .} and a template image T are given. We assume that T is small, say, 32 x 32 while each Ii is large, say, 1024 x 1024 or larger. We are interested in the task of listing all images in 1 that contains a similar subimage to T . This task has many applications in the areas such as object recognition, vehicle tracking, finding a particular pattern in VLSl masks, among others [I] . The main contribution of this paper is to present an FPGA-based instance-specific hard- Let T be a template image with m x m pixels and I be an image with n x n pixels. We assume that T has e effective pixels (e <_ m2) that are taken into account for image match- and log L block RAMS with mn bits each for L-level grayscale images. Thus, from the theoretical point of view, our FPGA-based instance-specific solution is much faster than the conventional software solution.
We evaluate the performance of our hardware using a timing analyzer for the Xilinx VirtexII series FPGA XC2V8000.
Further, we test our hardware using Spartan2(XC2S150) PCI card Strathnuey [3] . Since the generated hardware attains up to 3000 speed-up factor over the sequential algorithm, our approach is a promising solution. An image matching hardware using an FPGA has been proposed [2] . Their hardware is not instance-specific, does not support gray-scale images, and runs in O(n2) clock cycles. Thus, our hardware is a significant improvement on the FPGA-based image matching hardware. Also, an isntance-speicific solution for image matching has been shown [6] . However, it does not support gray-scale image and parallel matching. We assume that pixel ( 1 , l ) is the top of the leftmost column of
THE IMAGE DIFFERENCE
An m x m template image T is an image with "don't care", that is, an m x m two dimensional array with each element taking either an integer in [0, L -11 or a special value d. An
Let e denote the number of effective pixels, which are non-"don't care" pixels in T . The value of e, which depends on the applications, can be much smaller than m2.
Let D be the function that returns an integer for a template image T and an image I such that
D ( T , I ) = l T i , j -I i , j l .
(1) Suppose that an image I is larger than a template image T . 
AN IMAGE MATCHING HARDWARE FOR BINARY IMAGE

RETRIEVAL
In this section, we are going to show our FPGA-based instance-specific hardware that computes D T ( I ) (= D(T, I ) ) for a fixed template T and various images I . We start with a binary template T and a binary image I . We then go on to extend our hardware to support gray-scale images later. It consists of parallel inversions and the Muller-Preparata's circuit [4, 51 that computes the number 1's in the input bits. For every pixel of template image T , the corresponding register bit or its inversion is connected to the Muller-Preparata's circuit if it is I or 0, respectively. Since T has e effective pixels, the Muller-Preparata's circuit computes the sum of e bits, which is equal to the value of DT (I[z, y] ).
To compute the minimum of DT (I[z, y] ) over all z and y, a comparator and a loge-bit register is used. The comparator computes the minimum of two log e-bit integers. The register is storing the temporary minimum value of D~( l [ z , y]) so far.
If the current value of DT(I[z, y]) is smaller, then it is stored
in the register. It should be clear that, after every pixel in image I is supplied to this circuit, the log e-bit register stores &(I). Next, let us evaluate the performance and the hardware resouses used by our hardware. As we discussed, our hardware computes &-(I) in less than n2 clock cycles. The MullerPreparata's circuit [4] that counts the number of 1's in e bits has O(e) gates. Furhter, the log e-bit comparator has no more than O ( 
Iv. PARALLEL IMAGE MATCHING FOR BINARY IMAGES
This section is devoted to show our parallel image matching architecture for further acceleration. Thus, we use m registers with (2m -1) bits each to store a subimage of (2m -1) x m pixels. Again, we assume that m pixels in an image I are supplied in every local clock cycle.
Hence, vertical (2m -1) pixels cannot be transferred to the rightmost (2m -1)-bit register in every local clock cycle. To supply the (2m -1) pixels to the register in every local clock cycle, we use an (m -1) x n-bit cache, that is, a cache with (m -1)-bit data and logn-bit address.
In what follows, we will describe how our parallel image matching hardware illustrated in Figure 3 works. An image I is transferred to the registers as follows. (m-1)n-bit block RAM. We have also constructed the parallel version of our gray-scale image matching hardware similarly to that for binary images. However, due to the page limitation, we omit the description of parallel gray-scale image matching hardware.
v. THE PERFORMANCE EVALUATION
The main purpose of this section is to evaluate the performance of our image matching hardware to compute &(I) for templates T with 32 x 32 pixels and images I with 1024 x 1024 pixels. Before we evaluate the performance of our hardware, we will show the computing time by the software approach as counterparts. Table I I[z, y] ) is evaluated using formula (I), the computing time is proportional to e. For L = 2, we use the bitwise XOR operation of a word of 32-bit data to evaluate formula (2) . Also, we accelerate the computation of
We have also estimated the performance of our image matching hardware using the VirtexIl series FPGA XC2V8000. Again, we assume that a template image T has 32 x 32 pixels and an image I has 1024 x 1024 pixels. We have estimated our hardware for randomly generated templates T of size 32 x 32 with effective pixels e = 128,256,512 and 768. Due to the stringent page limitation, we omit the performance of our non-parallel image matching hardware. 
