Abstract-The main contribution of this paper is to present an image retrieval system using FPGAs. Given a template image ¡ and a database of a number of images ¢ ¤ £ ¦ ¥ § ¢ © ¥ ¦ , our system lists all images that contain a subimage similar to ¡ . More specifically, a hardware generator in our system creates the Verilog HDL source of a hardware that determines whether ¢ has a similar subimage to ¡ for any image ¢ and a particular template ¡ . The created Verilog HDL source is embed in an FPGA using the design tool provided by the FPGA vendor. Since the hardware embedded in the FPGA is designed for a particular template ¡ , it is an instance-specific hardware that allows us to achieve extreme acceleration. We evaluate the performance of our image matching hardware using a PCI-connected Xilinx FPGA and a timing analyzer. Since the generated hardware attains up to 3000 speedup factor over the software solution, our approach is promising.
or larger. We are interested in the task of listing all images in that contains a similar subimage to ( . This task has many applications in the areas such as object recognition, vehicle tracking, finding a particular pattern in VLSI masks, among others [1] . The main contribution of this paper is to present an FPGA-based instance-specific hardware solution for this task. More precisely, let . The source program is complied using a design tool provided by an FPGA vendor. The created hardware is embed- ) that are taken into account for image matching. As we are going to show later, the evaluation of We evaluate the performance of our hardware using a timing analyzer for the Xilinx VirtexII series FPGA XC2V8000. Further, we test our hardware using Spartan2(XC2S150) PCI card Strathnuey [3] . Since the generated hardware attains up to 3000 speed-up factor over the sequential algorithm, our approach is a promising solution. An image matching hardware using an FPGA has been proposed [2] . Their hardware is not instance-specific, does not support gray-scale images, and runs in S C U " H clock cycles. Thus, our hardware is a significant improvement on the FPGA-based image matching hardware. Also, an isntance-speicific solution for image matching has been shown [6] . However, it does not support gray-scale image and parallel matching. 
II. THE IMAGE DIFFERENCE FUNCTION

An
. Let denote the number of effective pixels, which are non-"don't care" pixels in 
Intuitively,
is the sum of the difference of the brightness over all effective pixels. Clearly,
takes a larger value if they are less similar. Note that, for a binary template ( and a binary image , their difference is
where ) denotes the exclusive OR operator. Suppose that an image is larger than a template image 
x 6
). The image difference function
Clearly,
is small if has a similar subimage to
denote a function such that
in turn, we can retrieve all images in a database of images
, which have a similar subimage to , the value of
time. Therefore, the task of computing the image difference
III. AN IMAGE MATCHING HARDWARE FOR BINARY IMAGE
RETRIEVAL
In this section, we are going to show our FPGA-based instance-specific hardware that computes We use a combinatorial circuit that computes B X R C U [4, 5] that computes the number 1's in the input bits. For every pixel of template image ( , the corresponding register bit or its inversion is connected to the Muller-Preparata's circuit if it is 1 or 0, respectively. Since 
IV. PARALLEL IMAGE MATCHING FOR BINARY IMAGES
This section is devoted to show our parallel image matching architecture for further acceleration. In what follows, we will describe how our parallel image matching hardware illustrated in Figure 3 works. An image is transferred to the registers as follows. -bit block RAM. We have also constructed the parallel version of our gray-scale image matching hardware similarly to that for binary images. However, due to the page limitation, we omit the description of parallel gray-scale image matching hardware.
V. THE PERFORMANCE EVALUATION
The main purpose of this section is to evaluate the performance of our image matching hardware to compute pixels. Before we evaluate the performance of our hardware, we will show the computing time by the software approach as counterparts. Table I shows the computing time of H is evaluated using formula (1), the computing time is proportional to . For V 0 , we use the bitwise XOR operation of a word of 32-bit data to evaluate formula (2) . Also, we accelerate the computation of the sum in formula (2) using the look-up table storing the number of 1's in a 16-bit data. More precisely, let be a We have tested our image matching hardware using Spartan2(XC2S150) PCI card Strathnuey [3] . This PCI card is connected to the host PC through the 33-MHz 32-bit PCI bus. Table II illustrates the performance of our image matching hardware which includes the clock frequency given by the timing analyzer, the actual time to evaluate
, the speedup over the software, the number of used slices over 1728 available slices, and the number of used slice flip-flops over 3456 available flip-flops. Unfortunately, due to the small capacity of XC2S150, we could test our non-parallel hardware for binary images with words of 32-bit data are transfered in 75msec, the PCI bus sends images in 434Mbit/s, which is close to the actual maximum bandwidth of the 33MHz 32-bit PCI bus. We have also estimated the performance of our image matching hardware using the VirtexII series FPGA XC2V8000. Again, we assume that a template image pixels. We have estimated our hardware for randomly generated templates and 768. Due to the stringent page limitation, we omit the performance of our non-parallel image matching hardware. Table III shows the performance of parallel image matching hardware. For
