Checking copyright is therefore a concem for image owners, who must be able to identify undue use of images. This Because of the growing use of multimedia content over identification process relies upon precise and fast image-Retrieving an image consists first in associating a set of odes compared with a software mplementar on rr nn ng on descriptors with the reference imagetypically a few huna standard desktop PC. dred vectors of 24 real components, -then in computing the distance between each one of these descriptors and those of the database images (distance calculation stage). For each
Internet, Content-Based Image Retrieval (CBIR) has re-comparison algorithms, as Internet is a rapidly changing cently received a lot of interest. While accurate search tech-medium, and such algorithms need to be run on a daily baniques based on local image descriptors exist, they suffer sis. from very long execution time. We propose to accelerate CBIR on the RDISK machine, a cluster of FPGA-enhanced 1.1. CBIR Systems hard-drives, thatfollows the philosophy ofsmart-disks. Our platform comhines coarse andfine grain parallelism thanks CBIR is mainly based on the comparison of image deto the concurrent use of the cluster nodes and of a proscriptors of a reference image with those of a descriptor grammahle logic device. The implementation of the CIRI database. Descriptors may be either global, i.e. they repre-azpp1 caztion onz thls mixed har7dware/software platform fOlsent some global feature of an image (e.g. a grey-level hislowvs a str ct methodology thaft valS val dated on real stic datsa-setric (maedtahaeof than 30 000images Thistic togram) or local, in which case they describe special points data-set (image daab f. T of interest in the image (e.g. comers, color changes, etc.). methodology allows us to adapt the original algorithm to suit a hardware iinplementation and to select the values of some key design parameters trecent research has shown that they provide a more robust ofasme. key resmignparamreters todmaximize glohl speed s o-approach since they are less dependent to image variations malnce. Our prel minar7y results lndlcate thaft speed-ups he-ta lbldsrpos t-weenl 120 anId 200 couzld be obtainedfor a cluster of 32 n-ta lbldsrpos oween 120 aned 200ha coftldaheohtaenedtatforcluster of3n- 1 . Introduction reference descriptor, a k-nearest neighbor sorting selects the k database descriptors whose distances are the smallest Content Based Image Retrieval (CBIR) is a technique (selection stage). Finally, votes are assigned to the images that allows one to find out images of a database that are depending on their occurrences in the k-nearest neighbor (at least) partly similar to a given reference image. CBIR lists (election stage): the image that has the largest number is drawing increasing interest due to its potential applica-of votes is considered to be the best match. The whole protion to problems such as image copyright enforcement. In [18] .
In this paper, we approach this problem by combining Figure 2 . The RDISK System on Chip Internal both parallelism and special-purpose hardware. Our model Organization.
architecture is that of a so-called sm art disk, whose prototype is RDISK [9] , implemented and available at IRISA. In short, a smart disk is a collection of disk drives, each one controlled by a reconfigurable, FPGA-based processor, 2. The RDISK Cluster Architecture called afilter. Filter processors handle on the fly data coming out from the disk that they control and transmit those data that are relevanlt to a givenr request to a comxmoln host [5] , only limited efforts were FPGA-based processor.
put in that direction. While a few papers report parallel
The RDISK system [9] was designed to provide high perimplementation of CBIR on shared memory machines and formance with low-cost hardware components. Our initial PC clusters [118] , there has been almost no work on special goal was to build a system whose nodes would cost no more purpose hardware for this application domain. Moreover, than one tenth of a PC cluster (i.e. approximately $200). none of the few noticeable exceptions address the problem RDISK consists (see Fig. 1 ) in a number of nodestypin the context of a real-life hardware system (e.g with its ically, a few tens, a node contains a 40GB IDE disk, a communication interface, I/O bandwidth constraints, etc.) o00 Mbps Ethemet controller, 16 MB of SDRAM, a 8051 [1 1, 19, 14] . micro-controller, and an FPGA chip that serves as thefilter The remaining of this paper is organized as follows. Sec-processor. tion 2 presents the RDISK platform, from both the system The filter processor allows data coming from the disk to level and architectural point of view. Section 3 briefly de-be processed on the fly in order to select those data that are scribes the problems that needed to be addressed to obtain relevant for some application algorithm, e.g. a search query. an efficient implementation, and the methodology that we Filtered data are then sent to a post-processing host through used to solve them. Section 4 presents some of the algorith- Figure 1 . The system level view of an RDISK cluster the hardware filter component. This filter is designed ac-cluster. Each filter gets a fraction qo of the q descriptors cording to the template interface described in Fig. 2 . Its associated to the query image and computes the distance role is to process in real-time the data stream coming from between each one of the qo descriptors and a subset bo of the disk drive and to send its output to the embedded CPU.
the database descriptors. This embedded CPU then performs a post-filtering process As the distances are computed, they are also sorted and and sends the final results to the host through the Etheronly the k-smallest distances are kept. These distances are net network. Programming an RDISK node thus consists then sent to the host processor where the final election stage in designing an application dependent hardwvare filter (in is performed. This simple scheme raises however several VHDL, Verilog, or higher level languages such as Handelquestions, that we list now. C), in writing a post-filtering C program to be run on the 1. As the filter processor is based on the FPGA technoloembedded CPU and in writing a post-processing program gy, efficient implementations cannot be obtained usfor the host processor.
ing floating-point calculations, a mere translation of An important point of RDISK is its reconfigurability: at the initial software description. Therefore, an analysis any time upon receipt of a specific command from the host an RDISK node can switch to another hardware configura-We do it in Section 4. tion and its associated hardware filter. This reconfiguration is handled by the 8051 micro-controller which reconfigures 2. The second question is how to efficiently implementhe FPGA from a bit-stream file stored on the hard-disk.
t the distance computation step on the FPGA. Given
Each node can store up to 256 hardware configurations, and that most FPGA design operate at frequencies below a custom file system allows configurations to be added or 100 MHz, the only way to reach good performance is removed from the drive by the host. to take advantage of fine grain parallelism within the The RDISK cluster thus takes advantage of two levels FPGA. Section 5 will show how this can be done effiof parallelism: coarse grain parallelism through the concur-ciently. rent use of the nodes, and fine grain parallelism within each hardware filter we shall indeed see that a large amount of 3 at the system level. We do so by providing a detailed fine grain parallelism can be used when designing a hardperformance model in Section 6. This allows us to esware filter for the FPGA processor.
OurefilterforDI cluster protyeissor fullyfunctinalandhatimate global performance and to optimize the hard-Our RDISK cluster prototype is fully function al anLdhasware filter design. already proved to be a very promising platform; two applications of computing biology have been successfully implemented and speed-ups of two orders of magnitude have 4. Optimizations for Hardware Implementabeen reported [9] . tion 3. CBIR Implementatilon on RDISK:-Problems
The software implementation of an algorithm often reand Methodology quires important modifications when it is to be implemented and Methodology as application specific hardware. In this section we present the two main transformations that we applied on the initial The basic mapping scheme of CBIR on the RDISK CBIR specification: the conversion from floating-point to n-node cluster is quite straightforward. The database defixed-point arithmetics, and the use of the L1 distance as an scriptors are equally distributed among the in nodes of the alternative to the Euclidian (L2) distanrce. 
Using Fixed-Point Arithmetics

Conversion Methodology
trated within a very narrow interval; this suggests that mapping the initial dataset to a short interval will have limited Moving from floating-point to fixed-point is not straightimpact on accuracy. In the following, we choosed the mapforward. Such a conversion generally induces a loss of pre-ping interval such that 97% of the initial descriptor compocision in the computations (due to quantization and roundnents would belong to it. ing errors) which may in turn impact the Quality of Results (QoR) of the algorithm. Using analytical models, it is pos-4.3 Changing the Distance Metric sible to quantify, in terms of Signal to Quantization Noise Ratio (SQNR), the impact of a conversion to fixed-point Another typical transformation when dealing with a format [13] . However this modeling is useful only when hardware implementation is strengh reduction. It consists in it is possible to determine an upper bound of the acceptable replacing a costly operation (say multiplication, or division) SQNR for the application at hand (this is very often the case by a simpler one (usually addition or shift) that is functionin signal processing applications), ally equivalent to the initial operation. In this work, we per-Unfortunately, in the context of CBIR, there is no way to formed a somewhat similar transformation: we proposed to directly relate the SQNR to the quality of the search results. substitute the standard Euclidian (L2) distance by the Sum The only solution is hence to use extensive simulation and of Absolute Difference (SAD, L1) distance. This allows observe experimentally the impact of a conversion scheme the square-accumulate operation to be replaced by a simon the search results. Ultimately, we expect to determine the ple substract-accumulate with much lower resource usage.
narrowest fixed-point encoding that will preserve the CBIR Note that this transformation does not lead to a functionally accuracy. To do so, we first defined a metric for quantifying equivalent implementation. Therefore, its effect had to be the CBIR answer accuracy, and then we explored various checked by simulation. possibilities of fixed-point encoding to efficiently combine scaling and saturation. 
Validation Methodology
database.
1-B if r<-B
According to the CBIR algorithm, the election step results in a list of pairs (I, si) in which I corresponds to an From the initial data distrihution histogram, represented image of the datahase and s, to its score (i.e. the numher of Figure 5 . A 2D systolic array for distance computations votes received by the image). This list is sorted according to the number of votes, I, beeing the image with the higher score. The accuracy of the search is then defined by: of this paper: we will therefore limit ourselves to show its results (e.g. the parallel processor arrays on which the dis- Figure 4 shows the relative accuracy obtained for various A quick evaluation showed that this architecture is not fixed-point bit width format using the LI distance compared suited to RDISK. First it uses too much logic resource to be to the original floating-point implementation using the L2 considered for the actual Spartan II FPGA chip. Second the distance. For bitwidths of 8 and above we obtain results bandwidth necessary to feed this processor array is about which are almost identical to those of the original software 600 M descriptors per second for a hardware filter running implementation (whose global accuracy is 85-). We can at 25 MHz which is far beyond the 15 MBps available from hence take advantage of this information during the hard-the disk drive.
ware filter design stage to reduce the resource usage and increase the implementation performance.
Partitioning the Architecture
A partitioning transformation of this architecture allows one to adjust the resources and the bandwidth of the hardware filter to the RDISK architecture. The Locally Sequen-In this Section, we turn to the problem of implementing tial Globally Parallel (LSGP) partitioning scheme (see [6] ) efficiently distance calculations using the fine-grain paralconsists in grouping together processors into so-called tiles, lelism that is available on the FPGA. and to merge these tiles. Computations are then executed sequentially for each tile by a unique processor. For ex-5.1. Deriving a Parallel Architecture ample, Fig. 6 shows the architecture of Fig. 5 after LSGP partitioning using a 24 x 3 tile. This new architecture has Our approach is inspired from systolic-array design therefore N physical processors: one processor takes care methodologies [16] and especially partitioning techniques. of 3 successive columns of the initial network. Thus, it read-The parallelization methodology in itself is out of the scope s an average of one word of a referenrce descriptor every 3 distance calculation to be followed by a sorting. Imple-l24 qvi-s E melting sorting as a fine grain parallel architecture would be possible (for example, using a sorting systolic array), but
We therefore made the choice to implement this sorting b, bo bo 1~~~~~twol e eorc osuigan nffcet stage as part of the post-filtering step on the embedded CPU.
In this procedure, the data for post-filtering are transmitted to the post-processing host only once the node has finished lqo-bo qg,,_-b0J scanning its local database. 23e 23 q-.-One important assumption of this design choice is that Ilq,bo0 lqm,3*2-bo0 we expect the post-filtering not to be a performance bottleneck. In our case, since no communication occur between the host and the node during the scan, the network process- Figure 6 . Equivalent 24 x 3 partitioned procesing workload on the embedded CPU is very limited. This sor array leaves almost all its processing power for performing this post-filtering step. We also know (from benchmarking) that our embedded CPU is able to sustain the sorting and inser- To reduce this throughput, we use the fact that the distance \D-l-]v[231 FDesJId[7 0] a of the k-th element of the distance list constitutes a natural Querydata b~> ms threshold for the distance calculation processors; any par-Output pipeline (score + descriptor ID) tial accumulated distance that exceeds this threshold will ----------------------n -ot appear o-n the fin,al distance list, a d can therefore be discarded. As a consequence, the filter only outputs those Figure 7 . The filter processing element ar-database descriptors whose scores are below at least one of chitecture and its 3 stage pipelined datapath the query descriptors (we call such a descriptor a match).
(control signals are not represented) Using execution traces, we observed that the average probability for a database descriptor to be a match for a given query descriptor is p = 1.89 10-5. From there, we derived an estimation of the actual filter selectivity which cycles. As a counterpart, it needs to store three complete depends on the number of query descriptors handled by the descriptors, which increases its memory resource cost. filter (the more query descriptors the higher the chance to Note that this partitioning results in a linear array, since have a match). This lead to an average of lO.67nd matches the first dimension is equal to the height (24) of the array.
per second (with nd beeing the number of query descriptors Doing so has several advantages: it simplifies the hardware handled by the array). In other words, the CPU is able to interface with the hard-disk FIFO, and it allows further opsustain post-processing for a hardware filter handling up to timizations that will be described in Subsection 5.3. From several thousands query descriptors.
now on we will hence only consider tiles of the form 24 x cr.
Additionally, if processor say P, while computing the In addition to this transformation, we also pipelined the distance between reference descriptor number n and some intemal structure of each processor using cut-set retiming, database descriptor, detects that the accumulated distance as shown in Fig. 7 : combining this with partitioning proexceeds this threshold, it can ignore the remaining of the vides very efficient implementations [6] .
database descriptor and immediately jump to the next one, hence saving computation time. In the rest of this paper we 5.3. Sorting and Thresholding call threshold overflow such an event.
The threshold overflow optimization is currently used in As seen in Section 5, the goal of the hardware filter is to the software implementation of the CBIR application, and find out, for each reference descriptor, a list of k database it has proved to be very efficient, since it reduces the comdescriptors that provide the best k distances. This requires putation volume by 8 and the execution time by a factor of 4. Unfortunately this optimization is not that efficient 6.1. Modeling RDISK Node Execution when used in our parallel architecture. Because it behaves Time as an SIMD architecture, the array cannot proceed to the next descriptor unless all processors have overflowed their
In this Subsection, we model the execution time by takcorresponding threshold. This observation has a severe iming into account both computations and I/O timing informapact on the actual efficiency of our architecture: the more tions.
we add processors to the array, the longer we have to wait, Let us call i', the time required to read a descriptor of as synchronization is done on the worst case.
size Sdesc from the hard-drive, and '1' yte the average access A quantitative analysis of this phenomenon can be de-time for a byte on the disk, we can write lI, = Sdescl'byte rived using a simple probabilistic model. Fig. 8.(a) shows On the RDISK prototype, we have a sustained hard-drive the probability Pth, (i), that during a distance computation, 110 bandwidth of 15 MBps that leads to Tbyte = 66 ns, and a a threshold overflow occurs at iteration i, that is to say, af-descriptor size of Sdesc = 26 bytes (24 bytes for its compoter reading the ith component of the descriptor. One can nents plus one 16-bit word for its associated image index). observe that this probability is concentrated in the very s-We thus have r1i = 1716 ns. mall values of i. Let now Pth, (i, p) be the probability, for Let now Tpic be the time required by the hardware filan array of p processors, that a threshold overflow occurs at ter to process a single database descriptor. This value deiteration i, then: pends on several parameters: the filter clock speed 'lCk, the partitioning parameter ur, the processor pipeline depth L and (p) the average number of useful iterations in the (4) Trrax( §10,Il,'c). ing in parallel in the array, each one computing a distance scores, the average time for a distance computation is then Tp'rc) (6) This suggests that for very small values of p, adding a processor to the array brings almost no performance benefit, 6.2. Hardware Implementation Optimizasince the positive effect of this additionnal processing power tion on the execution time is annihilated by the synchronization overhead it induces.
We have to take into consideration the resource constraints of our target FPGA: number of logic cells avail-6. System Level Optimnization able for implementing the elementary processor datapath, and number of memory blocks (BlockRam) available to store the query descriptors components along with their cor-Although we now have a relatively accurate architectural responding list threshold.
model, we still need to determine the set of design parame-Our System on Chip implementation leaves 9 Blockters that will allow optimal performance. Such a model is to Ram, and roughly 3,300 logic cells available for the hardbe established at the system level, and must integrate sever- Np p mem(7)
Partitioning parameter (sigma) Nmemn wesangtherefodel wriven the maximum cannowderiveanes Figure 9 . Performance model for various pa-Usingpthermodel given int(6)awe can nowbderivplearameters performances timate of the global performance as a function of the partitioning parameter u and of the filter clock speed T1k. This estimate is shown in Fig. 9 . We can ohserve that the filter clock speed has little impact on the overall performance. This suggests that our implementation performance is rather limited hy other factors: (i) the hard-drive 6 We have successfully implemented an RDISK SoC sys-50 8 ten that includes a hardware filter consisting of 36 elementary processors. We used SynplifyPro 7.3 as synthesis tool and Xilinx ISE 7.1 for placing and routing. This design was 12000 specified in VHDL and uses 2350 out of 2352 available s-Number of query descriptors 600 0
Number of processors lices, and supports operating frequency up to 25 MHz. Any attempt to fit a higher number of PE failed due to FPGA resource overuse (in terms memory blocks). Figure 110 . Speed-up as a function of the num-The hardware filter in itself occupies approximately 65% ber of nodes and the number of query deof the chip area, and the place and route software reports scriptors indicate that the filter alone can be clocked above 50 MHz. Thanks to the results summarized in Fig.8 we realized that it was not worth decoupling the hardware filter clock from the rest of the system (that operates at 25 MHEz): while requir-On the other hand, most recent processors nrow integrate restof te sytem thatopertes t 25MHz) whie reuir-SIMD inrstructionl sets targeted to multimedia applicationling an important design effort, such a modification would s.MD instruction setstargte o multimedi ppicati onyae er imtd aof (proiatl 2 *pr s. These ilnstructioln sets operate oln short fixed-poilnt data formyhane imoverimiten types (8, 16 or 32-bit). As an example the MMX instruc-
Sorfarnthedmprovesin not) been completely tested in a real-tion set provides instructions that can perform up to 8 x 8-bit lifesituatn (thedesen esnots areenthmpletinhwevter te hardoperations in parallel. It can therefore be objected that the ware filteruwasfunction allyesvalidatedre a the r egistver trhane performance results presented in this work for the RDISK ware filter was functionally validated at the regiLster transfer pltomsudbec padwihofaripeenlevel (using a VHDL simulator), and we expect to provide platform should be compared wfth software implementadetaied prforanceresuts i thefortcomig moths. tionls that would take advalntage of such architectural improvements, since they are very likely to significantly boost 7.2. Comparing with PC Implementation up software performance.
However these instructions were reported to provide only limited performance improvement [4] , with speed-up on-As mentioned in the introduction, a direct software imly seldom above 2. In our case, as described in Section 8, plementation of the CBIR application on a 2.4 GHz Pentium this limited effciency is worsen by the computation volume 4 processor requires an average processing time of 1500 overhead induced by the SIMD execution model. For examseconds for each query (with an average of 693 descripple, when using MMX instructions to compute 8 distance tors per query). We considered as a comparison a cluster scores in parallel, the average number of useful iterations of 32 nodes, processing queries ranging from 250 to 1500 grows by a factor of almost 3, hence annihilating all the descriptors (these are typical bounds for images). Using our performance improvements due to parallelism. performance model, we computed some estimates of the expected speed-up that are represented in Fig. 10 . To sum-7.3. Impact of Technology Improvement marize our results, we obtain speed-ups varying between 150 and 200 depending on the query size for the 32 node As mentioned in Section 2, the RDISK prototype was cluster. For a single RDISK node, the speed-ups vary bedesigned using a low-cost FPGA of year 2001. However tween 4 and 6. Given that the cost of a 32-RDISK cluster FPGA technology is known to evolve very quickly with evcan be estimated to $12,000 (including Ethemet switches, er increasing density and prices dropping significantly every power supplies, and Rack Cabinet) and assuming a cost year. To understand the impact of this evolution, we conof $400 000 for a 200 PC cluster we can roughly estimate sidered an hypothetical implementation of RDISK using a the price/performance ratio to 40 in the favor of the RDISK 2004 FPGA (namely the Spartan-3 FPGA). For the same cluster.
cost this new family offers five times the logic density of 'Our actual RDISK cluster has 48 nodes its costs is approximately the current RDISK FPGA, with clock speed improvements $L5000 in the range of 50%. According to our modeling, the op-
