A BASELINE FOR VISUAL INSTANCE RETRIEVAL WITH DEEP CONVOLUTIONAL NETWORKS

Abstract

ABSTRACT This work presents simple pipelines for visual image retrieval exploiting image representations based on convolutional networks (ConvNets), and demonstrates that ConvNet image representations outperform other state-of-the-art image representations on six standard image retrieval datasets. ConvNet based image features have increasingly permeated the field of computer vision and are replacing hand-crafted features in many established application domains. Much recent work has illuminated the field on how to design and train ConvNets to maximize performance Beside the performance, another issue for visual instance retrieval is the dimensionality and memory requirements of the image representation. Usually two separate categories are considered, for which we report the results. These are the small footprint representations encoding each image with less than 1kbytes and the medium footprint representations which have dimensionality between 10k and 100k. The small regime is required when the number of images is huge and memory is a bottleneck, while the medium regime is more useful when the number of images is less than 50k. In our pipeline for the small we extract the features for 576×576 images, and for medium we use those features combined with the spatial search method described in . Furthermore, inspired by the recent work of RESULTS SUMMARY To evaluate our model, we used two networks. First, one which we refer to as AlexNet Krizhevsky et al. (Oxford5k performance drops to 82.6 while Paris6k performance increases to 87.5.) 2) Our pipelines are the first pipelines that work for both textured-less items (e.g. sculptures) and highly-textured items (e.g. buildings) using exactly the same settings. 3) Previous methods are often specialized and learn their parameters on similar datasets and could then suffer from domain shift. On the other hand, our pipeline does not rely on the bias of the dataset but it can still be specialized to a high degree (fine-tuning the OxfordNet with landmark dataset In sum, the work shows that ConvNet image representations outperform other s.o.a. image representations for visual image retrieval if one selects the appropriate responses from a generic deep ConvNet. Our result should only be viewed as a baseline and by no means we claim that our method is optimal yet. Even the simple additions such as concatinating different architecture representations gives a boost in performance (e.g. 87.2 for Oxford5k). Acknowledgment. We would like to thank NVIDIA Co. for the generous donation of K40 GPUs

    Similar works

    Full text

    thumbnail-image

    Available Versions