60,445 research outputs found
Textual Query Based Image Retrieval
As digital cameras becoming popular and mobile phones are increased very fast so that consumers photos are increased. So that retrieving the appropriate image depending on content or text based image retrieval techniques has become very vast. Content-based image retrieval, a technique which uses visual contents to search images from large scale image databases according to users interests, has been an active and fast advancing research area semantic gap between the low-level visual features and the high-level semantic concepts. Real-time textual query-based personal photo retrieval system by leveraging millions of Web images and their associated rich textual descriptions. Then user provides a textual query. Our system generates the inverted file to automatically find the positive Web images that are related to the textual query as well as the negative Web images that are irrelevant to the textual query. For that purpose we use k-Nearest Neighbor (kNN), Decision stumps, and linear SVM, to rank personal photos. For improvement of the photo retrieval performance, we have used two relevance feedback methods via cross-domain learning, which effectively utilize both the Web images and personal images.
DOI: 10.17762/ijritcc2321-8169.15032
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
With the success of large-scale visual-language pretraining models and the
wide application of image-text retrieval in industry areas, reducing the model
size and streamlining their terminal-device deployment have become urgently
necessary. The mainstream model structures for image-text retrieval are
single-stream and dual-stream, both aiming to close the semantic gap between
visual and textual modalities. Dual-stream models excel at offline indexing and
fast inference, while single-stream models achieve more accurate cross-model
alignment by employing adequate feature fusion. We propose a multi-teacher
cross-modality alignment distillation (MCAD) technique to integrate the
advantages of single-stream and dual-stream models. By incorporating the fused
single-stream features into the image and text features of the dual-stream
model, we formulate new modified teacher features and logits. Then, we conduct
both logit and feature distillation to boost the capability of the student
dual-stream model, achieving high retrieval performance without increasing
inference complexity. Extensive experiments demonstrate the remarkable
performance and high efficiency of MCAD on image-text retrieval tasks.
Furthermore, we implement a mobile CLIP model on Snapdragon clips with only 93M
running memory and 30ms search latency, without apparent performance
degradation of the original large CLIP
Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization
Many robotics applications require precise pose estimates despite operating
in large and changing environments. This can be addressed by visual
localization, using a pre-computed 3D model of the surroundings. The pose
estimation then amounts to finding correspondences between 2D keypoints in a
query image and 3D points in the model using local descriptors. However,
computational power is often limited on robotic platforms, making this task
challenging in large-scale environments. Binary feature descriptors
significantly speed up this 2D-3D matching, and have become popular in the
robotics community, but also strongly impair the robustness to perceptual
aliasing and changes in viewpoint, illumination and scene structure. In this
work, we propose to leverage recent advances in deep learning to perform an
efficient hierarchical localization. We first localize at the map level using
learned image-wide global descriptors, and subsequently estimate a precise pose
from 2D-3D matches computed in the candidate places only. This restricts the
local search and thus allows to efficiently exploit powerful non-binary
descriptors usually dismissed on resource-constrained devices. Our approach
results in state-of-the-art localization performance while running in real-time
on a popular mobile platform, enabling new prospects for robotics research.Comment: CoRL 2018 Camera-ready (fix typos and update citations
- …