Search CORE

75,254 research outputs found

Conditional Attention for Content-based Image Retrieval

Author: Bors Adrian Gheorghe
Hu Zechao
Publication venue
Publication date: 01/08/2020
Field of study

Deep learning based feature extraction combined with visual attention mechanism is shown to provide good results in content-based image retrieval (CBIR). Ideally, CBIR should rely on regions which contain objects of interest that appear in the query image. However, most existing attention models just predict the most likely region of interest based on the knowledge learned from the training dataset regardless of the content in the query image. As a result, they may look towards contexts outside the object of interest, especially when there are multiple potential objects of interest in a given image. In this paper, we propose a conditional attention model which is sensitive to the input query image content and can generate more accurate attention maps. A key-point detection and description based method is proposed for training data generation. Consequently, our model does not require any additional attention label for training. The proposed attention model enables the spatial pooling feature extraction method (generalized mean pooling) improves image feature representation and leads to better image retrieval performance. The proposed framework is tested on a series of databases where it is shown to perform well in challenging situations

White Rose Research Online

From Known to the Unknown: Transferring Knowledge to Answer Questions about Novel Visual and Semantic Concepts

Author: Barnes Nick
Farazi Moshiur R.
Khan Salman H.
Publication venue: ZU Scholars
Publication date: 01/11/2020
Field of study

© 2020 Elsevier B.V. Current Visual Question Answering (VQA) systems can answer intelligent questions about ‘known’ visual content. However, their performance drops significantly when questions about visually and linguistically ‘unknown’ concepts are presented during inference (‘Open-world’ scenario). A practical VQA system should be able to deal with novel concepts in real world settings. To address this problem, we propose an exemplar-based approach that transfers learning (i.e., knowledge) from previously ‘known’ concepts to answer questions about the ‘unknown’. We learn a highly discriminative joint embedding (JE) space, where visual and semantic features are fused to give a unified representation. Once novel concepts are presented to the model, it looks for the closest match from an exemplar set in the JE space. This auxiliary information is used alongside the given Image-Question pair to refine visual attention in a hierarchical fashion. Our novel attention model is based on a dual-attention mechanism that combines the complementary effect of spatial and channel attention. Since handling the high dimensional exemplars on large datasets can be a significant challenge, we introduce an efficient matching scheme that uses a compact feature description for search and retrieval. To evaluate our model, we propose a new dataset for VQA, separating unknown visual and semantic concepts from the training set. Our approach shows significant improvements over state-of-the-art VQA models on the proposed Open-World VQA dataset and other standard VQA datasets

ZU Scholars (Zayed University)

Evaluating Text-to-Image Matching using Binary Image Selection (BISON)

Author: Hu Hexiang
Misra Ishan
van der Maaten Laurens
Publication venue
Publication date: 05/04/2019
Field of study

Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as text-based image retrieval and image captioning were designed to test this ability but come with evaluation measures that have a high variance or are difficult to interpret. We study an alternative task for systems that match text and images: given a text query, the system is asked to select the image that best matches the query from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is interpretable, eliminates the reliability problems of retrieval evaluations, and focuses on the system's ability to understand fine-grained visual structure. We gather a BISON dataset that complements the COCO dataset and use it to evaluate modern text-based image retrieval and image captioning systems. Our results provide novel insights into the performance of these systems. The COCO-BISON dataset and corresponding evaluation code are publicly available from \url{http://hexianghu.com/bison/}

arXiv.org e-Print Archive

Crossref

Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch

Author: Dey Sounak
Dutta Anjan
Ghosh Suman K.
Lladós Josep
Pal Umapada
Valveny Ernest
Publication venue
Publication date: 28/04/2018
Field of study

In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.Comment: Accepted at ICPR 201

arXiv.org e-Print Archive

Crossref

Open Research Exeter