Search CORE

1,122 research outputs found

Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch

Author: Dey Sounak
Dutta Anjan
Ghosh Suman K.
Lladós Josep
Pal Umapada
Valveny Ernest
Publication venue
Publication date: 28/04/2018
Field of study

In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.Comment: Accepted at ICPR 201

arXiv.org e-Print Archive

Crossref

Open Research Exeter

Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval

Author: Liu Li
Liu Xianglong
Shao Ling
Shen Fumin
Shen Yuming
Publication venue
Publication date: 16/03/2017
Field of study

Free-hand sketch-based image retrieval (SBIR) is a specific cross-view retrieval task, in which queries are abstract and ambiguous sketches while the retrieval database is formed with natural images. Work in this area mainly focuses on extracting representative and shared features for sketches and natural images. However, these can neither cope well with the geometric distortion between sketches and images nor be feasible for large-scale SBIR due to the heavy continuous-valued distance computation. In this paper, we speed up SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and incorporated into an end-to-end binary coding framework. Specifically, three convolutional neural networks are utilized to encode free-hand sketches, natural images and, especially, the auxiliary sketch-tokens which are adopted as bridges to mitigate the sketch-image geometric distortion. The learned DSH codes can effectively capture the cross-view similarities as well as the intrinsic semantic correlations between different categories. To the best of our knowledge, DSH is the first hashing work specifically designed for category-level SBIR with an end-to-end deep architecture. The proposed DSH is comprehensively evaluated on two large-scale datasets of TU-Berlin Extension and Sketchy, and the experiments consistently show DSH's superior SBIR accuracies over several state-of-the-art methods, while achieving significantly reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201

arXiv.org e-Print Archive

Crossref

AMC: Attention guided Multi-modal Correlation Learning for Image Search

Author: Bui Trung
Chen Fang
Chen Kan
Nevatia Ram
Wang Zhaowen
Publication venue
Publication date: 03/04/2017
Field of study

Given a user's query, traditional image search systems rank images according to its relevance to a single modality (e.g., image content or surrounding text). Nowadays, an increasing number of images on the Internet are available with associated meta data in rich modalities (e.g., titles, keywords, tags, etc.), which can be exploited for better similarity measure with queries. In this paper, we leverage visual and textual modalities for image search by learning their correlation with input query. According to the intent of query, attention mechanism can be introduced to adaptively balance the importance of different modalities. We propose a novel Attention guided Multi-modal Correlation (AMC) learning method which consists of a jointly learned hierarchy of intra and inter-attention networks. Conditioned on query's intent, intra-attention networks (i.e., visual intra-attention network and language intra-attention network) attend on informative parts within each modality; a multi-modal inter-attention network promotes the importance of the most query-relevant modalities. In experiments, we evaluate AMC models on the search logs from two real world image search engines and show a significant boost on the ranking of user-clicked images in search results. Additionally, we extend AMC models to caption ranking task on COCO dataset and achieve competitive results compared with recent state-of-the-arts.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref

Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch

Author: Dey S
Dutta A
Ghosh SK
Llados J
Pal U
Valveny E
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/10/2019
Field of study

This is the author accepted manuscript. The final version is available from IEEE via the DOI in this recordIn this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.European Union Horizon 2020CERCA Programme/Generalitat de Cataluny

Crossref

Open Research Exeter

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

Author: Darrell Trevor
Fukui Akira
Park Dong Huk
Rohrbach Anna
Rohrbach Marcus
Yang Daylen
Publication venue
Publication date: 01/01/2016
Field of study

Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.Comment: Accepted to EMNLP 201

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

Author: Akata Z.
Chaudhuri A.
Chen Y.
Dutta A.
Mancini M.
Publication venue
Publication date: 01/01/2022
Field of study

MPG.PuRe