22 research outputs found
Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch
In this work we introduce a cross modal image retrieval system that allows
both text and sketch as input modalities for the query. A cross-modal deep
network architecture is formulated to jointly model the sketch and text input
modalities as well as the the image output modality, learning a common
embedding between text and images and between sketches and images. In addition,
an attention model is used to selectively focus the attention on the different
objects of the image, allowing for retrieval with multiple objects in the
query. Experiments show that the proposed method performs the best in both
single and multiple object image retrieval in standard datasets.Comment: Accepted at ICPR 201
Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection
Multi-label image classification is a fundamental but challenging task
towards general visual understanding. Existing methods found the region-level
cues (e.g., features from RoIs) can facilitate multi-label classification.
Nevertheless, such methods usually require laborious object-level annotations
(i.e., object labels and bounding boxes) for effective learning of the
object-level visual features. In this paper, we propose a novel and efficient
deep framework to boost multi-label classification by distilling knowledge from
weakly-supervised detection task without bounding box annotations.
Specifically, given the image-level annotations, (1) we first develop a
weakly-supervised detection (WSD) model, and then (2) construct an end-to-end
multi-label image classification framework augmented by a knowledge
distillation module that guides the classification model by the WSD model
according to the class-level predictions for the whole image and the
object-level visual features for object RoIs. The WSD model is the teacher
model and the classification model is the student model. After this cross-task
knowledge distillation, the performance of the classification model is
significantly improved and the efficiency is maintained since the WSD model can
be safely discarded in the test phase. Extensive experiments on two large-scale
datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior
performances over the state-of-the-art methods on both performance and
efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table
Multi-layered Semantic Representation Network for Multi-label Image Classification
Multi-label image classification (MLIC) is a fundamental and practical task,
which aims to assign multiple possible labels to an image. In recent years,
many deep convolutional neural network (CNN) based approaches have been
proposed which model label correlations to discover semantics of labels and
learn semantic representations of images. This paper advances this research
direction by improving both the modeling of label correlations and the learning
of semantic representations. On the one hand, besides the local semantics of
each label, we propose to further explore global semantics shared by multiple
labels. On the other hand, existing approaches mainly learn the semantic
representations at the last convolutional layer of a CNN. But it has been noted
that the image representations of different layers of CNN capture different
levels or scales of features and have different discriminative abilities. We
thus propose to learn semantic representations at multiple convolutional
layers. To this end, this paper designs a Multi-layered Semantic Representation
Network (MSRN) which discovers both local and global semantics of labels
through modeling label correlations and utilizes the label semantics to guide
the semantic representations learning at multiple layers through an attention
mechanism. Extensive experiments on four benchmark datasets including VOC 2007,
COCO, NUS-WIDE, and Apparel show a competitive performance of the proposed MSRN
against state-of-the-art models
Graph Attention Transformer Network for Multi-Label Image Classification
Multi-label classification aims to recognize multiple objects or attributes
from images. However, it is challenging to learn from proper label graphs to
effectively characterize such inter-label correlations or dependencies. Current
methods often use the co-occurrence probability of labels based on the training
set as the adjacency matrix to model this correlation, which is greatly limited
by the dataset and affects the model's generalization ability. In this paper,
we propose a Graph Attention Transformer Network (GATN), a general framework
for multi-label image classification that can effectively mine complex
inter-label relationships. First, we use the cosine similarity based on the
label word embedding as the initial correlation matrix, which can represent
rich semantic information. Subsequently, we design the graph attention
transformer layer to transfer this adjacency matrix to adapt to the current
domain. Our extensive experiments have demonstrated that our proposed methods
can achieve state-of-the-art performance on three datasets