131 research outputs found
Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection
Multi-label image classification is a fundamental but challenging task
towards general visual understanding. Existing methods found the region-level
cues (e.g., features from RoIs) can facilitate multi-label classification.
Nevertheless, such methods usually require laborious object-level annotations
(i.e., object labels and bounding boxes) for effective learning of the
object-level visual features. In this paper, we propose a novel and efficient
deep framework to boost multi-label classification by distilling knowledge from
weakly-supervised detection task without bounding box annotations.
Specifically, given the image-level annotations, (1) we first develop a
weakly-supervised detection (WSD) model, and then (2) construct an end-to-end
multi-label image classification framework augmented by a knowledge
distillation module that guides the classification model by the WSD model
according to the class-level predictions for the whole image and the
object-level visual features for object RoIs. The WSD model is the teacher
model and the classification model is the student model. After this cross-task
knowledge distillation, the performance of the classification model is
significantly improved and the efficiency is maintained since the WSD model can
be safely discarded in the test phase. Extensive experiments on two large-scale
datasets (MS-COCO and NUS-WIDE) show that our framework achieves superior
performances over the state-of-the-art methods on both performance and
efficiency.Comment: accepted by ACM Multimedia 2018, 9 pages, 4 figures, 5 table
End-to-End Supervised Multilabel Contrastive Learning
Multilabel representation learning is recognized as a challenging problem
that can be associated with either label dependencies between object categories
or data-related issues such as the inherent imbalance of positive/negative
samples. Recent advances address these challenges from model- and data-centric
viewpoints. In model-centric, the label correlation is obtained by an external
model designs (e.g., graph CNN) to incorporate an inductive bias for training.
However, they fail to design an end-to-end training framework, leading to high
computational complexity. On the contrary, in data-centric, the realistic
nature of the dataset is considered for improving the classification while
ignoring the label dependencies. In this paper, we propose a new end-to-end
training framework -- dubbed KMCL (Kernel-based Mutlilabel Contrastive
Learning) -- to address the shortcomings of both model- and data-centric
designs. The KMCL first transforms the embedded features into a mixture of
exponential kernels in Gaussian RKHS. It is then followed by encoding an
objective loss that is comprised of (a) reconstruction loss to reconstruct
kernel representation, (b) asymmetric classification loss to address the
inherent imbalance problem, and (c) contrastive loss to capture label
correlation. The KMCL models the uncertainty of the feature encoder while
maintaining a low computational footprint. Extensive experiments are conducted
on image classification tasks to showcase the consistent improvements of KMCL
over the SOTA methods. PyTorch implementation is provided in
\url{https://github.com/mahdihosseini/KMCL}
Multimodal sequential fashion attribute prediction
We address multimodal product attribute prediction of fashion items based on product images and titles. The product attributes, such as type, sub-type, cut or fit, are in a chain format, with previous attribute values constraining the values of the next attributes. We propose to address this task with a sequential prediction model that can learn to capture the dependencies between the different attribute values in the chain. Our experiments on three product datasets show that the sequential model outperforms two non-sequential baselines on all experimental datasets. Compared to other models, the sequential model is also better able to generate sequences of attribute chains not seen during training. We also measure the contributions of both image and textual input and show that while text-only models always outperform image-only models, only the multimodal sequential model combining both image and text improves over the text-only model on all experimental dataset
Graph Networks for Multi-Label Image Recognition
Providing machines with a robust visualization of multiple objects in a scene has a myriad of applications in the physical world. This research solves the task of multi-label image recognition using a deep learning approach. For most multi-label image recognition datasets, there are multiple objects within a single image and a single label can be seen many times throughout the dataset. Therefore, it is not efficient to classify each object in isolation, rather it is important to infer the inter-dependencies between the labels. To extract a latent representation of the pixels from an image, this work uses a convolutional network approach evaluating three different image feature extraction networks. In order to learn the label inter-dependencies, this work proposes a graph convolution network approach as compared to previous approaches such as probabilistic graph or recurrent neural networks. In the graph neural network approach, the image labels are first encoded into word embeddings. These serve as nodes on a graph. The correlations between these nodes are learned using graph neural networks. We investigate how to create the adjacency matrix without manual calculation of the label correlations in the respective datasets. This proposed approach is evaluated on the widely-used PASCAL VOC, MSCOCO, and NUS-WIDE multi-label image recognition datasets. The main evaluation metrics used will be mean average precision and overall F1 score, to show that the learned adjacency matrix method for labels along with the addition of visual attention for image features is able to achieve similar performance to manually calculating the label adjacency matrix
- …