377 research outputs found
Binary Representation Learning for Large Scale Visual Data
The exponentially growing modern media created large amount of multimodal or multidomain visual data, which usually reside in high dimensional space. And it is crucial to provide not only effective but also efficient understanding of the data.In this dissertation, we focus on learning binary representation of visual dataset, whose primary use has been hash code for retrieval purpose. Simultaneously it serves as multifunctional feature that can also be used for various computer vision tasks. Essentially, this is achieved by discriminative learning that preserves the supervision information in the binary representation.By using deep networks such as convolutional neural networks (CNNs) as backbones, and effective binary embedding algorithm that is seamlessly integrated into the learning process, we achieve state-of-the art performance on several settings. First, we study the supervised binary representation learning problem by using label information directly instead of pairwise similarity or triplet loss. By considering images and associated textual information, we study the cross-modal representation learning. CNNs are used in both image and text embedding, and we are able to perform retrieval and prediction across these modalities. Furthermore, by utilizing unlabeled images from a different domain, we propose to use adversarial learning to connect these domains. Finally, we also consider progressive learning for more efficient learning and instance-level representation learning to provide finer granularity understanding. This dissertation demonstrates that binary representation is versatile and powerful under various circumstances with different tasks
Multimodal sequential fashion attribute prediction
We address multimodal product attribute prediction of fashion items based on product images and titles. The product attributes, such as type, sub-type, cut or fit, are in a chain format, with previous attribute values constraining the values of the next attributes. We propose to address this task with a sequential prediction model that can learn to capture the dependencies between the different attribute values in the chain. Our experiments on three product datasets show that the sequential model outperforms two non-sequential baselines on all experimental datasets. Compared to other models, the sequential model is also better able to generate sequences of attribute chains not seen during training. We also measure the contributions of both image and textual input and show that while text-only models always outperform image-only models, only the multimodal sequential model combining both image and text improves over the text-only model on all experimental dataset
The Emerging Trends of Multi-Label Learning
Exabytes of data are generated daily by humans, leading to the growing need
for new efforts in dealing with the grand challenges for multi-label learning
brought by big data. For example, extreme multi-label classification is an
active and rapidly growing research area that deals with classification tasks
with an extremely large number of classes or labels; utilizing massive data
with limited supervision to build a multi-label classification model becomes
valuable for practical applications, etc. Besides these, there are tremendous
efforts on how to harvest the strong learning capability of deep learning to
better capture the label dependencies in multi-label learning, which is the key
for deep learning to address real-world classification tasks. However, it is
noted that there has been a lack of systemic studies that focus explicitly on
analyzing the emerging trends and new challenges of multi-label learning in the
era of big data. It is imperative to call for a comprehensive survey to fulfill
this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification
Multi-label image classification is a prediction task that aims to identify
more than one label from a given image. This paper considers the semantic
consistency of the latent space between the visual patch and linguistic label
domains and introduces the conditional transport (CT) theory to bridge the
acknowledged gap. While recent cross-modal attention-based studies have
attempted to align such two representations and achieved impressive
performance, they required carefully-designed alignment modules and extra
complex operations in the attention computation. We find that by formulating
the multi-label classification as a CT problem, we can exploit the interactions
between the image and label efficiently by minimizing the bidirectional CT
cost. Specifically, after feeding the images and textual labels into the
modality-specific encoders, we view each image as a mixture of patch embeddings
and a mixture of label embeddings, which capture the local region features and
the class prototypes, respectively. CT is then employed to learn and align
those two semantic sets by defining the forward and backward navigators.
Importantly, the defined navigators in CT distance model the similarities
between patches and labels, which provides an interpretable tool to visualize
the learned prototypes. Extensive experiments on three public image benchmarks
show that the proposed model consistently outperforms the previous methods. Our
code is available at https://github.com/keepgoingjkg/PatchCT.Comment: accepted by ICCV2
- …