19 research outputs found
The iMaterialist Fashion Attribute Dataset
Large-scale image databases such as ImageNet have significantly advanced
image classification and other visual recognition tasks. However much of these
datasets are constructed only for single-label and coarse object-level
classification. For real-world applications, multiple labels and fine-grained
categories are often needed, yet very few such datasets exist publicly,
especially those of large-scale and high quality. In this work, we contribute
to the community a new dataset called iMaterialist Fashion Attribute
(iFashion-Attribute) to address this problem in the fashion domain. The dataset
is constructed from over one million fashion images with a label space that
includes 8 groups of 228 fine-grained attributes in total. Each image is
annotated by experts with multiple, high-quality fashion attributes. The result
is the first known million-scale multi-label and fine-grained image dataset. We
conduct extensive experiments and provide baseline results with modern deep
Convolutional Neural Networks (CNNs). Additionally, we demonstrate models
pre-trained on iFashion-Attribute achieve superior transfer learning
performance on fashion related tasks compared with pre-training from ImageNet
or other fashion datasets. Data is available at:
https://github.com/visipedia/imat_fashion_com
Rethinking Few-Shot Object Detection on a Multi-Domain Benchmark
Most existing works on few-shot object detection (FSOD) focus on a setting
where both pre-training and few-shot learning datasets are from a similar
domain. However, few-shot algorithms are important in multiple domains; hence
evaluation needs to reflect the broad applications. We propose a Multi-dOmain
Few-Shot Object Detection (MoFSOD) benchmark consisting of 10 datasets from a
wide range of domains to evaluate FSOD algorithms. We comprehensively analyze
the impacts of freezing layers, different architectures, and different
pre-training datasets on FSOD performance. Our empirical results show several
key factors that have not been explored in previous works: 1) contrary to
previous belief, on a multi-domain benchmark, fine-tuning (FT) is a strong
baseline for FSOD, performing on par or better than the state-of-the-art (SOTA)
algorithms; 2) utilizing FT as the baseline allows us to explore multiple
architectures, and we found them to have a significant impact on down-stream
few-shot tasks, even with similar pre-training performances; 3) by decoupling
pre-training and few-shot learning, MoFSOD allows us to explore the impact of
different pre-training datasets, and the right choice can boost the performance
of the down-stream tasks significantly. Based on these findings, we list
possible avenues of investigation for improving FSOD performance and propose
two simple modifications to existing algorithms that lead to SOTA performance
on the MoFSOD benchmark. The code is available at
https://github.com/amazon-research/few-shot-object-detection-benchmark.Comment: Accepted at ECCV 202
OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
The inexorable growth of online shopping and e-commerce demands scalable and
robust machine learning-based solutions to accommodate customer requirements.
In the context of automatic tagging classification and multimodal retrieval,
prior works either defined a low generalizable supervised learning approach or
more reusable CLIP-based techniques while, however, training on closed source
data. In this work, we propose OpenFashionCLIP, a vision-and-language
contrastive learning method that only adopts open-source fashion data stemming
from diverse domains, and characterized by varying degrees of specificity. Our
approach is extensively validated across several tasks and benchmarks, and
experimental results highlight a significant out-of-domain generalization
capability and consistent improvements over state-of-the-art methods both in
terms of accuracy and recall. Source code and trained models are publicly
available at: https://github.com/aimagelab/open-fashion-clip.Comment: International Conference on Image Analysis and Processing (ICIAP)
202
OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip
Task2Vec: Task Embedding for Meta-Learning
We introduce a method to generate vectorial representations of visual classification tasks which can be used to reason about the nature of those tasks and their relations. Given a dataset with ground-truth labels and a loss function, we process images through a "probe network" and compute an embedding based on estimates of the Fisher information matrix associated with the probe network parameters. This provides a fixed-dimensional embedding of the task that is independent of details such as the number of classes and requires no understanding of the class label semantics. We demonstrate that this embedding is capable of predicting task similarities that match our intuition about semantic and taxonomic relations between different visual tasks. We demonstrate the practical value of this framework for the meta-task of selecting a pre-trained feature extractor for a novel task. We present a simple meta-learning framework for learning a metric on embeddings that is capable of predicting which feature extractors will perform well on which task. Selecting a feature extractor with task embedding yields performance close to the best available feature extractor, with substantially less computational effort than exhaustively training and evaluating all available models
Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Many visual recognition models are evaluated only on their classification
accuracy, a metric for which they obtain strong performance. In this paper, we
investigate whether computer vision models can also provide correct rationales
for their predictions. We propose a ``doubly right'' object recognition
benchmark, where the metric requires the model to simultaneously produce both
the right labels as well as the right rationales. We find that state-of-the-art
visual models, such as CLIP, often provide incorrect rationales for their
categorical predictions. However, by transferring the rationales from language
models into visual representations through a tailored dataset, we show that we
can learn a ``why prompt,'' which adapts large visual representations to
produce correct rationales. Visualizations and empirical experiments show that
our prompts significantly improve performance on doubly right object
recognition, in addition to zero-shot transfer to unseen tasks and datasets