Search CORE

19 research outputs found

The iMaterialist Fashion Attribute Dataset

Author: Adam Hartwig
Belongie Serge
Cui Yin
Guo Sheng
Huang Weilin
Li Yuan
Scott Matthew R.
Srikhanta Prasanna
Zhang Xiao
Publication venue
Publication date: 01/01/2019
Field of study

Large-scale image databases such as ImageNet have significantly advanced image classification and other visual recognition tasks. However much of these datasets are constructed only for single-label and coarse object-level classification. For real-world applications, multiple labels and fine-grained categories are often needed, yet very few such datasets exist publicly, especially those of large-scale and high quality. In this work, we contribute to the community a new dataset called iMaterialist Fashion Attribute (iFashion-Attribute) to address this problem in the fashion domain. The dataset is constructed from over one million fashion images with a label space that includes 8 groups of 228 fine-grained attributes in total. Each image is annotated by experts with multiple, high-quality fashion attributes. The result is the first known million-scale multi-label and fine-grained image dataset. We conduct extensive experiments and provide baseline results with modern deep Convolutional Neural Networks (CNNs). Additionally, we demonstrate models pre-trained on iFashion-Attribute achieve superior transfer learning performance on fashion related tasks compared with pre-training from ImageNet or other fashion datasets. Data is available at: https://github.com/visipedia/imat_fashion_com

arXiv.org e-Print Archive

Copenhagen University Research Information System

Rethinking Few-Shot Object Detection on a Multi-Domain Benchmark

Author: Cai Zhaowei
Chakraborty Satyaki
Dabeer Onkar
Lee Kibok
Ravichandran Avinash
Swaminathan Gurumurthy
Yang Hao
Publication venue
Publication date: 22/07/2022
Field of study

Most existing works on few-shot object detection (FSOD) focus on a setting where both pre-training and few-shot learning datasets are from a similar domain. However, few-shot algorithms are important in multiple domains; hence evaluation needs to reflect the broad applications. We propose a Multi-dOmain Few-Shot Object Detection (MoFSOD) benchmark consisting of 10 datasets from a wide range of domains to evaluate FSOD algorithms. We comprehensively analyze the impacts of freezing layers, different architectures, and different pre-training datasets on FSOD performance. Our empirical results show several key factors that have not been explored in previous works: 1) contrary to previous belief, on a multi-domain benchmark, fine-tuning (FT) is a strong baseline for FSOD, performing on par or better than the state-of-the-art (SOTA) algorithms; 2) utilizing FT as the baseline allows us to explore multiple architectures, and we found them to have a significant impact on down-stream few-shot tasks, even with similar pre-training performances; 3) by decoupling pre-training and few-shot learning, MoFSOD allows us to explore the impact of different pre-training datasets, and the right choice can boost the performance of the down-stream tasks significantly. Based on these findings, we list possible avenues of investigation for improving FSOD performance and propose two simple modifications to existing algorithms that lead to SOTA performance on the MoFSOD benchmark. The code is available at https://github.com/amazon-research/few-shot-object-detection-benchmark.Comment: Accepted at ECCV 202

arXiv.org e-Print Archive

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Author: Baldrati Alberto
Bertini Marco
Cartella Giuseppe
Cornia Marcella
Cucchiara Rita
Morelli Davide
Publication venue
Publication date: 11/09/2023
Field of study

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.Comment: International Conference on Image Analysis and Processing (ICIAP) 202

arXiv.org e-Print Archive

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Author: Baldrati Alberto
Bertini Marco
Cartella Giuseppe
Cornia Marcella
Cucchiara Rita
Morelli Davide
Publication venue
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Task2Vec: Task Embedding for Meta-Learning

Author: Achille Alessandro
Fowlkes Charless
Lam Michael
Maji Subhransu
Perona Pietro
Ravichandran Avinash
Soatto Stefano
Tewari Rahul
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2019
Field of study

We introduce a method to generate vectorial representations of visual classification tasks which can be used to reason about the nature of those tasks and their relations. Given a dataset with ground-truth labels and a loss function, we process images through a "probe network" and compute an embedding based on estimates of the Fisher information matrix associated with the probe network parameters. This provides a fixed-dimensional embedding of the task that is independent of details such as the number of classes and requires no understanding of the class label semantics. We demonstrate that this embedding is capable of predicting task similarities that match our intuition about semantic and taxonomic relations between different visual tasks. We demonstrate the practical value of this framework for the meta-task of selecting a pre-trained feature extractor for a novel task. We present a simple meta-learning framework for learning a metric on embeddings that is capable of predicting which feature extractors will perform well on which task. Selecting a feature extractor with task embedding yields performance close to the best available feature extractor, with substantially less computational effort than exhaustively training and evaluating all available models

Crossref

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Author: Mao Chengzhi
Menon Sachit
Sundar Amrutha
Teotia Revant
Vondrick Carl
Wang Xin
Yang Junfeng
Publication venue
Publication date: 12/12/2022
Field of study

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets

arXiv.org e-Print Archive