2,424 research outputs found
OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data
The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
The popularity of Contrastive Language-Image Pre-training (CLIP) has
propelled its application to diverse downstream vision tasks. To improve its
capacity on downstream tasks, few-shot learning has become a widely-adopted
technique. However, existing methods either exhibit limited performance or
suffer from excessive learnable parameters. In this paper, we propose APE, an
Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which
achieves superior accuracy with high computational efficiency. Via a prior
refinement module, we analyze the inter-class disparity in the downstream data
and decouple the domain-specific knowledge from the CLIP-extracted cache model.
On top of that, we introduce two model variants, a training-free APE and a
training-required APE-T. We explore the trilateral affinities between the test
image, prior cache model, and textual representations, and only enable a
lightweight category-residual module to be trained. For the average accuracy
over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively
outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less
learnable parameters.Comment: Code is available at https://github.com/yangyangyang127/AP
Cross-Domain Fine-Grained Classification: A Review
Fine-grained classification is an interesting but challenging task due to the high amount of data needed to achieve a high accuracy. However, the high specificity of the classes makes it difficult to collect a large amount of samples. Thus, the use of cross-domain learning is an interesting aspect since an abundant amount of data exists for some domains like web images exists. In this review, the current works of cross-domain fine-grained classification are summarized and potential areas for future work are highlighted. Even though first works exist, the variety of methods is still small and interesting cross-domain settings are rarely considered. Thus, the field of cross-domain fine-grained classification provides a large room for future research
Memories are One-to-Many Mapping Alleviators in Talking Face Generation
Talking face generation aims at generating photo-realistic video portraits of
a target person driven by input audio. Due to its nature of one-to-many mapping
from the input audio to the output video (e.g., one speech content may have
multiple feasible visual appearances), learning a deterministic mapping like
previous works brings ambiguity during training, and thus causes inferior
visual results. Although this one-to-many mapping could be alleviated in part
by a two-stage framework (i.e., an audio-to-expression model followed by a
neural-rendering model), it is still insufficient since the prediction is
produced without enough information (e.g., emotions, wrinkles, etc.). In this
paper, we propose MemFace to complement the missing information with an
implicit memory and an explicit memory that follow the sense of the two stages
respectively. More specifically, the implicit memory is employed in the
audio-to-expression model to capture high-level semantics in the
audio-expression shared space, while the explicit memory is employed in the
neural-rendering model to help synthesize pixel-level details. Our experimental
results show that our proposed MemFace surpasses all the state-of-the-art
results across multiple scenarios consistently and significantly.Comment: Project page: see https://memoryface.github.i
- …