162,233 research outputs found
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have
never appeared during training. One of the most effective and widely used
semantic information for zero-shot image classification are attributes which
are annotations for class-level visual characteristics. However, the current
methods often fail to discriminate those subtle visual distinctions between
images due to not only the shortage of fine-grained annotations, but also the
attribute imbalance and co-occurrence. In this paper, we present a
transformer-based end-to-end ZSL method named DUET, which integrates latent
semantic knowledge from the pre-trained language models (PLMs) via a
self-supervised multi-modal learning paradigm. Specifically, we (1) developed a
cross-modal semantic grounding network to investigate the model's capability of
disentangling semantic attributes from the images; (2) applied an
attribute-level contrastive learning strategy to further enhance the model's
discrimination on fine-grained visual characteristics against the attribute
co-occurrence and imbalance; (3) proposed a multi-task learning policy for
considering multi-model objectives. We find that our DUET can achieve
state-of-the-art performance on three standard ZSL benchmarks and a knowledge
graph equipped ZSL benchmark. Its components are effective and its predictions
are interpretable.Comment: AAAI 2023 (Oral). Repository: https://github.com/zjukg/DUE
Context-Aware Embeddings for Automatic Art Analysis
Automatic art analysis aims to classify and retrieve artistic representations
from a collection of images by using computer vision and machine learning
techniques. In this work, we propose to enhance visual representations from
neural networks with contextual artistic information. Whereas visual
representations are able to capture information about the content and the style
of an artwork, our proposed context-aware embeddings additionally encode
relationships between different artistic attributes, such as author, school, or
historical period. We design two different approaches for using context in
automatic art analysis. In the first one, contextual data is obtained through a
multi-task learning model, in which several attributes are trained together to
find visual relationships between elements. In the second approach, context is
obtained through an art-specific knowledge graph, which encodes relationships
between artistic attributes. An exhaustive evaluation of both of our models in
several art analysis problems, such as author identification, type
classification, or cross-modal retrieval, show that performance is improved by
up to 7.3% in art classification and 37.24% in retrieval when context-aware
embeddings are used
Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects
Human vision greatly benefits from the information about sizes of objects.
The role of size in several visual reasoning tasks has been thoroughly explored
in human perception and cognition. However, the impact of the information about
sizes of objects is yet to be determined in AI. We postulate that this is
mainly attributed to the lack of a comprehensive repository of size
information. In this paper, we introduce a method to automatically infer object
sizes, leveraging visual and textual information from web. By maximizing the
joint likelihood of textual and visual observations, our method learns reliable
relative size estimates, with no explicit human supervision. We introduce the
relative size dataset and show that our method outperforms competitive textual
and visual baselines in reasoning about size comparisons.Comment: To appear in AAAI 201
- …