6 research outputs found
Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information
Visual Word Sense Disambiguation (VWSD) is a task to find the image that most
accurately depicts the correct sense of the target word for the given context.
Previously, image-text matching models often suffered from recognizing
polysemous words. This paper introduces an unsupervised VWSD approach that uses
gloss information of an external lexical knowledge-base, especially the sense
definitions. Specifically, we suggest employing Bayesian inference to
incorporate the sense definitions when sense information of the answer is not
provided. In addition, to ameliorate the out-of-dictionary (OOD) issue, we
propose a context-aware definition generation with GPT-3. Experimental results
show that the VWSD performance significantly increased with our Bayesian
inference-based approach. In addition, our context-aware definition generation
achieved prominent performance improvement in OOD examples exhibiting better
performance than the existing definition generation method. We will publish
source codes as soon as possible.Comment: ACL 202
Cross-lingual Visual Verb Sense Disambiguation
Recent work has shown that visual context improves cross-lingual sense
disambiguation for nouns. We extend this line of work to the more challenging
task of cross-lingual verb sense disambiguation, introducing the MultiSense
dataset of 9,504 images annotated with English, German, and Spanish verbs. Each
image in MultiSense is annotated with an English verb and its translation in
German or Spanish. We show that cross-lingual verb sense disambiguation models
benefit from visual context, compared to unimodal baselines. We also show that
the verb sense predicted by our best disambiguation model can improve the
results of a text-only machine translation system when used for a multimodal
translation task.Comment: NAACL 2019; fix typo in author nam
Visual context for verb sense disambiguation and multilingual representation learning
Every day billions of images are uploaded to the web. To process images at such a large
scale it is important to build automatic image understanding systems. An important
step towards understanding the content of the images is to be able to understand all the
objects, scenes and actions depicted in the image. These systems should be capable of
integrating with natural language or text to be able to query and interact with humans
for tasks such as image retrieval.
Verbs play a key role in the understanding of sentences and scenes. Verbs express
the semantics of an actions as well as the interactions between objects participating in
an event. Thus understanding verbs is central to both language and image understanding.
However, verbs are known for their variability in meaning with context. Many studies in
psychology have shown that contextual information plays an important role in semantic
understanding and processing in the human visual system. We use this as intuition
and understand the role of textual or visual context in tasks that combine language and
vision.
Our research presented in this thesis focuses on the problems of integrating visual
and textual contexts for: (i) automatically identifying verbs that denote actions depicted
in the images; (ii) fine-grained analysis of how visual context can help disambiguate
different meanings of verbs in a language or across languages; (iii) the role played by
the visual and multilingual context in learning representations that allow us to query
information across modalities and languages.
First, we propose the task of visual sense disambiguation, an alternative way of
addressing the action recognition task. Instead of identifying the actions directly, we
develop a two step process: identifying the verb that denotes the action being depicted
in an image and then disambiguate the meaning of the verb based on the visual and
textual context associated with the image. We first build a image-verb classifier based
on the weak signal from image description data and analyse the specific regions that
model focuses on while predicting the verb. We then disambiguate the meaning of
the verb shown in the image using image features and sense-inventories. We test the
hypothesis that visual and textual context associated with the image contribute to the
disambiguation task.
Second, we ask whether the predictions made by such models correspond to human
intuitions about visual verbs or actions. We analyse whether the image regions a verb
prediction model identifies as salient for a given verb correlate with the regions fixated
by human observers performing an action classification task. We also compare the
correlation of human fixations against visual saliency and center bias models.
Third, we propose the crosslingual verb disambiguation task: identifying the correct
translation of the verb in a target language based on visual context. This task has the
potential to resolve lexical ambiguity in machine translation when the visual context
is available. We propose a series of models and show that multimodal models that
fuse textual information with visual features have an edge over text or visual only
models. We then demonstrate how visual sense disambiguation can be combined with
lexical constraint decoding to improve the performance of a standard unimodal machine
translation system on image descriptions.
Finally, we move on to learn joint representations for images and text in multiple
languages. We test the hypothesis that context provided as visual information or text
in other language contributes to better representation learning. We propose models to
map text from multiple languages and images into a common space and evaluating
the usefulness of the second language in multimodal search and usefulness of image
in the crosslingual search. Our experiments suggest that exploiting multilingual and
multimodal resources can help in learning better semantic representations that are useful
for various multimodal natural language understanding tasks.
Our experiments on visual sense disambiguation, sense disambiguation across languages,
multimodal and cross-lingual search demonstrate that visual context alone or
combined with textual context is useful for enhancing multimodal and crosslingual
applications
Learning Transferable Representations for Hierarchical Relationship Exploration
The visual scenes are composed of basic elements, such as objects, parts, and other semantic regions. It is well-acknowledged that humans perceive the world in a compositional and hierarchical way in which visual scenes are treated as a layout of distinct semantic objects/attributes/parts. Those separated objects/attributes/parts are linked together via different relationships, including visual relationships and semantic relationships. Particularly, the shared parts/attributes/objects of the visual concepts (object, visual relationships), are shared and thus transferable among different visual concepts. Humans can easily imagine a new composite concept from the shared parts of different concepts, while one of the important shortcomings of current deep neural networks is the compositional perception ability and thus it requires a large scale of data to optimize the deep neural networks. From the perspective of compositional perception, this thesis thinks one of the limitations of typical neural networks is that the factor representations of deep neural networks are not sharable and transferable among different concepts. Therefore, the thesis introduces various techniques, including compositional learning framework, compositional invariant learning, and BatchFormer module, to enable the factor representations of deep neural networks sharable and transferable among different concepts for hierarchical relationship exploration, involving human-object interaction, 3D human-object interaction and sample relationships