310 research outputs found

    Disambiguating Visual Verbs

    Get PDF

    Transductive Visual Verb Sense Disambiguation

    Get PDF
    Verb Sense Disambiguation is a well-known task in NLP, the aim is to find the correct sense of a verb in a sentence. Recently, this problem has been extended in a multimodal scenario, by exploiting both textual and visual features of ambiguous verbs leading to a new problem, the Visual Verb Sense Disambiguation (VVSD). Here, the sense of a verb is assigned considering the content of an image paired with it rather than a sentence in which the verb appears. Annotating a dataset for this task is more complex than textual disambiguation, because assigning the correct sense to a pair of requires both non-trivial linguistic and visual skills. In this work, differently from the literature, the VVSD task will be performed in a transductive semi-supervised learning (SSL) setting, in which only a small amount of labeled information is required, reducing tremendously the need for annotated data. The disambiguation process is based on a graph-based label propagation method which takes into account mono or multimodal representations for pairs. Experiments have been carried out on the recently published dataset VerSe, the only available dataset for this task. The achieved results outperform the current state-of-the-art by a large margin while using only a small fraction of labeled samples per sens

    Multimodal Grounding for Language Processing

    Get PDF
    This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference of Computational Linguistics. Please refer to this version for citations: https://www.aclweb.org/anthology/papers/C/C18/C18-1197

    Visual context for verb sense disambiguation and multilingual representation learning

    Get PDF
    Every day billions of images are uploaded to the web. To process images at such a large scale it is important to build automatic image understanding systems. An important step towards understanding the content of the images is to be able to understand all the objects, scenes and actions depicted in the image. These systems should be capable of integrating with natural language or text to be able to query and interact with humans for tasks such as image retrieval. Verbs play a key role in the understanding of sentences and scenes. Verbs express the semantics of an actions as well as the interactions between objects participating in an event. Thus understanding verbs is central to both language and image understanding. However, verbs are known for their variability in meaning with context. Many studies in psychology have shown that contextual information plays an important role in semantic understanding and processing in the human visual system. We use this as intuition and understand the role of textual or visual context in tasks that combine language and vision. Our research presented in this thesis focuses on the problems of integrating visual and textual contexts for: (i) automatically identifying verbs that denote actions depicted in the images; (ii) fine-grained analysis of how visual context can help disambiguate different meanings of verbs in a language or across languages; (iii) the role played by the visual and multilingual context in learning representations that allow us to query information across modalities and languages. First, we propose the task of visual sense disambiguation, an alternative way of addressing the action recognition task. Instead of identifying the actions directly, we develop a two step process: identifying the verb that denotes the action being depicted in an image and then disambiguate the meaning of the verb based on the visual and textual context associated with the image. We first build a image-verb classifier based on the weak signal from image description data and analyse the specific regions that model focuses on while predicting the verb. We then disambiguate the meaning of the verb shown in the image using image features and sense-inventories. We test the hypothesis that visual and textual context associated with the image contribute to the disambiguation task. Second, we ask whether the predictions made by such models correspond to human intuitions about visual verbs or actions. We analyse whether the image regions a verb prediction model identifies as salient for a given verb correlate with the regions fixated by human observers performing an action classification task. We also compare the correlation of human fixations against visual saliency and center bias models. Third, we propose the crosslingual verb disambiguation task: identifying the correct translation of the verb in a target language based on visual context. This task has the potential to resolve lexical ambiguity in machine translation when the visual context is available. We propose a series of models and show that multimodal models that fuse textual information with visual features have an edge over text or visual only models. We then demonstrate how visual sense disambiguation can be combined with lexical constraint decoding to improve the performance of a standard unimodal machine translation system on image descriptions. Finally, we move on to learn joint representations for images and text in multiple languages. We test the hypothesis that context provided as visual information or text in other language contributes to better representation learning. We propose models to map text from multiple languages and images into a common space and evaluating the usefulness of the second language in multimodal search and usefulness of image in the crosslingual search. Our experiments suggest that exploiting multilingual and multimodal resources can help in learning better semantic representations that are useful for various multimodal natural language understanding tasks. Our experiments on visual sense disambiguation, sense disambiguation across languages, multimodal and cross-lingual search demonstrate that visual context alone or combined with textual context is useful for enhancing multimodal and crosslingual applications

    Multimodal Grounding for Language Processing

    Get PDF

    From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

    Get PDF
    Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

    Cross-lingual Visual Verb Sense Disambiguation

    Get PDF
    Recent work has shown that visual context improves cross-lingual sense disambiguation for nouns. We extend this line of work to the more challenging task of cross-lingual verb sense disambiguation, introducing the MultiSense dataset of 9,504 images annotated with English, German, and Spanish verbs. Each image in MultiSense is annotated with an English verb and its translation in German or Spanish. We show that cross-lingual verb sense disambiguation models benefit from visual context, compared to unimodal baselines. We also show that the verb sense predicted by our best disambiguation model can improve the results of a text-only machine translation system when used for a multimodal translation task.Comment: NAACL 2019; fix typo in author nam

    MultiSubs: A Large-scale Multimodal and Multilingual Dataset

    Full text link
    This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation. Results of the human evaluation and automatic models demonstrate that images can be a useful complement to the textual context. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.Comment: Manuscript update: (i) Added links to the dataset and evaluation toolkit; (ii) Section 6.1.4: Added random and n-gram baselines to the fill-in-the-blank task, and added further discussion at the end of the section; (iii) Section 6.2.3: Further elaboration on the ALI metric; (iv) Section 6.2.4: Corrected results for the lexical translation task (Table 8), and updated the discussions accordingl