506 research outputs found

    Knowledge-rich Image Gist Understanding Beyond Literal Meaning

    Full text link
    We investigate the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles. To this end, we propose a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge that has previously been shown to be highly effective for text understanding. Our method identifies the connotation of objects beyond their denotation: where most approaches to image understanding focus on the denotation of objects, i.e., their literal meaning, our work addresses the identification of connotations, i.e., iconic meanings of objects, to understand the message of images. We view image understanding as the task of representing an image-caption pair on the basis of a wide-coverage vocabulary of concepts such as the one provided by Wikipedia, and cast gist detection as a concept-ranking problem with image-caption pairs as queries. To enable a thorough investigation of the problem of gist understanding, we produce a gold standard of over 300 image-caption pairs and over 8,000 gist annotations covering a wide variety of topics at different levels of abstraction. We use this dataset to experimentally benchmark the contribution of signals from heterogeneous sources, namely image and text. The best result with a Mean Average Precision (MAP) of 0.69 indicate that by combining both dimensions we are able to better understand the meaning of our image-caption pairs than when using language or vision information alone. We test the robustness of our gist detection approach when receiving automatically generated input, i.e., using automatically generated image tags or generated captions, and prove the feasibility of an end-to-end automated process

    Transductive Visual Verb Sense Disambiguation

    Get PDF
    Verb Sense Disambiguation is a well-known task in NLP, the aim is to find the correct sense of a verb in a sentence. Recently, this problem has been extended in a multimodal scenario, by exploiting both textual and visual features of ambiguous verbs leading to a new problem, the Visual Verb Sense Disambiguation (VVSD). Here, the sense of a verb is assigned considering the content of an image paired with it rather than a sentence in which the verb appears. Annotating a dataset for this task is more complex than textual disambiguation, because assigning the correct sense to a pair of requires both non-trivial linguistic and visual skills. In this work, differently from the literature, the VVSD task will be performed in a transductive semi-supervised learning (SSL) setting, in which only a small amount of labeled information is required, reducing tremendously the need for annotated data. The disambiguation process is based on a graph-based label propagation method which takes into account mono or multimodal representations for pairs. Experiments have been carried out on the recently published dataset VerSe, the only available dataset for this task. The achieved results outperform the current state-of-the-art by a large margin while using only a small fraction of labeled samples per sens

    Disambiguating Visual Verbs

    Get PDF

    Image Based Model for Document Search and Re-ranking

    Full text link
    Traditional Web search engines do not use the images in the web pages to search relevant documents for a given query. Instead, they are typically operated by computing a measure of agreement between the keywords provided by the user and only the text portion of each web page. This project describes whether the image content appearing in a Web page can be used to enhance the semantic description of Web page and accordingly improve the performance of a keyword-based search engine. A Web-scalable system is presented in such a way that exploits a pure text-based search engine that finds an initial set of candidate documents as per given query. Then, by using visual information extracted from the images contained in the pages, the candidate set will be re-ranked. The computational efficiency of traditional text-based search engines will be maintained by the resulting system with only a small additional storage cost that will be needed to predetermine the visual information

    Meaningful results for Information Retrieval in the MEANING project

    Get PDF
    The goal of the MEANING project (IST-2001-34460) is to develop tools for the automatic acquisition of lexical knowledge that will help Word Sense Disambiguation (WSD). The acquired lexical knowledge from various sources and various languages is stored in the Multilingual Central Repository (MCR) (Atserias et al 04), which is based on the design of the EuroWordNet database. The MCR holds wordnets in various languages (English, Spanish, Italian, Catalan and Basque), which are interconnected via an Inter-Lingual-Index (ILI). In addition, the MCR holds a number of ontologies and domain labels related to al

    Video-Helpful Multimodal Machine Translation

    Full text link
    Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models. The EVA dataset and the SAFA model are available at: https://github.com/ku-nlp/video-helpful-MMT.git.Comment: Accepted by EMNLP 2023 Main Conference (long paper

    Training a Naive Bayes Classifier via the EM Algorithm with a Class Distribution Constraint

    Get PDF
    Combining a naive Bayes classifier with the EM algorithm is one of the promising approaches for making use of unlabeled data for disambiguation tasks when using local context features including word sense disambiguation and spelling correction. However, the use of unlabeled data via the basic EM algorithm often causes disastrous performance degradation instead of improving classification performance, resulting in poor classification performance on average. In this study, we introduce a class distribution constraint into the iteration process of the EM algorithm. This constraint keeps the class distribution of unlabeled data consistent with the class distribution estimated from labeled data, preventing the EM algorithm from converging into an undesirable state. Experimental results from using 26 confusion sets and a large amount of unlabeled data show that our proposed method for using unlabeled data considerably improves classification performance when the amount of labeled data is small
    • …
    corecore