18,881 research outputs found
Visual context for verb sense disambiguation and multilingual representation learning
Every day billions of images are uploaded to the web. To process images at such a large
scale it is important to build automatic image understanding systems. An important
step towards understanding the content of the images is to be able to understand all the
objects, scenes and actions depicted in the image. These systems should be capable of
integrating with natural language or text to be able to query and interact with humans
for tasks such as image retrieval.
Verbs play a key role in the understanding of sentences and scenes. Verbs express
the semantics of an actions as well as the interactions between objects participating in
an event. Thus understanding verbs is central to both language and image understanding.
However, verbs are known for their variability in meaning with context. Many studies in
psychology have shown that contextual information plays an important role in semantic
understanding and processing in the human visual system. We use this as intuition
and understand the role of textual or visual context in tasks that combine language and
vision.
Our research presented in this thesis focuses on the problems of integrating visual
and textual contexts for: (i) automatically identifying verbs that denote actions depicted
in the images; (ii) fine-grained analysis of how visual context can help disambiguate
different meanings of verbs in a language or across languages; (iii) the role played by
the visual and multilingual context in learning representations that allow us to query
information across modalities and languages.
First, we propose the task of visual sense disambiguation, an alternative way of
addressing the action recognition task. Instead of identifying the actions directly, we
develop a two step process: identifying the verb that denotes the action being depicted
in an image and then disambiguate the meaning of the verb based on the visual and
textual context associated with the image. We first build a image-verb classifier based
on the weak signal from image description data and analyse the specific regions that
model focuses on while predicting the verb. We then disambiguate the meaning of
the verb shown in the image using image features and sense-inventories. We test the
hypothesis that visual and textual context associated with the image contribute to the
disambiguation task.
Second, we ask whether the predictions made by such models correspond to human
intuitions about visual verbs or actions. We analyse whether the image regions a verb
prediction model identifies as salient for a given verb correlate with the regions fixated
by human observers performing an action classification task. We also compare the
correlation of human fixations against visual saliency and center bias models.
Third, we propose the crosslingual verb disambiguation task: identifying the correct
translation of the verb in a target language based on visual context. This task has the
potential to resolve lexical ambiguity in machine translation when the visual context
is available. We propose a series of models and show that multimodal models that
fuse textual information with visual features have an edge over text or visual only
models. We then demonstrate how visual sense disambiguation can be combined with
lexical constraint decoding to improve the performance of a standard unimodal machine
translation system on image descriptions.
Finally, we move on to learn joint representations for images and text in multiple
languages. We test the hypothesis that context provided as visual information or text
in other language contributes to better representation learning. We propose models to
map text from multiple languages and images into a common space and evaluating
the usefulness of the second language in multimodal search and usefulness of image
in the crosslingual search. Our experiments suggest that exploiting multilingual and
multimodal resources can help in learning better semantic representations that are useful
for various multimodal natural language understanding tasks.
Our experiments on visual sense disambiguation, sense disambiguation across languages,
multimodal and cross-lingual search demonstrate that visual context alone or
combined with textual context is useful for enhancing multimodal and crosslingual
applications
Language-based multimedia information retrieval
This paper describes various methods and approaches for language-based multimedia information retrieval, which have been developed in the projects POP-EYE and OLIVE and which will be developed further in the MUMIS project. All of these project aim at supporting automated indexing of video material by use of human language technologies. Thus, in contrast to image or sound-based retrieval methods, where both the query language and the indexing methods build on non-linguistic data, these methods attempt to exploit advanced text retrieval technologies for the retrieval of non-textual material. While POP-EYE was building on subtitles or captions as the prime language key for disclosing video fragments, OLIVE is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which then serve as the basis for text-based retrieval functionality
Multimedia search without visual analysis: the value of linguistic and contextual information
This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features
Metadata Augmentation for Semantic- and Context- Based Retrieval of Digital Cultural Objects
Cultural objects are increasingly stored and generated in digital form, yet effective methods for their indexing and retrieval still remain an open area of research. The main problem arises from the disconnection between the content-based indexing approach used by computer scientists and the description-based approach used by information scientists. There is also a lack of representational schemes that allow the alignment of the semantics and context with keywords and low-level features that can be automatically extracted from the content of these cultural objects. This paper presents an integrated approach to address these problems, taking advantage of both computer science and information science approaches. The focus is on the rationale and conceptual design of the system and its various components. In particular, we discuss techniques for augmenting commonly used metadata with visual features and domain knowledge to generate high-level abstract metadata which in turn can be used for semantic and context-based indexing and retrieval. We use a sample collection of Vietnamese traditional woodcuts to demonstrate the usefulness of this approach
Mind the Gap: Another look at the problem of the semantic gap in image retrieval
This paper attempts to review and characterise the problem of the semantic gap in image retrieval and the attempts being made to bridge it. In particular, we draw from our own experience in user queries, automatic annotation and ontological techniques. The first section of the paper describes a characterisation of the semantic gap as a hierarchy between the raw media and full semantic understanding of the media's content. The second section discusses real users' queries with respect to the semantic gap. The final sections of the paper describe our own experience in attempting to bridge the semantic gap. In particular we discuss our work on auto-annotation and semantic-space models of image retrieval in order to bridge the gap from the bottom up, and the use of ontologies, which capture more semantics than keyword object labels alone, as a technique for bridging the gap from the top down
- …