374 research outputs found

    Movie Description

    Get PDF
    Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015

    Text–to–Video: Image Semantics and NLP

    Get PDF
    When aiming at automatically translating an arbitrary text into a visual story, the main challenge consists in finding a semantically close visual representation whereby the displayed meaning should remain the same as in the given text. Besides, the appearance of an image itself largely influences how its meaningful information is transported towards an observer. This thesis now demonstrates that investigating in both, image semantics as well as the semantic relatedness between visual and textual sources enables us to tackle the challenging semantic gap and to find a semantically close translation from natural language to a corresponding visual representation. Within the last years, social networking became of high interest leading to an enormous and still increasing amount of online available data. Photo sharing sites like Flickr allow users to associate textual information with their uploaded imagery. Thus, this thesis exploits this huge knowledge source of user generated data providing initial links between images and words, and other meaningful data. In order to approach visual semantics, this work presents various methods to analyze the visual structure as well as the appearance of images in terms of meaningful similarities, aesthetic appeal, and emotional effect towards an observer. In detail, our GPU-based approach efficiently finds visual similarities between images in large datasets across visual domains and identifies various meanings for ambiguous words exploring similarity in online search results. Further, we investigate in the highly subjective aesthetic appeal of images and make use of deep learning to directly learn aesthetic rankings from a broad diversity of user reactions in social online behavior. To gain even deeper insights into the influence of visual appearance towards an observer, we explore how simple image processing is capable of actually changing the emotional perception and derive a simple but effective image filter. To identify meaningful connections between written text and visual representations, we employ methods from Natural Language Processing (NLP). Extensive textual processing allows us to create semantically relevant illustrations for simple text elements as well as complete storylines. More precisely, we present an approach that resolves dependencies in textual descriptions to arrange 3D models correctly. Further, we develop a method that finds semantically relevant illustrations to texts of different types based on a novel hierarchical querying algorithm. Finally, we present an optimization based framework that is capable of not only generating semantically relevant but also visually coherent picture stories in different styles.Bei der automatischen Umwandlung eines beliebigen Textes in eine visuelle Geschichte, besteht die grĂ¶ĂŸte Herausforderung darin eine semantisch passende visuelle Darstellung zu finden. Dabei sollte die Bedeutung der Darstellung dem vorgegebenen Text entsprechen. DarĂŒber hinaus hat die Erscheinung eines Bildes einen großen Einfluß darauf, wie seine bedeutungsvollen Inhalte auf einen Betrachter ĂŒbertragen werden. Diese Dissertation zeigt, dass die Erforschung sowohl der Bildsemantik als auch der semantischen Verbindung zwischen visuellen und textuellen Quellen es ermöglicht, die anspruchsvolle semantische LĂŒcke zu schließen und eine semantisch nahe Übersetzung von natĂŒrlicher Sprache in eine entsprechend sinngemĂ€ĂŸe visuelle Darstellung zu finden. Des Weiteren gewann die soziale Vernetzung in den letzten Jahren zunehmend an Bedeutung, was zu einer enormen und immer noch wachsenden Menge an online verfĂŒgbaren Daten gefĂŒhrt hat. Foto-Sharing-Websites wie Flickr ermöglichen es Benutzern, Textinformationen mit ihren hochgeladenen Bildern zu verknĂŒpfen. Die vorliegende Arbeit nutzt die enorme Wissensquelle von benutzergenerierten Daten welche erste Verbindungen zwischen Bildern und Wörtern sowie anderen aussagekrĂ€ftigen Daten zur VerfĂŒgung stellt. Zur Erforschung der visuellen Semantik stellt diese Arbeit unterschiedliche Methoden vor, um die visuelle Struktur sowie die Wirkung von Bildern in Bezug auf bedeutungsvolle Ähnlichkeiten, Ă€sthetische Erscheinung und emotionalem Einfluss auf einen Beobachter zu analysieren. Genauer gesagt, findet unser GPU-basierter Ansatz effizient visuelle Ähnlichkeiten zwischen Bildern in großen Datenmengen quer ĂŒber visuelle DomĂ€nen hinweg und identifiziert verschiedene Bedeutungen fĂŒr mehrdeutige Wörter durch die Erforschung von Ähnlichkeiten in Online-Suchergebnissen. Des Weiteren wird die höchst subjektive Ă€sthetische Anziehungskraft von Bildern untersucht und "deep learning" genutzt, um direkt Ă€sthetische Einordnungen aus einer breiten Vielfalt von Benutzerreaktionen im sozialen Online-Verhalten zu lernen. Um noch tiefere Erkenntnisse ĂŒber den Einfluss des visuellen Erscheinungsbildes auf einen Betrachter zu gewinnen, wird erforscht, wie alleinig einfache Bildverarbeitung in der Lage ist, tatsĂ€chlich die emotionale Wahrnehmung zu verĂ€ndern und ein einfacher aber wirkungsvoller Bildfilter davon abgeleitet werden kann. Um bedeutungserhaltende Verbindungen zwischen geschriebenem Text und visueller Darstellung zu ermitteln, werden Methoden des "Natural Language Processing (NLP)" verwendet, die der Verarbeitung natĂŒrlicher Sprache dienen. Der Einsatz umfangreicher Textverarbeitung ermöglicht es, semantisch relevante Illustrationen fĂŒr einfache Textteile sowie fĂŒr komplette HandlungsstrĂ€nge zu erzeugen. Im Detail wird ein Ansatz vorgestellt, der AbhĂ€ngigkeiten in Textbeschreibungen auflöst, um 3D-Modelle korrekt anzuordnen. Des Weiteren wird eine Methode entwickelt die, basierend auf einem neuen hierarchischen Such-Anfrage Algorithmus, semantisch relevante Illustrationen zu Texten verschiedener Art findet. Schließlich wird ein optimierungsbasiertes Framework vorgestellt, das nicht nur semantisch relevante, sondern auch visuell kohĂ€rente Bildgeschichten in verschiedenen Bildstilen erzeugen kann

    ImageNet Large Scale Visual Recognition Challenge

    Get PDF
    The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL VOC (per-category comparisons in Table 3, distribution of localization difficulty in Fig 16), a list of queries used for obtaining object detection images (Appendix C), and some additional reference

    Image Annotation and Topic Extraction Using Super-Word Latent Dirichlet

    Get PDF
    This research presents a multi-domain solution that uses text and images to iteratively improve automated information extraction. Stage I uses local text surrounding an embedded image to provide clues that help rank-order possible image annotations. These annotations are forwarded to Stage II, where the image annotations from Stage I are used as highly-relevant super-words to improve extraction of topics. The model probabilities from the super-words in Stage II are forwarded to Stage III where they are used to refine the automated image annotation developed in Stage I. All stages demonstrate improvement over existing equivalent algorithms in the literature

    Apprentissage d'espaces sémantiques

    Get PDF
    Dans cette dissertation, nous prĂ©sentons plusieurs techniques d’apprentissage d’espaces sĂ©mantiques pour plusieurs domaines, par exemple des mots et des images, mais aussi Ă  l’intersection de diffĂ©rents domaines. Un espace de reprĂ©sentation est appelĂ© sĂ©mantique si des entitĂ©s jugĂ©es similaires par un ĂȘtre humain, ont leur similaritĂ© prĂ©servĂ©e dans cet espace. La premiĂšre publication prĂ©sente un enchaĂźnement de mĂ©thodes d’apprentissage incluant plusieurs techniques d’apprentissage non supervisĂ© qui nous a permis de remporter la compĂ©tition “Unsupervised and Transfer Learning Challenge” en 2011. Le deuxiĂšme article prĂ©sente une maniĂšre d’extraire de l’information Ă  partir d’un contexte structurĂ© (177 dĂ©tecteurs d’objets Ă  diffĂ©rentes positions et Ă©chelles). On montrera que l’utilisation de la structure des donnĂ©es combinĂ©e Ă  un apprentissage non supervisĂ© permet de rĂ©duire la dimensionnalitĂ© de 97% tout en amĂ©liorant les performances de reconnaissance de scĂšnes de +5% Ă  +11% selon l’ensemble de donnĂ©es. Dans le troisiĂšme travail, on s’intĂ©resse Ă  la structure apprise par les rĂ©seaux de neurones profonds utilisĂ©s dans les deux prĂ©cĂ©dentes publications. Plusieurs hypothĂšses sont prĂ©sentĂ©es et testĂ©es expĂ©rimentalement montrant que l’espace appris a de meilleures propriĂ©tĂ©s de mixage (facilitant l’exploration de diffĂ©rentes classes durant le processus d’échantillonnage). Pour la quatriĂšme publication, on s’intĂ©resse Ă  rĂ©soudre un problĂšme d’analyse syntaxique et sĂ©mantique avec des rĂ©seaux de neurones rĂ©currents appris sur des fenĂȘtres de contexte de mots. Dans notre cinquiĂšme travail, nous proposons une façon d’effectuer de la recherche d’image ”augmentĂ©e” en apprenant un espace sĂ©mantique joint oĂč une recherche d’image contenant un objet retournerait aussi des images des parties de l’objet, par exemple une recherche retournant des images de ”voiture” retournerait aussi des images de ”pare-brises”, ”coffres”, ”roues” en plus des images initiales.In this work, we focus on learning semantic spaces for multiple domains, but also at the intersection of different domains. The semantic space is where the learned representation lives. This space is called semantic if similar entities from a human perspective have their similarity preserved in this space. We use different machine learning algorithms to learn representations with interesting intrinsic properties. The first article presents a pipeline including many different unsupervised learning techniques used to win the Unsupervised and Transfer Learning Challenge in 2011. In the second article, we present a pipeline taking advantage of the structure of the data for a scene classification problem. This approach allows us to drastically reduce the dimensionality while improving significantly on the scene recognition accuracy. The third article focuses on the space structure learned by deep representations. We show that performing the sampling procedure from deeper levels of representation space explores more of the different classes. In the fourth article, we tackle a semantic parsing problem with several Recurrent Neural Network architectures taking as input context windows of word embeddings. In the fifth article, an investigation on learning a single semantic space at the intersection of words and images is presented. We propose a way to perform ”augmented search” where a search on an image containing an object would also return images of the object’s parts

    Closing the gap in WSD: supervised results with unsupervised methods

    Get PDF
    Word-Sense Disambiguation (WSD), holds promise for many NLP applications requiring broad-coverage language understanding, such as summarization (Barzilay and Elhadad, 1997) and question answering (Ramakrishnan et al., 2003). Recent studies have also shown that WSD can benefit machine translation (Vickrey et al., 2005) and information retrieval (Stokoe, 2005). Much work has focused on the computational treatment of sense ambiguity, primarily using data-driven methods. The most accurate WSD systems to date are supervised and rely on the availability of sense-labeled training data. This restriction poses a significant barrier to widespread use of WSD in practice, since such data is extremely expensive to acquire for new languages and domains. Unsupervised WSD holds the key to enable such application, as it does not require sense-labeled data. However, unsupervised methods fall far behind supervised ones in terms of accuracy and ease of use. In this thesis we explore the reasons for this, and present solutions to remedy this situation. We hypothesize that one of the main problems with unsupervised WSD is its lack of a standard formulation and general purpose tools common to supervised methods. As a first step, we examine existing approaches to unsupervised WSD, with the aim of detecting independent principles that can be utilized in a general framework. We investigate ways of leveraging the diversity of existing methods, using ensembles, a common tool in the supervised learning framework. This approach allows us to achieve accuracy beyond that of the individual methods, without need for extensive modification of the underlying systems. Our examination of existing unsupervised approaches highlights the importance of using the predominant sense in case of uncertainty, and the effectiveness of statistical similarity methods as a tool for WSD. However, it also serves to emphasize the need for a way to merge and combine learning elements, and the potential of a supervised-style approach to the problem. Relying on existing methods does not take full advantage of the insights gained from the supervised framework. We therefore present an unsupervised WSD system which circumvents the question of actual disambiguation method, which is the main source of discrepancy in unsupervised WSD, and deals directly with the data. Our method uses statistical and semantic similarity measures to produce labeled training data in a completely unsupervised fashion. This allows the training and use of any standard supervised classifier for the actual disambiguation. Classifiers trained with our method significantly outperform those using other methods of data generation, and represent a big step in bridging the accuracy gap between supervised and unsupervised methods. Finally, we address a major drawback of classical unsupervised systems – their reliance on a fixed sense inventory and lexical resources. This dependence represents a substantial setback for unsupervised methods in cases where such resources are unavailable. Unfortunately, these are exactly the areas in which unsupervised methods are most needed. Unsupervised sense-discrimination, which does not share those restrictions, presents a promising solution to the problem. We therefore develop an unsupervised sense discrimination system. We base our system on a well-studied probabilistic generative model, Latent Dirichlet Allocation (Blei et al., 2003), which has many of the advantages of supervised frameworks. The model’s probabilistic nature lends itself to easy combination and extension, and its generative aspect is well suited to linguistic tasks. Our model achieves state-of-the-art performance on the unsupervised sense induction task, while remaining independent of any fixed sense inventory, and thus represents a fully unsupervised, general purpose, WSD tool
    • 

    corecore