374 research outputs found
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Recommended from our members
Semantics and statistics for automated image annotation
Automated image annotation consists of a number of techniques that aim to find the correlation between words and image features such as colour, shape, and texture to provide correct annotation words to images. In particular, approaches based on Bayesian theory use machine-learning techniques to learn statistical models from a training set of pre-annotated images and apply them to generate annotations for unseen images.
The focus of this thesis lies in demonstrating that an approach, which goes beyond learning the statistical correlation between words and visual features and also exploits information about the actual semantics of the words used in the annotation process, is able to improve the performance of probabilistic annotation systems. Specifically, I present three experiments. Firstly, I introduce a novel approach that automatically refines the annotation words generated by a non-parametric density estimation model using semantic relatedness measures. Initially, I consider semantic measures based on co-occurrence of words in the training set. However, this approach can exhibit limitations, as its performance depends on the quality and coverage provided by the training data. For this reason, I devise an alternative solution that combines semantic measures based on knowledge sources, such as WordNet and Wikipedia, with word co-occurrence in the training set and on the web, to achieve statistically significant results over the baseline. Secondly, I investigate the effect of using semantic measures inside an evaluation measure that computes the performance of an automated image annotation system, whose annotation words adopt the hierarchical structure of an ontology. This is the case of the ImageCLEF2009 collection. Finally, I propose a Markov Random Field that exploits the semantic context dependencies of the image. The best result obtains a mean average precision of 0.32, which is consistent with the state-of-the-art in automated image annotation for the Corel 5k dataset.
</br
TextâtoâVideo: Image Semantics and NLP
When aiming at automatically translating an arbitrary text into a visual story, the main challenge consists in finding a semantically close visual representation whereby the displayed meaning should remain the same as in the given text. Besides, the appearance of an image itself largely influences how its meaningful information is transported towards an observer. This thesis now demonstrates that investigating in both, image semantics as well as the semantic relatedness between visual and textual sources enables us to tackle the challenging semantic gap and to find a semantically close translation from natural language to a corresponding visual representation.
Within the last years, social networking became of high interest leading to an enormous and still increasing amount of online available data. Photo sharing sites like Flickr allow users to associate textual information with their uploaded imagery. Thus, this thesis exploits this huge knowledge source of user generated data providing initial links between images and words, and other meaningful data.
In order to approach visual semantics, this work presents various methods to analyze the visual structure as well as the appearance of images in terms of meaningful similarities, aesthetic appeal, and emotional effect towards an observer. In detail, our GPU-based approach efficiently finds visual similarities between images in large datasets across visual domains and identifies various meanings for ambiguous words exploring similarity in online search results. Further, we investigate in the highly subjective aesthetic appeal of images and make use of deep learning to directly learn aesthetic rankings from a broad diversity of user reactions in social online behavior. To gain even deeper insights into the influence of visual appearance towards an observer, we explore how simple image processing is capable of actually changing the emotional perception and derive a simple but effective image filter.
To identify meaningful connections between written text and visual representations, we employ methods from Natural Language Processing (NLP). Extensive textual processing allows us to create semantically relevant illustrations for simple text elements as well as complete storylines. More precisely, we present an approach that resolves dependencies in textual descriptions to arrange 3D models correctly. Further, we develop a method that finds semantically relevant illustrations to texts of different types based on a novel hierarchical querying algorithm. Finally, we present an optimization based framework that is capable of not only generating semantically relevant but also visually coherent picture stories in different styles.Bei der automatischen Umwandlung eines beliebigen Textes in eine visuelle Geschichte, besteht die gröĂte Herausforderung darin eine semantisch passende visuelle Darstellung zu finden. Dabei sollte die Bedeutung der Darstellung dem vorgegebenen Text entsprechen. DarĂŒber hinaus hat die Erscheinung eines Bildes einen groĂen EinfluĂ darauf, wie seine bedeutungsvollen Inhalte auf einen Betrachter ĂŒbertragen werden. Diese Dissertation zeigt, dass die Erforschung sowohl der Bildsemantik als auch der semantischen Verbindung zwischen visuellen und textuellen Quellen es ermöglicht, die anspruchsvolle semantische LĂŒcke zu schlieĂen und eine semantisch nahe Ăbersetzung von natĂŒrlicher Sprache in eine entsprechend sinngemĂ€Ăe visuelle Darstellung zu finden.
Des Weiteren gewann die soziale Vernetzung in den letzten Jahren zunehmend an Bedeutung, was zu einer enormen und immer noch wachsenden Menge an online verfĂŒgbaren Daten gefĂŒhrt hat. Foto-Sharing-Websites wie Flickr ermöglichen es Benutzern, Textinformationen mit ihren hochgeladenen Bildern zu verknĂŒpfen. Die vorliegende Arbeit nutzt die enorme Wissensquelle von benutzergenerierten Daten welche erste Verbindungen zwischen Bildern und Wörtern sowie anderen aussagekrĂ€ftigen Daten zur VerfĂŒgung stellt.
Zur Erforschung der visuellen Semantik stellt diese Arbeit unterschiedliche Methoden vor, um die visuelle Struktur sowie die Wirkung von Bildern in Bezug auf bedeutungsvolle Ăhnlichkeiten, Ă€sthetische Erscheinung und emotionalem Einfluss auf einen Beobachter zu analysieren. Genauer gesagt, findet unser GPU-basierter Ansatz effizient visuelle Ăhnlichkeiten zwischen Bildern in groĂen Datenmengen quer ĂŒber visuelle DomĂ€nen hinweg und identifiziert verschiedene Bedeutungen fĂŒr mehrdeutige Wörter durch die Erforschung von Ăhnlichkeiten in Online-Suchergebnissen. Des Weiteren wird die höchst subjektive Ă€sthetische Anziehungskraft von Bildern untersucht und "deep learning" genutzt, um direkt Ă€sthetische Einordnungen aus einer breiten Vielfalt von Benutzerreaktionen im sozialen Online-Verhalten zu lernen. Um noch tiefere Erkenntnisse ĂŒber den Einfluss des visuellen Erscheinungsbildes auf einen Betrachter zu gewinnen, wird erforscht, wie alleinig einfache Bildverarbeitung in der Lage ist, tatsĂ€chlich die emotionale Wahrnehmung zu verĂ€ndern und ein einfacher aber wirkungsvoller Bildfilter davon abgeleitet werden kann.
Um bedeutungserhaltende Verbindungen zwischen geschriebenem Text und visueller Darstellung zu ermitteln, werden Methoden des "Natural Language Processing (NLP)" verwendet, die der Verarbeitung natĂŒrlicher Sprache dienen. Der Einsatz umfangreicher Textverarbeitung ermöglicht es, semantisch relevante Illustrationen fĂŒr einfache Textteile sowie fĂŒr komplette HandlungsstrĂ€nge zu erzeugen. Im Detail wird ein Ansatz vorgestellt, der AbhĂ€ngigkeiten in Textbeschreibungen auflöst, um 3D-Modelle korrekt anzuordnen. Des Weiteren wird eine Methode entwickelt die, basierend auf einem neuen hierarchischen Such-Anfrage Algorithmus, semantisch relevante Illustrationen zu Texten verschiedener Art findet. SchlieĂlich wird ein optimierungsbasiertes Framework vorgestellt, das nicht nur semantisch relevante, sondern auch visuell kohĂ€rente Bildgeschichten in verschiedenen Bildstilen erzeugen kann
ImageNet Large Scale Visual Recognition Challenge
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide a detailed analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL
VOC (per-category comparisons in Table 3, distribution of localization
difficulty in Fig 16), a list of queries used for obtaining object detection
images (Appendix C), and some additional reference
Image Annotation and Topic Extraction Using Super-Word Latent Dirichlet
This research presents a multi-domain solution that uses text and images to iteratively improve automated information extraction. Stage I uses local text surrounding an embedded image to provide clues that help rank-order possible image annotations. These annotations are forwarded to Stage II, where the image annotations from Stage I are used as highly-relevant super-words to improve extraction of topics. The model probabilities from the super-words in Stage II are forwarded to Stage III where they are used to refine the automated image annotation developed in Stage I. All stages demonstrate improvement over existing equivalent algorithms in the literature
Recommended from our members
Parallelizing support vector machines for scalable image annotation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large.
In this thesis distributed computing paradigms have been investigated to speed up SVM training, by partitioning a large training dataset into small data chunks and process each chunk in parallel utilizing the resources of a cluster of computers. A resource aware parallel SVM algorithm is introduced for large scale image annotation in parallel using a cluster of computers. A genetic algorithm based load balancing scheme is designed to optimize the performance of the algorithm in heterogeneous computing environments.
SVM was initially designed for binary classifications. However, most classification problems arising in domains such as image annotation usually involve more than two classes. A resource aware parallel multiclass SVM algorithm for large scale image annotation in parallel using a cluster of computers is introduced.
The combination of classifiers leads to substantial reduction of classification error in a wide range of applications. Among them SVM ensembles with bagging is shown to outperform a single SVM in terms of classification accuracy. However, SVM ensembles training are notably a computationally intensive process especially when the number replicated samples based on bootstrapping is large. A distributed SVM ensemble algorithm for image annotation is introduced which re-samples the training data based on bootstrapping and training SVM on each sample in parallel using a cluster of computers.
The above algorithms are evaluated in both experimental and simulation environments showing that the distributed SVM algorithm, distributed multiclass SVM algorithm, and distributed SVM ensemble algorithm, reduces the training time significantly while maintaining a high level of accuracy in classifications
Apprentissage d'espaces sémantiques
Dans cette dissertation, nous prĂ©sentons plusieurs techniques dâapprentissage dâespaces sĂ©mantiques pour plusieurs domaines, par exemple des mots et des images, mais aussi Ă lâintersection de diffĂ©rents domaines. Un espace de reprĂ©sentation est appelĂ© sĂ©mantique si des entitĂ©s jugĂ©es similaires par un ĂȘtre humain, ont leur similaritĂ© prĂ©servĂ©e dans cet espace.
La premiĂšre publication prĂ©sente un enchaĂźnement de mĂ©thodes dâapprentissage incluant plusieurs techniques dâapprentissage non supervisĂ© qui nous a permis de remporter la compĂ©tition âUnsupervised and Transfer Learning Challengeâ en 2011.
Le deuxiĂšme article prĂ©sente une maniĂšre dâextraire de lâinformation Ă partir dâun contexte structurĂ© (177 dĂ©tecteurs dâobjets Ă diffĂ©rentes positions et Ă©chelles). On montrera que lâutilisation de la structure des donnĂ©es combinĂ©e Ă un apprentissage non supervisĂ© permet de rĂ©duire la dimensionnalitĂ© de 97% tout en amĂ©liorant les performances de reconnaissance de scĂšnes de +5% Ă +11% selon lâensemble de donnĂ©es.
Dans le troisiĂšme travail, on sâintĂ©resse Ă la structure apprise par les rĂ©seaux de neurones profonds utilisĂ©s dans les deux prĂ©cĂ©dentes publications. Plusieurs hypothĂšses sont prĂ©sentĂ©es et testĂ©es expĂ©rimentalement montrant que lâespace appris a de meilleures propriĂ©tĂ©s de mixage (facilitant lâexploration de diffĂ©rentes classes durant le processus dâĂ©chantillonnage).
Pour la quatriĂšme publication, on sâintĂ©resse Ă rĂ©soudre un problĂšme dâanalyse syntaxique et sĂ©mantique avec des rĂ©seaux de neurones rĂ©currents appris sur des fenĂȘtres de contexte de mots.
Dans notre cinquiĂšme travail, nous proposons une façon dâeffectuer de la recherche dâimage âaugmentĂ©eâ en apprenant un espace sĂ©mantique joint oĂč une recherche dâimage contenant un objet retournerait aussi des images des parties de lâobjet, par exemple une recherche retournant des images de âvoitureâ retournerait aussi des images de âpare-brisesâ, âcoffresâ, ârouesâ en plus des images initiales.In this work, we focus on learning semantic spaces for multiple domains, but also at the intersection of different domains. The semantic space is where the learned representation lives. This space is called semantic if similar entities from a human perspective have their similarity preserved in this space. We use different machine learning algorithms to learn representations with interesting intrinsic properties.
The first article presents a pipeline including many different unsupervised learning techniques used to win the Unsupervised and Transfer Learning Challenge in 2011.
In the second article, we present a pipeline taking advantage of the structure of the data for a scene classification problem. This approach allows us to drastically reduce the dimensionality while improving significantly on the scene recognition accuracy.
The third article focuses on the space structure learned by deep representations. We show that performing the sampling procedure from deeper levels of representation space explores more of the different classes.
In the fourth article, we tackle a semantic parsing problem with several Recurrent Neural Network architectures taking as input context windows of word embeddings.
In the fifth article, an investigation on learning a single semantic space at the intersection of words and images is presented. We propose a way to perform âaugmented searchâ where a search on an image containing an object would also return images of the objectâs parts
Closing the gap in WSD: supervised results with unsupervised methods
Word-Sense Disambiguation (WSD), holds promise for many NLP applications requiring
broad-coverage language understanding, such as summarization (Barzilay and
Elhadad, 1997) and question answering (Ramakrishnan et al., 2003). Recent studies
have also shown that WSD can benefit machine translation (Vickrey et al., 2005) and
information retrieval (Stokoe, 2005). Much work has focused on the computational
treatment of sense ambiguity, primarily using data-driven methods. The most accurate
WSD systems to date are supervised and rely on the availability of sense-labeled
training data. This restriction poses a significant barrier to widespread use of WSD
in practice, since such data is extremely expensive to acquire for new languages and
domains.
Unsupervised WSD holds the key to enable such application, as it does not require
sense-labeled data. However, unsupervised methods fall far behind supervised ones
in terms of accuracy and ease of use. In this thesis we explore the reasons for this,
and present solutions to remedy this situation. We hypothesize that one of the main
problems with unsupervised WSD is its lack of a standard formulation and general
purpose tools common to supervised methods. As a first step, we examine existing approaches
to unsupervised WSD, with the aim of detecting independent principles that
can be utilized in a general framework. We investigate ways of leveraging the diversity
of existing methods, using ensembles, a common tool in the supervised learning
framework. This approach allows us to achieve accuracy beyond that of the individual
methods, without need for extensive modification of the underlying systems.
Our examination of existing unsupervised approaches highlights the importance of
using the predominant sense in case of uncertainty, and the effectiveness of statistical
similarity methods as a tool for WSD. However, it also serves to emphasize the need for
a way to merge and combine learning elements, and the potential of a supervised-style
approach to the problem. Relying on existing methods does not take full advantage of
the insights gained from the supervised framework.
We therefore present an unsupervised WSD system which circumvents the question
of actual disambiguation method, which is the main source of discrepancy in unsupervised
WSD, and deals directly with the data. Our method uses statistical and semantic
similarity measures to produce labeled training data in a completely unsupervised fashion.
This allows the training and use of any standard supervised classifier for the actual
disambiguation. Classifiers trained with our method significantly outperform those using
other methods of data generation, and represent a big step in bridging the accuracy
gap between supervised and unsupervised methods.
Finally, we address a major drawback of classical unsupervised systems â their reliance
on a fixed sense inventory and lexical resources. This dependence represents
a substantial setback for unsupervised methods in cases where such resources are unavailable.
Unfortunately, these are exactly the areas in which unsupervised methods are
most needed. Unsupervised sense-discrimination, which does not share those restrictions,
presents a promising solution to the problem. We therefore develop an unsupervised
sense discrimination system. We base our system on a well-studied probabilistic
generative model, Latent Dirichlet Allocation (Blei et al., 2003), which has many of
the advantages of supervised frameworks. The modelâs probabilistic nature lends itself
to easy combination and extension, and its generative aspect is well suited to linguistic
tasks. Our model achieves state-of-the-art performance on the unsupervised sense
induction task, while remaining independent of any fixed sense inventory, and thus
represents a fully unsupervised, general purpose, WSD tool
- âŠ