5,178 research outputs found
Jointly Modeling Embedding and Translation to Bridge Video and Language
Automatically describing video content with natural language is a fundamental
challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence
dynamics, has attracted increasing attention on visual interpretation. However,
most existing approaches generate a word locally with given previous words and
the visual content, while the relationship between sentence semantics and
visual content is not holistically exploited. As a result, the generated
sentences may be contextually correct but the semantics (e.g., subjects, verbs
or objects) are not true.
This paper presents a novel unified framework, named Long Short-Term Memory
with visual-semantic Embedding (LSTM-E), which can simultaneously explore the
learning of LSTM and visual-semantic embedding. The former aims to locally
maximize the probability of generating the next word given previous words and
visual content, while the latter is to create a visual-semantic embedding space
for enforcing the relationship between the semantics of the entire sentence and
visual content. Our proposed LSTM-E consists of three components: a 2-D and/or
3-D deep convolutional neural networks for learning powerful video
representation, a deep RNN for generating sentences, and a joint embedding
model for exploring the relationships between visual content and sentence
semantics. The experiments on YouTube2Text dataset show that our proposed
LSTM-E achieves to-date the best reported performance in generating natural
sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also
demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)
triplets to several state-of-the-art techniques
Image Parsing with a Wide Range of Classes and Scene-Level Context
This paper presents a nonparametric scene parsing approach that improves the
overall accuracy, as well as the coverage of foreground classes in scene
images. We first improve the label likelihood estimates at superpixels by
merging likelihood scores from different probabilistic classifiers. This boosts
the classification performance and enriches the representation of
less-represented classes. Our second contribution consists of incorporating
semantic context in the parsing process through global label costs. Our method
does not rely on image retrieval sets but rather assigns a global likelihood
estimate to each label, which is plugged into the overall energy function. We
evaluate our system on two large-scale datasets, SIFTflow and LMSun. We achieve
state-of-the-art performance on the SIFTflow dataset and near-record results on
LMSun.Comment: Published at CVPR 2015, Computer Vision and Pattern Recognition
(CVPR), 2015 IEEE Conference o
Context-Aware Embeddings for Automatic Art Analysis
Automatic art analysis aims to classify and retrieve artistic representations
from a collection of images by using computer vision and machine learning
techniques. In this work, we propose to enhance visual representations from
neural networks with contextual artistic information. Whereas visual
representations are able to capture information about the content and the style
of an artwork, our proposed context-aware embeddings additionally encode
relationships between different artistic attributes, such as author, school, or
historical period. We design two different approaches for using context in
automatic art analysis. In the first one, contextual data is obtained through a
multi-task learning model, in which several attributes are trained together to
find visual relationships between elements. In the second approach, context is
obtained through an art-specific knowledge graph, which encodes relationships
between artistic attributes. An exhaustive evaluation of both of our models in
several art analysis problems, such as author identification, type
classification, or cross-modal retrieval, show that performance is improved by
up to 7.3% in art classification and 37.24% in retrieval when context-aware
embeddings are used
Recommended from our members
Semantic Concept Co-Occurrence Patterns for Image Annotation and Retrieval.
Describing visual image contents by semantic concepts is an effective and straightforward way to facilitate various high level applications. Inferring semantic concepts from low-level pictorial feature analysis is challenging due to the semantic gap problem, while manually labeling concepts is unwise because of a large number of images in both online and offline collections. In this paper, we present a novel approach to automatically generate intermediate image descriptors by exploiting concept co-occurrence patterns in the pre-labeled training set that renders it possible to depict complex scene images semantically. Our work is motivated by the fact that multiple concepts that frequently co-occur across images form patterns which could provide contextual cues for individual concept inference. We discover the co-occurrence patterns as hierarchical communities by graph modularity maximization in a network with nodes and edges representing concepts and co-occurrence relationships separately. A random walk process working on the inferred concept probabilities with the discovered co-occurrence patterns is applied to acquire the refined concept signature representation. Through experiments in automatic image annotation and semantic image retrieval on several challenging datasets, we demonstrate the effectiveness of the proposed concept co-occurrence patterns as well as the concept signature representation in comparison with state-of-the-art approaches
- …