318 research outputs found
A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata
Images represent a commonly used form of visual communication among people.
Nevertheless, image classification may be a challenging task when dealing with
unclear or non-common images needing more context to be correctly annotated.
Metadata accompanying images on social-media represent an ideal source of
additional information for retrieving proper neighborhoods easing image
annotation task. To this end, we blend visual features extracted from neighbors
and their metadata to jointly leverage context and visual cues. Our models use
multiple semantic embeddings to achieve the dual objective of being robust to
vocabulary changes between train and test sets and decoupling the architecture
from the low-level metadata representation. Convolutional and recurrent neural
networks (CNNs-RNNs) are jointly adopted to infer similarity among neighbors
and query images. We perform comprehensive experiments on the NUS-WIDE dataset
showing that our models outperform state-of-the-art architectures based on
images and metadata, and decrease both sensory and semantic gaps to better
annotate images
Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking
Public speaking is an important aspect of human communication and
interaction. The majority of computational work on public speaking concentrates
on analyzing the spoken content, and the verbal behavior of the speakers. While
the success of public speaking largely depends on the content of the talk, and
the verbal behavior, non-verbal (visual) cues, such as gestures and physical
appearance also play a significant role. This paper investigates the importance
of visual cues by estimating their contribution towards predicting the
popularity of a public lecture. For this purpose, we constructed a large
database of more than TED talk videos. As a measure of popularity of the
TED talks, we leverage the corresponding (online) viewers' ratings from
YouTube. Visual cues related to facial and physical appearance, facial
expressions, and pose variations are extracted from the video frames using
convolutional neural network (CNN) models. Thereafter, an attention-based long
short-term memory (LSTM) network is proposed to predict the video popularity
from the sequence of visual features. The proposed network achieves
state-of-the-art prediction accuracy indicating that visual cues alone contain
highly predictive information about the popularity of a talk. Furthermore, our
network learns a human-like attention mechanism, which is particularly useful
for interpretability, i.e. how attention varies with time, and across different
visual cues by indicating their relative importance
Heritage image annotation via collective knowledge
© 2019 Elsevier Ltd The automatic image annotation can provide semantic illustrations to understand image contents, and builds a foundation to develop algorithms that can search images within a large database. However, most current methods focus on solving the annotation problem by modeling the image visual content and tag semantic information, which overlooks the additional information, such as scene descriptions and locations. Moreover, the majority of current annotation datasets are visually consistent and only annotated by common visual objects and attributes, which makes the classic methods vulnerable to handle the more diverse image annotation. To address above issues, we propose to annotate images via collective knowledge, that is, we uncover relationships between the image and its neighbors by measuring similarities among metadata and conduct the metric learning to obtain the representations of image contents, we also generate semantic representations for images given collective semantic information from their neighbors. Two representations from different paradigms are embedded together to train an annotation model. We ground our model on the heritage image collection we collected from the library online open data. Annotations on the heritage image collection are not limited to common visual objects, and are highly relevant to historical events, and the diversity of the heritage image content is much larger than the current datasets, which makes it more suitable for this task. Comprehensive experimental results on the benchmark dataset indicate that the proposed model achieves the best performance compared to baselines and state-of-the-art methods
First impressions: A survey on vision-based apparent personality trait analysis
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.Peer ReviewedPostprint (author's final draft
Confluence of Vision and Natural Language Processing for Cross-media Semantic Relations Extraction
In this dissertation, we focus on extracting and understanding semantically meaningful relationships between data items of various modalities; especially relations between images and natural language. We explore the ideas and techniques to integrate such cross-media semantic relations for machine understanding of large heterogeneous datasets, made available through the expansion of the World Wide Web. The datasets collected from social media websites, news media outlets and blogging platforms usually contain multiple modalities of data. Intelligent systems are needed to automatically make sense out of these datasets and present them in such a way that humans can find the relevant pieces of information or get a summary of the available material. Such systems have to process multiple modalities of data such as images, text, linguistic features, and structured data in reference to each other. For example, image and video search and retrieval engines are required to understand the relations between visual and textual data so that they can provide relevant answers in the form of images and videos to the users\u27 queries presented in the form of text. We emphasize the automatic extraction of semantic topics or concepts from the data available in any form such as images, free-flowing text or metadata. These semantic concepts/topics become the basis of semantic relations across heterogeneous data types, e.g., visual and textual data. A classic problem involving image-text relations is the automatic generation of textual descriptions of images. This problem is the main focus of our work. In many cases, large amount of text is associated with images. Deep exploration of linguistic features of such text is required to fully utilize the semantic information encoded in it. A news dataset involving images and news articles is an example of this scenario. We devise frameworks for automatic news image description generation based on the semantic relations of images, as well as semantic understanding of linguistic features of the news articles
- …