1,800 research outputs found
Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
Image annotation aims to annotate a given image with a variable number of
class labels corresponding to diverse visual concepts. In this paper, we
address two main issues in large-scale image annotation: 1) how to learn a rich
feature representation suitable for predicting a diverse set of visual concepts
ranging from object, scene to abstract concept; 2) how to annotate an image
with the optimal number of class labels. To address the first issue, we propose
a novel multi-scale deep model for extracting rich and discriminative features
capable of representing a wide range of visual concepts. Specifically, a novel
two-branch deep neural network architecture is proposed which comprises a very
deep main network branch and a companion feature fusion network branch designed
for fusing the multi-scale features computed from the main branch. The deep
model is also made multi-modal by taking noisy user-provided tags as model
input to complement the image input. For tackling the second issue, we
introduce a label quantity prediction auxiliary task to the main label
prediction task to explicitly estimate the optimal label number for a given
image. Extensive experiments are carried out on two large-scale image
annotation benchmark datasets and the results show that our method
significantly outperforms the state-of-the-art.Comment: Submited to IEEE TI
Fusing Audio, Textual and Visual Features for Sentiment Analysis of News Videos
This paper presents a novel approach to perform sentiment analysis of news
videos, based on the fusion of audio, textual and visual clues extracted from
their contents. The proposed approach aims at contributing to the
semiodiscoursive study regarding the construction of the ethos (identity) of
this media universe, which has become a central part of the modern-day lives of
millions of people. To achieve this goal, we apply state-of-the-art
computational methods for (1) automatic emotion recognition from facial
expressions, (2) extraction of modulations in the participants' speeches and
(3) sentiment analysis from the closed caption associated to the videos of
interest. More specifically, we compute features, such as, visual intensities
of recognized emotions, field sizes of participants, voicing probability, sound
loudness, speech fundamental frequencies and the sentiment scores (polarities)
from text sentences in the closed caption. Experimental results with a dataset
containing 520 annotated news videos from three Brazilian and one American
popular TV newscasts show that our approach achieves an accuracy of up to 84%
in the sentiments (tension levels) classification task, thus demonstrating its
high potential to be used by media analysts in several applications,
especially, in the journalistic domain.Comment: 5 pages, 1 figure, International AAAI Conference on Web and Social
Medi
Detecting Sarcasm in Multimodal Social Platforms
Sarcasm is a peculiar form of sentiment expression, where the surface
sentiment differs from the implied sentiment. The detection of sarcasm in
social media platforms has been applied in the past mainly to textual
utterances where lexical indicators (such as interjections and intensifiers),
linguistic markers, and contextual information (such as user profiles, or past
conversations) were used to detect the sarcastic tone. However, modern social
media platforms allow to create multimodal messages where audiovisual content
is integrated with the text, making the analysis of a mode in isolation
partial. In our work, we first study the relationship between the textual and
visual aspects in multimodal posts from three major social media platforms,
i.e., Instagram, Tumblr and Twitter, and we run a crowdsourcing task to
quantify the extent to which images are perceived as necessary by human
annotators. Moreover, we propose two different computational frameworks to
detect sarcasm that integrate the textual and visual modalities. The first
approach exploits visual semantics trained on an external dataset, and
concatenates the semantics features with state-of-the-art textual features. The
second method adapts a visual neural network initialized with parameters
trained on ImageNet to multimodal sarcastic posts. Results show the positive
effect of combining modalities for the detection of sarcasm across platforms
and methods.Comment: 10 pages, 3 figures, final version published in the Proceedings of
ACM Multimedia 201
Multimodal Classification of Urban Micro-Events
In this paper we seek methods to effectively detect urban micro-events. Urban
micro-events are events which occur in cities, have limited geographical
coverage and typically affect only a small group of citizens. Because of their
scale these are difficult to identify in most data sources. However, by using
citizen sensing to gather data, detecting them becomes feasible. The data
gathered by citizen sensing is often multimodal and, as a consequence, the
information required to detect urban micro-events is distributed over multiple
modalities. This makes it essential to have a classifier capable of combining
them. In this paper we explore several methods of creating such a classifier,
including early, late, hybrid fusion and representation learning using
multimodal graphs. We evaluate performance on a real world dataset obtained
from a live citizen reporting system. We show that a multimodal approach yields
higher performance than unimodal alternatives. Furthermore, we demonstrate that
our hybrid combination of early and late fusion with multimodal embeddings
performs best in classification of urban micro-events
On the Evolution of (Hateful) Memes by Means of Multimodal Contrastive Learning
The dissemination of hateful memes online has adverse effects on social media
platforms and the real world. Detecting hateful memes is challenging, one of
the reasons being the evolutionary nature of memes; new hateful memes can
emerge by fusing hateful connotations with other cultural ideas or symbols. In
this paper, we propose a framework that leverages multimodal contrastive
learning models, in particular OpenAI's CLIP, to identify targets of hateful
content and systematically investigate the evolution of hateful memes. We find
that semantic regularities exist in CLIP-generated embeddings that describe
semantic relationships within the same modality (images) or across modalities
(images and text). Leveraging this property, we study how hateful memes are
created by combining visual elements from multiple images or fusing textual
information with a hateful image. We demonstrate the capabilities of our
framework for analyzing the evolution of hateful memes by focusing on
antisemitic memes, particularly the Happy Merchant meme. Using our framework on
a dataset extracted from 4chan, we find 3.3K variants of the Happy Merchant
meme, with some linked to specific countries, persons, or organizations. We
envision that our framework can be used to aid human moderators by flagging new
variants of hateful memes so that moderators can manually verify them and
mitigate the problem of hateful content online.Comment: To Appear in the 44th IEEE Symposium on Security and Privacy, May
22-25, 202
Exploiting multimedia in creating and analysing multimedia Web archives
The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general
- âŠ