7,524 research outputs found
Event Specific Multimodal Pattern Mining with Image-Caption Pairs
In this paper we describe a novel framework and algorithms for discovering
image patch patterns from a large corpus of weakly supervised image-caption
pairs generated from news events. Current pattern mining techniques attempt to
find patterns that are representative and discriminative, we stipulate that our
discovered patterns must also be recognizable by humans and preferably with
meaningful names. We propose a new multimodal pattern mining approach that
leverages the descriptive captions often accompanying news images to learn
semantically meaningful image patch patterns. The mutltimodal patterns are then
named using words mined from the associated image captions for each pattern. A
novel evaluation framework is provided that demonstrates our patterns are 26.2%
more semantically meaningful than those discovered by the state of the art
vision only pipeline, and that we can provide tags for the discovered images
patches with 54.5% accuracy with no direct supervision. Our methods also
discover named patterns beyond those covered by the existing image datasets
like ImageNet. To the best of our knowledge this is the first algorithm
developed to automatically mine image patch patterns that have strong semantic
meaning specific to high-level news events, and then evaluate these patterns
based on that criteria
Tensor Fusion Network for Multimodal Sentiment Analysis
Multimodal sentiment analysis is an increasingly popular research area, which
extends the conventional language-based definition of sentiment analysis to a
multimodal setup where other relevant modalities accompany language. In this
paper, we pose the problem of multimodal sentiment analysis as modeling
intra-modality and inter-modality dynamics. We introduce a novel model, termed
Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed
approach is tailored for the volatile nature of spoken language in online
videos as well as accompanying gestures and voice. In the experiments, our
model outperforms state-of-the-art approaches for both multimodal and unimodal
sentiment analysis.Comment: Accepted as full paper in EMNLP 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
A Survey on Food Computing
Food is very essential for human life and it is fundamental to the human
experience. Food-related study may support multifarious applications and
services, such as guiding the human behavior, improving the human health and
understanding the culinary culture. With the rapid development of social
networks, mobile networks, and Internet of Things (IoT), people commonly
upload, share, and record food images, recipes, cooking videos, and food
diaries, leading to large-scale food data. Large-scale food data offers rich
knowledge about food and can help tackle many central issues of human society.
Therefore, it is time to group several disparate issues related to food
computing. Food computing acquires and analyzes heterogenous food data from
disparate sources for perception, recognition, retrieval, recommendation, and
monitoring of food. In food computing, computational approaches are applied to
address food related issues in medicine, biology, gastronomy and agronomy. Both
large-scale food data and recent breakthroughs in computer science are
transforming the way we analyze food data. Therefore, vast amounts of work has
been conducted in the food area, targeting different food-oriented tasks and
applications. However, there are very few systematic reviews, which shape this
area well and provide a comprehensive and in-depth summary of current efforts
or detail open problems in this area. In this paper, we formalize food
computing and present such a comprehensive overview of various emerging
concepts, methods, and tasks. We summarize key challenges and future directions
ahead for food computing. This is the first comprehensive survey that targets
the study of computing technology for the food area and also offers a
collection of research studies and technologies to benefit researchers and
practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey
Deep Learning for Sentiment Analysis : A Survey
Deep learning has emerged as a powerful machine learning technique that
learns multiple layers of representations or features of the data and produces
state-of-the-art prediction results. Along with the success of deep learning in
many other application domains, deep learning is also popularly used in
sentiment analysis in recent years. This paper first gives an overview of deep
learning and then provides a comprehensive survey of its current applications
in sentiment analysis.Comment: 34 pages, 9 figures, 2 table
Multimodal matching using a Hybrid Convolutional Neural Network
In this work, we propose a novel Convolutional Neural Network (CNN)
architecture for the joint detection and matching of feature points in images
acquired by different sensors using a single forward pass. The resulting
feature detector is tightly coupled with the feature descriptor, in contrast to
classical approaches (SIFT, etc.), where the detection phase precedes and
differs from computing the descriptor. Our approach utilizes two CNN
subnetworks, the first being a Siamese CNN and the second, consisting of dual
non-weight-sharing CNNs. This allows simultaneous processing and fusion of the
joint and disjoint cues in the multimodal image patches. The proposed approach
is experimentally shown to outperform contemporary state-of-the-art schemes
when applied to multiple datasets of multimodal images by reducing the matching
errors by 50\%-70\% compared with previous works. It is also shown to provide
repeatable feature points detections across multi-sensor images, outperforming
state-of-the-art detectors such as SIFT and ORB. To the best of our knowledge,
it is the first unified approach for the detection and matching of such images
Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks
There has been an explosion of multimodal content generated on social media
networks in the last few years, which has necessitated a deeper understanding
of social media content and user behavior. We present a novel
content-independent content-user-reaction model for social multimedia content
analysis. Compared to prior works that generally tackle semantic content
understanding and user behavior modeling in isolation, we propose a generalized
solution to these problems within a unified framework. We embed users, images
and text drawn from open social media in a common multimodal geometric space,
using a novel loss function designed to cope with distant and disparate
modalities, and thereby enable seamless three-way retrieval. Our model not only
outperforms unimodal embedding based methods on cross-modal retrieval tasks but
also shows improvements stemming from jointly solving the two tasks on Twitter
data. We also show that the user embeddings learned within our joint multimodal
embedding model are better at predicting user interests compared to those
learned with unimodal content on Instagram data. Our framework thus goes beyond
the prior practice of using explicit leader-follower link information to
establish affiliations by extracting implicit content-centric affiliations from
isolated users. We provide qualitative results to show that the user clusters
emerging from learned embeddings have consistent semantics and the ability of
our model to discover fine-grained semantics from noisy and unstructured data.
Our work reveals that social multimodal content is inherently multimodal and
possesses a consistent structure because in social networks meaning is created
through interactions between users and content.Comment: Preprint submitted to IJC
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge
The continuous dimensional emotion modelled by arousal and valence can depict
complex changes of emotions. In this paper, we present our works on arousal and
valence predictions for One-Minute-Gradual (OMG) Emotion Challenge. Multimodal
representations are first extracted from videos using a variety of acoustic,
video and textual models and support vector machine (SVM) is then used for
fusion of multimodal signals to make final predictions. Our solution achieves
Concordant Correlation Coefficient (CCC) scores of 0.397 and 0.520 on arousal
and valence respectively for the validation dataset, which outperforms the
baseline systems with the best CCC scores of 0.15 and 0.23 on arousal and
valence by a large margin
A visual approach for age and gender identification on Twitter
The goal of Author Profiling (AP) is to identify demographic aspects (e.g.,
age, gender) from a given set of authors by analyzing their written texts.
Recently, the AP task has gained interest in many problems related to computer
forensics, psychology, marketing, but specially in those related with social
media exploitation. As known, social media data is shared through a wide range
of modalities (e.g., text, images and audio), representing valuable information
to be exploited for extracting valuable insights from users. Nevertheless, most
of the current work in AP using social media data has been devoted to analyze
textual information only, and there are very few works that have started
exploring the gender identification using visual information. Contrastingly,
this paper focuses in exploiting the visual modality to perform both age and
gender identification in social media, specifically in Twitter. Our goal is to
evaluate the pertinence of using visual information in solving the AP task.
Accordingly, we have extended the Twitter corpus from PAN 2014, incorporating
posted images from all the users, making a distinction between tweeted and
retweeted images. Performed experiments provide interesting evidence on the
usefulness of visual information in comparison with traditional textual
representations for the AP task
- …