2,954 research outputs found
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Audio-visual representation learning is an important task from the
perspective of designing machines with the ability to understand complex
events. To this end, we propose a novel multimodal framework that instantiates
multiple instance learning. We show that the learnt representations are useful
for classifying events and localizing their characteristic audio-visual
elements. The system is trained using only video-level event labels without any
timing information. An important feature of our method is its capacity to learn
from unsynchronized audio-visual events. We achieve state-of-the-art results on
a large-scale dataset of weakly-labeled audio event videos. Visualizations of
localized visual regions and audio segments substantiate our system's efficacy,
especially when dealing with noisy situations where modality-specific cues
appear asynchronously
DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection
This paper presents a novel two-phase method for audio representation,
Discriminative and Compact Audio Representation (DCAR), and evaluates its
performance at detecting events in consumer-produced videos. In the first phase
of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that
includes several components to capture the variability within that track. The
second phase takes into account both global structure and local structure. In
this phase, the components are rendered more discriminative and compact by
formulating an optimization problem on Grassmannian manifolds, which we found
represents the structure of audio effectively.
Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus
based on YFCC100M), which includes ten events. The results show that the
proposed DCAR representation consistently outperforms state-of-the-art audio
representations. DCAR's advantage over i-vector, mv-vector, and GMM
representations is significant for both easier and harder discrimination tasks.
We discuss how these performance differences across easy and hard cases follow
from how each type of model leverages (or doesn't leverage) the intrinsic
structure of the data. Furthermore, DCAR shows a particularly notable accuracy
advantage on events where humans have more difficulty classifying the videos,
i.e., events with lower mean annotator confidence.Comment: An abbreviated version of this paper will be published in ACM
Multimedia 201
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Decoding Brain Representations by Multimodal Learning of Neural Activity and Visual Features
This work presents a novel method of exploring human brain-visual
representations, with a view towards replicating these processes in machines.
The core idea is to learn plausible computational and biological
representations by correlating human neural activity and natural images. Thus,
we first propose a model, EEG-ChannelNet, to learn a brain manifold for EEG
classification. After verifying that visual information can be extracted from
EEG data, we introduce a multimodal approach that uses deep image and EEG
encoders, trained in a siamese configuration, for learning a joint manifold
that maximizes a compatibility measure between visual features and brain
representations. We then carry out image classification and saliency detection
on the learned manifold. Performance analyses show that our approach
satisfactorily decodes visual information from neural signals. This, in turn,
can be used to effectively supervise the training of deep learning models, as
demonstrated by the high performance of image classification and saliency
detection on out-of-training classes. The obtained results show that the
learned brain-visual features lead to improved performance and simultaneously
bring deep models more in line with cognitive neuroscience work related to
visual perception and attention
VideoStory Embeddings Recognize Events when Examples are Scarce
This paper aims for event recognition when video examples are scarce or even
completely absent. The key in such a challenging setting is a semantic video
representation. Rather than building the representation from individual
attribute detectors and their annotations, we propose to learn the entire
representation from freely available web videos and their descriptions using an
embedding between video features and term vectors. In our proposed embedding,
which we call VideoStory, the correlations between the terms are utilized to
learn a more effective representation by optimizing a joint objective balancing
descriptiveness and predictability.We show how learning the VideoStory using a
multimodal predictability loss, including appearance, motion and audio
features, results in a better predictable representation. We also propose a
variant of VideoStory to recognize an event in video from just the important
terms in a text query by introducing a term sensitive descriptiveness loss. Our
experiments on three challenging collections of web videos from the NIST
TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets
demonstrate: i) the advantages of VideoStory over representations using
attributes or alternative embeddings, ii) the benefit of fusing video
modalities by an embedding over common strategies, iii) the complementarity of
term sensitive descriptiveness and multimodal predictability for event
recognition without examples. By it abilities to improve predictability upon
any underlying video feature while at the same time maximizing semantic
descriptiveness, VideoStory leads to state-of-the-art accuracy for both few-
and zero-example recognition of events in video
Event Specific Multimodal Pattern Mining with Image-Caption Pairs
In this paper we describe a novel framework and algorithms for discovering
image patch patterns from a large corpus of weakly supervised image-caption
pairs generated from news events. Current pattern mining techniques attempt to
find patterns that are representative and discriminative, we stipulate that our
discovered patterns must also be recognizable by humans and preferably with
meaningful names. We propose a new multimodal pattern mining approach that
leverages the descriptive captions often accompanying news images to learn
semantically meaningful image patch patterns. The mutltimodal patterns are then
named using words mined from the associated image captions for each pattern. A
novel evaluation framework is provided that demonstrates our patterns are 26.2%
more semantically meaningful than those discovered by the state of the art
vision only pipeline, and that we can provide tags for the discovered images
patches with 54.5% accuracy with no direct supervision. Our methods also
discover named patterns beyond those covered by the existing image datasets
like ImageNet. To the best of our knowledge this is the first algorithm
developed to automatically mine image patch patterns that have strong semantic
meaning specific to high-level news events, and then evaluate these patterns
based on that criteria
Automatic Spatially-aware Fashion Concept Discovery
This paper proposes an automatic spatially-aware concept discovery approach
using weakly labeled image-text data from shopping websites. We first fine-tune
GoogleNet by jointly modeling clothing images and their corresponding
descriptions in a visual-semantic embedding space. Then, for each attribute
(word), we generate its spatially-aware representation by combining its
semantic word vector representation with its spatial representation derived
from the convolutional maps of the fine-tuned network. The resulting
spatially-aware representations are further used to cluster attributes into
multiple groups to form spatially-aware concepts (e.g., the neckline concept
might consist of attributes like v-neck, round-neck, etc). Finally, we
decompose the visual-semantic embedding space into multiple concept-specific
subspaces, which facilitates structured browsing and attribute-feedback product
retrieval by exploiting multimodal linguistic regularities. We conducted
extensive experiments on our newly collected Fashion200K dataset, and results
on clustering quality evaluation and attribute-feedback product retrieval task
demonstrate the effectiveness of our automatically discovered spatially-aware
concepts.Comment: ICCV 201
Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos
Single modality action recognition on RGB or depth sequences has been
extensively explored recently. It is generally accepted that each of these two
modalities has different strengths and limitations for the task of action
recognition. Therefore, analysis of the RGB+D videos can help us to better
study the complementary properties of these two types of modalities and achieve
higher levels of performance. In this paper, we propose a new deep autoencoder
based shared-specific feature factorization network to separate input
multimodal signals into a hierarchy of components. Further, based on the
structure of the features, a structured sparsity learning machine is proposed
which utilizes mixed norms to apply regularization within components and group
selection between them for better classification performance. Our experimental
results show the effectiveness of our cross-modality feature analysis framework
by achieving state-of-the-art accuracy for action classification on five
challenging benchmark datasets
Multimodal sparse representation learning and applications
Unsupervised methods have proven effective for discriminative tasks in a
single-modality scenario. In this paper, we present a multimodal framework for
learning sparse representations that can capture semantic correlation between
modalities. The framework can model relationships at a higher level by forcing
the shared sparse representation. In particular, we propose the use of joint
dictionary learning technique for sparse coding and formulate the joint
representation for concision, cross-modal representations (in case of a missing
modality), and union of the cross-modal representations. Given the accelerated
growth of multimodal data posted on the Web such as YouTube, Wikipedia, and
Twitter, learning good multimodal features is becoming increasingly important.
We show that the shared representations enabled by our framework substantially
improve the classification performance under both unimodal and multimodal
settings. We further show how deep architectures built on the proposed
framework are effective for the case of highly nonlinear correlations between
modalities. The effectiveness of our approach is demonstrated experimentally in
image denoising, multimedia event detection and retrieval on the TRECVID
dataset (audio-video), category classification on the Wikipedia dataset
(image-text), and sentiment classification on PhotoTweet (image-text)
Predicting Human Intentions from Motion Only: A 2D+3D Fusion Approach
In this paper, we address the new problem of the prediction of human intents.
There is neuro-psychological evidence that actions performed by humans are
anticipated by peculiar motor acts which are discriminant of the type of action
going to be performed afterwards. In other words, an actual intent can be
forecast by looking at the kinematics of the immediately preceding movement. To
prove it in a computational and quantitative manner, we devise a new
experimental setup where, without using contextual information, we predict
human intents all originating from the same motor act. We posit the problem as
a classification task and we introduce a new multi-modal dataset consisting of
a set of motion capture marker 3D data and 2D video sequences, where, by only
analysing very similar movements in both training and test phases, we are able
to predict the underlying intent, i.e., the future, never observed action. We
also present an extensive experimental evaluation as a baseline, customizing
state-of-the-art techniques for either 3D and 2D data analysis. Realizing that
video processing methods lead to inferior performance but show complementary
information with respect to 3D data sequences, we developed a 2D+3D fusion
analysis where we achieve better classification accuracies, attesting the
superiority of the multimodal approach for the context-free prediction of human
intents.Comment: accepted as poster at the 25th ACM Multimedia (ACM MM) 2017, Mountain
View, California, US
- …