104,260 research outputs found
HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities
Multimodal datasets contain an enormous amount of relational information,
which grows exponentially with the introduction of new modalities. Learning
representations in such a scenario is inherently complex due to the presence of
multiple heterogeneous information channels. These channels can encode both (a)
inter-relations between the items of different modalities and (b)
intra-relations between the items of the same modality. Encoding multimedia
items into a continuous low-dimensional semantic space such that both types of
relations are captured and preserved is extremely challenging, especially if
the goal is a unified end-to-end learning framework. The two key challenges
that need to be addressed are: 1) the framework must be able to merge complex
intra and inter relations without losing any valuable information and 2) the
learning model should be invariant to the addition of new and potentially very
different modalities. In this paper, we propose a flexible framework which can
scale to data streams from many modalities. To that end we introduce a
hypergraph-based model for data representation and deploy Graph Convolutional
Networks to fuse relational information within and across modalities. Our
approach provides an efficient solution for distributing otherwise extremely
computationally expensive or even unfeasible training processes across
multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new
modalities to our model requires only an additional GPU unit keeping the
computational time unchanged, which brings representation learning to truly
multimodal datasets. We demonstrate the feasibility of our approach in the
experiments on multimedia datasets featuring second, third and fourth order
relations
Multimodal Convolutional Neural Networks for Matching Image and Sentence
In this paper, we propose multimodal convolutional neural networks (m-CNNs)
for matching image and sentence. Our m-CNN provides an end-to-end framework
with convolutional architectures to exploit image representation, word
composition, and the matching relations between the two modalities. More
specifically, it consists of one image CNN encoding the image content, and one
matching CNN learning the joint representation of image and sentence. The
matching CNN composes words to different semantic fragments and learns the
inter-modal relations between image and the composed fragments at different
levels, thus fully exploit the matching relations between image and sentence.
Experimental results on benchmark databases of bidirectional image and sentence
retrieval demonstrate that the proposed m-CNNs can effectively capture the
information necessary for image and sentence matching. Specifically, our
proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and
Microsoft COCO databases achieve the state-of-the-art performances.Comment: Accepted by ICCV 201
Unsupervised Learning of Long-Term Motion Dynamics for Videos
We present an unsupervised representation learning approach that compactly
encodes the motion dependencies in videos. Given a pair of images from a video
clip, our framework learns to predict the long-term 3D motions. To reduce the
complexity of the learning framework, we propose to describe the motion as a
sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent
Neural Network based Encoder-Decoder framework to predict these sequences of
flows. We argue that in order for the decoder to reconstruct these sequences,
the encoder must learn a robust video representation that captures long-term
motion dependencies and spatial-temporal relations. We demonstrate the
effectiveness of our learned temporal representations on activity
classification across multiple modalities and datasets such as NTU RGB+D and
MSR Daily Activity 3D. Our framework is generic to any input modality, i.e.,
RGB, Depth, and RGB-D videos.Comment: CVPR 201
Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions
Modern Review Helpfulness Prediction systems are dependent upon multiple
modalities, typically texts and images. Unfortunately, those contemporary
approaches pay scarce attention to polish representations of cross-modal
relations and tend to suffer from inferior optimization. This might cause harm
to model's predictions in numerous cases. To overcome the aforementioned
issues, we propose Multimodal Contrastive Learning for Multimodal Review
Helpfulness Prediction (MRHP) problem, concentrating on mutual information
between input modalities to explicitly elaborate cross-modal relations. In
addition, we introduce Adaptive Weighting scheme for our contrastive learning
approach in order to increase flexibility in optimization. Lastly, we propose
Multimodal Interaction module to address the unalignment nature of multimodal
data, thereby assisting the model in producing more reasonable multimodal
representations. Experimental results show that our method outperforms prior
baselines and achieves state-of-the-art results on two publicly available
benchmark datasets for MRHP problem.Comment: Accepted to the main EMNLP 2022 conferenc
Modeling Intra- and Inter-Modal Relations: Hierarchical Graph Contrastive Learning for Multimodal Sentiment Analysis
The existing research efforts in Multimodal Sentiment Analysis (MSA) have focused on developing the expressive ability of neural networks to fuse information from different modalities. However, these approaches lack a mechanism to understand the complex relations within and across different modalities, since some sentiments may be scattered in different modalities. To this end, in this paper, we propose a novel hierarchical graph contrastive learning (HGraph-CL) framework for MSA, aiming to explore the intricate relations of intra- and inter-modal representations for sentiment extraction. Specifically, regarding the intra-modal level, we build a unimodal graph for each modality representation to account for the modality-specific sentiment implications. Based on it, a graph contrastive learning strategy is adopted to explore the potential relations based on unimodal graph augmentations. Furthermore, we construct a multimodal graph of each instance based on the unimodal graphs to grasp the sentiment relations between different modalities. Then, in light of the multimodal augmentation graphs, a graph contrastive learning strategy over the inter-modal level is proposed to ulteriorly seek the possible graph structures for precisely learning sentiment relations. This essentially allows the framework to understand the appropriate graph structures for learning intricate relations among different modalities. Experimental results on two benchmark datasets show that the proposed framework outperforms the state-of-the-art baselines in MSA
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
- …