35,765 research outputs found
Multimodal Network Alignment
A multimodal network encodes relationships between the same set of nodes in
multiple settings, and network alignment is a powerful tool for transferring
information and insight between a pair of networks. We propose a method for
multimodal network alignment that computes a matrix which indicates the
alignment, but produces the result as a low-rank factorization directly. We
then propose new methods to compute approximate maximum weight matchings of
low-rank matrices to produce an alignment. We evaluate our approach by applying
it on synthetic networks and use it to de-anonymize a multimodal transportation
network.Comment: 14 pages, 6 figures, Siam Data Mining 201
Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition
Here, we develop an audiovisual deep residual network for multimodal apparent
personality trait recognition. The network is trained end-to-end for predicting
the Big Five personality traits of people from their videos. That is, the
network does not require any feature engineering or visual analysis such as
face detection, face landmark alignment or facial expression recognition.
Recently, the network won the third place in the ChaLearn First Impressions
Challenge with a test accuracy of 0.9109
SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods
Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging
Multimodal named entity recognition (MNER) and multimodal relation extraction
(MRE) are two fundamental subtasks in the multimodal knowledge graph
construction task. However, the existing methods usually handle two tasks
independently, which ignores the bidirectional interaction between them. This
paper is the first to propose jointly performing MNER and MRE as a joint
multimodal entity-relation extraction task (JMERE). Besides, the current MNER
and MRE models only consider aligning the visual objects with textual entities
in visual and textual graphs but ignore the entity-entity relationships and
object-object relationships. To address the above challenges, we propose an
edge-enhanced graph alignment network and a word-pair relation tagging (EEGA)
for JMERE task. Specifically, we first design a word-pair relation tagging to
exploit the bidirectional interaction between MNER and MRE and avoid the error
propagation. Then, we propose an edge-enhanced graph alignment network to
enhance the JMERE task by aligning nodes and edges in the cross-graph. Compared
with previous methods, the proposed method can leverage the edge information to
auxiliary alignment between objects and entities and find the correlations
between entity-entity relationships and object-object relationships.
Experiments are conducted to show the effectiveness of our model.Comment: accepted in AAAI-202
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
The multimedia community has shown a significant interest in perceiving and
representing the physical world with multimodal pretrained neural network
models, and among them, the visual-language pertaining (VLP) is, currently, the
most captivating topic. However, there have been few endeavors dedicated to the
exploration of 1) whether essential linguistic knowledge (e.g., semantics and
syntax) can be extracted during VLP, and 2) how such linguistic knowledge
impact or enhance the multimodal alignment. In response, here we aim to
elucidate the impact of comprehensive linguistic knowledge, including semantic
expression and syntactic structure, on multimodal alignment. Specifically, we
design and release the SNARE, the first large-scale multimodal alignment
probing benchmark, to detect the vital linguistic components, e.g., lexical,
semantic, and syntax knowledge, containing four tasks: Semantic structure,
Negation logic, Attribute ownership, and Relationship composition. Based on our
proposed probing benchmarks, our holistic analyses of five advanced VLP models
illustrate that the VLP model: i) shows insensitivity towards complex syntax
structures and relies on content words for sentence comprehension; ii)
demonstrates limited comprehension of combinations between sentences and
negations; iii) faces challenges in determining the presence of actions or
spatial relationships within visual information and struggles with verifying
the correctness of triple combinations. We make our benchmark and code
available at \url{https://github.com/WangFei-2019/SNARE/}.Comment: [TL;DR] we design and release the SNARE, the first large-scale
multimodal alignment probing benchmark for current vision-language pretrained
model
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
- …