56 research outputs found
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Understanding expressed sentiment and emotions are two crucial factors in
human multimodal language. This paper describes a Transformer-based
joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment
Analysis. In addition to use the Transformer architecture, our approach relies
on a modular co-attention and a glimpse layer to jointly encode one or more
modalities. The proposed solution has also been submitted to the ACL20: Second
Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI
dataset. The code to replicate the presented experiments is open-source:
https://github.com/jbdel/MOSEI_UMONS.Comment: Winner of the ACL20: Second Grand-Challenge on Multimodal Languag
Extending Compositional Attention Networks for Social Reasoning in Videos
We propose a novel deep architecture for the task of reasoning about social
interactions in videos. We leverage the multi-step reasoning capabilities of
Compositional Attention Networks (MAC), and propose a multimodal extension
(MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level
fusion of input modalities (visual, auditory, text) over multiple reasoning
steps, by use of a temporal attention mechanism. We then combine MAC-X with
LSTMs for temporal input processing in an end-to-end architecture. Our ablation
studies show that the proposed MAC-X architecture can effectively leverage
multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the
task of Social Video Question Answering in the Social IQ dataset and obtain a
2.5% absolute improvement in terms of binary accuracy over the current
state-of-the-art
Visual Question Answering for Cultural Heritage
Technology and the fruition of cultural heritage are becoming increasingly
more entwined, especially with the advent of smart audio guides, virtual and
augmented reality, and interactive installations. Machine learning and computer
vision are important components of this ongoing integration, enabling new
interaction modalities between user and museum. Nonetheless, the most frequent
way of interacting with paintings and statues still remains taking pictures.
Yet images alone can only convey the aesthetics of the artwork, lacking is
information which is often required to fully understand and appreciate it.
Usually this additional knowledge comes both from the artwork itself (and
therefore the image depicting it) and from an external source of knowledge,
such as an information sheet. While the former can be inferred by computer
vision algorithms, the latter needs more structured data to pair visual content
with relevant information. Regardless of its source, this information still
must be be effectively transmitted to the user. A popular emerging trend in
computer vision is Visual Question Answering (VQA), in which users can interact
with a neural network by posing questions in natural language and receiving
answers about the visual content. We believe that this will be the evolution of
smart audio guides for museum visits and simple image browsing on personal
smartphones. This will turn the classic audio guide into a smart personal
instructor with which the visitor can interact by asking for explanations
focused on specific interests. The advantages are twofold: on the one hand the
cognitive burden of the visitor will decrease, limiting the flow of information
to what the user actually wants to hear; and on the other hand it proposes the
most natural way of interacting with a guide, favoring engagement.Comment: accepted at FlorenceHeritech 202
AttnGrounder: Talking to Cars with Attention
We propose Attention Grounder (AttnGrounder), a single-stage end-to-end
trainable model for the task of visual grounding. Visual grounding aims to
localize a specific object in an image based on a given natural language text
query. Unlike previous methods that use the same text representation for every
image region, we use a visual-text attention module that relates each word in
the given query with every region in the corresponding image for constructing a
region dependent text representation. Furthermore, for improving the
localization ability of our model, we use our visual-text attention module to
generate an attention mask around the referred object. The attention mask is
trained as an auxiliary task using a rectangular mask generated with the
provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car
dataset and show an improvement of 3.26% over the existing methods
Developing ChatGPT for Biology and Medicine: A Complete Review of Biomedical Question Answering
ChatGPT explores a strategic blueprint of question answering (QA) in
delivering medical diagnosis, treatment recommendations, and other healthcare
support. This is achieved through the increasing incorporation of medical
domain data via natural language processing (NLP) and multimodal paradigms. By
transitioning the distribution of text, images, videos, and other modalities
from the general domain to the medical domain, these techniques have expedited
the progress of medical domain question answering (MDQA). They bridge the gap
between human natural language and sophisticated medical domain knowledge or
expert manual annotations, handling large-scale, diverse, unbalanced, or even
unlabeled data analysis scenarios in medical contexts. Central to our focus is
the utilizing of language models and multimodal paradigms for medical question
answering, aiming to guide the research community in selecting appropriate
mechanisms for their specific medical research requirements. Specialized tasks
such as unimodal-related question answering, reading comprehension, reasoning,
diagnosis, relation extraction, probability modeling, and others, as well as
multimodal-related tasks like vision question answering, image caption,
cross-modal retrieval, report summarization, and generation, are discussed in
detail. Each section delves into the intricate specifics of the respective
method under consideration. This paper highlights the structures and
advancements of medical domain explorations against general domain methods,
emphasizing their applications across different tasks and datasets. It also
outlines current challenges and opportunities for future medical domain
research, paving the way for continued innovation and application in this
rapidly evolving field.Comment: 50 pages, 3 figures, 3 table
Recent, rapid advancement in visual question answering architecture: a review
Understanding visual question answering is going to be crucial for numerous
human activities. However, it presents major challenges at the heart of the
artificial intelligence endeavor. This paper presents an update on the rapid
advancements in visual question answering using images that have occurred in
the last couple of years. Tremendous growth in research on improving visual
question answering system architecture has been published recently, showing the
importance of multimodal architectures. Several points on the benefits of
visual question answering are mentioned in the review paper by Manmadhan et al.
(2020), on which the present article builds, including subsequent updates in
the field.Comment: 11 page
VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval
Video Moment Retrieval (VMR) is a task to localize the temporal moment in
untrimmed video specified by natural language query. For VMR, several methods
that require full supervision for training have been proposed. Unfortunately,
acquiring a large number of training videos with labeled temporal boundaries
for each query is a labor-intensive process. This paper explores methods for
performing VMR in a weakly-supervised manner (wVMR): training is performed
without temporal moment labels but only with the text query that describes a
segment of the video. Existing methods on wVMR generate multi-scale proposals
and apply query-guided attention mechanisms to highlight the most relevant
proposal. To leverage the weak supervision, contrastive learning is used which
predicts higher scores for the correct video-query pairs than for the incorrect
pairs. It has been observed that a large number of candidate proposals, coarse
query representation, and one-way attention mechanism lead to blurry attention
maps which limit the localization performance. To handle this issue,
Video-Language Alignment Network (VLANet) is proposed that learns sharper
attention by pruning out spurious candidate proposals and applying a
multi-directional attention mechanism with fine-grained query representation.
The Surrogate Proposal Selection module selects a proposal based on the
proximity to the query in the joint embedding space, and thus substantially
reduces candidate proposals which leads to lower computation load and sharper
attention. Next, the Cascaded Cross-modal Attention module considers dense
feature interactions and multi-directional attention flow to learn the
multi-modal alignment. VLANet is trained end-to-end using contrastive loss
which enforces semantically similar videos and queries to gather. The
experiments show that the method achieves state-of-the-art performance on
Charades-STA and DiDeMo datasets.Comment: 16 pages, 6 figures, European Conference on Computer Vision, 202
- …