1,539 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation
Multi-modal recommendation systems, which integrate diverse types of
information, have gained widespread attention in recent years. However,
compared to traditional collaborative filtering-based multi-modal
recommendation systems, research on multi-modal sequential recommendation is
still in its nascent stages. Unlike traditional sequential recommendation
models that solely rely on item identifier (ID) information and focus on
network structure design, multi-modal recommendation models need to emphasize
item representation learning and the fusion of heterogeneous data sources. This
paper investigates the impact of item representation learning on downstream
recommendation tasks and examines the disparities in information fusion at
different stages. Empirical experiments are conducted to demonstrate the need
to design a framework suitable for collaborative learning and fusion of diverse
information. Based on this, we propose a new model-agnostic framework for
multi-modal sequential recommendation tasks, called Online
Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature
interaction and mutual learning among multi-source input (ID, text, and image),
while avoiding conflicts among different features during training, thereby
improving recommendation accuracy. To be specific, we first introduce an
ID-aware Multi-modal Transformer module in the item representation learning
stage to facilitate information interaction among different features. Secondly,
we employ an online distillation training strategy in the prediction
optimization stage to make multi-source data learn from each other and improve
prediction robustness. Experimental results on a video content recommendation
dataset and three e-commerce recommendation datasets demonstrate the
effectiveness of the proposed two modules, which is approximately 10%
improvement in performance compared to baseline models.Comment: 11 pages, 7 figure
Leveraging Historical Medical Records as a Proxy via Multimodal Modeling and Visualization to Enrich Medical Diagnostic Learning
Simulation-based Medical Education (SBME) has been developed as a
cost-effective means of enhancing the diagnostic skills of novice physicians
and interns, thereby mitigating the need for resource-intensive
mentor-apprentice training. However, feedback provided in most SBME is often
directed towards improving the operational proficiency of learners, rather than
providing summative medical diagnoses that result from experience and time.
Additionally, the multimodal nature of medical data during diagnosis poses
significant challenges for interns and novice physicians, including the
tendency to overlook or over-rely on data from certain modalities, and
difficulties in comprehending potential associations between modalities. To
address these challenges, we present DiagnosisAssistant, a visual analytics
system that leverages historical medical records as a proxy for multimodal
modeling and visualization to enhance the learning experience of interns and
novice physicians. The system employs elaborately designed visualizations to
explore different modality data, offer diagnostic interpretive hints based on
the constructed model, and enable comparative analyses of specific patients.
Our approach is validated through two case studies and expert interviews,
demonstrating its effectiveness in enhancing medical training.Comment: Accepted by IEEE VIS 202
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems
In this paper, we present nmtpy, a flexible Python toolkit based on Theano
for training Neural Machine Translation and other neural sequence-to-sequence
architectures. nmtpy decouples the specification of a network from the training
and inference utilities to simplify the addition of a new architecture and
reduce the amount of boilerplate code to be written. nmtpy has been used for
LIUM's top-ranked submissions to WMT Multimodal Machine Translation and News
Translation tasks in 2016 and 2017.Comment: 10 pages, 3 figure
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Given a query composed of a reference image and a relative caption, the
Composed Image Retrieval goal is to retrieve images visually similar to the
reference one that integrates the modifications expressed by the caption. Given
that recent research has demonstrated the efficacy of large-scale vision and
language pre-trained (VLP) models in various tasks, we rely on features from
the OpenAI CLIP model to tackle the considered task. We initially perform a
task-oriented fine-tuning of both CLIP encoders using the element-wise sum of
visual and textual features. Then, in the second stage, we train a Combiner
network that learns to combine the image-text features integrating the bimodal
information and providing combined features used to perform the retrieval. We
use contrastive learning in both stages of training. Starting from the bare
CLIP features as a baseline, experimental results show that the task-oriented
fine-tuning and the carefully crafted Combiner network are highly effective and
outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two
popular and challenging datasets for composed image retrieval. Code and
pre-trained models are available at https://github.com/ABaldrati/CLIP4CirComment: Accepted in ACM Transactions on Multimedia Computing Communications
and Applications (TOMM
- …