1,222 research outputs found
User Diverse Preference Modeling by Multimodal Attentive Metric Learning
Most existing recommender systems represent a user's preference with a
feature vector, which is assumed to be fixed when predicting this user's
preferences for different items. However, the same vector cannot accurately
capture a user's varying preferences on all items, especially when considering
the diverse characteristics of various items. To tackle this problem, in this
paper, we propose a novel Multimodal Attentive Metric Learning (MAML) method to
model user diverse preferences for various items. In particular, for each
user-item pair, we propose an attention neural network, which exploits the
item's multimodal features to estimate the user's special attention to
different aspects of this item. The obtained attention is then integrated into
a metric-based learning method to predict the user preference on this item. The
advantage of metric learning is that it can naturally overcome the problem of
dot product similarity, which is adopted by matrix factorization (MF) based
recommendation models but does not satisfy the triangle inequality property. In
addition, it is worth mentioning that the attention mechanism cannot only help
model user's diverse preferences towards different items, but also overcome the
geometrically restrictive problem caused by collaborative metric learning.
Extensive experiments on large-scale real-world datasets show that our model
can substantially outperform the state-of-the-art baselines, demonstrating the
potential of modeling user diverse preference for recommendation.Comment: Accepted by ACM Multimedia 2019 as a full pape
Multi-Behavior Hypergraph-Enhanced Transformer for Sequential Recommendation
Learning dynamic user preference has become an increasingly important
component for many online platforms (e.g., video-sharing sites, e-commerce
systems) to make sequential recommendations. Previous works have made many
efforts to model item-item transitions over user interaction sequences, based
on various architectures, e.g., recurrent neural networks and self-attention
mechanism. Recently emerged graph neural networks also serve as useful backbone
models to capture item dependencies in sequential recommendation scenarios.
Despite their effectiveness, existing methods have far focused on item sequence
representation with singular type of interactions, and thus are limited to
capture dynamic heterogeneous relational structures between users and items
(e.g., page view, add-to-favorite, purchase). To tackle this challenge, we
design a Multi-Behavior Hypergraph-enhanced Transformer framework (MBHT) to
capture both short-term and long-term cross-type behavior dependencies.
Specifically, a multi-scale Transformer is equipped with low-rank
self-attention to jointly encode behavior-aware sequential patterns from
fine-grained and coarse-grained levels. Additionally, we incorporate the global
multi-behavior dependency into the hypergraph neural architecture to capture
the hierarchical long-range item correlations in a customized manner.
Experimental results demonstrate the superiority of our MBHT over various
state-of-the-art recommendation solutions across different settings. Further
ablation studies validate the effectiveness of our model design and benefits of
the new MBHT framework. Our implementation code is released at:
https://github.com/yuh-yang/MBHT-KDD22.Comment: Published as a KDD'22 full pape
Formalizing Multimedia Recommendation through Multimodal Deep Learning
Recommender systems (RSs) offer personalized navigation experiences on online
platforms, but recommendation remains a challenging task, particularly in
specific scenarios and domains. Multimodality can help tap into richer
information sources and construct more refined user/item profiles for
recommendations. However, existing literature lacks a shared and universal
schema for modeling and solving the recommendation problem through the lens of
multimodality. This work aims to formalize a general multimodal schema for
multimedia recommendation. It provides a comprehensive literature review of
multimodal approaches for multimedia recommendation from the last eight years,
outlines the theoretical foundations of a multimodal pipeline, and demonstrates
its rationale by applying it to selected state-of-the-art approaches. The work
also conducts a benchmarking analysis of recent algorithms for multimedia
recommendation within Elliot, a rigorous framework for evaluating recommender
systems. The main aim is to provide guidelines for designing and implementing
the next generation of multimodal approaches in multimedia recommendation
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation
In recent years, audio-driven 3D facial animation has gained significant
attention, particularly in applications such as virtual reality, gaming, and
video conferencing. However, accurately modeling the intricate and subtle
dynamics of facial expressions remains a challenge. Most existing studies
approach the facial animation task as a single regression problem, which often
fail to capture the intrinsic inter-modal relationship between speech signals
and 3D facial animation and overlook their inherent consistency. Moreover, due
to the limited availability of 3D-audio-visual datasets, approaches learning
with small-size samples have poor generalizability that decreases the
performance. To address these issues, in this study, we propose a cross-modal
dual-learning framework, termed DualTalker, aiming at improving data usage
efficiency as well as relating cross-modal dependencies. The framework is
trained jointly with the primary task (audio-driven facial animation) and its
dual task (lip reading) and shares common audio/motion encoder components. Our
joint training framework facilitates more efficient data usage by leveraging
information from both tasks and explicitly capitalizing on the complementary
relationship between facial motion and audio to improve performance.
Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate
the potential over-smoothing underlying the cross-modal complementary
representations, enhancing the mapping of subtle facial expression dynamics.
Through extensive experiments and a perceptual user study conducted on the VOCA
and BIWI datasets, we demonstrate that our approach outperforms current
state-of-the-art methods both qualitatively and quantitatively. We have made
our code and video demonstrations available at
https://github.com/sabrina-su/iadf.git
On Popularity Bias of Multimodal-aware Recommender Systems: a Modalities-driven Analysis
Multimodal-aware recommender systems (MRSs) exploit multimodal content (e.g.,
product images or descriptions) as items' side information to improve
recommendation accuracy. While most of such methods rely on factorization
models (e.g., MFBPR) as base architecture, it has been shown that MFBPR may be
affected by popularity bias, meaning that it inherently tends to boost the
recommendation of popular (i.e., short-head) items at the detriment of niche
(i.e., long-tail) items from the catalog. Motivated by this assumption, in this
work, we provide one of the first analyses on how multimodality in
recommendation could further amplify popularity bias. Concretely, we evaluate
the performance of four state-of-the-art MRSs algorithms (i.e., VBPR, MMGCN,
GRCN, LATTICE) on three datasets from Amazon by assessing, along with
recommendation accuracy metrics, performance measures accounting for the
diversity of recommended items and the portion of retrieved niche items. To
better investigate this aspect, we decide to study the separate influence of
each modality (i.e., visual and textual) on popularity bias in different
evaluation dimensions. Results, which demonstrate how the single modality may
augment the negative effect of popularity bias, shed light on the importance to
provide a more rigorous analysis of the performance of such models
Online Distillation-enhanced Multi-modal Transformer for Sequential Recommendation
Multi-modal recommendation systems, which integrate diverse types of
information, have gained widespread attention in recent years. However,
compared to traditional collaborative filtering-based multi-modal
recommendation systems, research on multi-modal sequential recommendation is
still in its nascent stages. Unlike traditional sequential recommendation
models that solely rely on item identifier (ID) information and focus on
network structure design, multi-modal recommendation models need to emphasize
item representation learning and the fusion of heterogeneous data sources. This
paper investigates the impact of item representation learning on downstream
recommendation tasks and examines the disparities in information fusion at
different stages. Empirical experiments are conducted to demonstrate the need
to design a framework suitable for collaborative learning and fusion of diverse
information. Based on this, we propose a new model-agnostic framework for
multi-modal sequential recommendation tasks, called Online
Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature
interaction and mutual learning among multi-source input (ID, text, and image),
while avoiding conflicts among different features during training, thereby
improving recommendation accuracy. To be specific, we first introduce an
ID-aware Multi-modal Transformer module in the item representation learning
stage to facilitate information interaction among different features. Secondly,
we employ an online distillation training strategy in the prediction
optimization stage to make multi-source data learn from each other and improve
prediction robustness. Experimental results on a video content recommendation
dataset and three e-commerce recommendation datasets demonstrate the
effectiveness of the proposed two modules, which is approximately 10%
improvement in performance compared to baseline models.Comment: 11 pages, 7 figure
- …