162 research outputs found
Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation
Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which
aims to generate a natural language sentence for a multimodal social post (an
image as well as its caption) to explain why it contains sarcasm. Although the
existing pioneer study has achieved great success with the BART backbone, it
overlooks the gap between the visual feature space and the decoder semantic
space, the object-level metadata of the image, as well as the potential
external knowledge. To solve these limitations, in this work, we propose a
novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme,
named TEAM. In particular, TEAM extracts the object-level semantic meta-data
instead of the traditional global visual features from the input image.
Meanwhile, TEAM resorts to ConceptNet to obtain the external related knowledge
concepts for the input text and the extracted object meta-data. Thereafter,
TEAM introduces a multi-source semantic graph that comprehensively characterize
the multi-source (i.e., caption, object meta-data, external knowledge) semantic
relations to facilitate the sarcasm reasoning. Extensive experiments on a
public released dataset MORE verify the superiority of our model over
cutting-edge methods.Comment: Accepted by ACL 2023 main conferenc
Discrete Factorization Machines for Fast Feature-based Recommendation
User and item features of side information are crucial for accurate
recommendation. However, the large number of feature dimensions, e.g., usually
larger than 10^7, results in expensive storage and computational cost. This
prohibits fast recommendation especially on mobile applications where the
computational resource is very limited. In this paper, we develop a generic
feature-based recommendation model, called Discrete Factorization Machine
(DFM), for fast and accurate recommendation. DFM binarizes the real-valued
model parameters (e.g., float32) of every feature embedding into binary codes
(e.g., boolean), and thus supports efficient storage and fast user-item score
computation. To avoid the severe quantization loss of the binarization, we
propose a convergent updating rule that resolves the challenging discrete
optimization of DFM. Through extensive experiments on two real-world datasets,
we show that 1) DFM consistently outperforms state-of-the-art binarized
recommendation models, and 2) DFM shows very competitive performance compared
to its real-valued version (FM), demonstrating the minimized quantization loss.
This work is accepted by IJCAI 2018.Comment: Appeared in IJCAI 201
Explicit Interaction Model towards Text Classification
Text classification is one of the fundamental tasks in natural language
processing. Recently, deep neural networks have achieved promising performance
in the text classification task compared to shallow models. Despite of the
significance of deep models, they ignore the fine-grained (matching signals
between words and classes) classification clues since their classifications
mainly rely on the text-level representations. To address this problem, we
introduce the interaction mechanism to incorporate word-level matching signals
into the text classification task. In particular, we design a novel framework,
EXplicit interAction Model (dubbed as EXAM), equipped with the interaction
mechanism. We justified the proposed approach on several benchmark datasets
including both multi-label and multi-class text classification tasks. Extensive
experimental results demonstrate the superiority of the proposed method. As a
byproduct, we have released the codes and parameter settings to facilitate
other researches.Comment: 8 page
Simple to Complex Cross-modal Learning to Rank
The heterogeneity-gap between different modalities brings a significant
challenge to multimedia information retrieval. Some studies formalize the
cross-modal retrieval tasks as a ranking problem and learn a shared multi-modal
embedding space to measure the cross-modality similarity. However, previous
methods often establish the shared embedding space based on linear mapping
functions which might not be sophisticated enough to reveal more complicated
inter-modal correspondences. Additionally, current studies assume that the
rankings are of equal importance, and thus all rankings are used
simultaneously, or a small number of rankings are selected randomly to train
the embedding space at each iteration. Such strategies, however, always suffer
from outliers as well as reduced generalization capability due to their lack of
insightful understanding of procedure of human cognition. In this paper, we
involve the self-paced learning theory with diversity into the cross-modal
learning to rank and learn an optimal multi-modal embedding space based on
non-linear mapping functions. This strategy enhances the model's robustness to
outliers and achieves better generalization via training the model gradually
from easy rankings by diverse queries to more complex ones. An efficient
alternative algorithm is exploited to solve the proposed challenging problem
with fast convergence in practice. Extensive experimental results on several
benchmark datasets indicate that the proposed method achieves significant
improvements over the state-of-the-arts in this literature.Comment: 14 pages; Accepted by Computer Vision and Image Understandin
General Debiasing for Multimodal Sentiment Analysis
Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal
information for prediction yet unavoidably suffers from fitting the spurious
correlations between multimodal features and sentiment labels. For example, if
most videos with a blue background have positive labels in a dataset, the model
will rely on such correlations for prediction, while ``blue background'' is not
a sentiment-related feature. To address this problem, we define a general
debiasing MSA task, which aims to enhance the Out-Of-Distribution (OOD)
generalization ability of MSA models by reducing their reliance on spurious
correlations. To this end, we propose a general debiasing framework based on
Inverse Probability Weighting (IPW), which adaptively assigns small weights to
the samples with larger bias i.e., the severer spurious correlations). The key
to this debiasing framework is to estimate the bias of each sample, which is
achieved by two steps: 1) disentangling the robust features and biased features
in each modality, and 2) utilizing the biased features to estimate the bias.
Finally, we employ IPW to reduce the effects of large-biased samples,
facilitating robust feature learning for sentiment prediction. To examine the
model's generalization ability, we keep the original testing sets on two
benchmarks and additionally construct multiple unimodal and multimodal OOD
testing sets. The empirical results demonstrate the superior generalization
ability of our proposed framework. We have released the code and data to
facilitate the reproduction
Building Emotional Support Chatbots in the Era of LLMs
The integration of emotional support into various conversational scenarios
presents profound societal benefits, such as social interactions, mental health
counseling, and customer service. However, there are unsolved challenges that
hinder real-world applications in this field, including limited data
availability and the absence of well-accepted model training paradigms. This
work endeavors to navigate these challenges by harnessing the capabilities of
Large Language Models (LLMs). We introduce an innovative methodology that
synthesizes human insights with the computational prowess of LLMs to curate an
extensive emotional support dialogue dataset. Our approach is initiated with a
meticulously designed set of dialogues spanning diverse scenarios as generative
seeds. By utilizing the in-context learning potential of ChatGPT, we
recursively generate an ExTensible Emotional Support dialogue dataset, named
ExTES. Following this, we deploy advanced tuning techniques on the LLaMA model,
examining the impact of diverse training strategies, ultimately yielding an LLM
meticulously optimized for emotional support interactions. An exhaustive
assessment of the resultant model showcases its proficiency in offering
emotional support, marking a pivotal step in the realm of emotional support
bots and paving the way for subsequent research and implementations
Deep Convolutional Pooling Transformer for Deepfake Detection
Recently, Deepfake has drawn considerable public attention due to security
and privacy concerns in social media digital forensics. As the wildly spreading
Deepfake videos on the Internet become more realistic, traditional detection
techniques have failed in distinguishing between real and fake. Most existing
deep learning methods mainly focus on local features and relations within the
face image using convolutional neural networks as a backbone. However, local
features and relations are insufficient for model training to learn enough
general information for Deepfake detection. Therefore, the existing Deepfake
detection methods have reached a bottleneck to further improve the detection
performance. To address this issue, we propose a deep convolutional Transformer
to incorporate the decisive image features both locally and globally.
Specifically, we apply convolutional pooling and re-attention to enrich the
extracted features and enhance efficacy. Moreover, we employ the barely
discussed image keyframes in model training for performance improvement and
visualize the feature quantity gap between the key and normal image frames
caused by video compression. We finally illustrate the transferability with
extensive experiments on several Deepfake benchmark datasets. The proposed
solution consistently outperforms several state-of-the-art baselines on both
within- and cross-dataset experiments.Comment: Accepted to be published in ACM TOM
- …