9,667 research outputs found
Towards Better Multi-modal Keyphrase Generation via Visual Entity Enhancement and Multi-granularity Image Noise Filtering
Multi-modal keyphrase generation aims to produce a set of keyphrases that
represent the core points of the input text-image pair. In this regard,
dominant methods mainly focus on multi-modal fusion for keyphrase generation.
Nevertheless, there are still two main drawbacks: 1) only a limited number of
sources, such as image captions, can be utilized to provide auxiliary
information. However, they may not be sufficient for the subsequent keyphrase
generation. 2) the input text and image are often not perfectly matched, and
thus the image may introduce noise into the model. To address these
limitations, in this paper, we propose a novel multi-modal keyphrase generation
model, which not only enriches the model input with external knowledge, but
also effectively filters image noise. First, we introduce external visual
entities of the image as the supplementary input to the model, which benefits
the cross-modal semantic alignment for keyphrase generation. Second, we
simultaneously calculate an image-text matching score and image region-text
correlation scores to perform multi-granularity image noise filtering.
Particularly, we introduce the correlation scores between image regions and
ground-truth keyphrases to refine the calculation of the previously-mentioned
correlation scores. To demonstrate the effectiveness of our model, we conduct
several groups of experiments on the benchmark dataset.
Experimental results and in-depth analyses show that our model achieves the
state-of-the-art performance. Our code is available on
https://github.com/DeepLearnXMU/MM-MKP.Comment: Accepted In Proceedings of the 31st ACM International Conference on
Multimedia (MM' 23
Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to segment the target object
in a video sequence described by a language expression. Typical multimodal
Transformer based RVOS approaches process video sequence in a frame-independent
manner to reduce the high computational cost, which however restricts the
performance due to the lack of inter-frame interaction for temporal coherence
modeling and spatio-temporal representation learning of the referred object.
Besides, the absence of sufficient cross-modal interactions results in weak
correlation between the visual and linguistic features, which increases the
difficulty of decoding the target information and limits the performance of the
model. In this paper, we propose a bidirectional correlation-driven inter-frame
interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight and plug-and-play inter-frame interaction
module in the Transformer decoder to efficiently learn the spatio-temporal
features of the referred object, so as to decode the object information in the
video sequence more precisely and generate more accurate segmentation results.
Moreover, a bidirectional vision-language interaction module is implemented
before the multimodal Transformer to enhance the correlation between the visual
and linguistic features, thus facilitating the language queries to decode more
precise object information from visual features and ultimately improving the
segmentation performance. Extensive experimental results on four benchmarks
validate the superiority of our BIFIT over state-of-the-art methods and the
effectiveness of our proposed modules
Recommended from our members
The role of HG in the analysis of temporal iteration and interaural correlation
Learning models for semantic classification of insufficient plantar pressure images
Establishing a reliable and stable model to predict a target by using insufficient labeled samples is feasible and
effective, particularly, for a sensor-generated data-set. This paper has been inspired with insufficient data-set
learning algorithms, such as metric-based, prototype networks and meta-learning, and therefore we propose
an insufficient data-set transfer model learning method. Firstly, two basic models for transfer learning are
introduced. A classification system and calculation criteria are then subsequently introduced. Secondly, a dataset
of plantar pressure for comfort shoe design is acquired and preprocessed through foot scan system; and by
using a pre-trained convolution neural network employing AlexNet and convolution neural network (CNN)-
based transfer modeling, the classification accuracy of the plantar pressure images is over 93.5%. Finally,
the proposed method has been compared to the current classifiers VGG, ResNet, AlexNet and pre-trained
CNN. Also, our work is compared with known-scaling and shifting (SS) and unknown-plain slot (PS) partition
methods on the public test databases: SUN, CUB, AWA1, AWA2, and aPY with indices of precision (tr, ts, H)
and time (training and evaluation). The proposed method for the plantar pressure classification task shows high
performance in most indices when comparing with other methods. The transfer learning-based method can be
applied to other insufficient data-sets of sensor imaging fields
Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Automatic emotion recognition is an active research topic with wide range of
applications. Due to the high manual annotation cost and inevitable label
ambiguity, the development of emotion recognition dataset is limited in both
scale and quality. Therefore, one of the key challenges is how to build
effective models with limited data resource. Previous works have explored
different approaches to tackle this challenge including data enhancement,
transfer learning, and semi-supervised learning etc. However, the weakness of
these existing approaches includes such as training instability, large
performance loss during transfer, or marginal improvement.
In this work, we propose a novel semi-supervised multi-modal emotion
recognition model based on cross-modality distribution matching, which
leverages abundant unlabeled data to enhance the model training under the
assumption that the inner emotional status is consistent at the utterance level
across modalities.
We conduct extensive experiments to evaluate the proposed model on two
benchmark datasets, IEMOCAP and MELD. The experiment results prove that the
proposed semi-supervised learning model can effectively utilize unlabeled data
and combine multi-modalities to boost the emotion recognition performance,
which outperforms other state-of-the-art approaches under the same condition.
The proposed model also achieves competitive capacity compared with existing
approaches which take advantage of additional auxiliary information such as
speaker and interaction context.Comment: 10 pages, 5 figures, to be published on ACM Multimedia 202
On Transforming Reinforcement Learning by Transformer: The Development Trajectory
Transformer, originally devised for natural language processing, has also
attested significant success in computer vision. Thanks to its super expressive
power, researchers are investigating ways to deploy transformers to
reinforcement learning (RL) and the transformer-based models have manifested
their potential in representative RL benchmarks. In this paper, we collect and
dissect recent advances on transforming RL by transformer (transformer-based RL
or TRL), in order to explore its development trajectory and future trend. We
group existing developments in two categories: architecture enhancement and
trajectory optimization, and examine the main applications of TRL in robotic
manipulation, text-based games, navigation and autonomous driving. For
architecture enhancement, these methods consider how to apply the powerful
transformer structure to RL problems under the traditional RL framework, which
model agents and environments much more precisely than deep RL methods, but
they are still limited by the inherent defects of traditional RL algorithms,
such as bootstrapping and "deadly triad". For trajectory optimization, these
methods treat RL problems as sequence modeling and train a joint state-action
model over entire trajectories under the behavior cloning framework, which are
able to extract policies from static datasets and fully use the long-sequence
modeling capability of the transformer. Given these advancements, extensions
and challenges in TRL are reviewed and proposals about future direction are
discussed. We hope that this survey can provide a detailed introduction to TRL
and motivate future research in this rapidly developing field.Comment: 26 page
Plug-and-Play Regulators for Image-Text Matching
Exploiting fine-grained correspondence and visual-semantic alignments has
shown great potential in image-text matching. Generally, recent approaches
first employ a cross-modal attention unit to capture latent region-word
interactions, and then integrate all the alignments to obtain the final
similarity. However, most of them adopt one-time forward association or
aggregation strategies with complex architectures or additional information,
while ignoring the regulation ability of network feedback. In this paper, we
develop two simple but quite effective regulators which efficiently encode the
message output to automatically contextualize and aggregate cross-modal
representations. Specifically, we propose (i) a Recurrent Correspondence
Regulator (RCR) which facilitates the cross-modal attention unit progressively
with adaptive attention factors to capture more flexible correspondence, and
(ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation
weights repeatedly to increasingly emphasize important alignments and dilute
unimportant ones. Besides, it is interesting that RCR and RAR are
plug-and-play: both of them can be incorporated into many frameworks based on
cross-modal interaction to obtain significant benefits, and their cooperation
achieves further improvements. Extensive experiments on MSCOCO and Flickr30K
datasets validate that they can bring an impressive and consistent R@1 gain on
multiple models, confirming the general effectiveness and generalization
ability of the proposed methods. Code and pre-trained models are available at:
https://github.com/Paranioar/RCAR.Comment: 13 pages, 9 figures, Accepted by TIP202
- …