20 research outputs found
MIR-GAN : Refining Frame-Level Modality-Invariant Representations with Adversarial Network for Audio-Visual Speech Recognition
PreprintPublisher PD
Towards Balanced Active Learning for Multimodal Classification
Training multimodal networks requires a vast amount of data due to their
larger parameter space compared to unimodal networks. Active learning is a
widely used technique for reducing data annotation costs by selecting only
those samples that could contribute to improving model performance. However,
current active learning strategies are mostly designed for unimodal tasks, and
when applied to multimodal data, they often result in biased sample selection
from the dominant modality. This unfairness hinders balanced multimodal
learning, which is crucial for achieving optimal performance. To address this
issue, we propose three guidelines for designing a more balanced multimodal
active learning strategy. Following these guidelines, a novel approach is
proposed to achieve more fair data selection by modulating the gradient
embedding with the dominance degree among modalities. Our studies demonstrate
that the proposed method achieves more balanced multimodal learning by avoiding
greedy sample selection from the dominant modality. Our approach outperforms
existing active learning strategies on a variety of multimodal classification
tasks. Overall, our work highlights the importance of balancing sample
selection in multimodal active learning and provides a practical solution for
achieving more balanced active learning for multimodal classification.Comment: 12 pages, accepted by ACMMM 202
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Audio-visual speech recognition (AVSR) has gained remarkable success for
ameliorating the noise-robustness of speech recognition. Mainstream methods
focus on fusing audio and visual inputs to obtain modality-invariant
representations. However, such representations are prone to over-reliance on
audio modality as it is much easier to recognize than video modality in clean
conditions. As a result, the AVSR model underestimates the importance of visual
stream in face of noise corruption. To this end, we leverage visual
modality-specific representations to provide stable complementary information
for the AVSR task. Specifically, we propose a reinforcement learning (RL) based
framework called MSRL, where the agent dynamically harmonizes
modality-invariant and modality-specific representations in the auto-regressive
decoding process. We customize a reward function directly related to
task-specific metrics (i.e., word error rate), which encourages the MSRL to
effectively explore the optimal integration strategy. Experimental results on
the LRS3 dataset show that the proposed method achieves state-of-the-art in
both clean and various noisy conditions. Furthermore, we demonstrate the better
generality of MSRL system than other baselines when test set contains unseen
noises.Comment: Accepted by AAAI202
Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection
Audio-visual deepfake detection scrutinizes manipulations in public video
using complementary multimodal cues. Current methods, which train on fused
multimodal data for multimodal targets face challenges due to uncertainties and
inconsistencies in learned representations caused by independent modality
manipulations in deepfake videos. To address this, we propose cross-modality
and within-modality regularization to preserve modality distinctions during
multimodal representation learning. Our approach includes an audio-visual
transformer module for modality correspondence and a cross-modality
regularization module to align paired audio-visual signals, preserving modality
distinctions. Simultaneously, a within-modality regularization module refines
unimodal representations with modality-specific targets to retain
modal-specific details. Experimental results on the public audio-visual
dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our
approach.Comment: Accepted by ICASSP 202
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
12 pages, 5 figures, Accepted by IJCAI 2023Preprin
Paeonol Ameliorates Diabetic Renal Fibrosis Through Promoting the Activation of the Nrf2/ARE Pathway via Up-Regulating Sirt1
Diabetic nephropathy (DN) is rapidly becoming the leading cause of end-stage renal disease worldwide and a major cause of morbidity and mortality in patients of diabetes. The main pathological change of DN is renal fibrosis. Paeonol (PA), a single phenolic compound extracted from the root bark of Cortex Moutan, has been demonstrated to have many potential pharmacological activities. However, the effects of PA on DN have not been fully elucidated. In this study, high glucose (HG)-treated glomerular mesangial cells (GMCs) and streptozotocin (STZ)-induced diabetic mice were analyzed in exploring the potential mechanisms of PA on DN. Results in vitro showed that: (1) PA inhibited HG-induced fibronectin (FN) and ICAM-1 overexpressions; (2) PA exerted renoprotective effect through activating the Nrf2/ARE pathway; (3) Sirt1 mediated the effects of PA on the activation of Nrf2/ARE pathway. What is more, in accordance with the in vitro results, significant elevated levels of Sirt1, Nrf2 and downstream proteins related to Nrf2 were observed in the kidneys of PA treatment group compared with model group. Taken together, our study shows that PA delays the progression of diabetic renal fibrosis, and the underlying mechanism is probably associated with regulating the Nrf2 pathway. The effect of PA on Nrf2 is at least partially dependent on Sirt1 activation
Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition
Audio-visual speech recognition (AVSR) research has gained a great success
recently by improving the noise-robustness of audio-only automatic speech
recognition (ASR) with noise-invariant visual information. However, most
existing AVSR approaches simply fuse the audio and visual features by
concatenation, without explicit interactions to capture the deep correlations
between them, which results in sub-optimal multimodal representations for
downstream speech recognition task. In this paper, we propose a cross-modal
global interaction and local alignment (GILA) approach for AVSR, which captures
the deep audio-visual (A-V) correlations from both global and local
perspectives. Specifically, we design a global interaction model to capture the
A-V complementary relationship on modality level, as well as a local alignment
approach to model the A-V temporal consistency on frame level. Such a holistic
view of cross-modal correlations enable better multimodal representations for
AVSR. Experiments on public benchmarks LRS3 and LRS2 show that our GILA
outperforms the supervised learning state-of-the-art.Comment: 12 pages, 5 figures, Accepted by IJCAI 202
UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning
Multimodal learning aims to imitate human beings to acquire complementary
information from multiple modalities for various downstream tasks. However,
traditional aggregation-based multimodal fusion methods ignore the
inter-modality relationship, treat each modality equally, suffer sensor noise,
and thus reduce multimodal learning performance. In this work, we propose a
novel multimodal contrastive method to explore more reliable multimodal
representations under the weak supervision of unimodal predicting.
Specifically, we first capture task-related unimodal representations and the
unimodal predictions from the introduced unimodal predicting task. Then the
unimodal representations are aligned with the more effective one by the
designed multimodal contrastive method under the supervision of the unimodal
predictions. Experimental results with fused features on two image-text
classification benchmarks UPMC-Food-101 and N24News show that our proposed
Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method
outperforms current state-of-the-art multimodal methods. The detailed ablation
study and analysis further demonstrate the advantage of our proposed method.Comment: ACL 2023 Finding