20 research outputs found

    Towards Balanced Active Learning for Multimodal Classification

    Full text link
    Training multimodal networks requires a vast amount of data due to their larger parameter space compared to unimodal networks. Active learning is a widely used technique for reducing data annotation costs by selecting only those samples that could contribute to improving model performance. However, current active learning strategies are mostly designed for unimodal tasks, and when applied to multimodal data, they often result in biased sample selection from the dominant modality. This unfairness hinders balanced multimodal learning, which is crucial for achieving optimal performance. To address this issue, we propose three guidelines for designing a more balanced multimodal active learning strategy. Following these guidelines, a novel approach is proposed to achieve more fair data selection by modulating the gradient embedding with the dominance degree among modalities. Our studies demonstrate that the proposed method achieves more balanced multimodal learning by avoiding greedy sample selection from the dominant modality. Our approach outperforms existing active learning strategies on a variety of multimodal classification tasks. Overall, our work highlights the importance of balancing sample selection in multimodal active learning and provides a practical solution for achieving more balanced active learning for multimodal classification.Comment: 12 pages, accepted by ACMMM 202

    Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

    Full text link
    Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.Comment: Accepted by AAAI202

    Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

    Full text link
    Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and inconsistencies in learned representations caused by independent modality manipulations in deepfake videos. To address this, we propose cross-modality and within-modality regularization to preserve modality distinctions during multimodal representation learning. Our approach includes an audio-visual transformer module for modality correspondence and a cross-modality regularization module to align paired audio-visual signals, preserving modality distinctions. Simultaneously, a within-modality regularization module refines unimodal representations with modality-specific targets to retain modal-specific details. Experimental results on the public audio-visual dataset, FakeAVCeleb, demonstrate the effectiveness and competitiveness of our approach.Comment: Accepted by ICASSP 202

    Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

    Get PDF
    12 pages, 5 figures, Accepted by IJCAI 2023Preprin

    Paeonol Ameliorates Diabetic Renal Fibrosis Through Promoting the Activation of the Nrf2/ARE Pathway via Up-Regulating Sirt1

    No full text
    Diabetic nephropathy (DN) is rapidly becoming the leading cause of end-stage renal disease worldwide and a major cause of morbidity and mortality in patients of diabetes. The main pathological change of DN is renal fibrosis. Paeonol (PA), a single phenolic compound extracted from the root bark of Cortex Moutan, has been demonstrated to have many potential pharmacological activities. However, the effects of PA on DN have not been fully elucidated. In this study, high glucose (HG)-treated glomerular mesangial cells (GMCs) and streptozotocin (STZ)-induced diabetic mice were analyzed in exploring the potential mechanisms of PA on DN. Results in vitro showed that: (1) PA inhibited HG-induced fibronectin (FN) and ICAM-1 overexpressions; (2) PA exerted renoprotective effect through activating the Nrf2/ARE pathway; (3) Sirt1 mediated the effects of PA on the activation of Nrf2/ARE pathway. What is more, in accordance with the in vitro results, significant elevated levels of Sirt1, Nrf2 and downstream proteins related to Nrf2 were observed in the kidneys of PA treatment group compared with model group. Taken together, our study shows that PA delays the progression of diabetic renal fibrosis, and the underlying mechanism is probably associated with regulating the Nrf2 pathway. The effect of PA on Nrf2 is at least partially dependent on Sirt1 activation

    Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition

    Full text link
    Audio-visual speech recognition (AVSR) research has gained a great success recently by improving the noise-robustness of audio-only automatic speech recognition (ASR) with noise-invariant visual information. However, most existing AVSR approaches simply fuse the audio and visual features by concatenation, without explicit interactions to capture the deep correlations between them, which results in sub-optimal multimodal representations for downstream speech recognition task. In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level. Such a holistic view of cross-modal correlations enable better multimodal representations for AVSR. Experiments on public benchmarks LRS3 and LRS2 show that our GILA outperforms the supervised learning state-of-the-art.Comment: 12 pages, 5 figures, Accepted by IJCAI 202

    UniS-MMC: Multimodal Classification via Unimodality-supervised Multimodal Contrastive Learning

    Full text link
    Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.Comment: ACL 2023 Finding
    corecore