Search CORE

26 research outputs found

Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition

Author: Qin Yong
Wang Xuechen
Zhao Shiwan
Publication venue
Publication date: 31/08/2023
Field of study

Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP demonstrate that our proposed methods outperform current state-of-the-art results.Comment: Accepted by lnterspeech 2023, poste

arXiv.org e-Print Archive

G2C: A Generator-to-Classifier Framework Integrating Multi-Stained Visual Cues for Pathological Glomerulus Classification

Author: Liu Zhihong
Sun Guangyu
Wu Bingzhe
Xie Lingxi
Zeng Caihong
Zhang Xiaolu
Zhao Shiwan
Publication venue
Publication date: 07/03/2019
Field of study

Pathological glomerulus classification plays a key role in the diagnosis of nephropathy. As the difference between different subcategories is subtle, doctors often refer to slides from different staining methods to make decisions. However, creating correspondence across various stains is labor-intensive, bringing major difficulties in collecting data and training a vision-based algorithm to assist nephropathy diagnosis. This paper provides an alternative solution for integrating multi-stained visual cues for glomerulus classification. Our approach, named generator-to-classifier (G2C), is a two-stage framework. Given an input image from a specified stain, several generators are first applied to estimate its appearances in other staining methods, and a classifier follows to combine visual cues from different stains for prediction (whether it is pathological, or which type of pathology it has). We optimize these two stages in a joint manner. To provide a reasonable initialization, we pre-train the generators in an unlabeled reference set under an unpaired image-to-image translation task, and then fine-tune them together with the classifier. We conduct experiments on a glomerulus type classification dataset collected by ourselves (there are no publicly available datasets for this purpose). Although joint optimization slightly harms the authenticity of the generated patches, it boosts classification performance, suggesting more effective visual cues are extracted in an automatic way. We also transfer our model to a public dataset for breast cancer classification, and outperform the state-of-the-arts significantly.Comment: Accepted by AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting

Author: Qin Yong
Wang Hui
Zhao Shiwan
Zheng Xiguang
Publication venue
Publication date: 31/08/2023
Field of study

Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the perceptual quality of the synthetic speech. While recent approaches using pre-trained self-supervised learning (SSL) models have shown promising results, they only partly address the data scarcity issue for the feature extractor. This leaves the data scarcity issue for the decoder unresolved and leading to suboptimal performance. To address this challenge, we propose a retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the decoder's ability against the data scarcity issue. A fusing network is also proposed to dynamically adjust the retrieval scope for each instance and the fusion weights based on the predictive confidence. Experimental results show that our proposed method outperforms the existing methods in multiple scenarios.Comment: Accepted by Interspeech 2023, ora

arXiv.org e-Print Archive

Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances

Author: Jiang Ning
Wu Yike
Yuan Xiaojie
Zhang Ying
Zhao Guoqing
Zhao Shiwan
Zhao Yu
Publication venue
Publication date: 18/09/2022
Field of study

Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on the superficial correlation between the question type and its corresponding frequent answers (i.e., language priors) to make predictions, without really understanding the input. In this work, we define the training instances with the same question type but different answers as \textit{superficially similar instances}, and attribute the language priors to the confusion of VQA model on such instances. To solve this problem, we propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances. Specifically, for each training instance, we first construct a set that contains its superficially similar counterparts. Then we exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space. In this way, the VQA model is forced to further focus on the other parts of the input beyond the question type, which helps to overcome the language priors. Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2. Codes are available at \href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.Comment: Published in COLING 202

arXiv.org e-Print Archive

kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels

Author: Chen Yong
Liu Yaqi
Qin Yong
Zeng Wenjia
Zhao Shiwan
Zhou Jiaming
Publication venue
Publication date: 02/02/2024
Field of study

The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. We further introduce a skip-blank strategy, which strategically ignores CTC blank frames, to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems, achieving significant improvements in performance. By incorporating a k-nearest neighbors retrieval mechanism into pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore, kNN-CTC consistently achieves substantial improvements in performance under various experimental settings. Our code is available at https://github.com/NKU-HLT/KNN-CTC.Comment: Accepted by ICASSP 202

arXiv.org e-Print Archive