26 research outputs found
Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition
Speech Emotion Recognition (SER) is a challenging task due to limited data
and blurred boundaries of certain emotions. In this paper, we present a
comprehensive approach to improve the SER performance throughout the model
lifecycle, including pre-training, fine-tuning, and inference stages. To
address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0.
During fine-tuning, we propose a novel loss function that combines
cross-entropy loss with supervised contrastive learning loss to improve the
model's discriminative ability. This approach increases the inter-class
distances and decreases the intra-class distances, mitigating the issue of
blurred boundaries. Finally, to leverage the improved distances, we propose an
interpolation method at the inference stage that combines the model prediction
with the output from a k-nearest neighbors model. Our experiments on IEMOCAP
demonstrate that our proposed methods outperform current state-of-the-art
results.Comment: Accepted by lnterspeech 2023, poste
G2C: A Generator-to-Classifier Framework Integrating Multi-Stained Visual Cues for Pathological Glomerulus Classification
Pathological glomerulus classification plays a key role in the diagnosis of
nephropathy. As the difference between different subcategories is subtle,
doctors often refer to slides from different staining methods to make
decisions. However, creating correspondence across various stains is
labor-intensive, bringing major difficulties in collecting data and training a
vision-based algorithm to assist nephropathy diagnosis. This paper provides an
alternative solution for integrating multi-stained visual cues for glomerulus
classification. Our approach, named generator-to-classifier (G2C), is a
two-stage framework. Given an input image from a specified stain, several
generators are first applied to estimate its appearances in other staining
methods, and a classifier follows to combine visual cues from different stains
for prediction (whether it is pathological, or which type of pathology it has).
We optimize these two stages in a joint manner. To provide a reasonable
initialization, we pre-train the generators in an unlabeled reference set under
an unpaired image-to-image translation task, and then fine-tune them together
with the classifier. We conduct experiments on a glomerulus type classification
dataset collected by ourselves (there are no publicly available datasets for
this purpose). Although joint optimization slightly harms the authenticity of
the generated patches, it boosts classification performance, suggesting more
effective visual cues are extracted in an automatic way. We also transfer our
model to a public dataset for breast cancer classification, and outperform the
state-of-the-arts significantly.Comment: Accepted by AAAI 201
RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting
Automatic Mean Opinion Score (MOS) prediction is crucial to evaluate the
perceptual quality of the synthetic speech. While recent approaches using
pre-trained self-supervised learning (SSL) models have shown promising results,
they only partly address the data scarcity issue for the feature extractor.
This leaves the data scarcity issue for the decoder unresolved and leading to
suboptimal performance. To address this challenge, we propose a
retrieval-augmented MOS prediction method, dubbed {\bf RAMP}, to enhance the
decoder's ability against the data scarcity issue. A fusing network is also
proposed to dynamically adjust the retrieval scope for each instance and the
fusion weights based on the predictive confidence. Experimental results show
that our proposed method outperforms the existing methods in multiple
scenarios.Comment: Accepted by Interspeech 2023, ora
Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances
Despite the great progress of Visual Question Answering (VQA), current VQA
models heavily rely on the superficial correlation between the question type
and its corresponding frequent answers (i.e., language priors) to make
predictions, without really understanding the input. In this work, we define
the training instances with the same question type but different answers as
\textit{superficially similar instances}, and attribute the language priors to
the confusion of VQA model on such instances. To solve this problem, we propose
a novel training framework that explicitly encourages the VQA model to
distinguish between the superficially similar instances. Specifically, for each
training instance, we first construct a set that contains its superficially
similar counterparts. Then we exploit the proposed distinguishing module to
increase the distance between the instance and its counterparts in the answer
space. In this way, the VQA model is forced to further focus on the other parts
of the input beyond the question type, which helps to overcome the language
priors. Experimental results show that our method achieves the state-of-the-art
performance on VQA-CP v2. Codes are available at
\href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.Comment: Published in COLING 202
kNN-CTC: Enhancing ASR via Retrieval of CTC Pseudo Labels
The success of retrieval-augmented language models in various natural
language processing (NLP) tasks has been constrained in automatic speech
recognition (ASR) applications due to challenges in constructing fine-grained
audio-text datastores. This paper presents kNN-CTC, a novel approach that
overcomes these challenges by leveraging Connectionist Temporal Classification
(CTC) pseudo labels to establish frame-level audio-text key-value pairs,
circumventing the need for precise ground truth alignments. We further
introduce a skip-blank strategy, which strategically ignores CTC blank frames,
to reduce datastore size. kNN-CTC incorporates a k-nearest neighbors retrieval
mechanism into pre-trained CTC ASR systems, achieving significant improvements
in performance. By incorporating a k-nearest neighbors retrieval mechanism into
pre-trained CTC ASR systems and leveraging a fine-grained, pruned datastore,
kNN-CTC consistently achieves substantial improvements in performance under
various experimental settings. Our code is available at
https://github.com/NKU-HLT/KNN-CTC.Comment: Accepted by ICASSP 202