43 research outputs found
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification
Forensic audio analysis for speaker verification offers unique challenges due
to location/scenario uncertainty and diversity mismatch between reference and
naturalistic field recordings. The lack of real naturalistic forensic audio
corpora with ground-truth speaker identity represents a major challenge in this
field. It is also difficult to directly employ small-scale domain-specific data
to train complex neural network architectures due to domain mismatch and loss
in performance. Alternatively, cross-domain speaker verification for multiple
acoustic environments is a challenging task which could advance research in
audio forensics. In this study, we introduce a CRSS-Forensics audio dataset
collected in multiple acoustic environments. We pre-train a CNN-based network
using the VoxCeleb data, followed by an approach which fine-tunes part of the
high-level network layers with clean speech from CRSS-Forensics. Based on this
fine-tuned model, we align domain-specific distributions in the embedding space
with the discrepancy loss and maximum mean discrepancy (MMD). This maintains
effective performance on the clean set, while simultaneously generalizes the
model to other acoustic domains. From the results, we demonstrate that diverse
acoustic environments affect the speaker verification performance, and that our
proposed approach of cross-domain adaptation can significantly improve the
results in this scenario.Comment: To appear in INTERSPEECH 202
An Investigation of Distribution Alignment in Multi-Genre Speaker Recognition
Multi-genre speaker recognition is becoming increasingly popular due to its
ability to better represent the complexities of real-world applications.
However, a major challenge is the significant shift in the distribution of
speaker vectors across different genres. While distribution alignment is a
common approach to address this challenge, previous studies have mainly focused
on aligning a source domain with a target domain, and the performance of
multi-genre data is unknown.
This paper presents a comprehensive study of mainstream distribution
alignment methods on multi-genre data, where multiple distributions need to be
aligned. We analyze various methods both qualitatively and quantitatively. Our
experiments on the CN-Celeb dataset show that within-between distribution
alignment (WBDA) performs relatively better. However, we also found that none
of the investigated methods consistently improved performance in all test
cases. This suggests that solely aligning the distributions of speaker vectors
may not fully address the challenges posed by multi-genre speaker recognition.
Further investigation is necessary to develop a more comprehensive solution.Comment: submitted to ICASSP 202
Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification
In real-world applications, speaker recognition models often face various
domain-mismatch challenges, leading to a significant drop in performance.
Although numerous domain adaptation techniques have been developed to address
this issue, almost all present methods focus on a simple configuration where
the model is trained in one domain and deployed in another. However, real-world
environments are often complex and may contain multiple domains, making the
methods designed for one-to-one adaptation suboptimal. In our paper, we propose
a self-supervised learning method to tackle this multi-domain adaptation
problem. Building upon the basic self-supervised adaptation algorithm, we
designed three strategies to make it suitable for multi-domain adaptation: an
in-domain negative sampling strategy, a MoCo-like memory bank scheme, and a
CORAL-like distribution alignment. We conducted experiments using VoxCeleb2 as
the source domain dataset and CN-Celeb1 as the target multi-domain dataset. Our
results demonstrate that our method clearly outperforms the basic
self-supervised adaptation method, which simply treats the data of CN-Celeb1 as
a single domain. Importantly, the improvement is consistent in nearly all
in-domain tests and cross-domain tests, demonstrating the effectiveness of our
proposed method.Comment: submitted to ICASSP 202
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts
Machine learning methods strive to acquire a robust model during training
that can generalize well to test samples, even under distribution shifts.
However, these methods often suffer from a performance drop due to unknown test
distributions. Test-time adaptation (TTA), an emerging paradigm, has the
potential to adapt a pre-trained model to unlabeled data during testing, before
making predictions. Recent progress in this paradigm highlights the significant
benefits of utilizing unlabeled data for training self-adapted models prior to
inference. In this survey, we divide TTA into several distinct categories,
namely, test-time (source-free) domain adaptation, test-time batch adaptation,
online test-time adaptation, and test-time prior adaptation. For each category,
we provide a comprehensive taxonomy of advanced algorithms, followed by a
discussion of different learning scenarios. Furthermore, we analyze relevant
applications of TTA and discuss open challenges and promising areas for future
research. A comprehensive list of TTA methods can be found at
\url{https://github.com/tim-learn/awesome-test-time-adaptation}.Comment: Discussions, comments, and questions are all welcomed in
\url{https://github.com/tim-learn/awesome-test-time-adaptation