630 research outputs found
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
While audio-visual speech models can yield superior performance and
robustness compared to audio-only models, their development and adoption are
hindered by the lack of labeled and unlabeled audio-visual data and the cost to
deploy one model per modality. In this paper, we present u-HuBERT, a
self-supervised pre-training framework that can leverage both multimodal and
unimodal speech with a unified masked cluster prediction objective. By
utilizing modality dropout during pre-training, we demonstrate that a single
fine-tuned model can achieve performance on par or better than the
state-of-the-art modality-specific models. Moreover, our model fine-tuned only
on audio can perform well with audio-visual and visual speech input, achieving
zero-shot modality generalization for speech recognition and speaker
verification. In particular, our single model yields 1.2%/1.4%/27.2% speech
recognition word error rate on LRS3 with audio-visual/audio/visual input
Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck
Recent advances in sophisticated synthetic speech generated from
text-to-speech (TTS) or voice conversion (VC) systems cause threats to the
existing automatic speaker verification (ASV) systems. Since such synthetic
speech is generated from diverse algorithms, generalization ability with using
limited training data is indispensable for a robust anti-spoofing system. In
this work, we propose a transfer learning scheme based on the wav2vec 2.0
pretrained model with variational information bottleneck (VIB) for speech
anti-spoofing task. Evaluation on the ASVspoof 2019 logical access (LA)
database shows that our method improves the performance of distinguishing
unseen spoofed and genuine speech, outperforming current state-of-the-art
anti-spoofing systems. Furthermore, we show that the proposed system improves
performance in low-resource and cross-dataset settings of anti-spoofing task
significantly, demonstrating that our system is also robust in terms of data
size and data distribution.Comment: Submitted to Interspeech 202
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements
in many different domains including computer vision and natural language
processing. Speech processing drastically benefitted from SSL as most of the
current domain-related tasks are now being approached with pre-trained models.
This work introduces LeBenchmark 2.0 an open-source framework for assessing and
building SSL-equipped French speech technologies. It includes documented,
large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous
speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to
one billion learnable parameters shared with the community, and an evaluation
protocol made of six downstream tasks to complement existing benchmarks.
LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for
speech with the investigation of frozen versus fine-tuned downstream models,
task-agnostic versus task-specific pre-trained models as well as a discussion
on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing
Audio anti-spoofing for automatic speaker verification aims to safeguard
users' identities from spoofing attacks. Although state-of-the-art spoofing
countermeasure(CM) models perform well on specific datasets, they lack
generalization when evaluated with different datasets. To address this
limitation, previous studies have explored large pre-trained models, which
require significant resources and time. We aim to develop a compact but
well-generalizing CM model that can compete with large pre-trained models. Our
approach involves multi-dataset co-training and sharpness-aware minimization,
which has not been investigated in this domain. Extensive experiments reveal
that proposed method yield competitive results across various datasets while
utilizing 4,000 times less parameters than the large pre-trained models.Comment: Interspeech 202
Mic2Mic: Using Cycle-Consistent Generative Adversarial Networks to Overcome Microphone Variability in Speech Systems
Mobile and embedded devices are increasingly using microphones and
audio-based computational models to infer user context. A major challenge in
building systems that combine audio models with commodity microphones is to
guarantee their accuracy and robustness in the real-world. Besides many
environmental dynamics, a primary factor that impacts the robustness of audio
models is microphone variability. In this work, we propose Mic2Mic -- a
machine-learned system component -- which resides in the inference pipeline of
audio models and at real-time reduces the variability in audio data caused by
microphone-specific factors. Two key considerations for the design of Mic2Mic
were: a) to decouple the problem of microphone variability from the audio task,
and b) put a minimal burden on end-users to provide training data. With these
in mind, we apply the principles of cycle-consistent generative adversarial
networks (CycleGANs) to learn Mic2Mic using unlabeled and unpaired data
collected from different microphones. Our experiments show that Mic2Mic can
recover between 66% to 89% of the accuracy lost due to microphone variability
for two common audio tasks.Comment: Published at ACM IPSN 201
Self-supervised Speaker Recognition with Loss-gated Learning
In self-supervised learning for speaker recognition, pseudo labels are useful
as the supervision signals. It is a known fact that a speaker recognition model
doesn't always benefit from pseudo labels due to their unreliability. In this
work, we observe that a speaker recognition network tends to model the data
with reliable labels faster than those with unreliable labels. This motivates
us to study a loss-gated learning (LGL) strategy, which extracts the reliable
labels through the fitting ability of the neural network during training. With
the proposed LGL, our speaker recognition model obtains a performance
gain over the system without it. Further, the proposed self-supervised speaker
recognition with LGL trained on the VoxCeleb2 dataset without any labels
achieves an equal error rate of on the VoxCeleb1 original test set.
Code has been made available at:
https://github.com/TaoRuijie/Loss-Gated-Learning.Comment: 5 pages, 3 figure
A Review of Voice-Base Person Identification: State-of-the-Art
Automated person identification and authentication systems are useful for national security, integrity of electoral processes, prevention of cybercrimes and many access control applications. This is a critical component of information and communication technology which is central to national development. The use of biometrics systems in identification is fast replacing traditional methods such as use of names, personal identification numbers codes, password, etc., since nature bestow individuals with distinct personal imprints and signatures. Different measures have been put in place for person identification, ranging from face, to fingerprint and so on. This paper highlights the key approaches and schemes developed in the last five decades for voice-based person identification systems. Voice-base recognition system has gained interest due to its non-intrusive technique of data acquisition and its increasing method of continually studying and adapting to the person’s changes. Information on the benefits and challenges of various biometric systems are also presented in this paper. The present and prominent voice-based recognition methods are discussed. It was observed that these systems application areas have covered intelligent monitoring, surveillance, population management, election forensics, immigration and border control
Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings
The adoption of advanced deep learning architectures in stuttering detection
(SD) tasks is challenging due to the limited size of the available datasets. To
this end, this work introduces the application of speech embeddings extracted
from pre-trained deep learning models trained on large audio datasets for
different tasks. In particular, we explore audio representations obtained using
emphasized channel attention, propagation, and aggregation time delay neural
network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech
datasets respectively. After extracting the embeddings, we benchmark with
several traditional classifiers, such as the K-nearest neighbour (KNN),
Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to
the standard SD systems trained only on the limited SEP-28k dataset, we obtain
a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average
recall (UAR) over the baselines. Finally, we have shown that combining two
embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve
the UAR by up to 2.60% and 6.32% respectively.Comment: Accepted in International Journal of Speech Technology, Springer 2023
substantial overlap with arXiv:2204.0156
- …