Search CORE

630 research outputs found

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

Author: Hsu Wei-Ning
Shi Bowen
Publication venue
Publication date: 14/07/2022
Field of study

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for speech recognition and speaker verification. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input

arXiv.org e-Print Archive

Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck

Author: Eom Youngsik
Kim Hoirin
Lee Yeonghyeon
Um Ji Sub
Publication venue
Publication date: 04/04/2022
Field of study

Recent advances in sophisticated synthetic speech generated from text-to-speech (TTS) or voice conversion (VC) systems cause threats to the existing automatic speaker verification (ASV) systems. Since such synthetic speech is generated from diverse algorithms, generalization ability with using limited training data is indispensable for a robust anti-spoofing system. In this work, we propose a transfer learning scheme based on the wav2vec 2.0 pretrained model with variational information bottleneck (VIB) for speech anti-spoofing task. Evaluation on the ASVspoof 2019 logical access (LA) database shows that our method improves the performance of distinguishing unseen spoofed and genuine speech, outperforming current state-of-the-art anti-spoofing systems. Furthermore, we show that the proposed system improves performance in low-resource and cross-dataset settings of anti-spoofing task significantly, demonstrating that our system is also robust in terms of data size and data distribution.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech

Author: Alisamir Sina
Allauzen Alexandre
Besacier Laurent
Boito Marcely Zanon
Coavoux Maximin
Dinarelli Marco
Esteve Yannick
Evain Solene
Goulian Jerome
Le Hang
Lecouteux Benjamin
Mdhaffar Salima
Nguyen Ha
Parcollet Titouan
Portet Francois
Pupier Adrien
Ringeval Fabien
Rossato Solange
Rouvier Mickael
Schwab Didier
Tomashenko Natalia
Zhang Shucong
Publication venue
Publication date: 11/09/2023
Field of study

Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe

arXiv.org e-Print Archive

Multi-Dataset Co-Training with Sharpness-Aware Optimization for Audio Anti-spoofing

Author: Jung Jee-weon
Kinnunen Tomi
Shim Hye-jin
Publication venue
Publication date: 01/06/2023
Field of study

Audio anti-spoofing for automatic speaker verification aims to safeguard users' identities from spoofing attacks. Although state-of-the-art spoofing countermeasure(CM) models perform well on specific datasets, they lack generalization when evaluated with different datasets. To address this limitation, previous studies have explored large pre-trained models, which require significant resources and time. We aim to develop a compact but well-generalizing CM model that can compete with large pre-trained models. Our approach involves multi-dataset co-training and sharpness-aware minimization, which has not been investigated in this domain. Extensive experiments reveal that proposed method yield competitive results across various datasets while utilizing 4,000 times less parameters than the large pre-trained models.Comment: Interspeech 202

arXiv.org e-Print Archive

Mic2Mic: Using Cycle-Consistent Generative Adversarial Networks to Overcome Microphone Variability in Speech Systems

Author: Berthouze Nadia
Isopoussu Anton
Kawsar Fahim
Lane Nicholas D.
Mathur Akhil
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/03/2020
Field of study

Mobile and embedded devices are increasingly using microphones and audio-based computational models to infer user context. A major challenge in building systems that combine audio models with commodity microphones is to guarantee their accuracy and robustness in the real-world. Besides many environmental dynamics, a primary factor that impacts the robustness of audio models is microphone variability. In this work, we propose Mic2Mic -- a machine-learned system component -- which resides in the inference pipeline of audio models and at real-time reduces the variability in audio data caused by microphone-specific factors. Two key considerations for the design of Mic2Mic were: a) to decouple the problem of microphone variability from the audio task, and b) put a minimal burden on end-users to provide training data. With these in mind, we apply the principles of cycle-consistent generative adversarial networks (CycleGANs) to learn Mic2Mic using unlabeled and unpaired data collected from different microphones. Our experiments show that Mic2Mic can recover between 66% to 89% of the accuracy lost due to microphone variability for two common audio tasks.Comment: Published at ACM IPSN 201

arXiv.org e-Print Archive

Crossref

Self-supervised Speaker Recognition with Loss-gated Learning

Author: Das Rohan Kumar
Hautamäki Ville
Lee Kong Aik
Li Haizhou
Tao Ruijie
Publication venue
Publication date: 14/07/2022
Field of study

In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn't always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a

46.3\%

performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of

1.66\%

on the VoxCeleb1 original test set. Code has been made available at: https://github.com/TaoRuijie/Loss-Gated-Learning.Comment: 5 pages, 3 figure

arXiv.org e-Print Archive

A Review of Voice-Base Person Identification: State-of-the-Art

Author: Asaolu O. S. & Popoola O. P. Folorunso C. O.,
Publication venue: Covenant University, Ota, Nigeria
Publication date: 29/06/2019
Field of study

Automated person identification and authentication systems are useful for national security, integrity of electoral processes, prevention of cybercrimes and many access control applications. This is a critical component of information and communication technology which is central to national development. The use of biometrics systems in identification is fast replacing traditional methods such as use of names, personal identification numbers codes, password, etc., since nature bestow individuals with distinct personal imprints and signatures. Different measures have been put in place for person identification, ranging from face, to fingerprint and so on. This paper highlights the key approaches and schemes developed in the last five decades for voice-based person identification systems. Voice-base recognition system has gained interest due to its non-intrusive technique of data acquisition and its increasing method of continually studying and adapting to the person’s changes. Information on the benefits and challenges of various biometric systems are also presented in this paper. The present and prominent voice-based recognition methods are discussed. It was observed that these systems application areas have covered intelligent monitoring, surveillance, population management, election forensics, immigration and border control

Covenant Journals (Covenant University)

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

Author: Hirsch Fabrice
Ouni Slim
Sahidullah Md
Sheikh Shakeel A.
Publication venue
Publication date: 01/06/2023
Field of study

The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized channel attention, propagation, and aggregation time delay neural network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech datasets respectively. After extracting the embeddings, we benchmark with several traditional classifiers, such as the K-nearest neighbour (KNN), Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines. Finally, we have shown that combining two embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve the UAR by up to 2.60% and 6.32% respectively.Comment: Accepted in International Journal of Speech Technology, Springer 2023 substantial overlap with arXiv:2204.0156

arXiv.org e-Print Archive