90 research outputs found
Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022
Different speaker recognition challenges have been held to assess the speaker
verification system in the wild and probe the performance limit. Voxceleb
Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most
popular. Besides, another challenge called CN-Celeb Speaker Recognition
Challenge (CNSRC) is also held this year, which is based on the Chinese
celebrity multi-genre dataset CN-Celeb. This year, our team participated in
both speaker verification closed tracks in CNSRC 2022 and VoxSRC 2022, and
achieved the 1st place and 3rd place respectively. In most system reports, the
authors usually only provide a description of their systems but lack an
effective analysis of their methods. In this paper, we will outline how to
build a strong speaker verification challenge system and give a detailed
analysis of each method compared with some other popular technical means
Speech processing with deep learning for voice-based respiratory diagnosis : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
Voice-based respiratory diagnosis research aims at automatically screening and diagnosing respiratory-related symptoms (e.g., smoking status, COVID-19 infection) from human-generated sounds (e.g., breath, cough, speech). It has the potential to be used as an objective, simple, reliable, and less time-consuming method than traditional biomedical diagnosis methods. In this thesis, we conduct one comprehensive literature review and propose three novel deep learning methods to enrich voice-based respiratory diagnosis research and improve its performance.
Firstly, we conduct a comprehensive investigation of the effects of voice features on the detection of smoking status. Secondly, we propose a novel method that uses the combination of both high-level and low-level acoustic features along with deep neural networks for smoking status identification. Thirdly, we investigate various feature extraction/representation methods and propose a SincNet-based CNN method for feature representations to further improve the performance of smoking status identification. To the best of our knowledge, this is the first systemic study that applies speech processing with deep learning for voice-based smoking status identification.
Moreover, we propose a novel transfer learning scheme and a task-driven feature representation method for diagnosing respiratory diseases (e.g., COVID-19) from human-generated sounds. We find those transfer learning methods using VGGish, wav2vec 2.0 and PASE+, and our proposed task-driven method Sinc-ResNet have achieved competitive performance compared with other work. The findings of this study provide a new perspective and insights for voice-based respiratory disease diagnosis.
The experimental results demonstrate the effectiveness of our proposed methods and show that they have achieved better performances compared to other existing methods
Transducer-based language embedding for spoken language identification
The acoustic and linguistic features are important cues for the spoken
language identification (LID) task. Recent advanced LID systems mainly use
acoustic features that lack the usage of explicit linguistic feature encoding.
In this paper, we propose a novel transducer-based language embedding approach
for LID tasks by integrating an RNN transducer model into a language embedding
framework. Benefiting from the advantages of the RNN transducer's linguistic
representation capability, the proposed method can exploit both
phonetically-aware acoustic features and explicit linguistic features for LID
tasks. Experiments were carried out on the large-scale multilingual LibriSpeech
and VoxLingua107 datasets. Experimental results showed the proposed method
significantly improves the performance on LID tasks with 12% to 59% and 16% to
24% relative improvement on in-domain and cross-domain datasets, respectively.Comment: This paper was submitted to Interspeech 202
DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification
Data augmentation is vital to the generalization ability and robustness of
deep neural networks (DNNs) models. Existing augmentation methods for speaker
verification manipulate the raw signal, which are time-consuming and the
augmented samples lack diversity. In this paper, we present a novel
difficulty-aware semantic augmentation (DASA) approach for speaker
verification, which can generate diversified training samples in speaker
embedding space with negligible extra computing cost. Firstly, we augment
training samples by perturbing speaker embeddings along semantic directions,
which are obtained from speaker-wise covariance matrices. Secondly, accurate
covariance matrices are estimated from robust speaker embeddings during
training, so we introduce difficultyaware additive margin softmax
(DAAM-Softmax) to obtain optimal speaker embeddings. Finally, we assume the
number of augmented samples goes to infinity and derive a closed-form upper
bound of the expected loss with DASA, which achieves compatibility and
efficiency. Extensive experiments demonstrate the proposed approach can achieve
a remarkable performance improvement. The best result achieves a 14.6% relative
reduction in EER metric on CN-Celeb evaluation set.Comment: Accepted by ICASSP 202
- …