7 research outputs found
Speech Frame Selection for Spoofing Detection with an Application to Partially Spoofed Audio-Data
International audienceIn this paper, we introduce a frame selection strategy for improved detection of spoofed speech. A countermeasure (CM) system typically uses a Gaussian mixture model (GMM) based classifier for computing the log-likelihood scores. The average log-likelihood ratio for all speech frames of a test utterance is calculated as the score for the decision making. As opposed to this standard approach, we propose to use selected speech frames of the test utterance for scoring. We present two simple and computationally efficient frame selection strategies based on the log-likelihood ratios of the individual frames. The performance is evaluated with constant-Q cepstral coefficients as front-end feature extraction and two-class GMM as a back-end classifier. We conduct the experiments using the speech corpora from ASVspoof 2015, 2017, and 2019 challenges. The experimental results show that the proposed scoring techniques substantially outperform the conventional scoring technique for both the development and evaluation data set of ASVspoof 2015 corpus. We did not observe noticeable performance gain in ASVspoof 2017 and ASVspoof 2019 corpus. We further conducted experiments with partially spoofed data where spoofed data is created by augmenting natural and spoofed speech. In this scenario, the proposed methods demonstrate considerable performance improvement over baseline
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
The performance of most speaker diarization systems with x-vector embeddings
is both vulnerable to noisy environments and lacks domain robustness. Earlier
work on speaker diarization using generative adversarial network (GAN) with an
encoder network (ClusterGAN) to project input x-vectors into a latent space has
shown promising performance on meeting data. In this paper, we extend the
ClusterGAN network to improve diarization robustness and enable rapid
generalization across various challenging domains. To this end, we fetch the
pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical
loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments
are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II
(dev set) which includes challenging multi-domain corpus, and two
child-clinician interaction corpora (ADOS, BOSCC) related to the autism
spectrum disorder domain. Extensive analyses of the experimental data are done
to investigate the effectiveness of the proposed ClusterGAN and MCGAN
embeddings over x-vectors. The results show that the proposed embeddings with
normalized maximum eigengap spectral clustering (NME-SC) back-end consistently
outperform Kaldi state-of-the-art z-vector diarization system. Finally, we
employ embedding fusion with x-vectors to provide further improvement in
diarization performance. We achieve a relative diarization error rate (DER)
improvement of 6.67% to 53.93% on the aforementioned datasets using the
proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide
better performance in the number of speakers estimation and short speech
segment diarization as compared to x-vectors and ClusterGAN in telephonic data.Comment: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE
PROCESSIN
Meta-Learning With Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
Speech Frame Selection for Spoofing Detection with an Application to Partially Spoofed Audio-Data
International audienceIn this paper, we introduce a frame selection strategy for improved detection of spoofed speech. A countermeasure (CM) system typically uses a Gaussian mixture model (GMM) based classifier for computing the log-likelihood scores. The average log-likelihood ratio for all speech frames of a test utterance is calculated as the score for the decision making. As opposed to this standard approach, we propose to use selected speech frames of the test utterance for scoring. We present two simple and computationally efficient frame selection strategies based on the log-likelihood ratios of the individual frames. The performance is evaluated with constant-Q cepstral coefficients as front-end feature extraction and two-class GMM as a back-end classifier. We conduct the experiments using the speech corpora from ASVspoof 2015, 2017, and 2019 challenges. The experimental results show that the proposed scoring techniques substantially outperform the conventional scoring technique for both the development and evaluation data set of ASVspoof 2015 corpus. We did not observe noticeable performance gain in ASVspoof 2017 and ASVspoof 2019 corpus. We further conducted experiments with partially spoofed data where spoofed data is created by augmenting natural and spoofed speech. In this scenario, the proposed methods demonstrate considerable performance improvement over baseline