96 research outputs found
Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification
Text-to-speech and voice conversion studies are constantly improving to the
extent where they can produce synthetic speech almost indistinguishable from
bona fide human speech. In this regrad, the importance of countermeasures (CM)
against synthetic voice attacks of the automatic speaker verification (ASV)
systems emerges. Nonetheless, most end-to-end spoofing detection networks are
black box systems, and the answer to what is an effective representation for
finding artifacts still remains veiled. In this paper, we examine which feature
space can effectively represent synthetic artifacts using wav2vec 2.0, and
study which architecture can effectively utilize the space. Our study allows us
to analyze which attribute of speech signals is advantageous for the CM
systems. The proposed CM system achieved 0.31% equal error rate (EER) on
ASVspoof 2019 LA evaluation set for the spoof detection task. We further
propose a simple yet effective spoofing aware speaker verification (SASV)
methodology, which takes advantage of the disentangled representations from our
countermeasure system. Evaluation performed with the SASV Challenge 2022
database show 1.08% of SASV EER. Quantitative analysis shows that using the
explored feature space of wav2vec 2.0 advantages both spoofing CM and SASV.Comment: Submitted to Interspeech 202
The automatic detection of heart failure using speech signals
Heart failure (HF) is a major global health concern and is increasing in prevalence. It affects the larynx and breathing - thereby the quality of speech. In this article, we propose an approach for the automatic detection of people with HF using the speech signal. The proposed method explores mel-frequency cepstral coefficient (MFCC) features, glottal features, and their combination to distinguish HF from healthy speech. The glottal features were extracted from the voice source signal estimated using glottal inverse filtering. Four machine learning algorithms, namely, support vector machine, Extra Tree, AdaBoost, and feed-forward neural network (FFNN), were trained separately for individual features and their combination. It was observed that the MFCC features yielded higher classification accuracies compared to glottal features. Furthermore, the complementary nature of glottal features was investigated by combining these features with the MFCC features. Our results show that the FFNN classifier trained using a reduced set of glottal + MFCC features achieved the best overall performance in both speaker-dependent and speaker-independent scenarios. (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe
Paralinguistic Privacy Protection at the Edge
Voice user interfaces and digital assistants are rapidly entering our lives
and becoming singular touch points spanning our devices. These always-on
services capture and transmit our audio data to powerful cloud services for
further processing and subsequent actions. Our voices and raw audio signals
collected through these devices contain a host of sensitive paralinguistic
information that is transmitted to service providers regardless of deliberate
or false triggers. As our emotional patterns and sensitive attributes like our
identity, gender, mental well-being, are easily inferred using deep acoustic
models, we encounter a new generation of privacy risks by using these services.
One approach to mitigate the risk of paralinguistic-based privacy breaches is
to exploit a combination of cloud-based processing with privacy-preserving,
on-device paralinguistic information learning and filtering before transmitting
voice data. In this paper we introduce EDGY, a configurable, lightweight,
disentangled representation learning framework that transforms and filters
high-dimensional voice data to identify and contain sensitive attributes at the
edge prior to offloading to the cloud. We evaluate EDGY's on-device performance
and explore optimization techniques, including model quantization and knowledge
distillation, to enable private, accurate and efficient representation learning
on resource-constrained devices. Our results show that EDGY runs in tens of
milliseconds with 0.2% relative improvement in ABX score or minimal performance
penalties in learning linguistic representations from raw voice signals, using
a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with
arXiv:2007.1506
- …