96 research outputs found

    Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

    Full text link
    Text-to-speech and voice conversion studies are constantly improving to the extent where they can produce synthetic speech almost indistinguishable from bona fide human speech. In this regrad, the importance of countermeasures (CM) against synthetic voice attacks of the automatic speaker verification (ASV) systems emerges. Nonetheless, most end-to-end spoofing detection networks are black box systems, and the answer to what is an effective representation for finding artifacts still remains veiled. In this paper, we examine which feature space can effectively represent synthetic artifacts using wav2vec 2.0, and study which architecture can effectively utilize the space. Our study allows us to analyze which attribute of speech signals is advantageous for the CM systems. The proposed CM system achieved 0.31% equal error rate (EER) on ASVspoof 2019 LA evaluation set for the spoof detection task. We further propose a simple yet effective spoofing aware speaker verification (SASV) methodology, which takes advantage of the disentangled representations from our countermeasure system. Evaluation performed with the SASV Challenge 2022 database show 1.08% of SASV EER. Quantitative analysis shows that using the explored feature space of wav2vec 2.0 advantages both spoofing CM and SASV.Comment: Submitted to Interspeech 202

    The automatic detection of heart failure using speech signals

    Get PDF
    Heart failure (HF) is a major global health concern and is increasing in prevalence. It affects the larynx and breathing - thereby the quality of speech. In this article, we propose an approach for the automatic detection of people with HF using the speech signal. The proposed method explores mel-frequency cepstral coefficient (MFCC) features, glottal features, and their combination to distinguish HF from healthy speech. The glottal features were extracted from the voice source signal estimated using glottal inverse filtering. Four machine learning algorithms, namely, support vector machine, Extra Tree, AdaBoost, and feed-forward neural network (FFNN), were trained separately for individual features and their combination. It was observed that the MFCC features yielded higher classification accuracies compared to glottal features. Furthermore, the complementary nature of glottal features was investigated by combining these features with the MFCC features. Our results show that the FFNN classifier trained using a reduced set of glottal + MFCC features achieved the best overall performance in both speaker-dependent and speaker-independent scenarios. (C) 2021 The Author(s). Published by Elsevier Ltd.Peer reviewe

    Paralinguistic Privacy Protection at the Edge

    Full text link
    Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, mental well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this paper we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score or minimal performance penalties in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.Comment: 14 pages, 7 figures. arXiv admin note: text overlap with arXiv:2007.1506
    • …
    corecore