Search CORE

459 research outputs found

On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

Author: Côté-Lussier Carolyn
Falk Tiago H.
Imoussaïne-Aïkous Mohamed
Zhu Yi
Publication venue
Publication date: 04/04/2023
Field of study

With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and paralinguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the paralinguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test two popular anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Lastly, we show the benefits of anonymization as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymized data.Comment: 11 pages, 10 figure

arXiv.org e-Print Archive

Towards Cross-Lingual Emotion Transplantation

Author: J. Yamagishi
K. Oura
M.J. Gales
R. Togneri
S. Takeda
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Author: Li Taihao
Pekarek-Rosin Theresa
Qu Leyuan
Ren Fuji
Weber Cornelius
Wermter Stefan
Publication venue
Publication date: 25/09/2023
Field of study

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processin

arXiv.org e-Print Archive

Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification

Author: Hansen John H. L.
Wang Zhenyu
Xia Wei
Publication venue
Publication date: 09/09/2020
Field of study

Forensic audio analysis for speaker verification offers unique challenges due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. The lack of real naturalistic forensic audio corpora with ground-truth speaker identity represents a major challenge in this field. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and loss in performance. Alternatively, cross-domain speaker verification for multiple acoustic environments is a challenging task which could advance research in audio forensics. In this study, we introduce a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics. Based on this fine-tuned model, we align domain-specific distributions in the embedding space with the discrepancy loss and maximum mean discrepancy (MMD). This maintains effective performance on the clean set, while simultaneously generalizes the model to other acoustic domains. From the results, we demonstrate that diverse acoustic environments affect the speaker verification performance, and that our proposed approach of cross-domain adaptation can significantly improve the results in this scenario.Comment: To appear in INTERSPEECH 202

arXiv.org e-Print Archive

It's not what you say but the way that you say it: an fMRI study of differential lexical and non-lexical prosodic pitch processing

Abstract Background This study aims to identify the neural substrate involved in prosodic pitch processing. Functional magnetic resonance imaging was used to test the premise that prosody pitch processing is primarily subserved by the right cortical hemisphere. Two experimental paradigms were used, firstly pairs of spoken sentences, where the only variation was a single internal phrase pitch change, and secondly, a matched condition utilizing pitch changes within analogous tone-sequence phrases. This removed the potential confounder of lexical evaluation. fMRI images were obtained using these paradigms. Results Activation was significantly greater within the right frontal and temporal cortices during the tone-sequence stimuli relative to the sentence stimuli. Conclusion This study showed that pitch changes, stripped of lexical information, are mainly processed by the right cerebral hemisphere, whilst the processing of analogous, matched, lexical pitch change is preferentially left sided. These findings, showing hemispherical differentiation of processing based on stimulus complexity, are in accord with a 'task dependent' hypothesis of pitch processing.</p

Springer - Publisher Connector

Directory of Open Access Journals

King's Research Portal