459 research outputs found

    On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection

    Full text link
    With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and paralinguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the paralinguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test two popular anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Lastly, we show the benefits of anonymization as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymized data.Comment: 11 pages, 10 figure

    Towards Cross-Lingual Emotion Transplantation

    Get PDF

    Disentangling Prosody Representations with Unsupervised Speech Reconstruction

    Full text link
    Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processin

    Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification

    Full text link
    Forensic audio analysis for speaker verification offers unique challenges due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. The lack of real naturalistic forensic audio corpora with ground-truth speaker identity represents a major challenge in this field. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and loss in performance. Alternatively, cross-domain speaker verification for multiple acoustic environments is a challenging task which could advance research in audio forensics. In this study, we introduce a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics. Based on this fine-tuned model, we align domain-specific distributions in the embedding space with the discrepancy loss and maximum mean discrepancy (MMD). This maintains effective performance on the clean set, while simultaneously generalizes the model to other acoustic domains. From the results, we demonstrate that diverse acoustic environments affect the speaker verification performance, and that our proposed approach of cross-domain adaptation can significantly improve the results in this scenario.Comment: To appear in INTERSPEECH 202

    It's not what you say but the way that you say it: an fMRI study of differential lexical and non-lexical prosodic pitch processing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This study aims to identify the neural substrate involved in prosodic pitch processing. Functional magnetic resonance imaging was used to test the premise that prosody pitch processing is primarily subserved by the right cortical hemisphere.</p> <p>Two experimental paradigms were used, firstly pairs of spoken sentences, where the only variation was a single internal phrase pitch change, and secondly, a matched condition utilizing pitch changes within analogous tone-sequence phrases. This removed the potential confounder of lexical evaluation. fMRI images were obtained using these paradigms.</p> <p>Results</p> <p>Activation was significantly greater within the right frontal and temporal cortices during the tone-sequence stimuli relative to the sentence stimuli.</p> <p>Conclusion</p> <p>This study showed that pitch changes, stripped of lexical information, are mainly processed by the right cerebral hemisphere, whilst the processing of analogous, matched, lexical pitch change is preferentially left sided. These findings, showing hemispherical differentiation of processing based on stimulus complexity, are in accord with a 'task dependent' hypothesis of pitch processing.</p
    • …
    corecore