1,266 research outputs found

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

    Full text link
    The growing use of voice user interfaces has led to a surge in the collection and storage of speech data. While data collection allows for the development of efficient tools powering most speech services, it also poses serious privacy issues for users as centralized storage makes private personal speech data vulnerable to cyber threats. With the increasing use of voice-based digital assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the increasing ease with which personal speech data can be collected, the risk of malicious use of voice-cloning and speaker/gender/pathological/etc. recognition has increased. This thesis proposes solutions for anonymizing speech and evaluating the degree of the anonymization. In this work, anonymization refers to making personal speech data unlinkable to an identity while maintaining the usefulness (utility) of the speech signal (e.g., access to linguistic content). We start by identifying several challenges that evaluation protocols need to consider to evaluate the degree of privacy protection properly. We clarify how anonymization systems must be configured for evaluation purposes and highlight that many practical deployment configurations do not permit privacy evaluation. Furthermore, we study and examine the most common voice conversion-based anonymization system and identify its weak points before suggesting new methods to overcome some limitations. We isolate all components of the anonymization system to evaluate the degree of speaker PPI associated with each of them. Then, we propose several transformation methods for each component to reduce as much as possible speaker PPI while maintaining utility. We promote anonymization algorithms based on quantization-based transformation as an alternative to the most-used and well-known noise-based approach. Finally, we endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy | for associated source code, see https://github.com/deep-privacy/SA-toolki

    Advances in Probabilistic Deep Learning

    Get PDF
    This thesis is concerned with methodological advances in probabilistic inference and their application to core challenges in machine perception and AI. Inferring a posterior distribution over the parameters of a model given some data is a central challenge that occurs in many fields ranging from finance and artificial intelligence to physics. Exact calculation is impossible in all but the simplest cases and a rich field of approximate inference has been developed to tackle this challenge. This thesis develops both an advance in approximate inference and an application of these methods to the problem of speech synthesis. In the first section of this thesis we develop a novel framework for constructing Markov Chain Monte Carlo (MCMC) kernels that can efficiently sample from high dimensional distributions such as the posteriors, that frequently occur in machine perception. We provide a specific instance of this framework and demonstrate that it can match or exceed the performance of Hamiltonian Monte Carlo without requiring gradients of the target distribution. In the second section of the thesis we focus on the application of approximate inference techniques to the task of synthesising human speech from text. By using advances in neural variational inference we are able to construct a state of the art speech synthesis system in which it is possible to control aspects of prosody such as emotional expression from significantly less supervised data than previously existing state of the art methods

    Speech and natural language processing for the assessment of customer satisfaction and neuro-degenerative diseases

    Get PDF
    ABSTRACT: Nowadays, the interest in the automatic analysis of speech and text in different scenarios have been increasing. Currently, acoustic analysis is frequently used to extract non-verbal information related to para-linguistic aspects such as articulation and prosody. The linguistic analysis focuses on capturing verbal information from written sources, which can be suitable to evaluate customer satisfaction, or in health-care applications to assess the state of patients under depression or other cognitive states. In the case of call-centers many of the speech recordings collected are related to the opinion of the customers in different industry sectors. Only a small proportion of these calls are evaluated, whereby these processes can be automated using acoustic and linguistic analysis. In the assessment of neuro-degenerative diseases such as Alzheimer's Disease (AD) and Parkinson's Disease (PD), the symptoms are progressive, directly linked to dementia, cognitive decline, and motor impairments. This implies a continuous evaluation of the neurological state since the patients become dependent and need intensive care, showing a decrease of the ability from individual activities of daily life. This thesis proposes methodologies for acoustic and linguistic analyses in different scenarios related to customer satisfaction, cognitive disorders in AD, and depression in PD. The experiments include the evaluation of customer satisfaction, the assessment of genetic AD, linguistic analysis to discriminate PD, depression assessment in PD, and user state modeling based on the arousal-plane for the evaluation of customer satisfaction, AD, and depression in PD. The acoustic features are mainly focused on articulation and prosody analyses, while linguistic features are based on natural language processing techniques. Deep learning approaches based on convolutional and recurrent neural networks are also considered in this thesis

    Individual Differences in Speech Production and Perception

    Get PDF
    Inter-individual variation in speech is a topic of increasing interest both in human sciences and speech technology. It can yield important insights into biological, cognitive, communicative, and social aspects of language. Written by specialists in psycholinguistics, phonetics, speech development, speech perception and speech technology, this volume presents experimental and modeling studies that provide the reader with a deep understanding of interspeaker variability and its role in speech processing, speech development, and interspeaker interactions. It discusses how theoretical models take into account individual behavior, explains why interspeaker variability enriches speech communication, and summarizes the limitations of the use of speaker information in forensics

    Statistical Parametric Methods for Articulatory-Based Foreign Accent Conversion

    Get PDF
    Foreign accent conversion seeks to transform utterances from a non-native speaker (L2) to appear as if they had been produced by the same speaker but with a native (L1) accent. Such accent-modified utterances have been suggested to be effective in pronunciation training for adult second language learners. Accent modification involves separating the linguistic gestures and voice-quality cues from the L1 and L2 utterances, then transposing them across the two speakers. However, because of the complex interaction between these two sources of information, their separation in the acoustic domain is not straightforward. As a result, vocoding approaches to accent conversion results in a voice that is different from both the L1 and L2 speakers. In contrast, separation in the articulatory domain is straightforward since linguistic gestures are readily available via articulatory data. However, because of the difficulty in collecting articulatory data, conventional synthesis techniques based on unit selection are ill-suited for accent conversion given the small size of articulatory corpora and the inability to interpolate missing native sounds in L2 corpus. To address these issues, this dissertation presents two statistical parametric methods to accent conversion that operate in the acoustic and articulatory domains, respectively. The acoustic method uses a cross-speaker statistical mapping to generate L2 acoustic features from the trajectories of L1 acoustic features in a reference utterance. Our results show significant reductions in the perceived non-native accents compared to the corresponding L2 utterance. The results also show a strong voice-similarity between accent conversions and the original L2 utterance. Our second (articulatory-based) approach consists of building a statistical parametric articulatory synthesizer for a non-native speaker, then driving the synthesizer with the articulators from the reference L1 speaker. This statistical approach not only has low data requirements but also has the flexibility to interpolate missing sounds in the L2 corpus. In a series of listening tests, articulatory accent conversions were rated more intelligible and less accented than their L2 counterparts. In the final study, we compare the two approaches: acoustic and articulatory. Our results show that the articulatory approach, despite the direct access to the native linguistic gestures, is less effective in reducing perceived non-native accents than the acoustic approach

    Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

    Full text link
    The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized channel attention, propagation, and aggregation time delay neural network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech datasets respectively. After extracting the embeddings, we benchmark with several traditional classifiers, such as the K-nearest neighbour (KNN), Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to the standard SD systems trained only on the limited SEP-28k dataset, we obtain a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average recall (UAR) over the baselines. Finally, we have shown that combining two embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve the UAR by up to 2.60% and 6.32% respectively.Comment: Accepted in International Journal of Speech Technology, Springer 2023 substantial overlap with arXiv:2204.0156
    • …
    corecore