230 research outputs found

    Jitter and Shimmer measurements for speaker diarization

    Get PDF
    Jitter and shimmer voice quality features have been successfully used to characterize speaker voice traits and detect voice pathologies. Jitter and shimmer measure variations in the fundamental frequency and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version

    The use of long-term features for GMM- and i-vector-based speaker diarization systems

    Get PDF
    Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones. In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level. Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.Peer ReviewedPostprint (published version

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Multimodal spoofing and adversarial examples countermeasure for speaker verification

    Get PDF
    Authentication mechanisms have always been prevalent in our society — even as far back as Ancient Mesopotamia in the form of seals. Since the advent of the digital age, the need for a good digital authentication technique has soared stemming from the widespread adoption of online platforms and digitized content. Audio-based authentication like speaker verification has been explored as another mechanism for achieving this goal. Specifically, an audio template belonging to the authorized user is stored with the authentication system. This template is later compared with the current input voice to authenticate the current user. Audio spoofing refers to attacks used to fool the authentication system to gain access to restricted resources. This has been proven to effectively degrade the performance of a variety of audio-authentication methods. In response to this, spoofing countermeasures for the task of anti-spoofing have been developed that can detect and successfully thwart these types of attacks. The advent of deep learning techniques and their usage in real-life applications has led to the research and development of various techniques for purposes ranging from exploiting weaknesses in the deep learning model to stealing confidential information. One of the ways in which the deep learning-based audio authentication model can be evaded is the usage of a set of attacks that are known as adversarial attacks. These adversarial attacks consist of adding a carefully crafted perturbation to the input to elicit a wrong inference from the model. We first explore the performance that multimodality brings to the anti-spoofing task. We aim to augment a unimodal spoofing countermeasure with visual information to identify whether it can improve performance. Since visuals can serve as an additional domain of information, we experiment with whether the existing paradigm of using unimodal spoofing countermeasures for anti-spoofing can benefit from this new information. Our results indicate that augmenting an existing unimodal countermeasure with visual information does not provide any performance benefits. Future work can explore more tightly coupled multimodal models that use objectives like contrastive loss. We then study the vulnerability of deep learning-based multimodal speaker verification to adversarial attacks. In multimodal speaker verification, the vulnerability has not been established and we aim to accomplish this. We find that the multimodal models are heavily reliant on the visual modality and that attacking both modalities lead to a higher attack success rate. Future work can move on to stronger attacks by applying adversarial attacks to bypass the spoofing countermeasure and speaker verification. Finally, we investigate the feasibility of a generic evasion detector that can block both adversarial and spoofing attacks. Since both the spoofing and adversarial attacks target speaker verification models, we aim to add an adversarial attack detection mechanism — feature squeezing — onto the spoofing countermeasure to achieve this. We find that such a detector is feasible but involves a significant reduction in the identification of genuine samples. Future work can explore combining adversarial training as a defense for attacks that target the complete spoofing countermeasure and speaker verification pipeline

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
    • …
    corecore