4,193 research outputs found

    Multimodal Diarization Systems by Training Enrollment Models as Identity Representations

    Get PDF
    This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system

    Data augmentation enhanced speaker enrollment for text-dependent speaker verification

    Full text link
    Data augmentation is commonly used for generating additional data from the available training data to achieve a robust estimation of the parameters of complex models like the one for speaker verification (SV), especially for under-resourced applications. SV involves training speaker-independent (SI) models and speaker-dependent models where speakers are represented by models derived from an SI model using the training data for the particular speaker during the enrollment phase. While data augmentation for training SI models is well studied, data augmentation for speaker enrollment is rarely explored. In this paper, we propose the use of data augmentation methods for generating extra data to empower speaker enrollment. Each data augmentation method generates a new data set. Two strategies of using the data sets are explored: the first one is to training separate systems and fuses them at the score level and the other is to conduct multi-conditional training. Furthermore, we study the effect of data augmentation under noisy conditions. Experiments are performed on RedDots challenge 2016 database, and the results validate the effectiveness of the proposed methods

    Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

    Full text link
    Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under Revie

    Analysis of audio data to measure social interaction in the treatment of autism spectrum disorder using speaker diarization and identification

    Get PDF
    Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects communication and behavior in social environments. Some common characteristics of a person with ASD include difficulty with communication or interaction with others, restricted interests paired with repetitive behaviors and other symptoms that may affect the person's overall social life. People with ASD endure a lower quality of life due to their inability to navigate their daily social interactions. Autism is referred to as a spectrum disorder due to the variation in type and severity of symptoms. Therefore, measurement of the social interaction of a person with ASD in a clinical setting is inaccurate because the tests are subjective, time consuming, and not naturalistic. The goal of this study is to lay the foundation to passively collect continuous audio data of people with ASD through a voice recorder application that runs in the background of their mobile device and propose a methodology to understand and analyze the collected audio data while maintaining minimal human intervention. Speaker Diarization and Speaker Identification are two methods that are explored to answer essential questions when processing unlabeled audio data such as who spoke when and to whom does a certain speaker label belong to? Speaker Diarization is the process of partitioning an audio signal that involves multiple people into homogenous segments associated with each person. It provides an answer to the question of "who spoke when?". The implemented Speaker Diarization algorithm utilizes the state-of-the-art d-vector embeddings that take advantage of neural networks by using large datasets for training so variation in speech, accent, and acoustic conditions of the audio signal can be better accounted for. Furthermore, the algorithm uses a non-parametric, connection-based clustering algorithm commonly known as spectral clustering. The spectral clustering algorithm is applied to these previously extracted d-vector embeddings to determine the number of unique speakers and assign each portion of the audio file to a specific cluster. Through various experiments and trials, we chose Microsoft Azure Cognitive Services due to the robust algorithms and models that are available to identify speakers in unlabeled audio data. The Speaker Identification API from Microsoft Azure Cognitive Services provides a state-of-the-art service to identify human voices through RESTful API calls. A simple web interface was implemented to send audio data to the Speaker Identification API which returned data in JSON format. This returned data provides an answer to the question -- "who does a certain speaker label belong to?". The proposed methods were tested extensively on numerous audio files which contain various numbers of speakers who emulate a realistic conversational exchange. The results support our goal of digitally measuring social interaction of people with ASD through the analysis of audio data while maintaining minimal human intervention. We were able to identify our target speaker and differentiate them from others given an audio signal which could ultimately unlock valuable insights such as creating a bio marker to measure response to treatment
    • …
    corecore