84 research outputs found
LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
More and more neural network approaches have achieved considerable
improvement upon submodules of speaker diarization system, including speaker
change detection and segment-wise speaker embedding extraction. Still, in the
clustering stage, traditional algorithms like probabilistic linear discriminant
analysis (PLDA) are widely used for scoring the similarity between two speech
segments. In this paper, we propose a supervised method to measure the
similarity matrix between all segments of an audio recording with sequential
bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is
applied on top of the similarity matrix to further improve the performance.
Experimental results show that our system significantly outperforms the
state-of-the-art methods and achieves a diarization error rate of 6.63% on the
NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201
TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
This paper describes the TSUP team's submission to the ISCSLP 2022
conversational short-phrase speaker diarization (CSSD) challenge which
particularly focuses on short-phrase conversations with a new evaluation metric
called conversational diarization error rate (CDER). In this challenge, we
explore three kinds of typical speaker diarization systems, which are spectral
clustering(SC) based diarization, target-speaker voice activity
detection(TS-VAD) and end-to-end neural diarization(EEND) respectively. Our
major findings are summarized as follows. First, the SC approach is more
favored over the other two approaches under the new CDER metric. Second, tuning
on hyperparameters is essential to CDER for all three types of speaker
diarization systems. Specifically, CDER becomes smaller when the length of
sub-segments setting longer. Finally, multi-system fusion through DOVER-LAP
will worsen the CDER metric on the challenge data. Our submitted SC system
eventually ranks the third place in the challenge
Analysis of audio data to measure social interaction in the treatment of autism spectrum disorder using speaker diarization and identification
Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects communication and behavior in social environments. Some common characteristics of a person with ASD include difficulty with communication or interaction with others, restricted interests paired with repetitive behaviors and other symptoms that may affect the person's overall social life. People with ASD endure a lower quality of life due to their inability to navigate their daily social interactions. Autism is referred to as a spectrum disorder due to the variation in type and severity of symptoms. Therefore, measurement of the social interaction of a person with ASD in a clinical setting is inaccurate because the tests are subjective, time consuming, and not naturalistic. The goal of this study is to lay the foundation to passively collect continuous audio data of people with ASD through a voice recorder application that runs in the background of their mobile device and propose a methodology to understand and analyze the collected audio data while maintaining minimal human intervention. Speaker Diarization and Speaker Identification are two methods that are explored to answer essential questions when processing unlabeled audio data such as who spoke when and to whom does a certain speaker label belong to? Speaker Diarization is the process of partitioning an audio signal that involves multiple people into homogenous segments associated with each person. It provides an answer to the question of "who spoke when?". The implemented Speaker Diarization algorithm utilizes the state-of-the-art d-vector embeddings that take advantage of neural networks by using large datasets for training so variation in speech, accent, and acoustic conditions of the audio signal can be better accounted for. Furthermore, the algorithm uses a non-parametric, connection-based clustering algorithm commonly known as spectral clustering. The spectral clustering algorithm is applied to these previously extracted d-vector embeddings to determine the number of unique speakers and assign each portion of the audio file to a specific cluster. Through various experiments and trials, we chose Microsoft Azure Cognitive Services due to the robust algorithms and models that are available to identify speakers in unlabeled audio data. The Speaker Identification API from Microsoft Azure Cognitive Services provides a state-of-the-art service to identify human voices through RESTful API calls. A simple web interface was implemented to send audio data to the Speaker Identification API which returned data in JSON format. This returned data provides an answer to the question -- "who does a certain speaker label belong to?". The proposed methods were tested extensively on numerous audio files which contain various numbers of speakers who emulate a realistic conversational exchange. The results support our goal of digitally measuring social interaction of people with ASD through the analysis of audio data while maintaining minimal human intervention. We were able to identify our target speaker and differentiate them from others given an audio signal which could ultimately unlock valuable insights such as creating a bio marker to measure response to treatment
The DKU-DukeECE Diarization System for the VoxCeleb Speaker Recognition Challenge 2022
This paper discribes the DKU-DukeECE submission to the 4th track of the
VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22). Our system contains a
fused voice activity detection model, a clustering-based diarization model, and
a target-speaker voice activity detection-based overlap detection model.
Overall, the submitted system is similar to our previous year's system in
VoxSRC-21. The difference is that we use a much better speaker embedding and a
fused voice activity detection, which significantly improves the performance.
Finally, we fuse 4 different systems using DOVER-lap and achieve 4.75 of the
diarization error rate, which ranks the 1st place in track 4.Comment: arXiv admin note: substantial text overlap with arXiv:2109.0200
- …