766 research outputs found
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization
This paper details our speaker diarization system designed for multi-domain,
multi-microphone casual conversations. The proposed diarization pipeline uses
weighted prediction error (WPE)-based dereverberation as a front end, then
applies end-to-end neural diarization with vector clustering (EEND-VC) to each
channel separately. It integrates the diarization result obtained from each
channel using diarization output voting error reduction plus overlap
(DOVER-LAP). To harness the knowledge from the target domain and results
integrated across all channels, we apply self-supervised adaptation for each
session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP.
The proposed system was incorporated into NTT's submission for the distant
automatic speech recognition task in the CHiME-7 challenge. Our system achieved
65 % and 62 % relative improvements on development and eval sets compared to
the organizer-provided VC-based baseline diarization system, securing third
place in diarization performance.Comment: 5 pages, 5 figures, Submitted to ICASSP 202
ECAPA-TDNN Embeddings for Speaker Diarization
Learning robust speaker embeddings is a crucial step in speaker diarization.
Deep neural networks can accurately capture speaker discriminative
characteristics and popular deep embeddings such as x-vectors are nowadays a
fundamental component of modern diarization systems. Recently, some
improvements over the standard TDNN architecture used for x-vectors have been
proposed. The ECAPA-TDNN model, for instance, has shown impressive performance
in the speaker verification domain, thanks to a carefully designed neural
model.
In this work, we extend, for the first time, the use of the ECAPA-TDNN model
to speaker diarization. Moreover, we improved its robustness with a powerful
augmentation scheme that concatenates several contaminated versions of the same
signal within the same training batch. The ECAPA-TDNN model turned out to
provide robust speaker embeddings under both close-talking and distant-talking
conditions. Our results on the popular AMI meeting corpus show that our system
significantly outperforms recently proposed approaches
Speaker diarization of multi-party conversations using participants role information: political debates and professional meetings
Speaker Diarization aims at inferring who spoke when in an audio stream and involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers, and (2) the association of speech segments to each speaker. Most of the recent efforts in the domain have addressed the problem using machine learning techniques or statistical methods (for a review see [11]) ignoring the fact that the data consists of instances of human conversations
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
Data Fusion based on Game Theory for Speaker Diarization
A novel algorithm based on bimatrix game
theory has been developed to improve the accuracy and
reliability of a speaker diarization system. This algorithm
fuses the output data of two open-source speaker diarization
programs, LIUM and SHoUT, taking advantage of the
best properties of each one. The performance of this new
system has been tested by means of audio streams from
several movies. From preliminary results on fragments of
five movies, improvements of 63% in false alarms and
missed speech mistakes have been achieved with respect to
LIUM and SHoUT systems working alone. Moreover, we also
improve in a 20% the number of recognized speakers, getting
close to the real number of speakers in the audio strea
- …