555 research outputs found
Speaker Diarization with Lexical Information
This work presents a novel approach for speaker diarization to leverage
lexical information provided by automatic speech recognition. We propose a
speaker diarization system that can incorporate word-level speaker turn
probabilities with speaker embeddings into a speaker clustering process to
improve the overall diarization accuracy. To integrate lexical and acoustic
information in a comprehensive way during clustering, we introduce an adjacency
matrix integration for spectral clustering. Since words and word boundary
information for word-level speaker turn probability estimation are provided by
a speech recognition system, our proposed method works without any human
intervention for manual transcriptions. We show that the proposed method
improves diarization performance on various evaluation datasets compared to the
baseline diarization system using acoustic information only in speaker
embeddings
TSUP Speaker Diarization System for Conversational Short-phrase Speaker Diarization Challenge
This paper describes the TSUP team's submission to the ISCSLP 2022
conversational short-phrase speaker diarization (CSSD) challenge which
particularly focuses on short-phrase conversations with a new evaluation metric
called conversational diarization error rate (CDER). In this challenge, we
explore three kinds of typical speaker diarization systems, which are spectral
clustering(SC) based diarization, target-speaker voice activity
detection(TS-VAD) and end-to-end neural diarization(EEND) respectively. Our
major findings are summarized as follows. First, the SC approach is more
favored over the other two approaches under the new CDER metric. Second, tuning
on hyperparameters is essential to CDER for all three types of speaker
diarization systems. Specifically, CDER becomes smaller when the length of
sub-segments setting longer. Finally, multi-system fusion through DOVER-LAP
will worsen the CDER metric on the challenge data. Our submitted SC system
eventually ranks the third place in the challenge
Jitter and Shimmer measurements for speaker diarization
Jitter and shimmer voice quality features have been successfully
used to characterize speaker voice traits and detect voice pathologies.
Jitter and shimmer measure variations in the fundamental frequency
and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
- …