614 research outputs found
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
Jitter and Shimmer measurements for speaker diarization
Jitter and shimmer voice quality features have been successfully
used to characterize speaker voice traits and detect voice pathologies.
Jitter and shimmer measure variations in the fundamental frequency
and amplitude of speaker's voice, respectively. Due to their nature, they can be used to assess differences between speakers. In this paper, we investigate the usefulness of these voice quality features in the task of speaker diarization. The combination of voice quality features with the conventional spectral features, Mel-Frequency Cepstral Coefficients (MFCC), is addressed in the framework of Augmented Multiparty Interaction (AMI) corpus, a multi-party and spontaneous speech set of recordings. Both sets of features are independently modeled using mixture of Gaussians and fused together at the score likelihood level. The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.Peer ReviewedPostprint (published version
The Blame Game: Performance Analysis of Speaker Diarization System Components
In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%)
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
We propose a modular pipeline for the single-channel separation, recognition,
and diarization of meeting-style recordings and evaluate it on the Libri-CSS
dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet
separation architecture, followed by a speaker-agnostic speech recognizer, we
achieve state-of-the-art recognition performance in terms of Optimal Reference
Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization
module is employed to extract speaker embeddings from the enhanced signals and
to assign the CSS outputs to the correct speaker. Here, we propose a
syntactically informed diarization using sentence- and word-level boundaries of
the ASR module to support speaker turn detection. This results in a
state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for
the full meeting recognition pipeline.Comment: Submitted to ICASSP 202
Evaluation of spoken document retrieval for historic speech collections
The re-use of spoken word audio collections maintained by audiovisual archives is severely hindered by their generally limited access. The CHoral project, which is part of the CATCH program funded by the Dutch Research Council, aims to provide users of speech archives with online, instead of on-location, access to relevant fragments, instead of full documents. To meet this goal, a spoken document retrieval framework is being developed. In this paper the evaluation efforts undertaken so far to assess and improve various aspects of the framework are presented. These efforts include (i) evaluation of the automatically generated textual representations of the spoken word documents that enable word-based search, (ii) the development of measures to estimate the quality of the textual representations for use in information retrieval, and (iii) studies to establish the potential user groups of the to-be-developed technology, and the first versions of the user interface supporting online access to spoken word collections
- …