248 research outputs found
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
Detection and handling of overlapping speech for speaker diarization
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken
language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings,
compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also
due to the presence of overlapping speech.
Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a
substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually
only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually
lead to corrupt single-speaker models and thus to a worse segmentation.
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker
diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on
distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component
analysis, linear discriminant analysis, or by a multi-layer perceptron.
In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set
of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then,
a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features.
The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in
the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the
detection of overlapping speech, especially on data originating from a single recording site.
In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments
are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of
the diarization algorithm. During the system development it was discovered that it is favorable to do an independent
optimization of overlap exclusion and labeling with respect to the overlap detection system.
We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments
with NIST RT data show DER improvement on the RT ¿09 meeting recordings as well.
The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the
clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the
overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between
various settings of the overlap detection operation point. However, a high performance variability across different recordings is
also typical of the baseline diarization system, without any overlap handling
Speaker Diarization Features: The UPM Contribution to the RT09 Evaluation
Two new features have been proposed and used in the Rich Transcription Evaluation 2009 by the Universidad Politécnica de Madrid, which outperform the results of the baseline system. One of the features is the intensity channel contribution, a feature related to the location of the speaker. The second feature is the logarithm of the interpolated fundamental frequency. It is the first time that both features are applied to the clustering stage of multiple distant microphone meetings diarization. It is shown that the inclusion of both features improves the baseline results by 15.36% and 16.71% relative to the development set and the RT 09 set, respectively. If we consider speaker errors only, the relative improvement is 23% and 32.83% on the development set and the RT09 set, respectively
- …