565 research outputs found
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
A Speaker Diarization System for Studying Peer-Led Team Learning Groups
Peer-led team learning (PLTL) is a model for teaching STEM courses where
small student groups meet periodically to collaboratively discuss coursework.
Automatic analysis of PLTL sessions would help education researchers to get
insight into how learning outcomes are impacted by individual participation,
group behavior, team dynamics, etc.. Towards this, speech and language
technology can help, and speaker diarization technology will lay the foundation
for analysis. In this study, a new corpus is established called CRSS-PLTL, that
contains speech data from 5 PLTL teams over a semester (10 sessions per team
with 5-to-8 participants in each team). In CRSS-PLTL, every participant wears a
LENA device (portable audio recorder) that provides multiple audio recordings
of the event. Our proposed solution is unsupervised and contains a new online
speaker change detection algorithm, termed G 3 algorithm in conjunction with
Hausdorff-distance based clustering to provide improved detection accuracy.
Additionally, we also exploit cross channel information to refine our
diarization hypothesis. The proposed system provides good improvements in
diarization error rate (DER) over the baseline LIUM system. We also present
higher level analysis such as the number of conversational turns taken in a
session, and speaking-time duration (participation) for each speaker.Comment: 5 Pages, 2 Figures, 2 Tables, Proceedings of INTERSPEECH 2016, San
Francisco, US
Automatic Segmentation of Broadcast News Audio using Self Similarity Matrix
Generally audio news broadcast on radio is com- posed of music, commercials,
news from correspondents and recorded statements in addition to the actual news
read by the newsreader. When news transcripts are available, automatic
segmentation of audio news broadcast to time align the audio with the text
transcription to build frugal speech corpora is essential. We address the
problem of identifying segmentation in the audio news broadcast corresponding
to the news read by the newsreader so that they can be mapped to the text
transcripts. The existing techniques produce sub-optimal solutions when used to
extract newsreader read segments. In this paper, we propose a new technique
which is able to identify the acoustic change points reliably using an acoustic
Self Similarity Matrix (SSM). We describe the two pass technique in detail and
verify its performance on real audio news broadcast of All India Radio for
different languages.Comment: 4 pages, 5 image
- …