79 research outputs found
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Development of a Speaker Diarization System for Speaker Tracking in Audio Broadcast News: a Case Study
A system for speaker tracking in broadcast-news audio data is presented and the impacts of the main components of the system to the overall speaker-tracking performance are evaluated. The process of speaker tracking in continuous audio streams
involves several processing tasks and is therefore treated as a multistage process. The main building blocks of such system include the components for audio segmentation, speech detection, speaker clustering and speaker identification. The aim of the first three processes is to find homogeneous regions in continuous audio streams that belong to one speaker and to join each region of the same speaker together. The task of organizing the audio data in this way is known as speaker diarization and plays an important role in various speech-processing applications.
In our case the impact of speaker diarization
was assessed in a speaker-tracking system by performing a comparative study of how each of the component influenced the overall speaker-detection results. The evaluation experiments were performed on broadcast-news audio data with a speaker-tracking system,
which was capable of detecting 41 target speakers. We implemented several different approaches in each component of the system and compared their performances by inspecting the final speaker-tracking results. The evaluation results indicate the importance of the audio-segmentation and speech-detection components, while no significant improvement of the overall results was achieved by additionally including a speaker-clustering component to the speaker-tracking system
An Information Theoretic Approach to Speaker Diarization of Meeting Recordings
In this thesis we investigate a non parametric approach to speaker diarization for meeting recordings based on an information theoretic framework. The problem is formulated using the Information Bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. The distance between speech segments is selected as the Jensen-Shannon divergence as it arises from the IB objective function optimization. In the first part of the thesis, we explore IB based diarization with Mel frequency cepstral coefficients (MFCC) as input features. We study issues related to IB based speaker diarization such as optimizing the IB objective function, criteria for inferring the number of speakers. Furthermore, we benchmark the proposed system against a state-of-the-art systemon the NIST RT06 (Rich Transcription) meeting data for speaker diarization. The IB based system achieves similar speaker error rates (16.8%) as compared to a baseline HMM/GMM system (17.0%). This approach being non parametric clustering, perform diarization six times faster than realtime while the baseline is slower than realtime. The second part of thesis proposes a novel feature combination system in the context of IB diarization. Both speaker clustering and speaker realignment steps are discussed. In contrary to current systems, the proposed method avoids the feature combination by averaging log-likelihood scores. Two different sets of features were considered – (a) combination of MFCC features with time delay of arrival features (b) a four feature stream combination that combines MFCC, TDOA, modulation spectrum and frequency domain linear prediction. Experiments show that the proposed system achieve 5% absolute improvement over the baseline in case of two feature combination, and 7% in case of four feature combination. The increase in algorithm complexity of the IB system is minimal with more features. The system with four feature input performs in real time that is ten times faster than the GMM based system
Detection and handling of overlapping speech for speaker diarization
For the last several years, speaker diarization has been attracting substantial research attention as one of the spoken
language technologies applied for the improvement, or enrichment, of recording transcriptions. Recordings of meetings,
compared to other domains, exhibit an increased complexity due to the spontaneity of speech, reverberation effects, and also
due to the presence of overlapping speech.
Overlapping speech refers to situations when two or more speakers are speaking simultaneously. In meeting data, a
substantial portion of errors of the conventional speaker diarization systems can be ascribed to speaker overlaps, since usually
only one speaker label is assigned per segment. Furthermore, simultaneous speech included in training data can eventually
lead to corrupt single-speaker models and thus to a worse segmentation.
This thesis concerns the detection of overlapping speech segments and its further application for the improvement of speaker
diarization performance. We propose the use of three spatial cross-correlationbased parameters for overlap detection on
distant microphone channel data. Spatial features from different microphone pairs are fused by means of principal component
analysis, linear discriminant analysis, or by a multi-layer perceptron.
In addition, we also investigate the possibility of employing longterm prosodic information. The most suitable subset from a set
of candidate prosodic features is determined in two steps. Firstly, a ranking according to mRMR criterion is obtained, and then,
a standard hill-climbing wrapper approach is applied in order to determine the optimal number of features.
The novel spatial as well as prosodic parameters are used in combination with spectral-based features suggested previously in
the literature. In experiments conducted on AMI meeting data, we show that the newly proposed features do contribute to the
detection of overlapping speech, especially on data originating from a single recording site.
In speaker diarization, for segments including detected speaker overlap, a second speaker label is picked, and such segments
are also discarded from the model training. The proposed overlap labeling technique is integrated in Viterbi decoding, a part of
the diarization algorithm. During the system development it was discovered that it is favorable to do an independent
optimization of overlap exclusion and labeling with respect to the overlap detection system.
We report improvements over the baseline diarization system on both single- and multi-site AMI data. Preliminary experiments
with NIST RT data show DER improvement on the RT Âż09 meeting recordings as well.
The addition of beamforming and TDOA feature stream into the baseline diarization system, which was aimed at improving the
clustering process, results in a bit higher effectiveness of the overlap labeling algorithm. A more detailed analysis on the
overlap exclusion behavior reveals big improvement contrasts between individual meeting recordings as well as between
various settings of the overlap detection operation point. However, a high performance variability across different recordings is
also typical of the baseline diarization system, without any overlap handling
Advances in Subspace-based Solutions for Diarization in the Broadcast Domain
La motivaciĂłn de esta tesis es la necesidad de soluciones robustas al problema de diarizaciĂłn. Estas tĂ©cnicas de diarizaciĂłn deben proporcionar valor añadido a la creciente cantidad disponible de datos multimedia mediante la precisa discriminaciĂłn de los locutores presentes en la señal de audio. Desafortunadamente, hasta tiempos recientes este tipo de tecnologĂas solamente era viable en condiciones restringidas, quedando por tanto lejos de una soluciĂłn general. Las razones detrás de las limitadas prestaciones de los sistemas de diarizaciĂłn son mĂşltiples. La primera causa a tener en cuenta es la alta complejidad de la producciĂłn de la voz humana, en particular acerca de los procesos fisiolĂłgicos necesarios para incluir las caracterĂsticas discriminativas de locutor en la señal de voz. Esta complejidad hace del proceso inverso, la estimaciĂłn de dichas caracterĂsticas a partir del audio, una tarea ineficiente por medio de las tĂ©cnicas actuales del estado del arte. Consecuentemente, en su lugar deberán tenerse en cuenta aproximaciones. Los esfuerzos en la tarea de modelado han proporcionado modelos cada vez más elaborados, aunque no buscando la explicaciĂłn Ăşltima de naturaleza fisiolĂłgica de la señal de voz. En su lugar estos modelos aprenden relaciones entre la señales acĂşsticas a partir de un gran conjunto de datos de entrenamiento. El desarrollo de modelos aproximados genera a su vez una segunda razĂłn, la variabilidad de dominio. Debido al uso de relaciones aprendidas a partir de un conjunto de entrenamiento concreto, cualquier cambio de dominio que modifique las condiciones acĂşsticas con respecto a los datos de entrenamiento condiciona las relaciones asumidas, pudiendo causar fallos consistentes en los sistemas.Nuestra contribuciĂłn a las tecnologĂas de diarizaciĂłn se ha centrado en el entorno de radiodifusiĂłn. Este dominio es actualmente un entorno todavĂa complejo para los sistemas de diarizaciĂłn donde ninguna simplificaciĂłn de la tarea puede ser tenida en cuenta. Por tanto, se deberá desarrollar un modelado eficiente del audio para extraer la informaciĂłn de locutor y como inferir el etiquetado correspondiente. Además, la presencia de mĂşltiples condiciones acĂşsticas debido a la existencia de diferentes programas y/o gĂ©neros en el domino requiere el desarrollo de tĂ©cnicas capaces de adaptar el conocimiento adquirido en un determinado escenario donde la informaciĂłn está disponible a aquellos entornos donde dicha informaciĂłn es limitada o sencillamente no disponible.Para este propĂłsito el trabajo desarrollado a lo largo de la tesis se ha centrado en tres subtareas: caracterizaciĂłn de locutor, agrupamiento y adaptaciĂłn de modelos. La primera subtarea busca el modelado de un fragmento de audio para obtener representaciones precisas de los locutores involucrados, poniendo de manifiesto sus propiedades discriminativas. En este área se ha llevado a cabo un estudio acerca de las actuales estrategias de modelado, especialmente atendiendo a las limitaciones de las representaciones extraĂdas y poniendo de manifiesto el tipo de errores que pueden generar. Además, se han propuesto alternativas basadas en redes neuronales haciendo uso del conocimiento adquirido. La segunda tarea es el agrupamiento, encargado de desarrollar estrategias que busquen el etiquetado Ăłptimo de los locutores. La investigaciĂłn desarrollada durante esta tesis ha propuesto nuevas estrategias para estimar el mejor reparto de locutores basadas en tĂ©cnicas de subespacios, especialmente PLDA. Finalmente, la tarea de adaptaciĂłn de modelos busca transferir el conocimiento obtenido de un conjunto de entrenamiento a dominios alternativos donde no hay datos para extraerlo. Para este propĂłsito los esfuerzos se han centrado en la extracciĂłn no supervisada de informaciĂłn de locutor del propio audio a diarizar, sinedo posteriormente usada en la adaptaciĂłn de los modelos involucrados.<br /
Unsupervised methods for speaker diarization
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-95).Given a stream of unlabeled audio data, speaker diarization is the process of determining "who spoke when." We propose a novel approach to solving this problem by taking advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features and exploiting the inherent variabilities in the data through the use of unsupervised methods. Upon initial evaluation, our system achieves state-of-the art results of 0.9% Diarization Error Rate in the diarization of two-speaker telephone conversations. The approach is then generalized to the problem of K-speaker diarization, for which we take measures to address issues of data sparsity and experiment with the use of the von Mises-Fisher distribution for clustering on a unit hypersphere. Our extended system performs competitively on the diarization of conversations involving two or more speakers. Finally, we present promising initial results obtained from applying variational inference on our front-end speaker representation to estimate the unknown number of speakers in a given utterance.by Stephen Shum.S.M
- …