46 research outputs found
Speaker Diarization Based on Intensity Channel Contribution
The time delay of arrival (TDOA) between multiple microphones has been used since 2006 as a source of information (localization) to complement the spectral features for speaker diarization. In this paper, we propose a new localization feature, the intensity channel contribution (ICC) based on the relative energy of the signal arriving at each channel compared to the sum of the energy of all the channels. We have demonstrated that by joining the ICC features and the TDOA features, the robustness of the localization features is improved and that the diarization error rate (DER) of the complete system (using localization and spectral features) has been reduced. By using this new localization feature, we have been able to achieve a 5.2% DER relative improvement in our development data, a 3.6% DER relative improvement in the RT07 evaluation data and a 7.9% DER relative improvement in the last year's RT09 evaluation data
Data Fusion based on Game Theory for Speaker Diarization
A novel algorithm based on bimatrix game
theory has been developed to improve the accuracy and
reliability of a speaker diarization system. This algorithm
fuses the output data of two open-source speaker diarization
programs, LIUM and SHoUT, taking advantage of the
best properties of each one. The performance of this new
system has been tested by means of audio streams from
several movies. From preliminary results on fragments of
five movies, improvements of 63% in false alarms and
missed speech mistakes have been achieved with respect to
LIUM and SHoUT systems working alone. Moreover, we also
improve in a 20% the number of recognized speakers, getting
close to the real number of speakers in the audio strea
New experiments on speaker diarization for unsupervised speaking style voice building for speech synthesis
El uso universal de sÃntesis de voz en diferentes aplicaciones requerirÃa un desarrollo sencillo de las nuevas voces con poca intervención manual. Teniendo en cuenta la cantidad de datos multimedia disponibles en Internet y los medios de comunicación, un objetivo interesante es el desarrollo de herramientas y métodos para construir automáticamente las voces de estilo de varios de ellos. En un trabajo anterior se esbozó una metodologÃa para la construcción de este tipo de herramientas, y se presentaron experimentos preliminares con una base de datos multiestilo. En este artÃculo investigamos más a fondo esta tarea y proponemos varias mejoras basadas en la selección del número apropiado de hablantes iniciales, el uso o no de filtros de reducción de ruido, el uso de la F0 y el uso de un algoritmo de detección de música. Hemos demostrado que el mejor sistema usando un algoritmo de detección de música disminuye el error de precisión 22,36% relativo para el conjunto de desarrollo y 39,64% relativo para el montaje de ensayo en comparación con el sistema base, sin degradar el factor de mérito. La precisión media para el conjunto de prueba es 90.62% desde 76.18% para los reportajes de 99,93% para los informes meteorológicos
Influence of transition cost in the segmentation stage of speaker diarization
In any speaker diarization system there is a segmentation phase and a clustering phase. Our system uses them in a single step in which segmentation and clustering are used iteratively until certain condition is met. In this paper we propose an improvement of the segmentation method that cancels a penalization that had been applied in previous works to any transition between speakers. We also study the performance when transitions between speakers are favoured instead of penalized. This last option achieves better results both for the development set (21.65 % relative speaker error improvementSER) and for the test set (4.60% relative speaker error improvement