17 research outputs found

    Constrained speaker linking

    Get PDF
    In this paper we study speaker linking (a.k.a.\ partitioning) given constraints of the distribution of speaker identities over speech recordings. Specifically, we show that the intractable partitioning problem becomes tractable when the constraints pre-partition the data in smaller cliques with non-overlapping speakers. The surprisingly common case where speakers in telephone conversations are known, but the assignment of channels to identities is unspecified, is treated in a Bayesian way. We show that for the Dutch CGN database, where this channel assignment task is at hand, a lightweight speaker recognition system can quite effectively solve the channel assignment problem, with 93% of the cliques solved. We further show that the posterior distribution over channel assignment configurations is well calibrated.Comment: Submitted to Interspeech 2014, some typos fixe

    Restricted Boltzmann Machine vectors for speaker clustering

    Get PDF
    Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.Peer ReviewedPostprint (published version

    The 2015 Sheffield System for Longitudinal Diarisation of Broadcast Media

    Get PDF
    Speaker diarisation is the task of answering "who spoke when" within a multi-speaker audio recording. Diarisation of broadcast media typically operates on individual television shows, and is a particularly difficult task, due to a high number of speakers and challenging background conditions. Using prior knowledge, such as that from previous shows in a series, can improve performance. Longitudinal diarisation allows to use knowledge from previous audio files to improve performance, but requires finding matching speakers across consecutive files. This paper describes the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge. The challenge required longitudinal diarisation of data from BBC archives, under very constrained resource settings. Our system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows. The final result on the development set of 19 shows from five different television series provided a Diarisation Error Rate of 50.77% in the diarisation and linking task

    The Domain Mismatch Problem in the Broadcast Speaker Attribution Task

    Get PDF
    The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers in the audio under analysis. In this work, we carry out a study of the speaker attribution problem in the broadcast domain. Through our experiments, we illustrate the positive impact of diarization on the final performance. Additionally, we show the influence of the variability present in broadcast data, depicting the broadcast domain as a collection of subdomains with particular characteristics. Taking these two factors into account, we also propose alternative approximations robust against domain mismatch. These approximations include a semisupervised alternative as well as a totally unsupervised new hybrid solution fusing diarization and speaker assignment. Thanks to these two approximations, our performance is boosted around a relative 50%. The analysis has been carried out using the corpus for the Albayzín 2020 challenge, a diarization and speaker attribution evaluation working with broadcast data. These data, provided by Radio Televisión Española (RTVE), the Spanish public Radio and TV Corporation, include multiple shows and genres to analyze the impact of new speech technologies in real-world scenarios

    Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction

    Get PDF
    Most of the state-of-the-art speaker recognition systems use a compact representation of spoken utterances referred to as i-vector. Since the "standard" i-vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the memory needs for i-vector extraction, but is also fast and accurate compared to recently proposed approaches. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than other method

    Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering

    Get PDF
    ABSTRACT Speaker clustering is an important task in many applications such as Speaker Diarization as well as Speech Recognition. Speaker clustering can be done within a single multi-speaker recording (Diarization) or for a set of different recordings. In this work we are interested by the former case and we propose a simple iterative Mean Shift (MS) algorithm to deal with this problem. Traditionally, MS algorithm is based on Euclidean distance. We propose to use the Cosine distance in order to build a new version of MS algorithm. We report results as measured by speaker and cluster impurities on NIST SRE 2008 datasets

    Diarización de locutores en señales de audio de radiotelevisión

    Full text link
    Este trabajo de fin de grado tiene como principal objetivo la implementación y el análisis de diferentes técnicas utilizadas para el desarrollo de un sistema de diarización de locutores en el contexto de la evaluación Albayzín 2016 de diarización de locutores. La diarización de locutores consiste en, dado un audio de entrada determinar los intervalos de tiempo en los que intervienen distintos locutores sin tener ningún tipo de información adicional, además del propio audio, de forma independiente al tipo de canal y a la presencia de cualquier tipo de ruido de fondo. El desarrollo de este trabajo se divide en dos etapas. La primera, condicionada por la evaluación Albayzín, desarrollada entre los meses de septiembre y octubre de 2016, se centra en el acondicionamiento de un sistema de referencia a los datos de entrenamiento y desarrollo proporcionados, y en el análisis de técnicas alternativas a dicho sistema de referencia con el objetivo de mejorar su rendimiento. Por otra parte, la segunda etapa se centra en la incorporación de técnicas basadas en i-vectors a etapas específicas del proceso de diarización, desarrollando un nuevo sistema y midiendo su rendimiento en las mismas condiciones de la evaluación Albayzín 2016. De esta forma, este trabajo de fin de grado, en el contexto de la evaluación Albayzín 2016, nos permitirá estudiar varias técnicas utilizadas en la actualidad en las distintas etapas de un sistema de diarización: extracción de características, detección de actividad, segmentación y agrupamiento. Además, nos proporcionará una comparativa entre ellas en términos de rendimiento y de tiempo de ejecución que conlleva cada una.The main goal of this bachelor degree thesis is to implement and analyze different techniques that allow to develop a speaker diarization system in the context of the Albayzin 2016’s speaker diarization evaluation. Speaker diarization consist in determining the time intervals in which different speakers are taking part in a given recording, without any additional information besides the audio signal, where different transmission channel characteristics or background noise may appear. This work can be divided into two different parts. The first one, conditioned by the Albayzin evaluation, is focused on adapting our reference system to the training and development data provided for the evaluation and using alternative techniques to improve our system performance. On the other hand, the second part is focused on incorporating i-vectors-based techniques to specific stages of the speaker diarization process, developing a new system and measuring its performance in the same conditions of defined by the Albayzin 2016 evaluation. Thus, this bachelor degree thesis, using the Albayzin evaluation as our general frameworks, will allow us to study some techniques commonly used in different stages of a speaker diarization system: features extraction, activity detection, segmentation and clustering. Furthermore, it will provide us a comparative analysis between these different techniques in terms of performance and execution time for each on
    corecore