Search CORE

17 research outputs found

Constrained speaker linking

Author: Brümmer Niko
van Leeuwen David A.
Publication venue
Publication date: 01/01/2014
Field of study

In this paper we study speaker linking (a.k.a.\ partitioning) given constraints of the distribution of speaker identities over speech recordings. Specifically, we show that the intractable partitioning problem becomes tractable when the constraints pre-partition the data in smaller cliques with non-overlapping speakers. The surprisingly common case where speakers in telephone conversations are known, but the assignment of channels to identities is unspecified, is treated in a Bayesian way. We show that for the Dutch CGN database, where this channel assignment task is at hand, a lightweight speaker recognition system can quite effectively solve the channel assignment problem, with 93% of the cliques solved. We further show that the posterior distribution over channel assignment configurations is well calibrated.Comment: Submitted to Interspeech 2014, some typos fixe

arXiv.org e-Print Archive

Radboud Repository

Restricted Boltzmann Machine vectors for speaker clustering

Author: Hernando Pericás Francisco Javier
Khan Umair
Safari Pooyan
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

Restricted Boltzmann Machines (RBMs) have been used both in the front-end and backend of speaker verification systems. In this work, we apply RBMs as a front-end in the context of speaker clustering. Speakers' utterances are transformed into a vector representation by means of RBMs. These vectors, referred to as RBM vectors, have shown to preserve speaker-specific information and are used for the task of speaker clustering. In this work, we perform the traditional bottom-up Agglomerative Hierarchical Clustering (AHC). Using the RBM vector representation of speakers, the performance of speaker clustering is improved. The evaluation has been performed on the audio recordings of Catalan TV Broadcast shows. The experimental results show that our proposed system outperforms the baseline i-vectors system in terms of Equal Impurity (EI). Using cosine scoring, a relative improvement of 11% and 12% are achieved for average and single linkage clustering algorithms respectively. Using PLDA scoring, the RBM vectors achieve a relative improvement of 11% compared to i-vectors for the single linkage algorithm.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

The 2015 Sheffield System for Longitudinal Diarisation of Broadcast Media

Author: Deena S.
Doulaty M.
Hain T.
Milner R.
Ng R.
Saz O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2015
Field of study

Speaker diarisation is the task of answering "who spoke when" within a multi-speaker audio recording. Diarisation of broadcast media typically operates on individual television shows, and is a particularly difficult task, due to a high number of speakers and challenging background conditions. Using prior knowledge, such as that from previous shows in a series, can improve performance. Longitudinal diarisation allows to use knowledge from previous audio files to improve performance, but requires finding matching speakers across consecutive files. This paper describes the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge. The challenge required longitudinal diarisation of data from BBC archives, under very constrained resource settings. Our system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows. The final result on the development set of 19 shows from five different television series provided a Diarisation Error Rate of 50.77% in the diarisation and linking task

Crossref

White Rose Research Online

Recommended from our members

Speaker diarisation and longitudinal linking in multi-genre broadcast data

Author: Gales MJF
Karanasou P
Lanchantin P
Liu X
Qian Y
Wang L
Woodland PC
Zhang C
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485

Apollo (Cambridge)

The Domain Mismatch Problem in the Broadcast Speaker Attribution Task

Author: Lleida Eduardo
Miguel Antonio
Ortega Alfonso
Viñals Ignacio
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers in the audio under analysis. In this work, we carry out a study of the speaker attribution problem in the broadcast domain. Through our experiments, we illustrate the positive impact of diarization on the final performance. Additionally, we show the influence of the variability present in broadcast data, depicting the broadcast domain as a collection of subdomains with particular characteristics. Taking these two factors into account, we also propose alternative approximations robust against domain mismatch. These approximations include a semisupervised alternative as well as a totally unsupervised new hybrid solution fusing diarization and speaker assignment. Thanks to these two approximations, our performance is boosted around a relative 50%. The analysis has been carried out using the corpus for the Albayzín 2020 challenge, a diarization and speaker attribution evaluation working with broadcast data. These data, provided by Radio Televisión Española (RTVE), the Spanish public Radio and TV Corporation, include multiple shows and genres to analyze the impact of new speech technologies in real-world scenarios

Multidisciplinary Digital Publishing Institute

Repositorio Universidad de Zaragoza

Directory of Open Access Journals

Factorized Sub-Space Estimation for Fast and Memory Effective I-vector Extraction

Author: Cumani Sandro
Laface Pietro
Publication venue: IEEE - INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication date: 01/01/2013
Field of study

Most of the state-of-the-art speaker recognition systems use a compact representation of spoken utterances referred to as i-vector. Since the "standard" i-vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the memory needs for i-vector extraction, but is also fast and accurate compared to recently proposed approaches. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than other method

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering

Author: Mohammed Senoussaoui
Patrick Kenny
Pierre Dumouchel
Themos Stafylakis
Publication venue
Publication date: 03/04/2020
Field of study

ABSTRACT Speaker clustering is an important task in many applications such as Speaker Diarization as well as Speech Recognition. Speaker clustering can be done within a single multi-speaker recording (Diarization) or for a set of different recordings. In this work we are interested by the former case and we propose a simple iterative Mean Shift (MS) algorithm to deal with this problem. Traditionally, MS algorithm is based on Euclidean distance. We propose to use the Cosine distance in order to build a new version of MS algorithm. We report results as measured by speaker and cluster impurities on NIST SRE 2008 datasets

CiteSeerX

Speaker Diarization and Linking of Large Corpora

Author: Bourlard Hervé
Ferras Marc
Publication venue
Publication date: 19/12/2013
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Diarización de locutores en señales de audio de radiotelevisión

Author: Ramírez Hereza Pablo
Publication venue
Publication date: 01/06/2017
Field of study

Este trabajo de fin de grado tiene como principal objetivo la implementación y el análisis de diferentes técnicas utilizadas para el desarrollo de un sistema de diarización de locutores en el contexto de la evaluación Albayzín 2016 de diarización de locutores. La diarización de locutores consiste en, dado un audio de entrada determinar los intervalos de tiempo en los que intervienen distintos locutores sin tener ningún tipo de información adicional, además del propio audio, de forma independiente al tipo de canal y a la presencia de cualquier tipo de ruido de fondo. El desarrollo de este trabajo se divide en dos etapas. La primera, condicionada por la evaluación Albayzín, desarrollada entre los meses de septiembre y octubre de 2016, se centra en el acondicionamiento de un sistema de referencia a los datos de entrenamiento y desarrollo proporcionados, y en el análisis de técnicas alternativas a dicho sistema de referencia con el objetivo de mejorar su rendimiento. Por otra parte, la segunda etapa se centra en la incorporación de técnicas basadas en i-vectors a etapas específicas del proceso de diarización, desarrollando un nuevo sistema y midiendo su rendimiento en las mismas condiciones de la evaluación Albayzín 2016. De esta forma, este trabajo de fin de grado, en el contexto de la evaluación Albayzín 2016, nos permitirá estudiar varias técnicas utilizadas en la actualidad en las distintas etapas de un sistema de diarización: extracción de características, detección de actividad, segmentación y agrupamiento. Además, nos proporcionará una comparativa entre ellas en términos de rendimiento y de tiempo de ejecución que conlleva cada una.The main goal of this bachelor degree thesis is to implement and analyze different techniques that allow to develop a speaker diarization system in the context of the Albayzin 2016’s speaker diarization evaluation. Speaker diarization consist in determining the time intervals in which different speakers are taking part in a given recording, without any additional information besides the audio signal, where different transmission channel characteristics or background noise may appear. This work can be divided into two different parts. The first one, conditioned by the Albayzin evaluation, is focused on adapting our reference system to the training and development data provided for the evaluation and using alternative techniques to improve our system performance. On the other hand, the second part is focused on incorporating i-vectors-based techniques to specific stages of the speaker diarization process, developing a new system and measuring its performance in the same conditions of defined by the Albayzin 2016 evaluation. Thus, this bachelor degree thesis, using the Albayzin evaluation as our general frameworks, will allow us to study some techniques commonly used in different stages of a speaker diarization system: features extraction, activity detection, segmentation and clustering. Furthermore, it will provide us a comparative analysis between these different techniques in terms of performance and execution time for each on

Biblos-e Archivo