Search CORE

47 research outputs found

Towards a better integration of written names for unsupervised speakers identification in videos

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Poignant Johann
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

Hal - Université Grenoble Alpes

LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

Author: Barras Claude
Bredin Hervé
Li Ming
Lin Qingjian
Yin Ruiqing
Publication venue: 'International Speech Communication Association'
Publication date: 23/07/2019
Field of study

More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63% on the NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Collaborative Annotation for Person Identification in TV Shows

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Bruneau Pierrick
Budnik Matheuz
Poignant Johann
Stefas Mickael
Tamisier Thomas
Publication venue: HAL CCSD
Publication date: 06/09/2015
Field of study

International audienceThis paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment

Hal - Université Grenoble Alpes

Towards a better integration of written names for unsupervised speakers identification in videos

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Poignant Johann
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Le Viet-Bac
Poignant Johann
Quénot Georges
Publication venue: HAL CCSD
Publication date: 09/09/2012
Field of study

Poster Session: Speaker Recognition IIIInternational audienceWe propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2% when considering all the speakers, and 81.7% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.7% without anchor

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

The CAMOMILE Collaborative Annotation Platform for Multi-modal, Multi-lingual and Multi-media Documents

Author: Adda Gilles
Barras Claude
Besacier Laurent
Bredin Hervé
Bruneau Pierrick
Budnik Mateusz
Ekenel Hazim
Francopoulo Gil
Hernando Javier
Mariani Joseph
Morros Ramon
Poignant Johann
Quénot Georges
Rosset Sophie
Stefas Mickael
Tamisier Thomas
Publication venue: HAL CCSD
Publication date: 01/05/2016
Field of study

International audienceIn this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source

Hal - Université Grenoble Alpes

QCompere @ REPERE 2013

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Ekenel Hazim Kemal
Fortier Guillaume
Hua Gao
Le Viet-Bac
Mignon Alexis
Poignant Johann
Quénot Georges
Rosset Sophie
Roy Anindya
Sarkar Achintya
Stiefelhagen Rainer
Tapaswi Makarand
Verbeek Jakob
Yang Qian
Publication venue: HAL CCSD
Publication date: 22/08/2013
Field of study

International audienceWe describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed

Hal - Université Grenoble Alpes

The Speed Submission to DIHARD II: Contributions & Lessons Learned

Author: Barras Claude
Bredin Hervé
Brutti Alessio
Cornell Samuele
Evans Nicholas
Korshunov Pavel
Marcel Sébastien
Patino Jose
Sahidullah Md
Serizel Romain
Sivasankaran Sunit
Squartini Stefano
Vincent Emmanuel
Yin Ruiqing
Publication venue: HAL CCSD
Publication date: 01/01/2019
Field of study

This paper describes the speaker diarization systems developed for the Second DIHARD Speech Diarization Challenge (DIHARD II) by the Speed team. Besides describing the system, which considerably outperformed the challenge baselines, we also focus on the lessons learned from numerous approaches that we tried for single and multi-channel systems. We present several components of our diarization system, including categorization of domains, speech enhancement, speech activity detection, speaker embeddings, clustering methods, resegmentation, and system fusion. We analyze and discuss the effect of each such component on the overall diarization performance within the realistic settings of the challenge

INRIA a CCSD electronic archive server

QCompere @ REPERE 2013

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Ekenel Hazim Kemal
Fortier Guillaume
Hua Gao
Le Viet-Bac
Mignon Alexis
Poignant Johann
Quénot Georges
Rosset Sophie
Roy Anindya
Sarkar Achintya
Stiefelhagen Rainer
Tapaswi Makarand
Verbeek Jakob
Yang Qian
Publication venue: HAL CCSD
Publication date: 22/08/2013
Field of study

HAL - Normandie Université

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1