Search CORE

319 research outputs found

Corpus selection

Author: Adda Gilles
Barras Claude
Hernando Pericás Francisco Javier
Kernal Ekenel Hazim
Morros Rubió Josep Ramon
Publication venue
Publication date: 01/01/2013
Field of study

Entregable del proyecto Collaborative Annotation of multi-MOdal, MultI-Lingual and multi-mEdia documents. This document describes the different corpora that will be used during the Camomile projectPeer ReviewedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

QCompere @ REPERE 2013

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Ekenel Hazim Kemal
Fortier Guillaume
Hua Gao
Le Viet-Bac
Mignon Alexis
Poignant Johann
Quénot Georges
Rosset Sophie
Roy Anindya
Sarkar Achintya
Stiefelhagen Rainer
Tapaswi Makarand
Verbeek Jakob
Yang Qian
Publication venue: HAL CCSD
Publication date: 22/08/2013
Field of study

International audienceWe describe QCompere consortium submissions to the REPERE 2013 evaluation campaign. The REPERE challenge aims at gathering four communities (face recognition, speaker identification, optical character recognition and named entity detection) towards the same goal: multimodal person recognition in TV broadcast. First, four mono-modal components are introduced (one for each foregoing community) constituting the elementary building blocks of our various submissions. Then, depending on the target modality (speaker or face recognition) and on the task (supervised or unsupervised recognition), four different fusion techniques are introduced: they can be summarized as propagation-, classifier-, rule- or graph-based approaches. Finally, their performance is evaluated on REPERE 2013 test set and their advantages and limitations are discussed

Hal - Université Grenoble Alpes

Towards a complete Binary Key System for the Speaker Diarization Task

Author: Delgado Héctor
Fredouille Corinne
Serrano Javier
Publication venue: HAL CCSD
Publication date: 14/09/2014
Field of study

International audienceSpeaker diarization is the task of partitioning an audio stream into homogeneous segments according to speaker identity. Today state-of-the-art speaker diarization systems have achieved very competitive performance. However, any small improvement in Diarization Error Rate (DER) is usually subject to very large processing times (real time factor above one), which makes systems not suitable for some time-critical, real-life applications. Recently, a novel fast speaker diarization technique based on speaker modeling using binary keys was presented. The proposed technique speeds up the process up to ten times faster than real-time with little increase of DER. Although the approach shows great potential, the presented results are still preliminary. The goal of this paper is to further investigate this technique, in order to move towards a complete binary-key based system for the speaker diarization task. Preliminary experiments in Speech Activity Detection (SAD) based on binary keys show the feasibility of the binary key modeling approach for this task. Furthermore, the system has been tested on two different kinds of test data: meeting audio recordings and TV shows. The experiments carried out on NIST RT05 and REPERE databases show promising results and indicate that there is still room for further improvement

Ways of Forgetting and Remembering the Eloquence of the 19th Century: Editors of Romanian Political Speeches

Author: Patraș Roxana
Publication venue
Publication date: 01/01/2016
Field of study

The paper presents a critical evaluation of the existing anthologies of Romanian oratory and analyzes the pertinence of a new research line: how to trace back the foundations of Romanian versatile political memory, both from a lexical and from an ideological point of view. As I argue in the first part of the paper, collecting and editing the great speeches of Romanian orators seems crucial for today’s understanding of politics (politicians’ speaking/ actions as well as voters’ behavior/ electoral habits). In the second part, I focus on the particularities generated by a dramatic change of media support (in the context of Romania’s high rates of illiteracy at the end of the 19th century): from “writing” information on the slippery surface of memory (declaimed political texts such as “proclamations,” “petitions,” and “appeals”) to “writing” as such (transcribed political speeches). The last part of the paper problematizes the making of a new canon of Romanian eloquence as well as the opportunity of a new assemblage of oratorical texts, illustrative for the 19th century politics, and endeavors to settle a series of virtual editing principle

PhilPapers

QCompere @ REPERE 2013

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Ekenel Hazim Kemal
Fortier Guillaume
Hua Gao
Le Viet-Bac
Mignon Alexis
Poignant Johann
Quénot Georges
Rosset Sophie
Roy Anindya
Sarkar Achintya
Stiefelhagen Rainer
Tapaswi Makarand
Verbeek Jakob
Yang Qian
Publication venue: HAL CCSD
Publication date: 22/08/2013
Field of study

HAL - Normandie Université

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

UPC multimodal speaker diarization system for the 2018 Albayzin challenge

Author: Hernando Pericás Francisco Javier
India Massana Miquel Àngel
Morros Rubió Josep Ramon
Palau Puigdevall Ponç
Sagastiberri Itziar
Sayrol Clols Elisa
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2018
Field of study

This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

UPC multimodal speaker diarization system for the 2018 Albayzin challenge

Author: India Massana Miquel Àngel
Sagastiberri Itziar
Palau Puigdevall Ponç
Sayrol Clols Elisa
Morros Rubió Josep Ramon
Hernando Pericás Francisco Javier
Publication venue: International Speech Communication Association (ISCA)
Publication date: 01/01/2010
Field of study

UPCommons. Portal del coneixement obert de la UPC

University of Maine

Recommended from our members

Speaker diarisation and longitudinal linking in multi-genre broadcast data

Author: Gales MJF
Karanasou P
Lanchantin P
Liu X
Qian Y
Wang L
Woodland PC
Zhang C
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485

Apollo (Cambridge)

Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast

Author: Barras Claude
Besacier Laurent
Bredin Hervé
Le Viet-Bac
Poignant Johann
Quénot Georges
Publication venue: HAL CCSD
Publication date: 09/09/2012
Field of study

Poster Session: Speaker Recognition IIIInternational audienceWe propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2% when considering all the speakers, and 81.7% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.7% without anchor

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server