115 research outputs found

    Towards a better integration of written names for unsupervised speakers identification in videos

    No full text
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

    Full text link
    More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63% on the NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201

    The CAMOMILE collaborative annotation platform for multi-modal, multi-lingual and multi-media documents

    Get PDF
    In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.Peer ReviewedPostprint (author's final draft

    Analyse acoustique de la voix émotionnelle de locuteurs lors d'une interaction humain-robot

    Get PDF
    Mes travaux de thèse s'intéressent à la voix émotionnelle dans un contexte d'interaction humain-robot. Dans une interaction réaliste, nous définissons au moins quatre grands types de variabilités : l'environnement (salle, microphone); le locuteur, ses caractéristiques physiques (genre, âge, type de voix) et sa personnalité; ses états émotionnels; et enfin le type d'interaction (jeu, situation d'urgence ou de vie quotidienne). A partir de signaux audio collectés dans différentes conditions, nous avons cherché, grâce à des descripteurs acoustiques, à imbriquer la caractérisation d'un locuteur et de son état émotionnel en prenant en compte ces variabilités.Déterminer quels descripteurs sont essentiels et quels sont ceux à éviter est un défi complexe puisqu'il nécessite de travailler sur un grand nombre de variabilités et donc d'avoir à sa disposition des corpus riches et variés. Les principaux résultats portent à la fois sur la collecte et l'annotation de corpus émotionnels réalistes avec des locuteurs variés (enfants, adultes, personnes âgées), dans plusieurs environnements, et sur la robustesse de descripteurs acoustiques suivant ces quatre variabilités. Deux résultats intéressants découlent de cette analyse acoustique: la caractérisation sonore d'un corpus et l'établissement d'une liste "noire" de descripteurs très variables. Les émotions ne sont qu'une partie des indices paralinguistiques supportés par le signal audio, la personnalité et le stress dans la voix ont également été étudiés. Nous avons également mis en oeuvre un module de reconnaissance automatique des émotions et de caractérisation du locuteur qui a été testé au cours d'interactions humain-robot réalistes. Une réflexion éthique a été menée sur ces travaux.This thesis deals with emotional voices during a human-robot interaction. In a natural interaction, we define at least, four kinds of variabilities: environment (room, microphone); speaker, its physic characteristics (gender, age, voice type) and personality; emotional states; and finally the kind of interaction (game scenario, emergency, everyday life). From audio signals collected in different conditions, we tried to find out, with acoustic features, to overlap speaker and his emotional state characterisation taking into account these variabilities.To find which features are essential and which are to avoid is hard challenge because it needs to work with a high number of variabilities and then to have riche and diverse data to our disposal. The main results are about the collection and the annotation of natural emotional corpora that have been recorded with different kinds of speakers (children, adults, elderly people) in various environments, and about how reliable are acoustic features across the four variabilities. This analysis led to two interesting aspects: the audio characterisation of a corpus and the drawing of a black list of features which vary a lot. Emotions are ust a part of paralinguistic features that are supported by the audio channel, other paralinguistic features have been studied such as personality and stress in the voice. We have also built automatic emotion recognition and speaker characterisation module that we have tested during realistic interactions. An ethic discussion have been driven on our work.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Corpus selection

    Get PDF
    Entregable del proyecto Collaborative Annotation of multi-MOdal, MultI-Lingual and multi-mEdia documents. This document describes the different corpora that will be used during the Camomile projectPeer ReviewedPreprin

    Transcriber: Development and use of a tool for assisting speech corpora production”.

    Get PDF
    Abstract We present``Transcriber'', a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and tcLex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries. As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of dierent data formats will become easier. Ó 2001 Elsevier Science B.V. All rights reserved. R esum e Nous pr esentons``Transcriber'', un outil d'aide a la cr eation de corpus de parole, et nous d ecrivons des el ements de son d eveloppement et de son utilisation. Transcriber a et e conc ßu pour permettre la segmentation manuelle et la transcription d'enregistrements de nouvelles radio-dius ees de longue dur ee, ainsi que l'annotation des tours de parole, des th emes et des conditions acoustiques. Cet outil tr es portable, reposant sur le langage de script Tcl/Tk et des extensions telles que Snack pour les fonctionnalit es audio et tcLex pour l'analyse lexicale, a et e test e sur di erents syst emes Unix et sous Windows. Le format de donn ees respecte le standard XML avec un support d'Unicode pour les transcriptions multilingues. Distribu e sous license libre pour encourager la production de corpus, faciliter leur echange, augmenter le retour d'exp erience des utilisateurs et motiver les contributions logicielles ext erieures, Transcriber est utilis e depuis plus d'un an dans plusieurs pays. Suite a cette utilisation, de nouveaux besoins sont apparus comme le support de formats de donn ees suppl ementaires, de la vid eo, et un meilleur traitement de la parole conversationnelle. En utilisant le mod ele des graphes d'annotation formalis e r ecemment, l'adaptation de l'outil vers de nouvelles t aches et le support de di erents formats de donn ees sera facilit e. Ó 2001 Elsevier Science B.V. All rights reserved

    Collaborative Annotation for Person Identification in TV Shows

    No full text
    International audienceThis paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment

    Towards a better integration of written names for unsupervised speakers identification in videos

    Get PDF
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    Unsupervised Speaker Identification using Overlaid Texts in TV Broadcast

    Get PDF
    Poster Session: Speaker Recognition IIIInternational audienceWe propose an approach for unsupervised speaker identification in TV broadcast videos, by combining acoustic speaker diarization with person names obtained via video OCR from overlaid texts. Three methods for the propagation of the overlaid names to the speech turns are compared, taking into account the co-occurence duration between the speaker clusters and the names provided by the video OCR and using a task-adapted variant of the TF-IDF information retrieval coefficient. These methods were tested on the REPERE dry-run evaluation corpus, containing 3 hours of annotated videos. Our best unsupervised system reaches a F-measure of 70.2% when considering all the speakers, and 81.7% if anchor speakers are left out. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.5% F-measure when considering all the speakers and 45.7% without anchor

    Pengaruh Kesadaran Politik terhadap Pasrtisipasi Politik Masyarakat Kecamatan Tenayan Raya dalam Pemilihan Umum Walikota Pekanbaru Tahun 2017

    Get PDF
    One manifestation of political consciousness is political participation in the election of the mayor. Political participation based on political consciousness will encourage people to exercise their right to vote rationally and in accordance with the aspirations of the people. But in reality, the level of voter participation in the election mayor of Pekanbaru in 2017 is still lacking. Percentage of voters\u27 participation in sub district Tenayan Raya only reached 43.72%, which means there are still many voters who did not participate that reached 56.28%. This study aims to explain the effect of political awareness on political participation. This study uses quantitative descriptive by spreading questionnaires to 100 respondents in four urban villages in Tenayan Raya sub district, then the data is analyzed by using logistic regression. Based on the result of research, it can be concluded that there is a significant influence between political awareness on political participation because the value of p value Chi-Square 13,105 > Chi-Square table 3,841. In the Pseudo R Square table of 16.4% political awareness contribution to political participation, in other words 83.6% other factors from outside the model that explain the dependent variable. Based on the analysis of political awareness data of Tenayan Raya sub-district is low down to 68%, while the political participation of Tenayan Raya community in the election of mayor of Pekanbaru is also categorized as low by 94%
    corecore