Search CORE

72 research outputs found

EUMSSI team at the MediaEval Person Discovery Challenge 2016

Author: Le Nam
Meignier Sylvain
Odobez Jean-Marc
Publication venue
Publication date: 19/11/2016
Field of study

We present the results of the EUMSSI team’s participation in the Multimodal Person Discovery task. The goal is to identify all people who simultaneously appear and speak in a video corpus. In the proposed system, besides improving each modality, we emphasize on the ranking of multiple results from both audio stream and visual stream

Infoscience - École polytechnique fédérale de Lausanne

CRF-Based Context Modeling for Person Identification in Broadcast Videos

Author: Deleglise Paul
Gay Paul
Meignier Sylvain
Odobez Jean-Marc
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2016
Field of study

International audienceno abstrac

Infoscience - École polytechnique fédérale de Lausanne

Frontiers - Publisher Connector

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Author: Larcher Anthony
Laurent Antoine
Lebourdais Martin
Mariotte Théo
Meignier Sylvain
Montresor Silvio
Tahon Marie
Thomas Jean-Hugh
Publication venue
Publication date: 24/07/2023
Field of study

Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing

arXiv.org e-Print Archive

Speaker recognition and speaker segmentation : application over the Internet

Author: Meignier Sylvain
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

International audienceno abstrac

LIUM SpkDiarization: an open source toolkit for diarization

Author: Sylvain Meignier
Teva Merlin
Publication venue
Publication date: 01/01/2010
Field of study

This paper presents an open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM. This toolkit includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR. Two applications for which the toolkit has been used are presented: one is for broadcast news using the ESTER 2 data and the other is for telephone conversations using the MEDIA corpus. Index Terms — Speaker, Diarization, Toolkit 1

CiteSeerX

ScienceDirect Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions ଝ

Author: Antoine Laurent
Paul Deléglise
Sylvain Meignier
Publication venue
Publication date: 05/03/2020
Field of study

Abstract Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself. This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules. The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus

CiteSeerX

Réordonnancement automatique d'hypothèses pour l'assistancè a la transcription de la parole

Author: Deléglise Paul
Laurent Antoine
Meignier Sylvain
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceLarge vocabulary automatic speech recognition (ASR) technologies perform well in known, controlled contexts. However, some mistakes still have to be corrected. Human intervention is necessary to check and correct the results of such systems in order to make the output of ASR understandable. We propose a method for computer-assisted transcription of speech, based on automatic reordering confusion networks. It allows to significantly reduce the number of actions needed to correct the ASR outputs. WER computed before and after every network reordering shows an absolute gain of about 3.4%.Les technologies de reconnaissance vocale automatique (ASR) à grand vocabulaire fonctionnent bien dans des contextes connus et contrôlés. Cependant, certaines erreurs doivent encore être corrigées. Une intervention humaine est nécessaire pour vérifier et corriger les résultats de ces systèmes afin de rendre compréhensible la sortie d'ASR. Nous proposons une méthode de transcription assistée par ordinateur de la parole, basée sur des réseaux de confusion de réordonnancement automatique. Il permet de réduire considérablement le nombre d'actions nécessaires pour corriger les sorties ASR. Le WER calculé avant et après chaque réorganisation du réseau montre un gain absolu d'environ 3,4%