72 research outputs found
EUMSSI team at the MediaEval Person Discovery Challenge 2016
We present the results of the EUMSSI team’s participation in the Multimodal Person Discovery task. The goal is to identify all people who simultaneously appear and speak in a video corpus. In the proposed system, besides improving each modality, we emphasize on the ranking of multiple results from both audio stream and visual stream
CRF-Based Context Modeling for Person Identification in Broadcast Videos
International audienceno abstrac
Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains
Voice activity and overlapped speech detection (respectively VAD and OSD) are
key pre-processing tasks for speaker diarization. The final segmentation
performance highly relies on the robustness of these sub-tasks. Recent studies
have shown VAD and OSD can be trained jointly using a multi-class
classification model. However, these works are often restricted to a specific
speech domain, lacking information about the generalization capacities of the
systems. This paper proposes a complete and new benchmark of different VAD and
OSD models, on multiple audio setups (single/multi-channel) and speech domains
(e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal
Convolutional Network with speech representations adapted to the setup,
outperform state-of-the-art results. We show that the joint training of these
two tasks offers similar performances in terms of F1-score to two dedicated VAD
and OSD systems while reducing the training cost. This unique architecture can
also be used for single and multichannel speech processing
Speaker recognition and speaker segmentation : application over the Internet
International audienceno abstrac
LIUM SpkDiarization: an open source toolkit for diarization
This paper presents an open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM. This toolkit includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR. Two applications for which the toolkit has been used are presented: one is for broadcast news using the ESTER 2 data and the other is for telephone conversations using the MEDIA corpus. Index Terms — Speaker, Diarization, Toolkit 1
ScienceDirect Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions ଝ
Abstract Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself. This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules. The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus
Réordonnancement automatique d'hypothèses pour l'assistancè a la transcription de la parole
International audienceLarge vocabulary automatic speech recognition (ASR) technologies perform well in known, controlled contexts. However, some mistakes still have to be corrected. Human intervention is necessary to check and correct the results of such systems in order to make the output of ASR understandable. We propose a method for computer-assisted transcription of speech, based on automatic reordering confusion networks. It allows to significantly reduce the number of actions needed to correct the ASR outputs. WER computed before and after every network reordering shows an absolute gain of about 3.4%.Les technologies de reconnaissance vocale automatique (ASR) à grand vocabulaire fonctionnent bien dans des contextes connus et contrôlés. Cependant, certaines erreurs doivent encore être corrigées. Une intervention humaine est nécessaire pour vérifier et corriger les résultats de ces systèmes afin de rendre compréhensible la sortie d'ASR. Nous proposons une méthode de transcription assistée par ordinateur de la parole, basée sur des réseaux de confusion de réordonnancement automatique. Il permet de réduire considérablement le nombre d'actions nécessaires pour corriger les sorties ASR. Le WER calculé avant et après chaque réorganisation du réseau montre un gain absolu d'environ 3,4%
- …