72 research outputs found

    EUMSSI team at the MediaEval Person Discovery Challenge 2016

    Get PDF
    We present the results of the EUMSSI team’s participation in the Multimodal Person Discovery task. The goal is to identify all people who simultaneously appear and speak in a video corpus. In the proposed system, besides improving each modality, we emphasize on the ranking of multiple results from both audio stream and visual stream

    Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

    Full text link
    Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown VAD and OSD can be trained jointly using a multi-class classification model. However, these works are often restricted to a specific speech domain, lacking information about the generalization capacities of the systems. This paper proposes a complete and new benchmark of different VAD and OSD models, on multiple audio setups (single/multi-channel) and speech domains (e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results. We show that the joint training of these two tasks offers similar performances in terms of F1-score to two dedicated VAD and OSD systems while reducing the training cost. This unique architecture can also be used for single and multichannel speech processing

    Speaker recognition and speaker segmentation : application over the Internet

    No full text
    International audienceno abstrac

    LIUM SpkDiarization: an open source toolkit for diarization

    No full text
    This paper presents an open-source diarization toolkit which is mostly dedicated to speaker and developed by the LIUM. This toolkit includes hierarchical agglomerative clustering methods using well-known measures such as BIC and CLR. Two applications for which the toolkit has been used are presented: one is for broadcast news using the ESTER 2 data and the other is for telephone conversations using the MEDIA corpus. Index Terms — Speaker, Diarization, Toolkit 1

    ScienceDirect Improving recognition of proper nouns in ASR through generating and filtering phonetic transcriptions ଝ

    No full text
    Abstract Accurate phonetic transcription of proper nouns can be an important resource for commercial applications that embed speech technologies, such as audio indexing and vocal phone directory lookup. However, an accurate phonetic transcription is more difficult to obtain for proper nouns than for regular words. Indeed, phonetic transcription of a proper noun depends on both the origin of the speaker pronouncing it and the origin of the proper noun itself. This work proposes a method that allows the extraction of phonetic transcriptions of proper nouns using actual utterances of those proper nouns, thus yielding transcriptions based on practical use instead of mere pronunciation rules. The proposed method consists in a process that first extracts phonetic transcriptions, and then iteratively filters them. In order to initialize the process, an alignment dictionary is used to detect word boundaries. A rule-based grapheme-to-phoneme generator (LIA PHON), a knowledge-based approach (JSM), and a Statistical Machine Translation based system were evaluated for this alignment. As a result, compared to our reference dictionary (BDLEX supplemented by LIA PHON for missing words) on the ESTER 1 French broadcast news corpus, we were able to significantly decrease the Word Error Rate (WER) on segments of speech with proper nouns, without negatively affecting the WER on the rest of the corpus

    Réordonnancement automatique d'hypothèses pour l'assistancè a la transcription de la parole

    No full text
    International audienceLarge vocabulary automatic speech recognition (ASR) technologies perform well in known, controlled contexts. However, some mistakes still have to be corrected. Human intervention is necessary to check and correct the results of such systems in order to make the output of ASR understandable. We propose a method for computer-assisted transcription of speech, based on automatic reordering confusion networks. It allows to significantly reduce the number of actions needed to correct the ASR outputs. WER computed before and after every network reordering shows an absolute gain of about 3.4%.Les technologies de reconnaissance vocale automatique (ASR) à grand vocabulaire fonctionnent bien dans des contextes connus et contrôlés. Cependant, certaines erreurs doivent encore être corrigées. Une intervention humaine est nécessaire pour vérifier et corriger les résultats de ces systèmes afin de rendre compréhensible la sortie d'ASR. Nous proposons une méthode de transcription assistée par ordinateur de la parole, basée sur des réseaux de confusion de réordonnancement automatique. Il permet de réduire considérablement le nombre d'actions nécessaires pour corriger les sorties ASR. Le WER calculé avant et après chaque réorganisation du réseau montre un gain absolu d'environ 3,4%
    • …
    corecore