Search CORE

8 research outputs found

Improving lightly supervised training for broadcast transcription

Author: Gales MJF
Lanchantin P
Liu X
Long Y
Seigel MS
Woodland PC
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2013
Field of study

This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be poor. To address this issue, word and segment level combination approaches are used between the lightly supervised transcripts and the original programme scripts which yield improved transcriptions. Experimental results show that systems trained using these improved transcriptions consistently outperform those trained using only the original lightly supervised decoding hypotheses. This is shown to be the case for both the maximum likelihood and minimum phone error trained systems.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This is the accepted manuscript version. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_2187.html

CiteSeerX

Edinburgh Research Explorer

Apollo (Cambridge)

Distant Speech Recognition for Home Automation: Preliminary Experimental Results in a Smart Home

Author: Lecouteux Benjamin
Portet François
Vacher Michel
Publication venue: HAL CCSD
Publication date: 18/05/2011
Field of study

International audienceThis paper presents a study that is part of the Sweet-Home project which aims at developing a new home automation system based on voice command. The study focused on two tasks: distant speech recognition and sentence spotting (e.g., recognition of domotic orders). Regarding the first task, different combinations of ASR systems, language and acoustic models were tested. Fusion of ASR outputs by consensus and with a triggered language model (using a priori knowledge) were investigated. For the sentence spotting task, an algorithm based on distance evaluation between the current ASR hypotheses and the predefine set of keyword patterns was introduced in order to retrieve the correct sentences in spite of the ASR errors. The techniques were assessed on real daily living data collected in a 4-room smart home that was fully equipped with standard tactile commands and with 7 wireless microphones set in the ceiling. Thanks to Driven Decoding Algorithm techniques, a classical ASR system reached 7.9% WER against 35% WER in standard configuration and 15% with MLLR adaptation only. The best keyword pattern classification result obtained in distant speech conditions was 7.5% CER

Hal - Université Grenoble Alpes

Reconnaissance automatique de la parole guidée par des transcriptions a priori

Author: Lecouteux Benjamin
Publication venue: HAL CCSD
Publication date: 05/12/2008
Field of study

Robustness in speech recognition refers to the need to maintain high recognition accuracies even when the quality of the input speech is degraded. In the last decade, some papers proposed to use relevant meta-data in order to enhance the recognition process. Nevertheless, in many cases, an imperfect a priori transcript can be associated to the speech signal : movie subtitles, scenarios and theatrical plays, summariesand radio broadcast. This thesis addresses the issue of using such imperfect transcripts for improving the performance figures of automatic speech recognition (ASR) systems.Unfortunately, these a priori transcripts seldom correspond to the exact word utterances and suffer from a lack of temporal information. In spite of their varying quality, we will show how to use them to improve ASR systems. In the first part of the document we propose to integrate the imperfect transcripts inside the ASR search algorithm. We propose a method that allows us to drive an automatic speech recognition system by using prompts or subtitles. This driven decoding algorithm relies on an on-demand synchronization and on the linguistic rescoring of ASR hypotheses. In order to handle transcript excerpts, we suggest a method for extracting segments in large corpora. The second part presents the Driven Decoding Algorithm(DDA) approach in combining several speech recognition (ASR) systems : it consists in guiding the search algorithm of a primary ASR system by the one-best hypotheses of auxiliary systems.Our work suggests using auxiliary information directly inside an ASR system. The driven decoding algorithm enhances the baseline system and improves the a priori transcription. Moreover, the new combination schemes based on generalized-DDA significantly outperform state of the art combinations.L’utilisation des systèmes de reconnaissance automatique de la parole nécessite des conditions d’utilisation contraintes pour que ces derniers obtiennent des résultats convenables. Dans de nombreuses situations, des informations auxiliaires aux flux audio sont disponibles. Le travail de cette thèse s’articule autour des approches permettant d’exploiter ces transcriptions a priori disponibles. Ces informations se retrouvent dans de nombreuses situations : les pièces de théâtre avec les scripts des acteurs, les films accompagnés de sous-titres ou de leur scénario, les flashes d’information associés aux prompts des journalistes, les résumés d’émissions radio... Ces informations annexes sont de qualité variable, mais nous montrerons comment ces dernières peuvent être utilisées afin d’améliorer le décodage d’un SRAP.Ce document est divisé en deux axes liés par l’utilisation de transcriptions a priori au sein d’un SRAP : la première partie présente une méthode originale permettant d’exploiter des transcriptions a priori manuelles, et de les intégrer directement au cœur d’un SRAP. Nous proposons une méthode permettant de guider efficacement le système de reconnaissance à l’aide d’informations auxiliaires. Nous étendons notre stratégie à delarges corpus dénués d’informations temporelles. La seconde partie de nos travaux est axée sur la combinaison de SRAP. Nous proposons une combinaison de SRAP basée sur le décodage guidé : les transcriptions a priori guidant un SRAP principal sont fournies par des systèmes auxiliaires.Les travaux présentés proposent d’utiliser efficacement une information auxiliaire au sein d’un SRAP. Le décodage guidé par des transcriptions manuelles permet d’améliorer sensiblement la qualité du décodage ainsi que la qualité de la transcription a priori . Par ailleurs, les stratégies de combinaison proposées sont originales et obtiennent d’excellents résultats par rapport aux méthodes existantes à l’état de l’art

Thèses en Ligne

Structuration de contenus audio-visuel pour le résumé automatique

Author: LINARES Georges
ROUVIER Mickaël
Publication venue
Publication date: 01/01/2011
Field of study

Ces dernières années, avec l apparition des sites tels que Youtube, Dailymotion ou encore Blip TV, le nombre de vidéos disponibles sur Internet aconsidérablement augmenté. Le volume des collections et leur absence de structure limite l accès par le contenu à ces données. Le résumé automatique est un moyen de produire des synthèses qui extraient l essentiel des contenus et les présentent de façon aussi concise que possible. Dans ce travail, nous nous intéressons aux méthodes de résumé vidéo par extraction, basées sur l analyse du canal audio. Nous traitons les différents verrous scientifiques liés à cet objectif : l extraction des contenus, la structuration des documents, la définition et l estimation des fonctions d intérêts et des algorithmes de composition des résumés. Sur chacun de ces aspects, nous faisons des propositions concrètes qui sont évaluées. Sur l extraction des contenus, nous présentons une méthode rapide de détection de termes. La principale originalité de cette méthode est qu elle repose sur la construction d un détecteur en fonction des termes cherchés. Nous montrons que cette stratégie d auto-organisation du détecteur améliore la robustesse du système, qui dépasse sensiblement celle de l approche classique basée sur la transcription automatique de la parole.Nous présentons ensuite une méthode de filtrage qui repose sur les modèles à mixtures de Gaussiennes et l analyse factorielle telle qu elle a été utilisée récemment en identification du locuteur. L originalité de notre contribution tient à l utilisation des décompositions par analyse factorielle pour l estimation supervisée de filtres opérants dans le domaine cepstral.Nous abordons ensuite les questions de structuration de collections de vidéos. Nous montrons que l utilisation de différents niveaux de représentation et de différentes sources d informations permet de caractériser le style éditorial d une vidéo en se basant principalement sur l analyse de la source audio, alors que la plupart des travaux précédents suggéraient que l essentiel de l information relative au genre était contenue dans l image. Une autre contribution concerne l identification du type de discours ; nous proposons des modèles bas niveaux pour la détection de la parole spontanée qui améliorent sensiblement l état de l art sur ce type d approches.Le troisième axe de ce travail concerne le résumé lui-même. Dans le cadre du résumé automatique vidéo, nous essayons, dans un premier temps, de définir ce qu est une vue synthétique. S agit-il de ce qui le caractérise globalement ou de ce qu un utilisateur en retiendra (par exemple un moment émouvant, drôle....) ? Cette question est discutée et nous faisons des propositions concrètes pour la définition de fonctions d intérêts correspondants à 3 différents critères : la saillance, l expressivité et la significativité. Nous proposons ensuite un algorithme de recherche du résumé d intérêt maximal qui dérive de celui introduit dans des travaux précédents, basé sur la programmation linéaire en nombres entiers.These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.AVIGNON-Bib. numérique (840079901) / SudocSudocFranceF

OpenGrey Repository