8 research outputs found

    Improving lightly supervised training for broadcast transcription

    Get PDF
    This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be poor. To address this issue, word and segment level combination approaches are used between the lightly supervised transcripts and the original programme scripts which yield improved transcriptions. Experimental results show that systems trained using these improved transcriptions consistently outperform those trained using only the original lightly supervised decoding hypotheses. This is shown to be the case for both the maximum likelihood and minimum phone error trained systems.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This is the accepted manuscript version. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_2187.html

    Distant Speech Recognition for Home Automation: Preliminary Experimental Results in a Smart Home

    No full text
    International audienceThis paper presents a study that is part of the Sweet-Home project which aims at developing a new home automation system based on voice command. The study focused on two tasks: distant speech recognition and sentence spotting (e.g., recognition of domotic orders). Regarding the first task, different combinations of ASR systems, language and acoustic models were tested. Fusion of ASR outputs by consensus and with a triggered language model (using a priori knowledge) were investigated. For the sentence spotting task, an algorithm based on distance evaluation between the current ASR hypotheses and the predefine set of keyword patterns was introduced in order to retrieve the correct sentences in spite of the ASR errors. The techniques were assessed on real daily living data collected in a 4-room smart home that was fully equipped with standard tactile commands and with 7 wireless microphones set in the ceiling. Thanks to Driven Decoding Algorithm techniques, a classical ASR system reached 7.9% WER against 35% WER in standard configuration and 15% with MLLR adaptation only. The best keyword pattern classification result obtained in distant speech conditions was 7.5% CER

    Reconnaissance automatique de la parole guidée par des transcriptions a priori

    Get PDF
    Robustness in speech recognition refers to the need to maintain high recognition accuracies even when the quality of the input speech is degraded. In the last decade, some papers proposed to use relevant meta-data in order to enhance the recognition process. Nevertheless, in many cases, an imperfect a priori transcript can be associated to the speech signal : movie subtitles, scenarios and theatrical plays, summariesand radio broadcast. This thesis addresses the issue of using such imperfect transcripts for improving the performance figures of automatic speech recognition (ASR) systems.Unfortunately, these a priori transcripts seldom correspond to the exact word utterances and suffer from a lack of temporal information. In spite of their varying quality, we will show how to use them to improve ASR systems. In the first part of the document we propose to integrate the imperfect transcripts inside the ASR search algorithm. We propose a method that allows us to drive an automatic speech recognition system by using prompts or subtitles. This driven decoding algorithm relies on an on-demand synchronization and on the linguistic rescoring of ASR hypotheses. In order to handle transcript excerpts, we suggest a method for extracting segments in large corpora. The second part presents the Driven Decoding Algorithm(DDA) approach in combining several speech recognition (ASR) systems : it consists in guiding the search algorithm of a primary ASR system by the one-best hypotheses of auxiliary systems.Our work suggests using auxiliary information directly inside an ASR system. The driven decoding algorithm enhances the baseline system and improves the a priori transcription. Moreover, the new combination schemes based on generalized-DDA significantly outperform state of the art combinations.L’utilisation des systĂšmes de reconnaissance automatique de la parole nĂ©cessite des conditions d’utilisation contraintes pour que ces derniers obtiennent des rĂ©sultats convenables. Dans de nombreuses situations, des informations auxiliaires aux flux audio sont disponibles. Le travail de cette thĂšse s’articule autour des approches permettant d’exploiter ces transcriptions a priori disponibles. Ces informations se retrouvent dans de nombreuses situations : les piĂšces de thĂ©Ăątre avec les scripts des acteurs, les films accompagnĂ©s de sous-titres ou de leur scĂ©nario, les flashes d’information associĂ©s aux prompts des journalistes, les rĂ©sumĂ©s d’émissions radio... Ces informations annexes sont de qualitĂ© variable, mais nous montrerons comment ces derniĂšres peuvent ĂȘtre utilisĂ©es afin d’amĂ©liorer le dĂ©codage d’un SRAP.Ce document est divisĂ© en deux axes liĂ©s par l’utilisation de transcriptions a priori au sein d’un SRAP : la premiĂšre partie prĂ©sente une mĂ©thode originale permettant d’exploiter des transcriptions a priori manuelles, et de les intĂ©grer directement au cƓur d’un SRAP. Nous proposons une mĂ©thode permettant de guider efficacement le systĂšme de reconnaissance Ă  l’aide d’informations auxiliaires. Nous Ă©tendons notre stratĂ©gie Ă  delarges corpus dĂ©nuĂ©s d’informations temporelles. La seconde partie de nos travaux est axĂ©e sur la combinaison de SRAP. Nous proposons une combinaison de SRAP basĂ©e sur le dĂ©codage guidĂ© : les transcriptions a priori guidant un SRAP principal sont fournies par des systĂšmes auxiliaires.Les travaux prĂ©sentĂ©s proposent d’utiliser efficacement une information auxiliaire au sein d’un SRAP. Le dĂ©codage guidĂ© par des transcriptions manuelles permet d’amĂ©liorer sensiblement la qualitĂ© du dĂ©codage ainsi que la qualitĂ© de la transcription a priori . Par ailleurs, les stratĂ©gies de combinaison proposĂ©es sont originales et obtiennent d’excellents rĂ©sultats par rapport aux mĂ©thodes existantes Ă  l’état de l’art

    Structuration de contenus audio-visuel pour le résumé automatique

    Get PDF
    Ces derniĂšres annĂ©es, avec l apparition des sites tels que Youtube, Dailymotion ou encore Blip TV, le nombre de vidĂ©os disponibles sur Internet aconsidĂ©rablement augmentĂ©. Le volume des collections et leur absence de structure limite l accĂšs par le contenu Ă  ces donnĂ©es. Le rĂ©sumĂ© automatique est un moyen de produire des synthĂšses qui extraient l essentiel des contenus et les prĂ©sentent de façon aussi concise que possible. Dans ce travail, nous nous intĂ©ressons aux mĂ©thodes de rĂ©sumĂ© vidĂ©o par extraction, basĂ©es sur l analyse du canal audio. Nous traitons les diffĂ©rents verrous scientifiques liĂ©s Ă  cet objectif : l extraction des contenus, la structuration des documents, la dĂ©finition et l estimation des fonctions d intĂ©rĂȘts et des algorithmes de composition des rĂ©sumĂ©s. Sur chacun de ces aspects, nous faisons des propositions concrĂštes qui sont Ă©valuĂ©es. Sur l extraction des contenus, nous prĂ©sentons une mĂ©thode rapide de dĂ©tection de termes. La principale originalitĂ© de cette mĂ©thode est qu elle repose sur la construction d un dĂ©tecteur en fonction des termes cherchĂ©s. Nous montrons que cette stratĂ©gie d auto-organisation du dĂ©tecteur amĂ©liore la robustesse du systĂšme, qui dĂ©passe sensiblement celle de l approche classique basĂ©e sur la transcription automatique de la parole.Nous prĂ©sentons ensuite une mĂ©thode de filtrage qui repose sur les modĂšles Ă  mixtures de Gaussiennes et l analyse factorielle telle qu elle a Ă©tĂ© utilisĂ©e rĂ©cemment en identification du locuteur. L originalitĂ© de notre contribution tient Ă  l utilisation des dĂ©compositions par analyse factorielle pour l estimation supervisĂ©e de filtres opĂ©rants dans le domaine cepstral.Nous abordons ensuite les questions de structuration de collections de vidĂ©os. Nous montrons que l utilisation de diffĂ©rents niveaux de reprĂ©sentation et de diffĂ©rentes sources d informations permet de caractĂ©riser le style Ă©ditorial d une vidĂ©o en se basant principalement sur l analyse de la source audio, alors que la plupart des travaux prĂ©cĂ©dents suggĂ©raient que l essentiel de l information relative au genre Ă©tait contenue dans l image. Une autre contribution concerne l identification du type de discours ; nous proposons des modĂšles bas niveaux pour la dĂ©tection de la parole spontanĂ©e qui amĂ©liorent sensiblement l Ă©tat de l art sur ce type d approches.Le troisiĂšme axe de ce travail concerne le rĂ©sumĂ© lui-mĂȘme. Dans le cadre du rĂ©sumĂ© automatique vidĂ©o, nous essayons, dans un premier temps, de dĂ©finir ce qu est une vue synthĂ©tique. S agit-il de ce qui le caractĂ©rise globalement ou de ce qu un utilisateur en retiendra (par exemple un moment Ă©mouvant, drĂŽle....) ? Cette question est discutĂ©e et nous faisons des propositions concrĂštes pour la dĂ©finition de fonctions d intĂ©rĂȘts correspondants Ă  3 diffĂ©rents critĂšres : la saillance, l expressivitĂ© et la significativitĂ©. Nous proposons ensuite un algorithme de recherche du rĂ©sumĂ© d intĂ©rĂȘt maximal qui dĂ©rive de celui introduit dans des travaux prĂ©cĂ©dents, basĂ© sur la programmation linĂ©aire en nombres entiers.These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.AVIGNON-Bib. numĂ©rique (840079901) / SudocSudocFranceF
    corecore