8 research outputs found

    Automatic Quality Estimation for ASR System Combination

    Get PDF
    Recognizer Output Voting Error Reduction (ROVER) has been widely used for system combination in automatic speech recognition (ASR). In order to select the most appropriate words to insert at each position in the output transcriptions, some ROVER extensions rely on critical information such as confidence scores and other ASR decoder features. This information, which is not always available, highly depends on the decoding process and sometimes tends to over estimate the real quality of the recognized words. In this paper we propose a novel variant of ROVER that takes advantage of ASR quality estimation (QE) for ranking the transcriptions at "segment level" instead of: i) relying on confidence scores, or ii) feeding ROVER with randomly ordered hypotheses. We first introduce an effective set of features to compensate for the absence of ASR decoder information. Then, we apply QE techniques to perform accurate hypothesis ranking at segment-level before starting the fusion process. The evaluation is carried out on two different tasks, in which we respectively combine hypotheses coming from independent ASR systems and multi-microphone recordings. In both tasks, it is assumed that the ASR decoder information is not available. The proposed approach significantly outperforms standard ROVER and it is competitive with two strong oracles that e xploit prior knowledge about the real quality of the hypotheses to be combined. Compared to standard ROVER, the abs olute WER improvements in the two evaluation scenarios range from 0.5% to 7.3%

    Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project

    Get PDF
    In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1

    Overview of the IWSLT 2012 Evaluation Campaign

    Get PDF
    open5siWe report on the ninth evaluation campaign organized by the IWSLT workshop. This year, the evaluation offered multiple tracks on lecture translation based on the TED corpus, and one track on dialog translation from Chinese to English based on the Olympic trilingual corpus. In particular, the TED tracks included a speech transcription track in English, a speech translation track from English to French, and text translation tracks from English to French and from Arabic to English. In addition to the official tracks, ten unofficial MT tracks were offered that required translating TED talks into English from either Chinese, Dutch, German, Polish, Portuguese (Brazilian), Romanian, Russian, Slovak, Slovene, or Turkish. 16 teams participated in the evaluation and submitted a total of 48 primary runs. All runs were evaluated with objective metrics, while runs of the official translation tracks were also ranked by crowd-sourced judges. In particular, subjective ranking for the TED task was performed on a progress test which permitted direct comparison of the results from this year against the best results from the 2011 round of the evaluation campaign.Marcello Federico; Mauro Cettolo; Luisa Bentivogli; Michael Paul; Sebastian StükerFederico, Marcello; Cettolo, Mauro; Bentivogli, Luisa; Michael, Paul; Sebastian, Stüke

    Driving ROVER with Segment-based ASR Quality Estimation

    Get PDF
    ROVER is a widely used method to combine the output of multiple automatic speech recognition (ASR) systems. Though effective, the basic approach and its variants suffer from potential drawbacks: i) their results depend on the order in which the hypotheses are used to feed the combination process, ii) when applied to combine long hypotheses, they disregard possible differences in transcription quality at local level, iii) they often rely on word confidence information. We address these issues by proposing a segment-based ROVER in which hypothesis ranking is obtained from a confidence-independent ASR quality estimation method. Our results on English data from the IWSLT2012 and IWSLT2013 evaluation campaigns significantly outperform standard ROVER and approximate two strong oracles

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Metadiscourse Tagging in Academic Lectures

    Get PDF
    This thesis presents a study into the nature and structure of academic lectures, with a special focus on metadiscourse phenomena. Metadiscourse refers to a set of linguistics expressions that signal specific discourse functions such as the Introduction: “Today we will talk about...” and Emphasising: “This is an important point”. These functions are important because they are part of lecturers’ strategies in understanding of what happens in a lecture. The knowledge of their presence and identity could serve as initial steps toward downstream applications that will require functional analysis of lecture content such as a browser for lectures archives, summarisation, or an automatic minute-taker for lectures. One challenging aspect for metadiscourse detection and classification is that the set of expressions are semi-fixed, meaning that different phrases can indicate the same function. To that end a four-stage approach is developed to study metadiscourse in academic lectures. Firstly, a corpus of metadiscourse for academic lectures from Physics and Economics courses is built by adapting an existing scheme that describes functional-oriented metadiscourse categories. Second, because producing reference transcripts is a time-consuming task and prone to some errors due to the manual efforts required, an automatic speech recognition (ASR) system is built specifically to produce transcripts of lectures. Since the reference transcripts lack time-stamp information, an alignment system is applied to the reference to be able to evaluate the ASR system. Then, a model is developed using Support Vector Machines (SVMs) to classify metadiscourse tags using both textual and acoustical features. The results show that n-grams are the most inductive features for the task; however, due to data sparsity the model does not generalise for unseen n-grams. This limits its ability to solve the variation issue in metadiscourse expressions. Continuous Bag-of-Words (CBOW) provide a promising solution as this can capture both the syntactic and semantic similarities between words and thus is able to solve the generalisation issue. However, CBOW ignores the word order completely, something which is very important to be retained when classifying metadiscourse tags. The final stage aims to address the issue of sequence modelling by developing a joint CBOW and Convolutional Neural Network (CNN) model. CNNs can work with continuous features such as word embedding in an elegant and robust fashion by producing a fixed-size feature vector that is able to identify indicative local information for the tagging task. The results show that metadiscourse tagging using CNNs outperforms the SVMs model significantly even on ASR outputs, owing to its ability to predict a sequence of words that is more representative for the task regardless of its position in the sentence. In addition, the inclusion of other features such as part-of-speech (POS) tags and prosodic cues improved the results further. These findings are consistent in both disciplines. The final contribution in this thesis is to investigate the suitability of using metadiscourse tags as discourse features in the lecture structure segmentation model, despite the fact that the task is approached as a classification model and most of the state-of-art models are unsupervised. In general, the obtained results show remarkable improvements over the state-of-the-art models in both disciplines

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
    corecore