35 research outputs found
Automatic Quality Estimation for ASR System Combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for
system combination in automatic speech recognition (ASR). In order to select
the most appropriate words to insert at each position in the output
transcriptions, some ROVER extensions rely on critical information such as
confidence scores and other ASR decoder features. This information, which is
not always available, highly depends on the decoding process and sometimes
tends to over estimate the real quality of the recognized words. In this paper
we propose a novel variant of ROVER that takes advantage of ASR quality
estimation (QE) for ranking the transcriptions at "segment level" instead of:
i) relying on confidence scores, or ii) feeding ROVER with randomly ordered
hypotheses. We first introduce an effective set of features to compensate for
the absence of ASR decoder information. Then, we apply QE techniques to perform
accurate hypothesis ranking at segment-level before starting the fusion
process. The evaluation is carried out on two different tasks, in which we
respectively combine hypotheses coming from independent ASR systems and
multi-microphone recordings. In both tasks, it is assumed that the ASR decoder
information is not available. The proposed approach significantly outperforms
standard ROVER and it is competitive with two strong oracles that e xploit
prior knowledge about the real quality of the hypotheses to be combined.
Compared to standard ROVER, the abs olute WER improvements in the two
evaluation scenarios range from 0.5% to 7.3%
Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project
In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1
Overview of the IWSLT 2012 Evaluation Campaign
open5siWe report on the ninth evaluation campaign organized by
the IWSLT workshop. This year, the evaluation offered multiple
tracks on lecture translation based on the TED corpus, and
one track on dialog translation from Chinese
to English based on the Olympic trilingual corpus.
In particular, the TED tracks included a speech transcription
track in English, a speech translation track from English to French,
and text translation tracks from English to French and from Arabic
to English. In addition to the official tracks, ten unofficial
MT tracks were offered that required translating TED talks into English
from either Chinese, Dutch, German, Polish, Portuguese (Brazilian), Romanian, Russian, Slovak,
Slovene, or Turkish.
16 teams participated in the evaluation and submitted a total of 48 primary runs.
All runs were evaluated with objective metrics, while runs of the official translation
tracks were also ranked by crowd-sourced judges.
In particular, subjective ranking for the TED task was performed on a progress test which permitted
direct comparison of the results from this year against the best results from the 2011 round of the evaluation campaign.Marcello Federico; Mauro Cettolo; Luisa Bentivogli; Michael Paul; Sebastian StükerFederico, Marcello; Cettolo, Mauro; Bentivogli, Luisa; Michael, Paul; Sebastian, Stüke
Transductive data-selection algorithms for fine-tuning neural machine translation
Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a technique for adapting an NMT model to some domain. In this work, we want to use this technique to adapt the model to a given test set. In particular, we are using transductive data selection algorithms which take advantage the information of the test set to retrieve sentences from a larger parallel set