14,507 research outputs found
Effects of Transcription Errors on Supervised Learning in Speech Recognition
Supervised learning using Hidden Markov Models has been used to train acoustic models for automatic speech recognition for several years. Typically clean transcriptions form the basis for this training regimen. However, results have shown that using sources of readily available transcriptions, which can be erroneous at times (e.g., closed captions) do not degrade the performance significantly. This work analyzes the effects of mislabeled data on recognition accuracy. For this purpose, the training is performed using manually corrupted training data and the results are observed on three different databases: TIDigits, Alphadigits and SwitchBoard. For Alphadigits, with 16% of data mislabeled, the performance of the system degrades by 12% relative to the baseline results. For a complex task like SWITCHBOARD, at 16% mislabeled training data, the performance of the system degrades by 8.5% relative to the baseline results. The training process is more robust to mislabeled data because the Gaussian mixtures that are used to model the underlying distribution tend to cluster around the majority of the correct data. The outliers (incorrect data) do not contribute significantly to the reestimation process
DNN adaptation by automatic quality estimation of ASR hypotheses
In this paper we propose to exploit the automatic Quality Estimation (QE) of
ASR hypotheses to perform the unsupervised adaptation of a deep neural network
modeling acoustic probabilities. Our hypothesis is that significant
improvements can be achieved by: i)automatically transcribing the evaluation
data we are currently trying to recognise, and ii) selecting from it a subset
of "good quality" instances based on the word error rate (WER) scores predicted
by a QE component. To validate this hypothesis, we run several experiments on
the evaluation data sets released for the CHiME-3 challenge. First, we operate
in oracle conditions in which manual transcriptions of the evaluation data are
available, thus allowing us to compute the "true" sentence WER. In this
scenario, we perform the adaptation with variable amounts of data, which are
characterised by different levels of quality. Then, we move to realistic
conditions in which the manual transcriptions of the evaluation data are not
available. In this case, the adaptation is performed on data selected according
to the WER scores "predicted" by a QE component. Our results indicate that: i)
QE predictions allow us to closely approximate the adaptation results obtained
in oracle conditions, and ii) the overall ASR performance based on the proposed
QE-driven adaptation method is significantly better than the strong, most
recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Investigating the Effects of Word Substitution Errors on Sentence Embeddings
A key initial step in several natural language processing (NLP) tasks
involves embedding phrases of text to vectors of real numbers that preserve
semantic meaning. To that end, several methods have been recently proposed with
impressive results on semantic similarity tasks. However, all of these
approaches assume that perfect transcripts are available when generating the
embeddings. While this is a reasonable assumption for analysis of written text,
it is limiting for analysis of transcribed text. In this paper we investigate
the effects of word substitution errors, such as those coming from automatic
speech recognition errors (ASR), on several state-of-the-art sentence embedding
methods. To do this, we propose a new simulator that allows the experimenter to
induce ASR-plausible word substitution errors in a corpus at a desired word
error rate. We use this simulator to evaluate the robustness of several
sentence embedding methods. Our results show that pre-trained neural sentence
encoders are both robust to ASR errors and perform well on textual similarity
tasks after errors are introduced. Meanwhile, unweighted averages of word
vectors perform well with perfect transcriptions, but their performance
degrades rapidly on textual similarity tasks for text with word substitution
errors.Comment: 4 Pages, 2 figures. Copyright IEEE 2019. Accepted and to appear in
the Proceedings of the 44th International Conference on Acoustics, Speech,
and Signal Processing 2019 (IEEE-ICASSP-2019), May 12-17 in Brighton, U.K.
Personal use of this material is permitted. However, permission to
reprint/republish this material must be obtained from the IEE
Weakly-Supervised Temporal Localization via Occurrence Count Learning
We propose a novel model for temporal detection and localization which allows
the training of deep neural networks using only counts of event occurrences as
training labels. This powerful weakly-supervised framework alleviates the
burden of the imprecise and time-consuming process of annotating event
locations in temporal data. Unlike existing methods, in which localization is
explicitly achieved by design, our model learns localization implicitly as a
byproduct of learning to count instances. This unique feature is a direct
consequence of the model's theoretical properties. We validate the
effectiveness of our approach in a number of experiments (drum hit and piano
onset detection in audio, digit detection in images) and demonstrate
performance comparable to that of fully-supervised state-of-the-art methods,
despite much weaker training requirements.Comment: Accepted at ICML 201
- âŚ