240 research outputs found
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English
normally require hand-written pronunciations. In this paper, we describe a
system for automatically obtaining pronunciations of words for which
pronunciations are not available, but for which transcribed data exists. Our
method integrates information from the letter sequence and from the acoustic
evidence. The novel aspect of the problem that we address is the problem of how
to prune entries from such a lexicon (since, empirically, lexicons with too
many entries do not tend to be good for ASR performance). Experiments on
various ASR tasks show that, with the proposed framework, starting with an
initial lexicon of several thousand words, we are able to learn a lexicon which
performs close to a full expert lexicon in terms of WER performance on test
data, and is better than lexicons built using G2P alone or with a pruning
criterion based on pronunciation probability
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Likelihood-based semi-supervised model selection with applications to speech processing
In conventional supervised pattern recognition tasks, model selection is
typically accomplished by minimizing the classification error rate on a set of
so-called development data, subject to ground-truth labeling by human experts
or some other means. In the context of speech processing systems and other
large-scale practical applications, however, such labeled development data are
typically costly and difficult to obtain. This article proposes an alternative
semi-supervised framework for likelihood-based model selection that leverages
unlabeled data by using trained classifiers representing each model to
automatically generate putative labels. The errors that result from this
automatic labeling are shown to be amenable to results from robust statistics,
which in turn provide for minimax-optimal censored likelihood ratio tests that
recover the nonparametric sign test as a limiting case. This approach is then
validated experimentally using a state-of-the-art automatic speech recognition
system to select between candidate word pronunciations using unlabeled speech
data that only potentially contain instances of the words under test. Results
provide supporting evidence for the utility of this approach, and suggest that
it may also find use in other applications of machine learning.Comment: 11 pages, 2 figures; submitted for publicatio
Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?
Hidden Markov models (HMMs) have been successfully applied to automatic
speech recognition for more than 35 years in spite of the fact that a key HMM
assumption -- the statistical independence of frames -- is obviously violated
by speech data. In fact, this data/model mismatch has inspired many attempts to
modify or replace HMMs with alternative models that are better able to take
into account the statistical dependence of frames. However it is fair to say
that in 2010 the HMM is the consensus model of choice for speech recognition
and that HMMs are at the heart of both commercially available products and
contemporary research systems. In this paper we present a preliminary
exploration aimed at understanding how speech data depart from HMMs and what
effect this departure has on the accuracy of HMM-based speech recognition. Our
analysis uses standard diagnostic tools from the field of statistics --
hypothesis testing, simulation and resampling -- which are rarely used in the
field of speech recognition. Our main result, obtained by novel manipulations
of real and resampled data, demonstrates that real data have statistical
dependency and that this dependency is responsible for significant numbers of
recognition errors. We also demonstrate, using simulation and resampling, that
if we `remove' the statistical dependency from data, then the resulting
recognition error rates become negligible. Taken together, these results
suggest that a better understanding of the structure of the statistical
dependency in speech data is a crucial first step towards improving HMM-based
speech recognition
- …
