5,814 research outputs found
DNN adaptation by automatic quality estimation of ASR hypotheses
In this paper we propose to exploit the automatic Quality Estimation (QE) of
ASR hypotheses to perform the unsupervised adaptation of a deep neural network
modeling acoustic probabilities. Our hypothesis is that significant
improvements can be achieved by: i)automatically transcribing the evaluation
data we are currently trying to recognise, and ii) selecting from it a subset
of "good quality" instances based on the word error rate (WER) scores predicted
by a QE component. To validate this hypothesis, we run several experiments on
the evaluation data sets released for the CHiME-3 challenge. First, we operate
in oracle conditions in which manual transcriptions of the evaluation data are
available, thus allowing us to compute the "true" sentence WER. In this
scenario, we perform the adaptation with variable amounts of data, which are
characterised by different levels of quality. Then, we move to realistic
conditions in which the manual transcriptions of the evaluation data are not
available. In this case, the adaptation is performed on data selected according
to the WER scores "predicted" by a QE component. Our results indicate that: i)
QE predictions allow us to closely approximate the adaptation results obtained
in oracle conditions, and ii) the overall ASR performance based on the proposed
QE-driven adaptation method is significantly better than the strong, most
recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
Blind Normalization of Speech From Different Channels
We show how to construct a channel-independent representation of speech that
has propagated through a noisy reverberant channel. This is done by blindly
rescaling the cepstral time series by a non-linear function, with the form of
this scale function being determined by previously encountered cepstra from
that channel. The rescaled form of the time series is an invariant property of
it in the following sense: it is unaffected if the time series is transformed
by any time-independent invertible distortion. Because a linear channel with
stationary noise and impulse response transforms cepstra in this way, the new
technique can be used to remove the channel dependence of a cepstral time
series. In experiments, the method achieved greater channel-independence than
cepstral mean normalization, and it was comparable to the combination of
cepstral mean normalization and spectral subtraction, despite the fact that no
measurements of channel noise or reverberations were required (unlike spectral
subtraction).Comment: 25 pages, 7 figure
Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization
Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
An integrated theory of language production and comprehension
Currently, production and comprehension are regarded as quite distinct in accounts of language processing. In rejecting this dichotomy, we instead assert that producing and understanding are interwoven, and that this interweaving is what enables people to predict themselves and each other. We start by noting that production and comprehension are forms of action and action perception. We then consider the evidence for interweaving in action, action perception, and joint action, and explain such evidence in terms of prediction. Specifically, we assume that actors construct forward models of their actions before they execute those actions, and that perceivers of others' actions covertly imitate those actions, then construct forward models of those actions. We use these accounts of action, action perception, and joint action to develop accounts of production, comprehension, and interactive language. Importantly, they incorporate well-defined levels of linguistic representation (such as semantics, syntax, and phonology). We show (a) how speakers and comprehenders use covert imitation and forward modeling to make predictions at these levels of representation, (b) how they interweave production and comprehension processes, and (c) how they use these predictions to monitor the upcoming utterances. We show how these accounts explain a range of behavioral and neuroscientific data on language processing and discuss some of the implications of our proposal
Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification
Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624
放送ニュースの自動音声認識とインデクシングに関する研究
制度:新 ; 文部省報告番号:乙2056号 ; 学位の種類:博士(工学) ; 授与年月日:2006/12/21 ; 早大学位記番号:新435
Modeling DNN as human learner
In previous experiments, human listeners demonstrated that they had the ability to adapt to
unheard, ambiguous phonemes after some initial, relatively short exposures. At the same time,
previous work in the speech community has shown that pre-trained deep neural network-based
(DNN) ASR systems, like humans, also have the ability to adapt to unseen, ambiguous phonemes
after retuning their parameters on a relatively small set. In the first part of this thesis, the time-course
of phoneme category adaptation in a DNN is investigated in more detail. By retuning the
DNNs with more and more tokens with ambiguous sounds and comparing classification accuracy
of the ambiguous phonemes in a held-out test across the time-course, we found out that DNNs, like
human listeners, also demonstrated fast adaptation: the accuracy curves were step-like in almost
all cases, showing very little adaptation after seeing only one (out of ten) training bins. However,
unlike our experimental setup mentioned above, in a typical
lexically guided perceptual learning
experiment, listeners are trained with individual words instead of individual phones, and thus to truly
model such a scenario, we would require a model that could take the context of a whole utterance
into account. Traditional speech recognition systems accomplish this through the use of hidden
Markov models (HMM) and WFST decoding. In recent years, bidirectional long short-term memory (Bi-LSTM) trained under connectionist temporal classification (CTC) criterion has also attracted
much attention. In the second part of this thesis, previous experiments on ambiguous phoneme
recognition were carried out again on a new Bi-LSTM model, and phonetic transcriptions of words
ending with ambiguous phonemes were used as training targets, instead of individual sounds that
consisted of a single phoneme. We found out that despite the vastly different architecture, the
new model showed highly similar behavior in terms of classification rate over the time course of
incremental retuning. This indicated that ambiguous phonemes in a continuous context could also
be quickly adapted by neural network-based models. In the last part of this thesis, our pre-trained
Dutch Bi-LSTM from the previous part was treated as a Dutch second language learner and was
asked to transcribe English utterances in a self-adaptation scheme. In other words, we used the
Dutch model to generate phonetic transcriptions directly and retune the model on the transcriptions
it generated, although ground truth transcriptions were used to choose a subset of all self-labeled
transcriptions. Self-adaptation is of interest as a model of human second language learning, but also
has great practical engineering value, e.g., it could be used to adapt speech recognition to a lowr-resource
language. We investigated two ways to improve the adaptation scheme, with the first being
multi-task learning with articulatory feature detection during training the model on Dutch and self-labeled
adaptation, and the second being first letting the model adapt to isolated short words before
feeding it with longer utterances.Ope
Recommended from our members
Speaker diarisation and longitudinal linking in multi-genre broadcast data
This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485
User-Adaptive A Posteriori Restoration for Incorrectly Segmented Utterances in Spoken Dialogue Systems
Ideally, the users of spoken dialogue systems should be able to speak at their own tempo. Thus, the systems needs to interpret utterances from various users correctly, even when the utterances contain pauses. In response to this issue, we propose an approach based on a posteriori restoration for incorrectly segmented utterances. A crucial part of this approach is to determine whether restoration is required. We use a classification-based approach, adapted to each user. We focus on each user’s dialogue tempo, which can be obtained during the dialogue, and determine the correlation between each user’s tempo and the appropriate thresholds for classification. A linear regression function used to convert the tempos into thresholds is also derived. Experimental results show that the proposed user adaptation approach applied to two restoration classification methods, thresholding and decision trees, improves classification accuracies by 3.0% and 7.4%, respectively, in cross validation
- …