831 research outputs found
Recommended from our members
I-vector estimation using informative priors for adaptation of deep neural networks
This is the author accepted manuscript. The final version is available from ISCA via http://www.isca-speech.org/archive/interspeech_2015/i15_2872.html
Supporting data for this paper is available at the http://www.repository.cam.ac.uk/handle/1810/248387 data repository.I-vectors are a well-known low-dimensional representation of speaker space and are becoming increasingly popular in adaptation of state-of-the-art deep neural network (DNN) acoustic models. One advantage of i-vectors is that they can be used with very little data, for example a single utterance. However, to improve robustness of the i-vector estimates with limited data, a prior is often used. Traditionally, a standard normal prior is applied to i-vectors, which is nevertheless not well suited to the increased variability of short utterances. This paper proposes a more informative prior, derived from the training data. As well as aiming to reduce the non-Gaussian behaviour of the i-vector space, it allows prior information at different levels, for example gender, to be used. Experiments on a US English Broadcast News (BN) transcription task for speaker and utterance i-vector adaptation show that more informative priors reduce the sensitivity to the quantity of data used to estimate the i-vector. The best configuration for this task was utterance-level test i-vectors enhanced with informative priors which gave a 13% relative reduction in word error rate over the baseline (no i-vectors) and a 5% over utterance-level test i-vectors with standard prior.This work was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology)
Language Model Combination and Adaptation Using Weighted Finite State Transducers
In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaption may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequence
Improving lightly supervised training for broadcast transcription
This paper investigates improving lightly supervised acoustic
model training for an archive of broadcast data. Standard
lightly supervised training uses automatically derived decoding
hypotheses using a biased language model. However, as the
actual speech can deviate significantly from the original programme
scripts that are supplied, the quality of standard lightly
supervised hypotheses can be poor. To address this issue, word
and segment level combination approaches are used between
the lightly supervised transcripts and the original programme
scripts which yield improved transcriptions. Experimental results
show that systems trained using these improved transcriptions
consistently outperform those trained using only the original
lightly supervised decoding hypotheses. This is shown to be
the case for both the maximum likelihood and minimum phone
error trained systems.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This is the accepted manuscript version. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_2187.html
Recommended from our members
Speaker diarisation and longitudinal linking in multi-genre broadcast data
This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485
Effect of Sitting Posture on Development of Scoliosis in Duchenne Muscular Dystrophy Cases
Background: Scoliosis is a frequent association in boys with Duchenne Muscular Dystrophy when the ability to walk is lost around nine to 12 years of age. This study assessed the contribution of physical factors including lumbar posture to scoliosis in non-ambulatory youth with DMD in Nepal. Methods: Linear regression was used to assess effects of time since loss of ambulation, muscle strength, functional severity and lumbar angle as a binary variable on coronal Cobb angle; again logistic regression was used to assess effects of muscle strength and cross-legged sitting on the presence of a lordotic lumbar posture in 22 non-ambulant boys and young men. Results: The boys and young men had a mean (SD) age of 15.1 (4.0) years, had been non-ambulant for 48.6 (33.8) months and used a median of 3.5 (range 2 to 7) postures a day. The mean Cobb angle was 15.1 (range 0 to 70) degrees. Optimal accuracy in predicting scoliosis was obtained with a lumbar angle of -6° as measured by skin markers, and both a lumbar angle â€-6° (P=0.112) and better functional ability (P=0.102) were associated with less scoliosis. Use of cross-legged sitting postures during the day was associated with a lumbar angle â€-6° (OR 0.061; 95% CI 0.005 - 0.672; P=0.022). Conclusions: Use of cross-legged sitting posture was associated with increase in lumbar lordosis. Higher angle of lumbar lordosis and better functional ability are associated with lesser degree of scoliosis
All Politics is Local: The Renminbi's Prospects as a Future Global Currency
. In this article we describe methods for improving the RWTH German speech recognizer used within the VERBMOBIL project. In particular, we present acceleration methods for the search based on both within-word and across-word phoneme models. We also study incremental methods to reduce the response time of the online speech recognizer. Finally, we present experimental off-line results for the three VERBMOBIL scenarios. We report on word error rates and real-time factors for both speaker independent and speaker dependent recognition. 1 Introduction The goal of the VERBMOBIL project is to develop a speech-to-speech translation system that performs close to real-time. In this system, speech recognition is followed by subsequent VERBMOBIL modules (like syntactic analysis and translation) which depend on the recognition result. Therefore, in this application it is particularly important to keep the recognition time as short as possible. There are VERBMOBIL modules which are capable to work ..
The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition
This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting - i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained
- âŠ