Search CORE

831 research outputs found

Recommended from our members

I-vector estimation using informative priors for adaptation of deep neural networks

Author: Gales M
Karanasou P
Woodland P
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 10/06/2015
Field of study

This is the author accepted manuscript. The final version is available from ISCA via http://www.isca-speech.org/archive/interspeech_2015/i15_2872.html Supporting data for this paper is available at the http://www.repository.cam.ac.uk/handle/1810/248387 data repository.I-vectors are a well-known low-dimensional representation of speaker space and are becoming increasingly popular in adaptation of state-of-the-art deep neural network (DNN) acoustic models. One advantage of i-vectors is that they can be used with very little data, for example a single utterance. However, to improve robustness of the i-vector estimates with limited data, a prior is often used. Traditionally, a standard normal prior is applied to i-vectors, which is nevertheless not well suited to the increased variability of short utterances. This paper proposes a more informative prior, derived from the training data. As well as aiming to reduce the non-Gaussian behaviour of the i-vector space, it allows prior information at different levels, for example gender, to be used. Experiments on a US English Broadcast News (BN) transcription task for speaker and utterance i-vector adaptation show that more informative priors reduce the sensitivity to the quantity of data used to estimate the i-vector. The best configuration for this task was utterance-level test i-vectors enhanced with informative priors which gave a 13% relative reduction in word error rate over the baseline (no i-vectors) and a 5% over utterance-level test i-vectors with standard prior.This work was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology)

Apollo (Cambridge)

Language Model Combination and Adaptation Using Weighted Finite State Transducers

Author: Gales M. J. F.
Hieronymus J. L.
Liu X.
Woodland P. C.
Publication venue
Publication date: 15/03/2010
Field of study

In speech recognition systems language model (LMs) are often constructed by training and combining multiple n-gram models. They can be either used to represent different genres or tasks found in diverse text sources, or capture stochastic properties of different linguistic symbol sequences, for example, syllables and words. Unsupervised LM adaption may also be used to further improve robustness to varying styles or tasks. When using these techniques, extensive software changes are often required. In this paper an alternative and more general approach based on weighted finite state transducers (WFSTs) is investigated for LM combination and adaptation. As it is entirely based on well-defined WFST operations, minimum change to decoding tools is needed. A wide range of LM combination configurations can be flexibly supported. An efficient on-the-fly WFST decoding algorithm is also proposed. Significant error rate gains of 7.3% relative were obtained on a state-of-the-art broadcast audio recognition task using a history dependently adapted multi-level LM modelling both syllable and word sequence

NASA Technical Reports Server

Improving lightly supervised training for broadcast transcription

Author: Gales MJF
Lanchantin P
Liu X
Long Y
Seigel MS
Woodland PC
Publication venue: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication date: 01/01/2013
Field of study

This paper investigates improving lightly supervised acoustic model training for an archive of broadcast data. Standard lightly supervised training uses automatically derived decoding hypotheses using a biased language model. However, as the actual speech can deviate significantly from the original programme scripts that are supplied, the quality of standard lightly supervised hypotheses can be poor. To address this issue, word and segment level combination approaches are used between the lightly supervised transcripts and the original programme scripts which yield improved transcriptions. Experimental results show that systems trained using these improved transcriptions consistently outperform those trained using only the original lightly supervised decoding hypotheses. This is shown to be the case for both the maximum likelihood and minimum phone error trained systems.The research leading to these results was supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).This is the accepted manuscript version. The final version is available at http://www.isca-speech.org/archive/interspeech_2013/i13_2187.html

CiteSeerX

Edinburgh Research Explorer

Apollo (Cambridge)

Recommended from our members

Speaker diarisation and longitudinal linking in multi-genre broadcast data

Author: Gales MJF
Karanasou P
Lanchantin P
Liu X
Qian Y
Wang L
Woodland PC
Zhang C
Publication venue: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings
Publication date: 01/01/2015
Field of study

This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485

Apollo (Cambridge)

Effect of Sitting Posture on Development of Scoliosis in Duchenne Muscular Dystrophy Cases

Author: Agrawal J.
Brisco L.
Dillon D.
Downs Jennepher
Jacoby P.
Mandal M.
Pokharel R.
Vitale M.
Woodland P.
Publication venue: 'Nepal Health Research Council'
Publication date: 01/01/2014
Field of study

Background: Scoliosis is a frequent association in boys with Duchenne Muscular Dystrophy when the ability to walk is lost around nine to 12 years of age. This study assessed the contribution of physical factors including lumbar posture to scoliosis in non-ambulatory youth with DMD in Nepal. Methods: Linear regression was used to assess effects of time since loss of ambulation, muscle strength, functional severity and lumbar angle as a binary variable on coronal Cobb angle; again logistic regression was used to assess effects of muscle strength and cross-legged sitting on the presence of a lordotic lumbar posture in 22 non-ambulant boys and young men. Results: The boys and young men had a mean (SD) age of 15.1 (4.0) years, had been non-ambulant for 48.6 (33.8) months and used a median of 3.5 (range 2 to 7) postures a day. The mean Cobb angle was 15.1 (range 0 to 70) degrees. Optimal accuracy in predicting scoliosis was obtained with a lumbar angle of -6° as measured by skin markers, and both a lumbar angle ≤-6° (P=0.112) and better functional ability (P=0.102) were associated with less scoliosis. Use of cross-legged sitting postures during the day was associated with a lumbar angle ≤-6° (OR 0.061; 95% CI 0.005 - 0.672; P=0.022). Conclusions: Use of cross-legged sitting posture was associated with increase in lumbar lordosis. Higher angle of lumbar lordosis and better functional ability are associated with lesser degree of scoliosis

espace@Curtin

All Politics is Local: The Renminbi's Prospects as a Future Global Currency

Author: A Sixtus
A Sixtus
F Alleva
H Ney
JC Spohrer
K Beulen
L Lee
L Welling
P Beyerlein
PC Woodland
S Kanthak
S Ortmanns
S Ortmanns
S Ortmanns
S Ortmanns
S Ortmanns
XL Aubert
Publication venue: Palgrave Macmillan
Publication date: 01/01/2000
Field of study

. In this article we describe methods for improving the RWTH German speech recognizer used within the VERBMOBIL project. In particular, we present acceleration methods for the search based on both within-word and across-word phoneme models. We also study incremental methods to reduce the response time of the online speech recognizer. Finally, we present experimental off-line results for the three VERBMOBIL scenarios. We report on word error rates and real-time factors for both speaker independent and speaker dependent recognition. 1 Introduction The goal of the VERBMOBIL project is to develop a speech-to-speech translation system that performs close to real-time. In this system, speech recognition is followed by subsequent VERBMOBIL modules (like syntactic analysis and translation) which depend on the recognition result. Therefore, in this application it is particularly important to keep the recognition time as short as possible. There are VERBMOBIL modules which are capable to work ..

CiteSeerX

Crossref

SOAS Research Online

The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition

Author: Bell P.
Gales M.
Hain T.
Kilgour J.
Lanchantin P.
Liu A.
McParland A.
Renals S.
Saz O.
Wester M.
Woodland P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

This paper describes the Multi-Genre Broadcast (MGB) Challenge at ASRU 2015, an evaluation focused on speech recognition, speaker diarization, and "lightly supervised" alignment of BBC TV recordings. The challenge training data covered the whole range of seven weeks BBC TV output across four channels, resulting in about 1,600 hours of broadcast audio. In addition several hundred million words of BBC subtitle text was provided for language modelling. A novel aspect of the evaluation was the exploration of speech recognition and speaker diarization in a longitudinal setting - i.e. recognition of several episodes of the same show, and speaker diarization across these episodes, linking speakers. The longitudinal tasks also offered the opportunity for systems to make use of supplied metadata including show title, genre tag, and date/time of transmission. This paper describes the task data and evaluation process used in the MGB challenge, and summarises the results obtained

Crossref

Edinburgh Research Explorer

White Rose Research Online