3 research outputs found

    Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations

    Get PDF
    This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%

    Computer, Speech and Language - Experiment results for paper "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations"

    No full text
    <div>The files in the dataset correspond to results that have been generated for the Computer, Speech and Language article: "Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations" <a href="http://dx.doi.org/10.1016/j.csl.2016.06.008">http://dx.doi.org/10.1016/j.csl.2016.06.008</a>.</div><div><br></div><div>The files in the zip file are of three types:</div><div>- .ctm, which correspond to the output of the automatic speech recognition system and the columns include segment information as well as transcripts of the recognition.</div><div>- .sys, which correspond to scoring of the automatic speech recognition system and includes the overall word error rate as well as the number of insertions, deletions and substitutions of the overall system.</div><div>- .lur, which provides a more detailed decomposition of the word error rate across different tags.</div><div><br></div><div>The following is a description about the naming convention of the files:</div><div><br></div><div>TableX-LineY: This is the recognition and scoring output corresponding to Line Y of Table X in the article.</div><div>Figure X-BarY: This is the recognition and scoring output corresponding to Bar Y (starting on the left hand side) of Figure X in the article.</div><div><br></div><div>All three file types are standard outputs that are recognised by the automatic speech recognition community and can be opened using any text editor.</div
    corecore