1,272 research outputs found

    Hidden Neural Networks: A Framework for HMM/NN Hybrids

    Get PDF
    This paper presents a general framework for hybrids of Hidden Markov models (HMM) and neural networks (NN). In the new framework called Hidden Neural Networks (HNN) the usual HMM probability parameters are replaced by neural network outputs. To ensure a probabilistic interpretation the HNN is normalized globally as opposed to the local normalization enforced on parameters in standard HMMs. Furthermore, all parameters in the HNN are estimated simultaneously according to the discriminative conditional maximum likelihood (CML) criterion. The HNNs show clear performance gains compared to standard HMMs on TIMIT continuous speech recognition benchmarks. On the task of recognizing five broad phoneme classes an accuracy of 84% is obtained compared to 76% for a standard HMM. Additionally, we report a preliminary result of 69% accuracy on the TIMIT 39 phoneme task. 1. INTRODUCTION Among speech research scientists it is widely believed that HMMs are one of the best and most successful modelling..

    Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations

    Get PDF
    This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%

    The 2015 Sheffield System for Transcription of Multi–Genre Broadcast Media

    Get PDF
    We describe the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge task of transcribing multi-genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi-genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker-based cepstral normalisation, another hybrid stage with input features normalised by speaker feature-MLLR transformations, and finally a bottleneck-based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi-genre shows

    Speaker adaptive training using model agnostic meta-learning

    Get PDF
    Speaker adaptive training (SAT) of neural network acoustic models learns models in a way that makes them more suitable for adaptation to test conditions. Conventionally, model-based speaker adaptive training is performed by having a set of speaker dependent parameters that are jointly optimised with speaker independent parameters in order to remove speaker variation. However, this does not scale well if all neural network weights are to be adapted to the speaker. In this paper we formulate speaker adaptive training as a meta-learning task, in which an adaptation process using gradient descent is encoded directly into the training of the model. We compare our approach with test-only adaptation of a standard baseline model and a SAT-LHUC model with a learned speaker adaptation schedule and demonstrate that the meta-learning approach achieves comparable results.Comment: Accepted to IEEE ASRU 201

    Probabilistic Modelling for Unsupervised Analysis of Human Behaviour in Smart Cities

    Get PDF
    The growth of urban areas in recent years has motivated a large amount of new sensor applications in smart cities. At the centre of many new applications stands the goal of gaining insights into human activity. Scalable monitoring of urban environments can facilitate better informed city planning, efficient security, regular transport and commerce. A large part of monitoring capabilities have already been deployed; however, most rely on expensive motion imagery and privacy invading video cameras. It is possible to use a low-cost sensor alternative, which enables deep understanding of population behaviour such as the Global Positioning System (GPS) data. However, the automated analysis of such low dimensional sensor data, requires new flexible and structured techniques that can describe the generative distribution and time dynamics of the observation data, while accounting for external contextual influences such as time of day or the difference between weekend/weekday trends. In this paper, we propose a novel time series analysis technique that allows for multiple different transition matrices depending on the data’s contextual realisations all following shared adaptive observational models that govern the global distribution of the data given a latent sequence. The proposed approach, which we name Adaptive Input Hidden Markov model (AI-HMM) is tested on two datasets from different sensor types: GPS trajectories of taxis and derived vehicle counts in populated areas. We demonstrate that our model can group different categories of behavioural trends and identify time specific anomalies

    Statistical models for noise-robust speech recognition

    Get PDF
    A standard way of improving the robustness of speech recognition systems to noise is model compensation. This replaces a speech recogniser's distributions over clean speech by ones over noise-corrupted speech. For each clean speech component, model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution. This thesis looks into improving on this approximation in two ways: firstly, by estimating full-covariance Gaussian distributions; secondly, by approximating corrupted-speech likelihoods without any parameterised distribution. The first part of this work is about compensating for within-component feature correlations under noise. For this, the covariance matrices of the computed Gaussians should be full instead of diagonal. The estimation of off-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, like VTS compensation, use for dynamic coefficients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coefficients, and dynamic coefficients, which represent signal changes over time, and are normally computed from a window of static coefficients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It first compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coefficients. It introduces a number of methods that address the correlation changes that occur in noise within this framework. The next problem is decoding speed with full covariances. This thesis re-analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost. The second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequential importance resampling. Though it is too slow to use for recognition, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence to the ideal compensation for one component. The KL divergence proves to predict the word error rate well. This technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make.This work was supported by Toshiba Research Europe Ltd., Cambridge Research Laboratory

    Markov models of biomolecular systems

    Get PDF

    Registration of ultrasound and computed tomography for guidance of laparoscopic liver surgery

    Get PDF
    Laparoscopic Ultrasound (LUS) imaging is a standard tool used for image-guidance during laparoscopic liver resection, as it provides real-time information on the internal structure of the liver. However, LUS probes are di cult to handle and their resulting images hard to interpret. Additionally, some anatomical targets such as tumours are not always visible, making the LUS guidance less e ective. To solve this problem, registration between the LUS images and a pre-operative Computed Tomography (CT) scan using information from blood vessels has been previously proposed. By merging these two modalities, the relative position between the LUS images and the anatomy of CT is obtained and both can be used to guide the surgeon. The problem of LUS to CT registration is specially challenging, as besides being a multi-modal registration, the eld of view of LUS is signi cantly smaller than that of CT. Therefore, this problem becomes poorly constrained and typically an accurate initialisation is needed. Also, the liver is highly deformed during laparoscopy, complicating the problem further. So far, the methods presented in the literature are not clinically feasible as they depend on manually set correspondences between both images. In this thesis, a solution for this registration problem that may be more transferable to the clinic is proposed. Firstly, traditional registration approaches comprised of manual initialisation and optimisation of a cost function are studied. Secondly, it is demonstrated that a globally optimal registration without a manual initialisation is possible. Finally, a new globally optimal solution that does not require commonly used tracking technologies is proposed and validated. The resulting approach provides clinical value as it does not require manual interaction in the operating room or tracking devices. Furthermore, the proposed method could potentially be applied to other image-guidance problems that require registration between ultrasound and a pre-operative scan

    A Case Study of Algorithms for Morphosyntactic Tagging of Polish Language

    Get PDF
    The paper presents an evaluation of several part-of-speech taggers, representing main tagging algorithms, applied to corpus of frequency dictionary of the contemporary Polish language. We report our results considering two tagging schemes: IPI PAN positional tagset and its simplified version. Tagging accuracy is calculated for different training sets and takes into account many subcategories (accuracy on known and unknown tokens, word segments, sentences etc.) The comparison of results with other inflecting and analytic languages is done. Performance aspects (time demands) of used tagging tools are also discussed
    • …
    corecore