13 research outputs found

    Revisiting Hybrid and GMM-HMM system combination techniques

    Get PDF
    In this paper we investigate techniques to combine hybrid HMM-DNN (hidden Markov model – deep neural network) and tandem HMM-GMM (hidden Markov model – Gaussian mixture model) acoustic models using: (1) model averaging, and (2) lattice combination with Minimum Bayes Risk decoding. We have performed experiments on the “TED Talks” task following the protocol of the IWSLT-2012 evaluation. Our experimental results suggest that DNN-based and GMMbased acoustic models are complementary, with error rates being reduced by up to 8 % relative when the DNN and GMM systems are combined at model-level in a multi-pass automatic speech recognition (ASR) system. Additionally, further gains were obtained by combining model-averaged lattices with the one obtained from baseline systems. Index Terms — deep neural networks, tandem, hybrid, system combination, TED 1

    Generalized Hidden Filter Markov Models Applied to Speaker Recognition

    Get PDF
    Classification of time series has wide Air Force, DoD and commercial interest, from automatic target recognition systems on munitions to recognition of speakers in diverse environments. The ability to effectively model the temporal information contained in a sequence is of paramount importance. Toward this goal, this research develops theoretical extensions to a class of stochastic models and demonstrates their effectiveness on the problem of text-independent (language constrained) speaker recognition. Specifically within the hidden Markov model architecture, additional constraints are implemented which better incorporate observation correlations and context, where standard approaches fail. Two methods of modeling correlations are developed, and their mathematical properties of convergence and reestimation are analyzed. These differ in modeling correlation present in the time samples and those present in the processed features, such as Mel frequency cepstral coefficients. The system models speaker dependent phonemes, making use of word dictionary grammars, and recognition is based on normalized log-likelihood Viterbi decoding. Both closed set identification and speaker verification using cohorts are performed on the YOHO database. YOHO is the only large scale, multiple-session, high-quality speech database for speaker authentication and contains over one hundred speakers stating combination locks. Equal error rates of 0.21% for males and 0.31% for females are demonstrated. A critical error analysis using a hypothesis test formulation provides the maximum number of errors observable while still meeting the goal error rates of 1% False Reject and 0.1% False Accept. Our system achieves this goal

    Robot environment learning with a mixed-linear probabilistic state-space model

    Get PDF
    This thesis proposes the use of a probabilistic state-space model with mixed-linear dynamics for learning to predict a robot's experiences. It is motivated by a desire to bridge the gap between traditional models with predefined objective semantics on the one hand, and the biologically-inspired "black box" behavioural paradigm on the other. A novel EM-type algorithm for the model is presented, which is less compuationally demanding than the Monte Carlo techniques developed for use in (for example) visual applications. The algorithm's E-step is slightly approximative, but an extension is described which would in principle make it asymptotically correct. Investigation using synthetically sampled data shows that the uncorrected E-step can any case make correct inferences about quite complicated systems. Results collected from two simulated mobile robot environments support the claim that mixed-linear models can capture both discontinuous and continuous structure in world in an intuitively natural manner; while they proved to perform only slightly better than simpler autoregressive hidden Markov models on these simple tasks, it is possible to claim tentatively that they might scale more effectively to environments in which trends over time played a larger role. Bayesian confidence regions—easily by mixed-linear model— proved be an effective guard for preventing it from making over-confident predictions outside its area of competence. A section on future extensions discusses how the model's easy invertibility could be harnessed to the ultimate aim of choosing actions, from a continuous space of possibilities, which maximise the robot's expected payoff over several steps into the futur

    NOVEL GRAPHICAL MODEL AND NEURAL NETWORK FRAMEWORKS FOR AUTOMATED SEIZURE DETECTION, TRACKING, AND LOCALIZATION IN FOCAL EPILEPSY

    Get PDF
    Epilepsy is a heterogenous neurological disorder characterized by recurring and unprovoked seizures. It is estimated that 60% of epilepsy patients suffer from focal epilepsy, where seizures originate from one or more discrete locations within the brain. After onset, focal seizure activity spreads, involving more regions in the cortex. Diagnosis and therapeutic planning for patients with focal epilepsy crucially depends on being able to detect epileptic activity as it starts and localize its origin. Due to the subtlety of seizure activity and the complex spatio-temporal propagation patterns of seizure activity, detection and localization of seizure by visual inspection is time-consuming and must be done by highly trained neurologists. In this thesis, we detail modeling approaches to identify and capture the spatio-temporal ictal propagation of focal epileptic seizures. Through novel multi-scale frameworks, information fusion between signal paths, and hybrid architectures, models that capture the underlying seizure propagation phenomena are developed. The first half relies on graphical modeling approaches to detect seizures and track their activity through the space of EEG electrodes. A coupled hidden Markov model approach to seizure propagation is described. This model is subsequently improved through the addition of convolutional neural network based likelihood functions, removing the reliance on hand designed feature extraction. Through the inclusion of a hierarchical switching chain and localization variables, the model is revised to capture multi-scale seizure onset and spreading information. In the second half of this thesis, end-to-end neural network architectures for seizure detection and localization are developed. First, combination convolutional and recurrent neural networks are used to identify seizure activity at the level of individual EEG channels. Through novel aggregation, the network is trained to recognize seizure activity, track its evolution, and coarsely localize seizure onset from lower resolution labels. Next, a multi-scale network capable of analyzing the global and electrode level signals is developed for challenging task of end-to-end seizure localization. Onset location maps are defined for each patient and an ensemble of weakly supervised loss functions are used in a multi-task learning framework to train the architecture

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Causal inference and interpretable machine learning for personalised medicine

    Get PDF
    In this thesis, we discuss the importance of causal knowledge in healthcare for tailoring treatments to a patient's needs. We propose three different causal models for reasoning about the effects of medical interventions on patients with HIV and sepsis, based on observational data. Both application areas are challenging as a result of patient heterogeneity and the existence of confounding that influences patient outcomes. Our first contribution is a treatment policy mixture model that combines nonparametric, kernel-based learning with model-based reinforcement learning to reason about a series of treatments and their effects. These methods each have their own strengths: non-parametric methods can accurately predict treatment effects where there are overlapping patient instances or where data is abundant; model-based reinforcement learning generalises better in outlier situations by learning a belief state representation of confounding. The overall policy mixture model learns a partition of the space of heterogeneous patients such that we can personalise treatments accordingly. Our second contribution incorporates knowledge from kernel-based reasoning directly into a reinforcement learning model by learning a combined belief state representation. In doing so, we can use the model to simulate counterfactual scenarios to reason about what would happen to a patient if we intervened in a particular way and how would their specific outcomes change. As a result, we may tailor therapies according to patient-specific scenarios. Our third contribution is a reformulation of the information bottleneck problem for learning an interpretable, low-dimensional representation of confounding for medical decision-making. The approach uses the relevance of information to perform a sufficient reduction of confounding. Based on this reduction, we learn equivalence classes among groups of patients, such that we may transfer knowledge to patients with incomplete covariate information at test time. By conditioning on the sufficient statistic we can accurately infer treatment effects on both a population and subgroup level. Our final contribution is the development of a novel regularisation strategy that can be applied to deep machine learning models to enforce clinical interpretability. We specifically train deep time-series models such that their predictions have high accuracy while being closely modelled by small decision trees that can be audited easily by medical experts. Broadly, our tree-based explanations can be used to provide additional context in scenarios where reasoning about treatment effects may otherwise be difficult. Importantly, each of the models we present is an attempt to bring about more understanding in medical applications to inform better decision-making overall

    Incremental semi-supervised learning for anomalous trajectory detection

    Get PDF
    The acquisition of a scene-specific normal behaviour model underlies many existing approaches to the problem of automated video surveillance. Since it is unrealistic to acquire a comprehensive set of labelled behaviours for every surveyed scenario, modelling normal behaviour typically corresponds to modelling the distribution of a large collection of unlabelled examples. In general, however, it would be desirable to be able to filter an unlabelled dataset to remove potentially anomalous examples. This thesis proposes a simple semi-supervised learning framework that could allow a human operator to efficiently filter the examples used to construct a normal behaviour model by providing occasional feedback: Specifically, the classification output of the model under construction is used to filter the incoming sequence of unlabelled examples so that human approval is requested before incorporating any example classified as anomalous, while all other examples are automatically used for training. A key component of the proposed framework is an incremental one-class learning algorithm which can be trained on a sequence of normal examples while allowing new examples to be classified at any stage during training. The proposed algorithm represents an initial set of training examples with a kernel density estimate, before using merging operations to incrementally construct a Gaussian mixture model while minimising an information-theoretic cost function. This algorithm is shown to outperform an existing state-of-the-art approach without requiring off-line model selection. Throughout this thesis behaviours are considered in terms of whole motion trajectories: in order to apply the proposed algorithm, trajectories must be encoded with fixed length vectors. To determine an appropriate encoding strategy, an empirical comparison is conducted to determine the relative class-separability afforded by several different trajectory representations for a range of datasets. The results obtained suggest that the choice of representation makes a small but consistent difference to class separability, indicating that cubic B-Spline control points (fitted using least-squares regression) provide a good choice for use in subsequent experiments. The proposed semi-supervised learning framework is tested on three different real trajectory datasets. In all cases the rate of human intervention requests drops steadily, reaching a usefully low level of 1% in one case. A further experiment indicates that once a sufficient number of interventions has been provided, a high level of classification performance can be achieved even if subsequent requests are ignored. The automatic incorporation of unlabelled data is shown to improve classification performance in all cases, while a high level of classification performance is maintained even when unlabelled data containing a high proportion of anomalous examples is presented

    A survey of the application of soft computing to investment and financial trading

    Get PDF
    corecore