18,452 research outputs found

    Multitask Learning with CTC and Segmental CRF for Speech Recognition

    Full text link
    Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling methods used for end-to-end training of speech recognition models. Both models define a transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labels and durations, and the latter classifies each frame as either an output symbol or a "continuation" of the previous label. In this paper, we train a recognition model by optimizing an interpolation between the SCRF and CTC losses, where the same recurrent neural network (RNN) encoder is used for feature extraction for both outputs. We find that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models. Additionally, we show that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.Comment: 5 pages, 2 figures, camera ready version at Interspeech 201

    A latent discriminative model-based approach for classification of imaginary motor tasks from EEG data

    Get PDF
    We consider the problem of classification of imaginary motor tasks from electroencephalography (EEG) data for brain-computer interfaces (BCIs) and propose a new approach based on hidden conditional random fields (HCRFs). HCRFs are discriminative graphical models that are attractive for this problem because they (1) exploit the temporal structure of EEG; (2) include latent variables that can be used to model different brain states in the signal; and (3) involve learned statistical models matched to the classification task, avoiding some of the limitations of generative models. Our approach involves spatial filtering of the EEG signals and estimation of power spectra based on auto-regressive modeling of temporal segments of the EEG signals. Given this time-frequency representation, we select certain frequency bands that are known to be associated with execution of motor tasks. These selected features constitute the data that are fed to the HCRF, parameters of which are learned from training data. Inference algorithms on the HCRFs are used for classification of motor tasks. We experimentally compare this approach to the best performing methods in BCI competition IV as well as a number of more recent methods and observe that our proposed method yields better classification accuracy

    Discriminative methods for classification of asynchronous imaginary motor tasks from EEG data

    Get PDF
    In this work, two methods based on statistical models that take into account the temporal changes in the electroencephalographic (EEG) signal are proposed for asynchronous brain-computer interfaces (BCI) based on imaginary motor tasks. Unlike the current approaches to asynchronous BCI systems that make use of windowed versions of the EEG data combined with static classifiers, the methods proposed here are based on discriminative models that allow sequential labeling of data. In particular, the two methods we propose for asynchronous BCI are based on conditional random fields (CRFs) and latent dynamic CRFs (LDCRFs), respectively. We describe how the asynchronous BCI problem can be posed as a classification problem based on CRFs or LDCRFs, by defining appropriate random variables and their relationships. CRF allows modeling the extrinsic dynamics of data, making it possible to model the transitions between classes, which in this context correspond to distinct tasks in an asynchronous BCI system. On the other hand, LDCRF goes beyond this approach by incorporating latent variables that permit modeling the intrinsic structure for each class and at the same time allows modeling extrinsic dynamics. We apply our proposed methods on the publicly available BCI competition III dataset V as well as a data set recorded in our laboratory. Results obtained are compared to the top algorithm in the BCI competition as well as to methods based on hierarchical hidden Markov models (HHMMs), hierarchical hidden CRF (HHCRF), neural networks based on particle swarm optimization (IPSONN) and to a recently proposed approach based on neural networks and fuzzy theory, the S-dFasArt. Our experimental analysis demonstrates the improvements provided by our proposed methods in terms of classification accuracy

    Discriminative Segmental Cascades for Feature-Rich Phone Recognition

    Full text link
    Discriminative segmental models, such as segmental conditional random fields (SCRFs) and segmental structured support vector machines (SSVMs), have had success in speech recognition via both lattice rescoring and first-pass decoding. However, such models suffer from slow decoding, hampering the use of computationally expensive features, such as segment neural networks or other high-order features. A typical solution is to use approximate decoding, either by beam pruning in a single pass or by beam pruning to generate a lattice followed by a second pass. In this work, we study discriminative segmental models trained with a hinge loss (i.e., segmental structured SVMs). We show that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained. Instead, we consider an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices. We obtain a high-accuracy phonetic recognition system with several expensive feature types: a segment neural network, a second-order language model, and second-order phone boundary features

    Porting concepts from DNNs back to GMMs

    Get PDF
    Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination
    corecore