21 research outputs found

    Fast vocabulary acquisition in an NMF-based self-learning vocal user interface

    Get PDF
    AbstractIn command-and-control applications, a vocal user interface (VUI) is useful for handsfree control of various devices, especially for people with a physical disability. The spoken utterances are usually restricted to a predefined list of phrases or to a restricted grammar, and the acoustic models work well for normal speech. While some state-of-the-art methods allow for user adaptation of the predefined acoustic models and lexicons, we pursue a fully adaptive VUI by learning both vocabulary and acoustics directly from interaction examples. A learning curve usually has a steep rise in the beginning and an asymptotic ceiling at the end. To limit tutoring time and to guarantee good performance in the long run, the word learning rate of the VUI should be fast and the learning curve should level off at a high accuracy. In order to deal with these performance indicators, we propose a multi-level VUI architecture and we investigate the effectiveness of alternative processing schemes. In the low-level layer, we explore the use of MIDA features (Mutual Information Discrimination Analysis) against conventional MFCC features. In the mid-level layer, we enhance the acoustic representation by means of phone posteriorgrams and clustering procedures. In the high-level layer, we use the NMF (Non-negative Matrix Factorization) procedure which has been demonstrated to be an effective approach for word learning. We evaluate and discuss the performance and the feasibility of our approach in a realistic experimental setting of the VUI-user learning context

    CNN Architectures for Large-Scale Audio Classification

    Full text link
    Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new addition

    Exemplar-based joint channel and noise compensation

    No full text
    In this paper two models for channel estimation in exemplar-based noise robust speech recognition are proposed. Building on a compo-sitional model that models noisy speech and a combination of noise and speech atoms, the first model iteratively estimates a filter to best compensate the mismatch with the observed noisy speech. The sec-ond model estimates separate filters for the noise and speech atoms. We show that both models enable noise-robust ASR even if the chan-nel characteristics of the noisy speech do not match those of the ex-emplars in the dictionary. Moreover, the second model, which is able to estimate separate filters for speech and noise, is shown to be robust even in the presence of bandwidth-limited sources. Index Terms — Speech recognition, source separation, matrix factorization, noise robustness, channel compensatio

    Exemplar-based joint channel and noise compensation

    No full text
    In this paper two models for channel estimation in exemplar-based noise robust speech recognition are proposed. Building on a compositional model that models noisy speech and a combination of noise and speech atoms, the first model iteratively estimates a filter to best compensate the mismatch with the observed noisy speech. The second model estimates separate filters for the noise and speech atoms. We show that both models enable noise-robust ASR even if the channel characteristics of the noisy speech do not match those of the exemplars in the dictionary. Moreover, the second model, which is able to estimate separate filters for speech and noise, is shown to be robust even in the presence of bandwidth-limited sources

    Detecting irregular orbits in gravitational N-body simulations

    No full text
    status: publishe
    corecore