37 research outputs found

    Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations

    Full text link
    Speech recognition in noisy and channel distorted scenarios is often challenging as the current acoustic modeling schemes are not adaptive to the changes in the signal distribution in the presence of noise. In this work, we develop a novel acoustic modeling framework for noise robust speech recognition based on relevance weighting mechanism. The relevance weighting is achieved using a sub-network approach that performs feature selection. A relevance sub-network is applied on the output of first layer of a convolutional network model operating on raw speech signals while a second relevance sub-network is applied on the second convolutional layer output. The relevance weights for the first layer correspond to an acoustic filterbank selection while the relevance weights in the second layer perform modulation filter selection. The model is trained for a speech recognition task on noisy and reverberant speech. The speech recognition experiments on multiple datasets (Aurora-4, CHiME-3, VOiCES) reveal that the incorporation of relevance weighting in the neural network architecture improves the speech recognition word error rates significantly (average relative improvements of 10% over the baseline systems)Comment: arXiv admin note: text overlap with arXiv:2001.0706

    Optimization of data-driven filterbank for automatic speaker verification

    Get PDF
    Most of the speech processing applications use triangular filters spaced in mel-scale for feature extraction. In this paper, we propose a new data-driven filter design method which optimizes filter parameters from a given speech data. First, we introduce a frame-selection based approach for developing speech-signal-based frequency warping scale. Then, we propose a new method for computing the filter frequency responses by using principal component analysis (PCA). The main advantage of the proposed method over the recently introduced deep learning based methods is that it requires very limited amount of unlabeled speech-data. We demonstrate that the proposed filterbank has more speaker discriminative power than commonly used mel filterbank as well as existing data-driven filterbank. We conduct automatic speaker verification (ASV) experiments with different corpora using various classifier back-ends. We show that the acoustic features created with proposed filterbank are better than existing mel-frequency cepstral coefficients (MFCCs) and speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75% relative improvement in equal error rate (EER) over MFCCs. Similarly, the relative improvement is 4.43% with recently introduced x-vector system. We obtain further improvement using fusion of the proposed method with standard MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier

    Acoustic model adaptation from raw waveforms with Sincnet

    Get PDF
    Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of each filter. We study the adaptation of the SincNet filter parameters from adults' to children's speech, and show that the parameterisation of the SincNet layer is well suited for adaptation in practice: we can efficiently adapt with a very small number of parameters, producing error rates comparable to techniques using orders of magnitude more parameters.Comment: Accepted to IEEE ASRU 201

    A HIERARCHY BASED ACOUSTIC FRAMEWORK FOR AUDITORY SCENE ANALYSIS

    Get PDF
    The acoustic environment surrounding us is extremely dynamic and unstructured in nature. Humans exhibit a great ability at navigating these complex acoustic environments, and can parse a complex acoustic scene into its perceptually meaningful objects, referred to as ``auditory scene analysis". Current neuro-computational strategies developed for auditory scene analysis related tasks are primarily based on prior knowledge of acoustic environment and hence, fail to match human performance under realistic settings, i.e. the acoustic environment being dynamic in nature and presence of multiple competing auditory objects in the same scene. In this thesis, we explore hierarchy based computational frameworks that not only solve different auditory scene analysis related paradigms but also explain the processes driving these paradigms from physiological, psychophysical and computational viewpoint. In the first part of the thesis, we explore computational strategies that can extract varying degree of details from complex acoustic scene with an aim to capture non-trivial commonalities within a sound class as well as differences across sound classes. We specifically demonstrate that a rich feature space of spectro-temporal modulation representation complimented with markovian based temporal dynamics information captures the fine and subtle changes in the spectral and temporal structure of sound events in a complex and dynamic acoustic environment. We further extend this computational model to incorporate a biologically plausible network capable of learning a rich hierarchy of localized spectro-temporal bases and their corresponding long term temporal regularities from natural soundscape in a data driven fashion. We demonstrate that the unsupervised nature of the network yields physiologically and perceptually meaningful tuning functions that drive the organization of acoustic scene into distinct auditory objects. Next, we explore computational models based on hierarchical acoustic representation in the context of bottom-up salient event detection. We demonstrate that a rich hierarchy of local and global cues capture the salient details upon which the bottom-up saliency mechanisms operate to make a "new" event pop out in a complex acoustic scene. We further show that a top-down event specific knowledge gathered by scene classification framework biases bottom-up computational resources towards events of "interest" rather than any new event. We further extend the top-down framework in the context of modeling a broad and heterogeneous acoustic class. We demonstrate that when an acoustic scene comprises of multiple events, modeling the global details in the hierarchy as a mixture of temporal trajectories help to capture its semantic categorization and provide a detailed understanding of the scene. Overall, the results of this thesis improve our understanding of how a rich hierarchy of acoustic representation drives various auditory scene analysis paradigms and how to integrate multiple theories of scene analysis into a unified strategy, hence providing a platform for further development of computational scene analysis research

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure
    corecore