37 research outputs found
Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations
Speech recognition in noisy and channel distorted scenarios is often
challenging as the current acoustic modeling schemes are not adaptive to the
changes in the signal distribution in the presence of noise. In this work, we
develop a novel acoustic modeling framework for noise robust speech recognition
based on relevance weighting mechanism. The relevance weighting is achieved
using a sub-network approach that performs feature selection. A relevance
sub-network is applied on the output of first layer of a convolutional network
model operating on raw speech signals while a second relevance sub-network is
applied on the second convolutional layer output. The relevance weights for the
first layer correspond to an acoustic filterbank selection while the relevance
weights in the second layer perform modulation filter selection. The model is
trained for a speech recognition task on noisy and reverberant speech. The
speech recognition experiments on multiple datasets (Aurora-4, CHiME-3, VOiCES)
reveal that the incorporation of relevance weighting in the neural network
architecture improves the speech recognition word error rates significantly
(average relative improvements of 10% over the baseline systems)Comment: arXiv admin note: text overlap with arXiv:2001.0706
Optimization of data-driven filterbank for automatic speaker verification
Most of the speech processing applications use triangular filters spaced in
mel-scale for feature extraction. In this paper, we propose a new data-driven
filter design method which optimizes filter parameters from a given speech
data. First, we introduce a frame-selection based approach for developing
speech-signal-based frequency warping scale. Then, we propose a new method for
computing the filter frequency responses by using principal component analysis
(PCA). The main advantage of the proposed method over the recently introduced
deep learning based methods is that it requires very limited amount of
unlabeled speech-data. We demonstrate that the proposed filterbank has more
speaker discriminative power than commonly used mel filterbank as well as
existing data-driven filterbank. We conduct automatic speaker verification
(ASV) experiments with different corpora using various classifier back-ends. We
show that the acoustic features created with proposed filterbank are better
than existing mel-frequency cepstral coefficients (MFCCs) and
speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In
the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75%
relative improvement in equal error rate (EER) over MFCCs. Similarly, the
relative improvement is 4.43% with recently introduced x-vector system. We
obtain further improvement using fusion of the proposed method with standard
MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier
Acoustic model adaptation from raw waveforms with Sincnet
Raw waveform acoustic modelling has recently gained interest due to neural
networks' ability to learn feature extraction, and the potential for finding
better representations for a given scenario than hand-crafted features. SincNet
has been proposed to reduce the number of parameters required in raw-waveform
modelling, by restricting the filter functions, rather than having to learn
every tap of each filter. We study the adaptation of the SincNet filter
parameters from adults' to children's speech, and show that the
parameterisation of the SincNet layer is well suited for adaptation in
practice: we can efficiently adapt with a very small number of parameters,
producing error rates comparable to techniques using orders of magnitude more
parameters.Comment: Accepted to IEEE ASRU 201
A HIERARCHY BASED ACOUSTIC FRAMEWORK FOR AUDITORY SCENE ANALYSIS
The acoustic environment surrounding us is extremely dynamic and unstructured in nature. Humans exhibit a great ability at navigating these complex acoustic environments, and can parse a complex acoustic scene into its perceptually meaningful objects, referred to as ``auditory scene analysis". Current neuro-computational strategies developed for auditory scene analysis related tasks are primarily based on prior knowledge of acoustic environment and hence, fail to match human performance under realistic settings, i.e. the acoustic environment being dynamic in nature and presence of multiple competing auditory objects in the same scene. In this thesis, we explore hierarchy based computational frameworks that not only solve different auditory scene analysis related paradigms but also explain the processes driving these paradigms from physiological, psychophysical and computational viewpoint.
In the first part of the thesis, we explore computational strategies that can extract varying degree of details from complex acoustic scene with an aim to capture non-trivial commonalities within a sound class as well as differences across sound classes. We specifically demonstrate that a rich feature space of spectro-temporal modulation representation complimented with markovian based temporal dynamics information captures the fine and subtle changes in the spectral and temporal structure of sound events in a complex and dynamic acoustic environment. We further extend this computational model to incorporate a biologically plausible network capable of learning a rich hierarchy of localized spectro-temporal bases and their corresponding long term temporal regularities from natural soundscape in a data driven fashion. We demonstrate that the unsupervised nature of the network yields physiologically and perceptually meaningful tuning functions that drive the organization of acoustic scene into distinct auditory objects.
Next, we explore computational models based on hierarchical acoustic representation in the context of bottom-up salient event detection. We demonstrate that a rich hierarchy of local and global cues capture the salient details upon which the bottom-up saliency mechanisms operate to make a "new" event pop out in a complex acoustic scene. We further show that a top-down event specific knowledge gathered by scene classification framework biases bottom-up computational resources towards events of "interest" rather than any new event. We further extend the top-down framework in the context of modeling a broad and heterogeneous acoustic class. We demonstrate that when an acoustic scene comprises of multiple events, modeling the global details in the hierarchy as a mixture of temporal trajectories help to capture its semantic categorization and provide a detailed understanding of the scene.
Overall, the results of this thesis improve our understanding of how a rich hierarchy of acoustic representation drives various auditory scene analysis paradigms and how to integrate multiple theories of scene analysis into a unified strategy, hence providing a platform for further development of computational scene analysis research
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure