352 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Models for learning reverberant environments

    Get PDF
    Reverberation is present in all real life enclosures. From our workplaces to our homes and even in places designed as auditoria, such as concert halls and theatres. We have learned to understand speech in the presence of reverberation and also to use it for aesthetics in music. This thesis investigates novel ways enabling machines to learn the properties of reverberant acoustic environments. Training machines to classify rooms based on the effect of reverberation requires the use of data recorded in the room. The typical data for such measurements is the Acoustic Impulse Response (AIR) between the speaker and the receiver as a Finite Impulse Response (FIR) filter. Its representation however is high-dimensional and the measurements are small in number, which limits the design and performance of deep learning algorithms. Understanding properties of the rooms relies on the analysis of reflections that compose the AIRs and the decay and absorption of the sound energy in the room. This thesis proposes novel methods for representing the early reflections, which are strong and sparse in nature and depend on the position of the source and the receiver. The resulting representation significantly reduces the coefficients needed to represent the AIR and can be combined with a stochastic model from the literature to also represent the late reflections. The use of Finite Impulse Response (FIR) for the task of classifying rooms is investigated, which provides novel results in this field. The aforementioned issues related to AIRs are highlighted through the analysis. This leads to the proposal of a data augmentation method for the training of the classifiers based on Generative Adversarial Networks (GANs), which uses existing data to create artificial AIRs, as if they were measured in real rooms. The networks learn properties of the room in the space defined by the parameters of the low-dimensional representation that is proposed in this thesis.Open Acces

    Deep Learning for Distant Speech Recognition

    Full text link
    Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.Comment: PhD Thesis Unitn, 201

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Infant Cry Signal Processing, Analysis, and Classification with Artificial Neural Networks

    Get PDF
    As a special type of speech and environmental sound, infant cry has been a growing research area covering infant cry reason classification, pathological infant cry identification, and infant cry detection in the past two decades. In this dissertation, we build a new dataset, explore new feature extraction methods, and propose novel classification approaches, to improve the infant cry classification accuracy and identify diseases by learning infant cry signals. We propose a method through generating weighted prosodic features combined with acoustic features for a deep learning model to improve the performance of asphyxiated infant cry identification. The combined feature matrix captures the diversity of variations within infant cries and the result outperforms all other related studies on asphyxiated baby crying classification. We propose a non-invasive fast method of using infant cry signals with convolutional neural network (CNN) based age classification to diagnose the abnormality of infant vocal tract development as early as 4-month age. Experiments discover the pattern and tendency of the vocal tract changes and predict the abnormality of infant vocal tract by classifying the cry signals into younger age category. We propose an approach of generating hybrid feature set and using prior knowledge in a multi-stage CNNs model for robust infant sound classification. The dominant and auxiliary features within the set are beneficial to enlarge the coverage as well as keeping a good resolution for modeling the diversity of variations within infant sound and the experimental results give encouraging improvements on two relative databases. We propose an approach of graph convolutional network (GCN) with transfer learning for robust infant cry reason classification. Non-fully connected graphs based on the similarities among the relevant nodes are built to consider the short-term and long-term effects of infant cry signals related to inner-class and inter-class messages. With as limited as 20% of labeled training data, our model outperforms that of the CNN model with 80% labeled training data in both supervised and semi-supervised settings. Lastly, we apply mel-spectrogram decomposition to infant cry classification and propose a fusion method to further improve the infant cry classification performance

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016)

    Get PDF
    corecore