858 research outputs found

    Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network

    Get PDF
    In every aspect of human life, sound plays an important role. From personal security to critical surveillance, sound is a key element to develop the automated systems for these fields. Few systems are already in the market, but their efficiency is a point of concern for their implementation in real-life scenarios. The learning capabilities of the deep learning architectures can be used to develop the sound classification systems to overcome efficiency issues of the traditional systems. Our aim, in this paper, is to use the deep learning networks for classifying the environmental sounds based on the generated spectrograms of these sounds. We used the spectrogram images of environmental sounds to train the convolutional neural network (CNN) and the tensor deep stacking network (TDSN). We used two datasets for our experiment: ESC-10 and ESC-50. Both systems were trained on these datasets, and the achieved accuracy was 77% and 49% in CNN and 56% in TDSN trained on the ESC-10. From this experiment, it is concluded that the proposed approach for sound classification using the spectrogram images of sounds can be efficiently used to develop the sound classification and recognition systems

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

    Get PDF
    Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Sound Scene and Event Analysi

    Speech Mode Classification using the Fusion of CNNs and LSTM Networks

    Get PDF
    Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person. There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%
    • …
    corecore