21 research outputs found

    Environmental Sound Classification with Parallel Temporal-spectral Attention

    Full text link
    Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, temporal attention mechanisms have been used in CNN to capture the useful information from the relevant time frames for audio classification, especially for weakly labelled data where the onset and offset times of the sound events are not applied. In these methods, however, the inherent spectral characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands. Parallel branches are constructed to allow temporal attention and spectral attention to be applied respectively in order to mitigate interference from the segments without the presence of sound events. The experiments on three environmental sound classification (ESC) datasets and two acoustic scene classification (ASC) datasets show that our method improves the classification performance and also exhibits robustness to noise.Comment: submitted to INTERSPEECH202

    Environment Sound Classification using Multiple Feature Channels and Attention based Deep Convolutional Neural Network

    Get PDF
    In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the Constant Q-transform (CQT) and Chromagram. Such multiple features have never been used before for signal or audio processing. And, we employ a deeper CNN (DCNN) compared to previous models, consisting of spatially separable convolutions working on time and feature domain separately. Alongside, we use attention modules that perform channel and spatial attention together. We use some data augmentation techniques to further boost performance. Our model is able to achieve state-of-the-art performance on all three benchmark environment sound classification datasets, i.e. the UrbanSound8K (97.52%), ESC-10 (95.75%) and ESC-50 (88.50%). To the best of our knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets. For ESC-10 and ESC-50 datasets, the accuracy achieved by the proposed model is beyond human accuracy of 95.7% and 81.3% respectively.Comment: Re-checking result

    A Comprehensive Survey of Automated Audio Captioning

    Full text link
    Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning

    A Machine Learning-Based Approach for Audio Signals Classification using Chebychev Moments and Mel-Coefficients

    Get PDF
    This paper proposes a machine learning-based architecture for audio signals classification based on a joint exploitation of the Chebychev moments and the Mel-Frequency Cepstrum Coefficients. The procedure starts with the computation of the Mel-spectrogram of the recorded audio signals; then, Chebychev moments are obtained projecting the Cadence Frequency Diagram derived from the Mel-spectrogram into the base of Chebychev moments. These moments are then concatenated with the Mel-Frequency Cepstrum Coefficients to form the final feature vector. By doing so, the architecture exploits the peculiarities of the discrete Chebychev moments such as their symmetry characteristics. The effectiveness of the procedure is assessed on two challenging datasets, UrbanSound8K and ESC-50

    Visualization and categorization of ecological acoustic events based on discriminant features

    Get PDF
    Although sound classification in soundscape studies are generally performed by experts, the large growth of acoustic data presents a major challenge for performing such task. At the same time, the identification of more discriminating features becomes crucial when analyzing soundscapes, and this occurs because natural and anthropogenic sounds are very complex, particularly in Neotropical regions, where the biodiversity level is very high. In this scenario, the need for research addressing the discriminatory capability of acoustic features is of utmost importance to work towards automating these processes. In this study we present a method to identify the most discriminant features for categorizing sound events in soundscapes. Such identification is key to classification of sound events. Our experimental findings validate our method, showing high discriminatory capability of certain extracted features from sound data, reaching an accuracy of 89.91% for classification of frogs, birds and insects simultaneously. An extension of these experiments to simulate binary classification reached accuracy of 82.64%,100.0% and 99.40% for the classification between combinations of frogs-birds, frogs-insects and birds-insects, respectively

    Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward

    Full text link
    Malicious actors may seek to use different voice-spoofing attacks to fool ASV systems and even use them for spreading misinformation. Various countermeasures have been proposed to detect these spoofing attacks. Due to the extensive work done on spoofing detection in automated speaker verification (ASV) systems in the last 6-7 years, there is a need to classify the research and perform qualitative and quantitative comparisons on state-of-the-art countermeasures. Additionally, no existing survey paper has reviewed integrated solutions to voice spoofing evaluation and speaker verification, adversarial/antiforensics attacks on spoofing countermeasures, and ASV itself, or unified solutions to detect multiple attacks using a single model. Further, no work has been done to provide an apples-to-apples comparison of published countermeasures in order to assess their generalizability by evaluating them across corpora. In this work, we conduct a review of the literature on spoofing detection using hand-crafted features, deep learning, end-to-end, and universal spoofing countermeasure solutions to detect speech synthesis (SS), voice conversion (VC), and replay attacks. Additionally, we also review integrated solutions to voice spoofing evaluation and speaker verification, adversarial and anti-forensics attacks on voice countermeasures, and ASV. The limitations and challenges of the existing spoofing countermeasures are also presented. We report the performance of these countermeasures on several datasets and evaluate them across corpora. For the experiments, we employ the ASVspoof2019 and VSDC datasets along with GMM, SVM, CNN, and CNN-GRU classifiers. (For reproduceability of the results, the code of the test bed can be found in our GitHub Repository
    corecore