469 research outputs found

    Investigation into the Perceptually Informed Data for Environmental Sound Recognition

    Get PDF
    Environmental sound is rich source of information that can be used to infer contexts. With the rise in ubiquitous computing, the desire of environmental sound recognition is rapidly growing. Primarily, the research aims to recognize the environmental sound using the perceptually informed data. The initial study is concentrated on understanding the current state-of-the-art techniques in environmental sound recognition. Then those researches are evaluated by a critical review of the literature. This study extracts three sets of features: Mel Frequency Cepstral Coefficients, Mel-spectrogram and sound texture statistics. Two kinds machine learning algorithms are cooperated with appropriate sound features. The models are compared with a low-level baseline model. It also presents a performance comparison between each model with the high-level human listeners. The study results in sound texture statistics model performing the best classification by achieving 45.1% of accuracy based on support vector machine with radial basis function kernel. Another Mel-spectrogram model based on Convolutional Neural Network also provided satisfactory results and have received predictive results greater than the benchmark test

    Where and When: {S}pace-Time Attention for Audio-Visual Explanations

    Get PDF
    Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations

    Zero-Shot Audio Classification via Semantic Embeddings

    Get PDF
    In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.Comment: Submitted to Transactions on Audio, Speech and Language Processin

    DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event Classification

    Full text link
    Recent acoustic event classification research has focused on training suitable filters to represent acoustic events. However, due to limited availability of target event databases and linearity of conventional filters, there is still room for improving performance. By exploiting the non-linear modeling of deep neural networks (DNNs) and their ability to learn beyond pre-trained environments, this letter proposes a DNN-based feature extraction scheme for the classification of acoustic events. The effectiveness and robustness to noise of the proposed method are demonstrated using a database of indoor surveillance environments

    An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition

    Get PDF
    Convolutional neural networks (CNNs) with log-mel audio representation and CNN-based end-to-end learning have both been used for environmental event sound recognition (ESC). However, log-mel features can be complemented by features learned from the raw audio waveform with an effective fusion method. In this paper, we first propose a novel stacked CNN model with multiple convolutional layers of decreasing filter sizes to improve the performance of CNN models with either log-mel feature input or raw waveform input. These two models are then combined using the Dempster–Shafer (DS) evidence theory to build the ensemble DS-CNN model for ESC. Our experiments over three public datasets showed that our method could achieve much higher performance in environmental sound recognition than other CNN models with the same types of input features. This is achieved by exploiting the complementarity of the model based on log-mel feature input and the model based on learning features directly from raw waveforms
    • …
    corecore