Search CORE

469 research outputs found

Investigation into the Perceptually Informed Data for Environmental Sound Recognition

Author: Kang Chenglin
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2019
Field of study

Environmental sound is rich source of information that can be used to infer contexts. With the rise in ubiquitous computing, the desire of environmental sound recognition is rapidly growing. Primarily, the research aims to recognize the environmental sound using the perceptually informed data. The initial study is concentrated on understanding the current state-of-the-art techniques in environmental sound recognition. Then those researches are evaluated by a critical review of the literature. This study extracts three sets of features: Mel Frequency Cepstral Coefficients, Mel-spectrogram and sound texture statistics. Two kinds machine learning algorithms are cooperated with appropriate sound features. The models are compared with a low-level baseline model. It also presents a performance comparison between each model with the high-level human listeners. The study results in sound texture statistics model performing the best classification by achieving 45.1% of accuracy based on support vector machine with radial basis function kernel. Another Mel-spectrogram model based on Convolutional Neural Network also provided satisfactory results and have received predictive results greater than the benchmark test

Arrow@TUDublin

Where and When: {S}pace-Time Attention for Audio-Visual Explanations

Author: Akata Z.
Chen Y.
Hummel T.
Koepke A.
Publication venue
Publication date: 01/01/2021
Field of study

Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations

MPG.PuRe

Zero-Shot Audio Classification via Semantic Embeddings

Author: Virtanen Tuomas
Xie Huang
Publication venue
Publication date: 01/01/2021
Field of study

In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound classes, i.e., acoustic embeddings and semantic embeddings. We use VGGish to extract deep acoustic embeddings from audio clips, and pre-trained language models (Word2Vec, GloVe, BERT) to generate either label embeddings from textual labels or sentence embeddings from sentence descriptions of sound classes. Audio classification is performed by a linear compatibility function that measures how compatible an acoustic embedding and a semantic embedding are. We evaluate the proposed method on a small balanced dataset ESC-50 and a large-scale unbalanced audio subset of AudioSet. The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training. Meanwhile, we demonstrate that both label embeddings and sentence embeddings are useful for zero-shot learning. Classification performance is improved by concatenating label/sentence embeddings generated with different language models. With their hybrid concatenations, the results are improved further.Comment: Submitted to Transactions on Audio, Speech and Language Processin

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event Classification

Author: Han David K.
Kim Wooil
Ko Hanseok
Mun Seongkyu
Shin Minkyu
Shon Suwon
Publication venue: 'Institute of Electronics, Information and Communications Engineers (IEICE)'
Publication date: 01/01/2017
Field of study

Recent acoustic event classification research has focused on training suitable filters to represent acoustic events. However, due to limited availability of target event databases and linearity of conventional filters, there is still room for improving performance. By exploiting the non-linear modeling of deep neural networks (DNNs) and their ability to learn beyond pre-trained environments, this letter proposes a DNN-based feature extraction scheme for the classification of acoustic events. The effectiveness and robustness to noise of the proposed method are demonstrated using a database of indoor surveillance environments

arXiv.org e-Print Archive

Crossref

An Ensemble Stacked Convolutional Neural Network Model for Environmental Event Sound Recognition

Author: Hu Jianjun
Hu Jie
Li Shaobo
Liu Guokai
Yao Xuemei
Yao Yong
Publication venue: Scholar Commons
Publication date: 15/07/2018
Field of study

Convolutional neural networks (CNNs) with log-mel audio representation and CNN-based end-to-end learning have both been used for environmental event sound recognition (ESC). However, log-mel features can be complemented by features learned from the raw audio waveform with an effective fusion method. In this paper, we first propose a novel stacked CNN model with multiple convolutional layers of decreasing filter sizes to improve the performance of CNN models with either log-mel feature input or raw waveform input. These two models are then combined using the Dempster–Shafer (DS) evidence theory to build the ensemble DS-CNN model for ESC. Our experiments over three public datasets showed that our method could achieve much higher performance in environmental sound recognition than other CNN models with the same types of input features. This is achieved by exploiting the complementarity of the model based on log-mel feature input and the model based on learning features directly from raw waveforms

Scholar Commons - Institutional Repository of the University of South Carolina