21 research outputs found
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural
network architectures for environmental sound classification (ESC). Recently,
temporal attention mechanisms have been used in CNN to capture the useful
information from the relevant time frames for audio classification, especially
for weakly labelled data where the onset and offset times of the sound events
are not applied. In these methods, however, the inherent spectral
characteristics and variations are not explicitly exploited when obtaining the
deep features. In this paper, we propose a novel parallel temporal-spectral
attention mechanism for CNN to learn discriminative sound representations,
which enhances the temporal and spectral features by capturing the importance
of different time frames and frequency bands. Parallel branches are constructed
to allow temporal attention and spectral attention to be applied respectively
in order to mitigate interference from the segments without the presence of
sound events. The experiments on three environmental sound classification (ESC)
datasets and two acoustic scene classification (ASC) datasets show that our
method improves the classification performance and also exhibits robustness to
noise.Comment: submitted to INTERSPEECH202
Environment Sound Classification using Multiple Feature Channels and Attention based Deep Convolutional Neural Network
In this paper, we propose a model for the Environment Sound Classification
Task (ESC) that consists of multiple feature channels given as input to a Deep
Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the
paper lies in using multiple feature channels consisting of Mel-Frequency
Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC),
the Constant Q-transform (CQT) and Chromagram. Such multiple features have
never been used before for signal or audio processing. And, we employ a deeper
CNN (DCNN) compared to previous models, consisting of spatially separable
convolutions working on time and feature domain separately. Alongside, we use
attention modules that perform channel and spatial attention together. We use
some data augmentation techniques to further boost performance. Our model is
able to achieve state-of-the-art performance on all three benchmark environment
sound classification datasets, i.e. the UrbanSound8K (97.52%), ESC-10 (95.75%)
and ESC-50 (88.50%). To the best of our knowledge, this is the first time that
a single environment sound classification model is able to achieve
state-of-the-art results on all three datasets. For ESC-10 and ESC-50 datasets,
the accuracy achieved by the proposed model is beyond human accuracy of 95.7%
and 81.3% respectively.Comment: Re-checking result
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
A Machine Learning-Based Approach for Audio Signals Classification using Chebychev Moments and Mel-Coefficients
This paper proposes a machine learning-based architecture for audio signals classification based on a joint exploitation of the Chebychev moments and the Mel-Frequency Cepstrum Coefficients. The procedure starts with the computation of the Mel-spectrogram of the recorded audio signals; then, Chebychev moments are obtained projecting the Cadence Frequency Diagram derived from the Mel-spectrogram into the base of Chebychev moments. These moments are then concatenated with the Mel-Frequency Cepstrum Coefficients to form the final feature vector. By doing so, the architecture exploits the peculiarities of the discrete Chebychev moments such as their symmetry characteristics. The effectiveness of the procedure is assessed on two challenging datasets, UrbanSound8K and ESC-50
Visualization and categorization of ecological acoustic events based on discriminant features
Although sound classification in soundscape studies are generally performed by experts, the large growth of acoustic data presents a major challenge for performing such task. At the same time, the identification of more discriminating features becomes crucial when analyzing soundscapes, and this occurs because natural and anthropogenic sounds are very complex, particularly in Neotropical regions, where the biodiversity level is very high. In this scenario, the need for research addressing the discriminatory capability of acoustic features is of utmost importance to work towards automating these processes. In this study we present a method to identify the most discriminant features for categorizing sound events in soundscapes. Such identification is key to classification of sound events. Our experimental findings validate our method, showing high discriminatory capability of certain extracted features from sound data, reaching an accuracy of 89.91% for classification of frogs, birds and insects simultaneously. An extension of these experiments to simulate binary classification reached accuracy of 82.64%,100.0% and 99.40% for the classification between combinations of frogs-birds, frogs-insects and birds-insects, respectively
Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward
Malicious actors may seek to use different voice-spoofing attacks to fool ASV
systems and even use them for spreading misinformation. Various countermeasures
have been proposed to detect these spoofing attacks. Due to the extensive work
done on spoofing detection in automated speaker verification (ASV) systems in
the last 6-7 years, there is a need to classify the research and perform
qualitative and quantitative comparisons on state-of-the-art countermeasures.
Additionally, no existing survey paper has reviewed integrated solutions to
voice spoofing evaluation and speaker verification, adversarial/antiforensics
attacks on spoofing countermeasures, and ASV itself, or unified solutions to
detect multiple attacks using a single model. Further, no work has been done to
provide an apples-to-apples comparison of published countermeasures in order to
assess their generalizability by evaluating them across corpora. In this work,
we conduct a review of the literature on spoofing detection using hand-crafted
features, deep learning, end-to-end, and universal spoofing countermeasure
solutions to detect speech synthesis (SS), voice conversion (VC), and replay
attacks. Additionally, we also review integrated solutions to voice spoofing
evaluation and speaker verification, adversarial and anti-forensics attacks on
voice countermeasures, and ASV. The limitations and challenges of the existing
spoofing countermeasures are also presented. We report the performance of these
countermeasures on several datasets and evaluate them across corpora. For the
experiments, we employ the ASVspoof2019 and VSDC datasets along with GMM, SVM,
CNN, and CNN-GRU classifiers. (For reproduceability of the results, the code of
the test bed can be found in our GitHub Repository