874 research outputs found
A Two-Stage Approach to Device-Robust Acoustic Scene Classification
To improve device robustness, a highly desirable key feature of a competitive
data-driven acoustic scene classification (ASC) system, a novel two-stage
system based on fully convolutional neural networks (CNNs) is proposed. Our
two-stage system leverages on an ad-hoc score combination based on two CNN
classifiers: (i) the first CNN classifies acoustic inputs into one of three
broad classes, and (ii) the second CNN classifies the same inputs into one of
ten finer-grained classes. Three different CNN architectures are explored to
implement the two-stage classifiers, and a frequency sub-sampling scheme is
investigated. Moreover, novel data augmentation schemes for ASC are also
investigated. Evaluated on DCASE 2020 Task 1a, our results show that the
proposed ASC system attains a state-of-the-art accuracy on the development set,
where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9%
average accuracy among multi-device test data, and it obtains a significant
improvement on unseen devices. Finally, neural saliency analysis with class
activation mapping (CAM) gives new insights on the patterns learnt by our
models.Comment: Submitted to ICASSP 2021. Code available:
https://github.com/MihawkHu/DCASE2020_task
Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet
We present a work on low-complexity acoustic scene classification (ASC) with
multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge.
This subtask focuses on classifying audio samples of multiple devices with a
low-complexity model, where two main difficulties need to be overcome. First,
the audio samples are recorded by different devices, and there is mismatch of
recording devices in audio samples. We reduce the negative impact of the
mismatch of recording devices by using some effective strategies, including
data augmentation (e.g., mix-up, spectrum correction, pitch shift), usages of
multi-patch network structure and channel attention. Second, the model size
should be smaller than a threshold (e.g., 128 KB required by the DCASE2021
challenge). To meet this condition, we adopt a ResNet with both depthwise
separable convolution and channel attention as the backbone network, and
perform model compression. In summary, we propose a low-complexity ASC method
using data augmentation and a lightweight ResNet. Evaluated on the official
development and evaluation datasets, our method obtains classification accuracy
scores of 71.6% and 66.7%, respectively; and obtains Log-loss scores of 1.038
and 1.136, respectively. Our final model size is 110.3 KB which is smaller than
the maximum of 128 KB.Comment: 5 pages, 5 figures, 4 tables. Accepted for publication in the 16th
IEEE International Conference on Signal Processing (IEEE ICSP
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
In this paper, we propose a domain adaptation framework to address the device
mismatch issue in acoustic scene classification leveraging upon neural label
embedding (NLE) and relational teacher student learning (RTSL). Taking into
account the structural relationships between acoustic scene classes, our
proposed framework captures such relationships which are intrinsically
device-independent. In the training stage, transferable knowledge is condensed
in NLE from the source domain. Next in the adaptation stage, a novel RTSL
strategy is adopted to learn adapted target models without using paired
source-target data often required in conventional teacher student learning. The
proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental
results based on AlexNet-L deep classification models confirm the effectiveness
of our proposed approach for mismatch situations. NLE-alone adaptation compares
favourably with the conventional device adaptation and teacher student based
adaptation techniques. NLE with RTSL further improves the classification
accuracy.Comment: Accepted by Interspeech 202
An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances
In this paper, we propose a sub-utterance unit selection framework to remove
acoustic segments in audio recordings that carry little information for
acoustic scene classification (ASC). Our approach is built upon a universal set
of acoustic segment units covering the overall acoustic scene space. First,
those units are modeled with acoustic segment models (ASMs) used to tokenize
acoustic scene utterances into sequences of acoustic segment units. Next,
paralleling the idea of stop words in information retrieval, stop ASMs are
automatically detected. Finally, acoustic segments associated with the stop
ASMs are blocked, because of their low indexing power in retrieval of most
acoustic scenes. In contrast to building scene models with whole utterances,
the ASM-removed sub-utterances, i.e., acoustic utterances without stop acoustic
segments, are then used as inputs to the AlexNet-L back-end for final
classification. On the DCASE 2018 dataset, scene classification accuracy
increases from 68%, with whole utterances, to 72.1%, with segment selection.
This represents a competitive accuracy without any data augmentation, and/or
ensemble strategy. Moreover, our approach compares favourably to AlexNet-L with
attention.Comment: Accepted by Interspeech 202
Continual Learning For On-Device Environmental Sound Classification
Continuously learning new classes without catastrophic forgetting is a
challenging problem for on-device environmental sound classification given the
restrictions on computation resources (e.g., model size, running memory). To
address this issue, we propose a simple and efficient continual learning
method. Our method selects the historical data for the training by measuring
the per-sample classification uncertainty. Specifically, we measure the
uncertainty by observing how the classification probability of data fluctuates
against the parallel perturbations added to the classifier embedding. In this
way, the computation cost can be significantly reduced compared with adding
perturbation to the raw data. Experimental results on the DCASE 2019 Task 1 and
ESC-50 dataset show that our proposed method outperforms baseline continual
learning methods on classification accuracy and computational efficiency,
indicating our method can efficiently and incrementally learn new classes
without the catastrophic forgetting problem for on-device environmental sound
classification.Comment: The first two authors contributed equally, 5 pages one figure,
submitted to DCASE2022 Worksho
Machine Learning for Human Activity Detection in Smart Homes
Recognizing human activities in domestic environments from audio and active power consumption sensors is a challenging task since on the one hand, environmental sound signals are multi-source, heterogeneous, and varying in time and on the other hand, the active power consumption varies significantly for similar type electrical appliances.
Many systems have been proposed to process environmental sound signals for event detection in ambient assisted living applications. Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. A part of this thesis contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: 1) which features, and which classifiers are most suitable in the presence of background noise? 2) what is the effect of signal duration on recognition accuracy? 3) how do the SNR and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D CNN using mel-spectrogram energies and mel-spectrograms images, as inputs, respectively and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems and validated the performance of our algorithms on public datasets (Google Brain/TensorFlow Speech Recognition Challenge and the 2017 Detection and Classification of Acoustic Scenes and Events Challenge).
Regarding the problem of the energy-based human activity recognition in a household environment, machine learning techniques to infer the state of household appliances from their energy consumption data are applied and rule-based scenarios that exploit these states to detect human activity are used. Since most activities within a house are related with the operation of an electrical appliance, this unimodal approach has a significant advantage using inexpensive smart plugs and smart meters for each appliance. This part of the thesis proposes the use of unobtrusive and easy-install tools (smart plugs) for data collection and a decision engine that combines energy signal classification using dominant classifiers (compared in advanced with grid search) and a probabilistic measure for appliance usage. It helps preserving the privacy of the resident, since all the activities are stored in a local database.
DNNs received great research interest in the field of computer vision. In this thesis we adapted different architectures for the problem of human activity recognition. We analyze the quality of the extracted features, and more specifically how model architectures and parameters affect the ability of the automatically extracted features from DNNs to separate activity classes in the final feature space. Additionally, the architectures that we applied for our main problem were also applied to text classification in which we consider the input text as an image and apply 2D CNNs to learn the local and global semantics of the sentences from the variations of the visual patterns of words. This work helps as a first step of creating a dialogue agent that would not require any natural language preprocessing.
Finally, since in many domestic environments human speech is present with other environmental sounds, we developed a Convolutional Recurrent Neural Network, to separate the sound sources and applied novel post-processing filters, in order to have an end-to-end noise robust system. Our algorithm ranked first in the Apollo-11 Fearless Steps Challenge.Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 676157, project ACROSSIN
ORCA-SPOT: An Automatic Killer Whale Sound Detection Toolkit Using Deep Learning
Large bioacoustic archives of wild animals are an important source to identify reappearing communication patterns, which can then be related to recurring behavioral patterns to advance the current understanding of intra-specific communication of non-human animals. A main challenge remains that most large-scale bioacoustic archives contain only a small percentage of animal vocalizations and a large amount of environmental noise, which makes it extremely difficult to manually retrieve sufficient vocalizations for further analysis – particularly important for species with advanced social systems and complex vocalizations. In this study deep neural networks were trained on 11,509 killer whale (Orcinus orca) signals and 34,848 noise segments. The resulting toolkit ORCA-SPOT was tested on a large-scale bioacoustic repository – the Orchive – comprising roughly 19,000 hours of killer whale underwater recordings. An automated segmentation of the entire Orchive recordings (about 2.2 years) took approximately 8 days. It achieved a time-based precision or positive-predictive-value (PPV) of 93.2% and an area-under-the-curve (AUC) of 0.9523. This approach enables an automated annotation procedure of large bioacoustics databases to extract killer whale sounds, which are essential for subsequent identification of significant communication patterns. The code will be publicly available in October 2019 to support the application of deep learning to bioaoucstic research. ORCA-SPOT can be adapted to other animal species
- …