5 research outputs found
Domestic Activities Classification from Audio Recordings Using Multi-scale Dilated Depthwise Separable Convolutional Network
Domestic activities classification (DAC) from audio recordings aims at
classifying audio recordings into pre-defined categories of domestic
activities, which is an effective way for estimation of daily activities
performed in home environment. In this paper, we propose a method for DAC from
audio recordings using a multi-scale dilated depthwise separable convolutional
network (DSCN). The DSCN is a lightweight neural network with small size of
parameters and thus suitable to be deployed in portable terminals with limited
computing resources. To expand the receptive field with the same size of DSCN's
parameters, dilated convolution, instead of normal convolution, is used in the
DSCN for further improving the DSCN's performance. In addition, the embeddings
of various scales learned by the dilated DSCN are concatenated as a multi-scale
embedding for representing property differences among various classes of
domestic activities. Evaluated on a public dataset of the Task 5 of the 2018
challenge on Detection and Classification of Acoustic Scenes and Events
(DCASE-2018), the results show that: both dilated convolution and multi-scale
embedding contribute to the performance improvement of the proposed method; and
the proposed method outperforms the methods based on state-of-the-art
lightweight network in terms of classification accuracy.Comment: 5 pages, 2 figures, 4 tables. Accepted for publication in IEEE
MMSP202
Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification
In this paper, we propose a domain adaptation framework to address the device
mismatch issue in acoustic scene classification leveraging upon neural label
embedding (NLE) and relational teacher student learning (RTSL). Taking into
account the structural relationships between acoustic scene classes, our
proposed framework captures such relationships which are intrinsically
device-independent. In the training stage, transferable knowledge is condensed
in NLE from the source domain. Next in the adaptation stage, a novel RTSL
strategy is adopted to learn adapted target models without using paired
source-target data often required in conventional teacher student learning. The
proposed framework is evaluated on the DCASE 2018 Task1b data set. Experimental
results based on AlexNet-L deep classification models confirm the effectiveness
of our proposed approach for mismatch situations. NLE-alone adaptation compares
favourably with the conventional device adaptation and teacher student based
adaptation techniques. NLE with RTSL further improves the classification
accuracy.Comment: Accepted by Interspeech 202
Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet
We present a work on low-complexity acoustic scene classification (ASC) with
multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge.
This subtask focuses on classifying audio samples of multiple devices with a
low-complexity model, where two main difficulties need to be overcome. First,
the audio samples are recorded by different devices, and there is mismatch of
recording devices in audio samples. We reduce the negative impact of the
mismatch of recording devices by using some effective strategies, including
data augmentation (e.g., mix-up, spectrum correction, pitch shift), usages of
multi-patch network structure and channel attention. Second, the model size
should be smaller than a threshold (e.g., 128 KB required by the DCASE2021
challenge). To meet this condition, we adopt a ResNet with both depthwise
separable convolution and channel attention as the backbone network, and
perform model compression. In summary, we propose a low-complexity ASC method
using data augmentation and a lightweight ResNet. Evaluated on the official
development and evaluation datasets, our method obtains classification accuracy
scores of 71.6% and 66.7%, respectively; and obtains Log-loss scores of 1.038
and 1.136, respectively. Our final model size is 110.3 KB which is smaller than
the maximum of 128 KB.Comment: 5 pages, 5 figures, 4 tables. Accepted for publication in the 16th
IEEE International Conference on Signal Processing (IEEE ICSP
An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances
In this paper, we propose a sub-utterance unit selection framework to remove
acoustic segments in audio recordings that carry little information for
acoustic scene classification (ASC). Our approach is built upon a universal set
of acoustic segment units covering the overall acoustic scene space. First,
those units are modeled with acoustic segment models (ASMs) used to tokenize
acoustic scene utterances into sequences of acoustic segment units. Next,
paralleling the idea of stop words in information retrieval, stop ASMs are
automatically detected. Finally, acoustic segments associated with the stop
ASMs are blocked, because of their low indexing power in retrieval of most
acoustic scenes. In contrast to building scene models with whole utterances,
the ASM-removed sub-utterances, i.e., acoustic utterances without stop acoustic
segments, are then used as inputs to the AlexNet-L back-end for final
classification. On the DCASE 2018 dataset, scene classification accuracy
increases from 68%, with whole utterances, to 72.1%, with segment selection.
This represents a competitive accuracy without any data augmentation, and/or
ensemble strategy. Moreover, our approach compares favourably to AlexNet-L with
attention.Comment: Accepted by Interspeech 202
Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network
Automatic estimation of domestic activities from audio can be used to solve
many problems, such as reducing the labor cost for nursing the elderly people.
This study focuses on solving the problem of domestic activity clustering from
audio. The target of domestic activity clustering is to cluster audio clips
which belong to the same category of domestic activity into one cluster in an
unsupervised way. In this paper, we propose a method of domestic activity
clustering using a depthwise separable convolutional autoencoder network. In
the proposed method, initial embeddings are learned by the depthwise separable
convolutional autoencoder, and a clustering-oriented loss is designed to
jointly optimize embedding refinement and cluster assignment. Different methods
are evaluated on a public dataset (a derivative of the SINS dataset) used in
the challenge on Detection and Classification of Acoustic Scenes and Events
(DCASE) in 2018. Our method obtains the normalized mutual information (NMI)
score of 54.46%, and the clustering accuracy (CA) score of 63.64%, and
outperforms state-of-the-art methods in terms of NMI and CA. In addition, both
computational complexity and memory requirement of our method is lower than
that of previous deep-model-based methods. Codes:
https://github.com/vinceasvp/domestic-activity-clustering-from-audioComment: 6 pages, 5 figures, 4 tables. Accepted by IEEE MMSP 202