1,145 research outputs found
Acoustic Scene Clustering Using Joint Optimization of Deep Embedding Learning and Clustering Iteration
Recent efforts have been made on acoustic scene classification in the audio
signal processing community. In contrast, few studies have been conducted on
acoustic scene clustering, which is a newly emerging problem. Acoustic scene
clustering aims at merging the audio recordings of the same class of acoustic
scene into a single cluster without using prior information and training
classifiers. In this study, we propose a method for acoustic scene clustering
that jointly optimizes the procedures of feature learning and clustering
iteration. In the proposed method, the learned feature is a deep embedding that
is extracted from a deep convolutional neural network (CNN), while the
clustering algorithm is the agglomerative hierarchical clustering (AHC). We
formulate a unified loss function for integrating and optimizing these two
procedures. Various features and methods are compared. The experimental results
demonstrate that the proposed method outperforms other unsupervised methods in
terms of the normalized mutual information and the clustering accuracy. In
addition, the deep embedding outperforms many state-of-the-art features.Comment: 9 pages, 6 figures, 11 tables. Accepted for publication in IEEE TM
Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network
Automatic estimation of domestic activities from audio can be used to solve
many problems, such as reducing the labor cost for nursing the elderly people.
This study focuses on solving the problem of domestic activity clustering from
audio. The target of domestic activity clustering is to cluster audio clips
which belong to the same category of domestic activity into one cluster in an
unsupervised way. In this paper, we propose a method of domestic activity
clustering using a depthwise separable convolutional autoencoder network. In
the proposed method, initial embeddings are learned by the depthwise separable
convolutional autoencoder, and a clustering-oriented loss is designed to
jointly optimize embedding refinement and cluster assignment. Different methods
are evaluated on a public dataset (a derivative of the SINS dataset) used in
the challenge on Detection and Classification of Acoustic Scenes and Events
(DCASE) in 2018. Our method obtains the normalized mutual information (NMI)
score of 54.46%, and the clustering accuracy (CA) score of 63.64%, and
outperforms state-of-the-art methods in terms of NMI and CA. In addition, both
computational complexity and memory requirement of our method is lower than
that of previous deep-model-based methods. Codes:
https://github.com/vinceasvp/domestic-activity-clustering-from-audioComment: 6 pages, 5 figures, 4 tables. Accepted by IEEE MMSP 202
Data-Driven Representation Learning in Multimodal Feature Fusion
abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction.
We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems.
In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
Domestic Activities Classification from Audio Recordings Using Multi-scale Dilated Depthwise Separable Convolutional Network
Domestic activities classification (DAC) from audio recordings aims at
classifying audio recordings into pre-defined categories of domestic
activities, which is an effective way for estimation of daily activities
performed in home environment. In this paper, we propose a method for DAC from
audio recordings using a multi-scale dilated depthwise separable convolutional
network (DSCN). The DSCN is a lightweight neural network with small size of
parameters and thus suitable to be deployed in portable terminals with limited
computing resources. To expand the receptive field with the same size of DSCN's
parameters, dilated convolution, instead of normal convolution, is used in the
DSCN for further improving the DSCN's performance. In addition, the embeddings
of various scales learned by the dilated DSCN are concatenated as a multi-scale
embedding for representing property differences among various classes of
domestic activities. Evaluated on a public dataset of the Task 5 of the 2018
challenge on Detection and Classification of Acoustic Scenes and Events
(DCASE-2018), the results show that: both dilated convolution and multi-scale
embedding contribute to the performance improvement of the proposed method; and
the proposed method outperforms the methods based on state-of-the-art
lightweight network in terms of classification accuracy.Comment: 5 pages, 2 figures, 4 tables. Accepted for publication in IEEE
MMSP202
Recommended from our members
Single Channel auditory source separation with neural network
Although distinguishing different sounds in noisy environment is a relative easy task for human, source separation has long been extremely difficult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds.
In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech.
In this thesis, we firstly propose extensions for the current mask learning network, for the problem of speech enhancement, to fix the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under different recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data.
Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB.
We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform efficient end-to-end training. In the proposed model, attractor points for each source are firstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results.
Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classification based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system firstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience
- …