19 research outputs found
An empirical study of Conv-TasNet
Conv-TasNet is a recently proposed waveform-based deep neural network that
achieves state-of-the-art performance in speech source separation. Its
architecture consists of a learnable encoder/decoder and a separator that
operates on top of this learned space. Various improvements have been proposed
to Conv-TasNet. However, they mostly focus on the separator, leaving its
encoder/decoder as a (shallow) linear operator. In this paper, we conduct an
empirical study of Conv-TasNet and propose an enhancement to the
encoder/decoder that is based on a (deep) non-linear variant of it. In
addition, we experiment with the larger and more diverse LibriTTS dataset and
investigate the generalization capabilities of the studied models when trained
on a much larger dataset. We propose cross-dataset evaluation that includes
assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our
results show that enhancements to the encoder/decoder can improve average
SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the
generalization capabilities of Conv-TasNet and the potential value of
improvements to the encoder/decoder.Comment: In proceedings of ICASSP202
COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations
Audio representation learning based on deep neural networks (DNNs) emerged as
an alternative approach to hand-crafted features. For achieving high
performance, DNNs often need a large amount of annotated data which can be
difficult and costly to obtain. In this paper, we propose a method for learning
audio representations, aligning the learned latent representations of audio and
associated tags. Aligning is done by maximizing the agreement of the latent
representations of audio and tags, using a contrastive loss. The result is an
audio embedding model which reflects acoustic and semantic characteristics of
sounds. We evaluate the quality of our embedding model, measuring its
performance as a feature extractor on three different tasks (namely, sound
event recognition, and music genre and musical instrument classification), and
investigate what type of characteristics the model captures. Our results are
promising, sometimes in par with the state-of-the-art in the considered tasks
and the embeddings produced with our method are well correlated with some
acoustic descriptors.Comment: 8 pages, 1 figure, workshop on Self-supervision in Audio and Speech
at the 37th International Conference on Machine Learning (ICML), 2020,
Vienna, Austri
CnnSound: Convolutional Neural Networks for the Classification of Environmental Sounds
The classification of environmental sounds (ESC) has been increasingly studied in recent years. The main reason is that environmental sounds are part of our daily life, and associating them with our environment that we live in is important in several aspects as ESC is used in areas such as managing smart cities, determining location from environmental sounds, surveillance systems, machine hearing, environment monitoring. The ESC is however more difficult than other sounds because there are too many parameters that generate background noise in the ESC, which makes the sound more difficult to model and classify. The main aim of this study is therefore to develop more robust convolution neural networks architecture (CNN). For this purpose, 150 different CNN-based models were designed by changing the number of layers and values of their tuning parameters used in the layers. In order to test the accuracy of the models, the Urbansound8k environmental sound database was used. The sounds in this data set were first converted into an image format of 32x32x3. The proposed CNN model has yielded an accuracy of as much as 82.5% being higher than its classical counterpart. As there was not that much fine-tuning, the obtained accuracy has been found to be better and satisfactory compared to other studies on the Urbansound8k when both accuracy and computational complexity are considered. The results also suggest further improvement possible due to low complexity of the proposed CNN architecture and its applicability in real-world settings