2,518 research outputs found
Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design?
When convolutional neural networks are used to tackle learning problems based
on music or, more generally, time series data, raw one-dimensional data are
commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients,
which are then used as input to the actual neural network. In this
contribution, we investigate, both theoretically and experimentally, the
influence of this pre-processing step on the network's performance and pose the
question, whether replacing it by applying adaptive or learned filters directly
to the raw data, can improve learning success. The theoretical results show
that approximately reproducing mel-spectrogram coefficients by applying
adaptive filters and subsequent time-averaging is in principle possible. We
also conducted extensive experimental work on the task of singing voice
detection in music. The results of these experiments show that for
classification based on Convolutional Neural Networks the features obtained
from adaptive filter banks followed by time-averaging perform better than the
canonical Fourier-transform-based mel-spectrogram coefficients. Alternative
adaptive approaches with center frequencies or time-averaging lengths learned
from training data perform equally well.Comment: Completely revised version; 21 pages, 4 figure
Utilizing Domain Knowledge in End-to-End Audio Processing
End-to-end neural network based approaches to audio modelling are generally
outperformed by models trained on high-level data representations. In this
paper we present preliminary work that shows the feasibility of training the
first layers of a deep convolutional neural network (CNN) model to learn the
commonly-used log-scaled mel-spectrogram transformation. Secondly, we
demonstrate that upon initializing the first layers of an end-to-end CNN
classifier with the learned transformation, convergence and performance on the
ESC-50 environmental sound classification dataset are similar to a CNN-based
model trained on the highly pre-processed log-scaled mel-spectrogram features.Comment: Accepted at the ML4Audio workshop at the NIPS 201
A sinusoidal signal reconstruction method for the inversion of the mel-spectrogram
The synthesis of sound via deep learning methods has recently received much attention. Some problems for deep learning approaches to sound synthesis relate to the amount of data needed to specify an audio signal and the necessity of preserving both the long and short time coherence of the synthesised signal. Visual time-frequency representations such as the log-mel-spectrogram have gained in popularity. The log- mel-spectrogram is a perceptually informed representation of audio that greatly compresses the amount of information required for the description of the sound. However, because of this compression, this representation is not directly invertible. Both signal processing and machine learning techniques have previ- ously been applied to the inversion of the log-mel-spectrogram but they both caused audible distortions in the synthesised sounds due to issues of temporal and spectral coherence. In this paper, we outline the application of a sinusoidal model to the ‘inversion’ of the log-mel-spectrogram for pitched musical instrument sounds outperforming state-of-the-art deep learning methods. The approach could be later used as a general decoding step from spectral to time intervals in neural applications
VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature
The mainstream neural text-to-speech(TTS) pipeline is a cascade system,
including an acoustic model(AM) that predicts acoustic feature from the input
transcript and a vocoder that generates waveform according to the given
acoustic feature. However, the acoustic feature in current TTS systems is
typically mel-spectrogram, which is highly correlated along both time and
frequency axes in a complicated way, leading to a great difficulty for the AM
to predict. Although high-fidelity audio can be generated by recent neural
vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the
predicted mel-spectrogram from AM degrades the performance of the entire TTS
system. In this work, we propose VQTTS, consisting of an AM txt2vec and a
vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic
feature rather than mel-spectrogram. We redesign both the AM and the vocoder
accordingly. In particular, txt2vec basically becomes a classification model
instead of a traditional regression model while vec2wav uses an additional
feature encoder before HifiGAN generator for smoothing the discontinuous
quantized feature. Our experiments show that vec2wav achieves better
reconstruction performance than HifiGAN when using self-supervised VQ acoustic
feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art
performance in terms of naturalness among all current publicly available TTS
systems.Comment: This version has been removed by arXiv administrators because the
submitter did not have the authority to assign the license at the time of
submissio
Evaluation of 2D Acoustic Signal Representations for Acoustic-Based Machine Condition Monitoring
Acoustic-based machine condition monitoring (MCM) provides an improved alternative to conventional MCM approaches, including vibration analysis and lubrication monitoring, among others. Several challenges arise in anomalous machine operating sound classification, as it requires effective 2D acoustic signal representation. This paper explores this question. A baseline convolutional neural network (CNN) is implemented and trained with rolling element bearing acoustic fault data. Three representations are considered, such as log-spectrogram, short-time Fourier transform and log-Mel spectrogram. The results establish log-Mel spectrogram and log-spectrogram, as promising candidates for further exploration.Peer reviewe
DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP
Recent development of neural vocoders based on the generative adversarial
neural network (GAN) has shown their advantages of generating raw waveform
conditioned on mel-spectrogram with fast inference speed and lightweight
networks. Whereas, it is still challenging to train a universal neural vocoder
that can synthesize high-fidelity speech from various scenarios with unseen
speakers, languages, and speaking styles. In this paper, we propose DSPGAN, a
GAN-based universal vocoder for high-fidelity speech synthesis by applying the
time-frequency domain supervision from digital signal processing (DSP). To
eliminate the mismatch problem caused by the ground-truth spectrograms in
training phase and the predicted spectrograms in inference phase, we leverage
the mel-spectrogram extracted from the waveform generated by a DSP module,
rather than the predicted mel-spectrogram from the Text-to-Speech (TTS)
acoustic model, as the time-frequency domain supervision to the GAN-based
vocoder. We also utilize sine excitation as the time-domain supervision to
improve the harmonic modeling and eliminate various artifacts of the GAN-based
vocoder. Experimental results show that DSPGAN significantly outperforms the
compared approaches and can generate high-fidelity speech based on diverse data
in TTS.Comment: Submitted to ICASSP 202
Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals
The aim of latent variable disentanglement is to infer the multiple
informative latent representations that lie behind a data generation process
and is a key factor in controllable data generation. In this paper, we propose
a deep neural network-based self-supervised learning method to infer the
disentangled rhythmic and harmonic representations behind music audio
generation. We train a variational autoencoder that generates an audio
mel-spectrogram from two latent features representing the rhythmic and harmonic
content. In the training phase, the variational autoencoder is trained to
reconstruct the input mel-spectrogram given its pitch-shifted version. At each
forward computation in the training phase, a vector rotation operation is
applied to one of the latent features, assuming that the dimensions of the
feature vectors are related to pitch intervals. Therefore, in the trained
variational autoencoder, the rotated latent feature represents the
pitch-related information of the mel-spectrogram, and the unrotated latent
feature represents the pitch-invariant information, i.e., the rhythmic content.
The proposed method was evaluated using a predictor-based disentanglement
metric on the learned features. Furthermore, we demonstrate its application to
the automatic generation of music remixes.Comment: Accepted to DAFx 202
- …