26 research outputs found
Signal Reconstruction from Mel-spectrogram Based on Bi-level Consistency of Full-band Magnitude and Phase
We propose an optimization-based method for reconstructing a time-domain
signal from a low-dimensional spectral representation such as a
mel-spectrogram. Phase reconstruction has been studied to reconstruct a
time-domain signal from the full-band short-time Fourier transform (STFT)
magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it
relies only on the redundancy of STFT and is applicable to various audio
signals. In this paper, we jointly reconstruct the full-band magnitude and
phase by considering the bi-level relationships among the time-domain signal,
its STFT coefficients, and its mel-spectrogram. The proposed method is
formulated as a rigorous optimization problem and estimates the full-band
magnitude based on the criterion used in GLA. Our experiments demonstrate the
effectiveness of the proposed method on speech, music, and environmental
signals.Comment: Accepted to IEEE WASPAA 202
Neural Fast Full-Rank Spatial Covariance Analysis for Blind Source Separation
This paper describes an efficient unsupervised learning method for a neural
source separation model that utilizes a probabilistic generative model of
observed multichannel mixtures proposed for blind source separation (BSS). For
this purpose, amortized variational inference (AVI) has been used for directly
solving the inverse problem of BSS with full-rank spatial covariance analysis
(FCA). Although this unsupervised technique called neural FCA is in principle
free from the domain mismatch problem, it is computationally demanding due to
the full rankness of the spatial model in exchange for robustness against
relatively short reverberations. To reduce the model complexity without
sacrificing performance, we propose neural FastFCA based on the
jointly-diagonalizable yet full-rank spatial model. Our neural separation model
introduced for AVI alternately performs neural network blocks and single steps
of an efficient iterative algorithm called iterative source steering. This
alternating architecture enables the separation model to quickly separate the
mixture spectrogram by leveraging both the deep neural network and the
multichannel optimization algorithm. The training objective with AVI is derived
to maximize the marginalized likelihood of the observed mixtures. The
experiment using mixture signals of two to four sound sources shows that neural
FastFCA outperforms conventional BSS methods and reduces the computational time
to about 2% of that for the neural FCA.Comment: 5 pages, 2 figures, accepted to EUSIPCO 202