8 research outputs found
BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network
Generative adversarial network (GAN)-based vocoders have been intensively
studied because they can synthesize high-fidelity audio waveforms faster than
real-time. However, it has been reported that most GANs fail to obtain the
optimal projection for discriminating between real and fake data in the feature
space. In the literature, it has been demonstrated that slicing adversarial
network (SAN), an improved GAN training framework that can find the optimal
projection, is effective in the image generation task. In this paper, we
investigate the effectiveness of SAN in the vocoding task. For this purpose, we
propose a scheme to modify least-squares GAN, which most GAN-based vocoders
adopt, so that their loss functions satisfy the requirements of SAN. Through
our experiments, we demonstrate that SAN can improve the performance of
GAN-based vocoders, including BigVGAN, with small modifications. Our code is
available at https://github.com/sony/bigvsan.Comment: Submitted to ICASSP 202
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
Taking long-term spectral and temporal dependencies into account is essential
for automatic piano transcription. This is especially helpful when determining
the precise onset and offset for each note in the polyphonic piano content. In
this case, we may rely on the capability of self-attention mechanism in
Transformers to capture these long-term dependencies in the frequency and time
axes. In this work, we propose hFT-Transformer, which is an automatic music
transcription method that uses a two-level hierarchical frequency-time
Transformer architecture. The first hierarchy includes a convolutional block in
the time axis, a Transformer encoder in the frequency axis, and a Transformer
decoder that converts the dimension in the frequency axis. The output is then
fed into the second hierarchy which consists of another Transformer encoder in
the time axis. We evaluated our method with the widely used MAPS and MAESTRO
v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the
F1-scores of the metrics among Frame, Note, Note with Offset, and Note with
Offset and Velocity estimations.Comment: 8 pages, 6 figures, to be published in ISMIR202
On the Equivalence of Consistency-Type Models: Consistency Models, Consistent Diffusion Models, and Fokker-Planck Regularization
The emergence of various notions of ``consistency'' in diffusion models has
garnered considerable attention and helped achieve improved sample quality,
likelihood estimation, and accelerated sampling. Although similar concepts have
been proposed in the literature, the precise relationships among them remain
unclear. In this study, we establish theoretical connections between three
recent ``consistency'' notions designed to enhance diffusion models for
distinct objectives. Our insights offer the potential for a more comprehensive
and encompassing framework for consistency-type models
Unsupervised vocal dereverberation with diffusion-based generative models
Removing reverb from reverberant music is a necessary technique to clean up
audio for downstream music manipulations. Reverberation of music contains two
categories, natural reverb, and artificial reverb. Artificial reverb has a
wider diversity than natural reverb due to its various parameter setups and
reverberation types. However, recent supervised dereverberation methods may
fail because they rely on sufficiently diverse and numerous pairs of
reverberant observations and retrieved data for training in order to be
generalizable to unseen observations during inference. To resolve these
problems, we propose an unsupervised method that can remove a general kind of
artificial reverb for music without requiring pairs of data for training. The
proposed method is based on diffusion models, where it initializes the unknown
reverberation operator with a conventional signal processing technique and
simultaneously refines the estimate with the help of diffusion models. We show
through objective and perceptual evaluations that our method outperforms the
current leading vocal dereverberation benchmarks.Comment: 6 pages, 2 figures, submitted to ICASSP 202
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration
Pre-trained diffusion models have been successfully used as priors in a
variety of linear inverse problems, where the goal is to reconstruct a signal
from noisy linear measurements. However, existing approaches require knowledge
of the linear operator. In this paper, we propose GibbsDDRM, an extension of
Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the
linear measurement operator is unknown. GibbsDDRM constructs a joint
distribution of the data, measurements, and linear operator by using a
pre-trained diffusion model for the data prior, and it solves the problem by
posterior sampling with an efficient variant of a Gibbs sampler. The proposed
method is problem-agnostic, meaning that a pre-trained diffusion model can be
applied to various inverse problems without fine-tuning. In experiments, it
achieved high performance on both blind image deblurring and vocal
dereverberation tasks, despite the use of simple generic priors for the
underlying linear operators
SAN: Inducing Metrizability of GAN with Discriminative Normalized Linear Layer
Generative adversarial networks (GANs) learn a target probability
distribution by optimizing a generator and a discriminator with minimax
objectives. This paper addresses the question of whether such optimization
actually provides the generator with gradients that make its distribution close
to the target distribution. We derive metrizable conditions, sufficient
conditions for the discriminator to serve as the distance between the
distributions by connecting the GAN formulation with the concept of sliced
optimal transport. Furthermore, by leveraging these theoretical results, we
propose a novel GAN training scheme, called slicing adversarial network (SAN).
With only simple modifications, a broad class of existing GANs can be converted
to SANs. Experiments on synthetic and image datasets support our theoretical
results and the SAN's effectiveness as compared to usual GANs. Furthermore, we
also apply SAN to StyleGAN-XL, which leads to state-of-the-art FID score
amongst GANs for class conditional generation on ImageNet 256256.Comment: 24 pages with 12 figure
Manifold Preserving Guided Diffusion
Despite the recent advancements, conditional image generation still faces
challenges of cost, generalizability, and the need for task-specific training.
In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a
training-free conditional generation framework that leverages pretrained
diffusion models and off-the-shelf neural networks with minimal additional
inference cost for a broad range of tasks. Specifically, we leverage the
manifold hypothesis to refine the guided diffusion steps and introduce a
shortcut algorithm in the process. We then propose two methods for on-manifold
training-free guidance using pre-trained autoencoders and demonstrate that our
shortcut inherently preserves the manifolds when applied to latent diffusion
models. Our experiments show that MPGD is efficient and effective for solving a
variety of conditional generation applications in low-compute settings, and can
consistently offer up to 3.8x speed-ups with the same number of diffusion steps
while maintaining high sample quality compared to the baselines