3 research outputs found
C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds
Flow-based generative models have highly desirable properties like exact
log-likelihood evaluation and exact latent-variable inference, however they are
still in their infancy and have not received as much attention as alternative
generative models. In this paper, we introduce C-Flow, a novel conditioning
scheme that brings normalizing flows to an entirely new scenario with great
possibilities for multi-modal data modeling. C-Flow is based on a parallel
sequence of invertible mappings in which a source flow guides the target flow
at every step, enabling fine-grained control over the generation process. We
also devise a new strategy to model unordered 3D point clouds that, in
combination with the conditioning scheme, makes it possible to address 3D
reconstruction from a single image and its inverse problem of rendering an
image given a point cloud. We demonstrate our conditioning method to be very
adaptable, being also applicable to image manipulation, style transfer and
multi-modal image-to-image mapping in a diversity of domains, including RGB
images, segmentation maps, and edge masks
Enhancing gappy speech audio signals with generative adversarial networks
Gaps, dropouts and short clips of corrupted audio are a common problem and particularly annoying when they occur in speech. This paper uses machine learning to regenerate gaps of up to 320ms in an audio speech signal. Audio regeneration is translated into image regeneration by transforming audio into a Mel-spectrogram and using image in-painting to regenerate the gaps. The full Mel-spectrogram is then transferred back to audio using the Parallel-WaveGAN vocoder and integrated into the audio stream. Using a sample of 1300 spoken audio clips of between 1 and 10 seconds taken from the publicly-available LJSpeech dataset our results show regeneration of audio gaps in close to real time using GANs with a GPU equipped system.
As expected, the smaller the gap in the audio, the better the quality of the filled gaps. On a gap of 240ms the average mean opinion score (MOS) for the best performing models was 3.737, on a scale of 1 (worst) to 5 (best) which is sufficient for a human to perceive as close to uninterrupted human speech