Currently, applying diffusion models in pixel space of high resolution images
is difficult. Instead, existing approaches focus on diffusion in lower
dimensional spaces (latent diffusion), or have multiple super-resolution levels
of generation referred to as cascades. The downside is that these approaches
add additional complexity to the diffusion framework.
This paper aims to improve denoising diffusion for high resolution images
while keeping the model as simple as possible. The paper is centered around the
research question: How can one train a standard denoising diffusion models on
high resolution images, and still obtain performance comparable to these
alternate approaches?
The four main findings are: 1) the noise schedule should be adjusted for high
resolution images, 2) It is sufficient to scale only a particular part of the
architecture, 3) dropout should be added at specific locations in the
architecture, and 4) downsampling is an effective strategy to avoid high
resolution feature maps. Combining these simple yet effective techniques, we
achieve state-of-the-art on image generation among diffusion models without
sampling modifiers on ImageNet