Recent years have witnessed the remarkable performance of diffusion models in
various vision tasks. However, for image restoration that aims to recover clear
images with sharper details from given degraded observations, diffusion-based
methods may fail to recover promising results due to inaccurate noise
estimation. Moreover, simple constraining noises cannot effectively learn
complex degradation information, which subsequently hinders the model capacity.
To solve the above problems, we propose a coarse-to-fine diffusion Transformer
(C2F-DFT) for image restoration. Specifically, our C2F-DFT contains diffusion
self-attention (DFSA) and diffusion feed-forward network (DFN) within a new
coarse-to-fine training scheme. The DFSA and DFN respectively capture the
long-range diffusion dependencies and learn hierarchy diffusion representation
to facilitate better restoration. In the coarse training stage, our C2F-DFT
estimates noises and then generates the final clean image by a sampling
algorithm. To further improve the restoration quality, we propose a simple yet
effective fine training scheme. It first exploits the coarse-trained diffusion
model with fixed steps to generate restoration results, which then would be
constrained with corresponding ground-truth ones to optimize the models to
remedy the unsatisfactory results affected by inaccurate noise estimation.
Extensive experiments show that C2F-DFT significantly outperforms
diffusion-based restoration method IR-SDE and achieves competitive performance
compared with Transformer-based state-of-the-art methods on 3 tasks,
including deraining, deblurring, and real denoising.Comment: 9 pages, 8 figure