Improving the Stability of Diffusion Models for Content Consistent Super-Resolution

1. The Hong Kong Polytechnic University, 2. OPPO Research Institute
arXiv Code

Abstract

The generative priors of pre-trained latent diffusion models have demonstrated great potential to enhance the perceptual quality of image super-resolution (SR) results. Unfortunately, the existing diffusion prior-based SR methods encounter a common problem, i.e., they tend to generate rather different outputs for the same low-resolution image with different noise samples. Such stochasticity is desired for text-to-image generation tasks but problematic for SR tasks, where the image contents are expected to be well preserved. To improve the stability of diffusion prior-based SR, we propose to employ the diffusion models to refine image structures, while employing the generative adversarial training to enhance image fine details. Specifically, we propose a non-uniform timestep learning strategy to train a compact diffusion network, which has high efficiency and stability to reproduce the image main structures, and finetune the pre-trained decoder of variational auto-encoder (VAE) by adversarial training for detail enhancement. Extensive experiments show that our proposed method, namely content consistent super-resolution (CCSR), can significantly reduce the stochasticity of diffusion prior-based SR, improving the content consistency of SR outputs and speeding up the image generation process.

Motivation

Visual comparisons of two super-resolution images with different starting points in the diffusion process from an input LR image. One can see that the images generated by StableSR exhibit noticeable differences in textures, as well as large variations in PSNR and LPIPS indices. In contrast, our CCSR method produces more stable and content-consistent results.

To improve the stability of diffusion priors to better assist SR tasks, we investigate in-depth how diffusion priors can help SR at different diffusion timesteps. Compared with GAN, diffusion priors are more powerful and flexible to facilitate the generation of more realistic and visually pleasing image content. This capability is especially useful when the LR image suffers from significant information loss and heavy degradation, for which the GAN-based models may fail. However, if the image structural contents can be well-reproduced, for example, by using the diffusion model, the GAN network can subsequently enhance the details with low stochasticity. Therefore, we propose to employ the diffusion models to refine image structures, while employing the generative adversarial training to enhance image fine details.

Method

Framework of CCSR.

There are two training stages in CCSR, structure refinement (top left) and detail enhancement (top right). In the first stage, a non-uniform sampling strategy (bottom) is proposed, which applies one timestep for information extraction from LR and several other timesteps for image structure generation. The diffusion process is then stopped and the truncated output is fed into the second stage, where the detail is enhanced by finetuning the VAE decoder with adversarial training.

New Stability Measures

Most of the existing diffusion prior-based SR methods suffer from the stability problem. Therefore, it is necessary to design measures to evaluate the stability of diffusion model-based SR models. We propose new stability metrics, namely global standard deviation (G-STD) and local standard deviation (L-STD), to respectively measure the image-level and pixel-level variations of the SR results of diffusion-based methods. We run N times (N = 10 in our experiments) the experiments for each SR model on each test image within each test benchmark. For each SR image, we can compute its quality metrics (except for FID) and then calculate the STD values over the N runs for each metric. By averaging the STD values over all test images in a benchmark, the G-STD value of one metric can be obtained, which can reflect the stability of an SR model at the image level. To measure the stability at the local pixel level, we define L-STD, which computes the STD of pixels in the same location of the N SR images.

Results

Visual comparisons between our CCSR with state-of-the-art GAN-based and diffusion model-based methods, including RealESRGAN, BSRGAN, DASR, LDL, LDM-SR, StableSR, ResShift, DiffBIR and PASD. For the diffusion model-based method, two restored images that have the best and worst PSNR values over 10 runs are shown for a more comprehensive and fair comparison. Our proposed CCSR works the best in reconstructing more accurate structures and more realistic, content-consistent, and stable details.

CCSR achieves outstanding fidelity and perceptual measures among all the diffusion model-based methods. Meanwhile, it demonstrates much better stability in synthesizing image details, as evidenced by its outstanding G-STD and L-STD measures.

BibTeX

@article{sun2023ccsr,
title={Improving the Stability of Diffusion Models for Content Consistent Super-Resolution},
author={Sun, Lingchen and Wu, Rongyuan and Zhang, Zhengqiang and Yong, Hongwei and Zhang, Lei},
journal={arXiv preprint arXiv:2401.00877},
year={2024},
}