Diffusion models have achieved remarkable advances in various image generation tasks. However, their performance notably declines when generating images at resolutions higher than those used during the training period. Despite the existence of numerous methods for producing high-resolution images, they either suffer from inefficiency or are hindered by complex operations. In this paper, we propose RectifiedHR, an efficient and straightforward solution for training-free high-resolution image generation. Specifically, we introduce the noise refresh strategy, which theoretically only requires a few lines of code to unlock the model's high-resolution generation ability and improve efficiency. Additionally, we first observe the phenomenon of energy decay that may cause image blurriness during the high-resolution image generation process. To address this issue, we propose an Energy Rectification strategy, where modifying the hyperparameters of the classifier-free guidance effectively improves the generation performance. Our method is entirely training-free and boasts a simple implementation logic. Through extensive comparisons with numerous baseline methods, our RectifiedHR demonstrates superior effectiveness and efficiency.
The visualization results of predicted \( x_0 \) at different time step t, abbreviated as \( {p_{x0}^t} \). The figure visualizes the process of how \( {p_{x0}^t} \) changes with the sampling steps, where the x-axis represents the timestep in the sampling process. The 11 images are evenly extracted from 50 images. It can be observed that in the first half of the process, \( {p_{x0}^t} \) is mainly responsible for global structure generation, while the second half is mainly responsible for local detail generation.
The trend chart of predicted \( x_0 \) at different time step t, abbreviated as \( {p_{x0}^t} \), on 100 random prompts. (a) The trend of the average CLIP Score between \( {p_{x0}^t} \) and the prompt over different timesteps. The x-axis represents the sampling timestep, and the y-axis represents the average CLIP Score. (b) Average MSE between \( {p_{x0}^t} \) and \( {p_{x0}^{t-1}} \). The x-axis represents the sampling timestep, and the y-axis represents the Average MSE. It can be observed that after approximately 30 steps, the trend of change in \( {p_{x0}^t} \) slows down.
The energy decay phenomenon of our noise refresh sampling process compared to the original sampling process on 100 random prompts. The x-axis represents the timestep of the sampling process, and the y-axis represents the average latent energy. The blue line shows the average latent energy of the original sampling process generating 1024 x 1024-resolution images over the sampling process. The red line represents our noise refresh sampling process, where noise refresh is performed at the 30th and 40th sampling timesteps, and the resolution gradually increases from 1024 x 1024 to 2048 x 2048, and then to 3072 x 3072. It can be observed that noise refresh will cause the relative latent energy to show a significant decay. From the right images, it can be observed that after energy rectification, the image details have become more prominent.
Overview of our method. (a) the original sampling process and its pseudocode. (b) The sampling process and pseudocode of our method. The orange parts of the pseudocode and modules correspond to Noise Refresh, while the purple parts represent Energy Rectification. ε is a Gaussian random noise and its shape changes according to the shape of \( {p_{x0}^t}_\text{resize} \).
Our method comprises two components: (i) noise refresh and (ii) energy rectification. To validate the effectiveness of these components, we performed experiments on all possible combinations, as illustrated in figure. The first and second rows in figure represent images generated directly at resolutions of 1024 x 1024 and 2048 x 2048, respectively. It can be observed that when the 1024 x 1024 image is enlarged, there are local blurring phenomena. At the same time, it is evident that the 2048 x 2048 image in the second row of figure exhibits repeated patterns and also suffers from blurring issues due to energy decay. The third row does not use noise refresh; instead, it only adds energy rectification in the last 15 steps of direct inference. Compared to the second row, although the repeated pattern problem is not resolved, the image becomes clearer. The fourth row introduces noise refresh but does not use energy rectification. It can be seen that noise refresh solves the repeated pattern problems found in the second and third rows, but there are still some blurring phenomena. The fifth row represents our method, which can be seen to solve both the repeated pattern problem and to make the details clearer.
@misc{yang2025rectifiedhr,
title={RectifiedHR: Enable Efficient High Resolution Image Generation via Energy Rectification},
author={Zhen Yang, Guibao Shen, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen},
journal={arXiv preprint arXiv:2503.02537},
year={2025}
}