Classifier-free guidance(CFG) is a fundamental tool in modern diffusion models for text-guided generation.
Although effective, CFG's reliance on high guidance scales presents notable drawbacks.
In response, we introduce simple solution to this seemingly inherent limitation: CFG++ .
This innovation addresses the off-manifold issue inherent in CFG, thereby enabling effective utilization
of small guidance scales (0 < λλ < 1) .
CFG 😓 | CFG++ 😁 | |
T2I Generation | Mode Collapse and Saturation | Better Sample Quality & Adherence to text |
DDIM Inversion w/ CFG(++) |
Breakdown | Improves and enables better image editing |
PF-ODE trajectory | Unnatural, Curved | Smooth, Straighter |
In CFG++, the renoising process after applying Tweedie’s formula should utilize the unconditional noise ˆϵ∅ instead of ˆϵwc. This surprisingly simple fix to the original CFG algorithm leads to smoother trajectory of generation. This improvement is also demonstrated in the following visualization of the discrete evolution of the posterior mean.
As demonstrated by the teaser images, our CFG++ method results in a smoother generation trajectory and superior quality. Additionally, we visualize multiple images generated by CFG++ as we increase the guidance scale γ. The visualization shows a smooth transition from unconditional sampling towards highly conditional sampling.
We find that the improvement gain from CFG++ is even more dramatic for distilled diffusion models such as SDXL-{turbo, lightning}. We see significant boosts in the quality of the generated images, which is also depicted in the improvements seen in the above figure.
We demonstrate the effect of CFG++ on the image inversion task. (a) Notably, DDIM inversion with CFG++ consistently reconstructs the source image across all guidance scales, whereas DDIM inversion with CFG fails to do so. (b) Quantitative results, including PSNR and RMSE, show a consistent improvement in reconstruction performance.
The figures above compare image editing results using CFG and CFG++ followed by image inversion. During the editing stage, a word in the source text is swapped with the target concept, and this modified text is used as the condition for sampling. Our algorithm successfully works for both synthetic and real images.
We show that CFG++ enables the incorporation of text prompts into a standard diffusion inverse solvers. Specifically, we focus on comparing the performance of PSLD combined with CFG and CFG++ in solving linear inverse problems. CFG++ consistently delivers high-quality reconstructions across all tasks.