Zero-shot style transfer based on decoupled diffusion models

doi:10.11996/JG.j.2095-302X.2025040727

Abstract

Abstract:

Zero-shot style transfer aims to apply the style of a given source image to a target style domain described by a text prompt, without relying on a style image. Existing methods typically require time-consuming fine-tuning or optimization processes, while those avoiding such steps often fail to achieve a satisfactory alignment between content and style. A two-branch framework was proposed that enabled zero-shot style transfer with content and style alignment, without the need for training or optimization. Leveraging the diffusion models U-Net denoising network, the content branch first denoised the input image and extracts content features, preserving the source domain’s content structure. The style branch then employed a gradient-guided method to extract style information from the text prompt, which was transferred to the denoised image. Additionally, the style features were derived from the U-Net’s skip connection in the style branch sampling process, ensuring a clear separation between content and style. This decoupling of content and style allowed for effective style transfer while mitigating their entanglement within a single network. Finally, a feature modulation module (FMM) wais introduced to fuse the content and style features from the two branches, ensuring alignment and minimizing the impact on the content during the style transfer. Experimental results demonstrate that the proposed method achieved high-quality style transfer on any content image without the need for training or optimization.

Key words: style transfer, diffusion models, backbone features, skip connection features, feature modulation module

CLC Number:

TP391.41

LEI Songlin, ZHAO Zhengpeng, YANG Qiuxia, PU Yuanyuan, GU Jinjing, XU Dan. Zero-shot style transfer based on decoupled diffusion models[J]. Journal of Graphics, 2025, 46(4): 727-738.

Figures/Tables 12

Fig. 1 Analysis of image jumping connection features ((a) T=25; (b) T=20; (c) T=15; (d) T=10; (e) T=5; (f) T=0)

Fig. 2 The overall structure of the dual branch method ((a) Overall framework diagram of the model; (b) Feature modulation module; (c) Style guidance module; (d) Content guidance module)

Fig. 3 Comparative experiments on the ImageNet dataset ((a) Source image; (b) Style prompt text; (c) ZeCon; (d) DiffusionCLIP; (e) DiffuseIT; (f) InST; (g) FreeStyle; (h) Ours)

Fig. 4 Comparative experiments on the FFHQ dataset ((a) Source image; (b) Style prompt text; (c) ZeCon; (d) DiffusionCLIP; (e) DiffuseIT; (f) InST; (g) StyleGAN-NADA; (h) Ours)

Table 1 Compared quantitative results with other style transfer methods

方法	SSIM↑	LPIPS↓	CLIPscore↑	FID↓	Perference↑/%	Time/s
ZeCon	0.696	0.467	26.94	262.20	26	38
DiffuseIT	0.602	0.507	24.40	180.32	2	42
DiffCLIP	0.668	0.536	28.64	256.19	6	462
InST	0.557	0.489	26.34	220.13	22	816
Ours	0.750	0.401	27.54	204.89	44	46

Fig. 5 Ablation experiment of FMM module and content loss ((a) Source image; (b) No FMM; (c) No content loss; (d) Vgg loss; (e) Mse loss; (f ) Cut loss; (g) Vgg+Mse; (h) Vgg+Mse+Cut)

Table 2 Content loss and FMM module ablation experiments results

模型	SSIM↑	LPIPS↓	CLIP score↑
Baseline	0.707	0.431	27.37
w/o FMM(双支路)	0.773	0.353	25.73
w/o vgg	0.771	0.397	27.66
w/o mse	0.754	0.409	27.21
w/o cut	0.737	0.454	27.42
w/o content loss	0.704	0.483	27.94
Ours	0.819	0.321	27.34

Fig. 6 Ablation experiment of style loss ((a) Source image; (b) Style text prompt; (c) No style loss; (d) Only dir loss; (e) Only global loss; (f) Fullset)

Fig. 7 The ablation experiment of hyper-parameter n ((a) Source image; (b) n=0；(c) n=3; (d) n=6; (e) n=9; (f) n=12; (g) n=15; (h) n=18)

Table 3 Qualitative ablation experiment with hyper-parameter n results

参数	SSIM↑	LPIPS↓	CLIP score↑
n=0	0.785	0.347	26.59
n=3	0.802	0.277	26.83
n=6	0.782	0.298	26.16
n=9	0.728	0.334	25.30
n=12	0.672	0.372	25.30
n=15	0.681	0.356	25.55
n=18	0.689	0.347	25.48

Fig. 8 The ablation experiment of hyper-parameters s and b ((a) Source image; (b) Text prompt; (c) s=1.0, b=1.0; (d) s=1.0, b=1.5; (e) s=1.0, b=2.0; (f) s=1.0, b=2.5; (g) s=1.0, b=3.0; (h) s=0.5, b=2.0; (i) s=0.8, b=2.0; (j) s=1.0, b=2.0; (k) s=1.2, b=2.0; (l) s=1.5, b=2.0)

Fig. 9 Experiment on disentanglement of content and style ((a) Source image; (b) Text prompt; (c) α=0.4; (d) α=0.6; (e) α=0.8；(f) α=1.0; (g) α=1.2; (h) α=1.4; (i) α=1.6)

References 48

[1]	CHENG B, LIU Z H, PENG Y B, et al. General image-to- image translation with one-shot image guidance[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 22736-22746.
[2]	GATYS L A, ECKER A S, BETHGE M. Image style transfer using convolutional neural networks[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2016: 2414-2423.
[3]	王晨琛, 王业琳, 葛中芹, 等. 基于卷积神经网络的中国水墨画风格提取[J]. 图学学报, 2017, 38(5): 754-759. DOI
	WANG C C, WANG Y L, GE Z Q, et al. Convolutional neural network-based Chinese ink-painting artistic style extraction[J]. Journal of Graphics, 2017, 38(5): 754-759 (in Chinese). DOI
[4]	李鑫, 普园媛, 赵征鹏, 等. 内容语义和风格特征匹配一致的艺术风格迁移[J]. 图学学报, 2023, 44(4): 699-709. DOI
	LI X, PU Y Y, ZHAO Z P, et al. Content semantics and style features match consistent artistic style transfer[J]. Journal of Graphics, 2023, 44(4): 699-709 (in Chinese). DOI
[5]	HUANG X, BELONGIE S. Arbitrary style transfer in real-time with adaptive instance normalization[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 1501-1510.
[6]	JING Y C, LIU X, DING Y K, et al. Dynamic instance normalization for arbitrary style transfer[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 4369-4376.
[7]	GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]// The 28th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 2672-2680.
[8]	PARK T, EFROS A A, ZHANG R, et al. Contrastive learning for unpaired image-to-image translation[C]// The 16th European Conference on Computer Vision. Cham: Springer, 2020: 319-345.
[9]	ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]// 2017 IEEE International Conference on Computer Vision. New York: IEEE Press, 2017: 2223-2232.
[10]	GAL R, PATASHNIK O, MARON H, et al. StyleGAN-NADA: CLIP-guided domain adaptation of image generators[J]. ACM Transactions on Graphics (TOG), 2022, 41(4): 141.
[11]	KWON G, YE J C. CLIPstyler: image style transfer with a single text condition[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 18062-18071
[12]	SAHARIA C, HO J, CHAN W, et al. Image super-resolution via iterative refinement[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4713-4726.
[13]	KIM G, KWON T, YE J C. DiffusionCLIP: text-guided diffusion models for robust image manipulation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 2426-2435.
[14]	KWON G, YE J C. Diffusion-based image translation using disentangled style and content representation[EB/OL]. [2024-05-04]. https://dblp.uni-trier.de/db/conf/iclr/iclr2023.html#KwonY23.
[15]	YANG S, HWANG H, YE J C. Zero-shot contrastive loss for text-guided diffusion image style transfer[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 22873-22882.
[16]	ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2022: 10684-10695.
[17]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-05-04]. https://dblp.uni-trier.de/db/conf/icml/icml2021.html#RadfordKHRGASAM21.
[18]	MOKADY R, HERTZ A, ABERMAN K, et al. Null-text inversion for editing real images using guided diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 6038-6047.
[19]	EVERAERT M N, BOCCHIO M, ARPA S, et al. Diffusion in style[C]// 2023 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2023: 2251-2261.
[20]	CHUNG J, HYUN S, HEO J P. Style injection in diffusion: a training-free approach for adapting large-scale diffusion models for style transfer[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 8795-8805.
[21]	HERTZ A, MOKADY R, TENENBAUM J, et al. Prompt-to- prompt image editing with cross-attention control[EB/OL]. [2024-05-04]. https://dblp.uni-trier.de/db/conf/iclr/iclr2023.html#HertzMTAPC23.
[22]	SI C Y, HUANG Z Q, JIANG Y M, et al. FreeU: free lunch in diffusion U-net[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 4733-4743.
[23]	JEONG J, KWON M, UH Y. Training-free content injection using h-space in diffusion models[C]// 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2024: 5151-5161.
[24]	张海嵩, 尹小勤, 于金辉. 实时绘制3D中国画效果[J]. 计算机辅助设计与图形学学报, 2004, 16(11): 1485-1489.
	ZHANG H S, YIN X Q, YU J H. Real-time rendering of 3D Chinese painting effects[J]. Journal of Computer-Aided Design & Computer Graphics, 2004, 16(11): 1485-1489 (in Chinese).
[25]	钱小燕, 肖亮, 吴慧中. 快速风格迁移[J]. 计算机工程, 2006, 32(21): 15-17, 46. DOI
	QIAN X Y, XIAO L, WU H Z. Fast style transfer[J]. Computer Engineering, 2006, 32(21): 15-17, 46(in Chinese). DOI
[26]	LI X T, LIU S F, KAUTZ J, et al. Learning linear transformations for fast image and video style transfer[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 3809-3817.
[27]	PARK D Y, LEE K H. Arbitrary style transfer with style-attentional networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 5880-5888.
[28]	SONG C J, WU Z J, ZHOU Y, et al. ETNet: error transition network for arbitrary style transfer[C]// The 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 61.
[29]	WU Z J, SONG C J, ZHOU Y, et al. EFANet: exchangeable feature alignment network for arbitrary style transfer[C]// The 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12305-12312.
[30]	LIU S H, LIN T W, HE D L, et al. AdaAttN: revisit attention mechanism in arbitrary neural style transfer[C]// 2021 IEEE/CVF International Conference on Computer Vision. New York: IEEE Press, 2021: 6649-6658.
[31]	KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2019: 4401-4410.
[32]	KARRAS T, AITTALA M, HELLSTEN J, et al. Training generative adversarial networks with limited data[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1015.
[33]	ZHOU Y, CHEN Z C, HUANG H. Deformable one-shot face stylization via DINO semantic guidance[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 7787-7796.
[34]	ZHANG Y X, DONG W M, TANG F, et al. ProSpect: prompt spectrum for attribute-aware personalization of diffusion models[EB/OL]. [2024-05-04]. https://arxiv.org/abs/2305.16225.
[35]	ZHANG Y X, HUANG N S, TANG F, et al. Inversion-based style transfer with diffusion models[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2023: 10146-10156.
[36]	DENG Y Y, HE X Y, TANG F, et al. Z*: zero-shot style transfer via attention reweighting[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 6934-6944.
[37]	SOHN K, RUIZ N, LEE K, et al. StyleDrop: text-to-image generation in any style[EB/OL]. [2024-05-04]. https://arxiv.org/abs/2306.00983.
[38]	AHN N, LEE J, LEE C, et al. DreamStyler: paint by style inversion with text-to-image diffusion models[C]// The 38th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2024: 674-681.
[39]	QI T H, FANG S C, WU Y Z, et al. DEADiff: an efficient stylization diffusion model with disentangled representations[C]// 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2024: 8693-8702.
[40]	HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]// The 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 574.
[41]	DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]// The 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 672.
[42]	NICHOL A Q, DHARIWAL P. Improved denoising diffusion probabilistic models[EB/OL]. [2024-05-04]. https://dblp.uni-trier.de/db/conf/icml/icml2021.html#NicholD21.
[43]	SONG J M, MENG C L, ERMON S. Denoising diffusion implicit models[EB/OL]. [2024-05-04]. https://dblp.uni-trier.de/db/conf/iclr/iclr2021.html#SongME21.
[44]	PAN Z H, ZHOU X, TIAN H. Arbitrary style guidance for enhanced diffusion-based text-to-image generation[C]// 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. New York: IEEE Press, 2023: 4461-4471.
[45]	HE F H, LI G, SUN F H, et al. FreeStyle: free lunch for text-guided style transfer using diffusion models[EB/OL]. [2024-05-04]. https://arxiv.org/abs/2401.15636.
[46]	TOV O, ALALUF Y, NITZAN Y, et al. Designing an encoder for StyleGAN image manipulation[J]. ACM Transactions on Graphics (TOG), 2021, 40(4): 133.
[47]	WANG Z, BOVIK A C, SHEIKH H R, et al. Image quality assessment: from error visibility to structural similarity[J]. IEEE Transactions on Image Processing, 2004, 13(4): 600-612. DOI PMID
[48]	ZHANG R, ISOLA P, EFROS A A, et al. The unreasonable effectiveness of deep features as a perceptual metric[C]// 2018 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE Press, 2018: 586-595.