Semantix: An Energy Guided Sampler for Semantic Style Transfer

Huiang He1 Minghui Hu2 Chuanxia Zheng3 Chaoyue Wang4 Tat-Jen Cham2
1South China University of Technology   2Nanyang Technological University
3University of Oxford    4The University of Sydney

ICLR 2025

Paper Github


Given a visual context and a reference image (Top examples), Semantix can perform Semantic Style Transfer based on the semantic correspondence. Besides, our Semantix also can be directly adapted for the videos (Bottom examples) without the need of additional modification.

Abstract

Recent advances in style and appearance transfer are impressive, but most methods isolate global style and local appearance transfer, neglecting semantic correspondence. Additionally, image and video tasks are typically handled in isolation, with little focus on integrating them for video transfer. To address these limitations, we introduce a novel task, Semantic Style Transfer, which involves transferring style and appearance features from a reference image to a target visual content based on semantic correspondence. We subsequently propose a training-free method, Semantix, an energy-guided sampler designed for Semantic Style Transfer that simultaneously guides both style and appearance transfer based on semantic understanding capacity of pre-trained diffusion models. Additionally, as a sampler, Semantix can be seamlessly applied to both image and video models, enabling semantic style transfer to be generic across various visual media. Specifically, once inverting both reference and context images or videos to noise space by SDEs, Semantix utilizes a meticulously crafted energy function to guide the sampling process, including three key components: Style Feature Guidance, Spatial Feature Guidance and Semantic Distance as a regularisation term. Experimental results demonstrate that Semantix not only effectively accomplishes the task of semantic style transfer across images and videos, but also surpasses existing state-of-the-art solutions in both fields.

Method


Given a reference image Iref and a context image Ic or video Vc, we first invert them to the latent xT through an edit-friendly DDPM inversion. In the denoising process, we then modify the xtout through the designed energy gradient in every sampling step. Our proposed energy function comprises three terms: i) Style Feature Guidance, to align the style features with the reference image; ii) Spatial Feature Guidance, to maintain spatial coherence with context; and iii) Semantic Distance, to regularise the whole function.

Video Semantic Style Transfer Results

Image and Video Results for Semantic Style Transfer

(a) Image Examples w/ Semantix
(b) Video Examples w/ Semantix

Image Style Transfer Results

(a) Image semantic style transfer (style) results on given context and reference image pairs.

(b) Image semantic style transfer (style) results on given context and reference image pairs.

Image Appearance Transfer Results

(a) Image semantic style transfer (appearance) results on given context and reference image pairs.

(b) Image semantic style transfer (appearance) results on given context and reference image pairs.

Comparison of Style Transfer

Comparison with baselines on style transfer task.

Comparison of Appearance Transfer

Comparison with baselines on appearance transfer task.

BibTex

@inproceedings{he2025semantix,
  title={Semantix: An Energy-guided Sampler for Semantic Style Transfer},
  author={Huiang He and Minghui Hu and Chuanxia Zheng and Chaoyue Wang and Tat-Jen Cham},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=si37wk8U5D}
}

Acknowledgements

We appreciate the contributions of related work in this field: The website template is borrowed from DreamBooth.