Diff-TST: Diffusion model for one-shot text-image style transfer

  • Sizhe Pang
  • , Xinyuan Chen
  • , Yangchen Xie
  • , Hongjian Zhan
  • , Bing Yin
  • , Yue Lu*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

In recent years, there have been significant breakthroughs in style transfer that show impressive results, enabling the automatic creation of high-quality and diverse images. However, these models struggle to generate accurate and realistic text images, especially for languages with complex glyphs. Besides, existing methods often depend on a large amount of multimodally labeled data for supervision, limiting their applicability to languages with extensive labeled resources. In this work, we propose a universal one-shot text-image style transfer framework via the diffusion model, Diff-TST, that only requires one word-level content label and one style reference image for any language. To this end, Diff-TST decomposes the content and style features of text images and generates style transfer images by incorporating content and style guidance into the diffusion process. To further address the problem of text generation inaccuracy via the diffusion model, we propose character-wise encoding to encode the text at the character level and introduce positional encoding and cross-attention to align the input text with the content of the generated image. Extensive qualitative and quantitative experimental results show that our method is applicable in multiple languages (i.e, English, Thai, and Kazakh) as well as outperforms existing methods.

Original languageEnglish
Article number125747
JournalExpert Systems with Applications
Volume263
DOIs
StatePublished - 5 Mar 2025

Keywords

  • Diffusion model
  • One-shot style transfer
  • Scene text generation

Fingerprint

Dive into the research topics of 'Diff-TST: Diffusion model for one-shot text-image style transfer'. Together they form a unique fingerprint.

Cite this