TY - JOUR
T1 - Diff-TST
T2 - Diffusion model for one-shot text-image style transfer
AU - Pang, Sizhe
AU - Chen, Xinyuan
AU - Xie, Yangchen
AU - Zhan, Hongjian
AU - Yin, Bing
AU - Lu, Yue
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2025/3/5
Y1 - 2025/3/5
N2 - In recent years, there have been significant breakthroughs in style transfer that show impressive results, enabling the automatic creation of high-quality and diverse images. However, these models struggle to generate accurate and realistic text images, especially for languages with complex glyphs. Besides, existing methods often depend on a large amount of multimodally labeled data for supervision, limiting their applicability to languages with extensive labeled resources. In this work, we propose a universal one-shot text-image style transfer framework via the diffusion model, Diff-TST, that only requires one word-level content label and one style reference image for any language. To this end, Diff-TST decomposes the content and style features of text images and generates style transfer images by incorporating content and style guidance into the diffusion process. To further address the problem of text generation inaccuracy via the diffusion model, we propose character-wise encoding to encode the text at the character level and introduce positional encoding and cross-attention to align the input text with the content of the generated image. Extensive qualitative and quantitative experimental results show that our method is applicable in multiple languages (i.e, English, Thai, and Kazakh) as well as outperforms existing methods.
AB - In recent years, there have been significant breakthroughs in style transfer that show impressive results, enabling the automatic creation of high-quality and diverse images. However, these models struggle to generate accurate and realistic text images, especially for languages with complex glyphs. Besides, existing methods often depend on a large amount of multimodally labeled data for supervision, limiting their applicability to languages with extensive labeled resources. In this work, we propose a universal one-shot text-image style transfer framework via the diffusion model, Diff-TST, that only requires one word-level content label and one style reference image for any language. To this end, Diff-TST decomposes the content and style features of text images and generates style transfer images by incorporating content and style guidance into the diffusion process. To further address the problem of text generation inaccuracy via the diffusion model, we propose character-wise encoding to encode the text at the character level and introduce positional encoding and cross-attention to align the input text with the content of the generated image. Extensive qualitative and quantitative experimental results show that our method is applicable in multiple languages (i.e, English, Thai, and Kazakh) as well as outperforms existing methods.
KW - Diffusion model
KW - One-shot style transfer
KW - Scene text generation
UR - https://www.scopus.com/pages/publications/85209396293
U2 - 10.1016/j.eswa.2024.125747
DO - 10.1016/j.eswa.2024.125747
M3 - 文章
AN - SCOPUS:85209396293
SN - 0957-4174
VL - 263
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 125747
ER -