Skip to main navigation Skip to search Skip to main content

ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text

  • Haifeng Ni*
  • , Ming Xu
  • *Corresponding author for this work
  • East China Normal University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.

Original languageEnglish
Title of host publicationPattern Recognition and Computer Vision - 8th Chinese Conference, PRCV 2025, Proceedings
EditorsJosef Kittler, Hongkai Xiong, Jian Yang, Xilin Chen, Jiwen Lu, Weiyao Lin, Jingyi Yu, Weishi Zheng
PublisherSpringer Science and Business Media Deutschland GmbH
Pages460-474
Number of pages15
ISBN (Print)9789819556786
DOIs
StatePublished - 2026
Event8th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2025 - Shanghai, China
Duration: 15 Oct 202518 Oct 2025

Publication series

NameLecture Notes in Computer Science
Volume16277 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2025
Country/TerritoryChina
CityShanghai
Period15/10/2518/10/25

Keywords

  • diffusion transformer
  • parameter training
  • virtual try-on

Fingerprint

Dive into the research topics of 'ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text'. Together they form a unique fingerprint.

Cite this