Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

  • Jinhao Li
  • , Zijian Chen
  • , Runze Jiang
  • , Tingzhu Chen*
  • , Changbo Wang*
  • , Guangtao Zhai
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively. The dataset, code, and pre-trained models are available at https://github.com/LJHolyGround/Oracle-P15K.

Original languageEnglish
Title of host publicationMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PublisherAssociation for Computing Machinery, Inc
Pages7729-7738
Number of pages10
ISBN (Electronic)9798400720352
DOIs
StatePublished - 27 Oct 2025
Event33rd ACM International Conference on Multimedia, MM 2025 - Dublin, Ireland
Duration: 27 Oct 202531 Oct 2025

Publication series

NameMM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025

Conference

Conference33rd ACM International Conference on Multimedia, MM 2025
Country/TerritoryIreland
CityDublin
Period27/10/2531/10/25

Keywords

  • dataset
  • diffusion model
  • image denoising
  • oracle bone inscriptions
  • oracle character recognition

Fingerprint

Dive into the research topics of 'Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark'. Together they form a unique fingerprint.

Cite this