Unsupervised temporal action segmentation with sample discrimination training and alignment-based boundary refinement

  • Feng Huang
  • , Xiao Diao Chen*
  • , Hongyu Chen
  • , Haichuan Song
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Unsupervised temporal action segmentation (UTAS) addresses the task of partitioning untrimmed videos into coherent action segments without manual annotations. While boundary-detection-based approaches have demonstrated superior performance, they exhibit two critical limitations. First, these methods often uniformly treat all frames during training, resulting in over-segmentation and suboptimal performance. Second, they primarily rely on intra-video features while neglecting potentially valuable inter-video correlations within the dataset. To address these challenges, we present a comprehensive UTAS framework with three key innovations: (1) A discriminative training mechanism that differentiates between boundary/non-boundary frames in the temporal domain and motion/background pixels in the spatial domain, employing weighted training strategies alongside multiple temporal-scale modeling. (2) A self-validation mechanism for cross-verifying predictions across different input sequences. (3) A boundary refinement approach based on video alignment, which constructs reference video sets according to feature distributions and establishes inter-video correspondences to improve boundary localization. Extensive evaluations on three benchmark datasets, i.e., the Breakfast, the 50Salads, and the YouTube Instructions, demonstrate that our approach achieves state-of-the-art performance, with quantitative results showing significant improvements over existing methods.

Original languageEnglish
Article number131636
JournalNeurocomputing
Volume658
DOIs
StatePublished - 28 Dec 2025

Keywords

  • Action boundary refinement
  • Action segmentation boundaries
  • Optimal transport
  • Sample discrimination training
  • Unsupervised action segmentation
  • Video alignment

Fingerprint

Dive into the research topics of 'Unsupervised temporal action segmentation with sample discrimination training and alignment-based boundary refinement'. Together they form a unique fingerprint.

Cite this