TY - JOUR
T1 - Unsupervised temporal action segmentation with sample discrimination training and alignment-based boundary refinement
AU - Huang, Feng
AU - Chen, Xiao Diao
AU - Chen, Hongyu
AU - Song, Haichuan
N1 - Publisher Copyright:
© 2025
PY - 2025/12/28
Y1 - 2025/12/28
N2 - Unsupervised temporal action segmentation (UTAS) addresses the task of partitioning untrimmed videos into coherent action segments without manual annotations. While boundary-detection-based approaches have demonstrated superior performance, they exhibit two critical limitations. First, these methods often uniformly treat all frames during training, resulting in over-segmentation and suboptimal performance. Second, they primarily rely on intra-video features while neglecting potentially valuable inter-video correlations within the dataset. To address these challenges, we present a comprehensive UTAS framework with three key innovations: (1) A discriminative training mechanism that differentiates between boundary/non-boundary frames in the temporal domain and motion/background pixels in the spatial domain, employing weighted training strategies alongside multiple temporal-scale modeling. (2) A self-validation mechanism for cross-verifying predictions across different input sequences. (3) A boundary refinement approach based on video alignment, which constructs reference video sets according to feature distributions and establishes inter-video correspondences to improve boundary localization. Extensive evaluations on three benchmark datasets, i.e., the Breakfast, the 50Salads, and the YouTube Instructions, demonstrate that our approach achieves state-of-the-art performance, with quantitative results showing significant improvements over existing methods.
AB - Unsupervised temporal action segmentation (UTAS) addresses the task of partitioning untrimmed videos into coherent action segments without manual annotations. While boundary-detection-based approaches have demonstrated superior performance, they exhibit two critical limitations. First, these methods often uniformly treat all frames during training, resulting in over-segmentation and suboptimal performance. Second, they primarily rely on intra-video features while neglecting potentially valuable inter-video correlations within the dataset. To address these challenges, we present a comprehensive UTAS framework with three key innovations: (1) A discriminative training mechanism that differentiates between boundary/non-boundary frames in the temporal domain and motion/background pixels in the spatial domain, employing weighted training strategies alongside multiple temporal-scale modeling. (2) A self-validation mechanism for cross-verifying predictions across different input sequences. (3) A boundary refinement approach based on video alignment, which constructs reference video sets according to feature distributions and establishes inter-video correspondences to improve boundary localization. Extensive evaluations on three benchmark datasets, i.e., the Breakfast, the 50Salads, and the YouTube Instructions, demonstrate that our approach achieves state-of-the-art performance, with quantitative results showing significant improvements over existing methods.
KW - Action boundary refinement
KW - Action segmentation boundaries
KW - Optimal transport
KW - Sample discrimination training
KW - Unsupervised action segmentation
KW - Video alignment
UR - https://www.scopus.com/pages/publications/105017614608
U2 - 10.1016/j.neucom.2025.131636
DO - 10.1016/j.neucom.2025.131636
M3 - 文章
AN - SCOPUS:105017614608
SN - 0925-2312
VL - 658
JO - Neurocomputing
JF - Neurocomputing
M1 - 131636
ER -