Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Compressed video action recognition targets at classifying action class in compressed video, instead of decoded/standard video. It benefits from fast training and inference by reducing the utilization of redundant information. However, off-the-shelf methods still rely on heavy-cost labels for training. In this paper, we propose self-supervised compressed video action recognition method via Momentum contrast (MoCo) and temporal-consistent sampling. We leverage temporal-consistent sampling into MoCo to improve the ability of feature presentation on each input modality of compressed video. Modality-oriented fine-tuning is introduced to applying into the downstream compressed video action recognition. Extensive experiments demonstrate the effectiveness of our method on different datasets with different backbones. Compared to SOTA self-supervised learning methods for decoded videos on HMDB51 dataset, our method achieves the highest accuracy of 57.8%.

Original languageEnglish
Title of host publicationNeural Information Processing - 28th International Conference, ICONIP 2021, Proceedings
EditorsTeddy Mantoro, Minho Lee, Media Anugerah Ayu, Kok Wai Wong, Achmad Nizar Hidayanto
PublisherSpringer Science and Business Media Deutschland GmbH
Pages237-249
Number of pages13
ISBN (Print)9783030922726
DOIs
StatePublished - 2021
Event28th International Conference on Neural Information Processing, ICONIP 2021 - Virtual, Online
Duration: 8 Dec 202112 Dec 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13111 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th International Conference on Neural Information Processing, ICONIP 2021
CityVirtual, Online
Period8/12/2112/12/21

Keywords

  • Action recognition
  • Compressed video
  • Contrastive learning
  • Temporal-consistent sampling

Fingerprint

Dive into the research topics of 'Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling'. Together they form a unique fingerprint.

Cite this