跳到主要导航 跳到搜索 跳到主要内容

End-to-End Video Object Detection with Spatial-Temporal Transformers

  • Lu He
  • , Qianyu Zhou
  • , Xiangtai Li
  • , Li Niu
  • , Guangliang Cheng
  • , Xiao Li
  • , Wenxuan Liu
  • , Yunhai Tong
  • , Lizhuang Ma
  • , Liqing Zhang*
  • *此作品的通讯作者
  • Shanghai Jiao Tong University
  • Peking University
  • SenseTime Group Limited
  • University of California at Los Angeles

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection.

源语言英语
主期刊名MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
出版商Association for Computing Machinery, Inc
1507-1516
页数10
ISBN(电子版)9781450386517
DOI
出版状态已出版 - 17 10月 2021
已对外发布
活动29th ACM International Conference on Multimedia, MM 2021 - Virtual, Online, 中国
期限: 20 10月 202124 10月 2021

出版系列

姓名MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia

会议

会议29th ACM International Conference on Multimedia, MM 2021
国家/地区中国
Virtual, Online
时期20/10/2124/10/21

指纹

探究 'End-to-End Video Object Detection with Spatial-Temporal Transformers' 的科研主题。它们共同构成独一无二的指纹。

引用此