基 于 动 态 采 样 对 偶 可 变 形 网 络 的 实 时 视 频 实 例 分 割

Translated title of the contribution: Dynamic sampling dual deformable network for online video instance segmentation
  • Yiran Song
  • , Qianyu Zhou
  • , Zhiwen Shao
  • , Ran Yi
  • , Lizhuang Ma*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The dynamic sampling dual deformable network (DSDDN) was proposed in order to enhance the inference speed of video instance segmentation by better using temporal information within video frames. A dynamic sampling strategy was employed, which adjusted the sampling policy based on the similarity between consecutive frames. The inference process for the current frame was skipped for frames with high similarity by utilizing only segmentation results from the preceding frame for straightforward transfer computation. Frames with a larger temporal span were dynamically aggregated for frames with low similarity in order to enhance information for the current frame. Two deformable operations were additionally incorporated within the Transformer structure to circumvent the exponential computational cost associated with attention-based methods. The complex network was optimized through carefully designed tracking heads and loss functions. The proposed method achieves an inference accuracy of 39.1% mAP and an inference speed of 40.2 frames per second on the YouTube-VIS dataset, validating the effectiveness of the approach in achieving a favorable balance between accuracy and speed in real-time video segmentation tasks.

Translated title of the contributionDynamic sampling dual deformable network for online video instance segmentation
Original languageChinese (Traditional)
Pages (from-to)247-256
Number of pages10
JournalZhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science)
Volume58
Issue number2
DOIs
StatePublished - Feb 2024
Externally publishedYes

Fingerprint

Dive into the research topics of 'Dynamic sampling dual deformable network for online video instance segmentation'. Together they form a unique fingerprint.

Cite this