Loading [MathJax]/extensions/MathMenu.js
Holistic Prototype Attention Network for Few-Shot Video Object Segmentation | IEEE Journals & Magazine | IEEE Xplore

Holistic Prototype Attention Network for Few-Shot Video Object Segmentation


Abstract:

Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object...Show More

Abstract:

Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method. Our source code and models are available anonymously at https://github.com/NUST-Machine-Intelligence-Laboratory/HPAN.
Page(s): 6699 - 6709
Date of Publication: 18 July 2023

ISSN Information:

Funding Agency:


I. Introduction

With the rapid advancements of CNN-based [1] and attention-based [2] models, neural networks have achieved significant performance boosts in various vision tasks, such as object detection [3], [4], [5], captioning [6], [7], [8] and semantic segmentation [9], [10], [11], [12], [13]. Among all vision tasks, video object segmentation (VOS) is a task that aims to segment objects in video frames. It has attracted increasing attention from researchers in recent years. VOS consists of the following two subtasks: (1) unsupervised VOS (UVOS) segments salient objects without clues and (2) semi-supervised VOS (SVOS) uses the first frame mask to segment specific entities. However, existing VOS models rely on large amounts of training data. Moreover, these models can hardly recognize unseen classes. Consequently, few-shot video object segmentation (FSVOS), which uses support images to help discover objects of unseen classes in query videos, is proposed to address these issues.

Contact IEEE to Subscribe

References

References is not available for this document.