Loading [a11y]/accessibility-menu.js
Video OWL-ViT: Temporally-consistent open-world localization in video | IEEE Conference Publication | IEEE Xplore

Video OWL-ViT: Temporally-consistent open-world localization in video


Abstract:

We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without be...Show More

Abstract:

We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pretraining, can be transferred successfully to open-world localization across diverse videos.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

1. Introduction

A central goal in computer vision is to develop models that can understand diverse and novel scenarios in the visual world. While this has been difficult for methods developed on datasets with closed label spaces, web-scale image-text pretraining has recently led to dramatic improvements in open-world performance on a range of image-level vision tasks [12], [26], [19].

References

References is not available for this document.