Journals & Magazines >IEEE Transactions on Circuits... >Volume: 35 Issue: 3

EMTrack: Efficient Multimodal Object Tracking

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. R...Show More

Metadata

Abstract:

Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. Recent prompt tuning techniques enable multimodal tracking to effectively inherit knowledge from foundation models trained with a large amount of RGB tracking data and achieve parameter-efficient training. However, few works focus on the efficient inference of multimodal tracking handling multiple RGB-X (RGB-Thermal, RGB-Depth, RGB-Event, etc.) tracking tasks simultaneously, especially on resource-limited devices such as CPU. In this work, we propose an efficient multimodal tracker named EMTrack. EMTrack follows a concise and unified multimodal tracking framework with simple knowledge distillation. RGB modality and auxiliary modality are added after patch-embedding layer for fusion, reducing the computational complexity of multimodal tracking compared with that of single modality. Before fusion operation, we introduce a modal-specific spatial modulation module to exploit and realize adaptive spatial adjustment of different modality features. Multiple modal-specific experts are adopted to capture specific information for different RGB-X tracking tasks, which assists in handling such tasks in a unified model with joint training. EMTrack achieves competitive performance on various RGB-X tracking benchmarks while reaching a good balance of performance and speed on different platforms. Especially on an Intel Core i9-10850K CPU device, EMTrack achieves 29.1 fps, a real-time speed, with only 2.0G MAC computation.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 35, Issue: 3, March 2025)

Page(s): 2202 - 2214

Date of Publication: 08 November 2024

ISSN Information:

DOI: 10.1109/TCSVT.2024.3494725

Funding Agency:

Contents

I. Introduction

RGB-based tracking, as the main research branch of visual tracking, has developed greatly in recent years and achieved excellent performance in many different benchmarks. However, RGB-only tracking may struggle in some complicated scenes, such as extreme illumination and occlusion. This issue limits its applications in related fields that require high tracking robustness. Multimodal fusion has received considerable attention in visual perception fields such as segmentation [4], [5], [6], [7], detection [8] and image restoration [9]. In tracking filed, multimodal fusion obtains more valuable information from auxiliary modalities, achieving complementary and comprehensive information extraction and integration for robust tracking.

References is not available for this document.

MIT Libraries

MIT Libraries

EMTrack: Efficient Multimodal Object Tracking

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

EMTrack: Efficient Multimodal Object Tracking

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References