Loading [MathJax]/extensions/MathMenu.js
EMTrack: Efficient Multimodal Object Tracking | IEEE Journals & Magazine | IEEE Xplore

Abstract:

Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. R...Show More

Abstract:

Multi-modal object tracking has received increasing attention, given the limitations the representation ability in certain challenging scenarios of single RGB modality. Recent prompt tuning techniques enable multimodal tracking to effectively inherit knowledge from foundation models trained with a large amount of RGB tracking data and achieve parameter-efficient training. However, few works focus on the efficient inference of multimodal tracking handling multiple RGB-X (RGB-Thermal, RGB-Depth, RGB-Event, etc.) tracking tasks simultaneously, especially on resource-limited devices such as CPU. In this work, we propose an efficient multimodal tracker named EMTrack. EMTrack follows a concise and unified multimodal tracking framework with simple knowledge distillation. RGB modality and auxiliary modality are added after patch-embedding layer for fusion, reducing the computational complexity of multimodal tracking compared with that of single modality. Before fusion operation, we introduce a modal-specific spatial modulation module to exploit and realize adaptive spatial adjustment of different modality features. Multiple modal-specific experts are adopted to capture specific information for different RGB-X tracking tasks, which assists in handling such tasks in a unified model with joint training. EMTrack achieves competitive performance on various RGB-X tracking benchmarks while reaching a good balance of performance and speed on different platforms. Especially on an Intel Core i9-10850K CPU device, EMTrack achieves 29.1 fps, a real-time speed, with only 2.0G MAC computation.
Page(s): 2202 - 2214
Date of Publication: 08 November 2024

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

RGB-based tracking, as the main research branch of visual tracking, has developed greatly in recent years and achieved excellent performance in many different benchmarks. However, RGB-only tracking may struggle in some complicated scenes, such as extreme illumination and occlusion. This issue limits its applications in related fields that require high tracking robustness. Multimodal fusion has received considerable attention in visual perception fields such as segmentation [4], [5], [6], [7], detection [8] and image restoration [9]. In tracking filed, multimodal fusion obtains more valuable information from auxiliary modalities, achieving complementary and comprehensive information extraction and integration for robust tracking.

Select All
1.
C. Li et al., "Lasher: A large-scale high-diversity benchmark for RGBT tracking", IEEE Trans. Image Process., vol. 31, pp. 392-404, 2021.
2.
B. Ye, H. Chang, B. Ma, S. Shan and X. Chen, "Joint feature learning and relation modeling for tracking: A one-stream framework", Proc. 17th Eur. Conf. Comput. Vis. (ECCV), pp. 341-357, Oct. 2022.
3.
J. Zhu, S. Lai, X. Chen, D. Wang and H. Lu, "Visual prompt multi-modal tracking", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 9516-9526, Jun. 2023.
4.
B. Yin, X. Zhang, Z. Li, L. Liu, M.-M. Cheng and Q. Hou, "DFormer: Rethinking RGBD representation learning for semantic segmentation", arXiv:2309.09668, 2023.
5.
J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu and R. Stiefelhagen, "CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers", IEEE Trans. Intell. Transp. Syst., vol. 24, no. 12, pp. 14679-14694, Dec. 2023.
6.
G. Li, Y. Wang, Z. Liu, X. Zhang and D. Zeng, "RGB-T semantic segmentation with location activation and sharpening", IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1223-1235, Mar. 2023.
7.
Y. Lv, Z. Liu and G. Li, "Context-aware interaction network for RGB-T semantic segmentation", IEEE Trans. Multimedia, vol. 26, pp. 6348-6360, 2024.
8.
W. Zhou, Y. Zhu, J. Lei, R. Yang and L. Yu, "LSNet: Lightweight spatial boosting network for detecting salient objects in RGB-thermal images", IEEE Trans. Image Process., vol. 32, pp. 1329-1340, 2023.
9.
X. Deng, J. Xu, F. Gao, X. Sun and M. Xu, " Deep M 2 M2CDL: Deep multi-scale multi-modal convolutional dictionary learning network ", IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 2770-2787, May 2024.
10.
M. Jia et al., "Visual prompt tuning", Proc. Eur. Conf. Comput. Vis., pp. 709-727, Oct. 2022.
11.
X. Hou et al., "SDSTrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking", arXiv:2403.16002, 2024.
12.
X. Lu, C. Ma, B. Ni and X. Yang, "Adaptive region proposal with channel regularization for robust object tracking", IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1268-1282, Apr. 2021.
13.
B. Yan, H. Peng, J. Fu, D. Wang and H. Lu, "Learning spatio-temporal transformer for visual tracking", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 10448-10457, Oct. 2021.
14.
Z. Wu et al., "Single-model and any-modality for video object tracking", arXiv:2311.15851, 2023.
15.
S. Wang, J. Gao, Z. Li, X. Zhang and W. Hu, "A closer look at self-supervised lightweight vision transformers", Proc. Int. Conf. Mach. Learn., vol. 202, pp. 35624-35641, 2023.
16.
H. Zhang, J. Wang, J. Zhang, T. Zhang and B. Zhong, "One-stream vision-language memory network for object tracking", IEEE Trans. Multimedia, vol. 26, pp. 1720-1730, 2024.
17.
Y. Zheng, B. Zhong, Q. Liang, G. Li, R. Ji and X. Li, "Toward unified token learning for vision-language tracking", IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 4, pp. 2125-2135, Apr. 2024.
18.
L. Zhou, Z. Zhou, K. Mao and Z. He, "Joint visual grounding and tracking with natural language specification", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 23151-23160, Jun. 2023.
19.
Z. Li, R. Tao, E. Gavves, C. G. M. Snoek and A. W. M. Smeulders, "Tracking by natural language specification", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 7350-7358, Jul. 2017.
20.
P. Zhang, D. Wang and H. Lu, "Multi-modal visual tracking: Review and experimental comparison", Comput. Vis. Media, vol. 10, no. 2, pp. 193-214, Apr. 2024.
21.
L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer and F. S. Khan, "Multi-modal fusion for end-to-end RGB-T tracking", Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), pp. 2252-2261, Oct. 2019.
22.
G. Bhat, M. Danelljan, L. Van Gool and R. Timofte, "Learning discriminative model prediction for tracking", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 6182-6191, Oct. 2019.
23.
H. Zhao, J. Chen, L. Wang and H. Lu, "ARKitTrack: A new diverse dataset for tracking using mobile RGB-D data", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5126-5135, Jun. 2023.
24.
C. Tang et al., "Revisiting color-event based tracking: A unified network dataset and metric", arXiv:2211.11010, 2022.
25.
S. Yan, J. Yang, J. Käpylä, F. Zheng, A. Leonardis and J.-K. Kämäräinen, "DepthTrack: Unveiling the power of RGBD tracking", Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pp. 10705-10713, Oct. 2021.
26.
J. Yang, S. Gao, Z. Li, F. Zheng and A. Leonardis, "Resource-efficient RGBD aerial tracking", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 13374-13383, Sep. 2023.
27.
C. Li, H. Cheng, S. Hu, X. Liu, J. Tang and L. Lin, "Learning collaborative sparse representation for grayscale-thermal tracking", IEEE Trans. Image Process., vol. 25, no. 12, pp. 5743-5756, Dec. 2016.
28.
C. Li, X. Liang, Y. Lu, N. Zhao and J. Tang, "RGB-T object tracking: Benchmark and baseline", Pattern Recognit., vol. 96, Dec. 2019.
29.
X. Wang et al., "VisEvent: Reliable object tracking via collaboration of frame and event flows", IEEE Trans. Cybern., vol. 54, no. 3, pp. 1997-2010, Mar. 2024.
30.
J. Yang, Z. Li, F. Zheng, A. Leonardis and J. Song, "Prompting for multi-modal tracking", Proc. 30th ACM Int. Conf. Multimedia, pp. 3492-3500, Oct. 2022.

Contact IEEE to Subscribe

References

References is not available for this document.