Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training | IEEE Conference Publication | IEEE Xplore

Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training


Abstract:

The rapidly evolving field of robotics necessitates methods that can facilitate the fusion of multiple modalities. Specifically, when it comes to interacting with tangibl...Show More

Abstract:

The rapidly evolving field of robotics necessitates methods that can facilitate the fusion of multiple modalities. Specifically, when it comes to interacting with tangible objects, effectively combining visual and tactile sensory data is key to understanding and navigating the complex dynamics of the physical world, enabling a more nuanced and adaptable response to changing environments. Nevertheless, much of the earlier work in merging these two sensory modalities has relied on supervised methods utilizing datasets labeled by humans. This paper introduces MViTac, a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion. By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction. Through a series of experiments, we showcase the effectiveness of our method and its superiority over existing state-of-the-art self-supervised and supervised techniques. In evaluating our methodology, we focus on two distinct tasks: material classification and grasping success prediction. Our results indicate that MViTac facilitates the development of improved modality encoders, yielding more robust representations as evidenced by linear probing assessments. https://sites.google.com/view/mvitac/home
Date of Conference: 13-17 May 2024
Date Added to IEEE Xplore: 08 August 2024
ISBN Information:
Conference Location: Yokohama, Japan

Funding Agency:

Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria
Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria
Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria

I. INTRODUCTION

In the realm of robotics, visual perception has traditionally served as a central modality extensively leveraged for acquiring nuanced environmental representations, a role emphasized in a range of studies [1], [2]. However, this approach harbors intrinsic limitations in fully encapsulating the dynamic and intricate state of the surrounding environment [3]. Conversely, tactile sensing excels in delineating fine-grained attributes that are beyond the grasp of visual modalities, effectively capturing the subtleties that evade visual systems.

Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria
Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria
Cyber-Physical-Systems Lab, Montanuniversität Leoben, Austria
Contact IEEE to Subscribe

References

References is not available for this document.