1. Introduction
Hand gesture recognition is a crucial tool for touchless user interface, which is widely employed in games and other entertainments. We developed a touchless visualization system for medical applications [1]. Recently, multi-modal information with color and depth images has been used for accurate gesture recognition [2] –[6]. In multi-modal gesture recognition, the fusion of color and depth images is a crucial issue. Thus far, early fusion [6] and late fusion [4] approaches have been extensively employed for the fusion of color and depth images. However, improvement of the performance is limited due to the gap between the modalities. In this study, we propose a modality-invariant fusion approach to address the modality gap issue. To align the feature distribution gap between the color and depth images, we added a similarity loss. We applied the proposed approach to a public and our private data set and verified its effectiveness.