I. Introduction
End-to-end robot motion-learning methods using deep learning have been successfully used to execute various manipulation tasks [1] [2] [3] [4]. Like humans, contact-rich tasks are also realized using multimodalities such as visual and tactile sensations in addition to kinesthesia [3] [5] [6] [7]. The robot’s behavior and task-success rate with these methods are contingent upon the modality provided and type of demonstration data used for learning. Robots can use various modalities such as vision (camera), tactile sense (tactile sensor), force sense (force/torque sensor), and auditory (microphone). Mere augmentation of these modalities is insufficient, necessitating the development of an adept learning model [8]. Furthermore, given spatial constraints for sensor placement, judicious selection and using of appropriate modalities represent a reasoned approach. The choice of modalities significantly affects task success or failure depending on the specific task at hand. Demonstration data are frequently collected through remote human control [3] [9]. To enhance the generalizability of the learning model, it is preferable that the demonstration data consist of multiple motion data. However, variations stemming from human operations introduce a level of uncertainty affecting the learning model’s outcomes [10]. Operators also play a crucial role in contemplating how to instruct the robot in performing a particular task. Variations in teaching methods based on individual operators may also exert a discernible effect.