I. Introduction
The three basic characteristics of virtual reality (VR) technology are immersion, interactivity, and imagination, among which interactivity refers to the way users interact with the virtual world through specific means of communication. The implementation of VR technology cannot be achieved without the support of hardware devices. There are two main types of VR devices. The first type is mobile head display devices, such as Gear VR, which need to be plugged into a mobile phone for use. The second type is external head display devices, such as HTC Vive, which need to be used by an external computer. The interaction mode of these two types of VR devices is basically achieved through operating handles. There are three main ways to achieve interaction between humans and virtual scenes: one is virtual interaction, which is controlled through virtual buttons in the virtual scene; The second is physical interaction, which utilizes physical devices such as control handles to achieve interaction with virtual scenes; The third is direct interaction. Through data gloves or Gesture recognition, different action commands are made to control objects in the virtual scene. In VR, most human-computer interaction methods are achieved through virtual or physical interaction. With the continuous development of VR devices and human-computer interaction, direct interaction, especially gesture interaction, has become a more friendly and natural interaction method in VR. Gesture recognition is the precondition of gesture interaction. At present, Gesture recognition is mainly realized by two ways, one is Gesture recognition method based on sensors. Taking Gesture recognition and VR human-computer interaction as research objects, Li et al. [1] proposed and designed a new prototype system for assisting VR devices in human-computer interaction. The second is vision-based Gesture recognition, which uses ordinary cameras to collect data at a low cost, and uses machine learning or neural networks to recognize gesture images. Su et al. [2] proposed a human-computer interaction method for Gesture recognition based on improved YOLOv3. Vision-based deep learning Gesture recognition method is a hotspot in current research, but using deep learning requires providing many gesture sample data for training. When there are only a few samples, deep learning is prone to over fitting.