Journals & Magazines >IEEE Access >Volume: 9

SqueezeNet and Fusion Network-Based Accurate Fast Fully Convolutional Network for Hand Detection and Gesture Recognition

The proposed network consists of two parts. The SqueezeNet hand feature extraction network initially extracts hand features to produce a map with rich hand features. The ...

Abstract:

Accurate fast hand detection and gesture recognition for hand understanding are still challenging tasks that are influenced by the diversity of hands and the complexity o...Show More

Metadata

Abstract:

Accurate fast hand detection and gesture recognition for hand understanding are still challenging tasks that are influenced by the diversity of hands and the complexity of the scene in color images. To address the above problem, we propose a novel SqueezeNet and fusion network-based fully convolutional network (SF-FCNet) to accurately and quickly perform hand detection and gesture recognition in color images. First, we introduce the first 17-layer structure in the lightweight SqueezeNet as the hand feature extraction network to accelerate the detection and recognition speed by greatly compressing the network parameters. Second, a precise hand prediction fusion network is designed by adding a residual structure to the deconvolutional network to integrate high- and low-level features of hands, and hand detection and gesture recognition are performed on a single convolutional layer at multiple scales to improve the precision and reduce the computational costs. The verification results on the Oxford hand dataset show that SF-FCNet can reach a precision of 84.1% and a speed of 32 FPS. The experimental results show that SF-FCNet can substantially enhance the precision and speed of hand detection and gesture recognition on three benchmark datasets and has a strong generalization ability on a homemade test set.

The proposed network consists of two parts. The SqueezeNet hand feature extraction network initially extracts hand features to produce a map with rich hand features. The ...

Published in: IEEE Access ( Volume: 9)

Page(s): 77661 - 77674

Date of Publication: 11 May 2021

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2021.3079337

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Human hand detection and recognition are regarded as a way for computers to understand human language, enabling people to communicate with machines and interact naturally without any mechanical equipment. Human hands and gestures have applications in many computer fields, such as human-computer interaction (HCI) [1], rehabilitation medicine [2], anomaly detection [3], sign language recognition [4], gesture interaction [5], virtual reality [6], etc. Hand detection is a problem that detects the locations of all hands in an image. Gesture recognition is used to detect both the location and category of the gesture in an image. Hand as a special communication tool makes us have higher requirements for the accuracy and speed of hand detection and gesture recognition. In recent years, the development of deep learning has greatly improved hand detection and gesture recognition technology. However, how to continue to improve the accuracy and detection speed is still a major challenge due to the diversity of hands and the clutter of scenes.

Hand detection and gesture recognition are divided into conventional methods and deep learning-based methods. In some conventional methods [7]–[9], artificial features such as skin color and image shape are extracted, and then the hands and gestures are detected and recognized through modeling and a support vector machines (SVMs) classifier. However, these methods usually have great limitations due to the complexity of the hand, the challenge of modeling, and the inability to perform end-to-end training. Compared with conventional methods, deep learning-based methods have stronger feature expression capabilities due to the automatic extraction of more abstract features using a series of deep convolutional neural networks (CNNs), and the end-to-end training method of deep learning reduces the hand detection and gesture recognition costs. Therefore, the domain of hand detection and gesture recognition has recently been dominated by deep learning.

Encouraged through the use of deep learning networks [10]–[15] for classification and object detection, many methods have been applied for hand detection and gesture recognition, such as region-based CNNs (R-CNNs) [12], Faster R-CNNs [13], Mask R-CNNs [14], and the RefineDet-based method [15]. However, because the detection of hands and gestures belongs to fine-grained detection and hands are small, the accuracy of these object detection networks is not very high. Subsequently, some other methods, such as the multiple scale region-based fully convolutional network (MS-RFCN) [16], region proposal networks (RPN) [17], hand-CNN [18] and without generative adversarial network (GAN) [19], were proposed. These methods improve the hand detection accuracy by improving the RPN, region-based fully convolutional network (R-FCN) [20], Faster R-CNN [13] and Mask R-CNN [14]. There are some approaches, such as RetinaNet based [21] and ResNet50+highlight feature fusion (HFF)+Auxiliary [22], that use ResNet50 [23] as the backbone combined with other networks for detection and recognition. In [24] and [25], basic convolutional pooling layers were used to construct new detection and recognition models to improve the gesture recognition accuracy via end-to-end training on datasets. In [26], a first-person perspective dataset and a CNN-based method, which can distinguish between one’s own hands and the hands of others, were proposed. However, these methods still have the following problems. First, little research has been conducted on the detection and recognition speeds of most of these methods, or the speed is slow. Second, the accuracy of most methods has yet to be improved due to the complexity and variability of hands and highocclusion.

To address the above problems, in this study, we investigated hand detection and gesture recognition on the Oxford hand dataset [7], EgoHands dataset [26], and National University of Singapore (NUS) hand posture dataset [8] and proposed a new method named the SqueezeNet and fusion network-based fully convolutional network (SF-FCNet) to accurately and quickly perform hand detection and gesture recognition on images. The main contributions of this study are as follows:

We propose a fully convolutional network for hand detection and gesture recognition in complex and unconstrained environments and reduced the computational costs.
We construct a SqueezeNet hand feature extraction network using a lightweight SqueezeNet to reduce the weight parameters, simplify the network structure, and improve the hand detection and gesture recognition speed.
We design a precise hand prediction fusion network that fuses a deconvolution network and residual structure and includes multiscale feature processing to improve the hand detection and gesture recognition accuracy.
We show the experimental data and visualization results of SF-FCNet in terms of hand detection and gesture recognition on public datasets and the in-house-built test set.

The remainder of this paper is structured as follows. Section II reviews the related work. The proposed method is presented in Section III, and the experimental results are shown in Section IV. Finally, we draw conclusions in Section V.

SECTION II.

Related Work

According to the feature extraction method, hand detection and gesture recognition methods are divided into conventional methods and deep learning-based methods. In this section, we review the related work on using traditional methods and deep learning-based methods to solve two problems.

A. Hand Detection

In the early stages of hand detection research, some conventional hand detection methods were proposed to detect hands by manually extracting features. Utsumi et al. [27] constructed a hand tracking system that recognizes and tracks the appearance of hands with multiple cameras using a geometrical structure-based hand statistical detection method. Xu et al. [28] proposed a dynamic hand detection algorithm, which used self-organizing map to realize hand detection and segmentation on the HSV color space. Some workers have proposed skin color-based methods [29]–[31], which can first directly perform hand detection based on skin color or extract the hand according to skin color and then realize noncontact human-computer interactions. Mittal et al. [7] used a two-stage method to generate hand bounding boxes based on skin color in the first stage, but the detection accuracy needs to be improved. Zhao et al. [32] proposed a histogram of oriented gradient (HOG)–based hand detection method, which used HOG features for hand detection. Guo et al. [33] combined HOG features and SVM classifiers for hand detection. Conventional hand detection methods rely on manual design to extract features, and feature extraction is insufficient and is easily affected by the environment.

In recent years, deep learning-based hand detection methods have begun to attract attention because they can automatically extract features. Initially, some object detection networks [12]–[14] were applied to realize hand detection, but the accuracy needs to be improved. Then, some improved networks were proposed. Dibia [34] proposed a single-shot multibox detector (SSD) [35]-based real-time hand detection method, and testing on the EgoHands dataset showed that the method can achieve real-time hand detection. Chen et al. [36] proposed a new deep learning framework that integrated human hand detection and pose estimation and achieved reliable human hand detection through shared convolutional layers. Le et al. [37] proposed a cross-resolution feature fusion method, which used two modules to obtain context and semantic information to achieve fast hand detection. Wang and Ye [38] proposed a multiscale Faster R-CNN method, which uses the Faster R-CNN [13] as the basic architecture and combines multiscale integrated features to achieve hand detection. Gao et al. [39] proposed a deep CNN model for hand detection, which improved the SSD [35] by combining deep and shallow networks to achieve spatial human-computer interaction. Deep learning-based hand detection methods have better robustness because they can dig deeply into image features, and the learning features are not restricted by the environment.

B. Gesture Recognition

In the early days, gesture recognition was achieved by wearing sensor gloves or making hand tags. Davis and Shah [40] used hand tags to capture the location and angle information of the hand joints of users to realize gesture recognition. Due to the poor flexibility of sensor gloves and tags, gesture recognition methods for designing artificial hand features have been studied. Van der Bergh et al. [41] proposed an average neighborhood margin maximization (ANMM)-based detection system, which used Haarlet coefficients to calculate the degree of matching between hand and sample datasets. Pisharady et al. [8] used image shape, texture, and color descriptors to recognize gestures through SVM and obtained high accuracy. Dardas and Georganas [42] proposed a real-time recognition system, which detects and tracks the hand region by subtracting the face color from the skin color and then uses a multiclass SVM to recognize gestures. Yeo et al. [43] proposed a method that combined skin color segmentation with Haarlike features, which can effectively remove the interference of the skin color of other parts of the body to improve the accuracy. Ikegami et al. [9] proposed a human-computer interaction system that extracts the user’s skin color component through face detection and performs gesture detection according to the skin color, which has good robustness.

Currently, deep learning-based gesture recognition methods are widely used. In [24] and [25], CNN-based methods were proposed, and the basic CNN architecture was used to construct deep learning networks for gesture recognition, which can achieve good recognition accuracy. Wan et al. [44] proposed a GAN-based model for the augmentation of hand datasets to improve the gesture recognition accuracy. Chevtchenko et al. [45] proposed a feature fusion-based convolutional neural network that combined a CNN with a traditional method and used depth cameras to perform gesture recognition. Si et al. [46] proposed a model for detecting raised hands that combines the R-FCN [20] with a feature pyramid and uses an adaptive template selection algorithm to detect raised hands in the in-house-built raised hands dataset. Rouast and Adam [47] proposed a video-based gesture recognition method that used a deep learning architecture to detect video-based gestures and collected a large amount of video data of dining occasions. Neethu et al. [48] proposed a CNN-based classification method that used region segmentation, finger segmentation and image normalization to process gestures and finally detected and recognized gestures using the CNN classifier.

In this paper, we mainly study deep learning-based hand detection and gesture recognition methods and propose a SqueezeNet and fusion network-based fully convolutional network, which combines a deconvolution network, a residual structure and multiscale feature processing.

SECTION III.

Methodology

The architecture of the proposed network, including a SqueezeNet hand feature extraction network and a precise hand prediction fusion network, is shown in Fig. 1. The input image is first processed by a SqueezeNet hand feature extraction network to produce a map with rich hand features. Then, the feature map with gradually decreasing resolution is obtained by the precise hand prediction fusion network, and the feature map is expanded by the convolutional layer composed of the deconvolution layer and the residual structure. Finally, hand detection and gesture recognition are performed by fusing multiple feature maps on a single convolutional layer. In this section, we introduce two parts of the network: the loss function and the training algorithm of the proposed network.

FIGURE 1.

Illustration of the proposed network (SF-FCNet) architecture.

Show All

A. SqueezeNet Hand Feature Extraction Network

To achieve both precision and speed in hand detection and gesture recognition, the choice of the initial feature extraction layer is critical, and it usually involves a trade-off between speed and precision. SqueezeNet was designed to reduce the number of model parameters and the model size by Iandola et al. [49]. SqueezeNet can ensure recognition precision while compressing the parameters to approximately 1/50 of AlexNet [50], making the model size only 4.8 MB. SqueezeNet utilizes the strategy of convolutional separation to convert the standard $3\times 3$ convolution into a fire module by replacing part of the $3\times 3$ convolution kernel with a $1\times 1$ convolution kernel, as shown in Fig. 2. The fire module includes a squeeze layer and an expand layer, and each module includes a rectified linear unit (ReLU) activation function to improve the network depth. The squeeze layer contains $1 \times 1$ convolution kernels, and the expand layer contains $1 \times 1$ and $3 \times 3$ convolution kernels. The $1 \times 1$ convolution kernel can reduce the weight parameters, and the $3 \times 3$ convolution kernel can ensure the network precision. Because SqueezeNet has the advantages of being a small model and high precision, it is selected as the hand feature extraction network to shorten the feature extraction time and speed up detection and recognition.

$FIGURE 2. - Illustration of the $3\times 3$ convolution expanded to a fire module in SqueezeNet.$

FIGURE 2.

Illustration of the $3\times 3$ convolution expanded to a fire module in SqueezeNet.

Show All

To make the network have a certain depth, we deleted the last convolution and average pooling layers of SqueezeNet and retained the first 17 layers as the SqueezeNet hand feature extraction network, as shown in Fig. 1. The input image first passes through “Conv1” and “Max1”, then passes through “Fire2-Fire4” and “Max4”, then passes through “Fire5-Fire8” and “Max8”, and finally passes through “Fire9”. The hand feature map of the image is extracted through a series of convolutions.

The structure and parameter settings of each layer of the SqueezeNet hand feature extraction network, including 1 convolutional layer, 8 fire modules and 3 max pooling layers with a stride of 2, are shown in Table 1. In the SqueezeNet hand feature extraction network, each fire module has the same structure, including a squeeze layer and an expand layer; and the network depth is 2. On the expand layer, the feature maps of the $1\times 1$ and $3\times 3$ convolutional outputs are spliced together in the channel as the channel of this fire module. The number of convolution kernels in the squeeze layer and expand layer satisfy the following equation: $\begin{equation*} X< Y_{1}+Y_{2}\tag{1}\end{equation*}$ View Source where $X$ is the number of $1\times 1$ convolution kernels in the squeeze layer; and $Y_{1}$ and $Y_{2}$ are the number of $1\times 1$ convolution kernels and the number of $3\times 3$ convolution kernels in the expand layer, respectively.

TABLE 1 The Architecture Parameters of the SqueezeNet Hand Feature Extraction Network

The input size of the SqueezeNet hand feature extraction network is set to $300 \times 300\times 3$ , and the size of the feature maps is reduced to half of the original size by a $3 \times 3$ max pooling layer with a stride of 2. Finally, the $19 \times 19 \times 512$ hand feature map was obtained through “Fire9”. In [35], it was proven that a feature map with a large size is beneficial for the detection of small objects while a feature map with a small size is beneficial for the detection of large objects. To enhance the detection of large and small hands, we pass the $19 \times 19 \times 512$ feature map through a $3\times 3$ convolutional layer with 1024 channels and a step size of 2 to obtain a $10 \times 10 \times 1024$ feature map as the input of the precise hand e prediction fusion network to ensure that the size of the subsequent feature map is appropriate.

Table 1 shows that the number of $1 \times 1$ convolution kernels in the fire module is greater than the number of $3\times 3$ convolution kernels. The $1\times 1$ convolution kernel can reduce the network dimension with less information loss. Therefore, the SqueezeNet hand feature extraction network can keep more hand information and improve the hand feature extraction speed.

The superior performance of the SqueezeNet hand feature extraction network will be demonstrated in the experiment. It is this fast performance that allows our architecture to be well used for hand detection and gesture recognition that require higher real-time performance.

B. Precise Hand Prediction Fusion Network

Inspired by [51], a precise hand prediction fusion network was constructed, as shown in Fig. 1, to supplement the lack of contextual information in the convolution process and obtain better detection performance.

The precise hand prediction fusion network is constructed using deconvolution and combines a residual structure and multiscale detection, as shown in Fig. 1. First, the output of the SqueezeNet hand feature extraction network is used as the input of the fusion network to produce a series of feature maps (“Conv10”, “Conv11”, and “Conv12”) with a gradually decreasing resolution via multiple convolutional layers. Second, the high- and low-level features of hands are fused to obtain feature maps (“Conv14” and “Conv16”) with a gradually increasing resolution via the combination of deconvolution layers and residual structures. Finally, three feature maps (“Conv12”, “Conv14”, and “Conv16”) are provided to the detection and classification layer via multiscale detection for the detection and classification of hands.

In the precise hand prediction fusion network, “Conv11” is obtained through a set of $1\times 1$ convolution kernels with a step size of 1 and $3\times 3$ convolution kernels with a step size of 2, and “Conv12” is obtained through a set of $1\times 1$ and $3\times 3$ convolution kernels with a step size of 1. The residual structure contains a set of $1\times 1, 3\times 3$ , and $1\times 1$ convolution kernels. The fusion method of the deconvolution layer and the residual structure is shown in Fig. 3. First, the “Conv12” input feature maps of the residual structure are upsampled to the size of “Conv11” via bilinear interpolation and then combined with “Conv11” and passed through the “Res13” residual structure. The final output feature map “Conv14” is the sum of the output of the residual structure and the upsampling of “Conv12”. Similarly, “Conv16” is obtained through the fusion of “Conv14” and “Res15”. This fusion method utilizes the feature map with rich information in the early stage to supplement the detailed information that is gradually missing due to the deep convolution, which ensures that the feature map has a larger receptive field, ensures the integrity of the context information, and effectively reduces the missed detection.

FIGURE 3.

The fusion fashion of the deconvolution layer and the residual structure.

Show All

Since hand detection and classification are realized through convolutional layers, the entire network is a fully convolutional network, and most of the weights are shared. The final hand detection and classification layers are converted from the fully connected layer passing through several convolution kernels with the same size as the feature map. The three feature maps (“Conv12”, “Conv14”, and “Conv16”) obtained by the fusion network are provided to the final classification layer, and the NMS algorithm is applied to the feature maps to determine the final detection bounding box of the hand.

The precise hand prediction fusion network uses multiscale prediction and adds residual structures to deconvolution layers to improve the hand detection and gesture recognition accuracy. Multiscale prediction performs the detection and classification of hands according to different sizes of hands to improve the detection accuracy. The contextual information is integrated by fusing residual structures to deconvolution layers, which can increase the detailed information of the hand and simplify the learning process. In addition, the location and classification of hands is performed by the convolutional layers. This fully convolutional network can not only better identify and detect both large and small hands but can also reduce repeated calculations and model complexity.

Compared with object detection, the object of hand is relatively small. Our network integrates high- and low-level features of the large and small feature maps by the residual structure, and utilizes multiscale feature maps with $10 \times 10, 5 \times 5$ and $3 \times 3$ to predict different sizes of hands, which improves the utilization of feature mapping with large size. The feature maps with large size are more conducive to the detection of small object [35]. Therefore, our architecture can be well worked for hand detection and gesture recognition, thus the benefits of the precise hand prediction fusion network will be justified in the experiment.

C. Loss Function

The locating and classification of hands is achieved by searching bounding boxes. According to different scales and aspect ratios [35], a series of different-sized default bounding boxes will be produced at each pixel position of the feature map extracted by the precise hand prediction fusion network. When a $3 \times 3 \times s$ convolutional kernel is applied to the feature map with $s$ channels, each location of the feature map produces either an output value of category score $z_{c}$ or an output value of the location offset relative to the default bounding box. The location offset contains 4 offsets relative to the center coordinates, width and height of the default bounding box. For each location of the feature map, we calculate the $C$ category scores and 4 offsets. Assuming that each location in the feature map produces $f$ default bounding boxes, Cf category scores and $4f$ location scores can be obtained through $(C + 4)\,f 3 \times 3\times s$ convolutional filters. The confidence that each default bounding box matches category $c$ of the hand is calculated as follows: $\begin{equation*} C(z_{c})=\frac {e^{z_{c}}}{\sum \nolimits _{c}^ {e^{z_{c}}}}\tag{2}\end{equation*}$ View Source where $z_{c}$ is the score of the hand for category $c$ .

For each ground truth bounding box, we select some default bounding boxes for matching and use the selected boxes for network training. The intersection over union (IoU) is a metric for evaluating whether the default bounding box and ground truth bounding box match, and its formula is as follows: $\begin{equation*} IoU=\frac {A_{pre}\bigcap A_{gt}}{A_{pre}\bigcup A_{gt}}\tag{3}\end{equation*}$ View Source where $A_{pre}$ is the area of the default bounding box of the hand, and $A_{gt}$ is the area of the hand ground truth bounding box of the hand. If IoU is higher than a certain threshold, the default bounding box matches the ground truth bounding box of the hand, and the default bounding box is classified as a positive sample; otherwise, it is a negative sample. Hard-negative mining is used to solve the imbalance between the positive and negative samples by selecting the top-n negative samples with the highest confidence as the negative samples for training and ensuring that the ratio of positive and negative samples is approximately 1:3.

The overall hand detection loss function is the average of the hand confidence loss (handconf) and the hand localization loss (handloc), which is shown as follows: $\begin{equation*} Loss_{hand}=\frac {1}{N}\left ({Loss_{handcoff}+Loss_{handloc}}\right)\tag{4}\end{equation*}$ View Source where $N$ is the number of default bounding boxes matching ground truth bounding boxes. If $N =0$ , the loss function is 0. The hand confidence loss is calculated for positive and negative samples as follows: $\begin{align*} Loss_{handcoff}=-\left ({\sum _{i\in ps}^{N}x_{i,j}^{c}log\left ({C\left ({z_{i}^{c}}\right)}\right) +\sum _{i\in ng}^{N}log\left ({C\left ({z_{i}^{0}}\right)}\right)}\right)\!\!\!\!\! \\{}\tag{5}\end{align*}$ View Source where ${}^{x_{i,j}^{c}=1}$ represents that the $i$ th default bounding box matches the $j$ th ground truth bounding box of the hand for category $c$ ; otherwise, ${}^{x_{i,j}^{c}=0}$ . $c=0$ represents that the category is background. ps is a positive sample, and ng is a negative sample. ${}^{C_{i}^{c}}$ denotes the confidence that the $i$ th default bounding box is category $c$ of the hand.

The hand localization loss is calculated in positive samples as follows: $\begin{equation*} Loss_{handloc}=\sum _{i\in ps}^{N}\sum _{k\in \{x,y,w,h\}} x_{i,j}^{q}smooth_{L1}\left ({p_{i}^{k}-\hat {g}_{j}^{k}}\right)\tag{6}\end{equation*}$ View Source where $p$ represents the predicted bounding box, and $g$ represents the ground truth bounding box. We regress the location offset relative to the center $(x,y)$ , width $w$ , and height $h$ of the default bounding box $b$ as follows: $\begin{equation*} \hat {g}_{j}^{x} =\frac {g_{j}^{x} -b_{i}^{x}}{b_{i}^{w}}\, \hat {g}_{j}^{y} =\frac {g_{j}^{y} -b_{i}^{y}}{b_{i}^{h}}\, \hat {g}_{j}^{w} =\log \left({\frac {g_{j}^{w}}{b_{i}^{w}}}\right)\, \hat {g}_{j}^{h} =\log \left({\frac {g_{j}^{h}}{b_{i}^{h}}}\right)\tag{7}\end{equation*}$ View Source where $b_{i}$ is the ith default bounding box.

In the training process of our network, the parameters of the network model are constantly updated by minimizing the overall loss function to achieve better hand detection and gesture recognition results.

D. Proposed Hand Detection and Gesture Recognition Algorithm

The details of the training processing of the proposed network are described in Algorithm 1. First, the images are input to the network in batches. Second, the default bounding boxes are generated and divided into positive and negative samples. Finally, the Adam algorithm [52] is used to optimize the loss function in the positive and negative samples by updating the weights until the loss function converges.

Algorithm 1 The Training Process of the Proposed Network

Input: Images.

Output: Weight parameters $k$ of network.

Global parameters:

Ground truth bounding box g.

Number of default bounding box f.

Default bounding box b.

Positive sample ps, negative sample ng.

Weight parameters of network k.

Begin

Randomly load 32 images and corresponding $g$ .

Produce feature maps through network.

Produce $f$ default bounding boxes on the feature maps.

Compute whether $b$ matches $g$ using (3).

If IoU > 0.5, $b$ is ps.

Otherwise, $b$ is ng.

Select top-n ng, keep ps: ng = 1:3.

Calculate confidence loss using (2) and (5).

Calculate localization loss in ps using (6) and (7).

10:

Calculate $Loss_{hand}$ using (4).

11:

Optimize $Loss_{hand}$ using Adam.

12:

Update $k$ of network.

13:

If convergence, exit the loop.

14:

Otherwise, jump to 1.

End

The method of this paper is trained on the hand dataset for hand detection and gesture recognition and focuses on improving the accuracy and speed performance using a lightweight SqueezeNet and fusion network-based fully convolutional network.

SECTION IV.

Experiments

We show experiments conducted on the Oxford hand dataset [7], EgoHands dataset [26], NUS hand posture dataset [8] and the in-house-built test set. The first part presents the public dataset and the in-house-built test set in detail. The second part introduces the training parameter settings and the metrics. The last two parts discuss the experimental results and the performance of the network.

A. Datasets

The Oxford hand dataset [7] is a public comprehensive dataset that contains rich hand images from different public image datasets collected without any restrictions. The dataset has 13,050 hand instances with complex backgrounds. All hands that can be clearly seen by humans are marked with bounding rectangles. There are 4069 images for training, 813 images for testing and 444 images for validation.

The EgoHands dataset [26] includes 48 complex first-person interactive videos, which are recorded by 4 actors performing 4 activities in 3 real locations. The dataset has 4800 images with multiple hands and 15053 labeled hand instances. The EgoHands dataset contains four hand categories: “own left”, “own right”, “other left”, and “other right”. There are 3600 images for training, 795 images for testing and 405 images for validation.

The NUS hand posture dataset [8] is shot in and near the National University of Singapore, and the dataset contains 10 classes of gestures of different sizes and shapes with complex backgrounds. Forty volunteers of different races made these gestures to form 2000 different gesture images for gesture recognition. In addition, there are 750 gesture images with human skin color background in the dataset. In order to increase the number of the dataset, we also added 240 images from NUS-I [53]. There are 2990 images in total, including 15 gesture categories. There are 1575 images for training, 1184 images for testing and 231 images for validation.

In the Oxford hand dataset, only the location of the hand that appears in the image is annotated, and the categories of gestures are not distinguished. In the EgoHands dataset, one’s own hand or another’s hand are also annotated except for the location of the hand, but the gesture category is still not labeled. In the NUS hand posture dataset, the location and category of gestures are both annotated. The Oxford hand dataset and EgoHands dataset are used for hand detection, and the NUS hand posture dataset is used for gesture recognition.

Based on the requirements of the Guangxi Key Research and Development Project and testing of the generalization capabilities of the network, we produced two groups of test sets in the laboratory, and some examples are shown in Table 2. In the second group, we randomly selected 6 classes of gestures in the NUS hand posture dataset as the gesture categories. The members of the laboratory as volunteers showed hands and gestures without restriction and used a high-definition (HD) camera to shoot the images. Each group contains 72 images. The first group is used for hand detection, and the second group is used for gesture recognition. The hand and gesture in the in-house-built test set are far from and near the camera. The last column of the second group in Table 2 shows the gesture images that are far from the camera.

TABLE 2 The Different Categories and Some Examples From the In-House-Built Test Set

B. Training Setup and Metrics

The Adam algorithm is used to optimize the loss function of the proposed network. The initial learning rate is set to 0.0001, the batch size is 32 images, the weight delay parameter is set to 0.0005, and the size of the input images is $300 \times 300$ . To reduce the training time and enhance precision, a fine-tuning strategy that loads the weights of SqueezeNet to train other classification tasks in the network is used.

The experiment condition is the TensorFlow framework with 32 GB memory and GTX 1080 Ti with a 3584 CUDA core GPU. The operating system is 64-bit Ubuntu 16.04. The metrics for evaluating hand detection and gesture recognition are the mean average precision (mAP) and frames per second (FPS). The mAP represents the accuracy, and the FPS is the detection speed. The threshold of the IoU between the ground truth bounding box and the predicted bounding box can be set from 0.5–0.95.

C. Results

1) Oxford Hand Dataset

We conducted comparative experiments of mAP and FPS on the Oxford hand dataset, and gave the detection results of hands from South Asian, Africa and far away from the camera to verify the performance of our method in hand detection.

The performance of the proposed network (SF-FCNet) is verified on the Oxford hand dataset. The comparison of the hand detection performance in terms of the mAP and FPS for the state-of-the-art methods [15]–[18], [21], [22], R-CNN [12], Faster R-CNN [13], Mask R-CNN [14] and multiple proposals [7] is shown in Table 3. Table 3 shows that SF-FCNet trained on the Oxford hand dataset can reach an mAP of 84.1%, which outperforms the state-of-the-art methods [15]–[18], [21], [22]. The superior performance of the method is due to the fusion of a residual structure and a deconvolution network, which can combine the high- and low-level features of hands and detect hand in a multiscale fashion to improve the detection accuracy. The results in Table 3 show the effectiveness of the precise hand prediction fusion network in SF-FCNet in terms of its detection precision.

TABLE 3 Comparison of Hand Detection Performance in Terms of the Mean Average Precision (mAP) With State-of-the-Art Methods [15]–[18], [21], [22], R-CNN [12], Faster R-CNN [13], Mask R-CNN [14] and Multiple Proposals [7] on the Oxford Hand Dataset

Tables 4 and 5 show the comparison of SF-FCNet and the state-of-the-art methods [16], [17], [19], R-CNN [12], Faster R-CNN [13] and multiple proposals [7] on running time and FPS on the Oxford hand dataset. The running time is the detection time of each image in seconds. Due to the limitations of the current laboratory hardware environment, we only have GTX 1080 Ti GPU devices. To increase the credibility of our experiment, we use Faster-RCNN as the evaluation medium for the detection speed between two different GPUs.

TABLE 4 Comparison of Running Time (s) and FPS of State-of-the-Art Methods on the Oxford Hand Dataset and Titan X GPU

TABLE 5 Comparison of the Running Times (s) and FPS With State-of-the-Art Methods [19], R-CNN [12] and Faster R-CNN [13] on the Oxford Hand Dataset and GTX 1080 Ti GPU

Table 4 shows the comparison of other methods on the Titan X GPU. It can be seen from Table 4 that Faster R-CNN has the fastest detection speed compared with the other state-of-the-art methods on a Titan X GPU, indicating that Faster-RCNN has the best performance. Table 5 shows the comparison of our methods with other methods on the GTX 1080 Ti GPU. It can be seen from Table 5 that SF-FCNet can achieve a detection speed of 32 FPS, which is almost 2.1 times faster than the Faster-RCNN on a GTX 1080 Ti GPU, which indicates that our method has a fast. Combining Tables 4 and 5 shows that SF-FCNet has the fastest detection speed compared with the other methods. This method mainly benefits from the reduction of the weight parameters in the SqueezeNet hand feature extraction network, and the results show the effectiveness of the method in improving the detection speed.

Hand detection in SF-FCNet on the Oxford hand dataset is shown in Figs. 4 and Fig. 5. Fig. 4 shows the hand detection results of SF-FCNet on multiple hand images and hands far away from the camera. Fig. 5 shows the results of hand detection on images from South Asian, Africa, and areas with dark-skinned people. In the figures, green is the ground truth bounding box, and yellow is the bounding box of the hand predicted by SF-FCNet. The hand detection results in Figs. 4 and 5 show that SF-FCNet can accurately detect the locations of multiple hands, including hands far from the camera and hands from South Asian, African and darker skinned people, which shows the effectiveness of SF-FCNet. The hand far from the camera corresponds to a smaller size in the image, which suggest that our method has better detection results in terms of small-sized hands.

FIGURE 4.

Hand detection results of the proposed network (SF-FCNet) for multiple hands and hands far away from the camera on the Oxford hand dataset.

Show All

FIGURE 5.

Hand detection results of the proposed network (SF-FCNet) on South Asian, African and darker skinned hands on the Oxford hand dataset.

Show All

2) EgoHands Dataset

We conducted comparison experiment of mAP, and drew a test accuracy curve on the EgoHands dataset, and gave the results of hand detection under the first-person perspective to verify the performance of our method in hand detection.

The comparison of performance in terms of the mAP and method [26] on the EgoHands dataset is shown in Table 6. Table 6 shows that SF-FCNet has higher detection precision on 4 categories of hands, including “own left”, “own right”, “other left”, and “other right”; and it can achieve 89.4% precision on all hands, which outperforms the other methods. The SF-FCNet has a 10.8% higher mAP than [26]. The results show that SF-FCNet has a greater advantage than other state-of-the-art methods in terms of detection precision on the EgoHands dataset.

TABLE 6 Comparison With the State-of-the-Art Method [26] in Terms of the mAP on the EgoHands Dataset

The red curve in Fig. 6 shows the relationship between the test accuracy of SF-FCNet after training on the EgoHands dataset and steps. The total number of iterations is 45k. The red curve in Fig. 6 shows that the test accuracy of the network gradually converges after the number of steps reaches 10k, which shows that SF-FCNet has a faster convergence rate.

FIGURE 6.

The result of hand detection of SF-FCNet under the first-person perspective on the EgoHands dataset.

Show All

Fig. 7 shows the result of hand detection of SF-FCNet for hands with darker skin and hands far from the camera under the first-person perspective on the EgoHands dataset, where yellow represents the ground truth bounding box, orange represents the predicted bounding box of “other right”, cyan represents the predicted bounding box of “other left”, red represents the predicted bounding box of “own left”, and green represents the predicted bounding box of “own right”. Fig. 7 contains some hands far from the camera, and the second picture in the second row has hands with darker skin. The detection results in Fig. 7 show that these hands have better detection results, which indicates that SF-FCNet can achieve accurate hand detection with darker skin and far from the camera under the first-person perspective.

FIGURE 7.

The test accuracy curve on the EgoHands dataset and NUS hand posture dataset.

Show All

3) NUS Hand Posture Dataset

We conducted comparative experiments and ablation experiments on the NUS hand posture dataset, and drew a test accuracy curve, and gave the results of gesture recognition in a complex background to verify the performance of our method in gesture recognition.

The comparison of SF-FCNet in terms of the mAP performance with the state-of-the-art methods [24], [25] and [8] on the NUS hand posture dataset is shown Table 7. Table 7 shows that SF-FCNet can reach a mAP of 99.3%, which is higher than those of the state-of-the-art methods [24], [25]. The SF-FCNet attains better gesture recognition precision, which shows the effectiveness of SF-FCNet in gesture recognition.

TABLE 7 Comparison of Gesture Recognition Performance With the State-of-the-Art Methods of [24], [25] and [8] in Terms of the mAP on the NUS Hand Posture Dataset

The blue curve in Fig. 6 shows the relationship between the test accuracy of SF-FCNet after training on the NUS hand posture dataset and steps. The blue curve in Fig. 6 shows that the test accuracy of the network gradually converges after the number of steps reaches 5k, which shows that SF-FCNet has a faster convergence rate.

Table 8 shows the mAP and FPS of SF-FCNet on the EgoHands dataset and NUS hand posture dataset when the threshold of the IoU is 0.5 and 0.75. Table 8 shows that an increase in the threshold will reduce the average precision, while the impact on FPS is not great.

TABLE 8 mAP and FPS of SF-FCNet on the EgoHands Dataset and NUS Hand Posture Dataset When the Threshold of

$IoU$ is 0.5 and 0.75

To demonstrate the effectiveness of the multiscale features and residual structure of SF-FCNet, we establish the following experiment on the NUS hand posture dataset: while keeping the original network structure unchanged, we only use one or two of the multiscale features and residual structure or use neither. The experimental results Table 9 show that the accuracy of SF-FCNet with multiscale features and the residual structure is the highest, which indicates that the multiscale and residual structure of SF-FCNet can promote performance improvement to a certain extent.

TABLE 9 The Effect of Multiscale Features and Residual Structure on Performance Measured on the NUS Hand Posture Dataset

To verify the effectiveness of the SqueezeNet hand feature extraction network, we conduct an experiment on the NUS hand posture dataset: we replace SqueezeNet with ResNet50 as the feature extraction network of SF-FCNet.

The experimental results are shown in Table 10. Table 10 shows that the accuracy of SF-FCNet with SqueezeNet is roughly the same as that with ResNet50, but the FPS is greatly improved compared with ResNet50. This is mainly due to the design of the fire module in the SqueezeNet hand feature extraction network, which retains the depth of the network and reduces the weight parameters. The experimental results show that the SqueezeNet hand feature extraction network can improve the efficiency and speed without scarifying accuracy.

TABLE 10 The Effect of the SqueezeNet Module on Performance Measured Using the NUS Hand Posture Dataset

Fig. 8 shows recognition results of SF-FCNet for 10 categories of gestures on the NUS hand posture dataset. In Fig. 8, yellow is the ground truth bounding box, other colors are the bounding boxes predicted by SF-FCNet, and a color represents a category. The images in Fig. 8 contain complex backgrounds such as human faces and cluttered objects. The results show that SF-FCNet can accurately detect the location and category of gestures for images containing a complex background, which shows that SF-FCNet can achieve better gesture recognition when there is interference from other skin colors or cluttered objects.

FIGURE 8.

The gesture recognition results for 10 categories of gestures with complex backgrounds on the NUS hand posture dataset.

Show All

4) In-House-Built Test Set

To evaluate the effectiveness and generalization ability of SF-FCNet, we conducted hand detection and gesture recognition tests on an in-house-built test set.

Fig. 9 shows the detection results of SF-FCNet on our in-house-built test set. In our in-house-built test set, the camera-hand distance range for shooting hands and gestures is 0.5m-1.5m, so the range of the selected hand and gesture is 0.5m-1.5m. The first row shows the hand detection using SF-FCNet trained on the Oxford hand dataset. The second row shows the gesture recognition using SF-FCNet trained on the NUS hand posture dataset. Fig. 9 shows that the detection results of SF-FCNet for hands and gestures with a camera-hand distance between 0.5m-1.5m are basically above 95%, indicating that SF-FCNet has better effectiveness and generalization. These experiments also indicate that our method has better detection results for small hands far away from the camera in terms of the in-house-built test set.

FIGURE 9.

The hand detection and gesture recognition results of SF-FCNet on the homemade test set. The first row is hand detection, and the second row is gesture recognition.

Show All

In the Guangxi Key Research and Development Project, SF-FCNet is used for real-time gesture recognition. Fig. 10 shows some of the frames captured during real-time gesture recognition by SF-FCNet on the video. The recognition result on the video demonstrated the real-time performance of SF-FCNet. All the work proves that SF-FCNet has excellent practicability.

FIGURE 10.

Some frames captured during real-time gesture recognition through SF-FCNet on the video.

Show All

D. Discussion

On three benchmark datasets, SF-FCNet achieves a higher mAP than the other state-of-the-art methods. The main reason is that we combine the deconvolution network and the residual structure to increase the detailed information of hand in the precise hand prediction fusion network of SF-FCNet and use multiscale features to improve the accuracy on small hands. In addition, the speed of SF-FCNet is better than those of other state-of-the-art methods on the Oxford hand dataset. This is mainly due to the SqueezeNet hand feature extraction network greatly reducing the weight parameters of the entire network via model compression.

In general, the mAP and FPS of SF-FCNet are superior to those of other state-of-the-art methods, which show that SF-FCNet has state-of-the-art hand detection and gesture recognition performance. The results on the in-house-built test set show the strong generalization ability of SF-FCNet. The detection results of hands and gestures far away from the camera on the four datasets reflects that our method has a better detection effect on small object hands. In addition, from the analysis of the experimental results of the Oxford hand dataset and EgoHands dataset, our method is more suitable for hand detection with a simple background.

SECTION V.

Conclusion

In this work, we propose a new efficient network (SF-FCNet) for hand detection and gesture recognition in images. The SqueezeNet hand feature extraction network is built to improve the detection speed. A deconvolution network, a residual structure and multiscale processing are introduced to the precise hand prediction fusion network to improve the precision and share weights. The experimental results show that SF-FCNet is competitive and generalizable, and it outperforms other state-of-the-art methods on the three benchmark datasets, which shows that SF-FCNet can achieve accurate and fast hand detection and gesture recognition.

The successful application of SF-FCNET in actual engineering shows that the method has certain validity and practicability. In addition, our work in this paper on different datasets not only provides a new method in the field of hand detection and gesture recognition, but also provides new experimental data for the research in the field of detection speed. The research on the EgoHands dataset also provides a new method for research work under the first-person perspective.

References is not available for this document.

MIT Libraries

MIT Libraries

SqueezeNet and Fusion Network-Based Accurate Fast Fully Convolutional Network for Hand Detection and Gesture Recognition

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction