Introduction
Human action recognition (HAR) or activity recognition is an imperious area of research in signal and image processing. HAR mainly involves automatic detection, localization, recognition, and analysis of human actions from the data obtained from different types of sensors, including RGB camera, depth sensor, range sensor, or inertial sensor. Action detection involves determining the presence of the action of interest in a continuous data stream, whereas action localization estimates when and where an action of interest appears. The goal of action recognition or classification is to determine which action appears in the data. In the past few years, the research on HAR has gained significant popularity and is becoming increasingly vital in a variety of disciplines. Detecting and recognizing human activities is the core of many human-computer interaction (HCI) applications, including visual surveillance, video analytics, assistive living, intelligent driving, robotics, telemedicine, sports annotation, and health monitoring [1]–[6]. Various sensor modalities have been utilized to monitor human beings and their activities. HAR approaches can generally be classified into two main categories depending upon the type of sensors used. These include vision-based HAR and inertial sensor-based HAR.
Earlier vision-based action recognition studies involved the use of RGB video sequences captured by conventional RGB cameras to recognize a human activity [7], [8]. These studies are mostly based on template-based or model-based approaches [9]–[11], space-time trajectory [12], motion encoding [13], and key poses extraction [14]. Numerous feature extraction methods have been proposed for HAR using RGB video data, which achieved successful recognition results. Particularly, these methods include 3D gradient-based spatiotemporal descriptor [15], spatiotemporal interest point (STIP) detector [16], motion-energy images (MEIs) and motion history images (MHIs) [17], [18]. The evolution of deep learning schemes, i.e., deep learning based convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) networks, has motivated the researchers to explore its application for action recognition from RGB videos [19]–[22]. The increasing popularity of HAR using RGB camera has also been heavily investigated in recent years [23]–[26]. These papers have provided a comprehensive discussion on different features and algorithms used in the literature for efficient HAR. With all their benefits, there exist some limitations in utilizing RGB cameras for monitoring human activities. For example, conventional RGB images lack 3D action data, which ultimately affects the recognition performance.
The advancement in image acquisition technology has made it possible to capture 3D action data using depth sensors. The depth images obtained for these sensors are insensitive to changes in illumination compared to conventional RGB images. Moreover, these depth images also provide a way to obtain 3D information of a person’s skeleton to recognize human actions in a better way. Therefore, many researchers have put their efforts in recognizing human actions based on depth imagery [27]–[31]. Several feature extraction, description, and representation techniques have been developed for depth sensor-based HAR. These include depth motion maps (DMMs) [32], bag of 3D points [33], projected depth maps [34], space-time occupancy patterns [35], spatiotemporal depth cuboid [36], surface normal [37], and skeleton joints [38]. Recently, a few research studies proposed deep learning based methods for HAR using depth camera and skeleton joints [39]–[42]. In [43], the authors utilized CNN and LSTM for skeleton-based activity recognition. The authors in [44] proposed a deep bilinear learning method for RGB-D action recognition. A comprehensive study about RGB-D based human motion recognition using deep learning approaches is presented in [45]. Although vision-based HAR is continuously progressing, it is exposed to many hindrances such as camera position, a limited angle of view, subject disparities in carrying out different actions, occlusion, and background clutter. Furthermore, camera-based HAR systems require an extensive amount of hardware resources to run computationally complex computer vision algorithms. These limitations are addressed by low-cost, computationally efficient, and miniaturized inertial sensors.
Wearable inertial sensors enable dealing with a much broader field of view and changing illumination conditions as compared to RGB and depth sensors. They are attached directly on the human body or entrenched into outfits, smartphones, footwear, and wrist watches to track human activities. They generate 3D acceleration and rotation signals conforming to human action. Hence, like depth sensors, the inertial sensors also track 3D action data entailing 3-axis acceleration in case of an accelerometer and 3-axis angular velocity in case of a gyroscope. Many researchers utilized smartphones, smart watches, and wearable inertial sensors, incorporating an accelerometer and gyroscope, for human activity recognition [46]–[48]. In [49], [50], the authors detected complex human activities by utilizing the built-in inertial sensors of the smartphone along-with wrist-worn motion sensors. With the growth of deep learning applications in vision-based action recognition systems, we witnessed the utilization of deep learning for sensor-based activity recognition. In [51], the authors used deep learning for smartphone-sensor based activity recognition, whereas the authors in [52] used body sensor data for recognizing human activities. These studies achieved successful results in detecting and recognizing human activities. However, with the continuous evolvement in pulling down the power consumption of wearable sensors, deep learning based approaches are becoming futile for unobtrusive human activity monitoring. Moreover, sensor-based activity recognition approaches have certain other limitations as well. For instance, sensor readings are sensitive to their orientation and location on the body. Also, wearing or placing these sensors on the bodies creates inconvenience for the users to carry out their tasks in a natural way. Table 1 provides the pros and cons regarding the use of different sensing modalities (i.e., RGB camera, depth camera, and inertial sensors) for HAR.
A conventional HAR system typically makes use of a single sensor modality, i.e., either a vision-based sensing modality or a wearable inertial sensor. However, under realistic operational settings, no sensor modality alone can handle varying conditions that may take place in real time. The RGB and depth images from an RGB-D camera and 3D inertial signals from a wearable sensor offer complementary information. For instance, vision-based sensors provide global motion features whereas inertial signals give 3D information about local body movement. Hence, by fusing data from two complementary sensing modalities, the performance of HAR systems can be improved. Few existing studies [53]–[56] utilized the fusion of depth and inertial sensors, aiming to increase the accuracy of action recognition and their results revealed significant improvement in recognition. Some authors also worked on using deep learning for multiple sensing modalities for robust action recognition [57]–[59]. In [60], the authors utilized deep learning based decision-level fusion for action recognition using depth camera and wearable inertial sensors. For depth cameras, CNN based features are extracted, whereas, for the inertial sensors, CNN and LSTM networks are used. Recently, in [61], the authors used skeleton-based LSTM and spatial CNN models to extract temporal and spatial features respectively for action recognition. The results of this study revealed that the fusion of multiple sensing modalities achieved a significant performance improvement compared to single modality based action recognition. Therefore, in this research work, we proposed a multimodal HAR framework that utilizes the combination of multiple sensing modalities (e.g., wearable inertial sensor, RGB camera sensor, and depth camera sensor) for action classification.
The fusion of multiple sensors can be performed at base-level (descriptor-level), feature-level (representation-level), or decision-level (score-level) [12]. Each fusion type has its own merits and demerits, and the selection of the fusion method is generally dependent on the type of features and descriptors. Existing studies for multimodal HAR mostly focus on the decision-level fusion due to its independence on the type, length, and numerical scale of different features extracted from multiple sensing modalities. Moreover, decision-level fusion does not require any post-processing of the extracted features and reduces the dimensions of the final feature vector for classification. The major drawback of the decision-level fusion is independent and stand-alone classification decisions relating to each sensing modality, which are then combined using some soft rule to make the final decision. Hence, for
The key contributions of this research work are as follows:
A robust scheme is presented for HAR, which emphasized the feature-level fusion of RGB, depth, and inertial sensors to improve the accuracy of human action classification. Moreover, a detailed analysis is provided regarding the individual performance of these sensing modalities as well as their combination in HAR, using two common machine learning classifiers, i.e., K-Nearest Neighbor and Support Vector Machine.
The existing approaches for RGB and depth sensor-based HAR use different types of features for both RGB and depth videos, which becomes infeasible for the feature-level fusion. The proposed HAR method address this issue using RGB-D features based on densely extracted Histogram of Oriented Gradients (HOG). The obtained features are finally normalized to achieve the best recognition performance.
The proposed HAR method is evaluated on publically available benchmark dataset University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD) [53], which covers a wide-ranging set of 27 different human actions. The results achieved for the proposed scheme are better than state-of-the-art results. For demonstrating the effectiveness of the proposed feature-level fusion over decision-level fusion, the obtained results are also compared with the decision-level fusion results on UTD-MHAD.
The remaining part of the paper is organized as follows. Section II provides an in-depth discussion of the proposed method. Section III provides a discussion on the results of different experiments designed to measure the performance of the proposed HAR method. Also, we compared the performance of our method against different machine learning algorithms for HAR. Finally, Section IV concludes the outcomes of this research work and provide recommendations for future work.
Methodology of Research
The proposed methodology for HAR is shown in Fig. 1, which consists of three main steps: feature extraction and description, feature fusion, and action classification. These steps are explained in detail in the following sub-sections.
A. Feature Extraction and Description
As this research work focuses on the feature-level fusion of multiple sensor modalities for robust HAR, hence we extracted different sets of features for inertial sensor data and RGB/depth videos. It is done because these features provide the best recognition rate when used for HAR with individual modality data. The following sections provide the detail of the feature extraction process for inertial sensor data and RGB/depth video sequences.
1) Feature Extraction for Inertial Sensor
The raw data obtained from wearable inertial sensors is orientation sensitive and often degraded by unwanted noise produced by either the instrument or unanticipated movement of the participant. Hence, it is crucial to preprocess the raw data obtained from wearable inertial sensors before any further processing. For this purpose, the magnitude
For de-noising of the acquired signals, an average smoothing filter of size \begin{align*} \mu=&\frac {1}{N}\sum {s\left ({n }\right)}\tag{1}\\ \mu _{\nabla }=&\frac {1}{N}\sum \left |{ s\left ({n }\right)-s(n-1) }\right |\tag{2}\\ \mu _{\Delta }=&\frac {1}{N}\sum \left |{ s\left ({n+1 }\right)-2s\left ({n }\right)+s(n-1) }\right |\tag{3}\end{align*}
2) Feature Extraction for RGB/DEPTH Sensor
For RGB and depth video data, we employed the general Bag-of-Words (BoWs) pipeline for HAR, which is visualized in Fig. 2. The BoWs method [63] has been successfully adapted from static images to the motion clips and videos through local space-time descriptors. It has many successful applications in HAR [15], [64], [65]. For human action clips, BoWs may be specified as a bag of action patches that occur in the action frames for many times. We used the BoWs approach to transform locally extracted feature descriptors from an action clip into a fixed-sized vector needed for classification.
General pipeline for BoWs representation of dense HOG features extracted from RGB and depth video sequences.
The proposed BoWs-based approach for HAR consists of the following steps:
Local Feature Description: For extracting features from RGB and depth videos, we utilized the dense sampling of local visual descriptors, since densely sampled descriptors are more accurate than keypoint-based sampling [66], [67]. As a type of local visual descriptors, we paid attention to densely extracted 3D volumes of HOG [68]. For calculating dense HOG, firstly the gradient magnitude response is computed in both horizontal and vertical directions, which resulted in a 2D vector field per frame. Haar features are used to calculate gradient magnitude response as these features are faster and obtain better results for HOG [62]. Next, we divided the input video into dense blocks of size
pixels$15\times15$ frames. For every single block, the magnitude is quantized in$\times20$ orientation bins (where$O$ ), which is done by dividing each response magnitude linearly over two neighboring orientation bins. After that, we concatenated the responses of multiple adjacent blocks in both spatial and temporal directions. For this purpose, we concatenated the descriptors of 33 blocks in the spatial domain and two blocks in the temporal domain, resulting in a 144-dimensional HOG descriptor. The size of each HOG descriptor is then reduced to half using Principal Component Analysis (PCA), which lead to a 72-dimensional descriptor. Finally, L1-normalization is performed followed by the square root to obtain final descriptor representation.$O = 8$ Visual Codebook Construction: The number of significant interest points and densely extracted HOG features may change for different videos, which results in feature vectors having different size. However, to train a classifier, a fixed size feature vector is required for all data sequences. For this purpose, we clustered the features extracted from all training videos into ‘’ clusters using k-means clustering. The center of each cluster is considered as a visual word. A group of these visual words together make a visual vocabulary or codebook.
Histogram of Words Generation: After constructing the visual vocabulary/codebook from the training videos, the next step is to quantize the HOG descriptors from each training/testing video into a fixed-sized vector known as a histogram of words. Histogram of words shows the frequency of each visual word that is present in a video sequence. So, for a given video, each of HOG descriptor is compared with all visual words and voting is performed for the best matching visual word, which resulted in a histogram of the visual words for that video. In this manner, all training and testing videos are quantized into
-dimensional vectors referred to as Bag-of-Words. After computing BoWs for training and testing video data, classifiers are applied for learning and recognition of human actions.$k$
B. Feature Fusion
After extracting features from inertial sensors and RGB/depth videos, we performed their fusion for HAR. For this purpose, we independently computed feature vector for the data obtained from each sensing modality (i.e., RGB/depth sensor and inertial sensor) and concatenated the individual feature vectors obtained from the multimodal data related to the same action at the same time, which resulted in a new high dimensional feature vector. This resultant feature vector possessed more feature information to better recognize human actions compared to the feature vector obtained for single sensing modality.
For the feature-level fusion, it is necessary to balance different feature sets obtained corresponding to the data from different sensing modalities. Balancing different feature sets means that the concatenated features must have the same numerical scale and similar length. Hence, we applied the min-max normalization technique [69] on the feature sets obtained for RGB/depth and inertial sensors before concatenating them to produce a single resultant vector. The purpose of employing feature normalization is to modify the numerical ranges and scaling parameters of the individual feature sets to transform these values into a new feature domain, having a similar numerical scale. The min-max normalization scheme preserves the original score distribution and maps the values into a standard range [0, 1] according to the formula given in Eq. (4).\begin{equation*} x^{\prime }=\frac {x-min\left ({F_{x} }\right)}{max {\left ({F_{x} }\right)-min \left ({F_{x} }\right)}}\tag{4}\end{equation*}
The size of the feature vector obtained in the case of inertial sensor data is fixed for each data sequence, i.e., [
C. Action Recognition
After feature extraction and fusion from multiple sensor modalities, the next process is choosing a suitable classifier for training the proposed framework for HAR and to test it. Two popular classifiers, i.e., K-Nearest Neighbors (K-NN) and Support Vector Machine (SVM), are used for this purpose because of their efficient recognition performance in existing state-of-the-art studies [8], [70]–[72]. Moreover, we anticipated comparing their recognition performance when the fusion of different sensing modalities is used for HAR.
Experimental Results
In this section, we first briefly describe the dataset used for experimentation along with experimental design and evaluation metrics. We then provide information regarding the implementation of our proposed framework. After that, we compare our algorithm with existing state-of-the-art HAR methods. Finally, we discuss the qualitative results to provide essential intuitions of the proposed method.
A. Dataset and Implementation Details
We evaluated the proposed method on a publicly accessible multimodal HAR dataset UTD-MHAD, which entails 27 human actions carried out by eight subjects (four females and four males). Fig. 3 provides a list of these actions with example images. Each subject repeated every action four times. Hence, there were overall 864 trimmed data sequences (8 (no. of subjects)
For implementing the proposed HAR method, K-NN and SVM classifiers are trained and tested on UTD-MHAD. For K-NN classifier, the parameter ‘K’ is set to 1, and an equal weight Euclidean distance metric is used for similarity measure. The Nearest neighbor parameter ‘K’ is different from ‘
B. Action Recognition Results and Analysis
For feature-level fusion, we concatenated the individual feature sets extracted from inertial sensor data and the corresponding RGB and/or depth video sequence after min-max normalization. Although feature-level fusion seems to be simple and straightforward, it suffers from some several deficiencies. First, the increase in the dimensionality of the fused feature vector raises the computational complexity of classification. Second, the dimensionality of the RGB/depth features is typically much higher than the features extracted for inertial sensor data, which ultimately degrades the fusion purpose. We address these issues using a variable length feature vector for RGB and depth data sequences. The size of the feature vector obtained from inertial sensor data is equal to
1) Performance Analysis of Inertial Sensor-Based HAR
This section discusses the results of HAR obtained using only the inertial sensors for recognition. Table 2 summarizes these results for the different combination of sensors. The results of HAR are provided individually for each inertial sensor as well as their feature-level fusion. It can be observed that K-NN classifier provides better performance than SVM classifier in recognizing human actions based on a single inertial sensor or their combination. The accuracy rate achieved for K-NN classifier in recognizing human actions using accelerometer and gyroscope individually is 78.5% and 76.6% respectively. These accuracy rates are 1.9% and 3.8% better than the accuracy values achieved for SVM classifier when using these sensors individually. The overall performance of an accelerometer in recognizing human actions is better than the gyroscope. Moreover, it can be observed that the fusion of these inertial sensors improves the overall recognition accuracy to 91.6% and 90.5% when classified using K-NN and SVM classifiers individually. Overall, K-NN classifier provides better results as compared to SVM classifier in classifying human actions based on the feature-level fusion of inertial sensors.
2) Performance Analysis of RGB and Depth Sensor-Based HAR
This section provides the detailed results obtained for HAR using depth and RGB sensors individually as well as their combination. These results are computed for different values of
It can be observed from Table 3 that K-NN classifier achieves maximum accuracy rate for HAR using depth and RGB sensor individually, which is 81.5% and 85.2% respectively for
3) Performance Analysis of HAR Based on Feature-Level Fusion of RGB, Depth and Inertial Sensors
This section analyzes the performance of HAR when the feature-level fusion of RGB/depth and inertial sensors is performed. The statistical features computed from inertial sensor data are different from dense HOG-based features extracted for RGB/depth video data and have different dimensions. Feature-level fusion is practically possible when the dimensions of the fused feature vectors are not much different. In the case of inertial sensor data, the feature vector size is
Table 4 presents the detailed results of HAR based on the feature-level fusion of RGB/depth and inertial sensors. It can be observed that K-NN classifier provides better results as compared to SVM classifier. When using only the accelerometer with a depth sensor, the maximum accuracy rate achieved for HAR using K-NN classifier is 94.8% (for
The recognition results for the feature-level fusion of RGB and inertial sensors are also presented in Table 4. When adding accelerometer and gyroscope individually with RGB sensor, the maximum accuracy rate achieved for HAR using K-NN classifier is 96.1% (for
4) Analysis of Feature-Level Fusion Results for HAR Using K-NN Classifier
This section compares the best performance achieved for the proposed HAR method using K-NN classifier when different sensing modalities are used. Table 6 provides a comparison of the average accuracy attained using different sensors along with the final feature vector length and average processing time. It can be observed that the feature-level fusion of different sensors increases the length of the final feature vector, which in return increases the average computational time. The processing time for the proposed HAR method is computed using MATLAB on a laptop with a 2.3 GHz Intel Core-i5 CPU with 8 GB RAM. For each sensor or set of sensors, the average time taken for feature extraction and classification can be added to compute the overall average computational time. For RGB/depth sensor, the average time is calculated per frame, whereas, for inertial sensors, it is computed per sample.
From Table 6, it can be seen that the accuracy rate achieved for HAR with the accelerometer sensor only is 78.5%, whereas, for the gyroscope sensor, it is 76.6%. The fusion of accelerometer and gyroscope provides an accuracy of 91.6% at the expense of around 46% (53 microseconds (
The best accuracy rate obtained for the proposed HAR approach is 98.3%, which is achieved as a result of the feature-level fusion of four different sensors including RGB, depth, accelerometer, and gyroscope sensor. However, in this case, the average time for classification is increased by 13.5% when compared with the average classification time taken for the fusion of inertial sensors with RGB or depth sensor. The average time required for feature extraction is also increased with a maximum factor of approximately 2.6, which can be observed from Table 6. On the other hand, adding inertial sensors with RGB or depth sensor only results in a slight increase in the average processing time and provides an accuracy rate comparable to the maximum accuracy rate. The accuracy rate achieved by the fusion of RGB and inertial sensors is 97.6%, which is 8.3% more than that obtained by feature-level fusion of RGB and depth sensor using K-NN classifier. Hence, it is evident that the overall performance of the proposed HAR method (considering the accuracy rate and the computational time as a trade-off) is better for the feature level fusion of inertial sensors with only RGB or depth sensor. In particular, as the RGB sensor provides rich texture information and inertial sensor tracks 3D motion information, it is concluded that their feature-level fusion provides the overall best performance for the proposed HAR framework.
Fig. 4 provides the confusion matrix of the best overall results achieved for the feature-level fusion (using RGB and inertial sensors) to demonstrate per class recognition accuracy of all 27 actions in UTD-MHAD. It can be observed from the figure that most of the actions are recognized with a very high individual accuracy. The lowest individual recognition accuracy achieved is 87.5% for action 20, i.e., right-hand catch an object.
Confusion matrix for HAR results obtained for the feature-level fusion of RGB and inertial sensors (for
5) Comparison of Feature-Level Fusion and Decision-Level Fusion Results for Proposed HAR Method
Our proposed method for HAR relies on the feature-level fusion of multiple sensors for robust action recognition. However, most of the existing studies for multimodal action recognition [53], [54] focused on decision-level fusion to achieve effective recognition results as the features being extracted from different sensors are independent. Instead, the feature-level fusion requires the numerical scale and dimensions of the fused feature vectors to be similar, which is not possible with the type of features extracted for RGB and depth video sequences in the existing studies. Also, the dimensions of the RGB and depth features are often quite higher as compared to the inertial sensor features, which is infeasible for the feature-level fusion. Consequently, the results obtained for the feature-level fusion are not much consistent and accurate as compared to the decision-level fusion results in the literature. In our proposed study, we first balanced the dimensions and numerical scale of the RGB-D features (densely extracted HOG) and the statistical signal attributes computed from the inertial sensors. After that, we performed multimodal feature-level fusion to achieve the desired HAR results.
To validate the effectiveness of our feature-level fusion approach, we also computed the decision-level fusion results for the proposed scheme and compared both results. For the decision-level fusion, we followed the same approach as proposed by the authors in [53], [54]. For the fusion of
Comparison of the maximum accuracy rate achieved for the proposed HAR framework with the feature-level and decision-level fusion of different sensors using K-NN classifier. For any combination of sensors, the feature-level fusion outperforms the decision-level fusion. * Here, ‘A’ represents the accelerometer sensor, ‘G’ represents the gyroscope, ‘D’ is the depth sensor, and ‘RGB’ represents the RGB sensor.
6) Performance Comparison of Proposed HAR Scheme With State-of-the-Arts
This section provides a performance comparison of the proposed scheme for HAR with the existing techniques. The proposed HAR scheme, based on the feature-level fusion of RGB and inertial sensors, provides superior recognition performance on UTD-MHAD compared to existing methods as shown in Table 7. Chen et al. [53] presented UTD-MHAD in their study and utilized the decision-level fusion of depth and inertial sensors (accelerometer and gyroscope) for HAR. They computed three statistical features for inertial sensor data and extracted DMMs for depth video sequences. The authors partitioned the dataset into two equal splits for training and testing. The data corresponding to four different users was utilized for training whereas the data from the rest of the users was used for testing, which resulted in an average accuracy rate of 79.1%. The authors modified their existing methodology in [54] to incorporate real-time HAR, which achieved recognition accuracy of 91.5% using an 8-fold cross-validation scheme for subject-generic experiments. The authors also conducted experiments using subject-specific training and testing, which achieved an average accuracy rate of 97.2%. Ben Mahjoub and Atri [8] proposed an RGB sensor-based scheme that utilized the STIP for detecting significant changes in an action clip. Moreover, they used the HOG and Histogram of Optical Flow (HOF) as feature descriptors and achieved an accuracy rate of 70.37% using SVM classifier. Wang et al. [40] used CNN for HAR and utilized the skeleton information from the Kinect sensor to achieve an overall recognition accuracy of 88.1% on UTD-MHAD. Kamel et al. [41] applied deep CNN for HAR using depth maps and skeleton information and achieved an accuracy rate of 87.9% on UTD-MHAD dataset. The research work in [39] proposed the skeleton optical spectra (SOS) method based on CNNs to recognize human actions. The authors encoded the skeleton sequence information into color texture images for HAR and achieved an accuracy rate of 86.9% on UTD-MHAD. The authors in [60] utilized the decision-level fusion for HAR using depth camera and wearable inertial sensors. They extracted CNN based features for depth sensor and used CNN and LSTM networks for inertial sensors. Their study achieved an accuracy of 89.2% on UTD-MHAD. Recently, Cui et al. [61] used the skeletal data to extract the temporal and spatial features for action recognition using LSTM and spatial CNN models respectively. They achieved a maximum accuracy rate of 87.0% on UTD-MHAD.
Our proposed scheme combines the color and rich texture information from the RGB sensor with 3D motion information obtained from inertial sensors for robust HAR. The proposed scheme for HAR, based on the feature-level fusion of RGB and inertial sensors, obtained the maximum recognition accuracy of 97.6% using 8-fold cross-validation, which is better than the reported results of existing techniques. Furthermore, the proposed scheme is computationally efficient as the overall length of the fused feature vector is very small, i.e., 49 (
Conclusion
In this paper, a feature-level fusion method has been proposed for human action recognition, which utilizes data from two differing sensing modalities: vision and inertial. The proposed system merges the features extracted from individual sensing modalities to recognize an action using a supervised machine learning approach. The detailed experimental results indicate the robustness of our proposed method regarding classifying human actions as compared to the settings where each sensor modality is used individually. Also, the feature-level fusion of time domain features computed from inertial sensors and densely extracted HOG features from depth/RGB videos reduces the computational complexity and improves the recognition accuracy of the system as compared to state-of-the-art deep CNN methods. Regarding classifier performance, K-NN classifier provides better results for the proposed HAR system as compared to SVM classifier.
The proposed HAR methods also have some limitations. For example, it works with pre-segmented actions, which do not exist in practice. Moreover, it does not incorporate multi-view HAR, and the orientation of the person whose action is being recognized remains the same with respect to the camera. In the future, we plan to extend the proposed HAR method to address these limitations. Furthermore, we aim to investigate the specific applications of the proposed fusion framework using RGB-D camera and wearable inertial sensors.