Introduction
Inertial measurement units (IMUs) are a prominent option for analyzing human motion. IMUs measure 3D acceleration, angular velocity, and magnetic field, and they calculate their 3D orientation. Body-worn IMUs can be used to estimate rotational and, sometimes, translational motion of the attached segment, which help estimate the required motion parameters. As the sensors operate at a high frame rate with low latency, they can be introduced in real-time applications for motion analysis, such as full-body motion capture [1]–[3] and navigation [4], [5]. Furthermore, recent technological advances have dramatically reduced the size and price of IMUs, making them the most promising technology for the continuous tracking of human movements in daily life [6]–[8]. Due to recent improvements that have enabled easier configuration, non-expert (but trained) users can collect motion data with IMUs. A clinic’s doctors or their assistants can use the inertial sensors to track patients’ motions to assist in rehabilitation or disease diagnosis [9]–[11]. Some studies have collected data from many participants wearing IMUs during everyday life for an action recognition task [12]–[14].
For a detailed and robust motion analysis, many IMU-based applications derive data from multiple sensors mounted on multiple body segments. The conventional approach to gait analysis attaches six IMUs to the upper and lower legs and feet [15]. Some IMU-based full-body motion analyses require more than 10 inertial sensors to track one subject [2], [16], [17]. Such configurations are prone to errors because each sensor must be attached to a predefined body segment. If an IMU is mounted on the wrong segment, remeasurement will be required. This problem can be an obstacle for general users’ ability to measure motion with IMUs. Hence, a technique to identify the segment to which each sensor is attached based on the sensor signals is desired, as it would make IMU attachment easier and quicker. This identification task is called an IMU-to-segment (I2S) assignment [18].
In this paper, we address an I2S assignment: the task of classifying IMU data into classes corresponding to the body segments on which IMUs are mounted. With the assignment framework proposed in this paper, although only one IMU needs to be attached to the predetermined segment, the other IMUs can be mounted on arbitrary segments because our framework automatically assigns the sensors to the segments to which they are attached based on the sensors’ measurements during a few seconds of walking. The classical approaches to I2S assignments involve manually designing features for discriminating IMU placements [19]–[21]. Recent work has proposed extraction for features using deep neural networks (DNNs) [18]. Although these approaches achieved high assignment accuracy in well-controlled settings (e.g., the approximate angle of the sensor to the segment in the test set is the same as those of the training set), their accuracy has decreased in trials that did not meet these conditions.
To mitigate these limitations and robustly perform the I2S assignment, we propose an approach that merges features across all body-worn IMUs and learns the global dependencies between these IMUs. Unlike conventional methods that classify sensors one by one, our approach assigns locations to all body-worn IMUs at once through a DNN. The proposed model classifies each IMU based on a global feature that represents the motions of all sensor-attached segments of a body. Additionally, the model learns the dependency relationships between IMUs, which enables it to perform assignments based on the data from relevant IMUs (e.g., IMUs attached to the adjacent segment). To implement this feature fusion and dependency learning, we present a new DNN architecture that incorporates a global feature generation module and an attention-based mechanism.
We experimentally evaluated our method using synthetic and real datasets in three sensor configurations. The results demonstrated that the proposed approach significantly outperformed those of the conventional work and baselines in assignment accuracy. Also, the ablation studies and attention maps generated by the intermediate layer of the proposed model suggested that our model captured the dependency relationships between IMUs. The results obtained with the real IMU dataset validated the robustness of our method. Our contributions are summarized as follows:
We propose a novel I2S assignment model that generates a global feature representing the motion of all body segments to which IMUs are attached and learns pairwise dependencies between the IMUs.
We demonstrate that merging features extracted from multiple body-worn IMUs can benefit the identification of a segment where each IMU is mounted.
We show that the proposed method outperforms the conventional and baseline methods in three sensor configurations on synthetic and real public datasets.
Related Work
A. IMU-to-Segment Assignment
A line of research on placement recognition of inertial sensors has aimed to define effective feature representations based on signals from IMUs. The early work applied hand-crafted feature descriptors, such as root mean square and amplitudes of accelerations and classical classification algorithms, including support vector machines and decision trees [19]–[21]. The feature descriptors of these approaches are designed based on the intuition and experience of the researchers, with no agreement regarding the most suitable features for I2S assignments.
A recent study for I2S assignment proposed an approach that combines convolutional neural networks (CNNs) and recurrent networks [18]. This combined network was trained in an end-to-end manner without the need to manually design features. This approach assumes that IMUs are attached to the lower limbs and assigns IMUs one by one, ignoring the signals from other IMUs. The proposed method assigns IMUs mounted on the full-body segments by using the signals from all body-worn IMUs. Our method generates a global feature that represents multi-segment motions, which allows the model to assign an IMU of interest based on its relative motion to all the IMUs. In addition, the proposed model learns dependency relationships between the IMUs. Intuitively, when assigning a sensor on the left tibia, the model should pay attention to the data from the IMUs attached to the left femur and the left foot as well. Global feature extraction and dependency learning are incorporated into the proposed model using the techniques introduced in the following Secs. II-B and II-C, respectively. To the best of our knowledge, our work is the first deep learning approach that assigns each IMU to a segment using the aggregated global feature and the sensor interdependencies.
B. Global Feature Extraction
The proposed module to generate a global feature that represents the motion of all segments to which IMUs are attached is inspired by a technique used in point cloud semantic segmentation: the task of separating a point cloud into multiple regions according to the semantic meanings of points [22]. Because a 3D point in a point cloud, which has only positional data, has little information, recently developed approaches have successfully handled point clouds by aggregating local features and obtaining global features [23]–[25]. The feature aggregation module incorporated in the proposed model allows the model to use the global motion of the body segments for the assignment of IMUs.
Pointnet [23] is the pioneering work in applying neural networks to learn over general point sets. It takes raw point clouds as an input and obtains a global feature through a pooling layer that follows individual feature extractors composed of a simple multi-layer-perceptron (MLP). The pooling aggregator is widely used in various tasks against various data structures [26]–[28] due to its simple implementation and the permutation invariance of the inputs. The proposed assignment model generates a global feature using the pooling aggregator to merge individual features from the IMU data that are input in a random order.
C. Attention Mechanism
Attention-based neural networks have been successfully applied to a wide variety of fields, such as natural language [29], [30], image [31], [32], and speech processing [33]. The studies report that learning the dependencies among the intermediate features through the attention mechanism improves recognition accuracy. The learned attention also helps interpret the reasoning behind the machine prediction and improves the explainability of the DNN models [34], [35].
Transformer is one of the most promising approaches for learning global dependencies using the attention mechanism [29]. Transformer has been proposed for use in the task of natural language processing and has been quickly adopted for a variety of tasks, such as image classification [32] and object detection [31]. The self-attention operator in Transformer explores the dependencies of input feature vectors. We incorporate the Transformer encoder into our model to obtain the dependency relationships between body-worn IMUs. We expect the attention mechanism to capture the pairwise dependencies of the sensors, which enables the assignment of an IMU that relies on the features extracted from the dependent IMUs.
Methods
A. Problem Setting
We address the I2S assignment, which involves identifying a segment to which each IMU is mounted, based only on the IMU signals without relying on external sensors. We construct a DNN-based model to learn the discriminant features and classify the IMUs into the attached segments. In the proposed framework, a user processes the assignment following the three steps below:
The user selects a root IMU from a set of IMUs to be mounted and attaches it to the predetermined root segment of a subject.
The user mounts the remaining IMUs on the arbitrary body segments of the subject.
The proposed model provides assignment predictions using the data from all body-worn IMUs while the subject walks for a few seconds.
B. Method Overview
The proposed I2S assignment framework, as illustrated in Fig. 1, consists of data preprocessing, IMU-wise feature extraction, global feature generation, and attention learning modules. Our model takes as input the accelerations and angular velocities of
In the data preprocessing module, accelerations and angular velocities in the sensor-local coordinates are converted to the root sensor coordinates, and noise is added to the accelerations for data augmentation. Then, the discriminant features are extracted from the IMU signals in a one-by-one manner, and these features are merged in the global feature generation module. In the final step, global dependencies are learned in the Transformer encoder [29], and the model then provides classification scores through linear transformation with softmax activation.
C. Data Preprocessing
Coordinate transformation and data augmentation are performed in the data preprocessing modules for better generalization and convergence of the proposed assignment model. In this section and Fig. 2, the accelerations, angular velocities, and orientations refer to the values at a specific time step
At first, the raw sensor signals w.r.t. the sensor-local coordinates \begin{equation*} {\mathbf R}_{RS}^{i} = {\mathbf R_{WR}}^{\mathsf {T}} {\mathbf R}_{WS}^{i},\tag{1}\end{equation*}
\begin{equation*} {\mathbf a}_{i} = {\mathbf R}_{RS}^{i}{\mathbf a}_{i}^{l}.\tag{2}\end{equation*}
Given
Data augmentation is executed to avoid over-fitting and to stabilize the performance of the trained model. Following the methods of successful studies that have applied DNNs to IMU data [18], [37], we augment the sensor signals by adding zero-mean Gaussian noise to the accelerations. The
D. IMU-Wise Feature Extraction and Feature Aggregation
The proposed DNN-based assignment model starts with IMU-wise feature extraction. Inspired by the conventional architectures applied to IMU accelerations and angular velocities [18], [37], we construct the feature extractor with CNN layers and a recurrent network layer.
The main difference between previous work and ours is the step-by-step change in kernel size for each CNN layer. As shown in Fig. 3, the kernel size and strides of the first convolution along the height are three. This operator explicitly extracts features from accelerations and angular velocities separately, and the next convolution layer with kernel height
Illustration of the proposed convolution operator. The orange boxes in the blue blocks represent the convolution kernels. The kernel size changes for each operation.
We incorporate the recurrent units after the convolution layers. We adopt gated recurrent units (GRU) [40] following the results presented in the previous work that performed I2S assignments [18]. The feature map from the last CNN layer
The IMU-wise features individually extracted by the CNNs and the recurrent layer are aggregated to generate a global feature that represents the global motion of the segments to which the IMUs are attached. The architecture chosen for feature merging follows the recent success of the pooling aggregator proposed in [23]. The aggregated feature \begin{equation*} {\mathbf g}(p, q) = {\max }({\mathbf u}_{r}(p,q), {\mathbf u}_{1}(p,q), \cdots, {\mathbf u}_{n}(p,q)),\tag{3}\end{equation*}
E. Attention-Based Architecture
Transformer learns the dependency relationships between the feature vectors and obtain discriminant feature representations [29]. The IMU-wise features concatenated with the global feature
The architecture of the transformer encoder layer. The differences from the original are the position of the normalization operator and the lack of position embeddings.
The architecture within the attention learning layer is designed to be similar to that of the original Transformer encoder [29]; however, there are two differences between the original and ours. One is the position at which layer normalizations (LNs) are applied. LNs are applied before the multi-head attention module and before MLP, following the method used by recent works that modified the Transformer and improved its recognition accuracy [32], [41]. Another difference is the lack of position embeddings because our model solves an assignment problem that assumes the order of the input is unknown.
A given input \begin{equation*} {\mathbf A}_{h} = \text {softmax}\left ({\frac {\mathbf Q_{h}{\mathbf K}_{h}}{\sqrt {d}}}\right).\tag{4}\end{equation*}
The
In the training phase, we use the cross-entropy loss between
In the test phase, we found that defining an objective function from the probability distribution \begin{equation*} {\mathbf Y}^{\mathsf {T}} = (\hat {\mathbf y}_{1}, \hat {\mathbf y}_{2}, \cdots, \hat {\mathbf y}_{n}).\tag{5}\end{equation*}
Let \begin{equation*} \hat {\mathbf B} = \mathop {\mathrm {arg\,max}} _{\mathbf B} \sum _{i=1}^{n}\sum _{j=1}^{n}{\mathbf B}(i,j){\mathbf Y}(i,j).\tag{6}\end{equation*}
We solved the optimization using the 2D rectangle assignment algorithm [42] implemented in the SciPy library [43]. In the experiments, this optimization was applied to the proposed method and all comparison approaches, which contributed to the improved accuracy of all methods, including the conventional method.
Experimental Setup
A. Implementation Details
The left three blocks in Fig. 5 illustrate the architecture and hyperparameters of the proposed model. The architecture of each block is detailed in Sec. III. The algorithm based on the Tree-structured Parzen Estimator was used to seek the hyperparameter values, such as the learning rate, the batch size, and the number of kernels and GRU nodes. We divided the dataset into training, validation, and test set (see Appendix A for details); the validation set was then used for parameter tuning, and the values found are described in Appendix B. The parameters are fixed through all the experiments.
The architecture and the hyperparameters of the networks. The three blocks on the left show the proposed method, and the two on the right depict the implemented baselines.
B. Baselines
The assignment accuracy of the proposed model was compared to that of the conventional method [18], referred to as one-by-one, which applied DNN to identify IMU placement and infer the I2S orientation alignment of the IMU in a one-by-one manner. Since our work focuses on the I2S assignment, the branch layers for the alignment in one-by-one were pruned.
To validate the contribution of the feature aggregation module and the attention-based mechanism, we implemented the two baseline methods. The two models, Global and Attention, are depicted as the right two blocks in Fig. 5. Global is composed of IMU-wise feature extraction and global feature aggregation by the max-pooling layer. Global is a model made by removing the attention-based learning module from the proposed architecture. In contrast, Attention handles the features extracted from each IMU data to learn the dependency relationships without aggregating the IMU-wise features. The hyperparameters, the dataset division, and the coordinate frame of the input are consistent for the proposed, conventional, and baseline models across all the experiments.
C. Dataset
We quantitatively evaluate the performance of our approach on the synthetic and real IMU datasets: CMU-MoCap [44] and TotalCapture [45]. The sensor arrangement of the CMU-MoCap is shown in Fig. 6. Assuming that the proposed framework is utilized not only for full-body motion analysis but also for the measurement of body parts, we evaluated the model on lower-, upper-, and full-body configurations. The sensor placements are defined as follows:
lower body (7): lower back, l-femur, r-femur, l-tibia, r-tibia, l-foot, and r-foot
upper body (9): head, thorax, lower back, l-humerus, r-humerus, l-radius, r-radius, l-wrist, and r-wrist
full body (15): segments on both lower and upper body (lower back is duplicated),
CMU-MoCap is the public human motion dataset captured with the marker-based optical motion capture system (MoCap) [44]. We generated the synthetic IMU data assuming that the IMU was attached to the segments of the body measured in CMU-MoCap. The generation algorithm is described in Appendix C. We selected the same scene used in [18] (42 subjects performing different walking styles). The models were trained with IMU signals from 26 subjects in the training set and 7 subjects in the validation set, and they were tested with the remaining 9 subjects’ data (detailed in Appendix A).
TotalCapture is a public dataset providing 60 frame-per-second (fps) of all-synchronized IMU data, HD videos, and ground-truth human poses measured by optical MoCap [45]. Since our approach uses only IMU signals for the I2S assignment, real IMU data were utilized for the training and evaluation of the models. The number of IMUs was 13, and the sensor arrangement was the same as with CMU-MoCap, with the l-wrist and r-wrist sensors removed. TotalCapture has five subjects with a variety of motions measured. We selected the walking scenes and used three subjects’ data for training, one subject’s data for validation, and the rest for testing. The period during which the subjects took a calibration pose (the first and last two seconds) and walked backward were manually removed from the dataset. TotalCapture is a challenging dataset in three aspects. First, the number of subjects in the training data is small, which easily causes over-fitting. Second, it contains a variety of walking styles, including many twists and turns and slow and fast walking. Finally, the positions and angles of the sensors attached to the body change slightly depending on the subject because TotalCapture is not a dataset intended for evaluating I2S assignment but for pose estimation. Through the experiments on TotalCapture, we evaluated the versatility of the proposed method.
The window size of the input IMU data was two seconds (i.e., the number of frames
Results
A. Assignment Accuracy
The experimental results obtained using the setup described in Sec. IV are shown in Table 1. As seen in this table, the proposed method outperformed the other methods on both datasets for three configurations of sensor attachment, showing that I2S assignment training in the proposed approach yields better feature representations to discriminate the segment to which each IMU is attached.
The assignment results on the CMU-MoCap [44] are visualized using confusion matrices in Fig. 7. The matrices show that the assignment errors are caused by two main types of mistakes: left/right switch (l/r switch) and intra-limb misassignment (intra-misassignment). The l/r switch indicates an incorrect assignment to the opposite side of the actually attached segments (e.g., the IMU mounted on the l-wrist is classified into the r-wrist class). The intra-misassignment denotes that the IMU attached to a part of the limb is misclassified to another part of the same limb (e.g., the IMU mounted on the l-wrist is assigned to the l-radius or the l-humerus class). We highlighted some of the l/r switches and intra-misassignments in the confusion matrix at the lower left part of Fig. 7 with red and blue squares, respectively. The figure shows that the proposed method reduced both mistakes and significantly improved the assignment accuracy.
Some results on CMU-MoCap [44] in terms of confusion matrices. The left column represents the assignment results of the conventional work [18], and the right column represents the results of the proposed method. The red and blue rectangles on the lower-left confusion matrix highlight the left/right switches and intra-limb misassignments, respectively.
B. Ablation Studies
To analyze the contribution of each module in our model to mitigate the l/r switch and intra-misassignment problems, we visualized the confusion matrices of Global and Attention in Fig. 8 and computed the error rate caused by each mistake. On the CMU-MoCap dataset, the average l/r switch rates (the number of l/r switches divided by the total number of assignments) and the intra-misassignment rates for all three configurations were 2.2% and 5.5%, respectively, for Global and 3.1% and 2.4% for Attention. The lower l/r switch rates of Global and the lower intra-misassignment rate of Attention can be observed in the confusion matrices shown in Fig. 8 as well.
Comparison between the two baselines on the CMU-MoCap [44] in the full-body configuration. The major cause of incorrect assignment was intra-limb misassignments in (a) Global. On the other hand, (b) Attention suffered from left/right switches.
The assignment accuracy of the proposed method on the TotalCapture dataset [45] in terms of confusion matrices.
The results suggest that the global feature aggregation alleviates the l/r switch problem. This could be because the aggregation allows the network to model the motion of all body segments and capture the motion of each IMU relative to the global body motion, thus enabling the model to discriminate between left and right. The results also suggest that the attention module reduces intra-misassignment errors. This could be because the model with the attention learning architecture classifies the IMU data with consideration of the information from the relevant IMUs, such as IMUs attached to adjacent and opposite segments. For example, as can be seen in Fig. 10(b) (see Sec. II-C for an explanation of the figure), when assigning an IMU mounted on l-tibia, the self-attention architecture devotes much attention to l-femur, l-foot, and r-tibia. The assignment prediction relying on the IMUs on the segments in the same limb should prevent intra-misassignment.
C. Results on a Challenging Dataset
The results on the TotalCapture dataset [45], as presented in Table 1 and Fig. 9, revealed that the proposed approach is robust to different walking styles and slight changes in the IMU positions depending on the subjects. Our model took the same period of data as an input regardless of the change in walking speed, but the method achieved high accuracy in all the sensor configurations.
The accuracy in assigning the arm segments was lower than that of the other segments for two main reasons. One is a variety of movements not found in a normal gait in the training dataset, such as touching a head or face and raising clenched fists. The other is that the subject in the test set walked without moving his arms for a few seconds. The trained model could not distinguish between the IMU movement on the arms, head, and chest in the scene. Specifically, the mean assignment accuracy in the three seconds of the test scene in which the subject walked slowly without waving his arms (from 55 to 58 seconds in S5-W2 in TotalCapture [45]) was 61.1% in the full-body setting.
Discussion
A. Attention Maps Visualization
An attention mechanism can be used to improve the explainability of deep learning models [35], [46], [47]. Explainability, in this context, refers to a better understanding for humans of why the models behave as they do. The explainability of a model helps users make decisions based on the model and allows researchers to understand what input and intermediate features affect the results of the model. The attention learning architecture used in our model can capture the pairwise relationships between the IMUs and explain what dependencies the predicted assignments rely highly on.
To visualize the dependencies between the IMUs, we calculated the mean attention matrix, representing the average of the self-attention matrix (calculated by Eq. (4)) from all the
B. Root Segment Selection
The accuracy of the I2S assignment according to the root segment is shown in Fig. 11. The results suggest that the segments that stably and faithfully follow the body orientation (e.g., lower back, thorax, head, and femur) are suitable for the root. In contrast, when we chose the segments on the arms that have great freedom of movement during walking, the assignment accuracy decreased significantly.
Conclusion
We have presented an approach that identifies the segment on which each IMU is mounted by merging the features of all the body-worn IMUs and by learning the dependency relationships between the sensors. A pooling aggregator was incorporated to obtain a feature that represents the global motion of the body. In addition, a self-attention learning architecture was implemented to allow the model to perform an IMU assignment relying on the signals from the relevant IMUs. The proposed model was quantitatively evaluated on simulated and real IMU datasets, which validated our method, showing that it accurately and robustly performed the I2S assignment. Ablation studies suggested that the global feature fusion and attention mechanism reduced left/right switches and intra-limb misassignments.
Our I2S assignment framework assumes that the sensor configuration is known a priori and that one of the sensors is placed on the predetermined segment. These limitations do not significantly impair practicality; however, further studies to relax them are needed.
Appendix ADivision of the Dataset
Division of the Dataset
The data in a dataset are divided into training, validation, and test sets. In this paper, both the synthetic dataset CMU-MoCap [44] and the real dataset TotalCapture [45] are divided into the three sets on the basis of the subject (i.e., all the trials (scenes) of a subject are put into one of the three sets). The specific division of each dataset is summarized in Table 2. In CMU-MoCap [44], we selected subjects performing simple “walking” and had at least 600 frames in every scene as a test set.
Appendix BHyperparameters for Model Training
Hyperparameters for Model Training
We adopted the hyperparameters described in this section. Fig. 5 also visualizes the architecture and parameters of the network.
In IMU-wise feature extraction, three CNNs with different kernel sizes (3,
Our network was implemented in TensorFlow [48] and trained for 1000 epochs with a batch size 128. Early stopping with patience 400 was performed, and the model that achieved the lowest loss on the validation set was utilized for the test. RMSProp with a fixed learning rate 0.001 was applied to optimize the model.
Appendix CSimulated Data Generation
Simulated Data Generation
The public human motion dataset named CMU-MoCap [44] provides much 3D kinematics data which are measured using the optical MoCap. We used the human joint position \begin{align*} {\mathbf p}_{WS}^{t}=&{\mathbf p}_{WJ}^{t} + {\mathbf R}_{WJ}^{t}{\mathbf t}_{JS} \tag{7}\\ {\mathbf R}_{WS}^{t}=&{\mathbf R}_{WJ}^{t}{\mathbf R}_{JS}.\tag{8}\end{align*}
The angular velocity of the IMU w.r.t. \begin{equation*} {\mathbf a}_{WS}^{t} = \frac {\mathbf p_{WS}^{(t+1)} -2 {\mathbf p}_{WS}^{t} + {\mathbf p}_{WS}^{(t-1)}}{\Delta t^{2}},\tag{9}\end{equation*}