Introduction
Massage is a common way to physically stimulate the human body, which can accelerate tissue metabolism, promote wound repair and relieve fatigue. However, the number of currently experienced physical therapists is limited, and physical exertion during the massage process makes it difficult to consistently apply forces; consequently, there is an urgent demand for massage robots. The accurate application of force is a difficult problem in both theoretical research and practical applications of massage robots [1]. Too light a force cannot achieve the desired therapeutic effect; too heavy a force may cause tissue damage. Ideally, the massage recipient feels a completely bearable soreness without feeling flustered, dizzy, or nauseated. The overall massage technique must be firm, powerful, and uniform [2]. Therefore, the force application strategy of the robot has an important impact on the overall massage effect.
To keep a robot’s massage force within a set range, many scholars have used the robot’s compliance control to adjust the state of a robot’s contact with the skin. For example, Song et al. [3] explored the influence of different impedance control parameters on the massage force exerted by a robot on the back of a human body; Zhai et al. [4] proposed a robot force control based on a self-tuning impedance control to adapt to the uncertain skin environment; Tu et al. [5] designed a robot contact force transition strategy based on an active disturbance rejection controller, which can solve the problem of excessive instantaneous contact force; Maqsood et al. [6] used an adaptive impedance controller to compensate tracking force, to prevent a robot from deviating from the reference trajectory; Stephens et al. [7] proposed an impedance controller with an adaptive law of environmental parameters, which can interact with a soft environment; Luo and Hsieh [8] used impedance control to study the tapping action of robot massage, and verified the feasibility of robot tapping massage therapy through experiments; Dong et al. [9] proposed a force/position hybrid control algorithm based on a parallel massage robot with compliant joints, which can realize pressing, rolling, pushing and other actions; Sheng et al. [10] proposed an adjustable variable impedance control to deal with the nonlinearity and dynamic stiffness of soft tissues, and achieve ideal force tracking; Li et al. [11] proposed a variable robot impedance control algorithm based on Gaussian process model to enhance the flexibility and adaptability of the system. Although the abovementioned controllers have achieved a good massage force in a certain range, they still face challenges when massaging different individuals. Due to the anatomical structure and physiological differences of the body, the human skin may have different stiffness between different parts and individuals, which leads to continuous changes in the parameters of the skin mechanics model [12]. The surface characteristics of human skin may also vary depending on individual differences, such as dryness and humidity [13], which may affect the force control of robots. Therefore, the static model error of the skin and the existence of external interference in the skin contact environment make traditional controllers inevitably have tracking errors. To ensure the comfort of robot massage, the robot force controllers require not only precision but also a certain degree of versatility.
Reinforcement learning (RL) can discover the optimal behavior autonomously, reduce the complexity of manual operation, and thus achieve better control results [14], [15], [16]. Compared with traditional control algorithms, reinforcement learning has certain advantages in robot control and has achieved good results in robot force control scenarios. For example, Roveda et al. [17] proposed a predictive variable impedance model control based on Q-learning; Peng et al. [18], Perrusquía et al. [19] and Luo et al. [20] used a linear quadratic regulator to obtain the expected force; Zhao et al. [21] used an actor-critic algorithm and Bogdanovic et al. [22] used a deep deterministic policy gradient algorithm to optimize impedance parameters. In these tasks, RL from scratch remains data-inefficient or intractable, but learning a residual on an initial controller can yield substantial improvements. Therefore Johannink et al. [23] proposed a residual reinforcement learning framework, in which the RL agent learns an additional residual policy to modify the behavior of a predefined hand-crafted controller to simplify exploration; Rana et al. [24] proposed residual skill policies, which can achieve effective skill reuse for adaptation and efficient learning; Xie et al. [25] and Silver et al. [26] made substantial improvements over the inception controller. However, in robot massage scenarios, how to combine the two control methods for safe exploration, to achieve efficient sample initialization and fast learning is an unsolved problem.
Compared to traditional control for robot massage, the main contributions of this work are as follows:
A robot massage force controller based on residual reinforcement learning is proposed, which combines a traditional robot force controller and reinforcement learning algorithm;
An environmental dynamics model of reinforcement learning is constructed to simulate the contact process between the robot and the skin, which accelerates the search for the compensation strategy of residual terms,
The number of actual interactions of the reinforcement learning is reduced and at the same time the practicability of the reinforcement learning algorithm is improved.
The structure of the full text is as follows: the second part details the initial strategy in the robot massage process; the third part provides a robot massage force control algorithm based on residual reinforcement learning to compensate for the residual term of impedance control; and the fourth and fifth parts discuss the setup of the experimental platform and the experiments conducted to verify the feasibility of the algorithm, respectively.
Initial Strategy Via Impedance Control
In a robot massage scenario, the robot’s end effector is equipped with a massage head. When a robot imitates the rubbing massage technique of a masseur, the massage head is in contact with the skin and moves along a set trajectory, and the massage force is determined by setting the reference force. To ensure that the robot follows the reference force, a force controller is required to adjust the contact state of the robot. Impedance control can simplify the contact model between the robot and the human into a linear second-order system contact model with inertia, damping, and stiffness characteristics. The displacement of the robot can be adjusted according to the difference between the end force of the robot and the reference force. The characteristics of the model can be tuned by changing the inertial, damping, or stiffness parameters [27], so it can be applied in robot massage. In the Cartesian coordinate system, the analysis is only performed along the normal direction of the contact between a robot and skin, and the position and contact force of a robot meet the following conditions [28]:\begin{equation*} m_{d} \Delta \ddot {x}+b_{d} \Delta \dot {x}+k_{d} \Delta x=f_{r} -f_{e}, \tag{1}\end{equation*}
In an actual system, the following difference equation can be used to calculate the change in acceleration and velocity [3]:\begin{align*} \Delta \dot {x}(k)&=\frac {\Delta x(k)-\Delta x(k-1)}{T_{s}}, \tag{2}\\ \Delta \ddot {x}(k)&=\frac {\Delta \dot {x}(k)-\Delta \dot {x}(k-1)}{T_{s} }, \tag{3}\end{align*}
\begin{align*} &\hspace {-0.1pc}\Delta x(k) \\ &=\frac {eT_{s}^{^{2}}+b_{d} T_{s} \Delta x(k-1)+m_{d} (2\Delta x(k-1)-\Delta x(k-2))}{m_{d} +b_{d} T_{s} +k_{d} T_{s}^{^{2}}}, \tag{4}\end{align*}
Construction of Residual Reinforcement Learning
Skin texture varies with age, skin quality, body mass and other factors; even in the same individual, different skin areas show different mechanical properties due to the different distribution ratios of bones, muscles, and fat [29]. Although the parameters of impedance control in robot massage scenarios are easy to set, they often fluctuate. Residual reinforcement learning can be optimized with a good but imperfect force controller, which improves the accuracy of the algorithm by compensating for traditional impedance control.
The flow chart of robot force control based on residual reinforcement learning is shown in Figure 1. The actual force generated when the robot contacts human skin is \begin{equation*} {} {u}'=\Delta x+f(u), \tag{5}\end{equation*}
Residual policy adjustment is very cumbersome, but reinforcement learning can be used to find the optimal policy autonomously. At the same time, traditional force controllers can accelerate reinforcement learning by limiting the search range. Combining these two algorithms results in improved performance, so a residual policy can be built using reinforcement learning.
A. Reinforcement Learning to Construct the Residual Policy
Reinforcement learning is based on the Markov decision process, which consists of a tuple
During the robot massage process, the environment state \begin{equation*} r_{t} =-k_{r} \ast \left |{ {e_{t}} }\right |, \tag{6}\end{equation*}
\begin{equation*} R_{t} =\sum \nolimits _{i=t}^{T} {\gamma ^{(i-t)}} r_{i}, \tag{7}\end{equation*}
The ultimate goal of the agent is to maximize the expected return, so the objective function can be set as:\begin{equation*} J(\theta)=E_{\tau \sim \rho _{\theta } (\tau)} [R(\tau)], \tag{8}\end{equation*}
Since the displacement of the robot is a continuous variable, the output of the deep deterministic policy gradient(DDPG) in the reinforcement learning algorithm is not the probability of agent action, but a specific agent action [31], and it can be used for continuous action prediction. Therefore, DDPG can be used to build a strategy for robot residual policy. DDPG is based on an actor-critic(AC) framework; the actor is a policy network that outputs an action, while the critic is a value network that estimates the expected return for each state-action pair [32]. The training goal of the actor is to maximize the Q-value output by the critic, because this Q-value reflects the maximum benefit that the current strategy can obtain by taking this action.
The training goal of the critic network is to minimize the error between the actual return and the predicted return, which minimizes the TD error. It guides the update of network parameters by comparing the difference between the expected return and the actual return of taking an action in the current state. Since the DDPG algorithm adopts a deterministic strategy, the robot’s offset displacement strategy can be directly selected. The value network does not control the agent but only scores the compensation displacement \begin{equation*} u=\mu (s_{t} \vert \theta ^{\mu })+{\mathcal{ N}}_{t}, \tag{9}\end{equation*}
\begin{equation*} L=\frac {1}{K}\sum \limits _{i} {(y_{i} -Q(s_{i},u_{i} \vert \theta ^{Q}))} ^{2}, \tag{10}\end{equation*}
\begin{equation*} \nabla _{\theta ^{\mu }} \approx \frac {1}{N}\sum \limits _{i} {\nabla _{u} Q(s,u\vert \theta ^{Q})} \vert _{s=s_{i},u=\mu (s_{i})} \nabla _{\theta ^{\mu }} \mu (s\vert \theta ^{\mu })\vert _{s_{i}}. \tag{11}\end{equation*}
Finally, the target network is updated, the parameters in the actor network and critic network are copied and a soft update is performed. Each time, the parameters are only updated slightly [34]:\begin{align*} \theta ^{Q'}&\leftarrow \alpha \theta ^{Q}+(1-\alpha)\theta ^{Q'} \\ \theta ^{\mu '}&\leftarrow \alpha \theta ^{\mu }+(1-\alpha)\theta ^{\mu '}, \tag{12}\end{align*}
B. Dynamic Model for Residual Reinforcement Learning
The reinforcement learning agent needs to improve the policy through trial and error; therefore, in the actual interaction process, multiple experiments are required to achieve the desired result [35]. The repeated trial and error process not only creates discomfort but also causes skin damage and pain, so the rapid convergence of the algorithm is very important in robot massage scenarios. DDPG is a model-free algorithm based on the AC framework, which requires a certain number of experiments to have enough data, so to accelerate the convergence in the actual massage process, a dynamic model for reinforcement learning can be constructed. This enables DDPG to iteratively train in the constructed virtual environment, thereby reducing the number of actual trainings and improving the practicability of the algorithm [36], [37], [38].
When the robot is in state \begin{equation*} s_{t+1} =\varphi (s_{t},u_{t}), \tag{13}\end{equation*}
The hidden layer of the BP neural network is selected to be 1 layer and contains \begin{equation*} z_{j} =\phi \left({\sum \limits _{i=1}^{n} {w_{ij} y_{i} +b^{1}_{j}} }\right), \tag{14}\end{equation*}
\begin{equation*} \phi (a)=\frac {1}{1+e^{-c}}, \tag{15}\end{equation*}
\begin{equation*} O_{h} =\sum \limits _{j=1}^{n} {w_{jh} z_{j}} +b^{2}_{h}, \tag{16}\end{equation*}
C. Residual Policy Fusion
The policy obtained via reinforcement learning is not a smooth curve, and the unsmooth compensation process may cause the robot to shake during the movement process. Therefore, in actual robot massage scenarios, the robot residual policy can be smoothed to improve the comfort of the strategy, an average value filter can be used to optimize the output offset displacement of the residual reinforcement learning framework, and the robot offset displacement can be written as:\begin{equation*} {} {u}'_{t} =\sum \limits _{i=0}^{d-1} {u'_{t-i} /d}, \tag{17}\end{equation*}
The pseudocode of the algorithm is as follows:
Algorithm 1 Residual Reinforcement Learning for Robot Massage
Require empirical data of impedance control algorithm and robot contact force and robot displacement when using impedance parameters to massage;
BP neural network constructs the dynamic model of the robot’s contact state;
for
Initialize the authentication process N,
Set the initial state of the residual policy
for
Set policy action
Under the BP neural network, a policy for
robotic massage performed
Get the state of the next moment
Store data (
Extract a set of data from the cache area, and
optimize the hyperparameters
End
End
In the actual experiment, execute the of robot massage residual strategy
Use formula (17) to optimize the residual strategy.
Experimental Setup
The massage robot is a UR robot with a massage head installed at its end. A force sensor is installed between the massage head and the robot, and a Beckhoff module collects the force signal and transmits it to the host computer. At the same time, the control box transmits the position and speed signals of the robot to the host computer. The host computer and the control box communicate at a frequency of 50 Hz. A schematic diagram of the experiment is shown in Figure. 2. The robot presses the skin vertically along the Z-direction at a speed of 2 mm/s. When the robot force reaches the reference force value
The residual reinforcement learning experimental process is shown in Figure. 3. The initial strategy is used to obtain the robot contact states and displacements in the Z-direction. A BP neural network model is used to construct a state transition model, and the inputs are the robot contact state
Experimental Results and Analysis of Robot Massage
The robot exerts force on the skin surface by rubbing to ensure the safety of the recipient of the massage, and a gentle force application strategy is adopted, setting the robot’s reference force to 5 N, i.e.,
In the initial strategy, the impedance control parameters are manually adjusted to
Massage results comparison between the initial strategy, model-based reinforcement learning algorithm and residual reinforcement learning (volunteer A).
After executing the initial strategy, the difference between the robot force and the reference force
In the DDPG network, the input of the actor network and the target actor network is 2-dimensional data, the output is 1-dimensional data, and the value of the intermediate node of the neural network is set to 30. The input of the critic network and target critic network is 3-dimensional (
The iteration process of the DDPG algorithm under the dynamic model of the BP neural network is shown in Figure. 5, at the beginning of the iteration, due to the lack of the number of neural networks of the DDPG algorithm, the optimal policy obtained is not effective, and the return value of reinforcement learning is below −9100. When approximately 80 times the algorithm, the return value of reinforcement learning is rapidly improved, the return value is approximately −8950, the optimal policy of reinforcement learning is better than the initial stage, and the return value of the final steady state shows that the algorithm converges. When fusing the residual strategy obtained offline with the initial strategy, the filter window
The comparison of robot offset displacement under the initial strategy and the residual reinforcement learning algorithm is shown in Figure. 6, the robot and human massage offset displacement under the initial strategy is small; when the environment suddenly changes, the robot’s offset strategy is not enough to allow the robot’s contact force to reach the reference force quickly. Compared to the initial strategy, the residual reinforcement learning strategy can respond quickly to adapt to changes in the external environment by rapidly increasing or decreasing the robot’s offset displacement. The reason why the residual reinforcement learning algorithm performs well in handling such problems is that it achieves good results in the training of the dynamics model, resulting in an efficient strategy. During the first 80 iterations in Figure 5, the reinforcement learning algorithm learns and collects environmental information by exploring the environment and trying different actions. When faced with an uncertain contact environment, the algorithm tends to try a variety of actions, although the reward values of these actions may not be ideal. At the same time, the reinforcement learning algorithm can update the neural network in DDPG to better reflect the actual situation. As the number of iterations increases, the DDPG neural network continues to update and output more accurate strategies. In Figure 6, this is reflected in the more rapid response of the robot’s offset displacement strategy, which can quickly increase or decrease the robot’s offset displacement, thereby allowing the robot to adapt to changes in the external environment. As the residual reinforcement learning algorithm improves on the initial strategy, the output offset displacement fluctuates on the basis of the initial strategy, which improves the force application effect while also ensuring the safety of the contact.
Massage results comparison between initial policy and residual reinforcement learning (volunteer A).
To verify the generality of the algorithm, the residual reinforcement learning algorithm is used to massage different people’s arms, and the parameter selection is consistent with the first experiment. The comparison of the initial strategy and residual reinforcement learning results is shown in Figure. 7, which is similar to the effect of volunteer A. Under the initial strategy, the massage force signal obtained by the robot also fluctuates. After obtaining the reinforcement learning strategy offline, the massage force obtained by residual reinforcement learning is significantly smoother than that obtained by the initial strategy, the error with the reference force is stable within a certain range, and the control effect is significantly improved. The return value is shown in Figure. 8, which represents the total cumulative reward obtained by the agent in a robot massage, and converges after approximately 60 iterations. The comparison of the offset displacement of the robot under the initial strategy and the residual reinforcement learning algorithm is shown in Figure. 9. Under residual reinforcement learning, the robot offset displacement also fluctuates based on the initial policy offset displacement, and a good control effect is obtained on different volunteer arms.
Massage results comparison between the initial strategy, model-based reinforcement learning algorithm and residual reinforcement learning (volunteer B).
Robot offset displacement comparison between the initial strategy and the residual reinforcement learning algorithm (volunteer B).
In the experiments on different volunteer arms, because the residual reinforcement learning is trained in the BP neural network model without interacting with the actual environment, the cost of using the algorithm is reduced. There are only two interactions in both massage experiments. Compared with the traditional reinforcement learning algorithm, reinforcement learning can quickly obtain the optimal control parameters. The error comparison between the initial policy and the residual reinforcement learning algorithm is shown in Table 1. The force error of the residual reinforcement learning, which is measured by the maximum absolute force error
The algorithm for comparison is a model-based reinforcement learning algorithm. The obtained force is shown as the black dotted line in Figure. 4 and Figure. 7. When experimenting on volunteer A in Figure. 4, two reinforcement learning algorithms can both have good results. However, in the second half of the force tracking on volunteer B in Figure. 7, the force signal of the model-based reinforcement learning algorithm obviously exceeds the threshold, and the residual reinforcement learning algorithm is more stable and has better versatility. Since impedance control provides better initial search conditions, it provides a better initial strategy for residual reinforcement learning, and the residual reinforcement learning only repeats the experiment once to obtain good results. At the same time, impedance control experience reduces the scope of residual reinforcement learning search and improves the search efficiency of residual reinforcement learning. Compared with the model-based reinforcement learning algorithm, the obtained strategy is better than that of previous study, the absolute force error
Conclusion and Future Work
In the process of contact between the robot and the skin, the initial strategy of the robot’s massage force is established by using impedance control; because impedance control has difficulty adapting to changes in the skin environment, the control process is learned via reinforcement learning, and the residual item of the control is compensated. To reduce the number of online interactions in the actual use of reinforcement learning, a neural network is used to construct a dynamic model of the robot’s contact with the environment based on the relationship between the robot’s displacement and the output state, and the built model can be used to train the residual strategy offline via reinforcement learning. To fuse the residual strategy with the initial strategy, the robot offset is smoothed by the construction function.
Experiments were carried out on different volunteer arms. The results show that the robot massage force algorithm based on residual reinforcement learning converges quickly, and after approximately 80 offline iterations, a displacement compensation policy can be selected. When combining the initial strategy and the policy, the robot massage force can quickly converge to the reference force, and the force error is stable within the range of ±0.2 N. Compared with the initial strategy, the maximum absolute force error, the average absolute force error, and the mean square force error are all significantly reduced in residual reinforcement learning, and the average error is reduced by 82.3% and 75.4%. Compared with the model-based reinforcement learning algorithm, the average absolute error is reduced by 27.7% and 12.5%, which further proves that the algorithm has good stability.
In the current work, the robot massage force control algorithm needs to be iterated once offline. In future work, we will simplify the DDPG model so that the reinforcement learning algorithm can be learned online, and the force control strategy can be iterated online and controlled in real time. We will also plan to explore the possibilities of a wider variety of reinforcement learning algorithms. In addition, we will further study how to personalize the massage intensity according to the needs and preferences of different users.