Journals & Magazines >IEEE Access >Volume: 12

Research on Robot Massage Force Control Based on Residual Reinforcement Learning

Research on robot massage force control based on residual reinforcement learning.

Abstract:

To address the problem that traditional force control methods have difficulty obtaining stable forces during a robot massage, a robot force control algorithm based on res...Show More

Metadata

Abstract:

To address the problem that traditional force control methods have difficulty obtaining stable forces during a robot massage, a robot force control algorithm based on residual reinforcement learning is proposed. An initial strategy of the robot massage force is first constructed with impedance control, but the massage force often fluctuates when the skin environment is unknown to the robot. A reinforcement learning algorithm is then used to analyze the relationship between the robot contact state and offset displacement and compensate for the residuals of the impedance controller. To speed up the search for the compensation strategy, a neural network is constructed to fit a dynamic model of reinforcement learning with the data from the initial strategy, which can simulate the contact between the robot and skin. Offline training in a simulated environment can reduce the number of actual interactions in reinforcement learning and improve the practicability of the algorithm; to integrate the two algorithms, the output of the residual reinforcement learning strategy is smoothed. The experimental results show that the robot force control algorithm based on residual reinforcement learning converges after approximately 80 offline iterations. The force error in the online experiment is basically within ±0.2 N.

Research on robot massage force control based on residual reinforcement learning.

Published in: IEEE Access ( Volume: 12)

Page(s): 18270 - 18279

Date of Publication: 25 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3347416

Funding Agency:

Contents

SECTION I.

Introduction

Massage is a common way to physically stimulate the human body, which can accelerate tissue metabolism, promote wound repair and relieve fatigue. However, the number of currently experienced physical therapists is limited, and physical exertion during the massage process makes it difficult to consistently apply forces; consequently, there is an urgent demand for massage robots. The accurate application of force is a difficult problem in both theoretical research and practical applications of massage robots [1]. Too light a force cannot achieve the desired therapeutic effect; too heavy a force may cause tissue damage. Ideally, the massage recipient feels a completely bearable soreness without feeling flustered, dizzy, or nauseated. The overall massage technique must be firm, powerful, and uniform [2]. Therefore, the force application strategy of the robot has an important impact on the overall massage effect.

To keep a robot’s massage force within a set range, many scholars have used the robot’s compliance control to adjust the state of a robot’s contact with the skin. For example, Song et al. [3] explored the influence of different impedance control parameters on the massage force exerted by a robot on the back of a human body; Zhai et al. [4] proposed a robot force control based on a self-tuning impedance control to adapt to the uncertain skin environment; Tu et al. [5] designed a robot contact force transition strategy based on an active disturbance rejection controller, which can solve the problem of excessive instantaneous contact force; Maqsood et al. [6] used an adaptive impedance controller to compensate tracking force, to prevent a robot from deviating from the reference trajectory; Stephens et al. [7] proposed an impedance controller with an adaptive law of environmental parameters, which can interact with a soft environment; Luo and Hsieh [8] used impedance control to study the tapping action of robot massage, and verified the feasibility of robot tapping massage therapy through experiments; Dong et al. [9] proposed a force/position hybrid control algorithm based on a parallel massage robot with compliant joints, which can realize pressing, rolling, pushing and other actions; Sheng et al. [10] proposed an adjustable variable impedance control to deal with the nonlinearity and dynamic stiffness of soft tissues, and achieve ideal force tracking; Li et al. [11] proposed a variable robot impedance control algorithm based on Gaussian process model to enhance the flexibility and adaptability of the system. Although the abovementioned controllers have achieved a good massage force in a certain range, they still face challenges when massaging different individuals. Due to the anatomical structure and physiological differences of the body, the human skin may have different stiffness between different parts and individuals, which leads to continuous changes in the parameters of the skin mechanics model [12]. The surface characteristics of human skin may also vary depending on individual differences, such as dryness and humidity [13], which may affect the force control of robots. Therefore, the static model error of the skin and the existence of external interference in the skin contact environment make traditional controllers inevitably have tracking errors. To ensure the comfort of robot massage, the robot force controllers require not only precision but also a certain degree of versatility.

Reinforcement learning (RL) can discover the optimal behavior autonomously, reduce the complexity of manual operation, and thus achieve better control results [14], [15], [16]. Compared with traditional control algorithms, reinforcement learning has certain advantages in robot control and has achieved good results in robot force control scenarios. For example, Roveda et al. [17] proposed a predictive variable impedance model control based on Q-learning; Peng et al. [18], Perrusquía et al. [19] and Luo et al. [20] used a linear quadratic regulator to obtain the expected force; Zhao et al. [21] used an actor-critic algorithm and Bogdanovic et al. [22] used a deep deterministic policy gradient algorithm to optimize impedance parameters. In these tasks, RL from scratch remains data-inefficient or intractable, but learning a residual on an initial controller can yield substantial improvements. Therefore Johannink et al. [23] proposed a residual reinforcement learning framework, in which the RL agent learns an additional residual policy to modify the behavior of a predefined hand-crafted controller to simplify exploration; Rana et al. [24] proposed residual skill policies, which can achieve effective skill reuse for adaptation and efficient learning; Xie et al. [25] and Silver et al. [26] made substantial improvements over the inception controller. However, in robot massage scenarios, how to combine the two control methods for safe exploration, to achieve efficient sample initialization and fast learning is an unsolved problem.

Compared to traditional control for robot massage, the main contributions of this work are as follows:

A robot massage force controller based on residual reinforcement learning is proposed, which combines a traditional robot force controller and reinforcement learning algorithm;
An environmental dynamics model of reinforcement learning is constructed to simulate the contact process between the robot and the skin, which accelerates the search for the compensation strategy of residual terms,
The number of actual interactions of the reinforcement learning is reduced and at the same time the practicability of the reinforcement learning algorithm is improved.

The structure of the full text is as follows: the second part details the initial strategy in the robot massage process; the third part provides a robot massage force control algorithm based on residual reinforcement learning to compensate for the residual term of impedance control; and the fourth and fifth parts discuss the setup of the experimental platform and the experiments conducted to verify the feasibility of the algorithm, respectively.

SECTION II.

Initial Strategy Via Impedance Control

In a robot massage scenario, the robot’s end effector is equipped with a massage head. When a robot imitates the rubbing massage technique of a masseur, the massage head is in contact with the skin and moves along a set trajectory, and the massage force is determined by setting the reference force. To ensure that the robot follows the reference force, a force controller is required to adjust the contact state of the robot. Impedance control can simplify the contact model between the robot and the human into a linear second-order system contact model with inertia, damping, and stiffness characteristics. The displacement of the robot can be adjusted according to the difference between the end force of the robot and the reference force. The characteristics of the model can be tuned by changing the inertial, damping, or stiffness parameters [27], so it can be applied in robot massage. In the Cartesian coordinate system, the analysis is only performed along the normal direction of the contact between a robot and skin, and the position and contact force of a robot meet the following conditions [28]:

$\begin{equation*} m_{d} \Delta \ddot {x}+b_{d} \Delta \dot {x}+k_{d} \Delta x=f_{r} -f_{e}, \tag{1}\end{equation*}$ View Source

where

$m_{d}$

$b_{d}$

, and

$k_{d}$

are the inertia, damping, and stiffness parameters, respectively;

$\Delta \ddot {x}$

$\Delta \dot {x}$

and

$\Delta x$

are the changes in acceleration, velocity and offset displacement of the robot end effector, respectively;

$f_{r}$

is the reference force; and

$f_{e}$

is the actual contact force.

In an actual system, the following difference equation can be used to calculate the change in acceleration and velocity [3]:

$\begin{align*} \Delta \dot {x}(k)&=\frac {\Delta x(k)-\Delta x(k-1)}{T_{s}}, \tag{2}\\ \Delta \ddot {x}(k)&=\frac {\Delta \dot {x}(k)-\Delta \dot {x}(k-1)}{T_{s} }, \tag{3}\end{align*}$ View Source

where

${k}$

is used to represent the k- th sampling period and

$T_{s}$

represents the sampling period. Substituting equations (2) and (3) into equation (1),

$\Delta x(k)$

can be calculated online as:

$\begin{align*} &\hspace {-0.1pc}\Delta x(k) \\ &=\frac {eT_{s}^{^{2}}+b_{d} T_{s} \Delta x(k-1)+m_{d} (2\Delta x(k-1)-\Delta x(k-2))}{m_{d} +b_{d} T_{s} +k_{d} T_{s}^{^{2}}}, \tag{4}\end{align*}$

View Source

where

$e=f_{r}$

$f_{e}$

$f_{r}$

is the reference force, and

$f_{e}$

is the contact force at the end of the robot. If the inertia, damping and stiffness of the environment are known, the interaction force can be finely tuned by choosing an appropriate impedance parameter. However, in an unknown environment, it is difficult to determine the virtual trajectory in advance, and a fixed target impedance parameter does not guarantee a good contact effect.

SECTION III.

Construction of Residual Reinforcement Learning

Skin texture varies with age, skin quality, body mass and other factors; even in the same individual, different skin areas show different mechanical properties due to the different distribution ratios of bones, muscles, and fat [29]. Although the parameters of impedance control in robot massage scenarios are easy to set, they often fluctuate. Residual reinforcement learning can be optimized with a good but imperfect force controller, which improves the accuracy of the algorithm by compensating for traditional impedance control.

The flow chart of robot force control based on residual reinforcement learning is shown in Figure 1. The actual force generated when the robot contacts human skin is $f_{e}$ . The difference between the actual force and the reference force $f_{r}$ is passed through the impedance controller to obtain the robot displacement offset. Reinforcement learning compensates the impedance controller according to the contact state between the robot and the skin. The sum of the two is combined with the robot’s reference coordinate $x_{r}$ to obtain the actual coordinates in the Cartesian coordinate system of the robot. The accuracy of the control is ensured through the robot’s internal position controller, thereby adjusting the robot’s displacement and indirectly adjusting the contact state between the robot and the skin. Under the residual reinforcement learning framework, the robot offset displacement can be set as:

$\begin{equation*} {} {u}'=\Delta x+f(u), \tag{5}\end{equation*}$ View Source

where

$u'$

is the offset displacement of the robot under the residual strategy,

$\Delta x$

is the offset of the robot under the initial strategy of impedance control, and

${f}$

(

$u$

) is the compensation displacement of the robot under the residual strategy.

FIGURE 1.

Flow chart of robot force control based on residual reinforcement learning.

Show All

Residual policy adjustment is very cumbersome, but reinforcement learning can be used to find the optimal policy autonomously. At the same time, traditional force controllers can accelerate reinforcement learning by limiting the search range. Combining these two algorithms results in improved performance, so a residual policy can be built using reinforcement learning.

A. Reinforcement Learning to Construct the Residual Policy

Reinforcement learning is based on the Markov decision process, which consists of a tuple $M=(s$ , $u$ , $P$ , $r$ , $\gamma$ ), where $s$ is a set of environmental states, $u$ is an action, $P$ is the state transition probability of the system, $r$ is a reward, and $\gamma$ is a discount factor whose value belongs to the interval [0, 1] [30]. An agent of reinforcement learning continues to explore and try various actions by interacting with the environment, adjusts its strategy according to the feedback of the environment, gradually optimizes its behavior, and takes the best action to maximize its reward. When reinforcement learning is applied to robot massage scenarios, the robot can select the appropriate compensation displacement according to different contact states through the agent, and the agent adjusts the displacement of the robot to maximize the expected reward value of the entire trajectory.

During the robot massage process, the environment state $s$ depends on the error $e$ of the robot contact force and the error change rate of the contact force $\dot {e}$ i.e., $s=[e,\dot {e}]$ . The action $u$ is set as the robot compensation displacement. At time $t$ , the agent selects the robot compensation displacement $u_{t}$ according to the contact state $s_{t}$ and obtains a reward $r_{t}$ , thus forming a trajectory of the Markov decision process, $\tau =$ { $s_{0}$ , $u_{0}$ , $r_{0}$ , $s_{1}$ , $u_{1}$ , $r_{1} s_{2}$ , $u_{2}, r_{2},\cdots, s_{\mathrm {T-1}}, u_{\mathrm {T-1}}, r_{\mathrm {T-1}}, s_{\mathrm {T}}$ }, where the reward $r$ is used to determine whether the agent chooses a good compensation displacement strategy or not. The force control goal is to make the massage force equal to the set reference force, and $r$ can be set as the distance between the contact force and the reference force. When the error between the contact force and the reference force is small, the reward is higher; conversely, the reward is smaller when the error is larger, so the reward function can be set as:

$\begin{equation*} r_{t} =-k_{r} \ast \left |{ {e_{t}} }\right |, \tag{6}\end{equation*}$ View Source

where

$k_{r}$

is a proportionality factor and

$e_{t}$

is the error between the robot contact force and the reference force at time

$t$

, i.e.,

$e_{t} =f_{r}^{t}- f_{e}^{t}$

. All the rewards in a round are summed to obtain the return of reinforcement learning. To focus on the impact of the current decision on the future, the agent uses the discounted return to improve its long-term performance and income, and the discounted return

$R_{t}$

is:

$\begin{equation*} R_{t} =\sum \nolimits _{i=t}^{T} {\gamma ^{(i-t)}} r_{i}, \tag{7}\end{equation*}$

View Source

where

$\gamma$

is the discount rate whose value belongs to the interval [0, 1] and

$i$

is a variable between

${t}$

and T.

The ultimate goal of the agent is to maximize the expected return, so the objective function can be set as:

$\begin{equation*} J(\theta)=E_{\tau \sim \rho _{\theta } (\tau)} [R(\tau)], \tag{8}\end{equation*}$ View Source

where

$\theta$

is a hyperparameter selected for the reinforcement learning policy, which contains the configuration and control parameters, and

$J$

is the objective function, when the value of the objective function is large, it represents a good return for the robot, i.e., the error between the robot’s contact force and the reference force is smaller, and vice versa.

Since the displacement of the robot is a continuous variable, the output of the deep deterministic policy gradient(DDPG) in the reinforcement learning algorithm is not the probability of agent action, but a specific agent action [31], and it can be used for continuous action prediction. Therefore, DDPG can be used to build a strategy for robot residual policy. DDPG is based on an actor-critic(AC) framework; the actor is a policy network that outputs an action, while the critic is a value network that estimates the expected return for each state-action pair [32]. The training goal of the actor is to maximize the Q-value output by the critic, because this Q-value reflects the maximum benefit that the current strategy can obtain by taking this action.

The training goal of the critic network is to minimize the error between the actual return and the predicted return, which minimizes the TD error. It guides the update of network parameters by comparing the difference between the expected return and the actual return of taking an action in the current state. Since the DDPG algorithm adopts a deterministic strategy, the robot’s offset displacement strategy can be directly selected. The value network does not control the agent but only scores the compensation displacement $u$ based on the contact state $s$ to guide the policy network to make improvements. The actors and critics in the DDPG algorithm each use two networks: the actor network $\mu (s_{t} \vert \theta ^{\mu })$ and target actor network ${\mu }'(s_{t} \vert \theta ^{\mu }{^{\prime } })$ , and the critic network $Q(s,a\vert \theta ^{Q})$ and target critic network ${Q}'(s,a\vert \theta ^{Q'})$ [33]. In the actor network, the input is observation $s$ , and the output is action $u$ , while in the critic network, the input is action and observation [ $u, s$ ], and the output is $Q(s,u)$ . To ensure that the algorithm can explore, random noise is added to the compensation displacement, and $u$ can be written as:

$\begin{equation*} u=\mu (s_{t} \vert \theta ^{\mu })+{\mathcal{ N}}_{t}, \tag{9}\end{equation*}$ View Source

where

$\theta ^{\mu }$

is the hyperparameter of the actor network

$\mu$

, and

${\mathcal{ N}}_{t}$

is random noise with a standard normal distribution. To make full use of the sampled data and reduce the correlations between data, DDPG adopts experience playback technology, that is, the obtained storage conversion is stored in the experience pool, and

$K$

random samples are taken from the experience pool in small batches (

$s_{\mathrm {i}}, a_{\mathrm {i}}, r_{\mathrm {i}}, s_{i+1}$

,). Afterwards, the critic network is updated by minimizing the target loss, namely,

$\begin{equation*} L=\frac {1}{K}\sum \limits _{i} {(y_{i} -Q(s_{i},u_{i} \vert \theta ^{Q}))} ^{2}, \tag{10}\end{equation*}$

View Source

where

$y_{i}$

represents the actual Q- value,

$y_{i}=r_{i}+\gamma Q^{\prime }\left ({s_{i+1}, \mu ^{\prime }\left ({s_{i+1}}\right.}\right. \left.{\left.{\mid \theta ^{u \prime }}\right) \mid \theta ^{Q}{^{\prime} }}\right)$

the calculation of

$y_{i}$

uses the target actor network

$\mu '$

and the target critic network Q’ value, and the actor network is updated through the policy gradient. The purpose of the actor network is to select the best action, therefore, the update direction of the actor network is the direction that maximizes the

$Q$

-value. Its gradient is expressed as follows:

$\begin{equation*} \nabla _{\theta ^{\mu }} \approx \frac {1}{N}\sum \limits _{i} {\nabla _{u} Q(s,u\vert \theta ^{Q})} \vert _{s=s_{i},u=\mu (s_{i})} \nabla _{\theta ^{\mu }} \mu (s\vert \theta ^{\mu })\vert _{s_{i}}. \tag{11}\end{equation*}$

View Source

Finally, the target network is updated, the parameters in the actor network and critic network are copied and a soft update is performed. Each time, the parameters are only updated slightly [34]:

$\begin{align*} \theta ^{Q'}&\leftarrow \alpha \theta ^{Q}+(1-\alpha)\theta ^{Q'} \\ \theta ^{\mu '}&\leftarrow \alpha \theta ^{\mu }+(1-\alpha)\theta ^{\mu '}, \tag{12}\end{align*}$ View Source

where

$\alpha$

is a discount factor whose range is [0, 1]. Equation (12) is used to iterate the neural network of DDPG until the output strategy meets the actual needs.

B. Dynamic Model for Residual Reinforcement Learning

The reinforcement learning agent needs to improve the policy through trial and error; therefore, in the actual interaction process, multiple experiments are required to achieve the desired result [35]. The repeated trial and error process not only creates discomfort but also causes skin damage and pain, so the rapid convergence of the algorithm is very important in robot massage scenarios. DDPG is a model-free algorithm based on the AC framework, which requires a certain number of experiments to have enough data, so to accelerate the convergence in the actual massage process, a dynamic model for reinforcement learning can be constructed. This enables DDPG to iteratively train in the constructed virtual environment, thereby reducing the number of actual trainings and improving the practicability of the algorithm [36], [37], [38].

When the robot is in state $s_{t}$ , using the compensation displacement $u$ , the state at the next moment is usually unknown; that is, the joint probability $P(s_{t+1}\vert s_{t}, a_{t}$ ) is usually unknown. Because the skin has biological components, the mechanical characteristics of the skin are nonlinear, and the dynamic model of the robot is also nonlinear, the interaction between the two can be modeled as a nonlinear system, and the nonlinear relationship can be constructed using a neural network that can predict the state at the next moment. The back propagation (BP) neural network can perform nonlinear mapping, which can be used to construct the relationship between the contact state and compensation displacement. The input state is $e_{t}, \dot {e}_{t}$ , and the offset displacement of the robot is $\Delta x_{t}$ . During the forwards propagation process, the BP neural network processes the data through the hidden layer neurons, and the corresponding output results are $e_{t+1}$ and $\dot {e}_{t+1}$ . The output of the BP neural network is:

$\begin{equation*} s_{t+1} =\varphi (s_{t},u_{t}), \tag{13}\end{equation*}$ View Source

where

$u_{t}$

is the compensation displacement of the robot at time t, and

$\varphi$

expresses the nonlinear relationship between the input and output.

The hidden layer of the BP neural network is selected to be 1 layer and contains $n$ neurons, and the value of the hidden layer node of the neural network is obtained through a linear combination of the input layer node and the weight parameter and then processed by the activation function. The output value of the $j$ -th neuron node is

$\begin{equation*} z_{j} =\phi \left({\sum \limits _{i=1}^{n} {w_{ij} y_{i} +b^{1}_{j}} }\right), \tag{14}\end{equation*}$ View Source

where

$w_{ij}$

is the weight between the i- th input element and the j- th node,

$b^{1}_{j}$

is the threshold of node

$j$

, and

$y_{i}$

is the i- th input parameter, i.e., the values of

$s_{t}$

and

$u_{t}$

. Since there are multiple input and output elements and the input values are not saturated within a certain range, the activation function can be chosen to be the common sigmoid function:

$\begin{equation*} \phi (a)=\frac {1}{1+e^{-c}}, \tag{15}\end{equation*}$

View Source

where

$\phi$

is the activation function, and

$c$

is the input value of the activation function. Then the output result of the BP neural network is

$\begin{equation*} O_{h} =\sum \limits _{j=1}^{n} {w_{jh} z_{j}} +b^{2}_{h}, \tag{16}\end{equation*}$

View Source

where

$w_{jh}$

is the weight of the

$j$

-th node of the

$h$

-th output element,

$n$

is the number of hidden layer neurons,

$b^{2}_{h}$

is the threshold value of the

$h$

-th element in the output parameter, and

$O_{h}$

is the

$h$

-th output element, i.e.,

$e_{t+1}$

and

$\dot {e}_{t+1}$

$s_{t+1}$

. The neural network performs backpropagation by outputting the difference between the value of

$s_{t+1}$

and the actual values, and the model trains offline. Finally, the input of the BP neural network model is the state

$s_{t}$

and the robot’s compensation displacement

$u_{t}$

. After passing through the hidden layer (equations (14), (15)), the state

$s_{t+1}$

at the next moment in equation (13) can be obtained through the output layer (equation (16)). After the dynamics model is constructed by the BP neural network, the residual policy can be obtained by offline training in the dynamic model by using the DDPG algorithm.

C. Residual Policy Fusion

The policy obtained via reinforcement learning is not a smooth curve, and the unsmooth compensation process may cause the robot to shake during the movement process. Therefore, in actual robot massage scenarios, the robot residual policy can be smoothed to improve the comfort of the strategy, an average value filter can be used to optimize the output offset displacement of the residual reinforcement learning framework, and the robot offset displacement can be written as:

$\begin{equation*} {} {u}'_{t} =\sum \limits _{i=0}^{d-1} {u'_{t-i} /d}, \tag{17}\end{equation*}$ View Source

where

$d$

is the filtering window.

The pseudocode of the algorithm is as follows:

Algorithm 1 Residual Reinforcement Learning for Robot Massage

Require empirical data of impedance control algorithm and robot contact force and robot displacement when using impedance parameters to massage;

BP neural network constructs the dynamic model of the robot’s contact state;

for $n=1, 2,\ldots, N$

Initialize the authentication process N,

Set the initial state of the residual policy $s_{0}$ ,

for $t=0, 1,\ldots, T$

Set policy action $u=\mu (s_{t}\vert \theta ^{\mu })+{\mathrm{ N}}_{t}$ ;

Under the BP neural network, a policy for

robotic massage performed $u$ ;

Get the state of the next moment $s_{t+1}$ ;

Store data ( $s_{t}, u_{t}, r_{t}, s_{t+1}$ ) in the cache area;

Extract a set of data from the cache area, and

optimize the hyperparameters $\theta$ of RL according to equations (10), (11) and (12).

End

In the actual experiment, execute the of robot massage residual strategy $u'=x+f(u)$ ;

Use formula (17) to optimize the residual strategy.

SECTION IV.

Experimental Setup

The massage robot is a UR robot with a massage head installed at its end. A force sensor is installed between the massage head and the robot, and a Beckhoff module collects the force signal and transmits it to the host computer. At the same time, the control box transmits the position and speed signals of the robot to the host computer. The host computer and the control box communicate at a frequency of 50 Hz. A schematic diagram of the experiment is shown in Figure. 2. The robot presses the skin vertically along the Z-direction at a speed of 2 mm/s. When the robot force reaches the reference force value $f_{r}$ in the Z-direction, i.e., point $\text{Q}_{\mathrm {a}}$ in the figure, the robot stops moving in the Z-direction, enters the force control mode, moves horizontally along the X-direction at a speed of 2 mm/s for 5 s to point $\text{Q}_{\mathrm {b}}$ , and then moves vertically away from the skin. In the process of moving from $\text{Q}_{\mathrm {a}}$ to $\text{Q}_{\mathrm {b}}$ , the initial force control strategy of the robot is impedance control. After the residual strategy is calculated offline, residual reinforcement learning is used for experiments.

FIGURE 2.

Schematic diagram of a robot massage process.

Show All

The residual reinforcement learning experimental process is shown in Figure. 3. The initial strategy is used to obtain the robot contact states and displacements in the Z-direction. A BP neural network model is used to construct a state transition model, and the inputs are the robot contact state $e_{t}, \dot {e}_{t}$ and robot offset displacement $\Delta x_{t}$ . The outputs are states $e_{t+1}$ and $\dot {e}_{t+1}$ . With the BP neural network model, the DDPG algorithm learns the residual term of impedance control to obtain the residual policy. If the force error is greater than the expected threshold ±0.2 N when combining impedance control with reinforcement learning in online experiments, the obtained data can be added to the database. Then, the BP neural network can be updated again, and experiments can be iterated until the error between the force in the Z-direction and the reference force is within the set range, namely, ±0.2 N.

FIGURE 3.

Experimental flowchart of residual reinforcement learning.

Show All

SECTION V.

Experimental Results and Analysis of Robot Massage

The robot exerts force on the skin surface by rubbing to ensure the safety of the recipient of the massage, and a gentle force application strategy is adopted, setting the robot’s reference force to 5 N, i.e., $f_{r}=5$ N. Under the same conditions, there are three groups of robot massage experiments. The first group is the initial strategy, that is, impedance control, the second group is residual reinforcement learning, and the third group is model-based reinforcement learning algorithm for comparison, constructed by combining neural network and cross-entropy methods for control parameter search. This method was proposed by previous research [39]. The number of offline learning times for the residual reinforcement learning and model-based reinforcement learning algorithm is one, and the three different algorithms are all conducted once on two different volunteer arms, with a total of 6 sets of experiments.

In the initial strategy, the impedance control parameters are manually adjusted to $m_{d}=10, b_{d}=6$ , and $k_{d}=700$ according to experience, and the empirical parameters can ensure safe contact between the robot and human. The massage force obtained by using the initial strategy is shown by the blue line in Figure. 4. When the robot moves along the skin, the traditional control algorithm can achieve good results within a certain range, and the robot and the massager are always in contact and do not leave the skin surface, i.e., the contact force has not been equal to 0 N, the robot and the skin have been in the state of extrusion, the surface of the skin is not a smooth horizontal surface, the contact force is maximum at approximately 0.5 s, and the contact force is minimum at approximately 1.8 s, because the robot contact environment is time-varying, resulting in the traditional control being unable to adapt to the changes in the external environment, and obtaining the massage force fluctuates substantially around the reference force of 5 N.

FIGURE 4.

Massage results comparison between the initial strategy, model-based reinforcement learning algorithm and residual reinforcement learning (volunteer A).

Show All

After executing the initial strategy, the difference between the robot force and the reference force $e_{t},$ and the rate of error change of the robot offset displacement $\Delta x_{t}$ are collected, and input into the BP neural network for fitting. Due to the small amount of input and output data, the BP neural network nodes can only take a small number of values; we select 10 nodes according to experience. After the dynamic model is obtained by the BP neural network, the constructed model can be used for offline reinforcement learning training.

In the DDPG network, the input of the actor network and the target actor network is 2-dimensional data, the output is 1-dimensional data, and the value of the intermediate node of the neural network is set to 30. The input of the critic network and target critic network is 3-dimensional ( $s$ and $u$ ), the value of the middle node of the neural network is also set to 30, and the output is 1-dimensional data. The $k_{r}$ scale factor in equation (6) is set to 100, which is convenient for observing the gap between the contact force and the reference force. The discount rate in equation (7) is set to $\gamma =0.9$ , and $\alpha$ in equation (12) is set to 0.01. In the dynamic model, the step size of DDPG is set to T=200, and the total number of iterations N is 200. According to the empirical data collected by executing the initial strategy, the ${u}$ range is limited to [−0.05, 0.05], which can reduce the number of invalid searches.

The iteration process of the DDPG algorithm under the dynamic model of the BP neural network is shown in Figure. 5, at the beginning of the iteration, due to the lack of the number of neural networks of the DDPG algorithm, the optimal policy obtained is not effective, and the return value of reinforcement learning is below −9100. When approximately 80 times the algorithm, the return value of reinforcement learning is rapidly improved, the return value is approximately −8950, the optimal policy of reinforcement learning is better than the initial stage, and the return value of the final steady state shows that the algorithm converges. When fusing the residual strategy obtained offline with the initial strategy, the filter window $d$ in equation (17) is set to 2. During the online experiment, the force control effect of residual reinforcement learning is shown in red in Figure. 4. Compared with the initial strategy (the blue signal in Figure 4), the force signal of residual reinforcement learning is obviously smoother, and the control effect is significantly improved, which is not more than 6 N and not less than 4 N. The error with the reference force is within ±0.2 N, and the residual reinforcement learning significantly improves the control effect.

FIGURE 5.

Return value during the iterations (volunteer A).

Show All

The comparison of robot offset displacement under the initial strategy and the residual reinforcement learning algorithm is shown in Figure. 6, the robot and human massage offset displacement under the initial strategy is small; when the environment suddenly changes, the robot’s offset strategy is not enough to allow the robot’s contact force to reach the reference force quickly. Compared to the initial strategy, the residual reinforcement learning strategy can respond quickly to adapt to changes in the external environment by rapidly increasing or decreasing the robot’s offset displacement. The reason why the residual reinforcement learning algorithm performs well in handling such problems is that it achieves good results in the training of the dynamics model, resulting in an efficient strategy. During the first 80 iterations in Figure 5, the reinforcement learning algorithm learns and collects environmental information by exploring the environment and trying different actions. When faced with an uncertain contact environment, the algorithm tends to try a variety of actions, although the reward values of these actions may not be ideal. At the same time, the reinforcement learning algorithm can update the neural network in DDPG to better reflect the actual situation. As the number of iterations increases, the DDPG neural network continues to update and output more accurate strategies. In Figure 6, this is reflected in the more rapid response of the robot’s offset displacement strategy, which can quickly increase or decrease the robot’s offset displacement, thereby allowing the robot to adapt to changes in the external environment. As the residual reinforcement learning algorithm improves on the initial strategy, the output offset displacement fluctuates on the basis of the initial strategy, which improves the force application effect while also ensuring the safety of the contact.

FIGURE 6.

Massage results comparison between initial policy and residual reinforcement learning (volunteer A).

Show All

To verify the generality of the algorithm, the residual reinforcement learning algorithm is used to massage different people’s arms, and the parameter selection is consistent with the first experiment. The comparison of the initial strategy and residual reinforcement learning results is shown in Figure. 7, which is similar to the effect of volunteer A. Under the initial strategy, the massage force signal obtained by the robot also fluctuates. After obtaining the reinforcement learning strategy offline, the massage force obtained by residual reinforcement learning is significantly smoother than that obtained by the initial strategy, the error with the reference force is stable within a certain range, and the control effect is significantly improved. The return value is shown in Figure. 8, which represents the total cumulative reward obtained by the agent in a robot massage, and converges after approximately 60 iterations. The comparison of the offset displacement of the robot under the initial strategy and the residual reinforcement learning algorithm is shown in Figure. 9. Under residual reinforcement learning, the robot offset displacement also fluctuates based on the initial policy offset displacement, and a good control effect is obtained on different volunteer arms.

FIGURE 7.

Massage results comparison between the initial strategy, model-based reinforcement learning algorithm and residual reinforcement learning (volunteer B).

Show All

FIGURE 8.

Return value during the iterations (volunteer B).

Show All

FIGURE 9.

Robot offset displacement comparison between the initial strategy and the residual reinforcement learning algorithm (volunteer B).

Show All

In the experiments on different volunteer arms, because the residual reinforcement learning is trained in the BP neural network model without interacting with the actual environment, the cost of using the algorithm is reduced. There are only two interactions in both massage experiments. Compared with the traditional reinforcement learning algorithm, reinforcement learning can quickly obtain the optimal control parameters. The error comparison between the initial policy and the residual reinforcement learning algorithm is shown in Table 1. The force error of the residual reinforcement learning, which is measured by the maximum absolute force error $\vert e\vert _{max}$ , the average absolute force error $\vert$ ¯e $\vert$ , and the standard deviation of force error $\sigma _{e}$ , is significantly reduced. Compared with the initial strategy, the average absolute error $\vert$ ¯e $\vert$ is reduced by 82.3% and 75.4% for the two volunteers.

TABLE 1 Comparison of Errors Between the Initial Strategy, Model-Based Reinforcement Learning Algorithm and the Residual Reinforcement Learning Algorithm

The algorithm for comparison is a model-based reinforcement learning algorithm. The obtained force is shown as the black dotted line in Figure. 4 and Figure. 7. When experimenting on volunteer A in Figure. 4, two reinforcement learning algorithms can both have good results. However, in the second half of the force tracking on volunteer B in Figure. 7, the force signal of the model-based reinforcement learning algorithm obviously exceeds the threshold, and the residual reinforcement learning algorithm is more stable and has better versatility. Since impedance control provides better initial search conditions, it provides a better initial strategy for residual reinforcement learning, and the residual reinforcement learning only repeats the experiment once to obtain good results. At the same time, impedance control experience reduces the scope of residual reinforcement learning search and improves the search efficiency of residual reinforcement learning. Compared with the model-based reinforcement learning algorithm, the obtained strategy is better than that of previous study, the absolute force error $\vert e\vert _{max}$ , the average absolute force error $\vert$ ¯e $\vert$ , and the standard deviation of force error $\sigma _{e}$ , are reduced to a certain extent, the average absolute error $\vert$ ¯e $\vert$ is reduced by 27.7% and 12.5% for the two volunteers, and the stability of the algorithm has indeed been proven. Although the accuracy of the model-based reinforcement learning algorithm has been improved to a certain extent compared with traditional impedance control, but its versatility is slightly inferior to residual reinforcement learning.

SECTION VI.

Conclusion and Future Work

In the process of contact between the robot and the skin, the initial strategy of the robot’s massage force is established by using impedance control; because impedance control has difficulty adapting to changes in the skin environment, the control process is learned via reinforcement learning, and the residual item of the control is compensated. To reduce the number of online interactions in the actual use of reinforcement learning, a neural network is used to construct a dynamic model of the robot’s contact with the environment based on the relationship between the robot’s displacement and the output state, and the built model can be used to train the residual strategy offline via reinforcement learning. To fuse the residual strategy with the initial strategy, the robot offset is smoothed by the construction function.

Experiments were carried out on different volunteer arms. The results show that the robot massage force algorithm based on residual reinforcement learning converges quickly, and after approximately 80 offline iterations, a displacement compensation policy can be selected. When combining the initial strategy and the policy, the robot massage force can quickly converge to the reference force, and the force error is stable within the range of ±0.2 N. Compared with the initial strategy, the maximum absolute force error, the average absolute force error, and the mean square force error are all significantly reduced in residual reinforcement learning, and the average error is reduced by 82.3% and 75.4%. Compared with the model-based reinforcement learning algorithm, the average absolute error is reduced by 27.7% and 12.5%, which further proves that the algorithm has good stability.

In the current work, the robot massage force control algorithm needs to be iterated once offline. In future work, we will simplify the DDPG model so that the reinforcement learning algorithm can be learned online, and the force control strategy can be iterated online and controlled in real time. We will also plan to explore the possibilities of a wider variety of reinforcement learning algorithms. In addition, we will further study how to personalize the massage intensity according to the needs and preferences of different users.

References is not available for this document.

Research on Robot Massage Force Control Based on Residual Reinforcement Learning

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Initial Strategy Via Impedance Control

Construction of Residual Reinforcement Learning

A. Reinforcement Learning to Construct the Residual Policy

B. Dynamic Model for Residual Reinforcement Learning

C. Residual Policy Fusion

Algorithm 1 Residual Reinforcement Learning for Robot Massage

Experimental Setup

Experimental Results and Analysis of Robot Massage

Conclusion and Future Work

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Research on Robot Massage Force Control Based on Residual Reinforcement Learning

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Initial Strategy Via Impedance Control

Construction of Residual Reinforcement Learning

A. Reinforcement Learning to Construct the Residual Policy

B. Dynamic Model for Residual Reinforcement Learning

C. Residual Policy Fusion

Algorithm 1 Residual Reinforcement Learning for Robot Massage

Experimental Setup

Experimental Results and Analysis of Robot Massage

Conclusion and Future Work

Authors

Figures

References

Keywords

Metrics

References