Loading web-font TeX/Math/Italic
Mobile Service Robot Path Planning Using Deep Reinforcement Learning | IEEE Journals & Magazine | IEEE Xplore

Mobile Service Robot Path Planning Using Deep Reinforcement Learning


Proposed RL-based DQN Path Planner's Initial Learning, Beta-Decay Lifelong Learning and Execution Process Flow.

Abstract:

A mobile service robot operates in a constantly changing environment with other robots and humans. The service environment is usually vast and unknown, and the robot is e...Show More

Abstract:

A mobile service robot operates in a constantly changing environment with other robots and humans. The service environment is usually vast and unknown, and the robot is expected to operate continuously for a long period. The environment can be dynamic, leading to the generation of new routes or the permanent blocking of old routes. The traditional path planner that relies on static maps will not suffice for a dynamic environment. This work is focused on developing a reinforcement learning-based path planner for a dynamic environment. The proposed system uses the deep Q-Learning algorithm to learn the initial paths using a topological map of the environment. In an environmental change, the proposed \pmb \beta -decay transfer learning algorithm trains the agent in the new environment. This algorithm uses experience vs. exploration vs. exploitation-based training depending on the similarity of the old and new environments. The system is implemented on the Robotic Operating System framework and tested using Turtlebot3 mobile robot in the Gazebo simulator. The experimental results show that the reinforcement learning system learns all the routes based on the initial topological map of different service environments with an accuracy of over 98%. A comparative analysis of the \pmb \beta -decay transfer learning and non-transfer learning agents is performed based on various evaluation metrics. The transfer learning agent converges twice faster than the non-transfer learning agent.
Proposed RL-based DQN Path Planner's Initial Learning, Beta-Decay Lifelong Learning and Execution Process Flow.
Published in: IEEE Access ( Volume: 11)
Page(s): 100083 - 100096
Date of Publication: 04 September 2023
Electronic ISSN: 2169-3536

SECTION I.

Introduction

The Mobile Robotics field is one of the most popular domains among robotics researchers. An essential mobile robot subsystem is a navigation system. A navigation system perceives the environment and enables the robot to navigate autonomously to perform autonomous tasks [1]. There is a vast range of applications in which mobile robots are being used, such as surveillance, automation in industries, museum guides, elderly care, hospital care, home assistance, servers in restaurants, and so on. Mobile robots are classified based on their form factor, such as wheeled, legged, flying, and so on [2]. Wheeled mobile robots, in particular, can be effective and simple to be used in flat indoor terrain.

A mobile service robot performs autonomous tasks in service environments like homes, restaurants, hospitals, etc. [3]. A service environment is unique, unlike other environments, based on a few features of the environment. To name a few, vast square footage, moving obstacles (robots and humans) [4], prone to environmental changes over some time, and augmenting (adding extra sensors) the environment to support the autonomous operation is not possible. A path planner is essential for an autonomous mobile robot to plan an efficient path given the source and destination.

A path planner is as good as its understanding of the environment. Initially, the main focus of the path-planning research community was to achieve optimal paths given different levels of environmental awareness. These works are categorized based on the environmental information available to the path planners [5]. One of the main concerns of a service robot’s path planner is environmental changes that will hinder its performance. A path planner uses the initial environment configuration to identify efficient paths, and this configuration will not be valid if the environment changes. Hence a path planner which can dynamically plan the path based on the latest configuration of the environment is necessary to tackle this problem. This type of path planner is classified as a dynamic path planner, which considers environmental changes throughout its lifetime.

A dynamic path planner continuously monitors environmental changes and updates itself to provide efficient paths. Using machine learning (ML) techniques for dynamic path planning has recently gained more traction among researchers. In particular, Reinforcement Learning (RL) [6] is a type of machine-learning technique that uses the Markov Decision Process (MDP) to learn to interact in an environment. The main reason to use RL for path planning is it can be modeled as an MDP intuitively, and the fact that it does not need massive labeled data for training. In most of the work in RL-based path planning, the agent’s state space is considered as images or sensory information. While this is a reasonable approach to defining the state space on an RL agent, the shortcoming is the bigger state space. The bigger the state space, the more time to converge in the learning phase.

Lifelong learning [7] is a factor that is considered to keep any artificial intelligent agent learning and evolving continuously based on the various changes over a long period. Transfer Learning(TL) is an ML technique that uses previous experience to learn and perform well in a similar situation. A TL algorithm can efficiently make the RL agent dynamic with lifelong learning ability. Given the variations in the environment configuration, the transfer learning approach can be used to achieve lifelong learning. In this context, the source states are the old environment’s topological information, and the target states are the new altered environment’s topological information. There will be differences in the state space, and all the other RL-agent parameters will remain the same. TL approaches considering only the state difference between the source task, and target task is minimal. So a novel transfer learning technique that considers only state space difference is proposed in this article.

This work aims to design and develop a Deep Reinforcement Learning (DRL) based path planning framework. The proposed framework uses a Deep Q-Learning algorithm to learn the initial paths based on the topological map of the environment. A novel $\pmb \beta $ -decay TL algorithm is proposed to achieve lifelong learning. This algorithm uses an incremental learning approach to update the RL agent based on the changes in the environment dynamically for lifelong. The key contributions of the work are:

  • Design and development of a scalable DQL path planning framework for a mobile service robot.

  • Incorporating $\pmb \beta $ -decay TL algorithm to continuously evolve the RL agent based on environmental changes and achieve lifelong learning.

  • The RL agent’s scalability and the TL algorithm’s learning efficiency are tested and proved.

The remainder of the paper is presented as follows; a summary of the related work in path planning is described in section II. The problem formulation is defined in section III. Section IV describes the proposed RL and TL framework in detail. Results and analysis of various test cases are depicted in section V. The article is concluded, and future directions are identified in the last section.

SECTION II.

Recent and Relevant Work

The path planning problem can be categorized into two classes of problems, global and local path planning [8]. Global path planning is planning a path from a source to a destination, given the environment map. Local planning is planning the path if an unforeseen scenario (obstacle) occurs while traversing the global path. Rapid development in artificial intelligence and machine learning techniques has led to using it to solve path planning problems [8], [9], [10]. Artificial intelligence techniques like fuzzy logic [11], neural networks [12], and hybrid solutions like neuro-fuzzy inference systems [13] are used for local path planning as there are very efficient in dynamic problem-solving. Evolutionary techniques like particle swarm optimization [14] and genetic algorithms [15] are used for global path planning to optimize the path length and planning time.

One of the recent advancements in path planning research is using Reinforcement learning-based techniques. Unlike other machine learning algorithms, an RL agent can be trained with a minimal dataset. This is one primary reason to use RL in the path planning domain. Q-learning(QL) and Deep Q-Learning (DQL) are two of the most used RL algorithms for path planning. In QL, the Q-value is calculated based on the Bellman equation. In [16] and [17] QL agent is used for planning the path. The drawback in Q-learning is scalability; as the environment size increases, the memory complexity also increases. In DQL, the Q-value is predicted using a neural network, which makes the system scalable. In [18], [19], and [20], planning a path in a grid world environment using DQL is presented. The environment information regarding the obstacle location is available, and the RL agent learns to avoid all the static obstacles and plan the path efficiently. DQL-based algorithm for dynamic environment path planning is proposed in [21] and [22]. A dynamic environment is considered to have moving obstacles, and the obstacle location is unknown. Another approach to explore in an unknown environment is to use sensor values to train the RL agent [23], [24], [25].

One critical factor in any machine-learning algorithm is convergence time, i.e., the time the agent takes to learn the task. There have been considerable efforts to reduce the convergence time of DRL-based path-planning algorithms. In [26], convergence time is reduced by pre-training the RL agent using a 2D simulator environment and later using this experience to train in a real-time 3D environment. A Q-learning system with general information about the goal and current states is proposed in [27]. This knowledge is used efficiently while training to reduce the training time significantly. In article [28], the authors propose the use of RL and particle swarm optimization to increase the agent’s convergence rate.

Researchers have made numerous attempts to solve lifelong learning in various domains. In [29], a notion similar to lifelong learning called concept drift is described, and a solution is proposed to use previous experience to learn new similar concepts. Correspondence learning is yet another attempt to use old experience in a new task efficiently. In [30], correspondence learning uses previously learned skills to solve new tasks in an Atari ping-pong game environment. Transfer learning(TL) is a technique that is applied to evolve machine learning models toward lifelong learning. A TL algorithm transfers the source task knowledge to a target model. In [31], RL and TL tracking in stationary environments is proposed. Key points to be considered while applying transfer knowledge are the source task and target task similarity, mapping between source states and target states, and the type of knowledge to be transferred [32], [33]. Applying TL between the source and target tasks with different states and actions, the challenge lies in mapping the source states to target states. In articles [34] and [35], authors proposed a simple approach to provide the mappings as subject experts manually. While this is a simple solution, there can be practical difficulties in obtaining human experts mapping in all the applications. Inter-task mapping can be considered as required or learned automatically to address this shortcoming. If no explicit mapping is required, the agent attempts to learn the abstraction of the MPD, which are invariant even though actions and state variables change [36], [37]. Another approach is to learn inter-task mapping automatically [38], [39] using novel mapping learning methods. Applying heuristics in transfer learning is proven to have reasonable acceleration in target task learning. In article [40], heuristics is obtained by exploring the environment. Obtaining the heuristics as different cases based on source task learning and using it in target learning is proposed in articles [41], [42].

SECTION III.

Problem Definition

Consider a wheeled mobile robot performing autonomous tasks in an indoor service environment. To perform tasks autonomously, the robot should be capable of navigating from one location to another. A typical service robot is required to work in the environment for a long period, during which the environment configuration can change. Environment configuration is the location of all the static obstacles, doorways, pathways, etc. To cope-up with the environmental changes, the mobile robot should be able to perform lifelong learning. This will enable the robot to plan a path based on the current environment configuration rather than the old environment configuration. This can be achieved using Reinforcement learning.

The RL agent will be the mobile robot, and the environment will be the service environment. The proposed path planner is based on the Deep Q-learning algorithm. The DQL agent will be pre-trained based on the topological map of the service environment. A topological map is a high-level connectivity tree with all the critical locations in the environment. The topological map can be generated manually or automatically as proposed in [43]. The techniques to generate a topological map are out of the scope of this work. There are many techniques in the literature to obtain a topological map of an environment. A few techniques to generate a topological map are explored in our previous works [43], [44], [45]. Fig. 1 depicts a sample environment with paths and its topological map. The Home nodes refer to the charging point for the robot, the “Vx” nodes are waypoint nodes, and the “Dx” nodes are the destination locations. The map is represented in the form of a ternary tree, each node in the map is a location, and all the destination locations are the leaf nodes. Each node in the tree can have a maximum of three branches (left, right, forward). The direction to reach from the parent node to a child node is encoded as the branch of a node. For example, if a parent node has a left child node, then reaching that node is by taking the left direction from the parent node.

FIGURE 1. - Sample environment and its topological map.
FIGURE 1.

Sample environment and its topological map.

Lifelong learning can be achieved using transfer learning in the proposed DQL agent. In a practical scenario, the changes in the environment will be gradual. For example, furniture can be added or removed, which might block a current path. Such changes will lead to environment configuration changes, and the topological map can be updated based on the new configuration [46]. The process of obtaining the new topological map is not in the scope of this article. Consider a permanent obstacle placed between node “home” and “V1” in the sample environment shown in Fig. 1; this will affect the path from these two nodes. The sample environment with obstacle and its topological map is depicted in Fig. 2 with dotted lines to indicate the change in path. Given this new topological map, the RL agent should re-learn the new environment. Training the RL agent to learn the new environment configuration from scratch will require more time than using the TL technique to jump-start the training process. This can be achieved by efficiently transferring the old knowledge and training the RL agent.

FIGURE 2. - Sample environment and its topological map with obstacle.
FIGURE 2.

Sample environment and its topological map with obstacle.

SECTION IV.

Design of the Framework

The proposed system is a path planner for a mobile service robot with lifelong learning ability. The overall system consists of two modules DRL module and the TL module. DRL module is used to train an RL agent with the initial environment configuration to plan the paths efficiently. TL module is used to re-train the RL agent if there is any configuration change in the environment.

A. Deep Reinforcement Learning Framework

RL is a class of ML algorithms that uses an agent to learn a task by interacting with the environment through its actions. An agent with its current state $s$ takes action $a$ and reaches another state $s'$ . The agent receives a reward of $r$ for every action in the environment, which can be positive or negative. The reward will be based on the action and the agent’s current state. The agent’s ultimate goal is to maximize the reward that will indirectly train the agent to perform the task. A Deep Q-learning-based RL algorithm implements the path planner, and the mobile robot is the RL agent. DQL uses a neural network to predict the best action given the current state. The parameters of the DQL framework are initialized based on the topological map used to train the RL agent.

1) States - S

States are all the possible configurations of an RL agent. In the mobile robot domain, states can be the location of the mobile robot. The state space $S$ of the agent can be defined set of all the nodes in the topological map. The current state can be defined as $s \in S$ , where $s$ is the robot’s current location, a subset of all the possible locations.

2) Actions - A

Actions are considered as the interaction of the agent in the environment. The result of such interaction is the agent’s transition from one state to another. An action is defined as $a \in A$ . The action space $A$ is the set of all the possible actions a mobile robot can take, $A = (left, right, forward, backward)$ .

3) Reward - R

The reward system is the key ingredient of RL training that can make or break an agent. The reward space $R(S, A)$ is the set of all the possible rewards based on the current state and action. Equation (1) shows the reward function of the proposed RL agent. While the value of the reward is constant, it can be changed as long as the difference between the different rewards is maintained. A positive reward motivates the agent to perform a similar action in the future, and a negative reward is to avoid the action in the future. The highest reward $r_{1}$ is to reach the goal state $g \in G$ where $G$ is the set of all the goal states $G \subset S$ . The highest negative reward $r_{4}$ for reaching an unknown state $U \notin S$ is given. A reward $r_{3}$ is given to the agent to avoid looping. A reward $r_{2}$ is given as a step reward; this will converge the agent to find the shortest path possible.\begin{align*} r(s, a) = \begin{cases} \displaystyle r_{1}, & s = g \\ \displaystyle r_{2}, & s = \{x | x \notin (S \cap G) \} \\ \displaystyle r_{3}, & \text {s is previously visited location} \\ \displaystyle r_{4}, & s = U\end{cases} \tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features.

4) Q-Value - Q(s, a)

Q-value is used to determine the quality of an action given the current state. The DQN neural network is trained to predict the Q-values for all the possible actions given a state $Q(s, A)$ . The best action $a$ is chosen based on $argmax(A)$ .

5) DQN Architecture

The DQN architecture consists of an input layer, two hidden layers, and an output layer with Adam optimizer, as shown in Fig. 3. The neural network configurations depend on the states and actions of the RL agent, as shown in (2) and (3). The total number of input neurons $n$ depends on the number of states $S$ and the number of goals $G$ , i.e., the input to the neural network depends on the current state $s$ and desired goal to reach $g$ .The goal states $G \subset S$ denotes all the possible destination locations obtained from the topological map. Adding goal states as part of the input layer will enable the agent to find the optimal path to all the possible destinations. The output neurons $m$ depend on the number of actions $A$ . This represents Q-values for all the actions based on the current state. The best action is based on the output layer’s maximum Q-value.\begin{align*} n &= \lvert S \rvert + \lvert G \rvert \tag{2}\\ m &= \lvert A \rvert \tag{3}\end{align*} View SourceRight-click on figure for MathML and additional features.

FIGURE 3. - Proposed DQN architecture.
FIGURE 3.

Proposed DQN architecture.

6) Training

The neural network is trained using an experience replay buffer. Experience $e = (s, a, s', r)$ is stored after every iteration in the experience memory. A random experience batch is used for training. Batch size $\eta $ is a hyper-parameter. Equation (4) shows the loss equation, which is the Bellman error derived from the Bellman equation for Q-Learning. $Q^{*}(s, a)$ is the predicted Q value. $max Q'(s', a')$ is the maximum expected future reward based on the new state $s'$ and all the possible actions. $\gamma $ is the discount rate, a hyper-parameter controlling the weightage of the future reward. An RL agent at the early stages of the training process should explore more to understand the environment better. After exploring enough, the agent should rely on the trained network, i.e., exploitation. The hyper-parameter $\epsilon $ manages the exploration and exploitation factor. To set the $\epsilon $ value high to explore initially and reduce it gradually to exploit, the $\epsilon $ -decay algorithm is used. The decay function in (5) is chosen to be an exponential function based on the $\epsilon $ , current training episode number, $\epsilon \_{}min$ , $\epsilon $ _initial. The algorithm gradually reduces (decays) the $\epsilon $ value from $\epsilon $ _initial to $\epsilon \_{}min$ . A training episode is terminated based on the current state $s$ of the agent, (6) depicts all the terminating conditions.\begin{align*} \!\!Loss \!&= \!(Q^{*}(s, a)\! -\! (r\! + \!\gamma max Q' (s', a')))^{2} \lvert e \!=\! (\!s, a, s', r\!) \\{}\tag{4}\\ \epsilon _{new} &= \epsilon _{old} \cdot \exp ^{current\_{}episode} \\ where, \exp \!& = \!total\_{}episodes\sqrt {\frac {\epsilon \_{}min}{\epsilon \_{}initial}} \tag{5}\\ s \!&=\! \begin{cases} \displaystyle \text { terminal}\_{}\text {state}, \qquad \quad s = g \\ \displaystyle \text { terminal}\_{}\text {state}, \quad \qquad s = U \\ \displaystyle \text { terminal}\_{}\text {state}, \quad \text { total visited states == |S| } \\ \displaystyle \text {non-terminal}\_{}\text{state}, \qquad \quad \text { Otherwise} \end{cases} \tag{6}\end{align*} View SourceRight-click on figure for MathML and additional features.

B. $\pmb \beta $ -Decay Lifelong Learning Algorithm

The proposed $\pmb \beta $ -decay algorithm is a transfer learning algorithm applied when a configuration change in the environment affects the trained DQL path planner. The Non-TL approach would be to retrain the new RL agent from scratch. The configuration changes are less likely to be major and will be gradual. This will affect the number of nodes in the topological tree, i.e., there will be only a change in the state space of the RL agent. Considering this fact, the Non-TL approach would be redundant and time-consuming as the old agent will have useful information that can be used to upgrade the new agent to incorporate the new environment configuration.TL approach will be used to re-train the new agent using the knowledge of the old agent hence reducing training time. TL approach considers the old agent path planner as the source task and the new agent path planner as the target task.

The proposed $\pmb \beta $ -decay TL algorithm transfers the knowledge of the source task to the target task based on the $\pmb \beta $ factor. The target task agent will learn from the source task knowledge or $\epsilon $ -decay learning process described in IV-A6.

The higher the $\pmb \beta $ value, the higher the possibility of transferring the source task knowledge. $\pmb \beta $ is initialized based on (7). $S_{target}$ is the state space of the target task and $S_{source}$ is the state space of the source task. The $\pmb \beta $ is initialized based on the probability of choosing a random state from the target state space that will be the same as the source state space. The decay factor is calculated as in (5). Based on the $\epsilon $ and $\pmb \beta $ , the proposed TL algorithm chooses between Experience vs. Exploration vs. Exploitation (EEE) as shown in algorithm 1.\begin{equation*} \beta = \frac {|S_{target} \cap S_{source}|}{|S_{target}|} \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Algorithm 1 Experience Vs. Exploration Vs. Exploitation Pseducode

1:

Initialize Source and Target DQN network

2:

Initialize parameters - $\epsilon $ , $\pmb \beta $ , episode_count

3:

for $episode < = episode\_{}count$ do

4:

$s = random(S)$

5:

while $s != terminate\_{}state$ do

6:

if $random(0-1) < \beta $ then

7:

$Q\_{}value(s_{source}) = Source.Predict(s_{source})$

8:

a = $a_{max} (Q\_{}value(s_{source}))$

9:

else if $random(0-1) < \epsilon $ then

10:

$Q\_{}value(s_{target}) = Target.Predict(s_{target})$

11:

a = $a_{max} (Q\_{}value(s_{target}))$

12:

else

13:

$a = randint(1-4)$

14:

end if

15:

end while

16:

end for

SECTION V.

Results and Discussions

The DRL path planner is proposed to be implemented in two stages: pre-training and lifelong learning. The pre-training stage trains the RL agent to plan efficient paths based on the topological map. The lifelong learning stage uses the $\pmb \beta $ -decay Transfer Learning Algorithm to keep the RL agent in continuous learning mode to update its training based on environmental changes. The system is implemented in the ROS2 framework and tested using Turtlebot3 mobile in the Gazebo simulation environment. Two service environment and one benchmark maze environment that is different in terms of its size are chosen to implement the proposed path planner. The house layout has 8 destinations and 5 waypoints. The hospital layout has 28 destinations and 23 waypoints. The benchmark maze environment has 18 destinations and 88 waypoints. The environments and corresponding topological maps are depicted in Fig. 4 and 5. A virtual node, “U”, is added to the topological map. This represents the unknown state of the agent that will be reached as a result of an action performed to reach a location not represented by the topological map. The performance of the DRL framework is measured based on its training and testing success rate. The TL module performance is measured based on three parameters, jumpstart, time-to-threshold, and time-to-converge.

FIGURE 4. - House layout and its topological map.
FIGURE 4.

House layout and its topological map.

FIGURE 5. - Hospital layout and its topological map.
FIGURE 5.

Hospital layout and its topological map.

A. DQL Path Planner Pre-Training

The DQL planner is pre-trained using the topological map of the environment. The system is trained and tested in two service environments to prove the path planner’s scalability and generality.

1) Selection of Hyper-Parameters

One important step to train any RL agent is to tune the hyper-parameters, which enables the agent to learn the task efficiently. The hyper-parameters are $\alpha $ -learning rate, $\gamma $ -discount factor, $\epsilon $ -range exploration factor, and replay memory batch size. The pace of the agent’s learning is based on the learning rate. The learning rate is often chosen between values 0-1. A higher learning rate will result in less learning time but at the cost of sub-optimal learning quality. A lower learning rate will result in optimal learning quality but at the cost of increased learning time. Since the proposed RL agent’s trained for a service environment and is carried out offline, learning quality is chosen over learning time. Hence $\alpha $ is chosen as 0.001 to achieve stable and quality learning. The discount factor is used to evaluate the weight of future rewards. $\gamma $ is chosen at 0.95 to give importance to future rewards while learning. The other factors, $\epsilon $ range and batch size are determined by performing single-goal training in the house layout.

Single goal DQL network differs only in the input layer size compared to the proposed DQL network. The input layer size $n = |S|$ for the single goal DQL equals the total number of nodes in the topological map. In the house layout, the RL agent is trained to reach two destinations, D1 and D8, for 500 episodes. Batch size values 32 and 16 were chosen as the NN is trained with numerical values. $\epsilon $ range is given as (initial-min), a range of numbers between min and max with exponential decay factor calculated based on (5). $\epsilon $ -ranges chosen are (1-0.1) and (0.1-0.001). The evaluation parameter is the training success rate which indicates the agent’s convergence. It is calculated based on the ratio of successful and total training episodes. Table 1 shows the training success rates of all the training cases, which vary based on batch size and $\epsilon $ range.The test case with batch size 32 and $\epsilon $ range of (1-0.1) yielded higher success rates (marked in red). The final hyperparameters values are tabulated in Table 2.

TABLE 1 $\epsilon$ -Range & Batch Size Success Rate
Table 1- 

$\epsilon$
-Range & Batch Size Success Rate
TABLE 2 Hyper-Parameters
Table 2- 
Hyper-Parameters

2) House Layout Training

The RL agent is trained to learn paths to all the destinations in the house layout based on the topological map. The DQL architecture shown in Fig. 3 is used in the proposed system. The architecture has two hidden layers with some neurons twice the input layer size. This is determined by experimenting with different configurations. Table 3 shows the three different architectures considered to determine the ideal hidden layer size and its success rate after training for 2500 episodes. The difference in success rate is marginal; this can be because the agent is trained based on numerical data with fewer features. To gain more insight success rate is plotted and shown in Fig. 6. The success rate plot can be considered as the learning curve of the agent. From the plot, it can be concluded that case-2 performs better than other cases. Case-2 configuration is 48 neurons with two layers. Based on (2), input layer size n for house layout is calculated to be $n= 23(15(\text {states count}) + 8(\text {goal states count}))$ , so the minimum hidden layer size is 1.5 times the input layer size, i.e., 34. This is the same as case-2, validating the architecture configuration proposed in IV-A5. The rewards used for the training are shown in (8).\begin{align*} r(s, a) = \begin{cases} \displaystyle 100, & s = g \\ \displaystyle -1, & s = \{x | x \notin (S \cap G) \} \\ \displaystyle -10, & \text {s is previously visited location} \\ \displaystyle -100, & s = U\end{cases} \tag{8}\end{align*} View SourceRight-click on figure for MathML and additional features.

TABLE 3 Success Rate of Different DQL Architecture Configuration
Table 3- 
Success Rate of Different DQL Architecture Configuration
FIGURE 6. - Different DQN architectures training success rate.
FIGURE 6.

Different DQN architectures training success rate.

The episodic reward collected during the case-2 training process is shown in Fig. 7 along with zero and threshold reward lines. The threshold reward is calculated based on the difference between the goal reward and the critical path. A critical path is the biggest path (in terms of node count) among efficient paths to the destinations. Analyzing the reward plot reveals two major factors to evaluate the minimum number of training episodes required to train the agent effectively. First, the minimum number of episodes required for training is based on the frequency reduction of the negative rewards about the zero reward line. Second, training efficiency is based on the agent by comparing episodic reward about the threshold reward line. The negative reward frequency reduction and increased frequency of attaining episodic rewards about threshold reward are achieved in 2200 episodes. Hence, 2200 episodes are considered the minimum training episode to effectively train the RL agent in the house layout.

FIGURE 7. - House layout case-2 episodic reward.
FIGURE 7.

House layout case-2 episodic reward.

3) Hospital Layout Training

To verify the scalability of the proposed DQL path planner, it is used to train the hospital layout, another service environment. The hospital layout is a bigger environment compared to the house layout. The DQL architecture is the same as the case-2 type in the house layout. The input layer size n for hospital layout is calculated to be $n= 81(53(\text {states count}) + 28(\text {goal states count}))$ , two hidden layers with the size of 128 neurons, and output layer size $m=4$ are chosen. The training hyper-parameters are the same as in Table 2. A plot on episodic reward during the training process is shown in Fig. 8. The total number of training episodes required is over 10,000, so the plot on the left is too small to analyze the information. A scaled version of the same plot with the last 600 episodes of training data is depicted on the right. The RL agent negative reward frequency reduction and frequency of attaining episodic rewards to threshold reward is achieved in 11850 episodes. Hence, the nearest upper-rounded number, 12000, is considered the minimum training episode to effectively train the RL agent in the hospital layout.

FIGURE 8. - Hospital layout episodic reward[Left], Hospital layout episodic reward(last 2000 episodes)[Right].
FIGURE 8.

Hospital layout episodic reward[Left], Hospital layout episodic reward(last 2000 episodes)[Right].

In both the service environment, the $\epsilon $ exponent decay factor is calculated based on (5). For a hospital layout with 12000 episodes of training, the proposed exponent value is 0.9998. This theory is verified by training the agent in the hospital layout with various exponents close to the proposed exponent. Fig. 9 depicts the training success rate plot for three exponent values. From the plot, it can be concluded that the learning with an exponent value of 0.9998 is efficient compared to other values.

FIGURE 9. - Reasoning for exponent.
FIGURE 9.

Reasoning for exponent.

4) Generalized Parameters

The DQN architecture parameters are chosen based on the topological map properties for successfully training the RL agent in two different service environments. Table 4 tabulates all the model parameters, training success rate, and generalized parameters for both environment models. The generalized parameter for all the DQN parameters is derived as the function of topological map properties.

TABLE 4 Proposed DQL Model Parameters and Generalized Parameters
Table 4- 
Proposed DQL Model Parameters and Generalized Parameters

5) Testing

The proposed DQL path planner is successfully trained in two service environments. To prove that the model is generic and scalable, it is essential to test the model for its ability to plan paths for various destinations given a source. The test cases considered are all the destination nodes in the topological map to be sources and destinations for path planning. The model is evaluated based on the testing success rate, calculated as the ratio of the successful and total number of test cases. The testing success rate is more than 98% for both environments.

To further verify the correctness of the generalized parameters mentioned in Table 4, the proposed RL agent is tested using a 2D pathfinding maze benchmark map from [47]. These benchmark maps have been used by researchers worldwide for evaluating path-planning algorithms [48], [49], [50], [51]. Fig. 10 depicts the benchmark maze environment with the paths and random destinations. The topological map for the benchmark maze is shown in Fig. 11 with 18 destinations and 88 waypoints, including home and unknown nodes. There are 108 nodes in total, which is 7x and 2x times bigger compared to the house and hospital layouts, respectively. The topological map is too big to fit on a single page and split into three parts. The DQN architecture to train the RL agent for the maze environment is based on the generalized parameters tabulated in Table 4. The input layer size n for hospital layout is calculated to be $n= 126(108(\text {states count}) + 18(\text {goal states count}))$ , two hidden layers with the size of 190 neurons, and output layer size $m=4$ are chosen. The training hyper-parameters are the same as in Table 2. The RL agent was trained for 56000 episodes with the $\epsilon $ -decay factor as 0.999956 calculated based on (5). The episodic reward plot is shown in Fig. 12. The episodic reward was increasingly better as the training episodes reached 56000. The testing success rate is 98.3% for the maze layout. This is on par with the hospital and the house layouts. Based on the generalized parameters in Table 4, it can be concluded that the proposed DQL system is general and scalable.

FIGURE 10. - 2D pathfinding benchmark maze environment with paths.
FIGURE 10.

2D pathfinding benchmark maze environment with paths.

FIGURE 11. - Benchmark maze environment path in four parts).
FIGURE 11.

Benchmark maze environment path in four parts).

FIGURE 12. - Benchmark maze layout episodic reward[Left], Benchmark maze layout episodic reward(last 1200 episodes)[Right].
FIGURE 12.

Benchmark maze layout episodic reward[Left], Benchmark maze layout episodic reward(last 1200 episodes)[Right].

In the work [48], a modified Q-learning approach is proposed to overcome the drawback of increased convergence time of the traditional Q-learning approach. In this article, the authors have used the benchmark maze environment to compare their proposed approach with traditional Q-learning and other path-planning approaches. The system was trained from a fixed starting point and fixed ending point for over 30 runs, and the averaged comparison metrics were evaluated. The starting point considered is the same as the Home location, and the ending point is D18. Even though in this work, a topological map with 19 different destinations is used to train the RL agent. The final path generated by the RL agent from Home to D18 is shown in Fig. 13, and it is the same as the modified Q-learning approach [48].

FIGURE 13. - Path generated from Home to D18 in benchmark maze environment.
FIGURE 13.

Path generated from Home to D18 in benchmark maze environment.

The house layout DQL agent is also tested in the mobile robot Turtlebot3 using the Robotic Operating System (ROS) framework and the Gazebo simulator. Fig. 14 depicts the two paths, Home to D5 and Home to D4, autonomously generated by the path planner agent.

FIGURE 14. - Simulator paths(in red) for home to D5 (left) and D4 (right).
FIGURE 14.

Simulator paths(in red) for home to D5 (left) and D4 (right).

B. Transfer Learning and Testing

The proposed $\pmb \beta $ -decay transfer learning algorithm is tested by adding an obstacle in the house layout gazebo environment. The obstacle is placed between the V1 and V2 nodes of the house layout. Fig. 15 depicts the house layout with the cuboid obstacle in the Gazebo simulator environment and its topological map. The number of nodes in the topological map is increased, and all the new nodes are highlighted in red with dotted edges for their connections. This changes the environment’s configuration, making the pre-trained DQL agent inefficient in this new environment.

FIGURE 15. - Obstacle added between V1 and V2 in house layout and its topological map.
FIGURE 15.

Obstacle added between V1 and V2 in house layout and its topological map.

The $\pmb \beta $ -decay transfer learning algorithm transfers knowledge of the source path planner (pre-trained agent) to the target path planner (new agent). To validate the proposed method of initializing the $\pmb \beta =0.78$ calculated based on (7) and its decay exponent for 1300 episodes $e=0.9984$ based on (5) by replacing $\epsilon $ with $\pmb \beta $ . The DQL agent is trained using three different TL algorithms, without-$\pmb \beta $ , constant-$\pmb \beta $ , and proposed $\pmb \beta $ -decay. Table 5 shows the training success rate for the three TL algorithms after 500 training episodes. The proposed $\pmb \beta $ -decay TL algorithm outperforms the other two.

TABLE 5 Testing Success Rate Comparison of TL Algorithms
Table 5- 
Testing Success Rate Comparison of TL Algorithms

The $\pmb \beta $ -decay transfer learning algorithm transfers knowledge of the source path planner (pre-trained agent) to the target path planner (new agent). To validate the proposed method of initializing the $\pmb \beta =0.78$ calculated based on (7) and its decay exponent for 1300 episodes $e=0.9984$ based on (5) by replacing $\epsilon $ with $\pmb \beta $ . The DQL agent is trained using three different TL algorithms, without-$\pmb \beta $ , constant-$\pmb \beta $ , and proposed $\pmb \beta $ -decay. Table 5 shows the training success rate for the three TL algorithms after 500 training episodes. The proposed $\pmb \beta $ -decay TL algorithm outperforms the other two.

The DQL agent is trained with and without transfer learning to quantitatively evaluate the proposed transfer learning approach. The evaluation metrics to evaluate the efficiency of the TL algorithm are jumpstart, time-to-threshold, and time-to-converge. To evaluate the first two metrics success rate of the training for each training episode is analyzed. The success rate plot is shown in Fig. 16. The plot’s overall view shows that the training with a transfer outperforms non-transfer training. Jumpstart is the agent’s success rate in the initial few training episodes, and it can be calculated based on the average success rate of the first 100 training episodes. The time-to-threshold metric evaluates the learning performance of the agent by the time it takes to reach the threshold performance. The threshold value chosen is based on the agent training in the old environment. As there is only a small incremental change in the new environment compared to the old environment, this consideration is reasonable and logical. The threshold success rate value after 1800 training episodes is 42.5%. The time-to-converge metric will indicate the minimum number of training episodes required to learn the environment for path planning. This can be evaluated based on the testing success rate. Testing is performed by the agent’s ability to plan the path with source and destinations as destination locations. Table 6 shows the success rate comparison between transfer and non-transfer learning-based learning. The transfer learning approach attains a 100% success rate after 1300 episodes, whereas non-transfer learning attains a 78.5% success rate after 2500 episodes.

TABLE 6 Testing Success Rate of Transfer and Non-Transfer Training
Table 6- 
Testing Success Rate of Transfer and Non-Transfer Training
FIGURE 16. - Training success rate for transfer vs. non-transfer training.
FIGURE 16.

Training success rate for transfer vs. non-transfer training.

Table 7 compares all the evaluation metrics between transfer and non-transfer DQL agents with performance improvement factors. In all the metrics, the DQL agent with proposed $\pmb \beta $ -decay transfer learning outperforms the non-transfer agent. The convergence time of the transfer agent is more than two times faster than the non-transfer agent. The DQL agent with the proposed system is tested in the Gazebo simulator to navigate two paths, home to D5 and home to D4. Fig. 17 depicts the paths planned by the path planner by successfully avoiding the obstacles.

TABLE 7 Transfer Learning Evaluation Matrices Comparison
Table 7- 
Transfer Learning Evaluation Matrices Comparison
FIGURE 17. - 
$\pmb \beta $
-decay agent simulator paths(in red) for Home to D5 (left) and D4 (right) in new environment (with obstacle).
FIGURE 17.

$\pmb \beta $ -decay agent simulator paths(in red) for Home to D5 (left) and D4 (right) in new environment (with obstacle).

SECTION VI.

Conclusion and Future Scope

A generic and scalable RL-based mobile service robot path planner with transfer learning is proposed in this work. Once a service robot is deployed, it is expected to work for lifelong. A typical path planner will only consider the initial configuration of the environment to plan the path. But a service environment is prone to small gradual changes over some time. A deep Q-learning-based path planner with novel $\pmb \beta $ -decay transfer learning for lifelong learning ability is proposed and implemented. The DQL path planner is initially pre-trained in a service environment using its topological map. If there are any changes in the environment’s configuration, the proposed TL algorithm trains a new DQL agent by efficiently transferring the knowledge of the old agent. The DQL agent is trained in two service environments: house and hospital layouts, with a testing success rate of 100% and 98%, respectively. The hospital environment is almost three times larger than the house environment concerning the number of nodes and destinations. This proves that the proposed path planner is scalable. The DQL agent’s generality factor is derived for all its architecture parameters and training hyper-parameters. They are the same or dependent on the environment’s topological map, proving the system is generic. The path planner is successfully tested in the ROS framework with the Gazebo simulator using the turtlebot3 mobile robot.

The ability to lifelong learning is tested by adding an obstacle in the house environment. The proposed TL algorithm uses the Experience vs. Exploration vs. Exploitation (EEE) logic based on the $\beta $ -decay factor. The correctness of the proposed TL algorithm’s $\pmb \beta $ initial value and decay factor is evaluated by comparing it with non-$\pmb \beta $ TL and constant $\pmb \beta $ TL. It is found to be better than the other two techniques. The efficiency of the TL algorithm is evaluated based on the metrics: jumpstart, time-to-threshold, and time-to-converge. The evaluation metrics are evaluated in comparison with the Non-TL based agent. The proposed TL method’s speed-up factor is 20x more in jumpstart testing success rate, more than three times faster in achieving time-to-threshold, and converging more than two times faster.

Implementing and testing the proposed system in a real-time service robot is one future extension of this work. Methods to improve the efficiency of the TL algorithm can be explored. The $\pmb \beta $ factor is a stochastic factor initialized based on the total number of similar states in the environment. The similarity is identified based on the state names obtained from the topological map. Identifying the similarity based on state variables can increase the efficiency of the TL algorithm.

References

References is not available for this document.