Introduction
The Mobile Robotics field is one of the most popular domains among robotics researchers. An essential mobile robot subsystem is a navigation system. A navigation system perceives the environment and enables the robot to navigate autonomously to perform autonomous tasks [1]. There is a vast range of applications in which mobile robots are being used, such as surveillance, automation in industries, museum guides, elderly care, hospital care, home assistance, servers in restaurants, and so on. Mobile robots are classified based on their form factor, such as wheeled, legged, flying, and so on [2]. Wheeled mobile robots, in particular, can be effective and simple to be used in flat indoor terrain.
A mobile service robot performs autonomous tasks in service environments like homes, restaurants, hospitals, etc. [3]. A service environment is unique, unlike other environments, based on a few features of the environment. To name a few, vast square footage, moving obstacles (robots and humans) [4], prone to environmental changes over some time, and augmenting (adding extra sensors) the environment to support the autonomous operation is not possible. A path planner is essential for an autonomous mobile robot to plan an efficient path given the source and destination.
A path planner is as good as its understanding of the environment. Initially, the main focus of the path-planning research community was to achieve optimal paths given different levels of environmental awareness. These works are categorized based on the environmental information available to the path planners [5]. One of the main concerns of a service robot’s path planner is environmental changes that will hinder its performance. A path planner uses the initial environment configuration to identify efficient paths, and this configuration will not be valid if the environment changes. Hence a path planner which can dynamically plan the path based on the latest configuration of the environment is necessary to tackle this problem. This type of path planner is classified as a dynamic path planner, which considers environmental changes throughout its lifetime.
A dynamic path planner continuously monitors environmental changes and updates itself to provide efficient paths. Using machine learning (ML) techniques for dynamic path planning has recently gained more traction among researchers. In particular, Reinforcement Learning (RL) [6] is a type of machine-learning technique that uses the Markov Decision Process (MDP) to learn to interact in an environment. The main reason to use RL for path planning is it can be modeled as an MDP intuitively, and the fact that it does not need massive labeled data for training. In most of the work in RL-based path planning, the agent’s state space is considered as images or sensory information. While this is a reasonable approach to defining the state space on an RL agent, the shortcoming is the bigger state space. The bigger the state space, the more time to converge in the learning phase.
Lifelong learning [7] is a factor that is considered to keep any artificial intelligent agent learning and evolving continuously based on the various changes over a long period. Transfer Learning(TL) is an ML technique that uses previous experience to learn and perform well in a similar situation. A TL algorithm can efficiently make the RL agent dynamic with lifelong learning ability. Given the variations in the environment configuration, the transfer learning approach can be used to achieve lifelong learning. In this context, the source states are the old environment’s topological information, and the target states are the new altered environment’s topological information. There will be differences in the state space, and all the other RL-agent parameters will remain the same. TL approaches considering only the state difference between the source task, and target task is minimal. So a novel transfer learning technique that considers only state space difference is proposed in this article.
This work aims to design and develop a Deep Reinforcement Learning (DRL) based path planning framework. The proposed framework uses a Deep Q-Learning algorithm to learn the initial paths based on the topological map of the environment. A novel
Design and development of a scalable DQL path planning framework for a mobile service robot.
Incorporating
-decay TL algorithm to continuously evolve the RL agent based on environmental changes and achieve lifelong learning.$\pmb \beta $ The RL agent’s scalability and the TL algorithm’s learning efficiency are tested and proved.
The remainder of the paper is presented as follows; a summary of the related work in path planning is described in section II. The problem formulation is defined in section III. Section IV describes the proposed RL and TL framework in detail. Results and analysis of various test cases are depicted in section V. The article is concluded, and future directions are identified in the last section.
Recent and Relevant Work
The path planning problem can be categorized into two classes of problems, global and local path planning [8]. Global path planning is planning a path from a source to a destination, given the environment map. Local planning is planning the path if an unforeseen scenario (obstacle) occurs while traversing the global path. Rapid development in artificial intelligence and machine learning techniques has led to using it to solve path planning problems [8], [9], [10]. Artificial intelligence techniques like fuzzy logic [11], neural networks [12], and hybrid solutions like neuro-fuzzy inference systems [13] are used for local path planning as there are very efficient in dynamic problem-solving. Evolutionary techniques like particle swarm optimization [14] and genetic algorithms [15] are used for global path planning to optimize the path length and planning time.
One of the recent advancements in path planning research is using Reinforcement learning-based techniques. Unlike other machine learning algorithms, an RL agent can be trained with a minimal dataset. This is one primary reason to use RL in the path planning domain. Q-learning(QL) and Deep Q-Learning (DQL) are two of the most used RL algorithms for path planning. In QL, the Q-value is calculated based on the Bellman equation. In [16] and [17] QL agent is used for planning the path. The drawback in Q-learning is scalability; as the environment size increases, the memory complexity also increases. In DQL, the Q-value is predicted using a neural network, which makes the system scalable. In [18], [19], and [20], planning a path in a grid world environment using DQL is presented. The environment information regarding the obstacle location is available, and the RL agent learns to avoid all the static obstacles and plan the path efficiently. DQL-based algorithm for dynamic environment path planning is proposed in [21] and [22]. A dynamic environment is considered to have moving obstacles, and the obstacle location is unknown. Another approach to explore in an unknown environment is to use sensor values to train the RL agent [23], [24], [25].
One critical factor in any machine-learning algorithm is convergence time, i.e., the time the agent takes to learn the task. There have been considerable efforts to reduce the convergence time of DRL-based path-planning algorithms. In [26], convergence time is reduced by pre-training the RL agent using a 2D simulator environment and later using this experience to train in a real-time 3D environment. A Q-learning system with general information about the goal and current states is proposed in [27]. This knowledge is used efficiently while training to reduce the training time significantly. In article [28], the authors propose the use of RL and particle swarm optimization to increase the agent’s convergence rate.
Researchers have made numerous attempts to solve lifelong learning in various domains. In [29], a notion similar to lifelong learning called concept drift is described, and a solution is proposed to use previous experience to learn new similar concepts. Correspondence learning is yet another attempt to use old experience in a new task efficiently. In [30], correspondence learning uses previously learned skills to solve new tasks in an Atari ping-pong game environment. Transfer learning(TL) is a technique that is applied to evolve machine learning models toward lifelong learning. A TL algorithm transfers the source task knowledge to a target model. In [31], RL and TL tracking in stationary environments is proposed. Key points to be considered while applying transfer knowledge are the source task and target task similarity, mapping between source states and target states, and the type of knowledge to be transferred [32], [33]. Applying TL between the source and target tasks with different states and actions, the challenge lies in mapping the source states to target states. In articles [34] and [35], authors proposed a simple approach to provide the mappings as subject experts manually. While this is a simple solution, there can be practical difficulties in obtaining human experts mapping in all the applications. Inter-task mapping can be considered as required or learned automatically to address this shortcoming. If no explicit mapping is required, the agent attempts to learn the abstraction of the MPD, which are invariant even though actions and state variables change [36], [37]. Another approach is to learn inter-task mapping automatically [38], [39] using novel mapping learning methods. Applying heuristics in transfer learning is proven to have reasonable acceleration in target task learning. In article [40], heuristics is obtained by exploring the environment. Obtaining the heuristics as different cases based on source task learning and using it in target learning is proposed in articles [41], [42].
Problem Definition
Consider a wheeled mobile robot performing autonomous tasks in an indoor service environment. To perform tasks autonomously, the robot should be capable of navigating from one location to another. A typical service robot is required to work in the environment for a long period, during which the environment configuration can change. Environment configuration is the location of all the static obstacles, doorways, pathways, etc. To cope-up with the environmental changes, the mobile robot should be able to perform lifelong learning. This will enable the robot to plan a path based on the current environment configuration rather than the old environment configuration. This can be achieved using Reinforcement learning.
The RL agent will be the mobile robot, and the environment will be the service environment. The proposed path planner is based on the Deep Q-learning algorithm. The DQL agent will be pre-trained based on the topological map of the service environment. A topological map is a high-level connectivity tree with all the critical locations in the environment. The topological map can be generated manually or automatically as proposed in [43]. The techniques to generate a topological map are out of the scope of this work. There are many techniques in the literature to obtain a topological map of an environment. A few techniques to generate a topological map are explored in our previous works [43], [44], [45]. Fig. 1 depicts a sample environment with paths and its topological map. The Home nodes refer to the charging point for the robot, the “Vx” nodes are waypoint nodes, and the “Dx” nodes are the destination locations. The map is represented in the form of a ternary tree, each node in the map is a location, and all the destination locations are the leaf nodes. Each node in the tree can have a maximum of three branches (left, right, forward). The direction to reach from the parent node to a child node is encoded as the branch of a node. For example, if a parent node has a left child node, then reaching that node is by taking the left direction from the parent node.
Lifelong learning can be achieved using transfer learning in the proposed DQL agent. In a practical scenario, the changes in the environment will be gradual. For example, furniture can be added or removed, which might block a current path. Such changes will lead to environment configuration changes, and the topological map can be updated based on the new configuration [46]. The process of obtaining the new topological map is not in the scope of this article. Consider a permanent obstacle placed between node “home” and “V1” in the sample environment shown in Fig. 1; this will affect the path from these two nodes. The sample environment with obstacle and its topological map is depicted in Fig. 2 with dotted lines to indicate the change in path. Given this new topological map, the RL agent should re-learn the new environment. Training the RL agent to learn the new environment configuration from scratch will require more time than using the TL technique to jump-start the training process. This can be achieved by efficiently transferring the old knowledge and training the RL agent.
Design of the Framework
The proposed system is a path planner for a mobile service robot with lifelong learning ability. The overall system consists of two modules DRL module and the TL module. DRL module is used to train an RL agent with the initial environment configuration to plan the paths efficiently. TL module is used to re-train the RL agent if there is any configuration change in the environment.
A. Deep Reinforcement Learning Framework
RL is a class of ML algorithms that uses an agent to learn a task by interacting with the environment through its actions. An agent with its current state
1) States - S
States are all the possible configurations of an RL agent. In the mobile robot domain, states can be the location of the mobile robot. The state space
2) Actions - A
Actions are considered as the interaction of the agent in the environment. The result of such interaction is the agent’s transition from one state to another. An action is defined as
3) Reward - R
The reward system is the key ingredient of RL training that can make or break an agent. The reward space \begin{align*} r(s, a) = \begin{cases} \displaystyle r_{1}, & s = g \\ \displaystyle r_{2}, & s = \{x | x \notin (S \cap G) \} \\ \displaystyle r_{3}, & \text {s is previously visited location} \\ \displaystyle r_{4}, & s = U\end{cases} \tag{1}\end{align*}
4) Q-Value - Q(s, a)
Q-value is used to determine the quality of an action given the current state. The DQN neural network is trained to predict the Q-values for all the possible actions given a state
5) DQN Architecture
The DQN architecture consists of an input layer, two hidden layers, and an output layer with Adam optimizer, as shown in Fig. 3. The neural network configurations depend on the states and actions of the RL agent, as shown in (2) and (3). The total number of input neurons \begin{align*} n &= \lvert S \rvert + \lvert G \rvert \tag{2}\\ m &= \lvert A \rvert \tag{3}\end{align*}
6) Training
The neural network is trained using an experience replay buffer. Experience \begin{align*} \!\!Loss \!&= \!(Q^{*}(s, a)\! -\! (r\! + \!\gamma max Q' (s', a')))^{2} \lvert e \!=\! (\!s, a, s', r\!) \\{}\tag{4}\\ \epsilon _{new} &= \epsilon _{old} \cdot \exp ^{current\_{}episode} \\ where, \exp \!& = \!total\_{}episodes\sqrt {\frac {\epsilon \_{}min}{\epsilon \_{}initial}} \tag{5}\\ s \!&=\! \begin{cases} \displaystyle \text { terminal}\_{}\text {state}, \qquad \quad s = g \\ \displaystyle \text { terminal}\_{}\text {state}, \quad \qquad s = U \\ \displaystyle \text { terminal}\_{}\text {state}, \quad \text { total visited states == |S| } \\ \displaystyle \text {non-terminal}\_{}\text{state}, \qquad \quad \text { Otherwise} \end{cases} \tag{6}\end{align*}
B. $\pmb \beta $
-Decay Lifelong Learning Algorithm
The proposed
The proposed
The higher the \begin{equation*} \beta = \frac {|S_{target} \cap S_{source}|}{|S_{target}|} \tag{7}\end{equation*}
Algorithm 1 Experience Vs. Exploration Vs. Exploitation Pseducode
Initialize Source and Target DQN network
Initialize parameters -
for
while
if
a =
else if
a =
else
end if
end while
end for
Results and Discussions
The DRL path planner is proposed to be implemented in two stages: pre-training and lifelong learning. The pre-training stage trains the RL agent to plan efficient paths based on the topological map. The lifelong learning stage uses the
A. DQL Path Planner Pre-Training
The DQL planner is pre-trained using the topological map of the environment. The system is trained and tested in two service environments to prove the path planner’s scalability and generality.
1) Selection of Hyper-Parameters
One important step to train any RL agent is to tune the hyper-parameters, which enables the agent to learn the task efficiently. The hyper-parameters are
Single goal DQL network differs only in the input layer size compared to the proposed DQL network. The input layer size
2) House Layout Training
The RL agent is trained to learn paths to all the destinations in the house layout based on the topological map. The DQL architecture shown in Fig. 3 is used in the proposed system. The architecture has two hidden layers with some neurons twice the input layer size. This is determined by experimenting with different configurations. Table 3 shows the three different architectures considered to determine the ideal hidden layer size and its success rate after training for 2500 episodes. The difference in success rate is marginal; this can be because the agent is trained based on numerical data with fewer features. To gain more insight success rate is plotted and shown in Fig. 6. The success rate plot can be considered as the learning curve of the agent. From the plot, it can be concluded that case-2 performs better than other cases. Case-2 configuration is 48 neurons with two layers. Based on (2), input layer size n for house layout is calculated to be \begin{align*} r(s, a) = \begin{cases} \displaystyle 100, & s = g \\ \displaystyle -1, & s = \{x | x \notin (S \cap G) \} \\ \displaystyle -10, & \text {s is previously visited location} \\ \displaystyle -100, & s = U\end{cases} \tag{8}\end{align*}
The episodic reward collected during the case-2 training process is shown in Fig. 7 along with zero and threshold reward lines. The threshold reward is calculated based on the difference between the goal reward and the critical path. A critical path is the biggest path (in terms of node count) among efficient paths to the destinations. Analyzing the reward plot reveals two major factors to evaluate the minimum number of training episodes required to train the agent effectively. First, the minimum number of episodes required for training is based on the frequency reduction of the negative rewards about the zero reward line. Second, training efficiency is based on the agent by comparing episodic reward about the threshold reward line. The negative reward frequency reduction and increased frequency of attaining episodic rewards about threshold reward are achieved in 2200 episodes. Hence, 2200 episodes are considered the minimum training episode to effectively train the RL agent in the house layout.
3) Hospital Layout Training
To verify the scalability of the proposed DQL path planner, it is used to train the hospital layout, another service environment. The hospital layout is a bigger environment compared to the house layout. The DQL architecture is the same as the case-2 type in the house layout. The input layer size n for hospital layout is calculated to be
Hospital layout episodic reward[Left], Hospital layout episodic reward(last 2000 episodes)[Right].
In both the service environment, the
4) Generalized Parameters
The DQN architecture parameters are chosen based on the topological map properties for successfully training the RL agent in two different service environments. Table 4 tabulates all the model parameters, training success rate, and generalized parameters for both environment models. The generalized parameter for all the DQN parameters is derived as the function of topological map properties.
5) Testing
The proposed DQL path planner is successfully trained in two service environments. To prove that the model is generic and scalable, it is essential to test the model for its ability to plan paths for various destinations given a source. The test cases considered are all the destination nodes in the topological map to be sources and destinations for path planning. The model is evaluated based on the testing success rate, calculated as the ratio of the successful and total number of test cases. The testing success rate is more than 98% for both environments.
To further verify the correctness of the generalized parameters mentioned in Table 4, the proposed RL agent is tested using a 2D pathfinding maze benchmark map from [47]. These benchmark maps have been used by researchers worldwide for evaluating path-planning algorithms [48], [49], [50], [51]. Fig. 10 depicts the benchmark maze environment with the paths and random destinations. The topological map for the benchmark maze is shown in Fig. 11 with 18 destinations and 88 waypoints, including home and unknown nodes. There are 108 nodes in total, which is 7x and 2x times bigger compared to the house and hospital layouts, respectively. The topological map is too big to fit on a single page and split into three parts. The DQN architecture to train the RL agent for the maze environment is based on the generalized parameters tabulated in Table 4. The input layer size n for hospital layout is calculated to be
Benchmark maze layout episodic reward[Left], Benchmark maze layout episodic reward(last 1200 episodes)[Right].
In the work [48], a modified Q-learning approach is proposed to overcome the drawback of increased convergence time of the traditional Q-learning approach. In this article, the authors have used the benchmark maze environment to compare their proposed approach with traditional Q-learning and other path-planning approaches. The system was trained from a fixed starting point and fixed ending point for over 30 runs, and the averaged comparison metrics were evaluated. The starting point considered is the same as the Home location, and the ending point is D18. Even though in this work, a topological map with 19 different destinations is used to train the RL agent. The final path generated by the RL agent from Home to D18 is shown in Fig. 13, and it is the same as the modified Q-learning approach [48].
The house layout DQL agent is also tested in the mobile robot Turtlebot3 using the Robotic Operating System (ROS) framework and the Gazebo simulator. Fig. 14 depicts the two paths, Home to D5 and Home to D4, autonomously generated by the path planner agent.
B. Transfer Learning and Testing
The proposed
The
The
The DQL agent is trained with and without transfer learning to quantitatively evaluate the proposed transfer learning approach. The evaluation metrics to evaluate the efficiency of the TL algorithm are jumpstart, time-to-threshold, and time-to-converge. To evaluate the first two metrics success rate of the training for each training episode is analyzed. The success rate plot is shown in Fig. 16. The plot’s overall view shows that the training with a transfer outperforms non-transfer training. Jumpstart is the agent’s success rate in the initial few training episodes, and it can be calculated based on the average success rate of the first 100 training episodes. The time-to-threshold metric evaluates the learning performance of the agent by the time it takes to reach the threshold performance. The threshold value chosen is based on the agent training in the old environment. As there is only a small incremental change in the new environment compared to the old environment, this consideration is reasonable and logical. The threshold success rate value after 1800 training episodes is 42.5%. The time-to-converge metric will indicate the minimum number of training episodes required to learn the environment for path planning. This can be evaluated based on the testing success rate. Testing is performed by the agent’s ability to plan the path with source and destinations as destination locations. Table 6 shows the success rate comparison between transfer and non-transfer learning-based learning. The transfer learning approach attains a 100% success rate after 1300 episodes, whereas non-transfer learning attains a 78.5% success rate after 2500 episodes.
Table 7 compares all the evaluation metrics between transfer and non-transfer DQL agents with performance improvement factors. In all the metrics, the DQL agent with proposed
Conclusion and Future Scope
A generic and scalable RL-based mobile service robot path planner with transfer learning is proposed in this work. Once a service robot is deployed, it is expected to work for lifelong. A typical path planner will only consider the initial configuration of the environment to plan the path. But a service environment is prone to small gradual changes over some time. A deep Q-learning-based path planner with novel
The ability to lifelong learning is tested by adding an obstacle in the house environment. The proposed TL algorithm uses the Experience vs. Exploration vs. Exploitation (EEE) logic based on the
Implementing and testing the proposed system in a real-time service robot is one future extension of this work. Methods to improve the efficiency of the TL algorithm can be explored. The