I. Introduction
Integrating Unmanned Aerial Vehicles (UAVs) into cellular communication systems as User Equipments (UEs) is envisioned as an effective solution to support the UAVs mission specific rate-demanding data communication while improving the robustness of the UAV navigation [1]. This vision of cellular connected UAVs communication, however, poses new research challenges due to the significant differences from conventional communication systems. UAVs-UE have typically higher altitude, higher mobility and have more stringent constraints on the power and operational time than the corresponding ground ones [2]. In addition, the existing cellular network operating at sub-6 GHz is bandwidth limited and perform poorly at high UAV heights, due to the interference perceived from the down-tilted antennas at the ground Base Stations (BSs) [3]. As a consequence, during its trajectory, a UAV is very likely to experience radio link failures. The UAVs' mobility and flexibility offer a degree of freedom to circumvent these issues. The UAV path design that aims to respect a quality-of-connectivity constraint and minimize the travelling time goes under the name of Communication-aware trajectory. Several works have optimized the UAV trajectory under connectivity constraints using graph based [4] or dynamic programming based solutions [5]. The above traditional optimization solutions are time consuming and computationally complex. For this reason, Reinforcement Learning (RL) approaches have been recently investigated. Compared to a traditional optimization approach, RL is able of making decisions interacting iteratively with the environment. A double Q-Learning approach is proposed in [6] to solve a joint trajectory and outage time constraint problem. A Temporal Difference (TD) learning method is utilized in [7] to design the UAV-UE trajectory while minimizing the mission completion time and the disconnection duration.