1. Introduction
Overlay networks are virtual or logical networks built on top of a physical network (called underlay networks). Overlay networks provide flexible and dynamic traffic routing between nodes that are not directly connected by physical links, but rather by virtual or logical links that correspond to paths in the underlying network. Those virtual links can be established using different technologies, like Generic Routing Encapsulation (GRE), Virtual Private Network (VPN) or network virtualization. The underlay topology is managed by a third party, typically one or more network operators. One particular example of overlay networks is Software-Defined Wide Area Network (SD-WAN) [1], which fully utilizes the bandwidth of all available transport networks serving one location, like Multiple Protocol Label Switching (MPLS) fabric, Internet and 5G, considering each one of them as an overlay link. In the context of overlay networks, the problem of routing the traffic between the overlay links, especially in muti-hop scenarios, becomes challenging, since the underlay routing policies are unknown and can involve different protocols, like Open Shortest Path First (OSPF), Border Gateway Protocol (BGP) and others. The absence of information about the underlay network topology and routing policies, yields the existence of Triangle Inequality Violation (TIV) [2]: it is highly possible to find another path relayed by cloud servers which has a much lower delay than following the shortest path in the overlay topology. There are classical routing protocols that can be used to route in overlay, like Cisco’s Overlay Management Protocol (OMP), which is a control protocol developed by Cisco and working as BGP. This class of protocols highly depends on pre-defined metrics, and they do not handle multi-hop overlay network. To deal with the above limitations and challenges, a promising approach is using Machine Learning (ML) methods, especially Reinforcement Learning (RL) [3], which provides the ability for an agent (typically a network device) to learn from its environment, and to adapt it policy to meet the dynamically changing demand. In the context of overlay networks, the agent can exploit the information gathered from the network to overcome the lack of knowledge about the underlay network.