Introduction
The fifth-generation cellular networks (5G) is assumed to be the key infrastructure provider for the next decade, by means of profound changes in both radio technologies and network architecture design [1]–[4]. Besides the pure performance metrics like rate, reliability and allowed connections, the scope of 5G also incorporates the transformation of the mobile network ecosystem and accommodates heterogeneous services using one infrastructure. In order to achieve such a goal, 5G will fully glean the recent advances in the network virtualization and programmability [1], [2], and provide a novel technique named network slicing [1], [5]–[7]. Network slicing tries to get rid of the current, relatively monolithic architecture like the forth-generation cellular networks (4G) and slice the whole network into different parts, each of which is tailed to meet specific service requirement. Therefore, network slicing is born as an emerging business to operators and allows them to sell the customized network slices to various tenants at different prices. In a word, network slicing could act as a service (NSaaS) [5]. NSaaS is quite similar to the mature business “infrastructure as a service (IaaS)”, the benefit of which service providers like Amazon and Microsoft have happily enjoyed for a while. However, in order to provide better-performing and cost-efficient services, network slicing involves more challenging technical issues even for the real-time resource management on existing slices, since (a) for radio access networks, spectrum is a scarce resource and it is meaningful to guarantee the spectrum efficiency (SE) [8], while for core networks, virtualized functionalities are limited by computing resources; (b) the service level agreements (SLAs) with slice tenants usually impose stringent requirements on quality of experience (QoE) perceived by users [9]; and (c) the actual demand of each slice heavily depends on the request patterns of mobile users. Hence, in the 5G era, it is critical to investigate how to intelligently respond to the dynamics of service request from mobile users [7], so as to obtain satisfactory QoE in each slice at the cost of acceptable spectrum or computing resources [4]. There has been several works towards the resource management for the network slicing, particularly in specific scenarios like edge computing [10] and Internet of things [11]. However, it is still very appealing to discuss a approach in generalized scenarios. In that regard, [12] proposes to adopt genetic algorithm as an evolutionary means for inter-slice resource management. However, [12] does not reflect the explicit relationship that one slice might require more resources due to its more stringent SLA.
On the other hand, partially inspired by the psychology of human learning, the learning agent in reinforcement learning (RL) algorithm focuses on how to interact with the environment (represented by states) by trying alternative actions and reinforcing the tendency actions producing more rewarding consequences [13]. Besides, reinforcement learning also embraces the theory of optimal control and adopts some ideas like value functions and dynamic programming. However, reinforcement learning faces some difficulties in dealing with large state space, since it is challenging to traverse every state and obtain a value function or model for every station-action pair in a direct and explicit manner. Hence, benefiting from the advances in graphics processing units (GPUs) and the less concern for the computing power, some researchers propose to sample only a fraction of states and further apply neural networks (NN) to train a sufficiently accurate value function or model. Following this idea, Google DeepMind has pioneered to combine NN with one typical RL algorithm (i.e.,
The well-known success of AlphaGo [14] and following exciting results to apply DRL to solve resource allocation issues in some specific fields like power control [16], green communications [17], cloud radio access networks [18], mobile edge computing and caching [19]–[21], have aroused some research interest to apply DRL to the field of network slicing. However, given the challenging technical issues in the resource management on existing slices, it is critical to carefully investigate the performance of applying DRL in the following aspects:
The basic concern is whether or not the application of DRL is feasible. More specifically, does DRL produce satisfactory QoE results while consuming acceptable network resources (e.g., spectrum)?
The research community has proposed some schemes for the resource management in network slicing scenarios. For example, the resource management could be conducted by either following a meticulously designed prediction algorithm, or equally dividing the available resource into each slice. The former implies one reasonable option, while the latter saves a lot of computational cost. Hence, a comparison between DRL and these interesting schemes is also necessary.
In this paper, we strive to address these issues.
The remainder of the paper is organized as follows. Section II starts with the fundamentals of RL and talks about the motivation to evolve towards DRL from RL. As the main part of the paper, Section III addresses two resource management issues in network slicing and highlights the advantages of DRL by extensive simulation analyses. Section IV concludes the paper and points out some research directions to apply DRL in a general manner.
From Reinforcement Learning to Deep Reinforcement Learning
In this section, we give a brief introduction over RL or more specifically
A. Reinforcement Learning
RL learns how to interact with the environment to achieve maximum cumulative return (or average return), and has been successfully applied in the fields like robot control, self driving, and chess playing for years. Mathematically, RL follows the typical concept of Markov decision process (MDP), while the MDP is a generalized framework for modeling decision-making problems in cases where the result is partially random and affected by the applied decision. An MDP can be formulated by a 5-tuple as \begin{align*} V^{\pi }(\hat {s})=&E_{\pi }\left [{\sum \limits _{k=0}^{\infty } \gamma ^{k} R(s^{(k)},\pi (s^{(k)}))|s^{(0)}=\hat {s})}\right] \\=&E_{\pi }\left [{R(\hat {s},\pi (\hat {s})))+\gamma \sum _{\boldsymbol {s}' \in \mathcal {S}} P(s'|\hat {s},\pi (\hat {s}))V^{\pi }(s')}\right]\!.\tag{1}\end{align*}
Dynamic programming could be exploited to solve the Bellman equation when the state transition probability
Model-based versus Model-free: Model-based algorithms imply the agent tries to learn the model of how the environment works from its observations and then plan a solution using that model. Once the agent gains adequately accurate model, it can use a planning algorithm with its learned model to find a policy. Model-free algorithms means the agent does not directly learn how to model the environment. Instead, like the classical example of
-learning, the agent estimates the Q-values (or roughly the value function) of each state-action pair and derives the optimal policy by choosing the action yielding the largest Q-value in the current state. Different from the model-based algorithm, the well-learnt model-free algorithm like\mathcal {Q} -learning cannot predict the next state and value before taking the action.\mathcal {Q} Monte-Carlo Update versus Temporal-Difference Update: Generally, the value function update could be conducted in two ways, that is, the Monte-Carlo update and the temporal-difference (TD) update. A Monte-Carlo update means the agent updates its estimation for a state-action pair by calculating the mean return from a collection of episodes. A TD update approximates the estimation by comparing estimates at two consecutive episodes. For example,
-learning updates its Q-value by the TD update as\mathcal {Q} , whereQ(s,a) \leftarrow Q(s,a) + \alpha (R(s,a) + \gamma \max _{a'} Q(s',a') - Q(s,a)) is the learning rate. Specifically, the term\alpha is also named as the TD error, since it captures the difference between the current (sampled) estimateR(s,a) + \gamma \max _{a'} Q(s',a') - Q(s,a) and previous oneR(s,a) + \gamma \max _{a'} Q(s',a') .Q(s,a) On-policy versus Off-policy: The value function update is also coupled with the executed update policy. Before updating the value function, the agent also needs to sample and learn the environment by performing some non-optimal policy. If the update policy is irrelevant to the sampling policy, the agent is called to perform an off-policy update. Taking the example of
-learning, this off-policy agent updates the Q-value by choosing the action corresponding to the best Q-value, while it could learn the environment by adopting sampling policies like\mathcal {Q} -greedy or Boltzmann distribution to balance the “exploration and exploitation” problem [13]. The\epsilon -learning proves to converge regardless of the chosen sampling policy. On the contrary, the SARSA agent is on-policy, since it updates the value function by\mathcal {Q} whereQ(s,a) \leftarrow Q(s,a) + \alpha (R(s,a) + \gamma Q(s',a') - Q(s,a)) anda' need to be chosen according to the same policy.a
B. From \mathcal{Q}
-Learning to Deep \mathcal{Q}
-Learning
We first summarize the details of
The agent chooses an action
under statea according to some policy likes -greedy. Here, the\epsilon -greedy policy means the agents chooses the action with the largest Q-value\epsilon with a probability ofQ(s,a) , and equally chooses the other actions with a probability of\epsilon , where\frac {1-\epsilon }{|A|} denotes the size of the action space.|A| The agent learns the reward
from the environment, and the state transitions to the next stateR(s,a) .s' The agent updates the Q-value function in a TD manner as
.Q(s,a) \leftarrow Q(s,a) + \alpha (R(s,a) + \gamma \max _{a'} Q(s',a') - Q(s,a))
Classical RL algorithms usually rely on two different ways (i.e., explicit table or function approximation) to store the estimated value functions. For the table storage, RL algorithm uses an array or hash table to store the learnt results for each state-action pair. For large state space, it not only requires intensive storage, but also is unable to quickly transverse the complete the state-action pair. Due to the curse of dimensionality, function approximation sounds more appealing.
The most straightforward way for function approximation is a linear approach. Taking the example of \begin{align*} L(\boldsymbol {\theta })=&\frac {1}{2} \big (Q^{+}(s,a) - Q(s,a)\big)^{2} \\=&\frac {1}{2} \big (Q^{+}(s,a) - \boldsymbol {\theta }^{T} \boldsymbol {\psi }(s,a) \big)^{2}.\tag{2}\end{align*}
The parameter \begin{align*} \boldsymbol {\theta }^{(i+1)}\leftarrow&\boldsymbol {\theta }^{(i)} - \alpha \nabla L(\boldsymbol {\theta }^{(i)}) \\=&\boldsymbol {\theta }^{(i)} - \alpha \big (Q^{+}(s,a) - \boldsymbol {\theta }^{T} \boldsymbol {\psi }(s,a) \big) \boldsymbol {\psi }(s,a).\tag{3}\end{align*}
Apparently, the linear function approximation could not accurately model the estimated value function. Hence, researchers have proposed to replace the approximation
Experience Replay [15]: The agent stores the past experience (i.e., the tuple
) at episodee_{t} = \langle s_{t},a_{t}, s'_{t}, R(s_{t},a_{t})\rangle into a datasett and uniformly selects some (mini-batch) items from the dataset to update the Q-value neural networkD_{t} = (e_{1},\cdots,e_{t}) .Q(s,a;\boldsymbol {\theta }) Network Cloning: The agent uses a separate network
to guide how to select an action\hat {Q} in statea , and the networks is replaced by\hat {Q} everyQ episodes. Simulation results demonstrate that this network cloning enhances the learning stability [15].C
Finally, we illustrate the deep
Algorithm 1 The General Steps of Deep Reinforcement Learning
An evaluation network
A replay memory dataset
repeat
At episode
The agent chooses action
After executing the action
The agent stores the episode experience
The agent samples a minibatch of experiences from
The agent updates the weights
The agent clones the evaluation network
The episode index is updated by
until A predefined stopping condition (e.g., the gap between
Resource Management for Network Slicing
Resource management is a permanent topic during the evolution of wireless communication. Intuitively, resource management for network slicing can be considered from several different perspectives.
Radio Resource and Virtualized Network Functions: As depicted in Fig. 2, resource management for network slicing involves both radio access part and core network part with slightly different optimization goals. Due to the limited spectrum resource, the resource management for the radio access puts considerable efforts in allocating resource blocks (RBs) to one slice, so as to maintain acceptable SE while trying to bring appealing rate and small delay. The widely adopted optical transmission in core networks has shifted the optimization of core network to design common or dedicated virtualized network functions (VNFs), so as to appropriately forward the packets from one specific slice with minimal scheduling delay. By balancing the relative importance of resource utilization (e.g, SE) and QoE satisfaction ratio, the resource management problem could be formulated as
, whereR = \zeta \cdot \text {SE} + \beta \cdot \text {QoE} and\zeta denotes the importance of SE and QoE.\beta Equal or Prioritized Scheduling: As part of the control plane, IETF [23] has defined the common control network function (CCNF) to all or several slices. The CCNF includes the access and mobility management function (AMF) as well as the network slice selection function (NSSF), which is in charge of selecting core network slice instances. Hence, besides equally treating flows from different slices, the CCNF might differentiate flows. For example, flows from ultra-reliable low-latency communications (URLLC) service can be scheduled and provisioned in higher priority, so as to experience as little latency as possible. In this case, in order to balance the resource utilization (RU) and the waiting time (WT) of flows, the objective goal could be similarly written as a weighted summation of RU and WT.
Based on the aforementioned discussions, we can safely reach a conclusion that, the objective of resource management for network slicing should take account of several variables and a weighted summation of these variables can be considered as the reward for the learning agent.
A. Radio Resource Slicing
In this part, we address how to apply DRL for radio resource slicing. Mathematically, given a list of existing slices \begin{align*}&\hspace {-2pc}\arg _{\boldsymbol {w}} \max \mathbb {E} \{R(\boldsymbol {w},\boldsymbol {d})\} \\=&\arg _{\boldsymbol {w}} \max \mathbb {E}\big \{\zeta \cdot \text {SE} (\boldsymbol {w},\boldsymbol {d})+ \beta \cdot \text {QoE} (\boldsymbol {w},\boldsymbol {d}) \big \} \\&\text {s.t.:} ~ \boldsymbol {w}=(w_{1}, \cdots, w_{N}) \\&\hphantom {\text {s.t.:} ~} w_{1}+ \cdots + w_{N} = W \\&\hphantom {\text {s.t.:} ~} \boldsymbol {d}=(d_{1}, \cdots, d_{N}) \\&\hphantom {\text {s.t.:} ~} d_{i} \sim \text {Certain Traffic Model}, \quad \forall i \in [1, \cdots, N] \\\tag{4}\end{align*}
We evaluate the performance to adopt DQL to solve (4) by simulating a scenario containing one single BS with three types of services (i.e., VoIP, video, URLLC). There exist 100 registered subscribers randomly located within a 40 meter-radius circle surrounding the BS. These subscribers generate service models summarized in Table 1(b). VoIP and video services exactly take the parameter settings of VoLTE and video streaming models, while URLLC service takes the parameter settings of FTP 2 model [24]. It can be observed from Table 1(b), URLLC has less frequent packets compared with the others, while VoLTE requires the smallest bandwidth for its packets.
We consider DQL by using the mapping in Table 1(a) to optimize the weighted summation of system SE and slice QoE. Specifically, we perform round-robin scheduling method within each slice at the granularity of 0.5 ms. In other words, we sequentially allocate the bandwidth of each slice to the active users within each slice every 0.5 ms. Besides, we adjust the bandwidth allocation to each slice per second. Therefore, the DQL agent updates its Q-value neural network every second. We compare the simulation results with the following three methods, so as to explain the importance of DQL.
Demand-prediction based method: The method tries to estimate the possible demand by using long short-term memory (LSTM) to predict the number of active users requesting VoIP, video and URLLC respectively. Afterwards, the bandwidth is allocated by two ways: (1) DP-No allocates the whole bandwidth to each slice proportional to the number of predicted packets. In particular, assuming that the total bandwidth is
and the predicted number of packets for VoIP, video and URLLC isB ,N_{\text {VoIP}} andN_{\text {Video}} , the allocated bandwidth to these three slices (i.e., VoIP, video and URLLC) isN_{\text {URLLC}} ,\frac {B\cdot N_{\text {VoIP}}}{N_{\text {VoIP}}+ N_{\text {Video}} + N_{\text {URLLC}}} ,\frac {B\cdot N_{\text {Video}}}{N_{\text {VoIP}}+ N_{\text {Video}} + N_{\text {URLLC}}} , respectively. (2) DP-BW performs the allocation by multiplying the number of predicted packets by the least required rate in Table 1(b) and then computing the proportion. In this regard, assuming that the required rate for the three slices is\frac {B\cdot N_{\text {URLLC}}}{N_{\text {VoIP}}+ N_{\text {Video}} + N_{\text {URLLC}}} ,R_{\text {VoIP}} andR_{\text {Video}} , the allocated bandwidth to VoIP, video and URLLC isR_{\text {URLLC}} respectively. Round-robin is conducted within each slice.\begin{align*} \frac {B N_{\text {VoIP}} R_{\text {VoIP}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\\ \frac {B N_{\text {Video}} R_{\text {Video}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\\ \frac {B N_{\text {URLLC}} R_{\text {URLLC}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\end{align*} View Source\begin{align*} \frac {B N_{\text {VoIP}} R_{\text {VoIP}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\\ \frac {B N_{\text {Video}} R_{\text {Video}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\\ \frac {B N_{\text {URLLC}} R_{\text {URLLC}}}{N_{\text {VoIP}}R_{\text {VoIP}}+ N_{\text {Video}} R_{\text {Video}}+ N_{\text {URLLC}}R_{\text {URLLC}}},\end{align*}
Hard slicing: Hard slicing means that each service slice is always allocated
of the whole bandwidth, since there exists 3 types of service in total. Again, round-robin is conducted within each slice.\frac {1}{3} No slicing: Irrespective of the related SLA, all users are scheduled equally. Round-robin is conducted within all users.
The performance of DQL for radio resource slicing w.r.t. the learning steps (QoE Weight = 5000).
Fig. 3 presents the learning process of DQL1 in radio resource management. In particular, Fig. 3(a)~3(f) give the initial performance of DQL when the QoE weight is 5000 and the SE weight is 0.1. Fig. 3(g)~3(l) provide the performance during the last 50 of 50000 learning updates. From these sub-figures, it can be observed that DQL could not well learn the user activities at the very beginning and the allocated bandwidth fluctuates heavily. But after nearly 50000 updates, DQL has gained better knowledge over user activities and yielded a state bandwidth-allocation strategy. Besides, Fig. 3(m) and Fig. 3(n) show the variations of SE and QoE along with each learning epoch. From both subfigures, a larger QoE weight produces policies with superior QoE performance while bringing certain loss in the system SE performance.
Fig. 4 provides a detailed performance comparison among the candidate techniques, where the results for DQL are obtained after 50000 learning updates. Fig. 4(a)~4(f) gives the percentage of total bandwidth allocated to each slice using the pie charts and highlights the QoE satisfaction ratio by surrounding text. From Fig. 4(a)~4(b), a reduction in transmission antennas from 64 to 16, which implies a decrease in network capability and an increase in potential collisions across slices, leads to a re-allocation of network bandwidth inclined to the bandwidth-consuming yet activity-limited URLLC slice. Also, it can be observed from Fig. 4(f), when the downlink transmission uses 64 antennas, “no slicing” performs the best, since the transmission capability is sufficient and the scheduling period is 0.5 ms while the bandwidth allocated to each slice is adjusted per second and thus slower to catch the demand variations. When the number of downlink antenna turns to 32, the DQL-driven scheme produces 81% QoE satisfaction ratio for URLLC, while “no slicing” and “hard slicing” schemes only provision 15% and 41% satisfied URLLC packets, respectively. Notably, applying DQL mainly leads to the QoE gain of URLLC. The reason lies in that as summarized in Table 1(b), the distribution of packet size for URLLC follows a truncated lognormal distribution with the mean value of 2 MByte, which is far larger than those of VoLTE and Video services. Given the larger transmission volume and strictly lower latency requirement, it is far more difficult to satisfy the QoE of URLLC. In this case, it is still satisfactory that DQL outperforms the other competitive schemes to render higher QoE gain of URLLC at a slight cost of spectrum efficiency (SE). Meanwhile, Fig. 4(d) and Fig. 4(e) demonstrate the allocation results for the demand-prediction based schemes and show significantly inferior performance, since Fig. 3(a)~3(c) and Fig. 3(g)~3(i) show the number of video packets dominates the transmission and simple packet-number based prediction could not capture the complicated relationship between demand and QoE. On the other hand, Fig. 4(g) illustrates that this QoE advantage of DQL comes at the cost of a decrease in SE. Recalling the definition of the reward in DQL, if we decrease the QoE weight from 5000 to 1, DQL could learn another bandwidth allocation policy (in Fig. 4(c)) yielding a larger SE yet a lower QoE. Fig. 4(g) ~ 4(j) further summarize the performance comparison in terms of SE or QoE satisfaction ratios, where the vertical errorbars show the standard derivation. These subfigures validate the DQL’s flexibility and advantage in resource-limited scenarios to ensure the QoE per user.
The performance comparison among different schemes for radio resource slicing. (a) DQL. (b) DQL. (c) DQL. (d) DP-BW. (e) DP-No. (d) No slicing. (g) System SE. (h) VolTE QoE. (i) Video QoE. (j) URLL QoE.
B. Priority-Based Scheduling in Common VNFS
Section III-A has discussed how to apply DRL in radio resource slicing. Similarly, if we virtualize the computation resources as VNFs for each slice, the problem to allocate computation resources to each VNF could be solved similar to the radio resource slicing case. Therefore, in this part, we talk about another important issue, that is, priority-based core network slicing for common VNFs. Specifically, we simulate a scenario where there exists 3 service function chains (SFCs) possessing the same basic capability but working at the expenditure of different computation processing units (CPUs) and yields different provisioning results (e.g., waiting time). Also, based on the commercial value or related SLA, flows could be classified into 3 categories (e.g., Category A, B, and C) with decreasing priority from Category A to Category C, and a priority-based scheduling rule is defined as that SFC I prioritizes Category A flows over the others, while SFC II equally treats Category A and B users but serves Category C flows with lower priority. SFC III treats all flows equally. Besides, SFCs process flows with equal priority according to the arrival time. The eventually utilized CPUs of each SFC depend on the number of its processed flows. Besides, SFC I, II and III cost 2, 1.5, and 1 CPU(s), but incur 10, 15, and 20 ms regardless of the flow size, respectively. Hence, subject to the limited number of CPUs, flows for each type will be scheduled to an appropriate SFC, so as to incur acceptable waiting time. Therefore, the scheduling of flows should match and learn the arrival of flows in three categories, and DQL is considered as a promising solution.
Similarly, it is critical to design an appropriate mapping of DRL elements to this slicing issue. As Table 1(a) implies, we use a mapping slightly different from that for radio resource slicing, so as to manifest the flexibility of DQL. In particular, we abstract the state of DQL as a summary of the category and arrival time of last 5 flows and the category of the newly arrived flow, while the reward is defined as the weighted summation of processing and queue time of this flow, where a larger weight in this summation is adopted to reflect the importance of flows with higher priority. Also, we first pre-train its NN by emulating some flows with lognormal distributed inter-arrival time from the three categories’ users.
We compare the DQL scheme with an intuitive “no priority” solution, which allocate the flow to the SFC yielding minimum waiting time. Fig. 5 provides the related performance by randomly generating 10000 flows and provisioning accordingly, where the vertical and horizontal axes represent the number of utilized CPUs and the waiting time of flows respectively. Specifically, the bi-dimensional shading color reflects the number of flows corresponding to the specific waiting time and utilized CPUs. In particular, the darker color implies the larger number. Compared with the “no priority” solution, the DQL-empowered slicing results provision flows with smaller average waiting time (i.e., 10.5% lower than “no priority”) and significantly more sufficient CPU usage (i.e., 27.9% larger than “no priority”). In other words, DQL could support alternative solutions to exploit the computing resources and reduce the waiting time by first serving the users with higher commercial value.
Performance comparison between DQL-based priority scheduling and no priority scheduling for core network slicing. (a) DQL-based Prioritied Scheduling. (b) No Priority Scheduling.
Conclusion and Future Directions
From the discussions in this article, we found that matching the allocated resource to slices with the users’ activity demand will be the most critical challenge for effectively realizing network slicing, while DRL could be a promising solution. Starting with the introduction of fundamental concept for DQL, one typical type of DRL, we explained the working mechanism and application motivation of DQL to solve this problem. We further demonstrated the advantage of DQL in managing this demand-aware resource allocation in two typical slicing scenarios including radio resource slicing and priority-based core network slicing through extensive simulations. Our results showed that compared with the demand prediction-based and some other intuitive solutions, DQL could implicitly incorporate more deep relationship between demand (i.e., user activities) and supply (i.e., resource allocation) in resource-constrained scenarios, and enhance the effectiveness and agility for network slicing. Finally, in order to fulfill the application of DQL in a broader sense, we pointed out some noteworthy issues. We believe DRL could play a crucial role in network slicing in the future.
However, network slicing involves many aspects and a successful application of DQL needs some careful considerations: (a) Slice admission control on incoming requests for new slices: the success of network slicing implies a dynamic and agile slice management scheme. Therefore, if requests for new slices emerge, how to apply DQL is also an interesting problem since the defined state and action space requires to adapt to the changes in the “slice” space. (b) Abstraction of states and actions: Section III has provided two ways to abstract state and action. Both methods sound practical in the related scenarios and reflect the flexibility of DQL. Hence, for new scenarios, it becomes an important issue to choose appropriate abstraction of states and actions, so as to better model the problem and save the learning cost. Up to date, it remains an open question on how to give some abstraction guidelines. (c) Latency and accuracy to retrieve rewards: The simulations in Section III has assumed the instantaneous and accurate acquirement of rewards for a state-action pair. But, such an assumption no longer holds in practical complex wireless environment, since it takes time for user equipment to report the information and the network may not successfully receive the feedback. Also, similar to the case for state and action, the abstraction of reward might be difficult and the defined reward should be as simple as possible. (d) Policy learning cost: The time-varying nature of wireless channel and user activities requires a fast policy-learning scheme. However, the current cost of policy training still lacks the necessary learning speed. For example, our pre-training for the priority-based network slicing policy takes two days in an Intel Core i7-4712MQ processor to converge the Q-value function. Though GPU could speedup the training process, the learning cost is still heavy. Therefore, there are still a lot of interesting questions to be addressed.
ACKNOWLEDGMENT
The authors would like to express their sincere gratitude to Chen Yu and Yuxiu Hua of Zhejiang University for the valuable discussions to implement part of simulation codes.