In the Orthogonal Frequency Division Multiple Access (OFDMA) radio access networks, the system throughput and user fairness tradeoff optimization problem has to maximize the total cell throughput while maintaining a certain level of fairness between user throughputs. One way to maximize the total system throughput subject to fairness constraints is to use opportunistic schedulers such as channel aware based GPF scheduling rule that exploits the multi-user diversity principle. Therefore different tradeoff levels can be obtained by using a proper parameterization of the GPF scheduling scheme [1].
The Next Generation Mobile Networks (NGMN) fairness requirement [2] is used for the fairness criterion adoption which requires a predefined user throughput distribution to be achieved. Based on the NGMN concept, the scheduler is considered to be fair if and only if each user achieves a certain percentage from the Cumulative Distribution Function (CDF) of other normalized user throughputs (NUT). Based on the scheduler instantaneous state (channel conditions, user throughputs and traffic loads), the GPF rule should be adapted in such a manner that the obtained CDF curve of NUTs respects the NGMN fairness condition. By assuming that the NGMN optimality criterion depends only on the previous GPF parameterization, the scheduling procedure can then be modeled as a MDP with the respect of the Markov property.
The innovation of the current work aims to explore the unknown behavior of the scheduler states in order to learn the optimal policy of GPF parameters in such a way that the NGMN fairness requirement is satisfied at each TTI. The CACLA RL algorithm is proposed in this sense to solve given MDP problems by selecting optimal actions. The quality of applying different continuous GPF parameters in different continuous scheduler states is approximated by using a nonlinear MLPNN function. The rest of the document is organized as follows: Section II highlights the importance of the fairness-throughput tradeoff optimization problem. Section III presents the elements of the related work. In section IV, the insight elements of the proposed controller are analyzed. Section V presents the results, and the paper concludes with Section VI.
SECTION II.
User Fairness and System Throughput Tradeoff Optimization Problem
In LTE packet scheduling, a set {\cal U}_{t} of preselected users is scheduled at each TTI t in frequency domain by using a set of {\cal B} grouped OFDMA sub-carriers denoted as resource blocks (RBs). The resource allocation procedure in time-frequency domain follows the integer linear programming optimization problem at each TTI t as shown in Eq. (1):\eqalignno{
max_{b_{i,j}} & \sum_{i\in {\cal U}_{t}}\sum_{j\in {\cal B}}b_{i,j}[t]\cdot\left\{r_{i,j}[t]/(\overline{T}_{i}[t])^{\alpha_{k}[t]}\right\}\cr
s.t. & \sum_{i\in {\cal U}_{t}}b_{i,j}[t]=1,\forall i\in {\cal U}_{t}\cr
& b_{i,j}[t]\in\{0,1\},\forall i\in {\cal U}_{t} & \hbox{(1)}
}
View Source
\eqalignno{
max_{b_{i,j}} & \sum_{i\in {\cal U}_{t}}\sum_{j\in {\cal B}}b_{i,j}[t]\cdot\left\{r_{i,j}[t]/(\overline{T}_{i}[t])^{\alpha_{k}[t]}\right\}\cr
s.t. & \sum_{i\in {\cal U}_{t}}b_{i,j}[t]=1,\forall i\in {\cal U}_{t}\cr
& b_{i,j}[t]\in\{0,1\},\forall i\in {\cal U}_{t} & \hbox{(1)}
} where b_{i,j} represents the allocation vector, r_{i,j} is the achievable rate for user i and RB j, and \overline{T}_{i} denotes the average throughput of user i averaged over a number of TTI by using the exponential moving filter [1]. The fairness-throughput tradeoff is tuned by parameter \alpha_{k}[t] which can be adapted TTI-by-TTI in order to meet the objective function. When \alpha_{k}=0 for the entire transmission, the obtained GPF rule maximizes the throughput (MT). If \alpha_{k}=1, the obtained scheme is entitled Proportional Fair (PF) and when \alpha_{k} is very large (\alpha_{k} \rightarrow \infty), the scheduler maximizes the fairness between users and minimizes the system throughput and the obtained rule is entitled maximum fairness (MF). The optimal scheduler state in which the NGMN requirement is respected can be achieved by setting \alpha_{k} at each TTI t such as:\alpha_{k}[t]=\alpha_{k}[{t-1}]+\Delta\alpha_{k}\eqno{\hbox{(2)}}
View Source
\alpha_{k}[t]=\alpha_{k}[{t-1}]+\Delta\alpha_{k}\eqno{\hbox{(2)}} where \Delta\alpha_{k} is the optimal \alpha_{k}[t] parameter step. Let us define {\cal A}_{k}[t]\in {\cal A}=\{\Delta\alpha_{k}\},k=1,{.}.,\vert {\cal A}\vert as the decision vector of action taken at TTI t in order to close the scheduler nearby or in the optimal state. The action set {\cal A} can be discrete or continuous. Obviously, the current action {\cal A}_{k}[t] should be chosen in such a manner that the tradeoff objective function \Phi[t+1] in the next state is maximized. By using the estimation operator {\BBE}\{\cdot\}, the tradeoff action can be seen as a decision vector of the second linear programming optimization problem which should be solved before Eq. (1):\eqalignno{
max_{{\cal A}_{k}} & \sum_{k\in {\cal A}}{\cal A}_{k}[t]\cdot {\BBE}\left\{\Phi[{\cal S}_{t+1}^{S}]\right\}\cr
s.t. & \sum_{k\in {\cal A}} {\cal A}_{k}[t]=1,\forall k=1,{.}.,\vert {\cal A}\vert\cr
& {\cal A}_{k}[t]\in\{0,1\},\forall k=1,{.}.,\vert {\cal A}\vert & \hbox{(3)}
}
View Source
\eqalignno{
max_{{\cal A}_{k}} & \sum_{k\in {\cal A}}{\cal A}_{k}[t]\cdot {\BBE}\left\{\Phi[{\cal S}_{t+1}^{S}]\right\}\cr
s.t. & \sum_{k\in {\cal A}} {\cal A}_{k}[t]=1,\forall k=1,{.}.,\vert {\cal A}\vert\cr
& {\cal A}_{k}[t]\in\{0,1\},\forall k=1,{.}.,\vert {\cal A}\vert & \hbox{(3)}
} where {\cal S}_{t+1}^{S} represents the scheduler state in the next TTI t+1. The NGMN objective function \Phi[{\cal S}_{t}^{S}] is calculated based on the CDF function of NUT observations set \left\{\hat{\overline{T}}[t]\right\}\subset {\cal S}_{t}^{S}, i=1,{.}.,\vert {\cal U}_{t}\vert, as shown by Fig. 1. The NGMN fairness requirement is the oblique continuous line. If the CDF curve is located on the left side of the NGMN requirement (MT rule case), the system is considered unfair ({\cal S}_{t+1}^{S}\in {\cal UF}), and when the CDF function lies on the right side (MF and PF rules cases), the system is declared fair ({\cal S}_{t+1}^{S}\in {\cal F}). In order to determine the optimal or feasible region in the CDF domain, the superior limit of the NGMN requirement should be imposed (dot oblique line). In this sense, the fair area is divided in two sub-regions: feasible ({\cal S}_{t+1}^{S}\in {\cal FA}) and over-fairness ({\cal S}_{t+1}^{S}\in {\cal OF}) where \{{\cal F}\}=\{{\cal FA}\}\cup\{{\cal OF}\}. As seen from Fig. 1, only the green curve respects the feasibility condition. Therefore, at each TTI the action {\cal A}_{k}[t] should be chosen in such a manner that {\cal S}_{t+1}^{S}\in {\cal FA}. Based on Fig. 1, the NGMN objective function is calculated based on Eq. (4) \Phi[{\cal S}_{t}^{S}]=(1/\vert {\cal U}_{t}\vert)\sum_{i\in {\cal U}_{t}}\left[\Psi\left(\hat{\overline{T}}\right)-\Psi^{Req}\left(\hat{\overline{T}}\right)\right]\leq 0\eqno{\hbox{(4)}}
View Source
\Phi[{\cal S}_{t}^{S}]=(1/\vert {\cal U}_{t}\vert)\sum_{i\in {\cal U}_{t}}\left[\Psi\left(\hat{\overline{T}}\right)-\Psi^{Req}\left(\hat{\overline{T}}\right)\right]\leq 0\eqno{\hbox{(4)}} where \Psi and \Psi^{Req} represent the CDF function and the NGMN fairness requirement, respectively. Equation (4) is the cost function which aims to praise the quality of action \Delta\alpha_{k} taken in the previous state. Due to the noisy characteristic of Eq. (4), the CACLA RL algorithm requires the additional function in order to learn the optimal policies of GPF parameters in such a way that {\cal S}_{t}^{S}\in {\cal FA}.
The parameterization of the GPF scheduler for the system throughput maximization under NGMN requirement is discussed in [3]. The impact of the traffic load and user rate constraints are considered when the CDF distribution of \hat{T}_{k,t} is determined. Unfortunately, the adaptation process is achieved at different time scales in order to make the proposal suitable for real time scheduling leading to the inflexible behavior when severe changes in the network conditions may occur. In [4] an off-line procedure of adapting the \alpha parameter subject of different temporal fairness indices constraints is proposed. The expected user throughput is calculated at the beginning of each TTI in order to predict the current state of the average user throughput before the scheduling decision. However, the traffic load is not considered and the method cannot be applied to the real systems due to the high complexity cost when the number of active flows increases. In this study, the method from [4] can suffer a slight modification in the sense that \alpha_{k}[t] can be adapted based on the NGMN constraint where the CDF function is calculated based on the predicted throughput. The balance of the system throughput and user fairness tradeoff is analyzed in [1], in which the traffic load is categorized based on the CQI reports. The normalized system throughput and Jain Fairness Index are considered as a part of the input state. The Q-Learning algorithm is used to learn different policies that converge very well to different tradeoff levels. However, the concept is not extended to dynamic fairness requirement.
SECTION IV.
Proposed Architecture
The interaction between controller and scheduler shown in Fig. 2 is modeled in two stages: exploration and exploitation. In the first stage, the LTE controller receives a new state which is the aggregated version of {\cal S}_{t}^{S} such as {\cal S}_{t}^{C}. Based on the trial and error principle, the controller takes random actions that are mapped into scheduling decisions by the scheduler. The scheduler ranks the previous scheduling decision at the beginning of the next TTI based on the reward function such as {\cal RW}_{t}({\cal S}_{t-1}^{C},{\cal A}_{t-1}^{a}). Basically, the reward function {\cal RW}_{t} indicates how far or close is the function \Phi[{\cal S}_{t}^{S}] from its objective when compared with the previous state when action {\cal A}_{t-1}^{a} is applied. The exploration stage target is to form a policy of scheduling decisions that follows those actions that maximize the sum of future rewards for every initial state. The exploitation stage applies the learned policy TTI-by-TTI. In order to learn the optimal policy, the MLPNN non-linear function is required to approximate the continuous and multidimensional state {\cal S}_{t}^{C} in optimal GPF continuous parameters. In this sense, the MLPNN weights are trained by using the gradient descent algorithm with feed-forward (FP) and backward propagation (BP) principles. The BP minimizes the error between the target output and the one which is obtained through FP procedure. The way how the error and target values are calculated determines the type of RL algorithm which is used for the optimal GPF parameterization.
A. Controller State Space
The controller state space contains the relevant information including the previous GPF parameter, a representative compacted state of NUTs and an indication about how close or far the objective function \Phi[{\cal S}_{t}^{S}] is from the optimal value.
Therefore, the input controller continuous state space is represented by the following set with normalized elements:{\cal S}_{t}^{C}=\{\alpha_{t-1}, \mu_{t}^{\hat{T}}, \sigma_{t}^{\hat{T}}, d_{t}^{R}, \vert {\cal U}_{t}\vert, \rho_{t}\}\eqno{\hbox{(5)}}
View Source
{\cal S}_{t}^{C}=\{\alpha_{t-1}, \mu_{t}^{\hat{T}}, \sigma_{t}^{\hat{T}}, d_{t}^{R}, \vert {\cal U}_{t}\vert, \rho_{t}\}\eqno{\hbox{(5)}} where \mu_{t}^{\hat{T}} and \sigma_{t}^{\hat{T}} represent the mean deviation and the standard deviation respectively for the log-normal distributions of NUTs, \rho_{t} is the controller flag which indicates that {\cal S}_{t}^{C}\in {\cal UF} when \rho_{t}=-1, {\cal S}_{t}^{C}\in {\cal OF} when \rho_{t}=0 and the controller is feasible ({\cal S}_{t}^{C}\in {\cal FA}) when \rho_{t}=1. The flag \rho_{t} is determined based on d_{t}^{R} which is the representative CDF distance calculated based on Eq. (6) where d_{t}^{i,R}=\Psi_{i}-\Psi_{i}^{Req}: d_{t}^{R}=\cases{
max_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \exists d_{t}^{i,R} > 0,\vee i\in {\cal U}_{t}\cr
-min_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \nexists d_{t}^{i,R} > 0,\forall i\in {\cal U}_{t}\cr
}\eqno{\hbox{(6)}}
View Source
d_{t}^{R}=\cases{
max_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \exists d_{t}^{i,R} > 0,\vee i\in {\cal U}_{t}\cr
-min_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \nexists d_{t}^{i,R} > 0,\forall i\in {\cal U}_{t}\cr
}\eqno{\hbox{(6)}} Basically, if there is any d_{t}^{i,R} > 0 in the CDF representation, then {\cal S}_{t}^{C}\in {\cal UF}, and when all the percentiles are on the right side of the requirement such as \nexists d_{t}^{i,R} > 0, then {\cal S}_{t}^{C}\in {\cal F}. When d_{t}^{R} \in [0,\zeta] then {\cal S}_{t}^{C}\in {\cal FA}, where \zeta is the superior limit of feasible region.
B. Reward Function
The reward function is computed from perspective of the transition area between two consecutive TTIs. When the {\cal S}_{t}^{C}\in {\cal OF} (Fig. 1), any increase of \alpha_{k}[t] moves the scheduler further away from the optimal region. On the other pole, when {\cal S}_{t}^{C}\in {\cal UF}, it is undesirable to decrease \alpha_{k}[t] parameter.
Therefore, for the GPF parameterization when {\cal S}_{t}^{C}\in {\cal UF} then \alpha_{k}\nearrow and when {\cal S}_{t}^{C}\in {\cal OF} then \alpha_{k} \searrow until the feasible state is reached. Based on the aforementioned characteristics, the reward function for the GPF parameterization case becomes:{\cal RW}_{t}=\cases{
\alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal UF}\cr
-1, if\alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal UF}\cr
0, if \alpha_{t-1} > \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal OF}\cr
1, if {\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal FS}\cr
-1, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal OF}\cr
\alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal OF},{\cal S}_{t}^{C}\in {\cal OF}\cr
}\eqno{\hbox{(7)}}
View Source
{\cal RW}_{t}=\cases{
\alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal UF}\cr
-1, if\alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal UF}\cr
0, if \alpha_{t-1} > \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal OF}\cr
1, if {\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal FS}\cr
-1, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal OF}\cr
\alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal OF},{\cal S}_{t}^{C}\in {\cal OF}\cr
}\eqno{\hbox{(7)}} The goal of the LTE controller is to find the optimal policy \pi^{\ast}({\cal S}_{t}^{C},{\cal A}_{t}^{a}) at each TTI which permits to select the best action for returning the maximum reward within {\cal S}_{t+1}^{C}. The CACLA RL algorithm is used to perform the trained policy as an actor and aims to improve it when necessary as a critic.
C. The CACLA RL Algorithm
The CACLA RL algorithm uses one-dimensional continuous actions {\cal A}_{t}\in {\BBR}_{[0,1]} which implies \Delta\alpha_{k}\in {\BBR}_{[0,1]}. In order to find optimal policies, one MLPNN is used for the continuous action approximation such as A_{t}^{F}({\cal S}_{t}^{C}) and another MLPNN for forwarding the state value V_{t}^{F}({\cal S}_{t}^{C}). The notion of state value implies the approximated accumulated reward value for a given state under some learned policies. The principle of CACLA is to update the action value only if the state target value V_{t}^{T}({\cal S}_{t-1}^{C}) increases the previous update such as [5]:A_{t}^{T}({\cal S}_{t-1}^{C})\overleftarrow{\eta_{A}}A_{t}^{F}({\cal S}_{t-1}^{C})\ if\ V_{t}^{T}({\cal S}_{t-1}^{C}) > V_{t}^{F}({\cal S}_{t-1}^{C})\eqno{\hbox{(8)}}
View Source
A_{t}^{T}({\cal S}_{t-1}^{C})\overleftarrow{\eta_{A}}A_{t}^{F}({\cal S}_{t-1}^{C})\ if\ V_{t}^{T}({\cal S}_{t-1}^{C}) > V_{t}^{F}({\cal S}_{t-1}^{C})\eqno{\hbox{(8)}} where V_{t}^{T}({\cal S}_{t-1}^{C})={\cal RW}_{t}({\cal S}_{t-1}^{C},{\cal A}_{t-1})+\gamma\cdot V_{t}^{F}({\cal S}_{t-1}^{C}),\,\gamma is the discount factor and \eta_{A} is the action value learning rate. Alongside its very simple architecture, CACLA can locate relatively faster the optimal state when compared with other RL algorithms such as: Q-learning, QV-learning, SARSA or ACLA [6], [7], [8] by using predefined GPF parameters steps.
SECTION V.
Simulation Results
We consider a dynamic scenario with fluctuating traffic load within the interval of [10, 120] active data flows/users with infinite buffer. Moreover, the analyzed scheduling policies are running on parallel schedulers that use the same conditions for shadowing, path loss, multi-path loss and interference models. In order to test the impact of the proposed algorithms in the performance metrics, the number of active users is randomly switched at each 1s revealing the generality of the proposed scheduling policy. The rest of parameters are listed in Table I.
The scheduling policy obtained by using CACLA RL is compared against the methods proposed in [4] (MT), [5] (AS) and with other policies obtained by exploring with discrete actions based RL algorithms. The exploration is performed for all RL algorithms by using \varepsilon-greedy actions. Figure 3 concludes that the CACLA policy outperforms other policies from the number of TTIs when {\cal S}_{t}^{C}\in\{{\cal UF},{\cal FA}\} points of view.
In this paper the CACLA RL algorithm is used in order to adapt and to apply the best fairness parameter for a dynamic radio environment in LTE-Advanced networks. We proved that CACLA, minimize the number of TTIs when the system is declared unfair being able in the same to fast up the convergence speed by minimizing the number of punishments ({\cal RW}_{t}=-1) (Fig. 4) when the number of active users changes dramatically.