Conferences >39th Annual IEEE Conference o...

Scheduling policies based on dynamic throughput and fairness tradeoff control in LTE-A networks

Abstract:

In LTE-A cellular networks there is a fundamental trade-off between the cell throughput and fairness levels for preselected users which are sharing the same amount of res...Show More

Metadata

Abstract:

In LTE-A cellular networks there is a fundamental trade-off between the cell throughput and fairness levels for preselected users which are sharing the same amount of resources at one transmission time interval (TTI). The static parameterization of the Generalized Proportional Fair (GPF) scheduling rule is not able to maintain a satisfactory level of fairness at each TTI when a very dynamic radio environment is considered. The novelty of the current paper aims to find the optimal policy of GPF parameters in order to respect the fairness criterion. From sustainability reasons, the multi-layer perceptron neural network (MLPNN) is used to map at each TTI the continuous and multidimensional scheduler state into a desired GPF parameter. The MLPNN non-linear function is trained TTI-by-TTI based on the interaction between LTE scheduler and the proposed intelligent controller. The interaction is modeled by using the reinforcement learning (RL) principle in which the LTE scheduler behavior is modeled based on the Markov Decision Process (MDP) property. The continuous actor-critic learning automata (CACLA) RL algorithm is proposed to select at each TTI the continuous and optimal GPF parameter for a given MDP problem. The results indicate that CACLA enhances the convergence speed to the optimal fairness condition when compared with other existing methods by minimizing in the same time the number of TTIs when the scheduler is declared unfair.

Published in: 39th Annual IEEE Conference on Local Computer Networks

Date of Conference: 08-11 September 2014

Date Added to IEEE Xplore: 16 October 2014

ISBN Information:

Print ISSN: 0742-1303

DOI: 10.1109/LCN.2014.6925806

Conference Location: Edmonton, AB, Canada

Contents

SECTION I.

Introduction

In the Orthogonal Frequency Division Multiple Access (OFDMA) radio access networks, the system throughput and user fairness tradeoff optimization problem has to maximize the total cell throughput while maintaining a certain level of fairness between user throughputs. One way to maximize the total system throughput subject to fairness constraints is to use opportunistic schedulers such as channel aware based GPF scheduling rule that exploits the multi-user diversity principle. Therefore different tradeoff levels can be obtained by using a proper parameterization of the GPF scheduling scheme [1].

The Next Generation Mobile Networks (NGMN) fairness requirement [2] is used for the fairness criterion adoption which requires a predefined user throughput distribution to be achieved. Based on the NGMN concept, the scheduler is considered to be fair if and only if each user achieves a certain percentage from the Cumulative Distribution Function (CDF) of other normalized user throughputs (NUT). Based on the scheduler instantaneous state (channel conditions, user throughputs and traffic loads), the GPF rule should be adapted in such a manner that the obtained CDF curve of NUTs respects the NGMN fairness condition. By assuming that the NGMN optimality criterion depends only on the previous GPF parameterization, the scheduling procedure can then be modeled as a MDP with the respect of the Markov property.

The innovation of the current work aims to explore the unknown behavior of the scheduler states in order to learn the optimal policy of GPF parameters in such a way that the NGMN fairness requirement is satisfied at each TTI. The CACLA RL algorithm is proposed in this sense to solve given MDP problems by selecting optimal actions. The quality of applying different continuous GPF parameters in different continuous scheduler states is approximated by using a nonlinear MLPNN function. The rest of the document is organized as follows: Section II highlights the importance of the fairness-throughput tradeoff optimization problem. Section III presents the elements of the related work. In section IV, the insight elements of the proposed controller are analyzed. Section V presents the results, and the paper concludes with Section VI.

SECTION II.

User Fairness and System Throughput Tradeoff Optimization Problem

In LTE packet scheduling, a set ${\cal U}_{t}$ of preselected users is scheduled at each TTI $t$ in frequency domain by using a set of ${\cal B}$ grouped OFDMA sub-carriers denoted as resource blocks (RBs). The resource allocation procedure in time-frequency domain follows the integer linear programming optimization problem at each TTI $t$ as shown in Eq. (1):

$\eqalignno{ max_{b_{i,j}} & \sum_{i\in {\cal U}_{t}}\sum_{j\in {\cal B}}b_{i,j}[t]\cdot\left\{r_{i,j}[t]/(\overline{T}_{i}[t])^{\alpha_{k}[t]}\right\}\cr s.t. & \sum_{i\in {\cal U}_{t}}b_{i,j}[t]=1,\forall i\in {\cal U}_{t}\cr & b_{i,j}[t]\in\{0,1\},\forall i\in {\cal U}_{t} & \hbox{(1)} }$ View Source

where

$b_{i,j}$

represents the allocation vector,

$r_{i,j}$

is the achievable rate for user

$i$

and RB

$j$

, and

$\overline{T}_{i}$

denotes the average throughput of user

$i$

averaged over a number of TTI by using the exponential moving filter [1]. The fairness-throughput tradeoff is tuned by parameter

$\alpha_{k}[t]$

which can be adapted TTI-by-TTI in order to meet the objective function. When

$\alpha_{k}=0$

for the entire transmission, the obtained GPF rule maximizes the throughput (MT). If

$\alpha_{k}=1$

, the obtained scheme is entitled Proportional Fair (PF) and when

$\alpha_{k}$

is very large

$(\alpha_{k} \rightarrow \infty)$

, the scheduler maximizes the fairness between users and minimizes the system throughput and the obtained rule is entitled maximum fairness (MF). The optimal scheduler state in which the NGMN requirement is respected can be achieved by setting

$\alpha_{k}$

at each TTI

$t$

such as:

$\alpha_{k}[t]=\alpha_{k}[{t-1}]+\Delta\alpha_{k}\eqno{\hbox{(2)}}$

View Source

where

$\Delta\alpha_{k}$

is the optimal

$\alpha_{k}[t]$

parameter step. Let us define

${\cal A}_{k}[t]\in {\cal A}=\{\Delta\alpha_{k}\},k=1,{.}.,\vert {\cal A}\vert$

as the decision vector of action taken at TTI

$t$

in order to close the scheduler nearby or in the optimal state. The action set

${\cal A}$

can be discrete or continuous. Obviously, the current action

${\cal A}_{k}[t]$

should be chosen in such a manner that the tradeoff objective function

$\Phi[t+1]$

in the next state is maximized. By using the estimation operator

${\BBE}\{\cdot\}$

, the tradeoff action can be seen as a decision vector of the second linear programming optimization problem which should be solved before Eq. (1):

$\eqalignno{ max_{{\cal A}_{k}} & \sum_{k\in {\cal A}}{\cal A}_{k}[t]\cdot {\BBE}\left\{\Phi[{\cal S}_{t+1}^{S}]\right\}\cr s.t. & \sum_{k\in {\cal A}} {\cal A}_{k}[t]=1,\forall k=1,{.}.,\vert {\cal A}\vert\cr & {\cal A}_{k}[t]\in\{0,1\},\forall k=1,{.}.,\vert {\cal A}\vert & \hbox{(3)} }$

View Source

where

${\cal S}_{t+1}^{S}$

represents the scheduler state in the next TTI

$t+1$

. The NGMN objective function

$\Phi[{\cal S}_{t}^{S}]$

is calculated based on the CDF function of NUT observations set

$\left\{\hat{\overline{T}}[t]\right\}\subset {\cal S}_{t}^{S}$

$i=1,{.}.,\vert {\cal U}_{t}\vert$

, as shown by Fig. 1. The NGMN fairness requirement is the oblique continuous line. If the CDF curve is located on the left side of the NGMN requirement (MT rule case), the system is considered unfair

$({\cal S}_{t+1}^{S}\in {\cal UF})$

, and when the CDF function lies on the right side (MF and PF rules cases), the system is declared fair

$({\cal S}_{t+1}^{S}\in {\cal F})$

. In order to determine the optimal or feasible region in the CDF domain, the superior limit of the NGMN requirement should be imposed (dot oblique line). In this sense, the fair area is divided in two sub-regions: feasible

$({\cal S}_{t+1}^{S}\in {\cal FA})$

and over-fairness

$({\cal S}_{t+1}^{S}\in {\cal OF})$

where

$\{{\cal F}\}=\{{\cal FA}\}\cup\{{\cal OF}\}$

. As seen from Fig. 1, only the green curve respects the feasibility condition. Therefore, at each TTI the action

${\cal A}_{k}[t]$

should be chosen in such a manner that

${\cal S}_{t+1}^{S}\in {\cal FA}$

. Based on Fig. 1, the NGMN objective function is calculated based on Eq. (4)

$\Phi[{\cal S}_{t}^{S}]=(1/\vert {\cal U}_{t}\vert)\sum_{i\in {\cal U}_{t}}\left[\Psi\left(\hat{\overline{T}}\right)-\Psi^{Req}\left(\hat{\overline{T}}\right)\right]\leq 0\eqno{\hbox{(4)}}$

View Source

where

$\Psi$

and

$\Psi^{Req}$

represent the CDF function and the NGMN fairness requirement, respectively. Equation (4) is the cost function which aims to praise the quality of action

$\Delta\alpha_{k}$

taken in the previous state. Due to the noisy characteristic of Eq. (4), the CACLA RL algorithm requires the additional function in order to learn the optimal policies of GPF parameters in such a way that

${\cal S}_{t}^{S}\in {\cal FA}$

Fig. 1

NGMN fairness evaluation criteria (benchmarks) for 60 users scenario equally distributed from enodeb base station to the edge of cell under uniform power allocation and frequency division duplex downlink transmission with a system bandwidth of 20MHz

Show All

SECTION III.

Related Work

The parameterization of the GPF scheduler for the system throughput maximization under NGMN requirement is discussed in [3]. The impact of the traffic load and user rate constraints are considered when the CDF distribution of $\hat{T}_{k,t}$ is determined. Unfortunately, the adaptation process is achieved at different time scales in order to make the proposal suitable for real time scheduling leading to the inflexible behavior when severe changes in the network conditions may occur. In [4] an off-line procedure of adapting the $\alpha$ parameter subject of different temporal fairness indices constraints is proposed. The expected user throughput is calculated at the beginning of each TTI in order to predict the current state of the average user throughput before the scheduling decision. However, the traffic load is not considered and the method cannot be applied to the real systems due to the high complexity cost when the number of active flows increases. In this study, the method from [4] can suffer a slight modification in the sense that $\alpha_{k}[t]$ can be adapted based on the NGMN constraint where the CDF function is calculated based on the predicted throughput. The balance of the system throughput and user fairness tradeoff is analyzed in [1], in which the traffic load is categorized based on the CQI reports. The normalized system throughput and Jain Fairness Index are considered as a part of the input state. The Q-Learning algorithm is used to learn different policies that converge very well to different tradeoff levels. However, the concept is not extended to dynamic fairness requirement.

Fig. 2

Proposed scheduler-controller architecture

Show All

SECTION IV.

Proposed Architecture

The interaction between controller and scheduler shown in Fig. 2 is modeled in two stages: exploration and exploitation. In the first stage, the LTE controller receives a new state which is the aggregated version of ${\cal S}_{t}^{S}$ such as ${\cal S}_{t}^{C}$ . Based on the trial and error principle, the controller takes random actions that are mapped into scheduling decisions by the scheduler. The scheduler ranks the previous scheduling decision at the beginning of the next TTI based on the reward function such as ${\cal RW}_{t}({\cal S}_{t-1}^{C},{\cal A}_{t-1}^{a})$ . Basically, the reward function ${\cal RW}_{t}$ indicates how far or close is the function $\Phi[{\cal S}_{t}^{S}]$ from its objective when compared with the previous state when action ${\cal A}_{t-1}^{a}$ is applied. The exploration stage target is to form a policy of scheduling decisions that follows those actions that maximize the sum of future rewards for every initial state. The exploitation stage applies the learned policy TTI-by-TTI. In order to learn the optimal policy, the MLPNN non-linear function is required to approximate the continuous and multidimensional state ${\cal S}_{t}^{C}$ in optimal GPF continuous parameters. In this sense, the MLPNN weights are trained by using the gradient descent algorithm with feed-forward (FP) and backward propagation (BP) principles. The BP minimizes the error between the target output and the one which is obtained through FP procedure. The way how the error and target values are calculated determines the type of RL algorithm which is used for the optimal GPF parameterization.

A. Controller State Space

The controller state space contains the relevant information including the previous GPF parameter, a representative compacted state of NUTs and an indication about how close or far the objective function $\Phi[{\cal S}_{t}^{S}]$ is from the optimal value.

Therefore, the input controller continuous state space is represented by the following set with normalized elements:

${\cal S}_{t}^{C}=\{\alpha_{t-1}, \mu_{t}^{\hat{T}}, \sigma_{t}^{\hat{T}}, d_{t}^{R}, \vert {\cal U}_{t}\vert, \rho_{t}\}\eqno{\hbox{(5)}}$ View Source

where

$\mu_{t}^{\hat{T}}$

and

$\sigma_{t}^{\hat{T}}$

represent the mean deviation and the standard deviation respectively for the log-normal distributions of NUTs,

$\rho_{t}$

is the controller flag which indicates that

${\cal S}_{t}^{C}\in {\cal UF}$

when

$\rho_{t}=-1, {\cal S}_{t}^{C}\in {\cal OF}$

when

$\rho_{t}=0$

and the controller is feasible

$({\cal S}_{t}^{C}\in {\cal FA})$

when

$\rho_{t}=1$

. The flag

$\rho_{t}$

is determined based on

$d_{t}^{R}$

which is the representative CDF distance calculated based on Eq. (6) where

$d_{t}^{i,R}=\Psi_{i}-\Psi_{i}^{Req}$

$d_{t}^{R}=\cases{ max_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \exists d_{t}^{i,R} > 0,\vee i\in {\cal U}_{t}\cr -min_{i\in {\cal U}_{t}}d_{t}^{i,R}, if\, \nexists d_{t}^{i,R} > 0,\forall i\in {\cal U}_{t}\cr }\eqno{\hbox{(6)}}$

View Source

Basically, if there is any

$d_{t}^{i,R} > 0$

in the CDF representation, then

${\cal S}_{t}^{C}\in {\cal UF}$

, and when all the percentiles are on the right side of the requirement such as

$\nexists d_{t}^{i,R} > 0$

, then

${\cal S}_{t}^{C}\in {\cal F}$

. When

$d_{t}^{R} \in [0,\zeta]$

then

${\cal S}_{t}^{C}\in {\cal FA}$

, where

$\zeta$

is the superior limit of feasible region.

B. Reward Function

The reward function is computed from perspective of the transition area between two consecutive TTIs. When the ${\cal S}_{t}^{C}\in {\cal OF}$ (Fig. 1), any increase of $\alpha_{k}[t]$ moves the scheduler further away from the optimal region. On the other pole, when ${\cal S}_{t}^{C}\in {\cal UF}$ , it is undesirable to decrease $\alpha_{k}[t]$ parameter.

Therefore, for the GPF parameterization when ${\cal S}_{t}^{C}\in {\cal UF}$ then $\alpha_{k}\nearrow$ and when ${\cal S}_{t}^{C}\in {\cal OF}$ then $\alpha_{k} \searrow$ until the feasible state is reached. Based on the aforementioned characteristics, the reward function for the GPF parameterization case becomes:

${\cal RW}_{t}=\cases{ \alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal UF}\cr -1, if\alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal UF}\cr 0, if \alpha_{t-1} > \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal UF},{\cal S}_{t}^{C}\in {\cal OF}\cr 1, if {\cal S}_{t-1}^{C}\in\{{\cal UF},{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal FS}\cr -1, if \alpha_{t-1} \geq \alpha_{t-2},{\cal S}_{t-1}^{C}\in\{{\cal FS},{\cal OF}\},{\cal S}_{t}^{C}\in {\cal OF}\cr \alpha_{t-1}-\alpha_{t-2}, if \alpha_{t-1} < \alpha_{t-2},{\cal S}_{t-1}^{C}\in {\cal OF},{\cal S}_{t}^{C}\in {\cal OF}\cr }\eqno{\hbox{(7)}}$ View Source

The goal of the LTE controller is to find the optimal policy

$\pi^{\ast}({\cal S}_{t}^{C},{\cal A}_{t}^{a})$

at each TTI which permits to select the best action for returning the maximum reward within

${\cal S}_{t+1}^{C}$

. The CACLA RL algorithm is used to perform the trained policy as an actor and aims to improve it when necessary as a critic.

C. The CACLA RL Algorithm

The CACLA RL algorithm uses one-dimensional continuous actions ${\cal A}_{t}\in {\BBR}_{[0,1]}$ which implies $\Delta\alpha_{k}\in {\BBR}_{[0,1]}$ . In order to find optimal policies, one MLPNN is used for the continuous action approximation such as $A_{t}^{F}({\cal S}_{t}^{C})$ and another MLPNN for forwarding the state value $V_{t}^{F}({\cal S}_{t}^{C})$ . The notion of state value implies the approximated accumulated reward value for a given state under some learned policies. The principle of CACLA is to update the action value only if the state target value $V_{t}^{T}({\cal S}_{t-1}^{C})$ increases the previous update such as [5]:

$A_{t}^{T}({\cal S}_{t-1}^{C})\overleftarrow{\eta_{A}}A_{t}^{F}({\cal S}_{t-1}^{C})\ if\ V_{t}^{T}({\cal S}_{t-1}^{C}) > V_{t}^{F}({\cal S}_{t-1}^{C})\eqno{\hbox{(8)}}$ View Source

where

$V_{t}^{T}({\cal S}_{t-1}^{C})={\cal RW}_{t}({\cal S}_{t-1}^{C},{\cal A}_{t-1})+\gamma\cdot V_{t}^{F}({\cal S}_{t-1}^{C}),\,\gamma$

is the discount factor and

$\eta_{A}$

is the action value learning rate. Alongside its very simple architecture, CACLA can locate relatively faster the optimal state when compared with other RL algorithms such as: Q-learning, QV-learning, SARSA or ACLA [6], [7], [8] by using predefined GPF parameters steps.

SECTION V.

Simulation Results

We consider a dynamic scenario with fluctuating traffic load within the interval of [10, 120] active data flows/users with infinite buffer. Moreover, the analyzed scheduling policies are running on parallel schedulers that use the same conditions for shadowing, path loss, multi-path loss and interference models. In order to test the impact of the proposed algorithms in the performance metrics, the number of active users is randomly switched at each 1s revealing the generality of the proposed scheduling policy. The rest of parameters are listed in Table I.

The scheduling policy obtained by using CACLA RL is compared against the methods proposed in [4] (MT), [5] (AS) and with other policies obtained by exploring with discrete actions based RL algorithms. The exploration is performed for all RL algorithms by using $\varepsilon$ -greedy actions. Figure 3 concludes that the CACLA policy outperforms other policies from the number of TTIs when ${\cal S}_{t}^{C}\in\{{\cal UF},{\cal FA}\}$ points of view.

SECTION VI.

Conclusions

In this paper the CACLA RL algorithm is used in order to adapt and to apply the best fairness parameter for a dynamic radio environment in LTE-Advanced networks. We proved that CACLA, minimize the number of TTIs when the system is declared unfair being able in the same to fast up the convergence speed by minimizing the number of punishments $({\cal RW}_{t}=-1)$ (Fig. 4) when the number of active users changes dramatically.

Fig. 3

Reward type percentage and standard deviation

Show All

Fig. 4

Number of TTIS when the scheduler state is UF/FEA/OF

Show All

Table I. Simulation parameters

References is not available for this document.

Scheduling policies based on dynamic throughput and fairness tradeoff control in LTE-A networks

Abstract:

Metadata

Abstract:

Introduction

User Fairness and System Throughput Tradeoff Optimization Problem

Related Work