Introduction
The ever-growing number of vehicles worldwide is challenging current traffic systems. Especially during morning or evening rush hours, congestion easily occurs due to insufficient road capacity. Traffic jams do not only increase commute time for individuals, but also create negative impacts on society, including environmental, economic and health issues due to the large amount of emissions and the loss of productive time. Constructing new lanes and expanding the freeway network can alleviate these issues. However, this is not always feasible due to space, financial, or environmental restrictions. Efficient management of traffic on the existing infrastructure is a promising alternative to improve traffic efficiency and safety. Among freeway control measures, ramp metering (RM) [1] and variable speed limits (VSLs) [2] are the most widely used strategies, which have been shown to substantially decrease the travel delay in various real-world implementations [3], [4]. These two control measures can be either used independently or coordinated together within a control method, such as model predictive control (MPC) [5] or deep reinforcement learning (DRL) [6]. MPC and DRL are two powerful control techniques and have been studied extensively in the literature. These two methods have been applied to freeway traffic control successfully, but they also come along with their shortcomings.
MPC is a model-based optimal control approach, and its theory of stability and feasibility has become mature since 1990s [7]. It is widely applied in industry and many other fields, because of its robustness and ability to explicitly deal with input and state constraints, thus satisfying safety requirements, which is a crucial concern in many real-world applications. However, an accurate mathematical model is usually required for MPC to guarantee the closed-loop performance, while acquiring such a model is commonly not possible in practice. In particular, large-scale and complex systems, such as freeway networks, lead to highly nonlinear and non-convex optimization problems with multiple variables, which are difficult to solve in real time [8]. Even though some efficient MPC approaches have been developed to improve the computational efficiency of MPC [9], [10], [11], [12], the optimality and satisfaction of state constraints cannot always be guaranteed in case of model mismatches and external disturbances. Robust and stochastic MPC methods [13], including tube MPC [14] and scenario-based MPC [15], can address uncertainties to some extent. However, these methods require assumptions or descriptions of the uncertainties that are often difficult to validate.
DRL is a recent technique that has shown its success and potential in the field of control, including intelligent traffic control. Unlike conventional reinforcement learning algorithms, artificial neural networks are deployed in DRL to deal with large-dimension state and action spaces. This addresses the so-called issue of the curse of dimensionality [16]. Nevertheless, DRL still suffers from several challenges in real-world applications [17]. For example, safety constraints are of significant importance in operation of real-world systems, while satisfaction of constraints cannot be guaranteed during the learning phase and in implementation of DRL. In addition, the sample efficiency issue and delayed reward for large-scale systems (e.g., for traffic networks) remain considerable challenges for DRL that are still active research topics [17].
Both MPC and DRL have their advantages and disadvantages, and they complement each other well (see Table I). On the one hand, MPC suffers from degraded performance due to model uncertainties and external disturbances. Moreover, large-scale systems introduce multiple variables and long prediction horizons, which make MPC computationally intractable in real time. On the other hand, DRL can naturally cope with uncertainties, and tackle infinite prediction horizons with negligible online computational resources. However, it is usually time-consuming to train a well-performing DRL agent from scratch, especially for complex systems. Although there are clear potential benefits in combining MPC and DRL, very limited work has been done to explore the synergy between these two methods. In addition, very little work has been done to apply combined MPC-(D)RL algorithms in the field of traffic management. One of the representative studies is [18], which applied a model-reference framework that utilizes MPC and deep Q-network algorithm to urban traffic signal control.
The current paper contributes to the state-of-the-art by proposing a novel framework for combining MPC and DRL, and by applying it to traffic management of freeway networks. To be more specific:
Different from the previous work [18], the newly proposed MPC-DRL framework adopts a hierarchical structure in order to incorporate the advantages of both MPC and DRL approaches. The combined framework can learn from the environment, while providing basic control performance. By taking advantage of the dynamic model knowledge and environment information, the framework can deal with uncertainties and improve sample efficiency of the learning process. In particular, an efficient MPC controller operates at the upper control level with a low control frequency to provide initial optimality while explicitly incorporating the constraints. Meanwhile, a DRL agent works at the lower control level with a high control frequency in order to modify the MPC outputs, and to compensate for model mismatches that may affect MPC. Because of the hierarchical structure and multi-frequency control strategy, the proposed framework achieves a good balance between computational tractability and control performance.
The resulting MPC-DRL framework is implemented on a benchmark freeway network, and the results validate the effectiveness of the proposed method. In particular, the objective function of MPC and the reward function of DRL are designed properly, such that the two components are complementary with each other. In addition to MPC, DRL addresses the state and input constraints by introducing penalties on the constraint violation in the reward function. Simulation results show that the combined MPC-DRL outperforms other controllers in terms of control performance, constraint satisfaction, and computational efficiency.
The rest of this paper is organized as follows: Section II summarizes related work about MPC and DRL and their application in freeway traffic management, as well as the latest MPC-DRL algorithms and their applications. Section III presents and provides details on the novel MPC-DRL framework that is proposed in this paper. Section IV gives a case study that implements MPC, DRL, and the proposed MPC-DRL framework on the same benchmark network. Finally, Section V concludes the paper and proposes topics for future work.
Related Work
A large number of studies about traffic management of freeway networks exist in the literature, and a recent comprehensive survey is given in [19]. Among all the traffic control approaches, MPC and DRL have drawn significant attention because of their appealing features. MPC and DRL have been developed and applied for both freeway and urban traffic networks. As the case study in Section IV involves a freeway traffic network, we will mainly focus on MPC and DRL for freeway traffic control in this section.1 After that, current research gaps regarding combined MPC-DRL methods are analyzed.
A. MPC for Freeway Traffic Control
The idea of utilizing rolling horizon optimization in traffic signal control was first introduced by Gartner [20], after which the suggestion of adopting MPC in traffic signal control was formally made by De Schutter and De Moor in [21]. Since then extensive studies about MPC have been carried out in the field of traffic control, including railway [22], urban [23], and freeway traffic networks [8]. Particularly, an RM strategy and a VSLs strategy was adopted in MPC for freeway traffic control in, respectively, [24] and [25]. These two control methods were first coordinated within MPC in the work of Hegyi et al. [8].
As an online optimization-based control method, MPC struggles with computational complexity, especially when the scale of the freeway network is large. Therefore, a large amount of efforts have been devoted to alleviate this issue. One major direction is to reduce the complexity of the dynamic model of the freeway network, and many efficient mathematical models have been developed to describe traffic flow dynamics, such as METANET [26], and cell transmission model (CTM) [27].
The other direction to improve the computational efficiency of MPC is to simplify the problem by linearizing it, or by adopting efficient optimization techniques. For example, Zegeye et al. [9] employed the parameterized MPC technique to reduce the number of the decision variables of the optimization problem. By introducing state-feedback control laws, the control inputs can be described as a function of the states and several function parameters. Thus, only the parameters need to be optimized to obtain the control inputs. Jeschke et al. further extended this approach by using a grammatical evolution method to generate the state-feedback laws automatically, and apply it to urban traffic control [12]. Ferrara et al. [28] incorporated an event-trigger mechanism in the MPC framework to reduce the frequency of solving the optimization problems. Besides, the finite-horizon optimization problem within the MPC scheme is formulated as a mixed-integer linear programming problem that can be solved efficiently, thanks to the revised linear model obtained from CTM.
Despite the success that efficient MPC algorithms have achieved, MPC still suffers from issues that are caused by uncertainties, since it relies heavily on the prediction model. Mismatches between the (macroscopic) prediction model and the real-world traffic system, as well as the presence of external disturbances are inevitable, which deteriorate the closed-loop performance of MPC.
To address these issues, a few studies have considered robust MPC for freeway traffic control. For example, Liu et al. [29] utilized a scenario-based approach [15] to describe the uncertainties as a set of scenarios with their corresponding probabilities, including global uncertainties (e.g., global weather conditions) and local uncertainties (e.g., local weather conditions, local traffic compositions, and local demands at the origins). Coordinated with distributed MPC (DMPC), the resulting scenario-based DMPC improves the control performance for a large-scale freeway network considering some uncertainties. Nevertheless, current robust MPC algorithms for freeway traffic control require assumptions and simplifications about the uncertainties and disturbances that are usually hard to satisfy in practice. Moreover, the extra computational burden introduced by robust MPC methods is another issue. Therefore, developing efficient and uncertainty-resistant MPC algorithms for traffic management remains a challenging and urgent task.
B. DRL for Freeway Traffic Control
Reinforcement learning (RL) [30] is a machine learning technique that usually follows a two-stage procedure. In the first stage, the RL agent learns how to take actions by interacting with the environment/system (or a model of it), in order to maximize the notion of cumulative reward. After that, the trained RL agent is then implemented for control. RL is attracting more and more interest from the system and control community, since it can naturally deal with uncertainties and automatically learn a long-term optimal policy through interacting with the environment. However, the drawbacks of conventional RL are still obvious, which is called the curse of dimensionality [16]. As a result, existing studies that use conventional RL for traffic management can only deal with a small traffic network [31], [32].
The emergence of DRL algorithms significantly broadens the applications of RL and unlocks great potentials for various fields. The introduced neural networks in DRL can handle more complex state and action spaces [31]. DRL has also been studied for intelligent traffic signal control [33], and a recent survey is offered in [34]. In addition, DRL has been successfully applied in other problems. For example, Zhang et al. [35] used DRL to solve a dynamic travelling salesman problem, and achieved substantial improvement within a very short computation time, compared with other baseline approaches.
Despite the great progress in DRL techniques, the limitations of DRL mentioned in Section I still apply. Moreover, current work mainly focuses on urban traffic signal control, while relevant research about DRL for freeway traffic control is still limited [36], [37], [38]. To the best of our knowledge, [39] is the only paper that coordinates VSLs and RM with a DRL algorithm, where both DDPG and TD3 [40] algorithms are implemented and their performance is compared. It is also shown that a centralized DRL agent can handle a large freeway network with multiple VSL-RM hybrid controllers. In addition, very few studies considers the state constraints. For example, the queue length of the on-ramps should be constrained, otherwise it interferes with the connected urban road network and safety issues may occur. Moreover, although a lot of research has studied how to improve the practicability of learning-based methods, such as by training with real-world data or by pre-training (i.e., before implementation), there is still a huge gap between real-world deployment and simulator-based applications.
DRL methods have the potential to deal with uncertain environments, but they also suffer from the requirement of a prolonged training process (i.e., low sample efficiency), as well as the lack of performance and safety guarantees. How to maintain the positive features of DRL, while circumventing the drawbacks remains an interesting and relevant research topic.
C. Current Research Gaps in MPC-(D)RL Methods
Considering the features of both MPC and DRL, the idea of merging these two methods to exploit their complementary advantages sounds promising. Although a few studies have investigated this topic, current methods have their corresponding drawbacks and no one has implemented MPC-DRL methods in the field of traffic management, especially for freeway traffic control. Therefore, in this subsection, the latest work relevant to combined MPC-(D)RL methods are analyzed, which further motivates this paper.
The paper [41] is the earliest one that utilizes a value function to approximate the infinite-horizon objective function of MPC, where a Markov Decision Process (MDP), which is a discrete-time stochastic control process, is used as the prediction model. Moreover, the prediction horizon is reduced to look only one step ahead, while accounting for the long-term value of the performance criteria. The value function can be learned gradually on-line using RL techniques, and meanwhile MPC operates with a simplified optimization problem to provide data samples. This work opened up a research direction to combine MPC and RL algorithms, and inspired consequent research. The method was extended to more general dynamics in [42], where two different value function approximations are used and implemented for various control examples, including the inverted pendulum, the double pendulum, and the acrobot. However, the learning process still struggles with a low sample efficiency and unsafe exploration issues. Arroyo et al. [43] further extended the method given in [41] to a realistic scenario for building energy management, by encoding domain knowledge. Then the initial complex MPC optimization problem is reformulated as an optimization problem with a prediction horizon of one step. In [43] a simulation model is extracted from the simulator via system identification, and is used as the prediction model for MPC, as well as for pre-training of the DRL agent. The simulation results show that the proposed RL-MPC approach can meet the state constraints and provide satisfying performance. However, it is not demonstrated in [43] whether or not the RL-MPC approach outperforms MPC in uncertain environments.
The above MPC-DRL combined algorithms can be categorized as objective function truncating methods. This can reduce the on-line computational complexity of MPC, while RL is used to handle the uncertain environment. Nevertheless, these algorithms still suffer from several issues. First, one-step ahead MPC optimizes the control input only for the next time step, and therefore can only guarantee the short-term safety constraints. Second, although the value function can include the constraints by introducing a penalty on constraint violations in the reward function, in this way the constraints become soft constraints that do not necessarily provide guarantees. Third, an inaccurate system model is still used for the one-step ahead optimization of MPC, which influences the optimality of the performance. Fourth, optimizing the joint objective function and value function can be quite challenging, due to the nonlinearity and non-convexity introduced by the neural networks.
Another direction to connect MPC with RL is developed by Gros and Zanon. In [44] they proposed to use a parameterized MPC scheme instead of deep neural networks to approximate the value function and policy for the RL agent. It is shown that the MPC scheme can guarantee the optimality of the learned policy by adjusting the objective function of MPC, even with an inaccurate system model. Furthermore, they extended the algorithm by utilizing robust MPC techniques to address the safety issue of RL [45]. The method is implemented with a Q-learning algorithm and the results show that the constraints are well handled. In fact, Gros and Zanon [44], [45] are basically using RL tools to solve the MPC problem by using the connection between the parameterized MPC and RL. However, how to parameterize the cost function of MPC is not considered in a structured way.
A different trend is to directly combine the control inputs of MPC and RL. The paper [46] proposed a framework that contains independent MPC and DDPG agents, in which the overall output is a weighted sum of the control inputs generated by MPC and DRL. The idea is to use MPC to play a guiding role by applying its control action directly to the system to obtain more effective data samples for the training of the DDPG agent, thus improving the sample efficiency. However, the weight parameter needs to be tuned by trail and error for various tasks, and the synergies between MPC and DRL are not considered nor analyzed. The state and input constraints are not considered either.
There is not yet an extensive comparison study about the MPC-RL algorithms discussed above, so it is still an open question that which approach surpasses the other, and in which cases. However, each algorithm is designed to address a specific issue or a particular task. The current paper develops a novel framework that combines MPC and DRL in a flexible way, i.e., it allows the designers to freely choose the detailed MPC and DRL schemes. The framework is also designed using a hierarchical structure with multiple operation rates, such that MPC and DRL can coordinate well with each other, making the framework applicable for various complex applications. The proposed framework is tested for a freeway traffic control problem from [8], and the performance is compared with standard MPC, DRL methods, and advanced MPC methods.
Combined MPC-DRL Framework
This section presents the proposed MPC-DRL control framework. Section III-A gives an intuitive description of the framework from a high-level point-of-view. Section III-B defines the MPC and the DRL modules. Section III-C details the learning algorithm of the framework. The mathematical notations used in this section are presented and defined in Table II.
A. MPC-DRL Framework
As illustrated in Figure 1, the proposed MPC-DRL framework has a hierarchical structure. The MPC module operates at the high level to provide a basic control input that is optimized over the prediction window based on the objective function of MPC with the associated nominal model and the predicted traffic demands. The objective function is given according to the control purpose (e.g., minimizing TTS), and the state and input constraints are considered explicitly during the optimization. In practice, the MPC output
Block diagram of the hierarchical MPC-DRL control framework, in which
In order to improve the optimality of the MPC output and to avoid severe constraint violations, the DRL module works at the lower level to modify the MPC output
Assume that the model of the freeway dynamic is discrete-time with a simulation sampling time \begin{equation*} T_{\text {c}}=m_{1}\cdot T_{\text {d}}=m_{1}\cdot m_{2}\cdot T_{\text {s}},\quad m_{1},m_{2}\in \mathbb {N}^{+},m_{1}>1. \tag{1}\end{equation*}
\begin{equation*} \boldsymbol {u}_{\text {c}}(k_{\text {d}})=\text {sat}(\boldsymbol {u}_{\text {rl}}(k_{\text {d}})+ \boldsymbol {u}_{\text {b}}(k_{\text {c}})), \tag{2}\end{equation*}
\begin{align*} {\text {sat}(u)} = \begin{cases} \displaystyle u_{\text {max}},&{\text {if}}~u>u_{\text {max}} \\ \displaystyle u_{\text {min}},&{\text {if}}~u < u_{\text {min}} \\ \displaystyle u,&{\text {otherwise,}} \end{cases} \tag{3}\end{align*}
B. Detailed Description of the Framework
The details of the MPC and DRL modules are provided in this section.
1) MPC Module:
A standard MPC procedure is performed within the MPC module, where a nominal model \begin{equation*} \left \{{k_{\text {c}}m,k_{\text {c}}m+1,\ldots,k_{\text {c}}m+m-1}\right \}, \tag{4}\end{equation*}
\begin{align*} &\min _{\tilde {\boldsymbol {u}}_{\text {b}}(k_{\text {c}}),\tilde { \boldsymbol {x}}(k_{\text {c}})}\sum _{\ell =1}^{N_{\text {p,s}}}J(k_{\text {c}}m+\ell) \\ &\quad \text {s.t.}\quad \mathrm {(A.1)-(A.4)}, \tag{5}\end{align*}
Due to the nonlinearity and non-smoothness of the traffic model, the resulting optimization problem is, in general, nonlinear and non-convex. Therefore, a nonlinear optimization solver, such as multi-start sequential quadratic programming (SQP), simulated annealing, or genetic algorithms [47] is required. After the above optimization problem is solved, the first element of the optimized control input
2) DRL Module:
Considering the freeway network as a Markov Decision Process (MDP), it can be represented by a five-tuple \begin{equation*} \left \{{k_{\text {d}}m_{2},k_{\text {d}}m_{2}+1,\ldots,k_{\text {d}}m_{2}+m_{2}-1}\right \}, \tag{6}\end{equation*}
State \begin{align*} \boldsymbol {x}_{\text {rl}}(k_{\text {d}})=[\bar{\boldsymbol {x}}^{\top} (k_{\text {d}}m_{2}), \bar{\boldsymbol {u}}_{\text {s}}^{\top} (k_{\text {d}}m_{2}),\bar{\boldsymbol {d}}^{\top} (k_{\text {d}}m_{2}), \bar{\boldsymbol {u}}_{\text {c}}^{\top} (k_{\text {d}}-1)]^{\top}, \tag{7}\end{align*}
Action
Note that action \begin{equation*} -w_{u}\Delta \boldsymbol {U}\leq \boldsymbol {u}_{\text {rl}}\leq w_{u}\Delta \boldsymbol {U}, \tag{8}\end{equation*}
Reward \begin{align*} r(\boldsymbol {x}_{\textrm {rl}}(k_{\text {d}}),\boldsymbol {u}_{\text {rl}}(k_{\text {d}}))\!=\!\sum _{k=1}^{m_{2}}\big (-J(k_{\text {d}}m_{2}+k) \!-\!w_{p}P_{s}(k_{\text {d}}m_{2}+k)\big), \tag{9}\end{align*}
The deep actor-critic algorithms are considered to train the framework, among which Deep Deterministic Policy Gradient (DDPG) [48] is chosen for the DRL agent, which is an off-policy and model-free algorithm that can deal with continuous state and action spaces, and has been implemented successfully in many freeway traffic studies (see, e.g., [38], [39], [49]).
Remark 1:
The standard MPC procedure within the high-level MPC module can be replaced with any efficient MPC variants, such as parameterized MPC or DMPC for large-scale freeway networks. The DDPG agent can easily be extended to arbitrary off-policy DRL algorithms that can deal with continuous state and action spaces.
C. Algorithm for Training the Framework
The goal of learning is to train a policy \begin{align*} &\hspace {-.1pc}Q^{\pi} \left ({\boldsymbol {x}_{\text {rl}}(k_{\text {d}}),\pi (\boldsymbol {x}_{\text {rl}}(k_{\text {d}}))}\right)\\ &=\mathbb {E}_{r, \boldsymbol {x}_{\text {rl}}\sim E}\left [{\sum _{k=0}^{\infty }\gamma ^{k}r(\boldsymbol {x}_{\text {rl}}(k_{\text {d}}+k),\boldsymbol {u}_{\text {rl}}(k_{\text {d}}+k))}\right]\\ &=\mathbb {E}_{r, \boldsymbol {x}_{\text {rl}}\sim E}\big [r(\boldsymbol {x}_{\text {rl}}(k_{\text {d}}),\boldsymbol {u}_{\text {rl}}(k_{\text {d}}))\\ &\quad +\gamma Q^{\pi} \left ({\boldsymbol {x}_{\text {rl}}(k_{\text {d}}+1),\pi (\boldsymbol {x}_{\text {rl}}(k_{\text {d}}+1))}\right)\big], \tag{10}\end{align*}
Instead of the traditional one-step temporal-difference (TD) target in DDPG, we use the \begin{equation*} y_{k_{\text {d}}}=r_{n}(k_{\text {d}})+\gamma ^{n}Q^{\pi }_{\phi '}(\boldsymbol {x}_{\text {rl}}(k_{\text {d}}+n),\boldsymbol {u}'_{\text {rl}}(k_{\text {d}}+n)), \tag{11}\end{equation*}
\begin{equation*} r_{n}(k_{\text {d}})=\sum _{k=0}^{n-1}\gamma ^{k}r(\boldsymbol {x}_{\text {rl}}(k_{\text {d}}+k),\boldsymbol {u}_{\text {rl}}(k_{\text {d}}+k)), \tag{12}\end{equation*}
\begin{equation*} L\left ({\phi }\right)=\frac {1}{N}\sum _{i}\left ({y_{i}-Q^{\pi} _{\phi} (\boldsymbol {x}_{\text {rl}}(i),\boldsymbol {u}_{\text {rl}}(i))}\right), \tag{13}\end{equation*}
The benefits of using
A freeway network is a large-scale system with time delays, which means the control measures only take effect after a period of time. Thus, looking
steps into the future can better evaluate the quality of the actions taken.$n$ The optimization of the DDPG agent considers the reward for
future steps, which coincides with the predicted objective function in MPC. In practice, taking$n$ makes the look-ahead time of DDPG and MPC the same, and thus these two modules cooperate better.$n=N_{\text {p,s}}/m_{2}$ Looking
steps ahead makes the learning process more efficient than the one-step TD method of the conventional DDPG algorithm, where the update is only based on bootstrapping from the value of the state one step later [30].$n$ By introducing future rewards in (11), there is no need to predict future demand information as the MPC module does. Therefore, the state space definition (7) is simpler and has a smaller dimension.
One advantage of DDPG as an off-policy algorithm is that its exploration policy is independent from the learning process, which means that a stochastic exploration is allowed. In this context, the Ornstein-Uhlenbeck model [51] is used to produce the noise
Algorithm 1 Hierarchical MPC-DRL Framework Algorithm for Freeway Traffic Control
Initialize critic and actor networks
Initialize target network
Initialize experience replay buffer
for episode from 1 to
Initialize the empty traffic network with initial traffic demands for
for
Observe current traffic state
Perform high-level MPC with freeway model
Pass the optimized MPC output
for
Receive state
Select action
Combine the output of MPC and RL with a saturation function using (2)
for
Execute action
end for
Observe reward
Store transition
Sample a mini-batch of
Update the critic network
Update the actor network
Update the target networks:
end for
end for
end for
Case Study
The proposed MPC-DRL framework is now implemented and evaluated via a benchmark freeway network from [8]. METANET is adopted to model this network, for which the readers are referred to [8] and [26]. Model uncertainties and external disturbances are introduced into the model to represent the real-world system, as illustrated in Section IV-A. Furthermore, the proposed MPC-DRL framework is compared with standalone MPC and DRL methods, and one advanced MPC method (i.e., parameterized MPC [9]). In this case study, the performance criteria consist of TTS of all the vehicles for the entire traffic network, total waiting time (TWT) of all the queues, minimum traffic speed during the total simulation time, constraint violations of the queues on the lanes, and the online computation time. All the simulations were conducted in Matlab version 2022a running on a PC with an Intel Xeon Quad-Core E5-1620 V3 CPU with a clock speed of 3.5 GHz.
A. Setup
1) Freeway Traffic Network:
A benchmark network is taken from [8]. Note that this benchmark network has also been used in other freeway traffic studies [52], [53], [54]. As shown in Figure 3, the network consists of two origins (i.e., one mainstream and one on-ramp) and one destination. The length of the main stretch is 6 km, which is divided into 6 segments of 1 km. The mainstream has two lanes with a capacity of 2000 veh/h each, and its maximal allowed queue length is 200 veh. The on-ramp has one lane with the capacity of 2000 veh/h, and the maximum on-ramp queue length is 100 veh. The network parameters are taken from [8] and the same mathematical notations are used here. The real parameters are assumed unknown in this case study, and the estimated values for these parameters are used in the prediction model. Both the real and the estimated parameter values are given in Table III.
The benchmark freeway network with one metered on-ramp and two segments with speed limits (marked in red) used for the case study.
2) Demand Scenario:
Two typical demand scenarios as shown in Figure 4 similar to [8] are considered in order to evaluate the controllers. These two demand scenarios have the same profile for the mainstream. Without control both of them can cause severe traffic congestion, and they are suitable to examine the control effectiveness of both ramp metering and variable speed limits in this freeway network. The freeway network is initially empty, and is next simulated with a constant demand at 3000 veh/h for the mainstream and 500 veh/h for the on-ramp for a period of 10 min, before the control simulations start.
3) Noises:
To reproduce the stochastic phenomena of the traffic network, random noise with a Gaussian distribution is added to the demands of both mainstream and on-ramp. To fully evaluate the ability of the controllers to resist uncertainties, we consider three noise levels, i.e., low-level noise, medium-level noise, and high-level noise. More specifically, the noise levels have the following distributions:
Low-level noise:
for mainstream demand, and$\mathcal {N}(0,75)$ for on-ramp demand;$\mathcal {N}(0,30)$ Medium-level noise:
for mainstream demand, and$\mathcal {N}(0,150)$ for on-ramp demand;$\mathcal {N}(0,60)$ High-level noise:
for mainstream demand, and$\mathcal {N}(0,225)$ for on-ramp demand.$\mathcal {N}(0,90)$
B. Controllers
In this case study, the following controllers are implemented and compared: standalone MPC, standalone DRL (with
1) Standalone MPC Controller:
The objective function (5) used in the MPC controller is written as \begin{equation*} J(k_{\text {s}})= w_{\text {TTS}}J_{\text {TTS}}(k_{\text {s}})+w_{\mathcal {D}}\left \|{ \boldsymbol {u}_{\text {s}}(k_{\text {s}})- \boldsymbol {u}_{\text {s}}(k_{\text {s}}-1) }\right \|_{2}^{2},\end{equation*}
For simplicity, the control problem is transcribed into an optimization problem by single-shooting [56]. The resulting optimization problem is nonlinear and non-convex, so the Matlab
2) Standalone DRL (With $n$
-Step TD):
The standalone DRL agent in this case study shares the same definition as the DRL module in Section III-B.2. Therefore, the dimensions of the state space and action space are 30 and 3, respectively. The actor network contains one input layer of size 30, one output layer of size 3, and two inner layers with 256 neurons for both layers. Accordingly, the critic network has two input layers, in which one layer that corresponds to the states is of size of 30 and is followed by a layer with 256 neurons, and the other input layer that corresponds to the actions is of size 3 and is followed by a layer with 128 neurons. Both input layers are connected to two consecutive inner layers with 256 and 128 neurons, respectively. The size of the output layer, which generates the Q-values of the state-action pairs, is 1. ReLU activation functions are used in all the neural networks. Moreover, the reward function consists of the objective function defined for the MPC controller and a penalty for constraint violation with weight
3) High-Frequency Standalone MPC Controller:
This is a high-frequency version of the standalone MPC controller, in which the main differences are that the control sampling time is
4) Parameterized MPC Controller:
The parameterized MPC (PMPC) controller developed for integrated VSL-RM freeway traffic control [9] is used in this case study. PMPC is an efficient MPC controller: since the number of optimization variables are reduced because of the introduced parameterized control law, PMPC can reduce the online computation time significantly with comparable control performance, compared with a conventional MPC controller. For more details, the reader can refer to [9] and the references therein. In this case study, the PMPC controller shares settings with the high-frequency standalone MPC controller, including the objective function, prediction window, control sampling time, optimization parameters, etc.
5) Combined MPC-DRL Framework (With $n$
-Step TD):
The combined MPC-DRL framework (with
C. Results for the Learning Process
The standalone DRL (with 10-step TD) and the combined MPC-DRL framework (with 10-step TD) are both trained independently over the stochastic environment (i.e., the benchmark freeway network with stochastic demands), with 3000 episodes for each run. Each episode contains a simulation interval of 9000 s with the mentioned stochastic demands. In the plots, the episode rewards have first been smoothed by a moving average filter of size 21 to better present the learning progress. The learning performance is presented in Figure 5. There are 6 scenarios in total, which are:
Scenario 1: Low-level noise & demand 1;
Scenario 2: Medium-level noise & demand 1;
Scenario 3: High-level noise & demand 1;
Scenario 4: Low-level noise & demand 2;
Scenario 5: Medium-level noise & demand 2;
Scenario 6: High-level noise & demand 2.
Learning performance of standalone DRL and combined MPC-DRL with and without 10-step TD for different scenarios.
Figure 5 shows that the learning curves of the combined MPC-DRL framework methods start with higher rewards and converge faster than the standalone DRL methods for all the scenarios. This indicates that the proposed framework has a better sample efficiency than the conventional DRL methods. The reason is that the MPC module within the MPC-DRL framework generates basic control inputs that provide baseline control performance and guide the DRL to learn, and the DRL module within the framework only requires a limited exploration space with smaller action bounds and thus requires less sample data, compared with the standalone DRL agent.
Furthermore, the methods with 10-step TD have a better learning performance than the ones without 10-step TD, which validates the advantages of the
D. Results for the Implementations
According to the learning performance, only the trained standalone DRL controller with 10-step TD and the combined MPC-DRL controller with 10-step TD are implemented on the benchmark freeway network, and their control performance is evaluated in terms of TTS which represents the global traffic efficiency, constraint violations, and online CPU time. In addition, the total waiting time (TWT) of all the vehicles in a queue is considered for comparison, which indicates the congestion degree of the traffic network. The minimum traffic speed on the links during the total simulation time is compared, which represents the worst congestion degree. The standalone MPC controller, high-frequency MPC, PMPC, and the no-control case (i.e., no ramp metering or speed limit) are included for comparison. Because of the stochastic feature of the network, the experiments for each controller are repeated 10 times independently with random demands in order to evaluate the control performance. The quantitative simulation results are present in Table V, in which the constraint violation is the ratio of maximal exceeded queue length with respect to the maximum allowed queue length, the mean computation time is the average time required for the optimization process per control step (every 300 s), and the max computation time corresponds to the maximum computation time per step over all the control steps (every 300 s).
As shown in Table V, all the controllers can improve the traffic control performance with regard to the no-control case in terms of TTS, expect for the standalone DRL controller (Scenario 1, 5, and 6). This is due to the insufficient learning process and the low sample efficiency of the standalone DRL agent. The standalone MPC controller can improve the control performance compared to the no-control case with limited online computational complexity, in terms of TTS, TWT, minimum speed, and constraint satisfaction. The high-frequency MPC controller can further improve the performance of the standalone MPC controller, which, however, comes at the cost of a substantially higher online computational complexity (more than 20 times higher). The PMPC controller can reduce the online computational complexity of the high-frequency MPC controller significantly, and further reduce the TTS for several scenarios (Scenario 2, 3, and 6). However, the PMPC controller performs worse for constraint satisfaction, TWT, and minimum speed.
The proposed MPC-DRL framework outperforms both the standalone MPC controller and the standalone DRL controller, in terms of TTS, TWT, and constraint satisfaction, with similar online computational complexity for all the considered scenarios. The minimum speed of the proposed framework is slightly lower than the standalone MPC or standalone DRL controller in some scenarios. This is because more vehicles in the queues on both mainstream and on-ramp are allowed to enter the network in order to avoid constraint violation, thus reducing the flow speed. In general, the results show that the framework has learned from interacting with the environment, and that the DRL module within the framework can compensate for the model uncertainties and external disturbances and in this way, provides extra optimality.
Although the high-frequency MPC controller can achieve the best TTS, TWT, and minimum speed performance for most scenarios, it still suffers from the uncertainties and noise in the demand, which is reflected through the constraint violation for all the scenarios. For the scenarios with traffic demand 2 (i.e., Scenario 4, 5, and 6), the constraint violation can be avoided (see the MPC-DRL framework controller). However, the high-frequency MPC controller still has constraint violations, while the MPC-DRL framework controller can guarantee constraint satisfaction. For the scenarios with traffic demand 1 (i.e., Scenario 1, 2, and 3), the traffic congestion is more severe, and constraint violation is inevitable. In this case, the MPC-DRL framework controller can further reduce the constraint violation compared to the high-frequency MPC controller, at the price of slightly higher TTS and TWT and lower minimum speed with significantly less online computational burden. This indicates that, in addition to the smaller action space and high sample efficiency of the combined MPC-DRL framework, the penalty on the constraint violations within the reward function of the DRL module, which coincides with the state constraints within the optimization problem of the MPC module, also contributes to avoiding or relieving the constraint violations. So in the proposed MPC-DRL framework, the MPC and the DRL modules complement each other during both the learning process and the implementation stage, which results in a better sample efficiency and control performance in terms of TTS, TWT, minimum speed and constraint violation, with very limited online computational complexity.
Conclusion and Topics for Future Work
This paper has developed a novel framework combining MPC and DRL for freeway traffic control. Since MPC and DRL each suffer from their own shortcomings and their characteristics complement each other well, it is beneficial to merge these two methods. The proposed MPC-DRL framework inherits the ability of DRL in learning from the environment to deal with uncertainties, and the ability of MPC in using the model information to provide basic performance. Specifically, the novel framework has a hierarchical structure, in which an MPC controller works at a high level with a lower frequency, while the DRL agent operates at a low level with a high frequency. An additional advantage of the proposed framework is that it requires less computational efforts compared to conventional MPC, thanks to the lower control frequency of the MPC module.
A simulation study has been conducted on a benchmark freeway network with model uncertainties and stochastic traffic demands. The proposed MPC-DRL framework (with
ACKNOWLEDGMENT
The authors would like to thank Dr. Qingrui Zhang of Sun Yat-sen University and Yun Li of the Delft University of Technology for the useful discussions and their valuable suggestions on the DRL implementation.
Appendix
Appendix
MPC Module Formulation
The mathematical details of the MPC module within the combined MPC-DRL framework are given as below: \begin{align*} &\min _{\tilde {\boldsymbol {u}}_{\text {b}}(k_{\text {c}}),\tilde { \boldsymbol {x}}(k_{\text {c}})}\sum _{\ell =1}^{N_{\text {p,s}}}J(k_{\text {c}}m+\ell) \\ &\quad \text {s.t.}\quad \hat { \boldsymbol {x}}(k_{\text {c}}m+\ell +1)=\\ &\hphantom {\quad \text {s.t.}\quad }F(\hat { \boldsymbol {x}}(k_{\text {c}}m+\ell),\boldsymbol {u}_{\text {s}}(k_{\text {c}}m+\ell),\hat{\boldsymbol {d}}(k_{\text {c}}m+\ell)),\\ &\hphantom {\quad \text {s.t.}\quad }\text {for}~\ell =0,\ldots,N_{\text {p,s}}-1, \tag{A.1}\\ &\hphantom {\quad \text {s.t.}\quad }\hat{\boldsymbol {x}}(k_{\text {c}}m+\ell)\in \mathcal {X},\quad \text {for}~\ell =1,\ldots,N_{\text {p,s}}, \tag{A.2}\\ &\hphantom {\quad \text {s.t.}\quad }\boldsymbol {u}_{\text {b}}(k_{\text {c}}+k)\in \mathcal {U},\quad \text {for}~k=0,1,\ldots,N_{\text {p,c}}-1, \tag{A.3}\\ &\hphantom {\quad \text {s.t.}\quad } \boldsymbol {u}_{\text {s}}((k_{\text {c}}+k)m+\ell))=\boldsymbol {u}_{\text {b}}(k_{\text {c}}+k), \\ &\hphantom {\quad \text {s.t.}\quad }\text {for}~\ell =0,1,\ldots,m-1, k=0,1,\ldots,N_{\text {p,c}}-1, \tag{A.4}\end{align*}
Freeway Model: METANET
METANET [26], [58] is a macroscopic freeway traffic model that achieves a good trade-off between efficiency and accuracy and has been widely used [8], [53], [55], [59]. The model used in this paper is taken from [8]. In the METANET model, a freeway link is divided into several segments, each of which is described by segment traffic density \begin{equation*} q_{m,i}(k)=\rho _{m,i}(k)v_{m,i}(k)\lambda _{m}, \tag{A.5}\end{equation*}
\begin{align*} \rho _{m,i}(k+1)=\rho _{m,i}(k)+\frac {T}{L_{m}\lambda _{m}}(q_{m,i-1}(k)-q_{m,i}(k)), \tag{A.6}\end{align*}
\begin{align*} v_{m,i}(k+1)&=v_{m,i}(k)+\frac {T}{\tau }\left ({V(\rho _{m,i}(k))-v_{m,i}(k)}\right)\\ &\quad +\frac {T}{L_{m}}v_{m,i}(k)\left ({v_{m,i-1}(k)-v_{m,i}(k)}\right)\\ &\quad -\frac {\eta T}{\tau L_{m}}\frac {\rho _{m,i+1}(k)-\rho _{m,i}(k)}{\rho _{m,i}(k)+\kappa }, \tag{A.7}\end{align*}
\begin{equation*} V\left ({\rho _{m,i}(k)}\right)=v_{\text {free},m}\exp \left [{-\frac {1}{a_{m}}\left ({\frac {\rho _{m,i}(k)}{\rho _{\text {crit},m}}}\right)^{a_{m}}}\right], \tag{A.8}\end{equation*}
\begin{equation*} w_{o}(k+1)=w_{o}(k)+T\left ({d_{o}(k)-q_{o}(k)}\right), \tag{A.9}\end{equation*}
\begin{align*} q_{o}(k)&=\min \biggl [d_{o}(k)+\frac {w_{o}(k)}{T},C_{o}r_{o}(k),\\ &\quad C_{o}(k)\left ({\frac {\rho _{\max,\mu }-\rho _{\mu,1}(k)}{\rho _{\max,\mu }-\rho _{\text {crit},\mu }}}\right)\biggr], \tag{A.10}\end{align*}