Introduction
The year 2023 was the hottest on record in 174 years of observations, a year of extremes, with global average near-surface temperatures 1.45 ± 0.12 °C higher than the pre-industrial baseline [1]. Simultaneously, record-breaking heatwaves, wildfires, droughts and floods have wreaked havoc worldwide, upending everyday life for millions and inflicting many billions of dollars in economic losses [2], [3], [4], [5], [6], [7]. If greenhouse gases continue to be emitted at the current levels, the global average temperature is projected to rise by more than 1.5 °C by 2040. Consequently, the frequency of extreme heatwaves is expected to increase 8.6 times, intense rainfall 1.5 times, and droughts twice as much due to climate change [8]. Research results showed that “approximately 75% of extreme abnormal climate events are currently related to climate change caused by carbon emissions” and that “by 2030, the world will face about 560 severe disasters per year, or an average of about 1.5 per day” [9].
Recognizing the gravity of environmental problems, the EU announced plans to reduce net greenhouse gas emissions by more than 55% compared with the 1990 levels by 2030 and achieve carbon-neutral by 2050 [10]. Countries worldwide are focusing on developing carbon reduction plans. Korea has announced Net-Zero scenarios for each sector and is making significant efforts toward carbon neutrality [11]. Korea’s greenhouse gas emissions in 2022 were 725.744 Mt CO2 eq/year, with carbon dioxide accounting for 87.6% [12]. It has the 11th highest carbon dioxide emissions among 210 countries, with emissions from the transportation sector accounting for 107.365 Mt CO2, or 17% of the total. Traffic congestion has become a daily occurrence in Korea’s urban areas, where the number of cars is unusually high in relation to the area, resulting in enormous economic and time losses, as well as serious traffic accidents and air pollution problem. Traffic congestion at signalized intersections results in vehicle idling, where engines remain running without movement, producing air pollutants at rates up to four times higher than during normal driving [13], [14]. Addressing road congestion could reduce vehicle idling, leading to an estimated annual reduction of 1.96 tons of pollutant emissions [15]. To solve traffic congestion, capacity increases such as road expansion are required; however, capacity increases require a significant amount of time and financial resources [16]. Therefore, the intersection traffic signal control problem (ITSCP) has been emphasized as a means of reducing traffic congestion in urban areas and making efficient use of the limited capacity of roads [17], [18].
Currently, most roads in Republic of Korea use fixed-time signal control. The fixed-signal model is an operating system that repeats preplanned signal patterns over a set period of time [19]. However, this fixed-signal model is limited in its ability to respond flexibly to real-time traffic changes [20], [21]. To address these limitations, recent research has focused on adaptive traffic signal control (ATSC) [22], [23], [24], [25], [26], [27]. Traffic signal control research using multi-agent reinforcement learning (MARL) consists primarily of independent traffic signal control research and cooperative traffic signal control research. Although independent signal control research has been actively conducted, applying independent methods to the real-world traffic networks, which are complex and influenced by adjacent intersections, has limitations in problem solving [28], [29]. Therefore, recent research on traffic signal control has considered adjacent intersections.
Several studies have proven that cooperative signal control considering adjacent intersections have shown good results in solving traffic problems. Most signal control studies aim to reduce vehicle waiting time or vehicle queue length. Reference [30] learned at a
The remainder of this paper is organized into four major sections. Section II explains DQN, one of the most widely used algorithms in adaptive traffic signal control research. Section III describes the construction environment and cooperative method used in this study. Section IV presents the results of an experiment that involved configuring a real intersection environment with the SUMO micro-traffic simulator. Finally, Section V provides concluding remarks.
Reinforcement Learning
Q-learning is a reinforcement learning algorithm based on the temporal difference, which searches for an optimal policy using an action-value function [35]. Q-learning estimates \begin{align*} Q\left ({{ s_{t}, a_{t} }}\right) & \leftarrow Q\left ({{ s_{t}, a_{t} }}\right) \\ & \quad + \alpha [r_{t}+ \gamma \mathop {max}\limits _{a^{\prime }} {Q\left ({{ s_{t+1},a^{\prime } }}\right)-Q\left ({{ s_{t}, a_{t} }}\right)] } \tag {1}\end{align*}
Tabular Q-learning shows good performance in problems where small-scale discrete states and actions are defined; however, generalization problems are difficult to apply in real-world problems where large-scale continuous states and actions are defined with the rapid increase in computation time [38], [39]. To address this problem, Q-learning with neural networks has been proposed, and a deep Q-learning network (DQN) algorithm was developed [40], [41]. A DQN is a popular reinforcement learning algorithm, and numerous studies have applied it to solve adaptive traffic signal control problems [42], [43], [44]. In a DQN, instead of individually estimating the Q-value of each state–action pair, a deep neural network is used as a function approximator that maps states to Q-values. These functional approximations allow the use of larger continuous-state spaces [45]. Equation (2) expresses the loss function of DQN as follows:\begin{align*} MSE(\theta _{i})& =\frac {1}{m}\sum \nolimits _{t=1}^{m} (r_{t}+ \gamma \mathop {max}\limits _{a^{\prime }} {Q\left ({{ s_{t+1},a^{\prime };\theta _{i}^{\ast } }}\right)} \\ & \quad - Q(s_{t},a_{t};\theta _{t})) \tag {2}\end{align*}
After DQN, extended methods such as double DQN, dueling DQN, and prioritized experience replay were introduced. The Double DQN algorithms propose to select the action on the basis of the Online Q Network but to use the values of the target state-action value corresponding to this particular state-action from the Target Q Network. This algorithm can reduce the bias problems that occur in the expected reward predictions of traditional Q-functions [47], [48]. Dueling DQN’s algorithm divides the neural network into value and advantage networks. This algorithm achieves faster learning speeds because the advantage function focuses only on action values [49], [50]. In this study, an extended DQN, which combines a double DQN, dueling DQN, and prioritized experience replay, was applied to traffic signal control.
Methodology
A. State
In reinforcement learning, an agent recognizes the state based on the information it observes in the environment, which becomes the basis for the agent to select its actions [51]. TSC problems largely use two state–space representations: in the first case, the state representation is vector-based [52], [53], [54], and in the second case, it is a snapshot representation [55], [56]. In this study, we used a vector-based state expression based on the information that can be collected, which has become more diverse owing to the development of sensing and V2X technologies. Specifically, this study used the number of vehicles and the average speed, which are variables commonly used to understand the situation at an intersection, as well as whether the signal is on and the elapsed time of the current green signal, which can provide information regarding traffic lights.
Thus, the state expression of the intersection is
B. Action
Because reinforcement learning agents control signals through actions, action design is important for creating a model applicable to real-world scenarios [59]. Existing reinforcement learning-based TSC studies largely use two action expressions. First, to select the most appropriate display among the set of signal patterns, that is, to directly select the signal to provide (dynamic phase selection type), and second, to decide whether to move on to the next signal or maintain the current signal while maintaining the specified phase sequence (binary action selection type). Whereas both approaches have been actively studied in research considering independent intersections, cooperative signal control research is dominated by dynamic phase selection studies. Although a method that does not maintain the phase sequence can achieve better results because it provides the most appropriate signal based on the intersection situation, it can confuse drivers familiar with the signal patterns of existing intersections, thereby increasing the possibility of an accident [34]. Therefore, in this study, the phase sequence was maintained, and based on the state, the phase was maintained (
C. Reward
A reward is the value an agent receives when choosing an action; therefore, it is important to define the reward appropriately [60]. In reinforcement learning, the agent aims to maximize long-term accumulated rewards through continuous interaction with the environment; therefore, learning results may vary based on how the reward value is defined [61], [62]. The goal of this study was to reduce carbon dioxide emissions from vehicles. However, when we attempted to minimize carbon dioxide emissions, we discovered that many cars ended up sitting on the roads. That is because a vehicle’s carbon dioxide emissions are related to its acceleration, and the agent decided that stopping more vehicles on the road by not giving them a green signal would be a way to further reduce emissions than if they were running. However, keeping many cars on the road to reduce carbon dioxide emissions does not solve this problem and is not the desired outcome. Therefore, the main goal of this study was to reduce carbon dioxide emissions from vehicles without unrealistically impeding traffic at intersections. According to [63] and [64], vehicle carbon dioxide emissions were affected by acceleration and deceleration; therefore, emissions on roads that included signalized intersections appeared to be greater than those on roads that did not. In addition, vehicle emissions were particularly high around intersections where more vehicles stopped. We defined the reward as minimizing the carbon dioxide emissions from waiting vehicles. In other words, our reward was
D. Cooperation of Adjacent Intersection
Traffic flow at an intersection is affected not only by the traffic conditions of the intersection itself but also by the traffic conditions of adjacent intersections. The interconnection of signals can reduce the greenhouse gases emissions of vehicles [65]. Therefore, cooperative signal control that considers neighbors can solve traffic congestion and carbon dioxide emission problems more effectively [66], [67]. In this study, the cooperation mechanism was improved by incorporating state-based integration into the existing Q-function-based approach. In the proposed approach, as shown in Figure 2, the agent’s action selection is influenced not only by its own state, action, and reward values but also by the state, action, and reward values of adjacent intersections. Therefore, ATSC-based systems can equalize the traffic flow between adjacent intersections while improving the overall performance of road networks. In conclusion, we update the Q-function of each agent by considering the previous reward values of adjacent intersections as follows:\begin{align*} Q_{t+1}^{i}\left ({{ s_{t}^{i}, a_{t}^{i} }}\right) & \leftarrow Q_{t}^{i}\left ({{ s_{t}^{i}, a_{t}^{i};\theta _{i} }}\right)+ \alpha (t)[r_{t} \\ & \quad + \gamma \mathop {max}\limits _{a^{\prime }} {Q_{t}^{i}\left ({{ s_{t+1}^{i},a^{\prime };\theta _{i}^{\ast } }}\right)-}Q_{t}^{i}\left ({{ s_{t}^{i}, a_{t}^{i};\theta _{i} }}\right)] \\ & \quad + \frac {1}{N_{adj}}\sum \limits _{j\in N_{adj}} r_{t-1}^{j} \tag {3}\end{align*}
\begin{equation*} S_{t}^{i}=\{S_{t}^{i},S^{\prime,j}_{t},a_{t-1}^{j}\} \tag {4}\end{equation*}
At the intersection shown in Fig. 3, lanes 1 and 3 are right-turn lanes, lanes 2 and 8 are left-turn lanes, and lanes 4, 5, 6, and 7 are straight lanes. Among these, the only lanes heading toward the intersection in Fig. 1 are lanes 1, 4, and 5. Lane 1 was excluded from the analysis because it was a right turn and not controlled by traffic lights. Therefore, when the intersection in Fig. 3 is considered for its own signal, it corresponds to
Experiments and Results
A. Experimental Environments
This study conducted experiments using Simulation of Urban Mobility (SUMO), an open-source simulator widely used in traffic signal research. References [68] and [69]. Using SUMO, we can calculate the movements of individual vehicles and implement dynamic traffic signal control. As shown in Fig. 4, the real-time traffic situation implemented in SUMO was transmitted to the Python-based reinforcement learning model, which determined an action (signal control) through the received state, which was transmitted back to SUMO to change the traffic flow.
Reinforcement learning-based signal optimization model learning process using SUMO.
B. Scenarios
The experiment was conducted at six partially contiguous intersections on Gyeongchung-daero in Icheon-si, Gyeonggi-do, Republic of Korea, as shown in Fig. 5(a), and consisted of three- or four-way intersections, as shown in Fig. 5(b). This road passes through the downtown area of Icheon-si, Gyeonggi-do, where commercial and residential areas are concentrated, and signals of different cycles are used during off-peak and peak times. Therefore, in this study, we divided the simulation into two cases based on the off-peak hours of 15:00-16:00 and the peak hours of 18:00-19:00. Table 1 shows the traffic volume during off-peak hours and Table 2 shows the traffic volume during peak hours. The study conducted experiments based on observed real-world traffic when evaluating the trained model. If we look at the traffic volume, Gyeongchung-daero, which runs from intersections 1 to 6, is mainstream.
Reinforcement learning-based signal optimization model learning process using SUMO.
Although the numbers vary by time of day, at both the off-peak and peak hours, between 60% and more than 90% of vehicles at each intersection pass on the signal to go straight in the mainstream.
C. Results
The proposed model was trained for over 150 iterations. The termination condition for each episode was the passage of approximately 8,500 vehicles. Batch size
Fig. 6 shows the learning process at each traffic light. These plots were obtained by running 150 episodes. We can observe that as the number of episodes increased, the value of the reward earned by each traffic light increased and gradually converged. Therefore, we can conclude that the trained model was able to predict the outcome appropriately.
In this experiment, to test the proposed model, the off peak/peak-time fixed signals currently applied on the road, a cooperative model that consider adjacent signals only by the Q-function and a model without constraints on signal sequence as well as minimum and maximum green times were used as comparison models. Considering the convergence speed and stability of the reinforcement learning algorithm, appropriate information should be selected to define the states [34], [70]. Cooperative approaches that consider adjacent intersections through states, such as the proposed model, have the disadvantage of complicating the state representation. However, advances in DQNs allow us to consider more complex states than ever before, and this state cooperation allows us to consider only the flows coming into our intersection from neighbors. Therefore, to evaluate the method proposed in this study, a model that does not consider neighbors in the state but only considers neighbors in the Q-function was attached as a comparison model 1. The Q-function update equation for comparison model that considers neighbors was the same as that of the proposed model. Comparison model 2 was attached to evaluate the effects of signal constraints. This model adopts a dynamic phase selection approach, which directly determines signal allocation without preserving the predefined sequence. Additionally, the constraints on minimum and maximum phase durations were removed. Whereas it is a commonly adopted approach in cooperative traffic signal control research and is expected to achieve higher performance, it has limitations in real-world applications. We used cumulative waiting time and carbon dioxide emissions as metrics to evaluate the performance of the proposed algorithm.
1) Off-Peak
Fig. 7 shows a comparison of the cumulative waiting times during off-peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights, except for the second. In particular, we can see that the proposed model shows noticeably better results compared to fixed signals. The proposed approach reduced the waiting time by approximately 54% on average (250,779 in total) for the fixed method and approximately 18% on average (123,190 in total) for the comparison models. This means that existing fixed signals are not suitable for controlling the flow of dynamic vehicles.
Fig. 8 presents a comparison of carbon dioxide emissions during off-peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights, except for the second. At traffic light 2, the proposed model recorded slightly higher emissions than the comparison model 1, but at the rest of the traffic lights, it performed better. The proposed model reduced carbon dioxide emissions by an average of approximately 23% (829,369,598 in total) compared to the fixed method and an average of approximately 9% (325,579,461 in total) compared with the comparison model.
In both cumulative waiting times and carbon dioxide emissions, Comparison 2 outperformed the proposed model due to the removal of constraints, which allowed immediate signal allocation to roads with higher traffic volumes. However, the random signal sequences may confuse drivers, increasing the risk of accidents and potentially causing infinite delays for certain lanes.
2) Peak
Fig. 9 shows a comparison of the cumulative waiting times during peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights. The proposed approach reduced the waiting time by approximately 30% on average (360,627 in total) for the fixed method and approximately 41% on average (416,632 in total) for the comparison models.
Fig. 10 presents a comparison of carbon dioxide emissions during peak hours. Compared to the fix and comparison 1 models, the proposed model yielded the best results for all the traffic lights. The proposed model reduced carbon dioxide emissions by an average of approximately 21% (1,560,811,796 in total) compared to the fixed method and an average of approximately 19% (1,023,801,766 in total) compared with the comparison model. Whereas Comparison model 2 demonstrates better performance than the proposed model in terms of cumulative waiting times and carbon dioxide emissions. Nonetheless, its random signal sequence is impractical for real-world application. Therefore, we can see that the proposed model provides better signal control than the fixed signal and comparison 1 models in both off-peak and peak hours.
Fig. 11 shows the average speeds of all the vehicles in the fixed and proposed models during peak hours. Since the velocity comparison graph shows similar shapes in off-peak and peak hours, peak hours are used as a representative example. As shown in Fig. 11, the average speed of the proposed model was similar to that of the fixed model; however, the speed variation was relatively small. Because carbon dioxide emissions are highly related to the acceleration and deceleration of vehicles, it is expected that the proposed model, with relatively less variation in vehicle speed, will perform well in terms of carbon dioxide emissions.
Change in average speed of vehicles over simulation time in fixed and proposed models.
Fig. 12 shows the average speeds of all the vehicles in the comparison 1 and proposed models during peak hours. The proposed model has a higher average speed of vehicles than the comparison model, and the episode in the proposed model ended earlier than that in the comparison model. The evaluation in this study is conducted based on the number of vehicles observed at real-world intersections. Therefore, unlike previous studies that define a fixed duration as the termination condition for an episode, this study uses the departure of a specified number of vehicles as the termination condition. Each simulation concludes when the predetermined number of vehicles has exited the roadway. Therefore, the time required for an episode to run may vary between the models. The comparison model would have taken longer for a given number of vehicles to pass through the road than the proposed model, and it would have been expected that the carbon dioxide emissions from the vehicles would have accumulated during that time. Therefore, it can be expected that the proposed model, which took a relatively short time for the vehicle to exit the road, will perform well in terms of carbon dioxide emissions.
Change in average speed of vehicles over simulation time in comparison and proposed models.
Fig. 13 shows the smoothed average speed of the overall vehicles for the three models. Whereas the factors that are expected to have contributed significantly to the performance difference with the proposed model are different, we can see that the common thread is that the episodes in the proposed model ended earlier than both models. In fact, when we compared the travel time of vehicles by route, we identified that while not all routes had less travel time, the proposed model allowed people to reach their destination on average approximately 42 s faster than the fixed model and approximately 48 s faster than the comparison model. This is not a small difference, considering that the distance from intersections 1 to 6 was approximately 5 minutes (when traffic was not blocked). In other words, from the perspective of vehicles, people reached their destinations faster in the proposed model than in the comparison model.
Smoothing representation of the change in average speed of vehicles over simulation time in three models.
Conclusion
In this study, we propose a multi-intersection signal control model that uses a novel cooperative approach to reduce traffic congestion and carbon dioxide emissions. In the proposed model, we use adjacent intersections to improve overall performance by allowing agents to share their states, actions, and rewards. At each intersection, an agent’s action is determined by considering not only the state of its own intersection but also the states and actions of its neighbors. In addition, the estimated Q-value considers the last reward received from the neighbors. Experiments on six contiguous roads in Icheon City show that our method outperforms fixed-signal and comparison model 1 in terms of cumulative waiting time and carbon dioxide emission metrics. The amount of emission saved by the proposed method is expected to significantly reduce a huge amount of carbon dioxide emissions when accumulated over a month or year.