Journals & Magazines >IEEE Access >Volume: 13

Cooperative Control of Intersection Traffic Signals Based on Multi-Agent Reinforcement Learning for Carbon Dioxide Emission Reduction

We develop a reinforcement learning-based signal control model in which each intersection shares its state, action, and reward with adjacent agents to cooperatively mitig...

Abstract:

Abnormal weather is occurring around the world, including the hottest weather in 174 years of observation records, the largest fire in Europe’s observation records, and a...Show More

Metadata

Abstract:

Abnormal weather is occurring around the world, including the hottest weather in 174 years of observation records, the largest fire in Europe’s observation records, and approximately twice the average annual rainfall recorded in one day. This abnormal climate is highly related to greenhouse gases, and efforts to reduce emissions are required in various fields. This study aims to reduce carbon dioxide emissions in the transportation sector, which accounts for a high proportion of emissions. A multi-agent reinforcement learning technique is used for adaptive traffic signal control, and especially a novel cooperative approach is introduced, when considering neighboring intersections. We consider not only the adjacent intersection’s last reward as a Q-function but also its state and action as state. This method has the advantage of considering only vehicles from adjacent intersections that enter an intersection. The proposed method was evaluated on roads in Icheon City, and the results show that it reduces waiting time and carbon dioxide emissions.

We develop a reinforcement learning-based signal control model in which each intersection shares its state, action, and reward with adjacent agents to cooperatively mitig...

Published in: IEEE Access ( Volume: 13)

Page(s): 33485 - 33495

Date of Publication: 07 February 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3539685

Funding Agency:

Contents

SECTION I.

Introduction

The year 2023 was the hottest on record in 174 years of observations, a year of extremes, with global average near-surface temperatures 1.45 ± 0.12 °C higher than the pre-industrial baseline [1]. Simultaneously, record-breaking heatwaves, wildfires, droughts and floods have wreaked havoc worldwide, upending everyday life for millions and inflicting many billions of dollars in economic losses [2], [3], [4], [5], [6], [7]. If greenhouse gases continue to be emitted at the current levels, the global average temperature is projected to rise by more than 1.5 °C by 2040. Consequently, the frequency of extreme heatwaves is expected to increase 8.6 times, intense rainfall 1.5 times, and droughts twice as much due to climate change [8]. Research results showed that “approximately 75% of extreme abnormal climate events are currently related to climate change caused by carbon emissions” and that “by 2030, the world will face about 560 severe disasters per year, or an average of about 1.5 per day” [9].

Recognizing the gravity of environmental problems, the EU announced plans to reduce net greenhouse gas emissions by more than 55% compared with the 1990 levels by 2030 and achieve carbon-neutral by 2050 [10]. Countries worldwide are focusing on developing carbon reduction plans. Korea has announced Net-Zero scenarios for each sector and is making significant efforts toward carbon neutrality [11]. Korea’s greenhouse gas emissions in 2022 were 725.744 Mt CO2 eq/year, with carbon dioxide accounting for 87.6% [12]. It has the 11th highest carbon dioxide emissions among 210 countries, with emissions from the transportation sector accounting for 107.365 Mt CO2, or 17% of the total. Traffic congestion has become a daily occurrence in Korea’s urban areas, where the number of cars is unusually high in relation to the area, resulting in enormous economic and time losses, as well as serious traffic accidents and air pollution problem. Traffic congestion at signalized intersections results in vehicle idling, where engines remain running without movement, producing air pollutants at rates up to four times higher than during normal driving [13], [14]. Addressing road congestion could reduce vehicle idling, leading to an estimated annual reduction of 1.96 tons of pollutant emissions [15]. To solve traffic congestion, capacity increases such as road expansion are required; however, capacity increases require a significant amount of time and financial resources [16]. Therefore, the intersection traffic signal control problem (ITSCP) has been emphasized as a means of reducing traffic congestion in urban areas and making efficient use of the limited capacity of roads [17], [18].

Currently, most roads in Republic of Korea use fixed-time signal control. The fixed-signal model is an operating system that repeats preplanned signal patterns over a set period of time [19]. However, this fixed-signal model is limited in its ability to respond flexibly to real-time traffic changes [20], [21]. To address these limitations, recent research has focused on adaptive traffic signal control (ATSC) [22], [23], [24], [25], [26], [27]. Traffic signal control research using multi-agent reinforcement learning (MARL) consists primarily of independent traffic signal control research and cooperative traffic signal control research. Although independent signal control research has been actively conducted, applying independent methods to the real-world traffic networks, which are complex and influenced by adjacent intersections, has limitations in problem solving [28], [29]. Therefore, recent research on traffic signal control has considered adjacent intersections.

Several studies have proven that cooperative signal control considering adjacent intersections have shown good results in solving traffic problems. Most signal control studies aim to reduce vehicle waiting time or vehicle queue length. Reference [30] learned at a $4\times 4$ intersection by accounting for the difference in average waiting time between time t and time t+1. The number of vehicles in each lane was input as a state, and the green light direction was selected from four designated directions without maintaining the signal order. Each episode is performed for 36,000 seconds, and the optimizer used is Adam. Reference [31] trained to minimize the standard deviation of queue length at a $2\times 3$ intersection. This ultimately aimed to equalize drivers’ individual wait times and increase road user satisfaction. The state used the number of halting in all incoming lanes in percentage format, the number of current intersections in binary format, and the current intersections’ number to determine whether to give the green light north-south or east-west directions. Each episode is performed for 1,800 seconds, and the optimizer used is Adam. Reference [32] continued with learning by rewarding changes in the average queue length of vehicles at the intersection between times t and t+1. After dividing the lane into lengths of a certain size, the occupancy of each space and the vehicle’s speed were stored in a matrix state, and one of the three or four specified phases was selected as an action. Each episode is performed for 4,500 seconds. Most cooperative signal control studies, including the aforementioned research, consider neighboring intersections through Q-functions and adopt an approach where green signals are directly assigned without maintaining the sequence of signal phases. However, real-world traffic signals operate under constraints regarding the sequence and cycle of signal phases. Modifying phase sequences in conventional systems may influence intersection delay and safety [33]. Additionally, without restrictions on phase sequences, certain phases may be repeated indefinitely, causing vehicles in other lanes to experience prolonged delays [34]. Therefore, in this study, cooperative control of intersection traffic signals based on reinforcement learning is proposed to minimize carbon dioxide emissions, contributing to greenhouse gas reduction. The proposed model enhances performance by improving the cooperation mechanism to share the state of adjacent intersections through not only the Q-function but also the state. Furthermore, an action function reflecting realistic constraints was defined to ensure drivers’ comprehension and avoid confusion.

The remainder of this paper is organized into four major sections. Section II explains DQN, one of the most widely used algorithms in adaptive traffic signal control research. Section III describes the construction environment and cooperative method used in this study. Section IV presents the results of an experiment that involved configuring a real intersection environment with the SUMO micro-traffic simulator. Finally, Section V provides concluding remarks.

SECTION II.

Reinforcement Learning

Q-learning is a reinforcement learning algorithm based on the temporal difference, which searches for an optimal policy using an action-value function [35]. Q-learning estimates $q_{\pi }(s,a)$ , which is the value of taking action a based on a certain action strategy $\pi $ in a given situation s. Q-learning’s action-value function update formula is as follows:\begin{align*} Q\left ({{ s_{t}, a_{t} }}\right) & \leftarrow Q\left ({{ s_{t}, a_{t} }}\right) \\ & \quad + \alpha [r_{t}+ \gamma \mathop {max}\limits _{a^{\prime }} {Q\left ({{ s_{t+1},a^{\prime } }}\right)-Q\left ({{ s_{t}, a_{t} }}\right)] } \tag {1}\end{align*} View Sourcewhere $s_{t}$ denotes the state at time t, $a_{t}$ denotes the action taken at time t, $r_{t}$ denotes rewards received for actions taken at time t, $\gamma $ denotes the discount factor for long-term/ short-term consideration of rewards with values between 0 and 1, and $\alpha $ denotes the learning rate used for Q-learning update, and $a^{\prime }$ indicates the best action with the largest Q-value at $s_{t+1}$ . Through this update process, the Q-function considers the reward value of any future situation $s_{t}$ into the future [36]. The Q-learning algorithm stores the Q-value associated with each state–action pair in a table. Therefore, it is also called tabular Q-learning [37]. If the agent continues to update the state–action pairs infinitely according to (1), it can converge to the optimal value [35].

Tabular Q-learning shows good performance in problems where small-scale discrete states and actions are defined; however, generalization problems are difficult to apply in real-world problems where large-scale continuous states and actions are defined with the rapid increase in computation time [38], [39]. To address this problem, Q-learning with neural networks has been proposed, and a deep Q-learning network (DQN) algorithm was developed [40], [41]. A DQN is a popular reinforcement learning algorithm, and numerous studies have applied it to solve adaptive traffic signal control problems [42], [43], [44]. In a DQN, instead of individually estimating the Q-value of each state–action pair, a deep neural network is used as a function approximator that maps states to Q-values. These functional approximations allow the use of larger continuous-state spaces [45]. Equation (2) expresses the loss function of DQN as follows:\begin{align*} MSE(\theta _{i})& =\frac {1}{m}\sum \nolimits _{t=1}^{m} (r_{t}+ \gamma \mathop {max}\limits _{a^{\prime }} {Q\left ({{ s_{t+1},a^{\prime };\theta _{i}^{\ast } }}\right)} \\ & \quad - Q(s_{t},a_{t};\theta _{t})) \tag {2}\end{align*} View Sourcewhere m denotes the batch size, $s_{t}$ denotes the state at time t, $a_{t}$ denotes the action taken at time t, $r_{t}$ denotes rewards received for actions taken at time t, $\gamma $ denotes the discount factor for long-term/short-term consideration of reward value, and $\theta $ denotes the parameter used when estimating the Q-function. Deep Q Networks, which are Q-functions with deep learning, are optimized to minimize the loss function [46].

After DQN, extended methods such as double DQN, dueling DQN, and prioritized experience replay were introduced. The Double DQN algorithms propose to select the action on the basis of the Online Q Network but to use the values of the target state-action value corresponding to this particular state-action from the Target Q Network. This algorithm can reduce the bias problems that occur in the expected reward predictions of traditional Q-functions [47], [48]. Dueling DQN’s algorithm divides the neural network into value and advantage networks. This algorithm achieves faster learning speeds because the advantage function focuses only on action values [49], [50]. In this study, an extended DQN, which combines a double DQN, dueling DQN, and prioritized experience replay, was applied to traffic signal control.

SECTION III.

Methodology

A. State

In reinforcement learning, an agent recognizes the state based on the information it observes in the environment, which becomes the basis for the agent to select its actions [51]. TSC problems largely use two state–space representations: in the first case, the state representation is vector-based [52], [53], [54], and in the second case, it is a snapshot representation [55], [56]. In this study, we used a vector-based state expression based on the information that can be collected, which has become more diverse owing to the development of sensing and V2X technologies. Specifically, this study used the number of vehicles and the average speed, which are variables commonly used to understand the situation at an intersection, as well as whether the signal is on and the elapsed time of the current green signal, which can provide information regarding traffic lights.

Thus, the state expression of the intersection is $S_{t}=\{P_{t},d_{t},N_{t},V_{t}$ . $P_{t}$ indicates whether each phase is green or not in binary form. Green light has a value of 1, red light and yellow light have a value of 0. As shown in Fig. 1, green light is on in Phase 3 and red light is on in the rest; therefore, it has a value of $P_{t}=[{0,0,1,0}]$ . Subsequently, $d_{t}$ indicates the elapsed time of the currently turned on phase. As shown in Fig. 1, because 10 s have passed since Phase 3 turned green, $d_{t}=10$ . Third, $N_{t}$ is the number of vehicles affected by each signal. We used the number of vehicles counted by signal, not the number of vehicles counted by lane. In reinforcement learning, the state should encapsulate adequate information necessary for decision-making [57]. However, an increase in the complexity or size of the state space can lead to slower learning speeds and higher memory requirements, posing challenges to system efficiency [58]. This method can produce better results than lane-based aggregation methods by reducing unnecessary dimensions while retaining important traffic information. There are four vehicles in Phase 1, one vehicle in Phase 2, all vehicles have left and zero vehicles in Phase 3, and 11 vehicles in Phase 4. Therefore, $N_{t}=[{4,1,0,11}]$ . Right turns were not considered because they were not controlled by signals. Finally, $V_{t}$ is the average speed of the vehicles aggregated from $N_{t}$ . Therefore, the state shown in Fig. 1 at t s can be expressed as $S_{t}=\{0,0,1,0,10,4,1,0,11,0,0,0,0.7\}$ .

FIGURE 1.

Example of intersection state and intersection signal phase.

Show All

B. Action

Because reinforcement learning agents control signals through actions, action design is important for creating a model applicable to real-world scenarios [59]. Existing reinforcement learning-based TSC studies largely use two action expressions. First, to select the most appropriate display among the set of signal patterns, that is, to directly select the signal to provide (dynamic phase selection type), and second, to decide whether to move on to the next signal or maintain the current signal while maintaining the specified phase sequence (binary action selection type). Whereas both approaches have been actively studied in research considering independent intersections, cooperative signal control research is dominated by dynamic phase selection studies. Although a method that does not maintain the phase sequence can achieve better results because it provides the most appropriate signal based on the intersection situation, it can confuse drivers familiar with the signal patterns of existing intersections, thereby increasing the possibility of an accident [34]. Therefore, in this study, the phase sequence was maintained, and based on the state, the phase was maintained ($A_{t}=0$ ) or changed to the next phase ($A_{t}=1$ ). In other words, the set of possible action operations was $A_{t}=\{0,1\}$ . Additionally, the maximum green constraint was used to prevent certain vehicles from waiting endlessly by continuously providing green signals to lanes with many vehicles, and a minimum green constraint was applied to protect the pedestrian signal and prevent too frequent changes in the phase (signal flickering). Therefore, if the minimum green time of Phase 3 in Fig. 1 is set to 15 s, the action of changing to the next signal ($A_{t}=1$ ) cannot be selected at this point, when only 10 s has been maintained.

C. Reward

A reward is the value an agent receives when choosing an action; therefore, it is important to define the reward appropriately [60]. In reinforcement learning, the agent aims to maximize long-term accumulated rewards through continuous interaction with the environment; therefore, learning results may vary based on how the reward value is defined [61], [62]. The goal of this study was to reduce carbon dioxide emissions from vehicles. However, when we attempted to minimize carbon dioxide emissions, we discovered that many cars ended up sitting on the roads. That is because a vehicle’s carbon dioxide emissions are related to its acceleration, and the agent decided that stopping more vehicles on the road by not giving them a green signal would be a way to further reduce emissions than if they were running. However, keeping many cars on the road to reduce carbon dioxide emissions does not solve this problem and is not the desired outcome. Therefore, the main goal of this study was to reduce carbon dioxide emissions from vehicles without unrealistically impeding traffic at intersections. According to [63] and [64], vehicle carbon dioxide emissions were affected by acceleration and deceleration; therefore, emissions on roads that included signalized intersections appeared to be greater than those on roads that did not. In addition, vehicle emissions were particularly high around intersections where more vehicles stopped. We defined the reward as minimizing the carbon dioxide emissions from waiting vehicles. In other words, our reward was $R_{t}=-\sum E_{t} $ .

D. Cooperation of Adjacent Intersection

Traffic flow at an intersection is affected not only by the traffic conditions of the intersection itself but also by the traffic conditions of adjacent intersections. The interconnection of signals can reduce the greenhouse gases emissions of vehicles [65]. Therefore, cooperative signal control that considers neighbors can solve traffic congestion and carbon dioxide emission problems more effectively [66], [67]. In this study, the cooperation mechanism was improved by incorporating state-based integration into the existing Q-function-based approach. In the proposed approach, as shown in Figure 2, the agent’s action selection is influenced not only by its own state, action, and reward values but also by the state, action, and reward values of adjacent intersections. Therefore, ATSC-based systems can equalize the traffic flow between adjacent intersections while improving the overall performance of road networks. In conclusion, we update the Q-function of each agent by considering the previous reward values of adjacent intersections as follows:\begin{align*} Q_{t+1}^{i}\left ({{ s_{t}^{i}, a_{t}^{i} }}\right) & \leftarrow Q_{t}^{i}\left ({{ s_{t}^{i}, a_{t}^{i};\theta _{i} }}\right)+ \alpha (t)[r_{t} \\ & \quad + \gamma \mathop {max}\limits _{a^{\prime }} {Q_{t}^{i}\left ({{ s_{t+1}^{i},a^{\prime };\theta _{i}^{\ast } }}\right)-}Q_{t}^{i}\left ({{ s_{t}^{i}, a_{t}^{i};\theta _{i} }}\right)] \\ & \quad + \frac {1}{N_{adj}}\sum \limits _{j\in N_{adj}} r_{t-1}^{j} \tag {3}\end{align*} View Sourcewhere $N_{adj}$ denotes the number of adjacent intersections, $\theta _{i}$ and $\theta _{j}$ denote the parameters of evaluation network of intersections i and j, respectively. It also monitors not only the conditions of the current intersection but also the states and actions of adjacent intersections. In this study, the state of each agent considering the adjacent intersections is defined as follows:\begin{equation*} S_{t}^{i}=\{S_{t}^{i},S^{\prime,j}_{t},a_{t-1}^{j}\} \tag {4}\end{equation*} View Sourcewhere the intersection state($S_{t}^{i}$ ) uses $S_{t}=\{P_{t},d_{t},N_{t},V_{t}\}$ described in 3.A state; however, when considering the state of adjacent intersections, it is a slightly modified state (${S'}_{t}^{j}$ ). The difference between S and $S'$ is ${S'}_{t}=\{P_{t},d_{t},{N'}_{t},{V'}_{t}\}$ , and when receiving the state ($S'$ ) of a adjacent intersection, the number of vehicles entering an intersection only considers information. For example, to the right of the intersection in Fig. 1, there is an intersection shown in Fig. 3.

FIGURE 2.

Cooperative algorithm process with adjacent intersections.

Show All

FIGURE 3.

Example of adjacent intersection state and intersection signal phase.

Show All

At the intersection shown in Fig. 3, lanes 1 and 3 are right-turn lanes, lanes 2 and 8 are left-turn lanes, and lanes 4, 5, 6, and 7 are straight lanes. Among these, the only lanes heading toward the intersection in Fig. 1 are lanes 1, 4, and 5. Lane 1 was excluded from the analysis because it was a right turn and not controlled by traffic lights. Therefore, when the intersection in Fig. 3 is considered for its own signal, it corresponds to ${N'}_{t}=[{2,0,14}]$ and ${V'}_{t}=[{0,0,1.3}]$ , but when shared with the intersection in Fig. 1, it corresponds to ${N'}_{t}=[{0,0,10}]$ and ${V'}_{t}=[{0,0,0}]$ . In other words, the state of the intersection in Fig. 1 that is received by the intersection in Fig. 3 is ${S'}_{t}=\{P_{t},d_{t},{N'}_{t},{V'}_{t}\}=\{0,1,0,5,0,0,10,0,0,0\}$ .

SECTION IV.

Experiments and Results

A. Experimental Environments

This study conducted experiments using Simulation of Urban Mobility (SUMO), an open-source simulator widely used in traffic signal research. References [68] and [69]. Using SUMO, we can calculate the movements of individual vehicles and implement dynamic traffic signal control. As shown in Fig. 4, the real-time traffic situation implemented in SUMO was transmitted to the Python-based reinforcement learning model, which determined an action (signal control) through the received state, which was transmitted back to SUMO to change the traffic flow.

FIGURE 4.

Reinforcement learning-based signal optimization model learning process using SUMO.

Show All

B. Scenarios

The experiment was conducted at six partially contiguous intersections on Gyeongchung-daero in Icheon-si, Gyeonggi-do, Republic of Korea, as shown in Fig. 5(a), and consisted of three- or four-way intersections, as shown in Fig. 5(b). This road passes through the downtown area of Icheon-si, Gyeonggi-do, where commercial and residential areas are concentrated, and signals of different cycles are used during off-peak and peak times. Therefore, in this study, we divided the simulation into two cases based on the off-peak hours of 15:00-16:00 and the peak hours of 18:00-19:00. Table 1 shows the traffic volume during off-peak hours and Table 2 shows the traffic volume during peak hours. The study conducted experiments based on observed real-world traffic when evaluating the trained model. If we look at the traffic volume, Gyeongchung-daero, which runs from intersections 1 to 6, is mainstream.

TABLE 1 Off-Peak Time Traffic Volume at Each Intersection (The Numbers Next to the Arrows Indicate Traffic Volume)

TABLE 2 Peak Time Traffic Volume at Each Intersection (The Numbers Next to the Arrows Indicate Traffic Volume)

FIGURE 5.

Reinforcement learning-based signal optimization model learning process using SUMO.

Show All

Although the numbers vary by time of day, at both the off-peak and peak hours, between 60% and more than 90% of vehicles at each intersection pass on the signal to go straight in the mainstream.

C. Results

The proposed model was trained for over 150 iterations. The termination condition for each episode was the passage of approximately 8,500 vehicles. Batch size $\left |{{ B }}\right |$ was 32, and the replay memory size was 1000. The optimizer’s update rule was Adam, and the learning rate was 0.0001. To allow the agent to change its role from exploration to exploitation, epsilon decreased throughout the training process from a starting value of 0.9 to an ending value of 0.03. The parameters used in the simulations are enumerated in Table 3. These values were experimentally determined.

TABLE 3 Reinforcement Learning-Based Signal Optimization Model Learning Process Using SUMO

Fig. 6 shows the learning process at each traffic light. These plots were obtained by running 150 episodes. We can observe that as the number of episodes increased, the value of the reward earned by each traffic light increased and gradually converged. Therefore, we can conclude that the trained model was able to predict the outcome appropriately.

FIGURE 6.

The learning process at each intersection.

Show All

In this experiment, to test the proposed model, the off peak/peak-time fixed signals currently applied on the road, a cooperative model that consider adjacent signals only by the Q-function and a model without constraints on signal sequence as well as minimum and maximum green times were used as comparison models. Considering the convergence speed and stability of the reinforcement learning algorithm, appropriate information should be selected to define the states [34], [70]. Cooperative approaches that consider adjacent intersections through states, such as the proposed model, have the disadvantage of complicating the state representation. However, advances in DQNs allow us to consider more complex states than ever before, and this state cooperation allows us to consider only the flows coming into our intersection from neighbors. Therefore, to evaluate the method proposed in this study, a model that does not consider neighbors in the state but only considers neighbors in the Q-function was attached as a comparison model 1. The Q-function update equation for comparison model that considers neighbors was the same as that of the proposed model. Comparison model 2 was attached to evaluate the effects of signal constraints. This model adopts a dynamic phase selection approach, which directly determines signal allocation without preserving the predefined sequence. Additionally, the constraints on minimum and maximum phase durations were removed. Whereas it is a commonly adopted approach in cooperative traffic signal control research and is expected to achieve higher performance, it has limitations in real-world applications. We used cumulative waiting time and carbon dioxide emissions as metrics to evaluate the performance of the proposed algorithm.

1) Off-Peak

Fig. 7 shows a comparison of the cumulative waiting times during off-peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights, except for the second. In particular, we can see that the proposed model shows noticeably better results compared to fixed signals. The proposed approach reduced the waiting time by approximately 54% on average (250,779 in total) for the fixed method and approximately 18% on average (123,190 in total) for the comparison models. This means that existing fixed signals are not suitable for controlling the flow of dynamic vehicles.

FIGURE 7.

Compare cumulative waiting times at each intersection at off-peak times.

Show All

Fig. 8 presents a comparison of carbon dioxide emissions during off-peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights, except for the second. At traffic light 2, the proposed model recorded slightly higher emissions than the comparison model 1, but at the rest of the traffic lights, it performed better. The proposed model reduced carbon dioxide emissions by an average of approximately 23% (829,369,598 in total) compared to the fixed method and an average of approximately 9% (325,579,461 in total) compared with the comparison model.

FIGURE 8.

Compare carbon dioxide emissions at each intersection at off-peak times.

Show All

In both cumulative waiting times and carbon dioxide emissions, Comparison 2 outperformed the proposed model due to the removal of constraints, which allowed immediate signal allocation to roads with higher traffic volumes. However, the random signal sequences may confuse drivers, increasing the risk of accidents and potentially causing infinite delays for certain lanes.

2) Peak

Fig. 9 shows a comparison of the cumulative waiting times during peak hours. Compared to the fix and comparison 1 models, the proposed model showed the best results for all traffic lights. The proposed approach reduced the waiting time by approximately 30% on average (360,627 in total) for the fixed method and approximately 41% on average (416,632 in total) for the comparison models.

FIGURE 9.

Compare cumulative waiting times at each intersection at peak times.

Show All

Fig. 10 presents a comparison of carbon dioxide emissions during peak hours. Compared to the fix and comparison 1 models, the proposed model yielded the best results for all the traffic lights. The proposed model reduced carbon dioxide emissions by an average of approximately 21% (1,560,811,796 in total) compared to the fixed method and an average of approximately 19% (1,023,801,766 in total) compared with the comparison model. Whereas Comparison model 2 demonstrates better performance than the proposed model in terms of cumulative waiting times and carbon dioxide emissions. Nonetheless, its random signal sequence is impractical for real-world application. Therefore, we can see that the proposed model provides better signal control than the fixed signal and comparison 1 models in both off-peak and peak hours.

FIGURE 10.

Compare carbon dioxide emissions at each intersection at peak times.

Show All

Fig. 11 shows the average speeds of all the vehicles in the fixed and proposed models during peak hours. Since the velocity comparison graph shows similar shapes in off-peak and peak hours, peak hours are used as a representative example. As shown in Fig. 11, the average speed of the proposed model was similar to that of the fixed model; however, the speed variation was relatively small. Because carbon dioxide emissions are highly related to the acceleration and deceleration of vehicles, it is expected that the proposed model, with relatively less variation in vehicle speed, will perform well in terms of carbon dioxide emissions.

FIGURE 11.

Change in average speed of vehicles over simulation time in fixed and proposed models.

Show All

Fig. 12 shows the average speeds of all the vehicles in the comparison 1 and proposed models during peak hours. The proposed model has a higher average speed of vehicles than the comparison model, and the episode in the proposed model ended earlier than that in the comparison model. The evaluation in this study is conducted based on the number of vehicles observed at real-world intersections. Therefore, unlike previous studies that define a fixed duration as the termination condition for an episode, this study uses the departure of a specified number of vehicles as the termination condition. Each simulation concludes when the predetermined number of vehicles has exited the roadway. Therefore, the time required for an episode to run may vary between the models. The comparison model would have taken longer for a given number of vehicles to pass through the road than the proposed model, and it would have been expected that the carbon dioxide emissions from the vehicles would have accumulated during that time. Therefore, it can be expected that the proposed model, which took a relatively short time for the vehicle to exit the road, will perform well in terms of carbon dioxide emissions.

FIGURE 12.

Change in average speed of vehicles over simulation time in comparison and proposed models.

Show All

Fig. 13 shows the smoothed average speed of the overall vehicles for the three models. Whereas the factors that are expected to have contributed significantly to the performance difference with the proposed model are different, we can see that the common thread is that the episodes in the proposed model ended earlier than both models. In fact, when we compared the travel time of vehicles by route, we identified that while not all routes had less travel time, the proposed model allowed people to reach their destination on average approximately 42 s faster than the fixed model and approximately 48 s faster than the comparison model. This is not a small difference, considering that the distance from intersections 1 to 6 was approximately 5 minutes (when traffic was not blocked). In other words, from the perspective of vehicles, people reached their destinations faster in the proposed model than in the comparison model.

FIGURE 13.

Smoothing representation of the change in average speed of vehicles over simulation time in three models.

Show All

SECTION V.

Conclusion

In this study, we propose a multi-intersection signal control model that uses a novel cooperative approach to reduce traffic congestion and carbon dioxide emissions. In the proposed model, we use adjacent intersections to improve overall performance by allowing agents to share their states, actions, and rewards. At each intersection, an agent’s action is determined by considering not only the state of its own intersection but also the states and actions of its neighbors. In addition, the estimated Q-value considers the last reward received from the neighbors. Experiments on six contiguous roads in Icheon City show that our method outperforms fixed-signal and comparison model 1 in terms of cumulative waiting time and carbon dioxide emission metrics. The amount of emission saved by the proposed method is expected to significantly reduce a huge amount of carbon dioxide emissions when accumulated over a month or year.

References is not available for this document.

MIT Libraries

MIT Libraries

Cooperative Control of Intersection Traffic Signals Based on Multi-Agent Reinforcement Learning for Carbon Dioxide Emission Reduction

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Reinforcement Learning