Introduction
Vehicular communication, commonly referred to as vehicle-to-everything (V2X) communication, is envisioned to transform connected vehicles and intelligent transportation services in various aspects, such as road safety, traffic efficiency, and ubiquitous Internet access [1], [2]. More recently, the 3rd Generation Partnership Project (3GPP) has been looking to support V2X services in long-term evolution (LTE) and future 5G cellular networks [3]–[5]. Cross-industry consortium, such as the 5G automotive association (5GAA), has been founded by telecommunication and automotive industries to push development, testing, and deployment of cellular V2X technologies.
A. Problem Statement and Motivation
This paper considers spectrum access design in vehicular networks, which in general comprise both vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) connectivity. As illustrated in Fig. 1, the V2I links connect each vehicle to the base station (BS) or BS-type road side unit (RSU) while V2V links provide direct communications among neighboring vehicles. We focus on the cellular based V2X architecture discussed within the 3GPP [3], where V2I and V2V connections are supported through cellular (Uu) and sidelink (PC5) radio interfaces, respectively. A wide array of new use cases and requirements have been proposed and analyzed for 5G V2X enhancements in Release 15 [4], [5]. For example, the 5G cellular V2X networks are required to provide simultaneous support for mobile high data rate entertainment and advanced driving in 5G cellular V2X networks. The entertainment applications require high bandwidth V2I connection to the BS (and further the Internet) for, e.g., video streaming. Meanwhile, the advanced driving service needs to periodically disseminate safety messages among neighboring vehicles (e.g., 10, 20, 50 packets per second depending on vehicle mobility [5]) through V2V communications, with high reliability. The safety messages usually include such information as vehicle position, speed, heading, etc. to increase “co-operative awareness” of the local driving environment for all vehicles.
An illustrative structure of vehicular networks, where V2I and V2V links are indexed by
This work is based on Mode 4 defined in the 3GPP cellular V2X architecture, where the vehicles have a pool of radio resources that they can autonomously select from for V2V communications [5]. To fully use available resources, we propose that such sidelink V2V connections share spectrum with Uu (V2I) links with necessary interference management design. We make some simplification on the V2I communications in that they have preoccupied the spectrum in an orthogonal way with fixed transmission power. Hence, the resource optimization is left for the design of V2V connections that need to devise effective strategies of spectrum sharing with V2I links, including the selection of spectrum sub-band and proper control of transmission power, to meet the diverse service requirements of both V2I and V2V links. Such an architecture provides more opportunities for the coexistence of V2I and V2V connections on limited frequency spectrum, but also complicates interference design in the network and hence motivates this work.
While there exists a rich body of literature applying conventional optimization methods to solve similarly formulated V2X resource allocation problems, they actually find difficulty to fully address them in several aspects. On one hand, fast changing channel conditions in vehicular environments causes substantial uncertainty for resource allocation, e.g., in terms of performance loss induced by inaccuracy of acquired channel state information (CSI). On the other hand, increasingly diverse service requirements are being brought up to support new V2X applications, such as simultaneously maximizing throughput and reliability for a mix of V2X traffic, as discussed earlier in the motivational example. Such requirements are sometimes hard to be modeled in a mathematically exact way, not to mention a systematic approach to find optimal solutions. Fortunately, reinforcement learning (RL) has been shown effective in addressing decision making under uncertainty [6]. In particular, recent success of deep RL in human-level video game play [7] and AlphaGo [8] has sparked a flurry of interest in applying RL techniques to solve problems from a wide variety of areas and remarkable progress has been made ever since [9]–[11]. It provides a robust and principled way to treat environment dynamics and perform sequential decision making under uncertainty, thus representing a promising method to handle the unique and challenging V2X dynamics. In addition, the hard-to-optimize objective issues can also be nicely addressed in a RL framework through designing training rewards such that they correlate with the final objective. The learning algorithm can then figure out a clever strategy to approach the ultimate goal by itself. Another potential advantage of using RL for resource allocation is that distributed algorithms are made possible, as demonstrated in [12], which treats each V2V link as an agent that learns to refine its resource sharing strategy through interacting with the unknown vehicular environment. As a result, we investigate the use of multi-agent RL tools to solve the V2X spectrum access problem in this work.
B. Related Work
To address the challenges caused by fleeting channel conditions in vehicular environments, a heuristic spatial spectrum reuse scheme has been developed in [13] for device-to-device (D2D) based vehicular networks, relieving requirements on full CSI. In [14], V2X resource allocation, which maximizes throughput of V2I links, adapts to slowly-varying large-scale channel fading and hence reduces network signaling overhead. Further in [15], similar strategies have been adopted while spectrum sharing is allowed not only between V2I and V2V links but also among peer V2V links. A proximity and QoS-aware resource allocation scheme for V2V communications has been developed in [16] that minimizes the total transmission power of all V2V links while satisfying latency and reliability requirements using a Lyapunov-based stochastic optimization framework. Sum ergodic capacity of V2I links has been maximized with V2V reliability guarantee using large-scale fading channel information in [17] or CSI from periodic feedback in [18]. A novel graph-based approach has been further developed in [19] to deal with a generic V2X resource allocation problem.
Apart from the traditional optimization methods, RL based approaches have been developed in several recent works to address resource allocation in V2X networks [20], [21]. In [22], RL algorithms have been applied to address the resource provisioning problem in vehicular clouds such that dynamic resource demands and stringent quality of service requirements of various entities in the clouds are met with minimal overhead. The radio resource management problem for transmission delay minimization in software-defined vehicular networks has been studied in [23], which is formulated as an infinite-horizon partially observed Markov decision process (MDP) and solved with an online distributed learning algorithm based on an equivalent Bellman equation and stochastic approximation. In [24], a deep RL based method has been proposed to jointly manage the networking, caching, and computing resources in virtualized vehicular networks with information-centric networking and mobile edge computing capabilities. The developed deep RL based approach efficiently solves the highly complex joint optimization problem and improves total revenues for the virtual network operators. In [25], the downlink scheduling has been optimized for battery-charged roadside units in vehicular networks using RL methods to maximize the number of fulfilled service requests during a discharge period, where Q learning is employed to obtain the highest long-term returns. The framework has been further extended in [26], where a deep RL based scheme has been proposed to learn a scheduling policy with high dimensional continuous inputs using end-to-end learning. A distributed user association approach based on RL has been developed in [27] for vehicular networks with heterogeneous BSs. The proposed method leverages the
This work differentiates itself from existing studies in at least two aspects. First, we explicitly model and solve the problem of improving the V2V payload delivery rate, i.e., the success probability of delivering packets of size
C. Contribution
In this paper, we consider the spectrum sharing problem in high mobility vehicular networks, where multiple V2V links attempt to share the frequency spectrum preoccupied by V2I links. To support diverse service requirements in vehicular networks, we design V2V spectrum and power allocation schemes that simultaneously maximize the capacity of V2I links for high bandwidth content delivery and meanwhile improve the payload delivery reliability of V2V links for periodic safety-critical message sharing. The major contributions of this work are summarized as follows.
We model the spectrum access of the multiple V2V links as a multi-agent problem and exploit recent progress of multi-agent RL [29], [30] to develop a distributed spectrum and power allocation algorithm that simultaneously improves performance of both V2I and V2V links.
We provide a direct treatment of reliability guarantee for periodic safety message sharing of V2V links that adjusts V2V spectrum sub-band selection and power control in response to small-scale channel fading within the message generation period.
We show that through a proper reward design and training mechanism, the V2V transmitters can learn from interactions with the communication environment and figure out a clever strategy of working cooperatively with each other in a distributed way to optimize system level performance based on local information.
D. Paper Organization
The rest of the paper is organized as follows. The system model is described in Section II. We present the proposed multi-agent RL based V2X resource sharing design in Section III. Section IV provides our experimental results and concluding remarks are finally made in Section V.
System Model
We consider a cellular based vehicular communication network in Fig. 1 with
We focus on Mode 4 defined in the cellular V2X architecture, where vehicles have a pool of radio resources that they can autonomously select for V2V communications [5]. Such resource pools can overlap with that of the cellular V2I interfaces for better spectrum utilization provided necessary interference management design is in place, which is investigated in this work. We further assume that the
Orthogonal frequency division multiplexing (OFDM) is exploited to convert the frequency selective wireless channels into multiple parallel flat channels over different subcarriers. Several consecutive subcarriers are grouped to form a spectrum sub-band and we assume channel fading is approximately the same within one sub-band and independent across different sub-bands. During one coherence time period, the channel power gain, \begin{equation*} g_{k}[m] = \alpha _{k} h_{k}[m],\tag{1}\end{equation*}
The received signal-to-interference-plus-noise ratios (SINRs) of the \begin{equation*} \gamma _{m}^{c}[m] = \frac {P_{m}^{c} \hat {g}_{m,B}[m]}{\sigma ^{2} + \sum \limits _{k}\rho _{k}[m] P_{k}^{d}[m] g_{k,B}[m]},\tag{2}\end{equation*}
\begin{equation*} \gamma _{k}^{d}[m] = \frac {P_{k}^{d}[m] g_{k}[m]}{\sigma ^{2} + I_{k}[m]},\tag{3}\end{equation*}
\begin{equation*} I_{k}[m] = P_{m}^{c} \hat {g}_{m,k}[m] + \sum \limits _{k'\ne k}\rho _{k'}[m] P_{k}^{\prime d}[m] g_{k',k}[m],\tag{4}\end{equation*}
Capacities of the \begin{equation*} C_{m}^{c}[m] = W\log (1+\gamma _{m}^{c}[m]),\tag{5}\end{equation*}
\begin{equation*} C_{k}^{d}[m] = W\log (1+\gamma _{k}^{d}[m]),\tag{6}\end{equation*}
As described earlier, the V2I links are designed to support mobile high data rate entertainment services and hence an appropriate design objective is to maximize their sum capacity, defined as \begin{equation*} \text {Pr}\left \{{ \sum _{t=1}^{T}\sum \limits _{m=1}^{M} \rho _{k}[m] C_{k}^{d}[m,t] \ge B/\Delta _{T}}\right \},\quad k\in \mathcal {K},\tag{7}\end{equation*}
To this end, the resource allocation problem investigated in this work is formally stated as: To design the V2V spectrum allocation, expressed through binary variables
High mobility in a vehicular environment precludes collection of accurate full CSI at a central controller, hence making distributed V2V resource allocation more preferable. Then how to coordinate actions of multiple V2V links such that they do not act selfishly in their own interests to compromise performance of the system as a whole remains challenging. In addition, the packet delivery rate for V2V links, defined in (7), involves sequential decision making across multiple coherence time slots within the time constraint
Multi-Agent RL Based Resource Allocation
In the resource sharing scenario illustrated in Fig. 1, multiple V2V links attempt to access limited spectrum occupied by V2I links, which can be modeled as a multi-agent RL problem. Each V2V link acts as an agent and interacts with the unknown communication environment to gain experiences, which are then used to direct its own policy design. Multiple V2V agents collectively explore the environment and refine spectrum allocation and power control strategies based on their own observations of the environment state. While the resource sharing problem may appear a competitive game, we turn it into a fully cooperative one through using the same reward for all agents, in the interest of global network performance.
The proposed multi-agent RL based approach is divided into two phases, i.e., the learning (training) and the implementation phases. We focus on settings with centralized learning and distributed implementation. This means in the learning phase, the system performance-oriented reward is readily accessible to each individual V2V agent, which then adjusts its actions toward an optimal policy through updating its deep Q-network (DQN). In the implementation phase, each V2V agent receives local observations of the environment and then selects an action according to its trained DQN on a time scale on par with the small-scale channel fading. Key elements of the multi-agent RL based resource sharing design are described below in detail.
A. State and Observation Space
In the multi-agent RL formulation of the resource sharing problem, each V2V link
The agent-environment interaction in multi-agent RL formulation of the investigated resource sharing in vehicular networks.
The true environment state, \begin{equation*} O(S_{t}, k) = \left \{{ B_{k}, T_{k}, \{I_{k}[m]\}_{m\in \mathcal {M}}, \{G_{k}[m]\}_{m\in \mathcal {M}}}\right \},\tag{8}\end{equation*}
Independent Q-learning [32] is among the most popular methods to solve multi-agent RL problems, where each agent learns a decentralized policy based on its own action and observation, treating other agents as part of the environment. However, naively combining DQN with independent Q-learning is problematic since each agent would face a nonstationary environment while other agents are also learning to adjust their behaviors. The issue grows even more severe with experience replay, which is the key to the success of DQN, in that sampled experiences no longer reflect current dynamics and thus destabilize learning. To address this issue, we adopt the fingerprint-based method developed in [30]. The idea is that while the action-value function of an agent is nonstationary with other agents changing their behaviors over time, it can be made stationary conditioned on other agents’ policies. This means we can augment each agent’s observation space with an estimate of other agents’ policies to avoid nonstationarity, which is the essential idea of hyper Q-learning [33]. However, it is undesirable for the action-value function to include as input all parameters of other agents’ neural networks, \begin{equation*} Z_{t}^{(k)} = \left \{{O(S_{t}, k), e, \epsilon }\right \}.\tag{9}\end{equation*}
B. Action Space
The resource sharing design of vehicular links comes down to the spectrum sub-band selection and transmission power control for V2V links. While the spectrum naturally breaks into
C. Reward Design
What makes RL particularly appealing for solving problems with hard-to-optimize objectives is the flexibility in its reward design. The system performance can be improved when the designed reward signal at each step correlates with the desired objective. In the investigated V2X spectrum sharing problem described in Section II, our objectives are twofold: Maximizing the sum V2I capacity while increasing the success probability of V2V payload delivery within a certain time constraint
In response to the first objective, we simply include the instantaneous sum capacity of all V2I links, \begin{equation*} L_{k}(t) = \begin{cases} \displaystyle \sum \limits _{m=1}^{M}\rho _{k}[m] C_{k}^{d}[m,t], & \text {if $B_{k} \ge 0$},\\ \beta, & \text {otherwise}. \end{cases}\tag{10}\end{equation*}
The goal of learning is to find an optimal policy \begin{equation*} G_{t} = \sum \limits _{k=0}^{\infty } \gamma ^{k} R_{t+k+1},\quad 0 \le \gamma \le 1.\tag{11}\end{equation*}
In practice,
To this end, we set the reward at each time step \begin{equation*} R_{t+1} = \lambda _{c} \sum \limits _{m} C_{m}^{c}[m,t] + \lambda _{d} \sum \limits _{k} L_{k}(t),\tag{12}\end{equation*}
D. Learning Algorithm
We focus on an episodic setting with each episode spanning the V2V payload delivery time constraint
1) Training Procedure:
We leverage deep Q-learning with experience replay [7] to train the multiple V2V agents for effective learning of spectrum access policies. Q-Learning [34] is based on the concept of action-value function, \begin{equation*} q_{\pi }(s,a) = \mathbb {E}_{\pi }\left [{G_{t} | S_{t} = s, A_{t} = a }\right],\tag{13}\end{equation*}
Each V2V agent \begin{equation*} \sum \limits _{\mathcal {D}} \left [{R_{t+1} + \gamma \max \limits _{a'}Q(Z_{t+1}, a';\theta ^{-}) - Q(Z_{t}, A_{t}; \theta) }\right]^{2},\tag{14}\end{equation*}
Algorithm 1 Resource Sharing With Multi-Agent RL
Start environment simulator, generating vehicles and links
Initialize Q-networks for all agents randomly
for each episode do
Update vehicle locations and large-scale fading
Reset
for each step
for each V2V agent
Observe
Choose action
end for
All agents take actions and receive reward
Update channel small-scale fading
for each V2V agent
Observe
Store
end for
end for
for each V2V agent
Uniformly sample mini-batches from
Optimize error between Q-network and learning targets, defined in (14), using variant of stochastic gradient descent
end for
end for
2) Distributed Implementation:
During the implementation phase, at each time step
Note that the computation intensive training procedure in Algorithm 1 can be performed offline for many episodes over different channel conditions and network topology changes while the inexpensive implementation procedure is executed online for network deployment. The trained DQNs for all agents only need to be updated when the environment characteristics have experienced significant changes, say, once a week or even a month, depending on environment dynamics and network performance requirements.
Simulation Results
In this section, simulation results are presented to validate the proposed multi-agent RL based resource sharing scheme for vehicular networks. We custom built our simulator following the evaluation methodology for the urban case defined in Annex A of 3GPP TR 36.885 [3], which describes in detail vehicle drop models, densities, speeds, direction of movement, vehicular channels, V2V data traffic, etc.. The
The DQN for each V2V agent consists of 3 fully connected hidden layers, containing 500, 250, and 120 neurons, respectively. The rectified linear unit (ReLU),
We compare in Figs. 3 and 4 the proposed multi-agent RL based resource sharing scheme, termed MARL, against the following two baseline methods that are executed in a distributed manner.
The single-agent RL based algorithm in [12], termed SARL, where at each moment only one V2V agent updates its action, i.e., spectrum sub-band selection and power control, based on locally acquired information and a trained DQN while others agents’ actions remain unchanged. A single DQN is shared across all V2V agents.
The random baseline, which chooses the spectrum sub-band and transmission power for each V2V link in a random fashion at each time step.
We further benchmark the proposed MARL method in Algorithm 1 against the theoretical performance upper bounds of the V2I and V2V links, derived from the following two idealistic (and extreme) schemes.
We disable the transmission of all V2V links to obtain the upper bound of V2I performance, hence the name upper bound without V2V. In this case, the packet delivery rates for all V2V links are exactly zero, thus not shown in Fig. 4.
We exclusively focus on improving V2V performance while ignoring the requirement of V2I links. Such an assumption breaks the sequential decision making of delivering
bytes over multiple steps within the time constraintB into separate optimization of sum V2V rates over each step. Then, we exhaustively search the action space of allT V2V agents in each step to maximize sum V2V rates. Apart from the complexity due to exhaustive search, this scheme needs to be performed in a centralized way with accurate global CSI available, hence the name centralized maxV2V.K
Fig. 3 shows the V2I performance with respect to increasing V2V payload sizes
Fig. 4 shows the success probability of V2V payload delivery against growing payload sizes
We also observe from Fig. 4 that the proposed MARL method achieves highly desirable V2V performance for the low payload cases and suffers from noticeable degradation when the payload size grows beyond
We show in Fig. 5 the cumulative rewards per training episode with increasing training iterations to study the convergence behavior of the proposed multi-agent RL method. From the figure, the cumulative rewards per episode improve as training continues, demonstrating the effectiveness of the proposed training algorithm. When the training episode approximately reaches 2, 000, the performance gradually converges despite some fluctuations due to mobility-induced channel fading in vehicular environments. Based on such an observation, we train each agent’s Q-network for 3,000 episodes when evaluating the performance of V2I and V2V links in Figs. 3 and 4, which should provide a safe convergence guarantee.
Return for each training episode with increasing iterations. The V2V payload size
To understand why the proposed multi-agent RL based method achieves better performance compared with the random baseline, we select an episode in which the proposed method enables all V2V links to successfully deliver the payload of 2, 120 bytes while the random baseline fails. We plot in Fig. 6 the change of the remaining V2V payload within the time constraint, i.e.,
The change of the remaining V2V payload of the proposed MARL and the random baseline resource sharing schemes within the time constraint
In Fig. 7, we further show the instantaneous rates of all V2V links under the two different resource allocation schemes at each step in the same episode as Fig. 6. Several valuable observations can be made from comparing Figs. 7(a) and (b) that demonstrate the effectiveness of the proposed method in encouraging cooperation among multiple V2V agents. From Fig. 7(a), with the proposed method, V2V Link 4 gets very high transmission rates at the beginning to finish transmission early such that the good channel condition of this link is fully exploited and no interference will be generated toward other links at later stages of the episode. V2V Link 1 keeps low transmission rates at first such that the vulnerable V2V Links 2 and 3 can get relatively good transmission rates to deliver payload, and then jumps to high data rates to deliver its own data when Links 2 and 3 almost finish transmission. Moreover, a closer examination of the rates of Links 2 and 3 reveals that the two links figure out a clever strategy to take turns to transmit such that both of their payloads can be delivered quickly. To summarize, the proposed multi-agent RL based method learns to leverage good channels of some V2V links and meanwhile provides protection for those with bad channel conditions. The success probability of V2V payload transmission is thus significantly improved. In contrast, Fig. 7(b) shows that the random baseline method fails to provide such protection for vulnerable V2V links, leading to high probability of failed payload delivery for them.
Conclusion
In this paper, we have developed a distributed resource sharing scheme based on multi-agent RL for vehicular networks with multiple V2V links reusing the spectrum of V2I links. A fingerprint-based method has been exploited to address nonstationary issues of independent Q-learning for multi-agent RL problems when combined with DQN with experience replay. The proposed multi-agent RL based method is divided into a centralized training stage and a distributed implementation stage. We demonstrate that through such a mechanism, the proposed resource sharing scheme is effective in encouraging cooperation among V2V links to improve system level performance although decision making is performed locally at each V2V transmitter. Future work will include an in-depth analysis and comparison of the robustness of both single-agent and multi-agent RL based algorithms to gain better understanding on when the trained Q-networks need to be updated and how to efficiently perform such updates. Extension of the proposed multi-agent RL based resource allocation method to the multiple-input multiple-output (MIMO) and the millimeter MIMO scenarios for vehicular communications is also an interesting direction worth further investigation.