Journals & Magazines >Intelligent and Converged Net... >Volume: 5 Issue: 4

A Reposition Algorithm for E-Hailing Based on Quantum Annealing and Intuitive Reasoning

Abstract:

Currently, the challenge lies in the traditional intelligent algorithm's ability to effectively address the e-hailing repositioning issue. Accurately identifying the unde...Show More

Metadata

Abstract:

Currently, the challenge lies in the traditional intelligent algorithm's ability to effectively address the e-hailing repositioning issue. Accurately identifying the underlying characteristics in extensive traffic data within a limited timeframe is difficult, ultimately preventing the achievement of the most optimal solution. This paper suggests a hybrid computing architecture involving reinforcement learning and quantum annealing based on intuitive reasoning. Intuitive reasoning aims to enhance performance in scenarios with poor system robustness, complex tasks, and diverse goals. A deep learning model is constructed, trained to extract scene features, and combined with expert knowledge, then transformed into a quantum annealable form. The final strategy is obtained using a D-wave quantum computer with quantum tunneling effect, which helps in finding optimal solutions by jumping out of local suboptimal solutions. Based on 400 000 real data, four algorithms are compared: minimum-cost flow, sequential markov decision process, hot-dot strategy, and driver-prefer strategy. The average total revenue increases by about 10% and vehicle utilization by about 15% in various scenarios. In summary, the proposed architecture effectively solves the e-hailing reposition problem, offering new directions for robust artificial intelligence in big data decision problems.

Published in: Intelligent and Converged Networks ( Volume: 5, Issue: 4, December 2024)

Page(s): 317 - 335

Date of Publication: December 2024

Electronic ISSN: 2708-6240

DOI: 10.23919/ICN.2024.0020

Figures are not available for this document.

Contents

SECTION 1

Introduction

The intuitive reasoning theory is an artificial intelligence computing framework based on neurobiology and cognitive science. It attempts to enable machines to have the ability to intuitively understand the physical world and make common sense reasoning, thereby achieving more robust and flexible artificial intelligence systems. Intuitive reasoning theory holds that human thinking predicts, classifies patterns, and learns based on historical experience and knowledge, and predictive ability is the core of human intelligence[1].

The swift advancement of mobile Internet and the proliferation of smartphones have led to the rapid emergence of e-hailing as a burgeoning transportation mode. This phenomenon not only caters to individual demands for customized and effective travel but also presents substantial challenges and prospects for urban transportation systems. The e-hailing platform is required to fulfill tasks like swift passenger-driver matching and route planning to enhance travel satisfaction and service effectiveness. The appropriate repositioning of e-hailing vehicles, implementation of effective route planning, enhancement of vehicle utilization, and service efficiency constitute a significant area of study. Concurrently, e-hailing platforms have amassed substantial volumes of order and trajectory data, which offer opportunities for thorough analysis and extraction of valuable insights to enhance e-hailing service strategies. This area also represents a focal point in contemporary research on intelligent transportation. Currently, the hot applications of e-hailing data mining include order allocation[2], demand forecasting[3], [4], and vehicle reposition[5], [6], etc.

The issue of e-hailing repositioning entails the strategic allocation of vehicles and route adjustments aiming at maximizing passenger satisfaction and optimizing the overall repositioning impact. To enhance vehicle utilization and boost total order revenue, drivers are assigned to routes with a high likelihood of passenger access, while also considering the supply-demand imbalance arising from driver competition for orders in popular areas.

Currently, conventional intelligent algorithms face limitations in both accuracy and time efficiency when addressing e-hailing reposition problems within the context of big data scenarios. The intricate nature of the global e-hailing reposition problem presents a multifaceted optimization challenge, demanding thorough consideration of various factors and constraints like passenger wait times, vehicle travel distances, and road congestion. Coping with vast data volumes places increased pressure on algorithm performance, necessitating the ability to process larger datasets in shorter time periods while ensuring precise decision-making. The prompt identification of optimal or near-optimal solutions may prove challenging for traditional optimization algorithms, underscoring the need for the integration of more sophisticated algorithms and technologies to tackle these complexities. Hence, the outcomes of the model pose a challenge in jumping out of local suboptimal solutions to attain global optimal solutions, in addition to being a time-intensive process.

To address these challenges, this paper proposes a hybrid computing architecture involving reinforcement learning and quantum annealing based on intuitive reasoning. Intuitive reasoning represents an advanced form of intelligence that combines existing knowledge or representations with algorithms and applications, offering robustness and interpretability. It effectively addresses the uncertainties, vulnerabilities, and openness issues typically encountered by conventional intelligent algorithms. E-hailing reposition entails navigating complex real-time traffic conditions, evolving customer demands, and variable road conditions. Intuitive reasoning swiftly integrates these diverse inputs to make informed decisions that adapt to dynamic circumstances. The presence of numerous uncertain factors in vehicle scheduling can significantly impact the efficacy of scheduling strategies. Leveraging its robustness, intuitive reasoning ensures system stability and reliability amid uncertainties and risks. Furthermore, intuitive reasoning not only delivers decision outcomes but also elucidates the underlying reasoning process and rationale. This transparency aids users in comprehending the operational logic of the system, facilitating necessary adjustments and enhancements.

This paper integrates reinforcement learning with quantum annealing under the guidance of intuitive reasoning as a theoretical framework. Reinforcement learning is particularly adept at tackling decision-making challenges within intricate environments. Through iterative exploration of actions in dynamic environments, reinforcement learning identifies reward signals, optimizing action, and value functions to develop effective behavioral strategies across diverse states. Thus, reinforcement learning exhibits robustness and adaptability when navigating complex environments.

E-hailing reposition problems typically entail numerous complex optimization variables and constraints. Traditional algorithms often struggle with large search spaces and the propensity for local optima, which impedes efficiency and the discovery of global optima. In contrast, quantum annealing leverages the quantum tunneling effect, enabling qubits to traverse energy barriers in a non-classical manner. This capability enhances the efficiency of navigating potential search space obstacles and facilitates the discovery of superior solutions. Quantum tunneling empowers quantum annealing to conduct more profound and extensive searches in large-scale optimization challenges, surpassing the limitations of local optima inherent in classical methods. This attribute is particularly advantageous for vehicle scheduling due to its involvement with intricate spatio-temporal constraints and dynamic environmental factors. Furthermore, the ability of quantum annealing to exploit quantum tunneling enhances solution speed, especially in managing substantial-scale issues. Its parallel computing and quantum parallelism further expedite the identification of viable solutions. As quantum computing technology continues to advance and mature, quantum annealing is poised to unlock greater potential and broader application in addressing vehicle scheduling problems, offering fresh avenues and innovative strategies for optimizing scheduling efficiencies.

By integrating intuitive reasoning, quantum annealing, and reinforcement learning, these approaches leverage their respective strengths to enhance the intelligence and optimization capabilities of vehicle scheduling systems. Intuitive reasoning aims at clear decision logic and comprehensible decision processes. Reinforcement learning continuously refines decision strategies through interaction with the environment, adjusting to dynamic needs and conditions to derive robust and effective reposition strategies. Quantum annealing contributes global optimization capabilities and rapid solutions to complex problems. The strategy derived from the reinforcement learning model, along with expert knowledge from various traffic scenarios, serves as multiple optimization objectives, which are sent to D-wave for quantum annealing for global optimization. This integrated framework not only enhances system decision-making efficiency and quality but also reduces operational costs and elevates service levels, paving the way for future advancements in intelligent traffic management systems.

In summary, studying the application of intuitive reasoning and quantum annealing in the field of e-hailing is beneficial for improving the overall efficiency of e-hailing operations and enhancing the quality of e-hailing from the perspective of e-hailing platforms. From the perspective of smart city management, it can alleviate urban road traffic congestion and promote the development of future smart transportation.

SECTION 2

Related Work

The objective of e-hailing repositioning is to strategically deploy idle e-hailing vehicles to areas with potential passengers, optimizing traffic flow, maximizing resource utilization, and balancing supply and demand between drivers and passengers. In order to better understand how to optimize the e-hailing reposition algorithm, some researchers have conducted in-depth research on e-hailing reposition. Research on supply-demand equilibrium was initially established by Yang et al.[7], who investigated the movement patterns of empty and occupied taxis and passenger behavior, and developed an equilibrium model and solution algorithm for urban taxi services. They analyzed factors such as congestion effects and demand elasticity, predicting metrics such as the relationship between supply and demand, customer waiting times, and taxi utilization rates. Building upon this foundation, Yang et al.[8] explored the impact of taxi fare structures and fleet sizes in their model construction. Wong et al.[9] expanded upon the taxi service model in congested networks by incorporating multiple user categories, taxi operation modes, and hierarchical travel choices. In 2019, Di et al.[10] introduced a comprehensive theoretical framework for transportation network equilibrium, unifying various shared mobility services and establishing a basis for modeling shared transportation services within congested urban road networks.

Markov Decision Process (MDP) was adept at capturing long-term rewards, as demonstrated by Rong et al.[11], who introduced an MDP approach to enhance income efficiency for taxi drivers. Their method incorporates considerations such as the influence of destinations on future passenger pickups and the impact of fuel costs on earnings. By learning MDP parameters from historical data across different time periods, the model suggests optimal directions for empty taxis. Verma et al.[12] and Gao et al.[13], have utilized reinforcement learning techniques to derive optimal strategies from real-world data. However, these approaches typically optimize revenue for individual vehicles rather than maximizing global revenue. Subsequent models focus on resolving multi-agent decision conflicts. Lin et al.[14] proposed a multi-agent deep reinforcement learning framework encompassing Deep Q-Network (DQN) and contextual multi-agent actor-critic methods. This framework aims to foster coordination and adaptive learning among a large fleet of taxi drivers under varying conditions. Oda et al.[15] introduced a DQN-based framework that updates policies through environmental interaction, though it assumes each vehicle independently learns its optimal strategy, thereby sacrificing some inter-vehicle coordination. Shou et al.[5] employed MDP to model driver behaviors, using a sequential Markov method to dynamically adjust order matching probabilities to achieve repositioning balance. Meanwhile, Xu et al.[6] proposed a three-stage repositioning framework, which includes ensuring driver satisfaction, promoting multi-driver collaboration, and predicting task success probabilities. Their approach involved generating candidate repositioning tasks, scoring tasks based on response rates and supply-demand ratios, and planning by transforming optimization into a Minimum Cost Flow (MCF) problem. Implemented in Didi Chuxing operational environment, their strategy resulted in a revenue increase of 2%.

SECTION 3

Theoretical Foundation

3.1 Intuitive Reasoning

Intuitive reasoning integrates existing knowledge, representations, algorithms, and applications to form a sophisticated intelligence characterized by robustness and interpretability. It addresses uncertainties, vulnerabilities, and openness challenges typically encountered by conventional intelligent algorithms, thereby approaching human-level intelligence. Intuitive reasoning operates on cognitive maps[16], cost maps, and decision searches, as illustrated in Fig. 1. During intuitive reasoning, decisions are made by searching through a cognitive map enriched with knowledge and experience. This search utilizes a cost map, where the decision minimizing losses are selected. The outcome updates the cognitive map, improving solutions for subsequent decisions.

The classic case of containing intuitive reasoning is AlphaGo[17]. AlphaGo integrates neural networks and reinforcement learning to develop value and policy networks, enabling sophisticated gameplay simulation. Initially, AlphaGo derives an initial strategy based on game scores, refining its strategy network through feedback garnered from self-play outcomes. In real-world applications, AlphaGo employs Monte Carlo tree search to explore multiple potential paths, subsequently selecting the optimal path after evaluating the value function and strategy network. This approach efficiently narrows down the search space within complex solution landscapes, achieving optimal solutions through iterative refinement[18].

Fig. 1

Intuitive reasoning based on cognitive maps.

Show All

The findings above underscore the efficacy of intuitive reasoning in tackling intricate problems. Intuitive reasoning engages with practical application scenarios, leveraging interaction outcomes to optimize strategies swiftly and enhance decision-making capabilities in complex environments. This process effectively reduces the solution space, enhances decision search efficiency, and enables adaptation to challenges in real-world settings. The e-hailing repositioning problem examined in this study is characterized by a complex traffic environment, large-scale issues, and diverse scenarios. Therefore, this study adopts intuitive reasoning as its guiding principle, proposing a hybrid architecture that integrates reinforcement learning and quantum annealing. This approach builds upon the cognitive map framework of intuitive reasoning and incorporates feedback mechanisms to update reward and punishment, aiming to optimize solution strategies effectively.

3.2 Reinforcement Learning for Intuitive Reasoning

Reinforcement learning is used to solve the problem of intelligent agents learning and optimizing decision choices based on the interaction results, ultimately maximizing the objective function. The deep reinforcement learning model can effectively deal with high-dimensional state space and complex decision problems. For intuitive reasoning, the process of reinforcement learning can be likened to the process of the human brain searching for the most profitable decision in a cognitive map and updating the cognitive map through feedback. In practical applications, the environment can be constructed through real data or platforms, and the model can be updated by simulating the interaction between vehicles and the environment. The results of the reinforcement learning model can be combined with the results of other machine learning algorithms and explicit traffic rules to form a cognitive map. Therefore, reinforcement learning can autonomously learn and optimize in complex environments, thereby solving more complex and realistic problems.

Reinforcement learning algorithms are divided into policy-based algorithm and value-function-based algorithm. The actor-critic algorithm combines the advantages of policy-based algorithm and value-function-based algorithm, allowing for updates based on single step states, actions, and rewards, resulting in fast learning speed. The advantage function can be used to reduce variance and bias. The advantage function is obtained by subtracting the state value function $V(s)$ from the state action value function $Q(s,a)$, thereby eliminating the influence of the state value function on gradient calculation and reducing variance and bias[19].

Based on real e-hailing order data collected from various scenarios, this study designs a rational action space and reward mechanism. It constructs an interactive environment tailored for e-hailing repositioning and simulates scenarios to optimize repositioning strategies through iterative interaction.

3.3 Quantum Annealing for Intuitive Reasoning in Solving Combination Optimization Problems

Human intuitive reasoning can be likened to navigating the solution space of a problem to find its global optimum. Traditional machine algorithms often encounter local minima when tackling complex problems, posing challenges in practical applications where exact solutions are not feasible within polynomial time. Current mainstream approaches resort to approximation or heuristic algorithms, yet these methods come with inherent limitations in their solution outcomes.

In response to these challenges, quantum computing emerges as a promising paradigm capable of overcoming classical computing constraints, particularly the tendency to settle for local optima. Quantum computing harnesses the superposition property of qubits to achieve quantum parallelism, thereby enhancing search efficiency and precision. It is anticipated that quantum computing will address longstanding technical hurdles in fields such as combinatorial optimization, quantum chemistry, cryptography, and machine learning. Key issues in quantum computing encompass the physical realization of quantum computers and the development of quantum algorithms

This paper uses the D-wave quantum computer based on quantum annealing algorithm to find the optimal solution of the problem by establishing interactions between quantum bits. Its latest quantum computing system, D-wave advantage, has over 5000 quantum bits and implements a quantum annealing algorithm based on the Ising model through superconducting circuit operation under absolute zero temperature conditions[20]. The D-wave quantum computer is shown in Fig. 2. The cost function of the Ising model is as follows: \begin{equation*}E_{\text{Ising}}(s_{1},s_{2}, \ldots,s_{N})=-\sum\limits_{i=1}^{N}h_{i}s_{i}+\sum\limits_{(i,j)\in E}J_{i,j}s_{i}s_{j},s_{i}=\pm 1\tag{1}\end{equation*}View Source where $h$ represents the external magnetic field, $s_{i}$ represents the $i-\text{th}$ Ising spin, $N$ represents the total number of spin variable, and the coupling constant $J$ represents the interaction between adjacent spins.

Fig. 2

D-wave quantum computer.

Show All

To solve combinatorial optimization problems by using D-wave quantum computers, the problem must be expressed as a Quadratic Unconstrained Binary Optimization (QUBO) form as the input to the quantum computer. By using formula $s_{i}=1-2z_{i}$, the Ising model can be transformed into an equivalent QUBO model, which is defined as follows: \begin{equation*}E_{\text{QUBO}} = \sum\limits_{i}Q_{ii}z_{i}+\sum\limits_{i < j}Q_{ij}z_{i}z_{j}=z^{\mathrm{T}}Qz\tag{2}\end{equation*}View Source where $z$ represents a binary variable vector of length $L$, and $Q$ is the coupling matrix of $L\times L$. The implementation process of quantum annealing for solving combinatorial optimization problems is to transform the research problem into a QUBO expression, minimize it through quantum annealing, and finally obtain the quantum annealing result $z_{i}$.

In contrast to traditional intelligent algorithms, quantum annealing leverages the quantum tunneling effect to escape local suboptimal solutions with greater likelihood, thereby approaching the global optimum more effectively. This approach is particularly well-suited for addressing large-scale combinatorial optimization problems and offers the potential for exponential acceleration in time complexity. At present, quantum annealing has been applied in fields such as public transportation ^[21]–[23], quantum chemistry[24], and cryptography[25]. Neukart et al.[26] successfully utilized quantum annealing to balance the taxi traffic in Beijing in 2017. In 2019, Volkswagen, in collaboration with D-wave, unveiled a traffic management system based on quantum annealing technology. The system is able to optimize the routes of nine public transit buses, demonstrating the potential of quantum computing to solve traffic congestion at the Lisbon Global Network Summit.

SECTION 4

Hybrid Computing Architecture of Reinforcement Learning and Quantum Annealing Guided By Intuitive Reasoning

While D-wave quantum computers have advanced rapidly, significant limitations remain regarding their applicability, size, and qubit connectivity. Handling real-world problems of considerable scale solely with quantum computers remains challenging. Therefore, a combined approach leveraging both classical and quantum computing strengths is often necessary for practical problem-solving. Classical computers excel at intricate logical operations, while quantum computers are adept at tackling large-scale combinatorial optimization problems. This paper proposes a hybrid computing architecture integrating reinforcement learning and quantum annealing guided by intuitive reasoning. Implementation and training of deep reinforcement learning models occur on classical computers. A QUBO model is constructed based on these models, considering constraints and optimization criteria specific to the e-hailing repositioning problem. The QUBO model undergoes quantum annealing on the D-wave quantum computer to derive the optimal solution.

Figure 3 illustrates the methodology employed in this study, which utilizes extensive real-world e-hailing order data across various scenarios to establish a reinforcement learning environment. The actor-critic algorithm is employed to learn localized repositioning strategies, which are subsequently integrated into the QUBO model for quantum annealing. Temporal Difference (TD) error is to quantify the difference between the predicted reward and the actual reward received over time. This model incorporates constraints and optimization objectives related to issues such as order acquisition challenges and traffic congestion arising from supply-demand imbalances in e-hailing repositioning scenarios. The outcome of quantum annealing yields a global repositioning strategy.

In essence, the framework of this paper is guided by intuitive reasoning, aiming to integrate models, algorithms, and applications into a sophisticated intelligence with common-sense cognitive abilities, robustness, and interpretability. This approach addresses limitations observed in traditional artificial intelligence algorithms, particularly their susceptibility to poor robustness, complexity, diverse objectives, and incomplete data samples. Guided by intuitive reasoning, this study introduces a deep reinforcement learning model that incorporates established traffic rules and experiential knowledge to construct a cognitive map characterized by robustness and interpretability. The decisions derived from this cognitive map are then integrated into quantum annealing, enhancing the capability to efficiently and probabilistically navigate beyond local suboptimal solutions towards achieving global optima in large-scale data scenarios. This framework not only applies to solving the e-hailing repositioning problem but also holds potential for addressing decision-making challenges in other extensive data contexts in the future.

Fig. 3

Hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning.

Show All

4.1 Description of e-Hailing Reposition Problem

The e-hailing reposition can be explained by five core parts, namely the region, time slot, dispatchable fleet, reposition center, and orders.

Definition 1

Regions $G$. Dividing $G$ into hexagonal grids with a centroid distance of 1.1 km (this parameter is given by the real data set of Didi Chuxing, where the hexagon side length is about 0.64 km, and the centroid distance between regions is 1.1 km), $G= \sum\limits_{k=1}^{n}g_{k}$, where $g_{k}$ represents each hexagonal grid in region $G$. Each reposition is based on a hexagonal grid, and the reposition strategy can choose to remain within the current grid or to be repositioned to adjacent six grids.

Definition 2

Time slots. The reposition step is triggered every five minutes, and vehicles and requests are matched once every five minutes (the order duration in the Didi Chuxing dataset is calculated in units of 5 min each).

Definition 3

Dispatchable fleet. Assuming that the set of idle vehicles in the current time slot is $I$, and the set of vehicles currently in the order process is $O$. In region $G, U$ vehicles in the dispatchable fleet $X$ are in a dispatchable state, which can be expressed as follows: \begin{equation*}X=\{x_{u}^{t},u\in\{1,2,\ldots,U\},\ t\in\{0,1,\ldots,287\},x_{u}^{t}\in I\}\tag{3}\end{equation*}View Source where $x_{u}^{t}$ represents the vehichle $u$ at time slot $t$.

Definition 4

Reposition center. The reposition center provides reposition strategies for dispatchable fleet at each time slot based on the model results.

Definition 5

Orders $\varOmega$. Assuming there is a collection of $O$ orders in the region $G$, which can be expressed as follows: \begin{equation*}\varOmega=\{w_{o}^{t},o\in\{1,2,\ldots,O\},\ t\in\{0,1,\ldots,287\}\}\tag{4}\end{equation*}View Source where each order $w_{o}^{t}=(s,d,r)$ in the time slot $t$ contains three pieces of information, where $o$ represents the order number, $s$ represents the starting pick-up location of the order, $d$ represents the passenger disembarking location, and $r$ represents the revenue from completing this order. The order matching rules include: In the same time slot, the prerequisite for successful matching is that the vehicle and the starting address of the order are in the same hexagonal grids; if an order is not matched to a vehicle within a time slot after it is initiated, the order request will be rejected by the reposition center.

The reposition scenario can be visualized as shown in Fig. 4.

4.2 Design of the Reinforcement Learning Model in e-Hailing Reposition Problem

State $\boldsymbol{S}$: $S=\{s_{t}^{u},t\in\{0,1, \ldots,287\},w\in I\}$ where $s_{t}^{u}$ represents the state of vehicle $u$ at time slot $t$.

Action $\boldsymbol{A}$: The actions that each vehicle can choose from include heading to the six surrounding grids and staying within this current area. The action space $A$ can be expressed as follows: \begin{equation*}A=\{a_{t}^{u},t\in\{0,1,\ldots,287\},w\in I\}, \vert a_{t}^{u}\vert \leqslant 7\tag{5}\end{equation*}View Source where $a_{t}^{u}$ represents the actions that vehicle $u$ can take during the time slot $t$. When the surrounding six grids are within the region, it is equal to 7. If they are not within the region, they need to be excluded.

Reward $\boldsymbol{R}$: When the vehicle is repositioned to grid $g$, the reward is the weighted average of the reward of all orders in the current time slot of the current grid, $R_{t}^{g}= \sum\limits_{O}\gamma^{I}r/n$, discount factor $\gamma\in[0,\ 1],\ I$ represents the number of time slots required from the current grid to the destination and $n$ is the total number of orders in the current region.

Fig. 4

Diagram of the reposition scenario.

Show All

4.3 Construction and Training of Reinforcement Learning Model in e-Hailing Reposition Problem

This section establishes a deep neural network model based on the actor-critic algorithm to implement intuitive reasoning theory. The implementation of the actor-critic algorithm consists of the actor policy network and the critical value network. The actor policy network approximates policies through the policy function $\pi_{\theta}(a\vert s)$, responsible for generating actions and interacting with the environment. The critical value network estimates the value function of the current policy using the value function $V_{\pi}(s)$, evaluates the actions generated by the actor, and guides the actor to the next stage of actions.

The actor policy network introduces an advantage function. The advantage function is to optimize strategies, helping agents to have a clearer understanding of which actions are advantageous in the current state, while also improving learning efficiency, reducing variance, and making learning more stable. The advantage function can be approximated by the temporal difference $\text{td}_{\text{error}}$, which can be obtained from the Bellman equation. Therefore, the advantage function can be expressed as \begin{gather*}A(s_{t}, a_{t})=\text{td}_{\text{error}}\tag{6}\\ \text{td}_{\text{error}}= r_{t}+\gamma V(s_{t+1}; \theta_{\mathrm{c}}^{\prime})-V(s_{t}; \theta_{\mathrm{c}})\tag{7}\end{gather*}View Source where $\theta_{\mathrm{c}}$ is the parameter of actor-critic value network, $\theta_{\mathrm{c}}^{\prime}$ is the copy of $\theta_{\mathrm{c}}$ and it is updated every $E$ episodes to ensure the stability of the training process, which $E$ represents the number of training iterations. the gradient of actor policy network can be expressed as follows: \begin{equation*}\nabla_{\theta_{\mathrm{a}}}J(\theta_{\mathrm{a}})=\nabla_{\theta_{\mathrm{a}}}\ln\pi(a_{t}\vert s_{t})A(s_{t},a_{t})\tag{8}\end{equation*}View Source where $\theta_{\mathrm{a}}$ is the parameter of actor network and the gradient update formula for actor network can be expressed as \begin{equation*}\theta_{\mathrm{a}}\leftarrow\theta_{\mathrm{a}}+\alpha_{\mathrm{a}}\nabla_{\theta_{\mathrm{a}}}J(\theta_{\mathrm{a}})\tag{9}\end{equation*}View Source where $\alpha_{\mathrm{a}}$ is the learning rate of actor policy network.

The input of the actor policy network consists of two parts, as shown in Fig. 5. The first part is the agent state, which includes the current grid and the current time slot of the agent, and the second part is the supply-demand of the current grid. Rectified Linear Unit (ReLU) in Fig. 5 is a commonly used activation function. The output of the actor policy network is $\pi(s_{t},a_{t})$, which is the policy function.

Actor-critic value network takes the square of the temporal difference as the loss function and it can be expressed as follows: \begin{equation*}L(\theta_{\mathrm{c}})=\text{td}_{\text{error}}^{2}\tag{10}\end{equation*}View Source where discount factor $\gamma\in[0,1]$, gradient update formula can be expressed as follows: \begin{equation*}\theta_{\mathrm{c}}\leftarrow\theta_{\mathrm{c}}+\alpha_{\mathrm{c}}\nabla_{\theta_{\mathrm{c}}}L(\theta_{\mathrm{c}})\tag{11}\end{equation*}View Source where $\alpha_{\mathrm{c}}$ is the learning rate of actor-critic value network.

The input of the actor-critic value network consists of two parts, as shown in Fig. 5, which is the same as the input of the actor policy network. The output of the actor-critic value network is $V(s_{t},\theta_{\mathrm{c}})$, which represents the current state value.

The construction and training of the neural network model for the actor-critic algorithm are carried out on the MindSpore platform. MindSpore supports an end-to-end development process, which can complete the entire Artificial Intelligence (AI) application development and deployment process from data processing, model construction, training and inference to deployment on the MindSpore platform. Deep learning based on the MindSpore framework has strong spatiotemporal feature learning ability and strong data fitting ability.

Fig. 5

Policy and value network.

Show All

The training of the network firstly initializes a memory library $M$ to store state transition tuples, including the current state $s_{t}$, current action $a_{t}$, reward $r_{t}$, and next state $s_{t+1}$, as well as the network parameters of the actor policy network $\theta_{\mathrm{a}}$ and the actor-critic value network $\theta_{\mathrm{c}}$. During training process, passenger requests are received based on real order data. actor policy network within each time slot based on $\varepsilon$ -greedy strategy generates the current action and the next state, which are stored in the memory library $M$ Then, a portion of data is randomly extracted from memory library $M$ to calculate temporal errors, loss functions, and gradients for updating the actor-critic networks. In order to ensure the full utilization of training data and reduce data correlation, offline experience replay is used for parameter updates. The specific training process is shown in Algorithm 1.

Algorithm 1 Training process

Input: Current state $s_{t}$, supply-demand sd

Output: The output of actor policy network is $\pi(s_{t},a_{t})$, and the output of actor-critic network is $V(s_{t})$

Initialize the state of e-hailing, parameters $\theta_{\mathrm{a}}$ and $\theta_{\mathrm{c}}$ of the network being random values

for $i=1$ to episodes do

for $t=1$ to $\vert T\vert$ do

Input the ride hailing status to the actor policy network and generate a storage state transition $(s_{t},a_{t},r_{t},s_{t+1})$ tuple to $M$

end for

Randomly select small batch samples from memory library $M(s_{t},a_{t},r_{t},s_{t+1})$

for each $(s_{t},a_{t},r_{t},s_{t+1})$ do

Calculate $\text{td}_{\text{error}}=r_{t}+\gamma V(s_{t+1};\theta_{\mathrm{c}}^{\prime})-V(s_{t};\theta_{\mathrm{c}})$

Calculate loss function

10:

end for

11:

Batch update $\theta_{\mathrm{c}}$

12:

Batch update $\theta_{\mathrm{a}}$

13:

if $i\ \%\ E=0$:

14:

update $\theta_{\mathrm{c}}^{\prime}\leftarrow\theta_{\mathrm{c}}$

15:

end if

16:

end for

Note: The variable $T$ in Algorithm 1 means the maximum time slots. % means modulo in mathematics.

4.4 Design of Quantum Annealing Model

This section focuses on utilizing quantum annealing to formulate a global repositioning strategy based on local strategies generated by a reinforcement learning model. In the context of the e-hailing repositioning problem, each driver is assigned a comparable repositioning score within the same geographical area, emphasizing the collective experience of all e-hailing vehicles. While using the state value function derived from the reinforcement learning model suffices for repositioning individual vehicles, it tends to attract all vehicles towards states with high value, resulting in congestion and intensified competition for orders from a global perspective. To mitigate these challenges, this study integrates the outcomes of the reinforcement learning model into QUBO model. This approach balances the distribution of vehicles and state values, aiming to maximize both the overall vehicle occupancy rate and total revenue on a global scale.

Quantum annealing necessitates the initial conversion of the problem into a QUBO model suitable for quantum annealing. In addressing the constraints of the e-hailing repositioning problem, each vehicle is limited to selecting only one repositioning strategy at a time, with the overarching objective being the maximization of global expected revenue and vehicle occupancy rates. Consequently, the QUBO model developed in this study comprises three main components: constraints ensuring each vehicle adheres to a single strategy, optimization for mitigating vehicle congestion, and prioritization of high state values.

Vehicle single strategy constraint means that each e-hailing vehicle should choose one from the given reposition strategy, which can improve satisfaction of passengers. In the vehicle single strategy constraint, a binary variable $q_{ug}$ is defined to represent whether the vehicle $u$ is repositioned to region $g$, where $q_{ug}=1$ means accepting the reposition, and $q_{ug}=0$ means rejecting the reposition. The vehicle single strategy constraint can be expressed as \begin{equation*}\text{Constraint}_{\text{single}}(u)=\left(\sum\limits_{g=1}^{n}q_{ug}-1\right)^{2}=0\tag{12}\end{equation*}View Source where $\sum\limits_{g=1}^{n}q_{ug}$ represents the sum of reposition strategy of vehicle $u$, If vehicle $u$ only chooses one reposition strategy, then this constraint function should be zero. Expanding and simplifying this equation can be expressed as follows: \begin{align*}& 2 q_{u1} q_{u2}+2 q_{u3}(q_{u1}+ q_{u2})+\cdots+\\ &\quad 2 q_{un}(q_{u1}+ q_{u2}+\cdots+ q_{u(n-1)})-\\ &\quad (q_{u1}+ q_{u2}+\cdots+ q_{un})+1=0\tag{13}\end{align*}View Source where vehicle congestion optimization refers to avoiding the supply-demand imbalance caused by all dispatchable vehicles being dispatched to the same region in the same time slot due to the high state value. Therefore, in order to limit the number of drivers dispatched to region $g$, the following expression can be obtained: \begin{equation*}\text{Cost}_{\text{congestion}}(g)=\left(\sum\limits_{u=1}^{U}q_{ug}\right)^{2}\tag{14}\end{equation*}View Source where $q_{ug}$ represents that the reposition strategy of vehicle $u$ is to go to region $g$, and $\sum\limits_{u=1}^{U}q_{ug}$ represents the number of vehicles in grid $g$. Expanding and simplifying this equation, we can get the expression as follows: \begin{align*}& \text{Cost}_{\text{congestion}}(g)=2 q_{1g} q_{2g}+2 q_{3g}(q_{1g}+ q_{2g})+\cdots+\\ &\quad 2(q_{1g}+q_{2g}+\cdots+q_{(U-1)g})+q_{1g}+q_{2g}+\cdots+q_{Ug}\tag{15}\end{align*}View Source

High state value priority means that the reposition strategy of the dispatch vehicle in the current time slot will tend to go to the reposition area with more orders and high value. The reinforcement learning model established before is trained according to the real order data in the past. In the actual reposition scenario, the actor-critic model will choose the highest state value among all optional reposition strategies according to the current grid and time slot of the vehicle. Therefore, the QUBO expression of high state value priority can be obtained as follows: \begin{equation*}\text{Cost}_{\text{value}}(u)=-\sum\limits_{g=1}^{k}C(s)U(s)q_{ug}\tag{16}\end{equation*}View Source where $C(s)$ represents the vehicle capacity of state $s$, which is obtained from the previous real order data, $V(s)$ represents the state value of state $s$, which is the result of the trained reinforcement learning model, and $V(s)$ is scaled to be in the same order of magnitude as other optimization objectives. Since quantum annealing is to find the minimum value of the QUBO expression, and the final goal is to maximize $\text{Cost}_{\text{value}}$, a negative sign is needed before this expression.

The form of the QUBO expression is obtained by integrating the vehicle single strategy constraint, vehicle congestion optimization, and high state value priority: \begin{align*}& \text{Object}_{\text{total}}= \lambda_{1}\text{Constraint}_{\text{single}}+ \lambda_{2} \text{Cost}_{\text{congestion}}+\\ &\quad \lambda_{3}\text{Cost}_{\text{value}}= \lambda_{1} \sum\limits_{u=1}^{U}\left(\sum\limits_{g=1}^{n} q_{ug}-1\right)^{2}+\\ &\quad \lambda_{2} \sum\limits_{g=1}^{n}\left(\sum\limits_{u=1}^{U} q_{ug}\right)^{2} - \lambda_{3} \sum\limits_{u=1}^{U} \sum\limits_{g=1}^{n}C(s)U(s) q_{ug}\tag{17}\end{align*}View Source

In Eq. (17), $\lambda_{1},\ \lambda_{2}$, and $\lambda_{3}$ are Lagrange multipliers, which determine the weight relationship between the vehicle single strategy constraint, vehicle congestion optimization, and high state value priority. Since the vehicle single strategy constraint should be satisfied, the optimization of vehicle congestion and high state value cannot cause the vehicle single strategy constraint to be unsatisfied. Therefore, this paper sets $\lambda_{1}$ to 10, $\lambda_{2}$ to 1, and $\lambda_{3}$ to 1. The final result obtained by quantum annealing can be expressed as \begin{equation*}\{q_{ug},u\in\{0,1,\ldots,n\},\ g\in\{0,1,\ldots,V\}\}\tag{18}\end{equation*}View Source

The above formula can obtain the final reposition result of each vehicle, that is, whether vehicle $u$ is repositioned to grid $g,\ q_{ug}=1$ means adopting this reposition strategy, and $q_{ug}=0$ means not adopting this reposition strategy.

This section introduces the results obtained by the reinforcement learning model into the quantum annealing QUBO expression, and completes the application implementation of the intuition reasoning theory in the e-hailing reposition problem. The actor-critic value model in the reinforcement learning model learns the characteristics of the real order data. Benefiting from the quantum tunneling effect of quantum annealing, the global reposition strategy can efficiently be optimized. Based on quantum annealing, vehicle single strategy constraint, vehicle congestion optimization, and high state value priority participate into the optimization of the global reposition strategy.

SECTION 5

Experimental Result and Analysis

This section states the process of performing experiment based on the real order of online vehicles in Chengdu, conducting experimental analysis, verifying the robustness and effectiveness of the model. The simulation environment runs on Python 3.7 environment, the system is Windows 11, the CPU is 12th Gen Intel® Core™ i7-12700H, except for quantum annealing, and quantum annealing is implemented by calling the remote interface of D-wave Leap platform. The training of neural network is based on MindSpore platform. The architecture of this paper is compared with four repositioning algorithms: MCF, Sequential Markov Decision Processes (S-MDPs), hot-dot and diver-prefer to prove the effectiveness of the architecture in solving the e-hailing reposition problem.

5.1 Experimental Data

The experimental dataset of this paper is the online vehicles in the local area of Chengdu Second Ring. This paper mainly uses the order data in this dataset to train and simulate the model, and constructs the comparative experiments with travel order data and the probability data of order cancellation.

The order data is in units of days, which contains about 20 million orders per day on average from November 1st to November 30th, 2016. The specific fields of the order data are shown in Table 1.

5.2 Simulation Environment Construction

This paper selects the order data of e-hailing vehicles on November 1st (Tuesday) and November 5th (Saturday) for constructing the simulation environment. November 1st is taken as the representative scenario of working days, and November 5th is taken as the representative scenario of rest days.

In the data preparation phase of the simulation environment construction, the first step is to divide all order data into time slots of five minutes each, totaling 288 time slots. The second step is to map the latitude and longitude of the order data to the grids.

In Algorithm 2, experiments are carried out to verify the effect of each algorithm by simulating the real vehicle repositioning scene. $N$ vehicles are randomly initialized to the grids, and the order data of the current time slot is taken as the current passenger demand. If in the current time slot, the starting area of the vehicle and the order data is the same, it will immediately match successfully from the empty car state to the order processing state. If it is not the same, the match will fail and go to the next grid according to the reposition result. The failed orders will not appear in the next time slot.

Table 1 Travel order data of didi chuxing dataset.

Algorithm 2 Simulation process

Input: Order information, $T=287$

Output: Cumulative reward and average vehicle occupancy rate

Enter order information

Initialize two tables: “idle” for vehicles in idle state, “ontrip” for vehicles processing orders

Randomly initialize the positions of $N$ vehicles and add them to the idle table

for $t=0$ to $T$ do

Add vehicles that have completed orders in ontrip to the idle table

Select all orders at scheduling slot $t$

for each $e\in$ idle do

Match an order for this vehicle

if $e$ matches an order:

10:

Remove from the “idle” table and add to the ontrip table

11:

Add the reward of this order to reward

12:

end for

13:

Invoke the text model to generate the reposition strategy for the next time slot

14:

Calculate the vehicle occupancy rate of the current time slot and add to the vehicle occupancy rate table

15:

end for

16:

Obtain the cumulative reward and calculate the average of occupancy rate from the occupancy table

5.3 Comparison Algorithms and Evaluation Metrics

5.3.1 Introduction To Comparison Algorithms

Ever since the advent of e-hailing platforms, numerous studies have been conducted on e-hailing reposition problem. This paper presents a comparative analysis of two recently developed, high-performing e-hailing reposition solutions: MCF and S-MDPs. Additionally, two fundamental strategies are considered: hot-dot and driver-prefer. These are compared with a repositioning strategy proposed in this paper. The following statement provides an introduction to these five strategies:

MCF[6]: The MCF strategy transforms the e-hailing reposition problem into a graph network, where each node represents a dispatchable area, each edge represents a route between areas, and each node and edge has a capacity and a cost, representing the capacity limit and revenue of that area. Then, the minimum cost flow solver is used to find a set of optimal reposition tasks, i.e., to reposition idle drivers from areas where supply exceeds demand to areas where demand exceeds supply, maximizing total flow and total revenue. In 2020, Xu et al.[6] used the MCF strategy to reposition e-hailing vehicles and successfully implemented it in Didi online production environment.
S-MDPs: In 2020, Shou et al.[5] used the learning of value functions or policy functions to evaluate or optimize the expected return of an intelligent agent taking different actions in different states. This was applied to the multi-agent decision-making process to dynamically change the matching probability and achieve optimal sequential decisions.
Hot-dot[11]: This refers to all vehicles choosing the area with the highest state value without considering the mutual influence between dispatch vehicles globally.
Driver-prefer: This strategy involves statistical analysis of driver reposition data, reflecting the reposition situation of e-hailing service in real traffic scenarios.
Quantum Annealing (QA): This paper proposes a hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning.

5.3.2 Metric for Measuring Experimental Results

Total revenue of all vehicles: The cumulative sum of order revenues for all vehicles, reflecting the profitability of the algorithm. This metric is from the driver's perspective, as drivers tend to maximize revenue. The total revenue is the sum of the revenues of orders matched in each dispatch slot.
Occupancy rate of all vehicles: The average occupancy rate of e-hailing vehicles carrying passengers in each time slot, reflecting the ability to utilize vehicles. This metric is from the perspective of the reposition platform, as the reposition platform tends to reposition all vehicles to meet more passengers' needs.

5.4 Experimental Result and Analysis

In order to reflect the generalization ability of the architecture, this paper selects both large and small experimental backgrounds and two experimental scenarios of rest days and working days, and compared the algorithm in this paper with four algorithms of hot-dot, Driver-prefer, S-MDPs, and MCF from two metrics of total revenue and vehicle occupancy rate in two experimental scenarios of working day and rest day. In the experimental scene of small area and small data set, $35\ \text{km}^{2}$ around Chengdu city center is used as the experimental grids, and 100 e-hailing vehicles are initialized and randomly assigned to the grids for repositioning. In the experiment scenario of large area big data set, $84.5\ \text{km}^{2}$ around Chengdu city center is used as the experimental grids, 200, 300, and 400 e-hailing vehicles are respectively initialized on weekdays 300, 400, and 500 e-hailing vehicles are randomly assigned to the grids on rest days. The experiment started from 0 time slot. The experimental result on weekdays in small experimental background is showed in Table 2 and in large experimental background is showed in Table 3. The variation trends of vehicle utilization across different time slots for each algorithm on weekdays are shown in Fig. 6.

It can be seen from the experimental results that in the small data scenario, compared with the other four algorithms, the total profit of this architecture increases by 34% and the vehicle occupancy rate increases by 30%. Compared with the total profit of MCF and S-MDPs, the total profit increases by 12% and the vehicle occupancy rate increases by 2%. In the big data scenario, compared with the other four algorithms, the proposed architecture increases by 31% in terms of total profit and 28% in the growth rate of vehicle occupancy. Compared with the total profit of MCF and S-MDPs, it increases by 12% and vehicle occupancy increases by 6%. Based on the analysis of vehicle occupancy rates in various time periods, the advantages of the architecture of this paper are obvious whether it is in the time period with high demand for taxi services such as the morning and evening rush hour, or in the early morning when the demand is not high but it is easy to have difficulties in calling e-hailing services.

The experimental result at weekends in small experimental background is showed in Table 4 and in large experimental background is showed in Table 5. The variation trends of vehicle utilization across different time slots for each algorithm at weekends are shown in Fig. 7.

$Fig. 6 - Vehicle utilization rate of each algorithm with a region size of $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$ and $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$ and the vehicle count of 100 (corresponding to $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$), 200, 300, and 400 (corresponding to $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$) at different time slots.$

Fig. 6

Vehicle utilization rate of each algorithm with a region size of $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$ and $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$ and the vehicle count of 100 (corresponding to $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$), 200, 300, and 400 (corresponding to $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$) at different time slots.

Show All

Table 2 Total revenue and average vehicle occupancy rate in small data volume scenarios on weekdays.

Table 3 Total revenue and average vehicle occupancy rate in big data volume scenarios on weekdays.

Table 4 Total revenue and average vehicle occupancy rate in small data volume scenarios at weekends.

In the rest day scenario, it can be seen from the experimental results that in the small data scenario, compared with the other four algorithms, the total profit of this architecture increases by 7% and the vehicle occupancy rate increases by 5%. Compared with the total profit of MCF and S-MDPs, the vehicle occupancy rate increases by 2%. In the big data scenario, compared with the other four algorithms, the architecture of this paper increases by 10% in terms of total profit and 9% in the growth rate of vehicle occupancy. Compared with the total profit of MCF and S-MDPs, it increases by 10% and vehicle occupancy increases by 2%. Based on the analysis of vehicle occupancy rates in various time periods, the advantages of the architecture of this paper are obvious whether it is in the period of high demand for e-hailing services such as morning and evening rush hour, or in the case of low demand but easy to have difficulties in e-hailing services in the early morning.

$Fig. 7 - Vehicle utilization rate of each algorithm with a region size of $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$ and $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$ and the vehicle count of 100 (corresponding to $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$), 300,400, and 500 (corresponding to $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$) at different time slots.$

Fig. 7

Vehicle utilization rate of each algorithm with a region size of $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$ and $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$ and the vehicle count of 100 (corresponding to $\mathbf{35}\ \mathbf{km}^{\mathbf{2}}$), 300,400, and 500 (corresponding to $\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}$) at different time slots.

Show All

Table 5 Total revenue and average vehicle occupancy rate in big data volume scenarios at weekends.

From the experimental results, it can be drawn that the QA algorithm is the optimal strategy according to the total revenue and the occupancy rate. Hot-dot has the worst performance in each experimental scenario because the hot-dot strategy does not optimize the vehicle aggregation problem, which will cause regional congestion and low order acceptance rate. The driver-prefer strategy simulates the real reposition strategy of drivers according to the dataset. In the real scenario, drivers will consider more factors. However, due to the limitations of individual information, the performance is better than hot-dot but inferior to the other three intelligent algorithms. Compared with driver-prefer, the other three intelligent algorithms all have a reinforcement learning actor-critic model to participate in the decision, but the global strategy after is different. MCF algorithm has high time complexity and poor performance in big data scenarios. S-MDPs statistics the supply and demand situation and performs a linear fitting, so it can capture the impact of supply and demand ratio on probability matching in the e-hailing reposition process, so as to guide the e-hailing to the region with high matching probability. However, S-MDPs usually requires a large amount of data and computing resources, especially when the state space and action space are large. Therefore, it is difficult to run efficiently on large-scale and complex problems, and there are problems of local optimum and overfitting. QA algorithm can find the optimal solution efficiently in the huge solution space due to its unique quantum tunneling effect. It can be seen that the more vehicles are repositioned, the more obvious the advantage is. Therefore, it has obvious advantages in application scenarios with severe vehicle conflicts.

Since the e-hailing reposition algorithm has certain requirements for real-time performance, the average running time of the algorithm in four scenario, respectively $35\ \text{km}^{2}$ for 100 vehicles, $84.5\ \text{km}^{2}$ for 300 vehicles, $84.5\ \text{km}^{2}$ for 400 vehicles, and $84.5\ \text{km}^{2}$ for 500 vehicles are recorded as Table 6.

The following analysis can be drawn from Table 6. The number of edges and points in MCF algorithm will increase with the increase of the number of regions and vehicles, and the time complexity is related to its product. In the S-MDPs algorithm, the running time will change dynamically as the vehicle successfully receives the order, and in general, the running time shows a linear relationship with the number of vehicles. Hot-dot and driver-prefer algorithms have simple logic and the running speed varies little with the scene, so the average running time is shorter. The quantum annealing operation speed is relatively stable, due to the need to call the D-wave real quantum computer remote interface, the time is slightly slower, but for the five-minute reposition task, the running time of three seconds meets the real-time requirements.

In addition to the architecture proposed in this paper, other comparative methods in the experiment rely on traditional computers. Based on the experimental results, it is evident that the scheduling outcomes achieved using the D-wave quantum computer surpass those obtained through traditional methods in terms of total revenue and vehicle utilization across both small and large geographical areas. Moreover, traditional computers exhibit highly variable repositioning times, which escalate significantly with larger datasets and increased repositioning volumes. In contrast, quantum computers maintain relatively stable repositioning efficiency and show potential for high efficiency as data scales increase.

Table 6 Average algorithm running time for e-hailing reposition in four scenarios.

SECTION 6

Conclusion

In recent years, the e-hailing industry has risen rapidly and become one of the important fields of the sharing economy. At present, the traditional intelligent algorithms for solving the e-hailing reposition problem have limitations in terms of solution accuracy and time consumption in big data scenarios. As the global e-hailing reposition problem poses a complex optimization challenge, it necessitates consideration of numerous factors and constraints, such as passenger waiting times, vehicle travel distances, road congestion, among others. The processing of vast amounts of data imposes heightened demands on algorithm performance, requiring algorithms to handle more data within shorter timeframes while making more precise decisions. So the model results are not only difficult to escape from the local suboptimal solution to reach the global optimal solution, but also time-consuming.

Therefore, this paper proposes a hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning. The main contributions of this paper can be summed up as follows: In theory, this paper proposes to take intuitive reasoning as the core idea to solve e-hailing reposition problem. At present, traditional artificial intelligence generally has a good implementation effect under strongly supervised learning, differentiable and closed static systems, but there is little suitable solution for poor system robustness, complex tasks, diverse goals, and imperfect sample data. However, intuitive reasoning fits the requirements of a new generation of artificial intelligence, which aims to correlate models, algorithms, and applications to form advanced intelligence with common sense cognition, strong robustness, and interpretability. Therefore, in some complex scenarios such as traffic, the adaptability and robustness of traditional artificial intelligence algorithms are not good, and the introduction of intuitive reasoning ideas can make up for this deficiency.

Based on the intuitive reasoning, quantum annealing model is introduced in this paper. In some complex production environments such as traffic, many problems are NP-complete, that is, exact solutions cannot be found in polynomial time. At present, the mainstream solution is to use some approximate or heuristic algorithms, but the solution results have some limitations. In order to solve the above problems, quantum annealing algorithm is introduced. Quantum annealing algorithm not only has the parallelism of quantum computing to improve the search efficiency and accuracy, but also has the quantum tunneling effect, which can jump out of the local suboptimal solution and reach the global optimal solution. In this paper, we design the state space, action space and reward mechanism of reinforcement learning in the e-hailing reposition problem, and build an interactive environment with real scene data. In terms of technology selection, this paper chooses to build and train a deep reinforcement learning model on the MindSpore platform with efficient kernel algorithm, and updates the strategy of single agent by interacting with the constructed environment to maximize long-term rewards. The strategy, expert knowledge and common sense are introduced into the quantum annealing model as constrained and optimization conditions in form of QUBO. In the multi-scenario experiment, the architecture in this paper has achieved ideal results, proving the effectiveness and robustness of the architecture.

In terms of portability, the hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning proposed in this paper can be applied to decision-making problems in other complex big data scenarios. Utilizing deep reinforcement learning models to interactively learn complex scene features, and integrating existing expert experiential knowledge within the application scenario to collaboratively form a cognitive map. This cognitive map is then incorporated into the quantum annealing model, leveraging the quantum tunneling effect to efficiently and with high probability search for the globally optimal strategy. Therefore, the architecture of this paper not only optimizes the e-hailing reposition problem, promotes the development of smart cities, but provides new ideas for robust artificial intelligence.

In future research, the hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning proposed in this paper can be applied in decision-making problems in other complex big data scenarios. Reinforcement learning models can be used to learn decision-making strategies in complex scenarios, combined with existing expert experience knowledge in the application scenario to form a cognitive map. The strategies of the cognitive map can be introduced into quantum annealing models to obtain the global optimal strategy. Because of its quantum tunneling effect, quantum annealing can jump out of the local suboptimal solution with a greater probability to reach the global optimal, so in the future quantum annealing can be applied to address challenges in other big data scenarios.

ACKNOWLEDGMENT

This study was sponsored by the Chinese Association for Artificial Intelligence-Huawei MindSpore Open Fund.

Intuitive reasoning based on cognitive maps.

Figure 1 of 7

Fig. 1

Intuitive reasoning based on cognitive maps.

Show in Context

Fig. 2

D-wave quantum computer.

Show in Context

Fig. 3

Hybrid computing architecture of reinforcement learning and quantum annealing guided by intuitive reasoning.

Show in Context

Fig. 4

Diagram of the reposition scenario.

Show in Context

Fig. 5

Policy and value network.

Show in Context

Fig. 6

Vehicle utilization rate of each algorithm with a region size of \$\mathbf{35}\ \mathbf{km}^{\mathbf{2}}\$ and \$\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}\$ and the vehicle count of 100 (corresponding to \$\mathbf{35}\ \mathbf{km}^{\mathbf{2}}\$ ), 200, 300, and 400 (corresponding to \$\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}\$ ) at different time slots.

Show in Context

Fig. 7

Vehicle utilization rate of each algorithm with a region size of \$\mathbf{35}\ \mathbf{km}^{\mathbf{2}}\$ and \$\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}\$ and the vehicle count of 100 (corresponding to \$\mathbf{35}\ \mathbf{km}^{\mathbf{2}}\$ ), 300,400, and 500 (corresponding to \$\mathbf{84.5}\ \mathbf{km}^{\mathbf{2}}\$ ) at different time slots.

Show in Context

References is not available for this document.

MIT Libraries

MIT Libraries

A Reposition Algorithm for E-Hailing Based on Quantum Annealing and Intuitive Reasoning

Alerts

Abstract:

Metadata

Abstract:

Introduction

Related Work

Theoretical Foundation

3.1 Intuitive Reasoning

3.2 Reinforcement Learning for Intuitive Reasoning

3.3 Quantum Annealing for Intuitive Reasoning in Solving Combination Optimization Problems

Hybrid Computing Architecture of Reinforcement Learning and Quantum Annealing Guided By Intuitive Reasoning

4.1 Description of e-Hailing Reposition Problem

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

4.2 Design of the Reinforcement Learning Model in e-Hailing Reposition Problem

4.3 Construction and Training of Reinforcement Learning Model in e-Hailing Reposition Problem

Algorithm 1 Training process

4.4 Design of Quantum Annealing Model

Experimental Result and Analysis

5.1 Experimental Data

5.2 Simulation Environment Construction

Algorithm 2 Simulation process

5.3 Comparison Algorithms and Evaluation Metrics

5.3.1 Introduction To Comparison Algorithms

5.3.2 Metric for Measuring Experimental Results

5.4 Experimental Result and Analysis

Conclusion

ACKNOWLEDGMENT

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

References