Introduction
Although object detection and tracking methods are widely used for understanding traffic scenes, the relationships between actors within traffic scenes, such as “vehicle waiting for a pedestrian near an intersection”, are not widely utilized. When we drive, while we might not be able to accurately perceive the intentions of every nearby actor in the traffic environment, nor compute an accurate distance to them, we are able to see and predict these relationships. This ability to predict the actions of other vehicles and pedestrians is regarded as a fundamental component of safe driving. Thus, the ability to accurately model the relationships between traffic actors (vehicles, pedestrians, obstacles, etc.) will play an important role in the development of next-generation intelligent vehicles. For example, when driving on a highway, if a following vehicle quickly “approaches” the leading vehicle, a significant signal is that it may have a tendency to overtake in the next moment. Another example is that when a vehicle senses the possibility of a “collision” semantic relationship between two vehicles ahead, it may slow down in advance to avoid a potential triple car accident. Furthermore, understanding a visual scene involves more than recognizing individual objects in isolation. By examining the interactions among the actors in traffic scenes, we may be able to discover specific patterns that indicate potential risks, disruptions, or overly-aggressive driving. The ability to model such relationships and patterns would also benefit related research. For example, the recognition of semantic relationships can benefit the following practical applications:
In simulation-based auto evaluation of self driving systems like [1], it is often necessary to generate multiple traffic scenes that are similar but different based on given scenarios. By constructing an RSG graph that contains semantic relationships as described in this paper, combined with a graph autoencoder network, we can generate multiple similar and different traffic scenes, and simulate them in a simulator. Relevant work is detailed in our paper [2].
Large-scale scene retrieval: Current traffic datasets includes hundreds or thousands of traffic scenes, and we often want to retrieve them based on certain conditions. For example, “a vehicle parked at an intersection waiting for two pedestrians to cross the road.” In our research, we construct a graph representation that includes semantic relationships, enabling us to quickly retrieve similar scenes with ease.
Natural language description generation: Generating natural language descriptions corresponding to traffic scenes has broad demand in multiple fields. However, traditional video captioning networks often perform poorly because they do not encode prior knowledge related to autonomous vehicle and driving scenes. By constructing an intermediate representation using topological graph that contains such knowledge, we expect to improve the performance of natural language description tasks.
Bringing this level of semantic relationship reasoning into the traffic scene domain would be a significant leap forward, but doing so involves two primary challenges:
Traffic actor relationship data is non-Euclidean. In other words, unlike image and text data, relationship data is difficult to process using normal convolutional neural networks (CNNs). Luckily, recent developments in graph neural networks (GNN) have brought significant improvements in the training of models using non-Euclidean data by arranging them into graph structure.
There is insufficient actor relationship data within conventional datasets. To solve this data insufficiency problem, we have created the Road Scene Graph (RSG) Dataset, based on the nuScenes dataset [3]. In addition, we provide graph-structured representations of traffic scenes, where nodes in the RSG correspond to actor status, and where the edges (or nexus) of these nodes correspond to their pairwise relationships [4].
Compared to existing, large-scale scene graph datasets, which are used for common scene graph prediction tasks, our dataset is much smaller due to the difficulty of annotating all the actor relationships, which brings new challenges for our proposed model. First, the number of learned parameters will be smaller. Second, the domain knowledge, prior knowledge and geolocation information must be fully exploited to achieve state-of-art prediction accuracy with such small dataset.
Our proposed RSG prediction network, shown in Fig. 1, was inspired by previous studies focused on generating scene graph depictions of common traffic scene images. Currently, many scene graph prediction methods [5], [6], [7], [8], [9] follow an end-to-end structure. That is to say, these models first extract features (for example, using faster R-CNN [10] or Mask R-CNN [11]), and then simultaneously predicting the actor’s class, bounding box and corresponding relationships. However, in the case of intelligent vehicles, object detection results are obtained from the fusion of data from multiple sensors, such as LIDAR, camera, mmWave radar, etc., so it is not necessary for our model to perform object detection again. Therefore, by separating the front-end (object detection and tracking) and back-end (relationship inference), our model is much more compact and efficient.
Overview of Road Scene Graph (RSG) generation and proposed method: A traffic scene (A) and its corresponding RSG (B). The colors of the nodes indicate node categories (Vehicle, Pedestrian, Road or Lane), while the colors of their edges (i.e., the relationship-defining lines linking the nodes) represent “layers”, such as “Actor-to-Actor”, “Actor-to-Map” or “Map-to-Map”. Our goal in this research is to predict the “Actor-to-Actor” relationships, represented by the red edges in sub-figure (B). Sub-figure (C) shows an overview of the proposed method, including graph generation and the inference process.
As illustrated in Fig. 1, the goal of our research is to transform a traffic scene (A) into a graph-based representation (B) which accurately describes the relationships between actors. The nodes in the representation, which represent vehicles, pedestrians, lanes and intersections, are connected with explainable relationships, such as “passing-by” and “driving-on”. To automate this process, we propose using a GNN-based model to predict the unknown relationships (red edges) in the graph, based on the hierarchical nature of RSG. The nodes in the graph are divided into an actor set and a map component set, therefore the relationships between the nodes can be divided into three categories: “actor-to-actor”, “actor-to-map” and “map-to-map”. The “actor-to-actor” relationships are unknown, and determining these relationships is the goal of our research, while the other two sets of relationships can be easily obtained from an HD map and geometry-based rules. As a consequence, the edge/relationship-prediction problem can be transformed into a graph completion problem, which is easier to solve. Furthermore, our experimental results show that prior knowledge about these relationships significantly improves prediction accuracy.
The RSG-GCN method proposed in this paper is illustrated in Fig. 1 (C). Here, the actors’ 2D bounding boxes and HD map data are used as inputs. Then, both the actor and map component data are transformed into graph nodes. Next, based on the prior relationships in the “actor-to-map” and “map-to-map” sets, these graph nodes can be arranged into a semi-graph
In addition to providing a relationship-based representation of traffic scenes, such road scene graphs can also help the self-driving community in tasks such as traffic scene retrieval (finding a specific traffic scene in dataset), and synthetic traffic scene generation (generating near-realistic, simulator friendly traffic scenes), which will assist in the automatic evaluation of autonomous driving systems.
The contributions of our work are as follows:
We introduce a gated recurrent neural network (i.e., a gated recurrent unit or GRU) model for semantic relationship prediction tasks. To improve prediction accuracy, we utilize geographic relationships as prior knowledge. Experimental results indicate that such knowledge greatly benefits the relationship prediction task, allowing our model to outperform baseline models in both accuracy and model efficiency.
We introduce a novel Road Scene Graph (RSG) dataset consisting of 20,000 road scene graphs from 500 traffic scenes. This dataset includes not only actor annotation (from nuScenes), but also pairwise relationships among actors and map components.
We introduce a scene retrieval method for finding specific scenes in RSG datasets, which finds and clusters similar scenes in order to predict potentially risky situations. In addition, the proposed RSG-based traffic scene generator can generate near-realistic traffic scenes for various applications.
The remainder of this work is structured as follows: Section II provides a comprehensive review of related work, in the fields of applications of graph neural networks for autonomous vehicles, actor relationships in traffic scenes and road scene graph prediction. In Section III, we state the problem definition for road scene graph modeling. In Section IV, we discuss the methodology of our work, as well as possible downstream applications. And in Section V, we describe several experiments conducted to validate our proposed method through comparison with other popular traffic scene prediction models. Finally, in Section VI we summarize our study’s findings and conclusions.
Related Work
A. Graph-based Methods Applied in Intelligent Vehicle
The decision-making systems of autonomous vehicles are expected to achieve a high level of driving safety [34], but generating appropriate driving behavior requires the integration of a broad range of data sources. For modern autonomous driving systems such as Autoware [35], [36], the data sources are highly heterogeneous, from tire pressure and battery status to camera video, GPS data, LIDAR point clouds and HD maps.
As a result of the recent, rapid development of graph neural networks (GNN) and their variants [37], researchers have proposed many graph-based applications for intelligent vehicles, such as traffic prediction and forecasting [27], vehicle control [38], trajectory prediction [13], [14], [39], CAN bus attack (cyberattack) detection [40], traffic scene captioning [41], [42], driving behavior prediction [16], [43] and synthetic traffic scene generation [19], [20]. There are several reasons for the widespread adoption of GNN-based systems. First, heterogeneous data can be more efficiently utilized by the vehicle’s decision-making system, since graphs can be more friendly when heterogeneous data formats are being used. For example, compared with normal convolutional networks, GNNs can perfectly process data with various input sizes. Regarding actor trajectory prediction, before GNN and its variants were applied, much research focused on the use of carefully-rendered bird’s-eye-view (BEV) images of the traffic environment as input [44], [43], since the number of vehicles and other map components were random. CNN variants such as convolutional LSTM (ConvLSTM) [45] were then used to learn the visual features of those images. The use of GNNs allowed the direct processing of raw data without rasterization, thus the design of decision-making models became much more straightforward. A second advantage of using GNNs is the ability to capture information in both nodes and edges. In traffic speed estimation [29] for example, nodes in the graph represent intersections while the edges represent the roads between the intersections. Moreover, as in Meta-Sim [19], [20], graph nodes can be used to represent objects (cars, people, trees, roads), and edges represent their hierarchical relationships. A third advantage of GNNs is that the graph structure itself can also convey crucial information, in some cases information that is as important as the nodes. Meta-Sim uses the graph structure to maintain generative rules such as “lane belongs to road” and “car driving in the lane”.
These graph-based data representations allow a great deal of flexibility, as the relationships between nodes can vary depending on the type of prior information to be learned for a particular task. As shown in Table 1, the meanings of the nodes and edge data can be varied depending on the task. As illustrated by the previously developed applications mentioned above [29], [19], [20], there are many kinds of relationships which can be captured and graphic data representations that can be constructed.
The interaction graphs [12], [13], [14], [16] noted in Table 1 capture possible interactions among vehicles in traffic scenes, however they do not define the categories of these interactions. This is because predicting the categories of edge data is not an easy task. But many tasks, such as vehicle trajectory prediction [13], [14], [39], [46] and behavior prediction [16], [43], [47], [48], [49], benefit from graph-based data representation, since it provides an easy way to learn from a given scene without using image representation data.
Another interesting way to build graph-based representations of driving environments is the Lane Graph [17]. The purpose of these graphs is to learn HD maps without rendered image input. Instead, the lane graph connects nearby waypoints in the map to build a graph of the map. Liang et al. [17] first adopted this method for motion forecasting, and proposed LaneGCN to learn map features from HD maps. In their study, lane graphs were directly generated from the HD map. Zürn et al. [50] proposed LaneGraphNet to estimate such graphs from BEV images.
Expanding the scene understanding task from lanes to all nearby objects, Meta-sim [19], [20] uses a hierarchical tree for arranging these objects, based on a set of rules, such as “lane belongs to the road”, “vehicle on the lane”, etc. In this way, the graph captures the status of all nearby objects, as well as the hierarchical structure of the scene.
The proposed Road Scene Graph (RSG) method is a relational graph [51] based on our previous work [31], which included map components, actors and the semantic relationships among these actors. RSG itself is described in detail in Section III of this paper.
All these methods transform spatial and other kinds of information into various types of relationships, and then build a very compact, non-Euclidean, learnable graph to describe the scene surrounding the ego-vehicle. The preferred graph format varies according to the chosen task. For behavior prediction tasks, interaction graphs [16] are commonly used, while for motion prediction, lane graphs work better since they provide a fine-grained description of HD maps. Structured object trees (Meta-Sim) [19], [20] are commonly used for traffic scene generation tasks. And the RSG approach proposed in this paper is likely a better solution for predicting semantic relationships.
B. Scene Graph Prediction
Scene graphs [52], [53] were originally proposed as a method of describing the relationships between objects detected in an image [4]. Currently, the majority of scene graph research focuses on describing common images, to meet the increasing demand for image retrieval [4], image/scene captioning [41], [42], image generation and image-based querying [54], [55].
The rapid growth in scene graph generation tasks is a result of the creation of large-scale, relational datasets of common Web images. Since Johnson et al. first proposed this concept [4], many large-scale relational datasets have been created. The Real-World Scene Graphs Dataset (RW-SGD) [4] was among the first, containing 5,000 images from the YFCC100m [56] and MS COCO datasets [57]. In addition, the Visual Relationship Dataset (VRD) [6], Visual Genome Dataset (VG) [58], Visually-Relevant Relationships Dataset (VrR-VG) [59], UnRel Dataset [60], HCVRD dataset [61] and others have appeared, with increasing numbers of images, object annotations and relationship annotations. Within the self-driving community, many excellent, large-scale dataset have also emerged [62], [63], [64], [65], [66], [67], [68]. However, our Road Scene Graph dataset would be the first focused on the semantic relationships among vehicles, pedestrians and other actors in traffic scenes.
Currently, the majority of scene graph generation (SGG) models follow a similar framework: (1) a region proposal predictor, which commonly uses Faster R-CNN [10]; (2) a region feature extractor [6]; and (3) iterative feature fine-tuning, using CRF or GNN models. Using probability distributions from natural language tasks as prior knowledge has also been proposed. In contrast, SGG for traffic scenes does not rely on a region proposal predictor, thus the complexity of the model can be significantly reduced. In Section IV-C we discuss this difference in more detail.
The Road Scene Graph Generation (RSGG) task is similar to the graph generation approaches proposed in previous studies, however the region proposal predictor has been removed as we can easily obtain highly accurate scene perception using Autoware or other sensing systems. But a hand-crafted region feature extractor is used for integrating information from traffic actors and the HD map. Finally, RSGG’s iterative feature fine-tuning model was borrowed from the Iterative Message Passing (IMP) method [8], and then modified.
Problem Definition
In this section, we first provide a formal definition of our Road Scene Graph (RSG) method, and then explain the semantic relationship prediction problem.
As shown in Fig. 2, a Road Scene Graph, which can be represented as
To transform the relationships between a set of actors into a fixed-length node feature vector
Likewise, we transform map components, such as roads, lanes and intersections, into node feature vectors, which share a similar representation to actor node
As illustrated in Fig. 2, the edges in the RSG can represent three kinds of relationships, based on the type of nodes they connect: “actor-to-actor”, “actor-to-map-component”, and “map-component-to-map-component”. The categories of these possible semantic relationships are listed in Table 3. The actor-to-actor edges capture interactions between actors in the scene, while the actor-to-map edges capture spatial relationships between actors and map components, such as “vehicle driving on the lane”. The edges linking map components represent topographical relationships between map components, such as “lane belongs to road” or “road is predecessor to intersection”. These map component-to-map component relationships are not predicted by our model, because they can be easily obtained from the geolocation database, and rarely change over time. In our proposed model, these relationships are fundamental to the message-passing mechanism [8] of graph neural networks, as they can significantly increase the connectivity of a graph. At each iteration, these links allow actors to aggregate information from nearby map components, as well as from other actors.
The goal of semantic relationship prediction is to infer pairwise relationships among all actors in a traffic scene, given actor node set
For each pair of actors or map component nodes (see Table 3), we formulate the relationship prediction problem as finding the optimal \begin{align*} Pr\left ({r|A, \mathcal {M}}\right)=&\prod _{ \boldsymbol {\alpha } \in A}\prod _{i \neq j} Pr\left ({r_{i \rightarrow j} | \boldsymbol {\alpha }_{i}, \boldsymbol {\alpha }_{j}, \mathcal {M}}\right)\tag{1}\\ Pr\left ({r|A, \mathcal {M}}\right)=&\prod _{ \boldsymbol {\alpha } \in A}\prod _{i \neq j} Pr\left ({r_{i \rightarrow j} | \boldsymbol {\alpha }_{i}, \boldsymbol {\alpha }_{j}}\right) Pr\left ({\boldsymbol {\alpha }_{i}, \boldsymbol {\alpha }_{j} | \mathcal {M}}\right)\tag{2}\end{align*}
In contrast to scene graph prediction tasks based on common images [8], in traffic scene graph prediction the predicted relationships are undirected. That is to say, our proposed model does not distinguish between the “object” and “subject” of a relationship. One reason for this is that not all relationships have clearly defined objects and subjects. Relationships such as “Following” or “Approaching” do, but relationships such as “Grouping” do not. Then, since we can obtain an actor’s position and velocity, and information about other nearby actors, it is easy to build a rule-based system to determine the “object” and “subject” of a particular relationship, without increasing the complexity of our graph inference model. The detailed definitions of the notations used in this study can be found in Table 2.
RSG-GCN: Relationship prediction network for intelligent transportation systems
A. Data Setup
Here we introduce our Road Scene Graph dataset and explain how this dataset was constructed. The data format of our dataset is similar to that of other graph-based datasets in the common image domain, such as Visual Genome [58] and VRR-VG [59], however, since the RSG dataset is used to model traffic scenes, different object and relationship categories are used. Also, as Table 4 and Fig. 3 illustrate, the distribution of labels in the RSG dataset is unique and more balanced than image domain datasets. The main reason for this is that the number of node and relationship categories used in RSG is significantly smaller than those used for common scene graph datasets. The RSGD only contains 6 unique objects and 48 relationships, compared with 75,729 objects and 40,480 relationships in the graph-based, image domain Visual Genome dataset [58], as the goal of RSG is limited to predicting semantic relationships in traffic scenes. When applying semantic labels to describe traffic scene data, the most important thing is to carefully define these semantic labels in order to cover as many traffic scenes as possible. Despite this effort, there must be some cases not covered or ambiguous cases. In this study, the category of such relationships is shown in Table 3. We use three methods to describe traffic scenes as comprehensively as possible:
Distribution of cross-label annotations in RSG dataset, showing the distribution of entity-to-entity annotations (left) and entity-to-relationship annotations (right).
For the relationships between map components as shown in the green font in the top-left part of Table 3, we only describe the most basic relationships in the road network. These hierarchical and universal relationships can cover all 500 traffic scenes in the RSG dataset. In some rare cases, such as underground parking lots, construction sites, or wilderness areas, it is not possible to obtain HD map and create relationships between road elements. We have excluded these cases from the RSG dataset. To obtain such relationships, we first convert the nuScenes map to the ASAM OpenDrive [69] format. This process is described in detail in our previous work [2]. By constructing high-precision maps on the nuScenes dataset, we can obtain the connections and “Belongs-to” relationships between map components.
The orange-labeled blocks on the bottom-left (also implied in the top-right boxes) represent cross-layer “actor-to-map component” relationships. For these relationships, we use a set of rules based on the object state to roughly determine t 1123hem, ensuring that for any actor in the traffic scene, there is one and only one corresponding relationship can be generated to a map component road.
The most important semantic relationships in our research are those at the “Actor-to-Actor” level, lists in the bottom-right part of Table 3. Before annotating these relationships, to ensure that the categories of semantic label can cover the majority of relationships in various traffic scenarios, two methods were used: questionnaire surveys and prototype annotations. First, questionnaires were distributed to participants, asking them to observe 20 traffic scenes from the nuScenes dataset, each lasting 20 seconds. And then asking them to provide detailed descriptions of all potential relationship categories in those scenes. The participants included 4 faculty members, 8 doctoral and master’s students. And 4 student participants of them do not have any driving experience, who could provide observations from a pedestrian’s point of view. Except for fatigue, the experiment posed no significant risks to the participants. We managed the data carefully and protected the privacy of the participants. In this way, we obtained an initial relationship category list. Additionally, we referenced the relationships from the HDD HRI Driving Dataset [70], which focuses on the behavior of the driver.
During the prototype annotation process, to ensure that our list includes the majority of semantic relationships that appear in the scene, we collected feedback and reports from annotators and revised the semantic relationship category list based on this feedback. We revised the relationship list when annotating 50, 200, and 300 scenes, and re-annotated previous scenes to include the newly added semantic labels. In addition, to make the semantic labels more generalizable, we aggregated some semantic relationship labels, such as the “cut-in” and “cut-out” relationships, which we merged into “overtaking.”
The final issue is about the consistency. Here, consistency includes the consistency of semantic labels in different traffic scenarios, and labels annotated by different annotators. To this end, the following four strategies were used: (1) Using program-generated semantic labels: For “Actor-to-Map” and “Map-to-Map” relationships (green and orange parts in Table 3), we infer these semantic relationships with program which concerning object status and road network information. Therefore, we can maintain consistency among various traffic scenes. (2) Using standardized semantic labels: For relationships between objects (blue part in Table 3), we define each relationship in detail, indicate special cases, and distribute manual of annotation. We provide training to annotators to ensure label consistency before annotation. (3) Multiple annotators are used to annotate one scene, and conflicts between annotators are resolved. (4) For the “Actor-to-Actor” relationships, based on the second point, we used a set of rules based on the objects’ positions and velocities to assist in the annotation process. Although this mechanism cannot automatically generate semantic relationships, it can automatically detect some obvious errors.
B. Graph Pre-processing
The aim of the graph-preprocessing is to generate the semi-graph
Fig. 4 illustrates, during this stage three matrices are generated: feature matrix
Pipeline of our proposed Road Scene Graph generation framework. Given a traffic scene based on actor status set
C. RSG-GCN: Semantic Relationship Prediction Network
Here we propose our novel model, Road Scene Graph-based Convolutional Neural Network (RSG-GCN). The RSG-GCN decomposes the probability of “actor located in HD map” \begin{align*} Pr\left ({\boldsymbol {\alpha _{i}}, \boldsymbol {\alpha _{j}} | \mathcal {M}}\right)=&\prod _{m_{p,q} \in \mathcal {M}} Pr\left ({\boldsymbol {\alpha _{i}}, \boldsymbol {\alpha _{j}} | \boldsymbol {m_{p}}, \boldsymbol {m_{q}}}\right) \\&Pr\left ({r_{ \boldsymbol {\alpha _{i}} \rightarrow \boldsymbol {m_{p}} } r_{ \boldsymbol {\alpha _{j}} \rightarrow \boldsymbol {m_{q}}} | \boldsymbol {\alpha _{i}}, \boldsymbol {\alpha _{j}}, \boldsymbol {m_{p}}, \boldsymbol {m_{q}} }\right) \\&Pr\left ({r_{ \boldsymbol {m_{p}} \rightarrow \boldsymbol {m_{q}} } | \boldsymbol {m_{p}}, \boldsymbol {m_{q}} }\right) \\&Pr\left ({\boldsymbol {m_{p}}, \boldsymbol {m_{q}} | \mathcal {M}}\right)\tag{3}\end{align*}
As demonstrated in our previous work [31], such decomposition allows us to integrate prior relationships such as “vehicle driving on the road” or “road next to intersection” into our graph inference model. Here, we model both the actor-to-map relationships
Our goal here is to refine semi-graph
Using a method based on Iterative Message Passing [8], we also apply gated recurrent unit networks (GRU) during our graph refinement process. GRU is a popular technique which has been used in several graph network generation methods to propagate node messages in graphs [72], [73], [74]. However, in contrast to these models, our proposed method does not simultaneously predict the status of objects and their relationships. This is because the bounding boxes of objects can be obtained from the perception module in a self-driving system, using LIDAR auxiliary bounding box regression. Therefore, we can focus entirely on prediction of the actor relationship edges using the highly accurate, intelligent sensing system of the vehicle (which is achieved through sensor fusion). To fully utilize this capability, we remove the node information update step from the original Iterative Message Passing model.
The new graph inference model is shown on the far right of Fig. 4. Here, we first transform
To enable message propagation on the graphs, we used a GRU [75], [72] model for the edge data, which is a simple but effective method that also solves the vanishing gradient problem caused by stacking graph convolutional layers. GRUs also outperform LSTMs when the scale of the dataset is limited [72]. For each node in the dual graph, we created a vector \begin{align*} r_{t}=&\sigma \left ({W_{xr} f_{t} + W_{hr} \hat {h}_{t-1}}\right) \\ z_{t}=&\sigma \left ({W_{xz} f_{t} + W_{hz} \hat {h}_{t-1}}\right) \\ \widetilde {h}_{t}=&\tanh \left ({W_{xh} f_{t} + W_{hh} \left ({r_{t} \odot \hat {h}_{t-1} }\right)}\right) \\ h_{t}=&\left ({z_{t} \odot \hat {h}_{t-1} }\right) + \left ({1 - z_{t}}\right) \odot \widetilde {h}_{t} \\ a_{t}=&\sigma \left ({W_{l} h_{t} + b_{l}}\right) \tag{4}\end{align*}
Here,
D. Application of RSG: Traffic Scene Retrieval
In addition to revealing the semantic relationships in traffic scenes, a key strength of our graph-based representation method is that it also allows a variety of potential downstream applications. Figure 5 shows schematic diagrams of two of these possible applications. One vanilla application is scene retrieval, which involves querying the dataset using a specific condition, such as “find scenes where two vehicles are waiting at an intersection, three pedestrians are crossing the intersection, and there are barriers nearby”. To provide solid query results, we transform that scene retrieval task into a more appropriate subgraph isomorphic problem. As Fig. 5 (A) shows, the user first manually translates the query into a road scene graph. Then, a modified VF2 graph searching algorithm is used to find appropriate scenes and frames in the dataset, which are then compared with the original RSG. As shown in Algorithm 1, we need to check the categories of both the nodes and the edges.
Algorithm 1 Road Scene Graph Matching Algorithm $M(s)$
Input: Intermediate
Output: Mapping
Set the intermediate state
if
if Foreach edge
OUTPUT
else
Continue
end if
else
Compute the set P(s) of the pairs’ candidate for inclusion in
while Each
if
Compute the next state
CALL Match
end if
end while
Restore data structures
end if
Schematic of two possible downstream applications of RSGs: (A) Traffic scene retrieval, and (B) Synthetic traffic scene generation.
A subgraph isomorphism problem is a computational task in which two graphs,
E. Application of RSG: Synthetic Traffic Scene Generation
Here, we briefly introduce our previous work on RSG-based synthetic traffic scene generation [71]. The purpose of this work is to automatically generate digital twins of open-source traffic scene dataset. And simulate them in CARLA [78] and SUMO [79] simulator. Then generate multiple traffic scenes that are similar to the given scene for testing purposes. This work can be divided into two parts. In the first part, we designed a graph autoencoder to learn and generate synthetic RSG. Then, traffic scenes in CARLA or SUMO rely on quantitative information (speed, location, pose, etc.) of all actors. To generate these scenes from semantic relationship information, we developed a grounding mechanism, which places each object node on the correct geometry position and assigns initial status.
The grounding mechanism works as follows: First, the RSG graph we generate contains a certain number of nodes corresponding to map elements. And in advance, we define a HD-map map set (from nuScenes, we converted maps in this dataset to openDrive format). Based on a topological graph matching method (VF2), we find a map (or a series of maps) that contains all map nodes in the traffic scene. Then, a set of random generation mechanisms is used to place objects on the map based on semantic relationships between objects and the map components. For example, if there is a “driving-on” relationship between a vehicle and a certain lane, the program randomly places it in a legal position on the lane and then assigns an initial velocity with Frenet coordinate system. At the same time, a “destination” is assigned for each object, i.e., the position the object will be in after a certain period of time. Finally, we use SUMO to simulate the generated initial status of traffic scene, obtain the trajectories of each object, and solve problems such as traffic lights, pedestrian avoidance, and object interactions. The simulation results contain all quantitative information about the object at each moment (speed, position, orientation, etc.), which can be passed into CARLA through CARLA-rosbridge for simulation (with maps loaded using openDrive loading function in the CARLA dev-branch).
Experiments
In this section, we describe our evaluation of the proposed method, an RSG (gated GRU + dual graph) model. We do this by comparing its performance when predicting semantic relationships between traffic scene actors, in the form of graph edge data, with that of other graph prediction models, as well as with non-graph-based learning models, as listed in Table 5. We conducted this evaluation using the RSG dataset introduced in Section IV-A.
A. Baseline Models
Here, we compare our proposed RSG-GCN model, described in Section IV with the following methods: (1) Vanilla GNN model with simple GNN stack; (2) Vanilla CNN model with simple CNN stack; (3) Iterative Message Passing model [8]; (4) Graph VAE [80] model with autoencoder, without prior geometric information (our previously proposed road scene graph generation model); and (5) Pairwise prediction model.
1) Vanilla GNN
In this model, we used the simple, 3-layer GNN model from [18] to learn the graph embedding of given semi-RSG \begin{align*} H^{(l+1)}=&\sigma \left ({\widetilde {D}^{-0.5}\widetilde {A}\widetilde {D}^{-0.5}H^{l}W^{l}}\right) \\ \widetilde {A}=&A+I_{N} \\ \widetilde {D}=&\sum _{j}\widetilde {A}_{ij} \tag{5}\end{align*}
2) Vanilla CNN
Similar to the approach used in [44], we used ConvLSTM [45] an extension of classic LSTM architecture, to learn the features of pre-rasterized, rendered data. The input for this model was 2D rasterized images, as shown in Fig. 6 (D). We also used a fully-connected layer to predict both the adjacency matrix and edge data category matrix of
Various methods of integrating prior HD map information. (A) Simple bounding boxes; (B) Minimum surrounding rectangle (MSR); (C) Representation as an OpenDRIVE format map, generated using our previously proposed Real-to-Synthetic project [71], where ID indicates the node label; (D) Rendered image for ConvLSTM [45].
3) Iterative Message Passing
Here we used a full version of the IMP (Iterative Message Passing) model [8], [81]. Compared to our proposed RSG model described in Section IV-C, this model contains both a node data message-pooling model and an edge data message-pooling model. One challenge when using this approach is that the node features are obtained from learned feature representations of small pieces of images contained within bounding boxes, while edge data features are obtained from the ROI-pooling layer of the images. As these features are not accessible for our RSG prediction task, we use node/edge data features from
4) Graph Autoencoder
GraphVAE [80] is another popular method for generating small graphs. The key idea of this approach is to train an encoder to generate a latent representation \begin{align*} \mathcal {L}=&- \log p\left ({G|z}\right) = -\lambda _{A} \log p\left ({A^{\prime }|z}\right) - \lambda _{F} \log p\left ({V|z}\right) \\&{}- \lambda _{E} \log p\left ({E|z}\right)\tag{6}\end{align*}
The three losses were defined as follows. Let \begin{align*} \log p\left ({A^{\prime }|z}\right)=&\frac {1}{k} \sum _{a}\widehat {A}_{a,a} \log A^{\prime }_{a,a} + \left ({1-\widehat {A}_{a,a}}\right)\log \left ({1-A^{\prime }_{a,a}}\right) \\&{}+\frac {1}{k(k-1)} \sum _{a \neq b} \widehat {A}_{a,b} + \left ({1-\widehat {A}_{a,b}}\right) \log \left ({1-A^{\prime }_{a,b}}\right) \tag{7}\\ \log p\left ({V|z}\right)=&\frac {1}{n} \sum _{i} \log F_{i}^{T} \widehat {F^{\prime }}_{i,.}\tag{8}\\ \log p\left ({E|z}\right)=&\frac {1}{||A||_{l} - n} \sum _{i \neq j} \log E_{i,j,.}^{T}\widehat {E^{\prime }}_{i,j,.}\tag{9}\end{align*}
When training this model, we first use a form of complete loss, as shown in Eq. (6). We then remove
5) Gated GRUDual Graph
This is the model proposed in this paper, which is described in detail in Section IV-C.
6) Pairwise Prediction
A non-graph method which is entirely different from the other models, this method takes every possible pairing of actors as its input and predicts the categories of the potential relationships. Due to the very limited parameter space of both the input and output (even smaller than MNIST), we used a stack of 3 fully-connected layers for the model structure. Nodes representing HD map components are learned in the same manner as the pairs of traffic actors. As shown in Table 7, this method is the only one which saw a drop in performance when prior geo-relationship information was used.
Evaluation results for the six methods described above are shown in Table 7. The proposed model outperformed all other methods, with or without prior geo-relationship information
In this research, we not only represent actors as graph nodes, but also treat HD map components as nodes. However, these map component nodes differ from the actor nodes, whose status is represented geometrically by a 3D bounding box. Because the shapes of roads, lanes and intersections are quite unique, they are difficult to represent using fixed-length feature vectors. Therefore, we evaluated several methods of integrating map assets into our inference model, such as using rasterized images, component IDs or bounding boxes to represent the draft version of the map assets. Fig. 6 shows examples of the qualitative results for different map feature learning methods.
Figure 7 shows the prediction accuracy of our proposed model when trained with different HD map data integration methods. The performance of our model peaked when using the minimum surrounding rectangle (MSR) method. Comparing to normal bounding box, MSR provides an additional degree of freedom, and much more similar to the majority of map components.
(Left) (Left) Distribution of the vertices among roads, lanes, intersections and sidewalks. (Right) Intersection over Union (IOU) of polygon features for bounding box vs. minimum surrounding rectangle (MSR) integration methods.
Figure 7 (left) shows the reason for the poor performance of the polygon GNNs; the number of vertices of the various map components follows a long-tail distribution. Since 100 polygon vertices are quite a lot, this makes it difficult for a simple RNN to learn the features of these polygons. However, the polygon GNN outperforms the other, simpler methods like the bounding box and MSR methods. Figure 7 (right) shows the IOUs for the ground-truth polygon with bounding boxes vs. MSR. The majority of the IOUs of the minimum surrounding rectangles is in the 0.75 to 1.0 range, thus MSR is a simple and appropriate way to generate HD map representations. Model recall performance, as shown in Table 6, also supports this hypothesis.
B. Evaluation Criteria
Top \begin{equation*} R\text{@}K = \frac {|\text {Top}_{K} \cap \text {GT}|}{\text {GT}}\tag{10}\end{equation*}
We also used the ego-centric R@K metric because both the nuScenes and RSG datasets are recorded from an ego-centric point of view, as annotation quality seems to be related to the Euclidean or topological distance between the targeted actor and the ego-vehicle. To evaluate such phenomenon, Table 8 shows our results when we extract a subgraph of relationship prediction based on topological distance (1, 3 or 5 hops) and Euclidean distance (10, 20 or 50 meters) from the ego-vehicle. We then used the R@K metric to evaluate prediction accuracy, based on these extracted subgraphs.
C. Qualitative Results
Figure 8 shows three examples of actual road scene graphs generated using our proposed method. A graph of an entire road scene is often composed of tens of nodes and edges. To simplify our performance evaluation, we cropped the road scene graphs and generated ego-vehicle-oriented subgraphs. As the samples shown in Figure 8 illustrate, our model can generate good quality scene graphs for traffic scenes. When using the R@K performance metric instead of mAP, the predicted results have excellent structural consistency with the ground truth of these scenes.
Sample of road scene graphs as predicted by our proposed RSG Gated GRU + Dual Graph model, using bounding boxes with orientation. To improve readability, these graphs have been cropped, and are presented from an ego-centric viewpoint. The predicted semantic relationships have been labeled with names, while relationships based on prior knowledge are represented simply as edges without text labels.
Samples of road scene graphs converted into CARLA simulations, using the experimental CARLA-OpenDRIVE support function.
Furthermore, most prediction errors occur among ambiguous relationships, such as “waiting for” and “passing by”, or in situations where the geometry of the relationships changes drastically. For example, in Scene 0061 at
D. Quantitative Analysis for Various Model Structures
In this subsection, we quantitatively analyze the prediction performance of our proposed model by comparing prediction error and R@K recall of our method with those of the baseline methods described in Section V-A1. As noted previously, all models were trained using the same Road Scene Graph (RSG) Dataset.
Table 7 shows the prediction performance of our model and the baseline models with and without prior geographic relationship information. The performance of the vanilla baselines indicates that even simple GNN stacks can learn the features of semi-graph
Our proposed model (RSG with Gated GRU + Dual Graph) achieved the best performance, outperforming the graph autoencoder model by 10% using the R@50 metric, when prior geographic relationship information was available. However, iterative message passing (IMP), the original version of our proposed model, was not as accurate as our proposed, cropped model. The goal of our original IMP model was to simultaneously predict each actor’s status (position, velocity, etc.) and the semantic relationships of the traffic scene. In contrast, the actor status inference module has been removed from our proposed RSG model. Note that the IMP model performs 11.4% better in terms of R@50 when using the dataset without prior geo-relationship information, in comparison to its performance when this information is provided. This is likely because, with the integration of this prior geo-relationship information, the graph’s diameter increases significantly. Furthermore, although the IMP model benefits from the additional information, its accuracy suffers from the increased output. On the other hand, the accuracy of the pairwise prediction model does not suffer, as it does not receive the whole graph as input, so it neither benefits nor suffers from prior knowledge.
E. Evaluation of Ego-centric Accuracy
We also measured the accuracy and recall of these models using the ego-centric metric defined in Section V-B. As mentioned previously in this section, we assumed that our dataset has an ego-centric bias due to the method used to generate and annotate the RSG dataset. The results of this experiment, shown in Table 8, support this hypothesis. We can see that relationship prediction accuracy significantly increases for actors closer to the ego-vehicle for all models. Even if we cropped the subgraph to create a wider range (50 meters or five hops), prediction accuracy is still significantly higher than when the full graph prediction results are used. This confirms that the ego-centric bias is a result of the RSG dataset’s ground truth for relationships and bounding boxes.
There are two possible reasons for this ego-centric bias: (1) Since the dataset is composed of drive recordings made by different vehicles, the quality of the relationship annotations may change dramatically at the periphery of the lidar’s or camera’s field of view (FOV), as overlapping, appearance and vanishing occur more often in the peripheral areas of traffic scenes. (2) The viewpoint of our relationship annotation system itself is ego-centric. Since we only provide the view from the camera mounted on the ego vehicle, the annotator may, unconsciously, label more relationships between ego-vehicle and its surrounding objects than relationships between the non-ego actors. A simple experiment confirms the second hypothesis. When the ego-vehicle view was changed in 10 scenes to a random vehicle view (but not too far from the ego-vehicle, to avoid eliminating all the surrounding actors), annotations for the selected non-ego vehicle increased 21.1%. This result reveals that a fixed bird’s-eye-view (BEV) would be fairer and more objective. However, our ego-centric dataset may be more appropriate for ego-vehicle-related research, such as identifying potential risks around the ego-vehicle, or predicting the ego vehicle’s trajectory, for example.
F. Ablation Experiments
These experiments evaluated how prior knowledge of the topological map benefits the relationship prediction task. Table 9 shows the results of removing specific layers of the road scene graph (left), and of randomly removing various amounts of prior knowledge (right). These results indicate that the IMP, autoencoder (graphVAE) and proposed models are the most affected by the removal of prior knowledge. Although unaffected by the removal of this topological data, the simple GNN model is still unable to outperform these three models.
Among the three layers of RSG data, the results shown in Table 9 indicate that removal of the “map-to-actor” layer degrades prediction performance the most. And when the prior relationship information is randomly removed, the performance of most of the models rapidly decreases to a level even lower than the “without prior geo-relationship information” condition in Table 7. This decrease in performance suggests that the grounding of actors to the HD map greatly boosts prediction accuracy. However, some models, such as the IMP model, suffer less than others when prior relationship information is removed. Note also that when we randomly removed just a small amount of relationship information (5%), the accuracy of the prediction results for our proposed model increased slightly (1.05%). This suggests that the random removal of prior knowledge could be a safe method of graph data augmentation.
G. Applications of RSG
Here we provide some qualitative results about synthetic traffic scene generation from our previous study proposing a Real-to-Synthetic [71] traffic scene generation method, which generates near-realistic, simulator-friendly traffic scenes from graph-based scene representations. Detailed result about digital twins generation and synthetic scene generation can be found on our previous work [2].
Conclusion and Future Work
The goal of this paper is to improve understanding of urban traffic scenes by accurately predicting semantic relationships among traffic actors. To accomplish this task, we first created and annotated a Road Scene Graph dataset containing traffic scenes with multiple semantic relationship annotations linking each pair of actors. We then proposed an RSG-GCN model as a method to predict this graph. Our model first generates traffic actor and HD map node features. These node features are then integrated into a semi-graph by determining the geometric relationships among the actors and map components. Finally, a graph refinement model is proposed to leverage actor status information and prior HD map information to predict semantic relationships among the actors. The proposed Road Scene Graph traffic scene modeling approach provides a novel way to describe traffic scenes at both the geometric and semantic levels. Our experimental results indicate that our proposed relationship prediction model outperforms other popular methods. Future work will be focused on how recent developments in the performance of common SGG (Scene Graph Generation) tasks can be used to improve RSG’s prediction domain, and how RSG can be used to benefit other tasks, such as traffic scene captioning and risk detection.
ACKNOWLEDGMENT
The authors would like to thank Prof. Ruifeng Li, Prof. Ke Wang, Prof. Lijun Zhao, other faculty members and graduate students, and their study participants, for their assistance with data collection, facilities, questionnaires, relationship category determination and RSG data annotation.