Journals & Magazines >IEEE Open Journal of Vehicula... >Volume: 6

N-DriverMotion: Driver Motion Learning and Prediction Using an Event-Based Camera and Directly Trained Spiking Neural Networks on Loihi 2

Abstract:

Driver motion recognition is a key factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions, along ...Show More

Metadata

Abstract:

Driver motion recognition is a key factor in ensuring the safety of driving systems. This paper presents a novel system for learning and predicting driver motions, along with an event-based (720 × 720) dataset, N-DriverMotion, newly collected to train a neuromorphic vision system. The system includes an event-based camera that generates a driver motion dataset representing spike inputs and efficient spiking neural networks (SNNs) that are effective in training and predicting the driver's gestures. The event dataset consists of 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant. A novel optimized four-layer convolutional spiking neural network (CSNN) was trained directly without any time-consuming preprocessing. This enables efficient adaptation to energy- and resource-constrained on-device SNNs for real-time inference on high-resolution event-based streams. Compared to recent gesture recognition systems adopting neural networks for vision processing, the proposed neuromorphic vision system achieves competitive accuracy of 94.04% in a 13-class classification task, and 97.24% in an unexpected abnormal driver motion classification task with the CSNN architecture. Additionally, when deployed to Intel Loihi 2 neuromorphic chips, the energy-delay product (EDP) of the model achieved 20,721 times more efficient than that of a non-edge GPU, and 541 times more efficient than edge-purpose GPU. Our proposed CSNN and the dataset can be used to develop safer and more efficient driver-monitoring systems for autonomous vehicles or edge devices requiring an efficient neural network architecture.

Published in: IEEE Open Journal of Vehicular Technology ( Volume: 6)

Page(s): 68 - 80

Date of Publication: 21 November 2024

Electronic ISSN: 2644-1330

DOI: 10.1109/OJVT.2024.3504481

Funding Agency:

References is not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

With the advancement of artificial intelligence (AI), it is being applied across various industrial fields, and among these, vehicle AI systems are emerging as one of the most prominent applications. Onboard AIs in vehicles are used to assist autonomous driving and the safety of drivers and pedestrians by integrating with the control system [1]. In particular, the EU and the United States have introduced mandatory requirements for various driver assistance systems to ensure safe road traffic by regulations on general automotive safety [2]. The primary objective of this regulation is to enhance the protection of elderly drivers, vehicle occupants, pedestrians, and cyclists. Given that research shows human error is the cause of 95% of accidents, implementing this regulation is projected to save more than 25,000 lives and prevent at least 140,000 serious injuries by 2038. This has led to an urgent need to research and develop driver assistance systems (ADAS) using AI.

Conventional AI vision systems for autonomous vehicles utilize AI platforms for training with large datasets, after which the learned information is transferred to individual vehicles. These systems are typically implemented using GPU or AI accelerator-based hardware for data training and inference, which result in high power consumption, slow response times, and challenges in real-time prediction [3]. The exponential increase in neural network computations required for massive training datasets in autonomous driving has exacerbated these limitations. To address these challenges, a new generation of low-power, high-performance, and highly efficient AI vision systems is expected to become central to AI implementation in autonomous driving.

Neuromorphic AI vision systems, in particular, represent a way to support AI technology for autonomous vehicles that is gaining increasing significance [4]. Event cameras, also called neuromorphic cameras, and neuromorphic processors, which mimic the functionality of the human eye and brain, offer ultra-low power consumption, high efficiency, and rapid responsiveness. Consequently, these technologies are actively applied in various applications in autonomous vehicles, where they serve as core components for the sensory and processing needs [5], [6], [7].

In response to these demands, we propose N-DriverMotion, an event-based dataset and vision system for neuromorphic learning and predicting driver motions on an efficient convolutional spiking neural network (CSNN). This driver motion recognition research, comprising a neuromorphic framework for efficient CSNN configuration in terms of energy and latency, implements driver safety assistance by incorporating a high-resolution event-based camera. The contributions of the proposed driver motion recognition system include the following:

We create an event-based dataset, N-DriverMotion (720 * 720 resolution), for large-scale driver motion recognition using an event-based camera: To the best of our knowledge, this is the first study for driver motion recognition using a high-resolution event camera and spiking neural networks. The event-based dataset presents 13 driver motion categories classified by direction (front, side), illumination (bright, moderate, dark), and participant.
We present a novel simplified four-layer convolutional spiking neural network (CSNN), directly trained with the event dataset without any time-consuming preprocessing. This enables efficient adaptation to on-device spiking neural networks (SNNs) for real-time inference over event-based camera streams. Furthermore, the proposed neuromorphic vision system achieves competitive accuracy of 94.04% in 13 class classification task, and 97.24$\%$ in unexpected abnormal driver motion classification task with the CSNN architecture, developing safer and more efficient driver monitoring systems for autonomous vehicles or edge devices requiring a low-power and efficient neural network architecture.
We design an efficient CSNN for driver motion recognition for one of the widely used practical neuromorphic frameworks, the Intel Lava neuromorphic framework. We also propose a version of our CSNN for the Loihi 2 processor [8], [9]. We observed that the energy-delay product (EDP) of the proposed model on Loihi 2 exhibited 20,721 times more efficient than that of a non-edge GPU and 541 times more efficient than that of an edge purpose GPU, which is considered to be one of the most important aspects of edge-device AI.

We designed an optimized convolutional spiking neural network (CSNN) for efficient energy use and deployment on on-device applications and systems like Loihi 2. To train the proposed CSNN, we used a surrogate gradient descent-based backpropagation method, SLAYER [10], for direct training with our N-DriverMotion dataset. Additionally, instead of resizing the high-resolution event frames directly, we added a pooling layer to enable adaptive deployment on GPUs and the Loihi 2 system, depending on memory requirements.

The remainder of this paper is organized as follows: Section II briefly reviews related work, while Section III describes the proposed driver motion recognition system. Section IV presents the implementation results and performance evaluations and Section IV-G concludes this paper.

SECTION II.

Related Work

Event-based cameras have advantages over frame-based cameras in properties such as high dynamic range, high temporal resolution, low latency, and low power consumption due to their sparse event streams when intensity changes [11]. Therefore, they have been increasingly adopted in machine learning vision systems for gesture recognition and learning. Hand-gesture recognition [12], gesture and facial expression recognition [13], DECOLLE [14] for deep continuous local learning, EDDD for event-based drowsiness driving detection [15], TORE [16] for time-ordered recent events, event-based asynchronous sparse convolutional networks [17], and EST [18] for end-to-end learning of representations for asynchronous event-based data have been developed on deep learning-based neural networks (DNNs) and event representations, exploiting the sparsity and temporal dispersion of event-based gestures. Riccardo et al. [6] trained a deep neural network on a preprocessed DVS hand gesture dataset for recognizing hand gestures on a Loihi neuromorphic processor. The trained DNN was converted into the spike domain for deployment on Intel Loihi [19] for real-time gesture recognition.

Unlike DNN-based approaches, which require converting event-based datasets into static patterns for processing, recent research has shifted toward direct training and learning on SNNs using event-based datasets. This is because events are inherently suited for SNNs operating in continuous time. Moreover, SNNs directly exploit the temporal information of events for energy-efficient computation. SLAYER proposes an error backpropagation mechanism for offline training of SNNs. It receives events and leverages the backpropagation method to train synaptic weights and axonal delays directly. Amir et al. [20] proposed an end-to-end event-based gesture recognition system using an event-based camera and a TrueNorth processor configured with a convolutional neural network (CNN). It recognized hand gestures in real time from gesture event streams captured by a Dynamic Vision Sensor (DVS).

Event-based vision systems have been widely used in various applications, including vision systems for autonomous driving [21], [22] and object detection [23], [24]. These applications exploit event-based cameras to capture the spatial and temporal event data of driving and objects.

Loihi 2 is the second generation of Intel neuromorphic chip designed for energy-efficient AI applications [9]. Loihi chip applies Current Based Leaky-Integrate and Fire neuron (CUBA LIF), as its main neuron model. The overall behavior of CUBA LIF model is close to that of LIF, but is more biologically plausible as CUBA LIF model includes the temporal dynamics of the postsynaptic input current [25]. A single Loihi chip consists of 128 neurocores, which are groups of neurons that function according to the compartments. Each neurocore is able to communicate with other neurocores through the asynchronous Network-on-Chip (NoC) where the messages are packetized for sending and receiving. Although a single chip provides only 128 neurocores, Loihi is capable of transporting information without increasing latency for message exchange between multiple chips. The simplified model of Loihi is shown in Fig. 1, which illustrates the asynchronous neuron model mechanism, along with the implementation of online learning or a continual learning mechanism. The Loihi chip has been used with the Lava software framework to support the development of agile neuromorphic software and applications [26], [27], [28], [29], [30], [31].

Figure 1.

Simplified structure of Loihi which illustrates how neurocore is structured and operates.

Show All

While event-based systems have shown promise in previous research for autonomous driving and gesture recognition, they still face challenges in achieving real-time performance with low power consumption in highly dynamic environments. Additionally, existing systems may struggle to handle the diverse and unpredictable nature of recognition scenarios, particularly when dealing with rapid motions or changing lighting conditions, which can generate massive event streams. Our proposed system with N-DriverMotion addresses these challenges by leveraging the energy-efficient CSNN architecture on Loihi 2. This enables fast, real-time processing of large-scale event data while maintaining low power consumption. Furthermore, the system ensures that event data from event cameras is processed promptly without bottlenecks, which is important for safety-critical applications like autonomous driving.

SECTION III.

Proposed Event-Based Driver Motion Recognition System

In this section, we introduce an efficient CSNN architecture and a driver motion learning and prediction system employing a direct training mechanism and event-based data streams. As event-based data differ from static images in that they include time to represent events, we adopt direct spiking training methods and a 3D spike convolution operation to build an efficient CSNN model.

For the direct training of spiking neural networks, we used a gradient-based training method. It is generally known that the characteristic of discrete spike events hinders the differentiation in SNN. To address this problem, various attempts have been made to propose the approximation for the derivative of the spike function [32], [33], [34]. However, none of the above have considered the temporal dispersion between spikes to resolve this problem. A differentiable approximation (i.e., surrogate gradient) of the spike function is introduced with the probability of a change in the spiking state. It is explained that the derivative of the spike function is explained to represent the Probability Density Function (PDF) for the change of state of a spiking neuron.

With an expected value of the derivative of the spike function, the estimated backpropagation error, and gradient terms for weight and delay in layer $l$, it is possible to obtain the derivatives of the total loss $E$ as follows: \begin{align*} \nabla _{W^{(l)}} E =& \int _{0}^{T} \delta ^{(l+1)}(t) \cdot \left(a^{(l)}(t) \right)^{T} \, dt \tag{1}\\ \nabla _{d^{(l)}} E =& - \int _{0}^{T} \dot{a}^{(l)}(t) \cdot e^{(l)}(t) \, dt \tag{2}\\ e^{(l)} =& {\begin{cases}\frac{\partial L(t)}{\partial a^{(n_{l})}} & \text{if } l = n_{l} \\ \left(W^{(l)} \right)^{T} \delta ^{(l+1)}(t) & \text{otherwise} \end{cases}} \tag{3}\\ \delta ^{(l)}(t) =& \rho ^{(l)}(t) \cdot (\varepsilon _{d} \odot e^{(l)})(t) \tag{4} \end{align*} View Sourcewhere $\rho ^{(l)}(t)$ denotes the probability density function in layer $l$ at a certain time $t$, $\Delta \xi$ represents the random perturbation, $W^{(l)}$ is the weight vector for layer $l$, $a^{(l)}$ is the spike response signal, $L(t)$ is the loss at a certain time $t$, $d^{(l)}$ is the axonal delay, $\varepsilon _{d}$ is the spike response kernel, and $\odot$ is the element-wise correlation operation in time. This enables effective distribution of errors back through the layers of a neural network as in DNNs. It takes into account errors in the previous timeline, a crucial factor to consider, as the states of spiking neurons depend on previous states.

In our model, the event-based video streams are defined as a 3D tensor with the shape of $(u, v, t)$, where $u$ and $v$ denote the coordinates of the width and height of layer $l$, and $t$ serves as the timestamp [35]. By configuring the optimal temporal frequency, users can control $t$ from the original segment.

A convolutional neural network (CNN) is a feed-forward network composed of multiple layers where filters convolve around the input or a single layer for neurons to collect meaningful features or patterns [36]. However, as the event-based video contains not only spatial information but also time, the sampled input sequence $S(n)$ requires a 3D convolutional kernel for building a spiking neuronal feature map as shown in Fig. 2. Each spike inside the kernel represents a cluster of spike trains $s_{u,v}(t)$, where the spike's coordinate is $(u,v)$, and is in range of the temporal resolution window $t$. Spikes from the kernel are continually accumulated by the neurons in the feature map, and when the spikes in the kernel's region exceed a certain threshold, it would generate the membrane potential for a single neuron in the feature map, creating spatio-temporal dynamic patterns.

Figure 2.

The 3D convolutional spiking operation.

Show All

The 3D spiking convolution operation is performed by convolving the spike trains in the kernel with a spike response kernel and applying the threshold function. Each spike train is converted into a spike response signal, then subsequently transformed into the membrane potential by integrating the signal with the refractory response of neuron: \begin{align*} a_{u,v}(t) =& S_{u,v}(t) * \varepsilon _{d}(t) \tag{5}\\ u_{j,k}(t) =& \sum _{m=1}^{K} \sum _{n=1}^{K} W_{m,n} a_{m+(j-1),n+(k-1)}(t) + (S_{j,k}(t) * \nu (t)) \tag{6}\\ S_{j,k}(t) =& 1 \ \text{and} \ u_{j,k}(t) = 0 \quad \text{when} \quad u_{j,k}(t) \geq V_{\text{thr}} \tag{7} \end{align*} View Sourcewhere $*$ denotes the convolution operator, $W_{m,n}$ denotes the synaptic weight at $(m,n)$ of the kernel, $u_{j,k}$ represents the membrane potential at coordinate $(j,k)$ in the feature map, $K$ represents the width and height of the convolution kernel, $\nu (t)$ is the refractory kernel, and $V_{\text{thr}}$ is the threshold of the membrane potential.

Our proposed model is developed based on existing models [10], [20] for DVS Gesture recognition, which implements SNN training and a 3D spike convolution operation. The SLAYER and TrueNorth models consist of 8 layers and 16 layers, respectively. Fig. 3 presents our proposed network model and system, where the network is simplified and comprises only 4 layers. When tested with a more complex model, we observed that it not only occupied more memory, but also had a detrimental impact on overall performance, as more layers hindered the neurons from firing spikes, leading to lower event rate because spikes could not be passed to consecutive neuron layers.

Figure 3.

Proposed Event-based Driver Gesture Recognition System. The event stream is captured with EVK4 camera, where each event is 3 seconds long. The stream is then fed to the proposed model, where the inputs are sampled down to 2 seconds. The model classifies the stream based on the event rates, selecting the class with the most spikes occurring as the classification result.

Show All

The 720 × 720 resolution event streams for driver motion recognition were recorded by an event-based camera for 3 seconds and delivered to the first 3D pooling layer. Unlike conventional convolutional neural networks, where convolution layers are processed first, our model implements a pooling layer before passing the data to the convolution layer. The main reason for this is to down-sample feature maps to resolve large memory usage, as well as to extract features. Compared to classical pooling layers, SNN pooling layers are considered trainable layer, as they not only down-sample the event streams but also take into account temporal aspects, enabling the learning of spatio-temporal features from the input. The pooling layer uses a kernel of size 8 × 8 and passes data to the convolution layer, which uses a 5 × 5 kernel to produce 16 channels of features. The features are then fed to 2 final fully-connected layers, which provide 13 outputs representing the spike rates for the given event data.

We notice that the existing spiking neural networks designed for other applications are not optimized for autonomous driving, which requires low power consumption, real-time inference, and efficient deployment on actual neuromorphic chips. To effectively implement driver motion recognition for autonomous driving, it is important to have a neural network that is optimized for neuromorphic hardware (simple but without performance degradation) as well as an event-based dataset for training such networks. We addressed these elements by designing, implementing, and testing a network that takes these features into consideration.

SECTION IV.

Experiments and Results

A. Event-Based Camera and Data Conversion

Prophesee's Metavision EVK4¹ camera is one of the latest event-based vision sensors that supports up to 1280 × 720 pixel resolution. Generally, when a change in pixel values exceeding a certain threshold defined by the user occurs, the event-based camera asynchronously detects such changes in brightness and generates events specific to that pixel [11]. Each event contains pixel information describing the position in the x and y coordinates, indicating a motion's change, and a timestamp for the occurrence of the event. The device offers a high dynamic range (86 dB, however, it can reach over 120 dB based on low light cutoff measurement being: 0.08 l ×) and a typical and maximum event rate of 1.06 G-events per second (Geps).

In the experiment, in order to allow direct usage of the spike data obtained from the high-resolution event-based camera without requiring additional resources, we minimized modification to the extractor so that it can retrieve coordinate, polarity, origin (a newly implemented feature in the EVK4), and timestamp information. Moreover, we observed that the most meaningful motion information was contained in the central 720 × 720 pixels so that the size of the input events was cropped to 720 × 720 pixels.

B. Dataset

Over the past years, a profusion of gesture datasets captured with simple frame-based sensors has been presented. However, to stimulate the improvement of event-based computer vision, Hu et al. [37] strongly support the importance of the DVS dataset. Serrano et al. [38] introduced the first labeled event-based neuromorphic vision sensor dataset which originated from the classical MNIST digit recognition dataset by moving the images along certain directions within the screen. This dataset was further developed by Orchard et al. [39], who removed several frame artifacts using a pan-tilt unit. It can be observed that generating artificial movement for static images enabled the production of event outputs, which became important for object recognition in neuromorphic systems such as spiking neural networks [7]. However, it has been observed that static image recognition with event-based vision sensors is ineffective, as their main purpose is focused on dynamic scenes. Recognizing these drawbacks, some datasets consisting of dynamic scenes have been presented. Berner et al. [40] introduced novel datasets that converted the existing visual video benchmarks for object tracking and action/object recognition into spiking neuromorphic datasets using the DAVIS camera's output.

Even though these datasets show dynamic movements in a spiking neuromorphic way, according to Tan et al. [41], a DVS camera produces microsecond temporal resolution output. In contrast, datasets produced by conventional vision tools such as video recorders or color cameras have tens of milliseconds of temporal resolutions, which causes a loss of high temporal frequency during conversion. Moreover, they add that there is a high chance of subsidiary unwanted artifacts being present during conversion. To address these problems, Amir et al. [20] present the DvsGesture dataset, which directly captured scenes with the DVS128 camera. Additionally, the DvsGesture datasets were captured under different lighting conditions, as DVS sensors are less affected by brightness and introduce some meaningful noise. However, the DvsGesture dataset is mainly utilized for simple classification tasks, which may be considered less practical.

The N-DriverMotion dataset comprises 1,239 instances of a set of 13 driver motions (1 normal driving motion and 12 driving motions that are considered possibly dangerous) as shown in Fig. 4 and Table 1. These motions include head falling (front, sides, and back), holding a cellphone (right/left hand), looking back (to the right/left side), the body going down (to the right/left side), and all the same motions above in side-captured manner. The dataset was collected from 23 subjects under three different lighting conditions. All subjects were within a specific age range; they had no disabilities, or any other problems that could affect driving. Each subject performed all 13 motions in a single recording, where each motion was recorded for around 3 seconds under the same brightness condition. The three brightness conditions were classified as “bright,” “moderate,” and “dark” to test how much brightness would affect the recordings and to examine the robustness of event data in harsh illumination condition. We note that informed consent was obtained from all participants during data collection. To evaluate the classifier's performance, we randomly selected 992 motions for the training set, and the remaining 247 motions were designated as the test set.

TABLE 1 Labels of the Corresponding Driver Gestures

Figure 4.

Sample shots taken for each driver gesture.

Show All

Due to limitations in recruiting participants, we were unable to include experiments involving drivers with medical conditions or disabilities. This will be expanded in future work. Instead, to address unpredictable scenarios, we introduced five new types of event data that were not used during training. These additional test cases include abnormal reactions such as seizures during driving, sudden standing or rising movements, collisions where the driver hits the windshield, hand-swipe gestures, and abnormal camera angles caused by accidents or issues with the camera's fixed position. These newly created event-based video sequences were specifically designed to test how the system responds to other abnormal cases not seen during training.

C. Experimental Setup and Tool Flow

The experiment is mainly consisted of 3 sub-parts to measure different performance of our proposed model: the accuracy of the 13-class driver motion classification, the accuracy of classifying unexpected abnormal driver motion versus normal driver motion, and the energy efficiency with respect to throughput in various AI accelerators.

1) Multiple Driver Motion Classification

The aim of this experiment lies in observing how accurately our proposed system can classify various driver motions (13 classes in total). With the model developed based on the Lava-DL API, our proposed network will be loaded onto the RTX 3080 GPU and trained for 200 epochs. During each epoch in training, the models would be evaluated with the test set and produce accuracy based on the spike rates of the output spikes. If the model produces the highest test accuracy, the model's weights are stored as a PyTorch state dictionary file. Further explanation about the configuration of the system is shown in Section IV-D.

2) Inference on Unseen Abnormal Driving Motions

While it is important to see how the model performs classifications based on the provided 13 motion sets, it is also important to check the model's capability in classifying unexpected motion sets. Here, we make inferences on a separate dataset that is composed of abnormal driver motions not included in the 13 classes, as well as normal motions, where the model has been trained to classify them as label 3. In this experiment, we have set the objective as binary classification: abnormal driver motion and normal driving motion. Given such abnormal driver motions, if the model produces predictions other than label 3 (normal driving motion), we can observe that the model is capable of classifying them as abnormal motion, and vice versa. We have used the fully trained CSNN model with the same configuration for event data (e.g., sampling time) that was used for measuring accuracy on the 13-class classification task. The test data for this task is composed of 82 abnormal driver motions (10 participants performing 5 different unexpected abnormal driving motions, as shown in Fig. 5 and Table 2) along with 63 normal driver motions from the original test set. The evaluation process was conducted in the same way as the multiple driver motion classification, except in a binary classification fashion.

TABLE 2 Abnormal Driver Gestures

Figure 5.

Sample shots taken of abnormal driver gestures. Note that hand-swipe gestures are not simple hand-waving actions but are actions to protect oneself from various attacks.

Show All

3) Energy Efficiency in Various AI Accelerators

Energy efficiency is one of the most promising aspects of SNN, and this experiment is designed to show the overall efficiency of the model with respect to 3 AI accelerators: Nvidia RTX 3080, Nvidia Jetson Xavier NX, and Intel Loihi 2. For both Nvidia GPUs, we loaded the models directly onto the devices and measured various inference costs (latency, total energy consumption, etc.). Due to fast inference speed of GPUs, we measured by averaging each of the total costs by the number of samples processed to measure the values. For Loihi 2, we designed a specific pipeline that enables us to load the model into the system and feed the event data directly to the model, along with the utilization of measurement tools provided by the Lava API. Detailed information about experimental flows and settings for Jetson Xavier NX and Loihi 2 is described in Section IV-E.

D. Network Configuration and Training

To enable adequate training, the dataset was converted from binary format into tensors consisting of x-coordinates, y-coordinates, polarities, and timestamps. We also set the sampling time to 2.0 seconds, as the core movements were mostly within this range. More importantly, we wanted to check whether our proposed model is capable of classifying driver motions in a short amount of time, as in real-time applications, such systems should detect abnormal activities rapidly. In this form, the data can now be used as input to the CSNN during runtime.

All of the applications were implemented and mapped to GPU/CPU to be executed as the simulation for Loihi 2, using the Lava version 0.4.4 software framework and the Lava-DL library for training. Our CSNN models were developed with neuron models and trained with the event-based dataset using the SLAYER API, which were all provided by the Lava-DL.

We trained multiple CNN configurations to compare accuracy results. Each network was trained for 200 epochs with a single batch size due to the dataset's high resolution. We selected CUBA LIF as the base neuron model for the spiking CNN models. For optimization, the learning rate was set to 3×10⁻³ and ADAM was used as the optimizer. For the loss function, we chose spike rate loss. The network configuration for the 4-layered CSNN and the neuron parameters are shown in Tables 3 and 4.

TABLE 3 Network Configuration of a 4-Layered CSNN for GPU

TABLE 4 Settings for Neuron Parameters: Note That Neuron Parameters Were Set the Same for Both the GPU Model and the Loihi2 Model

E. Edge Device Configurations and Inference

In this section, we provide a detailed description of various AI accelerators that are specialized for use in edge devices. We utilized two different edge AI accelerators: Jetson Xaior NX, and Intel Loihi 2 to compare overall energy efficiency with respect to image processing delays.

1) Jetson Xavior NX

The Jetson Xavior NX is a fully featured development board equipped with a 6-core ARM CPU and a 48 tensor core processing unit supporting a clock frequency up to 1.1GHz, along with 16GB of memory. To minimize power usage, we operated on the standard module, where the GPUâs clock frequency is set to 0.67GHz. All inference configurations were the same as with the RTX 3080, except for the library version. Due to the restriction of the Python version, inference on the Jetson Xavior NX was executed with a lower version of the lava-dl (0.4.0) library.

2) Intel Loihi 2

A Neuromorphic chip is one of the edge AI accelerators that particularly targets the implementation of SNNs. Among various neuromorphic chips, we used Loihi 2 to observe the overall power consumption of the benchmark three fully connected layer SNN model, and the proposed CSNN model, as Loihi 2 adopts CUBA LIF as the base neuron model best fitting all of the proposed models [26]. For this experiment, we used the Oheo Gulch, one of the Loihi 2-based neuromorphic systems that contains a single-socketed Loihi 2 chip. Due to the limited availability of Loihi 2 chips, along with the nature of large input and model size, some adjustments to the input and models were made. We downsampled the input from 720 × 720 to 40x40 with max pooling and extracted the motion stream time from total of 3.0 seconds to 1.7 seconds. For the three fully connected layer model, there were no changes. For the proposed CSNN model, we excluded the CUBA pooling layer since the input had already been downsampled and reduced the number of features extracted from 16 to 4. The description of the network model on Loihi 2 is shown in Table 5. We were able to load the three fully connected layer model onto a single Loihi 2 chip, whereas the proposed CSNN model required a total of 4 Loihi 2 chips to be loaded.

TABLE 5 Network Configuration of a 4-Layered CSNN for Loihi 2

To enable inference on Loihi 2, it is necessary to create a pipeline that can process spatio-temporal data for classification tasks on Loihi 2. We created a pipeline to conduct inference on Loihi 2, which is shown in Fig. 6. The event data was stored in a buffer block named RingBuffer, where it passed the frame data one at a time to Loihi 2. As the frame data was passed to Loihi 2 the last layer in the model produced spikes that generated the classification result for the single frame. The spikes were cumulatively stored in the OutputProcess block until every frame in the event data was processed. The final classification was conducted in a Winner-Takes-All (WTA) fashion, where the class with the most spikes was the predicted value.

Figure 6.

Pipeline for Inference of CSNN on the Loihi 2 processor. The blocks that are part of the main processes are colored as gray, while the green blocks are not part of the main process.

Show All

The total description of how the models were converted and converged at the Oheo Gulch neuromorphic hardware is shown in Tables 6 and 7, where each column indicates the percentage of the total usage for the input axons, neuron group, neurons, synapses, axonal mapping, and axon memory, respectively.

TABLE 6 Resource Utilization for a 3-Fully Connected Layer Model on Loihi 2

TABLE 7 Resource Utilization for the Proposed CSNN Model on Loihi 2. The Amount of Core Utilization for the Proposed Model is More Than Twice of the 3-Fully Connected Layer Model Due to the Convolutional Layer Acting as the Bottleneck of the Model

F. Results

1) Accuracy on 13-Class Prediction

The results for 13 motion classifications from the training models are shown in Fig. 7 and Table 8. In the benchmark dataset for evaluating our models, we used the DVS Gesture recognition dataset, comprising 11 hand gesture categories from 29 subjects under 3 illumination conditions [20]. We tested three different SNN configurations: 1) all layers composed of dense connection, 2) a convolutional spiking layer with 2 fully connected layers, with the pooling kernel size set to 8, and 3) the same model as 2, except with the pooling kernel size set to 18. For all models, we selectively used only the first 2.0 seconds out of 3.0 seconds of motion streams for each class to classify the actions. Additionally, the temporal resolution was set to 5 milliseconds to increase training efficiency. For each experiment, we recorded the best loss and accuracy of the SNNs for both training and inference steps, as well as the confusion matrix for the corresponding inference results. The confusion matrix in Fig. 7 shows that the normal situation (label 3) is distinguished from the hazardous situations. Fig. 8 illustrates the classification results with output spikes captured on the output layer, classifying 13 driver motions.

TABLE 8 Classification Results Comparing the DVS Gesture Dataset and the N-DriverMotion Dataset. For the DVS Gesture Dataset, Results are Included From Both SLAYER and IBM TrueNorth. In Both Classification Tasks, Our Proposed Model Performed Relatively High, as the Difference in Overall Accuracy for Either Task is Within 1.5%, Despite Having a Much Simpler Model Structure

Figure 7.

Results (accuracy, loss, and confusion matrix) of the proposed CSNN with pooling kernel sizes 8 and 18 (PK=8 and PK=18) vs. the three fully connected layer SNN (3FC). The color of the confusion matrix approaches white as the number of predictions for the label increases. Compared to the 3FC model, the proposed CSNN achieves a faster convergence rate and higher overall accuracy.

Show All

Figure 8.

Classification result with respect to the output spikes produced at each timestamp. It is observed that the model demonstrates a fine classification process by generating the highest number of spikes corresponding to the ground truth label that the event stream is labeled as (Winner-Takes-All).

Show All

The system from the proposed CSNN model with 4 layers showed the highest accuracy, reaching 94.04% in the test. The inclusion of the pooling and convolution layers improved accuracy performance by almost 8% over the 3-layered model, which was composed only of fully connected layers. Even though our model is simplified with restrictions applied to resolve the out-of-memory problem caused by the high-resolution event data for training and testing, it was able to demonstrate comparable results. We note that our proposed dataset has more categories for classification and does not include repetitive actions, unlike the DVS Gesture recognition dataset.

2) Accuracy on Unexpected Abnormal Driver Motion Classification Task

The results for binary classification of unexpected abnormal motions and normal motions are displayed in Fig. 9. Except for 4 normal motions which were classified as abnormal driving motions, the model was highly capable of distinguishing between two categories, reaching an accuracy of 97.24%. One significant point to note is that there were no abnormal motions misclassified as normal motions, which demonstrates the robustness of our proposed CSNN model in detecting such unexpected motions.

Figure 9.

Classification result for unexpected abnormal driver motion. The model detects all the abnormal actions correctly, demonstrating high sensitivity in detecting these actions, and also classifies normal driver motions with high accuracy.

Show All

3) Latency and Energy

To measure the power consumption for the Nvidia RTX 3080, we used the NVIDIA Management Library (NVML), which allows us to extract the throughput and energy consumption with respect to single event inference. Due to the fast inference time of the GPU, we calculated the latency by averaging the values, dividing the overall inference time by the total number of inference samples. Energy consumption was also measured by averaging the total energy over the total time of inference.

Nvidia Jetson Xavier NX does not support NVIDIA Management Library, unlike the RTX 3080 GPU. To resolve this issue, we utilized jetson $\_$ stats library, which enables us to keep track of the energy consumptions of the GPU by logging the statistics as a CSV file. Similar to the RTX 3080, the calculation for latency was done by averaging the values. The experiment was executed with no additional GPU processes other than the measurement process being loaded.

In the case of Loihi 2, the LAVA framework supports profiler tools which enables us to obtain various measurement values related to power, execution time, counts of neurocore activities, and neurocore SRAM utilization rates. To get precise latency and energy dissipation results, we ran the model 1 million times, which provided enough time for the profiler to measure the overall costs. All experiments for latency and energy comparisons between the GPUs and Loihi 2 were conducted with equal models, both adjusted for Loihi 2. The results for throughput and various energy-related measurements for single sample inference are shown in Table 9.

TABLE 9 Comparisons Between GPUs and Loihi 2 on Inference Cost for a Single Sample. It is Clear That Despite the Throughput of the RTX 3080 Being Lower, the Energy Delay Product (EDP) of Loihi 2 is Much Lower Than That of Both GPUs, Demonstrating the High Energy Efficiency of Loihi 2 for SNN Models

The overall energy consumption for Loihi 2 for 1 million inferences was 1.82J which was highly comparable to that of both GPUs, where the RTX 3080 required over 500J, and the Jetson Xavior NX required 1.13J for a single inference on the CSNN model. The throughput, however, showed that while Loihi 2 processed 40.835 samples per seconds, the RTX 3080 processed 66.214 samples per seconds. We believe that this was due to the inference method of Loihi 2, which did not use matrix multiplication, but only addition for the input spike data. The Jetson Xavior NX performed very poorly on throughput, with only 5.46 samples being processed per second.

To compare the balance of low energy and fast runtimes of the devices, we calculated the energy-delay product (EDP), where the energy consumption and performance of the devices were weighted equally [42]. The calculation for EDP was conducted by multiplying the total energy by the latency for a single sample inference. In the EDP comparison between Loihi 2 and the two GPUs, we observed that Loihi 2 had a better EDP than the GPUs for both models, with the 3FCN model being 1.8 million times and the CSNN model being 20,721 times more efficient compared to the RTX 3080, and the 3FCN model being 16,278 times and the CSNN model being 541 times more efficient than the Jetson Xavior NX.

G. Future Work

While our proposed driver motion recognition system achieves significant improvements in energy efficiency and accuracy, there are several directions for future work to further enhance its performance and applicability. First, we plan to extend the N-DriverMotion dataset by increasing the diversity of driving scenarios, including more complex driving conditions, varied motion patterns, and a larger pool of participants. This will enable the model to generalize more effectively to real-world driving environments. Additionally, we plan to explore approaches for enhancing safety for pedestrians and vehicles in autonomous driving conditions. Another important area of exploration is the integration of online learning mechanisms within the neuromorphic system, allowing the model to adapt dynamically to changing driving conditions and new motion patterns in real time. To achieve this, we will explore unsupervised learning methods based on the spike-timing-dependent plasticity (STDP) learning rule. Moreover, we will evaluate the system's scalability and performance across different hardware platforms, including more neuromorphic processors and edge devices. Lastly, we aim to optimize the neuromorphic pipeline to address challenges such as latency reduction and increased robustness in highly dynamic and unpredictable driving scenarios, further enhancing the system's practicality in real-time autonomous driving applications.

SECTION V.

Conclusion

In this paper, we propose N-DriverMotion, a newly collected event-based dataset designed to learn and predict driver motions using a neuromorphic vision system. Our system consists of an event-based camera that captures driver movements as spike inputs and an optimized convolutional spiking neural network (CSNN) capable of efficient inference on 720 × 720 event streams without the need for computationally expensive preprocessing. The dataset includes 13 driver motion categories, classified by direction, lighting conditions, and participants, including challenging environments like low-light conditions and tunnels.

Our experiments show that the system achieved a high accuracy of 94.04% in recognizing driver motions, even under difficult conditions such as varying light and directional inputs. This demonstrates that our proposed four-layer CSNN, designed for training and inference on energy- and resource-constrained platforms, can meet performance requirements. In addition, our experiments display the capability of distinguishing between unexpected abnormal driver motions and normal driver motions with an accuracy of 97.24%. Our system was especially able to classify all the abnormal driver motions, indicating high reliability in sensing various abnormal driver motions when deployed in real applications. We also demonstrated that our system was comparable to other applications that implemented direct use of event vision systems with SNNs, such as the DVS Gesture recognition task, where our proposed system achieved similar accuracy with a less complex model structure. When deployed on Intel's Loihi 2 neuromorphic processor, the system demonstrated significant energy efficiency, with an energy-delay product (EDP) that was over 20,721 times more efficient than the RTX 3080 and 541 times more efficient than the Jetson Xavior NX. This level of efficiency is critical for real-time, on-device AI applications where power consumption is a key constraint.

By eliminating the need for time-consuming preprocessing and significantly reducing power consumption while maintaining high accuracy, our system can advance the safety and efficiency of autonomous driving systems using neuromorphic vision technology. Furthermore, the N-DriverMotion dataset provides a valuable resource for future research in driver behavior prediction, particularly in AI-driven systems requiring a high-resolution event dataset.

Getting results...

References is not available for this document.

N-DriverMotion: Driver Motion Learning and Prediction Using an Event-Based Camera and Directly Trained Spiking Neural Networks on Loihi 2

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

Proposed Event-Based Driver Motion Recognition System

Experiments and Results