Introduction
In recent years, the deployment of computer vision applications using small edge devices has gained substantial momentum, driven by the increasing demand for real-time and efficient processing of visual data. For example, as the importance of autonomous crash detection and autonomous driving assistance increases in modern automobiles, there is a growing need for these vehicles to accurately process large volumes of data (i.e., real-time videos from cameras). This involves detecting objects (e.g., cars, pedestrians, traffic lights, roads, etc.) in real-time through computer vision algorithms [12], [13], [14], [15], [16]. In addition, modern robots are equipped with cameras and need to perform similar object detection to recognize objects and make decisions based on the detected objects [17], [18], [19].
To achieve real-time object detection, it is important to choose an appropriate model that carefully considers both computational efficiency and its performance. Compared with existing Faster Region Convolutional Neural Network (Faster R-CNN) [20] and Spatial Pyramid Pooling network (SPPNet) [21], which perform detection using a two-stage approach, YOLO [22] and Single Shot Detectors (SSD) [23], which perform detection with a single forward pass of the neural network. Instead of processing the entire input image, YOLO divides the image into multiple cells and conducts detection solely within each small region. This approach enables faster processing speed while maintaining reasonable accuracy, making YOLO best suited for real-time object detection.
In addition to adopting lightweight detection models, real-time object detection is often performed in IoT and edge computing environments such as moving vehicles or robots where computational resources are limited [24], [25], [26]. To overcome this limitation, cloud computing is widely adopted to upload the input image or video, perform processing on large cloud nodes, and download the result to the IoT and edge devices. However, while such a process can be adopted during model training to update the model as it does not require a fast response time, it cannot be adopted during inference. This is because inference is a latency-sensitive task where a real-time response from the input data is crucial and communication overhead between cloud and edge is often intolerable. Thus, the tight integration of lightweight detection models and the utilization of resource-constrained IoT and edge computing environments are inevitable and important in real-time object detection.
However, deploying the models on resource-constrained edge devices presents unique challenges as edge devices are constrained by limited hardware resources such as computation, memory, and storage and requires more careful management of those resources. For example, the latest server-grade CPU from Intel has 32 cores [27], while the latest edge device from Nvidia has 6 cores [28] with a significantly lower base clock speed. In addition, edge devices are constrained by a limited power supply. This is because they often operate on restricted electricity sources such as batteries, rather than the wall power of server systems. Therefore, deploying object detection models in edge environments requires a careful understanding of the model and the limited hardware resources such as computation and power [29], [30].
Many previous studies, as shown in Table 1, have improved machine learning models and their performance on resource constrained edge devices. Some studies [3], [5], [6], [8], [11] have focused on compressing ML models to maintain maximum performance on edge devices, dynamically adapting the model during ML task execution to efficiently perform inference within available memory resources. However, these methods requires real-time monitoring of resources and continuous adjustment of ML layers and weights, introducing additional overhead. Other studies [1], [2], [4], [7], [9], [10] have aimed to distribute and optimize ML tasks in alignment with the resource availability of edge devices, maximizing inference runtime performance by fully leveraging computational units and optimizing memory utilization. Nevertheless, these approaches involve parallelizing the computation of ML layers for individual data instances, leading to substantial synchronization overhead. In contrast to these efforts, our research focuses on grouping incoming data based on the available resources of edge device and distributing these data groups across accessible cores to enable independent, parallel processing using the preloaded original model.
In this paper, we propose a concurrent processing scheme for the real-time object detection model on IoT and edge devices to reduce runtime, memory consumption and improve power efficiency. To achieve this, we first analyze a representative real-time object detection model (e.g., YOLO) using a state-of-the-art edge device, the Nvidia Jetson Orin Nano. Our analysis shows that simply adopting YOLO on an edge device does not fully utilize the device, and a significant portion of computational resources remain idle as input data is processed serially. To overcome this, our scheme first divides the input, such as image sets and videos, into smaller datasets and groups the subsets with a predetermined number of processes. Then, our scheme allocates a process to each dataset, allowing parallel execution of the object detection among multiple processes without synchronization. To verify the effectiveness of our scheme, we implement our scheme on SOTA YOLO object detection model and utilize state-of-the-art Nvidia Jetson edge devices and real-world datasets such as animal videos, car-traffic videos, MS-COCO [31], ImageNet [32], Pascal VOC [33], and DOTA [34]. Our evaluation shows that the proposed scheme can improve the runtime by up to 445%, reduce memory usage by up to 69%, and decrease power consumption by up to 73%.
The contributions of our paper are as follows:
We analyze the YOLO architecture on the edge device and demonstrate that the device is underutilized due to serial processing.
We propose a parallel processing scheme that groups frames with a predetermined number of processes and distributes frames using one core per frame.
We evaluate the proposed scheme using a state-of-the-art edge device with a real-world dataset and demonstrate that our scheme can improve runtime by up to 435%.
Background and Motivation
A. YOLO(You Only Look Once)
1) YOLO Architecture
YOLO is a one-stage object detection model based on Convolutional Neural Networks (CNNs). Compared with two-stage detectors such as Faster R-CNN and SPPNet, it has relatively high accuracy and fast detection time This is because YOLO performs both localization and classification tasks simultaneously, allowing for efficient processing of frames in real-time applications. Figure 1 shows the overall object detection procedure of YOLOv8 which is the latest YOLO model used throughout this paper. As shown in the figure, before performing the dection, the input video or image file (❶) and the YOLO model (❷) is loaded from storage to memory. Then, each frame from the video or image to fed into the model that is processed through three steps (❸-❺).
First, during preprocessing, each frame undergoes resizing, padding to match network input dimensions, and normalization of pixel values for consistent and effective processing within the network (❸).
Second, the output tensor from preprocessing undergoes an inference stage, during which the extraction and refinement of features occur. Bounding box coordinates (locations) and class probabilities are derived from these features (❹).
Finally, after inference, the output tensor undergoes post-processing to retain the bounding box with the highest confidence score while eliminating overlaps, resulting in the final output of object detection (❺).
2) Performance YOLO on Edge Device
To analyze the performance of the YOLO model, we conducted a motivational evaluation. For evaluation purposes, we utilized a state-of-the-art edge device (Nvidia Jetson Orin Nano) due to the lightweight performance of YOLO, making it suitable for adoption in mobile platforms such as IoT and edge computing. Jetson Orin Nano is a System-on-Chip (SoC) with a 6-core CPU (grouped as four and two ARM Cortex-A78), eight SM (Streaming Multiprocessors) GPUs (with 1024 CUDA cores and 32 Tensor cores), and AI accelerators for video, image, and audio processing [35]. During the YOLO inference task, the CPU is primarily used for processing.
Table 2 shows the time analysis result using YOLO on SOTA edge device. For the dataset, we use polar_bear.mp4, which is the default video input file for YOLO. It is 2.5 MB in size and contains 191 frames with a single object (a polar bear) moving through the background (glacier). As depicted in the table, the total processing time for the video is 58.72 seconds, with the majority of the time being consumed by the inference (❹). This is attributed to the smaller sizes of other components, such as frame_read (❶), owing to the compact video file size, and the preprocessing (❸), owing to the low resolution of the input video, which makes pixel preprocessing less computationally intensive. In contrast, in the case of inference, each frame is processed serially and sequentially, meaning that the next frame can enter the model only after the prior frame has finished processing. Thus, these motivational evaluation results suggest that the runtime of the YOLO model is dominated by the inference time, and improving inference efficiency is critical to enhancing the overall performance of the YOLO model in edge devices.
B. Parallelism in ML
As reducing the overhead of inference is the most critical component in overall runtime, converting serial model execution to parallel can improve the overall YOLO performance on the edge. To achieve this, figure 2 shows two methods to enhance the parallelism of the model: Data parallelism (figure 2a) and Model parallelism (figure 2b).
1) Data Parallelism
As shown in figure 2a, when handling large inputs, such as high-resolution images, the data can be partitioned into multiple segments and processed in parallel by replicating the model and processing each replicated model. After parallel execution, the calculated values (gradients) from the divided segments should be synchronized as if the data were never divided. If the input data is excessively or incorrectly partitioned, it can negatively impact the accuracy of the entire model. In addition, the synchronization overhead can vary due to the shape of the partitioned segments, and mal-partitioned segments can lead to substantial communication overhead between them.
2) Model Parallelism
As shown in Figure 2b, if the model size is large due to the computational complexity of the model, it can be divided into multiple sub-models. Then, the divided models can be processed in parallel. Once the computation of a single sub-model is completed, the results of the model are passed to the adjacent and disjoint sub-models for processing the next layer. Since the computational complexity of each sub-model can vary, leading to unbalanced runtime and idle time, it is important to understand the model and determine the optimal method for dividing the model into computationally equivalent sub-models.
Thus, while improving the inference overhead is key, it requires a deep understanding of the model, data, and hardware to reduce communication overhead and improve the overall performance of the model on edge devices.
Design
In this section, we present our proposed scheme to improve the performance of YOLO on edge devices by enhancing parallelism.
A. Overall Procedure
Figure 3 illustrates the overall procedure of the YOLO model on state-of-the-art edge devices, showcasing both the existing and proposed schemes. As shown in figure 3a, first, the input file (i.e., video 1-N and Image Dataset 1-N) is read from the storage device and loaded to the DARM inside the edge device. Then, the YOLO model is loaded from storage into DRAM memory. Finally, each frame/image is processed by a single CPU, and this process is repeated until all the input is processed. While a frame/image is being processed inside a single CPU (i.e., Core 1), other cores (i.e., Core 2-N) remain idle and simply wait for the input to be processed by Core 1. Additionally, after the execution of F1 in Core 1, the next frame (i.e., F2) is processed in Core N, causing unnecessary data movement from Core 1 to Core N. This not only degrades the performance of the model but also increases the power consumption of the device, as idle CPUs still consume power even if they are not performing any computations.
To overcome this inefficient core utilization in the existing scheme, we propose a parallel model execution sheme. Figure 3b shows the overall procedure of the proposed scheme. As shown in the figure, the procedure of reading a video/image set (❷-❸) is similar to the original scheme. Then, the video or image dataset is divided into frames or images and grouped according to a predetermined number of cores. In addition, our scheme load YOLO model to all available cores to perform inference in all available cores (i.e., Core 1-N)(❹). This is to exploit data parallelism, as YOLO is a lightweight model that does not require model parallelism, which is suitable for large models. As seen in table 2, the increased model load time does not significantly impact the overall object detection time.
With models loaded to all cores, the previously grouped data (i.e., F1-N) are allocated to each core for parallel processing object detection. Since typical videos consist of multiple frames, frames can be simultaneously processed using multiple cores. From the perspective of cores, all cores are simultaneously processing different frames without idle time, increasing the overall efficiency of the edge device. During this process, each core executes its object detection task asynchronously and if one core finishes processing its data before others, it can start working on data from the next group. Since each frame is allocated to different cores and processed individually with data parallelism, there is no need for communication between the frames, eliminating synchronization overhead.
B. Parallel Processing Algorithm
Algorithm 1 presents the detailed procedure of the proposed scheme to allocate multiple frames to multiple cores for parallel execution of the model to expedite model processing in edge devices. First, our scheme loads the YOLO model and the input dataset such as video file or image directory from the storage devices to the memory (Lines 1-4). Then, it sets multiple configuration parameters for parallel execution. To do this, the number of frames in the video or the number of images from the image dataset directory is first read. Then, the number of processes that will be utilized is set by fetching the number of CPUs from the device or a predefined number set by the user. Finally, the number of groups, which denotes the number of frames/images that need to be executed by each process, is calculated (Lines 7-8). Rather than enforcing strict ordering between processes, our scheme simply ensures that each process handles an identical amount of inputs, ensuring load balancing between the processes. Through this, our scheme can determine the end condition for each CPU, eliminating the need for inter-process communication.
Algorithm 1 Procedure of the Proposed Scheme
Import YOLO
inputPath = Path to the dataset (i.e., video.mp4, image.jpg)
cap = cv2.VideoCapture(inputPath)
loopCount = 0
numFrame = cap.get(cv2.CAP_PROP_FRAME_
COUNT)
numProcess = multiprocessing.cpu_count()
numGroup = numFrame // numProcess
Function ObjectDetection
While cap.isOpened():
model = YOLO(YOLOModel (i.e., YOLOv8n.pt))
While loopCount < numGroup:
success, frame = cap.read()
if success
model(frame)
loopCount ++
else
cap.release()
break
Function Multiprocessing
pool = torch.multiprocessing(numProcess)
pool.map(ObjectDetection, range(numProcess))
pool.close()
pool.join()
With the model, input, and process group, each process performs the object detection. To do this, the process checks if the input is still open and ensures that the state of the input has not been changed by another process (Line 13). This is necessary because another process might have finished processing all the inputs or accidentally closed the input stream while the current process handles the previous frame/image. Then, our scheme selects the model (i.e., YOLOv8, YOLOv5) and checks if the current process has executed all previously allocated frames and if the condition has been met (Lines 14-15). If not, and there are still more frames to be processed, our scheme reads the next available frame from the input stream. This process is a safe operation to prevent multiple processes from accessing identical input due to concurrent access. Additionally, by fetching the next available frame, it eliminates communication overhead from strict ordering, where a process waits for another process to complete the previous frame. After fetching the input, each process processes the fetched input simultaneously but also independently of other processes, increasing the parallelism of the scheme (Lines 17-22).
In terms of process management, our scheme follows the standard process pool approach, which spawns processes according to the number of processes. Each process executes the object detection function, and the main process waits for all processes to finish their execution and join to the main process (Lines 25-29). As shown throughout the algorithm, our scheme requires minimal code modification, and we modified fewer than 100 lines of code. This allows for easy adoption of the proposed scheme in various models and edge devices.
Evaluation
In this section, we evaluate the original and proposed schemes from the perspectives of runtime, CPU utilization, memory usage, and power consumption as the number of cores increases.
Our evaluation of proposed scheme focuses on answering these questions:
How do runtime, memory usage, and power consumption change as the process number increases in a video benchmark(IV-B1 and IV-B2)?
Can our proposed scheme improve performance on video benchmarks(IV-C)?
How do image benchmarks differ in performance compared to video benchmarks(IV-D)?
How does the ratio of runtime and memory space change as the number of processes increases(IV-E)?
Can our proposed scheme demonstrate performance improvements compared to existing state-of-the-art optimized YOLO model(IV-F)?
A. Experimental Setup
For evaluation, we utilize a SOTA edge device, Jetson Orin Nano, as mentioned in Section II. During object detection, only the CPU is utilized to perform the inference task, and we conducted the evaluation with an increasing number of CPU processes. The original scheme refers to running with only one process, while the proposed scheme involves utilizing an increasing number of processes from 2 to 12, utilizing all six available cores. For storage, we utilized Samsung EVO Plus MicroSD 128 GB for OS and Samsung SSD PM9A1 NVMe 256 GB for storing dataset.
In terms of software, the Jetson Orin Nano operates on Ubuntu 20.04.5 LTS with the 5.10.104-tegra Linux kernel. Among various versions of YOLO models, we use YOLOv8 (YOLOv8n.pt), YOLOv5. For the dataset, we used ‘Polar-bear.mp4’ (191 frames) as the reference input file. In addition, for large realistic video datasets, we utilized ‘Elephant.mp4’ (1064 frames), ‘Car-traffic(1).mp4’ (600 frames), and ‘Car-traffic(2).mp4’ (693 frames) from Kaggle. For image datasets, we employed standard large-scale image datasets such as MS-COCO, Pascal VOC, ImageNet, and DOTA. In terms of performance metrics, we present overall runtime, CPU utilization, power consumption, and memory usage.
B. Reference Video Increasing Process Number
1) Runtime
For a reference video, we use the ‘polar-bear.mp4’ video file, which is 2.5 MB and consists of 191 frames. Figure 4 shows the runtime of YOLO models using a varying number of processes, from 1, which is the existing scheme, to 12, which is twice the number of CPUs using the proposed scheme. As shown in the figure, the runtime is the slowest with the existing scheme (i.e., P1), and it reaches its fastest, increasing the runtime by
2) Resource Utilization
Figure 5 shows the resource utilization trends with YOLOv8 and an identical video. In the case of CPU, as shown in figure 5a, the CPU utilization increases as the number of processes increases, suggesting efficient model processing even with an increased number of processes. While the CPU utilization is 21.39% with 1 process, it rises to 97.21% with 6 processes, indicating full CPU utilization with 6 processes. In the case of memory, as shown in figure 5b, 1639MB of memory is used for a single process, and 2650MB for six processes, showing similar pattern with CPU utilization. While the average memory usage is higher when number of process increases, using 6 processes reduces memory usage by 509.88MB(63%) per frame. This indicates that, although there is a Additionally, unlike CPU and power, the memory usage significantly increases when using 12 processes compared to 6 processes. This is due to the additional memory (around 220MB per process) required for the process contexts. This results show that there is the optimal concurrency due to the limited resource available in edge device and highlight the need for careful configuration to maximize the performance.
Resourace Utilization using a reference video (Polar-bear) with increasing number of processes (P denotes Number of Proecesses).
In the case of power consumption, as shown in figure 5c, the power consumption increases as the number of process increases, similar to CPU and memory. To further analyze the power consumption and efficiency as the number of processes increases, Table 3 shows the total power consumption, average power consumption, and power consumption per frame. Similar to memory consumption, the average power consumption during the runtime increases as the number of process increases. This is because more frames are processed simultaneously when higher number of processes are used. Although this results in higher power consumption per frame, potentially increasing heat generation, the reduction in runtime ultimately leads to a decrease in total power consumed and power consumption per frame as well. Using six processes saves 110141mW and 576.66mW (70.22%) total power and power per frame, respectively, compared to using a single process. Thus, these results of resource consumption show that while increasing the number of processes increases the general resource consumption compared to the existing scheme, it is far more efficient to perform parallel execution of the model in the perspective of runtime, CPU, memory, and power utilization.
C. Video
To evaluate the proposed scheme with realistic videos, we perform identical evaluation using ‘Elephant’, ‘Car-traffic(1)’, and ‘Car-traffic(2)’ videos. Figure 6 shows runtime, CPU utilization, memory utilization, and power consumption of the existing and proposed scheme. As shown in figure 6a, the runtime of all videos is significantly reduced with the average of
Runtime and resource utilization using video datasets (Elephant, Car-traffic videos) with original and proposed scheme.
In the case of resource utilization, while the proposed scheme utilizes more resources such as CPU (figure 6b), memory (figure 6c), and power (figure 6d) during the runtime, the difference is minimal. In addition, since the runtime of the proposed scheme is lower, the overall resource consumption and consumption per frame are reduced using the proposed scheme. Specifically, in the case of car-traffic(2), which has high resolution and memory consumption, the proposed scheme can only reduce 293MB (30%) per frame. In contrast, in the case of elephant, which has a low resolution but a high number of frames, the proposed scheme reduces the memory consumption per frame by 585MB (69%). Further more, in the case of power consumption, while the average power consumption of proposed scheme during the runtime is 336mW higher and the power per frame is reduce by 308mW. Specifically, car-traffic(2) shows the most significant power saving, reducing power consumption per frame by 323mW(55%). These results show that the proposed scheme can reduce the runtime of complex videos, and while the average resource consumption during runtime increases, the proposed scheme can improve the efficiency per frame.
D. Image Dataset
To evaluate the proposed scheme with images, we perform the evaluation using MS-COCO, ImageNet, and PascalVOC datasets. These are widely used image datasets with various characteristics and are used to evaluate model performance with images. figure 7 shows the runtime, CPU utilization, memory utilization, and power consumption of the existing and proposed schemes with the image dataset. As shown in figure 7a, similar to the results using video dataset, the proposed scheme reduces the overall runtime compared with the existing scheme. In particular, the proposed scheme with MS-COCO improves the runtime the most, achieving
Resource Utilization using a Image dataset(MS-COCO, ImageNet, and PascalVOC) with original and proposed.
In the case of resource utilization, the data trends of all resources are similar to that using videos dataset. For CPU, as shown in figure 7b, the proposed scheme fully utilizes the CPU throughout the execution, while that of the existing scheme is 87 to 90%. The CPU utilization of image is 28% lower than that of video. This indicates that the original scheme does not efficiently utilize all CPU resources, especially when processing large amounts of non-contiguous independent data, such as image sets. For memory, as shown in Figure 7c, compared to video datasets, the difference in average memory usage during runtime between the proposed and original scheme becomes more pronounced in image datasets. However, owing to the significant reduction in runtime, the total memory usage per frame of the proposed scheme is reduced by up to 173MB (55%) in the case of MS-COCO.
In terms of power consumption, as shown in figure 7d, the proposed scheme can significantly improve power efficiency compared with the existing scheme. This is because the CPU utilization of the existing scheme is considerably lower than that of the video, making the edge device less efficient and increasing the overall runtime. Specifically, in the case of MS-COCO, the proposed scheme reduces the power consumption per frame by 1173mW(73%) compared with the existing scheme. Considering that the original scheme exhibits an average power consumption of approximately 1600mW per frame, the reductions achieved with the proposed schemes are noteworthy. Thus, these results suggest that the proposed scheme can achieve even better runtime and resource utilization efficiency when processing image sets compared to videos.
E. Advanced Analysis of Runtime and Memory Space
To analyze the runtime performance of the YOLO model, we present the proportion of each subcomponent relative to the total runtime as the number of processes increases from 1 to 12, as shown in figure 8a. We us the ‘Elephant.mp4’ video as the dataset and ‘yolov8n.pt’ as the model. In the case of single process (i.e., the original scheme), we only present the components for the inference, frame_read, model_load, preprocess, and postprocess time. As shown in the figure, inference time accounts for 311.02 seconds(96.75%), representing the largest share fo the total runtime. This is because inference process is the most compute-intensive part of the pipeline. While the percentage remains relatively consistent as the number of processes increases, the inference time proportion is lowest in the original scheme. For the other components, model_load, frame_read, preprocess, and postprocess contribute 0.12%, 1.02%, 1.31%, and 0.8% of the total runtime, respectively. In the case of 6 processes (i.e., the proposed scheme), inference time is reduced to about one-third of the original, comprising 110.66 seconds (97.32%) of the total runtime. The model_load, frame_read, preprocess, and postprocess components account for 0.87%, 0.5%, 0.76%, and 0.63% of the total runtime, respectively.
Runtime and Memory space ratio of YOLOv8n using elephant video increasing process number.
Comparing the original and proposed scheme, we observe an increase in the proportion of inference and model_load, while the proportions of frame_read, preprocess, and postprocess decreases. Specifically, preprocess and postprocess times benefit from parallelization in the proposed scheme, while the original scheme processes frames sequentially, the proposed scheme processes frames concurrently, reducing overall runtime and resulting in a lower percentage of preprocess and postprocess times. Similarly, the frame_read time in the original scheme is attributed to reading one frame at a time, whereas the proposed scheme groups frames based on the number of cores, decreasing the time allocated to frame reading. On the other hand, model_load time increases slightly in the proposed scheme since more models are loaded as the number of cores increases. However, the additional model_load time in the proposed scheme (about 0.2 seconds) has a negligible impact on total runtime. Due to our input data is not exceptionally large, these time proportions remain nearly consistent, but with larger data sizes, the proportion of frame_read is expected to increase, further amplifying the difference in inference time between the proposed and original schemes.
In terms of the memory allocation patterns of running YOLO model, we present the proportion of each subcomponent relative to the total main memory as the number processes increases from 1 to 12, as shown in figure 8b. We used the same dataset and model as in the above experiment. In the case of single process we only present the components for the initial, model, data, and ETC memory. As shown in the figure, free memory accounts for 4615MB(71.22%), which occupies the majority of the total memory space(6480MB). Initial memory (e.g., Os kernel) occupies 1475MB(22.76%) and remains nearly constant across different process numbers. Model memory occupies 374MB(5.77%) and increases as the number of processes grows. Data memory occupies only 16MB(0.25%) due to the relatively small size of the video dataset. In the case of 6 processes, model memory increases to 1149MB, occupying 17.73% of the total memory. This increase is due to the six models loaded in parallel, resulting in a greater number of intermediate values, such as feature maps,and tensors, generated during the inference process. With 12 processes, model memory further expands to 2194MB(33.86%), as mentioned above, attributed to the additional models and extra process context introduced by hyper-threading. Consequently, as the number of processes increases, model memory consumes a larger portion of the total memory space. This highlights the need for efficient memory management when handling large models or input data on memory-constrained edge devices,. Further details on this topics are discussed in Section VI-B.
F. Comparison With SOTA
Figure 9 shows the FPS and runtime inference performance of YOLO-Parallel [7] and the proposed scheme. To provide a fair comparison, we executed the same model on each respective device using DOTA dataset, as the two schemes employ distinct hardware environments. YOLO-Parallel is designed to introduce parallel block sections in training process, enabling parallel extraction of multi-scale features in the YOLO model. This design allows the trained YOLO-Parallel model to enhance inference performance while maintaining accuracy. In contrast, the proposed scheme focuses on computational optimization of edge device itself, without modifying the original YOLO model. Consequently, we evaluate the performance of each optimized model and edge device configuration based on the same YOLO model.
As shown in figure 9a, YOLO-Parallel achieves an FPS of 116.28, representing a
To summarize the results of our evaluation:
When, using 6 processes on a video benchmark, the runtime showed a
speedup compared to using a single process. Although the average memory usage and power consumption increased, they decreased by 63% and 70.22% per frame, respectively.(IV-B).$4.45\times $ The proposed scheme improved runtime, memory usage, and power consumption on other real-world video datasets by up to
, 69%, and 55%, respectively.(IV-C).$2.98\times $ For a large-scale image dataset, the proposed scheme demonstrated even greater performance improvements than on video datasets, achieving up to
in runtime, 73% in power consumption, and and 55% in memory usage.(IV-D).$4.45\times $ In terms of runtime, inference time occupied the majority of the total time and was the lowest in the original. The proportions of frame_read, preprocess, and postprocess were reduced in the proposed scheme through frame grouping and parallel inference. For memory space, model memory proportion increased as the number of processes increase.(IV-E).
Compared to SOTA optimized YOLO model, the proposed scheme achieved
better performance using the same model.(IV-F).$2.10\times $
Related Work
A. Improving Performance of ML Models
Several studies have focused on optimizing ML for resource-constrained edge devices. Previous studies [3], [11], [36], [37], [38], [39] have aimed to model compression and computational efficiency to retain performance on edge devices while reducing resource demands. In addition, other studies [5], [6], [8], [40], [41], [42] have focused on real-time resource monitoring in edge environments, dynamically adapting models during execution to enhance efficiency.
Kwon et al. [3] proposed YONO, which utilizes product quantization to compress multiple heterogeneous neural networks, enabling edge devices to perform in-memory execution and model switching, thereby enhancing memory efficiency and allowing simultaneous handling of multiple tasks while maintaining accuracy. Gao et al. [7] proposed YOLO-Parallel, which optimizes the YOLO model by employing parallel blocks during training to extract multi-scale features and using a positive gradient loss function to enhance the positive gradients of tail classes, thereby improving accuracy for tail categories. Wen et al. [5] proposed AdaptiveNet, a post-deploymnet adaption framework that dynamically adjusts ML architectures for edge environments based on real-time resource monitoring, utilizing a pretraining-assisted model elastification and an edge-friendly architecture search method to generate optimal models and maintain runtime.
Our proposed scheme is consistent with these studies in terms of optimizing ML for edge devices. In contrast, we focus on grouping input data for data parallelism to optimize resource usage on edge devices and loading original ML models onto each available core to process the grouped data without modifications to process the grouped data.
B. Improving Performance of Edge Devices
There have been many approaches to improving the performance of ML on edge devices. In terms of edge device optimization, previous [1], [2], [3], [4], [43], [44], [45] studies focused on maximizing the utilization of edge device resources to optimize inference performance. Additionally, other studies [9], [10], [46], [47], [48], [49], [50] focused on distributing data and ML models efficiently across multiple edge devices to improve performance.
Wang et al. [2] proposed L-PIC, which enables efficient concurrent inference of multiple DNN models on CPU-only edge devices by employing block-based model partitioning for multi-pipeline execution and reduces inference latency by distributing CPU load through dynamic resource allocation, thereby optimizing resource allocation. Zhao et al. [4] employs fine-grained resource allocation, layer-wise task parallelization, and optimized memory access patterns to streamline neural computations under strict latency constraints, thereby maximizing throughput and predictability in real-time applications such as autonomous systems and robotics. Li et al. [49] proposed a two-stage execution for multi-DNN inference on edge devices, where the GPU handles the highly parallel and computationally intensive tasks of DNN processing while the CPU manages the less compute-intensive tasks. It dynamically controls GPU contention through DNN partitioning and tail offloading to the CPU, optimizing resource utilization, reducing latency, and ensuring real-time performance.
Our paper aligns with these studies in terms of utilizing edge device resources to improve inference performance. However, unlike previous studies that partition and distribute the model through model parallelism, we focused on grouping and distributing data according to the available cores on the edge device. By deploying the model on each core and executing inference tasks independently in parallel, our approach effectively enhances performance while minimizing overhead from synchronization and data communication.
Limitation and Future Work
Our proposed scheme loads the model on each core and processes input data in parallel across these cores, improving runtime, memory usage, and power consumption performance. However, while the average memory usage and power consumption improved, the per-frame measurements were higher than those of original scheme. Handling large models and high-volume input data may lead to OOM(Out of Memory) issues. To address this, we propose two future works involving the distribution of both model and workload.
A. Core-Level Parallelism
In proposed scheme, a single model is loaded onto each core, and a single input data is processed. However, as the size of the model and input data increases, memory-constrained edge devices may encounter OOM issues. Our future plan is to implement model and data parallelism at the core level. For large-scale models, the model will be partitioned across cores, enabling parallel processing of input data. In terms of large input data, data will be split to distribute memory usage efficiently across cores. This approach allows each core to handle memory-intensive data, effectively mitigating the limitations of edge devices.
B. Storage Offloading for Edge Devices
In memory-intensive edge devices, handling high-capacity models (e.g., Large Language Models(LLM)) or large input data (e.g., high-resolution images, video or large-scale text data) presents significant challenges. Our plan is to offload data from memory to storage, using memory as a cache for essential data. However, such offloading may introduce significant delays due to increased access time. To address this, we will implement a prefetching strategy that analyzes data access patterns during ML execution. This approach includes a novel caching policy that replaces conventional LRU or LFU strategies, dynamically adjusting memory allocation to prioritize essential data, thereby keeping available in memory. This enables edge devices to efficiently handle high-capacity data and improve processing efficiency.
Conclusion
In this paper, we propose parallel processing of the YOLO object detection model on resource-constrained edge devices to improve performance and resource utilization. Based on the analysis, we found that most of runtime is consumed by the inference task of the model. To improve this, our proposed scheme (1) Divides the video frames or images from the image dataset into groups according to a predetermined number of cores. (2) Allocates the grouped frames or images to each core individually and executing object detection asynchronously on each core. Our evaluation using SOTA edge device and realworld video/image dataset shows that runtime can be improved by up to