Introduction
Vehicle recognition in aerial images is crucial for both military and civilian applications. Military target strikes and traffic control can both benefit from the use of this technology. Researchers have proposed various techniques for object recognition in aerial photos in the daytime with sufficient lightning, producing remarkable results [1]. However, vehicle detection of objects in low light conditions is a challenging and significant issue with surveillance camera applications. In low-illumination conditions there is less information available and difficulty extracting enough useful features, such as at night time there is background light interference, the object is underexposed, and brightness and contrast are poor which results in low-image quality [2].
The generic object detection method has poor accuracy and a limited ability to extract the intended objects. Consequently, in low brightness, capturing every detail of a scene is impossible, and a lot of detailed information, such as colour and texture, is lost. This is especially true in distant view or aerial images, where the objects are frequently far away and small, having low contrast against the background [3]. One of the solutions to this problem is to use specialized and improved hardware, which might get very costly. Therefore, more algorithmic solutions are focused on by the researchers.
Recent studies on low lighting concentrate on image enhancement to improve basic visual properties in the pre-processing steps [4], [5], [6]. These methods include global and local enhancement techniques. The global enhancement techniques when applied to night-time images may over-expose already bright parts of the images, however, local contrast enhancement methods focus on image details, but it increases noise when contrast gain is high [4]. However, deep learning models can produce better results to enhance image contrast. By applying a simple histogram equalization method or gamma correction, the contrast of road and vehicle headlights was increased that decreases the image quality for vehicle detection. Therefore, we applied the low light enhancement model miRNet on nighttime traffic sequences as it produces robust results and prevents over-exposure of car and road lights. Our proposed model consists of the following steps: all the extracted nighttime image sequences are first subjected to defogging and then fed into miRNet model. The enhanced images are then passed onto the YOLOv5 object detection algorithm to locate vehicles in each image frame. For each detection, SIFT features and templates are extracted, based on which IDs are assigned to each one. The extracted templates are used to find the possible matches in succeeding image frames which are filtered to get the best possible match by feature matching. In the last, the trajectories of all the tracked vehicles are drawn by plotting the centroid points.
The main contribution of our work is as follows:
An efficient and computationally lightweight vehicle detection and tracking algorithm for night-time aerial image sequences is established.
We used the deep learning model YOLOv5 to detect objects in night-time aerial images containing dense scenes to enhance the detection rate.
A simple and efficient multi-vehicle tracking approach that uses SIFT features which are robust to noise, light transformation, and angle view for identifier allocation, and a template-matching model is proposed.
The remainder of the paper is structured as follows: Section II presents related work. Our proposed architecture is thoroughly explained in Section III. Benchmark datasets and the experimental findings are described in Section IV. Section V presents the conclusion and suggested next steps.
Literature Review
Numerous researchers have focused on object detection in low-light conditions. In most cases, the images are first subjected to pre-processing to enhance brightness levels. Also, traffic monitoring work has been done, including vehicle detection and tracking as the core steps [7], [8], [9], [10]. Therefore, this section is divided into two categories: object detection in low-light conditions and vehicle detection and tracking methodologies.
A. Object Detection in Low-Light Conditions
One of the most extensively used machine learning approaches for image enhancement includes histogram equalization, which is easy to implement and consumes low computational power [11]. However, due to excessive gray merging, gray levels are easily lost in
B. Vehicle Detection and Tracking
Several studies are focusing on vehicle detection and tracking methods [16]. In [18], an effective Gaussian Mixture Model (GMM) based image segmentation technique is applied. This technique may identify different automobiles’ frontal views. The Canny edge detector and Hough transform are used for spotting lanes to determine the vehicles’ driving area. This work trains the Support Vector Machine (SVM) classifier using the Histogram of Gradient (HOG) features, colours, and Haar features of automobiles to increase the efficacy of the proposed technique further. Also, to detect vehicles, an upgraded You Look Only Once version 3 (YOLOv3) algorithm is created. The data collection is first clustered using a clustering analysis approach, and the network topology is then optimized to increase the number of final output grids and improve the relatively weak vehicle prediction capacity. In another study [20], an intelligent transport system has been proposed which uses a Kalman filter and YOLO detector. The model also generates track IDs and uses a Hungarian algorithm to retrieve them. Similarly, [21] creates a Simple, Online, real-time tracking (SORT) technique based on the Kalman filter and the Hungarian matching algorithm, using the Faster Regional Convolutional Neural Network (R-CNN) algorithm as the target detector to track multiple targets concurrently. One drawback of the SORT algorithm is that it does not incorporate appearance features. Another study [22] presents a vehicle recognition and monitoring method using the Gaussian Mixture Model (GMM) and blob extraction method. Firstly, the background was estimated and subtracted from frames to extract the foreground objects. For further noise removal, morphological corrections were employed. Tracking of the vehicle was enhanced using the GMM algorithm. Ait Abdelali et al. [23] developed a vision-based traffic monitoring model. The detection of vehicles is done using the deep learning-based YOLO detector. For tracking, a particle filter is implemented. Mou et al. [24] proposed a detection method based on segmenting the aerial image into similar regions using a Convolutional Neural Network (CNN). Then, a trained SVM classifier was used to track and classify vehicles. The training of two different classifiers increases the complexity and computational cost of the model which limits its applicability to large datasets.
In this paper, we aim to introduce a lightweight vehicle detection and tracking approach which requires limited training. Also, we have combined deep learning and machine learning techniques to increase the model efficiency.
The Proposed Framework
Fig. 1 depicts the general architecture of the proposed model. The model mainly consists of five modules: (i) pre-processing steps to enhance the brightness level of nighttime images; (ii) vehicle detection using the deep learning model YOLOv5; (iii) SIFT feature extraction of each detected vehicle and identifier assignment; (iv) vehicle tracking using the template matching algorithm; and (v) drawing trajectories of each tracked vehicles. Each module of the framework is explained in detail in the following subsections. a machine inspection dataset, the suggested system is tested, and the findings demonstrate that it outperforms a number of cutting-edge object recognition.
A. Image Pre-Processing
1) Defogging
The input images extracted from nighttime traffic videos at the rate of 8 FPS were first resized to
To denoise the image, we applied the defogging method [25], [26] which estimated the intensity of noise in each image pixel and then removed as follows:\begin{equation*} I\left ({{ x }}\right)=U\left ({{ x }}\right)t\left ({{ x }}\right)+K(1-t\left ({{ x }}\right)) \tag {1}\end{equation*}
Defogging process over UAVDT and VisDrone datasets (a) original images and (b) defogged images.
2) Low-Light Enhancement Using MIRNet
After denoising the images, the next step is to enhance the brightness level of the images to locate the objects easily. For this purpose, we used the pre-trained model MIRNet. All the images are passed onto the contrast enhancement module. MIRNet is a pre-trained fully convolutional deep learning architecture that retains spatially exact high-resolution representations over the whole network while receiving significant contextual information from the low-resolution representations [28]. The model consists of a feature extraction module that maintains the high-resolution original features to reserve fine spatial details while computing a complementary collection of features at various spatial scales [29]. Also, the characteristics from numerous multi-resolution branches are gradually integrated for better representation learning using a recurring information exchange mechanism [30]. It uses a technique for fusing features from different scales that correctly maintains the original information of the feature at each spatial level while dynamically combining varying receptive fields. To simplify the learning process, the recursive residual gradually decomposes the input image, enabling deep networks to be built [31].
The output of the MIRNet image enhancement is shown in Fig. 3. Also, the overall architecture of MIRNet is seen in Fig. 4.
Architecture of MIRNet for image enhancement where RRG = Recursive Residual Group, MRB = Multiscale Residual Block, DAU = Dual Attention Unit, SKFF = Selective Kernel Feature Fusion.
B. YOLO v5-Based Vehicle Detection
Because of its high-performance capabilities, YOLO algorithms are frequently used in object detection systems, especially for vehicle detection tasks. YOLO sees an image as a regression problem with fast speed [32]. While training, YOLO takes the entire image as input training, paying more attention to global information for target detection and returns the position of the object bounding box [33], [34], [35].
YOLOv5 is a single-stage detector that considerably reduces the processing time of deeper networks. As we have already processed the images to make them viable for detection, therefore, we used YOLOv5 for detection purposes to keep the system lightweight as well as efficient. Also, it performs better in small target detection [35]. Four primary parts make up the construction of YOLOv5: input, backbone, neck, and head.
1) Backbone
The backbone selects the important component from the input image for further analysis. YOLOv5 uses spatial pyramid pooling (SPP) and cross-stage partial networks (CSP) as its main building blocks to extract rich, significant information from input images. SPP can be used for the same object detection in multiple sizes and scales enhancing the model’s generalization.
2) Neck
It consists of a path aggregation network (PANet) and the Feature Pyramid Network (FPN). The primary function of PANet is to generate feature pyramids. FPN structure improves the bottom-up path and low-level feature propagation. Also, Localization features are sent from lower feature maps to higher feature maps via the PAN framework.
3) Head
The output layer consists of three convolution layers to predict the location of the object bounding box and scores. YOLOv5 uses the Sigmoid Linear Unit (SiLU) activation function in hidden layers and the Sigmoid activation function is utilized in the convolution operation of the output layer calculated as follows [36]:\begin{equation*} SiLU\left ({{ x }}\right)=x\times \sigma (x) \tag {2}\end{equation*}
\begin{equation*}S\left ({{ x }}\right)=\frac {1}{1+e^{-x}} \tag {3}\end{equation*}
The loss function of the overall structure is calculated as given below:\begin{equation*} Loss=\lambda _{1}L_{cls}+\lambda _{2}L_{obj}+\lambda _{3}L_{loc} \tag {4}\end{equation*}
The architecture of the YOLOv5 algorithm is shown in Fig. 5.
The image frames were divided into bursts of five images. The detection is performed on the first image of each burst while tracking was done on the next four images. The vehicle detection result using the YOLOv5 algorithm is visualized in Fig. 6.
C. Identifier Number Assignment
As each image frame contains multiple vehicles to be tracked in the succeeding frames. Therefore, an identifier was required to locate each vehicle separately that should remain the same for a particular vehicle throughout the tracking. For this purpose, every detected vehicle was subjected to SIFT feature extraction [37], [38]. Based on this a unique identifier number was assigned to each car.
The SIFT features are local making them robust against occlusion and clutter [35], [36], [37]. The feature extraction algorithm consists of the following four steps.
1) Scale Space
This step includes selecting potential areas in the image to find features [38], [39]. The input image is convolved with a Gaussian kernel at various scales to produce the function \begin{equation*} L(x,y,\sigma)=G(x,y,\sigma)^{\ast }I\left ({{ x,y }}\right) \tag {5}\end{equation*}
\begin{equation*} G(x,y,\sigma)=\frac {1}{2\pi (\sigma)^{2}} e^{- -\frac {x^{2}+ y^{2}}{2\sigma ^{2}} } \tag {6}\end{equation*}
2) Key Point Localization
To update the keypoint location, the Taylor series expansion of scale space is used to locate the extrema with greater accuracy, if the intensity at the extrema is less than a certain threshold it is rejected.
3) Orientation Assignment
Each keypoint was assigned an orientation to make the extracted keypoint invariant to rotation. The neighbourhood around the keypoint position is chosen depending on the scale, and the gradient’s amplitude and direction are defined as follows:\begin{align*}\left |{{ I }}\right |& =\sqrt {I_{x}^{2}+ I_{y}^{2 }} \tag {7}\\ \Theta & ={tan}^{-1}\left ({{ \frac {I_{y}}{I_{x}} }}\right) \tag {8}\end{align*}
4) Key Point Detector
To calculate the local image descriptor a
5) Key Point Matching
The matching between two images is obtained by identifying the nearest neighbour of key points using the formula given below:\begin{equation*}\left ({{ u,v }}\right)=\sqrt {\sum \nolimits _{i=1}^{n} {(v_{i}{-v}_{i})^{2}}} \tag {9}\end{equation*}
If the number of matches exceeds threshold value 6, then the corresponding vehicle’s identifier number is retrieved and assigned to the matched vehicle in the succeeding frame. If no match is found, the vehicle is added as a new entry with a unique number. the identifier assignment to each detected vehicle is shown in Fig. 7.
Identifier assignment to each detected vehicle based on SIFT feature extraction over UAVDT and VisDrone datasets.
D. Template Matching-Based Vehicle Tracking
To lower the computational complexity of the model, we used a template-matching algorithm to avoid unnecessary feature extraction. A template model has been generated for every new vehicle registered in the system [43], [44]. This generated model was used to locate all the possible locations of the vehicle in the following frame. The template matching algorithm moves the template across the entire image and a similarity score is calculated between the area covered by the window and the template [45], [46], [47], [48]. The matching is implemented through a 2-dimensional convolution:\begin{equation*} l\left ({{ x,y }}\right)=f(x,y) ^{\circ } g(x,y) \tag {10}\end{equation*}
The extracted templates contain texture and appearance information which helps to find its match. If an image has more than one possible location detected, then it is subjected to SIFT feature matching to get the best match and the associated identifier number [49], [50]. Vehicle templates that are not found in the succeeding frames are retained and matched for the next 5 frames before deletion to handle the occlusion within the tracked images. The tracking results are shown in Fig. 8. The steps involved in template-based matching are given in Algorithm 1.
Vehicle Tracking using Template Matching (a) vehicle model extracted from detection (b) number of template matchings is greater than 1 (c) SIFT feature extraction and matching with the template matches (d) best possible location retained across the image frames.
Algorithm 1 Vehicle Detection and Tracking
vehicle detections V = {v1,v2,v3,.......vn} where vn = (x1,y1),(x2,y2), Input_image I, Frames F = {f1,f2,.......,fn}
The tracking results
Initilize feature_list =[], thresh =6, vehicle_model =[]
for i in range (V)
x,y,w,h
ROI = Extract Region_of_interest(I(x,y,x+w,y+h))
feature_list
vehicle_model
while F
for j in range (vehicle_model)
matches = template matching(vehicle_model[j], F)
if matches >1 then
fm = F eatureMatching(matches, feature_list)
if fm > thresh then
Retrieve and assign corresponsing ID and discard other matched templates
else
Retrieve corresponding ID and assign to matched vehicle
end if
end if
end
E. Trajectories Approximation
Each tracked vehicle’s path was recorded and plotted against each video frame to understand the traffic flow conditions and routes. To estimate the trajectories [51], the final match obtained from the tracking algorithm for each vehicle was recorded by calculating the rectangular centroid of each vehicle against the frame number taken as a reference for the time stamp. The centroids are calculated as:\begin{equation*} rectangular-centroids_{vehicles}=\left ({{\frac {x1+x2}{2},\frac {y1+y2}{2}}}\right) \tag {11}\end{equation*}
The points were plotted and joined with time information incorporated as shown in Fig. 9.
Vehicle trajectories approximation is estimated by joining the centroid of each vehicle location against the identifier number ID and Frame number.
Experiments and Results
The experiments were conducted using a laptop with an Intel Core i5-8550U 1.80GHz processor, 6GB of Random Access Memory (RAM), Windows 10 running on the x64 architecture, and the Python tool. Also, to compare the performance of CPU and GPU. We ran the experiment on Tesla K80 GPU which is available free on Google Colab. The training time on the CPU was 1.3 hrs whereas it took 0.86 hrs to train on the GPU. However, there was no difference in the precision values. The proposed model produces remarkable results when tested on two benchmark datasets: UAVDT and VisDrone datasets.
A. DATASETs
1) VisDrone Dataset
The Vision Meets Drone Single Object-Tracking (VisDrone) dataset contains 288 clips of videos with a total of 261,908 frames and 10,209 still photos taken by several drones equipped with cameras and covering a variety of places. We used traffic image sequences taken at nighttime to test our model. Some of the sample images from the VisDrone dataset are displayed in Fig. 10.
2) VisDrone Dataset
The other dataset includes the Unmanned Aerial Vehicle Detection and Tracking (UAVDT) benchmark dataset. It consists of traffic sequences recorded using a UAV platform in various urban settings. Each frame is in jpg format with a
B. Evaluation of Detection and Tracking Algorithm
We used three performance metrics to assess our proposed detection and tracking algorithm specially designed for low illumination conditions: Precision, Recall and F1-score. These parameters are calculated as follows:\begin{align*} Precision& = \frac {True Positive}{(True Positive+False Positive)} \tag {12}\\ Recal& = \frac {True Positive}{(True Positive+False Negative)} \tag {13}\\ F1& = \frac {2\times Precision\times Recall}{Precision + Recall} \tag {14}\end{align*}
C. Comparision with Other Methods
We compared our proposed model with other methods in terms of precision score. Our model outperforms other techniques for both vehicle detection and tracking. Table 4 demonstrates the contrast of our proposed detection model with other methodologies.
Table 5 shows the comparison of our proposed tracking algorithm with other methodologies. It can be seen that our model produces efficient results.
A comparison of detection and tracking techniques with state-of-the-art techniques has been demonstrated in Tables 6 and 7.
Limitations
The proposed method performs well for nighttime surveillance of road traffic. However, there are still some limitations of the model. The system can detect vehicles in case of partial occlusion or cluttering, but a separate method is required to eliminate the full occlusion or background cluttering problem due to low contrast as shown in Fig. 12. Moreover, the model does not take into account the pedestrians, bicycles or bikes. Moreover, diverse weather conditions such as images taken in cloudy, foggy or rainy weather require other pre-processing methodologies which are beyond the scope of our model.
Conclusion and Future Work
In this study, we propose a lightweight and efficient vehicle detection and tracking algorithm specially designed for low-illumination conditions. First of all, we pre-processed the nighttime traffic scenes to adjust the brightness level of the image. Then, we applied semantic segmentation based on FCM clustering to segment the image into multiple uniform regions to reduce the overall complexity. For detection, we used YOLOv5 which can detect small objects precisely. We assign identifiers based on SIFT features to track multiple vehicles within a single image frame. Then, template matching was employed to get each vehicle’s possible location and its corresponding identifier was retrieved by SIFT feature matching. The evaluation experimentation on public datasets demonstrates that our proposed framework can efficiently detect and track automobiles and outperforms other methods. In the future, we aim to enhance vehicle monitoring techniques to adapt to more complex traffic scenarios.