Introduction
Cameras and video technology is continuously improving, and it is increasingly common to find images in FullHD, 2K, 4K or even 8K used as input for training convolutional neural networks (CNN) [1]. Computing capacity has also increased significantly [2], and a great deal of effort is being made to develop hardware with the capacity to run neural networks in real time [3]. This hardware is becoming increasingly compact, efficient, and affordable, enabling embedded or distributed training systems for the construction of distributed object detection and surveillance systems [4], [5], [6].
Limited progress has been made however in CNN training [7]. While neural networks are, in theory, trained only once and then later depend on inference, the fact is that neural networks are continuously being retrained, either with new datasets or modifications in the parameters of training algorithms.
Given the current size of images [8], and the need for increasingly exact or precise detection of objects within these images, training times are growing [9] as classic methods of optimizing training become less effective [10].
There are two commonly used methods to reduce training times for deep neural networks:
Image size reduction [11]. This is an effective method if the objects to be detected or classified occupy a sufficiently large part of the total image so that, even when the image is reduced, these objects still provide sufficient information for the training algorithm [9] [13].
Partition of the original image into a mosaic of images [7], [14], [15]. This method reduces the size of the image, dividing it into several parts with a predefined size (usually
or3\times 3 ) with equal dimensions (length and width) to maintain the same proportions as the original image.4\times 4
Both methods reduce the size of images, which can be processed using more modest hardware, particularly when memory is the principal limitation to processing large images. Both methods, however, have certain drawbacks:
Image size reduction [16]. If objects are small, the loss of resolution may mean these objects become undetectable.
Partition of the original image into a mosaic of images. The image being processed may be smaller but there are more images to process. Additionally, objects may be cut between two images. The superimposition of the regions is a way to minimize this although it does not solve the problem as the area of superimposition must be very large resulting in an even greater reduction of the object, reducing the effectiveness of this solution.
In this study we will describe the method used to significantly reduce processing times without diminishing the effectiveness of the trained network.
Training Optimization Algorithm
This algorithm is designed to pre-process the labelled images of a dataset prior to being used in the habitual training process for a deep neural network. The dataset must be labelled using the format of a YOLO type network [18]. Thus, the input of this algorithm is one dataset, and the output another dataset constructed using the original images but optimized for training (also in YOLO format). For datasets other than those of the YOLO type, the labels can be translated for use in other formats. For this reason, the method is replicable and extendible to other dataset formats.
A. Terms Used in Algorithm Definition
Target (Object) or BoundingBox: A labelled element in the image that the neural network should detect. This may be any of the type of object that the future neural network will detect by inference.
Selected object: An object labelled in the image that has been selected as input for the neural network. This object was chosen to be part of the set of objects used to train the neural network.
Discarded object: An object labelled to be discarded as input in the neural network. This object may be duplicated, cut, etc. and is discarded for training purposes.
Cropped region: Portion of the image surrounding a “selected object”. The size of this region is a configurable parameter of the algorithm. The region is the piece of the image inputted into the neural network for training in which there is at least one selected object.
Key image: An image on which the object discarding process is not applied. This is established every “N” images. This “N” parameter is configurable in the algorithm.
B. Algorithm
This algorithm, as opposed to the methods described in the bibliography, consists of two phases:
Discard of objects and reduction of the training set.
Cropping of the training regions and new labelling of objects.
In the first phase, that of discarding, all targets are checked against the objects in the previous image. The first image of the dataset is considered a “key” image, so no target is discarded, and this phase is omitted. If these targets show relatively little movement compared to the previous image they are discarded; they will not be selectable objects and will be discarded. This parameter, “relatively small distance”, is configurable. The values which given the best results are 1% to 3% of the total image. In 2K or 4K resolution images these are approximately 5 to 15 pixels. The principal factors affecting the selection of this parameter are:
Type of recorded scene. From very static scenes to scenes with lots of movement. The more the objects move the greater the discrimination distance.
Number of Frames Per Second (FPS) at input. When the sequence of images is very close in time objects have a smaller displacement between frames. The higher the FPS the less the discrimination distance.
Rotation of objects. If the target objects in the image move in rotation, that is, around a central axis within the BoundingBox rather than moving across the image, this may cause a loss of the object for training. In this case, pre-processing is simply not recommended.
In the case studies, most of the recorded images are scenes from highways with an average of 2 FPS, although there are some images of agglomerations of people in pedestrian streets or at sports events with a resolution of
The second phase uses a set of objects that have not been discarded and so are selectable objects. Each of the selectable objects are delimited by a cropped region that is labelled in the image for training purposes. This region is configurable in terms of size and position, but all are the same size with the same ratio or proportion as the original image. The size of the region will depend on the grouping of the objects used for training as well as the size of the image. Larger regions encompass more space within the image, thus reducing the total number of regions but also increasing the computation cost of training. Furthermore, it is important that the region is sufficiently large for the selected object to be contained entirely within it. The cropped region must have the same length-width proportions as the images used for training and later for inference. This is a critical factor in the effectiveness of convolutional neural networks. It is also important that the region does not extend beyond the limits of the image, maintaining the same proportions and size.
Each of the cropped regions is checked for other objects, including those discarded in the first phase. For each of the objects identified within the region one of the following options is applied:
The Object Is Entirely Within the Cropped Region: This object is labelled to be part of the training. If the object is selectable, that is, not discarded in the first phase, it will now be marked as “not selectable” as it is now part of a training region.
The Object Is Partially Within the Cropped Region:
If more than 50% of the object is within the region (this value is configurable and set at 50% for the training process), it is labelled but not marked as “not selectable” and will continue to open to creating its own training region.
If less than 50% of the object is within the region, the object is not labelled and is deleted (for example, by blurring the image through Gaussian elimination) and it is not marked as “not selectable”. In order not to pollute the training process, these images are blurred rather than eliminated (painted a background color) to prevent the network from learning that a specific color (the background color) has any specific utility and incorporating it into its training criteria.
If the object has been deleted, the labelled and selectable objects will be rechecked to verify that the deletion of a specific object has not led to the elimination of any complete objects if the areas of interest (BoundingBoxes) overlap. If this is the case, restore the image in this area to ensure the selectable object to be used in the training is complete.
Figure 2 shows how from the original image 2 regions or sub-images are generated (green boxes) with the training targets marked in red. These targets generate regions or sub-images, the blue targets are ROI included for training which do not generate their own regions or sub-images. Figure 3 shows how from an original, full-sized image, 4 regions or sub-images are generated (green boxes) containing the training targets. In this example we see how one of the labelled targets is partially blurred, marked in white, because it was marked as discardable in one of the regions or sub-images because it there is less than a 50% overlap.
Example of the pre-processor in operation. Using the original image, two regions or sub-images are generated (green squares), centered on the red target objects. The blue targets are ROI included for training which do not generate their own sub-image.
By creating a region centered on a selected object, the network will always be trained with a central labelled object. This may cause the network to learn to always expect to detect objects in the center of the region or image. This will not be a problem for this project, given that we have used YOLO as a CNN, which initially divides the image into sections (by default, into
It is important to note that this method is not ideal for all situations or for all datasets. For this article, we conducted tests using three different datasets, all of them public and verifiable, which allowed us to determine which factors are most beneficial for this algorithm. From these case studies it was determined that there is no optimum configuration for all the configurable parameters of the application. The complete set of images of the dataset determines the configuration and effectiveness of the algorithm. The most key factors which determine the effectiveness of the algorithm (as shown in Figure 4 and Figure 5) are:
Large images, such as FullHD, 2K, 4K or even larger, and with small objects or “targets” to detect considering the size of the image. Examples may be images taken from a certain distance where the elements to detect are distant.
Images taken at short intervals. That is, video images. It is not necessary that the time interval between images is very short (images at 1 FPS is optimum) but they must be sequential and taken from a relatively static camera.
Few objects within the image or the objects are not evenly distributed within the image. That is, objects should be grouped in zones. These images will have large areas with no objects to detect and these areas can thus be eliminated from the training processes.
There are static objects of interest in the image. It is not necessary that static object predominate in the scene but only that there be static objects of interest. This is a factor that reduces the set of images for training, thus reducing the size of the dataset and making the process faster.
Frame 1 of a sequence of two video frames with static (blue) and moving (red) objects of interest. There is a clear non-uniformity in the density of objects in the image.
Frame 2 of a sequence of two video frames with static (blue) and moving (red) objects of interest. There is a clear non-uniformity in the density of objects in the image.
In summary, the types of datasets which best respond to these factors are those consisting of video images from drones or high-resolution static cameras. In this type of dataset, the images are chronological and usually have high resolution. Examples are drone videos observing beaches, roadways, parks, large agglomerations of people or animals, etc. other examples include security or surveillance cameras on highways, streets, buildings, etc. where there are many objects of interest distributed in specific areas of the image, such as cars on a highway, or doorways and entrances for surveillance cameras, etc. These images also generally contain static objects of interest, such as people lying on the beach or parked cars on a city street.
In this case, we used the YOLO neural network which has a series of limitations which make it ideal for the pre-processing algorithm. YOLO divides the image into regions for analysis and each region is assigned a maximum number of objects [19], [20]. Thus, YOLO is limited to a specific number of objects per region. By dividing the image around groups of objects these are distributed within the new image, permitting a greater number of detections given that there are objects within each of the regions created by YOLO.
It is important to note that this limitation is not critical, but it is a factor to be considered since the number of objects can be [16], [17], [18] in the YOLO network. However, the higher the number of objects the slower the process becomes with greater memory consumption. This is a generic factor for all the regions into which YOLO divides the original image, and so the number of objects must be adjusted for the region with the most objects rather than the average number.
Evaluated Datasets
To determine the effectiveness of the dataset pre-processing algorithm, we experimented with three different datasets, all publicly accessible: Drone [21], Roundabout [22] and VisDrone [24]. The following section provides a description of the principal characteristics of these datasets.
A. “Drone” Dataset
This dataset consists of images of road traffic in Spain [21], with 12 video sequences recorded by a UAV (Unmanned Arial Vehicle) or drone and from static cameras. These are principally images of critical traffic points such as intersections and roundabouts. The videos are recorded at 1 frame per second in 4K resolution. The total dataset consists of 17,570 images of marked objects (types) such as “cars” and “motorcycles”. In total there are over 155,000 labelled objects in the dataset: 137,000 cars (88.6%) and 18,000 motorcycles (11.4%). Three frames extracted from the dataset are presented in Figure 6.
Frames extracted from the dataset, corresponding to a section of interurban roadway and a split roundabout.
B. “Roundabout” Dataset
This dataset consists of areal images of rotundas in Spain taken with a drone [22], along with their respective annotations in XML (PASCAL VOC) files indicating the position of the vehicles. In total, the dataset consists of 54 sequences of drone video with a central view of roundabouts. There are a total of over 65,000 images with a resolution of
Frames extracted from the dataset, corresponding to three different roundabouts with light traffic, heavy traffic and very light traffic.
C. “Visdrone” Dataset
This dataset is a largescale reference point with carefully annotated data for a computer vision of drone images. The VisDrone 2019 dataset was compiled by AISKYEYE at the Machine Learning and Data Mining Laboratory at the Tianjin University, China [24]. The complete dataset consists of 288 video clips with a total of 261,908 frames and 10,209 static images captured by various drone-mounted cameras with a wide range of different characteristics such as location (14 different cities), setting (urban and rural), objects (pedestrians, vehicles, bicycles, etc.) and density (dispersed or very congested scenes).
It should be noted that the data set was compiled using several different drones in various scenarios and under diverse weather and lighting conditions. These frames were manually annotated with specific objects of interest such as pedestrians, cars, bicycles, and tricycles. Other important attributes are also provided such as visibility of the scene, type of object and ambient occlusion for a better use of the data. Three sample frames from this dataset are provided in Figure 8.
Frames extracted from the dataset, corresponding to a parking lot, an intersection, and a rotunda with different intensities of traffic.
For our study, we used only 79 sequences of video consisting of 33,600 frames. There are a total of over 1.5 million labelled items in the dataset, distributed as shown in Table 1.
Pre-Processing of the Datasets
The three datasets were pre-processed using the algorithm discussed in this study, using the following equipment: an ninth generation Intel i7 processor with 64Gb RAM, SSD hard drive and RTX 2060 graphics card with 8Gb RAM. For software, the study used Microsoft Visual C++ and the OpenCV v4.5 library for their facility in generating compilation files for both Windows and Linux.
A. Processing the “Drone” Dataset
The dataset was processed as follows:
Initial
image to maintain the same proportion as the images in the original dataset.640\times 360 Objects of interest were discarded with their position does not vary in 10px of the image.
Deletion of counted objects when their area is less than 50%.
Key image every 7 frames.
Evolution of the number of images and labels after pre-processing of the “Drone” dataset. There is a slight decrease in images and a significant decrease in labels.
Evolution of the number of labels assigned to each type after pre-processing of the “Drone” dataset.
Both datasets, the original and the pre-processed, were used to train a “medium-sized” YoloV5 neural network with the mAP metric adjusted to the value 0.5. The training results in the different epochs are shown in Figure 11.
Training results for mAP_0.5 of the original and pre-processed images of the “Drone” dataset.
For this dataset, consisting of 17K images in 2K quality, the training time using the YOLO algorithm and “Yolov5m” network for 20 epochs, was 14 hours and 46 minutes, while the training time using the same computer for the pre-processed dataset was 1 hour and 35 minutes. If we reanalyze the graph of the mAP_0.5 metric but considering training time rather than epochs (Figure 12.), we see a time reduction of some 89.3%.
mAP_0.5 graph of the time differences in training. The hours of training are indicated on the horizontal axis.
There was a significant reduction in training time. The additional time used for pre-processing, for this dataset 14 minutes, is largely insignificant compared to total training time. For pre-processing, as opposed to the training process for the network, what is most important is not only the graphics card but also storage capacity since the algorithm loads a great deal of images. In our case, we used an SSD hard drive with a Read/Write speed of 600Mb/s.
B. Processing the “Roundabout” Dataset
The dataset was processed as follows:
Initial
image to maintain the same proportion as the images in the original dataset.640\times 360 Objects of interest were discarded with their position does not vary in 10px of the image.
Deletion of counted objects when their area is less than 50%.
Key image every 7 frames.
Evolution of the number of images and labels after pre-processing of the “Roundabout” dataset. There are significantly more images and labels.
Evolution of the number of labels assigned to each type after pre-processing of the “Roundabout” dataset. Cars are the most affected type with a significant increase in the number of labels.
Both datasets, the original and the pre-processed, were used to train a “medium-sized” YoloV5 neural network. The training results in the different epochs are shown in Figure 15.
Training results for mAP_0.5 of the original and pre-processed images of the “Roundabout” dataset.
If we reanalyze the graph of the mAP_0.5 metric but considering training time rather than epochs on the horizontal axis (Figure 16.), we see a time reduction of some 43.0%. To this time must be added an additional 30 minutes in pre-processing time for this dataset.
mAP_0.5 graph of the time differences in training. The hours of training for the “Roundabout” dataset are indicated on the horizontal axis.
C. Processing the “Visdrone” Dataset
The dataset was processed as follows:
Initial
image to maintain the same proportion as the images in the original dataset.640\times 360 Objects of interest were discarded with their position does not vary in 10px of the image.
Deletion of counted objects when their area is less than 50%.
Key image every 7 frames.
Evolution of the number of images and labels after pre-processing of the “Visdrone” dataset. There is a slight increase in images and a significant decrease in labels.
In this case, the number of images has increased by a rate of 1 to 1.543 (154.3%) while the number of labelled objects falls to 38.8%. Both datasets, the original and the pre-processed, were used to train a “medium-sized” YoloV5 neural network. Training results in the different epochs are shown in Figure 18.
Training results for mAP_0.5 of the original and pre-processed images of the “Visdrone” dataset.
If we reanalyze the graph of the mAP_0.5 metric but considering training time rather than epochs on the horizontal axis (Figure 19.), we see a time reduction of some 75.0%. To this time must be added an additional 25 minutes in pre-processing time for this dataset.
mAP_0.5 graph of the time differences in training. The hours of training for the “Visdrone” dataset are indicated on the horizontal axis.
Results
The results were validated using two networks with different training procedures. Firstly, a network trained using original images without being reduced or cropped and, secondly, a network trained using pre-processed images using the algorithm discussed in this study.
The validation was conducted not to determine the quality of the model, since it was validated against the same dataset with which it was trained. We note that the purpose of this article is not to determine the success of the training itself but rather whether the algorithm succeeds in reducing training times without any loss of effectiveness. The results in themselves are not significant but the differences between results if the network is trained using a pre-processed dataset or the original. Thus, both training results were validated to compare them. The terms used in this comparison are:
Network A: Network resulting from the training based on the original dataset.
Network B: Network resulting from the training based on the pre-processed dataset generated using the algorithm discussed in this study.
A. “Drone” Case
Both networks used a validation process against the original images, generating the confusion matrices shown in Figure 20.
These matrices show, in the validation of Network B, that is, the network generated from pro-processed images, a slight increase in the number of “False Positives” especially in the type “car”. But a closer analysis shows that this is not correct. In fact, the network has a higher success rate than the labelled original. In the original images, small and distant objects of interest are not labelled to avoid adding noise to the training process. In the training with the original images these objects are categorized correctly as true negatives, while with the cropped images these objects simply are not included in the training process (neither as true positives nor true negatives).
But in validating the original images, these “true negatives” are detected as “true positives” by the network trained with the pre-processed dataset. That is, Network B has a greater sensibility to small, non-labelled objects, but positives, in the original images.
Figure 21 shows an original frame from the video without any labelled objects as these are very far from the camera. This image was analyzed by both neural networks (Network A and Network B). In the case of Network A, the objects were correctly learned as true negatives and were not marked (Figure 22.). But in the case of Network B, these distant objects were not inputted into the network, that is, they w ere never marked as “selectable objects” and so were never marked as objects to be discarded as “true negative”. Thus, in processing this image, Network B will detect these objects as a target if the resolution of the image permits.
Upper right corner amplification of Figure 21, where appear objects undetected by network A (trained with the original dataset).
Advantages Obtained During Training: In line with the above, we found that both datasets produce a very similar trained network, even for this dataset. It may be said that the network generated using the pre-processed dataset is slightly better, detecting smaller objects of interest and with fewer false negatives.
Thus far, we have demonstrated that the training results are similar, the two networks are equivalent. But this is not the principal advantage of the algorithm which is the training process itself where better results are obtained.
B. “Roundabout” Case
Both networks were used in a validation process against the original images, generating the Figure 24 confusion matrices.
Upper right corner amplification of Figure 21, where appears objects detected by network B (trained with a pre-processed dataset).
Confusion matrices of the original images (above) and the pre-processed images (below).
Advantages Obtained During Training: For this dataset, consisting of 65K images in 2K quality, the training time using the YOLO algorithm and “Yolov5m” network for 30 epochs, was 3 days, 4 hours, and 3 minutes, while the training time using the same computer for the pre-processed dataset was 1 day, 8 hours and 46 minutes.
This is a perfect example of network training where the results are the virtually the same, with very little differences between them. The greatest difference, although minimum, is in the case of the label “car” where there was a slight confusion with “truck”.
C. “Visdrone” Case
Both networks were validated using the original images, generating the confusion matrices shown in Figure 25.
Advantages Obtained During Training: For this dataset, consisting of 33.6K images in FullHD quality, the training time using the YOLO algorithm and “Yolov5m” network for 30 epochs, was 14 hours and 26 minutes, while the training time using the same computer for the pre-processed dataset was 3 hours and 36 minutes.
Here it is important to note that this training exercise presented the largest differences, although these are not significant if we consider that the network was not trained effectively. The results of the training process in both cases, for the original dataset and the pre-processed dataset, were approximately 0.3 in the mAP_0.5 metric, a very poor result.
We will explain the reasons for this poor performance although it is important to note that these results also validate the algorithm which is designed exclusively to reduce training times rather than improve the training process itself.
The reason for this poor training result is because the network was trained using values downloaded repository without any prior cleaning of the dataset. For this dataset, the labelled original (not using YOLO) includes special types and attributes. Thus, we have a ‘type 0’ to indicate “regions to ignore”, see Figure 30, or attributes that indicate if the labelled object is hidden, as shown in Figure 27 and Figure 28, truncated or even confidence (score) of the labelled objects.
Amplification of Figure 27 showing targets (cars and motorcycles) that are perfectly identifiable and not labelled in the dataset.
Sample frame from the labelled dataset. Here we see the upper part of the image is marked as not labelled (red box) while many objects can be perfectly recognized.
To improve the training results, it is essential that the dataset be initially cleaned and filtered of hidden objects, highly distorted or cut objects, and dubious labels, relabeling objects which are unlabeled but as perfectly recognizable in the images (see Figure 27, Figure 28, Figure 29 and Figure 30). This was not done here, firstly, because the purpose of this article is not to evaluate the quality of the training process of neural networks using known datasets but rather to evaluate the time reductions in training provided by the algorithm; secondly, a clean dataset with fewer labels can optimize the training process, thus, this is further evidence of the effectiveness of our pre-processing system. Regardless, the algorithm reduced the training time to one quarter of the original training time.
This improvement in training times is particularly important given that the dataset in this case is not ideal for pre-processing. Figure 26 shows how the images do not meet some of the conditions for optimum effectiveness of the algorithm such as the lack of concentration of objects in a specific zone of the image. As can be seen, the labelled objects are distributed throughout the frame. In contrast, it does meet other parameters that allow the algorithm to be effective, such as the limited movement of objects between frames and many objects remaining immobile over many frames.
By contrast, these images demonstrate the poor results of the training which, while not a problem for pre-processing, should be taken into consideration. Certain objects are labelled but totally hidden (cars under trees, for example), mislabeled or unlabeled (motorcycles, for example). Oddly, these same motorcycles are labelled in other frames of the video. There are also zones of the image which are perfectly recognizable but marked as to be ignored.
Discusion
How is it possible that partitioning an image into smaller images produces results which are inferior compared to the original? In other words, how is it possible that the dataset, in addition to being taken from smaller images, generates a set of smaller images?
The explanation is found in the first criterium for eliminating cut images. That is, in the discarding of cut images which only include objects of interest that do not move, for example parked cars. In many frames the only cars appearing are parked, with no other vehicles circulating. These cars are only labelled once in the “key” frame which the configuration established every 7 images (7 to 1 reduction).
The result is that the pre-processed image is not only smaller but also more equal. A parked car will appear in all the frames of the video, giving it greater weight in the training process while a car moving in front of the camera only appears in the sequence of images for a few seconds. Thus, a false positive of an object appearing in all the images will be more highly penalized than a false positive of an object which is only labelled in 5 or 10 frames. This means the network can ‘overlearn’ some objects to the detriment of others.
It is important to note that the static objects of interest (parked cars, for example) are not only labelled once in the pre-processed dataset, as shown in Figure 31, discarding all other appearances because the object doesn’t move, but are also labelled in every “key” image. By adjusting the configuration of this value in the algorithm the repercussion of static objects can be compensated, being very abundant in the dataset versus objects which appear only in a limited number of frames.
These two key points that the algorithm addresses primarily achieve:
Reducing the dataset size in terms of storage space by 20%. As mentioned earlier, the original and processed image sets are not vastly different. In our tests, in the worst case, it doesn’t even double the number of images. However, these images are much smaller, going from around 1.5MB (in jpg format) per original image to about 100KB per processed image. This translates to a significant reduction. It’s worth noting that the labeling file size is negligible in these calculations, as it accounts for less than 0.01% of the total dataset size.
With smaller images, a larger number of images can be loaded in parallel into the memory of the graphics cards. In our case, we were able to go from loading 4 images in parallel to loading 42 images. This makes the training process more efficient.
Conclusion
An analysis of the results shows that the image pre-processing algorithm is more efficient in terms of time and computation, able to be executed using standard equipment without any outstanding characteristics. Additionally, very significant improvements were seen in training times with reductions from at least 50% to, depending on the dataset, reduction of 80%. If, for example, we focus on a success score of 0.95 in the mAP_0.5 metric, very significant time reductions were achieved, as shown in Figure 32:
Drone Dataset. Training without improvement: 3 hours and 36 minutes, with pre-processing: 30 minutes. A reduction in training time of 87%.
Roundabout Dataset. Training without improvement: 21 hours and 11 minutes, with pre-processing: 6 hours and 34 minutes. A reduction in training time of 72%.
Visdrone Dataset. A success score of 0.95 for the metric was never achieved for this dataset. The highest success score was in epoch 9, after 4 hours and 5 minutes for the original dataset and 1 hour for the pre-processed dataset. A reduction in training time of 76%.
As shown in Figure 33, similar results can be obtained if the aim is simply a specific number of epochs.
Additionally, it was found that pre-processing does not alter the quality of the training. If the dataset is clean or well formatted, the training is successful in both cases, as seen in the Drone and Roundabout datasets while, if the dataset is not well labelled, the network trains with the same failures as with the original.
To conclude, it is important to note the added benefit that a network trained with a pre-processed dataset tends to be more precise in distant, unlabeled objects, as can be seen in Fig. 5, Figure 21 and Figure 22. In the complete images these objects are trained as true negatives while in the pre-processed network these objects are not part of the training. Thus, these objects are detected in the image during the training process, but in the validation, they are detected as false positives since these are not marked in the original dataset.
ACKNOWLEDGMENT
(Sergio Bemposta Rosende and Javier Sánchez-Soriano contributed equally to this work.) The authors would like to thank Universidad Francisco de Vitoria and the European University of Madrid for their support. They are especially grateful to the translation service of Universidad Francisco de Vitoria for their help in translating and revising the manuscript.