Journals & Magazines >IEEE Access >Volume: 11

Optimization Algorithm to Reduce Training Time for Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects

Optimization Algorithm used to reduce the training time by an average of 75% without altering the results of Convolutional Neural Networks (CNN) training. This algorithm ...

Abstract:

The optimization of convolutional neural networks (CNN) generally refers to the improvement of the inference process, making it as fast and precise as possible. While inf...Show More

Metadata

Abstract:

The optimization of convolutional neural networks (CNN) generally refers to the improvement of the inference process, making it as fast and precise as possible. While inference time is an essential factor in using these networks in real time, the training of CNNs using very large datasets can be costly in terms of time and computing power. This study proposes a technique to reduce the training time by an average of 75% without altering the results of CNN training with an algorithm which partitions the dataset and discards superfluous objects (targets). This algorithm is a tool that pre-processes the original dataset, generating a smaller and more condensed dataset to be used for network training. The effectiveness of this tool depends on the type of dataset used for training the CNN and is particularly effective with sequential images (video), large images and images with tiny targets generally from drones or traffic surveillance cameras (but applicable to any other type of image which meets the requirements). The tool can be parameterized to meet the characteristics of the initial dataset.

Optimization Algorithm used to reduce the training time by an average of 75% without altering the results of Convolutional Neural Networks (CNN) training. This algorithm ...

Published in: IEEE Access ( Volume: 11)

Page(s): 104593 - 104605

Date of Publication: 18 September 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3316618

Funding Agency:

Contents

SECTION I.

Introduction

Cameras and video technology is continuously improving, and it is increasingly common to find images in FullHD, 2K, 4K or even 8K used as input for training convolutional neural networks (CNN) [1]. Computing capacity has also increased significantly [2], and a great deal of effort is being made to develop hardware with the capacity to run neural networks in real time [3]. This hardware is becoming increasingly compact, efficient, and affordable, enabling embedded or distributed training systems for the construction of distributed object detection and surveillance systems [4], [5], [6].

Limited progress has been made however in CNN training [7]. While neural networks are, in theory, trained only once and then later depend on inference, the fact is that neural networks are continuously being retrained, either with new datasets or modifications in the parameters of training algorithms.

Given the current size of images [8], and the need for increasingly exact or precise detection of objects within these images, training times are growing [9] as classic methods of optimizing training become less effective [10].

There are two commonly used methods to reduce training times for deep neural networks:

Image size reduction [11]. This is an effective method if the objects to be detected or classified occupy a sufficiently large part of the total image so that, even when the image is reduced, these objects still provide sufficient information for the training algorithm [9] [13].
Partition of the original image into a mosaic of images [7], [14], [15]. This method reduces the size of the image, dividing it into several parts with a predefined size (usually $3\times 3$ or $4\times 4$ ) with equal dimensions (length and width) to maintain the same proportions as the original image.

Both methods reduce the size of images, which can be processed using more modest hardware, particularly when memory is the principal limitation to processing large images. Both methods, however, have certain drawbacks:

Image size reduction [16]. If objects are small, the loss of resolution may mean these objects become undetectable.
Partition of the original image into a mosaic of images. The image being processed may be smaller but there are more images to process. Additionally, objects may be cut between two images. The superimposition of the regions is a way to minimize this although it does not solve the problem as the area of superimposition must be very large resulting in an even greater reduction of the object, reducing the effectiveness of this solution.

In this study we propose a method to optimize training times without the losses indicated above. This method was validated in a case study using traffic images captured by drone. This involved a handicap because the objects of interest were very small compared with the total size of the image. Thus, a solution to reduce the original image was ruled out. For example, the size of a car or pedestrian in an image taken by a drone at a height of 50 meters may be approximately

$20\times 20$

pixels, if we reduce the image to a size that can be processed by a PyTorch or Tensor Flow type network, that is, up to

$640\times 640$

pixels, we are reducing the image to one-fifth, and the objects will be too small to be accurately detected by the neural networks. Although YOLO can theoretically be trained using target as small as

$2\times 2$

pixels [17], our tests with targets smaller than

$16\times 16$

pixels had a very low degree of precision.

In this study we will describe the method used to significantly reduce processing times without diminishing the effectiveness of the trained network.

SECTION II.

Training Optimization Algorithm

This algorithm is designed to pre-process the labelled images of a dataset prior to being used in the habitual training process for a deep neural network. The dataset must be labelled using the format of a YOLO type network [18]. Thus, the input of this algorithm is one dataset, and the output another dataset constructed using the original images but optimized for training (also in YOLO format). For datasets other than those of the YOLO type, the labels can be translated for use in other formats. For this reason, the method is replicable and extendible to other dataset formats.

A. Terms Used in Algorithm Definition

Target (Object) or BoundingBox: A labelled element in the image that the neural network should detect. This may be any of the type of object that the future neural network will detect by inference.
Selected object: An object labelled in the image that has been selected as input for the neural network. This object was chosen to be part of the set of objects used to train the neural network.
Discarded object: An object labelled to be discarded as input in the neural network. This object may be duplicated, cut, etc. and is discarded for training purposes.
Cropped region: Portion of the image surrounding a “selected object”. The size of this region is a configurable parameter of the algorithm. The region is the piece of the image inputted into the neural network for training in which there is at least one selected object.
Key image: An image on which the object discarding process is not applied. This is established every “N” images. This “N” parameter is configurable in the algorithm.

The difference between an “object” and a “selected object” is that not all marked objects in the image to be recognized are part of the input for training the neural network. Of all the labelled objects, only a subset of these per image will be part of the input of the neural network, the rest being discarded.

B. Algorithm

This algorithm, as opposed to the methods described in the bibliography, consists of two phases:

Discard of objects and reduction of the training set.
Cropping of the training regions and new labelling of objects.

To clarify, we will use a training dataset from high-definition videos or consecutive images taken in short time intervals. In either case, these images are from a great distance where each image contains various marked targets for training with a very small size considering the total size of the image. Each of these images is inputted into the algorithm in the same order they were taken by the camera (see flow chart of all the steps of the algorithm in Figure 1).

FIGURE 1.

Flow chart diagram of the functioning of the algorithm.

Optimization Algorithm to Reduce Training Time for Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Training Optimization Algorithm

A. Terms Used in Algorithm Definition

B. Algorithm

Evaluated Datasets

A. “Drone” Dataset

B. “Roundabout” Dataset

C. “Visdrone” Dataset

Pre-Processing of the Datasets

A. Processing the “Drone” Dataset

B. Processing the “Roundabout” Dataset

C. Processing the “Visdrone” Dataset

Results

A. “Drone” Case

B. “Roundabout” Case

C. “Visdrone” Case

Discusion

Conclusion

ACKNOWLEDGMENT

References