I. Introduction
Deep learning models are significantly affected by the quality [1] and quantity [2] of the dataset used for training. Supervised learning with labeled datasets is common in deep learning model training. In particular, for object detection tasks that require high accuracy, a high-quality labeled dataset is essential because the accuracy decreases when low-quality labels are used, including errors such as localization, classification, and false errors [3]. However, obtaining numerous high-quality labeled datasets is challenging owing to the high cost of annotation [4], [5]. Moreover, a small training dataset also reduces accuracy because it is not representative of the actual distribution of data [2]. Although abundant unlabeled data are readily available in the real world, they cannot be directly used for supervised learning. Therefore, by utilizing a large number of new datasets suitable for the user environment, obtained from numerous mobile devices, an optimal trained model can be secured for each user. Recently, to use unlabeled datasets generated from each device on the user side, a method of transmitting data to a data center and processing the data on a cloud server has been widely used [6], [7]. However, this approach faces many obstacles, including data privacy difficulties [8], transmission issues [9], cloud computing burden, data center maintenance [10], and annotation costs [4]. If optimized training is possible on personal devices (e.g., mobile/edge devices) using large unlabeled datasets, benefits such as lower annotation cost, lower cloud processing cost, and improved model accuracy for personal applications can be provided [11]. Therefore, semi-supervised learning for object detection (SSOD) [12], [13], [14], which trains networks using large and readily available unlabeled datasets, is becoming increasingly popular.