I. Introduction
In recent years deep learning has been the core of artificial intelligence (AI), especially in computer vision fields. In computer vision technology, convolutional neural network (CNN) is the popular choice in making the model for object detection because of its performance in training many images. However, this model is mostly used in supervised learning, in which images are trained to identify a set of objects into classes. The disadvantage of supervised learning is the cost and time involved in selecting the training data. This includes pre-defining and labelling the data into different classes, which limits the categories and class selection. In comparison, unsupervised learning does not require the training data to be labelled, which reduces the cost and time for classification [1].