1. Introduction
ImageNet [2] is a representative benchmark dataset to verify the visual feature learning effects of deep learning models in the vision domain. However, each image has only one label, which cannot fully explain the various features of real objects. For example, a car can be identified with various attributes such as category, color, and length, as in Figure 1. As shown in Figure 1 (a), the general method of forming embeddings for objects’ various attributes involves constructing neural networks equal to the number of specific attributes, and creating multiple embeddings for vision tasks such as image classification [6], [22], [8] and retrieval [10], [20]. Unlike conventional methods, this study presents a technique that embeds various attributes into a single network. We refer to this technique as multi-space attribute-specific embedding Figure 1 (b).