I. INTRODUCTION
Along with rapid development of contemporary information technology, image and text in multimedia become key elements in effective communication and interaction in people's daily life. Accordingly, categorization of image and text has gained significant interest in recent years. However, It has been challenging for the categorization, particularly when a text document or an image instance is associated with multiple concepts [1]–[14].