1 Introduction
Learning disentangled structures of the observations [1], [2] is a fundamental problem towards controlling modern deep models and understanding the world. Conceptual understanding requires a disentangled representation that separates the underlying explanatory factors and shows the important attributes of the real-world data explicitly [3], [4]. For instance, given an image dataset of human faces, a disentangled representation can separate the face’s appearance attributes, such as color, light source, identity, gender, and the geometric attributes, such as face shape and viewing angle. Such disentangled representations are semantically meaningful not only in building more transparent and interpretable generative models, but also useful for a large variety of downstream AI tasks such as transfer learning and zero-shot inference where humans excel but machines struggle [5]. It has also been shown that such disentangled representations are more generalizable and robust against adversarial attacks [6].