I. Introduction
Tracking the 6-DOF pose of a rigid object in monocular videos is an essential problem in computer vision [1]. It is the basic technology in various applications such augmented reality (AR), robotic perception and human-computer interaction [2]–[4]. Recent researches have demonstrated the advantages of region-based methods in real-time 3D object pose tracking among other traditional approaches [5]–[7], especially in some difficult situations. The underlying statistical formulation of region-based method makes it robust to certain kinds of pixel-level outliers, such as a moderate degree of lighting variation, background cluttering, and minor occlusions. However, two main challenges still remain for region-based methods and have limited the application in more complex configurations: (1) Dealing with heterogeneous object and background; (2) Dealing with partial occlusions. We discuss the two challenges in detail as follows.