I. Introduction
Motion segmentation aims to identify and separate different motion patterns in video sequences, enabling the analysis and understanding of individual object motions or activities. It is an essential task in computer vision and video analysis, with applications in various fields, including vehicle safety [1], surveillance [2], and autonomous driving [3]. Motion segmentation methods can be roughly divided into two categories: pixel-wise and key-points based methods. Pixel-wise motion segmentation [4], [5], [6], [7], which is also referred to as Video Object Segmentation (VOS) [8], assumes dense correspondences of objects over a video as the input and predicts the dense segmentation of moving objects. In contrast, the goal of key-points based motion segmentation is to partition a set of sparse points in the moving objects into individual non-overlapping clusters. This paper focuses on the latter, as we are interested in handling sparse and unordered datasets without any temporal component, which is motivated by the multi-model structures from motions. In recent years, numerous key-points based motion segmentation methods have been developed, but the performance still cannot satisfy the practical requirements due to the complexity of object motions, occlusions, and noises.