1. Introduction
As the expressivity of deep learning models increases rapidly, so do their complexity and resource requirements [13]. In particular, vision transformers have demonstrated remarkable results for 3D point cloud semantic segmentation [56], [37], [16], [23], [32], but their high computational requirements make them challenging to train effectively. Additionally, these models rely on regular grids or point samplings, which do not adapt to the varying complexity of 3D data: the same computational effort is allocated everywhere, regardless of the local geometry or radiometry of the point cloud. This issue leads to needlessly high memory consumption, limits the number of points that can be processed simultaneously, and hinders the modeling of long-range interactions.