1. INTRODUCTION
Speaker diarisation is the task of finding ‘who spoke when’, which is an essential task for a range of applications, including speech dictation systems [1], [2]. The task is usually used as a pre-process of a speech recognition system to divide long speech recordings into short speaker-homogeneous segments [3],[4]. The field is experiencing rapid breakthroughs accelerated by advances in deep learning, where diarisation error rate (DER) is widely adopted as the primary metric.