Recent advancements in text-to-speech and voice-conversion technologies have achieved remarkably high-quality results, now posing a significant challenge for humans to distinguish between synthetic and natural speech [1]. The rising concerns regarding potential misuse, as identified e.g. by Europol [2], have been a focal point of the ASVspoof challenge [3], which led to numerous AI-driven solutions [4]–[9].