1. Introduction
Audio-Video (AV) synchronization is a basic expectation to anyone that is consuming video, whether through streaming, social media, cable television, theaters or any other form of media. From the lens of the camera to the eye of the consumer, there are many instances where errors can be introduced, such as during content mastering, third party modifications, content encoding, or client playback. Studies show that the viewer experience can be negatively affected by a mere 45 millisecond (ms) discrepancy [7]; this is equivalent to a delay of a single frame in a 90 minute film at 25 frames per second (fps). While commercial solutions [1] exist, their scale and capabilities are insufficient for production. Thereby, detecting and identifying the origin of synchronization issues remains a significant burden for quality control teams, as it is largely a manual process. Thus, there is a pressing need for an automated detection system that can accurately identify and resolve AV synchronization issues before they reach the viewer.