1. INTRODUCTION
Video moment retrieval (VMR) [1], [2], [3], [4], [5], aiming to locate the target video moment that best corresponds to the given query sentence of natural language from a long video, has become one of the most intriguing and hot topics in video understanding literature. While promising results are achieved, almost all existing VMR methods are designed for centralized data [6], [7], [8], [9], deterring them from real-world applications. Videos in the reality are often created and stored with personal cameras, CCTV systems or other distributed devices. In such scenarios, aggregating data from different devices or datasets to enable large-scale training of VMR will be faced with the challenge of an expensive cost of transmission and storage. Furthermore, as videos might also contain sensitive information, sharing them would inevitably cause information leakage, leading to a serious data privacy problem.