1. Introduction
We address natural language-based temporal localization in untrimmed videos. Given an untrimmed video and a natural language query, natural language-based temporal localization aims to determine the start and end times of the activities in the query. Compared to the activity temporal localization task [25], [12], [33], [34], [3], the natural language-based temporal localization has two unique characteristics. First, the natural language-based temporal localization is not constrained by a pre-defined activity label list; second, the language query can be more complex and may contain multiple activities. e.g. “person begin opening the refrigerator to find more food”.