1. Introduction
Human pose forecasting aims to predict future human motion based on observed past motion [20], [31], [32], [35], [37], [53], [86]. Humans instinctively perform such tasks, allowing them to naturally navigate in crowded areas or identify and circumvent potential dangers. For this reason, human pose forecasting plays an important role in various computer vision tasks [21], [23], [27], [54], [85], [91]. Indeed, recent years have seen a proliferation of work on multi agent motion forecasting which aim towards modeling complex multiagent inter-Although various methods have been proposed, they share two major limitations. The first is a limitation on long-term predictions, as previous studies predicted up to 3 seconds at most [4], [47], [74], [75]. However, a sufficiently long forecast horizon is essential to fully leverage human pose forecasting for diverse downstream tasks in the scope of identifying potential danger or understanding human be-havior. The second is that multi-person interactions are not proficiently learned. Existing methods consider the joints of multiple people all at once as objects of interaction [47], [65], [74], resulting in an excessive complexity with respect to the number of joints. Due to such inefficient modeling, these approaches are found to be incompetent in long-term (3s+) multiagent (6+) settings, limiting their practicality on complex real-world environments. action [20], [47], [53], [71], [74].
Human motion is goal-directed and influenced by other entities. Therefore, global intention contains hints for local intention, allowing us to infer local pose from global trajectories. Our method first forecasts global trajectories, upon which local poses are conditioned for subsequent forecasts. Pose and trajectory-wise interagent interactions are considered for both predictions.