1. Introduction
Human-object interaction (HOI) detection is essential for comprehending human-centric scenes at a high level. Given an image, HOI detection aims to localize human and object pairs and recognize their interactions, i.e. a set of <human, object, action> triplets. Recently, vision Transformers [41], especially the DEtection TRansformer (DETR) [1], have started to revolutionize the HOI detection task. Two-stage methods use an off-the-shelf detector, e.g. DETR, to localize humans and objects concurrently, followed by predicting interaction classes using the localized region features. One-stage methods usually leverage the pre-trained or fine-tuned weights and architecture of DETR to predict HOI triplets from a global image context in an end-to-end manner.