Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries | IEEE Conference Publication | IEEE Xplore

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries


Abstract:

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in whi...Show More

Abstract:

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.
Date of Conference: 10-17 October 2021
Date Added to IEEE Xplore: 28 February 2022
ISBN Information:

ISSN Information:

Conference Location: Montreal, QC, Canada
No metrics found for this document.

1. Introduction

In this paper, we study problems such as visual relationship detection (VRD) [29], [21] and human object interaction (HOI) [11], [35], [4] where a composite set of a two-level (part-and-sum) hierarchy is to be detected and localized in an image. In both VRD and HOI, the output consists of a set of entities. Each entity, referred to as a "sum", represents a triplet structure composed of parts: the parts are (subject, object, predicate) in VRD and (human, interaction, object) in HOI. The sum-and-parts structure naturally forms a two-level hierarchical output - with the sum at the root level and the parts at the leaf level. In the general setting for composite set detection, the hierarchy consists of two levels, but the number of parts can be arbitrary.

Usage
Select a Year
2025

View as

Total usage sinceMar 2022:99
012345JanFebMarAprMayJunJulAugSepOctNovDec140000000000
Year Total:5
Data is updated monthly. Usage includes PDF downloads and HTML views.

Contact IEEE to Subscribe

References

References is not available for this document.