Conferences >2021 IEEE/CVF International C...

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in whi...Show More

Metadata

Abstract:

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.

Published in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Date of Conference: 10-17 October 2021

Date Added to IEEE Xplore: 28 February 2022

ISBN Information:

ISSN Information:

DOI: 10.1109/ICCV48922.2021.00353

Conference Location: Montreal, QC, Canada

Contents

1. Introduction

In this paper, we study problems such as visual relationship detection (VRD) [29], [21] and human object interaction (HOI) [11], [35], [4] where a composite set of a two-level (part-and-sum) hierarchy is to be detected and localized in an image. In both VRD and HOI, the output consists of a set of entities. Each entity, referred to as a "sum", represents a triplet structure composed of parts: the parts are (subject, object, predicate) in VRD and (human, interaction, object) in HOI. The sum-and-parts structure naturally forms a two-level hierarchical output - with the sum at the root level and the parts at the leaf level. In the general setting for composite set detection, the hierarchy consists of two levels, but the number of parts can be arbitrary.

References is not available for this document.

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References