Journals & Magazines >IEEE Transactions on Pattern ... >Volume: 45 Issue: 2

Contextual Transformer Networks for Visual Recognition

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture...Show More

Metadata

Abstract:

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks. Nevertheless, most of existing designs directly employ self-attention over a 2D feature map to obtain the attention matrix based on pairs of isolated queries and keys at each spatial location, but leave the rich contexts among neighbor keys under-exploited. In this work, we design a novel Transformer-style module, i.e., Contextual Transformer (CoT) block, for visual recognition. Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation. Technically, CoT block first contextually encodes input keys via a

$3\times 3$ convolution, leading to a static contextual representation of inputs. We further concatenate the encoded keys with input queries to learn the dynamic multi-head attention matrix through two consecutive

$1\times 1$ convolutions. The learnt attention matrix is multiplied by input values to achieve the dynamic contextual representation of inputs. The fusion of the static and dynamic contextual representations are finally taken as outputs. Our CoT block is appealing in the view that it can readily replace each

$3\times 3$ convolution in ResNet architectures, yielding a Transformer-style backbone named as Contextual Transformer Networks (CoTNet). Through extensive experiments over a wide range of applications (e.g., image recognition, object detection, instance segmentation, and semantic segmentation), we validate the superiority of CoTNet as a stronger backbone. Source code is available at https://github.com/JDAI-CV/CoTNet.

Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 45, Issue: 2, 01 February 2023)

Page(s): 1489 - 1500

Date of Publication: 01 April 2022

ISSN Information:

PubMed ID: 35363608

DOI: 10.1109/TPAMI.2022.3164083

Funding Agency:

Contents

1 Introduction

Convolutional Neural Networks (CNN) [1], [2], [3], [4], [5], [6], [7] demonstrates high capability of learning discriminative visual representations, and convincingly generalizes well to a series of Computer Vision (CV) tasks, e.g., image recognition, object detection, and semantic segmentation. The de-facto recipe of CNN architecture design is based on discrete convolutional operators (e.g., 3×3 or 5×5 convolution), which effectively impose spatial locality and translation equivariance. However, the limited receptive field of convolution adversely hinders the modeling of global/long-range dependencies, and such long-range interaction subserves numerous CV tasks [8], [9]. Recently, Natural Language Processing (NLP) field has witnessed the rise of Transformer with self-attention in powerful language modeling architectures [10], [11] that triggers long-range interaction in a scalable manner. Inspired by this, there has been a steady momentum of breakthroughs [12], [13], [14], [15], [16], [17], [18] that push the limits of CV tasks by integrating CNN-based architecture with Transformer-style modules. For example, ViT [14] and DETR [13] directly process the image patches or CNN outputs using self-attention as in Transformer. [17], [18] present a stand-alone design of local self-attention module, which can completely replace the spatial convolutions in ResNet architectures. Nevertheless, previous designs mainly hinge on the independent pairwise query-key interaction for measuring attention matrix as in conventional self-attention block (Fig. 1a), thereby ignoring the rich contexts among neighbor keys.

References is not available for this document.

Contextual Transformer Networks for Visual Recognition

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Contextual Transformer Networks for Visual Recognition

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

1 Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References