Conferences >2024 IEEE Symposium on Securi...

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Backdoor attacks are an important type of adversarial threat against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)cl...Show More

Metadata

Abstract:

Backdoor attacks are an important type of adversarial threat against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker’s target class when a backdoor pattern is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor-attacked without any access to the training set. Many post-training detectors are designed to detect attacks that use either one or a few specific backdoor embedding functions (e.g., patch-replacement or additive attacks). These detectors may fail when the backdoor embedding function used by the attacker (unknown to the defender) is different from the backdoor embedding function assumed by the defender. In contrast, we propose a post-training defense that detects backdoor attacks with arbitrary types of backdoor embeddings, without making any assumptions about the backdoor embedding type. Our detector leverages the influence of the backdoor attack, independent of the backdoor embedding mechanism, on the landscape of the classifier’s outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated. Detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector does not need any legitimate clean samples, and can efficiently detect backdoor attacks with arbitrary numbers of source classes. These advantages over several state-of-the-art methods are demonstrated on four datasets, for three different types of backdoor patterns, and for a variety of attack configurations. Finally, we propose a novel, general approach for backdoor mitigation once a detection is made. The mitigation approach was the runner-up at the first IEEE Trojan Removal Competition. The code is online available.

Published in: 2024 IEEE Symposium on Security and Privacy (SP)

Date of Conference: 19-23 May 2024

Date Added to IEEE Xplore: 05 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/SP54263.2024.00015

Conference Location: San Francisco, CA, USA

Contents

1. Introduction

Although deep neural networks (DNNs) are successful in many research areas, they are vulnerable to attacks [1]. A backdoor attack or Trojan is an important type of attack under which a DNN classifier will predict to the attacker’s target class when a test sample from one or more source classes is embedded with the attacker’s backdoor pattern [2]–[4]. A backdoor attack is typically launched by poisoning the classifier’s training set with samples originally from the source classes, embedded with the same backdoor pattern that will be used during inference, and labeled to the target class [5]. Since successful backdoor attacks do not degrade the classifier’s accuracy on clean test samples, they cannot be easily detected, e.g., using validation set accuracy [6].

References is not available for this document.

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

References