Abstract
In this paper, we present a novel defense against backdoor attacks on deep neural networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. Our proposed defense is built upon an intriguing concept: given a backdoored model, we reverse engineer it to directly extract its backdoor functionality to a backdoor expert model. To accomplish this, we finetune the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising robust backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB, and ImageNet) across multiple model architectures (ResNet, VGG, MobileNetV2, and Vision Transformer). Our code is integrated into our research toolbox: https://github.com/vtu81/backdoor-toolbox.
Original language | English (US) |
---|---|
State | Published - 2024 |
Event | 12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria Duration: May 7 2024 → May 11 2024 |
Conference
Conference | 12th International Conference on Learning Representations, ICLR 2024 |
---|---|
Country/Territory | Austria |
City | Hybrid, Vienna |
Period | 5/7/24 → 5/11/24 |
All Science Journal Classification (ASJC) codes
- Language and Linguistics
- Computer Science Applications
- Education
- Linguistics and Language