Unsupervised Detection of Adversarial Examples with Model Explanations
Ko & Lim / Unsupervised Detection of Adversarial Examples with Model Explanations / KDD-2021
Last updated
Ko & Lim / Unsupervised Detection of Adversarial Examples with Model Explanations / KDD-2021
Last updated
Ko, G., & Lim, G. (2021). Unsupervised Detection of Adversarial Examples with Model Explanations. arXiv preprint arXiv:2107.10480.
In the last few years, adversarial attacks are one of the main issues in security threats. It alters the behavior of a deep neural network by utilizing data samples which have been subtly modified. Adversarial perturbations, even simple ones, can affect deep neural networks. In this case, the model may produce incorrect results and cause damage to the security system. The following example of adversarial attack on a panda image will give you an idea of what adversarial examples look like. A small perturbation is applied to the original image so that the attacker is successfully misclassifying it as a gibbon with high confidence.
Figure 1: An adversarial perturbation can manipulate a classifier to misclassify a panda as a gibbon.
In order to identify which input data may have undergone adversarial perturbation, the new technique uses deep learning explainability methods. The idea came from observing that adding small noise to inputs affected their explanations greatly. As a consequence, the perturbed image will produce abnormal results when it is run through an explainability algorithm.
In many existing detection-based defenses, adversarial attacks are detected with supervised methods or by modifying current networks, which often require a great deal of computational power and can sometimes lead to a loss of accuracy on normal examples. Several previous works used pre-generated adversarial examples, which resulted in subpar performance against unknown attacks. Additionally, they result in a high computational cost due to the large dimension in model explanations. While other existing methods require less computation power, their transformations lack generalization so it may only work for the specified dataset.
In contrast to many previous attempts, the proposed method uses an unsupervised method to detect the attack. It does not rely on pre-generated adversarial samples, making it a simple yet effective method for detecting adversarial examples.
In this method, a saliency map is used as an explanation map to detect adversarial examples. For image inputs, each pixel is scored based on its contribution to the final output of the deep learning model and shown on a heatmap.
Figure 2: Examples of saliency map based on importance or contribution of each pixel.
There are three steps in this method:
Generating input explanations
By using explainability techniques, inspector networks create saliency maps based on the data examples used to train the original model (target classifier). With Φ𝑐 as a set of input explanations of output label 𝑐, we get:
Training reconstructor networks
By using the saliency maps, the inspector trains reconstruction networks (autoencoder) which are capable of recreating each class' explanation. An explanation map is then produced for a given image input. For example, in a handwritten digit case, it will need ten reconstructor networks. When an input image is classified by the target classifier as a “1”, the image is then entered to the class “1” reconstructor network and a saliency map is produced.
The training process is done by optimizing:
where LΦ(𝜃; ·) is a reconstruction loss for parameterized network 𝑔(𝜃; ·) on Φ.
Separating adversarial examples.
The networks are trained on unperturbed examples. Hence, when presented with an adversarial example (abnormal explanation), the reconstruction network will produce poor results, making it possible for the inspector to detect adversarially perturbed images. If the reconstruction error (𝜙′) of a given input (𝑥′) is higher than given threshold 𝑡′𝑐 then the input is expected to be an adversarial example.
The method is evaluated on the MNIST dataset. A simple CNN network is used as target classifier, saliency maps are generated using input gradients method, and all reconstructor networks consist of one single hidden layer autoencoder. Adversarial examples are generated using FGSM, PGD, and MIM methods and performance evaluation is measured using Area Under the ROC Curve (AUC).
Effect of input perturbations on explanations
Figure 3: Input, gradient, and reconstruction of an example MNIST image and adversarial examples crafted using the image. For each attack, adversarial example with 𝜖 = 0.1 is created.
Adversarial perturbation on input proved to lead to an obvious alteration in their explanation. The above figure shows that reconstructions of adversarial explanations have more noise than those of non-adversarial explanations.
Adversarial detection performance
Figure 4: Area under the Receiver Operating Characteristic (ROC) curve obtained according to the attack’s severity (parameterized by 𝜖), for (a) FGSM, (b) PGD, and (c) MIM attacks. For each class label, our proposed detector’s performance is recorded using adversarial examples created using given (attack, epsilon) pair. Grey areas show the min-max range of AUC, and black lines show average value of AUC across different class labels.
Overall, the method has difficulty on detecting adversarial examples with low noise level (𝜖 < 0.1). However, in the standard setting for MNIST dataset (𝜖 = 0.1), the experimental result shows that this method has relatively high performance with average AUC of 0.9583 for FGSM, 0.9942 for PGD, 0.9944 for MIM.
Quantitative comparison to previous approaches
Table 1: Comparison on adversarial detection accuracy of the proposed (Ko & Lim) and existing approaches. The best and the second best results are highlighted in boldface and underlined texts, espectively. All benchmarks are done on MNIST dataset.
The above table shows that the proposed method has better or on-par accuracy compared with previously existing works.
As a means of securing deep learning models, the paper proposed model explanations that are critical in repairing vulnerability in deep neural networks. A new method is suggested to identify which input data may have undergone adversarial perturbation based on model explainability. Small adversarial perturbation will greatly affect model explanation and produce abnormal results.
According to the results of the experiment utilizing the MNIST dataset, adversarial explanation maps are present in all adversarial attack approaches. This proves that the method is attack-agnostic and therefore does not require pre-generated adversarial samples and generalized to unseen attacks. The unsupervised detection approach was also found to be capable of detecting various adversarial examples with performance comparable to or better than existing methods. Moreover, the unsupervised defense method using model explanations is efficient to detect adversarial attacks as it only requires a single training for reconstructor networks.
Despite all the advantages described previously, using the MNIST dataset to evaluate the method is considered rather straightforward. The dataset may fail to replicate the complexities of real-world adversarial attacks, making its applicability to more complex cases questionable. In the future, further evaluation is needed to check the performance of the proposed method on a more complex and realistic dataset of adversarial attacks.
Gihyuk Ko
Carnegie Mellon University
Formal methods, security and privacy, and machine learning
Gyumin Lim
CSRC, KAIST
AI, cybersecurity
Github code: None