Understanding Adversarial Examples and Defence Mechanisms

By | April 16, 2024

Adversarial Examples and Defence Mechanisms

Adversarial examples are inputs to Machine Learning (ML) models that are intentionally designed to fool the model. These examples are quite easy to generate and can be created by performing intentional feature perturbation on the inputs. And, as a result they can make the ML models do false predictions. In this article, we will cover adversarial examples and defence mechanisms against them.

Adversarial examples are often indistinguishable to the human eye. It is really important to study such inputs because they make ML models vulnerable to attacks. These attacks may cause serious security violations in some scenarios. Thus, it is really important to study them and make ML models robust to such adversarial attacks.


This article covers the following key topics:

If you are interested in learning more about the generative learning and Generative Adversarial Networks, Do check out my book:

Let’s get started.


What is Adversarial Machine Learning?

Adversarial Machine Learning is the study of adversarial examples and defence mechanisms against them. Adversarial examples are intentionally generated in order to trick the ML models by using deceptive inputs. Now a days, ML models are extensively used in many security and safety departments and the adversarial attacks are seriously dangerous for such scenarios. Thus, it is important to study such attacks and prepare the defence mechanisms against them.

Based on the intentions of the attacker, the adversarial attacks can be classified into following two categories: the targeted attacks and the un-targeted attacks.

Targeted and Un-Targeted Adversarial Attacks

In case of an un-targeted adversarial attack, the only intention of the attacker is to make the ML model do a wrong prediction (irrespective of the outcome). On the other hand, a targeted attack has a target class and the intention of the attacker is not just to make the model do a mistake but also to make the model do a desired mistake and get a wrong prediction that belongs to the desired output class.

Based on the information that attacker has, the adversarial attacks can be classified into the following two categories: a white-box attack and a black-box attack.

White Box and Black Box Adversarial Attacks

A white-box attack is a scenario where the attacker has full access to the ML model. The
attacker is already aware of the model’s architecture and its parameters.

In a black-box attack, the attacker can only see the output of the ML model and doesn’t
have any other information about it.

ML has become an essential part of almost every organisation and their key business decisions. Many critical tools built for security and safety purposes are developed using ML algorithms today. Thus, the need to protect these ML systems is also growing really fast.

In the next sections, we will learn about some common methods of generating adversarial
examples and defence mechanisms against them.


What are common methods of Adversarial Attacks?

As discussed earlier, the adversarial examples are intentionally prepared to make the ML model do a wrong prediction. These examples appear normal to humans but they can cause misclassifications for the target ML models. Many different methods of generating the adversarial examples have been studied.

Following are some common methods of generating adversarial examples:

Let’s learn about these methods.


Fast Gradient Sign Method (FGSM)

Fast Gradient Sign Method, of FGSM, works by using the gradients of the neural network. Because FGSM has access to the model, it comes under the white-box attacking mechanisms. In this method, the input image pixels are perturbed using the information of the gradient of the loss function, such that the model loss is increased. This newly generated input image is called the adversarial image.

The panda image shown in the following Figure, is a very popular adversarial example generated using FGSM method. This image was taken from the research paper titled “Explaining and Harnessing Adversarial Examples-2015” by Ian J. Goodfellow and fellow researchers.

Adversarial Examples and Defence Mechanisms
Generating an adversarial example using FGSM, taken from the 2015 paper of Ian
Goodfellow et.al
.

Let’s learn about L-BFGS now.


Limited-Memory BFGS (L-BFGS)

Limited memory BFGS, or L-BFGS, is an optimisation algorithm to minimise the number of perturbations added to an image. It is a non-linear gradient-based method. It is quite effective in generating adversarial examples. But it is very computationally expensive.

Next, let’s learn about JSMA.


Jacobian-based Saliency Map Attack (JSMA)

JSMA method uses feature selection method to minimise the number of feature perturbations needed to cause the model make a classification mistake. This method is also very computationally intensive.

Let’s now learn about deep fool attack.


Deep Fool Attack

Deep Fool attack technique works by minimising the Euclidean distance between the perturbed inputs and the original inputs. In this method, the perturbations are added iteratively based on the decision boundaries. This method of generating adversarial examples is also computationally expensive.

Let’s move to the next method.


Carlini & Wagner Attack

This method is based on the L-BFGS algorithm and is more efficient at generating the adversarial examples. This method was able to defeat certain defence mechanisms. Again, this attack is also very computationally expensive.

Let’s learn about GAN based attack now.


Generative Adversarial Networks (GANs)

Generative Adversarial Networks, or GANs for short, have also been used for generating adversarial examples. GANs are capable of generating desired number of plausible examples that are similar to the training dataset. Training a GAN is also computationally intensive and highly unstable.

Let’s now learn about ZOO attack now.


Zeroth-Order Optimisation Attack (ZOO)

ZOO attack is a black-box adversarial attack that works without even the knowledge of the ML algorithm. This method collects gradient information of the ML model indirectly, by querying the model with modified individual features. One common dis-advantage of ZOO technique is that it requires large number of queries to the underlying ML model.

Now that we have learned about some common adversarial attack methodologies. Let’s learn about some of the defence mechanism against them.


How to defend against Adversarial Attacks?

Traditional optimisation techniques such as weight decay, dropouts and so on, generally
don’t provide defence against the adversarial attacks. The paper “Adversarial Attacks and Defences: A Survey-2018” by Anirban chakraborty and team describes some common adversarial attacks and defence mechanisms.

Following are some of the common defence mechanisms against the adversarial attacks,
as per their paper:

Let’s discuss these mechanisms.


Adversarial Training

The adversarial training is a simple brute-force solution to defend the ML models against the adversarial attacks. In this method, a large number of adversarial examples are generated and then the ML model is explicitly trained on them, to make sure that it is not fooled by them.

In this way, the ML models become robust to such known types of the adversarial examples. However, the attacker might still break this defence using a new attacking mechanism that the model has not seen during the training process.

Let’s move to the second technique now.


Defensive Distillation

This method is based on the concept of knowledge distillation of the Neural Networks where, a small ML model is trained to imitate a very large ML model, in order to obtain the computational savings. This small ML model, trained using the probabilistic outputs of the larger ML model, is robust to the adversarial attacks as its loss surface is smoothed in the directions that an adversary will try to exploit. This mechanism makes it hard for the attacker to tweak the adversarial examples for getting a wrong prediction from the ML model.

Let’s now learn about Gradient Hiding method.


Gradient Hiding

As discussed earlier, some of the attacking mechanisms such as FGSM, rely on the gradient information from the ML model. Gradient hiding is one way to make those attacks ineffective by using non-differentiable models such as Decision Trees, KNN, Random Forests and so on.

Let’s now move to the next defence mechanism.


Feature Squeezing

Feature squeezing is a method of model hardening. In this method, the ML models are made less complex by reducing or squeezing the input features. Simpler ML models are more robust to the small feature perturbations and noise. One dis-advantage with this method is that these ML models are often less accurate because of their simpler nature.

Let’s now learn about Blocking and Transferability method.


Blocking and Transferability

The main reason behind the defeat of most of the well-known defence mechanisms; is due to the strong transferability property of the neural networks. For example, the adversarial examples generated on one classifier are expected to cause another classifier to perform the same mistake. The transferability property holds true even if the classifiers have different architectures or have been trained on disjoint datasets.

Hence, the key for protecting against a black-box attack is to block the transferability of the adversarial examples. Let’s now learn about the next mechanism.


Defense-GAN

The Defense-GAN mechanism leverages the power of Generative Adversarial Networks to reduce the efficiency of the adversarial perturbations. The central idea of this method is to project the input images onto the range of the generator by minimising the reconstruction error, prior to feeding the image to the classifier. Due to this extra step, the legitimate samples will be closer to the range of the generator than the adversarial samples, resulting in substantial reduction of the potential adversarial perturbations.

Let’s now learn about MagNet.


MagNet

MagNet is an adversarial defence framework that uses the classifier as a black box, to read the outputs of the classifier’s last layer without reading the data on any internal layer or modifying the classifier. It uses detectors to distinguish the normal samples from the adversarial examples. The detector measures the distance between the given test example and the manifold and, rejects the sample if the distance exceeds a threshold. It also uses a reformer to reform the adversarial example to a similar legitimate example using the auto-encoders.

Although MagNet is quite successful in thwarting a range of the black-box attacks, its performance degrades significantly in the case of the white-box attacks where, the attackers are supposed to be aware of the parameters of the MagNet. So, the authors came up with an idea of using varieties of auto-encoders and randomly picking one at a time to make it difficult for the adversary to predict which auto-encoder is actually sitting behind the defence.

We now have learned about quite a few defence mechanisms against the adversarial attacks. Although these mechanisms are really good at catching the adversarial attacks, still, they can be defeated if the attacker has significantly large number of computational resources. So, an absolute defence against the adversarial attacks, remains a challenge.

Let’s now learn about the common challenges of defending against the adversarial attacks.


Why it is hard to defend against Adversarial Attacks?

The adversarial examples are hard to defend against because: it is difficult to build a model of the process of creating an adversarial example. The adversarial examples are the solutions to an optimisation problem (or results of an optimisation algorithm) that is non-linear and non-convex for many neural networks.

ML models need to provide the good outputs for every possible input. A considerable modification of the model to incorporate the robustness against the adversarial examples, may also change the elementary objective of the model. And, it may result in a bad performing model.

The defence strategies discussed earlier, work really well against only a certain type of attacks but fail for new kind of attacks because they are not adaptive. If an attacker knows the defence strategies used in a system, then it’s a major vulnerability. Moreover, the implementation of such defence strategies may cause performance overhead as well as the degradation in model accuracy for the actual or intended inputs.

Thus, it is important to design powerful defence mechanism which are adaptive, to protect ML systems from the adversarial attacks. It is also a growing area of research and let’s see what the future holds.


Conclusion

In this article, we learned about the adversarial examples, common types of adversarial attacks, common defence mechanisms against them and the importance of designing the appropriate defence mechanisms against them to protect the ML based intelligent systems.

Specifically, we learned:

I hope this article was helpful, do share your feedback by commenting below. See you in the next article!


Read Next>>>

  1. How Does a Generative Learning Model Work?
  2. Building Blocks of Deep Generative Models
  3. Generative Learning and its Differences from the Discriminative Learning
  4. Image Synthesis using Pixel CNN based Autoregressive Generative Models
  5. What are Autoregressive Generative Models?
  6. Best Practices for training stable GANs
  7. Understanding Failure Modes of GAN Training

Leave a Reply