Building blocks of Deep Generative Models

By | January 23, 2024

Building blocks of deep generative models

In this article, we will learn about some important concepts that are important to learn if we want to thoroughly understand how a deep generative learning model works. We will learn about some probabilistic concepts that help the generative learning frameworks in learning data distributions. These concepts are basic building blocks of deep generative models. Specifically, we will learn about the probabilistic concepts such as Maximum Likelihood Estimation (MLE), KL Divergence and JS Divergence.

This article covers the following key topics:

  1. What are Deep Generative Models?
  2. Maximum Likelihood Estimation (MLE)
  3. Kullback-Leibler or KL Divergence
  4. Jenson-Shannon or JS Divergence

Let’s get started.

Checkout my article on: How does a Generative Learning Model Work?


What are Deep Generative Models?

Recent advances in the field of deep learning have led to the development of complex
generative models that are capable of generating high quality content in the form of text, audio, pictures, videos and so on. The generative models that make use of deep learning architectures to tackle the task of learning distributions, are known as deep
generative models. Due to the flexibility and scalability of neural networks, deep
generative models have become the most exciting and swiftly evolving field of ML and
AI. Deep generative modelling techniques have helped in developing modern AI agents that are constantly generating and processing vast amounts of data.

Check out my introductory article on “Generative Learning and its Differences from the Discriminative Learning”


Generative learning with deep neural networks is creating wonders today. There are a few popular deep generative learning frameworks that are quite an active area of research. Following are three popular types of deep generative learning frameworks:

  • Autoregressive Generative Models
  • Variational Autoencoders
  • Generative Adversarial Networks

We will learn about these deep generative learning frameworks in my subsequent articles.

Before jumping right into the generative modelling frameworks, itโ€™s important to
understand few things about the probabilistic distributions. This knowledge will help us in understanding and also discovering new ways of comparing distributions. We will now talk about the following concepts related to probabilistic distributions:

  • Maximum Likelihood Estimation (MLE)
  • Kullback-Leibler or KL Divergence
  • Jenson-Shannon or JS Divergence

Letโ€™s learn about each of these concepts now.

If you are interested in learning more about the generative learning and Generative Adversarial Networks, Do check out my book:


Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation, or MLE for short, is a method of estimating (or
learning) a probability distribution. Suppose that we are given a dataset and we want to learn the underlying distribution of this dataset. So, we want our generative model to learn a distribution ๐‘ท๐‘ด๐’๐’…๐’†๐’ that best describes the given dataset. It means that any given sample x from the dataset must be very likely as per the model. In other words, the likelihood value ๐‘ท๐‘ด๐’๐’…๐’†๐’(x) is high for all the samples in the dataset.


As the name suggests, the method โ€˜Maximum Likelihood Estimationโ€™ takes this notion
very seriously and provides an objective function that makes the given dataset (or
observed data) most likely under the model distribution. The state or point of model
parameters, that makes the observed dataset most likely, is known as the maximum
likelihood estimate.

Statistical Definition

In statistical terms, the likelihood function is defined over the model parameter space
and the objective is to find a subset of the Euclidean space (a space defined by possible ranges of the model parameters) that maximises the value of the likelihood function (for the observed data obviously). If the likelihood function is differentiable, we can take advantage of the gradient descent method to learn the optimal parameters. Letโ€™s now understand the likelihood function in mathematical terms.


Suppose, we have a likelihood function f (or a model) defined over the parameter space (๐œฝ). Here, we can define the likelihood objective ( ฬ‚๐’ ) for the given data samples (x1, x2, โ€ฆ. xn) by the following equation โ€“

building blocks of deep generative models

This equation assumes that the data samples are identically distributed and independent.


Quick Question: why do we have log in this equation?


Answer: Itโ€™s often convenient to work with log-likelihoods as the optimal solution (๐œฝ)
still remains same.


We now have defined a simple likelihood function. We can now maximise this function
for the given dataset to get the maximum likelihood estimate or a trained model that
understands the dataset distribution. We now have a good idea about how MLE works, letโ€™s now look at the second concept: Kullback-Leibler Divergence or KL Divergence.


Kullback-Leibler or KL Divergence

Kullback-Leibler Divergence, or KL divergence, or DKL for short, is also known as relative entropy. KL divergence is a statistical way of calculating or quantifying the distance between two probability distributions. A relative entropy of zero, confirms that two distributions are identical.


Given any two discrete probability distributions P and Q, the relative entropy over the
same probability space (S) can be calculated using the following equation:

building blocks of deep generative models

The relative entropy, ๐‘ซ๐‘ฒ๐‘ณ(๐‘ท || ๐‘ธ) can be understood as the relative entropy of distribution P with respect to Q (or the divergence of P from Q). Because, KL divergence is an asymmetric measure, it does not obey the triangular inequality. And thus, it does not qualify as a true statistical metric of spread.


In simpler terms,

๐ท๐พ๐ฟ(๐‘ƒ || ๐‘„) โ‰  ๐ท๐พ๐ฟ(๐‘„ || ๐‘ƒ)

Similar to KL divergence, Jenson-Shannon Divergence (or JSD) is a way of comparing distributions. Letโ€™s see how it works.


Jenson-Shannon Divergence

Jenson-Shannon Divergence, or JSD for short, is again a method of measuring the
similarity between two probability distributions, very similar to KL divergence. JSD is
actually a clever modification of KL divergence that makes it symmetric and smooth,
and hence a true statistical metric. The square root of JSD is also a metric, known as
Jenson-Shannon distance.


We can write the JSD between two probability distributions P and Q as:

building blocks of deep generative models

As we can see from the above equations, JSD very cleverly makes use of KL divergence to arrive at a new symmetric metric. We now have a good understanding of three popular probabilistic concepts โ€“ MLE, KL divergence and JSD. With this knowledge, we are ready to jump right into deep generative modelling frameworks. We will talk about these frameworks in my subsequent articles.


Conclusion

In this article, we learned about some important concepts about about learning and comparing distributions. These probabilistic concepts are key building blocks of deep generative models. In my subsequent articles, we will use these concepts to understand, develop and train some popular generative learning framework based models.

Thanks for reading!

I hope this article was helpful and cleared some of your doubts. If you find this useful kindly share, if you find any mistakes please let me know your valuable feedback by commenting below. Until then, see you in the next article!!


  1. Generative Learning and its Differences from the Discriminative Learning
  2. How Does a Generative Learning Model Work?
  3. Deep Learning with PyTorch: Introduction
  4. Deep Learning with PyTorch: First Neural Network
  5. Autoencoders in Keras and Deep Learning (Introduction)
  6. Optimizers explained for training Neural Networks
  7. Optimizing TensorFlow models with Quantization Techniques