From high-school physics to GANs: essentials for mastering generative machine learning [2/2]

Published in

Towards Data Science

8 min readJan 15, 2021

Wave motion illustration http://animatedphysics.com/insights/modelling-photon-phase/

In the previous article, we have built experiments where we have learned how to approximate physical laws models with machine learning algorithms, which was a preamble for a “real” data generation process. In this article, we won’t simply approximate the dependency between a time step and the exact position of the object but will generate whole trajectories as objects coming from a data distribution and will try to control this process and the variables as we do in classic mathematical models.

This article concludes the idea of the evolution of mathematical modeling from human-designed first to data-driven first and I hope, it will clarify why we need generative modeling today and you will be able to implement it in your R&D and product activities. As always, the source code is on my Github.

Learning a pure “generation” formula from data

Maximizing the likelihood of the model

How we can learn a distribution of complex data (instead of approximation function) and then sample from it? The most common approach to build such a model is the maximum likelihood approach (MLE). The likelihood is a mathematical formula that measures the “goodness” of the fit of the mathematical model to the empirical data. The likelihood itself can be expressed as:

Likelihood function that depends on parameters theta of the model and data X

To find such theta, that maximizes this likelihood we solve an optimization problem with different algorithms (for example, Monte-Carlo Markov Chains):

Optimal theta of the model, that maximizes the above-mentioned likelihood between the model and the data

This is the same as minimizing the cross-entropy and this is what you can find in the literature and most of the tutorials on machine learning, but there is also another point of view on this task.

Minimizing the distance between model and data

Let’s remember bits of information theory. The self-information and corresponding Shannon entropy of some event x can be expressed as:

Self-information formula / Shannon Entropy formula

One of the typical tasks that information theory solves is a comparison of two distributions with some distance to determine how “far” they are from each other. Within our MLE framework, the cross-entropy of the log-likelihood already measures the “goodness” of the fit and we aim to maximize it. Within the information-theoretic approach, we want to minimize the distance between model and data distributions and we can use Kullback-Leibler divergence to describe this distance:

Kullback-Leibler (KL) divergence formula between distributions P and Q

And minimizing this divergence leads to minimizing this single term:

Minimization of KL divergence solution, which coincides with maximum likelihood function maximization

As we can see, minimizing KL divergence is the same as maximizing the likelihood function!

This all already looks a lot like what we used to do with neural networks already, so why don’t we try to apply them here?

Generative adversarial networks 101

Generative adversarial networks (GAN) training procedure. Picture from https://it.mathworks.com/help/deeplearning/ug/train-generative-adversarial-network.html

Unfortunately, in real life, we don’t know exactly the data generation process, and hence, we can’t use ML maximization / KL minimization to train such a model. A modern alternative for a process of learning data distribution is a min-max game between two neural networks — generator and discriminator. However, we still want to have a connection to the basics and keep minimizing some distance between distributions.

Let’s assume, that instead of having some analytical formula for the distance, we can have a learned divergence, which is some optimization problem by itself. Please, pay attention, because we’re going very meta here. We want at the same time to learn one function (divergence between distributions) that will be used to learn another function (generating network that will minimize that divergence):

the “outer” function that minimizes the “learned divergence” we will call a generator:

The function of the generator is depending on theta parameters that have to be optimized

the second “inner” function is trained to represent the probability of a data sample belonging to our data distribution. This “inner” function we will call a discriminator:

The function of the discriminator V has to be maximized towards correct recognition of both real data samples and fake generated ones

It penalizes the GAN for misclassifying a real instance as fake, or a fake instance as real, by maximizing the above function

which can be shown as Jensen-Shannon divergence. Altogether, the total loss function that should be optimized to find the weights of both neural networks looks (represented as parameters theta and phi) like the following:

The GAN optimization objective is a min-max game between 1) the discriminator that trains to tell if a new data sample belongs to the target distribution and 2) the generator, that learns to sample the new data samples that are as close as possible to the target distribution

The main idea here is that we still minimize some sort of KL divergence with respect to the parameters of the generator, but since we don’t know p_data(x), we learn it alongside as a separate discriminator. It’s very straightforward to put it inside your favorite deep learning framework, which you can see in my Github or many other tutorials. Now it will learn our pendulum trajectories? Let’s check some visualizations:

On the two right-most images are samples drawn from a GAN. The detailed architecture you can find in the corresponding Notebook on my Github

Not the greatest results ever from the quality point of view, but those samples are clearly coming from our distribution and have a good variety, and let’s let advances from great GAN-related tutorials and courses like this one to make the smoothing and prettification job for us. We need to focus on making generation controllable as in transparent mathematical models, where every variable has its own physical sense. As for now, we got just the random input noise, that has nothing to do with control.

Learning interpretable physical laws from the data

Most of the GANs’ inputs are random vectors from which are generated random objects. In mathematical modeling, the inputs can be parts of a random vector of course, but each of those inputs is responsible for a single output property, which is not true for regular GANs (see more why in another article of mine).

Schematic illustration of a variational autoencoder, image source: https://theailearner.com/2018/11/10/variational-autoencoders/

However, we have variational autoencoders (VAEs), that

compress data to a latent vector with a neural network called an encoder
by adding some noise to this vector, we can generate new objects (see the structure above) with another network called a decoder

This noisy vector could be the same uninterpretable as in GANs, but we can affect it with a regularization term in a loss function. This term (see the formula below) makes all elements all latent vector be maximally independent of each other, which results in learning different properties of the data for each element of the latent vector (as opposed to the input random vector in the GANs):

Conducting the same experiments as above, but using VAE this time, we can achieve the following generation results:

On the two right-most images are samples drawn from a VAE with beta = 1. The detailed architecture you can find in the corresponding Notebook on my Github

As we can see, the samples are much more clean compared to the GAN, which is indeed very nice! Now, let’s retrain this model with beta = 10, and let’s manipulate the latent dimensions (which I’ve selected to be 6 as we change omega, theta, rope length, and the mass of the object to create the dataset + 2 additional and I expect them to be blank). What we want to see, is that with making one latent dimension bigger or smaller to see a change in the physical object property:

Samples were drawn from a VAE with beta = 10. Each picture represents changes in one of the dimensions of the latent vector. The blue line is an original sampled trajectory, the green line shows an increase of the latent dimension (while keeping other fixed), and the red one — a decrease of it

What we see is rather tricky… We can see that the last 3 dimensions almost don’t change anything and are “blank” codes in our VAE. However, it’s hard to interpret the first three dimensions clearly. What we can state, that the first one changes the frequency of the generated wave, the second and the third one change the amplitude. It’s not exactly what we planned while generating the data, but this is what a neural network sees! I recommend to try out this technique on other datasets. For example, in one of my previous experiments, I could get a nice representation of heartbeats with respect to the pulse speed and some anomalies:

See more details on this experiment here: https://towardsdatascience.com/gans-vs-odes-the-end-of-mathematical-modeling-ec158f04acb9

Takeaways

In these two articles, we have built the idea of generative machine learning from the first-order principles of mathematical modeling. With a simple example of an oscillatory pendulum movement, we have implemented:

a classic human-designed mathematical model that samples trajectories from physical properties and time steps
a machine learning model that approximates dynamics from noisy empirical observations and doesn’t require human-designed formulas
a probabilistic machine learning model, that can generate different trajectories based on the uncertainty at each time step
a generative adversarial network, that generates a whole trajectory without a need to tell the physical properties and time steps
a variational autoencoder, that generates a whole trajectory and allows controlling the properties (although not the ones we expected)

Now it’s time to work with more complex data! Let me know if this approach to explaining generative modeling was useful and I should write more on this: how to generate images, texts, sounds, and even tabular data, taking in mind the idea of treating our fancy GANs as mathematical modeling tools.

P.S.
If you found this content useful and perspective, you can support me on Bitclout. Follow me also on Facebook for AI articles that are too short for Medium, Instagram for personal stuff, and Linkedin! Contact me if you want to collaborate on interpretable AI applications or other ML projects.