There are only so many AI-generated cat videos you can watch before you start wondering how this technology works and, more importantly, why it even exists? As someone familiar with the basics of supervised and unsupervised machine learning, I found this curiosity to be a natural segue into learning how image generation works. While the process of understanding the inner workings of these models is challenging, I assure you that once the pieces of the puzzle start to fit and you see the beauty behind the math it feels almost like watching the Mona Lisa being painted. Let me take you through my journey of understanding image generation models while focusing primarily on variational autoencoders.
The first step is always to look up a video on how image generation works, but to be honest I could comprehend little jumping straight to (what I now know is) the 4th step of understanding VAE. So after conferring with Claude, here is how I approached it:
1. Refresh the Math
A solid foundation comes from understanding the basics first. A few math concepts are used heavily while making the models and they are the Bayes’ Theorem of Probability which is best revisited from the famous 3Blue1Brown youtube channel, along with refreshers on the Markov chain/ the chain rules of probability and log properties. Once these are ready in your mind: you can jump to the more esoteric KL-Divergence. This is how I understood it:
Entropy: is the measure of surprise
For example if you have a biased coin where the probability of getting heads is 0.9, you would be significantly more surprised if you got tails here.
Therefore: 
Cross Entropy: Cross entropy measures how well Q approximates P. Suppose you think that the coin is biased with probability Q but in reality it’s a normal coin with probability P, then you can measure the cross entropy as:

So finally KL-Divergence measures how inefficient it is to encode samples from P using a code optimized for Q, put simply, the difference of probability distributions between P and Q is:
Cross entropy - Entropy
which is the surprise of the believed vs observed system minus the surprise of just the believed system.
For more information please check out these resources:
Intuitively Understanding the KL Divergence
The key Equation behind Probability
KL Divergence blog
2. Understand the principles behind Autoencoders
Autoencoders are essentially a compression–decompression mechanism.
X -> Encode -> Z -> Decode -> x’
Where z is a bottleneck and a point in latent space.
Say you take an image of a cat, this image has many pixels. You train a neural network to be able to encode this image of a cat as vectors/ points in latent space so let’s say that this image of a cat is [0.13, 0.13] (though in practice it usually has dozens or hundreds of dimensions.).
The 0.13, 0.13 is the z. The bottle neck.
And then the decoder works the exact opposite way by taking z and being able to generate a similar cat x’.
Resources:
https://youtu.be/hZ4a4NgM3u0?si=iuSxElY8t3tEHF0m
https://youtu.be/qiUEgSCyY5o?si=JrW1ed8WihDaw_FS
3. The issue with Autoencoders and need for Variational Autoencoders
Continuing the above example of our cat, say you give it an x2 which is an image of sunglasses which might be encoded as [0.15, 0.14]
Now on this latent space if I check what [0.14, 0.14] is it will be nothing? Or frogs maybe.
But ideally it should be an image of a cat in sunglasses. So we do not encode z as a point but instead as a distribution with a mean and std. Deviation.
Standard autoencoders do not enforce any structure on the latent space. As a result, sampling or interpolating between points may produce meaningless outputs. Variational autoencoders solve this by learning a probability distribution over the latent space (typically Gaussian), which ensures smooth and meaningful generation.
Resources:
https://youtu.be/qJeaCHQ1k2w?si=ApWkgTLz2ew8UEl_
4. Understanding ELBO
This is the core mathematics of variational autoencoders and even other probabilistic diffusion models.
This works with finding the P(X) which is the probability of the image. But we don’t know what z is because it is a latent(hidden) variable. The brute force way would be to marginalise it by adding all p(z):
p(x)=∫p(x∣z)p(z)dz
But this is intractable :( so we do something else we find the ELBO i.e. the Expectation Lower BOund.

ELBO = Reconstruction term - Regularisation term
OR
ELBO = How well the decoder reconstruct x from z - how close is the encoder’s distribution to the standard normal
The training objective of a VAE is to maximise the ELBO or the lower bound limit, which pushes the evidence - log p(x) up and the KL gap down.
5. Code it!
Now you are finally ready to get your hands on the keyboard. I recommend starting with building a VAE on the basic MNIST dataset with PyTorch. Aladdin Perssons has a great and very easy to follow video on this topic. You can check out my github for more comments and explanations.
6. Explore
Now you can explore other models like the DDPM, DDIM, etc.
To conclude, the whole process took me around 10 days of dedicated effort and the one thing that stood out was the importance of the gaussian curve in image generation. This is a beautiful piece of scientific achievement. For computers to be able to “understand” and generate images so smoothly is an extraordinary feat. And as rewarding as the process is, I still don’t understand the need for this technology to be easily accessible to the general public. The risks of deepfakes and non-consensual image generations far outweighs the benefits of having an image generation software to design easy and quick prototypes. At the very least it solves Bad Bunny's need to have more photos or in Spanish Debí tirar más fotos. Hope you had a good read!