Lesson 9B - the math of diffusion - YouTube Transcript

hello everyone uh my name is Waseem I am an entrepreneur in Residence at fast AI and um I'm currently at the first AI headquarters at the moment in in Australia although I'm originally from South Africa from Cape Town um and I'm joined here today by tanishq tanishq works at stability Ai and we've been working together with a couple other people on diffusion models generative kind of modeling and that's been super fun so tanishq do you want to maybe you know introduce yourself as well yeah so my name is tanishq I am a PhD student at UC Davis but I also work at stability Ai and I've been exploring and playing around with diffusion models for the past several months and so it's been it's been great to also uh explore that with the with the fast AI Community as well in these last few weeks as well awesome cool cool so this talk is um us or me trying to understand the math behind diffusion so you know if you've done the fast AI courses before you know that you don't need to understand the math to be effective with any of these models in fact you don't even need the math to do research to do novel research and contribute to these but for me it was all about it came out of interest and you know I thought it was kind of it's kind of beautiful how uh what you know diffusion models were discovered and I think a large part of that was thanks to some some really clever mess and so I wanted to understand that um I'm not I don't have a math background um and so I want to help kind of describe how I think about it and how I um how you can kind of interpret you know all of these notations and things cool um yeah so so I can just dive into it I think so the first sort of math that we see in this paper is Q of x superscript 0. and they call this the data distribution do you want to mention what the paper exactly which paper this is right good good question so this paper is the 2015 paper uh do you remember the authors that I paid for tanishq uh I think it's just a soul dickstein Yahoo networks at Google I think and it's from uh Surya ganguly's lab so cool yeah yeah so so this was the papers as far as I understand that introduced this idea of diffusion um yeah 2015 by those authors they start out by you know defining this data distribution and they use this notation and already uh like a lot of people you know myself included Finance quite confusing um but let's go through what's described here so they have an X and you know in math X is often used as the input variable um much like Y which is then used often as the output variable um yeah and the the fact that it has a superscript also implies something so the fact that we have X superscript zero implies that there might be a sequence of x's and you know I think it's it's useful to get comfortable with this idea of simple compact notations implying a lot more than you know might be obvious at first glance so X implies that it means something about this quantity it's an input variable and you know the zero implies that there might be other things so you might have you might have an X1 an X2 and so on but but we'll we might see that and then the third part is you have q and Q is a what we call a probability density function so the first part here is is probability uh and the question is you know what is what does q have to do with probabilities well it's because usually we use the letter p to describe probability density functions of interest and then because Q is right after that it's another common one so it's kind of like how you use X and Y we use p and Q and the fact that we use Q here instead of B is because it it suggests that there might be a p that will introduce and maybe p is the thing that we're modeling and Q is kind of supplementary to that does that sound right tanishq yeah yeah and I think it's also helpful to kind of maybe think about like x x zeros in a more practical concrete way of course if we're working with images then x0 would be you know that's that's what's representing the images so it's also useful to think about it from kind of that concrete practical approach as well right so x0 might be you know an amnest digit and then we got Q so Q I'll just use this to mean uh Q is some function so we look at it as a box and it takes in X zero and it gives us uh the probability that this x 0 which is an image looks like an amnest digit so in this case you know this would be 3.9 or maybe even uh yeah let's go 0.98 so this is quite high probability that this is the name this digit hi this is Jeremy can I jump in for a moment please do oh thank you I just wanted to double check um this looks a lot like the magic uh API that we had at the start of lesson one that you feed in a digit and it gives you back a probability is that basically what what Q is doing here absolutely yeah it's a magic API it's a good way to think of it we don't know we couldn't write down what Q is but we imagine that somebody somebody has it somewhere um yeah so so this is a concrete example and like if you had to do something to this image uh you might get a smaller number so another thing worth mentioning here is probability density functions so these are these magic apis that you know give us the number tells us How likely the thing is they you don't often see them they don't often make it all the way to your code in fact they very rarely will appear in your code um but turns out that there are very useful ways or tools to work with random quantities because they allow you to represent random quantities as functions just ordinary functions and because they're functions you have a whole you know centuries worth of math to analyze and understanding so you'll often find probability density functions in papers and eventually they work out to really simple equations or formulas that end up in your code do you want to add anything to Niche I think that sounds so correct of course um I think you probably will go over some examples of probability density functions especially relevant to this one but yeah it's useful to think about the process that the sorts of functions you may have in a in a simplified case and that's what we probably are going to talk about next right yeah yeah that's exactly what we're talking so we have this qx0 and then we introduce another one and like you said this is going to turn out to have a really nice simple form but before that the next thing we Define is qxt minus one so we will say what we Define this to be but to begin with this is another probability density function and this bar over here means it's a conditional probability density function which you can think of as you are given the thing on the right to calculate uh probabilities over the thing on the left in this case it you can think of it as something that takes images so maybe another magic API and produces other images but we don't know what these look like yet because we haven't defined over here and this this we would call kind of you know x t minus one which could be x0 and this would be x t which in the x0 case would be X1 something worth noting here is this notation can be a little bit confusing because we we said Q is one thing earlier now we think Q is another thing so this year I'm going to need your help on the song tanishq um I think people would usually if in the stricter sense Define the first one uh you know like this maybe and the second one was a subscript and that this notation that we see here on the left is just a shortcut where they you know they wanted to save the space of writing that and kind of included that you were implied it by what was in the practice is that true yeah I mean I think here they use the the first later on vcp to kind of describe as we'll see different aspects of the diffusion model uh the sort of different processes of the diffusion model which we'll see so I think that's what you know those they use the same variables to kind of demonstrate this is corresponding to this process and then yeah the variable corresponds to the other process of the diffusion model so we'll obviously go over that so I think that's where that those those variables or those uh letters are being used in that manner but if you do want to make it for more specific more clear yeah I think that that that notation is fine as well all right okay yeah that makes sense okay so so let's describe what this queue does uh you know to the image on the left to produce the one on the right so I'll start over here so we have more space I'll write it out first and then we can go into the details foreign so kind of like the bar you can think of this semicolon as you know grouping things together and so you have the things on the left and things on the right my understanding is these two things on the right are the parameters of the model uh sorry of the probability the thing on the left is actually Denise could you help me understand what the thing on the left is do you know right well so this is again like a probability distribution and the thing on the left is saying this is a probability distribution for this particular variable so that's just representing what it is a probability distribution for and then the stuff on the right are the parameters for this uh probability uh distribution so that's kind of what's going on here so like yeah anytime you have like a normal distribution and it's describing some variable you'll have that sort of notation right it's the normal distribution of some variable um and then these are the parameters that this would not describe that normal distribution right so just to clarify the bit after the semicolon is the bit that we're kind of used to seeing to describe a normal distribution which is which is the mean and variance of the normal distribution so we're going to be sampling random numbers from that normal distribution according to that mean and that variance is that right yes that's correct yeah mm-hmm yeah so we we need to describe a bit more there about normal distribution we kind of you know Skip past that so we have this Fancy in and fancy letters in math for distributions usually refer to well-known distributions and the N here stands for normal which is also known as a gaussian distribution [Music] and it's probably the most well-known probability distribution that you can you can find and what when I say well known I mean that these things pop up everywhere um you know you can do in all sorts of fields measuring all sorts of things turns out that they follow roughly something that looks like this distribution and because they pop up so much um you know people studied them studied all of their properties and we understand them really well now the reason that they used often in cases like this is because they turns out they have really useful properties and they're easy to work with some reasons are they are described by just two parameters so the mean called the mean and the covariance another property is that they have kind of you know what people would call Sun tails which kind of means that they only you only need to describe their behavior in a small region of space uh you can kind of just ignore the rest um yeah do you mind drawing a quick example of a normal distribution that's a good point so we have let's say our random variable is just one kind of dimensional so just a single number of floats this is sort of what the normal distribution would look like and in this case that would be our mean and the variance would sort of describe the what's over here which in this case you'd use a small Sigma because you're doing a single variable in our case we use a capital Sigma which is the symbol for multiple variables or multiple dimensions and yeah I also didn't say that this is the letter Greek letter mu so it's capital Sigma mu and lowercase Sigma uh I just wanted to note that typically the lowercase thinking about represents the standard deviation which is the square root of the of the of the variance um so for example sometimes you may see uh in papers uh Sigma squared and that's just the the variance but they will write it sometimes as Sigma squared instead the sigma is a standard deviation often and sigma squared would then be the variance cool uh yeah we can also show with our example what what this would look like so we start out with a MNS digit put it through this magic API and what would we get out okay so oh something we didn't describe is you know what does this I mean did you want me to talk about that was him ah yes please okay sure so because I think this is something which actually can I borrow your pen it actually came up in um the lesson we were doing kind of in an interesting way so um oh well okay I'm in the video now um yeah in the video uh hi Tanish nice to see you hello yeah so in the lesson like we we did this thing for clip I don't know if you remember um Russian where we had the um you know the various pictures down here I'm so embarrassed you're better at the graphics tablet than I am and it's my Graphics tablet and we had the various sentences along here right and we said oh you know it'd be kind of cool to like take the dot product of their embeddings because like if their dot products are high that means they're they're similar to each other um and you know if if we subtracted the means from those first right then you've got the the dot product and instead of having um images down here okay what if we had the exact same um the exact same uh vectors on each side then what you've got down here is basically x minus you know it's average right if you subtract that first squared and that is the variance right so that's like the variance for each one of these vectors but what's interesting as you pointed out is that like normally you know at high school when we look at a normal distribution it looks like this right but you're not just doing one normal distribution you've got a whole bunch of kind of normal distributions right for all of your different uh pixels they're the pixels right tanishq normally sort of the distribution of every pixel so there's a whole bunch of them and so one of them might have a normal distribution that's there and another one might have a normal distribution that's here and another one might have a normal distribution that's like here and it's more than that though because like it's possible that that you know two one pixel tends to be higher when another pixel tends to be higher or one pixel tends to be higher when another pixel is lower so it actually has kind of kind of create this like surface you know an n-dimensional space to add as the number of pixels so if you now like look at like okay well what happens if we multiply this by this just like we did in clip right then if this number is high then it's saying that when this variable is high where this pixel is high this pixel tends to be high and vice versa or if it's low it's saying when this pixel tends to be high this one tends to be low or interesting to us what happened oopsie Daisy sorry about that what happens if this is zero that says that if this is pi then this could be anything when this is high this could be anything there's no relationship between them so statistically we would say that these two pixels are independent and so now that basically means we could do that for all of these we could say oh you know these are all zeros and what that says is that oh every pixel is independent of every other pixel now of course in real pictures that's uh not how real pixels work but that's the Assumption we're making because if we start with a very special Matrix called I which is one one one one zero zero zero right if we take this very special matrix it's very special because I can multiply it by something say beta and um uh If I multiply it by a matrix I get back the original Matrix If I multiply it by a scalar I'm going to get beta beta beta and lots of zeros and so if I multiply something by this Matrix all right then I'm just multiplying it by Beta but what's interesting about this is that this is what theme wrote last name wrote I times Theta I times beta t so what he's saying is oh we've now got a covariance matrix where for each individual pixel it's like pixel Number One beta one picks number two beta two this is the variances of each one and the covariances you know the relationship between the pixels is zero they're expected to be independent so that's where we're kind of going from like statistics you do in high school to statistics you do at University is like suddenly covariance is now a matrices not individual numbers does that sound about right to you tanishq yeah that's that's a that's a great explanation of it yes awesome cool so now let's let's try to describe you know what this would do with two Ms digits so you know we let's put back our mean equation and our covariance whoops our covariance so mean and our covariance and let's look at how this behaves you know at the edges sort of so it's really hard to you know understand this I don't think anybody can kind of just look at this and and know what it means what we typically do is we try to describe it kind of at the edges and so we'll start with like what what happens if that's zero and we'll work with X here as well instead of you know x t minus one uh which would mean like an Eminence digit so if if beta zero and we get our our X zero uh you know square root one minus zero which is one and square root of one is one so that kind of Falls away so we just have a mean of our previous image and this is just variance of zero so we have a normal distribution with the mean of our previous image a variance of zero which means we have the same image yeah just to clarify when you have variance of zero that means that there's really no noise or anything it's just at that mean and you know your distribution is just saying that's the only point that you can get from it so yeah that's what it just becomes the same image because uh yeah there's no noise or variance because of the variance is zero yeah exactly and then when our beta is one we still have this and then we have you know square root one minus one and that becomes zero so this whole thing becomes zero and this thing becomes I times Theta T which is you know I and if it's just I then as Jeremy described it would you know imply a variance of one and so our image through this function would just be Pure Noise so let you know mean of zero standard deviation of one and it would just be a bunch of noise and kind of somewhere in between that we have to say over here you know what would it produce it would be some mixture so you know like maybe a light the lighter pixels of eight and some noise maybe a bit darker and we can kind of draw this and and you you would have seen this in the previous lecture you can draw the sequence of things that become progressively more noisy in very small steps all the way until it becomes Pure Noise this is what we call the forward diffusion process and we can now describe some of these things so this would be a sample from our data distribution Q x0 this would be the function for the conditional probability density function that takes so of X1 given x 0 and so on and the way that the the terminology that we would use or that mathematicians use to describe this as they would call it a mark of process with gaussian transitions and you know this this can sound quite scary but we've just described exactly what this is so when we say process it usually means you know something where there's a sequence involved when we say Markov it means that the thing at time T depends only on the thing at T minus one the transition is this function how do you actually go from T minus one to T and gaussian is the fact that that transition is the normal distribution does that sound right yes uh just to also clarify a couple things and we say that you know we're sampling from the data distribution what that is referring to is trying to find some random You Know sample or some random data point that maximizes that likelihood or that has a high likelihood so when we say that you know we're looking at that that API that magic apis we were talking about and we're trying to get some you know uh some data points that have a high value with you know from that API and you know for some so for sub distributions but it's very simple and they know how it works like a gaussian distribution if you know the parameters of that gaussian distribution it's very easy to be able to do that sampling and then of course in other cases it's not very it's not it's quite difficult to do that sampling so then we have to figure out alternative ways of doing that sampling but that's why in this case with the forward distribution we just have these simple gaussian Transitions and we already know the parameters of those gaussian transitions so we can easily do that sampling and going back also to that I think it's a worthwhile to also kind of show and think about maybe how this is again done practically because one of the nice properties of gaussian distributions as a whole is that you can uh you know simply take some normal noise that with a mean of zero and variance of one so that's I think they usually typically call that a unit uh uh unit distribution it's just like yeah normal of zero one and then if you want to get to some other point with a mean of whatever value you specify and a variance of whatever value you specify you can simply uh take that normal distribution scale it by the um you multiply it by the variance and then you add your your mean so then there's a simple equation that you can take to to get the uh to you know to get at any particular uh mean and variance so that's how you would you know for either get the samples for these other uh distributions that we have defined uh throughout the forward distribution so you know for example when you're coding this up of course A lot of these uh softwares they will have a way of getting a sample from this normal distribution of zero one and then you just use that equation then to get it at the desired mean and variance and so that's how it kind of happens uh under the hood when you're when you're kind of described this uh with code that's really helpful yeah and this idea of we can't really sample from this thing um that's exactly you know the problem that generative kind of modeling is is trying to solve like how do you represent this in such a way that you can easily sample from it and so it turns out that if you have one of these persisters you know where you have many many steps so let's say a thousand steps a thousand of these tips going to the right and they're all very small steps that eventually go to noise somebody uh you know maybe in the 1950s I think discovered that you can represent the process of going backwards in exactly the same functional form with just different parameters so what that means is if we say p is the thing that goes back with so you know the previous one given the current one this p has the same functional form so it's also the transitions are also normal but the mean is you know some unknown so we'll use a square and the variance is some unknown so we use a triangle um is that correct yeah that's correct and just going back to our previous point about P versus Q here we can see that that the queue was describing the sort of forward process going you know yeah this sort of steps that we're doing and then the p is describing when we're going in the reverse uh uh reverse way so that's why you know these papers are using you know Q for some one one process in the P for another that's what they're kind of indicating at least in the diffusion model literature and peas is kind of like X you know it's the it's the one we want to figure out so like Q is kind of like Y and the p is kind of like exits how I like to think of that and so you you know we have this functional form and the next question is uh how can we use this so you know we we just don't know what these parameters are how can we figure out what those are and this is uh goes back you know to early kind of Statistics literature where you can fit this model using by maximizing What's called the likelihood function so we can try different parameters until we have one that maximizes the likelihood it turns out that we can't quite do this exactly because you would need to calculate some integral and that integral is over very high dimensional values continuous values so you can't actually calculate this I I think you can think of it because you know we're having these uh thousands of steps that we're trying to go in this reverse process and so you know you have these thousands of steps that they're going to be many possible values for each step so it's kind of hard to evaluate it over all these thousands of steps and all the possible values for all these different steps so I think that's kind of where the the challenges arise and that's what it makes it difficult because you have to find uh you have to evaluate it over these multiple steps and try to find these functions for all these different steps so that's that's kind of where the challenge is mm-hmm and so you might see people talk not about the likelihood function but about the log likelihood and correct me if I'm wrong here tanishq but I think the log here is is a bit of a you know computational tick almost so I think it has a few properties the first is that it's it's always increasing and you know people would call this I think monotonic uh you know it looks always kind of increasing and because it's always increasing if it's the same you get the same parameters if you optimize the log likelihood versus you optimize the likelihood it also uh takes uh products to sums because um and and that's helpful because we have joint distributions you know which turn out to be products so it turns out we have a lot of products here and they become songs which is easy to work with and the last thing is that uh you know this normal distribution has exponentials or exponential functions and those uh disappear with the log so this is a much friendlier thing to optimize yep that's correct cool and then there's one more step uh you know we we still can't optimize the log likelihood of the thing that this eventually describes but again and this is kind of the beauty of math is that somebody figured out a long time ago that there's a way to optimize some other quantity uh called the elbow for short which stands for evidence lower bound and the evidence is just another name for the likelihood and the lower bound means uh it's sort of you know it's the lower bound of the evidence then if you optimize that it's almost as good as optimizing the thing that we really want to but this one we can calculate very very easily and so you can use this as a loss function to train two neural networks that predict our Square from earlier which was our mean and our triangle which is our variance of this reverse process and once you have that you can go all the way back here so then you have these values you can start with Pure Noise and keep calling these neural networks sampling from those normal distributions um kind of applying that iteratively over many steps and you recover this data distributions one thing that's important to clarify here is that you can recover the whole distribution but you can't necessarily take a single image convert it to Pure Noise and then convert it back so this operates sort of at the distribution level so you can take this kind of magic API you can reconstruct that whole API and if you can do that then you know you can generate images digits or cats or dogs or whatever you want to I want to just clarify one thing about this process of the kind of the the loss function so this sort of evidence lower bound loss function the kind of approach that it's taking is that you know we have this forward process right we have we can go from the original images and you know figure out these sorts of intermediate distributions going all the way finally to noise with this sort of evidence level bound um loss function what we're really kind of doing is trying to match the power distribution that we're trying to optimize to those distributions that we saw in the forward process so that's what we're trying to do we're trying to match uh that sort of uh those distributions and there's a specific type of of uh function that is able to do that it's called a KL Divergence that's the sort of function that can compare probability distributions and again because we're dealing with gaussians uh you can calculate that analytically and a lot of the math uh you know becomes very simple so that's again you know with the whole gaussians you know we we know them quite well the math is very simple so that allows us to do this sort of comparison between these distributions very very easily and optimize that and so we want to kind of minimize the difference between the distributions we see in the forward process and the distributions we're trying to determine for the reverse process perfect then there's there's one more thing I think one more kind of major you know step to get closer to the form that that you would have seen in Jeremy's lesson um so there was the 2020 paper uh the initials of that model is ddpm tanishq you know what this stands for yeah it's that's for denoising diffusion probabilistic model okay cool and what they did was they said let's assume that this variance is just a constant so we don't learn it and we assume also that the step size from earlier you know the variance of the noise that we add at each step is also a constant we don't learn that so we're just predicting the mean and these are set to some really convenient values then the loss turns out to be that you predict the noise so you can restructure this whole thing as you take in you need to train a network that takes in images so here's your network and it tells you what of this image is noise thanks to these you know these simplifying assumptions and even though the assumptions turns out you can train much more uh you know models that produce much better images now I think this relates to something from the you know the lesson that that Jeremy gave tanishq do you remember that there was something about the gradient or something like that yes yes so um this idea of you know adding noise and and learning to remove noise uh the idea is that kind of by uh you know again you have this sort of uh this image that you have noise right and by sorry let me uh think about the best way to say this uh oh yeah sorry okay let me start it over so I'll just start um yeah so like uh Jeremy will say uh in his uh in the lesson what we want to do is we want to figure out the the gradient of of this likelihood function so this is just kind of a different way about thinking about this if we had some information about this gradient then we could for example um you know use that information to uh produce kind of like we talked about kind of this optimization kind of produce images with high likelihood so the idea is that we can add noise to the to the images that we have so that's those are samples that we have and that kind of uh takes us away from you know the regular images that we know that we have and you know that kind of decreases the likelihood right so we have those images and we're adding noise that decreases the whole likelihood and we want to kind of learn how to get back to high likelihood images and and kind of use that to provide some sort of estimate of our gradient so this sort of denoising process actually allows us to do that so there are actually uh theorems also I think from the 1950s that demonstrate that especially in the case of this sort of gaussian noise that we're working with uh this denoising process is equivalent to learning uh what is known as the the score function and the score function is the gradient of the log of the likelihood so again they have this log here which again make makes the math nicer and easier to work with but the general idea is the same because as as we talked about log is a monotonic function so again the general ideas are the same but the score function specifically refers to the gradient of the log likelihood so this sort of denoising process allows us to learn the score function so that's what we're doing this noise predicting that you know we've had this whole probabilistic framework using that sort of likelihood framework and it came back down to just predicting the noise and that's what the ddpm paper showed in 2020 but it turns out that is equivalent to calculating out this sort of score function and using that information to be able to uh sample from our distribution so that's kind of how these two approaches connect so there's a lot of literature talking about maybe the that sort of holistic likelihood perspectives of diffusion models and there's also a lot of literature talking about this score based perspective but you know this hopefully allows you to think about the similarities and how these two approaches connect with each other yeah awesome yeah and that's kind of the you know the beauty I think of the math side of things here is that you find all of these relationships um between different fields and also like between different centuries basically and that allows you to do really kind of powerful and unexpected things Okay so you can just do a quick recap where we we got to so we started out with our data distribution which we want to model uh we said you know we'll Define this forward diffusion process which is a way of kind of adding noise to this model and because we added in this specific way thanks to you know some Discovery in the 1950s uh the reverse process has the same form and then you know we already know how to train a neural network for this using the elbow and then a couple years later came the discovery uh you know simplifying assumptions that in the end all we do is predict the noise and I just remembered we take actually the MSE of this noise prediction the mean squared area which is a nice very simple framing of the model and internist spoke about the another way to derive all of this which is the score function approach the gradient of the log likelihood okay cool um yeah I highly recommend checking out the course lesson as well if you haven't um you know if you don't understand this there's no need to be intimidated uh you can still do be very effective without ever using math you can be very effective at Deep learning as fast AI has shown us and you can do novel research as well for me this is it's interesting and um you know it's even beautiful in a way so I I recommend checking it out but don't feel intimidated you can find the course lesson links in the past AI Forum We'll add those links as well in the description of this video we'll also have a topic in the forum for this lesson you can have discussions there post any comments add any you know relevant links to the math and then we have another lesson uh you know video by jono which I I really recommend checking out he's a you know he's a great teacher and he was I think he was the first person to do a full course on on stable diffusion yeah jono's video is kind of a deep dive into some of the code a little bit more and into some of the concepts a little bit more so I feel like between these three videos it's a a good overview you know I think uh I mean just to clarify you don't need to understand all the math that was being described in this video that's not to say you want me to understand math we'll be covering lots of math um in these lessons um but we'll be covering just the math you need to understand and build on the code and we'll be covering it over many many more hours than this rather rapid overview perfect cool and yeah thank you so much Danish I had a lot of fun and thank you so much question that was awesome cool bye-bye

Lesson 9B - the math of diffusion

Full Transcript

Need a transcript for another video?