Lesson 22: Deep Learning Foundations to Stable Diffusion

Jeremy Howard12,893 words

Full Transcript

JEREMY: All right, hi gang, and here we are in Lesson 21, joined by the 

legends themselves, Johno and Tanishq. Hello. TANISHQ: Hello. JEREMY: And today you'll be shocked to hear 

that we are going to look at a Jupyter Notebook. Amazing, right? We're going to look at notebook 

22. This is a pretty quick, just, you know, improvement, fairly simple improvement to our 

DDPM - DDIM implementation for Fashion-MNIST. And this is all the same so far, but what I've 

done is I've made some, one quite significant change and some of the changes we'll be making 

today are all about making life simpler. And they're kind of reflecting the way the 

papers have been taking things. And it's interesting to see how the papers have not only 

made things better, they've made things simpler. And so one of the things that I've noticed in recent papers is 

that there's no longer a concept of n steps, which is something we've always had before and 

always bothered me a bit, this capital T thing. You know, this t/T, it's basically saying this 

is time step number, say 500 out of 1,000, so it's time step 0.5. Why don't I just call 

it 0.5? And the answer is, well, we can. So we talked last time about the cosine scheduler. 

We didn't end up using it because I came up with an idea which was, you know, simpler and nearly 

the same, which is just to change our betamax. But in this next notebook, let's use the cosine 

scheduler, but let's try to get rid of the n_steps thing and the capital T thing. So here is 

abar again. And now I've got rid of the capital T. So now I'm going to assume that your time step 

is between 0 and 1. And it basically represents what percentage of the way through 

the diffusion process are you. So 0 would be all noise and 1 would 

be all, no sorry, other way around. 0 would be all clean and 1 would be all noise. 

So how far through the forward diffusion process. So other than that, this is exactly the 

same equation we've already seen. And I realized something else, which is kind of fun, 

which is you can take the inverse of that. So you can calculate t. So we would 

basically first take the square root and we would then take the inverse cos and we 

would then divide by 2 over pi or times pi over 2. So we can both, so it's interesting now, we 

don't, the alpha bar is not something we look up in a list, it's something we calculate with a 

function from a float. And so yeah, interestingly, that means we can also calculate t from an alpha 

bar. So noisify has changed a little. So now when we get the alpha bar for our time step, we don't 

look it up, we just call it, call the function. And now the time step is a random float 

between 0 and 1. Actually between 0 and 0.999, which actually I'm sure there's a function I 

could have chosen to do a float in this range, but I just clapped it because I was 

lazy, couldn't be bothered hooking it up. Other than that noisify is exactly the 

same. Right, so we're still returning the xt, the time step, which is now a float, and 

the noise. That's the thing we're going to try and predict, dependent variable, this 

tuple there as our inputs to the model. All right, so here is what that looks like. 

So now when we look at our input to our UNet training process, you can see, you know, we've 

got a t of 0.05, so 5% of the way through the forward diffusion process, it looks like 

this, and 65% through it looks like this. TANISHQ: So now the time step, and 

basically the process is more of a kind of a continuous time step 

and a continuous process, rather, before we were having these discrete time 

steps, here we could, it's just any random value, it could be between 0 and 1. 

I think, yeah, that's also something. JEREMY: Yeah, which I found is more convenient, 

you know, to have a function to call. Yeah, I find this life a little bit easier. So 

the model's the same, the callbacks are the same, the fitting process is the same. And so something 

which is kind of fun is that we could now, we do now, create a little denoise function, 

so we can take, you know, this batch of data that we generated, the noisified data, so 

here it is again, and we can denoise it. So we know the t for each element, obviously, 

so remember t is different for each element now. And we can therefore calculate 

the alpha bar for each element, and then we can just undo the noisification to 

get the denoised version. And so if we do that, there's what we get. And so this is great, right? 

It shows you what actually happens when we run a single step of the model on varyingly partially 

noised images. And this is something you don't see very often, because I guess not many people 

are working in these kind of interactive notebook environments where it's really easy to do this 

kind of thing, but I think this is really helpful to get a sense of like, okay, if you're 25% of 

the way through the forward diffusion process, this is what it looks like when you undo that. If you're 95% of the way through it, this is what 

happens when you undo that. So you can see here, it's basically like, oh, I don't really know what 

the hell's going on, so at least a noisy mess. Yeah, I guess my feeling from looking 

at this is I'm impressed, you know, like this 45% noise thing, it looks all 

noise to me. It's found the long-sleeved top. And yeah, it's actually pretty close to the real 

one. I looked it up, or you might see it later, it's a little bit more of a pattern here, but it 

even gives a sense of the pattern. So it shows you how impressive this is. So this is 35%, you can 

kind of see there's a shoe there, but it's really picked up the shoe nicely. So it's, these are 

very impressive models in one step, in my opinion. So okay, so sampling is basically the 

same, except now, rather than starting with using the range function to create a time 

steps, we use linspace to create our time steps. So our time steps start at, you know, 

if we did 1,000, it would be 0.999, and they end at 0, and then they're just 

linearly spaced with this number of steps. So other than that, you know, abar we now calculate, and the next abar is going to be 

whatever the current step is minus one over steps. So if you're doing 100 steps, then you'd be minus 

0.01. So this is just stepping through linearly. And yeah, that's actually it for changes. 

So if we just do DDIM for 100 steps, you know, that works really well, we 

get a FID of three, which is actually quite a bit better than we had on 

100 steps for our previous DDIM. So this definitely seems like a good sampling approach. And I know Johno is 

going to talk a bit more shortly about, you know, some of the things that can make 

better sampling approaches. But yeah, definitely, we can see it making a difference here. Did you guys have anything you wanted to 

say about this before we move on? JOHNO: No, but it is a nice transition towards 

some of the other things we'll be looking at to start thinking about how do we frame this. And 

it's also good, like the idea… So the original DDPM paper has a thousand time-states and a lot 

of people follow that. But the idea that you don't have to be bound to that, and maybe it is worth 

breaking that convention. I know Tanishq made that meme about, you know, this 15 computing 

different standards for notation. But yeah, sometimes it's helpful to reframe it, okay, time 

goes from zero to one, that can simplify some things. It may complicate others, but yeah, it's 

nice to think how you can reframe stuff sometimes. JEREMY: Yeah, and in fact, where we will 

head today, by the time we get to notebook 23, we will see, you know, even simpler 

notation. And yes, simpler notation generally comes. I think what happens is over time, 

people understand better what's the essence of the problem and the approach, and 

then that gets reflected in the notation. So, okay, so the next one I 

wanted to share is something which is an idea we've been working on for a 

while, and it's some new research. So partly, I guess this is an interesting 

insight into how we do research. This is 22_noise-spread. And the basic idea of 

this was, well, actually, I'm going to take you through it to see what the basic idea is. So 

what I'm going to do is I'm going to create, okay, so Fashion-MNIST as before, but 

I'm going to create a different kind of model. I'm not going to create 

a model that predicts the noise, given the noised image and t. Instead, I'm 

going to try to create a model which predicts t, given the noised image. So why did I want to do 

that? Well, partly, well, entirely, because I was curious. I felt like when I looked at something 

like this, I thought it was pretty obvious, roughly, how much noise each image had. 

And so I thought, why are we passing noise, when we call the model, why are we passing in 

the noise image and the amount of noise or the t, given that I would have thought the model 

could figure out how much noise there is. So I wanted to check my contention, which is 

that the model could figure out how much noise there is. So I thought, okay, well, let's create a 

model that would try and figure out how much noise here is. So I created a different noisify now, 

and this noisify grabs an alpha bar t randomly. And it's just a random number between 0 and 1, 

you know, you want one per item in the batch. And so then after just randomly grabbing an alpha 

bar t, we then noisify in the usual way. But now our independent variable is the noise image, 

and the dependent variable is alpha bar t. And so we're going to try to create a model that 

can predict alpha bar t, given a noise image. Okay, so everything else is the same 

as usual. And so we can see an example… JOHNO: …you've got alpha 

bar t dot squeeze dot logit. JEREMY: Oh yeah, that's true. So the 

alpha bar t goes between naught and one. So we've got a choice. Like I mean, we don't have 

to do anything, but you know, normally if you've got say between zero and one, you might consider 

putting a sigmoid at the end of your model. But I felt like the difference between 0.999 and 

0.99 is very significant, you know. So if we do logit, then we don't need the sigmoid at the end 

anymore. It'll naturally cover the full range of kind of, you know, it'll be centered at zero, 

it'll cover all the normal kind of range of numbers, and it also will treat equal ratios as 

equally important at both ends of the spectrum. So that was my hypothesis was that using logit 

would be better. I did test it and it was actually very dramatically better. So without this logit 

here, my model didn't work well at all. And so this is like an example of where thinking about 

these details is really important. Because if I hadn't have done this, then I would have come 

away from this bit of research thinking like, oh, I was wrong. We can't predict noise, noise amount. 

Yeah, so thanks for pointing that out, Johno. Yeah, so that's why in this example of a 

mini-batch, you can see that the numbers can be negative or positive. So 0 would represent 

noise, the alpha bar of 0.5. So here 3.05 is not very noised at all, where else, negative one 

is pretty noisy. So the idea is that, yeah, given this image, you would have to try to predict 

3.05. So one thing I was kind of curious about is, like, it's always useful to know is like, what's 

the baseline? Like, what counts as good? You know, because often people will say to me like, 

oh, I created a model and the MSE was 2.6. I'll be like, well, is that good? Well, 

it's the best I can do, but is it good? Or is it better than random? Or is it 

better than predicting the average? So in this case, I was just like, okay, well, what 

if we just predicted, actually, this is slightly out of date, I should have said 0 here, rather 

than 0.5, but nevermind close enough. So this is before I did the logit thing. So I basically 

was looking at like, what's the, you know, loss if you just always predicted a constant, 

which as I said, I should have put zero here, haven't updated it. And so it's like, oh, that 

would give you a loss of 3.5. Or another way to do it is you could just, just put MSE here 

and then look at the MSE loss between 0.5 and your various, just a single mini batch, which we, 

yeah, mini batch of alpha bar t’s, logits. Yeah, so, you know, we wanted to get some, you know, 

if we're getting something that's about three, then we basically haven't done any better 

than random. And so this case, this model, it doesn't actually have anything 

to learn. It always returns the same thing. So we can just call fit with 

train equals false just to find the loss. So these are just a couple of ways of getting 

quickly finding a loss for a baseline naive model. One thing that thankfully PyTorch will warn you 

about is if you try to use MSE and your inputs and targets have different shapes, it will broadcast 

and give you probably not the result you would expect, and it will give you a warning. So one way 

to avoid that is just to use dot flatten on each. So this kind of flattened MSE is useful to avoid 

both, avoid the warning and also avoid getting weird errors or weird, sorry, weird results. 

So we use that for our loss. So the models, the model that we always use, so it's kind of nice. 

We just use our same old model. Nothing changes, even though we're doing something 

totally different. Oh, well, okay, that's not quite true. One difference is 

that our output, we just have one output now, because this is now a regression model. 

It's just trying to predict a single number. And so our Learner now uses MSE as the loss. 

Everything else is the same as usual. So we can go ahead and trade it and you can see, okay, 

the loss is already much better than three, so we're definitely learning something and we end 

up with a 0.075 mean squared error. That's pretty good considering, you know, there's a pretty wide 

range of numbers we're trying to predict here. So I've got to save that as noise prediction 

on sigma. So save that model. And so we can take a look at how it's doing by 

grabbing our one batch of noise images, putting it through our tmodel. Actually, it's 

really an alpha bar model, but never mind, call it a tmodel. And then we can take a look 

to see what it's predicted for each one.And we can compare it to the actual for each 

one. And so you can see here, it said, oh, I think this is about 0.91. And actually it is 

0.91. So now here it looks like about 0.36. And yeah, it is actually 0.36. So you know, you 

can see overall 0.72, it's actually 0.72. Or it's actually right, this one's 0.02 off. But 

yeah, my hypothesis was correct, which is that we, you know, we can predict the thing that 

we were putting in manually as input. So there's a couple of reasons I was interested in 

checking this out. The first was just like, well, yeah, wouldn't it be simpler if we 

weren't passing in the t each time? You know, why not pass in the t each time? But 

it also felt like it would open up a wider range of kind of how we can do sampling. The idea of 

doing sampling by like precisely controlling the amount of noise that you try to remove each 

time, and then assuming you can remove exactly that amount of noise each time feels limited to 

me. So I want to try to remove this constraint. So having built this model, I thought, okay, well, 

you know, which is basically like, okay, I think we don't need to pass t in. Let's try it. So what 

I then did is I replicated the 22_cosine notebook, I just copied it, pasted it in here. But I made 

a couple of changes. The first is that noisify doesn't return t anymore. So there's no 

way to cheat. We don't know what t is. And so that means that the UNet now doesn't 

have t, so it's actually going to pass 0 every time. So it has no ability to learn from t because it doesn't get t. So it 

doesn't really matter what we pass in. We could have changed the UNet to like remove 

the conditioning on t. But for research, this is just as good, you know, for finding 

out. And it's good to be lazy when doing research. There's no point doing something 

a fancy way when you can do it a quick and easy way before you even know if it's going 

to work. So yeah, that's the only change. So we can then train the model and we can 

check the loss. So the loss here is 0.034. And previously it was 0.033. So 

interestingly, you know, maybe it's a tiny bit worse at that, 

you know, but it's very close. Okay, so we'll save that 

model. And then for sampling, I've got exactly the same DDIM step as usual. 

And my sampling is exactly the same as usual, except now when I call the 

model, I have no t to pass in. So we just pass in this. I mean, I still 

know t because I'm still using the usual sampling approach, but I'm not passing it 

to the model. And yeah, we can sample and what happens is actually pretty garbage. 

22 is our FID. And as you can see here, you know, some of the images are still really 

noisy. So I totally failed. And so that's always a little discouraging when you think 

something's going to work and it doesn't. But my reaction to that is like, if I think 

something's going to work and it doesn't, is to think, well, I'm just going to have to do a 

better job of it. You know, it ought to work. So I tried something different, which is I thought 

like, okay, since we're not passing in the t, then we're basically saying like, how much noise 

should you be removing? It doesn't know exactly. So it might remove a little bit more noise that 

we want or a little bit less noise than we want. And we know from the, you know, testing we did, 

that sometimes it's out by like this case 0.02. And I guess if you're out consistently, 

sometimes it's, yeah, got to end up not removing all the noise. So the change I made 

was to the DDIM step, which is here. And… let me just copy this and get rid of the, I'm into 

that section just to make it a bit easier to read. Okay. So the DDIM step, this is the normal 

DDIM step. Okay. And so step one is the same. So don't worry about that. Cause it's 

the same as we've seen before. But what I did was I actually used my tmodel. So I 

passed the noised image into my tmodel, which is actually an alpha bar model 

to get the predicted alpha bar. And this is, remember, the predicted alpha bar 

for each image, because we know from here that sometimes, so sometimes it did a pretty 

good job, right? But sometimes it didn't. So I felt like, okay, we need a 

predicted alpha bar for each image. What I then discovered is, sometimes that could 

be like really too low, right? So what I wanted to make sure of is it wasn't too crazy. So 

I then found the median for a mini batch of all the predicted alpha bars, and I clamped 

it to not be too far away from the median. And so then what I did when I did my 

x_0_hat is rather than using alpha bar t, I used the estimated alpha bar t for each 

image, clamped to be not too far away from the median. And so this way it was updating it 

based on the amount of noise that actually seems to be left behind rather than the assumed 

amount of noise that should be left behind. You know, if we assume it's 

removed the correct amount. And then everything else is 

the same. So when I did that, so whoa, made all the difference. And here it is. 

They are beautiful pieces of clothing. So 3.88 versus 3.2. That's possibly close enough. 

Like I'd have to run it a few times, you know, my guess is maybe it's a 

tiny bit worse. But it's pretty close. But like this definitely gives 

me some encouragement that, you know, even though this is like something 

I just did in a couple of days, where else they're kind of the with t approaches have been 

developed since 2015 and we're now in 2023. I, you know, I would expect it's quite likely 

that these kind of like no, no t approaches could eventually surpass the t based approaches. And like one thing that definitely makes me 

think there's room to improve is if I plot the FID or the KID for each sample 

during the reverse diffusion process, it actually gets worse for a while. I'm 

like, okay, well that's, that's a bad sign. I have no idea when that's 

happening, but it's a sign that, you know, if we could improve each step, then 

one would assume we could get better than 3.8. So yeah, Tanishkq, John, I don't have any thoughts 

about that or questions or comments or… JOHNO: Maybe to just like to highlight the 

research process a little bit, it wasn't like this linear thing of like, oh, here's this issue not 

performing as well as we thought, oh, here's the fix. We just kept this. You know, this was like 

multiple days of like discussing and like Jeremy saying like, you know, I'm tying my hair out. You 

guys have any ideas? And, oh, what about this? And, oh, and they're just in the DDIM paper, 

they do this clamping, maybe that'll help, you know? So there's a lot of back 

and forth and also a lot of like, you saw the code that was commented out there, 

print xt.min, xt.max, alpha bar, pred, you know, just like seeing, oh, okay. You know, my 

average prediction is about what I would expect, but sometimes the middle of the max goes, you 

know, 2, 3, 8, 16, 150, 212 million, infinity, you know, maybe like one or two little 

baddies that would just skyrocket out. Yeah. And so that kind of like debugging 

and exploring and printing the results. JEREMY: And actually our initial discussions 

about this idea, I kind of said to you guys before Lesson 1 of Part 2, I said, like, it 

feels to me like we shouldn't need the t thing. And so it's actually been like bumbling 

away in the background for months. Yeah. JOHNO: And I guess, I mean, we should also 

mention, we have tried this, like a friend of ours trained a no-t version of stable diffusion for 

us. And we did the same sort of thing. I trained a pretty bad t predictor and it sort of generates 

samples. So we're not like focusing on that large scale stuff yet, but it is fun to like, every 

now and again, got this idea from Fashio-MNIST, we are trying these out on some bigger models 

and seeing, okay, this does seem like maybe it'll work. And so down the line that future 

plan is to say, let's actually, you know, spend the time train a proper model and see, 

yeah, see how well that does. If it's interesting. JEREMY: You say a friend of ours, we can be more 

specific. It's Robert, one of the two lead authors of the stable diffusion paper who yeah, actually 

has been fine tuning a real stable diffusion model which is without t and it's looking super 

encouraging. So yeah, that'll be fun to play with, with this new, you know, we'll have to train 

a t predictor for that. See how it looks. Yep. All right. So I guess the other area we've been 

talking about kind of doing some research on is this weird thing that came up over the last two 

weeks where our bug in the DDPM implementation where we accidentally weren't doing it 

from minus 1 to 1 for the input range, it turned out that actually being from minus 

one to one wasn't a very good idea anyway. And so we ended up centering it as 

being from minus 0.5 to 0.5. And Johno and Tanishq have managed to actually 

find a paper. Well I say find a paper. A paper has come out in the last 24 hours which has coincidentally cast some light on this 

and has also cited a paper that we weren't aware of which was not released in the last 24 hours. So 

Johno, are you going to tell us a bit about that? JOHNO: Yeah, sure I can do that. So it's 

funny, this was such perfect timing because I actually got up early this morning planning 

to run with the different input scalings and the cosine schedule that Jeremy was showing 

and some of the other schedulers we look at, I thought it might be nice for the lesson 

to have a little plot of like what is the fit with these different solvents and input 

scalings, but it was going to be a lot of work. I was like, I'm not looking forward to doing 

the groundwork. And then Tanishq sent me this paper which AK [@_akhaliq] had just tweeted 

out because he reviews everything that comes up on arXiv every day: “On the Importance 

of Noise Scheduling for Diffusion Models”. And this is by a researcher at the Google Brain 

team who's also done a really cool recent paper on something called a Recurrent Interface 

Network outside of the scope of this lesson, but also worth checking out. Yeah, 

so this paper they're hoping to study this noise scheduling and the strategies that 

you take for that and they want to show that number 1) noise scheduling is crucial for 

performance and the optimal one depends on the task. When increasing the image size, 

the noise scheduling that you want changes and scaling the input data by some factor 

is a good strategy for working with this. JEREMY: And that's the bit 

we've been talking about, right? JOHNO: Yeah, that's what we've been doing 

where we said, oh, do we scale from minus 0.5 to 0.5 or minus 1 to 1 or do we normalize? 

And so they demonstrate the effectiveness by training a really good high resolution 

model on ImageNet. So class condition model. JEREMY: It looks great. JOHNO: Yeah, amazing samples. 

That I’ll show them later. So I really liked this paper. It's very short 

and concise and it just gets all the information across. And so they introduced us here. We have 

this noising process our noisify function where we have square root of something times X plus 

square root of one minus that something times the noise. And here they use gamma, gamma 

of t, which is often used for the continuous time case. So instead of the alpha bar and the 

beta bar schedule for a thousand time saves, there'll be some function gamma of t that 

tells you what your alpha bar should be. JEREMY: Okay. So that's how our function is 

actually called abar, but it's the same thing. JOHNO: Yeah. Same thing. Takes in a time set from 

0 to 1 and then that's used to noise the image. JEREMY: Interestingly, what they're showing here 

actually is something that we had discovered and I've been complaining about that my DDIMs with 

an eta of less than one weren't working, which is to say when I added extra noise to the image, it 

wasn't working. And what they're showing here is like, oh yeah, duh, if you use a small image, then 

adding extra noise is probably not a good idea .

JOHNO: Yeah. And so they, they, they use a lot of reference in this paper to like information 

being destroyed and signal to noise ratios, and that's really helpful for thinking about it 

because it's not something that's obvious, but at 64 by 64 pixels adjacent pixels might have 

much less in common versus the same amount of noise added at a much higher resolution, the 

noise kind of averages out and you can still see a lot of the image. So yeah, that's one 

thing they highlight is that the same noise level for different image sizes might have 

it, it might be a harder or easier task. And so they investigate some strategies for 

this. They look at the different noise schedule functions. So we've seen the original version from 

the DDPM paper. We've seen the cosine schedule and we've seen, I think we might look at, or 

the next thing that Jeremy's going to show us, a sigmoid based schedule. And so they show the 

continuous time versions of that and they plot how you can change various parameters to get 

these different gamma functions or in our case, the alpha bar where we starting at all image, 

no noise at t equals zero, moving to all noise, no image at t equals one, but the path that 

you take, it's going to be different for these different classes of functions and 

parameters and the signal to noise ratio, that's what this, or the log signal to noise 

ratio is going to change over that time as well. And so that's one of the knobs we can tweak. We're 

saying our diffusion model isn't training that well. We think it might be related to the noise 

schedule and so on. One of the things you could do is try different noise schedules, either 

changing the parameters in one class of noise schedule or switching from a linear to a cosine 

to a sigmoid. And then the second strategy is kind of what we were doing in those experiments, 

which is just to add some scaling factor to x0. JEREMY: Well we were accidentally using b of 0.5. JOHNO: Exactly. And so that's a second dial 

that you can tweak is to say keeping your noise schedule fixed, maybe just scale x0, which 

is going to change the ratio of signal to noise. JEREMY: And so that's what Figure 4 in c) 

there is what we were accidentally doing. JOHNO: Yes. Yeah, exactly. And so see if 

we can get to, oh yeah, so that again, changes the signal to noise for different scalings 

you get. So that's fine. So they have a compound, they have a strategy that combines some of those 

things. And this is the important part. They do their experiments. And so they have a nice 

table of investigating different schedules, cosine schedules and sigmoid schedules, and in 

bold are the best results. And you can see for 64 by 64 images versus 128 versus 256, the best 

schedule is not necessarily always the same. And so that's like important finding number one, 

depending on what your data looks like using a different noise schedule might be optimal. There's 

no one true best schedule. There's no one value of beta min and beta max that's just magically 

the best. Likewise for this input scaling at different sizes with whatever schedules they 

tested and different values were kind of optimal. And so, yeah, it's just a really great 

illustration, I guess, that this is another design choice that's implicit or explicitly part 

of your diffusion model training and sampling is how are you dealing with this noise schedule? 

What schedule are you following? What scaling are you doing with your inputs? And by using 

this thinking and doing these experiments, and they come up with a kind of rule of thumb 

for how to scale the image based on image size, they show that they can, as they increase the 

resolution, they can still maintain really good performance. Where previously it was quite 

hard to train a really large resolution pixel space model and they're able to 

do that. They get some advantage from their fancy Recurrent Interface Network, but 

still it's kind of cool that they can say, look, we get state of the art high 

quality and 512 by 512 or 1024 by 1024 samples on class conditioned ImageNet and 

using this approach to really like consider how well do you train, how many steps do we need to 

take? One of the other things in this table is that they compare it to previous approaches. Oh, 

we used, you know, a third of the training steps and for the same other settings 

and we get better performance. And just because we've chosen 

that input scaling better. And yeah, so that's the paper. Really 

nice, great work to the team. And that was- JEREMY: I'd love to, you got up in the morning and 

thought, oh, it's going to be a hassle training all these different models I need to train 

for different input scalings and different sampling approaches. I might just look at Twitter first. And then you looked at Twitter 

and there was a paper saying like, Hey!, we just did a bunch of experiments for 

different noise schedules and input scaling. JOHNO: Yeah. JEREMY: Does your life always work that 

way, Johno? That seems quite blessed. JOHNO: Yeah. It's very lucky like that. Yeah. If 

you wait long enough, someone else will do it. TANISHQ: That's why it's always that the time 

when AK [@_akhaliq] starts posting on Twitter, it's like my favorite hour of the day. 

It's just for all the papers to be posted. JEREMY: Oh, well thank you for that. 

So let me switch to notebook 23 because this notebook is actually largely an 

implementation of some ideas from this paper that everybody tends to just call it Karras, 

unfair ‘cause there's other people. But I will do it anyway, Karras paper. And the 

reason we're going to look at this is because in this paper, the authors actually take a 

much more explicit look at the question of input scaling. Their approach was not apparently to accidentally put a bug in their code and 

then take it out, find it worked worse and then just put it back in again. Their approach 

was actually to think, how should things be? So that's an interesting approach to doing things 

and I guess it works for them. So that's fine. TANISHQ: I think our approach is more exciting. JEREMY: Yeah, exactly. Our approach is much more 

fun because you never quite know what's going to happen. And so, yeah, in their approach, 

they actually tried to say like, okay, given all the things that are coming into our 

model, how can we have them all nicely balanced? So we will skip back and forth 

between the notebook and the paper. So the start of this is all the same, 

except now we are actually going to do it minus 1 to 1 because we're not going to rely 

on accidental bugs anymore, but instead we're going to rely on the Kerras papers, carefully 

designed scaling. I say that except that I put a bug in the notebook as well. One of 

the things that's in the Kerras paper is what is the standard deviation of the actual 

data which I calculated for a batch. However, this used to say minus 0.5. I used 

to do the minus 0.5 to 0.5 thing. And so this is actually the standard deviation 

of the data before when it was still minus 0.5. So this is actually half the real standard 

deviation. For reasons I don't yet understand, this is giving me better scaled results. 

So this actually should be 0.66. So there's still a bug here and the bug still 

seems to work better. So we've still got some mysteries involved. So we're going to leave this. 

So it's actually not 0.33, it's actually 0.66. Okay, so the basic idea of this 

paper, actually I'll come back. Well let me have a little think. Yeah, okay. Now 

we're going to start here. So the basic idea of this paper is to say, you know what, sometimes 

maybe predicting the noise is a bad idea. So like you can either try and predict the noise 

or you can try and predict the clean image and each of those can be a better 

idea in different situations. If you're given something 

which is nearly pure noise, you know, the model's given something which 

is nearly pure noise and is then asked to predict the noise, that's basically a waste

of time because the whole thing's noise. If you do the opposite, which is you try to get 

it predict the clean image, well then if you give it a clean image that's nearly clean and try to 

predict the clean image, that's nearly a waste of time as well. So you want something which is like, 

regardless of how noisy the image is, you want it to be kind of like an equally difficult problem 

to solve. And so what Karras do is they basically use this new thing called Cskip, which is a 

number which is basically saying like, you know what we should do for the training target, 

is not just predict the noise all the time, not just predict the clean image all the time, 

but predict kind of lerp version of one or the other depending on how noisy it is. So here y is 

the plain image and n is the noise. So y plus n is the noised image. And so if Cskip was zero, 

then we would be predicting the clean image. And if Cskip was one, we would be predicting 

y minus y, we would be predicting the noise. And so you can decide by picking a different Cskip whether you're predicting 

the clean image or the noise. And so, as you can see from the way they've 

written it, they make this a function. They make it a function of sigma. Now this is 

where we've got to a point now where we've kind of got a fairly much simpler notation. 

There's no more alpha bars, no more alphas, no more betas, no more beta bars. There's just 

a single thing called sigma. Unfortunately, sigma is the same thing as alpha bar used to be. 

Right. So we've simplified it, but we've also made things more confusing by using existing symbol for 

something totally different. So this is alpha bar. Okay. So there's going to be a function that says, 

depending on how much noise there is, we'll either predict the noise or we'll predict the clean 

image or we'll predict something between the two. So in the paper, they showed this chart where 

they basically said like, okay, let's look at the loss to see how good are we with a trained 

model at predicting when sigma is really low. So when there's very small alpha bar or when sigma 

is in the middle or when sigma is really high. And they basically said, you know what, when 

it's nearly all noise or nearly no noise, you know, we're basically not 

able to do anything at all. You know, we're basically good at doing 

things when there's a medium amount of noise. So when deciding, okay, what, what segments 

are we going to send to this thing? The first thing we need to do is to, is to figure out 

some sigmas. And they said, okay, well let's pick a distribution of sigmas that matches this 

red curve here, right, as you can see. And so this is a normally distributed curve where this is on a 

log scale. So this is actually a log normal curve. So to get the sigmas that they're going to use, 

they picked a normally distributed random number and then they exp'd it. And this is 

called a log normal distribution. And so they used a mean of minus 1.2 and a 

standard deviation of 1.2. So that means that about one third of the time they're going to be 

getting a number that's bigger than zero here. And e to the zero is one. So about one third 

of the time they're going to be picking sigmas that are bigger than one. 

And so here's a histogram I drew of the sigmas that we're going to 

be using. And so it's nearly always, you know, less than five. But sometimes it's way 

out here. And so it's quite hard to read these histograms. So this really nice library called 

seaborn, which is built on top of Matplotlib, has some more sophisticated and often nicer 

looking plots. And one of them they have is called a kdeplot, which is a kernel density plot. It's 

a histogram, but it's smooth. And so I clipped it at 10 so that you could see it better. So you can 

basically see that the vast majority of the time it's going to be somewhere, you know, about 0.4 

or 0.5. But sometimes it's going to be really big. So our noisify is going to pick a sigma 

using that log normal distribution. And then we're going to get the noise as 

usual. But now we're going to calculate c_skip, right? Because we're going to do that 

thing we just saw. We're going to find something between the plain image and the noised input. 

So what do we use for c_skip? We calculate it here. And so what we do is we say, what's the 

total amount of variance at some level of sigma? Well, it's going to be sigma squared. That's the 

definition of the variance of the noise. But we also have the sigma of the data itself, 

right? So if we add those two together, we'll get the total variance. And 

so what the Karras paper said to do is to do the variance of the data divided by 

the total variance and use that for c_skip. So that means that if your total variance is really 

big, so in other words, it's got a lot of noise, then c_skip is going to be really 

small. So if you've got a lot of noise, then this bit here will be really small. 

So that means if there's a lot of noise, try to predict the original image, right? That 

makes sense because predicting the noise would be too easy. If there's hardly any noise, then 

this will be, total variance will be really small, right? So c_skipwill be really big. And 

so if there's hardly any noise, then try to predict the noise. And so 

that's basically what this c_skip does. So it's a kind of slightly weird idea is that our 

target, the thing we're trying to do actually is not the input image, sorry, the original image. 

It's not the noise, but it's somewhere between the two. And I've found the easiest way to 

understand that is to draw a picture of it. So here is some examples of noised 

input, right? With various amounts of… with various sigmas. Remember sigma 

is alpha bar, right? So here's an example with very little noise, 0.06. 

And so in this case, the target is predict the noise, right? So that's the 

hard thing to do, is predict the noise. Or else here's an example, 4.53, which is nearly 

all noise. So for nearly all noise, the target is predict the image, right? And then for something 

which is a little bit between the two, like here, 0.64, the target is predict some 

of the noise and some of the image. So that's the idea of Karras. And so what this does is it's making the 

problem to be solved by the UNet equally difficult, regardless of what sigma is. 

It doesn't solve our input scaling problem. It solves our kind of difficulty scaling problem. 

To solve the input scaling problem, they do it. TANISHQ: I just want to make one quick note. And 

so this idea of interpolating between the noise and the image is similar to what's called the 

V-Objective as well. So there's also a similar kind of, it's quite similar to what Karras 

et al. has. But that's also now been used in a lot of different models. For example, 

Stable Diffusion 2.0 was trained with this sort of V-Objective. So people are using this 

sort of methodology and getting good results. So it's an actual practical thing that people are 

doing. So yeah, just want to make a note of that. JEREMY: Yeah. As is the case of basically 

all papers created by NVIDIA researchers, of which this is one, it flies under 

the radar and everybody ignores it. The V-Objective paper came from the senior 

author, was Tim Salimans, which is Google, right? So anything from Google and OpenAI, everybody 

listens to. So yeah, although Karras, I think has done the more complete version of this. 

And in fact, the V-Objective was almost like mentioned in passing in the 

distillation paper. But yeah, that's the one that everybody has ended up 

looking at. But I think this is the more complete… TANISHQ: I think what happened with 

the V-Objective is not many people paid attention to it. I think folks like Kat 

and Robin, these sorts of folks are actually paying attention to that V-Objective 

in that Google Brain paper. But then also this paper did a much more principled 

analysis of this sort of thing. So yeah, I think it's very interesting how, yeah. Sometimes 

even these sort of side notes in papers that maybe people don't pay much attention to, 

they can actually be quite important. JEREMY: Yeah. Yeah. So okay, so the noise input 

as usual is the input image plus the noise times the sigma. But then, and then as we discussed, we 

decide how to kind of decide what our target is. But then we actually take that noise input 

and we scale it up or down by this number. And the target, we also scale up or down by this 

number. And those are both calculated in this thing as well. So here's c_out and here's c_in. Now I just wanted to show one 

example of where these numbers come from because for a while they all seem pretty mysterious to 

me and I felt like I'd never be smart enough to understand them, particularly because they 

were explained in the mathematical appendix of this paper, which are always the bits I 

don't understand, until I actually try to, and then it tends to turn out they're not so bad 

after all, which was certainly the case here. TANISHQ: I think it was up, it was B something 

I think. So the B6 I think, is the other one. JEREMY: Oh, yeah. So in appendix B6, 

which does look pretty terrifying, but if you actually look at, for example, 

what we were just talking about, c_in, it's like, how do they calculate? So c_in is 

this. Now this is the variance of the noise, this is the variance of the data, add them 

together to get the total variance, square roots, the total standard deviation. So it's just the 

inverse of the total standard deviation, which is what we have here. Where does that come 

from? Well they just said, you know what? The inputs for a model should have unit variance. 

Now we know that. We've done that to dare in this course. So they just said, all right, 

so well the inputs to the model is the clean data plus the noise times some number we're 

going to calculate, and we want that to be one. Okay, so the variance of the clean images plus the 

noise is equal to the variance of the clean images plus the variance of the noise. 

Okay, so if we want that to be, if we want variance to be one, then divide 

both sides by this and take the square root, and that tells us that our multiplier has to be 

one over this. That's it. So it's like literally, you know, classical math. The only bit you have 

to know is that the variance of two things added together is the variance of the two things added 

together, which is not rocket science either. JOHNO: And in this context, like why we want to 

do this?, when we looked at those sigma's that you're plotting, like the distribution, you've 

got some that are fairly low, but you've also got some where the standard deviation sigma is 

like 40, right? So the variance is super high. JEREMY: Yes. JOHNO: And so we don't want to feed something with 

standard deviation 40 into our model. You would like it to be closer to unit variance. So we're 

thinking, okay, well, if you divide by roughly 40, that would scale it down. But then we've also got 

some extra variance from our data. It's just like 40 plus variance of the data of a little bit. We want to scale back down by 

that to get unit variance. JEREMY: Yeah. I mean, I love this paper 

because it's basically just doing what we spent weeks doing. I feel like everything that 

we've done that's improved every model has always been one thing, which is, can we get 

mean zero, variance one inputs to our model and for all of our activations? And then the 

only other thing is include enough compute by adding enough layers and enough activations. 

Those two things seem to be all that matters. Basically, well, I guess ResNets added an extra 

cool little thing to that, which is to make it even smoother by giving this kind of like identity 

path. So yeah, basically trying to make things as smooth as possible and as equal 

everywhere as possible. So yeah, this is what they've done. So they did that 

for the inputs and then they've also done it for the outputs. And for the 

outputs, you know, it's basically the same idea, you know, and they have basically 

the same kind of analysis to show that. And so with this, so now, yeah, we've 

basically, we've got our noised input. We've got the, you know, kind of linear version 

somewhere between X nought and the noise to input. We've got the scaling of the output and we've got 

the scaling of the input. So now for the inputs to our model, we're going to have the scaled noise. 

We're going to have the sigma and we're going to have the target, which is somewhere between 

the image and the noise. And so, yeah, so I've, you know, never seen anybody draw a picture 

of this before. So it was really cool when, you know, being in a notebook, being able to see 

like, oh, that's what they're doing, you know? So yeah, have a good look at this notebook 

to see exactly what's going on. Cause I think it gives you a really good intuition 

around what problem it's trying to solve. So then I actually checked the noised input has 

a standard deviation of one, the means not zero. And of course, why would it be? We didn't 

do anything, you know, the only thing Karras cared about was having the variance one. We could 

easily adjust the input and output to have a mean of zero as well. That's something I think we or 

somebody should try. Cause I think it does seem to help a bit as we saw with that generalized 

value stuff we did. But it's less important than the variance. And so same with the target. 

It's got the one and yeah, this is where if I changed this to the correct value, which is 0.66, 

then actually it's slightly further away from one both here and here, quite a lot further away. And 

maybe that's because actually the data's well, we know the data's not Gaussian distributed pixel 

data definitely isn't Gaussian distributed. So this bug turned out better. Okay. So the unit's the same, the 

initialization's the same. This is all the same. Train it for a while. We can't compare the 

losses, right? Because our target's different. So, but what we can do is we can create 

a denoise that just takes the thing that as, per usual, the thing 

we had in noisify, right? And so for x naught, so you're going to multiply by 

c_out and then add c_skip by noised_input. Here it is, multiply by c_out, add noised_input, 

c_skip. Okay. So we can denoise. So let's grab our sigmas from the actual batch we had. Let's 

calculate c_skip, c_out and c_in for the sigmas in our mini batch. Let's use the model to predict 

the target given the noise to input and the sigmas and then denoise it. And so here's our noise to 

input, which we've already seen, and here's our predictions. And these are 

absolutely remarkable in my opinion. Yeah. Like this one here, I can barely see it. 

You know, it's really found. Look at the shirt. There's the shirt here. It's actually really 

finding the little thing on the front and let me show you. Here's what it should look like. And in 

cases where the sigma is pretty high, like here, you can see it's really like saying like, I 

don't know, maybe it's shoes, but it could be something else. Is it shoes? Yeah, it wasn't 

shoes, but at least it's kind of got the, you know, the bulk of the pixels in 

the right spot. Yeah. Something like this one is 4.5. Has no idea what it is. It's 

like, oh, maybe it's shoes. Maybe it's pants. You know, it turns out it is shoes. Yeah. So 

I think that's fascinating how well it can do. And then the other thing I did, which I 

thought was fun was I just created, so I just, you did a sigma of 80, which is actually what they 

do when they're doing sampling from pure noise. That's what they consider the pure noise level. 

So I just created some pure noise and denoised it just for one step. And so here's what happens 

when you denoise it for one step. And you can see it's kind of overlaid all the possibilities. 

It's like, I can see a pair of shoes here, a pair of pants here at top here. And sometimes 

it's kind of like more confident that the noise is actually a pair of pants. And sometimes 

it's more confident that it's actually shoes. But you can really get a sense of how like from 

pure noise, it starts to make a call about like what this noise is actually covering up. And 

this is also the bit which I feel is like, I'm the least convinced about when it comes to 

diffusion models. This first step of going from like pure noise to something and like trying to 

have a good mix of all the possible somethings. I don't know, it feels a bit handwavy to me. 

It clearly works quite well, but I'm not sure if it's like we're getting the full range of 

possibilities. And I feel like some of the papers we're starting to see is starting to say 

like, you know what, maybe this is not quite the right approach. Then maybe later in the course, 

we'll look at some of the ones that look at what we call VQ models and tokenized stuff. 

Anyway, I thought this was pretty interesting to see these pictures, which I don't think, 

yeah, I've never seen any pictures like this before. So I think this is a fun result from 

doing all this stuff in notebooks step by step. Okay, so sampling. So one of the nice things 

with this is the sampling becomes much, much, much simpler. And so, and I should mention 

a lot of the code that I'm using, particularly in the sampling section is heavily inspired by, 

and some of it's actually copied and pasted from Kat's k-diffusion repo, which is, I think 

I mentioned before, some of the nicest generative modeling code or maybe the nicest 

generative modeling code I've ever seen. It's really great. So before we talk about the actual 

sampling, the first thing we need to talk about is what sigma do we use at each reverse time step. 

And in the past, we've always, well, nearly always done something, which I think has always 

felt is sketchy as all hell, which is we've just linearly gone down the sigmas or the alpha bars 

or the t’s. So here, when we're sampling in the previous notebook, we used linspace. So I always 

felt like that was questionable. And I felt like at the start, you probably like it was just 

noise anyway. So who cared? Who cares? So I, in DDPM_v3, I experimented with something 

that I thought intuitively made more sense. I don't know if you remember this one, but I 

actually said, oh, let's, for the first hundred time steps, let's actually only run the model 

every 10 times. And then for the next hundred, let's run it nine times. The next one hundred, 

let's run it every eight times. So basically at the start, be much less careful. And so Karras 

actually ran a whole bunch of experiments. And they said, yeah, you know what, at the start of 

training, you know, you can start with a high sigma, but then like step to a much lower sigma 

in the next step and then a much lower sigma in the next step. And then the longer, the more you 

train step by smaller and smaller steps so that you spend a lot more time fine tuning carefully 

at the end and not very much time at the start. Now, this has its own problems. And in 

fact, a paper just came out today, which we probably won't talk about today, but maybe 

another time, which talked about the problems is that in these very early steps, this is the 

bit where you're trying to create a composition that makes sense. Now for Fashion-MNIST, we 

don't have much composing to do. It's just a piece of clothing. But if you're trying to 

do an astronaut riding a horse, you know, you've got to think about how all those pieces fit 

together. And this is where that happens. And so I do worry that with the Karras approach is not 

giving that maybe enough time. But as I've said, that's really the same as this step at 

that, that whole piece feels a bit wrong to me. But aside from that, I think this makes a 

lot of sense, which is that, yeah, the sampling, you should jump, you know, by big steps early 

on and small steps later on and make sure that the fine details are just so. So that's what 

this function does, is it creates this plot. Now it's this schedule of reverse diffusion 

sigma steps. It's a bit of a weird function in that it's the, the rho-th root of sigma, 

where rho is seven. So the seventh root of sigma is basically what it's scaling on. But 

the answer to why it's that is because they tried it and it turned out to work pretty 

well. Do you guys remember where this was? TANISHQ: This is the 

truncation error analysis, D1. JEREMY: Nice memory. So this image here –so thanks 

for Tanishq reminding me where this is– shows FID as a function of rho. So it's basically what, the 

what root are we taking. And they basically said, like, if you take the fifth root up, 

it seems to work well, basically. So yeah, so that's a perfectly good way to do 

things is just to try things and see what works. And you'll notice they tried things, 

just like we love, on small datasets, not as small as us because we're the king of small 

datasets, but smallish, CIFAR-10, ImageNet-64. That's the way to do things. So I saw like –it 

might've even been the CEO of Hugging Face the other day– tweets something saying only people 

with huge amounts of GPUs can do research now. And I think it totally misunderstands how research 

is done, which is research is done on very small datasets. That's the actual research. And then 

when you're all done, you scale it up at the end. I think we're kind of pushing the 

envelope in terms of like, yeah, how much can you do? And yeah, we've like 

re-covered this kind of main substantive path of diffusion models history step-by-step showing 

every improvement and seeing clear improvements across all the papers using nothing but 

Fashion-MNIST running on a single GPU in like 15 minutes of training or something per model. 

So yeah, definitely don't need lots of models. Anyway. Okay. So this is the sigma we're going 

to jump to. So the denoising is going to involve calculating the c_skip, c_out and c_in and calling 

our model with the c_in scaled data and the sigma and then scaling it with c_out and then doing the 

c_skip. Okay. So that's just undoing the noisify. So check this out. There's all that's required to 

do one step of denoising for the simplest kind of scheduler, which is sorry, the simplest 

kind of sampler, which is called Euler. So we basically say, okay, 

what's the sigma at time step i, what's the sigma two at time step i. And 

now when I'm talking about time step, I'm really talking about like the step from 

this function, right? So this is, this is – JOHNO: Sampling step. JEREMY: Sampling step, yeah. Okay. 

So then denoise –using the function– and then we say, okay, well just 

send back whatever you were given, plus move a little bit in the direction of the 

denoised image. So the direction is x minus denoised. So that's the noise, that's the gradient 

as we discussed right back in the first lesson of this part. So we'll take the noise. If we divide 

it by sigma, we get a slope. That's how much noise is there per sigma. And then the amount 

that we're stepping is sigma two minus sigma one. So take that slope and multiply it by the 

change, right? So that's the distance to travel towards the noise, that this fraction, you 

know, or you could also think of it this way. And I know this is a very obvious 

algebraic change, but if we move this over here, you could also think of this as 

being, oh, of the total amount of noise, the change in sigma we're doing, what 

percentage is that? Okay, well that's the amount we should step. Right? So there's two 

ways of thinking about the same thing. So again, this is just, you know, 

high school math. Well I mean, actually my seven year old daughter has done all 

these things. It's plus minus dividing times. So we're going to need to do this once per 

sampling step. So here's a thing called sample, which does that. That's going to 

go through each sampling step, call our sampler, which initially we're going to 

do sample Euler, right? With that information, add it to our list of results and do it again. 

So that's it. That's all the sampling is. And of course we need to grab 

our list of sigmas to start with. So I think that's pretty cool. And at the very 

start we need to create our pure noise image. And so the amount of noise we 

start with is got a sigma of 80. Okay, so if we call sample using sample Euler, 

and we get back some very nice looking images and, believe it or not, our FID is 1.98. So this 

extremely simple sampler, three lines of code plus a loop has given us a FID of 1.98, which is 

clearly substantially better than our coastline. Now we can improve it from there. So one 

potential improvement is to, you might've noticed we added no new noise at all, right? This 

is a deterministic scheduler, right? There's no rand anywhere here. So we can do something called 

an Ancestral Euler Sampler, which does add rand, right? So we basically do the denoising in the 

usual way, but then we also add some rand. And so what we do need to make sure is given that 

we're adding a certain amount of randomness, we need to remove that amount of randomness 

from the step that we take. So I won't go into the details, but basically there's our way of 

calculating how much new randomness and how much just going back in the existing direction do we 

do. And so there's the amount in the existing direction and there's the amount in the new 

random direction. And you can just pass in eta, which is just going to, when we pass it into here, 

is going to scale that. So if we scale it by half, so basically half of it is new noise and 

half of it is going in the direction that we thought we should go, that makes it better 

still. Again with a hundred steps and just make sure I'm comparing to the same. Yep. A hundred 

steps. Okay. So it's fair. Like with like. Okay. So that's adding a bit of extra noise. Now then the… something that I think we might've 

mentioned back in the first lesson of this part is something called Heun's method. And Heun's method does something which we can 

pictorially see here to decide where to go, which is basically we say, okay, where 

are we right now? What's the, you know, at our current point, what's the direction? So we 

take the tangent line, the slope, right? That's basically all it does is it takes a slope. So 

it's not, here's a slope, you know? Okay. And so if we take that slope and that 

would take us to a new spot and then at that new spot, we can then 

calculate a slope at the new spot as well. And at the new spot, the slope is something else. 

So that's it here, right? And then you say like, okay, well, let's go halfway between the two 

and let's actually follow that line. And so basically it's saying like, okay, each of 

these slopes is going to be inaccurate. But what we could do is calculate the slope of 

where we are, the slope of where we're going and then go halfway between the two. I actually 

find it easier to look at in code personally. I just kind of delete a whole bunch of stuff 

that's totally irrelevant to this conversation. So take a look at this compared to Euler. So here's our Euler, right? So we're going 

to do the same first line exactly the same, right? Then the denoising is exactly the same, 

right? And then this step here is exactly the same. I've actually just done it in multiple 

steps for no particular reason. And then say, okay, well, if this is the last step, then we're 

done. So actually the last step is Euler. But then what we do is we then say, well, that's okay 

for an Euler step, this is where we'd go. Well, what does that look like if we denoise 

it? So this calls the model the second time, right? And where would that take us if we took 

an Euler step there? And so here, if we took an Euler step there, what's the slope? And so what 

we then do is we say, oh, okay, well, it's just, just like in the picture, let's take the average. 

Okay, so let's take the average and then use that, the step. So that's all the Heun sampler 

does is it just takes the average of the slope where we're at and the slope where 

the Euler method would have taken us. And so if we now, so notice that it called the 

model twice for a single step. So to be fair, since we've been taking a hundred steps with 

Euler, we should take 50 steps with Heun, right? Because it's going to call the model twice. 

And still that is now, whoa, we beat 1, which is pretty amazing. And so we could keep going, 

check this out. We can even go down to 20. This is actually doing 40 model evaluations and this is 

better than our best Euler, which is pretty crazy. Now something which you might've noticed is kind 

of weird about this or kind of silly about this is we're cat–, we're calling the model twice just 

in order to average them, but we already have two model results like without calling it twice. We 

cause we could have just looked at the previous time step. And so something called the LMS 

sampler does that instead. And so the LMS sampler, if I call it with 20, it actually literally 

does 20 evaluations and actually it beats Euler with a hundred evaluations. And so 

LMS, I won't go into the details too much. It didn't actually fit into my little sampling very 

well. So basically largely copied and pasted the Kat's code. But the key thing it does is look 

at, it gets the current sig –sigma. It does the denoising, it calculates the slope and it stores 

the slope in a list, right? And then it grabs the first one from the list. So it's kind of 

keeping a list of up to this case four at a time. And so then uses up to the last four to basically, 

yes, kind of the curvature of this and take the next step. So that's pretty smart. And yeah, so 

I think if you wanted to do super fast sampling, it seems like a pretty good way to do it. 

And I think Johno, you were telling me that, or maybe it's Pedro was saying that 

currently people have started to move away, that this was very popular, but people 

started to move towards a new sampler, which is a bit similar called the 

DPM++ sampler, something like that. TANISHQ: Yeah. Yeah. Yeah. JOHNO: Yeah. JEREMY: But I think it's 

the same idea. So it kind of keeps a list of recent results and use 

that. I'll have to check it more closely. JOHNO: The similar idea is like, if it's 

done more than one step, then it's using some history to the next thing. JEREMY: Yeah. [unintelligible] in Heun 

doesn't make a huge amount of sense, I guess, from that perspective. I mean, still works very 

well. This makes more sense. So then, we can compare if we use an actual mini-match of data, 

we get about 0.5. So yeah, I feel like this is quite a stunning result to get very close to real data, at least in terms of 

FID. You know, really with 40 model evaluations. And the entire, nearly the entire thing here is 

by making sure we've got unit variance inputs, unit variance outputs, and kind of equally 

difficult problems to solve in our loss function. JOHNO: Yeah. Plus having that different schedule 

for sampling, that's completely unrelated to the training schedule. So one of the big things 

with Karras et al's paper was they also could apply this to like, oh, existing diffusion 

models that have been trained by other papers, we can use our sampler and in fewer steps 

get better results without any of the other changes. And yeah, I mean, they do a 

little bit of rearranging equations to get the other papers versions into 

their c_skip, c_in, c_out framework. But then, yeah, it's really nice that these 

ideas can be applied to. So for example, I think stable diffusion, especially version one was 

trained DDPM style training, epsilon objective, whatever. But you can now get these different 

samplers and different something schedules and things like that and use that to sample it and do 

it in 15, 20 steps and get pretty nice samples. JEREMY: Yeah. You know, and another 

nice thing about this paper is they, you know, in fact, it's the name of 

the paper, “Elucidating the Design Space of Diffusion-Based …” models. You know, they looked 

at various different papers and approaches and trying to set like, oh, you know what? These 

are all doing the same thing when we kind of parameterize things in this way. And if you fill 

in these parameters, you get this paper and these parameters, you get that paper, you know? And 

then, so, we found a better set of parameters, which… it was very nice to code because, you know, 

it really actually ended up simplifying things a whole lot. And so if you look through the 

notebook carefully, which I hope everybody will, you'll see, you know, that the code 

is really there and simple compared to previous, all the previous ones, in 

my opinion. Like I feel like every notebook we've done from DDPM onwards, 

the code's got easier to understand and the results… TANISHQ: And just to, again, clarify, like, 

how this connects with some of the previous papers that we've looked at. So like, for example, 

with the DDIM, the deterministic, that's again, the deterministic approach, that's similar to the 

Euler method sampler that we were just looking at, which was completely deterministic. And then 

some of something like the Euler Ancestral that we were looking at is similar to 

the standard DDPM approach with the, that was kind of a more stochastic approach. 

So, again, there's just all these sorts of connections that then are kind of nice to 

see, again, the sorts of connections between the different papers and how they change it, how 

they can be expressed in this common framework. JEREMY: Yeah. Thanks, Tanishq. So we definitely 

now are at the point where we can show you the UNet next time. And so I think we're, unless any 

of us come up with interesting new insights on the unconditional diffusion sampling, training 

and sampling process, we might be putting that aside for a while. And instead we're going 

to be looking at creating a good quality UNet from scratch. And we're going to look at a 

different data set to do that. This was starting to scale things up a bit as Johno mentioned in 

the last lesson. So we're going to be using a 64 by 64 pixel ImageNet subset, called Tiny ImageNet. 

So we'll start looking at some 3-channel images. So I'm sure we're all sick of looking at black 

and white shoes. So now we get to look at shift dwellings and trolley buses and koala bears and 

yeah, 200 different things. So that'll be nice. Yeah. All right. Well, thank you, Johno. Thank 

you, Tanishq. That was fun as always. And next time will be Lesson 22. Bye. JOHNO: This was Lesson 22. JEREMY: Oh, no way. Okay. See you.

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 22: Deep Learning Foundations to Stable Diffusion ...