Lesson 22: Deep Learning Foundations to Stable Diffusion ...

JEREMY: All right, hi gang, and here we are in Lesson 21, joined by the

legends themselves, Johno and Tanishq. Hello. TANISHQ: Hello. JEREMY: And today you'll be shocked to hear

that we are going to look at a Jupyter Notebook. Amazing, right? We're going to look at notebook

22. This is a pretty quick, just, you know, improvement, fairly simple improvement to our

DDPM - DDIM implementation for Fashion-MNIST. And this is all the same so far, but what I've

done is I've made some, one quite significant change and some of the changes we'll be making

today are all about making life simpler. And they're kind of reflecting the way the

papers have been taking things. And it's interesting to see how the papers have not only

made things better, they've made things simpler. And so one of the things that I've noticed in recent papers is

that there's no longer a concept of n steps, which is something we've always had before and

always bothered me a bit, this capital T thing. You know, this t/T, it's basically saying this

is time step number, say 500 out of 1,000, so it's time step 0.5. Why don't I just call

it 0.5? And the answer is, well, we can. So we talked last time about the cosine scheduler.

We didn't end up using it because I came up with an idea which was, you know, simpler and nearly

the same, which is just to change our betamax. But in this next notebook, let's use the cosine

scheduler, but let's try to get rid of the n_steps thing and the capital T thing. So here is

abar again. And now I've got rid of the capital T. So now I'm going to assume that your time step

is between 0 and 1. And it basically represents what percentage of the way through

the diffusion process are you. So 0 would be all noise and 1 would

be all, no sorry, other way around. 0 would be all clean and 1 would be all noise.

So how far through the forward diffusion process. So other than that, this is exactly the

same equation we've already seen. And I realized something else, which is kind of fun,

which is you can take the inverse of that. So you can calculate t. So we would

basically first take the square root and we would then take the inverse cos and we

would then divide by 2 over pi or times pi over 2. So we can both, so it's interesting now, we

don't, the alpha bar is not something we look up in a list, it's something we calculate with a

function from a float. And so yeah, interestingly, that means we can also calculate t from an alpha

bar. So noisify has changed a little. So now when we get the alpha bar for our time step, we don't

look it up, we just call it, call the function. And now the time step is a random float

between 0 and 1. Actually between 0 and 0.999, which actually I'm sure there's a function I

could have chosen to do a float in this range, but I just clapped it because I was

lazy, couldn't be bothered hooking it up. Other than that noisify is exactly the

same. Right, so we're still returning the xt, the time step, which is now a float, and

the noise. That's the thing we're going to try and predict, dependent variable, this

tuple there as our inputs to the model. All right, so here is what that looks like.

So now when we look at our input to our UNet training process, you can see, you know, we've

got a t of 0.05, so 5% of the way through the forward diffusion process, it looks like

this, and 65% through it looks like this. TANISHQ: So now the time step, and

basically the process is more of a kind of a continuous time step

and a continuous process, rather, before we were having these discrete time

steps, here we could, it's just any random value, it could be between 0 and 1.

I think, yeah, that's also something. JEREMY: Yeah, which I found is more convenient,

you know, to have a function to call. Yeah, I find this life a little bit easier. So

the model's the same, the callbacks are the same, the fitting process is the same. And so something

which is kind of fun is that we could now, we do now, create a little denoise function,

so we can take, you know, this batch of data that we generated, the noisified data, so

here it is again, and we can denoise it. So we know the t for each element, obviously,

so remember t is different for each element now. And we can therefore calculate

the alpha bar for each element, and then we can just undo the noisification to

get the denoised version. And so if we do that, there's what we get. And so this is great, right?

It shows you what actually happens when we run a single step of the model on varyingly partially

noised images. And this is something you don't see very often, because I guess not many people

are working in these kind of interactive notebook environments where it's really easy to do this

kind of thing, but I think this is really helpful to get a sense of like, okay, if you're 25% of

the way through the forward diffusion process, this is what it looks like when you undo that. If you're 95% of the way through it, this is what

happens when you undo that. So you can see here, it's basically like, oh, I don't really know what

the hell's going on, so at least a noisy mess. Yeah, I guess my feeling from looking

at this is I'm impressed, you know, like this 45% noise thing, it looks all

noise to me. It's found the long-sleeved top. And yeah, it's actually pretty close to the real

one. I looked it up, or you might see it later, it's a little bit more of a pattern here, but it

even gives a sense of the pattern. So it shows you how impressive this is. So this is 35%, you can

kind of see there's a shoe there, but it's really picked up the shoe nicely. So it's, these are

very impressive models in one step, in my opinion. So okay, so sampling is basically the

same, except now, rather than starting with using the range function to create a time

steps, we use linspace to create our time steps. So our time steps start at, you know,

if we did 1,000, it would be 0.999, and they end at 0, and then they're just

linearly spaced with this number of steps. So other than that, you know, abar we now calculate, and the next abar is going to be

whatever the current step is minus one over steps. So if you're doing 100 steps, then you'd be minus

0.01. So this is just stepping through linearly. And yeah, that's actually it for changes.

So if we just do DDIM for 100 steps, you know, that works really well, we

get a FID of three, which is actually quite a bit better than we had on

100 steps for our previous DDIM. So this definitely seems like a good sampling approach. And I know Johno is

going to talk a bit more shortly about, you know, some of the things that can make

better sampling approaches. But yeah, definitely, we can see it making a difference here. Did you guys have anything you wanted to

say about this before we move on? JOHNO: No, but it is a nice transition towards

some of the other things we'll be looking at to start thinking about how do we frame this. And

it's also good, like the idea… So the original DDPM paper has a thousand time-states and a lot

of people follow that. But the idea that you don't have to be bound to that, and maybe it is worth

breaking that convention. I know Tanishq made that meme about, you know, this 15 computing

different standards for notation. But yeah, sometimes it's helpful to reframe it, okay, time

goes from zero to one, that can simplify some things. It may complicate others, but yeah, it's

nice to think how you can reframe stuff sometimes. JEREMY: Yeah, and in fact, where we will

head today, by the time we get to notebook 23, we will see, you know, even simpler

notation. And yes, simpler notation generally comes. I think what happens is over time,

people understand better what's the essence of the problem and the approach, and

then that gets reflected in the notation. So, okay, so the next one I

wanted to share is something which is an idea we've been working on for a

while, and it's some new research. So partly, I guess this is an interesting

insight into how we do research. This is 22_noise-spread. And the basic idea of

this was, well, actually, I'm going to take you through it to see what the basic idea is. So

what I'm going to do is I'm going to create, okay, so Fashion-MNIST as before, but

I'm going to create a different kind of model. I'm not going to create

a model that predicts the noise, given the noised image and t. Instead, I'm

going to try to create a model which predicts t, given the noised image. So why did I want to do

that? Well, partly, well, entirely, because I was curious. I felt like when I looked at something

like this, I thought it was pretty obvious, roughly, how much noise each image had.

And so I thought, why are we passing noise, when we call the model, why are we passing in

the noise image and the amount of noise or the t, given that I would have thought the model

could figure out how much noise there is. So I wanted to check my contention, which is

that the model could figure out how much noise there is. So I thought, okay, well, let's create a

model that would try and figure out how much noise here is. So I created a different noisify now,

and this noisify grabs an alpha bar t randomly. And it's just a random number between 0 and 1,

you know, you want one per item in the batch. And so then after just randomly grabbing an alpha

bar t, we then noisify in the usual way. But now our independent variable is the noise image,

and the dependent variable is alpha bar t. And so we're going to try to create a model that

can predict alpha bar t, given a noise image. Okay, so everything else is the same

as usual. And so we can see an example… JOHNO: …you've got alpha

bar t dot squeeze dot logit. JEREMY: Oh yeah, that's true. So the

alpha bar t goes between naught and one. So we've got a choice. Like I mean, we don't have

to do anything, but you know, normally if you've got say between zero and one, you might consider

putting a sigmoid at the end of your model. But I felt like the difference between 0.999 and

0.99 is very significant, you know. So if we do logit, then we don't need the sigmoid at the end

anymore. It'll naturally cover the full range of kind of, you know, it'll be centered at zero,

it'll cover all the normal kind of range of numbers, and it also will treat equal ratios as

equally important at both ends of the spectrum. So that was my hypothesis was that using logit

would be better. I did test it and it was actually very dramatically better. So without this logit

here, my model didn't work well at all. And so this is like an example of where thinking about

these details is really important. Because if I hadn't have done this, then I would have come

away from this bit of research thinking like, oh, I was wrong. We can't predict noise, noise amount.

Yeah, so thanks for pointing that out, Johno. Yeah, so that's why in this example of a

mini-batch, you can see that the numbers can be negative or positive. So 0 would represent

noise, the alpha bar of 0.5. So here 3.05 is not very noised at all, where else, negative one

is pretty noisy. So the idea is that, yeah, given this image, you would have to try to predict

3.05. So one thing I was kind of curious about is, like, it's always useful to know is like, what's

the baseline? Like, what counts as good? You know, because often people will say to me like,

oh, I created a model and the MSE was 2.6. I'll be like, well, is that good? Well,

it's the best I can do, but is it good? Or is it better than random? Or is it

better than predicting the average? So in this case, I was just like, okay, well, what

if we just predicted, actually, this is slightly out of date, I should have said 0 here, rather

than 0.5, but nevermind close enough. So this is before I did the logit thing. So I basically

was looking at like, what's the, you know, loss if you just always predicted a constant,

which as I said, I should have put zero here, haven't updated it. And so it's like, oh, that

would give you a loss of 3.5. Or another way to do it is you could just, just put MSE here

and then look at the MSE loss between 0.5 and your various, just a single mini batch, which we,

yeah, mini batch of alpha bar t’s, logits. Yeah, so, you know, we wanted to get some, you know,

if we're getting something that's about three, then we basically haven't done any better

than random. And so this case, this model, it doesn't actually have anything

to learn. It always returns the same thing. So we can just call fit with

train equals false just to find the loss. So these are just a couple of ways of getting

quickly finding a loss for a baseline naive model. One thing that thankfully PyTorch will warn you

about is if you try to use MSE and your inputs and targets have different shapes, it will broadcast

and give you probably not the result you would expect, and it will give you a warning. So one way

to avoid that is just to use dot flatten on each. So this kind of flattened MSE is useful to avoid

both, avoid the warning and also avoid getting weird errors or weird, sorry, weird results.

So we use that for our loss. So the models, the model that we always use, so it's kind of nice.

We just use our same old model. Nothing changes, even though we're doing something

totally different. Oh, well, okay, that's not quite true. One difference is

that our output, we just have one output now, because this is now a regression model.

It's just trying to predict a single number. And so our Learner now uses MSE as the loss.

Everything else is the same as usual. So we can go ahead and trade it and you can see, okay,

the loss is already much better than three, so we're definitely learning something and we end

up with a 0.075 mean squared error. That's pretty good considering, you know, there's a pretty wide

range of numbers we're trying to predict here. So I've got to save that as noise prediction

on sigma. So save that model. And so we can take a look at how it's doing by

grabbing our one batch of noise images, putting it through our tmodel. Actually, it's

really an alpha bar model, but never mind, call it a tmodel. And then we can take a look

to see what it's predicted for each one.And we can compare it to the actual for each

one. And so you can see here, it said, oh, I think this is about 0.91. And actually it is

0.91. So now here it looks like about 0.36. And yeah, it is actually 0.36. So you know, you

can see overall 0.72, it's actually 0.72. Or it's actually right, this one's 0.02 off. But

yeah, my hypothesis was correct, which is that we, you know, we can predict the thing that

we were putting in manually as input. So there's a couple of reasons I was interested in

checking this out. The first was just like, well, yeah, wouldn't it be simpler if we

weren't passing in the t each time? You know, why not pass in the t each time? But

it also felt like it would open up a wider range of kind of how we can do sampling. The idea of

doing sampling by like precisely controlling the amount of noise that you try to remove each

time, and then assuming you can remove exactly that amount of noise each time feels limited to

me. So I want to try to remove this constraint. So having built this model, I thought, okay, well,

you know, which is basically like, okay, I think we don't need to pass t in. Let's try it. So what

I then did is I replicated the 22_cosine notebook, I just copied it, pasted it in here. But I made

a couple of changes. The first is that noisify doesn't return t anymore. So there's no

way to cheat. We don't know what t is. And so that means that the UNet now doesn't

have t, so it's actually going to pass 0 every time. So it has no ability to learn from t because it doesn't get t. So it

doesn't really matter what we pass in. We could have changed the UNet to like remove

the conditioning on t. But for research, this is just as good, you know, for finding

out. And it's good to be lazy when doing research. There's no point doing something

a fancy way when you can do it a quick and easy way before you even know if it's going

to work. So yeah, that's the only change. So we can then train the model and we can

check the loss. So the loss here is 0.034. And previously it was 0.033. So

interestingly, you know, maybe it's a tiny bit worse at that,

you know, but it's very close. Okay, so we'll save that

model. And then for sampling, I've got exactly the same DDIM step as usual.

And my sampling is exactly the same as usual, except now when I call the

model, I have no t to pass in. So we just pass in this. I mean, I still

know t because I'm still using the usual sampling approach, but I'm not passing it

to the model. And yeah, we can sample and what happens is actually pretty garbage.

22 is our FID. And as you can see here, you know, some of the images are still really

noisy. So I totally failed. And so that's always a little discouraging when you think

something's going to work and it doesn't. But my reaction to that is like, if I think

something's going to work and it doesn't, is to think, well, I'm just going to have to do a

better job of it. You know, it ought to work. So I tried something different, which is I thought

like, okay, since we're not passing in the t, then we're basically saying like, how much noise

should you be removing? It doesn't know exactly. So it might remove a little bit more noise that

we want or a little bit less noise than we want. And we know from the, you know, testing we did,

that sometimes it's out by like this case 0.02. And I guess if you're out consistently,

sometimes it's, yeah, got to end up not removing all the noise. So the change I made

was to the DDIM step, which is here. And… let me just copy this and get rid of the, I'm into

that section just to make it a bit easier to read. Okay. So the DDIM step, this is the normal

DDIM step. Okay. And so step one is the same. So don't worry about that. Cause it's

the same as we've seen before. But what I did was I actually used my tmodel. So I

passed the noised image into my tmodel, which is actually an alpha bar model

to get the predicted alpha bar. And this is, remember, the predicted alpha bar

for each image, because we know from here that sometimes, so sometimes it did a pretty

good job, right? But sometimes it didn't. So I felt like, okay, we need a

predicted alpha bar for each image. What I then discovered is, sometimes that could

be like really too low, right? So what I wanted to make sure of is it wasn't too crazy. So

I then found the median for a mini batch of all the predicted alpha bars, and I clamped

it to not be too far away from the median. And so then what I did when I did my

x_0_hat is rather than using alpha bar t, I used the estimated alpha bar t for each

image, clamped to be not too far away from the median. And so this way it was updating it

based on the amount of noise that actually seems to be left behind rather than the assumed

amount of noise that should be left behind. You know, if we assume it's

removed the correct amount. And then everything else is

the same. So when I did that, so whoa, made all the difference. And here it is.

They are beautiful pieces of clothing. So 3.88 versus 3.2. That's possibly close enough.

Like I'd have to run it a few times, you know, my guess is maybe it's a

tiny bit worse. But it's pretty close. But like this definitely gives

me some encouragement that, you know, even though this is like something

I just did in a couple of days, where else they're kind of the with t approaches have been

developed since 2015 and we're now in 2023. I, you know, I would expect it's quite likely

that these kind of like no, no t approaches could eventually surpass the t based approaches. And like one thing that definitely makes me

think there's room to improve is if I plot the FID or the KID for each sample

during the reverse diffusion process, it actually gets worse for a while. I'm

like, okay, well that's, that's a bad sign. I have no idea when that's

happening, but it's a sign that, you know, if we could improve each step, then

one would assume we could get better than 3.8. So yeah, Tanishkq, John, I don't have any thoughts

about that or questions or comments or… JOHNO: Maybe to just like to highlight the

research process a little bit, it wasn't like this linear thing of like, oh, here's this issue not

performing as well as we thought, oh, here's the fix. We just kept this. You know, this was like

multiple days of like discussing and like Jeremy saying like, you know, I'm tying my hair out. You

guys have any ideas? And, oh, what about this? And, oh, and they're just in the DDIM paper,

they do this clamping, maybe that'll help, you know? So there's a lot of back

and forth and also a lot of like, you saw the code that was commented out there,

print xt.min, xt.max, alpha bar, pred, you know, just like seeing, oh, okay. You know, my

average prediction is about what I would expect, but sometimes the middle of the max goes, you

know, 2, 3, 8, 16, 150, 212 million, infinity, you know, maybe like one or two little

baddies that would just skyrocket out. Yeah. And so that kind of like debugging

and exploring and printing the results. JEREMY: And actually our initial discussions

about this idea, I kind of said to you guys before Lesson 1 of Part 2, I said, like, it

feels to me like we shouldn't need the t thing. And so it's actually been like bumbling

away in the background for months. Yeah. JOHNO: And I guess, I mean, we should also

mention, we have tried this, like a friend of ours trained a no-t version of stable diffusion for

us. And we did the same sort of thing. I trained a pretty bad t predictor and it sort of generates

samples. So we're not like focusing on that large scale stuff yet, but it is fun to like, every

now and again, got this idea from Fashio-MNIST, we are trying these out on some bigger models

and seeing, okay, this does seem like maybe it'll work. And so down the line that future

plan is to say, let's actually, you know, spend the time train a proper model and see,

yeah, see how well that does. If it's interesting. JEREMY: You say a friend of ours, we can be more

specific. It's Robert, one of the two lead authors of the stable diffusion paper who yeah, actually

has been fine tuning a real stable diffusion model which is without t and it's looking super

encouraging. So yeah, that'll be fun to play with, with this new, you know, we'll have to train

a t predictor for that. See how it looks. Yep. All right. So I guess the other area we've been

talking about kind of doing some research on is this weird thing that came up over the last two

weeks where our bug in the DDPM implementation where we accidentally weren't doing it

from minus 1 to 1 for the input range, it turned out that actually being from minus

one to one wasn't a very good idea anyway. And so we ended up centering it as

being from minus 0.5 to 0.5. And Johno and Tanishq have managed to actually

find a paper. Well I say find a paper. A paper has come out in the last 24 hours which has coincidentally cast some light on this

and has also cited a paper that we weren't aware of which was not released in the last 24 hours. So

Johno, are you going to tell us a bit about that? JOHNO: Yeah, sure I can do that. So it's

funny, this was such perfect timing because I actually got up early this morning planning

to run with the different input scalings and the cosine schedule that Jeremy was showing

and some of the other schedulers we look at, I thought it might be nice for the lesson

to have a little plot of like what is the fit with these different solvents and input

scalings, but it was going to be a lot of work. I was like, I'm not looking forward to doing

the groundwork. And then Tanishq sent me this paper which AK [@_akhaliq] had just tweeted

out because he reviews everything that comes up on arXiv every day: “On the Importance

of Noise Scheduling for Diffusion Models”. And this is by a researcher at the Google Brain

team who's also done a really cool recent paper on something called a Recurrent Interface

Network outside of the scope of this lesson, but also worth checking out. Yeah,

so this paper they're hoping to study this noise scheduling and the strategies that

you take for that and they want to show that number 1) noise scheduling is crucial for

performance and the optimal one depends on the task. When increasing the image size,

the noise scheduling that you want changes and scaling the input data by some factor

is a good strategy for working with this. JEREMY: And that's the bit

we've been talking about, right? JOHNO: Yeah, that's what we've been doing

where we said, oh, do we scale from minus 0.5 to 0.5 or minus 1 to 1 or do we normalize?

And so they demonstrate the effectiveness by training a really good high resolution

model on ImageNet. So class condition model. JEREMY: It looks great. JOHNO: Yeah, amazing samples.

That I’ll show them later. So I really liked this paper. It's very short

and concise and it just gets all the information across. And so they introduced us here. We have

this noising process our noisify function where we have square root of something times X plus

square root of one minus that something times the noise. And here they use gamma, gamma

of t, which is often used for the continuous time case. So instead of the alpha bar and the

beta bar schedule for a thousand time saves, there'll be some function gamma of t that

tells you what your alpha bar should be. JEREMY: Okay. So that's how our function is

actually called abar, but it's the same thing. JOHNO: Yeah. Same thing. Takes in a time set from

0 to 1 and then that's used to noise the image. JEREMY: Interestingly, what they're showing here

actually is something that we had discovered and I've been complaining about that my DDIMs with

an eta of less than one weren't working, which is to say when I added extra noise to the image, it

wasn't working. And what they're showing here is like, oh yeah, duh, if you use a small image, then

adding extra noise is probably not a good idea .

JOHNO: Yeah. And so they, they, they use a lot of reference in this paper to like information

being destroyed and signal to noise ratios, and that's really helpful for thinking about it

because it's not something that's obvious, but at 64 by 64 pixels adjacent pixels might have

much less in common versus the same amount of noise added at a much higher resolution, the

noise kind of averages out and you can still see a lot of the image. So yeah, that's one

thing they highlight is that the same noise level for different image sizes might have

it, it might be a harder or easier task. And so they investigate some strategies for

this. They look at the different noise schedule functions. So we've seen the original version from

the DDPM paper. We've seen the cosine schedule and we've seen, I think we might look at, or

the next thing that Jeremy's going to show us, a sigmoid based schedule. And so they show the

continuous time versions of that and they plot how you can change various parameters to get

these different gamma functions or in our case, the alpha bar where we starting at all image,

no noise at t equals zero, moving to all noise, no image at t equals one, but the path that

you take, it's going to be different for these different classes of functions and

parameters and the signal to noise ratio, that's what this, or the log signal to noise

ratio is going to change over that time as well. And so that's one of the knobs we can tweak. We're

saying our diffusion model isn't training that well. We think it might be related to the noise

schedule and so on. One of the things you could do is try different noise schedules, either

changing the parameters in one class of noise schedule or switching from a linear to a cosine

to a sigmoid. And then the second strategy is kind of what we were doing in those experiments,

which is just to add some scaling factor to x0. JEREMY: Well we were accidentally using b of 0.5. JOHNO: Exactly. And so that's a second dial

that you can tweak is to say keeping your noise schedule fixed, maybe just scale x0, which

is going to change the ratio of signal to noise. JEREMY: And so that's what Figure 4 in c)

there is what we were accidentally doing. JOHNO: Yes. Yeah, exactly. And so see if

we can get to, oh yeah, so that again, changes the signal to noise for different scalings

you get. So that's fine. So they have a compound, they have a strategy that combines some of those

things. And this is the important part. They do their experiments. And so they have a nice

table of investigating different schedules, cosine schedules and sigmoid schedules, and in

bold are the best results. And you can see for 64 by 64 images versus 128 versus 256, the best

schedule is not necessarily always the same. And so that's like important finding number one,

depending on what your data looks like using a different noise schedule might be optimal. There's

no one true best schedule. There's no one value of beta min and beta max that's just magically

the best. Likewise for this input scaling at different sizes with whatever schedules they

tested and different values were kind of optimal. And so, yeah, it's just a really great

illustration, I guess, that this is another design choice that's implicit or explicitly part

of your diffusion model training and sampling is how are you dealing with this noise schedule?

What schedule are you following? What scaling are you doing with your inputs? And by using

this thinking and doing these experiments, and they come up with a kind of rule of thumb

for how to scale the image based on image size, they show that they can, as they increase the

resolution, they can still maintain really good performance. Where previously it was quite

hard to train a really large resolution pixel space model and they're able to

do that. They get some advantage from their fancy Recurrent Interface Network, but

still it's kind of cool that they can say, look, we get state of the art high

quality and 512 by 512 or 1024 by 1024 samples on class conditioned ImageNet and

using this approach to really like consider how well do you train, how many steps do we need to

take? One of the other things in this table is that they compare it to previous approaches. Oh,

we used, you know, a third of the training steps and for the same other settings

and we get better performance. And just because we've chosen

that input scaling better. And yeah, so that's the paper. Really

nice, great work to the team. And that was- JEREMY: I'd love to, you got up in the morning and

thought, oh, it's going to be a hassle training all these different models I need to train

for different input scalings and different sampling approaches. I might just look at Twitter first. And then you looked at Twitter

and there was a paper saying like, Hey!, we just did a bunch of experiments for

different noise schedules and input scaling. JOHNO: Yeah. JEREMY: Does your life always work that

way, Johno? That seems quite blessed. JOHNO: Yeah. It's very lucky like that. Yeah. If

you wait long enough, someone else will do it. TANISHQ: That's why it's always that the time

when AK [@_akhaliq] starts posting on Twitter, it's like my favorite hour of the day.

It's just for all the papers to be posted. JEREMY: Oh, well thank you for that.

So let me switch to notebook 23 because this notebook is actually largely an

implementation of some ideas from this paper that everybody tends to just call it Karras,

unfair ‘cause there's other people. But I will do it anyway, Karras paper. And the

reason we're going to look at this is because in this paper, the authors actually take a

much more explicit look at the question of input scaling. Their approach was not apparently to accidentally put a bug in their code and

then take it out, find it worked worse and then just put it back in again. Their approach

was actually to think, how should things be? So that's an interesting approach to doing things

and I guess it works for them. So that's fine. TANISHQ: I think our approach is more exciting. JEREMY: Yeah, exactly. Our approach is much more

fun because you never quite know what's going to happen. And so, yeah, in their approach,

they actually tried to say like, okay, given all the things that are coming into our

model, how can we have them all nicely balanced? So we will skip back and forth

between the notebook and the paper. So the start of this is all the same,

except now we are actually going to do it minus 1 to 1 because we're not going to rely

on accidental bugs anymore, but instead we're going to rely on the Kerras papers, carefully

designed scaling. I say that except that I put a bug in the notebook as well. One of

the things that's in the Kerras paper is what is the standard deviation of the actual

data which I calculated for a batch. However, this used to say minus 0.5. I used

to do the minus 0.5 to 0.5 thing. And so this is actually the standard deviation

of the data before when it was still minus 0.5. So this is actually half the real standard

deviation. For reasons I don't yet understand, this is giving me better scaled results.

So this actually should be 0.66. So there's still a bug here and the bug still

seems to work better. So we've still got some mysteries involved. So we're going to leave this.

So it's actually not 0.33, it's actually 0.66. Okay, so the basic idea of this

paper, actually I'll come back. Well let me have a little think. Yeah, okay. Now

we're going to start here. So the basic idea of this paper is to say, you know what, sometimes

maybe predicting the noise is a bad idea. So like you can either try and predict the noise

or you can try and predict the clean image and each of those can be a better

idea in different situations. If you're given something

which is nearly pure noise, you know, the model's given something which

is nearly pure noise and is then asked to predict the noise, that's basically a waste

of time because the whole thing's noise. If you do the opposite, which is you try to get

it predict the clean image, well then if you give it a clean image that's nearly clean and try to

predict the clean image, that's nearly a waste of time as well. So you want something which is like,

regardless of how noisy the image is, you want it to be kind of like an equally difficult problem

to solve. And so what Karras do is they basically use this new thing called Cskip, which is a

number which is basically saying like, you know what we should do for the training target,

is not just predict the noise all the time, not just predict the clean image all the time,

but predict kind of lerp version of one or the other depending on how noisy it is. So here y is

the plain image and n is the noise. So y plus n is the noised image. And so if Cskip was zero,

then we would be predicting the clean image. And if Cskip was one, we would be predicting

y minus y, we would be predicting the noise. And so you can decide by picking a different Cskip whether you're predicting

the clean image or the noise. And so, as you can see from the way they've

written it, they make this a function. They make it a function of sigma. Now this is

where we've got to a point now where we've kind of got a fairly much simpler notation.

There's no more alpha bars, no more alphas, no more betas, no more beta bars. There's just

a single thing called sigma. Unfortunately, sigma is the same thing as alpha bar used to be.

Right. So we've simplified it, but we've also made things more confusing by using existing symbol for

something totally different. So this is alpha bar. Okay. So there's going to be a function that says,

depending on how much noise there is, we'll either predict the noise or we'll predict the clean

image or we'll predict something between the two. So in the paper, they showed this chart where

they basically said like, okay, let's look at the loss to see how good are we with a trained

model at predicting when sigma is really low. So when there's very small alpha bar or when sigma

is in the middle or when sigma is really high. And they basically said, you know what, when

it's nearly all noise or nearly no noise, you know, we're basically not

able to do anything at all. You know, we're basically good at doing

things when there's a medium amount of noise. So when deciding, okay, what, what segments

are we going to send to this thing? The first thing we need to do is to, is to figure out

some sigmas. And they said, okay, well let's pick a distribution of sigmas that matches this

red curve here, right, as you can see. And so this is a normally distributed curve where this is on a

log scale. So this is actually a log normal curve. So to get the sigmas that they're going to use,

they picked a normally distributed random number and then they exp'd it. And this is

called a log normal distribution. And so they used a mean of minus 1.2 and a

standard deviation of 1.2. So that means that about one third of the time they're going to be

getting a number that's bigger than zero here. And e to the zero is one. So about one third

of the time they're going to be picking sigmas that are bigger than one.

And so here's a histogram I drew of the sigmas that we're going to

be using. And so it's nearly always, you know, less than five. But sometimes it's way

out here. And so it's quite hard to read these histograms. So this really nice library called

seaborn, which is built on top of Matplotlib, has some more sophisticated and often nicer

looking plots. And one of them they have is called a kdeplot, which is a kernel density plot. It's

a histogram, but it's smooth. And so I clipped it at 10 so that you could see it better. So you can

basically see that the vast majority of the time it's going to be somewhere, you know, about 0.4

or 0.5. But sometimes it's going to be really big. So our noisify is going to pick a sigma

using that log normal distribution. And then we're going to get the noise as

usual. But now we're going to calculate c_skip, right? Because we're going to do that

thing we just saw. We're going to find something between the plain image and the noised input.

So what do we use for c_skip? We calculate it here. And so what we do is we say, what's the

total amount of variance at some level of sigma? Well, it's going to be sigma squared. That's the

definition of the variance of the noise. But we also have the sigma of the data itself,

right? So if we add those two together, we'll get the total variance. And

so what the Karras paper said to do is to do the variance of the data divided by

the total variance and use that for c_skip. So that means that if your total variance is really

big, so in other words, it's got a lot of noise, then c_skip is going to be really

small. So if you've got a lot of noise, then this bit here will be really small.

So that means if there's a lot of noise, try to predict the original image, right? That

makes sense because predicting the noise would be too easy. If there's hardly any noise, then

this will be, total variance will be really small, right? So c_skipwill be really big. And

so if there's hardly any noise, then try to predict the noise. And so

that's basically what this c_skip does. So it's a kind of slightly weird idea is that our

target, the thing we're trying to do actually is not the input image, sorry, the original image.

It's not the noise, but it's somewhere between the two. And I've found the easiest way to

understand that is to draw a picture of it. So here is some examples of noised

input, right? With various amounts of… with various sigmas. Remember sigma

is alpha bar, right? So here's an example with very little noise, 0.06.

And so in this case, the target is predict the noise, right? So that's the

hard thing to do, is predict the noise. Or else here's an example, 4.53, which is nearly

all noise. So for nearly all noise, the target is predict the image, right? And then for something

which is a little bit between the two, like here, 0.64, the target is predict some

of the noise and some of the image. So that's the idea of Karras. And so what this does is it's making the

problem to be solved by the UNet equally difficult, regardless of what sigma is.

It doesn't solve our input scaling problem. It solves our kind of difficulty scaling problem.

To solve the input scaling problem, they do it. TANISHQ: I just want to make one quick note. And

so this idea of interpolating between the noise and the image is similar to what's called the

V-Objective as well. So there's also a similar kind of, it's quite similar to what Karras

et al. has. But that's also now been used in a lot of different models. For example,

Stable Diffusion 2.0 was trained with this sort of V-Objective. So people are using this

sort of methodology and getting good results. So it's an actual practical thing that people are

doing. So yeah, just want to make a note of that. JEREMY: Yeah. As is the case of basically

all papers created by NVIDIA researchers, of which this is one, it flies under

the radar and everybody ignores it. The V-Objective paper came from the senior

author, was Tim Salimans, which is Google, right? So anything from Google and OpenAI, everybody

listens to. So yeah, although Karras, I think has done the more complete version of this.

And in fact, the V-Objective was almost like mentioned in passing in the

distillation paper. But yeah, that's the one that everybody has ended up

looking at. But I think this is the more complete… TANISHQ: I think what happened with

the V-Objective is not many people paid attention to it. I think folks like Kat

and Robin, these sorts of folks are actually paying attention to that V-Objective

in that Google Brain paper. But then also this paper did a much more principled

analysis of this sort of thing. So yeah, I think it's very interesting how, yeah. Sometimes

even these sort of side notes in papers that maybe people don't pay much attention to,

they can actually be quite important. JEREMY: Yeah. Yeah. So okay, so the noise input

as usual is the input image plus the noise times the sigma. But then, and then as we discussed, we

decide how to kind of decide what our target is. But then we actually take that noise input

and we scale it up or down by this number. And the target, we also scale up or down by this

number. And those are both calculated in this thing as well. So here's c_out and here's c_in. Now I just wanted to show one

example of where these numbers come from because for a while they all seem pretty mysterious to

me and I felt like I'd never be smart enough to understand them, particularly because they

were explained in the mathematical appendix of this paper, which are always the bits I

don't understand, until I actually try to, and then it tends to turn out they're not so bad

after all, which was certainly the case here. TANISHQ: I think it was up, it was B something

I think. So the B6 I think, is the other one. JEREMY: Oh, yeah. So in appendix B6,

which does look pretty terrifying, but if you actually look at, for example,

what we were just talking about, c_in, it's like, how do they calculate? So c_in is

this. Now this is the variance of the noise, this is the variance of the data, add them

together to get the total variance, square roots, the total standard deviation. So it's just the

inverse of the total standard deviation, which is what we have here. Where does that come

from? Well they just said, you know what? The inputs for a model should have unit variance.

Now we know that. We've done that to dare in this course. So they just said, all right,

so well the inputs to the model is the clean data plus the noise times some number we're

going to calculate, and we want that to be one. Okay, so the variance of the clean images plus the

noise is equal to the variance of the clean images plus the variance of the noise.

Okay, so if we want that to be, if we want variance to be one, then divide

both sides by this and take the square root, and that tells us that our multiplier has to be

one over this. That's it. So it's like literally, you know, classical math. The only bit you have

to know is that the variance of two things added together is the variance of the two things added

together, which is not rocket science either. JOHNO: And in this context, like why we want to

do this?, when we looked at those sigma's that you're plotting, like the distribution, you've

got some that are fairly low, but you've also got some where the standard deviation sigma is

like 40, right? So the variance is super high. JEREMY: Yes. JOHNO: And so we don't want to feed something with

standard deviation 40 into our model. You would like it to be closer to unit variance. So we're

thinking, okay, well, if you divide by roughly 40, that would scale it down. But then we've also got

some extra variance from our data. It's just like 40 plus variance of the data of a little bit. We want to scale back down by

that to get unit variance. JEREMY: Yeah. I mean, I love this paper

because it's basically just doing what we spent weeks doing. I feel like everything that

we've done that's improved every model has always been one thing, which is, can we get

mean zero, variance one inputs to our model and for all of our activations? And then the

only other thing is include enough compute by adding enough layers and enough activations.

Those two things seem to be all that matters. Basically, well, I guess ResNets added an extra

cool little thing to that, which is to make it even smoother by giving this kind of like identity

path. So yeah, basically trying to make things as smooth as possible and as equal

everywhere as possible. So yeah, this is what they've done. So they did that

for the inputs and then they've also done it for the outputs. And for the

outputs, you know, it's basically the same idea, you know, and they have basically

the same kind of analysis to show that. And so with this, so now, yeah, we've

basically, we've got our noised input. We've got the, you know, kind of linear version

somewhere between X nought and the noise to input. We've got the scaling of the output and we've got

the scaling of the input. So now for the inputs to our model, we're going to have the scaled noise.

We're going to have the sigma and we're going to have the target, which is somewhere between

the image and the noise. And so, yeah, so I've, you know, never seen anybody draw a picture

of this before. So it was really cool when, you know, being in a notebook, being able to see

like, oh, that's what they're doing, you know? So yeah, have a good look at this notebook

to see exactly what's going on. Cause I think it gives you a really good intuition

around what problem it's trying to solve. So then I actually checked the noised input has

a standard deviation of one, the means not zero. And of course, why would it be? We didn't

do anything, you know, the only thing Karras cared about was having the variance one. We could

easily adjust the input and output to have a mean of zero as well. That's something I think we or

somebody should try. Cause I think it does seem to help a bit as we saw with that generalized

value stuff we did. But it's less important than the variance. And so same with the target.

It's got the one and yeah, this is where if I changed this to the correct value, which is 0.66,

then actually it's slightly further away from one both here and here, quite a lot further away. And

maybe that's because actually the data's well, we know the data's not Gaussian distributed pixel

data definitely isn't Gaussian distributed. So this bug turned out better. Okay. So the unit's the same, the

initialization's the same. This is all the same. Train it for a while. We can't compare the

losses, right? Because our target's different. So, but what we can do is we can create

a denoise that just takes the thing that as, per usual, the thing

we had in noisify, right? And so for x naught, so you're going to multiply by

c_out and then add c_skip by noised_input. Here it is, multiply by c_out, add noised_input,

c_skip. Okay. So we can denoise. So let's grab our sigmas from the actual batch we had. Let's

calculate c_skip, c_out and c_in for the sigmas in our mini batch. Let's use the model to predict

the target given the noise to input and the sigmas and then denoise it. And so here's our noise to

input, which we've already seen, and here's our predictions. And these are

absolutely remarkable in my opinion. Yeah. Like this one here, I can barely see it.

You know, it's really found. Look at the shirt. There's the shirt here. It's actually really

finding the little thing on the front and let me show you. Here's what it should look like. And in

cases where the sigma is pretty high, like here, you can see it's really like saying like, I

don't know, maybe it's shoes, but it could be something else. Is it shoes? Yeah, it wasn't

shoes, but at least it's kind of got the, you know, the bulk of the pixels in

the right spot. Yeah. Something like this one is 4.5. Has no idea what it is. It's

like, oh, maybe it's shoes. Maybe it's pants. You know, it turns out it is shoes. Yeah. So

I think that's fascinating how well it can do. And then the other thing I did, which I

thought was fun was I just created, so I just, you did a sigma of 80, which is actually what they

do when they're doing sampling from pure noise. That's what they consider the pure noise level.

So I just created some pure noise and denoised it just for one step. And so here's what happens

when you denoise it for one step. And you can see it's kind of overlaid all the possibilities.

It's like, I can see a pair of shoes here, a pair of pants here at top here. And sometimes

it's kind of like more confident that the noise is actually a pair of pants. And sometimes

it's more confident that it's actually shoes. But you can really get a sense of how like from

pure noise, it starts to make a call about like what this noise is actually covering up. And

this is also the bit which I feel is like, I'm the least convinced about when it comes to

diffusion models. This first step of going from like pure noise to something and like trying to

have a good mix of all the possible somethings. I don't know, it feels a bit handwavy to me.

It clearly works quite well, but I'm not sure if it's like we're getting the full range of

possibilities. And I feel like some of the papers we're starting to see is starting to say

like, you know what, maybe this is not quite the right approach. Then maybe later in the course,

we'll look at some of the ones that look at what we call VQ models and tokenized stuff.

Anyway, I thought this was pretty interesting to see these pictures, which I don't think,

yeah, I've never seen any pictures like this before. So I think this is a fun result from

doing all this stuff in notebooks step by step. Okay, so sampling. So one of the nice things

with this is the sampling becomes much, much, much simpler. And so, and I should mention

a lot of the code that I'm using, particularly in the sampling section is heavily inspired by,

and some of it's actually copied and pasted from Kat's k-diffusion repo, which is, I think

I mentioned before, some of the nicest generative modeling code or maybe the nicest

generative modeling code I've ever seen. It's really great. So before we talk about the actual

sampling, the first thing we need to talk about is what sigma do we use at each reverse time step.

And in the past, we've always, well, nearly always done something, which I think has always

felt is sketchy as all hell, which is we've just linearly gone down the sigmas or the alpha bars

or the t’s. So here, when we're sampling in the previous notebook, we used linspace. So I always

felt like that was questionable. And I felt like at the start, you probably like it was just

noise anyway. So who cared? Who cares? So I, in DDPM_v3, I experimented with something

that I thought intuitively made more sense. I don't know if you remember this one, but I

actually said, oh, let's, for the first hundred time steps, let's actually only run the model

every 10 times. And then for the next hundred, let's run it nine times. The next one hundred,

let's run it every eight times. So basically at the start, be much less careful. And so Karras

actually ran a whole bunch of experiments. And they said, yeah, you know what, at the start of

training, you know, you can start with a high sigma, but then like step to a much lower sigma

in the next step and then a much lower sigma in the next step. And then the longer, the more you

train step by smaller and smaller steps so that you spend a lot more time fine tuning carefully

at the end and not very much time at the start. Now, this has its own problems. And in

fact, a paper just came out today, which we probably won't talk about today, but maybe

another time, which talked about the problems is that in these very early steps, this is the

bit where you're trying to create a composition that makes sense. Now for Fashion-MNIST, we

don't have much composing to do. It's just a piece of clothing. But if you're trying to

do an astronaut riding a horse, you know, you've got to think about how all those pieces fit

together. And this is where that happens. And so I do worry that with the Karras approach is not

giving that maybe enough time. But as I've said, that's really the same as this step at

that, that whole piece feels a bit wrong to me. But aside from that, I think this makes a

lot of sense, which is that, yeah, the sampling, you should jump, you know, by big steps early

on and small steps later on and make sure that the fine details are just so. So that's what

this function does, is it creates this plot. Now it's this schedule of reverse diffusion

sigma steps. It's a bit of a weird function in that it's the, the rho-th root of sigma,

where rho is seven. So the seventh root of sigma is basically what it's scaling on. But

the answer to why it's that is because they tried it and it turned out to work pretty

well. Do you guys remember where this was? TANISHQ: This is the

truncation error analysis, D1. JEREMY: Nice memory. So this image here –so thanks

for Tanishq reminding me where this is– shows FID as a function of rho. So it's basically what, the

what root are we taking. And they basically said, like, if you take the fifth root up,

it seems to work well, basically. So yeah, so that's a perfectly good way to do

things is just to try things and see what works. And you'll notice they tried things,

just like we love, on small datasets, not as small as us because we're the king of small

datasets, but smallish, CIFAR-10, ImageNet-64. That's the way to do things. So I saw like –it

might've even been the CEO of Hugging Face the other day– tweets something saying only people

with huge amounts of GPUs can do research now. And I think it totally misunderstands how research

is done, which is research is done on very small datasets. That's the actual research. And then

when you're all done, you scale it up at the end. I think we're kind of pushing the

envelope in terms of like, yeah, how much can you do? And yeah, we've like

re-covered this kind of main substantive path of diffusion models history step-by-step showing

every improvement and seeing clear improvements across all the papers using nothing but

Fashion-MNIST running on a single GPU in like 15 minutes of training or something per model.

So yeah, definitely don't need lots of models. Anyway. Okay. So this is the sigma we're going

to jump to. So the denoising is going to involve calculating the c_skip, c_out and c_in and calling

our model with the c_in scaled data and the sigma and then scaling it with c_out and then doing the

c_skip. Okay. So that's just undoing the noisify. So check this out. There's all that's required to

do one step of denoising for the simplest kind of scheduler, which is sorry, the simplest

kind of sampler, which is called Euler. So we basically say, okay,

what's the sigma at time step i, what's the sigma two at time step i. And

now when I'm talking about time step, I'm really talking about like the step from

this function, right? So this is, this is – JOHNO: Sampling step. JEREMY: Sampling step, yeah. Okay.

So then denoise –using the function– and then we say, okay, well just

send back whatever you were given, plus move a little bit in the direction of the

denoised image. So the direction is x minus denoised. So that's the noise, that's the gradient

as we discussed right back in the first lesson of this part. So we'll take the noise. If we divide

it by sigma, we get a slope. That's how much noise is there per sigma. And then the amount

that we're stepping is sigma two minus sigma one. So take that slope and multiply it by the

change, right? So that's the distance to travel towards the noise, that this fraction, you

know, or you could also think of it this way. And I know this is a very obvious

algebraic change, but if we move this over here, you could also think of this as

being, oh, of the total amount of noise, the change in sigma we're doing, what

percentage is that? Okay, well that's the amount we should step. Right? So there's two

ways of thinking about the same thing. So again, this is just, you know,

high school math. Well I mean, actually my seven year old daughter has done all

these things. It's plus minus dividing times. So we're going to need to do this once per

sampling step. So here's a thing called sample, which does that. That's going to

go through each sampling step, call our sampler, which initially we're going to

do sample Euler, right? With that information, add it to our list of results and do it again.

So that's it. That's all the sampling is. And of course we need to grab

our list of sigmas to start with. So I think that's pretty cool. And at the very

start we need to create our pure noise image. And so the amount of noise we

start with is got a sigma of 80. Okay, so if we call sample using sample Euler,

and we get back some very nice looking images and, believe it or not, our FID is 1.98. So this

extremely simple sampler, three lines of code plus a loop has given us a FID of 1.98, which is

clearly substantially better than our coastline. Now we can improve it from there. So one

potential improvement is to, you might've noticed we added no new noise at all, right? This

is a deterministic scheduler, right? There's no rand anywhere here. So we can do something called

an Ancestral Euler Sampler, which does add rand, right? So we basically do the denoising in the

usual way, but then we also add some rand. And so what we do need to make sure is given that

we're adding a certain amount of randomness, we need to remove that amount of randomness

from the step that we take. So I won't go into the details, but basically there's our way of

calculating how much new randomness and how much just going back in the existing direction do we

do. And so there's the amount in the existing direction and there's the amount in the new

random direction. And you can just pass in eta, which is just going to, when we pass it into here,

is going to scale that. So if we scale it by half, so basically half of it is new noise and

half of it is going in the direction that we thought we should go, that makes it better

still. Again with a hundred steps and just make sure I'm comparing to the same. Yep. A hundred

steps. Okay. So it's fair. Like with like. Okay. So that's adding a bit of extra noise. Now then the… something that I think we might've

mentioned back in the first lesson of this part is something called Heun's method. And Heun's method does something which we can

pictorially see here to decide where to go, which is basically we say, okay, where

are we right now? What's the, you know, at our current point, what's the direction? So we

take the tangent line, the slope, right? That's basically all it does is it takes a slope. So

it's not, here's a slope, you know? Okay. And so if we take that slope and that

would take us to a new spot and then at that new spot, we can then

calculate a slope at the new spot as well. And at the new spot, the slope is something else.

So that's it here, right? And then you say like, okay, well, let's go halfway between the two

and let's actually follow that line. And so basically it's saying like, okay, each of

these slopes is going to be inaccurate. But what we could do is calculate the slope of

where we are, the slope of where we're going and then go halfway between the two. I actually

find it easier to look at in code personally. I just kind of delete a whole bunch of stuff

that's totally irrelevant to this conversation. So take a look at this compared to Euler. So here's our Euler, right? So we're going

to do the same first line exactly the same, right? Then the denoising is exactly the same,

right? And then this step here is exactly the same. I've actually just done it in multiple

steps for no particular reason. And then say, okay, well, if this is the last step, then we're

done. So actually the last step is Euler. But then what we do is we then say, well, that's okay

for an Euler step, this is where we'd go. Well, what does that look like if we denoise

it? So this calls the model the second time, right? And where would that take us if we took

an Euler step there? And so here, if we took an Euler step there, what's the slope? And so what

we then do is we say, oh, okay, well, it's just, just like in the picture, let's take the average.

Okay, so let's take the average and then use that, the step. So that's all the Heun sampler

does is it just takes the average of the slope where we're at and the slope where

the Euler method would have taken us. And so if we now, so notice that it called the

model twice for a single step. So to be fair, since we've been taking a hundred steps with

Euler, we should take 50 steps with Heun, right? Because it's going to call the model twice.

And still that is now, whoa, we beat 1, which is pretty amazing. And so we could keep going,

check this out. We can even go down to 20. This is actually doing 40 model evaluations and this is

better than our best Euler, which is pretty crazy. Now something which you might've noticed is kind

of weird about this or kind of silly about this is we're cat–, we're calling the model twice just

in order to average them, but we already have two model results like without calling it twice. We

cause we could have just looked at the previous time step. And so something called the LMS

sampler does that instead. And so the LMS sampler, if I call it with 20, it actually literally

does 20 evaluations and actually it beats Euler with a hundred evaluations. And so

LMS, I won't go into the details too much. It didn't actually fit into my little sampling very

well. So basically largely copied and pasted the Kat's code. But the key thing it does is look

at, it gets the current sig –sigma. It does the denoising, it calculates the slope and it stores

the slope in a list, right? And then it grabs the first one from the list. So it's kind of

keeping a list of up to this case four at a time. And so then uses up to the last four to basically,

yes, kind of the curvature of this and take the next step. So that's pretty smart. And yeah, so

I think if you wanted to do super fast sampling, it seems like a pretty good way to do it.

And I think Johno, you were telling me that, or maybe it's Pedro was saying that

currently people have started to move away, that this was very popular, but people

started to move towards a new sampler, which is a bit similar called the

DPM++ sampler, something like that. TANISHQ: Yeah. Yeah. Yeah. JOHNO: Yeah. JEREMY: But I think it's

the same idea. So it kind of keeps a list of recent results and use

that. I'll have to check it more closely. JOHNO: The similar idea is like, if it's

done more than one step, then it's using some history to the next thing. JEREMY: Yeah. [unintelligible] in Heun

doesn't make a huge amount of sense, I guess, from that perspective. I mean, still works very

well. This makes more sense. So then, we can compare if we use an actual mini-match of data,

we get about 0.5. So yeah, I feel like this is quite a stunning result to get very close to real data, at least in terms of

FID. You know, really with 40 model evaluations. And the entire, nearly the entire thing here is

by making sure we've got unit variance inputs, unit variance outputs, and kind of equally

difficult problems to solve in our loss function. JOHNO: Yeah. Plus having that different schedule

for sampling, that's completely unrelated to the training schedule. So one of the big things

with Karras et al's paper was they also could apply this to like, oh, existing diffusion

models that have been trained by other papers, we can use our sampler and in fewer steps

get better results without any of the other changes. And yeah, I mean, they do a

little bit of rearranging equations to get the other papers versions into

their c_skip, c_in, c_out framework. But then, yeah, it's really nice that these

ideas can be applied to. So for example, I think stable diffusion, especially version one was

trained DDPM style training, epsilon objective, whatever. But you can now get these different

samplers and different something schedules and things like that and use that to sample it and do

it in 15, 20 steps and get pretty nice samples. JEREMY: Yeah. You know, and another

nice thing about this paper is they, you know, in fact, it's the name of

the paper, “Elucidating the Design Space of Diffusion-Based …” models. You know, they looked

at various different papers and approaches and trying to set like, oh, you know what? These

are all doing the same thing when we kind of parameterize things in this way. And if you fill

in these parameters, you get this paper and these parameters, you get that paper, you know? And

then, so, we found a better set of parameters, which… it was very nice to code because, you know,

it really actually ended up simplifying things a whole lot. And so if you look through the

notebook carefully, which I hope everybody will, you'll see, you know, that the code

is really there and simple compared to previous, all the previous ones, in

my opinion. Like I feel like every notebook we've done from DDPM onwards,

the code's got easier to understand and the results… TANISHQ: And just to, again, clarify, like,

how this connects with some of the previous papers that we've looked at. So like, for example,

with the DDIM, the deterministic, that's again, the deterministic approach, that's similar to the

Euler method sampler that we were just looking at, which was completely deterministic. And then

some of something like the Euler Ancestral that we were looking at is similar to

the standard DDPM approach with the, that was kind of a more stochastic approach.

So, again, there's just all these sorts of connections that then are kind of nice to

see, again, the sorts of connections between the different papers and how they change it, how

they can be expressed in this common framework. JEREMY: Yeah. Thanks, Tanishq. So we definitely

now are at the point where we can show you the UNet next time. And so I think we're, unless any

of us come up with interesting new insights on the unconditional diffusion sampling, training

and sampling process, we might be putting that aside for a while. And instead we're going

to be looking at creating a good quality UNet from scratch. And we're going to look at a

different data set to do that. This was starting to scale things up a bit as Johno mentioned in

the last lesson. So we're going to be using a 64 by 64 pixel ImageNet subset, called Tiny ImageNet.

So we'll start looking at some 3-channel images. So I'm sure we're all sick of looking at black

and white shoes. So now we get to look at shift dwellings and trolley buses and koala bears and

yeah, 200 different things. So that'll be nice. Yeah. All right. Well, thank you, Johno. Thank

you, Tanishq. That was fun as always. And next time will be Lesson 22. Bye. JOHNO: This was Lesson 22. JEREMY: Oh, no way. Okay. See you.

Lesson 22: Deep Learning Foundations to Stable Diffusion

Full Transcript

Need a transcript for another video?