Lesson 19: Deep Learning Foundations to Stable Diffusion

Jeremy Howard13,422 words

Full Transcript

JEREMY: Okay.

Hi everybody. And this is Lesson 19 with extremely 

special guests, Tanishq and Johno. Hi guys.

How are you? TANISHQ: Hello.

JOHNO: Hey Jeremy. Good to be here.

JEREMY: And it's New Year's Eve, 2022, finishing off 2022 with a bang, or at least a really cool

lesson. And most of this lesson is 

going to be Tanishq and Johno, but I'm going to start with a quick

update from the last lesson. What I wanted to show you is that Christopher 

Thomas on the forum, what I want to show you is that Christopher Thomas on the forum came up 

with a better winning result for our challenge, the Fashion-MNIST challenge, 

which we are tracking here. And be sure to check out this forum 

thread for the latest results. And he found that he was able to 

get better results with Dropout. Then Piotr on the forum 

noticed I had a bug in my code. And the bug in my code for ResNets, actually 

I won't show you, I'll just tell you, is that in the ResBlock, I was not passing 

along the BatchNorm parameter. And as a result, all the results 

I had were without BatchNorm. So then when I fixed BatchNorm and added 

Dropout at Christopher's suggestion, I got better results still.

And then Christopher came up with a better Dropout and got better results still for 50

epochs. So let me show you the 93.2 

for 5 epochs improvement. I won't show the change to BatchNorm because 

that's actually, that'll just be in the repo now.

So the BatchNorm is already fixed. So I'm going to tell you about what 

Dropout is and then show that to you. So Dropout is a simple but powerful idea where 

what we do with some particular probability, so here that's a probability of 0.1, 

we randomly delete some activations. And when I say delete, what I actually 

mean is we change them to zero. So one easy way to do this is to 

create a binomial distribution object where the probabilities

are 1-p and then sample from that. And that will give you a 0.1 probability.

So in this case, oh, this is perfect. I have exactly one 0.

Of course, randomly, that's not always going to be the case.

But since I asked for 10 samples and 0.1 of the time it should be zero, I so happened

to get, yeah, exactly one of them. And so if we took a tensor like this 

and multiplied it by our activations, that will set about

a 10th of them to zero because multiplying by zero gives you zero.

So here's a Dropout class. So you pass it and you say what probability 

of Dropout there is, store it away. Now we're only going to do 

this during training time. So at evaluation time, we're not 

going to randomly delete activations. But during training time, we will 

create our binomial distribution object. We will pass in the 1-p probability. And then you say, how many 

binomial trials do you want to run? So how many coin tosses or dice 

rolls or whatever each time? And so it's just one.

And this is a cool little trick. If you put that one onto your accelerator, you 

know, GPU or MPS or whatever, it's actually going to create a binomial 

distribution that runs on the GPU. That's a really cool trick that 

not many people know about. And so then if I sample and I make a sample 

exactly the same size as my input, then that's going to give me a bunch of ones and zeros 

and a tensor, the same size as my activations. And then another cool trick is this is going 

to result in activations that are on average about one tenth smaller.

So if I multiply by 1/(1-0.9), so multiply this case by that, then that's going

to scale up my to undo that difference. JOHNO: Jeremy.

JEREMY: Yeah. JOHNO: In the line above where you have 

probs equals 1-p, should that be 1-self.p JEREMY: Oh, it absolutely should.

Thank you very much, Johno. Not that it matters too much because, yeah, you can always just use nn.Dropout at this

point and I only have to use 0.1, which is why I didn't even see that.

So as you can see, I'm not even bothering to export this because I'm just showing how

to repeat what's already available in PyTorch. So yeah, thanks, Johno.

That's a good fix. Yeah, so if we're in evaluation mode, 

it's just going to return the original. If p=0, then these are all 

going to be just ones anyway. So we'll be multiplying by 1 divided 

by 1, so there's nothing to change. So with p of 0, it does nothing in effect.

Yeah, and otherwise it's going to kind of zero out some of our activations.

So we can, a pretty common place to add dropout is before your last linear layer.

So that's what I've done here. So yeah, if I run the exact same epochs, I 

get 93.2, which is a very slight improvement. And so the reason for that is that it's not 

going to be able to kind of memorize the data or the activations, you know, because 

there's a little bit of randomness. So it's going to force it to try to identify 

just the actual underlying differences. There's a lot of different 

ways of thinking about this. You can almost think of it as a bagging 

thing, a bit like a random forest. You know, it's each time it's giving a 

slightly different kind of random subset. Yeah, but that's what it does.

I also added a Dropout2d layer right at the start, which is not particularly common.

I was just kind of like showing it. This is also how Christopher Thomas's idea tried 

it as well, although he didn't use Dropout2d. What's the difference between 

Dropout2d and Dropout? So this is actually something I'd like you 

to do to implement yourself as an exercise, is to implement Dropout2d.

The difference is that with Dropout2d, rather than using x.size() 

as our tensor of ones and zeros, so in other words, potentially dropping 

out every single batch, every single channel, every single x, y independently.

Instead, we want to drop out an entire kind of grid area, all of the channels together.

So if any of them are zero, then they're all zero. So you can look up the docs for Dropout2d for 

more details about exactly what that looks like.

But yeah, so the exercise is to try and implement that from scratch and come up with a way to

test it. So like actually check that 

it's working correctly, because it's a very easy thing to think that

it's working and then realize it's not. So then, yeah, Christopher Thomas actually 

found that if you remove this entirely and only keep this, then you end up 

with a better results for 50 epochs. And so he's the first to break 95%

So I feel like we should insert some kind of animation or trumpet sounds or something

at this point. I'm not sure if I'm clever enough to do that 

in the video editor, but I'll see how I go. Hooray!

Okay. So that's about it for me.

Did you guys have any other things to add about Dropout, how to understand it or what

it does or interesting things? Oh, I did have one more thing before.

But you go ahead if you've got anything to mention.

JOHNO: So I was going to ask just because I think the standard is to set it, like remove the

dropout before you do inference. But I was wondering if there's anyone you 

know of, or if it works to use it for some sort of test time augmentation.

JEREMY: Oh, dude! Thank you.

Because I wrote a callback for that. Did you see this or are you just like (JOHNO: 

no), okay, just a test time dropout callback. Nice.

So yeah, before_epoch, if you're a member in 

Learner, we put it into training mode. Which actually what it does is it puts 

every individual layer into training mode. So that's why for the module itself, we can 

check whether that module's in training mode. So what we can actually do is after that's 

happened, we can then go back in this callback and apply a lambda that says 

if this is a Dropout, then… wait, this is, yeah, then put

it in training mode all the time, including at evaluation.

And so then you can run it multiple times, just like we did for TTA, but with this callback.

Now that's very unlikely to give you a better result because it's not kind of showing it

different versions or anything like that, like TTA does that are kind of meant to be

the same. But what it does do is it gives 

you a sense of how confident it is. If it kind of has no idea, then that little 

bit of dropout's quite often going to lead to different predictions.

So this is a way of kind of doing some kind of confidence measure.

You'd have to calibrate it by, kind of, looking at things 

that it should be confident about and not confident about and seeing how 

that dropout, test time dropout changes. But the basic idea, it's been 

used in medical models before. I wouldn't say it's totally popular, which 

is why I didn't even bother to show it being used, but I just want to add it here because 

I think it's an interesting idea and maybe could be more used than it is, or at 

least more studied than it has been. A lot of stuff that gets used in the medical 

world is less well known out in the rest of the world.

So maybe that's part of the problem. Cool.

All right. So I will stop my sharing and we're going to 

switch to Tanishq, who's going to do something much more exciting, which is to show that 

we are now at a point where we can do DDPM from scratch or at least 

everything except the model. And so to remind you, DDPM doesn't have the 

latent VAE thing and we're not going to do conditional. So it's not going to be like, we're not 

going to get to tell it what to draw. And the U-Net model itself is the 

one bit we're not going to do today. We're going to do that next lesson.

But, but other than the U-Net, it's going to be unconditional DDPM from scratch.

So Tanishq, take it away. Okay.

Hi, welcome back. Sorry for the slight continuity problem.

You may notice people look a little bit different. That's because we had some Zoom issues.

So we have a couple of days have passed and we're back again.

And then Johno over recorded his bit before we do Tanishq's bit, and then we're going

to post them in backwards. So hopefully there's not too many confusing 

continuity problems as a result and it all goes smoothly, but it's time to turn 

it over to Tanishq to talk about DDPM. TANISHQ: So we've reached the point where we have 

this miniai framework and I guess it's time to now start using it to build more, 

I guess, sophisticated models. And as we'll see here, we can start putting 

together a diffusion model from scratch using the miniai library, and we'll see 

how it makes our life a lot easier. And also it'd be very nice to see how the 

equations in the papers correspond to the code.

I have here, of course, the notebook 

that we'll be watching from. The paper, which we have the diffusion model 

paper, “Denoising Diffusion Probabilistic Models”, which is the paper 

that was published in 2020. It was one of the original diffusion model 

papers that set off the entire trend of diffusion models and is a good starting point 

as we delve into this topic further. And also I have some diagrams and 

drawings that I will also show later on. But yeah, basically let's just get started 

with the code here and of course the paper. So just to provide some context with this 

paper, this paper was published from this group in UC Berkeley, I think a few of 

them have gone on now to work at Google. And this is Pieter Abbeel, he 

has a big lab at UC Berkeley. And so diffusion models were actually originally 

introduced in 2015, but this paper in 2020 greatly simplified the diffusion models and 

made it a lot easier to work with and got these amazing results as you can see here 

when they trained on faces and in this case CIFAR-10 and this really was very, kind 

of a big leap in terms of the progress of diffusion

models. And so just to kind of briefly 

provide, I guess, kind of an overview. JEREMY: If I could just quickly step 

just mention something, which is, when we started this course, we

talked a bit about how perhaps the diffusion part of diffusion models is not actually all

that. Everybody's been talking about 

diffusion models because that's, particularly because that's

the open source thing we have that works really well.

But this week, actually a model that appears to be quite a lot better than Stable Diffusion

was released that doesn't use diffusion at all. Having said that, the basic ideas, like most 

of the stuff that Tanishq talks about today, will still appear in some kind of form, 

but a lot of the details will be different. But strictly speaking, actually, I don't even 

know if we've got a word anymore for the kind of like modern generative 

model things we're doing. So in some ways, when we're talking about 

diffusion models, you should maybe replace it in your head with some other word, which 

is more general and includes this paper that Tanishq is looking at here.

JOHNO: Iterative Refinement, perhaps? That's what I'd like.

JEREMY: Yeah, that's not bad, iterative refinement.

I'm sure by the time people watch this video, probably, you know, somebody will have decided

on something. We will keep our course website up to date.

TANISHQ: Yeah. Yeah.

This is the paper that Jeremy was talking about and yeah, every week there seems to

be another state of the art model. But yeah, like Jeremy said, a lot of the 

principles are the same, but the details can be different

for each paper. And I just want to again, also, like Jeremy 

was saying, zoom back a little bit and talk a little bit about what, just to provide 

a review of what we're trying to do here. So let me just right next to him here.

Yeah. So with this task, we were trying to, in this 

case, we're trying to do image generation. Of course, it could be other forms of 

generation, like text generation or whatever. And the general idea is that of 

course we have some data points. In this case, we have some images of dogs 

and we want to produce more like the data points that we're given.

So in this case, maybe the dog image generation or something like this.

And so the overall idea that a lot of these approaches take for some sort of generative

modeling task is they try to... Not over there, I’m going to mark here.

They try to... Oops, what happened here?

Maybe it might... Yeah.

So let me use it in a bit. p(x), which is basically the likelihood...

What's going to happen here? Likelihood of data point x.

So let's say x is some image. Then p(x) tells us what is the probability 

that you would see that image in real life. And we can take a simpler example, which may 

be easier to think about, of a one-dimensional data point like height, for example. And if we were to look at height, of course 

we know we have a data distribution that's kind of a bell curve.

And you have maybe some mean height, which is something like 5'9", 5'10".

I guess 5'10", or something like that, or 5'9", whatever.

And then of course we have some more unlikely points, but that is still possible.

Like for example, we have 7'8", or we have something that's maybe not as likely, which

is like 3', or something like this. JEREMY: So here's the X axis is height, and the 

Y axis is the probability of some random person you meet being that tall.

TANISHQ: Exactly. So this is basically the probability.

And so of course you have this sort of peak, which is where you have higher probability.

And so those are the sorts of values that you would see more often.

So this is what we would call our p(x). And the important part about p(x) is that 

you can use this now to sample new values if you know what p(x) is, or if you have 

some sort of information about p(x). So for example, here you can think of, if 

you were to say, maybe have some, let's say you have some game and you have some human 

characters in the game, and you just want to randomly generate a height for this human 

character, you wouldn't want to of course select a random height between 3 and 7, 

that's kind of uniformly distributed. You would instead maybe want to have the height 

dependent on this sort of function, where you would more likely sample values in the 

middle and less likely sample these sorts of extreme points.

So it's dependent on this function p(x). So having some information about p(x) 

will allow you to sample more data points. And so that's kind of the overall goal of 

generative modeling is to get some information about p(x) that then allows us to sample 

new points and create new generations. So that's kind of a high level kind of description 

of what we're trying to do when we're doing generative modeling.

And of course there are many different approaches. We have our famous GANs, which used to be the 

common method back in the day before diffusion models.

We have VAEs, which I think we'll probably talk a 

little bit more about that later as well.

JEREMY: We'll be talking about both of those techniques later.

Yeah. TANISHQ: Yeah.

So there are many different other techniques. There are also some niche techniques 

that are out there as well. But of course now the popular one is are these 

diffusion models or as we talked about, maybe a better term might be, iterative 

refinement or whatever the term ends to be. But yeah, so there are many different techniques.

And yeah, so this is kind of the general diagram that shows what diffusion models are.

And if we can look at the paper here, which let's pull up the paper.

Yeah, you see here, this is the sort of, they call it directed graphical model.

It's a very complicated term. It's just kind of showing 

what's going on in this process. There's a lot of complicated math here, but 

we'll highlight some of the key variables and equations here.

So basically the idea is that, okay, so let's see here.

This is an image that we want to generate, right? And so x0 is basically, these are 

actually the samples that we want. So we want to, x0 is what we want to generate.

And these would be, yeah, these are images. And we start out with pure noise.

So that's what xt, pure noise. And the whole idea is that we have two processes.

We have this process where we're going from pure noise to our image.

And we have this process from our image to pure noise.

So the process where we're going from our image to pure noise, this is called the forward

process. Forward, sorry, my typing is still, 

my handwriting is not so good in it. So hopefully it's clear enough.

Let me know if it's not. So we have the forward process, which 

is mostly just used for training. Then we also have our reverse process.

This is the reverse process, which I will write up here.

Reverse process. JEREMY: So this is a bit of a summary, I guess, 

of what you and Wasim talked about in Lesson TANISHQ: And just, it's just mostly to highlight 

now what are the different variables as we look at the code and see the 

different variables in the code. JEREMY: Okay, so we'll be focusing today on the 

code, but the code will be referring to things by name and those names won't make sense very 

much unless we see what they're used for in the math.

Okay. TANISHQ: Yeah.

And I won't dive too much into the math. I just want to focus on these sorts of 

variables and equations that we see in the code. So basically the general idea is that 

we do these in multiple different steps. We have here from time step 0 all the way to 

time step uppercase T. And so there's some fixed number of steps, but then we have this 

intermediate process where we're going from some particular time step.

We have this time step lowercase t, which is some noisy image.

And yes, we're transitioning between these two different noisy images.

So we have this, what is sometimes called the transition.

We have this one here. This is like something that's 

called the transition kernel or yeah, whatever it is, it basically

is just telling us, you know, how do we go from, you know, one, in this case, we're going

from a less noisy image to a more noisy image and then going backwards, it's going from

a more noisy image to a less noisy image. So let's look at the equations.

JEREMY: So the forward direction is driven really easily to make it something more noisy.

You just add a bit more noise to it. And the reverse direction is incredibly difficult, 

which is to particularly to go from the far left to the far right is strictly 

speaking impossible because none of that person's face

exists anymore. That somewhere in between you could certainly 

go from something that's partially noisy to less noisy by a learned model.

TANISHQ: Exactly. And that's like one of the little things I've 

done right now in terms of, you know, in terms of I guess the symbols and the math. So yeah, basically I'm just trying to pull 

out the, just to write down the equations here.

So we have, let me zoom in a bit. So we have our two, let's see here.

Q of xt, x (t minus 1) Or actually, you know what, maybe it's 

just better if I just snipped it from here. So the one that is going from our 

forward process is this equation here. So I'll just make that a 

little smaller for you guys. So right there.

So that is going, and basically to explain, we have this sort of script, a little bit

of a, maybe a little bit confusing notation here, but basically this is referring to a

normal distribution or a Gaussian distribution. And this is just saying, okay, this is a Gaussian 

distribution that's describing this particular variable.

So it's just saying, okay, N is our normal or Gaussian 

distribution, and it's representing this variable x of t, or x, sorry, 

xt. And then we have here is the mean, and this is

the variance. So just to again, clarify, I think we've talked 

about this before as well, but this is a, this is of course a bad drawing of a Gaussian, 

but our mean is just, our mean is just this, the middle point here is the mean, and the 

variance just kind of describes the sort of spread of the Gaussian distribution.

So if you think about it a little further, you have this beta, which is one of the important variables that kind of describes 

the diffusion process, beta-t So you'll see the beta-t in the code.

And basically beta-t increases as t increases. So basically your beta-t will be 

greater than your beta-(t minus 1). So if you think about that a little bit more 

carefully, you can see that, okay, so at t-1, at this time point here, and then you're 

going to the next time point, you're going to increase your beta-t.

So you're increasing the variance, but then you have this 1 - beta-t and take the

square root of that and multiply it by x(t minus 1)

So as your t is increasing, this term actually decreases.

So your mean is actually decreasing and you're getting less of the original image

Because the original image is going to be part of x(t minus 1)

JEREMY: And just to let you know, Tanishq, we can't see your pointer.

So if you want to point at things, you would need to highlight them or something.

TANISHQ: So yeah, I'll just, let's see. Yeah.

Basically, I haven't pointed anything in specific. I was just saying that, yeah, basically 

if we have our xt here, as the time step increases, you're getting less 

contribution from your x(t minus 1) And so that means your mean is going towards zero.

And so you've got to have a mean of 0 and the variance keeps increasing and basically

you just have a Gaussian distribution and you lose any contribution from the original

image as your time step increases. So that's why when we start out from 

x0 and go all the way to our xt here, this becomes pure noise.

It's because we're doing this iterative process where we keep adding noise.

We lose that contribution from the original image and that leads to the image having pure

noise at the end of the process. JEREMY: Something I find useful here is to 

consider one extreme, which is to consider x1. So at x1, the mean is going to 

be root 1 - beta-t times x0. The reason that's interesting 

is x0 is the original image. So we're taking the original 

image and at this point 1 - beta-t will be pretty

close to 1. So at x1, we're going to have something 

that's the mean is very close to the image and the variance will be very small.

And so that's why we will have a image that just has a tiny bit of noise.

TANISHQ: Right, right. And then another thing that sometimes it's 

easier to write out is sometimes you can write out, in this case, you can write out q(xt) 

directly because these are all independent in terms of q(xt) is only 

dependent on x(t minus 1) And then x(t minus 1) is only 

dependent on x(t minus 2) And each of these steps are independent.

So based on the different laws of probability, you can get your q(xt) in close form.

So, yeah, that's what's shown here. q xt given the original image.

So this is also another way of kind of seeing this more clearly where you can see that.

Anyways, so I'm going back here. So this is another way to see here more directly.

So this is, of course, our clean image. And this is our clear, our noisy image.

And so you can also see again, now alpha bar t is dependent on beta t.

Basically it's like one minus the cumulative. JEREMY: I mean, we'll see 

the code for it, I guess. So maybe.

TANISHQ: Yes, yes. So it might be clear to see that this 

is alpha bar t or something like this. But basically, basically the idea is that 

alpha bar t is going to be, again, less. This is what is going to be 

less than alpha bar t minus one. So basically alpha, this keeps decreasing, right?

This decreases as time step increases. And on the other hand, this is going to 

be increasing as time step increases. But again, you can see the contribution from 

the original image decreases as time step increases while the noise, as shown by the 

variance, is increasing while the time step is increasing.

Anyway, so that hopefully clarifies the forward process.

And then the reverse process is basically a neural network, as Jeremy had mentioned.

And yeah, screenshot this.

That's our reverse process. And basically the idea is, well, this is a 

neural network and this is also a neural network. Neural network.

And we learn it during the training of the model. But the nice thing about this particular diffusion 

model paper that made it so simple was actually, we completely ignored this and actually set 

it to constants just based on, you know, big numbers.

JEREMY: We can't see what you're pointing at. So I think it's important to 

mention what this is here. TANISHQ: This term here.

So this one, we just kind of ignore and it's just a constant dependent on beta-t.

So you only have one neural network that you need to train, which is basically referring

to this mean. And when the nice thing about this diffusion 

model process is that it also re-paraphrases the mean into this easier form where 

you do a lot of complicated math, which we'll not

get into here. But basically you get this kind of simplified 

training objective where, let's see here. Yeah, you see the simplified training objective.

You instead have this epsilon-theta function. And let me just screenshot that again. This is our loss function that we train 

and we have this epsilon-theta function. You can see it's a very 

simple loss function, right? This is just a, let me just write this down.

This is just an MSE loss. And we have this epsilon-theta function here.

That is our- JEREMY: …maybe here we're less mathy, it 

might not be obvious that it's a simple thing, because it looks quite complicated 

to me, but once we see it in code, it'll be simple.

TANISHQ: Yes, yes. Basically you're just doing like, and 

you'll see it in code how simple it is. But this is like just an MSE loss.

So we've seen MSE loss before, but you'll see how, yeah, this is basically MSE. So the nice, so just to kind of take a step 

back again, what is this epsilon-theta? Because this is like a new thing that 

like seems a little bit confusing. Basically epsilon, you can see here, basically, 

yeah, so this here is saying, this is actually equivalent to this equation here.

These two are equivalent. This is just another way of saying that, 

because basically it's saying, that's xt So this is giving xt just in a different way.

But epsilon is actually this normal distribution with a 

mean of 0 and a variance of 1. And then you have all these scaling terms 

that changes the mean to be the same as this equation that we have over here.

So this is our xt. And so what epsilon is, it's actually the noise that we're adding

to our image to make it into a noisy image. And what this neural network is doing 

is trying to predict that noise. So what this is actually doing is this is 

actually a noise predictor, and it is predicting the noise in the image.

And why is that important? Basically the general idea is, 

if we were to think about our distribution of data, let's

just think about it in a 2D space. Just here, each data point here represents an 

image, and they're in this blob area, which represents a distribution.

So this is in-distribution, and this is out of the distribution.

Out-of-distribution. And basically the idea is that, okay, if we 

take an image and we want to generate some random image, if we were to take a random 

data point, it would most likely be noisy images, right?

So if we take some random data point, it would be more, 

the way to generate random data point, it's going to be just noise.

But we want to keep adjusting this data point to make it look more like an image from your

distribution. That's kind of the whole idea of this iterative 

process that we're doing in our diffusion model.

So the way to get that information is actually to take images from your dataset and actually

add noise to it. So that's what we try to do in this process.

So we have an image here and we add noise to it. And then what we do is we try to plan 

a neural network to predict the noise. And by predicting the noise and subtracting 

it out, we are going back to the distribution. So adding the noise takes you away from the 

distribution and then predicting the noise brings you back to the distribution.

So then if we know at any given point in this space how much noise to remove, that tells

you how to keep going towards the data distribution and get a point that 

lies within the distribution. So that's why we have noise prediction.

And that's the importance of doing this noise prediction is to be 

able to then do this iterative processor, we can start out at a random point, 

which would be, for example, pure noise and keep predicting and removing that noise 

and walking towards the data distribution. Okay.

Okay. So yeah, let's get started with the code.

And so here we of course have our imports and we're going to load our dataset.

We're going to work with our Fashion-MNIST dataset, which is what we've been working with

for a while already. And yeah, this is just basically the same 

code that we've seen from before in terms of loading the dataset. And then we have our model.

So I've removed the noise from an image. So our model is going to take in, is it's 

going to take in the previous image, the noisy image and predict the noise.

So the shapes of the input and the output are the same.

They're going to be in the shape of an image. So what we use is we use a U-Net neural 

network, which takes in kind of an input image. JEREMY: And we do see your 

pointer now, by the way. So feel free to point at things.

TANISHQ: Yeah. So yeah, it takes in an input image.

And in this case, a U-Net is the purpose, but they can also 

be used for any sort of image to image task, where we're going from an input 

image and then outputting some other image of some sort.

And we'll talk about... JEREMY: So this is a new architecture, which we 

haven't learned about yet, and we will be learning about in the next lesson.

But broadly speaking, those gray arrows going from left to right are a lot like ResNet,

very much like ResNet skip connections. But they're being used in a different way.

Everything else is stuff that we've seen before. So it's basically, we can pretend 

those don't exist for now. It's a neural network that the output is the 

same size or a similar size to the input. And therefore you can use it to learn how 

to go from one image to a different image. TANISHQ: Yeah.

So that's where the U-Net is. And yeah, like Jeremy said, 

we'll talk about it more. The sort of U-Net that are used for diffusion 

models also tend to have some additional tricks, which again, we'll talk 

about them later on as well. But yeah, for the time being, we will just 

import a U-Net from the Diffusers library, which is the Hugging Face 

library for diffusion models. So they have a U-Net implementation 

and we'll just be using that for now. And so, yeah, of course, 

JEREMY: strictly speaking, we're cheating at this point because we're

using something we haven't written from scratch, but we're only cheating temporarily because

we will be writing it from scratch. TANISHQ: Yeah.

And yeah, so, and then of course we're working with one channel images, our Fashion-MNIST

images are one channel images. So we just have to specify that.

And then of course the channels of the different blocks within the U-Net are also specified.

And then let's go into the training process. So basically the general idea of course 

is we want to train with this MSE loss. What we do is we select a random timestep 

and then we add noise to our image based on that timestep.

So of course, if we have a very high timestep, we're adding a lot of noise.

If we have a lower timestep, they were adding very little noise.

So we're going to randomly choose a timestep. And then, yeah, we add the 

noise accordingly to the image. And then we pass the noisy image 

to a model as well as the timestep. And we are trying to predict the amount of 

noise that was in the image and we predict it with the MSE loss.

So we can see all the... JEREMY: I have some pictures of some of these 

variables I could share if that would be useful. So I have a version.

So I think Tanishq is sharing notebook number 15. Is that right?

And I've got here notebook number 17. And so I took Tanishq's notebook and just, 

as I was starting to understand it, I like to draw pictures for myself 

to understand what's going on. So I took the things which are in Tanishq's 

class and just put them into a cell. So I just copied and pasted them, although 

I replaced the Greek letters with English written out versions.

And then I just plotted them to see what they look like.

So in Tanishq's class, he has this thing called beta, which is just linspace.

So that's just literally a line. So beta, there's going to be a thousand of 

them and they're just going to be equally spaced from 0.0001 to 0.02.

And then there's something called sigma, which is the square root of that.

So that's what sigma is going to look like. And then he's also got alphabar, which 

is a cumulative product of 1 minus this. And there's what alphabar looks like.

So you can see here, as Tanishq was describing earlier, that when t is higher, this is t

on the x-axis, beta is higher, and when t is higher, alphabar is lower.

So yeah, so if you want to remind yourself, so each of these things, beta, sigma, alphabar,

they're each, they've each got a thousand things in them.

And this is the shape of those thousand things. So this is the amount of variance, 

I guess, added at each step. This is the square root of that.

So it's the standard deviation added at each step. And then if we do 1 minus that, 

it's just the exact opposite. And then this is what happens if you 

multiply them all together up to that point. And the reason you do that is because if you add 

noise to something, you add noise to something that you add noise to something that you add 

noise to something, then you have to multiply together all that amount of noise 

to say how much noise you would get. So yeah, those are my pictures, if that's helpful.

TANISHQ: Yep. Yep.

Good to see the diagram or see how the actual values and how it changes over time.

So yeah. Let's see here.

Sorry. Yeah.

So like Jeremy was showing, we have our linspace for our beta.

In this case, we're using kind of more of the Greek letters.

So you can see the Greek letters that we see in the paper as well as...

Now we have it here in the code as well. And we have our linspace from our 

minimum value to our maximum value. And we have some number of steps.

So this is the number of timesteps. So here we use a thousand timesteps, but 

that can depend on the type of model that you're training.

And that's one of the parameters of your model or hyperparameters of your model.

JEREMY: And this is the callback you've got here. So this callback is going to be used to set 

up the data, I guess, so that you're going to be using this to add the noise so that the 

model's then got the data that we're trying to get it to learn to then denoise.

TANISHQ: Yeah. So the callback of course makes life a lot 

easier in terms of, yeah, setting up everything and still being able to use, I guess, the 

miniai learner with maybe some of these more complicated and maybe a little 

bit more unique training loops. So yeah, in this case, we're just able to 

use the callback in order to set up the batch that we are passing into our learner.

JEREMY: I just want to mention, when you first did this, you wrote out the Greek letters in English,

alpha and beta and so forth. And at least for my brain, I was finding it 

difficult to read because they were literally going off the edge of the page 

and I couldn't see it all at once. And so we did a search and replace to 

replace it with the actual Greek letters. I still don't know how I feel about it.

I'm finding it easier to read because I can see it all at once.

I don't know if it's a scroll and I don't get overwhelmed.

But when I need to edit the code, I kind of just tend to copy 

and paste the Greek letters, which is why we use the actual word beta in 

the init parameter list so that somebody using this never has to type a Greek letter.

But I don't know, Johno or Tanishq, if you had any thoughts over the last week or two,

since we made that change about whether you guys like having the Greek letters in there

or not. JOHNO: I like it for this demo in particular.

I don't know that I do this in my code, but because we're looking back and forth between

the paper and the implementation here, I think it works in this case just fine.

TANISHQ: Yeah, I agree. I think it's good for when you're trying to 

study something or try to implement something, having the Greek letters is very useful to be 

able to, I guess, match the math more closely and it's just easy just to pick the equation 

and put it into code or white-source style looking at the code and try 

to match it to the equation. So I think for educational purpose, I 

tend to like, I guess, the Greek letters. So yeah. Yeah, so we have our initialization where 

we're just defining all these variables. We'll get to the predict in just a moment, but 

first I just want to go over the before_batch where we're setting up our 

batch to pass into the model. So remember that the model is taking 

in our noisy image and the timestep. And of course the target is the actual amount 

of noise that we are adding to the image. So basically we generate that noise.

So that's what… JEREMY: So epsilon is that target.

So epsilon is the amount of noise, not the amount of, is the actual noise.

TANISHQ: Yes. Epsilon is the actual noise that we're adding.

And that's the target as well because our model is a noise predicting model.

It's predicting the noise in the image. And so our target should be the noise 

[unintelligible] that we're adding to the image during

training. So we have our epsilon and we're just 

generating it with this random function. The random normal distribution 

with a mean of 0, variance of 1. So that's what that's doing and adding 

the appropriate shape and device. Then the batch that we get originally 

will contain the clean images. These are the original images from our dataset.

So that's x0. And then what we want to 

do is we want to add noise. So we have our alphabar and we have 

a random timestep that we select. And then we just simply follow that equation, 

which I can, I'll just show in just a moment. JEREMY: That equation, you can 

make a tiny bit easier to read. I think if you were to double click on that 

first alphabar underscore t, cut it and then paste it, sorry, in the xt equals torch dot 

square root, take the thing inside the square root, double click it and paste 

it over the top of the word torch. That would be a little bit easier to read. And then you'll do the same for the next one.

TANISHQ: There we go.

Put those parentheses. Yep.

TANISHQ: Yeah, so basically, yeah, so yeah, I 

guess I'll just pull up the equation. So let's see, there's then, so there's a section 

in the paper that has the nice algorithm. See if I can find it.

No, no, here. It's, I think earlier.

Yes, training. Right, so this, we're just following these 

same sort of training steps here, right? Where we select a clean image 

that we take from our data set. This fancy kind of equation here is just saying, 

take an image from your data set, take a random timestep between this range.

Then this is our epsilon that we're getting, just saying, get some epsilon value.

And then we have our equation for Xt, right? This is the equation here.

You can see that it is square root of alphabar t x0 plus one square root of one minus

alphabar t times epsilon. So that's the same equation 

that we have right here, right? And then what we need to do is we 

need to pass this into our model. So we have xt and t. So we 

set up our batch accordingly. So this is the two things 

that we pass into our model. And of course we also have our 

target, which is our epsilon. And so that's what this is showing here.

We passed in our Xt as well as our t here, right? And we pass that into a model.

The model is represented here as epsilon theta. And theta is often used to represent, like, this 

is a neural network with some parameters and the parameters are represented by theta.

So epsilon theta is just representing our noise predicting model.

So this is our neural network. So we have passed in our Xt and our t into 

a neural network, and we are comparing it to our target here, 

which is the actual epsilon. And so that's what we're doing here.

We have our batch where we have our xt and t and epsilon.

And then here we have our prediction function. And because we actually have, I guess in this 

case, we have two things that are in a tuple that we need to pass into our model.

So we just kind of get those elements from our tuple with this.

We get the elements from the tuple, pass it into the model, and then Hugging Face has

its own API in terms of getting the output. So you'd need to call .sample in order 

to get the predictions from your model. So we just do that.

And then we do, we have learn.preds and that's what's going to be used later then when we're

trying to do our loss function calculation. JEREMY: So the, just so, I mean, it's just worth 

looking at that a little bit more since we haven't quite seen something like this before.

And it's something which I'm not aware of any other framework that would let you do

this, you know, literally replace how prediction works.

And miniai is kind of really fun for this. So because you're inherited from TrainCB, 

TrainCB has predict, a dot defined and you've defined a new version.

So it's not going to use the TrainCB version anymore.

It's going to use your version. And what you're doing is instead of passing 

learn.batch[0] to the model, you're, you've got a * in front of it.

So the key thing is that * is going to, you know, and is, we know that actually learn.batch[0]

has two things in it because that learn.batch that you showed at the end of the before_batch

method has two things in learn dot zero. So that star will unpack them and send 

each one of those as a separate argument. So our model needs to take two things, which 

the diffusers U-Net does take two things. So that's the main interesting point.

And then something I find a bit awkward honestly, about a lot of Hugging Face stuff, including

diffusers is that generally their models don't just return the result, but they put it inside

some name. And so that's what happens here.

They put it in something inside something called sample.

So that's why Tanishq added .sample at the end of the predict because of this somewhat

awkward thing, which Hugging Face like to do for some reason.

But yeah, now that you know, I mean, this is something that people often get stuck on.

I see on Kaggle and stuff like that. It's like, how on earth do I use these models?

Because they take things in weird forms and they give back things with weird forms.

Well, this is hell. You know, if you inherit from TranCB, you 

can change predict to do whatever you want, which I think is quite sweet.

TANISHQ: Yep. So yeah, that's the training loop.

And then of course you have your regular training loop that's implemented in miniai where you

are going to have. Yeah.

So you have your loss function calculation, I mean, and the predictions, learn.preds.

And of course the target is our learn.batch[1], which is our epsilon.

So we have those and we pass it into the loss function.

It calculates the loss function and does the back propagation.

So I'll just go over that. We'll get back to the sampling in just a moment.

But just to show the training loop. JEREMY: Most of this is copied from 

our, I think it's 14_augment notebook, the way you've got the

tmax and the sched. The only thing I think you've added 

here is the DDPM callback, right? TANISHQ: Yes.

The DDPM callback. JEREMY: And you change the loss function.

TANISHQ: Yes. So basically we have to initialize our DDPM 

callback with the appropriate arguments. So like the number of timesteps and 

the minimum beta and maximum beta. And then yeah, obviously, and then of course 

we're using an MSE loss as we talked about. It just becomes a regular training 

loop and everything else is run before. Yeah.

So we have your scheduler, your progress bar, all of that we've seen before.

JEREMY: I think that's really cool that we're using basically the same code to train a diffusion model as we've used to train a 

classifier just with one extra callback. TANISHQ: Yeah.

Yeah. Yeah.

That's why I think callbacks are very powerful for allowing us to do such things.

It's like pretty, you can take all this code and now 

we have a diffusion training loop and we can just call learn.fit and yeah, 

they can see got a nice training loop, nice loss curve.

We can save our model on a torch saving functionality 

to be able to save our model and we could

load it in. But now that we have our trained model, then 

the question is, what can we do to use it to sample the dataset?

So the basic idea of course was that we have, like basically we're here, right?

We have, let's see here. Okay.

So we have, the basic idea is that we start out with a random data point and of course that's

not going to be within the distribution at first, but now we've learned how to move from

one point towards the data distribution. That's what our noise prediction, 

predicting function does. It basically tells you how, you know, 

in what direction and how much to, so the basic idea

is that, yeah, I guess I'll start from maybe a new drawing here.

Again, we have, distribution is, and we have a random point and we use our noise predicting

model that we have trained to tell us which direction to move.

So it tells us some direction. Or I guess what's the exact, other area.

Okay. So like here, okay.

So it tells us some direction to move. At first that direction is not going to be 

like, you cannot follow that direction all the way to get the correct data point.

Because basically what we were doing is we're trying to reverse the path that we were following

when we were adding noise. So like, cause we had originally data point 

and we kept adding noise to the data point and maybe, you know, it 

followed some path like this. And we want to reverse that path to get to.

So our noise predicting function will give us an original direction, which, you know,

would be, you know, some kind of, it's going to be kind of tangential to the actual path

at that location. So what we would do is, you know, we would maybe 

follow that data point all the way towards, you know, we're just going to 

keep following that data point. You know, we're going to try to predict the 

fully denoised image by following this noise prediction.

But our fully denoised image is also not going to be a real image.

So what we, so let me, I'll show an example of that over here in the paper on why they

show this a little bit more carefully. So x0, it's there.

So basically you can see the different data...

You can see the different data points here.

It's not going to look anything like our real image.

So you can see all these points, you know, it doesn't look anything… what we would do

is we actually had a little bit of noise back to it.

And we start, we have a new point where then we could maybe estimate a better, get a better

estimate of which direction to move, follow that all the way again, we follow a new point.

And then I can add back a little bit of noise. You get a new estimate, you make a new estimate 

of, you know, this noise prediction and removing the noise, you know, fall that 

all again, completely and add a little bit of noise again to the

image and burst onto a image. So that's kind of what we're showing here.

JEREMY: That's a lot like SGD, with SGD we don't take the gradient and jump all the way.

We use a learning rate to go some of the way because each of those estimates of where we

want to go, you know, not that great, but we just do it slowly.

TANISHQ: Exactly. And at the end of the day, that's what 

we're doing with this noise prediction. We are predicting the sort of gradient of 

this p(x), but of course we need to keep making estimates of that 

gradient as we're progressing. So we have to keep evaluating our noise prediction 

function to get updated and better estimates of our gradient in order to 

finally converge onto our image. So and then you can see that here, you know, we 

have this, maybe this fully predicted denoised image which at the beginning doesn't look 

anything like a real image, but then as we continue throughout the sampling 

process, we finally converge on something that looks

like an actual image. Again, these are CIFAR-10 images and it's 

still a little bit maybe unclear about how realistic these images, these very small images 

look, but that's kind of the general principle I would say.

And so that's what I can show in the code. This idea of we're going to start out 

basically with a random image, right? And this random image is going to be like 

a pure noise image and it's not going to be part of the data distribution.

You know, it’s not anything like a real image, it's just a random image.

And so this is going to be our x, I guess, x uppercase t [x_t], right?

That's what we start out with. And we want to go from x 

uppercase t all the way to x0. So what we do is we go through each of the 

timesteps and we create, we have to put it in this sort of batch format because 

that's what our neural network expects. So we just have to format it appropriately.

And we'll get to z in just a moment. I'll explain that in just a moment, but of 

course we just take it have similar… alphabar, betabar, which is getting 

those variables that we need. JEREMY: And we faked beta bar because 

we couldn't figure out how to type it. So we used bbar instead.

TANISHQ: Yeah. So yeah, so yeah, in, yeah, we, yeah, we were 

pretty able to get betabar to work, I guess. But anyway, at each step, what we're trying 

to do is to try to predict what direction we need to go.

And that direction is given by our noise predicting model, right?

So what we do is we pass in x_t and our current timestep into our model.

And we get this noise prediction and that's the direction that we need to move in.

So basically we take x_t. We first attempt to completely remove the noise, right?

That's what this is doing. That's what x_0_hat is.

That's completely removing the noise. And of course, as we said, that estimate 

at the beginning won't be very accurate. And so now what we do is we have some coefficients 

here where we have a coefficient of how much that we keep about this estimate of 

our denoise image and how much of the originally noisy

image we keep. And on top of that, we're going 

to add in some additional noise. So that's what we do here.

We have x_0_hat. And so, and we multiply by its coefficient 

and we have x_t we multiply it by some coefficient and we also add some additional noise.

That's what the z is. It's just-

JEREMY: That's basically a weighted average of the two plus the noise…

TANISHQ: Exactly. And then the whole idea is that as we get 

closer and closer to a timestep equals to 0 our estimate of x0 will be more and more accurate.

So our x0_coeff will get closer as we're increasing our, going through the process and then our xt_coeff 

will get closer and closer to 0. So basically we're going to be weighting more 

and more of the x_0_hat estimate and less and less of the x_t as we're getting 

closer and closer to our final timestep. And so at the end of the day, we will 

have our estimated generated image. So that's kind of an overview 

of the sampling process. So yeah.

So yeah, basically the way I implemented it here was I had the sample function that's

part of our callback and it will take in the model and the kind of shape that you want

for your images that you're producing. So like if you want to specify how many images 

you produce, that's going to be part of your batch size or whatever.

And you'll just see that in a moment. But yeah, it's just part of the callback.

So then we basically have our DDPM callback and then we could just call the sample method

of our DDPM callback and we pass in our model. And then here you can see we're going to produce, 

for example, 16 images and it just has to be a 1 channel image of shape 

32 by 32 and we get our samples. And one thing I forgot to note was that I 

am collecting each of the timestep, the x_t. So the predictions here, you can see 

that there are a thousand of them. We want the last one because 

that is our final generation. So we want the last one and that's what we should-

JEREMY: They are not [sad] actually. TANISHQ: Yeah.

So this is- JEREMY: We've come a long way since DDPM.

So this is, like, slower and less great than it could be.

But considering that, except for U-NET, we've done this from scratch, you know, actually

from matrix multiplication, I think those are pretty decent.

TANISHQ: Yeah. And we're only trained for about five epochs.

It took like, you know, maybe like four minutes to train this model, something like that.

It's pretty quick. And this is what we get with very little training.

And it's pretty decent. You can see, of course, some clear shirts 

and shoes and pants and whatever else. JEREMY: Yeah.

And you can see fabric and it's got texture and things have buckles and-

TANISHQ: Yeah. JEREMY: You know, something to compare, like, we 

did generative modeling in the first time we did Part 2 back in the days when Wasserstein 

GAN was just new, which was actually created by the same guy that created 

PyTorch or one of the two guys, Soumith. And we trained for hours and hours and hours 

and got things that I'm not sure were any better than this.

So things have come a long way. TANISHQ: Yeah.

Yeah. And of course, then, yeah, so we can see then 

like how this sampling progresses over time, over the multiple timesteps.

So that's what I'm showing here because I collected, during the sampling process, we

are collecting at each timestep what that estimate looks like.

And you can kind of see here. And so this is an estimate of the 

noisy image over the timesteps. Oops.

And I guess I had to pause. Yeah.

You can kind of see. But you'll notice that actually, so we actually, 

what we did is like, okay, so we selected an image, which is like the ninth image.

So that's, that's this image here. So we're looking at this image 

particularly here and we're going over. Yeah.

We have a function here that's showing the i-th timestep during the sampling process of

that image. And we're just getting the images.

And what we are doing is we're only showing basically from timestep 800 to a 1,000.

And here we're just, we're just having it like where it's like, okay, we're looking

at like maybe every 5 steps and we're going from 800 to 990.

And this time it would make it visually easier to see the transition.

But what you'll notice is I didn't start all the way from 0.

I started from 800. And the reason we do that is because actually 

between 0 and 800 there's very little change in terms of like, it's just mostly a noisy image.

And it turns out, but yeah, I didn't see as I make a note of this here, it's actually

a limitation of the noise schedule that is used in the original DDPM paper.

And especially when applied to some of these smaller images, when we're working with images

of like size 32 by 32 or whatever. And so there are some other papers like the 

improved DDPM paper that propose other sorts of noise schedules.

And what I mean by noise schedule is basically how beta is defined basically.

So we had this definition of torch.linspace for our beta, but people have different ways

of defining beta that lead to different properties.

So things like that, people have come up with different improvements and those sorts of

improvements work well when we're working with these smaller images.

And basically the point is like, if we are working from 0 to 800 and it's just mostly

just noise that entire time, we're not actually making full use of all this timesteps.

So it would be nice if we could actually make full use of those time steps and actually

have it do something during that time period. So all these, there are some papers that examine 

this a little bit more carefully and it would be kind of interesting for maybe some of you 

folks to also look at these papers and see if you can try to implement those sorts of 

models with this notebook as a starting point. And it should be a fairly simple change in 

terms of like noise schedule or something like that.

JEREMY: So I actually think this is the start of our next journey, which is our previous journey

was going from being totally rubbish at Fashion-MNIST classification 

to being really good at it. I heard you say now we're like a little bit 

rubbish at doing Fashion-MNIST generation. And yeah, I think we should all now work from 

here over the next few lessons and so forth and people trying things at home and all of 

us trying to make better and better generative models, initially a Fashion-MNIST and hopefully 

we'll get to the point where we're so good at that, that we're like, oh, this is too easy.

And then we'll pick something harder. And eventually that'll take us to Stable Diffusion and beyond.

I imagine. That's cool.

I got some stuff to show you guys. If you're interested, I tried to better understand 

what was going on in Tanishq's notebook and tried doing it in a thousand different ways 

and also see if I could just start to make it a bit faster.

So that's what's in notebook 17, which I will share.

So we've already seen the start of notebook 17. Well, the one thing I did just do is just 

drew a picture for myself, partly just to remind myself what they, 

what the real ones look like. And they definitely have more detail than 

the samples that Tanishq was showing. But they're not, you know, they're just 28 by 28.

I mean, they're not super amazing images and they're just black or white.

So even if we're fantastic at this, they're never going to look great because we're using

a small, simple data set. As you always should, when you're doing 

any kind of R&D or experiments, you should always use a small and simple data set up 

until you're so good at it that it's not challenging

anymore. And even then when you're exploring new ideas, 

you should explore them on small, simple data sets first.

Yeah. So after I drew the various things, what I 

like to do is one thing I found challenging about working with your class Tanishq is I 

find when stuff is inside a class, it's harder for me to explore.

So I copied and pasted it, the before_batch contents and called it noisify.

And so one of the things that's fun to do that is it forces you to figure out what are

the actual parameters to it. And so now that I, rather than putting in the 

class, now that I've got all of my, you know, various things to do with, so these are the 

three parameters to the DDPM callbacks in it.

And then these things we can calculate from that. So with those then actually 

all we need is yeah, what's the image that we're going to

noisify and then what's the, what's the alphabar, which I mean, we can get from here, but

it's sort of be more general if you can pass in your alphabar.

So yeah, this is just copying and pasting from the class, but the nice thing is then

I could experiment with it. So I can call noisify on my first 25 images 

and with, with a random t, each one's got a different random t, and so I can print out 

the t and then I could actually use those as titles.

And so this lets me, I thought this is quite nice. I might actually rerun this cause actually 

none of these look like anything because as it turns out in this particular 

case, all of the t's are over 200. And as Tanishq mentioned, once you're over 

200, it's almost impossible to see anything. So let me just rerun this 

and see if we get a better, there we go.

There's a better one. So with a t of 7, right?

So remember t naught, t equals naught is the pure image.

So t equals 7, it's just a slightly speckledy image.

And by 67, it's a pretty bad image. And by 94, it's very hard 

to see what it is at all. And by 293, maybe I can see a pair of pants.

I'm not sure I can see anything. So yeah.

By the way, there's a handy little, so we've, I think 

we've looked at map before in in the course there's an extended 

version of map in fastcore. And one of the nice things is you can pass 

it a string and it basically just calls this format string if you pass it a 

string rather than a function. And so this is going to stringify 

everything using its representations. This is how I got the titles 

out of it just by the way. So yeah, I found this useful to be 

able to draw a picture of everything. And then I wanted to, yeah, look at what, 

what, what else, what else can I do? So then I took nuts.

You won't be surprised to see. I took the sample method and 

turn that into a function. And I actually decided to 

pass everything that it needs. Even, I mean, you could actually 

calculate pretty much all of these. But I thought since I've calculated 

them before, just pass them in. So this is all copied and 

pasted from Tanishq's version. And so that means the callback now is tiny, right? Because before_batch is just noisify and the 

sample method just calls the sample function. Now, what I did do is I decided just to, yeah, 

I wanted to try like as many different ways of doing this as possible.

Partly it's an exercise to help everybody like see all the different ways we can work

with our framework, you know? So I decided not to inherit from TrainCB, 

but I instead I inherited from Callback. So that means I can't use Tanishq's 

nifty trick of replacing predict. So instead I now need some way to pass in 

the two parts of the first element of the tuple as separate things to the 

model and return the sample. So how else could we do that?

Well what we could do is we could actually inherit from UNet2DModel, which is what

Tanishq used directly, unit 2d model, and we could replace the model.

And so we could replace specifically the forward function.

That's the thing that gets called. And we could just call the original forward 

function, but rather than passing an x we’re passing a *x, and rather than 

returning that, we'll return that .sample. Okay.

So if we do that, then we don't need the TrainCB anymore and we don't need the predict.

And so if you're not working with something as beautifully flexible as miniai, you can

always do this, you know, to make, to replace your model so that it has the interface that

you need it to have. So now again, we did the same as 

Tanishq had of create the callback. And now when we create the model, we'll 

reuse our UNet class, which we just created. I wanted to see if I can make things faster.

I tried dividing all of Tanishq's channels by two and I found it worked just as well.

One thing I noticed is that it uses group norm in the U-Net, which we have briefly learned

about before and in group norm, it splits the channels up into a certain number of groups.

And I needed to make sure that those groups had more than one thing in.

So you can actually pass in how many groups do you want to use in the normalization.

So that's what this is for. You gotta be a little bit careful of these 

things, I didn't think of it at first and I ended up, I think the num groups might've 

been 32 and I got an error saying you can't split 16 things into 32 groups.

But it also made me realize actually, even in Tanishq's maybe you probably had 32 in

the first with 32 groups. And so maybe the group norm 

wouldn't have been working as well. So they're little subtle things to look out for.

So now that we're not using anything inherited from TrainCB, that means we either need to

use TrainCB itself or just use our train learner and that everything else is the same as what

Tanishq had. So then I wanted to look at the results of 

noisify here and we've seen this trick before, which is we call fit, but don't call the training 

part of the fit and use the SingleBatchCB callback that we created way back 

when we first created Learner. And now learn.batch will contain the tuple 

of tuples, which we can then use that trick to show.

So I mean, obviously we'd expect it to look 

the same as before, but it's nice. I always like to draw pictures 

of everything all along the way. Cause it's very, very of..

I mean, I, the first six to seven times I do anything, I do it wrong.

So given that I know that I might as well draw a picture to try and see how it's wrong

until it's fixed. It also tells me when it's not wrong.

TANISHQ: Isn't there a show_batch function now that does something similar?

JEREMY: Um, yes, you wrote that show_image_batch, didn't you?

I can't quite remember. Yeah.

We should, uh, remind ourselves how that worked.

That's a good point. Thanks for a reminder.

Okay. So then, um, I'll just go ahead and 

do the same thing that Tanishq did. And um, uh, but then the next thing I looked 

at was I looked at the, you know, how am I going to make this train faster?

I want a bigger, I want a higher learning rate. Um, and I realized, oddly enough, the diffusers 

code does not initialize anything at all. They use the defaults, um, which just goes 

to show like even, you know, the experts at Hugging Face that don't necessarily really 

think like, oh, maybe the PyTorch defaults aren't, you know, perfect for my model.

Of course they're not because they depend on what activation function do you have and

what res blocks do you have and so forth. Um, so I wasn't exactly sure how to initialize it.

Um, I, um, partly by chatting to, um, Kat Crowson, who's the author of K-diffusion,

um, and partly by looking at papers and partly by thinking about my own experience, I ended

up doing a few things. One is I did do the thing that we talked about 

a while ago, which is to take every second convolutional layer and zero it out.

You could do the same thing with using batch norm, which is what we tried.

And since we've got quite a deep network, you know, that seemed like it might, you know,

it helps basically by having the, the, the non-id paths in the ResNets do nothing at

first so they can't cause problems. Um, we haven't talked about, um, orthogonalized 

weights before, and we probably won't because you would need to take our, um, computational 

linear algebra course to learn about that, which is a great course, Rachel 

Thomas did a fantastic job of it. I highly recommend it, but I don't want to 

make it a prerequisite, but, um, Kat mentioned, she thought that using orthogonal weights 

for the downsamplers was a good idea. Um, and then, well, the up_blocks, they also 

set the second convs to zero and something Kat mentioned, she found useful, which is 

also from, um, I think it's from the Dhariwal Google paper is to also zero out the 

weights of basically the very last layer. Um, and so it's going to start by predicting 

zero as the noise, which is, you know, something that can't hurt.

Um, so that was, that's how I initialized the weights.

Um, so call init_ddpm on my model, uh, something that I found that a huge difference is I replaced

the normal Adam optimizer with one that has an epsilon of 1e-5, the default,

I think is 1e-8. And so to remind you, this is, we, is we, 

when we divide by the exponentially weighted moving average of the squared gradients, we, 

when we divide by that, if that's a very, very small number, um, then it makes 

the effective learning rate huge. And so we add this to it to make it not too huge.

And it's nearly always a good idea to make this bigger than the default.

I don't know why the default is so small. And I found, until I did this, anytime I tried 

to use a reasonably large learning rate somewhere around the middle of the 1-Cycle 

training, it would explode. Um, uh, so that makes a big difference. Um, so this way, yeah.

Uh, I could train, I could get 0.016 after 5 epochs.

Um, and then sampling, so it looks all pretty similar.

We've got some pretty nice textures, I think. So then I was thinking, how do I get faster?

So one way we can make it faster is we can take advantage of, um, something called

mixed precision. Um, so currently we're using 

32 bit floating point values. Um, that's the defaults and 

also known as single precision. And um, GPUs are pretty fast at doing 32 bit 

floating point values, but they're much, much, much, much faster during 16 

bit floating point values. So I'm 16 bit floating point values. I'm able to represent a very, you know, wide 

range of numbers or much precision at the difference between numbers.

And so they're quite difficult to use, but if you can, you'll get a huge benefit because,

um, modern GPUs, modern Nvidia GPU specifically have special units that do matrix multiplies

of 16 bit values extremely quickly. Um, you can't just cast everything to 16 bit 

because then you, there's not enough precision to calculate gradients and stuff properly. So we have to use something 

called Mixed Precision. Um, depending on how enthusiastic I'm feeling, 

I guess we ought to do this from scratch as well.

Um, we'll, we'll see. Um, we do have an implementation from scratch 

cause we actually implemented this before NVIDIA implemented it, um, in 

an earlier version of fastai. Um, um, anyway, we'll see.

So basically the idea is that we use 32 bit for things where we need 32 bit and we use

Um, so that's what we're going to do is we're going to use this mixed precision.

Um, but for now we're going to use, um, NVIDIA's, you know, semi-automatic or fairly automatic

code to do that for us. Actually we had a slight change of plan at 

this point when we realized, uh, this lesson was going to be over three hours in length 

and we should actually split it into two. So we're going to wrap up this lesson here and 

we're going to, um, come back and implement this mixed precision thing in Lesson 20.

So we'll see you then.

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 19: Deep Learning Foundations to Stable Diffusion ...