Lesson 19: Deep Learning Foundations to Stable Diffusion ...

JEREMY: Okay.

Hi everybody. And this is Lesson 19 with extremely

special guests, Tanishq and Johno. Hi guys.

How are you? TANISHQ: Hello.

JOHNO: Hey Jeremy. Good to be here.

JEREMY: And it's New Year's Eve, 2022, finishing off 2022 with a bang, or at least a really cool

lesson. And most of this lesson is

going to be Tanishq and Johno, but I'm going to start with a quick

update from the last lesson. What I wanted to show you is that Christopher

Thomas on the forum, what I want to show you is that Christopher Thomas on the forum came up

with a better winning result for our challenge, the Fashion-MNIST challenge,

which we are tracking here. And be sure to check out this forum

thread for the latest results. And he found that he was able to

get better results with Dropout. Then Piotr on the forum

noticed I had a bug in my code. And the bug in my code for ResNets, actually

I won't show you, I'll just tell you, is that in the ResBlock, I was not passing

along the BatchNorm parameter. And as a result, all the results

I had were without BatchNorm. So then when I fixed BatchNorm and added

Dropout at Christopher's suggestion, I got better results still.

And then Christopher came up with a better Dropout and got better results still for 50

epochs. So let me show you the 93.2

for 5 epochs improvement. I won't show the change to BatchNorm because

that's actually, that'll just be in the repo now.

So the BatchNorm is already fixed. So I'm going to tell you about what

Dropout is and then show that to you. So Dropout is a simple but powerful idea where

what we do with some particular probability, so here that's a probability of 0.1,

we randomly delete some activations. And when I say delete, what I actually

mean is we change them to zero. So one easy way to do this is to

create a binomial distribution object where the probabilities

are 1-p and then sample from that. And that will give you a 0.1 probability.

So in this case, oh, this is perfect. I have exactly one 0.

Of course, randomly, that's not always going to be the case.

But since I asked for 10 samples and 0.1 of the time it should be zero, I so happened

to get, yeah, exactly one of them. And so if we took a tensor like this

and multiplied it by our activations, that will set about

a 10th of them to zero because multiplying by zero gives you zero.

So here's a Dropout class. So you pass it and you say what probability

of Dropout there is, store it away. Now we're only going to do

this during training time. So at evaluation time, we're not

going to randomly delete activations. But during training time, we will

create our binomial distribution object. We will pass in the 1-p probability. And then you say, how many

binomial trials do you want to run? So how many coin tosses or dice

rolls or whatever each time? And so it's just one.

And this is a cool little trick. If you put that one onto your accelerator, you

know, GPU or MPS or whatever, it's actually going to create a binomial

distribution that runs on the GPU. That's a really cool trick that

not many people know about. And so then if I sample and I make a sample

exactly the same size as my input, then that's going to give me a bunch of ones and zeros

and a tensor, the same size as my activations. And then another cool trick is this is going

to result in activations that are on average about one tenth smaller.

So if I multiply by 1/(1-0.9), so multiply this case by that, then that's going

to scale up my to undo that difference. JOHNO: Jeremy.

JEREMY: Yeah. JOHNO: In the line above where you have

probs equals 1-p, should that be 1-self.p JEREMY: Oh, it absolutely should.

Thank you very much, Johno. Not that it matters too much because, yeah, you can always just use nn.Dropout at this

point and I only have to use 0.1, which is why I didn't even see that.

So as you can see, I'm not even bothering to export this because I'm just showing how

to repeat what's already available in PyTorch. So yeah, thanks, Johno.

That's a good fix. Yeah, so if we're in evaluation mode,

it's just going to return the original. If p=0, then these are all

going to be just ones anyway. So we'll be multiplying by 1 divided

by 1, so there's nothing to change. So with p of 0, it does nothing in effect.

Yeah, and otherwise it's going to kind of zero out some of our activations.

So we can, a pretty common place to add dropout is before your last linear layer.

So that's what I've done here. So yeah, if I run the exact same epochs, I

get 93.2, which is a very slight improvement. And so the reason for that is that it's not

going to be able to kind of memorize the data or the activations, you know, because

there's a little bit of randomness. So it's going to force it to try to identify

just the actual underlying differences. There's a lot of different

ways of thinking about this. You can almost think of it as a bagging

thing, a bit like a random forest. You know, it's each time it's giving a

slightly different kind of random subset. Yeah, but that's what it does.

I also added a Dropout2d layer right at the start, which is not particularly common.

I was just kind of like showing it. This is also how Christopher Thomas's idea tried

it as well, although he didn't use Dropout2d. What's the difference between

Dropout2d and Dropout? So this is actually something I'd like you

to do to implement yourself as an exercise, is to implement Dropout2d.

The difference is that with Dropout2d, rather than using x.size()

as our tensor of ones and zeros, so in other words, potentially dropping

out every single batch, every single channel, every single x, y independently.

Instead, we want to drop out an entire kind of grid area, all of the channels together.

So if any of them are zero, then they're all zero. So you can look up the docs for Dropout2d for

more details about exactly what that looks like.

But yeah, so the exercise is to try and implement that from scratch and come up with a way to

test it. So like actually check that

it's working correctly, because it's a very easy thing to think that

it's working and then realize it's not. So then, yeah, Christopher Thomas actually

found that if you remove this entirely and only keep this, then you end up

with a better results for 50 epochs. And so he's the first to break 95%

So I feel like we should insert some kind of animation or trumpet sounds or something

at this point. I'm not sure if I'm clever enough to do that

in the video editor, but I'll see how I go. Hooray!

Okay. So that's about it for me.

Did you guys have any other things to add about Dropout, how to understand it or what

it does or interesting things? Oh, I did have one more thing before.

But you go ahead if you've got anything to mention.

JOHNO: So I was going to ask just because I think the standard is to set it, like remove the

dropout before you do inference. But I was wondering if there's anyone you

know of, or if it works to use it for some sort of test time augmentation.

JEREMY: Oh, dude! Thank you.

Because I wrote a callback for that. Did you see this or are you just like (JOHNO:

no), okay, just a test time dropout callback. Nice.

So yeah, before_epoch, if you're a member in

Learner, we put it into training mode. Which actually what it does is it puts

every individual layer into training mode. So that's why for the module itself, we can

check whether that module's in training mode. So what we can actually do is after that's

happened, we can then go back in this callback and apply a lambda that says

if this is a Dropout, then… wait, this is, yeah, then put

it in training mode all the time, including at evaluation.

And so then you can run it multiple times, just like we did for TTA, but with this callback.

Now that's very unlikely to give you a better result because it's not kind of showing it

different versions or anything like that, like TTA does that are kind of meant to be

the same. But what it does do is it gives

you a sense of how confident it is. If it kind of has no idea, then that little

bit of dropout's quite often going to lead to different predictions.

So this is a way of kind of doing some kind of confidence measure.

You'd have to calibrate it by, kind of, looking at things

that it should be confident about and not confident about and seeing how

that dropout, test time dropout changes. But the basic idea, it's been

used in medical models before. I wouldn't say it's totally popular, which

is why I didn't even bother to show it being used, but I just want to add it here because

I think it's an interesting idea and maybe could be more used than it is, or at

least more studied than it has been. A lot of stuff that gets used in the medical

world is less well known out in the rest of the world.

So maybe that's part of the problem. Cool.

All right. So I will stop my sharing and we're going to

switch to Tanishq, who's going to do something much more exciting, which is to show that

we are now at a point where we can do DDPM from scratch or at least

everything except the model. And so to remind you, DDPM doesn't have the

latent VAE thing and we're not going to do conditional. So it's not going to be like, we're not

going to get to tell it what to draw. And the U-Net model itself is the

one bit we're not going to do today. We're going to do that next lesson.

But, but other than the U-Net, it's going to be unconditional DDPM from scratch.

So Tanishq, take it away. Okay.

Hi, welcome back. Sorry for the slight continuity problem.

You may notice people look a little bit different. That's because we had some Zoom issues.

So we have a couple of days have passed and we're back again.

And then Johno over recorded his bit before we do Tanishq's bit, and then we're going

to post them in backwards. So hopefully there's not too many confusing

continuity problems as a result and it all goes smoothly, but it's time to turn

it over to Tanishq to talk about DDPM. TANISHQ: So we've reached the point where we have

this miniai framework and I guess it's time to now start using it to build more,

I guess, sophisticated models. And as we'll see here, we can start putting

together a diffusion model from scratch using the miniai library, and we'll see

how it makes our life a lot easier. And also it'd be very nice to see how the

equations in the papers correspond to the code.

I have here, of course, the notebook

that we'll be watching from. The paper, which we have the diffusion model

paper, “Denoising Diffusion Probabilistic Models”, which is the paper

that was published in 2020. It was one of the original diffusion model

papers that set off the entire trend of diffusion models and is a good starting point

as we delve into this topic further. And also I have some diagrams and

drawings that I will also show later on. But yeah, basically let's just get started

with the code here and of course the paper. So just to provide some context with this

paper, this paper was published from this group in UC Berkeley, I think a few of

them have gone on now to work at Google. And this is Pieter Abbeel, he

has a big lab at UC Berkeley. And so diffusion models were actually originally

introduced in 2015, but this paper in 2020 greatly simplified the diffusion models and

made it a lot easier to work with and got these amazing results as you can see here

when they trained on faces and in this case CIFAR-10 and this really was very, kind

of a big leap in terms of the progress of diffusion

models. And so just to kind of briefly

provide, I guess, kind of an overview. JEREMY: If I could just quickly step

just mention something, which is, when we started this course, we

talked a bit about how perhaps the diffusion part of diffusion models is not actually all

that. Everybody's been talking about

diffusion models because that's, particularly because that's

the open source thing we have that works really well.

But this week, actually a model that appears to be quite a lot better than Stable Diffusion

was released that doesn't use diffusion at all. Having said that, the basic ideas, like most

of the stuff that Tanishq talks about today, will still appear in some kind of form,

but a lot of the details will be different. But strictly speaking, actually, I don't even

know if we've got a word anymore for the kind of like modern generative

model things we're doing. So in some ways, when we're talking about

diffusion models, you should maybe replace it in your head with some other word, which

is more general and includes this paper that Tanishq is looking at here.

JOHNO: Iterative Refinement, perhaps? That's what I'd like.

JEREMY: Yeah, that's not bad, iterative refinement.

I'm sure by the time people watch this video, probably, you know, somebody will have decided

on something. We will keep our course website up to date.

TANISHQ: Yeah. Yeah.

This is the paper that Jeremy was talking about and yeah, every week there seems to

be another state of the art model. But yeah, like Jeremy said, a lot of the

principles are the same, but the details can be different

for each paper. And I just want to again, also, like Jeremy

was saying, zoom back a little bit and talk a little bit about what, just to provide

a review of what we're trying to do here. So let me just right next to him here.

Yeah. So with this task, we were trying to, in this

case, we're trying to do image generation. Of course, it could be other forms of

generation, like text generation or whatever. And the general idea is that of

course we have some data points. In this case, we have some images of dogs

and we want to produce more like the data points that we're given.

So in this case, maybe the dog image generation or something like this.

And so the overall idea that a lot of these approaches take for some sort of generative

modeling task is they try to... Not over there, I’m going to mark here.

They try to... Oops, what happened here?

Maybe it might... Yeah.

So let me use it in a bit. p(x), which is basically the likelihood...

What's going to happen here? Likelihood of data point x.

So let's say x is some image. Then p(x) tells us what is the probability

that you would see that image in real life. And we can take a simpler example, which may

be easier to think about, of a one-dimensional data point like height, for example. And if we were to look at height, of course

we know we have a data distribution that's kind of a bell curve.

And you have maybe some mean height, which is something like 5'9", 5'10".

I guess 5'10", or something like that, or 5'9", whatever.

And then of course we have some more unlikely points, but that is still possible.

Like for example, we have 7'8", or we have something that's maybe not as likely, which

is like 3', or something like this. JEREMY: So here's the X axis is height, and the

Y axis is the probability of some random person you meet being that tall.

TANISHQ: Exactly. So this is basically the probability.

And so of course you have this sort of peak, which is where you have higher probability.

And so those are the sorts of values that you would see more often.

So this is what we would call our p(x). And the important part about p(x) is that

you can use this now to sample new values if you know what p(x) is, or if you have

some sort of information about p(x). So for example, here you can think of, if

you were to say, maybe have some, let's say you have some game and you have some human

characters in the game, and you just want to randomly generate a height for this human

character, you wouldn't want to of course select a random height between 3 and 7,

that's kind of uniformly distributed. You would instead maybe want to have the height

dependent on this sort of function, where you would more likely sample values in the

middle and less likely sample these sorts of extreme points.

So it's dependent on this function p(x). So having some information about p(x)

will allow you to sample more data points. And so that's kind of the overall goal of

generative modeling is to get some information about p(x) that then allows us to sample

new points and create new generations. So that's kind of a high level kind of description

of what we're trying to do when we're doing generative modeling.

And of course there are many different approaches. We have our famous GANs, which used to be the

common method back in the day before diffusion models.

We have VAEs, which I think we'll probably talk a

little bit more about that later as well.

JEREMY: We'll be talking about both of those techniques later.

Yeah. TANISHQ: Yeah.

So there are many different other techniques. There are also some niche techniques

that are out there as well. But of course now the popular one is are these

diffusion models or as we talked about, maybe a better term might be, iterative

refinement or whatever the term ends to be. But yeah, so there are many different techniques.

And yeah, so this is kind of the general diagram that shows what diffusion models are.

And if we can look at the paper here, which let's pull up the paper.

Yeah, you see here, this is the sort of, they call it directed graphical model.

It's a very complicated term. It's just kind of showing

what's going on in this process. There's a lot of complicated math here, but

we'll highlight some of the key variables and equations here.

So basically the idea is that, okay, so let's see here.

This is an image that we want to generate, right? And so x0 is basically, these are

actually the samples that we want. So we want to, x0 is what we want to generate.

And these would be, yeah, these are images. And we start out with pure noise.

So that's what xt, pure noise. And the whole idea is that we have two processes.

We have this process where we're going from pure noise to our image.

And we have this process from our image to pure noise.

So the process where we're going from our image to pure noise, this is called the forward

process. Forward, sorry, my typing is still,

my handwriting is not so good in it. So hopefully it's clear enough.

Let me know if it's not. So we have the forward process, which

is mostly just used for training. Then we also have our reverse process.

This is the reverse process, which I will write up here.

Reverse process. JEREMY: So this is a bit of a summary, I guess,

of what you and Wasim talked about in Lesson TANISHQ: And just, it's just mostly to highlight

now what are the different variables as we look at the code and see the

different variables in the code. JEREMY: Okay, so we'll be focusing today on the

code, but the code will be referring to things by name and those names won't make sense very

much unless we see what they're used for in the math.

Okay. TANISHQ: Yeah.

And I won't dive too much into the math. I just want to focus on these sorts of

variables and equations that we see in the code. So basically the general idea is that

we do these in multiple different steps. We have here from time step 0 all the way to

time step uppercase T. And so there's some fixed number of steps, but then we have this

intermediate process where we're going from some particular time step.

We have this time step lowercase t, which is some noisy image.

And yes, we're transitioning between these two different noisy images.

So we have this, what is sometimes called the transition.

We have this one here. This is like something that's

called the transition kernel or yeah, whatever it is, it basically

is just telling us, you know, how do we go from, you know, one, in this case, we're going

from a less noisy image to a more noisy image and then going backwards, it's going from

a more noisy image to a less noisy image. So let's look at the equations.

JEREMY: So the forward direction is driven really easily to make it something more noisy.

You just add a bit more noise to it. And the reverse direction is incredibly difficult,

which is to particularly to go from the far left to the far right is strictly

speaking impossible because none of that person's face

exists anymore. That somewhere in between you could certainly

go from something that's partially noisy to less noisy by a learned model.

TANISHQ: Exactly. And that's like one of the little things I've

done right now in terms of, you know, in terms of I guess the symbols and the math. So yeah, basically I'm just trying to pull

out the, just to write down the equations here.

So we have, let me zoom in a bit. So we have our two, let's see here.

Q of xt, x (t minus 1) Or actually, you know what, maybe it's

just better if I just snipped it from here. So the one that is going from our

forward process is this equation here. So I'll just make that a

little smaller for you guys. So right there.

So that is going, and basically to explain, we have this sort of script, a little bit

of a, maybe a little bit confusing notation here, but basically this is referring to a

normal distribution or a Gaussian distribution. And this is just saying, okay, this is a Gaussian

distribution that's describing this particular variable.

So it's just saying, okay, N is our normal or Gaussian

distribution, and it's representing this variable x of t, or x, sorry,

xt. And then we have here is the mean, and this is

the variance. So just to again, clarify, I think we've talked

about this before as well, but this is a, this is of course a bad drawing of a Gaussian,

but our mean is just, our mean is just this, the middle point here is the mean, and the

variance just kind of describes the sort of spread of the Gaussian distribution.

So if you think about it a little further, you have this beta, which is one of the important variables that kind of describes

the diffusion process, beta-t So you'll see the beta-t in the code.

And basically beta-t increases as t increases. So basically your beta-t will be

greater than your beta-(t minus 1). So if you think about that a little bit more

carefully, you can see that, okay, so at t-1, at this time point here, and then you're

going to the next time point, you're going to increase your beta-t.

So you're increasing the variance, but then you have this 1 - beta-t and take the

square root of that and multiply it by x(t minus 1)

So as your t is increasing, this term actually decreases.

So your mean is actually decreasing and you're getting less of the original image

Because the original image is going to be part of x(t minus 1)

JEREMY: And just to let you know, Tanishq, we can't see your pointer.

So if you want to point at things, you would need to highlight them or something.

TANISHQ: So yeah, I'll just, let's see. Yeah.

Basically, I haven't pointed anything in specific. I was just saying that, yeah, basically

if we have our xt here, as the time step increases, you're getting less

contribution from your x(t minus 1) And so that means your mean is going towards zero.

And so you've got to have a mean of 0 and the variance keeps increasing and basically

you just have a Gaussian distribution and you lose any contribution from the original

image as your time step increases. So that's why when we start out from

x0 and go all the way to our xt here, this becomes pure noise.

It's because we're doing this iterative process where we keep adding noise.

We lose that contribution from the original image and that leads to the image having pure

noise at the end of the process. JEREMY: Something I find useful here is to

consider one extreme, which is to consider x1. So at x1, the mean is going to

be root 1 - beta-t times x0. The reason that's interesting

is x0 is the original image. So we're taking the original

image and at this point 1 - beta-t will be pretty

close to 1. So at x1, we're going to have something

that's the mean is very close to the image and the variance will be very small.

And so that's why we will have a image that just has a tiny bit of noise.

TANISHQ: Right, right. And then another thing that sometimes it's

easier to write out is sometimes you can write out, in this case, you can write out q(xt)

directly because these are all independent in terms of q(xt) is only

dependent on x(t minus 1) And then x(t minus 1) is only

dependent on x(t minus 2) And each of these steps are independent.

So based on the different laws of probability, you can get your q(xt) in close form.

So, yeah, that's what's shown here. q xt given the original image.

So this is also another way of kind of seeing this more clearly where you can see that.

Anyways, so I'm going back here. So this is another way to see here more directly.

So this is, of course, our clean image. And this is our clear, our noisy image.

And so you can also see again, now alpha bar t is dependent on beta t.

Basically it's like one minus the cumulative. JEREMY: I mean, we'll see

the code for it, I guess. So maybe.

TANISHQ: Yes, yes. So it might be clear to see that this

is alpha bar t or something like this. But basically, basically the idea is that

alpha bar t is going to be, again, less. This is what is going to be

less than alpha bar t minus one. So basically alpha, this keeps decreasing, right?

This decreases as time step increases. And on the other hand, this is going to

be increasing as time step increases. But again, you can see the contribution from

the original image decreases as time step increases while the noise, as shown by the

variance, is increasing while the time step is increasing.

Anyway, so that hopefully clarifies the forward process.

And then the reverse process is basically a neural network, as Jeremy had mentioned.

And yeah, screenshot this.

That's our reverse process. And basically the idea is, well, this is a

neural network and this is also a neural network. Neural network.

And we learn it during the training of the model. But the nice thing about this particular diffusion

model paper that made it so simple was actually, we completely ignored this and actually set

it to constants just based on, you know, big numbers.

JEREMY: We can't see what you're pointing at. So I think it's important to

mention what this is here. TANISHQ: This term here.

So this one, we just kind of ignore and it's just a constant dependent on beta-t.

So you only have one neural network that you need to train, which is basically referring

to this mean. And when the nice thing about this diffusion

model process is that it also re-paraphrases the mean into this easier form where

you do a lot of complicated math, which we'll not

get into here. But basically you get this kind of simplified

training objective where, let's see here. Yeah, you see the simplified training objective.

You instead have this epsilon-theta function. And let me just screenshot that again. This is our loss function that we train

and we have this epsilon-theta function. You can see it's a very

simple loss function, right? This is just a, let me just write this down.

This is just an MSE loss. And we have this epsilon-theta function here.

That is our- JEREMY: …maybe here we're less mathy, it

might not be obvious that it's a simple thing, because it looks quite complicated

to me, but once we see it in code, it'll be simple.

TANISHQ: Yes, yes. Basically you're just doing like, and

you'll see it in code how simple it is. But this is like just an MSE loss.

So we've seen MSE loss before, but you'll see how, yeah, this is basically MSE. So the nice, so just to kind of take a step

back again, what is this epsilon-theta? Because this is like a new thing that

like seems a little bit confusing. Basically epsilon, you can see here, basically,

yeah, so this here is saying, this is actually equivalent to this equation here.

These two are equivalent. This is just another way of saying that,

because basically it's saying, that's xt So this is giving xt just in a different way.

But epsilon is actually this normal distribution with a

mean of 0 and a variance of 1. And then you have all these scaling terms

that changes the mean to be the same as this equation that we have over here.

So this is our xt. And so what epsilon is, it's actually the noise that we're adding

to our image to make it into a noisy image. And what this neural network is doing

is trying to predict that noise. So what this is actually doing is this is

actually a noise predictor, and it is predicting the noise in the image.

And why is that important? Basically the general idea is,

if we were to think about our distribution of data, let's

just think about it in a 2D space. Just here, each data point here represents an

image, and they're in this blob area, which represents a distribution.

So this is in-distribution, and this is out of the distribution.

Out-of-distribution. And basically the idea is that, okay, if we

take an image and we want to generate some random image, if we were to take a random

data point, it would most likely be noisy images, right?

So if we take some random data point, it would be more,

the way to generate random data point, it's going to be just noise.

But we want to keep adjusting this data point to make it look more like an image from your

distribution. That's kind of the whole idea of this iterative

process that we're doing in our diffusion model.

So the way to get that information is actually to take images from your dataset and actually

add noise to it. So that's what we try to do in this process.

So we have an image here and we add noise to it. And then what we do is we try to plan

a neural network to predict the noise. And by predicting the noise and subtracting

it out, we are going back to the distribution. So adding the noise takes you away from the

distribution and then predicting the noise brings you back to the distribution.

So then if we know at any given point in this space how much noise to remove, that tells

you how to keep going towards the data distribution and get a point that

lies within the distribution. So that's why we have noise prediction.

And that's the importance of doing this noise prediction is to be

able to then do this iterative processor, we can start out at a random point,

which would be, for example, pure noise and keep predicting and removing that noise

and walking towards the data distribution. Okay.

Okay. So yeah, let's get started with the code.

And so here we of course have our imports and we're going to load our dataset.

We're going to work with our Fashion-MNIST dataset, which is what we've been working with

for a while already. And yeah, this is just basically the same

code that we've seen from before in terms of loading the dataset. And then we have our model.

So I've removed the noise from an image. So our model is going to take in, is it's

going to take in the previous image, the noisy image and predict the noise.

So the shapes of the input and the output are the same.

They're going to be in the shape of an image. So what we use is we use a U-Net neural

network, which takes in kind of an input image. JEREMY: And we do see your

pointer now, by the way. So feel free to point at things.

TANISHQ: Yeah. So yeah, it takes in an input image.

And in this case, a U-Net is the purpose, but they can also

be used for any sort of image to image task, where we're going from an input

image and then outputting some other image of some sort.

And we'll talk about... JEREMY: So this is a new architecture, which we

haven't learned about yet, and we will be learning about in the next lesson.

But broadly speaking, those gray arrows going from left to right are a lot like ResNet,

very much like ResNet skip connections. But they're being used in a different way.

Everything else is stuff that we've seen before. So it's basically, we can pretend

those don't exist for now. It's a neural network that the output is the

same size or a similar size to the input. And therefore you can use it to learn how

to go from one image to a different image. TANISHQ: Yeah.

So that's where the U-Net is. And yeah, like Jeremy said,

we'll talk about it more. The sort of U-Net that are used for diffusion

models also tend to have some additional tricks, which again, we'll talk

about them later on as well. But yeah, for the time being, we will just

import a U-Net from the Diffusers library, which is the Hugging Face

library for diffusion models. So they have a U-Net implementation

and we'll just be using that for now. And so, yeah, of course,

JEREMY: strictly speaking, we're cheating at this point because we're

using something we haven't written from scratch, but we're only cheating temporarily because

we will be writing it from scratch. TANISHQ: Yeah.

And yeah, so, and then of course we're working with one channel images, our Fashion-MNIST

images are one channel images. So we just have to specify that.

And then of course the channels of the different blocks within the U-Net are also specified.

And then let's go into the training process. So basically the general idea of course

is we want to train with this MSE loss. What we do is we select a random timestep

and then we add noise to our image based on that timestep.

So of course, if we have a very high timestep, we're adding a lot of noise.

If we have a lower timestep, they were adding very little noise.

So we're going to randomly choose a timestep. And then, yeah, we add the

noise accordingly to the image. And then we pass the noisy image

to a model as well as the timestep. And we are trying to predict the amount of

noise that was in the image and we predict it with the MSE loss.

So we can see all the... JEREMY: I have some pictures of some of these

variables I could share if that would be useful. So I have a version.

So I think Tanishq is sharing notebook number 15. Is that right?

And I've got here notebook number 17. And so I took Tanishq's notebook and just,

as I was starting to understand it, I like to draw pictures for myself

to understand what's going on. So I took the things which are in Tanishq's

class and just put them into a cell. So I just copied and pasted them, although

I replaced the Greek letters with English written out versions.

And then I just plotted them to see what they look like.

So in Tanishq's class, he has this thing called beta, which is just linspace.

So that's just literally a line. So beta, there's going to be a thousand of

them and they're just going to be equally spaced from 0.0001 to 0.02.

And then there's something called sigma, which is the square root of that.

So that's what sigma is going to look like. And then he's also got alphabar, which

is a cumulative product of 1 minus this. And there's what alphabar looks like.

So you can see here, as Tanishq was describing earlier, that when t is higher, this is t

on the x-axis, beta is higher, and when t is higher, alphabar is lower.

So yeah, so if you want to remind yourself, so each of these things, beta, sigma, alphabar,

they're each, they've each got a thousand things in them.

And this is the shape of those thousand things. So this is the amount of variance,

I guess, added at each step. This is the square root of that.

So it's the standard deviation added at each step. And then if we do 1 minus that,

it's just the exact opposite. And then this is what happens if you

multiply them all together up to that point. And the reason you do that is because if you add

noise to something, you add noise to something that you add noise to something that you add

noise to something, then you have to multiply together all that amount of noise

to say how much noise you would get. So yeah, those are my pictures, if that's helpful.

TANISHQ: Yep. Yep.

Good to see the diagram or see how the actual values and how it changes over time.

So yeah. Let's see here.

Sorry. Yeah.

So like Jeremy was showing, we have our linspace for our beta.

In this case, we're using kind of more of the Greek letters.

So you can see the Greek letters that we see in the paper as well as...

Now we have it here in the code as well. And we have our linspace from our

minimum value to our maximum value. And we have some number of steps.

So this is the number of timesteps. So here we use a thousand timesteps, but

that can depend on the type of model that you're training.

And that's one of the parameters of your model or hyperparameters of your model.

JEREMY: And this is the callback you've got here. So this callback is going to be used to set

up the data, I guess, so that you're going to be using this to add the noise so that the

model's then got the data that we're trying to get it to learn to then denoise.

TANISHQ: Yeah. So the callback of course makes life a lot

easier in terms of, yeah, setting up everything and still being able to use, I guess, the

miniai learner with maybe some of these more complicated and maybe a little

bit more unique training loops. So yeah, in this case, we're just able to

use the callback in order to set up the batch that we are passing into our learner.

JEREMY: I just want to mention, when you first did this, you wrote out the Greek letters in English,

alpha and beta and so forth. And at least for my brain, I was finding it

difficult to read because they were literally going off the edge of the page

and I couldn't see it all at once. And so we did a search and replace to

replace it with the actual Greek letters. I still don't know how I feel about it.

I'm finding it easier to read because I can see it all at once.

I don't know if it's a scroll and I don't get overwhelmed.

But when I need to edit the code, I kind of just tend to copy

and paste the Greek letters, which is why we use the actual word beta in

the init parameter list so that somebody using this never has to type a Greek letter.

But I don't know, Johno or Tanishq, if you had any thoughts over the last week or two,

since we made that change about whether you guys like having the Greek letters in there

or not. JOHNO: I like it for this demo in particular.

I don't know that I do this in my code, but because we're looking back and forth between

the paper and the implementation here, I think it works in this case just fine.

TANISHQ: Yeah, I agree. I think it's good for when you're trying to

study something or try to implement something, having the Greek letters is very useful to be

able to, I guess, match the math more closely and it's just easy just to pick the equation

and put it into code or white-source style looking at the code and try

to match it to the equation. So I think for educational purpose, I

tend to like, I guess, the Greek letters. So yeah. Yeah, so we have our initialization where

we're just defining all these variables. We'll get to the predict in just a moment, but

first I just want to go over the before_batch where we're setting up our

batch to pass into the model. So remember that the model is taking

in our noisy image and the timestep. And of course the target is the actual amount

of noise that we are adding to the image. So basically we generate that noise.

So that's what… JEREMY: So epsilon is that target.

So epsilon is the amount of noise, not the amount of, is the actual noise.

TANISHQ: Yes. Epsilon is the actual noise that we're adding.

And that's the target as well because our model is a noise predicting model.

It's predicting the noise in the image. And so our target should be the noise

[unintelligible] that we're adding to the image during

training. So we have our epsilon and we're just

generating it with this random function. The random normal distribution

with a mean of 0, variance of 1. So that's what that's doing and adding

the appropriate shape and device. Then the batch that we get originally

will contain the clean images. These are the original images from our dataset.

So that's x0. And then what we want to

do is we want to add noise. So we have our alphabar and we have

a random timestep that we select. And then we just simply follow that equation,

which I can, I'll just show in just a moment. JEREMY: That equation, you can

make a tiny bit easier to read. I think if you were to double click on that

first alphabar underscore t, cut it and then paste it, sorry, in the xt equals torch dot

square root, take the thing inside the square root, double click it and paste

it over the top of the word torch. That would be a little bit easier to read. And then you'll do the same for the next one.

TANISHQ: There we go.

Put those parentheses. Yep.

TANISHQ: Yeah, so basically, yeah, so yeah, I

guess I'll just pull up the equation. So let's see, there's then, so there's a section

in the paper that has the nice algorithm. See if I can find it.

No, no, here. It's, I think earlier.

Yes, training. Right, so this, we're just following these

same sort of training steps here, right? Where we select a clean image

that we take from our data set. This fancy kind of equation here is just saying,

take an image from your data set, take a random timestep between this range.

Then this is our epsilon that we're getting, just saying, get some epsilon value.

And then we have our equation for Xt, right? This is the equation here.

You can see that it is square root of alphabar t x0 plus one square root of one minus

alphabar t times epsilon. So that's the same equation

that we have right here, right? And then what we need to do is we

need to pass this into our model. So we have xt and t. So we

set up our batch accordingly. So this is the two things

that we pass into our model. And of course we also have our

target, which is our epsilon. And so that's what this is showing here.

We passed in our Xt as well as our t here, right? And we pass that into a model.

The model is represented here as epsilon theta. And theta is often used to represent, like, this

is a neural network with some parameters and the parameters are represented by theta.

So epsilon theta is just representing our noise predicting model.

So this is our neural network. So we have passed in our Xt and our t into

a neural network, and we are comparing it to our target here,

which is the actual epsilon. And so that's what we're doing here.

We have our batch where we have our xt and t and epsilon.

And then here we have our prediction function. And because we actually have, I guess in this

case, we have two things that are in a tuple that we need to pass into our model.

So we just kind of get those elements from our tuple with this.

We get the elements from the tuple, pass it into the model, and then Hugging Face has

its own API in terms of getting the output. So you'd need to call .sample in order

to get the predictions from your model. So we just do that.

And then we do, we have learn.preds and that's what's going to be used later then when we're

trying to do our loss function calculation. JEREMY: So the, just so, I mean, it's just worth

looking at that a little bit more since we haven't quite seen something like this before.

And it's something which I'm not aware of any other framework that would let you do

this, you know, literally replace how prediction works.

And miniai is kind of really fun for this. So because you're inherited from TrainCB,

TrainCB has predict, a dot defined and you've defined a new version.

So it's not going to use the TrainCB version anymore.

It's going to use your version. And what you're doing is instead of passing

learn.batch[0] to the model, you're, you've got a * in front of it.

So the key thing is that * is going to, you know, and is, we know that actually learn.batch[0]

has two things in it because that learn.batch that you showed at the end of the before_batch

method has two things in learn dot zero. So that star will unpack them and send

each one of those as a separate argument. So our model needs to take two things, which

the diffusers U-Net does take two things. So that's the main interesting point.

And then something I find a bit awkward honestly, about a lot of Hugging Face stuff, including

diffusers is that generally their models don't just return the result, but they put it inside

some name. And so that's what happens here.

They put it in something inside something called sample.

So that's why Tanishq added .sample at the end of the predict because of this somewhat

awkward thing, which Hugging Face like to do for some reason.

But yeah, now that you know, I mean, this is something that people often get stuck on.

I see on Kaggle and stuff like that. It's like, how on earth do I use these models?

Because they take things in weird forms and they give back things with weird forms.

Well, this is hell. You know, if you inherit from TranCB, you

can change predict to do whatever you want, which I think is quite sweet.

TANISHQ: Yep. So yeah, that's the training loop.

And then of course you have your regular training loop that's implemented in miniai where you

are going to have. Yeah.

So you have your loss function calculation, I mean, and the predictions, learn.preds.

And of course the target is our learn.batch[1], which is our epsilon.

So we have those and we pass it into the loss function.

It calculates the loss function and does the back propagation.

So I'll just go over that. We'll get back to the sampling in just a moment.

But just to show the training loop. JEREMY: Most of this is copied from

our, I think it's 14_augment notebook, the way you've got the

tmax and the sched. The only thing I think you've added

here is the DDPM callback, right? TANISHQ: Yes.

The DDPM callback. JEREMY: And you change the loss function.

TANISHQ: Yes. So basically we have to initialize our DDPM

callback with the appropriate arguments. So like the number of timesteps and

the minimum beta and maximum beta. And then yeah, obviously, and then of course

we're using an MSE loss as we talked about. It just becomes a regular training

loop and everything else is run before. Yeah.

So we have your scheduler, your progress bar, all of that we've seen before.

JEREMY: I think that's really cool that we're using basically the same code to train a diffusion model as we've used to train a

classifier just with one extra callback. TANISHQ: Yeah.

Yeah. Yeah.

That's why I think callbacks are very powerful for allowing us to do such things.

It's like pretty, you can take all this code and now

we have a diffusion training loop and we can just call learn.fit and yeah,

they can see got a nice training loop, nice loss curve.

We can save our model on a torch saving functionality

to be able to save our model and we could

load it in. But now that we have our trained model, then

the question is, what can we do to use it to sample the dataset?

So the basic idea of course was that we have, like basically we're here, right?

We have, let's see here. Okay.

So we have, the basic idea is that we start out with a random data point and of course that's

not going to be within the distribution at first, but now we've learned how to move from

one point towards the data distribution. That's what our noise prediction,

predicting function does. It basically tells you how, you know,

in what direction and how much to, so the basic idea

is that, yeah, I guess I'll start from maybe a new drawing here.

Again, we have, distribution is, and we have a random point and we use our noise predicting

model that we have trained to tell us which direction to move.

So it tells us some direction. Or I guess what's the exact, other area.

Okay. So like here, okay.

So it tells us some direction to move. At first that direction is not going to be

like, you cannot follow that direction all the way to get the correct data point.

Because basically what we were doing is we're trying to reverse the path that we were following

when we were adding noise. So like, cause we had originally data point

and we kept adding noise to the data point and maybe, you know, it

followed some path like this. And we want to reverse that path to get to.

So our noise predicting function will give us an original direction, which, you know,

would be, you know, some kind of, it's going to be kind of tangential to the actual path

at that location. So what we would do is, you know, we would maybe

follow that data point all the way towards, you know, we're just going to

keep following that data point. You know, we're going to try to predict the

fully denoised image by following this noise prediction.

But our fully denoised image is also not going to be a real image.

So what we, so let me, I'll show an example of that over here in the paper on why they

show this a little bit more carefully. So x0, it's there.

So basically you can see the different data...

You can see the different data points here.

It's not going to look anything like our real image.

So you can see all these points, you know, it doesn't look anything… what we would do

is we actually had a little bit of noise back to it.

And we start, we have a new point where then we could maybe estimate a better, get a better

estimate of which direction to move, follow that all the way again, we follow a new point.

And then I can add back a little bit of noise. You get a new estimate, you make a new estimate

of, you know, this noise prediction and removing the noise, you know, fall that

all again, completely and add a little bit of noise again to the

image and burst onto a image. So that's kind of what we're showing here.

JEREMY: That's a lot like SGD, with SGD we don't take the gradient and jump all the way.

We use a learning rate to go some of the way because each of those estimates of where we

want to go, you know, not that great, but we just do it slowly.

TANISHQ: Exactly. And at the end of the day, that's what

we're doing with this noise prediction. We are predicting the sort of gradient of

this p(x), but of course we need to keep making estimates of that

gradient as we're progressing. So we have to keep evaluating our noise prediction

function to get updated and better estimates of our gradient in order to

finally converge onto our image. So and then you can see that here, you know, we

have this, maybe this fully predicted denoised image which at the beginning doesn't look

anything like a real image, but then as we continue throughout the sampling

process, we finally converge on something that looks

like an actual image. Again, these are CIFAR-10 images and it's

still a little bit maybe unclear about how realistic these images, these very small images

look, but that's kind of the general principle I would say.

And so that's what I can show in the code. This idea of we're going to start out

basically with a random image, right? And this random image is going to be like

a pure noise image and it's not going to be part of the data distribution.

You know, it’s not anything like a real image, it's just a random image.

And so this is going to be our x, I guess, x uppercase t [x_t], right?

That's what we start out with. And we want to go from x

uppercase t all the way to x0. So what we do is we go through each of the

timesteps and we create, we have to put it in this sort of batch format because

that's what our neural network expects. So we just have to format it appropriately.

And we'll get to z in just a moment. I'll explain that in just a moment, but of

course we just take it have similar… alphabar, betabar, which is getting

those variables that we need. JEREMY: And we faked beta bar because

we couldn't figure out how to type it. So we used bbar instead.

TANISHQ: Yeah. So yeah, so yeah, in, yeah, we, yeah, we were

pretty able to get betabar to work, I guess. But anyway, at each step, what we're trying

to do is to try to predict what direction we need to go.

And that direction is given by our noise predicting model, right?

So what we do is we pass in x_t and our current timestep into our model.

And we get this noise prediction and that's the direction that we need to move in.

So basically we take x_t. We first attempt to completely remove the noise, right?

That's what this is doing. That's what x_0_hat is.

That's completely removing the noise. And of course, as we said, that estimate

at the beginning won't be very accurate. And so now what we do is we have some coefficients

here where we have a coefficient of how much that we keep about this estimate of

our denoise image and how much of the originally noisy

image we keep. And on top of that, we're going

to add in some additional noise. So that's what we do here.

We have x_0_hat. And so, and we multiply by its coefficient

and we have x_t we multiply it by some coefficient and we also add some additional noise.

That's what the z is. It's just-

JEREMY: That's basically a weighted average of the two plus the noise…

TANISHQ: Exactly. And then the whole idea is that as we get

closer and closer to a timestep equals to 0 our estimate of x0 will be more and more accurate.

So our x0_coeff will get closer as we're increasing our, going through the process and then our xt_coeff

will get closer and closer to 0. So basically we're going to be weighting more

and more of the x_0_hat estimate and less and less of the x_t as we're getting

closer and closer to our final timestep. And so at the end of the day, we will

have our estimated generated image. So that's kind of an overview

of the sampling process. So yeah.

So yeah, basically the way I implemented it here was I had the sample function that's

part of our callback and it will take in the model and the kind of shape that you want

for your images that you're producing. So like if you want to specify how many images

you produce, that's going to be part of your batch size or whatever.

And you'll just see that in a moment. But yeah, it's just part of the callback.

So then we basically have our DDPM callback and then we could just call the sample method

of our DDPM callback and we pass in our model. And then here you can see we're going to produce,

for example, 16 images and it just has to be a 1 channel image of shape

32 by 32 and we get our samples. And one thing I forgot to note was that I

am collecting each of the timestep, the x_t. So the predictions here, you can see

that there are a thousand of them. We want the last one because

that is our final generation. So we want the last one and that's what we should-

JEREMY: They are not [sad] actually. TANISHQ: Yeah.

So this is- JEREMY: We've come a long way since DDPM.

So this is, like, slower and less great than it could be.

But considering that, except for U-NET, we've done this from scratch, you know, actually

from matrix multiplication, I think those are pretty decent.

TANISHQ: Yeah. And we're only trained for about five epochs.

It took like, you know, maybe like four minutes to train this model, something like that.

It's pretty quick. And this is what we get with very little training.

And it's pretty decent. You can see, of course, some clear shirts

and shoes and pants and whatever else. JEREMY: Yeah.

And you can see fabric and it's got texture and things have buckles and-

TANISHQ: Yeah. JEREMY: You know, something to compare, like, we

did generative modeling in the first time we did Part 2 back in the days when Wasserstein

GAN was just new, which was actually created by the same guy that created

PyTorch or one of the two guys, Soumith. And we trained for hours and hours and hours

and got things that I'm not sure were any better than this.

So things have come a long way. TANISHQ: Yeah.

Yeah. And of course, then, yeah, so we can see then

like how this sampling progresses over time, over the multiple timesteps.

So that's what I'm showing here because I collected, during the sampling process, we

are collecting at each timestep what that estimate looks like.

And you can kind of see here. And so this is an estimate of the

noisy image over the timesteps. Oops.

And I guess I had to pause. Yeah.

You can kind of see. But you'll notice that actually, so we actually,

what we did is like, okay, so we selected an image, which is like the ninth image.

So that's, that's this image here. So we're looking at this image

particularly here and we're going over. Yeah.

We have a function here that's showing the i-th timestep during the sampling process of

that image. And we're just getting the images.

And what we are doing is we're only showing basically from timestep 800 to a 1,000.

And here we're just, we're just having it like where it's like, okay, we're looking

at like maybe every 5 steps and we're going from 800 to 990.

And this time it would make it visually easier to see the transition.

But what you'll notice is I didn't start all the way from 0.

I started from 800. And the reason we do that is because actually

between 0 and 800 there's very little change in terms of like, it's just mostly a noisy image.

And it turns out, but yeah, I didn't see as I make a note of this here, it's actually

a limitation of the noise schedule that is used in the original DDPM paper.

And especially when applied to some of these smaller images, when we're working with images

of like size 32 by 32 or whatever. And so there are some other papers like the

improved DDPM paper that propose other sorts of noise schedules.

And what I mean by noise schedule is basically how beta is defined basically.

So we had this definition of torch.linspace for our beta, but people have different ways

of defining beta that lead to different properties.

So things like that, people have come up with different improvements and those sorts of

improvements work well when we're working with these smaller images.

And basically the point is like, if we are working from 0 to 800 and it's just mostly

just noise that entire time, we're not actually making full use of all this timesteps.

So it would be nice if we could actually make full use of those time steps and actually

have it do something during that time period. So all these, there are some papers that examine

this a little bit more carefully and it would be kind of interesting for maybe some of you

folks to also look at these papers and see if you can try to implement those sorts of

models with this notebook as a starting point. And it should be a fairly simple change in

terms of like noise schedule or something like that.

JEREMY: So I actually think this is the start of our next journey, which is our previous journey

was going from being totally rubbish at Fashion-MNIST classification

to being really good at it. I heard you say now we're like a little bit

rubbish at doing Fashion-MNIST generation. And yeah, I think we should all now work from

here over the next few lessons and so forth and people trying things at home and all of

us trying to make better and better generative models, initially a Fashion-MNIST and hopefully

we'll get to the point where we're so good at that, that we're like, oh, this is too easy.

And then we'll pick something harder. And eventually that'll take us to Stable Diffusion and beyond.

I imagine. That's cool.

I got some stuff to show you guys. If you're interested, I tried to better understand

what was going on in Tanishq's notebook and tried doing it in a thousand different ways

and also see if I could just start to make it a bit faster.

So that's what's in notebook 17, which I will share.

So we've already seen the start of notebook 17. Well, the one thing I did just do is just

drew a picture for myself, partly just to remind myself what they,

what the real ones look like. And they definitely have more detail than

the samples that Tanishq was showing. But they're not, you know, they're just 28 by 28.

I mean, they're not super amazing images and they're just black or white.

So even if we're fantastic at this, they're never going to look great because we're using

a small, simple data set. As you always should, when you're doing

any kind of R&D or experiments, you should always use a small and simple data set up

until you're so good at it that it's not challenging

anymore. And even then when you're exploring new ideas,

you should explore them on small, simple data sets first.

Yeah. So after I drew the various things, what I

like to do is one thing I found challenging about working with your class Tanishq is I

find when stuff is inside a class, it's harder for me to explore.

So I copied and pasted it, the before_batch contents and called it noisify.

And so one of the things that's fun to do that is it forces you to figure out what are

the actual parameters to it. And so now that I, rather than putting in the

class, now that I've got all of my, you know, various things to do with, so these are the

three parameters to the DDPM callbacks in it.

And then these things we can calculate from that. So with those then actually

all we need is yeah, what's the image that we're going to

noisify and then what's the, what's the alphabar, which I mean, we can get from here, but

it's sort of be more general if you can pass in your alphabar.

So yeah, this is just copying and pasting from the class, but the nice thing is then

I could experiment with it. So I can call noisify on my first 25 images

and with, with a random t, each one's got a different random t, and so I can print out

the t and then I could actually use those as titles.

And so this lets me, I thought this is quite nice. I might actually rerun this cause actually

none of these look like anything because as it turns out in this particular

case, all of the t's are over 200. And as Tanishq mentioned, once you're over

200, it's almost impossible to see anything. So let me just rerun this

and see if we get a better, there we go.

There's a better one. So with a t of 7, right?

So remember t naught, t equals naught is the pure image.

So t equals 7, it's just a slightly speckledy image.

And by 67, it's a pretty bad image. And by 94, it's very hard

to see what it is at all. And by 293, maybe I can see a pair of pants.

I'm not sure I can see anything. So yeah.

By the way, there's a handy little, so we've, I think

we've looked at map before in in the course there's an extended

version of map in fastcore. And one of the nice things is you can pass

it a string and it basically just calls this format string if you pass it a

string rather than a function. And so this is going to stringify

everything using its representations. This is how I got the titles

out of it just by the way. So yeah, I found this useful to be

able to draw a picture of everything. And then I wanted to, yeah, look at what,

what, what else, what else can I do? So then I took nuts.

You won't be surprised to see. I took the sample method and

turn that into a function. And I actually decided to

pass everything that it needs. Even, I mean, you could actually

calculate pretty much all of these. But I thought since I've calculated

them before, just pass them in. So this is all copied and

pasted from Tanishq's version. And so that means the callback now is tiny, right? Because before_batch is just noisify and the

sample method just calls the sample function. Now, what I did do is I decided just to, yeah,

I wanted to try like as many different ways of doing this as possible.

Partly it's an exercise to help everybody like see all the different ways we can work

with our framework, you know? So I decided not to inherit from TrainCB,

but I instead I inherited from Callback. So that means I can't use Tanishq's

nifty trick of replacing predict. So instead I now need some way to pass in

the two parts of the first element of the tuple as separate things to the

model and return the sample. So how else could we do that?

Well what we could do is we could actually inherit from UNet2DModel, which is what

Tanishq used directly, unit 2d model, and we could replace the model.

And so we could replace specifically the forward function.

That's the thing that gets called. And we could just call the original forward

function, but rather than passing an x we’re passing a *x, and rather than

returning that, we'll return that .sample. Okay.

So if we do that, then we don't need the TrainCB anymore and we don't need the predict.

And so if you're not working with something as beautifully flexible as miniai, you can

always do this, you know, to make, to replace your model so that it has the interface that

you need it to have. So now again, we did the same as

Tanishq had of create the callback. And now when we create the model, we'll

reuse our UNet class, which we just created. I wanted to see if I can make things faster.

I tried dividing all of Tanishq's channels by two and I found it worked just as well.

One thing I noticed is that it uses group norm in the U-Net, which we have briefly learned

about before and in group norm, it splits the channels up into a certain number of groups.

And I needed to make sure that those groups had more than one thing in.

So you can actually pass in how many groups do you want to use in the normalization.

So that's what this is for. You gotta be a little bit careful of these

things, I didn't think of it at first and I ended up, I think the num groups might've

been 32 and I got an error saying you can't split 16 things into 32 groups.

But it also made me realize actually, even in Tanishq's maybe you probably had 32 in

the first with 32 groups. And so maybe the group norm

wouldn't have been working as well. So they're little subtle things to look out for.

So now that we're not using anything inherited from TrainCB, that means we either need to

use TrainCB itself or just use our train learner and that everything else is the same as what

Tanishq had. So then I wanted to look at the results of

noisify here and we've seen this trick before, which is we call fit, but don't call the training

part of the fit and use the SingleBatchCB callback that we created way back

when we first created Learner. And now learn.batch will contain the tuple

of tuples, which we can then use that trick to show.

So I mean, obviously we'd expect it to look

the same as before, but it's nice. I always like to draw pictures

of everything all along the way. Cause it's very, very of..

I mean, I, the first six to seven times I do anything, I do it wrong.

So given that I know that I might as well draw a picture to try and see how it's wrong

until it's fixed. It also tells me when it's not wrong.

TANISHQ: Isn't there a show_batch function now that does something similar?

JEREMY: Um, yes, you wrote that show_image_batch, didn't you?

I can't quite remember. Yeah.

We should, uh, remind ourselves how that worked.

That's a good point. Thanks for a reminder.

Okay. So then, um, I'll just go ahead and

do the same thing that Tanishq did. And um, uh, but then the next thing I looked

at was I looked at the, you know, how am I going to make this train faster?

I want a bigger, I want a higher learning rate. Um, and I realized, oddly enough, the diffusers

code does not initialize anything at all. They use the defaults, um, which just goes

to show like even, you know, the experts at Hugging Face that don't necessarily really

think like, oh, maybe the PyTorch defaults aren't, you know, perfect for my model.

Of course they're not because they depend on what activation function do you have and

what res blocks do you have and so forth. Um, so I wasn't exactly sure how to initialize it.

Um, I, um, partly by chatting to, um, Kat Crowson, who's the author of K-diffusion,

um, and partly by looking at papers and partly by thinking about my own experience, I ended

up doing a few things. One is I did do the thing that we talked about

a while ago, which is to take every second convolutional layer and zero it out.

You could do the same thing with using batch norm, which is what we tried.

And since we've got quite a deep network, you know, that seemed like it might, you know,

it helps basically by having the, the, the non-id paths in the ResNets do nothing at

first so they can't cause problems. Um, we haven't talked about, um, orthogonalized

weights before, and we probably won't because you would need to take our, um, computational

linear algebra course to learn about that, which is a great course, Rachel

Thomas did a fantastic job of it. I highly recommend it, but I don't want to

make it a prerequisite, but, um, Kat mentioned, she thought that using orthogonal weights

for the downsamplers was a good idea. Um, and then, well, the up_blocks, they also

set the second convs to zero and something Kat mentioned, she found useful, which is

also from, um, I think it's from the Dhariwal Google paper is to also zero out the

weights of basically the very last layer. Um, and so it's going to start by predicting

zero as the noise, which is, you know, something that can't hurt.

Um, so that was, that's how I initialized the weights.

Um, so call init_ddpm on my model, uh, something that I found that a huge difference is I replaced

the normal Adam optimizer with one that has an epsilon of 1e-5, the default,

I think is 1e-8. And so to remind you, this is, we, is we,

when we divide by the exponentially weighted moving average of the squared gradients, we,

when we divide by that, if that's a very, very small number, um, then it makes

the effective learning rate huge. And so we add this to it to make it not too huge.

And it's nearly always a good idea to make this bigger than the default.

I don't know why the default is so small. And I found, until I did this, anytime I tried

to use a reasonably large learning rate somewhere around the middle of the 1-Cycle

training, it would explode. Um, uh, so that makes a big difference. Um, so this way, yeah.

Uh, I could train, I could get 0.016 after 5 epochs.

Um, and then sampling, so it looks all pretty similar.

We've got some pretty nice textures, I think. So then I was thinking, how do I get faster?

So one way we can make it faster is we can take advantage of, um, something called

mixed precision. Um, so currently we're using

32 bit floating point values. Um, that's the defaults and

also known as single precision. And um, GPUs are pretty fast at doing 32 bit

floating point values, but they're much, much, much, much faster during 16

bit floating point values. So I'm 16 bit floating point values. I'm able to represent a very, you know, wide

range of numbers or much precision at the difference between numbers.

And so they're quite difficult to use, but if you can, you'll get a huge benefit because,

um, modern GPUs, modern Nvidia GPU specifically have special units that do matrix multiplies

of 16 bit values extremely quickly. Um, you can't just cast everything to 16 bit

because then you, there's not enough precision to calculate gradients and stuff properly. So we have to use something

called Mixed Precision. Um, depending on how enthusiastic I'm feeling,

I guess we ought to do this from scratch as well.

Um, we'll, we'll see. Um, we do have an implementation from scratch

cause we actually implemented this before NVIDIA implemented it, um, in

an earlier version of fastai. Um, um, anyway, we'll see.

So basically the idea is that we use 32 bit for things where we need 32 bit and we use

Um, so that's what we're going to do is we're going to use this mixed precision.

Um, but for now we're going to use, um, NVIDIA's, you know, semi-automatic or fairly automatic

code to do that for us. Actually we had a slight change of plan at

this point when we realized, uh, this lesson was going to be over three hours in length

and we should actually split it into two. So we're going to wrap up this lesson here and

we're going to, um, come back and implement this mixed precision thing in Lesson 20.

So we'll see you then.

Lesson 19: Deep Learning Foundations to Stable Diffusion

Full Transcript

Need a transcript for another video?