JEREMY: Okay.
Hi everybody. And this is Lesson 19 with extremely
special guests, Tanishq and Johno. Hi guys.
How are you? TANISHQ: Hello.
JOHNO: Hey Jeremy. Good to be here.
JEREMY: And it's New Year's Eve, 2022, finishing off 2022 with a bang, or at least a really cool
lesson. And most of this lesson is
going to be Tanishq and Johno, but I'm going to start with a quick
update from the last lesson. What I wanted to show you is that Christopher
Thomas on the forum, what I want to show you is that Christopher Thomas on the forum came up
with a better winning result for our challenge, the Fashion-MNIST challenge,
which we are tracking here. And be sure to check out this forum
thread for the latest results. And he found that he was able to
get better results with Dropout. Then Piotr on the forum
noticed I had a bug in my code. And the bug in my code for ResNets, actually
I won't show you, I'll just tell you, is that in the ResBlock, I was not passing
along the BatchNorm parameter. And as a result, all the results
I had were without BatchNorm. So then when I fixed BatchNorm and added
Dropout at Christopher's suggestion, I got better results still.
And then Christopher came up with a better Dropout and got better results still for 50
epochs. So let me show you the 93.2
for 5 epochs improvement. I won't show the change to BatchNorm because
that's actually, that'll just be in the repo now.
So the BatchNorm is already fixed. So I'm going to tell you about what
Dropout is and then show that to you. So Dropout is a simple but powerful idea where
what we do with some particular probability, so here that's a probability of 0.1,
we randomly delete some activations. And when I say delete, what I actually
mean is we change them to zero. So one easy way to do this is to
create a binomial distribution object where the probabilities
are 1-p and then sample from that. And that will give you a 0.1 probability.
So in this case, oh, this is perfect. I have exactly one 0.
Of course, randomly, that's not always going to be the case.
But since I asked for 10 samples and 0.1 of the time it should be zero, I so happened
to get, yeah, exactly one of them. And so if we took a tensor like this
and multiplied it by our activations, that will set about
a 10th of them to zero because multiplying by zero gives you zero.
So here's a Dropout class. So you pass it and you say what probability
of Dropout there is, store it away. Now we're only going to do
this during training time. So at evaluation time, we're not
going to randomly delete activations. But during training time, we will
create our binomial distribution object. We will pass in the 1-p probability. And then you say, how many
binomial trials do you want to run? So how many coin tosses or dice
rolls or whatever each time? And so it's just one.
And this is a cool little trick. If you put that one onto your accelerator, you
know, GPU or MPS or whatever, it's actually going to create a binomial
distribution that runs on the GPU. That's a really cool trick that
not many people know about. And so then if I sample and I make a sample
exactly the same size as my input, then that's going to give me a bunch of ones and zeros
and a tensor, the same size as my activations. And then another cool trick is this is going
to result in activations that are on average about one tenth smaller.
So if I multiply by 1/(1-0.9), so multiply this case by that, then that's going
to scale up my to undo that difference. JOHNO: Jeremy.
JEREMY: Yeah. JOHNO: In the line above where you have
probs equals 1-p, should that be 1-self.p JEREMY: Oh, it absolutely should.
Thank you very much, Johno. Not that it matters too much because, yeah, you can always just use nn.Dropout at this
point and I only have to use 0.1, which is why I didn't even see that.
So as you can see, I'm not even bothering to export this because I'm just showing how
to repeat what's already available in PyTorch. So yeah, thanks, Johno.
That's a good fix. Yeah, so if we're in evaluation mode,
it's just going to return the original. If p=0, then these are all
going to be just ones anyway. So we'll be multiplying by 1 divided
by 1, so there's nothing to change. So with p of 0, it does nothing in effect.
Yeah, and otherwise it's going to kind of zero out some of our activations.
So we can, a pretty common place to add dropout is before your last linear layer.
So that's what I've done here. So yeah, if I run the exact same epochs, I
get 93.2, which is a very slight improvement. And so the reason for that is that it's not
going to be able to kind of memorize the data or the activations, you know, because
there's a little bit of randomness. So it's going to force it to try to identify
just the actual underlying differences. There's a lot of different
ways of thinking about this. You can almost think of it as a bagging
thing, a bit like a random forest. You know, it's each time it's giving a
slightly different kind of random subset. Yeah, but that's what it does.
I also added a Dropout2d layer right at the start, which is not particularly common.
I was just kind of like showing it. This is also how Christopher Thomas's idea tried
it as well, although he didn't use Dropout2d. What's the difference between
Dropout2d and Dropout? So this is actually something I'd like you
to do to implement yourself as an exercise, is to implement Dropout2d.
The difference is that with Dropout2d, rather than using x.size()
as our tensor of ones and zeros, so in other words, potentially dropping
out every single batch, every single channel, every single x, y independently.
Instead, we want to drop out an entire kind of grid area, all of the channels together.
So if any of them are zero, then they're all zero. So you can look up the docs for Dropout2d for
more details about exactly what that looks like.
But yeah, so the exercise is to try and implement that from scratch and come up with a way to
test it. So like actually check that
it's working correctly, because it's a very easy thing to think that
it's working and then realize it's not. So then, yeah, Christopher Thomas actually
found that if you remove this entirely and only keep this, then you end up
with a better results for 50 epochs. And so he's the first to break 95%
So I feel like we should insert some kind of animation or trumpet sounds or something
at this point. I'm not sure if I'm clever enough to do that
in the video editor, but I'll see how I go. Hooray!
Okay. So that's about it for me.
Did you guys have any other things to add about Dropout, how to understand it or what
it does or interesting things? Oh, I did have one more thing before.
But you go ahead if you've got anything to mention.
JOHNO: So I was going to ask just because I think the standard is to set it, like remove the
dropout before you do inference. But I was wondering if there's anyone you
know of, or if it works to use it for some sort of test time augmentation.
JEREMY: Oh, dude! Thank you.
Because I wrote a callback for that. Did you see this or are you just like (JOHNO:
no), okay, just a test time dropout callback. Nice.
So yeah, before_epoch, if you're a member in
Learner, we put it into training mode. Which actually what it does is it puts
every individual layer into training mode. So that's why for the module itself, we can
check whether that module's in training mode. So what we can actually do is after that's
happened, we can then go back in this callback and apply a lambda that says
if this is a Dropout, then… wait, this is, yeah, then put
it in training mode all the time, including at evaluation.
And so then you can run it multiple times, just like we did for TTA, but with this callback.
Now that's very unlikely to give you a better result because it's not kind of showing it
different versions or anything like that, like TTA does that are kind of meant to be
the same. But what it does do is it gives
you a sense of how confident it is. If it kind of has no idea, then that little
bit of dropout's quite often going to lead to different predictions.
So this is a way of kind of doing some kind of confidence measure.
You'd have to calibrate it by, kind of, looking at things
that it should be confident about and not confident about and seeing how
that dropout, test time dropout changes. But the basic idea, it's been
used in medical models before. I wouldn't say it's totally popular, which
is why I didn't even bother to show it being used, but I just want to add it here because
I think it's an interesting idea and maybe could be more used than it is, or at
least more studied than it has been. A lot of stuff that gets used in the medical
world is less well known out in the rest of the world.
So maybe that's part of the problem. Cool.
All right. So I will stop my sharing and we're going to
switch to Tanishq, who's going to do something much more exciting, which is to show that
we are now at a point where we can do DDPM from scratch or at least
everything except the model. And so to remind you, DDPM doesn't have the
latent VAE thing and we're not going to do conditional. So it's not going to be like, we're not
going to get to tell it what to draw. And the U-Net model itself is the
one bit we're not going to do today. We're going to do that next lesson.
But, but other than the U-Net, it's going to be unconditional DDPM from scratch.
So Tanishq, take it away. Okay.
Hi, welcome back. Sorry for the slight continuity problem.
You may notice people look a little bit different. That's because we had some Zoom issues.
So we have a couple of days have passed and we're back again.
And then Johno over recorded his bit before we do Tanishq's bit, and then we're going
to post them in backwards. So hopefully there's not too many confusing
continuity problems as a result and it all goes smoothly, but it's time to turn
it over to Tanishq to talk about DDPM. TANISHQ: So we've reached the point where we have
this miniai framework and I guess it's time to now start using it to build more,
I guess, sophisticated models. And as we'll see here, we can start putting
together a diffusion model from scratch using the miniai library, and we'll see
how it makes our life a lot easier. And also it'd be very nice to see how the
equations in the papers correspond to the code.
I have here, of course, the notebook
that we'll be watching from. The paper, which we have the diffusion model
paper, “Denoising Diffusion Probabilistic Models”, which is the paper
that was published in 2020. It was one of the original diffusion model
papers that set off the entire trend of diffusion models and is a good starting point
as we delve into this topic further. And also I have some diagrams and
drawings that I will also show later on. But yeah, basically let's just get started
with the code here and of course the paper. So just to provide some context with this
paper, this paper was published from this group in UC Berkeley, I think a few of
them have gone on now to work at Google. And this is Pieter Abbeel, he
has a big lab at UC Berkeley. And so diffusion models were actually originally
introduced in 2015, but this paper in 2020 greatly simplified the diffusion models and
made it a lot easier to work with and got these amazing results as you can see here
when they trained on faces and in this case CIFAR-10 and this really was very, kind
of a big leap in terms of the progress of diffusion
models. And so just to kind of briefly
provide, I guess, kind of an overview. JEREMY: If I could just quickly step
just mention something, which is, when we started this course, we
talked a bit about how perhaps the diffusion part of diffusion models is not actually all
that. Everybody's been talking about
diffusion models because that's, particularly because that's
the open source thing we have that works really well.
But this week, actually a model that appears to be quite a lot better than Stable Diffusion
was released that doesn't use diffusion at all. Having said that, the basic ideas, like most
of the stuff that Tanishq talks about today, will still appear in some kind of form,
but a lot of the details will be different. But strictly speaking, actually, I don't even
know if we've got a word anymore for the kind of like modern generative
model things we're doing. So in some ways, when we're talking about
diffusion models, you should maybe replace it in your head with some other word, which
is more general and includes this paper that Tanishq is looking at here.
JOHNO: Iterative Refinement, perhaps? That's what I'd like.
JEREMY: Yeah, that's not bad, iterative refinement.
I'm sure by the time people watch this video, probably, you know, somebody will have decided
on something. We will keep our course website up to date.
TANISHQ: Yeah. Yeah.
This is the paper that Jeremy was talking about and yeah, every week there seems to
be another state of the art model. But yeah, like Jeremy said, a lot of the
principles are the same, but the details can be different
for each paper. And I just want to again, also, like Jeremy
was saying, zoom back a little bit and talk a little bit about what, just to provide
a review of what we're trying to do here. So let me just right next to him here.
Yeah. So with this task, we were trying to, in this
case, we're trying to do image generation. Of course, it could be other forms of
generation, like text generation or whatever. And the general idea is that of
course we have some data points. In this case, we have some images of dogs
and we want to produce more like the data points that we're given.
So in this case, maybe the dog image generation or something like this.
And so the overall idea that a lot of these approaches take for some sort of generative
modeling task is they try to... Not over there, I’m going to mark here.
They try to... Oops, what happened here?
Maybe it might... Yeah.
So let me use it in a bit. p(x), which is basically the likelihood...
What's going to happen here? Likelihood of data point x.
So let's say x is some image. Then p(x) tells us what is the probability
that you would see that image in real life. And we can take a simpler example, which may
be easier to think about, of a one-dimensional data point like height, for example. And if we were to look at height, of course
we know we have a data distribution that's kind of a bell curve.
And you have maybe some mean height, which is something like 5'9", 5'10".
I guess 5'10", or something like that, or 5'9", whatever.
And then of course we have some more unlikely points, but that is still possible.
Like for example, we have 7'8", or we have something that's maybe not as likely, which
is like 3', or something like this. JEREMY: So here's the X axis is height, and the
Y axis is the probability of some random person you meet being that tall.
TANISHQ: Exactly. So this is basically the probability.
And so of course you have this sort of peak, which is where you have higher probability.
And so those are the sorts of values that you would see more often.
So this is what we would call our p(x). And the important part about p(x) is that
you can use this now to sample new values if you know what p(x) is, or if you have
some sort of information about p(x). So for example, here you can think of, if
you were to say, maybe have some, let's say you have some game and you have some human
characters in the game, and you just want to randomly generate a height for this human
character, you wouldn't want to of course select a random height between 3 and 7,
that's kind of uniformly distributed. You would instead maybe want to have the height
dependent on this sort of function, where you would more likely sample values in the
middle and less likely sample these sorts of extreme points.
So it's dependent on this function p(x). So having some information about p(x)
will allow you to sample more data points. And so that's kind of the overall goal of
generative modeling is to get some information about p(x) that then allows us to sample
new points and create new generations. So that's kind of a high level kind of description
of what we're trying to do when we're doing generative modeling.
And of course there are many different approaches. We have our famous GANs, which used to be the
common method back in the day before diffusion models.
We have VAEs, which I think we'll probably talk a
little bit more about that later as well.
JEREMY: We'll be talking about both of those techniques later.
Yeah. TANISHQ: Yeah.
So there are many different other techniques. There are also some niche techniques
that are out there as well. But of course now the popular one is are these
diffusion models or as we talked about, maybe a better term might be, iterative
refinement or whatever the term ends to be. But yeah, so there are many different techniques.
And yeah, so this is kind of the general diagram that shows what diffusion models are.
And if we can look at the paper here, which let's pull up the paper.
Yeah, you see here, this is the sort of, they call it directed graphical model.
It's a very complicated term. It's just kind of showing
what's going on in this process. There's a lot of complicated math here, but
we'll highlight some of the key variables and equations here.
So basically the idea is that, okay, so let's see here.
This is an image that we want to generate, right? And so x0 is basically, these are
actually the samples that we want. So we want to, x0 is what we want to generate.
And these would be, yeah, these are images. And we start out with pure noise.
So that's what xt, pure noise. And the whole idea is that we have two processes.
We have this process where we're going from pure noise to our image.
And we have this process from our image to pure noise.
So the process where we're going from our image to pure noise, this is called the forward
process. Forward, sorry, my typing is still,
my handwriting is not so good in it. So hopefully it's clear enough.
Let me know if it's not. So we have the forward process, which
is mostly just used for training. Then we also have our reverse process.
This is the reverse process, which I will write up here.
Reverse process. JEREMY: So this is a bit of a summary, I guess,
of what you and Wasim talked about in Lesson TANISHQ: And just, it's just mostly to highlight
now what are the different variables as we look at the code and see the
different variables in the code. JEREMY: Okay, so we'll be focusing today on the
code, but the code will be referring to things by name and those names won't make sense very
much unless we see what they're used for in the math.
Okay. TANISHQ: Yeah.
And I won't dive too much into the math. I just want to focus on these sorts of
variables and equations that we see in the code. So basically the general idea is that
we do these in multiple different steps. We have here from time step 0 all the way to
time step uppercase T. And so there's some fixed number of steps, but then we have this
intermediate process where we're going from some particular time step.
We have this time step lowercase t, which is some noisy image.
And yes, we're transitioning between these two different noisy images.
So we have this, what is sometimes called the transition.
We have this one here. This is like something that's
called the transition kernel or yeah, whatever it is, it basically
is just telling us, you know, how do we go from, you know, one, in this case, we're going
from a less noisy image to a more noisy image and then going backwards, it's going from
a more noisy image to a less noisy image. So let's look at the equations.
JEREMY: So the forward direction is driven really easily to make it something more noisy.
You just add a bit more noise to it. And the reverse direction is incredibly difficult,
which is to particularly to go from the far left to the far right is strictly
speaking impossible because none of that person's face
exists anymore. That somewhere in between you could certainly
go from something that's partially noisy to less noisy by a learned model.
TANISHQ: Exactly. And that's like one of the little things I've
done right now in terms of, you know, in terms of I guess the symbols and the math. So yeah, basically I'm just trying to pull
out the, just to write down the equations here.
So we have, let me zoom in a bit. So we have our two, let's see here.
Q of xt, x (t minus 1) Or actually, you know what, maybe it's
just better if I just snipped it from here. So the one that is going from our
forward process is this equation here. So I'll just make that a
little smaller for you guys. So right there.
So that is going, and basically to explain, we have this sort of script, a little bit
of a, maybe a little bit confusing notation here, but basically this is referring to a
normal distribution or a Gaussian distribution. And this is just saying, okay, this is a Gaussian
distribution that's describing this particular variable.
So it's just saying, okay, N is our normal or Gaussian
distribution, and it's representing this variable x of t, or x, sorry,
xt. And then we have here is the mean, and this is
the variance. So just to again, clarify, I think we've talked
about this before as well, but this is a, this is of course a bad drawing of a Gaussian,
but our mean is just, our mean is just this, the middle point here is the mean, and the
variance just kind of describes the sort of spread of the Gaussian distribution.
So if you think about it a little further, you have this beta, which is one of the important variables that kind of describes
the diffusion process, beta-t So you'll see the beta-t in the code.
And basically beta-t increases as t increases. So basically your beta-t will be
greater than your beta-(t minus 1). So if you think about that a little bit more
carefully, you can see that, okay, so at t-1, at this time point here, and then you're
going to the next time point, you're going to increase your beta-t.
So you're increasing the variance, but then you have this 1 - beta-t and take the
square root of that and multiply it by x(t minus 1)
So as your t is increasing, this term actually decreases.
So your mean is actually decreasing and you're getting less of the original image
Because the original image is going to be part of x(t minus 1)
JEREMY: And just to let you know, Tanishq, we can't see your pointer.
So if you want to point at things, you would need to highlight them or something.
TANISHQ: So yeah, I'll just, let's see. Yeah.
Basically, I haven't pointed anything in specific. I was just saying that, yeah, basically
if we have our xt here, as the time step increases, you're getting less
contribution from your x(t minus 1) And so that means your mean is going towards zero.
And so you've got to have a mean of 0 and the variance keeps increasing and basically
you just have a Gaussian distribution and you lose any contribution from the original
image as your time step increases. So that's why when we start out from
x0 and go all the way to our xt here, this becomes pure noise.
It's because we're doing this iterative process where we keep adding noise.
We lose that contribution from the original image and that leads to the image having pure
noise at the end of the process. JEREMY: Something I find useful here is to
consider one extreme, which is to consider x1. So at x1, the mean is going to
be root 1 - beta-t times x0. The reason that's interesting
is x0 is the original image. So we're taking the original
image and at this point 1 - beta-t will be pretty
close to 1. So at x1, we're going to have something
that's the mean is very close to the image and the variance will be very small.
And so that's why we will have a image that just has a tiny bit of noise.
TANISHQ: Right, right. And then another thing that sometimes it's
easier to write out is sometimes you can write out, in this case, you can write out q(xt)
directly because these are all independent in terms of q(xt) is only
dependent on x(t minus 1) And then x(t minus 1) is only
dependent on x(t minus 2) And each of these steps are independent.
So based on the different laws of probability, you can get your q(xt) in close form.
So, yeah, that's what's shown here. q xt given the original image.
So this is also another way of kind of seeing this more clearly where you can see that.
Anyways, so I'm going back here. So this is another way to see here more directly.
So this is, of course, our clean image. And this is our clear, our noisy image.
And so you can also see again, now alpha bar t is dependent on beta t.
Basically it's like one minus the cumulative. JEREMY: I mean, we'll see
the code for it, I guess. So maybe.
TANISHQ: Yes, yes. So it might be clear to see that this
is alpha bar t or something like this. But basically, basically the idea is that
alpha bar t is going to be, again, less. This is what is going to be
less than alpha bar t minus one. So basically alpha, this keeps decreasing, right?
This decreases as time step increases. And on the other hand, this is going to
be increasing as time step increases. But again, you can see the contribution from
the original image decreases as time step increases while the noise, as shown by the
variance, is increasing while the time step is increasing.
Anyway, so that hopefully clarifies the forward process.
And then the reverse process is basically a neural network, as Jeremy had mentioned.
And yeah, screenshot this.
That's our reverse process. And basically the idea is, well, this is a
neural network and this is also a neural network. Neural network.
And we learn it during the training of the model. But the nice thing about this particular diffusion
model paper that made it so simple was actually, we completely ignored this and actually set
it to constants just based on, you know, big numbers.
JEREMY: We can't see what you're pointing at. So I think it's important to
mention what this is here. TANISHQ: This term here.
So this one, we just kind of ignore and it's just a constant dependent on beta-t.
So you only have one neural network that you need to train, which is basically referring
to this mean. And when the nice thing about this diffusion
model process is that it also re-paraphrases the mean into this easier form where
you do a lot of complicated math, which we'll not
get into here. But basically you get this kind of simplified
training objective where, let's see here. Yeah, you see the simplified training objective.
You instead have this epsilon-theta function. And let me just screenshot that again. This is our loss function that we train
and we have this epsilon-theta function. You can see it's a very
simple loss function, right? This is just a, let me just write this down.
This is just an MSE loss. And we have this epsilon-theta function here.
That is our- JEREMY: …maybe here we're less mathy, it
might not be obvious that it's a simple thing, because it looks quite complicated
to me, but once we see it in code, it'll be simple.
TANISHQ: Yes, yes. Basically you're just doing like, and
you'll see it in code how simple it is. But this is like just an MSE loss.
So we've seen MSE loss before, but you'll see how, yeah, this is basically MSE. So the nice, so just to kind of take a step
back again, what is this epsilon-theta? Because this is like a new thing that
like seems a little bit confusing. Basically epsilon, you can see here, basically,
yeah, so this here is saying, this is actually equivalent to this equation here.
These two are equivalent. This is just another way of saying that,
because basically it's saying, that's xt So this is giving xt just in a different way.
But epsilon is actually this normal distribution with a
mean of 0 and a variance of 1. And then you have all these scaling terms
that changes the mean to be the same as this equation that we have over here.
So this is our xt. And so what epsilon is, it's actually the noise that we're adding
to our image to make it into a noisy image. And what this neural network is doing
is trying to predict that noise. So what this is actually doing is this is
actually a noise predictor, and it is predicting the noise in the image.
And why is that important? Basically the general idea is,
if we were to think about our distribution of data, let's
just think about it in a 2D space. Just here, each data point here represents an
image, and they're in this blob area, which represents a distribution.
So this is in-distribution, and this is out of the distribution.
Out-of-distribution. And basically the idea is that, okay, if we
take an image and we want to generate some random image, if we were to take a random
data point, it would most likely be noisy images, right?
So if we take some random data point, it would be more,
the way to generate random data point, it's going to be just noise.
But we want to keep adjusting this data point to make it look more like an image from your
distribution. That's kind of the whole idea of this iterative
process that we're doing in our diffusion model.
So the way to get that information is actually to take images from your dataset and actually
add noise to it. So that's what we try to do in this process.
So we have an image here and we add noise to it. And then what we do is we try to plan
a neural network to predict the noise. And by predicting the noise and subtracting
it out, we are going back to the distribution. So adding the noise takes you away from the
distribution and then predicting the noise brings you back to the distribution.
So then if we know at any given point in this space how much noise to remove, that tells
you how to keep going towards the data distribution and get a point that
lies within the distribution. So that's why we have noise prediction.
And that's the importance of doing this noise prediction is to be
able to then do this iterative processor, we can start out at a random point,
which would be, for example, pure noise and keep predicting and removing that noise
and walking towards the data distribution. Okay.
Okay. So yeah, let's get started with the code.
And so here we of course have our imports and we're going to load our dataset.
We're going to work with our Fashion-MNIST dataset, which is what we've been working with
for a while already. And yeah, this is just basically the same
code that we've seen from before in terms of loading the dataset. And then we have our model.
So I've removed the noise from an image. So our model is going to take in, is it's
going to take in the previous image, the noisy image and predict the noise.
So the shapes of the input and the output are the same.
They're going to be in the shape of an image. So what we use is we use a U-Net neural
network, which takes in kind of an input image. JEREMY: And we do see your
pointer now, by the way. So feel free to point at things.
TANISHQ: Yeah. So yeah, it takes in an input image.
And in this case, a U-Net is the purpose, but they can also
be used for any sort of image to image task, where we're going from an input
image and then outputting some other image of some sort.
And we'll talk about... JEREMY: So this is a new architecture, which we
haven't learned about yet, and we will be learning about in the next lesson.
But broadly speaking, those gray arrows going from left to right are a lot like ResNet,
very much like ResNet skip connections. But they're being used in a different way.
Everything else is stuff that we've seen before. So it's basically, we can pretend
those don't exist for now. It's a neural network that the output is the
same size or a similar size to the input. And therefore you can use it to learn how
to go from one image to a different image. TANISHQ: Yeah.
So that's where the U-Net is. And yeah, like Jeremy said,
we'll talk about it more. The sort of U-Net that are used for diffusion
models also tend to have some additional tricks, which again, we'll talk
about them later on as well. But yeah, for the time being, we will just
import a U-Net from the Diffusers library, which is the Hugging Face
library for diffusion models. So they have a U-Net implementation
and we'll just be using that for now. And so, yeah, of course,
JEREMY: strictly speaking, we're cheating at this point because we're
using something we haven't written from scratch, but we're only cheating temporarily because
we will be writing it from scratch. TANISHQ: Yeah.
And yeah, so, and then of course we're working with one channel images, our Fashion-MNIST
images are one channel images. So we just have to specify that.
And then of course the channels of the different blocks within the U-Net are also specified.
And then let's go into the training process. So basically the general idea of course
is we want to train with this MSE loss. What we do is we select a random timestep
and then we add noise to our image based on that timestep.
So of course, if we have a very high timestep, we're adding a lot of noise.
If we have a lower timestep, they were adding very little noise.
So we're going to randomly choose a timestep. And then, yeah, we add the
noise accordingly to the image. And then we pass the noisy image
to a model as well as the timestep. And we are trying to predict the amount of
noise that was in the image and we predict it with the MSE loss.
So we can see all the... JEREMY: I have some pictures of some of these
variables I could share if that would be useful. So I have a version.
So I think Tanishq is sharing notebook number 15. Is that right?
And I've got here notebook number 17. And so I took Tanishq's notebook and just,
as I was starting to understand it, I like to draw pictures for myself
to understand what's going on. So I took the things which are in Tanishq's
class and just put them into a cell. So I just copied and pasted them, although
I replaced the Greek letters with English written out versions.
And then I just plotted them to see what they look like.
So in Tanishq's class, he has this thing called beta, which is just linspace.
So that's just literally a line. So beta, there's going to be a thousand of
them and they're just going to be equally spaced from 0.0001 to 0.02.
And then there's something called sigma, which is the square root of that.
So that's what sigma is going to look like. And then he's also got alphabar, which
is a cumulative product of 1 minus this. And there's what alphabar looks like.
So you can see here, as Tanishq was describing earlier, that when t is higher, this is t
on the x-axis, beta is higher, and when t is higher, alphabar is lower.
So yeah, so if you want to remind yourself, so each of these things, beta, sigma, alphabar,
they're each, they've each got a thousand things in them.
And this is the shape of those thousand things. So this is the amount of variance,
I guess, added at each step. This is the square root of that.
So it's the standard deviation added at each step. And then if we do 1 minus that,
it's just the exact opposite. And then this is what happens if you
multiply them all together up to that point. And the reason you do that is because if you add
noise to something, you add noise to something that you add noise to something that you add
noise to something, then you have to multiply together all that amount of noise
to say how much noise you would get. So yeah, those are my pictures, if that's helpful.
TANISHQ: Yep. Yep.
Good to see the diagram or see how the actual values and how it changes over time.
So yeah. Let's see here.
Sorry. Yeah.
So like Jeremy was showing, we have our linspace for our beta.
In this case, we're using kind of more of the Greek letters.
So you can see the Greek letters that we see in the paper as well as...
Now we have it here in the code as well. And we have our linspace from our
minimum value to our maximum value. And we have some number of steps.
So this is the number of timesteps. So here we use a thousand timesteps, but
that can depend on the type of model that you're training.
And that's one of the parameters of your model or hyperparameters of your model.
JEREMY: And this is the callback you've got here. So this callback is going to be used to set
up the data, I guess, so that you're going to be using this to add the noise so that the
model's then got the data that we're trying to get it to learn to then denoise.
TANISHQ: Yeah. So the callback of course makes life a lot
easier in terms of, yeah, setting up everything and still being able to use, I guess, the
miniai learner with maybe some of these more complicated and maybe a little
bit more unique training loops. So yeah, in this case, we're just able to
use the callback in order to set up the batch that we are passing into our learner.
JEREMY: I just want to mention, when you first did this, you wrote out the Greek letters in English,
alpha and beta and so forth. And at least for my brain, I was finding it
difficult to read because they were literally going off the edge of the page
and I couldn't see it all at once. And so we did a search and replace to
replace it with the actual Greek letters. I still don't know how I feel about it.
I'm finding it easier to read because I can see it all at once.
I don't know if it's a scroll and I don't get overwhelmed.
But when I need to edit the code, I kind of just tend to copy
and paste the Greek letters, which is why we use the actual word beta in
the init parameter list so that somebody using this never has to type a Greek letter.
But I don't know, Johno or Tanishq, if you had any thoughts over the last week or two,
since we made that change about whether you guys like having the Greek letters in there
or not. JOHNO: I like it for this demo in particular.
I don't know that I do this in my code, but because we're looking back and forth between
the paper and the implementation here, I think it works in this case just fine.
TANISHQ: Yeah, I agree. I think it's good for when you're trying to
study something or try to implement something, having the Greek letters is very useful to be
able to, I guess, match the math more closely and it's just easy just to pick the equation
and put it into code or white-source style looking at the code and try
to match it to the equation. So I think for educational purpose, I
tend to like, I guess, the Greek letters. So yeah. Yeah, so we have our initialization where
we're just defining all these variables. We'll get to the predict in just a moment, but
first I just want to go over the before_batch where we're setting up our
batch to pass into the model. So remember that the model is taking
in our noisy image and the timestep. And of course the target is the actual amount
of noise that we are adding to the image. So basically we generate that noise.
So that's what… JEREMY: So epsilon is that target.
So epsilon is the amount of noise, not the amount of, is the actual noise.
TANISHQ: Yes. Epsilon is the actual noise that we're adding.
And that's the target as well because our model is a noise predicting model.
It's predicting the noise in the image. And so our target should be the noise
[unintelligible] that we're adding to the image during
training. So we have our epsilon and we're just
generating it with this random function. The random normal distribution
with a mean of 0, variance of 1. So that's what that's doing and adding
the appropriate shape and device. Then the batch that we get originally
will contain the clean images. These are the original images from our dataset.
So that's x0. And then what we want to
do is we want to add noise. So we have our alphabar and we have
a random timestep that we select. And then we just simply follow that equation,
which I can, I'll just show in just a moment. JEREMY: That equation, you can
make a tiny bit easier to read. I think if you were to double click on that
first alphabar underscore t, cut it and then paste it, sorry, in the xt equals torch dot
square root, take the thing inside the square root, double click it and paste
it over the top of the word torch. That would be a little bit easier to read. And then you'll do the same for the next one.
TANISHQ: There we go.
Put those parentheses. Yep.
TANISHQ: Yeah, so basically, yeah, so yeah, I
guess I'll just pull up the equation. So let's see, there's then, so there's a section
in the paper that has the nice algorithm. See if I can find it.
No, no, here. It's, I think earlier.
Yes, training. Right, so this, we're just following these
same sort of training steps here, right? Where we select a clean image
that we take from our data set. This fancy kind of equation here is just saying,
take an image from your data set, take a random timestep between this range.
Then this is our epsilon that we're getting, just saying, get some epsilon value.
And then we have our equation for Xt, right? This is the equation here.
You can see that it is square root of alphabar t x0 plus one square root of one minus
alphabar t times epsilon. So that's the same equation
that we have right here, right? And then what we need to do is we
need to pass this into our model. So we have xt and t. So we
set up our batch accordingly. So this is the two things
that we pass into our model. And of course we also have our
target, which is our epsilon. And so that's what this is showing here.
We passed in our Xt as well as our t here, right? And we pass that into a model.
The model is represented here as epsilon theta. And theta is often used to represent, like, this
is a neural network with some parameters and the parameters are represented by theta.
So epsilon theta is just representing our noise predicting model.
So this is our neural network. So we have passed in our Xt and our t into
a neural network, and we are comparing it to our target here,
which is the actual epsilon. And so that's what we're doing here.
We have our batch where we have our xt and t and epsilon.
And then here we have our prediction function. And because we actually have, I guess in this
case, we have two things that are in a tuple that we need to pass into our model.
So we just kind of get those elements from our tuple with this.
We get the elements from the tuple, pass it into the model, and then Hugging Face has
its own API in terms of getting the output. So you'd need to call .sample in order
to get the predictions from your model. So we just do that.
And then we do, we have learn.preds and that's what's going to be used later then when we're
trying to do our loss function calculation. JEREMY: So the, just so, I mean, it's just worth
looking at that a little bit more since we haven't quite seen something like this before.
And it's something which I'm not aware of any other framework that would let you do
this, you know, literally replace how prediction works.
And miniai is kind of really fun for this. So because you're inherited from TrainCB,
TrainCB has predict, a dot defined and you've defined a new version.
So it's not going to use the TrainCB version anymore.
It's going to use your version. And what you're doing is instead of passing
learn.batch[0] to the model, you're, you've got a * in front of it.
So the key thing is that * is going to, you know, and is, we know that actually learn.batch[0]
has two things in it because that learn.batch that you showed at the end of the before_batch
method has two things in learn dot zero. So that star will unpack them and send
each one of those as a separate argument. So our model needs to take two things, which
the diffusers U-Net does take two things. So that's the main interesting point.
And then something I find a bit awkward honestly, about a lot of Hugging Face stuff, including
diffusers is that generally their models don't just return the result, but they put it inside
some name. And so that's what happens here.
They put it in something inside something called sample.
So that's why Tanishq added .sample at the end of the predict because of this somewhat
awkward thing, which Hugging Face like to do for some reason.
But yeah, now that you know, I mean, this is something that people often get stuck on.
I see on Kaggle and stuff like that. It's like, how on earth do I use these models?
Because they take things in weird forms and they give back things with weird forms.
Well, this is hell. You know, if you inherit from TranCB, you
can change predict to do whatever you want, which I think is quite sweet.
TANISHQ: Yep. So yeah, that's the training loop.
And then of course you have your regular training loop that's implemented in miniai where you
are going to have. Yeah.
So you have your loss function calculation, I mean, and the predictions, learn.preds.
And of course the target is our learn.batch[1], which is our epsilon.
So we have those and we pass it into the loss function.
It calculates the loss function and does the back propagation.
So I'll just go over that. We'll get back to the sampling in just a moment.
But just to show the training loop. JEREMY: Most of this is copied from
our, I think it's 14_augment notebook, the way you've got the
tmax and the sched. The only thing I think you've added
here is the DDPM callback, right? TANISHQ: Yes.
The DDPM callback. JEREMY: And you change the loss function.
TANISHQ: Yes. So basically we have to initialize our DDPM
callback with the appropriate arguments. So like the number of timesteps and
the minimum beta and maximum beta. And then yeah, obviously, and then of course
we're using an MSE loss as we talked about. It just becomes a regular training
loop and everything else is run before. Yeah.
So we have your scheduler, your progress bar, all of that we've seen before.
JEREMY: I think that's really cool that we're using basically the same code to train a diffusion model as we've used to train a
classifier just with one extra callback. TANISHQ: Yeah.
Yeah. Yeah.
That's why I think callbacks are very powerful for allowing us to do such things.
It's like pretty, you can take all this code and now
we have a diffusion training loop and we can just call learn.fit and yeah,
they can see got a nice training loop, nice loss curve.
We can save our model on a torch saving functionality
to be able to save our model and we could
load it in. But now that we have our trained model, then
the question is, what can we do to use it to sample the dataset?
So the basic idea of course was that we have, like basically we're here, right?
We have, let's see here. Okay.
So we have, the basic idea is that we start out with a random data point and of course that's
not going to be within the distribution at first, but now we've learned how to move from
one point towards the data distribution. That's what our noise prediction,
predicting function does. It basically tells you how, you know,
in what direction and how much to, so the basic idea
is that, yeah, I guess I'll start from maybe a new drawing here.
Again, we have, distribution is, and we have a random point and we use our noise predicting
model that we have trained to tell us which direction to move.
So it tells us some direction. Or I guess what's the exact, other area.
Okay. So like here, okay.
So it tells us some direction to move. At first that direction is not going to be
like, you cannot follow that direction all the way to get the correct data point.
Because basically what we were doing is we're trying to reverse the path that we were following
when we were adding noise. So like, cause we had originally data point
and we kept adding noise to the data point and maybe, you know, it
followed some path like this. And we want to reverse that path to get to.
So our noise predicting function will give us an original direction, which, you know,
would be, you know, some kind of, it's going to be kind of tangential to the actual path
at that location. So what we would do is, you know, we would maybe
follow that data point all the way towards, you know, we're just going to
keep following that data point. You know, we're going to try to predict the
fully denoised image by following this noise prediction.
But our fully denoised image is also not going to be a real image.
So what we, so let me, I'll show an example of that over here in the paper on why they
show this a little bit more carefully. So x0, it's there.
So basically you can see the different data...
You can see the different data points here.
It's not going to look anything like our real image.
So you can see all these points, you know, it doesn't look anything… what we would do
is we actually had a little bit of noise back to it.
And we start, we have a new point where then we could maybe estimate a better, get a better
estimate of which direction to move, follow that all the way again, we follow a new point.
And then I can add back a little bit of noise. You get a new estimate, you make a new estimate
of, you know, this noise prediction and removing the noise, you know, fall that
all again, completely and add a little bit of noise again to the
image and burst onto a image. So that's kind of what we're showing here.
JEREMY: That's a lot like SGD, with SGD we don't take the gradient and jump all the way.
We use a learning rate to go some of the way because each of those estimates of where we
want to go, you know, not that great, but we just do it slowly.
TANISHQ: Exactly. And at the end of the day, that's what
we're doing with this noise prediction. We are predicting the sort of gradient of
this p(x), but of course we need to keep making estimates of that
gradient as we're progressing. So we have to keep evaluating our noise prediction
function to get updated and better estimates of our gradient in order to
finally converge onto our image. So and then you can see that here, you know, we
have this, maybe this fully predicted denoised image which at the beginning doesn't look
anything like a real image, but then as we continue throughout the sampling
process, we finally converge on something that looks
like an actual image. Again, these are CIFAR-10 images and it's
still a little bit maybe unclear about how realistic these images, these very small images
look, but that's kind of the general principle I would say.
And so that's what I can show in the code. This idea of we're going to start out
basically with a random image, right? And this random image is going to be like
a pure noise image and it's not going to be part of the data distribution.
You know, it’s not anything like a real image, it's just a random image.
And so this is going to be our x, I guess, x uppercase t [x_t], right?
That's what we start out with. And we want to go from x
uppercase t all the way to x0. So what we do is we go through each of the
timesteps and we create, we have to put it in this sort of batch format because
that's what our neural network expects. So we just have to format it appropriately.
And we'll get to z in just a moment. I'll explain that in just a moment, but of
course we just take it have similar… alphabar, betabar, which is getting
those variables that we need. JEREMY: And we faked beta bar because
we couldn't figure out how to type it. So we used bbar instead.
TANISHQ: Yeah. So yeah, so yeah, in, yeah, we, yeah, we were
pretty able to get betabar to work, I guess. But anyway, at each step, what we're trying
to do is to try to predict what direction we need to go.
And that direction is given by our noise predicting model, right?
So what we do is we pass in x_t and our current timestep into our model.
And we get this noise prediction and that's the direction that we need to move in.
So basically we take x_t. We first attempt to completely remove the noise, right?
That's what this is doing. That's what x_0_hat is.
That's completely removing the noise. And of course, as we said, that estimate
at the beginning won't be very accurate. And so now what we do is we have some coefficients
here where we have a coefficient of how much that we keep about this estimate of
our denoise image and how much of the originally noisy
image we keep. And on top of that, we're going
to add in some additional noise. So that's what we do here.
We have x_0_hat. And so, and we multiply by its coefficient
and we have x_t we multiply it by some coefficient and we also add some additional noise.
That's what the z is. It's just-
JEREMY: That's basically a weighted average of the two plus the noise…
TANISHQ: Exactly. And then the whole idea is that as we get
closer and closer to a timestep equals to 0 our estimate of x0 will be more and more accurate.
So our x0_coeff will get closer as we're increasing our, going through the process and then our xt_coeff
will get closer and closer to 0. So basically we're going to be weighting more
and more of the x_0_hat estimate and less and less of the x_t as we're getting
closer and closer to our final timestep. And so at the end of the day, we will
have our estimated generated image. So that's kind of an overview
of the sampling process. So yeah.
So yeah, basically the way I implemented it here was I had the sample function that's
part of our callback and it will take in the model and the kind of shape that you want
for your images that you're producing. So like if you want to specify how many images
you produce, that's going to be part of your batch size or whatever.
And you'll just see that in a moment. But yeah, it's just part of the callback.
So then we basically have our DDPM callback and then we could just call the sample method
of our DDPM callback and we pass in our model. And then here you can see we're going to produce,
for example, 16 images and it just has to be a 1 channel image of shape
32 by 32 and we get our samples. And one thing I forgot to note was that I
am collecting each of the timestep, the x_t. So the predictions here, you can see
that there are a thousand of them. We want the last one because
that is our final generation. So we want the last one and that's what we should-
JEREMY: They are not [sad] actually. TANISHQ: Yeah.
So this is- JEREMY: We've come a long way since DDPM.
So this is, like, slower and less great than it could be.
But considering that, except for U-NET, we've done this from scratch, you know, actually
from matrix multiplication, I think those are pretty decent.
TANISHQ: Yeah. And we're only trained for about five epochs.
It took like, you know, maybe like four minutes to train this model, something like that.
It's pretty quick. And this is what we get with very little training.
And it's pretty decent. You can see, of course, some clear shirts
and shoes and pants and whatever else. JEREMY: Yeah.
And you can see fabric and it's got texture and things have buckles and-
TANISHQ: Yeah. JEREMY: You know, something to compare, like, we
did generative modeling in the first time we did Part 2 back in the days when Wasserstein
GAN was just new, which was actually created by the same guy that created
PyTorch or one of the two guys, Soumith. And we trained for hours and hours and hours
and got things that I'm not sure were any better than this.
So things have come a long way. TANISHQ: Yeah.
Yeah. And of course, then, yeah, so we can see then
like how this sampling progresses over time, over the multiple timesteps.
So that's what I'm showing here because I collected, during the sampling process, we
are collecting at each timestep what that estimate looks like.
And you can kind of see here. And so this is an estimate of the
noisy image over the timesteps. Oops.
And I guess I had to pause. Yeah.
You can kind of see. But you'll notice that actually, so we actually,
what we did is like, okay, so we selected an image, which is like the ninth image.
So that's, that's this image here. So we're looking at this image
particularly here and we're going over. Yeah.
We have a function here that's showing the i-th timestep during the sampling process of
that image. And we're just getting the images.
And what we are doing is we're only showing basically from timestep 800 to a 1,000.
And here we're just, we're just having it like where it's like, okay, we're looking
at like maybe every 5 steps and we're going from 800 to 990.
And this time it would make it visually easier to see the transition.
But what you'll notice is I didn't start all the way from 0.
I started from 800. And the reason we do that is because actually
between 0 and 800 there's very little change in terms of like, it's just mostly a noisy image.
And it turns out, but yeah, I didn't see as I make a note of this here, it's actually
a limitation of the noise schedule that is used in the original DDPM paper.
And especially when applied to some of these smaller images, when we're working with images
of like size 32 by 32 or whatever. And so there are some other papers like the
improved DDPM paper that propose other sorts of noise schedules.
And what I mean by noise schedule is basically how beta is defined basically.
So we had this definition of torch.linspace for our beta, but people have different ways
of defining beta that lead to different properties.
So things like that, people have come up with different improvements and those sorts of
improvements work well when we're working with these smaller images.
And basically the point is like, if we are working from 0 to 800 and it's just mostly
just noise that entire time, we're not actually making full use of all this timesteps.
So it would be nice if we could actually make full use of those time steps and actually
have it do something during that time period. So all these, there are some papers that examine
this a little bit more carefully and it would be kind of interesting for maybe some of you
folks to also look at these papers and see if you can try to implement those sorts of
models with this notebook as a starting point. And it should be a fairly simple change in
terms of like noise schedule or something like that.
JEREMY: So I actually think this is the start of our next journey, which is our previous journey
was going from being totally rubbish at Fashion-MNIST classification
to being really good at it. I heard you say now we're like a little bit
rubbish at doing Fashion-MNIST generation. And yeah, I think we should all now work from
here over the next few lessons and so forth and people trying things at home and all of
us trying to make better and better generative models, initially a Fashion-MNIST and hopefully
we'll get to the point where we're so good at that, that we're like, oh, this is too easy.
And then we'll pick something harder. And eventually that'll take us to Stable Diffusion and beyond.
I imagine. That's cool.
I got some stuff to show you guys. If you're interested, I tried to better understand
what was going on in Tanishq's notebook and tried doing it in a thousand different ways
and also see if I could just start to make it a bit faster.
So that's what's in notebook 17, which I will share.
So we've already seen the start of notebook 17. Well, the one thing I did just do is just
drew a picture for myself, partly just to remind myself what they,
what the real ones look like. And they definitely have more detail than
the samples that Tanishq was showing. But they're not, you know, they're just 28 by 28.
I mean, they're not super amazing images and they're just black or white.
So even if we're fantastic at this, they're never going to look great because we're using
a small, simple data set. As you always should, when you're doing
any kind of R&D or experiments, you should always use a small and simple data set up
until you're so good at it that it's not challenging
anymore. And even then when you're exploring new ideas,
you should explore them on small, simple data sets first.
Yeah. So after I drew the various things, what I
like to do is one thing I found challenging about working with your class Tanishq is I
find when stuff is inside a class, it's harder for me to explore.
So I copied and pasted it, the before_batch contents and called it noisify.
And so one of the things that's fun to do that is it forces you to figure out what are
the actual parameters to it. And so now that I, rather than putting in the
class, now that I've got all of my, you know, various things to do with, so these are the
three parameters to the DDPM callbacks in it.
And then these things we can calculate from that. So with those then actually
all we need is yeah, what's the image that we're going to
noisify and then what's the, what's the alphabar, which I mean, we can get from here, but
it's sort of be more general if you can pass in your alphabar.
So yeah, this is just copying and pasting from the class, but the nice thing is then
I could experiment with it. So I can call noisify on my first 25 images
and with, with a random t, each one's got a different random t, and so I can print out
the t and then I could actually use those as titles.
And so this lets me, I thought this is quite nice. I might actually rerun this cause actually
none of these look like anything because as it turns out in this particular
case, all of the t's are over 200. And as Tanishq mentioned, once you're over
200, it's almost impossible to see anything. So let me just rerun this
and see if we get a better, there we go.
There's a better one. So with a t of 7, right?
So remember t naught, t equals naught is the pure image.
So t equals 7, it's just a slightly speckledy image.
And by 67, it's a pretty bad image. And by 94, it's very hard
to see what it is at all. And by 293, maybe I can see a pair of pants.
I'm not sure I can see anything. So yeah.
By the way, there's a handy little, so we've, I think
we've looked at map before in in the course there's an extended
version of map in fastcore. And one of the nice things is you can pass
it a string and it basically just calls this format string if you pass it a
string rather than a function. And so this is going to stringify
everything using its representations. This is how I got the titles
out of it just by the way. So yeah, I found this useful to be
able to draw a picture of everything. And then I wanted to, yeah, look at what,
what, what else, what else can I do? So then I took nuts.
You won't be surprised to see. I took the sample method and
turn that into a function. And I actually decided to
pass everything that it needs. Even, I mean, you could actually
calculate pretty much all of these. But I thought since I've calculated
them before, just pass them in. So this is all copied and
pasted from Tanishq's version. And so that means the callback now is tiny, right? Because before_batch is just noisify and the
sample method just calls the sample function. Now, what I did do is I decided just to, yeah,
I wanted to try like as many different ways of doing this as possible.
Partly it's an exercise to help everybody like see all the different ways we can work
with our framework, you know? So I decided not to inherit from TrainCB,
but I instead I inherited from Callback. So that means I can't use Tanishq's
nifty trick of replacing predict. So instead I now need some way to pass in
the two parts of the first element of the tuple as separate things to the
model and return the sample. So how else could we do that?
Well what we could do is we could actually inherit from UNet2DModel, which is what
Tanishq used directly, unit 2d model, and we could replace the model.
And so we could replace specifically the forward function.
That's the thing that gets called. And we could just call the original forward
function, but rather than passing an x we’re passing a *x, and rather than
returning that, we'll return that .sample. Okay.
So if we do that, then we don't need the TrainCB anymore and we don't need the predict.
And so if you're not working with something as beautifully flexible as miniai, you can
always do this, you know, to make, to replace your model so that it has the interface that
you need it to have. So now again, we did the same as
Tanishq had of create the callback. And now when we create the model, we'll
reuse our UNet class, which we just created. I wanted to see if I can make things faster.
I tried dividing all of Tanishq's channels by two and I found it worked just as well.
One thing I noticed is that it uses group norm in the U-Net, which we have briefly learned
about before and in group norm, it splits the channels up into a certain number of groups.
And I needed to make sure that those groups had more than one thing in.
So you can actually pass in how many groups do you want to use in the normalization.
So that's what this is for. You gotta be a little bit careful of these
things, I didn't think of it at first and I ended up, I think the num groups might've
been 32 and I got an error saying you can't split 16 things into 32 groups.
But it also made me realize actually, even in Tanishq's maybe you probably had 32 in
the first with 32 groups. And so maybe the group norm
wouldn't have been working as well. So they're little subtle things to look out for.
So now that we're not using anything inherited from TrainCB, that means we either need to
use TrainCB itself or just use our train learner and that everything else is the same as what
Tanishq had. So then I wanted to look at the results of
noisify here and we've seen this trick before, which is we call fit, but don't call the training
part of the fit and use the SingleBatchCB callback that we created way back
when we first created Learner. And now learn.batch will contain the tuple
of tuples, which we can then use that trick to show.
So I mean, obviously we'd expect it to look
the same as before, but it's nice. I always like to draw pictures
of everything all along the way. Cause it's very, very of..
I mean, I, the first six to seven times I do anything, I do it wrong.
So given that I know that I might as well draw a picture to try and see how it's wrong
until it's fixed. It also tells me when it's not wrong.
TANISHQ: Isn't there a show_batch function now that does something similar?
JEREMY: Um, yes, you wrote that show_image_batch, didn't you?
I can't quite remember. Yeah.
We should, uh, remind ourselves how that worked.
That's a good point. Thanks for a reminder.
Okay. So then, um, I'll just go ahead and
do the same thing that Tanishq did. And um, uh, but then the next thing I looked
at was I looked at the, you know, how am I going to make this train faster?
I want a bigger, I want a higher learning rate. Um, and I realized, oddly enough, the diffusers
code does not initialize anything at all. They use the defaults, um, which just goes
to show like even, you know, the experts at Hugging Face that don't necessarily really
think like, oh, maybe the PyTorch defaults aren't, you know, perfect for my model.
Of course they're not because they depend on what activation function do you have and
what res blocks do you have and so forth. Um, so I wasn't exactly sure how to initialize it.
Um, I, um, partly by chatting to, um, Kat Crowson, who's the author of K-diffusion,
um, and partly by looking at papers and partly by thinking about my own experience, I ended
up doing a few things. One is I did do the thing that we talked about
a while ago, which is to take every second convolutional layer and zero it out.
You could do the same thing with using batch norm, which is what we tried.
And since we've got quite a deep network, you know, that seemed like it might, you know,
it helps basically by having the, the, the non-id paths in the ResNets do nothing at
first so they can't cause problems. Um, we haven't talked about, um, orthogonalized
weights before, and we probably won't because you would need to take our, um, computational
linear algebra course to learn about that, which is a great course, Rachel
Thomas did a fantastic job of it. I highly recommend it, but I don't want to
make it a prerequisite, but, um, Kat mentioned, she thought that using orthogonal weights
for the downsamplers was a good idea. Um, and then, well, the up_blocks, they also
set the second convs to zero and something Kat mentioned, she found useful, which is
also from, um, I think it's from the Dhariwal Google paper is to also zero out the
weights of basically the very last layer. Um, and so it's going to start by predicting
zero as the noise, which is, you know, something that can't hurt.
Um, so that was, that's how I initialized the weights.
Um, so call init_ddpm on my model, uh, something that I found that a huge difference is I replaced
the normal Adam optimizer with one that has an epsilon of 1e-5, the default,
I think is 1e-8. And so to remind you, this is, we, is we,
when we divide by the exponentially weighted moving average of the squared gradients, we,
when we divide by that, if that's a very, very small number, um, then it makes
the effective learning rate huge. And so we add this to it to make it not too huge.
And it's nearly always a good idea to make this bigger than the default.
I don't know why the default is so small. And I found, until I did this, anytime I tried
to use a reasonably large learning rate somewhere around the middle of the 1-Cycle
training, it would explode. Um, uh, so that makes a big difference. Um, so this way, yeah.
Uh, I could train, I could get 0.016 after 5 epochs.
Um, and then sampling, so it looks all pretty similar.
We've got some pretty nice textures, I think. So then I was thinking, how do I get faster?
So one way we can make it faster is we can take advantage of, um, something called
mixed precision. Um, so currently we're using
32 bit floating point values. Um, that's the defaults and
also known as single precision. And um, GPUs are pretty fast at doing 32 bit
floating point values, but they're much, much, much, much faster during 16
bit floating point values. So I'm 16 bit floating point values. I'm able to represent a very, you know, wide
range of numbers or much precision at the difference between numbers.
And so they're quite difficult to use, but if you can, you'll get a huge benefit because,
um, modern GPUs, modern Nvidia GPU specifically have special units that do matrix multiplies
of 16 bit values extremely quickly. Um, you can't just cast everything to 16 bit
because then you, there's not enough precision to calculate gradients and stuff properly. So we have to use something
called Mixed Precision. Um, depending on how enthusiastic I'm feeling,
I guess we ought to do this from scratch as well.
Um, we'll, we'll see. Um, we do have an implementation from scratch
cause we actually implemented this before NVIDIA implemented it, um, in
an earlier version of fastai. Um, um, anyway, we'll see.
So basically the idea is that we use 32 bit for things where we need 32 bit and we use
Um, so that's what we're going to do is we're going to use this mixed precision.
Um, but for now we're going to use, um, NVIDIA's, you know, semi-automatic or fairly automatic
code to do that for us. Actually we had a slight change of plan at
this point when we realized, uh, this lesson was going to be over three hours in length
and we should actually split it into two. So we're going to wrap up this lesson here and
we're going to, um, come back and implement this mixed precision thing in Lesson 20.
So we'll see you then.
Get free YouTube transcripts with timestamps, translation, and download options.
Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact