JEREMY: All right, hi gang, and here we are in Lesson 21, joined by the
legends themselves, Johno and Tanishq. Hello. TANISHQ: Hello. JEREMY: And today you'll be shocked to hear
that we are going to look at a Jupyter Notebook. Amazing, right? We're going to look at notebook
22. This is a pretty quick, just, you know, improvement, fairly simple improvement to our
DDPM - DDIM implementation for Fashion-MNIST. And this is all the same so far, but what I've
done is I've made some, one quite significant change and some of the changes we'll be making
today are all about making life simpler. And they're kind of reflecting the way the
papers have been taking things. And it's interesting to see how the papers have not only
made things better, they've made things simpler. And so one of the things that I've noticed in recent papers is
that there's no longer a concept of n steps, which is something we've always had before and
always bothered me a bit, this capital T thing. You know, this t/T, it's basically saying this
is time step number, say 500 out of 1,000, so it's time step 0.5. Why don't I just call
it 0.5? And the answer is, well, we can. So we talked last time about the cosine scheduler.
We didn't end up using it because I came up with an idea which was, you know, simpler and nearly
the same, which is just to change our betamax. But in this next notebook, let's use the cosine
scheduler, but let's try to get rid of the n_steps thing and the capital T thing. So here is
abar again. And now I've got rid of the capital T. So now I'm going to assume that your time step
is between 0 and 1. And it basically represents what percentage of the way through
the diffusion process are you. So 0 would be all noise and 1 would
be all, no sorry, other way around. 0 would be all clean and 1 would be all noise.
So how far through the forward diffusion process. So other than that, this is exactly the
same equation we've already seen. And I realized something else, which is kind of fun,
which is you can take the inverse of that. So you can calculate t. So we would
basically first take the square root and we would then take the inverse cos and we
would then divide by 2 over pi or times pi over 2. So we can both, so it's interesting now, we
don't, the alpha bar is not something we look up in a list, it's something we calculate with a
function from a float. And so yeah, interestingly, that means we can also calculate t from an alpha
bar. So noisify has changed a little. So now when we get the alpha bar for our time step, we don't
look it up, we just call it, call the function. And now the time step is a random float
between 0 and 1. Actually between 0 and 0.999, which actually I'm sure there's a function I
could have chosen to do a float in this range, but I just clapped it because I was
lazy, couldn't be bothered hooking it up. Other than that noisify is exactly the
same. Right, so we're still returning the xt, the time step, which is now a float, and
the noise. That's the thing we're going to try and predict, dependent variable, this
tuple there as our inputs to the model. All right, so here is what that looks like.
So now when we look at our input to our UNet training process, you can see, you know, we've
got a t of 0.05, so 5% of the way through the forward diffusion process, it looks like
this, and 65% through it looks like this. TANISHQ: So now the time step, and
basically the process is more of a kind of a continuous time step
and a continuous process, rather, before we were having these discrete time
steps, here we could, it's just any random value, it could be between 0 and 1.
I think, yeah, that's also something. JEREMY: Yeah, which I found is more convenient,
you know, to have a function to call. Yeah, I find this life a little bit easier. So
the model's the same, the callbacks are the same, the fitting process is the same. And so something
which is kind of fun is that we could now, we do now, create a little denoise function,
so we can take, you know, this batch of data that we generated, the noisified data, so
here it is again, and we can denoise it. So we know the t for each element, obviously,
so remember t is different for each element now. And we can therefore calculate
the alpha bar for each element, and then we can just undo the noisification to
get the denoised version. And so if we do that, there's what we get. And so this is great, right?
It shows you what actually happens when we run a single step of the model on varyingly partially
noised images. And this is something you don't see very often, because I guess not many people
are working in these kind of interactive notebook environments where it's really easy to do this
kind of thing, but I think this is really helpful to get a sense of like, okay, if you're 25% of
the way through the forward diffusion process, this is what it looks like when you undo that. If you're 95% of the way through it, this is what
happens when you undo that. So you can see here, it's basically like, oh, I don't really know what
the hell's going on, so at least a noisy mess. Yeah, I guess my feeling from looking
at this is I'm impressed, you know, like this 45% noise thing, it looks all
noise to me. It's found the long-sleeved top. And yeah, it's actually pretty close to the real
one. I looked it up, or you might see it later, it's a little bit more of a pattern here, but it
even gives a sense of the pattern. So it shows you how impressive this is. So this is 35%, you can
kind of see there's a shoe there, but it's really picked up the shoe nicely. So it's, these are
very impressive models in one step, in my opinion. So okay, so sampling is basically the
same, except now, rather than starting with using the range function to create a time
steps, we use linspace to create our time steps. So our time steps start at, you know,
if we did 1,000, it would be 0.999, and they end at 0, and then they're just
linearly spaced with this number of steps. So other than that, you know, abar we now calculate, and the next abar is going to be
whatever the current step is minus one over steps. So if you're doing 100 steps, then you'd be minus
0.01. So this is just stepping through linearly. And yeah, that's actually it for changes.
So if we just do DDIM for 100 steps, you know, that works really well, we
get a FID of three, which is actually quite a bit better than we had on
100 steps for our previous DDIM. So this definitely seems like a good sampling approach. And I know Johno is
going to talk a bit more shortly about, you know, some of the things that can make
better sampling approaches. But yeah, definitely, we can see it making a difference here. Did you guys have anything you wanted to
say about this before we move on? JOHNO: No, but it is a nice transition towards
some of the other things we'll be looking at to start thinking about how do we frame this. And
it's also good, like the idea… So the original DDPM paper has a thousand time-states and a lot
of people follow that. But the idea that you don't have to be bound to that, and maybe it is worth
breaking that convention. I know Tanishq made that meme about, you know, this 15 computing
different standards for notation. But yeah, sometimes it's helpful to reframe it, okay, time
goes from zero to one, that can simplify some things. It may complicate others, but yeah, it's
nice to think how you can reframe stuff sometimes. JEREMY: Yeah, and in fact, where we will
head today, by the time we get to notebook 23, we will see, you know, even simpler
notation. And yes, simpler notation generally comes. I think what happens is over time,
people understand better what's the essence of the problem and the approach, and
then that gets reflected in the notation. So, okay, so the next one I
wanted to share is something which is an idea we've been working on for a
while, and it's some new research. So partly, I guess this is an interesting
insight into how we do research. This is 22_noise-spread. And the basic idea of
this was, well, actually, I'm going to take you through it to see what the basic idea is. So
what I'm going to do is I'm going to create, okay, so Fashion-MNIST as before, but
I'm going to create a different kind of model. I'm not going to create
a model that predicts the noise, given the noised image and t. Instead, I'm
going to try to create a model which predicts t, given the noised image. So why did I want to do
that? Well, partly, well, entirely, because I was curious. I felt like when I looked at something
like this, I thought it was pretty obvious, roughly, how much noise each image had.
And so I thought, why are we passing noise, when we call the model, why are we passing in
the noise image and the amount of noise or the t, given that I would have thought the model
could figure out how much noise there is. So I wanted to check my contention, which is
that the model could figure out how much noise there is. So I thought, okay, well, let's create a
model that would try and figure out how much noise here is. So I created a different noisify now,
and this noisify grabs an alpha bar t randomly. And it's just a random number between 0 and 1,
you know, you want one per item in the batch. And so then after just randomly grabbing an alpha
bar t, we then noisify in the usual way. But now our independent variable is the noise image,
and the dependent variable is alpha bar t. And so we're going to try to create a model that
can predict alpha bar t, given a noise image. Okay, so everything else is the same
as usual. And so we can see an example… JOHNO: …you've got alpha
bar t dot squeeze dot logit. JEREMY: Oh yeah, that's true. So the
alpha bar t goes between naught and one. So we've got a choice. Like I mean, we don't have
to do anything, but you know, normally if you've got say between zero and one, you might consider
putting a sigmoid at the end of your model. But I felt like the difference between 0.999 and
0.99 is very significant, you know. So if we do logit, then we don't need the sigmoid at the end
anymore. It'll naturally cover the full range of kind of, you know, it'll be centered at zero,
it'll cover all the normal kind of range of numbers, and it also will treat equal ratios as
equally important at both ends of the spectrum. So that was my hypothesis was that using logit
would be better. I did test it and it was actually very dramatically better. So without this logit
here, my model didn't work well at all. And so this is like an example of where thinking about
these details is really important. Because if I hadn't have done this, then I would have come
away from this bit of research thinking like, oh, I was wrong. We can't predict noise, noise amount.
Yeah, so thanks for pointing that out, Johno. Yeah, so that's why in this example of a
mini-batch, you can see that the numbers can be negative or positive. So 0 would represent
noise, the alpha bar of 0.5. So here 3.05 is not very noised at all, where else, negative one
is pretty noisy. So the idea is that, yeah, given this image, you would have to try to predict
3.05. So one thing I was kind of curious about is, like, it's always useful to know is like, what's
the baseline? Like, what counts as good? You know, because often people will say to me like,
oh, I created a model and the MSE was 2.6. I'll be like, well, is that good? Well,
it's the best I can do, but is it good? Or is it better than random? Or is it
better than predicting the average? So in this case, I was just like, okay, well, what
if we just predicted, actually, this is slightly out of date, I should have said 0 here, rather
than 0.5, but nevermind close enough. So this is before I did the logit thing. So I basically
was looking at like, what's the, you know, loss if you just always predicted a constant,
which as I said, I should have put zero here, haven't updated it. And so it's like, oh, that
would give you a loss of 3.5. Or another way to do it is you could just, just put MSE here
and then look at the MSE loss between 0.5 and your various, just a single mini batch, which we,
yeah, mini batch of alpha bar t’s, logits. Yeah, so, you know, we wanted to get some, you know,
if we're getting something that's about three, then we basically haven't done any better
than random. And so this case, this model, it doesn't actually have anything
to learn. It always returns the same thing. So we can just call fit with
train equals false just to find the loss. So these are just a couple of ways of getting
quickly finding a loss for a baseline naive model. One thing that thankfully PyTorch will warn you
about is if you try to use MSE and your inputs and targets have different shapes, it will broadcast
and give you probably not the result you would expect, and it will give you a warning. So one way
to avoid that is just to use dot flatten on each. So this kind of flattened MSE is useful to avoid
both, avoid the warning and also avoid getting weird errors or weird, sorry, weird results.
So we use that for our loss. So the models, the model that we always use, so it's kind of nice.
We just use our same old model. Nothing changes, even though we're doing something
totally different. Oh, well, okay, that's not quite true. One difference is
that our output, we just have one output now, because this is now a regression model.
It's just trying to predict a single number. And so our Learner now uses MSE as the loss.
Everything else is the same as usual. So we can go ahead and trade it and you can see, okay,
the loss is already much better than three, so we're definitely learning something and we end
up with a 0.075 mean squared error. That's pretty good considering, you know, there's a pretty wide
range of numbers we're trying to predict here. So I've got to save that as noise prediction
on sigma. So save that model. And so we can take a look at how it's doing by
grabbing our one batch of noise images, putting it through our tmodel. Actually, it's
really an alpha bar model, but never mind, call it a tmodel. And then we can take a look
to see what it's predicted for each one.And we can compare it to the actual for each
one. And so you can see here, it said, oh, I think this is about 0.91. And actually it is
0.91. So now here it looks like about 0.36. And yeah, it is actually 0.36. So you know, you
can see overall 0.72, it's actually 0.72. Or it's actually right, this one's 0.02 off. But
yeah, my hypothesis was correct, which is that we, you know, we can predict the thing that
we were putting in manually as input. So there's a couple of reasons I was interested in
checking this out. The first was just like, well, yeah, wouldn't it be simpler if we
weren't passing in the t each time? You know, why not pass in the t each time? But
it also felt like it would open up a wider range of kind of how we can do sampling. The idea of
doing sampling by like precisely controlling the amount of noise that you try to remove each
time, and then assuming you can remove exactly that amount of noise each time feels limited to
me. So I want to try to remove this constraint. So having built this model, I thought, okay, well,
you know, which is basically like, okay, I think we don't need to pass t in. Let's try it. So what
I then did is I replicated the 22_cosine notebook, I just copied it, pasted it in here. But I made
a couple of changes. The first is that noisify doesn't return t anymore. So there's no
way to cheat. We don't know what t is. And so that means that the UNet now doesn't
have t, so it's actually going to pass 0 every time. So it has no ability to learn from t because it doesn't get t. So it
doesn't really matter what we pass in. We could have changed the UNet to like remove
the conditioning on t. But for research, this is just as good, you know, for finding
out. And it's good to be lazy when doing research. There's no point doing something
a fancy way when you can do it a quick and easy way before you even know if it's going
to work. So yeah, that's the only change. So we can then train the model and we can
check the loss. So the loss here is 0.034. And previously it was 0.033. So
interestingly, you know, maybe it's a tiny bit worse at that,
you know, but it's very close. Okay, so we'll save that
model. And then for sampling, I've got exactly the same DDIM step as usual.
And my sampling is exactly the same as usual, except now when I call the
model, I have no t to pass in. So we just pass in this. I mean, I still
know t because I'm still using the usual sampling approach, but I'm not passing it
to the model. And yeah, we can sample and what happens is actually pretty garbage.
22 is our FID. And as you can see here, you know, some of the images are still really
noisy. So I totally failed. And so that's always a little discouraging when you think
something's going to work and it doesn't. But my reaction to that is like, if I think
something's going to work and it doesn't, is to think, well, I'm just going to have to do a
better job of it. You know, it ought to work. So I tried something different, which is I thought
like, okay, since we're not passing in the t, then we're basically saying like, how much noise
should you be removing? It doesn't know exactly. So it might remove a little bit more noise that
we want or a little bit less noise than we want. And we know from the, you know, testing we did,
that sometimes it's out by like this case 0.02. And I guess if you're out consistently,
sometimes it's, yeah, got to end up not removing all the noise. So the change I made
was to the DDIM step, which is here. And… let me just copy this and get rid of the, I'm into
that section just to make it a bit easier to read. Okay. So the DDIM step, this is the normal
DDIM step. Okay. And so step one is the same. So don't worry about that. Cause it's
the same as we've seen before. But what I did was I actually used my tmodel. So I
passed the noised image into my tmodel, which is actually an alpha bar model
to get the predicted alpha bar. And this is, remember, the predicted alpha bar
for each image, because we know from here that sometimes, so sometimes it did a pretty
good job, right? But sometimes it didn't. So I felt like, okay, we need a
predicted alpha bar for each image. What I then discovered is, sometimes that could
be like really too low, right? So what I wanted to make sure of is it wasn't too crazy. So
I then found the median for a mini batch of all the predicted alpha bars, and I clamped
it to not be too far away from the median. And so then what I did when I did my
x_0_hat is rather than using alpha bar t, I used the estimated alpha bar t for each
image, clamped to be not too far away from the median. And so this way it was updating it
based on the amount of noise that actually seems to be left behind rather than the assumed
amount of noise that should be left behind. You know, if we assume it's
removed the correct amount. And then everything else is
the same. So when I did that, so whoa, made all the difference. And here it is.
They are beautiful pieces of clothing. So 3.88 versus 3.2. That's possibly close enough.
Like I'd have to run it a few times, you know, my guess is maybe it's a
tiny bit worse. But it's pretty close. But like this definitely gives
me some encouragement that, you know, even though this is like something
I just did in a couple of days, where else they're kind of the with t approaches have been
developed since 2015 and we're now in 2023. I, you know, I would expect it's quite likely
that these kind of like no, no t approaches could eventually surpass the t based approaches. And like one thing that definitely makes me
think there's room to improve is if I plot the FID or the KID for each sample
during the reverse diffusion process, it actually gets worse for a while. I'm
like, okay, well that's, that's a bad sign. I have no idea when that's
happening, but it's a sign that, you know, if we could improve each step, then
one would assume we could get better than 3.8. So yeah, Tanishkq, John, I don't have any thoughts
about that or questions or comments or… JOHNO: Maybe to just like to highlight the
research process a little bit, it wasn't like this linear thing of like, oh, here's this issue not
performing as well as we thought, oh, here's the fix. We just kept this. You know, this was like
multiple days of like discussing and like Jeremy saying like, you know, I'm tying my hair out. You
guys have any ideas? And, oh, what about this? And, oh, and they're just in the DDIM paper,
they do this clamping, maybe that'll help, you know? So there's a lot of back
and forth and also a lot of like, you saw the code that was commented out there,
print xt.min, xt.max, alpha bar, pred, you know, just like seeing, oh, okay. You know, my
average prediction is about what I would expect, but sometimes the middle of the max goes, you
know, 2, 3, 8, 16, 150, 212 million, infinity, you know, maybe like one or two little
baddies that would just skyrocket out. Yeah. And so that kind of like debugging
and exploring and printing the results. JEREMY: And actually our initial discussions
about this idea, I kind of said to you guys before Lesson 1 of Part 2, I said, like, it
feels to me like we shouldn't need the t thing. And so it's actually been like bumbling
away in the background for months. Yeah. JOHNO: And I guess, I mean, we should also
mention, we have tried this, like a friend of ours trained a no-t version of stable diffusion for
us. And we did the same sort of thing. I trained a pretty bad t predictor and it sort of generates
samples. So we're not like focusing on that large scale stuff yet, but it is fun to like, every
now and again, got this idea from Fashio-MNIST, we are trying these out on some bigger models
and seeing, okay, this does seem like maybe it'll work. And so down the line that future
plan is to say, let's actually, you know, spend the time train a proper model and see,
yeah, see how well that does. If it's interesting. JEREMY: You say a friend of ours, we can be more
specific. It's Robert, one of the two lead authors of the stable diffusion paper who yeah, actually
has been fine tuning a real stable diffusion model which is without t and it's looking super
encouraging. So yeah, that'll be fun to play with, with this new, you know, we'll have to train
a t predictor for that. See how it looks. Yep. All right. So I guess the other area we've been
talking about kind of doing some research on is this weird thing that came up over the last two
weeks where our bug in the DDPM implementation where we accidentally weren't doing it
from minus 1 to 1 for the input range, it turned out that actually being from minus
one to one wasn't a very good idea anyway. And so we ended up centering it as
being from minus 0.5 to 0.5. And Johno and Tanishq have managed to actually
find a paper. Well I say find a paper. A paper has come out in the last 24 hours which has coincidentally cast some light on this
and has also cited a paper that we weren't aware of which was not released in the last 24 hours. So
Johno, are you going to tell us a bit about that? JOHNO: Yeah, sure I can do that. So it's
funny, this was such perfect timing because I actually got up early this morning planning
to run with the different input scalings and the cosine schedule that Jeremy was showing
and some of the other schedulers we look at, I thought it might be nice for the lesson
to have a little plot of like what is the fit with these different solvents and input
scalings, but it was going to be a lot of work. I was like, I'm not looking forward to doing
the groundwork. And then Tanishq sent me this paper which AK [@_akhaliq] had just tweeted
out because he reviews everything that comes up on arXiv every day: “On the Importance
of Noise Scheduling for Diffusion Models”. And this is by a researcher at the Google Brain
team who's also done a really cool recent paper on something called a Recurrent Interface
Network outside of the scope of this lesson, but also worth checking out. Yeah,
so this paper they're hoping to study this noise scheduling and the strategies that
you take for that and they want to show that number 1) noise scheduling is crucial for
performance and the optimal one depends on the task. When increasing the image size,
the noise scheduling that you want changes and scaling the input data by some factor
is a good strategy for working with this. JEREMY: And that's the bit
we've been talking about, right? JOHNO: Yeah, that's what we've been doing
where we said, oh, do we scale from minus 0.5 to 0.5 or minus 1 to 1 or do we normalize?
And so they demonstrate the effectiveness by training a really good high resolution
model on ImageNet. So class condition model. JEREMY: It looks great. JOHNO: Yeah, amazing samples.
That I’ll show them later. So I really liked this paper. It's very short
and concise and it just gets all the information across. And so they introduced us here. We have
this noising process our noisify function where we have square root of something times X plus
square root of one minus that something times the noise. And here they use gamma, gamma
of t, which is often used for the continuous time case. So instead of the alpha bar and the
beta bar schedule for a thousand time saves, there'll be some function gamma of t that
tells you what your alpha bar should be. JEREMY: Okay. So that's how our function is
actually called abar, but it's the same thing. JOHNO: Yeah. Same thing. Takes in a time set from
0 to 1 and then that's used to noise the image. JEREMY: Interestingly, what they're showing here
actually is something that we had discovered and I've been complaining about that my DDIMs with
an eta of less than one weren't working, which is to say when I added extra noise to the image, it
wasn't working. And what they're showing here is like, oh yeah, duh, if you use a small image, then
adding extra noise is probably not a good idea .
JOHNO: Yeah. And so they, they, they use a lot of reference in this paper to like information
being destroyed and signal to noise ratios, and that's really helpful for thinking about it
because it's not something that's obvious, but at 64 by 64 pixels adjacent pixels might have
much less in common versus the same amount of noise added at a much higher resolution, the
noise kind of averages out and you can still see a lot of the image. So yeah, that's one
thing they highlight is that the same noise level for different image sizes might have
it, it might be a harder or easier task. And so they investigate some strategies for
this. They look at the different noise schedule functions. So we've seen the original version from
the DDPM paper. We've seen the cosine schedule and we've seen, I think we might look at, or
the next thing that Jeremy's going to show us, a sigmoid based schedule. And so they show the
continuous time versions of that and they plot how you can change various parameters to get
these different gamma functions or in our case, the alpha bar where we starting at all image,
no noise at t equals zero, moving to all noise, no image at t equals one, but the path that
you take, it's going to be different for these different classes of functions and
parameters and the signal to noise ratio, that's what this, or the log signal to noise
ratio is going to change over that time as well. And so that's one of the knobs we can tweak. We're
saying our diffusion model isn't training that well. We think it might be related to the noise
schedule and so on. One of the things you could do is try different noise schedules, either
changing the parameters in one class of noise schedule or switching from a linear to a cosine
to a sigmoid. And then the second strategy is kind of what we were doing in those experiments,
which is just to add some scaling factor to x0. JEREMY: Well we were accidentally using b of 0.5. JOHNO: Exactly. And so that's a second dial
that you can tweak is to say keeping your noise schedule fixed, maybe just scale x0, which
is going to change the ratio of signal to noise. JEREMY: And so that's what Figure 4 in c)
there is what we were accidentally doing. JOHNO: Yes. Yeah, exactly. And so see if
we can get to, oh yeah, so that again, changes the signal to noise for different scalings
you get. So that's fine. So they have a compound, they have a strategy that combines some of those
things. And this is the important part. They do their experiments. And so they have a nice
table of investigating different schedules, cosine schedules and sigmoid schedules, and in
bold are the best results. And you can see for 64 by 64 images versus 128 versus 256, the best
schedule is not necessarily always the same. And so that's like important finding number one,
depending on what your data looks like using a different noise schedule might be optimal. There's
no one true best schedule. There's no one value of beta min and beta max that's just magically
the best. Likewise for this input scaling at different sizes with whatever schedules they
tested and different values were kind of optimal. And so, yeah, it's just a really great
illustration, I guess, that this is another design choice that's implicit or explicitly part
of your diffusion model training and sampling is how are you dealing with this noise schedule?
What schedule are you following? What scaling are you doing with your inputs? And by using
this thinking and doing these experiments, and they come up with a kind of rule of thumb
for how to scale the image based on image size, they show that they can, as they increase the
resolution, they can still maintain really good performance. Where previously it was quite
hard to train a really large resolution pixel space model and they're able to
do that. They get some advantage from their fancy Recurrent Interface Network, but
still it's kind of cool that they can say, look, we get state of the art high
quality and 512 by 512 or 1024 by 1024 samples on class conditioned ImageNet and
using this approach to really like consider how well do you train, how many steps do we need to
take? One of the other things in this table is that they compare it to previous approaches. Oh,
we used, you know, a third of the training steps and for the same other settings
and we get better performance. And just because we've chosen
that input scaling better. And yeah, so that's the paper. Really
nice, great work to the team. And that was- JEREMY: I'd love to, you got up in the morning and
thought, oh, it's going to be a hassle training all these different models I need to train
for different input scalings and different sampling approaches. I might just look at Twitter first. And then you looked at Twitter
and there was a paper saying like, Hey!, we just did a bunch of experiments for
different noise schedules and input scaling. JOHNO: Yeah. JEREMY: Does your life always work that
way, Johno? That seems quite blessed. JOHNO: Yeah. It's very lucky like that. Yeah. If
you wait long enough, someone else will do it. TANISHQ: That's why it's always that the time
when AK [@_akhaliq] starts posting on Twitter, it's like my favorite hour of the day.
It's just for all the papers to be posted. JEREMY: Oh, well thank you for that.
So let me switch to notebook 23 because this notebook is actually largely an
implementation of some ideas from this paper that everybody tends to just call it Karras,
unfair ‘cause there's other people. But I will do it anyway, Karras paper. And the
reason we're going to look at this is because in this paper, the authors actually take a
much more explicit look at the question of input scaling. Their approach was not apparently to accidentally put a bug in their code and
then take it out, find it worked worse and then just put it back in again. Their approach
was actually to think, how should things be? So that's an interesting approach to doing things
and I guess it works for them. So that's fine. TANISHQ: I think our approach is more exciting. JEREMY: Yeah, exactly. Our approach is much more
fun because you never quite know what's going to happen. And so, yeah, in their approach,
they actually tried to say like, okay, given all the things that are coming into our
model, how can we have them all nicely balanced? So we will skip back and forth
between the notebook and the paper. So the start of this is all the same,
except now we are actually going to do it minus 1 to 1 because we're not going to rely
on accidental bugs anymore, but instead we're going to rely on the Kerras papers, carefully
designed scaling. I say that except that I put a bug in the notebook as well. One of
the things that's in the Kerras paper is what is the standard deviation of the actual
data which I calculated for a batch. However, this used to say minus 0.5. I used
to do the minus 0.5 to 0.5 thing. And so this is actually the standard deviation
of the data before when it was still minus 0.5. So this is actually half the real standard
deviation. For reasons I don't yet understand, this is giving me better scaled results.
So this actually should be 0.66. So there's still a bug here and the bug still
seems to work better. So we've still got some mysteries involved. So we're going to leave this.
So it's actually not 0.33, it's actually 0.66. Okay, so the basic idea of this
paper, actually I'll come back. Well let me have a little think. Yeah, okay. Now
we're going to start here. So the basic idea of this paper is to say, you know what, sometimes
maybe predicting the noise is a bad idea. So like you can either try and predict the noise
or you can try and predict the clean image and each of those can be a better
idea in different situations. If you're given something
which is nearly pure noise, you know, the model's given something which
is nearly pure noise and is then asked to predict the noise, that's basically a waste
of time because the whole thing's noise. If you do the opposite, which is you try to get
it predict the clean image, well then if you give it a clean image that's nearly clean and try to
predict the clean image, that's nearly a waste of time as well. So you want something which is like,
regardless of how noisy the image is, you want it to be kind of like an equally difficult problem
to solve. And so what Karras do is they basically use this new thing called Cskip, which is a
number which is basically saying like, you know what we should do for the training target,
is not just predict the noise all the time, not just predict the clean image all the time,
but predict kind of lerp version of one or the other depending on how noisy it is. So here y is
the plain image and n is the noise. So y plus n is the noised image. And so if Cskip was zero,
then we would be predicting the clean image. And if Cskip was one, we would be predicting
y minus y, we would be predicting the noise. And so you can decide by picking a different Cskip whether you're predicting
the clean image or the noise. And so, as you can see from the way they've
written it, they make this a function. They make it a function of sigma. Now this is
where we've got to a point now where we've kind of got a fairly much simpler notation.
There's no more alpha bars, no more alphas, no more betas, no more beta bars. There's just
a single thing called sigma. Unfortunately, sigma is the same thing as alpha bar used to be.
Right. So we've simplified it, but we've also made things more confusing by using existing symbol for
something totally different. So this is alpha bar. Okay. So there's going to be a function that says,
depending on how much noise there is, we'll either predict the noise or we'll predict the clean
image or we'll predict something between the two. So in the paper, they showed this chart where
they basically said like, okay, let's look at the loss to see how good are we with a trained
model at predicting when sigma is really low. So when there's very small alpha bar or when sigma
is in the middle or when sigma is really high. And they basically said, you know what, when
it's nearly all noise or nearly no noise, you know, we're basically not
able to do anything at all. You know, we're basically good at doing
things when there's a medium amount of noise. So when deciding, okay, what, what segments
are we going to send to this thing? The first thing we need to do is to, is to figure out
some sigmas. And they said, okay, well let's pick a distribution of sigmas that matches this
red curve here, right, as you can see. And so this is a normally distributed curve where this is on a
log scale. So this is actually a log normal curve. So to get the sigmas that they're going to use,
they picked a normally distributed random number and then they exp'd it. And this is
called a log normal distribution. And so they used a mean of minus 1.2 and a
standard deviation of 1.2. So that means that about one third of the time they're going to be
getting a number that's bigger than zero here. And e to the zero is one. So about one third
of the time they're going to be picking sigmas that are bigger than one.
And so here's a histogram I drew of the sigmas that we're going to
be using. And so it's nearly always, you know, less than five. But sometimes it's way
out here. And so it's quite hard to read these histograms. So this really nice library called
seaborn, which is built on top of Matplotlib, has some more sophisticated and often nicer
looking plots. And one of them they have is called a kdeplot, which is a kernel density plot. It's
a histogram, but it's smooth. And so I clipped it at 10 so that you could see it better. So you can
basically see that the vast majority of the time it's going to be somewhere, you know, about 0.4
or 0.5. But sometimes it's going to be really big. So our noisify is going to pick a sigma
using that log normal distribution. And then we're going to get the noise as
usual. But now we're going to calculate c_skip, right? Because we're going to do that
thing we just saw. We're going to find something between the plain image and the noised input.
So what do we use for c_skip? We calculate it here. And so what we do is we say, what's the
total amount of variance at some level of sigma? Well, it's going to be sigma squared. That's the
definition of the variance of the noise. But we also have the sigma of the data itself,
right? So if we add those two together, we'll get the total variance. And
so what the Karras paper said to do is to do the variance of the data divided by
the total variance and use that for c_skip. So that means that if your total variance is really
big, so in other words, it's got a lot of noise, then c_skip is going to be really
small. So if you've got a lot of noise, then this bit here will be really small.
So that means if there's a lot of noise, try to predict the original image, right? That
makes sense because predicting the noise would be too easy. If there's hardly any noise, then
this will be, total variance will be really small, right? So c_skipwill be really big. And
so if there's hardly any noise, then try to predict the noise. And so
that's basically what this c_skip does. So it's a kind of slightly weird idea is that our
target, the thing we're trying to do actually is not the input image, sorry, the original image.
It's not the noise, but it's somewhere between the two. And I've found the easiest way to
understand that is to draw a picture of it. So here is some examples of noised
input, right? With various amounts of… with various sigmas. Remember sigma
is alpha bar, right? So here's an example with very little noise, 0.06.
And so in this case, the target is predict the noise, right? So that's the
hard thing to do, is predict the noise. Or else here's an example, 4.53, which is nearly
all noise. So for nearly all noise, the target is predict the image, right? And then for something
which is a little bit between the two, like here, 0.64, the target is predict some
of the noise and some of the image. So that's the idea of Karras. And so what this does is it's making the
problem to be solved by the UNet equally difficult, regardless of what sigma is.
It doesn't solve our input scaling problem. It solves our kind of difficulty scaling problem.
To solve the input scaling problem, they do it. TANISHQ: I just want to make one quick note. And
so this idea of interpolating between the noise and the image is similar to what's called the
V-Objective as well. So there's also a similar kind of, it's quite similar to what Karras
et al. has. But that's also now been used in a lot of different models. For example,
Stable Diffusion 2.0 was trained with this sort of V-Objective. So people are using this
sort of methodology and getting good results. So it's an actual practical thing that people are
doing. So yeah, just want to make a note of that. JEREMY: Yeah. As is the case of basically
all papers created by NVIDIA researchers, of which this is one, it flies under
the radar and everybody ignores it. The V-Objective paper came from the senior
author, was Tim Salimans, which is Google, right? So anything from Google and OpenAI, everybody
listens to. So yeah, although Karras, I think has done the more complete version of this.
And in fact, the V-Objective was almost like mentioned in passing in the
distillation paper. But yeah, that's the one that everybody has ended up
looking at. But I think this is the more complete… TANISHQ: I think what happened with
the V-Objective is not many people paid attention to it. I think folks like Kat
and Robin, these sorts of folks are actually paying attention to that V-Objective
in that Google Brain paper. But then also this paper did a much more principled
analysis of this sort of thing. So yeah, I think it's very interesting how, yeah. Sometimes
even these sort of side notes in papers that maybe people don't pay much attention to,
they can actually be quite important. JEREMY: Yeah. Yeah. So okay, so the noise input
as usual is the input image plus the noise times the sigma. But then, and then as we discussed, we
decide how to kind of decide what our target is. But then we actually take that noise input
and we scale it up or down by this number. And the target, we also scale up or down by this
number. And those are both calculated in this thing as well. So here's c_out and here's c_in. Now I just wanted to show one
example of where these numbers come from because for a while they all seem pretty mysterious to
me and I felt like I'd never be smart enough to understand them, particularly because they
were explained in the mathematical appendix of this paper, which are always the bits I
don't understand, until I actually try to, and then it tends to turn out they're not so bad
after all, which was certainly the case here. TANISHQ: I think it was up, it was B something
I think. So the B6 I think, is the other one. JEREMY: Oh, yeah. So in appendix B6,
which does look pretty terrifying, but if you actually look at, for example,
what we were just talking about, c_in, it's like, how do they calculate? So c_in is
this. Now this is the variance of the noise, this is the variance of the data, add them
together to get the total variance, square roots, the total standard deviation. So it's just the
inverse of the total standard deviation, which is what we have here. Where does that come
from? Well they just said, you know what? The inputs for a model should have unit variance.
Now we know that. We've done that to dare in this course. So they just said, all right,
so well the inputs to the model is the clean data plus the noise times some number we're
going to calculate, and we want that to be one. Okay, so the variance of the clean images plus the
noise is equal to the variance of the clean images plus the variance of the noise.
Okay, so if we want that to be, if we want variance to be one, then divide
both sides by this and take the square root, and that tells us that our multiplier has to be
one over this. That's it. So it's like literally, you know, classical math. The only bit you have
to know is that the variance of two things added together is the variance of the two things added
together, which is not rocket science either. JOHNO: And in this context, like why we want to
do this?, when we looked at those sigma's that you're plotting, like the distribution, you've
got some that are fairly low, but you've also got some where the standard deviation sigma is
like 40, right? So the variance is super high. JEREMY: Yes. JOHNO: And so we don't want to feed something with
standard deviation 40 into our model. You would like it to be closer to unit variance. So we're
thinking, okay, well, if you divide by roughly 40, that would scale it down. But then we've also got
some extra variance from our data. It's just like 40 plus variance of the data of a little bit. We want to scale back down by
that to get unit variance. JEREMY: Yeah. I mean, I love this paper
because it's basically just doing what we spent weeks doing. I feel like everything that
we've done that's improved every model has always been one thing, which is, can we get
mean zero, variance one inputs to our model and for all of our activations? And then the
only other thing is include enough compute by adding enough layers and enough activations.
Those two things seem to be all that matters. Basically, well, I guess ResNets added an extra
cool little thing to that, which is to make it even smoother by giving this kind of like identity
path. So yeah, basically trying to make things as smooth as possible and as equal
everywhere as possible. So yeah, this is what they've done. So they did that
for the inputs and then they've also done it for the outputs. And for the
outputs, you know, it's basically the same idea, you know, and they have basically
the same kind of analysis to show that. And so with this, so now, yeah, we've
basically, we've got our noised input. We've got the, you know, kind of linear version
somewhere between X nought and the noise to input. We've got the scaling of the output and we've got
the scaling of the input. So now for the inputs to our model, we're going to have the scaled noise.
We're going to have the sigma and we're going to have the target, which is somewhere between
the image and the noise. And so, yeah, so I've, you know, never seen anybody draw a picture
of this before. So it was really cool when, you know, being in a notebook, being able to see
like, oh, that's what they're doing, you know? So yeah, have a good look at this notebook
to see exactly what's going on. Cause I think it gives you a really good intuition
around what problem it's trying to solve. So then I actually checked the noised input has
a standard deviation of one, the means not zero. And of course, why would it be? We didn't
do anything, you know, the only thing Karras cared about was having the variance one. We could
easily adjust the input and output to have a mean of zero as well. That's something I think we or
somebody should try. Cause I think it does seem to help a bit as we saw with that generalized
value stuff we did. But it's less important than the variance. And so same with the target.
It's got the one and yeah, this is where if I changed this to the correct value, which is 0.66,
then actually it's slightly further away from one both here and here, quite a lot further away. And
maybe that's because actually the data's well, we know the data's not Gaussian distributed pixel
data definitely isn't Gaussian distributed. So this bug turned out better. Okay. So the unit's the same, the
initialization's the same. This is all the same. Train it for a while. We can't compare the
losses, right? Because our target's different. So, but what we can do is we can create
a denoise that just takes the thing that as, per usual, the thing
we had in noisify, right? And so for x naught, so you're going to multiply by
c_out and then add c_skip by noised_input. Here it is, multiply by c_out, add noised_input,
c_skip. Okay. So we can denoise. So let's grab our sigmas from the actual batch we had. Let's
calculate c_skip, c_out and c_in for the sigmas in our mini batch. Let's use the model to predict
the target given the noise to input and the sigmas and then denoise it. And so here's our noise to
input, which we've already seen, and here's our predictions. And these are
absolutely remarkable in my opinion. Yeah. Like this one here, I can barely see it.
You know, it's really found. Look at the shirt. There's the shirt here. It's actually really
finding the little thing on the front and let me show you. Here's what it should look like. And in
cases where the sigma is pretty high, like here, you can see it's really like saying like, I
don't know, maybe it's shoes, but it could be something else. Is it shoes? Yeah, it wasn't
shoes, but at least it's kind of got the, you know, the bulk of the pixels in
the right spot. Yeah. Something like this one is 4.5. Has no idea what it is. It's
like, oh, maybe it's shoes. Maybe it's pants. You know, it turns out it is shoes. Yeah. So
I think that's fascinating how well it can do. And then the other thing I did, which I
thought was fun was I just created, so I just, you did a sigma of 80, which is actually what they
do when they're doing sampling from pure noise. That's what they consider the pure noise level.
So I just created some pure noise and denoised it just for one step. And so here's what happens
when you denoise it for one step. And you can see it's kind of overlaid all the possibilities.
It's like, I can see a pair of shoes here, a pair of pants here at top here. And sometimes
it's kind of like more confident that the noise is actually a pair of pants. And sometimes
it's more confident that it's actually shoes. But you can really get a sense of how like from
pure noise, it starts to make a call about like what this noise is actually covering up. And
this is also the bit which I feel is like, I'm the least convinced about when it comes to
diffusion models. This first step of going from like pure noise to something and like trying to
have a good mix of all the possible somethings. I don't know, it feels a bit handwavy to me.
It clearly works quite well, but I'm not sure if it's like we're getting the full range of
possibilities. And I feel like some of the papers we're starting to see is starting to say
like, you know what, maybe this is not quite the right approach. Then maybe later in the course,
we'll look at some of the ones that look at what we call VQ models and tokenized stuff.
Anyway, I thought this was pretty interesting to see these pictures, which I don't think,
yeah, I've never seen any pictures like this before. So I think this is a fun result from
doing all this stuff in notebooks step by step. Okay, so sampling. So one of the nice things
with this is the sampling becomes much, much, much simpler. And so, and I should mention
a lot of the code that I'm using, particularly in the sampling section is heavily inspired by,
and some of it's actually copied and pasted from Kat's k-diffusion repo, which is, I think
I mentioned before, some of the nicest generative modeling code or maybe the nicest
generative modeling code I've ever seen. It's really great. So before we talk about the actual
sampling, the first thing we need to talk about is what sigma do we use at each reverse time step.
And in the past, we've always, well, nearly always done something, which I think has always
felt is sketchy as all hell, which is we've just linearly gone down the sigmas or the alpha bars
or the t’s. So here, when we're sampling in the previous notebook, we used linspace. So I always
felt like that was questionable. And I felt like at the start, you probably like it was just
noise anyway. So who cared? Who cares? So I, in DDPM_v3, I experimented with something
that I thought intuitively made more sense. I don't know if you remember this one, but I
actually said, oh, let's, for the first hundred time steps, let's actually only run the model
every 10 times. And then for the next hundred, let's run it nine times. The next one hundred,
let's run it every eight times. So basically at the start, be much less careful. And so Karras
actually ran a whole bunch of experiments. And they said, yeah, you know what, at the start of
training, you know, you can start with a high sigma, but then like step to a much lower sigma
in the next step and then a much lower sigma in the next step. And then the longer, the more you
train step by smaller and smaller steps so that you spend a lot more time fine tuning carefully
at the end and not very much time at the start. Now, this has its own problems. And in
fact, a paper just came out today, which we probably won't talk about today, but maybe
another time, which talked about the problems is that in these very early steps, this is the
bit where you're trying to create a composition that makes sense. Now for Fashion-MNIST, we
don't have much composing to do. It's just a piece of clothing. But if you're trying to
do an astronaut riding a horse, you know, you've got to think about how all those pieces fit
together. And this is where that happens. And so I do worry that with the Karras approach is not
giving that maybe enough time. But as I've said, that's really the same as this step at
that, that whole piece feels a bit wrong to me. But aside from that, I think this makes a
lot of sense, which is that, yeah, the sampling, you should jump, you know, by big steps early
on and small steps later on and make sure that the fine details are just so. So that's what
this function does, is it creates this plot. Now it's this schedule of reverse diffusion
sigma steps. It's a bit of a weird function in that it's the, the rho-th root of sigma,
where rho is seven. So the seventh root of sigma is basically what it's scaling on. But
the answer to why it's that is because they tried it and it turned out to work pretty
well. Do you guys remember where this was? TANISHQ: This is the
truncation error analysis, D1. JEREMY: Nice memory. So this image here –so thanks
for Tanishq reminding me where this is– shows FID as a function of rho. So it's basically what, the
what root are we taking. And they basically said, like, if you take the fifth root up,
it seems to work well, basically. So yeah, so that's a perfectly good way to do
things is just to try things and see what works. And you'll notice they tried things,
just like we love, on small datasets, not as small as us because we're the king of small
datasets, but smallish, CIFAR-10, ImageNet-64. That's the way to do things. So I saw like –it
might've even been the CEO of Hugging Face the other day– tweets something saying only people
with huge amounts of GPUs can do research now. And I think it totally misunderstands how research
is done, which is research is done on very small datasets. That's the actual research. And then
when you're all done, you scale it up at the end. I think we're kind of pushing the
envelope in terms of like, yeah, how much can you do? And yeah, we've like
re-covered this kind of main substantive path of diffusion models history step-by-step showing
every improvement and seeing clear improvements across all the papers using nothing but
Fashion-MNIST running on a single GPU in like 15 minutes of training or something per model.
So yeah, definitely don't need lots of models. Anyway. Okay. So this is the sigma we're going
to jump to. So the denoising is going to involve calculating the c_skip, c_out and c_in and calling
our model with the c_in scaled data and the sigma and then scaling it with c_out and then doing the
c_skip. Okay. So that's just undoing the noisify. So check this out. There's all that's required to
do one step of denoising for the simplest kind of scheduler, which is sorry, the simplest
kind of sampler, which is called Euler. So we basically say, okay,
what's the sigma at time step i, what's the sigma two at time step i. And
now when I'm talking about time step, I'm really talking about like the step from
this function, right? So this is, this is – JOHNO: Sampling step. JEREMY: Sampling step, yeah. Okay.
So then denoise –using the function– and then we say, okay, well just
send back whatever you were given, plus move a little bit in the direction of the
denoised image. So the direction is x minus denoised. So that's the noise, that's the gradient
as we discussed right back in the first lesson of this part. So we'll take the noise. If we divide
it by sigma, we get a slope. That's how much noise is there per sigma. And then the amount
that we're stepping is sigma two minus sigma one. So take that slope and multiply it by the
change, right? So that's the distance to travel towards the noise, that this fraction, you
know, or you could also think of it this way. And I know this is a very obvious
algebraic change, but if we move this over here, you could also think of this as
being, oh, of the total amount of noise, the change in sigma we're doing, what
percentage is that? Okay, well that's the amount we should step. Right? So there's two
ways of thinking about the same thing. So again, this is just, you know,
high school math. Well I mean, actually my seven year old daughter has done all
these things. It's plus minus dividing times. So we're going to need to do this once per
sampling step. So here's a thing called sample, which does that. That's going to
go through each sampling step, call our sampler, which initially we're going to
do sample Euler, right? With that information, add it to our list of results and do it again.
So that's it. That's all the sampling is. And of course we need to grab
our list of sigmas to start with. So I think that's pretty cool. And at the very
start we need to create our pure noise image. And so the amount of noise we
start with is got a sigma of 80. Okay, so if we call sample using sample Euler,
and we get back some very nice looking images and, believe it or not, our FID is 1.98. So this
extremely simple sampler, three lines of code plus a loop has given us a FID of 1.98, which is
clearly substantially better than our coastline. Now we can improve it from there. So one
potential improvement is to, you might've noticed we added no new noise at all, right? This
is a deterministic scheduler, right? There's no rand anywhere here. So we can do something called
an Ancestral Euler Sampler, which does add rand, right? So we basically do the denoising in the
usual way, but then we also add some rand. And so what we do need to make sure is given that
we're adding a certain amount of randomness, we need to remove that amount of randomness
from the step that we take. So I won't go into the details, but basically there's our way of
calculating how much new randomness and how much just going back in the existing direction do we
do. And so there's the amount in the existing direction and there's the amount in the new
random direction. And you can just pass in eta, which is just going to, when we pass it into here,
is going to scale that. So if we scale it by half, so basically half of it is new noise and
half of it is going in the direction that we thought we should go, that makes it better
still. Again with a hundred steps and just make sure I'm comparing to the same. Yep. A hundred
steps. Okay. So it's fair. Like with like. Okay. So that's adding a bit of extra noise. Now then the… something that I think we might've
mentioned back in the first lesson of this part is something called Heun's method. And Heun's method does something which we can
pictorially see here to decide where to go, which is basically we say, okay, where
are we right now? What's the, you know, at our current point, what's the direction? So we
take the tangent line, the slope, right? That's basically all it does is it takes a slope. So
it's not, here's a slope, you know? Okay. And so if we take that slope and that
would take us to a new spot and then at that new spot, we can then
calculate a slope at the new spot as well. And at the new spot, the slope is something else.
So that's it here, right? And then you say like, okay, well, let's go halfway between the two
and let's actually follow that line. And so basically it's saying like, okay, each of
these slopes is going to be inaccurate. But what we could do is calculate the slope of
where we are, the slope of where we're going and then go halfway between the two. I actually
find it easier to look at in code personally. I just kind of delete a whole bunch of stuff
that's totally irrelevant to this conversation. So take a look at this compared to Euler. So here's our Euler, right? So we're going
to do the same first line exactly the same, right? Then the denoising is exactly the same,
right? And then this step here is exactly the same. I've actually just done it in multiple
steps for no particular reason. And then say, okay, well, if this is the last step, then we're
done. So actually the last step is Euler. But then what we do is we then say, well, that's okay
for an Euler step, this is where we'd go. Well, what does that look like if we denoise
it? So this calls the model the second time, right? And where would that take us if we took
an Euler step there? And so here, if we took an Euler step there, what's the slope? And so what
we then do is we say, oh, okay, well, it's just, just like in the picture, let's take the average.
Okay, so let's take the average and then use that, the step. So that's all the Heun sampler
does is it just takes the average of the slope where we're at and the slope where
the Euler method would have taken us. And so if we now, so notice that it called the
model twice for a single step. So to be fair, since we've been taking a hundred steps with
Euler, we should take 50 steps with Heun, right? Because it's going to call the model twice.
And still that is now, whoa, we beat 1, which is pretty amazing. And so we could keep going,
check this out. We can even go down to 20. This is actually doing 40 model evaluations and this is
better than our best Euler, which is pretty crazy. Now something which you might've noticed is kind
of weird about this or kind of silly about this is we're cat–, we're calling the model twice just
in order to average them, but we already have two model results like without calling it twice. We
cause we could have just looked at the previous time step. And so something called the LMS
sampler does that instead. And so the LMS sampler, if I call it with 20, it actually literally
does 20 evaluations and actually it beats Euler with a hundred evaluations. And so
LMS, I won't go into the details too much. It didn't actually fit into my little sampling very
well. So basically largely copied and pasted the Kat's code. But the key thing it does is look
at, it gets the current sig –sigma. It does the denoising, it calculates the slope and it stores
the slope in a list, right? And then it grabs the first one from the list. So it's kind of
keeping a list of up to this case four at a time. And so then uses up to the last four to basically,
yes, kind of the curvature of this and take the next step. So that's pretty smart. And yeah, so
I think if you wanted to do super fast sampling, it seems like a pretty good way to do it.
And I think Johno, you were telling me that, or maybe it's Pedro was saying that
currently people have started to move away, that this was very popular, but people
started to move towards a new sampler, which is a bit similar called the
DPM++ sampler, something like that. TANISHQ: Yeah. Yeah. Yeah. JOHNO: Yeah. JEREMY: But I think it's
the same idea. So it kind of keeps a list of recent results and use
that. I'll have to check it more closely. JOHNO: The similar idea is like, if it's
done more than one step, then it's using some history to the next thing. JEREMY: Yeah. [unintelligible] in Heun
doesn't make a huge amount of sense, I guess, from that perspective. I mean, still works very
well. This makes more sense. So then, we can compare if we use an actual mini-match of data,
we get about 0.5. So yeah, I feel like this is quite a stunning result to get very close to real data, at least in terms of
FID. You know, really with 40 model evaluations. And the entire, nearly the entire thing here is
by making sure we've got unit variance inputs, unit variance outputs, and kind of equally
difficult problems to solve in our loss function. JOHNO: Yeah. Plus having that different schedule
for sampling, that's completely unrelated to the training schedule. So one of the big things
with Karras et al's paper was they also could apply this to like, oh, existing diffusion
models that have been trained by other papers, we can use our sampler and in fewer steps
get better results without any of the other changes. And yeah, I mean, they do a
little bit of rearranging equations to get the other papers versions into
their c_skip, c_in, c_out framework. But then, yeah, it's really nice that these
ideas can be applied to. So for example, I think stable diffusion, especially version one was
trained DDPM style training, epsilon objective, whatever. But you can now get these different
samplers and different something schedules and things like that and use that to sample it and do
it in 15, 20 steps and get pretty nice samples. JEREMY: Yeah. You know, and another
nice thing about this paper is they, you know, in fact, it's the name of
the paper, “Elucidating the Design Space of Diffusion-Based …” models. You know, they looked
at various different papers and approaches and trying to set like, oh, you know what? These
are all doing the same thing when we kind of parameterize things in this way. And if you fill
in these parameters, you get this paper and these parameters, you get that paper, you know? And
then, so, we found a better set of parameters, which… it was very nice to code because, you know,
it really actually ended up simplifying things a whole lot. And so if you look through the
notebook carefully, which I hope everybody will, you'll see, you know, that the code
is really there and simple compared to previous, all the previous ones, in
my opinion. Like I feel like every notebook we've done from DDPM onwards,
the code's got easier to understand and the results… TANISHQ: And just to, again, clarify, like,
how this connects with some of the previous papers that we've looked at. So like, for example,
with the DDIM, the deterministic, that's again, the deterministic approach, that's similar to the
Euler method sampler that we were just looking at, which was completely deterministic. And then
some of something like the Euler Ancestral that we were looking at is similar to
the standard DDPM approach with the, that was kind of a more stochastic approach.
So, again, there's just all these sorts of connections that then are kind of nice to
see, again, the sorts of connections between the different papers and how they change it, how
they can be expressed in this common framework. JEREMY: Yeah. Thanks, Tanishq. So we definitely
now are at the point where we can show you the UNet next time. And so I think we're, unless any
of us come up with interesting new insights on the unconditional diffusion sampling, training
and sampling process, we might be putting that aside for a while. And instead we're going
to be looking at creating a good quality UNet from scratch. And we're going to look at a
different data set to do that. This was starting to scale things up a bit as Johno mentioned in
the last lesson. So we're going to be using a 64 by 64 pixel ImageNet subset, called Tiny ImageNet.
So we'll start looking at some 3-channel images. So I'm sure we're all sick of looking at black
and white shoes. So now we get to look at shift dwellings and trolley buses and koala bears and
yeah, 200 different things. So that'll be nice. Yeah. All right. Well, thank you, Johno. Thank
you, Tanishq. That was fun as always. And next time will be Lesson 22. Bye. JOHNO: This was Lesson 22. JEREMY: Oh, no way. Okay. See you.
Get free YouTube transcripts with timestamps, translation, and download options.
Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact