Lesson 21: Deep Learning Foundations to Stable Diffusion ...

JEREMY: Hello, Johno.

Hello, Tanishq. Are you guys ready for Lesson 21?

JOHNO: Ready… TANISHQ: Yep, I'm excited.

JEREMY: I don't know what I would have said if you had said no.

So good. I'm actually particularly excited because I

had a little bit of a big preview of something that Johno has been working on, which I think

is a super cool demo of what's possible with very little code with miniai.

So let me turn it over to Johno. JOHNO: Great.

Thanks Jeremy. Yeah, so as you'll see when it's back to Jeremy

to talk through some of the experiments and things we've been doing, we've been using

the Fashion-MNISTdataset at a really small scale and really rapidly try out these different

ideas and see some maybe nuances or things that we'd like to explore further.

And so as we were doing that, I started to think that maybe it was about time to explore

just ramping up the level, like seeing if we can go to the next, like slightly larger

datasets, slightly harder difficulty, just to double check that these ideas still hold

for longer training runs and yeah, different, more difficult data.

JEREMY: That's a really good idea because I feel like pretty confident

that the learnings from Fashion-MNIST are going to move across like most of the

time these things seem to, but sometimes they don't and it can be very hard to predict.

So this seems like a very, a very wise choice. JOHNO: Yeah.

And so, we'll keep wrapping up, but as a next step, one above Fashion-MNIST, I thought

I'd look at this data called CIFAR-10. And so CIFAR-10 dataset is a very popular dataset

originally for things like image classification, but also now for all of this genera–

any paper on generative modeling. And it's kind of like the smallest

dataset that you'll see in these papers. And so, yeah, if you look at

the classification results, for example, pretty much every classification

paper since they started tracking has reported results on CIFAR-10 as well as their larger

datasets and likewise with image generation, very, very popular.

All of the recent diffusion papers will usually report

CIFAR-10 at the end maybe ImageNet and then whatever large, massive

dataset they're training on. JEREMY: So we were, we were somewhat

notable in 2018 for managing to train. So for CIFAR-10, 94% classification

is kind of the benchmark. So there was a competition a few years ago

where we managed to get to that point at a cost of like 26 cents worth of AWS time, I

think, which won a big global competition. So I actually hate CIFAR-10, but we had

some real fun with it a few years ago. JOHNO: Yeah.

And it's good. It's a nice dataset for quickly testing things

out, but we'll talk about why we also like, us as a group, don't like it at all.

And we'll pretty soon move on to something better. So one of the things you'll notice in this

notebook, I'm basically using all of the same code that Jeremy's going to

be looking at and explaining. So I won't go into too much, but

the datasets also on HuggingFace. So we can load it just like

we did the Fashion-MNIST. The images are three channels

rather than single channel. So the shape of the data is slightly

different to what we've been working with. That's weird.

Yeah. So we have, instead of a single channel image,

we have a three channel red, green and blue image.

And this is what a batch of data looks like. JEREMY: And you've got 32 images in your batch.

So that's Batch by Channel by Height by Width, right?

JOHNO: Yeah. Yeah.

Batch by Channel by Height and Width. JEREMY: That was a little

confused by the 32 by 32. JOHNO: Oh yeah. JEREMY: I got it now.

JOHNO: Batch size can be arbitrary. So if you plot these, one of the things, if

you look at this, okay, I can see these are different classes.

Like I know this is an airplane, a frog, an airplane, but it's actually a puzzle with

an airplane on the cover, a bird, a horse, a car. That one, you squint, you can tell

it's a bear, but only if you really know what you're

looking for. And so when we started to talk about generating

these images, this is actually quite frustrating. Like this, if I generated this, I'd say this

might be the model doing a really bad job. And, but it's actually that

this is a boat, this is a dog. It's just that this is what the data looks like.

JEREMY: And so I've actually got something that can help you out.

I'll show later today, which is something like this.

It's really actually hard to see whether it's good because the images are bad.

It can be helpful to have a metric to generate that can see how good samples are.

So I'll be showing a metric for that later today. JOHNO: Yeah.

And that'll be great. And I hope to have like automated, but anyway, I

just wanted to flag like for visually inspecting these it's not great.

And so we don't really like CIFAR-10 because it's hard to tell.

And but still a good, a good one to test with. So the noisify and everything I'm following

what Jeremy is going to be showing exactly the code works without any changes because

we're adding random noise in the same shape as our data.

So even though our data now has three channels, the

noisify function still works fine. If we try and visualize the noisified images

because we're adding noise in the Red, Green, Blue channels.

And some of that's, you know, kind of extreme values.

Yeah. It looks slightly different.

Looks all crazy RGB. But you can see, for example, this frog doesn't

have as much noise and it's vaguely visible. But it is, it's a, it's a many impossible

tasks to look at this and tell what image is hiding under all of that noise.

JEREMY: So I think this is really neat that you could use the same noisify.

Yeah. And it's still, it still works.

It's not just that shape thing, but I guess just thanks to kind of PyTorch's broadcasting

kind of stuff, this often happens. You can kind of change the dimensions

of things and it just keeps working. JOHNO: Exactly.

And we've been paying attention to those broadcasting rules

and the right dimensions and so on. Cool.

So I'm going to use the same sort of approach to loading the UNet, except that now obviously

I need to specify three input channels and three output channels because we're working

with three channel images. But I did want to explore for this demo, like,

okay, how could I maybe justify wanting to do this kind of experiment tracking

thing that I'll talk about? And so I'm bumping up the size

of the model substantially. I've gone from, this is the default settings

that we were using for Fashion-MNIST, but the diffuser's default UNet has what

many, 20 times as many parameters, 274 million versus

15 million. So we're going to try a larger model.

We're going to try some longer training. And so I could just do the same training that

we've always done just in the notebook, set up a Learner with ProgressCB to kind

of plot the loss, track some metrics. But yeah, I don't know about you, but once

it's beyond a few minutes training, I quickly get a patient and I have to wait for

it to finish before we can sample. So I'm doing the DDPM sample, but I have to,

I actually interrupted the training to say, I just want to get a look at what it looks

like initially and to plot some samples. And again, the sampling function works without

any modification, but I'm passing in my size to be a three channel image.

Yeah. And so this is like, we could do it like this,

but at some point I would like to a) keep track of what experiments I've tried and b)

be able to see things as it's going over time, including like, I would love to see what the

samples look like if you generate it after the first epoch, after the second epoch.

And so that's where my like little callback that I've been playing with comes in.

JEREMY: So just before you do that, I'll just mention, like, I mean,

there are simple ways you could do that, right?

Like, you know, one popular way a lot of people do is that they'll save some sample images

as files every epoch or two, or we could like the same way that we have a updating plot

as we train with fastprogress, we could have an updating set of sample images.

So there's a few ways we could, we could solve that.

That wouldn't handle the tracking that you mentioned of like looking over time at how

different changes have improved things or made them worse, whatever that would, I guess

would require you kind of like saving multiple versions of a notebook or keeping some kind

of research journal or something. That'd be a bit fiddly.

JOHNO: It is. And all of that's doable, but I also find

like I'm a little bit lazy sometimes. Maybe I don't write down what I'm trying or

yeah, I've saved untitled number 37 notebook. So yeah, the idea that I wanted to show here

is just that there are lots of other solutions for this kind of experiment tracking and logging.

And one that I really like is called Weights and Biases (W&B).

So I'll explain what's going on in the code here, but I'm running a training with this

additional Weights and Biases callback. And what it's doing is it's allowing

me to log whatever I'd like. And so I can log samples at different.

JEREMY: Okay. So just switching to a

website here called wandb.ai. So that's where your callback

is sending information to. JOHNO: Yeah.

So Weights and Biases accounts are free for personal and academic use.

And it's very, very like, I don't think I know anyone who writes Weights and Biases,

but it's a very nice service. You sign in and you log in on your computer

or you get a little authentication token. And then you're able to log these experiments

and you can log into different projects. What it gives you is for each experiment,

anything that you call Weights and Biases dot log at any step in the training, that's

getting logged and sent to their server and stored somewhere where you can

later like access it and display it. They have these plots that you can, you know,

visualize easily and you can also share them very easily and like these reports that

integrate this data sort of interactively. And why that's nice is that later, like you

can go and look at, so this is another project that I'm logging to.

You can log multiple runs with different settings. And for each of those, you have all of these

things that you've tracked, like your training loss and validation, but you can also track

your learning rate to keep doing a learning rate schedule.

And you can save your model as an artifact and it'll get saved on their server.

So you can see exactly what run produced, what model.

It logs the code. If you set that to, you can save,

you know, save code equals true. And then it creates a copy of your whole Python

environment, what libraries were installed, what code you ran.

So for being able to come back later and say, oh, these images here, these look really good.

I can go back and see, oh, that was this experiment here.

I can check what settings I used. In the initialization, you can log whatever

configuration details you'd like and any comments. And yeah, there's other frameworks for this.

JEREMY: Yeah, in some ways it's kind of, initially when I first saw Weights and Biases, it felt

a bit weird to me actually, like sending your information off to an external website because,

I mean, before Weights and Biases existed, the most popular way to do this was something

called TensorBoard, which Google provides, which is actually a lot like this, but it's

a little server that runs on your computer. And so like when you log things, it just puts

it into this little database on your computer, which is totally fine.

But I guess actually there are some benefits to having somebody else run this service,

you know, instead of running your own little TensorBoard or whatever server, you know,

one is that you can have multiple people working on a project collaborating.

So I've done that before where we will each be sending like different sets of hyper parameters

and then they'll end up in the same place. Or if you want to be really antisocial, you

know, you can interrupt your romantic dinner and look at your phone to see

how your training's going. So like, yeah, I'm not going to say it's like

always the best approach to doing things, but I think there's definitely

benefits to using this kind of service. And it looks like you're showing us that you

can also create reports for sharing this, which is also pretty nifty.

JOHNO: Yeah. Yeah.

I mean, like for working with other people or like you want to show somebody the final

results and being able to, yeah, like pull together the results from some different runs

or just say, oh, look, by the way, here's a set of examples from my two most recent

and things tracked at different steps. What do you think of this?

And yeah, being able to have this like place where everyone can go and they can inspect

the different loss curves for any run. They can say, oh, you know, what were

the, what was the batch size for this? Let me go look at the info there.

Okay. I didn't log it, but I logged how

many epochs and the learning rate. So yeah, I find it quite nice, especially

in a team or if you're doing lots and lots of experiments to be able to like have this

permanent record that somebody else deals with and may host the storage and the tracking.

Yeah, it's quite nice. JEREMY: Wait, and this is all

the code you had to write?. That's amazing.

JOHNO: Yeah. So this is using the callback system.

The way Weights and Biases works is that you start a… you start an experiment with this

wandb.init, and you can specify any like configurational settings that you use

there. And then anything you need to log is

wandb.log and you pass in whatever the name of your body is.

So again, logging the loss and then the value and once you've done, wandb.finish and

that syncs everything up and sends it to the server.

JEREMY: Oh, this is wild the way you've inherited

from MetricsCB and you replaced that _log that we previously were used to allow

fastprogress to do the logging and you've replaced it to allow Weights

and Biases to the logging. So yeah, it's really sweet.

JOHNO: Yeah. Yeah.

So this is using the callback system. I wanted to do the things that MetricCV normally

does, which is tracking different metrics that you pass in.

So this will still do that. And I just offload to the super, like the

original MetricsCV method for things like the after_batch.

But in addition to that, I'd also like to log the Weights and Biases.

And so before I fit, I initialize the experiments. Every batch I'm going to load

the loss. After every epoch the default metrics callback is going

to accumulate the metrics and so on. And then it's going to call this _log function.

So I chose to modify that to say, I'm going to log my training loss.

It's training. I'm going to log my validation loss if I'm

doing validation and I'd like it to log some samples and Weights and Biases is quite

flexible in terms of what you can log. You can create images or

videos or audio or whatever. But it also takes a matplotlib figure.

And so I'm generating samples and plotting them with show_image and spitting back that

matplotlib figure, which I can then log. And that becomes these pretty

pictures that you can see over time. Like every time that log function runs, which

is after every epoch, you can go in and see what the images look like.

JEREMY: So where do you think you can make your code even simpler in the future?

If we had show_images, maybe it could have like a optional return fig parameter that

returns the figure and then we could replace those four lines of code with one, I suspect.

JOHNO: Yeah. Yeah.

And I mean this, I just sort of threw this together.

It's quite early still. You could also, what I've done in the past

is usually just create a PIL image where you can, you know, make a grid or overlay

text or whatever else you'd like. And then just log that as wandb.image.

Otherwise like apart from that, I'm just passing in this callback as an extra callback to,

um, my set of callbacks for the learner, instead of a metric callback.

And so when I call .fit, I still get my little progress bar.

I still get this printed out version because my log function still also prints those metrics

just for debugging. But instead of having to like watch the progress

in the notebook, I can set this running disconnect from the server, go have dinner and then

I can check on my phone or whatever. What did the circles look like?

And okay, cool. It's starting to look like less than random

nonsense, but still not necessarily recognizable. Maybe we need to train for longer.

That can be the next experiment. What I should probably do next

is think of some extra metrics, but Jeremy's going to

talk about that. So for now, that's pretty much all I had to

show is just to say, yeah, it's worth as you move to these longer, you know, 10 minutes,

one hour, 10 hours, these experiments, it's worth setting up a bit of infrastructure for

yourself so that you know what were the settings I used.

And maybe you're saving the model so you have the artifact as a result.

And yeah, I like this Weight and Biases approach, but there's lots of others.

The main thing is that you're doing something to track these experiments beyond just, you

know, creating plenty of different versions of your notebook.

JEREMY: I love it. TANISHQ: One thing I was going to note that I

don't know if many people know, but like Weights and Biases can also save the exact code

that you use to run that for that run. So like if you make any changes to your code

and then you, you know, then you don't know which version of your code you use

for this particular experiment. So then you can figure out

exactly what code you use. So it's all completely reproducible.

And so I love, you know, Weights and Biases, all these different features it has. And I use Weights and Biases all the time

for my own research, like almost daily. Like I had to put a run just last

night and check on it today morning. So it's like, I use it all the time for my own

research and yeah, like I use it especially to just know like, oh, this

run had this particular config. And then like, yeah, the models go

straight into Weights and Biases. And then if I want to run a model on the test

set, I literally actually take it off of Weights and Biases, like download it from Weights

and Biases and run it on the test set. So I use it all the time.

And also just having the ability to have everything reproducible

and know exactly what you were doing is very convenient instead of having

to like manually track it in some sort of like, I guess a big Excel sheet or some

sort of journal or something like that. Sometimes this is, you know, this

is a lot more convenient, I feel so. Yeah.

JEREMY: Last we get into too much chilling from Weights and Biases. I'm going to put a slightly

alternative point of view, which is, I don't use it or any experiment tracking framework

myself. Which is not to say maybe I could get some

benefits by doing so, but I fairly intentionally don't because I don't want to make it easy

for myself to try a thousand different hyper parameters or do kind of like all-directed, you know, sampling of things.

I like to be very like directed, you know. And so that's kind of the workflow I'm

looking for is one that allows that to happen, right?, constantly go back

and refactoring and thinking, what did I learn and how do I change things

from here and never kind of doing like 17 learning rates and 6 architectures and whatever.

Now obviously that's not something that Johno is doing at the moment.

And it’ll be so easy for him to get on, if you wanted to.

JOHNO: I can now make a script that just does a hundred runs

with different models and different tasks and then I can look at my Weight and

Biases and say filter by the best loss. JEREMY: Yeah.

JOHNO: Which is very tempting. JEREMY: I would say to people like, yeah,

definitely be aware that these tools exist. And I definitely agree that as we do this,

which is early 2023, Weights and Biases is by far the best one I've seen.

It has by far the best integration with fastai. And as of today, if Johno's pushed yet, it

has by far the best integration with miniai. I think also fastai is the best library

for using with Weights and Biases. It works in both ways.

So yeah, know it's there. Consider using it, but also consider not going

crazy on experiments because, you know, I think experiments have their place clearly,

but also carefully thought out hypotheses, testing them, changing your code is

overall the approach that I think is best. Well, thank you, Johno.

I think that's awesome. I got some fun stuff to share as

well, or at least I think it's fun. And what I wanted to share is like, well…

first of all, I should say we had said, we all had said that we were

going to look at UNets this week. We are not going to look at UNets this week,

but we have good reason, which is that we had said, we're going to go from

foundations to Stable Diffusion. Well, that was also a lie because we're

actually going beyond Stable Diffusion. And so we're actually going to start

showing today some new research directions. I'm going to describe the process

that I'm using at the moment to investigate some new

research directions. And we're also going to be looking at some

other people's research directions that have gone beyond Stable Diffusion

over the past few months. So we will get to UNets, but we haven't quite

finished, you know, as it turns out, the training and sampling yet.

Now one challenge that I was having as I started

experimenting with new things was started getting to the point where actually the generated

images looked pretty good and it felt like you know, almost like being a parent, you

know, each time a new set of images would come out, I would want to convince myself

that these were the most beautiful. And so I, yeah, I, I, I, I, and like when

they're crap, it's obvious they're crap, you know, but when they're starting to look pretty

good, it's very easy to convince yourself you're improving.

So I wanted to, you know, have a metric which could tell me how good they were.

Now unfortunately there is no such metric. There's no metric that actually says, do these

images, would these images look to a human being like pictures of clothes?

Because only talking to a person can do that. But there are some metrics which

give you an approximation of that. And as it turns out, these metrics are not

actually, they're not actually a replacement for human beings looking at things.

But they're a useful addition. So I mean, I certainly found them useful.

So I'm going to show you the two most common, well there's really the one most common metric,

which is called FID. And I'm going to show another

one called KID or KID. So let me describe and show how they work.

And I'm going to demonstrate them using the model we trained in the last lesson, which

was in DDPM2. And you might remember we

trained one with mixed precision and we saved it as

fashion_ddpm_mp for mixed precision. Okay, so this is all the usual imports and stuff.

This is all the usual stuff. But there's a slight difference this time,

which is that we're going to try to get the FID for a model we've already trained.

So basically to get the model we've already trained to get its FID, we can just torch.load

it and then dot cuda() to pop it on the GPU. So I'm going to call that the smodel, which

is the model for samples, the samples model. And this is just a copied and

pasted DDPM from the last time. So that's for sampling.

So we're going to do sampling from that model. And so once we've sampled from the model,

we're then going to try and calculate this score called the FID.

Now what the FID is going to do is it's not going to say how good are these images.

It's going to say how similar are they to real images.

And so the way we're going to do that is we're going to actually look specifically at for

the images that we generated in these samples, we're going to look at some statistics of

some of the activations. So what we're going to do, we've generated

these samples and we're going to create a new Dataloaders, which

contains no training batches. And it contains one validation

batch, which contains the samples. It doesn't actually matter

what the dependent variable is. So I just put in the same dependent

variable that we already had. And then what we're going to do is we're going

to use that to extract some features from a model.

Now, what do we mean by that? So if you remember back to notebook 14, we

created this thing called summary and summary shows us at different blocks of our model,

there are various different output shapes. In this case, it's a batch size of 1024.

And so after the first block, we had 16 channels, 28 by 28.

And then we had 32 channels, 14 by 14 and so forth until at the, just before the final

linear layer, we had, we had the 1024 batches and we had 512 channels with no height and

width. Now the idea of FID and KID is that

the distribution of these 512 channels for a real image has

a particular kind of like signature, right? It looks a particular way.

And so what we're going to do is we're going to take our samples.

We're going to run it through a model that's learned to predict, you know, fashion classes,

and we're going to grab this layer, right? And then we're going to average

it across a batch, right? To get 512 numbers.

And that's going to represent the mean of each of those channels.

So those channels might represent, for example, you know, does it have a pointed collar?

Does it have, you know, smooth fabric? Does it have sharp heels and so forth?

Right. And you could recognize that something is

probably not a normal fashion image if it says, oh yes, it's got sharp

heels and flowing fabric. It's like, oh, that doesn't

sound like anything we recognize. So there are certain kind of like sets of

means of these activations that don't make sense.

So that's... JOHNO: This is a metric for, it's not a metric

for an individual image necessarily, but it's across a whole lot of images.

So if I generate a bunch of fashion images and I want to say, does this look like a bunch

of fashion images? If I look at the mean, like maybe X percent

have this feature and X percent have that feature.

So if I'm looking at those means, as like comparing the distribution, within all these

images I generated, do roughly the same amount have sharp collars as those in the…

JEREMY: Yeah, that's a very good point too. JOHNO: Well, the features generated are similar.

JEREMY: Yeah. And it's actually going to get

even more sophisticated than that, but let's just start at that level,

which is this features dot mean. So the basic idea here is that we're going

to take our samples and we're going to pass them through a pre-trained model that has

learned to predict what type of fashion something is.

And of course we train some of those in this notebook.

And specifically we trained a nice 20 epoch one in the data augmentation section, which

had a 94.3% accuracy. And so if we pass our samples through this

model, we would expect to get some useful features.

One thing that I found made this a bit complicated though, is that this model was trained using

data that had gone through this transformation of subtracting the mean and dividing by the

standard deviation. And that's not what we're creating in our samples.

And so generally speaking, samples in most of these kinds of diffusion models tend to

be between negative one and one. So I actually added a new section to the very

bottom of this notebook, which simply replaces the transform with something that goes from

negative one to one, and just creates those Dataloaders and then trains

something that can classify fashion. And I saved this as not data_aug, but data_aug2.

So this is just exactly the same as before, but it's a fashion classifier where the inputs

are expected to between minus one and one. Having said that, it turns out that our

samples are not between minus one and one. But actually, if you go

back and you look at DDPM2, we just use TF.to_tensor, and that

actually makes images that are between zero and one.

So actually that's a bug. Okay, so our images have a bug, which

is they go between zero and one. So we'll look at fixing that in a moment.

But for now, we're just trying to get the FID of our existing model.

So let's do that. So what we need to do is we need to take the

output of our model, and we need to multiply by two, so that'll be between

zero and two, and subtract one. So that'll change our samples to be between

minus one and one, and we can now pass them through our pre-trained fashion classifier. Okay, so now how do we get that the output of

that pooling layer, because that's actually what we want to remind you.

We want the output of this layer. So just to kind of flex our PyTorch muscles,

I'm going to show a couple of ways to do it. So we're going to load the model I

just trained, the data_aug2 model. And what we could do is, of

course, we could use a hook. And we have a hook's callback.

So we could just create a function, which just appends the output.

So very straightforward. Okay, because that's what we want. We want the output.

And specifically, it's, so we've got these are all sequentials.

So we can just go through and go, oh, one, two, three, four, five, the layer that we

want. Okay, and so that's the

module that we want to hook. So once we've hooked that, we

can pass that as a callback. And we can then, it's a bit weird calling

fit, I suppose, because we're saying train equals false, but we're just basically capturing.

This is just to put make one_batch go through and grab the outputs.

So this means now in our hook, there's never going to be thing called outp, because we

put it there. And we can grab, for example,

a few of those to have a look. And yep, here, we've got a

64 by 512 set of features. Okay, so that's one way we can do it.

Another way we could do it is that actually sequential models are what's called in Python

collections, they have certain, a certain API that they're expected to support.

And out of something a collection can do like a list is you can call del to delete something.

So we can delete this layer and this layer and be left with just these layers.

And once we do that, that means we can just call capture_preds, because now they don't

have the last two layers. So we can just delete layers eight

and seven, call capture_preds. And one nice thing about this is

it's going to give us the entire 10,000 images in the

test set. So that's what I ended up deciding to do.

There's lots of other ways I played around with which worked, but I decided to these

two as being two good, pretty good techniques. Okay, so now we've got what do 1,000 real images

look like at the end of the pooling layer. So now we need to do the same for our sample.

So we'll load up our fashion_ddpm_mp, we’ll call sample.

Let's just grab 256 images for now, make them go between minus one and one, make sure they

look okay. And as I described before, created Dataloaders

where the validation set just has one batch, which contains our samples and call capture_preds.

Okay, so that's going to give us our features. And the reason why is because we're passing the

sample to model and model is the classifier. Okay, which we've deleted

the last two layers from. So that's going to give us our 256 by 512.

So now we can get the means. Now that's not really enough to tell us

whether something is looks like real images. So maybe I should draw here.

So we started out with our batch of 256 and our channels of 512.

And we squished them by taking their mean. It's now just 256 vector.

So this is the wrong way around. We squished it this way, 512, because

this is the main for each channel. And we did exactly the same

thing for the much bigger, full set of real images.

So this is our samples. And this is our real. But when we squish it, that's

10,000 by 512, we get again 512. So we could now compare these two.

But you know, you could absolutely have some samples that don't look anything like images,

but have similar averages for each channel. So we do a second thing, which

is we create a covariance matrix. Now if you've forgotten what this is, you

should go back to our previous lesson where we looked at it.

But just to remind you, a covariance matrix says, in this case, we do it across the channels,

it's going to be 512 by 512. So it's going to take each of these columns.

And it says, in each cell, so here's cell 1, 1. Basically it says, what's the difference between…

it basically saying what's the difference between each row, each element here, and the

mean of the whole column, multiplied by the exactly the same thing for a different column.

Now on the diagonal, it's the same column twice. So that means that these in the

diagonal is just the variance. But more interestingly, the ones in the off

diagonal, like here, is actually saying, what's the relationship between

column one and column two. So if column one and column two are

uncorrelated, then this would be zero. If they were identical, then it would

be the same as the variance in here. So it's how correlated are they.

And why is this interesting? Well, if we do the same, exactly the same

thing for the reals, that's going to give us another 512 by 512.

And it's going to say things like, so let's say this first column was kind of like, you

know, does it have pointy heels? And sorry, heels, can't spell.

And the second one might be, does it have flowing fabric?

Right. And this is where we say, okay, if, you know,

generally speaking, you would expect these to be negatively correlated.

Right? So over here in the reals, this is

probably going to have a negative, right? Whereas if over here it was like zero or even

worse if it's positive, it'd be like, oh, those are probably not real, right?

Because it's very unlikely you're going to have images that have both where pointy heels

are positively associated with a flowing fabric. So we're basically looking for two data sets

where their covariance matrices are kind of the same and their means

are also kind of the same. All right.

So there are ways of comparing these, you know,

basically comparing two sets of data to say, are they, you know,

from the same distribution? And you can broadly think of it as being like,

oh, do they have pretty similar covariance matrices?

Do they have pretty similar mean vectors? And so this is basically what the

Fréchet Inception Distance does. Does that make sense so far, guys?

JOHNO: Yes. It's only striking me now how strong the

similarity as to when we're talking about like the style loss and those kinds of things. How do we measure the types of features that

occur together without worrying about like which items in the data set?

JEREMY: The Gram-Schmidt matrices or whatever. JOHNO: Exactly.

JEREMY: Yeah. JOHNO: Yeah.

JEREMY: Now the particular way of comparing, so, okay, so

I've got the means and I've got the covariances for my samples.

And I've actually just created this little _calc_stats, right?

So I always, I'm showing you how I build things, not just things that are built, right?

So I always create things step by step and check their shapes, right?

And then I paste them into our, merge the cells, copy the cells and merge them into

functions. So here's something that gets the

means and the covariance matrix. So then I basically do, I call that both for

my sample features and for my features of the actual data set or the

test set and the data set. Now what I now do with that, if they have

those features, I can calculate this thing called the Fréchet Inception

Distance, which is here. And basically what happens is we multiply

together the two covariance matrices and that's now going to make them like bigger, right?

So we now need to basically scale that down again. Now, if we were working with, you know, non

matrices, you know, if you kind of like multiply two things together, then to kind of bring

it back down to the original scale, you know, you could kind of like take

the square root, right? So particularly if it was by itself, you took

the square root, you get back to the original. And so we need to do exactly the same

thing to renormalize these matrices. The problem is that we've got matrices and

we need to take the matrix square root. Now the matrix square root, you might not

have come across this before, but it exists. And it's the thing where the matrix square

root of the matrix A times itself is A. Now I'm going to slightly cheat because we've used the float square root before and we did

not re-implement it from scratch because it's in the Python standard library.

And also it wouldn't be particularly interesting, but basically the way you can calculate the

float square root from scratch is by using, there's lots of ways, but you know, the classic

way that you might've done it in high school is to use Newton's method, which is where

you basically can solve, if you're trying to calculate, you know, A equals the root

X, then you're basically saying A squared equals X, which means you're saying A squared

minus X equals zero. And that's an equation that you can solve

and you can solve it by basically taking the derivative and taking a step along

the derivative a bunch of times. You can basically do the same thing

to calculate the matrix square root. And so here it is, right?

It's the Newton method, but because it's the matrices,

it's slightly more complicated. So it's the Newton-Schurz method and I'm not

going to go through it, but it's basically the same deal you go through up to a hundred

iterations and you basically do something like traveling along that kind of derivative.

And then you say, okay, well, the result times itself ought to equal the original matrix.

So let's subtract the matrix times itself from the original matrix and see whether the

absolute value is small and if it is, we've calculated it.

Okay. So that's basically how we

do a matrix square root. So we do that.

And so now that we have, strictly speaking, implemented from scratch, we're allowed to

use the one that already exists. PyTorch doesn't have one, sadly, so we have

to use the one from SciPy, scipy.linalg So this is basically going to give us a measure

of similarity between the two covariance matrices and then we, here's the measure

of similarity between the two main matrices, which is just

the sum of squared errors. And then basically for reasons that aren't

interesting, but it's just normalizing, we subtract what's called the trace, which

is the sum of the diagonal elements. And we subtract two times

the trace of the, this thing. And that's called the Fréchet Inception Distance.

Also a bit hand-wavy on the math because I don't think it's particularly relevant to

anything, but it gives you a number which represents how similar is, you know, this

for the samples to this for some real data. Now it's weird it's called Fréchet Inception

Distance when we've done nothing to do with inception.

The reason why is that people do not normally use the fastai Part 2 custom Fashion-

MNIST data_aug2 dot pickle. They normally use a more famous model.

They normally use the inception model, which was an ImageNet winning model from Google

Brain from a few years ago. There's no reason whatsoever that

inception is a good model to use for this. It just happens to be the one

which the original paper used. And as a result, everybody now uses that not

because they are sheep, but because you want to be able to compare your results

with other people's results. Perhaps we actually don't, we actually want

to compare our results from our other results. And we're going to get a much more accurate

metric if we use a model that's good specifically at recognizing fashion.

So that's why we're using this. So very, very few people bother to use this.

Most people just pip install Python FID or whatever it's called and use inception, but

it's actually better to use. Unless you're comparing to papers, it's better

to use a model that you've trained on your data and you know is good at that.

So I guess this is not a FID. It's a, well maybe FID now

stands for Fashion-MNIST. I don't know what it stands for.

I should have done something. TANISHQ: I wanted to bring up two other

caveats of FID, especially in papers. The other thing is that FID is dependent

on the number of samples that you use. So as the number of samples they use for measuring

FID, it's more accurate if you use more samples and it's less accurate if you use less samples.

JEREMY: What I think is less accurate is actually biased.

So if you use less samples, it's too high specifically.

TANISHQ: Yeah. So in papers, you'll see them

report how many samples they used. And so even then comparing to other papers

and comparing between different models and different things, you want to make sure that

you're comparing with the same amount of samples. Otherwise it might just be high because they

just use less number of samples or something like this.

So you want to make sure that's comparable. And then the other thing that is, because I

guess it's a kind of a side effect of using the Inception network in these papers is the

fact that all of these are at a size 299 by 299, which is like the size that

the Inception model was trained. So actually when you're applying this Inception

network for measuring this distance, you're going to be resizing your images to 299 by

299, which in different cases that may not make much sense.

So like in our case, we're working with 32 by 32 or 28 by 28 images.

These are very small images. And if you resize it to 299 or in other cases,

this is now kind of an issue with some of these latest models, you have these

large 512 by 512 or 1024 by 1024 images. And then you're kind of shrinking these images

to 299 by 299 and you're losing a lot of that detail and quality in those images. So actually it's kind of become a problem

with some of these latest papers when you look at the FID scores and

how they're comparing them. And then visually when you see them, you can

kind of notice, oh yeah, these are much better images, but the FID score doesn't capture

that as well because you're actually using these much smaller images.

So there are a bunch of different caveats. And so FID, it's very good for like, yeah,

it's nice and simple and automated for this sort of comparison, but you have to be aware

of all these different caveats of this metric as well.

JEREMY: So excellent segue because we're going to look at exactly those two things right now.

And in fact, there is a metric that compares the two distributions in a way that is not

biased. So it's not necessarily higher or lower if

you use more or less samples and it's called the KID or KID, which is the

Kernel Inception Distance. It's actually significantly simpler to

calculate than the Fréchet Inception Distance. And basically what you do is you create a

bunch of groups, a bunch of partitions, and you go through each of those partitions and

you grab a few of your Xs at a time and a few of your Ys at a time.

And then you calculate something called the MMD, which is here, which is basically the…

again, the details don't really matter. We basically do a matrix product

and we actually take the cube of it. This is this K is for kernel, and we basically

do that for the first sample by its compared to itself, the second compared to itself

and the first compared to the second. And we then normalize them in various ways

and add the two with themselves together and subtract the, with the other one.

And this one actually does not use the stats. It doesn't use the means and covariance metrics.

It uses the features directly. And the actual final result is basically the

mean of this calculated across different little batches.

Yeah, again, the math doesn't really matter as to, you know, exactly why all these are

exactly what they are, but it's going to give you again, a measure of the similarity of

these two distributions. At first I was confused as to why more people

weren't using this because people don't tend to use this and it doesn't

have this nasty bias problem. And now that I've been using it for a while,

I know why, which is that it has a very high variance, which means when I call it multiple

times with just like samples with different random seeds, I get very different values. And so I actually haven't

found this useful at all. So we left in the situation, which is, yeah,

we don't actually have a good unbiased metric. And I think that's the truth of

where we are, the best practices. And even if we did, all I would tell you is

like how similar distributions are to each other.

It doesn't actually tell you whether they look any good really.

So that's why in pretty much all good papers, they have a section on human testing.

But I've definitely found this useful for me for like comparing fashion images, which

particularly like humans are good at looking at like faces that are reasonably high resolution

and be like, Oh, that eye looks kind of weird, but we're not looking good at looking at 28

by 28 fashion images. So it's particularly helpful for

stuff that our brains aren't good at. So I basically wrapped this up into a class,

which I called ImageEval for evaluating images. And so what you're going to do is you're going

to pass in a pre-trained model for a classifier and your Dataloaders, which is the thing that

we're going to use to, to basically calculate the real images.

So that's going to be, you know, the Dataloaders that were in this learn.

So the real images. And so what it's going to do in this class

that again, I, this is just copying and pasting the previous lines of code

and putting them into a class. This is going to be then something

that we call capture_preds on to get our features for

the real images. And then we can also calculate

the stats for the real images. And so then we can call fid by calling _calc_fid,

which is the thing we already had passing in the stats for the real images and calculated

stats for the features from our samples, where the features, the thing that we've

seen before, we pass in our samples, any random Y value

is fine. So I just have a single tensor

there and call capture_preds. So we can now create an ImageEval mod object

passing in our classifier, passing in our Dataloaders with the real data,

any other callbacks you want. And if we call fid, it takes about a quarter of

a second and 33.9 is the FID for our samples. So something that I think, okay, then KID,

KID’s going to be a very different scale. It's only 0.05.

So KIDs are generally much smaller than FIDs. So mainly going to be looking at FIDs.

And so here's what happens if we call FID on sample 0 and then sample 50 and then

sample 100 and so forth all the way up to 900. And then we also do samples, 975, 990, 999.

And so you can see over time, our samples’ FID’s improved.

So that's a good little test. There's something curious about the fact

that they stopped improving about here. So that's interesting.

I have not seen anybody plot this graph before. I don't know if Johno or Tanishq, if you guys

have, I feel like it's something people should be looking at because it's really telling you:

is your sampling making consistent improvements. JOHNO: And to clarify, this is like the predicted

denoised sample at the different stages during sampling, right?

JEREMY: Yes, exactly. JOHNO: If I was to stop something now and just

go straight to the predicted X error, what would the FID be?

JEREMY: So I just want to check or check our samples.

Yeah, we preset, we add the x0_hat at each time. Yep.

Yep. Exactly.

Same for KID. And I was hoping that they

would look the same and they do. So that's encouraging that KID and FID

are basically measuring the same thing. And then something else that I haven't seen

people do, but I think it's very good idea is to take the FID of an actual batch of data. Okay.

And so that's tells us how good we could get. Now that's a bit unfair because I think the

different sizes, our data is 512, our sample was 256, but anyway, it's

a pretty huge difference. And then, yeah, the second thing that Tanishq

talked about, which I thought I'd actually show is what does it take to

use, you know, to get a real FID? There's the Inception network.

So I didn't particularly feel like re-elementing the Inception network.

So I guess I'm cheating here. I'm just going to grab it from pytorch_fid.

But there's absolutely no reason to study the Inception networks.

It's totally obsolete at this point. And as Tanishq mentioned, it wants 299 by

299 images, which actually you can just call resize input to have that done for you.

It also expects three channel images. So what I did is I created a wrapper for an

Inceptionv3 model that when you call forward, it takes your batch and replicates

the channel three times. So that's basically creating a three channel

version of a black and white image just by replicating it three times.

So with that wrapping, and again, this is good, like flexing of your PyTorch muscles,

you know, try to make sure you can replicate this, that you can, yeah, get an Inception

model working on your Fashion-MNIST samples. And yeah, then from there we can just

pass that to our image eval instead. And so on our samples, that gives us 63.8

and on a real batch of data, it gets 27.9. And like, I find this like a good sign that

this is much less effective than our real Fashion-MNIST classifier, because like that's

only a difference of a ratio of three or so. The fact that our FID for real data using

a real classifier was 6.6, I think that's pretty encouraging, you know.

Yeah, so that is that. And we now have a FID, more specifically,

we now have an ImageEval class. Did you guys have any questions or

comments about that before we keep going? TANISHQ: No, let's do it.

JOHNO: I think again that pretty much every other FID you see

reported is going to be, you know, set up for CIFAR-10 tiny 32 by 32 pixels resized

up to 299 and fed through Inception that was aimed on imaging, not CIFAR-10.

So yeah, it's bearing in mind that once again, this is a slightly weird metric and even things

like the types of image, like the imagery sizing algorithms and PyTorch and TensorFlow

might be slightly different. Or you know, if you saved your images as JPEGs

and then reloaded them, your FID might be twice as bad.

JEREMY: Yeah, it makes a big difference. JOHNO: Yeah, exactly.

So just to reiterate, what this will like, the takeaway from all of this that I get is

that it's really useful. Everything's the same, like using the same

backbone model, using the same approach, the same number of samples, then you

can compare it apples to apples. But yeah, for one set of experiments, a FID of

30 might be good because of the way everything's set up and for another, that might be terrible.

So trying to compare to a paper or whatever is best.

JEREMY: So I'm going to say... TANISHQ: I guess maybe the approach is that like,

if you're doing your own experiments, you know, these sorts of metrics are good, but then

if you're going to compare to other models, it's best to rely on human studies if you're

comparing to other models and that, yeah, I think that's kind of the sort of approach

or mindset that we should be having when it comes to this.

JEREMY: Yeah, or both, you know. But yeah, so we're going to see this

is going to be very useful for us. And we're just going to be using the same,

pretty much all the time, we're going to use the same number of samples and we're going to

use the same Fashion-MNIST specific classifier. So the first thing I wanted to do was fix

our bug and to remind you, the bug was that we had, we were feeding into our UNet in DDPMv2

and the original DDPM images that were from zero to one.

And yeah, that's wrong. Nobody does that.

Everybody feeds in images that are from minus one to one.

So that's very easy to fix. You just… JOHNO: Jeremy, just to

ask, like, why, why is that a bug? JEREMY: Why it's a bug?.

I mean, it's like, everybody knows it's a bug because that's what everybody does.

Like I've never seen anybody do anything else and it's very easy to fix.

So I fixed it by adding this to DDPMv2 and I reran it and it didn't work.

It made it worse. And this was the start of, you know, a few

horrible days of pain because like when you, you know, fix a bug and it makes things worse,

that generally suggests there's some other bug somewhere else that somehow

has offset your first bug. And so I had to go, you know, I basically

went back through every other notebook at every cell and I did find at least one bug

elsewhere, which is that we hadn't been shuffling our training sets the whole time.

So I fixed that, but it's got absolutely nothing to do with this.

And I ended up going through everything from scratch three times, rerunning everything

three times, checking every intermediate output three times.

So days of, you know, depressing and annoying

work and made no progress at all. At which point I then asked Johno's question

to myself more carefully and provided a less flippant response to myself, which was, well,

I don't know why everybody does this, actually. So I asked Tanishq and Johno and I was like,

and Pedro and I was like, have you guys seen any math papers, whatever it's based

on this particular input range? And yeah, you guys are both like, no, I haven't.

It's just, it's just what everybody does. So at that point it raised the

possibility that like, okay, maybe, maybe what everybody

does is not the right thing to do. And is there any reason to believe

it is the right thing to do? And given that it seemed like fixing

the bug made it worse, maybe not. And then, but then it's like, well, okay,

we, we are pretty confident from everything we've learned and discussed that having

centered data is better than uncentered data. So having data that go from

zero to one clearly seems weird. So maybe the issue is not that we've changed

the center, but that we've scaled it down. So rather than having a range

of two, it's got a range of one. So at that point, you know, I, I did

something very simple, which was, I did this, I subtracted

0.5. So now rather than going from naught

to one, it goes from minus 0.5 to 0.5. And so the theory here then was, okay, if

our hypothesis is correct, which is that the negative one to one range has no foundational

reason for being, and we've accidentally hit on something, which is that a range of one is

better than a range of two, and this should be better still, because this is a

range of one and it's centered properly. And so this is DDPM_v3 and I ran that and yes, it appeared to be better.

And this is great because now I've got FID. I was able to run FID on DDPM_v2 and on

DDPM_v3 and it was dramatically, dramatically, dramatically

better. And in fact, I was running a lot, a lot of

other experiments at the time, which we'll talk about soon.

And like all of my experiments are totally falling apart when I fixed the bug.

And once I did this, all the things that I thought, thought weren't working, suddenly

started working. So you know, this is often

the case, I guess, is that bugs can highlight, you know, accidental

discoveries and the trick is always to be, you know, careful enough to recognize when

that's happened. You know, this is, some people might

remember the story, this is how the noble gases were

discovered. A chemistry experiment went wrong and left

behind some strange bubbles at the bottom of the test tube.

And most people would just be like, huh, oops, bubbles.

But people who are careful enough actually went, no, there shouldn't be bubbles there.

Let's test them carefully. It's like, they don't react.

Again, most people would be like, oh, that didn't work.

The reaction failed. But you know, if you're really careful, you'll

be like, oh, maybe the fact that they don't react is the interesting thing.

So yes, being careful is a fair feature. JOHNO: When you say things like it didn't work

or it was worse, when you first showed us this thing, I kind of said, oh, the images looked fine.

The FID was slightly worse. But it was okay.

And if you trained it longer, it eventually got better.

Mostly. There were some things that sampling occasionally

went wrong, one image in a hundred or something like that.

But it was like, this isn't like everything completely fell apart.

No, it's just the truth. The performance was slightly worse than expected.

If you were doing the run and gun, try a bunch of things, it was like, oh, well, I just doubled

my training time and set a few runs going and looked at the Weights and Biases stats

later and oh, that seems like it's better now. We just needed to train for longer.

And we have infinite GPUs and lots of money. You wouldn't notice this.

So yeah, it wasn't like, you know, yeah, the fact that you picked up on it showed that

you had this like deep intuition for where it should be at this stage in training versus

where it was, what the samples should look like. So there were any other fit as well to say

like, okay, I would have expected a FID of nine and I'm getting 14.

What's up, what's up here. And that was enough to start asking these

questions and we all jumped on all the other to think, you know, where this came from.

JEREMY: Yeah. I mean, definitely, I drive

people crazy that I work with. I don't know why you guys aren't crazy yet,

but with this kind of like, no, I need to know exactly why, you know, this

is not exactly what we expected. But yeah, this is why to, to find the, you

know, when something's mysterious and weird, it means that there's something you didn't

understand and that's an opportunity to learn something new.

So that's what we did. And… so that was quite exciting because

yeah, going minus 0.5 to 0.5 made the FID better still. And I was definitely in, yeah, I moved

from this frame of mind or from like, you had a

whole total depression. I was so mad.

I still remember when I spoke to Johno, I was just so upset and, you know, and that

I was suddenly like, Oh my gosh, we're actually onto something.

So, started experimenting more and you know, a bit more confidence at this point, I guess.

And one thing I started looking at was our, our schedule, you know, I, you know, we'd

always been copying and pasting this standard, again, set of stuff.

And I started questioning everything. Why is this a standard?

Like, why are these numbers here? You know, and I don't see any particular

reason why those numbers were there. And I thought, well, we should

maybe experiment with them. So to make it easier, I created a little

function that would return a schedule. Now you could create a new class for a schedule,

but something that's really cool is there's a thing in Python called SimpleNamespace,

which is a lot like a struct in C basically lets you wrap up a little bunch of keys and

values as it lifts as if it's an object. So okay, this little SimpleNamespace, which

contains our alphas, our alpha bars and our sigmas for our normal betamax equals

0.02, linspace, this is what we always do. And then yeah, there's another paper which

mentions an alternative approach, which is cosine and… cosine schedule, which is where

you basically set alpha bar equal to t as a fraction of big T times pi

over two, cosine of that squared. And if you make that your alpha bar, you can

then basically reverse back out to calculate what alpha must have been.

And so we can create a schedule for this cosine schedule as well.

And yeah, this cosine schedule is, I think, pretty recognized as being better than this

linear schedule. And so I thought, okay, it'll be

interesting to look at how they compare. And in fact, really all that

matters is the alpha bar. The alpha bar is, you know, the total

amount of noise that you're adding. So in DDPM, when we do noisify, you know,

it's alpha bar that we're actually using. JOHNO: The amount of the image and one

minus alpha bar, it's the amount of noise. JEREMY: Exactly.

Yeah. So I just printed those out, plotted those

for the normal linear schedule and this cosine schedule.

And you can really see the linear schedule. It really sucks badly.

It's got a lot of time steps where it's basically about zero.

And you know, that's something we can't really do anything with, you know, whereas the cosine

schedule is really nice and smooth and there's not many steps which are nearly zero or nearly

one. So I thought, so I was kind of inclined to

try using the cosine schedule, but then I thought, well, it'd be easy enough to get

rid of this big flat bit by just decreasing betamax.

That'd be another thing we can do. So I tried, oh, sorry, first of all, I should

mention that the other thing that's really important is the slope of these

curves, because that's how much things are stepping during

the sampling process. And so here's the slope of the lin and the

cosine, and you can see the cosine slope. Really nice, right?

You have this nice smooth curve, whereas the linear is just a disaster.

So yeah, if I change betamax to 0.01, that actually gets you nearly the same curve as

the cosine. So I thought that was very interesting.

It kind of made me think like, why on earth does everybody always use 0.02 as the default?

And so we actually talked to Robin, who was one of the two lead authors on the Stable

Diffusion paper, and in fact we talked about all of these things, and he said, oh yeah,

we noticed not exactly this, but we experimented with everything, and we noticed that when

we decreased betamax, we got better results. And so actually Stable

Diffusion uses betamax of 0.012. I think that might be a little bit higher than

they should have picked, but it's certainly a lot better than the normal default.

So it's interesting talking to Robin to see all of these kinds of experiments and things

that we tried out, they had been there as well and noticed the same things.

JOHNO: The inputs range as well, they have this magical factor of 0.18, 0.02, whatever, they scale

the latency by. And if you ask why, they're like, oh yeah,

we wanted the latest to be roughly uniform range, whatever.

But that's also like, that's reducing the range of your inputs to reasonable.

JEREMY: I think exactly, we independently discovered and they

independently discovered this idea. Yeah, exactly.

Yeah, exactly. So we'll be talking more about like what's

actually going on with that maybe next lesson. Anyway, so here's the curves as

well, they're also pretty close. So at this point I was kind of thinking, well,

I'd like to like change as little as possible, so I'm going to keep using a linear schedule,

but I'm just going to change betamax to 0.01. For my next, you know, version of DDPM.

So that's what I've got here, linear schedule betamax 0.01.

And so that I wouldn't really have to change any of my code, I'd end up just put those

in the same variable names that I've always used. So then noisify is exactly the

same as it always has been. So now I just repeat everything

that I've done before. So now would I show a batch of data, I can

already see that there's, you know, more actually recognizable images, which

I think is very encouraging. Previously they, like almost all of them had

been pure noise, which is not a good sign. So okay, so now I just train

it exactly the same as DDPM_v2. And so I save this as Fashion DDPM_v3.

Oh, and then the other things I've done here is, you know, this turned out to work pretty

well. I actually decided, let's keep going even further.

So I actually doubled all of my channels from before.

And I also increased the number of epochs by three, because things are going so well,

I was like, how well could they go? So we've got a bigger model trained

for longer, so it takes a few minutes. That's what the 25 here is the number of epochs.

So samples exactly the same as it always has been. So create 512 samples.

And here they are, and they definitely look to me, you know, great, like that.

I'm not sure I could recognize whether these are real samples or generated samples.

But luckily, we know we can test them. So we can load up our data_aug_2, delete

the last two layers, pass that to ImageVal, and get a FID for our samples.

And it's 8. And then I chose 512 for a reason,

because that's our batch size. So then I can compare that like with like

for the FID for the actual data at 6.6. So this is like hugely exciting to me, we've

got down to a FID that is nearly as good as real images.

So I feel like this is, you know, in terms of image

quality for small, unconditional sampling, I feel like we're

done, you know, pretty much. And so at this point, I was like, okay, well,

can we make it faster, you know, at the same quality.

And I just wanted to experiment with a few things, like really obvious ideas.

And in particular, I thought, we're calling this 1,000 times, which means we're calling

this 1,000 times, just running the model. And that's slow.

And most of the time, you just move a tiny bit. So the model is pretty much the same.

It's, you know, the noise being predicted is pretty much the same.

So I just did something really obvious, which is I decided, let's only call the model every

third time, you know, and maybe also just the last 50 to help it fine tune.

I don't know if that's necessary, other than that it's exactly the same.

So now this is basically three times faster. And yeah, samples look basically the same.

So the FID is 9.78 versus 8.1. And this is like within

the normal variance of FID. So I don't know, like you'd have to run this

a few times or use bigger samples, but this is basically saying like, yeah, you probably

don't need to call the model a thousand times. I did something else slightly weird, which

is I basically said like, oh, let's create a different like schedule for how often we

call the model, which is I created this thing called sample_at.

So I basically said when you're for the first few time

steps, just do it every 10 and then for the next 2, every 9 and then

the next 3, every 8, so forth. And just for the last 100, do it every 1.

So that makes it even faster. Samples look good.

This is, you know, it's definitely worse though now,

you know, but it's still not bad. So yeah, I kind of felt like,

all right, this is encouraging. And this, this stuff before we fixed the minus

one to one thing was, they looked really bad, you know, um, that's why I was

thinking that my code is full of bugs. So at this point I'm thinking, okay, okay, okay. We can create extremely high

quality samples using DDPM. What's the like, you know, best

paper out there for doing it faster. And the most popular paper

for doing it faster is DDIM. So I thought we might switch to this next.

So we're now at the point where we're not actually going to retrain our model at all.

If you noticed with these different sampling approaches, I didn't retrain the model at

all, but just say, okay, we've got a model. The model knows how to best

to make the noise in an image. How do we use that to call it multiple times

to denoise using iterative refinement as Johno calls it.

And so DDIM is a, another way of doing that. So what we're going to do, I'm going to show

you how I built my own DDIM from scratch and I kind of cheated, which is, there's

already an existing one in diffusers. So I decided I will use that first,

make sure everything works and then I'll try and re-implement

it from scratch myself. So that's kind of like when there's an existing

thing that works, you know, that's what I like to do.

And it's been really good to have my own DDIM from scratch because now I can modify it,

you know, and I've made it much more concise code than the diffusers version.

So, now, we had created this class called a UNet, which asked the tuple of Xs through

as individual parameters and returned to the dot sample.

But not surprisingly, given that this comes from diffusers and we want to use the diffusers

schedulers, the diffusers schedulers assume this has not happened.

It wants the X, you know, as a tuple and it expects to find the thing called dot sample.

So here's something crazy. When we save this thing, this pickle, it doesn't

really know anything about the code, right? It just knows that it's from a class called UNet.

So we can actually lie. We can say, oh yeah, that class called UNet

it's actually the same as a UNet2DModel with no other changes and Python doesn't know

or care, right? So this, we can now load up this

model and it's going to use this UNet. Okay.

So this is where it's useful to understand how Python works behind the scenes, right?

It's a very simple programming language. So we've now got a model which we've trained,

but it's not, it's just going to, you know, use the dot sample.

And that means we can use it directly with the diffusers schedulers.

So we'll start by actually repeating what we already know how to do, which is use a

DDPM scheduler. So we have to tell it what beta we used to train.

And so we can grab some random data. And so we could say, okay, we're

going to start at time step 999. So let's create a batch of data

and then predict the noise. And then this is the way that diffusers thing

works is you call scheduler.step and that's the thing which does those lines.

That's the thing that calculates x_t given noise. So that's what scheduler.step does.

That's why you pass in x_t and the time step and the noise.

And that's going to give you a new set. And so I ran that as usual first cell by cell

to make sure I understood how it all worked. I then copied those cells and merged

them together and chucked them in a loop. So this is now going to go through all the

time steps, use a progress bar to see how we're going, get the noise, call step and append.

So this is just DDPM, but using diffusers and not surprisingly it gives us, you know,

basically the same results as, you know, nice results, very nice results that we got from

our own DDPM. And so we can now use the same code we've

used before to create our image evaluator. And I decided, yeah, we're now going to

go right up to 2,048 images at a time. So it's now, this is the size I found it's

big enough that it's reasonably stable. And so we're now down to 3.7 for our FID,

where else the data itself has a FID of 1.9. So again, showing that our DDPM is basically

very nearly unrecognizably different from real data using its distribution

of those activations. So then we can switch to DDIM

by the same DDIM scheduler. And so with DDIM, you can say, I

don't want to do all thousand steps. I just want to do 333 steps to every third.

So that's basically a bit like, a bit like this sample skip of doing every third, but

DDIM as we'll see, does it in a smarter way. And so here's exactly the same code basically

as before, but I put it into a little function. Okay. So I can basically pass in my

model, the size, the scheduler. And then there's a parameter called eta,

which is basically how much noise to add. So just add all the noise.

And so this is now going to take three times, this three times faster.

And yeah, the FID's basically the same. That's encouraging.

So I went down to 200 steps. It's basically the same. 100 steps.

And at this point, okay, the FID's getting worse. And then 50 steps.

We're still. 25 steps. We're still...

It's interesting, like when you get down to 25 steps, like, well, what does it look like?

And you can see that they're kind of like, they're too smooth.

You know, they don't have interesting, you know, fabric swells so much or buckles or

mongos or patterns as much, you know, as the, these ones, they've got a lot more texture

to them. So that's kind of what tends to happen.

So you can still like get something out pretty fast, but that's, that's kind of how they

suffer. So okay.

So how does DDIM work? Well DDIM, it's nice.

It's actually, in my opinion, it makes things a lot easier than DDPM.

So there's basically an equation from the paper, which Tanishq will explain shortly.

But basically what you do is I've actually grabbed the sample function from here and

I split it out into two bits. One bit is the bit that says, what are the

time steps, creates that random starting point, loops through, finds what my current a bar

is, gets the noise, and then, basically does the same as sched.step.

It calls some function, right? And then that's been pulled out.

So this allows me to now create my own different steps.

So I created a DDIM step and basically all I did was I took this equation and I turned

it into code. Actually this, this one is the

second equation from the paper. Now it's a bit confusing, which is

that the notation here is different. DDPM, what it calls alpha

bar, this paper calls alpha. So you've got to look out for that.

So basically you'll see, I basically go, I've got here x_t, x_t minus, okay.

One minus alpha bar is we've created a call that beta bar.

So beta bar dot square root times noise. This here is the, this is the neural net.

So this here is the noise. Okay. And here I've got my next x_t is, oh, sorry.

Yes. Here's my a bar t one square root times this,

and you can see here, it says predicted X naught.

Here's my predicted X naught plus beta bar t one minus sigma squared square root. Again, here's noise.

It's the same thing as here. Okay.

And then plus a bit of random noise, which we

only do if you're not at the last step. So yeah, so I can call that.

So I just did it for, so I'd rather than saying a hundred

steps, I said, skip every skip 10 steps.

So do 10 steps at a time. So it's basically going to be a hundred steps.

And so you can see here actually is just happened to do a bit better for my a hundred steps.

It's not bad at all. So yeah, I mean, this has been getting to

this point, it's been a bit of a lifesaver to be honest, because you know, I can now

run a two, you know, two, that batch of 2,048 samples.

I can sample them in under a minute which doesn't feel painful.

And so I'm now at a point where I can actually get a pretty

good measure of how I'm doing in a pretty reasonable amount of time.

And I can, you know, easily compare it. And I got to admit, you know, the difference

between a FID of 5 and 8 and 11, I can't necessarily tell the difference.

So for fashion, I think FID is better than my eyes for this, as long as I use a consistent

sample size. So yeah, Tanishq, did you want to talk a bit

about, you know, the ideas of why we do this or where it comes from or what the notation means?

JOHNO: Can I say a little bit before we do that, which is just that what you have there, Jeremy,

which is like a screenshot from the paper and then the code that is closest as possible

tries to follow that. Like the difference that makes for people is huge.

Like I've got a little research team that I'm doing some contract work with.

And the fact that like it's called alpha in the DDIM paper and alpha bar elsewhere.

And then in the code that they were copying and pasting from, it was called A and B for

alpha and beta. And it's like you can get things kind of

working by copying and pasting into things. And it's all just sort of kind of works, but

just spending that time to literally take two screenshots from equation 14 and 16 from

the paper and put them in there and rewrite the code so that it, you know, with some comments

and things to say like, this is what this is, this is that part from the equation.

It's like, you know, the, the look of pain on their face when I said, Oh, by the way,

did you notice that like, it's called alpha there and alpha bar there?

They're like, yes, how could they do that? You know, it's just like, you could just tell

how many hours have been spent, you know, like grinding and saying what's wrong here.

JEREMY: And doing notebooks. Yeah, and building this stuff in

notebooks is such a good idea. Like we're doing with miniai because the,

you know, the next engineer to come along and work on that and see the equation

right there and you can add rows and stuff. So I think, you know, nbdev works particularly

well for this, this kind of development. Yeah.

Thanks, Johno. TANISHQ: Yeah, before, I talk about this,

I just wanted to briefly, in the context of all of these differing notations, I recently

created this meme, which I thought was, was relevant in terms of like each paper basically

has a different diffusion model notation. So it's just like this, but they all try to

come up with their own universal notation and it's just, just keeps proliferating.

JEREMY: Let's just agree we should all use APL.

TANISHQ: Yes, exactly. We need to implement diffusion

models in APLs somehow. So yeah, the paper that we're, that, you know,

Jeremy had implemented was this Denoising Diffusion Implicit Model paper.

And if you look at the paper again, you could see like the notation could be again, a little

bit intimidating, but when we walk through it, we'll see, it's not too bad actually.

So I'll just bring up, I guess, some of the important equations and also comparing and

contrasting you know, DDPM and the notation of DDPM and the equations with DDIM.

JEREMY: Not only is it not too bad, I actually discovered it's making

life a lot, the DDIM notation and equations are a lot

easier to work with than DDPM. So I found my life is better

since I discovered DDIM. TANISHQ: Yes, yes, I think a lot of

people prefer to use DDIM as well. So yeah, basically in both DDIM and

DDPM, we have this same sort of equation. This equation is exactly the same.

This is telling us the predicted denoised image. So we predict our, but basically we

predict the, you can see my pointer, right? Just want to confirm.

JEREMY: By the way, the little double-headed arrow in the top right, does that, if you click

that till you get more room for us to see what's going on.

The double-headed arrow just above the, yeah. TANISHQ: Yeah, sorry.

Yeah. That’s much better.

So we have our predicted noise. So our model is predicting the noise in the image.

It is also passed in the time step, but this is just emitted.

It basically kind of is given in the XT, but our model also takes in the time step.

And so it's predicting the noise in this XT, our noisy image.

And we are trying to remove the noise. That's what this whole term

here is, remove the noise. So because our noise that we're

predicting is unit variance noise. So we have to scale the variance of our noise

appropriately to remove it from our noisy image.

So we have to scale the noise and subtract it out of the original image.

And that's how we would get our predicted denoised image.

JEREMY: And I think we have derived this one before by looking at the

equation for XT in the noisify function and rearranging it to solve for X nought.

And that's what you get. TANISHQ: Yes, that's basically what this is.

Yes, that's basically what this is. So basically the idea is, okay, instead of,

yeah, noisifying it where we're starting out with X0 and some noise and get an XT, we're

doing the opposite where we have some noise and we have XT.

So how can we get X0? So that's what this equation is.

So that's the predicted X0 or our predicted clean image.

And this equation is the same for both DDPM and DDIM, but these distributions are what's

different between DDPM and DDIM. So we have these distribution, which tells us,

okay, so if we have XT, which is our current noisy image; and X0, which is our clean image,

can we find out what some sort of intermediate noisy image is in that process?

And that's XT minus one. So we have a distribution for that.

And so that tells us how to get such an image. And so this is in the DDPM paper, they did

to divide some distribution and explain the math behind it.

But yeah, basically they have some equation. So you have, again, a Gaussian distribution

with some sort of mean and variance, but it's again some form of you have this

sort of interpolation between your original clean image and your

noisy image, and that gives you your intermediate slightly less noisy image.

So that's what this is giving. Given a clean image and a noisy image,

you're slightly less noisy image. And so the sampling procedure that we do with

DDPM basically is predict the X0 and then plug it into this distribution to give

you your slightly less noisy image. So maybe it's worth drawing that out.

So if we had, let's say, some sort of, I don't know, I'm just making some sort of, I don't

know, maybe a lot of some sort of better, yeah, some sort of.

So then in this case, I'm showing a one dimensional example.

Let's say you have some sort of a point. So it's kind of a one dimensional example

that's still in the sort of 2D space. But let's say you have some, any point on

this represents an actual image that you want to sample from, right?

So this is where your distribution of actual images would lie.

And you want to estimate this. So this sort of algorithm that we've been

seeing here says that, okay, if we take some random point, this is some random

point that we choose when we start out. And what we did is we learned this function,

the score function to take us to this manifold, but it's only going to be accurate in some space. So it's going to be accurate.

It would be accurate in some area. So we get an estimate of the score function

and it tells us the direction to move in. And it's going to give us the direction

to predict our denoised image, right? So basically, let's say your score function

is actually in reality some curve, okay? So it's in reality some curve that

points to your, oops, it points here. So that's your score function.

And you know the value here. JEREMY: Just to clarify,

the score function basically means your gradient, yeah?

TANISHQ: Yes, yes. It's a gradient.

So we're again doing some form of, in this case, I guess you would say gradient ascent,

because you're not really minimizing the score, you're maximizing it.

You want, so you're maximizing the likelihood of that data point being an actual data point.

You want to go towards it. So you're doing the sort

of gradient ascent process. So you're following the gradient to get to that.

So when we estimate epsilon, theta, and predict our noise, what we're doing is we're getting

the score value here. And then so we can follow that

and we follow it to some point. And being kind of exaggerating here, but

this point will now represent our X0 hat. So yeah, our X0 hat.

And in reality, maybe that's not going to be some point that is an actual point.

It wouldn't be next to the distribution. So it's not going to be a very good estimate

of a clean image at the beginning, but we only had that estimate at the beginning at

this point, and we have to follow it all the way to some place.

So this is where we follow it to. And then we want to find

some sort of X t minus one. So that's what our next point is.

And so that's what our second distribution tells us.

And it basically takes us all the way back to maybe some point here.

And now we can re-estimate the score function or our gradient over there, do this prediction

of noise. And it may be more accurate of a score function.

And maybe we go somewhere here, and then we re-estimate and get another point, and then

we follow it. And so that's kind of this iterative process

where we're trying to follow the score function to your own point.

And in order to do so, we first have to estimate our X0 hat, and then basically add back some

noise and to get a little bit, get a new estimate and keep following and add back a little bit

more noise and keep estimating. So that's what we're doing

here in these two steps. We have our X0 hat, and then

we have this distribution. And that's how we do it with regular DDPM.

And I think that's maybe where the sort of breaking it up in two steps is a little bit

clearer. And I don't think the DDPM paper really

clarifies that or really talks about it too much. But the DDIM paper also really hammers that

point home, I think, and especially in their update equation.

So that's the DDPM, but then with DDIM... Okay, go ahead.

JOHNO: DDPM, just the one thing is that you look at your prediction, use that to make a step,

but you also add back some additional noise that's always fixed.

TANISHQ: Right. JOHNO: DDPM there's no parameter to control

how much extra noise you add back at each step. TANISHQ: Right, exactly.

So when you're... Let's see here.

So yeah, basically you won't be exactly at this point.

You could be... You're in that general vicinity.

Adding that noise also helps with... You don't want to fall into specific modes

where it's like, oh, this is the most likely data point.

You want to add some noise where you can explore other data points as well.

So yeah, the noise also can help. That's something you really

can't control with DDPM. And that's something that DDIM explores a

little bit further, is in terms of the noise and even trying to get rid of

the noise altogether in DDIM. So with the DDIM paper, the main

difference is literally this one equation. That's all really it is in terms of changing

this distribution where you predict the less noisy equation.

The less noisy... Sorry, the less noisy image. And basically, as you can see, you have this

additional parameter now, which is sigma. And the sigma controls how much noise, like

we were just mentioning, is going to be part of this process.

And you can actually, for example, if you want, you could set sigma to zero.

And then you can see here, now you have a variance that would be zero.

And so this becomes a completely deterministic process.

So if you want, this could be completely deterministic.

So that's one aspect of it. And then, yeah, so the other aspect...

So one of the reasons it's called DDIM is just not DDPM because it's not probabilistic

anymore. It can be made deterministic.

So the name was changed for that reason. But the other thing is you would think that

you've changed the model altogether with a new distribution altogether.

And so you think, oh, wouldn't you have to trade a different model for this purpose?

But it turns out the math works out where the same model objective would work well with

this distribution as well. In fact, I think that's what they were studying

out from the very beginning is what kind of other models can we get with the same objective?

And so this is what they're able to do is you can have this new parameter that you can

introduce, in this case, kind of controlling the stochasticity of the model.

And you can still use the same exact trained model that you had.

So what this means is that this actually is just a new sampling algorithm and not anything

new with the training itself. And this is just, yeah, just like we talked

about, a new way of sampling the model. And then, so yeah, this is how, given now

this equation, then you can rewrite your X t minus one term.

And again, we're doing the same sort of thing where we split it up into predicting the X0

and then adding back to go back to your Xt. And also if you need to add a little bit of

noise back in, like Jonathan was saying, you can do so, you have this extra term

here and the sigma controls that term. And again, like we said, you have to be, again,

looking at the DIM equation versus the DDPM equation.

You have to be careful of the alphas here are referring to alpha bars in the DDPM equation.

So that's the other caveat. So yeah, and you have this sigma t set to

this particular value will give you back DDPM. So sometimes instead they will write basically

Jeremy mentioned this sort of, I guess, eta, which is equal to basically, yeah, so it's

just basically eta is, sigma is equal to eta times this coefficient.

So sorry, let me just go back. And so basically, yeah, in reality

you take, I'm not wearing white. You have the eta here.

So it's like, yeah, this is where eta would go. So if it's one, it becomes regular DDPM benefits

zero, of course that's a determinative case. So this is the eta that all these APIs and

in the code that we have, also the code that Jeremy was showing, they have eta equals to

one, which they say is corresponding to regular DDPM. This is actually where the

eta would go in the equation. So finally the not-

JOHNO: It's like, you could pass in sigma, right? Like if you weren't trying to match it in

previous papers, you could just, oh, well, we have this parameter sigma that

controls the amount of noise. So that's just taken sigma scale as an argument.

And for convenience, they said, let's create this new thing, eta, where zero means sigma

is equal to zero, which if you look at the equation that works, one means we match the

DDPM, the amount of noise that's in the familiar DDPM.

And so then that gives you a nice slice. You could say eta equals two

or eta equals 0.7 or whatever. But it's got a meaningful unit of one equals

the same as this previous reference were. JEREMY: Well, it's also convenient because it's

sigma t, which is to say different time steps, unless

you choose eta equals zero, in which case it doesn't matter, different time steps probably

want different amounts of noise. And so here's a reasonable

way of scaling that noise. TANISHQ: Then the last thing of importance, which

is of course, one of the reasons that we were exploring this in the first place, is

to be able to do this rapid sampling. The basic idea here is that you can define

a similar distribution where again, the math works out similarly, where now you have, let's

say you have some subset of diffusion steps. So in this case, it uses tau variable.

So for example, let's say subset of diffusion steps.

So if it's like 10 diffusion steps, then tau one would just be zero, then tau two would

be 10. You just keep going all the way up to

say 1,000, but you've only got the... Sorry, tau two would be 100, and

then you go all the way up to 1,000. And so you'd get 10 diffusion steps.

So that's what they're referring to when they have this tau variable here.

And so you can do these sorts of similar equation and similar derivation

to show that this distribution here again, meets the objective

that you use for training. And you can now use this for a faster sampling,

where basically all you have to do is you have to just select the appropriate alpha bar.

And sorry, this one I've written out. So this one, actually the alpha bar is the

regular alpha bar that we've talked about. But basically, sorry, it's a little bit confusing

switching between different notations, but basically you have this distribution and then

you just have to select the appropriate alpha bars and it follows the same in terms of

you have appropriate sampling process. So yeah, and I guess it makes it a lot simpler

in terms of doing this accelerated sampling. Yeah, I guess with any other note, maybe other

comments that maybe you guys had or was this... JEREMY: Well, the key for

me is that in this equation, we just have one, we only need one parameter,

which is the alpha bar or alpha depending which notation is and everything else you

calculated from that. And so we don't have the what DDPM calls the alpha or beta anymore.

And that's more convenient for doing this kind of smaller number of steps because we

can just jump straight from time step to alpha bar.

And we can also then, it's particularly convenient with the cosine schedule because you can calculate

the inverse of the cosine schedule function, which means you can also go from an alpha

bar to a T. So it's really easy to say like, Oh, what

would alpha bar be 10 time steps previously to this one?

You know, it's just, you could just call a function.

We don't need, yeah, we don't need anything else. And so actually the original cosine schedule

paper has to fuss around with various like kind of epsilon style small numbers that they

add to things to avoid getting weird numerical problems.

And so, yeah, when we only deal with alpha bar, all that stuff also goes away.

So yeah, so looking, if you're looking at the DDIM code, you know, it's simpler code

with less parameters than our DDPM code. And of course it's dramatically faster and

it's also more flexible because we've got this eta thing we can play with.

TANISHQ: Yes. Yeah.

And that's the other thing is like this idea of like, yeah, controlling stochasticity.

I think that's something that's interesting to explore.

And we've been exploring that a bit now and I think we'll continue to explore that in

terms of deterministic versus stochasticity. So yeah.

JEREMY: So it's worth talking about just the sigma in the middle equation you've got there.

So you've got the sigma t, eta t, adding the random noise and intuitively it makes sense

that if you're adding random noise there, you would need to have less.

You want to move less back towards Xt, which is your noisy image.

So that's why, and you know, you've got the one minus alpha t minus one minus sigma squared,

and then you're taking the square root of that. So basically that's just sigma,

the square root of the squared. So you're subtracting sigma t from the direction

pointing to Xt and adding it to the random noise or vice versa.

So everything's there for a reason, you know. TANISHQ: Yes.

JEREMY: And the predicted X zero, that entire equation we've derived previously.

TANISHQ: And it remains the same in pretty much any diffusion model methodology.

JEREMY: Well, as long as you're using, we'll be talking about actually

some places where it's going to change probably next week.

TANISHQ: Well, yeah, I guess… JEREMY: That's another thing where you're predicting noise.

Yes. TANISHQ: Yes.

Yes. If you're predicting the noise, yes, there'll be.

JEREMY: Okay. TANISHQ: Yeah.

JEREMY: So, so I think, you know, we will, we'll probably,

yeah, let's wrap it up here so that we leave ourselves plenty of time to cover the kind

of new research directions next lesson more in more detail.

I just wanted to mention like in terms of where we're at, just like we hit a kind of

like, okay, we can really predict classes for Fashion-MNIST a few weeks ago where I

think we're there now and like we can do Stable Diffusion, sampling and UNets, except for the

UNet architecture for unconditional generation. Now we basically can do Fashion-MNIST almost.

So it's unrecognizably different to the real samples and DDIM is the scheduler that the

original Stable Diffusion paper used. So yeah, you know, we're actually about to

go beyond Stable Diffusion for our sampling and UNet trading now.

So I think we've, yeah, definitely meeting our stretch goals so far and all from scratch

with Weights and Biases, experiment logging. And, you know, if you wanted to have fun,

there's no reason you couldn't like have a little callback that instead logs

things into a SQLite database. And then you could write a little front end

to show your experiments, you know, that'd be fun as well.

JOHNO: Yeah. I mean, you could do also send you a text

message when the loss gets good enough. Yeah.

JEREMY: All right. Well, thanks guys.

That was really fun. JOHNO: Thanks everybody.

JEREMY: All right. Bye.

Okay. TANISHQ: Talk to you later then.

Bye bye.

Lesson 21: Deep Learning Foundations to Stable Diffusion

Full Transcript

Need a transcript for another video?