Lesson 23: Deep Learning Foundations to Stable Diffusion

Jeremy Howard14,306 words

Full Transcript

Hi everybody, today we are covering lesson 23 

and we're here with Johno and Tanishq. How are you guys both doing? Doing well? Yes, I'm 

doing well. I'm saying for another lecture. Yeah, likewise. Great. I, shamefully, 

have to start with admitting to a bug, which actually is rather... Well, I don't know. 

It kind of messed up things in a sense, but I kind of, I think it's really interesting actually 

what happened. The bug, it was in Notebook 23, the Karras Notebook, and it's about the measuring 

the FID. So, to recall, FID measures how similar a bunch of samples are from a model to 

a bunch of samples of real images, and that similarity is defined in this kind of like 

some kind of distance between the distributions of the features in a classifier or some kind of 

model. So, that means that to get FID, we have to load a model, and we have to pass it some 

data loaders so that it can calculate what the samples look like from real images. Now, the 

problem is that the data loaders I was passing actually had images that the pixels were 

between negative 0.5 and positive 0.5. But you might recall this model that I trained has 

pixels between negative 1 and 1. So, what this image, eval class, would have seen, and 

specifically this sea model, which we are putting, which we are getting the features from, is it 

would have seen a whole bunch of unusually low contrast images. So, they wouldn't really have 

looked like many things in the data set, because in fact in the data set, I think, particularly 

for fashion emnest things are pretty consistently normalized in terms of going all the way from 

0 to 1 or negative 1 to 1, or I guess 0 to 255 in the original. And so, as a result, I think 

what would have happened is that the features that came out of this would have been kind 

of weird, and they might not have necessarily consistently said, oh, these are t-shirt features 

and these are two features, but they would have said, oh, this is a weird locust low contrast 

image feature. And so, then the shame continues in that I added another bug on top of this bug, 

which is when I then did the sampling, I didn't, I didn't multiply by 2, and the data that 

I trained it on was actually the same data loaders, well, the specifically the same 

transform, the same noise if I transform. Well, where did it come from? It's the same, 

yeah, the same transform I, not noise if I, the same transform I, which, yeah, previously 

was point from point negative 0.5 to 0.5. So, I trained the model using this restricted input 

space as well. And therefore, it was spitting out things that were between negative point 5 and 

point 5. And so, the FID then said, wow, these are so similar, the samples are consistently spitting 

out features of low contrast things and all of the real samples are low contrast things. So, those 

are really similar. And that's how we got really low numbers. So, those lung numbers are wrong. So, 

I was a bit surprised, I guess, that the carous model was doing so much better, and it certainly 

it made me a big believer in the carous model, but actually it's not doing so much better. So, 

once we fix that, the FIDs are actually around 5, 6, 5, and the reals are 2.5. So, to compare, we were getting some pretty good results in 

cosine. So, cosine, yeah, we were getting 3, 4, depending on how many steps we were doing DDIM. So, the result of this is that this somewhat 

odd situation where the cosine model, where we scaled it accidentally to be negative point 5 

to point 5. And then, post sampling, multiply by 2. So, we're not cheating, like the carous 

one used to be is working better than carous, which, yeah, is a surprise to me, because 

I was thinking carous was kind of like in theory, optimally scaling things. But I 

guess the truth is, it was scaling things to unit variance, but there's nothing particularly 

to say that's optimally scaling things and so empirically we've found kind of accidentally 

a better way to scale things. And also, our dependent variable is different. Our 

dependent variable is not that carous, you know, c, mix combination, but our dependent 

variable is just the noise, the 0, 1 noise, you know, the noise before it's multiplied by 

alpha bar. Okay, so that's, that's the bug. Anyway, I promised last time we would stop 

looking at fashion, MNIST through a while. So, let's move on to Tiny Imagenet. So, and the reason 

we're going to do this is because we want to, I want to show an example of, we're going to try 

and create Unets today. And I wanted to show it example of a, of a nice Unet we can create the 

combines a lot of the ideas we've been looking at. It's going to be a super resolution Unet 

and doing super resolution on fashion, MNIST isn't going to be very interesting because 

the maximum training size we have is 28 by 28. So, so I thought we'd go a little bit bigger than 

that to Tiny Imagenet, which is 64 by 64. I found it quite difficult actually to find Tiny 

Imagenet data, but eventually I discovered that it's still on the Stanford servers were originally 

created. It's just not linked to anywhere. So, we'll try to keep if this disappears, we 

will, we will keep our forum and website up to date with other places to find it. Anyway, 

so for now, we can grab the URL from there and unpack it. So, SHUTIL is a very handy little 

library inside the Python standard library and one of the things that has is a very handy 

unpack archive, switching handles it files and it's going to put it in our data directory. 

So, I, yeah, there's a few different ways we could process this and I thought we might 

experiment with some things, but I thought, yeah, it wouldn't be a bad idea to try doing 

things the reasonably kind of manual way just to see what that looks like. And often 

this is the easiest way to do things because you know, that's a very well defined set of steps, 

right? So, step one is to create a data set. So, a data set is just literally something that has 

a length and that you can index into it. So, it has to have these two things to find. 

You don't have to inherit from anything, you just have to define these two things. Broadly 

speaking in Python, you generally don't have to inherit from things. You just have to provide 

the methods that are expected. So, our data set is in a directory called Tiny Imagenet 200 

and then there's a train directory and a valve directory for the trading and the validation 

set and then the train directory, this is pretty classic normal thing each category. So, this is 

a category as images in a separate folder and specifically there are images subfolder. So, what 

I wanted to do was to just grab start with grab all of the files in path slash train or the image 

files. So, the Python standard library has a glob function, which searches recursively if asked 

to for everything that matches this, well, this specification. So, this specification is path 

slash start at jpeg and then this star star here, I don't know why we need to do it twice, it's a 

bit weird, it was that you also need that to be recursive. So, to be recursive, you both have 

to say recursive tree here and also put star star before the slash here. So, it's going to 

give us a list of all files inside path train. And so, then if we index into that training data 

set with zero, that will call get item passing an eye of zero. And so we will then return 

a tuple. One is the thing in self.files i, which is this file and then the label for it and 

the label is that. So, it's the parents parents name parents parents name. And so that's the 

name. Okay, so there's a data set that returns two strings when you index into it, a couple of two 

strings. The first is the name of the image file, so the path of the image file and the second is 

the name of the category it's in. These weird names are called word net categories, there's like 

codes that indicate concepts basically in English. So, one of the reasons I actually used this 

particular data set is because it's going to force us to do some more data processing, 

which I think is good practice. That's because weirdly in the validation set, although it's 

in Tiny Imagenet 200 slash val, which is the not weird part, the weird part is that they 

are not then in sub directories organized by label instead there is a separate 

val annotations. And file, which looks like this, so it says to each 

file name. What category is it is also got the like the bounding box of whereabouts that 

is, but we're not going to be using that today. I decided to create a dictionary that will tell 

us for each file, what categories it in. So that means that I want to create a this case here, I'm 

doing something exactly like a list comprehension, but because it's not in square brackets, it's 

a generator comprehension, so it'll generate. You've kind of stream out the results and 

we're going to go through each line in. This file. And we're going to split on tab. 

So that's going to give us this and this and this and then we're going to grab the 

first two. And if you basically pass a, a list of lists or list of tuples or whatever 

to dict it will create a dictionary using these pairs as key values. So if we have a 

look. There it is. That's quite a nice neat way to do it. And if you're not sure you can 

just click type, deck type open brackets and then hit shift tab a couple of times and 

it'll show you the various options. And you can see here I'm doing dict iterable because my 

generator is it is it is iterable and it says, oh, that's exactly as if you created a dictionary 

and then gone for KV in iterable decay equals V. So there's a nice little trick. Now we need a data set that works just like tiny data 

set, but the get items are going to label things differently. So I just inherited from tiny data 

set. So that means we don't need to do in it or then again and then get item again. It's going 

to turn the file. This time the label will not be the parent parent name, but we will look up in 

the annotations dictionary. The name of the file. And so that works. We can check the length 

works. So then a fairly generally useful thing that I thought will then create 

is something that lets us transform any data set. So here's a class that you can 

pass it to data set. And you can pass it a transformation for the X to the independent 

variable and you can pass it a transformation for the way. And both of them default to know up 

that is no operation. So it just doesn't change it at all. So a transform data set. The length 

of it is just the length of the original data set. So we will get item. It'll grab the tuple 

from the data set we passed in and it will return that tuple but with transform X and transform 

Y applied to it. Does that make sense so far? Great. Okay. So I don't like working 

with these and zero three zero things, but the data set luckily has 

a word net IDs file in it. So if I just open it up. Oh, sorry, this one 

actually is not quite going to help us. This is just a list of all of the word net IDs that 

they have images for. We could have actually got this by simply grabbing. By listing this 

directory. It would have told us all the IDs, but they've got they've also got just a text file 

containing all of them. We can see that there are 200 categories. Okay. And that's useful 

because we're going to want to change N zero three zero, etc into an int. And 

the way we can change it into an int is by simply saying, oh, we'll call call this one 

zero and this one one and so forth. So the kind of the int to string or ID to string version 

of this is literally this list. So zero will be there that. But the string to int version, we 

do this all the time. It's basically a numerate. So that gives us the index and the value for 

everything in the list. So those are going to be their keys and values, but actually we're 

going to invert it to become value column key. And that's what's true to ID 

will be so note here that we have a dictionary comprehension. You 

can tell us it's got curly brackets and a colon. And so here's our dictionary 

comprehension. So we could have used that for this as well. We could have done a dictionary 

comprehension instead. But yeah, so there's lots of ways of doing things. None of them's any better 

or worse than any other. Okay, so that's the. The word that tags whatever do we have the names for 

them or is that some? Yes, the names I'm going to get to. Yes, shortly. There's a word that text. 

So yeah. All right, I grabbed one batch of data and grabbed its main and standard deviation. And 

so then I've just copied and pasted them in here for normalizing. So my transform X is going to 

be I'm going to read the image. If you read it as RGB that's going to force it to be three channels 

because actually some of them are any one channel. We write it by two fifty five sort of between 

zero and one. And then we will normalize. And then for our wise, we will go through 

this dread ID to get the ID and just use that as our tensor. So it's, you know, doing 

it manually is actually pretty straightforward right because now we just pass those to 

our to from DS our transform to data set. And we can check that. You 

know, you can see why I. Is a tensor, but we can look it up to 

get its value and XI is an image tensor. With three channels. So channel by height 

by which has is normal for paytorch. So for showing images, it's nice to 

denomalize them. So that's just denomalizing. And so if we show the image that we 

described, it say what a joke, I guess. Alright, so now we can create a data loader for 

our training set. So it's going to contain our transformed training data set and pass in 

a batch size. This one has to be shuffled. Not sure why I put num work 

as it calls zero there. Generally, it's pretty good if 

you've got at least eight cause. Yeah, so we can now grab an expatch and a Y batch. And take a look at a denomalized image from 

there. So there we've got a nice little kitty cat. So I think this is already looking 

better than fashion, amnest. Yeah, so there's this thing. So let's dot text that they've also provided. And 

this is actually a list of the entire word net hierarchy. So the top of the hierarchy is 

entity and one of the entity types is a physical entity or an abstract entity entities can be 

things. And so forth. So this is how word net is. And so this is quite a big file actually. So if 

we go through each item of that file and again split on tabs. Because split on tabs, that's what 

backslash team means is going to give us the word net ID and then the name of it. So now we can go 

through all of those. We call them since sets. And if the key is in our list of 

the 200 that we want, we'll keep it. And we don't really want like causal agent, 

comma cause, comma causal agency. The first one generally seems to be the most normal. So 

I just split on comma. And grab the first one. All right. So that's so we could then go 

through our Y batch and just turn each of those numbers into strings and then look at each 

of those up in our sin sets and join them up. And then use those as titles to see our 

Egyptian cat in our cliff, the guacamole. So monarch butterfly. And so forth. And you can 

see that this is going to be quite tricky because like a cliff, this is a cliff dwelling, for 

instance, could be quite, you know, complicated. I have a feeling for this day intentionally 

like a hundred of the. So I think that the number of degrees might have come from the normal 

Imagenet. And I think they might have been picked a hundred that are designed to be particularly 

difficult or something if memory says correctly. All right. So then we could define a transform 

batch function with the same basic idea. And that's just going to, yeah, 

transform the X and the Y in a batch. Oh, yes, we're about to use that. I should move 

that down a bit because we're not quite there yet. Okay. So before that, we can create our data 

loaders. We created a get deals back in an earlier lesson, which simply turns that into a data loader 

and that into a data loader and this one gets shuffled and that one doesn't and so forth. Oh, I 

see this is where we do our number workers. Cool. All right. So then. Oh, yes. So then we want 

to add our data augmentation. So I noticed that training a Tiny Imagenet model. I mean, it's it's 

a much harder thing to do than fashion, MNIST. And overfitting was actually a real challenge. 

And I guess it's because 64 by 64 isn't that many pixels. So yeah, so I found I really needed 

data augmentation to make much progress at all. Now, very common data augmentation is called 

random resize crop, which is basically to pick like one area inside and that zoom into it and 

make that your image. But for such low resolution images that tends to work really poorly. Because 

it's going to introduce a lot of kind of blurring artifacts. So instead for small images, I think 

it's better to add a bit of padding around them. And then randomly pick a 64 by 64 area from that 

padded area. So it's just going to shift them slightly. It's not a lot of augmentation, but it's 

something. And then we do our random horizontal flips. And then we'll use that random arrays 

thing that we created earlier. This is just something I was experimenting with. So yeah, 

so now we can use that batch transform callback using transform batch passing in those transforms. 

So with torch vision transforms, so this capital T is torch vision transforms. Yeah, because these 

are all. And then dot modules, you can pass them to nn dot sequential to just have each of them 

called one at a time in a row. There's nothing magic about this. It's just doing function 

composition. We could easily create our own. In fact, also the transforms dot compose. 

That does the same thing. Yeah, it's going to say so we've got a fast. A fast core dot 

compose, which as you can see, basically, it just says for f in funks x equals f of x. 

Yeah, I don't know is there is a yeah, torch. Orch vision compose. I think might be the kind of 

the old way to do it. Is that right? I'm not sure. I have a feeling maybe this is considered the 

better way now because it's kind of scriptible. I'm not promising that though. But 

yeah, that's basically the same thing. Okay, so yeah, we can now create a model as 

usual. Okay, so basically I copied the get model with dropout get drop model from our earlier. 

Tiny sorry, our earlier fashion, emnest stuff. And I started with kernel size five convolution. 

And then yeah, a bunch of rest blocks. Yeah, so this is all what 

we've used to seeing before. And so we can take a look in this 

case, as it's quite often seems to be the case we accidentally end up with 

no random erasing. Let's run it again. Really doesn't want to do random erasing here 

we go. So we can see it. So yeah, that's this very small border. You can hardly see sometimes 

in a bit of random erasing and it's been done. You know, all of the batch is being transformed or 

augmented in the same way. Which is kind of okay. It's certainly faster. It can be a bit of a 

problem if you have like one batch that has lots and lots and lots of augmentation being done to 

it. And it could be like really hard to recognize. And that could. Was the loss to be a lot in that 

batch. And if you're like been training for ages that could. Kind of jump you out of the. You know, 

the smooth part of the loss surface. That's that's the one downside of this. So I'm not going to 

say it's always a good idea to do augmentation at batch level, but it can certainly speed things 

up. A lot if you don't have hints of CPUs. All right. So you can use that summary 

thing we created. There's your model. And yeah, because we're increasing the doubling 

the number of channels as we're decreasing the grid size. Our number of mega flops per layer 

is constant. That's a pretty good sign that we're using compute throughout. So yeah, then 

we can train it with Adam W. Mix precision. And our. Or quintetions. So then did the learning rate 

finder. And trained it for 25 epochs. And got. Nearly 60% 59%. And yeah, this took quite a while 

actually to get close to 60% I got to admit. It's. And you can see. The training sets are already up to 91. So 

we're kind of on the verge of overfitting. Okay. So then I thought all 

right. How do we do better. And I wanted to have a sense of like how much 

better could we get. And I kind of tend to like to look at papers with code, which is a site 

that shows. Papers with their code and also like how good results did they get. So this is the 

image classification on Tiny Imagenet. And at first I was like pretty disheartened to see all 

these like 90% plus things. But as I looked at the. And I thought, oh my, to see what was this. 

I realized, I realized something. The first thing as I noticed that these ticks here represent 

extra training data. So these are actually. Retrained models that are only fine tuned 

on Tiny Imagenet. So that's a total treat. And then I looked more closely at this one. And 

actually, these are also using pre train data. So papers with code is actually incorrect. And so 

the first ones I could see, which I could clearly kind of replicate and make sense of. And so the 

highest one that I'm confident of is this 72%. And so then I kind of wanted 

to get a sense of, or how. You know, how, how much work is there to get 

from like 60% to 70% and how good is this. So I opened up the paper. 

And so here's Tiny Imagenet. And they've got like, they're basically this 

paper turns out to be about a new type of mix up data orientation. This is the normal kind of 

mix up. And this is a special kind of mix up. And on a resident 18. Yeah, I see they're getting 

like 63, 64 or 65 with various different types of mix up. And kind of 64 or 65 for their special 

one. And then if they use much bigger models than we're using, they can get up to 66 ish. So that 

kind of maybe think, OK, this classifier is not not bad. But there's clearly room to improve it. 

And I can't help myself. I always have to try to do better. So this is a good opportunity to learn 

about a trick that is used in real res nets, which is in a real resident. We don't just say. How 

many filters or channels or activations per layer. And then just go through and do a. You 

know, try to con each time. But instead. You can also say. The number of. Res blocks. Her. Her kind of down sampling layer. 

So this would say do three res blocks. And you know, then down sample or down sample and 

then do three res blocks or something like that. I'll do three res blocks. The first of which or 

the last of which is a down sample. And then two res blocks. With a down sample and then two res 

blocks the down sample. So this has got a total of one, two, three, four, five down samples. But it's 

got it's rather than having one, two, three, four, five. Res blocks. It's going to have three, four, 

five, six, seven, eight, nine. Res blocks. So it's nearly twice as deep. And so the way we do that 

is we just replace the places it was saying. Res block with res underscore blocks. And that's just 

a sequential. Which goes through the number of. Box. And it's a res block. And you can 

do it a couple of ways. In this case. I said if it's the last one, then make it straight 

to otherwise dried one. So it's going to be. Down sampling at the end of each set of res 

blocks. So that's the anything I changed. I changed res, res block to res blocks. And passed 

in. The number of blocks, which is this. Okay. So. So the number of mega flops is now seven, ten. 

Ish, which is more than double. Right. So. Should give. Should have more opportunity to learn stuff, 

which also could be more opportunity to overfit. So again, we do a L. I find. And. So it's a 25 

E. And I didn't actually add more augmentation. Okay. And that got up to nearly 62. 

So that was a good. Improvesment. And you know, interestingly. 

It's not overfitting more. It's actually if anything less, which 

you know, there's something about its. Ability to actually learn this, which is. And 

then you can probably get down or something. So I thought, yeah, we're nice to train it for 

longer. So I decided to add. More augmentation. And to do that. I decided to use 

something called trivial augment, which is not a very well known. And it's not a 

very well known. It's a very well known. It's a very well known. It's a very well known. But it's 

a very well known. But it's a very well known. And it comes from Frank Hutter's lab. 

He's, he's, Frank Hutter is somebody who consistently creates. Extremely practical 

useful improvements. With much less of the. Lonsense that we often see from. And so 

this one's kind of a bit of a reaction to some previous approaches. Such as one called 

auto augment, one called Rand augment. They might have both come from Google Brain. I'm not 

quite sure where they kind of used lots of like. You know, many, many thousands of TPU hours. 

To like optimize. How every images, you know, or how how each set of images is augmented. And 

yeah, what these guys did is they said, well, what if we don't do that. But we just randomly 

pick a different augmentation for each image. And that's what they did. They just, they just 

said. The algorithm one is the procedure. Pick an augmentation. Pick an amount. Do it. 

I feel like they're almost kind of like. I'm trying to make a point about 

writing this algorithm here. Yeah, and they basically find this as at 

least as good or often better. Actually, then the incredibly resource intensive ones. The 

incredibly resource intensive ones also kind of require a different version for every data set. 

Which is why they describe this as a tuning free. So rather nicely and surprisingly for 

me, it's actually built into paytorch. So if we go to paytorch's website 

and go to trivial augment wide. Yeah, you can they show you some 

examples of trivial augment wide. We can create our own as well. Now the thing 

is I found that doing this at a batch level worked poorly. And I think the reason is what I 

described earlier. I think sometimes it will pick a really challenging augmentation 

to see on, you know, and it all totally don't mess up the loss function. And if 

every single image in the batch is like that, that it all. Shoot it off into the. And 

then we can see the distance parts of the weight area. Which is a good excuse to me to show 

how to do augmentations on a per item level. Now. These actually require, or some of them require. 

Having a P I L image, the Python imaging library image, not a tensor. So I had to change things 

around. So we have to import image from P I L. And we have to change our to from 

X now. And we're going to do the augmentations in there instead for the 

training set. So for the training set. We're going to say well, I'm back for both. 

So we're going to pass in something is just do you want to do augmentations. So for the training 

set, we're going to pass all equals true. And for the validation set, we won't. So yeah, so we so 

image.open is how you create a P I L image object. And then if we wanted augmentations, then. 

Do these augmentations. And then convert it into a tensor. So torch vision has 

a dot to tensor. We can then call. And then we can normalize it. And actually, I 

decided just to use torch visions normalize. The minute either it's fine with this one works 

well. And then again, if you want augmentation, then do your render raise. And if you remember 

our render raise was designed to kind of use. Zero one distributed Gaussian noise. So you 

want that to happen after normalization. So that's why I do this order. So yeah, so now we 

don't need to use the. Batch tripping thing. We're just doing it all directly in the data set. So you 

can see, you know, you can do data augmentation. In very simple ways without almost 

any framework help here. In fact, we're really not is we're not doing any and 

it nothing's coming from a framework really. It's just yeah, it's just this little 

to film DS we made. And so now yeah, we just pass that into a data loaders get deals. 

And we don't need any augmentation callback. All right, so now we can keep improving things by doing something called pre activation resnets. 

So if we go back to our original resident. You might recall that the way we did it. We have this a conf block, which 

consists two. On Volutions. Enerro. The second one has no 

activation. And to remind you what conf is. That we first of all do a conf and then 

optionally we do a normalization and then optionally we do our activation function. So 

we end up and then the second of those has act equals none. So basically what this is saying is 

go convolution norm activation convolution norm. That's what self dot coms is. And then 

this is the identity path. This does nothing at all. If there's no down 

sampling or no change of channels. And then we apply the activation function, the 

final activation function to the whole thing. So that was how the original res block 

was designed, which is kind of a bit of an accident because I. To be honest, when I 

wrote that I didn't bother looking at the paper, I just did whatever seemed reasonable in my 

head. But yeah, then looking into it further, I looked at this this slightly later paper by 

the same author as the resident paper coming her. And. And then I was just timing her out drew. 

You know, this. This version here on the left, as you can see, it's conf norm value, conf norm 

add value. And yeah, he basically pointed out. Yeah, you know what? Maybe that's not great 

because the value is being applied to the addition. So it's not really a really an identity 

path at all. So it wouldn't have been nice if we could have a pure identity path. And so to do 

that, he proposed reordering things to go norm value conf norm value conf add. And so this is 

called a pre act or pre activation res block. So that means I had to redefine conf 

to do norm then act and then conf. So my sequential now has the 

activation in both places. And so yeah, other than that. Oh, and 

then of course, there's no activation happening in the res block because it's all 

happening in the cons. Does that make sense? Yeah, makes sense. Yeah. So this is now the 

site. This is exactly the same except we now need to have an activation and a batch norm 

after all those blocks because previously it finished with an activation norm and activation. 

So it starts with them. So we have to put these at the end. It also means we can't start with a 

res block anymore because if he started with a res block, then it would have an activation function 

at the start, which would throw away half of our data, which would be a bad idea. So you've got 

to be a bit careful with some of the details. But yeah, so now you can see that each image 

is getting its own augmentation. And so this one's been shared looks like it's a door 

or something. Got to tell what the hell it is. It's been shared. This one's been moved. 

That looks like this one's also been shared. And you can also see they've got 

different amounts of random rays on them. So yeah, so I thought I tried to 

change training that for 50 epochs. And that got us to 65% which is, you 

know, as good as nearly as good as the normal mix up things are getting even 

on a resident 50s. This is really good. So I won't spend time on this, but I just 

mentioned I was kind of curious like, I mean, one of the things I should mention also 

is they trained all these for 400 epochs. So I was kind of curious what would 

happen if we trained it a bit longer. I wasn't patient enough to train it for 

400 epochs, but I thought I could do 200 epochs. So I just replicated that 

last one, but made it 200 epochs. And that got us to 67.5. Which, yeah, is 

better than any of their non-special mix ups. So I think it just goes to show you can get 

you know genuinely state of the results. So if we use their special mix up that 

would be interesting to try as well. So if we can match their results there. But 

you know, we've built all this from scratch. We didn't do the data orientation from scratch 

because it's not very interesting, but yeah, other than that. So I think that's really 

cool. So I know that you did some other experience with the pre activation. Right. 

Yeah. Right. When I saw that when I saw the. Preactivation success, I was quite enthusiastic 

about it. So I actually thought like, oh, maybe I should go back and actually use 

it everywhere. But for that really enough, I think it's weird like it was worse for fashion, 

emnest and worse for like less data augmentation. I mean, maybe it's not that weird, but because the 

idea of when her at our introduced it, they said, this is to train deeper models. You know, 

there's a there's a more pure identity path. And so with that more pure identity path. And so 

that that should kind of let the gradients flow through it more easily. And so there should be 

a smoother surface weight surface loss surface. So yeah, I guess it makes sense that you don't 

really see the benefits on less deep models. The bit I'm surprised. Because it's like, it's the 

thing I should be the that sort of justification should be true for small and one is made or 

it's easy. Yeah, it does, but smaller models. I'm going to have a less bumpy surface anyway. 

They've just got less dimensions to be bumpy on. And there's less more importantly, 

they're less deep. So this less room for gradients to explode exponentially. So 

they're not as sensitive. But yeah, I mean, I can see why they don't necessarily help as 

much, but I don't have any idea why they were worse and they were quite consistently worse. 

Yeah, yeah, I find it quite interesting to. Yeah. Yeah, it's quite curious. And it's interesting 

that when we do these like experiments on things that nowadays are considered pretty fundamental 

and foundational, all the time discover things that everybody seems to have noticed 

or written about or there's plenty of room to. As a kind of a more experimental researcher 

to do experiments and then go like, oh, that's interesting and then try and figure out what's 

going on. Yeah. I think a lot of researchers go in the opposite direction and they try to start with 

like theoretical assumptions and then test them. Well, I think about it. I feel like maybe 

one of the more successful folks in terms of people who build stuff that actually 

get used a more experimental first maybe. Okay, so. We have a five minute break 

since we are kind of on the hour. Sure. All right, so let's now look at notebook 25 

super reds. I've just. I'll be a few things in the previous notebook. It's in transforms and our 

data sets now, de norm. And our tripping batch. And our to from X. Let me show 

you using to from batch here. Not even using to from batch. Let's get rid of 

that because that's just confusing. Okay, so looks like we're doing the per. Let's figure this out. 

So what are we doing here? So we've got. We've got our two data sets. All right, so the goal of 

this is we're going to do super resolution, not. Um, classification. So let's talk about 

what that means. What we're going to do is the independent variable will be 

scaled down to a 32 by 32 pixel. Um, image. And the dependent variable 

will be the original image. Um, and so to do random crop within 

a padded image and random flips. Both the independent and the dependent variable 

needs to have had exactly the same random cropping and exactly the same flipping. Otherwise, it can't 

say, oh, this is how you do super res to go from the 32 by 32 to the 64 by 64. But you're like, oh, 

it has to be flipped around and moved around. So yes, so for this kind of imagery construction 

task. Um, you it's important to make sure that your augmentation is done in the same way on the 

independent, the dependent variable. So that's why we've put it into our data set. And so this 

is something people often get confused about and they don't know how to do it. But it's actually 

pretty straightforward. If we do it this way, we just put it straight in the data set. And 

it doesn't require any framework fanciness. Now then what I did do is I then added 

random erasing. Just to the training set. And the reason for that is I wanted to make 

the super resolution task a bit more difficult. Which means sometimes it doesn't just do super 

resolution, but it also has to like replace some of the deleted pixels with proper pixels. And so 

it gives a little bit more to do, you know, which can be quite helpful. It's kind of it's a it's 

a it's a data augmentation technique and also something to give it like. World on opportunity 

to learn what the pictures really look like. Okay, so with that in case that though these 

are going to do the padding random cropping and flipping. The training set will also add random 

erasing and then we create data loads from those. Would it make sense to use the trivial augment 

here? The trivial augment, did you say? Yeah. Maybe. Yeah, I particularly see a reason not 

to if if if well, only if you found that. Overfitting was a problem. And if you did do 

it, you would do it to both independent and dependent variables. So yeah, here you can 

see an example, the independent variable, some of the in this case, all of them 

actually have some random arrays that dependent doesn't. It has to figure out how to 

replace that with that. And you can also see that this is very blocky. And this is less blocky. 

That's because this has been gone down to 32 by 32 pixels. And this one's still at the 64 by 

64. So in fact, once you go down that far, the cats lost its eyes entirely. So it's going to be 

quite challenging. It's lost its lines entirely. So super resolutions, quite a good task to 

try to get a model to learn what pictures look like. Because it has to figure out like 

how to draw an eye and how to draw cats, whiskers and things like that. Well, 

you're an associat of in general, sorry. It's just going to point out that the data sets 

are also simpler because you don't have to load the labels. So there's no difference between the 

train and the validation now. Good point. Yeah, because the label, you know, it's actually 

dependent variable is just the picture. And so. Okay, so because to FMDS, to FMDS has a to FMX, 

which is only applied to the independent variable. The independent variable has applied to 

it this pair of resize to 30 by 32 by 32. And then interpolate. And what that actually 

does is it ends up still with a 64 by 64 image, but the pixels in that image are all like doubled 

up. And so that means that it's still doing super resolution, but it's not actually going from 

32 by 32 to 64 by 64 by 64. But it's just going from the 64 by 64 where all of the pixels are 

like two by two pixels. And it's just a little bit easier because that way, we can certainly 

create a Unet that goes from 32 to 64. But if you have the input and output image, the same 

size, it can make code a little bit simpler. I originally started doing it by not doing this 

interpolate thing. And then I decided I was just getting a bit a bit confusing. And there's 

no reason not to do it this way, frankly. OK, so that's our task. And the idea is that then 

if it does a good job of this, you know, you could pass 64 by 64 images into it. And hopefully 

it might turn them into 128 by 128 images. Particularly if you trained it 

on a few different resolutions, you'd expect it to get pretty good at. 

And you could even call it multiple times. But anyway, for this, I was just kind of doing it 

to demonstrate. But we have in previous courses trained, you know, bigger ones for longer with 

larger images. And they actually do one of the interesting things is they tend to not only do 

super resolution, but they often make the images look better. Because the kind of the pixels 

it fills in, it kind of fills in with like what that image looks like on average, which 

tends to kind of like average out imperfections. So often these super resolution models actually 

improve image quality as well, fun, we enough. OK, so let's consider the dumb way to do things. 

We've seen a kind of a dumb way to do things before, which is an autoencoder. But when we've 

got with low expectations here, because we've done an autoencoder before, it was so bad it 

actually inspired us to create the learner, if you remember. So that was back in notebook 

eight. And so basically what we're going to do is we're going to have a model, which looks 

a lot like previous models that starts with a res block kernel size five. And then it's 

got a bunch of res blocks of stray two. But then we're going to have an equal number of 

up blocks. And what an up block is going to do is it's going to sequentially, first of all, 

it's going to do an up sampling nearest 2d, which is actually identical to this. Right, 

so it's going to just double all the pixels. And then we're going to pass that through a res 

block. So it's basically a res block with like a stride of a half, if you like, you know, 

it's it's it's it's undoing a stride to it's up sampling rather than down sampling. 

Okay, so and then we'll have an extra res block at the end to get it down to 

three channels, which is what we need. Okay, so we can do our learning 

learning learning right finder on that. And I just try to pretty briefly for five epochs. 

So this model is basically trying to take the image that we start out, then kind of squeeze 

it into, I guess, a small representation of the trying to bring that small representation back up 

to then the full super. Exactly right to niche and we could have done it without any of the stride 

to, you know, I guess we could have just had a whole bunch of stride one layers. There's 

a few reasons not to do it that way though, one is obviously just the computation requirements 

are very high because the convolution has to scan over the image. So when you keep it at 64 

by 64, that's a lot of scanning another is that. You're never kind of forcing it to learn higher 

level abstractions by recognizing how to kind of like, you know, use more channels on a 

smaller grid size to represent it. Yeah, it's like the same reason that we in classifiers, 

we don't leave it at stride one the whole time, you know, you end up with something that's 

inefficient and generally not as good. Exactly, yep, thanks for clarifying to niche. Okay, so the 

loss goes down and the loss function I'm using is just MSC here right so it's how similar is 

each pixel to the pixel it's meant to be. And so then I can call capture spreads to get 

their predictions and the targets and inputs or probabilities targets and inputs I can't 

quite remember now. So here's our input images. So they're pretty low resolution. And oh, dear, here's our predicted images so 

pretty terrible. So why is that well, basically it's kind of like the problem we had with our 

earlier order encoder, it's really difficult to go from like a two by two or four by four or 

whatever image into a 64 by 64 image, you know, we're asking you to do something that's just 

really challenging and so that would require a much bigger model trained for a much longer 

amount of time, I'm sure it's possible. And in fact, you know, latent diffusion, as we've 

talked about has a model that kind of does exactly that. So in our case, there's no need to make 

it so complicated, we can actually do something dramatically easier, which 

is we can create a Unet. The Unets were originally developed in 2015 

and they were originally developed for medical imaging, but they've been used very, very widely 

since. And I was involved in medical imaging at the time they came out and certainly they quite 

quickly got recognized in medical imaging, they talk a little bit longer to get recognized 

elsewhere, but nowadays they're pretty universal. And they are used to stable diffusion and 

basically some of the details don't matter here, this is like the original paper. So let's focus 

on the kind of the broad idea, this thing here is called that we're going to call it the down 

sampling path, so in this case they started with five seven two by five seven two images. And it 

looks like they started with one channel images and then they, you know, as we've seen, then they 

took them down to 284 by 284 by 128 and then down to 140 by 2056 and then at a 68 by 68 by 512 32 by 

32 by 102 for. So here's this down sampling path, right, and then the up sampling path is 

exactly what we've seen before right so we. Up sample and have some I mean in the original 

thing they didn't use res nets or res blocks, they just use cons, but the idea is the same. 

But the trick is these extra things across here, these errors, which is copy and crop what we can 

do is we can take so under in the up sampling we've got a 512 by 512 here, sorry a 512 channel 

thing here we can up sample to a 512 channel thing. We can then put it through a 

con to make it into a 256 channel thing and then what we can do is we can copy across the 

activations from here now and they actually do things in a slightly weird way where they're 

down sampling they had 136 pixels by 136 and over here they have 104 by 104 so they crop 

out the center bit. Because of just kind of like the slightly weird way they did basically 

weren't padding things nowadays we don't have to worry about that that cropping so what we do 

is we literally copy over this these activations and we then either concatenate or add and you 

can see in this case they're concatenating see how there's the white bit in the blue bit so. 

They have concatenated the two lots together so actually I think what they did here is they 

went from a 52 by 52 by 512 to 104 by 104 by 256 and I think that's what this little blue 

rectangle here is and then they had another. Copy copied out the 104 by 104 by 256 and then 

put the two together to get a 104 by 104 by 512. And so this these activations half are from the 

up sampling and half are from the down sampling from earlier in this whole process and it might 

be easiest to understand why that's interesting when we get all the way back up to the top 

where we've got this 393 by 392 by 392 thing. The thing we're copying across now is just two 

convolutions away from the original image so like for super resolution for example we wanted to 

look a lot like the original image so in this case we're actually going to have an entire copy of 

almost something very much like the original image that we can include in these final convolutions. 

And so did I hear we have you know something that's kind of like the somewhat down sampled 

version we can use here and the more down sampled version we can use here so yeah that's that's 

how the Unet works. Do I think you guys have anything to add like things that you found this 

helpful to understand or anything surprising. I guess it's shaping things these days a lot 

of people tend to just out so you've got the you know the outputs from the down layer are the 

same shape the inputs for the corresponding like up block and then they just kind of add the. 

Yeah, particularly for super resolution adding might make more sense than concatenating because 

you're like literally saying like oh this little 2 by 2 bit is basically the right pixel but it 

just have to be slightly modified on the edges. Yeah, it also makes me think of like a boosting 

sort of thing where if you think about like the fact that the location from the original and 

just being passed all the way across at that highest skip connection then the rest of the 

network can be effectively producing an update to that rather than having to recreate the whole 

image it's. Or to put it another way it's like a resonant but there's a skip connections 

right but the skip connections are like jumping from the start to the end and a 

bit after the start to a bit before the end and I guess a resonance a 

bit like boosting to. Yeah. Yeah, I mean it was kind of good to say so 

yeah basically I think it compared to like the denoising on encoder where like we saw like 

the results were like even worse than I guess the original image here I guess the worst it can be 

is basically the rich of them is you know I guess it's just like a similar sort of intuition behind 

the the the the resident. And how that works so yeah I mean it could be worse if these cons at the 

end are incapable of undoing what these cons did which is like one argument for maybe 

why there should also be a connection from here over to here and maybe 

a few more cons after that which is something I'm kind of interested in 

and not enough people do in my opinion. Another thing to consider is that they've 

only got two cons down here but at this point. You have the benefit of only being a 28 by 28 

you know why not do more computation at this point you know so there's a couple of things that. 

Sometimes people consider but maybe not enough. So let me try to remember what I did. So in my 

Unet here so we've got the down sampling path which is a list of res blocks now a module list 

is just like a sequential except it doesn't actually do anything so then in the forward 

we have to go through the down path and. The X equals LX each time so it's basically yeah it's 

a quench with it doesn't actually do anything and so the up path is exactly the same 

as before it's a bunch of up blocks. And then like we saw before the final 

ones going to have to go to three channel. But now for our forward. What we're going 

to do is we're going to keep track of. Since we're going to be copying this over here 

and copying this over here we have to save it during the down sampling path so we're going 

to save it in a. Something called layers so I actually decided to do the little trick I 

mentioned which is to save the very first input. So I saved the very first input I then 

put it through the very first res block. And then we go through each 

in the downward path. And there's actually no need at all for there to 

be an I L here doesn't have to be a numerator because we don't use I. Okay so we go 

through the downward path so for this L for layer so for each layer in the downward 

path append the activations so that again as we go through each one we're going to be able 

to copy them over by saving them for later. And then call the layer. Okay so how many layers 

if we got there's n layers that we stored away so now we're going to go through the up sampling 

path and again we're going to call call each one. But before we do we're going to actually do the 

thing that john I mentioned which is rather than concatenating unless we're back at unless with 

this is the very first layer because the very first up sampling layer there's nothing to copy. 

Right so this is the very first up sampling layer it's just add the saved activations and then 

call the layer. And then right at the very end we'll add back the very first layer and then 

pass it through the very last last. Res block. All right, maybe that last one should 

be concatenated i'm not sure any who this is what I did. Now the next thing 

that I wondered about was like how to initialize this and basically what I wanted to 

do is I wanted to initialize this so that when it's when it's untrained it would the output 

of the model would be identical to the input. Because like a reasonable starting point for like 

what does this look like so yeah what does this look like following super resolution would be this 

you know that's a reasonable starting point. So I just created this little zero weights thing which 

zeros out the weights and biases of a layer right so I created the model. And then I said okay let's 

look at the very end of the up sampling path. And we'll call that the last resident and so 

let's zero out the very last convolutions. And also the ID connection and so that means that whatever it does for all this at the very last. 

Whatever it does for all this at the very end. It's going to have. Nothing in there this will 

be zero so that means that this will be equal to layers zero. And then that means we also want 

to make sure that this doesn't change anything. So then we can just zero out the weights 

there. That's probably not quite right is it. I guess I should have actually set 

those to like an identity matrix. Maybe I try to do that later. But at least it's 

something that would be very easy for it to. I have a question. Jo, I mean yeah. The the 

Sarah weights I see a lot of people do a thing where they instead like multiply by one 

e minus three or one e minus four to make the weights really small but not completely zero. And 

I don't have a good intuition whether it's like, you know, in some sense, having everything said to 

zero. Fires off some warnings that maybe this is going to be like perfectly balanced on some saddle 

point or it's not going to have any signal to work with. Yeah, it's very small but not quite so 

around the weights might be better. Yeah, I think so or not to much intuition but more empirical 

like all both. I don't I don't think it's an issue. And I think it comes from like a lot of 

people's PhD supervisors and stuff, you know, come from back in an era when they were doing like 

linear aggression with one layer or whatever. And in those cases, yeah, for the weights are the 

same. Then no learning can happen because every weight update is identical. But in this case, all 

the previous weights are different. So there's. They all have different gradients and 

there's definitely nothing to worry about. I mean, multiplying it by a small number would 

work to like it's not a problem that yeah setting it to zeros. And honestly, I, I have to stop 

myself from I mean, that's a problem, but I just. I always have this natural inclination to 

not want to set them to zeros because of years of being told not to. But there's 

no reason that should be a problem. All right. So I just was just like, again, like 

that Unet code is very concise and it's very, very interesting to see the basic ideas, 

you're very simple and. Oh, yeah, to see that, I guess. Yeah. Yeah, it's helpful. I think we just 

get it into a little bit of code, isn't it? Yeah. Thanks. That's very simple code too. 

Okay, so we do a lot of find and then we train. And you can see, but previously 

our loss, even after five epochs was 207. And in this case, our loss after one 

epoch is 086. So it's obviously much easier. And we end up at 073. Okay. So we can 

take a look. Is our inputs. And there's our outputs. So it's actually better rather 

than dramatically worse now. So that's good. Yeah, so it's actually not 

bad at all, I would say. And this car definitely looks like I think 

it's like a little over smooth, you know, I think you could say. So if we look at the 

other guys eyes, kids eyes still aren't great. I can the original is actually got proper 

pupils. So yeah, it's definitely not. Recreated the original, but you know, given limited compute 

and limited data, like the basic idea is not bad. I do worry that the poor koala like it didn't have 

eyes here, but like it ought to have known there should be eyes in a sense, and it didn't create 

any. And maybe it should have done a better job on the eyes. So my feeling is. And this is pretty 

common way of thinking about this is that when you use means great error, MSC as your loss function 

on these kinds of models. You tend to get rather blurry results, because if the model's not sure 

what to do, this is going to predict kind of the average, you know. So one good way to fix 

that is to use perceptual loss. And I think it was John who taught us about perceptual loss 

wasn't it when we did the style transfer stuff. So perceptual loss is this idea that we could look 

it's kind of similar as well to the the fed idea. We could look at the some intermediate layer of 

a pre trained model and try to make sure that our output images have the same. And then we 

could look at the features as the real images. And in this case, it ought to be saying like the 

real image, you know, if we went to kind of midway through a resident, it should be saying like 

there should be an eye here. And in this case, this would not represent an eye very well. 

So that would should give it some useful feedback to improve how it draws an eye here. So 

to do perceptual loss, we need a classifier model. So I just used the little, I don't know 

why I used the 25 epoch one, I guess, maybe that's all I had trained with at 

that time. So I used a 25 epoch model. So then, yeah, I just grab a batch of my 

validation set and then we can just try it out by calling the classifier model. And here I'm doing 

it in fp16 just keeping my memory used down. Don't think this dot half would be 

necessary since I got auto cast, never mind. Okay, this is the same code we had before 

for the synth sets. So here is our images. So what we've got here. Just looking at some of them there, we are 

not they I mean, qual is so fine. You know, I wouldn't have picked this as a parking meter. 

I wouldn't have picked this as a bow tie. So yeah, so basically what this 

is doing here is it's. So, um. Showing us the predictions. So the predictions 

are not amazing. trolley bus that looks right. This is weird. It's called this one a neck 

brace and this one a basketball that looks for the connect base the lab or retriever. 

It's got right the tractor. It's got right. Centipede's right mushrooms right. So, you 

know, you can see how classified it's okay, but it's not amazing. I think this 

was one with like a 60% accuracy. But the important thing is it's like it's 

got enough features to be able to like do an okay job. I have no idea what this is. So I'm 

pretty sure it's not a goose. Okay, so the model. The model was a very simple, 

just a bunch of res blocks. Three, four, five, and then at the end, we've 

got a pooling flattened dropout linear batch. So we don't need. Yeah, so what we're going to 

do is just to keep things simple. We're just going to grab. I think the end of the three 

res block. And so simple way to do that is we'll just go from range four to the end of the 

model and delete those layers. So if we do that. And then look at the model again. 

You can now see I've got zero one, two, three, and that's it. So this model is going 

to, yeah, return the kind of the activations after the fourth res block. So for the conceptual 

losses, I think we talked about you could like pick a couple of different places like there's 

various ways to do it. This is just the simplest. I didn't even have to use hooks or anything. We 

can just call sea model. And in fact, if we do it. So just to take a look at this looks like and 

again, we're going to use. And then we're going to use a mix precision here. We can grab our Y batch 

as before, put it through our classifier model. And so now that we've done this, this is now going 

to give us those intermediate level features. So the features, what's the shape of 

them? It's batch size one or two four by the number of channels of that layer by 

the height and width of that layer. So we're going to be using the features we're 

going to be using for the perceptual loss. And so when I was doing this, I kind of 

wanted to check whether things were vaguely looking reasonable. So I would expect 

it that these features from the actual Y should be similar to if I use our model. So something that I did, I thought, okay, if we 

took that model that we trained, then we would hope that the features were at least of the same 

sign from, you know, from the result of the model, then they are in the real images. So this 

is just me comparing that and it's like, oh, yeah, they are generally the same sign. 

So this is just a little checks that I was doing along the way. And then I also thought 

I kind of look at the msc loss along the way. Yeah, so there's no need to keep all those in 

there. It was just stuff I was kind of doing to like debug as I went, what not even debug 

to identify ahead of time as of any problems. So now we can calculate our loss function. So our 

loss function is going to be the the msc loss just like before between the input and the target, 

just just all that's being passed in here. Plus the msc loss between the features we get out of 

sea model and the features we get from the actual. And the features we get from the actual target 

image and so the features we can calculate for the target image now the target image where 

not going to be modifying that at all. So we do that bit with no gradient. But we do want to 

be able to modify the thing that's generating our input that the model we're trying to 

optimize. So we do have gradient for that. So in each case we're calling the classifier 

model one on the target and one on the input. And so those are giving us our features. 

Now then we add them together, but they're not particularly similar numerically like they're 

very different scales and we wouldn't want it to focus entirely on one or the other. So I just 

ran it for epoch or to check what the losses were looked like and I noticed that the feature 

loss was about 10 times bigger. So my very hacky way was just to divide it by 10. But honestly, 

like that detail doesn't tend to matter very much in my opinion, which there's nothing 

wrong with doing it in a rather hacky way. There are papers which suggest 

more elegant ways to handle it, which isn't a bad idea to save you a bit of time 

if you're doing a lot of messing around with this. Jeremy, I don't know if you know it, but the 

new VAD Coda from stability AI for the stable diffusion auto encoder. They train some with just 

mean squared error and some with mean squared error combined with the perceptual loss and they 

had a scaling factor of times 0.1. So exactly the same. So the answer is 0.1 that's that's the 

official. And Andre Capathy says that the correct learning rate to use is always 4 reneg 3. So we're 

getting all this sorted out now. That's good. All right. So for my unit, we're going to 

do the same stuff as before in terms of initializing it. Do our LR find. Train 

it for 20 epochs and obviously the loss is not comparable because this is lost now 

incorporates the perceptual loss as well. And so this is one of the challenges with these 

things is like is it better or worse? Well, we just turn it have to take a look and 

compare, I guess. Maybe I should copy over our previous models images. So we can compare. 

Okay, there's our inputs. There's our outputs. And yeah, look, he's got pupils 

now, which he didn't used to have. Qualistial doesn't quite have eyeballs, 

but like it's definitely less. You know, out of focusy looking. So 

yeah, I'm just looking at going on. Yeah, so there's some of them are going to be 

flipped because this is copied from earlier. So yeah, there's clipping and cropping 

going on. So there won't be identical. Yeah, you can also see like the background. 

Like was all just blurred before where else now it's got texture, which if we look at the 

real, the real has texture, you know, so. So yeah, clearly the perceptual losses 

improved matters. Quite significantly. There's an interesting thing here, which is that 

there's not really any metric we can use now, right? Because if you didn't mean squared error, 

the one that's trained me and means good error would probably do better, but visually it looks 

worse. Yeah, if we use like an if ID, well, that's based on the features of the pre-trained 

network. So that would probably be by it's by the one that's trained using those features, the 

perceptual loss. And so you get back to this very old school thing of like, well, actually, how are 

you choosing is just looking at evaluating right. And when you speak to someone like Jason Antick, 

who's made a whole career out of, you know, image restoration and super resolution and colorization. 

That is like a big part of his process, even now is still like looking at a bunch of images 

to decide where there's something is better. Rather than relying on these. Yeah, some 

PhD student yelled at me on Twitter a few weeks ago for like saying like, look, this cool 

thing our student made look how don't they look better and he was like, don't you know there's 

rigorous ways to measure these things. This is not a rigorous approach at all. It's like, 

PhD students, man, they got all the answers. I never human looking at a picture and 

deciding if they like it or not. That's insane. Well, I'm a PhD student, I agree, though 

that we should be looking at. So yeah, okay, some PhD students are better 

than others. That's fair enough. What's this? Right. Okay. So 

talking of trading, let's do that. So we're going to do something which is 

kind of fast a is favorite trick and has been since we first launched, which is gradually 

unfreezing pre trained networks. So in a sense, it seems a bit funny to initialize all of this 

down path randomly because we already have a model that's perfectly capable of doing something 

useful on Tiny Imagenet images, which is this. So yeah, what if we took our unit, right, and 

for the model dot start, which to remind you. Is the res block right at the front. Why don't we 

use the actual weights of the pre trained model. And then for each of the bits in the down 

sampling path, why don't we use the actual weights that we used from that as well. And so this is 

a useful way to understand how we can copy over weights, which is that any part of a module and 

end end module is itself an end end module. Has a state dict, which is a thing you can then 

call load state dict to put it somewhere else. So this is going to fill in the whole res block 

called model dot start with the whole res block, which is p model zero. So here's how we can copy 

across. Yeah, that's starting one and then all the down blocks are going to have the rest of it. 

This is basically going to copy into our model rather than having random weights. We're going to 

have all the weights from our pre trained model. And then since they're. They're good at doing 

something. They're not going to do in super resolution, but they're going to do something. 

Why don't we assume that they're good at doing super resolution. So turn off requires 

grad. And so what that means if we now train it's not going to update any of the 

parameters in the down block. I guess I should have actually done model model dot start 

requires grad as false to now think about it. And so this is the classic fine tune approach from 

fast AI the library. We're going to do one epoch of just the upsampling path. And that gets us to a 

loss of two five five now our loss function hasn't changed. That's totally comparable. So previously 

I won epoch was three eight five. And in fact, after one epoch with frozen weights 

for the down path, we've beaten this now. This is in a sense totally 

cheating, but in a sense it's totally not it's totally cheating because the 

thing we're trying to do is to generate for the perceptual loss intermediate 

layer activations, which are the same as this. And so we're literally using that 

to create intermediate layer activations. So obviously that's going to work. But why 

is it okay to be cheating? Well, because that's actually what we want like to be able to do 

super resolution. We need something that can like. Recognize isn't I here. So we already 

has something that know that there's an eye there. And in fact, interestingly 

this thing trained a lot more quickly. Than this thing. And it turns out it's better at 

super resolution. Than that thing, even though it wasn't trained to do super resolution. And I think 

that's because the signal, which is just like. What is this is a really simple signal to use. 

So yeah, so we do that. And then we can basically go through and set requires gradicals true 

again. And so the basic idea being here that. Yeah, when you've got a bunch of random weights, 

which is the whole up sampling path and a bunch of pre trained weights, the down sampling path. 

Don't start then fine cheating the whole thing. Because at the start it's going to be crap, you 

know, so it's so just train the random weights for at least an epoch. And then set everything 

to unfrozen. And then we'll do a 20 epochs on the whole thing. And so go from 255 to 

249. To a 7198. So it's improved a lot. So there fight with the. With using these weights, 

the in comparing that to the perceptual loss, the perceptual loss is looking at the. And the data 

sample, they throw the super resolution images, as well as we're incorporating the ways 

that's for the down sampling path. And so that's right. And yet I guess the original. 

Downgraded. But we are just asking them. So if you have zeros in the up sampling path 

that it's going to be the same. So it is very easy for it to get the correct. 

Activations in the up sampling path. Yeah, I mean, then it's kind of weird 

because it goes all the way back to the top, create the image and then goes into the 

class of C model, the classifier again. But I think it's going to create basically 

the same activations. It's a bit confusing and weird. So yeah, I mean, it's not totally 

cheating, but it's some. It's certainly an easier problem to solve. Yeah. Okay, so let's 

get our results again. So there's our imports. Yeah, so that's looking pretty impressive. 

So the kid has a, yeah, definitely looks pretty reasonable now. How looks pretty 

reasonable. We still don't have eyes for the koala such as life, but definitely 

the background textures look way better. The candy store looks less 

much better than it did. Medicine looks a lot better than it did. So 

yeah, it's really. I think it looks great. So then we can get better still. This 

is not part of the original unit, but. You know, making better models is often about like 

where can we squeeze in law computation, give it opportunities to do things and like there's 

nothing particularly that says that this down sampling thing is exactly the right thing you need 

here. It's being used for two things. This con and one is this kind of, but those are two different 

things. And so it's kind of having to like learn to squeeze both purposes into one thing. So I had 

this idea, probably I'm sure lots of people have this idea, but whatever. I had this idea, which 

is why don't we put some res blocks in here, which are called cross connections or cross 

cons. So I decided that a cross con is going to be just a res block by a con. And 

so the Unet I just copied and pasted, but now as well as the downs, I've also got 

crosses. And so the crosses across cons. So now rather than just adding the layer, I add 

the cross con. I applied to the layer. Yeah, I really should have added a cross con for 

this one as well. Now I think about it. This is probably the one that wants at the most. Well, 

other time. Okay, so now again, we can definitely compare. So this is one nine eight. So everything 

else was the same. So I did the same thing of because you know the down sampling is the same. So 

we can still copy in the state deck requires grad. And it's better. One eight nine quite a lot better 

really because you know, this is these are hard to get improvements. So if we can notice 

anything. Hey, look. It's got an eye just. Yeah. So how about that? At this point, it's 

almost quite difficult to see whether it's an improvement or not, but. I think there's a bit 

of an eye on the koala. I think it's encouraging. Yeah. So that's our super res. Oh, man. 

The bad news is we're out of time. Okay. We didn't promise to do diffusion 

Unet this lesson. We built a Unet. We built a Unet. We did. And it's and we did 

super resolution with it and it looks pretty good. So, I got to admit, I haven't thought about 

like exercises for people to do. What would be useful things for people to try with 

like maybe they could create a Unet. They could learn about segmentation, create 

a Unet for segmentation or. Oh, you know, there are a couple of lines where you. I was just 

going to see there were a couple of ways we said, oh, I should have tried this and you're trying 

that. I think that's how we see. Yeah, basically. I think that's obviously a good next step. I'm 

going to say style transfer is a good idea to do. I think with a Unet. So style transfer, you 

can actually set up a loss function so that you can create a Unet that learns to create images 

that look like Van Gogh. You know, for example. It's a totally different approach. It's a tricky 

one. I think I think when I was playing with that, it almost helped to not have the skip 

connections at the highest resolutions. Otherwise, it just really wants to copy the 

input and modify it slightly. Interesting. Maybe doing. Which one would be better there 

too. Oh, yes, that's a good point. Yeah. Cool. Well, we'll put some stuff up on the 

website about, yeah, you know, ideas. I'm sure some students, hopefully by the time you 

watch this, we'll have some ideas on the forum of things I've tried to. Yeah, right. Yeah, the 

colorization is nice because it's. Colorization at the transform is just. To grace, scaling 

back. Oh, yes. And then that's, yeah, that's really. Actually, okay. So there's all kinds of 

decrappification you could do isn't there. If you want to keep it a bit more simple, yes, rather 

than doing these two lines of code, you could. Yeah, just turn it into black and white. It's a 

great point. Or you could delete the center every time you know, to create like a something that 

learns how to fill in. If you delete the left hand side and that way that would lead to something 

that you can give it a photo in a little invent a little bit more to the left. Yeah, and then 

you could keep running it. Generator panorama. Another one you could do would be to like in memory or something, save it as a really 

highly compressed JPEG. And so then you would, it would be something that would learn to 

remove JPEG artifacts, which then for your like. Old photos that you saved with crappy JPEG 

compression, you could. Bring them back to life. You can probably do like, yeah, you can do like, I 

guess, drawing to painting or something like this, by taking some paintings that might pass and get 

through such an edge detection and using that as your starting point. Sounds interesting. Oh, 

what about watermark removal? You could use PIL or whatever to draw. Watermarks, text, whatever 

over the top, which is quite useful for like, you know, radiology images and stuff sometimes 

have personally identifiable information written on them and you can just like. Learn 

to delete it. Yeah, okay, so lots of things people can do. That's awesome. Thanks for 

your ideas. Basically any image to image task. Super. All right. Or just make the super res 

better. Try it on full Imagenet, if you like. If you've got lots of hard drive space. Thanks, John. 

Thanks, Tanishq. See you next time. Thank you.

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 23: Deep Learning Foundations to Stable Diffusion ...