Lesson 20: Deep Learning Foundations to Stable Diffusion

Jeremy Howard18,672 words

Full Transcript

Hi and welcome to Lesson 20. In the last lesson 

we were about to learn about implementing mixed precision training. Let's dive into it. Um, so to, and I'm going to fiddle with other 

things just because I want to really experiment. I just love fiddling around. So one thing I 

wanted to do is I wanted to get rid of the DDPMCB entirely. We made it pretty small 

here, but I wanted to remove it. So much as Tanishq said, isn't it great the callbacks 

make everything so cool. I wanted to show we can actually make things so cool without 

callbacks at all. And so to do that, I realized what we could do is we could put noisify inside 

a collation function. So the collation function, if you remember back to our datasets notebook, 

which was going back to notebook five, and you've probably forgotten that by now. So 

go and reread that to remind yourself it's the function that runs to take all of the, 

you know, you've basically each kind of row of data, you know, will be a separate tuple, 

but then it hits the collation function and the collation function turns that into tensors, 

you know, one tensor representing the independent variable, one tensor representing the dependent 

variable, something like that. And the default collation function is called not surprisingly 

default_collate. So if our collation function calls that on our batch and then grabs the 

X part, which is, it's always been the same for the last few things. That's the image cause 

we use, cause data sets uses dictionaries. So we're going to grab the image. Then we 

can call noisify on that collated batch. Then that's exactly the same thing as Tanishq's 

before_batch did, right? Because before_batch is operating on the thing that came out of 

the default_collate function. So we could just do it in a collate function. So if we 

do it here and then we create a DDPM data loader function, which just creates a 

DataLoader from some data set that we pass in with some batch size, with that collation 

function, then we can create our dls not using DataLoaders dot from what's it called? 

DataLoaders.from_dd. But instead of that the, the, the original, you know, the plain init that 

we created for DataLoaders, and again, you should go back and remind yourself of this. 

You just pass in the data loaders for training and test. So there's our two data loaders. 

So with that, we don't need a DDPM callback anymore. All right. So now that we've, you know, 

and again, this isn't, this is not required for mixed precision. This is just cause I 

wanted to experiment and flex our muscles a little bit of trying things out. So here's 

our MixedPrecision callback, and this is a training callback. And basically if 

you Google for PyTorch mixed precision, you'll

see that the docs show the typical mixed precision 

basically says with autocast device equals ‘cuda’ type equals float16, get your predictions 

and call your loss. So again, remind yourself if you've forgotten that this is called a 

context manager and context managers, when they start call something called dunder to 

enter. And when they finish, they call something called dunder exit. So we could therefore 

put the torch dot autocast into an attribute and calle dunder enter before the batch begins. 

And then after we've calculated the loss, we want to finish that context manager. So 

after_loss, we call autocast dunder exit. And so I had to add this. So 

you'll find now in the 09_learner, that there's a section

called updated version since the lesson where I've added an after_predict and after_loss

and after_backward and an after_step. And that means that a callback can now insert

code at any point of the training loop. And so we haven't used all of those different

things here, but we certainly do want, yeah, 

after_loss, we need to be able to do that. And then yeah, this is just code that has 

to be run according to the PyTorch docs. So instead of calling loss dot backwards, you 

have to call scaler.scale(loss).backward(). So we replace our backward in the train 

callback. There's something called scaler.scale( loss).backward. And then it says that finally, 

when you do the step, you don't call optimizer dot step. You call scaler.step(optimizer); 

scaler.update(). So we've replaced step() with scaler.step(); scaler.update(). So 

that does all of the things in here. And the nice thing is now that this exists, we don't 

have to think about any of that. We can add mixed precision to anything which is really 

nice. And so we now, as you'll see, cbs no longer has a DDPMCB. But we do have the 

MixedPrecision and that's a train callback. So we just need a normal Learner, not a 

TrainLearner. And we initialize our DDPM. Now to get benefit from mixed precision, you need 

to do quite a bit at a time. You know, your GPU needs to be busy and on something as small 

as Fashion-MNIST, it's not easy to keep a GPU busy. So that's why I've increased the 

batch size by four times. Now that means that each epoch, it's going to have four times 

less batches because they're bigger. And that means it's got four times less opportunities 

to update. And that's going to be a problem because if I want to have as good a result 

as Tanishq had, and as I've had here in less time, that's the whole purpose of this, is 

to do it in less time. Then I'm going to need to, you know, increase the learning rate and 

maybe also increase the epoch. So increase the epochs up to 8 from 5 and I increase 

the learning rate up to 1e-2. And yeah, I've found I could train it fine with 

that once I used the proper initialization and most importantly use the optimization 

function that has Epsilon of 1e-5. And so this trains, even though it's doing 

more epochs, this trains about twice as fast and gets the same result. Does that make sense so far?

TANISHQ: Yeah, it was great. JEREMY: Cool. Now the good news is actually we 

don't even need to write all this because there's a nice library from HuggingFace originally 

created by Sylvain who used to work with me at fastai and went to HuggingFace and kept 

on doing awesome work. And he started this project called Accelerator, which he now works 

on with another fastai alum named Zach Mueller. And accelerate is a library that provides this 

single Accelerator that does things to accelerate your training loops. And one of the things 

it does is mixed precision training. And it basically handles these things for you. It 

also lets you train on multiple GPU's. It also lets you train on TPUs. So by adding a 

TrainCB subclass that will allow us to use Accelerate, that means we can now hopefully 

use TPUs and multi GPU training and all that kind of thing. So the Accelerate docs show 

that what you have to do to use Accelerate is to create an Accelerator, tell it what 

kind of mixed precision you want to use. So we're going to use 16 bit floating point fp16. And then you have to basically 

call Accelerator.prepare and you pass in your model, your optimizer, 

and your training and validation data loaders. And it returns you back a model and optimizer 

and training and validation data loaders, that they've been wrapped up in Accelerate, 

and Accelerate is going to now do all the things we saw you have to do, automatically. 

And that's why that's almost all the code we need. The only other thing we need is, it 

didn't, we didn't like tell it how to like change our loss function to use Accelerate. 

So we actually have to change backward. That's why we inherit from TrainCB. We have to 

change backward to not call loss.backward, but self dot accelerate dot backward

and pass in loss. Okay. And then I had another idea of 

something I wanted to do, which is, I liked the idea that noisify, I've copied 

noisify here, but rather than returning a tuple of tuples, I just returned a tuple with 

three things. I think this is neater to me. I would like to just have three things in 

the tuple. I don't want to have to modify my model. I don't want to have to modify my 

training callback. I don't want to do anything tricky. I don't even want to have a custom 

collation function‒ Sorry, I want to have a custom collation function, but I want to 

have it‒ I don't want to have a modified model. So I'm going to go back to using a UNet2DModel. 

So how can we use a UNet2DModel when we've now got three things and 

what I did in my modified Learner, just underneath it. Sorry, actually what I did 

was I modified TrainCB to add one parameter, which is number of inputs. And so this tells 

you how many inputs are there to the model. And normally you would expect one input, but 

our model has two inputs. So here we say… Okay, so, AccelerateCB is a TrainCB. So 

we, so when we call it, we say, we're going to have two inputs. And so what that's going 

to do is it's just going to remember how many you asked for. And so when you call predict, 

it's not going to pass learn.batch[0]. It's going to call *learn.batch[ :self.n_inputs]. 

And ditto when you call the loss function, it's going to be the rest. So it's *learn.batch[self.n_inputs: ] 

onwards. So this way you can have one, two, three, four, five inputs; one, two, 

three, four, five outputs, whatever you like. And it's just up to you then to make sure 

that your model and your loss function take the number of parameters. So the loss function 

is going to first of all, take your preds and then yeah, however many non-inputs you 

have. So that way, yeah, we now don't need to replace anything except that we did need 

to do the thing to make sure that we get the dot sample out. So I just had a little, this 

is the whole DDPMCB callback now, DDPMCB2. So after the predictions are done, replace 

them with the dot sample, right? So that's nice and easy, you know? So we ended up with 

quite a bit of pieces, but they're all very decoupled, you know? So with miniai, you know, 

with, and with AccelerateCB and whatever, because we should export these actually into 

a nice module. Well, if you had all those, then yeah, you wouldn't have AccelerateCB. 

The only thing you would need would be the, would be the noisify and the collation function 

and this tiny callback. And then yeah, here's our learner and fit and we get the same result 

as usual. And this takes basically an identical amount of time because at this point I'm not 

using multi GPU or TPU or whatever, I'm just using mixed precision. So this is just a shortcut 

for this. So it's not a huge shortcut. The main purpose of it really is to allow us to 

use other types of accelerators or multiple accelerators or whatever. So we'll look at 

those later. Does that make sense so far? TANISHQ: Yeah. Oh yeah. Accelerate is 

really powerful and pretty amazing. JEREMY: Yeah, it is. And I know like a lot of, 

yeah, like I know Kat Crowson uses it in all her k-diffusion code, for example. Yeah. It's 

used a lot out there in the real world. I've got one more thing I just want to mention 

briefly, just this sneaky trick. I haven't even bothered training anything with it cause 

it's just a sneaky trick. But sometimes thinking about speed, loading the data is the slow 

bit. And so particularly if you use Kaggle, for example, on Kaggle, you get two 

GPUs, which is amazing, but you only get two CPUs, which is crazy. So it's really hard 

to like take advantage of them because the amount of time it takes to like open a PNG or 

a JPEG, you know, your GPU is sitting around waiting for you. So there's a, if your, you 

know, if your data loading and transformation process is slow and it's difficult to keep 

your GPUs busy, there's a trick you can do, which is you could create a new data loader 

class, which wraps your existing data loader and replaces dunder iter. Now dunder iter 

is the thing that gets called when you use a for loop, right? Or when you use next data, 

it calls this. And when you call this, you just go through the data loader as per usual. 

So that's what dunder iter would normally do. But then you also go through i from 0 to, 

by default, 2, and then you spit out the batch. And what this is going to do is it's 

going to go through the data loader and spit out the batch twice. Why is that interesting? 

Because it means every epoch is going to be twice as long. But it's going to only load 

and augment the data as often as one epoch, but it's going to give you two epochs worth 

of updates. And basically there's no reason to have a whole new batch every time. You 

know, looking at the same batch two or three or four times, at a row, is totally fine. And 

what happens in practice is you look at that batch, you do an update, get ready a part of 

the weight space, look at exactly the same batch and find out now where to go in the, in, in 

the weight space. It's still, yeah, basically equally useful. So I just wanted to add this 

little sneaky trick here, particularly because if we start doing more stuff on Kaggle, we'll 

probably want to surprise all the Kagglers with how fast our miniai solutions are. And 

they'll be like, how is that possible? We'll be like, Oh, we're using our, you know, two 

GPUs, extra Accelerate. I was thinking about how do we use the two GPUs and like, Oh, and 

we're, you know, using, you know, getting out, you know, loading, flying through using 

MultDL. I think that'd be pretty sweet. So that's

that. JOHNO: Nice.

TANISHQ: Yeah. It's great to see the various different 

ways that we can use miniai to, to do the same thing, I guess. Um, or, you know, however 

you feel like doing it or whatever works best for you.

JEREMY: Yeah. I'd be curious to see if other people 

find other ways to, you know, I'm sure there's so many different ways to, to handle this 

problem. I think it's an interesting, interesting problem to solve. And I think, for homework, 

it'd be useful for people to, um, yeah. Run, run some of their own experiments, maybe either 

use these techniques on other data sets or see if you can come up with other variants 

of these approaches, um, or come up with some different noise schedules, um, to try, um, 

it would all be useful. Any other thoughts of exercises people could try?

TANISHQ: Yeah. I mean, JOHNO: Getting away with 

less than a thousand steps. JEREMY: Yeah. Less than a thousand steps.

JOHNO: Happening in the final 200. So why not just train with only 200 steps?

JEREMY: Yeah. Less steps would be good. Yeah. Because the sampling is 

actually pretty slow. So that's a good point.

TANISHQ: Yeah. Yeah. Yeah. I was going to say something similar 

in terms of like, yeah, many, I guess work with less number of steps, you know, 

you would have to adjust the noise schedule appropriately and you have to, I guess there's 

maybe a little bit more thought into some of these things. Um, or, you know, another, 

uh, aspect is like, um, when you're selecting the time step during training right now, we 

selected randomly kind of uniformly each time step has equal probability of being selected. 

Maybe different probabilities are better. And some papers do analyze that more carefully. 

So that's another thing to play around with as well.

JEREMY: That's almost kind of like, if I guess there are almost two 

ways of doing the same thing in a sense, right? If you change that mapping 

from T to beta, then you could reduce T and have different betas would kind of give you a 

similar result as changing the probabilities of the T's, I think.

TANISHQ: Yeah. I think, I think there's definitely, they're 

both, they're kind of similar, but potentially something complimentary happening 

there as well. Uh, and I think those could be some interesting experiments to study that. 

And also the, sort of, noise levels that you do choose affect the sort of behavior of, of 

the sampling process. And of course, what, what features you focus on. Um, and so maybe 

as, as people play around with that, maybe they'll start to, to notice how using, yeah, 

different noise levels or, you know, this different noise schedules affect maybe some 

of the features that you see in the, in the final image. Um, and that could be something 

very interesting to the study as well. JEREMY: Great. Well, let me also say it's been 

really fun. I'm doing something a bit different, which is doing a lesson with you guys rather 

than all on my lonesome. I, uh, I hope we can do this again because I've 

really enjoyed it. TANISHQ: Yeah. JEREMY: So of course now we're, you know,

strictly speaking in the recording, we'll, we'll next up say Johno, who's actually already

recorded his thanks to the Zoom mess up, but stick around. Uh, so I've already seen it.

Johno’s thing is amazing. So you definitely don't want to miss that.

JOHNO: Hello everyone. Um, so today, uh, depending on the 

order that this ends up happening, you've probably seen Tanishq’s DDPM implementation 

where we taking the default training call back and doing some more interesting things 

with preparing the data for the learner or interpreting the results. Um, so in these 

two notebooks that I'm going to show, we're going to be doing something similar, just 

exploring like what else can we do besides just the classic kind of classification model 

where we have some inputs on a label and what else can we do with this miniai setup that we 

have. And so in the first one, we can approach a kind of classic, um, AI art, uh, approach 

called Style Transfer. And so the idea here is that we're going to want to somehow create 

an artistic combination of two images where we have the structure and, um, layout of one 

image and the style of another. So we'll look at how we do that. And then we'll also talk 

along the way in terms of like, why is this actually useful beyond just making pretty 

pictures. So to start with, I've got a couple of URLs for images. You're welcome to go and slip 

in your own as well. And definitely recommend trying this notebook with some different ones 

just to see what effects you can get. Um, and we're going to download the image and, 

um, load it up as a tensor. So we have here a three channel image, 256 by 256 pixels. 

And so this is the kind of base image that we're going to start working with. So before 

we talk about styles or anything, um, let's just think what is our goal here, right? We'd 

like to, we'd like to do some sort of training or optimization. We'd like to get to a point 

where we can match some aspect of this image. And so maybe a good place to start is to just 

try and do, well, can we start from a random image and optimize it until it matches pixel 

for pixel? Exactly. And that's going to help us get this. Yeah.

JEREMY: I think that might be helpful is if you type style 

transfer deep learning into Google images, you could maybe show some examples so 

that people will see what their goal. JOHNO: Yeah, that's a very good point. Um, so 

let's see, this is a good one here. We've got the Mona Lisa as our, our base, but we've managed 

to apply somehow some different artistic styles to that same base structure. So we have the Great 

Wave by Kanagawa-oki, we have The Starry Night by Vincent van Gogh, this is some sort of 

Kandinsky or something. Um, yeah, so this is our end goal to be able to take the overall 

structure and layout of one image and the style from some different reference image.

JEREMY: And in fact, this was the first, um, ever, I think fastai generative modeling lesson

looked at style transfer. It's, um, it's been around for a few years. It's kind of a classic

technique and it's really, um, I think a lot of the students when we first did it, found

it extremely useful, a way of better understanding, like, 

you know, flexing their deep learning muscles, understanding what's going on and also 

created some really interesting new approaches. So hopefully we'll see the same thing again. 

Maybe some students will be able to show some really interesting, um, um, results from this.

JOHNO: Yeah. And I mean, today we're going to focus on kind of the classic approach. Um, but I know one of the previous students from fastai 

did a whole different way of doing that style loss and that we'll maybe post in the forums 

or, you know, I've got some comparisons that you can look at. Um, so yeah, definitely a 

fruitful field still. And I think after the initial hype of like, everyone was excited 

about style transfer apps and things, I don't know, five years ago, um, I feel like there's 

still some things to explore there. Very creative and fun little diversion 

in the deep learning world. Um, okay. So our first step in getting to 

that point is being able to optimize an image. And so up until now we've been optimizing 

like the weights of a neural network. Um, but now you want to go to something a bit 

more simple and we just want to optimize the raw pixels of an image.

JEREMY: Do you mind if we scroll up a bit to the previous code just 

so we can have a look at it? So there's a couple of interesting points about 

this code here is, you know, we're not, we're not cheating. Um, well not really. So we're, 

so yeah, we, we've seen how to download things in the network before. So we're using 

fastcore urlread, cause we're allowed to. And then I think we decided we weren't going to 

write our own JPEG parser. So torchvision actually has a pretty good one, which a lot 

of people don't realize exists. And a lot of people tend to use PIL, but actually 

torchvision has a more performant option. Um, and it's actually quite difficult to find any 

examples of how to use it like this. Um, but here's some code you can borrow.

JOHNO: Yeah. And if you Google “load image from url in pytorch”, all of the examples are

going to use PIL. And that's what I've done historically, um, is use the requests library

to download the URL and then feed that into PIL’s Image.open function. Um,

so yeah, that was fun when I was working with Jeremy on this 

notebook, like that's how I was doing it. It's wait, we're breaking the rules. Uh, 

let's see if we can do this directly into a tensor without this intermediate step of 

loading it with, with Pillow. Um, cool. Okay. So how are we going to do this image optimization? 

Um, well, first thing is we don't really have a dataset of lots of training examples. We 

just have a single like target and a single thing we're optimizing. Um, and so we've built 

this LengthDataset here, which is just going to follow the PyTorch like dataset standard. 

We can tell it how to get a particular item and what our length is. Um, but in this 

case, we're just always going to return 0, 0. and we're not actually going to care about 

the results from this dataset. We just want something that we can pass to the learner 

to do some number of training iterations. So we create like a fake dummy dataset with 

a hundred items. Um, and then we create a data loaders from that, and that's going to 

give us a way to train for some number of steps without really caring about what this 

data is. Um, so does baking make sense? JEREMY: Yeah. So just to clarify the reason 

we're doing this. So basically the idea is we're going to, um, start with 

that photo you downloaded. Um, and I guess you're going to be downloading

another photo. So that photo is going to be like the content. We're going to try to make

it continue to look like that lady. And then we're going to try to change the style so

that the style looks like the style of some other picture. And the way we're going to

be doing that is by doing an optimization loop with like SGD or whatever. But

so the idea is that each step of that, we're going to be moving the style somehow of the

image closer and closer to one of those images you downloaded. So it's not that we're going

to be looping through lots of different images, but we're just going to be looping

through steps of an optimization loop. Is that the idea?

JOHNO: Exactly. Um, and so yeah, we can, we can create 

this data loader. And then in terms of the actual like model that we're optimizing and 

passing to the learner, and we've created this tensor model class, which just has whatever 

tensor we pass in as its parameter. So there's no actual neural network necessarily. We're 

just going to pass in a random image or, or some image shaped thing, a set of 

numbers that we can then optimize. JEREMY: So just in case people have forgotten 

that. So to remind people, when you put something in an end up parameter, it doesn't change 

it in any way. It's just a normal tensor, but it's stored inside the module as being 

something as being a tensor to optimize. So what you're doing here, Johno, I guess, is 

to say, I'm not actually optimizing a model at all. I'm optimizing an image, 

the pixels of an image directly. JOHNO: Exactly. Um, and because it's in a 

parameter, if we look at our model, we can see that, for example, model.t… it does 

require_grad, right? Because that's already set up because this nn.Module is going to 

look for any parameters. And if our optimizer is looking at, um, let's look 

at the shape of the parameters. Um, so this is the shape

of the parameters that we're optimizing. This is just that tensor that we passed in, the

same shape as our image. And this is what's going to be optimized if we pass this into

any sort of learner fit method. JEREMY: Okay. So this model does have a thing being

passed to forward, which is x, which we're ignoring. And I guess that's just because

our learner passes something in. So we're making life a bit easier for ourselves by

making the model look the way our learner expects. JOHNO: Yeah. JEREMY: And 

we could have used anything like TrainCB or something if we wanted to, but this 

seems like a nice, nice and easy way to do it. JOHNO: Yeah. So, I mean, 

this is the way I've done it. Um, if you do want to use TrainCB, you can 

set it up, um, with a custom predict method that is just going to call the model forward 

method with no parameters. Um, and if you want likewise, just calling the loss function on 

just the predictions. Um, but if you want to skip this, because we take this argument x 

equals zero and never use it, um, that should also work without this callback. So either 

way is fine. Um, this is a nice approach if you have something that you're using an existing 

model, which expects some number of parameters or something. Um, yeah, you can just modify 

that training callback, but we almost don't need to in this case. Um, okay. So let's see, 

let's put this in a learner. Let's optimize it with some loss function. JEREMY: Oh, 

just to clarify, I get it. So the loss, the get_loss you had to change 

because normally we pass a target to the loss function.

JOHNO: Yeah. We, we, yeah, this is learn.preds. I mean learn.batch... 

JEREMY: And again, we could, we could avoid, we could remove that as well if we wanted to, 

by having our loss function, take a target that we then ignore. JOHNO: Yeah, yeah, 

exactly. JEREMY: Cool. JOHNO: Um, so both, both other approaches, I like this because it's, 

we can kind of be building on this idea of modifying the training call back in the DDPM 

example and the other examples. Um, but in this case, it's just these like two lines 

change. This is how we get a model predictions. We just call the forward method, which returns 

this image that we're optimizing and we're going to evaluate this according to some loss 

function that just takes in an image. And so for our first loss function, we're just 

going to use the mean squared error between the image that we are generating, like this 

output of our model and that content image. That's our like target, right? So we're going 

to set up our model, start it out with a random image like this above. We're going to create 

a learner, um, with a dummy data loader for a hundred steps. Our loss function is going 

to be this means squared error loss function, a set of learning rates and an optimizer 

function. The default would probably also work. Um, and if we run this, something's 

going to happen, our loss is going to go from a non-zero number to close to zero. Um, and 

we can look at the final result. Like if we call learn.model and show that as an image 

versus the actual image, we'll see that they look pretty much identical. JEREMY: Yeah. 

So just to clarify, this was like a, a pointless example, but what we did, we started with 

that noisy image you showed above, and then we used SGD to make those 

pixels get closer and closer to the lady in the sunglasses.

Not, you know, not for any particular purpose, but just to show that we can turn noisy pixels

into something else by having it follow a loss function. And this loss function was

just like: make the pixels look as much as possible, like the lady in the sunglasses.

JOHNO: Exactly. And so in this case, it's a very simple loss. There's like a one direction

that you update. So it's, it's almost trivial to solve, but it still helps us get like the

framework in place. Um, but just seeing this final result is not very instructive because

you almost think, well, did I get a bug in my code that I just duplicated the image?

How do I know this is actually doing what we expect? And so before we even move on to

any more complicated loss functions, um, I thought it was important to have some sort

of more obvious way of doing progress. Um, so I've created a little, um, logging callback

here that is just after every batch, um, it's going to store the output as an image.

And then… JEREMY: I guess after every 10 batches here by default.

JOHNO: Oh yes. Yeah. Sorry. So we can set how often it's going to update and then every

10 iterations or 50 iterations, whatever we set the log, every argument to, it's going

to store that in the list. Um, and then after the training is done after fits, we're just

going to show those images. Um, and so everything else, the 

same as before, the passing in this extra logging callback, it's going to give 

us the kind of progress. And so now you can see, okay, there is actually something happening. 

We starting from this noise after a few iterations already most of it is gone. And by the end 

of this process, it looks exactly like the content image. JEREMY: So I really like this 

because what you've basically done here is you've now already got all the calling and infrastructure 

in place you need to, basically, create a really wide variety of interesting outputs that could 

either be artistic or, um, or like, you know, they could be more like imagery construction, 

super resolution, colorization, whatever. And you just have to modify the loss function 

and you, you know, and I really like the way you've created the absolute easiest possible 

first and fully checked it. And before you start doing the fancy stuff and now you kind 

of, I guess you're really comfortable doing the fancy stuff because you 

know, that's all in place. JOHNO: Yeah, exactly. And we know that we're 

going to see some tracking. So hopefully it'll be like visually obvious if things are going 

wrong and we know exactly what we need to modify. If we can now express some desired 

property, it's more interesting than just, like, mean squared error to a target image. 

Then we quickly have everything in place to optimize. And so this is not really fun to 

like, okay, let's think about what other loss functions we could do. Maybe we want it to 

match the image, but also have a particular overall color. Maybe we want some, some more 

complicated thing. And so towards that, like towards starting to get a more richer, like 

measure of what this output image looks like. We're going to talk about extracting features 

from a pre-trained network. And this is kind of like the core idea of this notebook is 

that we have these big convolutional neural networks. This one is a much older 

architecture and so relatively simple compared to some

of the big, you know, DenseNets and so on used today.

JEREMY: It's actually a lot like our pre ResNet Fashion-MNIST model 

is basically almost the same as VGG16.

JOHNO: Yeah, yeah, exactly. And so we feeding in an image and then we have these like convolutional

layers, downsampling, convolution, you know, downsampling with max pooling up until some

final prediction. JEREMY: Oh, can I just point something out? 

There's one big difference here, which is that 7 by 7 by 512, if you can point at that. 

Normally nowadays and in our models, we tried, you know, using an adaptive or 

global pooling to get down to a 1 by 1 by 512, VGG16 does something which is very unusual by 

today's standards, which is it just flattens that out into a 1 by 1 by 4,096. Which 

actually might be a really interesting feature of VGG and I've always felt like people 

might want to consider training, you know, ResNets and stuff without the global 

pooling and instead do the flattening. The reason we don't do the flattening nowadays 

is that that very last linear layer that goes to 1 by 1 by 4,096 to 1 by 1 by 

1,000 ‒because this is an ImageNet model‒ is going to need an awfully big 

weight matrix. You've got a 4,096 x 1,000 weight matrix as a result 

of which this is actually horrifically memory intensive

for a reasonably poor performing model by modern standards. But yeah, I think that doing

that actually also has some, some benefits potentially as well.

JOHNO: Yeah. And in this case, we are not even really 

interested in the classification side. We're more excited about the capacity of this to 

extract different features. And so, the idea here, and maybe I should pull up this classic 

article looking at like, what do neural networks learn and trying to visualize some of these 

features. This is something we've mentioned before with these big pre-trained networks 

is that the early layers tend to pick up on very simple features, edges and shapes and 

textures. And those get mixed together into more complicated textures. And by the way, 

this is just trying to visualize what kind of input maximally activates a particular 

output on each of these layers. And so it's a great way to see what kinds of things that's 

learning. And so you can see as we move deeper and deeper into the network, we're getting more 

and more complicated like hierarchical features. JEREMY: We should mention, so we've looked at 

the Zeiler and Fergus paper before that, which is an earlier version doing something like this 

to see what kind of features were available. So we're linked to this distill paper from 

the forum and the course lesson page, because it's actually a more modern and fancy 

version kind of, of the same thing. JOHNO: Yeah. Also note the 

names here. All of these people are worth following. Chris does amazing

work on interpretability and Alexander Mordvintsev we'll see in the second notebook that I look

at today and doing all sorts of other cool stuff as well. Anyway, so we want to think

about like, let's extract the outputs of these layers in the hope that 

they give us a representation of our image that's richer than just 

the raw pixels. So we can list... JEREMY: The idea being there 

that if we had another, if we were able to change our image to have

the same features at those various types that you were just showing us, that then it would

like have similar textures or similar kind of higher level concepts or whatever.

JOHNO: Exactly. So if you think of this like 14 by 14 feature map over here, maybe it's capturing

that there's an eye in the top left and some hair on the top right, these kind of abstract

things. And if you change the brightness of the image, it's unlikely that it's going to

change what features are stored there because the networks land to be somewhat invariant

to these like rough transformations, a bit of noise, a bit of changing texture early

on is not going to affect the fact that it still thinks this looks like a dog and a few

days before that, that it still thinks that part looks like a nose and that part looks

like an ear. JEREMY: Maybe the more interesting bits then for 

what you're doing are those earlier layers where it's going to be like, there's a whole bunch 

of kind of diagonal lines here, or there's a kind of a loopy bit here. Because then yeah, 

if you replicate those, you're going to get similar textures without changing the semantics.

JOHNO: Exactly. Yeah. So I mean, I guess let's load the model and look at what the layers are. And then in the next section we can try and 

like see what kinds of images work when we optimize towards different layers in there. 

So this is the network we have, Convolutions, ReLUs, Max Pooling, all of this we should 

be familiar with by now. And it's all just in one big nn.Sequential. This doesn't have 

the head. So we said dot features. If you did this without, you'd have then the, this is 

like the features sub network. That's everything up until some point. And then you have the 

flattening and the classification, which we are kind of just throwing away. So this is 

the body of the network. And we're going to try and tag into various layers here and extract 

the outputs. But before we do that, there's one more bit of admin we need to handle. This 

was trained on a normalized version of ImageNet, right? Where you took the dataset mean and 

the dataset standard deviation and use that to normalize your images. So if we want to match 

what the data looked like during training, we need to match that normalization step. 

And we've done this on grayscale images where we just subtract the mean, divide by the standard 

deviation. But with three channel images, these RGB images, we can't get away with just 

saying, let's subtract our mean from our image and divide by the standard deviation. You're 

going to get an error that's going to pop up. And this is because we now need to think 

about broadcasting and these shapes a little bit more carefully than we can with just a 

scale of value. So if we look at the mean here, we just have three values, right? One 

for each channel, the red, green, and blue channels. Whereas our content image has 3 

channels and then 256 by 256 for the spatial dimensions. So if we try and say content image 

divided by the mean or minus the mean, it's going to go from right to left and find the 

first non-unit access. So anything with a size greater than one, and it's going to try 

and line those up. And in this case, the 3 and the 256, those are going to match. And so 

we're going to get an error. More perniciously, if the shape did happen to match, that might 

still not be what you intended. So what we'd like is to have these 3 channels mapped to 

the three channels of our image, and then somehow expand those values out across the 

2 other dimensions. And the way we do that is we just add two additional dimensions on 

the right for our imagenet_mean. And you could also do the unsqueeze minus one, the unsqueeze 

minus one. But this is the kind of syntax that we're using in this course. And now our 

shapes are going to match because we're going to go from right to left. If 

it's a unit dimension, size one, we're going to expand it out to

match the other tensors. And if it's a non-unit dimension, 

then the shapes have to match. And that looks like it's the case. And so 

now with this reshaping operation, we can write a little normalize function, which we 

can then apply to our content image. And I'm just checking the min and the max to make 

sure that this roughly makes sense. We could check the mean as well to make sure that the 

mean is somewhat close to zero. Okay, in this case less maybe because it's a darker image 

than average. But at least we are doing the operation that seems like the math is 

correct. And now the shape is not the same. JEREMY: Maybe doing the channel-wise 

mean would be interesting. JOHNO: Oh, yes. So that would be the mean 

over the dimensions. 1 and 2, I think. I think

you have to tap all 1 comma 2. Yes. I wasn't sure which way around it was.

Yeah, I always forget to. Okay, so our blue channel is brighter than 

the others. And if you go back and look at our image, you could maybe believe that the 

image int is going to be blue and red and the face is going to be just blue. Yeah, okay. 

So that seems to be working. We can double check because now that 

we've implemented ourselves, torchvision.transforms has a normalized function

that you can pass the mean and standard deviation to, and it's going to handle making sure that

the devices match, that the shapes match, et cetera. And you can see if we check the

min and max, it's exactly the same, just a little bit of reassurance that our function

is doing the same thing as this normalized transform.

JEREMY: I appreciate you not cheating by implementing that. Johno, thank you.

JOHNO: You're welcome. Got to follow the rules. JEREMY: Got to follow the rules.

JOHNO: Okay. So a bit of admin out the way. We can finally say how do we extract the features

from this network. And now if you remember the previous lesson on hooks, that might be

something that springs to mind. I'm going to leave that as an exercise for the reader.

And what we're going to do is we're just going to normalize our input and then we're going

to run through the layers one by one in this sequential stack. We're 

going to pass our x through that layer. And then if we're in one of the 

target layers, which we can specify, we're going to store the opposite of that layer.

JEREMY: And I can't remember if I've used the term “features” before or not, so apologies if I

have, but just to clarify here, when we say features, we just mean the activations of

a layer. And in this case, Jono's picked out two particular layers, 18 and 25. I just want

to, I mean, I'm not sure it matters in this particular case, but there's a bit of a gotcha

you've got here, Johno, which is you should change that default 18, 25 from a list to

a tuple. And the reason for that is that when you use a mutable type, like a list in a Python

default parameter, it does this really weird thing where it actually keeps it around. And

if you change it at all later than it actually kind of modifies your function. So I would

suggest, yeah, never using a list as a default parameter because at some point it will create

the weirdest bug you've ever had. I speak, I say this in experience.

JOHNO: Yeah, yeah. That sounds like something that was hard won. All right. I'll change that.

And by the time you see this notebook, that change should be there. All right. So this

is one way to do it. Just manually running through the layers one by one up until whatever

the latest layer we're interested in is. But you could do this just as easily by adding

hooks to the specific layers and then just feeding your data through the whole network

at once and relying on the hooks to store those intermediates.

JEREMY: Yeah. So let's make that homework actually, not just 

an exercise you can do, but yeah, I want, let's make sure everybody does that. 

You can use the, one of the hooks callbacks we had or the hooks context managers we had, or 

you can use the register forward hook PyTorch directly.

JOHNO: Yeah. And so what we get out here, we feeding 

in an image that's 256 by 256. And the first layer that we're looking at is this one here. 

And so it's getting half to 128, then to 64. These ones are just different because it's 

a different starting size. And then to 32 by 32 by 512. And so those are the features 

that we're talking about for that layer 18. It's this thing of shape 512 by 32 by 32 for 

every kind of spatial location in that 32 by 32 grid, we have the output from 512 different 

filters. And so those are going to be the features that we're talking about.

JEREMY: So there's a [unintelligible] being the channels in a single convolution.

JOHNO: Yeah. Okay. So what's the point of this? Well, like I said, we’re 

hoping that we can capture different things at different layers. And so 

to kind of first get a feel for this, like, what if we just compared these feature maps 

we can institute what I'm calling a content loss, or you might see it as a perceptual 

loss. And we're going to focus on a couple of later layers. Again, make sure that this 

is... as I've learned. And what we're going to do is we're going to pass in a target image, 

in this case, our content image. And we're going to calculate those features in 

those target layers. And then in the in the forward

method, when we're comparing to our inputs, we're going to calculate the features of our

inputs. And we're going to do the mean squared error between those and our target features.

So maybe there's a bad way of explaining it. But the so the…

JEREMY: I can maybe read it back to you to make sure you understand. JOHNO: Yeah. So good idea.

JEREMY: Okay. So this is a loss function you've created. It has it done to 

call method, which means you can pretend that it's a function. It's 

a callable in Python language. Your forward… your best… So yeah, in nn.Module would call 

it forward. But in normal Python, we just use it dunder call. It's taking one input, 

which is the way you set up your image training callback earlier, it's just going to pass 

in the input, which is this is the image as it's been optimized to so far. So initially, 

it's going to be that random noise. And then the loss you're calculating is the mean squared 

error of how far away is this input image from the target image, the mean squared error 

for each of the layers, by default 18 and 25. And so you're literally actually, it's a 

bit weird, you're actually calling a different neural network; calc_features actually call 

the neural network, but not because that's the model we're optimizing, but because it's 

actually, the loss function is how far away are we. Yeah. So that's the loss function. And 

so if we, so if you're, so if we, with SGD, optimize that loss function, you're not going 

to get the same pixels. You're going to get, I don't even know what this is going to look 

like. You're going to get some pixels, which have the same activations 

of those features. JOHNO: Yeah. And so if we run that, we see, you can

see the sort of shape of our person there, but it definitely doesn't match on like a

color and style basis. JEREMY: So 18 and 25 remind us how deep they are in the scheme of things.

JOHNO: So these are fairly close towards the end. JEREMY: Okay. So I 

guess color often doesn't have much of a semantic kind of property. So that's 

probably why it doesn't care much about color because it's still going to be an eyeball, 

whether it's green or blue or brown. JOHNO: Yeah. There's something else I should mention, which 

is we aren't constraining our tensor that we're optimizing to be in the same bounds 

as a normal image. And so some of these will also be less than zero or greater than one 

as kind of like almost hacking the neural network to get the same features that those 

deep layers by passing in something that it's never seen during training. And so for display, 

we're clipping it to the same bounds as an image, but you might want to have either some 

sort of sigmoid function or some other way that you clamp your tensor model and to have 

outputs that are like within the allowed range for. JEREMY: Oh, good point. 

Also, it's interesting to note the background hasn't changed much. And 

I guess the reason for that would be that the VGG model you were using in 

the lost function was trained on ImageNet and ImageNet is specifically

about recognizing generally a single big object, like a dog or a boat or whatever.

So it's not going to care about the background and the background probably isn't going to

have much in the way of features at all, which is why it hasn't really changed the background.

JOHNO: Yeah, exactly. And so, I mean, this is kind of interesting to see how little it looks

like the image will at the same time still being like, if you squint, you can recognize

it. But we can also try passing in earlier layers, right? And 

comparing on those earlier layers and see that we get a completely different 

result because now we're optimizing to some image that is a lot closer to the original. 

It still doesn't look exactly the same. And so there's a few things that I thought were 

worth noting, just potentially of interest. One is that we're looking into these ReLUlayers, 

which might mean, for example, that if you're looking at the very early layers, you're missing 

out on some kinds of features. That was one of my guesses as to why this didn't have as 

dark a darks as the input image. And then also we still have this thing where we might 

be going out of bounds to get the same kinds of features. So yeah, you can see how by looking 

at really deep layers, we really don't care about the color or texture at all. We're just 

getting like sunglassesy bits and nosey bits there. By looking at the earlier layers, we 

have much more rigid adherence to the sort of lower level features as well. And so this 

is nice. It gives you a very tunable way to compare two images. You can say, do I care 

that they match exactly on pixels? Then I could use mean square error. But do I care 

quite a lot about the exact match? Then I can use maybe some early layers. But do I 

only care about the overall semantics? In that case, I can go to some deeper 

layers and you can experiment with. JEREMY: If I remember correctly, 

this is also something like the kind of technique that Zeiler and Fergus

and the Distill.pub papers use to like just identify like what do filters look at,

which is like, you can optimize an image to try and maximize a particular filter, for

example, that would be a similar loss function to the one you've built here. And that would

show you, yeah, what they're, what they're looking at.

JOHNO: Yeah. And that would be a really fun little project actually. So do it where you calculate

these feature maps and then just pick one of those 512 features and optimize the image

to maximize that activation. By default, you might get quite a noisy, weird result, like

almost an adversarial input. And so what these feature visualization people do is they add

things like augmentations so that you're optimizing an image that 

even under some augmentations still activates that feature. But, yeah, 

that might be a good one to play with. Cool. Okay. So we have a lot of our infrastructure 

in place. We know how to optimize an image. We know how to extract features from this 

neural network and we're saying this is great for comparing at these different kind of types 

of feature how similar two images are. The final piece that we need for our full style 

transfer, artistic application is to say I'd like to keep the structure of this image, 

but I'd like to have the style come from a different image. And you might think, oh, 

well, that's easy. We just look at the early layers like you've shown us. But there's a 

problem, which is that these feature maps by default, we feed in our image and we get these 

feature maps. They have a spatial component, right? We said we had a 32 by 32 by 512 feature 

map out and each of those locations in that 32 by 32 grid are going to correspond to some 

part of the input image. And so if we just said, let's do mean squared error for the 

activations from some early layers, what we'd be saying is I want the same types of feature, 

like the same style, the same textures, and I want them in the same location. Right? And so 

we can't just get like Van Gogh brushstrokes. We're going to try and have the same colors 

in the same place and the same textures in the same place. And so we're going to get 

something that just matches our image. What we'd like is something that has the same colors 

and textures, but doesn't, but they might be in different parts of the image. So we 

want to get rid of this spatial aspect. JEREMY: So just to clarify, when we're saying to 

it, for example, give it to us in the style of Van Gogh's Starry Night, we're not saying 

in this part of the image, there should be something with this texture, but we're saying 

that the kinds of textures that are used anywhere in that image should also appear in our version, but not necessarily in the same place.

JOHNO: Exactly. And so the solution that Zeiler and Fergus proposed is 

this thing called a Gram Matrix. So what we want is some measure of what kinds 

of styles are present without worrying about where they are. And so there's always a trouble 

trying to represent more than two dimensional things on a 2D grid. But what I've done here 

is I've made our feature map, right, where we have our height and our width, that might 

be 32 by 32 and some number of features, but instead of having those be like a third dimension, 

I've just represented those features as these little colored dots. And so what we're going 

to do with a Gram Matrix is we're going to flatten out our spatial dimension. So we're 

going to reshape this so that we have the width times the height so that, like, the spatial 

location on one axis and the feature dimension on the other. So each of these rows is like 

this is the location here. There's no yellow dots, so we get a zero. There's no green, 

so we get a zero. There is a red and a blue, so we get ones. So we've kind of flattened 

out this feature map into a 2D thing. And then instead of caring about the spatial dimension 

at all, all we care about is which features do we have in general, which types of features, 

and do they occur with each other. And so we're going to get effectively the dot products 

of this row with itself and then this row with the next row and this row with the next 

row. We're saying like for these feature vectors, how correlated are they with each other? 

Right. And so we'll see this in code just now. JEREMY: I think you might've 

said, I might've misheard you, but I just want to make sure I got the

citation here, right? So this idea came from, I don't know if it was first invented in the

Gatys et al. paper, called “A Neural Algorithm of Artistic Style”.

JOHNO: Yeah. Yeah. I meant Gatys, that's the style transfer one. 

Zeiler and Ferguson is the feature visualization one. Yeah. Sorry, I got those 

switched. Thanks, Jeremy. Okay. So we are ending up with this kind of like, this Gram 

matrix, this correlation of features. And the way you can read this in this example is 

to say, okay, there are seven reds, right? Red with red, there's seven in total. And if 

you go and count them, there's seven there. And then if I look at any other one in this 

row, like here, there's only one red that occurs alongside a green, right? This is the 

only location where there's one red cell and one green cell. There's three reds that occur 

with the yellow. They're there and there. And so this gram matrix here has no spatial 

component at all. It's just the feature dimension by the feature dimension. But it has a measure 

of how common these features are. Like what's an uncommon one here? Yeah, maybe there's 

only three greens in total, right? And all of them occur alongside a yellow. One of them 

occurs alongside a red. One of them occurs alongside a blue. Yeah, so this is exactly 

what we want. This is some measure of what features are present, where if they occur 

together with other features often, that's a useful thing, but it doesn't have the 

spatial component. We've gotten rid of that. JEREMY: And this is the first clear explanation 

I've ever seen of how a Grain Matrix works. This is such a cool picture. I also want to, maybe 

you can open up the original paper because I'd also like to encourage people to look at 

the original paper because this is something we're trying to practice at this point is 

reading papers. And so hopefully you can take Johno's fantastic explanation 

and yeah, and bring it back to understanding the paper as

well. That's crazy that it's been so far down. JOHNO: 

Oh yeah, it's a different search engine that I'm trying out that has some AI magic, but 

they use Bing for their actual searching. [...] Yeah. So we can quickly 

check the paper. I don't know if I've actually read this paper ‒as horrific as 

that sounds. JEREMY: Not horrific at all. It was a while ago. But I think it's 

got some nice pictures and I'm going to zoom in a bit. JOHNO: Oh,

good idea. Okay. They're the examples. JEREMY: It's 

great examples. Yeah. Love for Kandinsky. JOHNO: Sorry about the doorbell. Okay. Yeah. Gram 

Matrix “inner a product between the vectorized feature map”. So those kinds of wordings kind of 

put me off on for a while. The way I explained Gram Matrices when I had to deal with them 

at all was to say it's magic that measures with, you know, what features are there without 

worrying about where they are and left it at that. But it is worth trying to decode this 

back. They talk about which layers they're looking into. I think in TensorFlow they have 

names. We're just using the index. Okay. Yeah. So it doesn't really explain how the Gram 

Matrix works, but it's something that people use historically in some other contexts 

as well. And for the same kind of measure. JEREMY: Nowadays actually PyTorch has 

named parameters and I don't know if they've updated VGG yet,

but you can name layers of a sequential model as well.

JOHNO: Yeah. Okay. So just quickly, I wanted to implement this diagram 

in code. I should mention these are, these are like zero or one for simplicity, 

but you could have like obviously different size activations and things. The correlation 

idea is still going to be there. Just not as easy to represent visually. And so we're 

going to do it with an einsum because it makes it easy to add later the bash dimension 

and so on. But I wanted to also highlight that this is just this matrix multiplied with 

its own transpose and you're going to get the same result. So yeah, that's our Gram 

Matrix calculation. There's no magic involved there as much as it might seem like it. And 

so we can now use this, like, can we create this measure and then can we...

JEREMY: When you look later at things like Word2Vec, I think it's 

got some similarities, this idea of kind of co-occurrence of features. And 

it also reminds me of the clip loss similar idea of like basically a dot product, but 

in this case with itself, I mean, we've seen how covariance is basically that as well. So 

this idea of kind of like multiplying with your own transpose is a really common mathematical 

technique we've come across three or four times already in this course.

JOHNO: Yeah. And it comes up all over the place even. Yeah. You'll 

see that in protein folding stuff as well. They have a big 

covariance matrix for like... JEREMY: So the difference in each case is like, 

yeah, the difference in each case is the matrix that we're multiplying by its own transpose. 

So for covariance, the matrix is the matrix of differences to the mean, for example. And 

yeah, in this case, the matrix is this flattened picture thing.

JOHNO: Cool. So I have here the calc_grams function that's 

going to do exactly that operation we did above, but we're going to add some 

scaling. And the reason we're adding the scaling is that we have this feature map and we might 

pass in images of different sizes. And so what this gives us is the absolute... You 

can see there's a relation to the number of spatial locations here. And so by scaling 

by this width times height, we're going to get a relative measure as opposed to an absolute 

measure. It just means that the comparisons are going to be valid even for images of different 

sizes. And so that's the only extra complexity here, but we have channels by height, by width, 

image in, and we're going to pass in... Oh, sorry. This is channels being the number of 

features. We're going to pass it in two versions of that, right? Because it's the same image 

in both times. But we're going to map this down to just this features by features, 

but you can't repeat variables and einsum. So that's why it's c and d. And if we run this 

on our style image, you can see I'm targeting five different layers. And for each one, the 

first layer has 64 features. And so we get a 64 by 64 Gram Matrix. The second one has 128 

features. We can get 128 by 128 Gram Matrix. So this is doing, it seems like what we want. 

Because this is a list, we can use this attrgot method which I...

JEREMY: Well, actually it's a fastcore capital L, not a list.

JOHNO: Oh, sorry. Yeah. Magic list. And so I like to think.

Magic list. Yeah. So either works. Okay. So let's use 

this as a loss. Just like with the content loss before, we're going to take in a target 

image, which is going to be our style. We're going to calculate these gram matrices for 

that. And then when we get in an input to our loss function, we're going to calculate 

the gram matrices for that and do the mean squared error between the gram matrices. So 

these are the no spatial components, just what features are there, comparing the two 

to make sure that they ideally have the same kinds of features and the same kinds 

of correlations between features. So we can set that up.

We can evaluate it on my image. So our content image at the moment has quite a high loss

when we compare it to our style image. JEREMY: That means that the content image doesn't 

look anything like a spider web in terms of its textures and whatever.

JOHNO: Exactly. So we're going to set up an optimization 

thing here. One difference is that at the moment I'm starting from the content image 

itself rather than optimizing from random noise. You can choose either way. For style 

transfer, it's quite nice to use the content image as the starting point. And so you can 

see at the beginning, it just looks like our content image. But as we do more and more 

steps, we maintain the structure because we're still using the content loss as one component 

of our loss function. But now we also have more and more of the style because of the 

early layers, we're evaluating that style loss and you can see this doesn't have the 

same layout as our spider web, but it has the same kinds of textures and the same types 

of structure there. And so we can check out the final result and you can see it's done 

ostensibly what our goal is. It's taken one image and it's done it in the style of 

another. And to me, this is quite satisfying. JEREMY: And it's actually 

done it in a particularly clever way because look at her arm. Her arm

has the spider web nicely laid out on it and she's almost picking it out with her fingers.

And her face, which is quite important or very important in terms of object recognition,

the model didn't want to mess with the face much at all. So it's kept the spider webs

away from that. Like I think it's the more you look at it, the more impressive it is

in how it's managed to find a way to add spider webs without messing up the overall kind of

semantics of the image. JOHNO: Yeah. Yeah. So this is really fun to 

play with. If you've been running the notebook with the demo images, please like, right now, 

go and find your own pictures, make sure you're not stealing someone's licensed work, but 

there's lots of creative commons images out there and try bash them together. Do it at 

a larger size, get some higher resolution style loss going. And then there's so much 

that you can experiment with. So for example, you can change the content loss to focus on 

maybe an earlier layer as well. You can start from a random image instead of the content 

image, or you can start from the style image and optimize towards the content image. You 

can change how you scale these two components of the loss function. You can change how long 

you train for, what your learning rate is. All of this is up for grabs in terms of what 

you can optimize and what you can explore. And you get different results with different 

scalings and different focus layers. So there's a whole lot of fun experimentation to be done 

in terms of finding a set of parameters that gives you a pleasing result for a 

given style content pair and for a given effect that you

want on the output. TANISHQ: On that note, I wanted to... One of the 

really interesting things about this is just how well VGG works as a network, even though it's 

a very old network. But I think it's also worth playing around with other networks as 

well. I think there's definitely some special properties of VGG that allow to do well for 

style transfer. And there are a few papers on that. And there are also some papers that 

explore how we can use maybe other networks for style transfer that maintains maybe some 

of these nice properties of VGG. So I think that could be interesting to explore some 

of these papers. And of course we have this very nice framework that allows us to easily 

plug and play different networks and try that aspect out as well.

JEREMY: Yeah. And in particular, I think taking a ConvNeXt or a ResNet or something and replacing

its head with a VGG head would be an interesting thing to try.

JOHNO: Yeah. Tanishq, on the experimentation version, 

one of the things that when we were developing this I said to Jeremy was like, ah, you're 

doing all this work, setting up these callbacks and things, you know, isn't it nicer to just 

have like, here's my image that I'm optimizing, set up an optimizer, set up my loss function 

and do this optimization loop. And the answer is that it is theoretically easier when you 

just want to do this once. And that's why you see in a tutorial or something, you keep 

this as minimal as possible. You just want to show what style loss is. But as soon as 

you say, okay, I'd like to try this again, but adding a different layer. So maybe let me 

do another cell and then copying and pasting over a bunch, you know, and then you say, oh, 

let me add some progress stuff. So images, you know, it gets messy really quickly. As 

soon as you want to save images for a video and you want to mess with the loss function 

and you want to do some sort of annealing on your learning rate, each of these things 

is going to grow this loop into something messier and messier. Yeah. And so I thought 

it was fun. Like I was very quickly a convert to being able to experiment with a completely 

new version with, yeah, minimal lines of code, minimal changes and having everything in its 

own piece, like the image logging, or you wanted to make a little movie showing the 

progress that goes in a separate callback. You want to tweak the model. You're just taking 

one thing that all the other infrastructure can stay the same. Yeah. So that was pretty cool.

JEREMY: I mean, there's not like one answer, right? Like it's, yeah. Use 

the right layer of abstraction for what you're doing at the right time. Like 

something I actually think people do too much of when they use the fastai library is jumping 

straight into data blocks, for example, even though the model, you know, they might be 

working on a slightly more custom thing where data blocks, there isn't a data block already 

written for them. And so then step one is like, Oh, write a data book. That's not at 

all easy. And you actually want to be focusing on building your model. So I kind of say to 

people, Oh, well like, you know, go, go down a layer of abstraction. Now I will say I don't 

very often start at the very lowest level of abstraction. So something like the very 

last thing that you showed Johno, just because in my experience, I'm not good enough to do 

that. Right. And so like most of the time I yeah, I'll forget zero grad or I'll, I'll 

just mess up something, especially if I want to like have it run reasonably quickly by 

using like, you know, fp16, mixed precision or you know, or I'll be like, Oh, now I've 

got to like think about how to put a metrics in so that I can see it's training properly. 

And I always mess that up. And so I don't often go to that level, but I do like quite 

often start at a reasonably low level. And I think with miniai now we all have this 

tool where we fully understand all the layers and there aren't that many. And yeah, you 

could like write your own TrainCB or whatever. And at least you've got something that makes 

sure, for example, that, Oh, okay. You remember to use torch.no_grad here. And you remember 

to put it in, you know, put it in eval mode there that, you know, those things will be 

done correctly. And you'll be able to easily run a learning rate finder and easily have it 

run on CUDA and you know, or whatever device you're on. So I think, you know, hopefully 

this is a good place for people now to have a framework that they can call their own, 

you know, and use as much or as little of it as makes sense.

TANISHQ: The other nice thing is of course, like there are multiple ways of doing the same thing

and it's like whatever way maybe works better for you. You can implement that. Like for

example, Jonathan showed with the ImageOpt callback, you know, you could implement that

in different ways and whichever one I guess is easier for you to understand or easier

for you to work with, it's, you can implement it that way. And yeah, miniai is flexible

in multiple ways. So that's the, especially one thing I really enjoy about it.

JEREMY: Yeah. And this is why extreme of like weirdness, I think, which 

is like, Johno is like using miniai for something that we never really 

considered making it for, which is like, it's not even looping through data. It's just looping 

through loops. So, you know, this is about as weird as it's going to get, I guess.

JOHNO: Yeah. Well the next notebook is about as weird as it's going to get, I think.

JEREMY: Oh great. JOHNO: Yeah. Okay. So before we move on to what 

we're going to do next is use this kind of style loss in an even funkier way, to train a 

different kind of thing. But before we do that, I did want to just call out like using 

these pre-trained networks as very smart feature extractors is pretty powerful. And unlike 

the kind of fun, crazy example that we're going to look at just now, they also have 

very valid uses. So if you're doing like a super resolution or even something like diffusion, 

adding in a perceptual loss or even a style loss to your target image, it can improve 

things. We've played around with using perceptual loss for diffusion. Or even during, like say 

you want to generate an image that matches a face in some kind of image to image thing 

with stable diffusion, maybe you have an extra guidance function that makes sure 

that structurally it matches, but maybe a textually it doesn't. Maybe you want to pass in a style image and 

have that guide to the diffusion process to be a particular style without having to say, 

you know, in the style of it's a starry night. Yeah. And for all sorts of like image to image 

tasks, this perceptual, this idea of like using the features from a network like VGG, 

it does actually have lots of practical uses apart from just this artistic and fiddling. 

Okay. So speaking of artistic fiddling, we're going to look at something a little bit more 

niche now called Neural Cellular Automata. And so try and spend about half an hour on 

this before we move on to the next section. And so this is off the beaten track. It's 

a really fun domain of yeah, like combining a lot of different fields, all of which I'm 

quite excited about. And so you may be familiar with like, kind of, classic cellular automata. 

And so if we look at Conway's Game of Life, oops, I misspelled it, but you've probably 

seen this kind of classic… Conway's Game of Life.

JEREMY: Oh, [unintelligible] there when I was a kid.

JOHNO: Yeah. So the idea here is that you have all of these independent cells and each cell can

only see its neighbors and you have some sort of update rule, right? That says if a cell

has three neighbors, it's going to remain the same state for the next one. If it has

only one neighbor, it's going to die in the next iteration. And so this is a really cool

example of like a distributed system, a self-organizing system 

where there's no like global communication or anything. Each cell can only look at its 

immediate neighbors. And typically the rules are really small and simple. And so we can 

use these to model these complex systems. It's very much inspired by biology where we 

actually do have huge arrangements of cells, each of which is only seeing its neighborhood, 

like sensing chemicals in the bloodstream next to it and so on. And yet somehow 

they're able to coordinate together. JEREMY: I watched a really cool 

Kurzgesagt video the other day about ants and I didn't know this before,

maybe everybody else does, but ants like huge ant colonies organized by like having little

chemical signals that the ants around can smell and yeah, it can like organize the entire

massive ant colony just using that. I thought it was crazy, but it sounds really similar.

JOHNO: Yeah. Yeah. And you can do… so I’m doing my tangent. You could do very similar things where you have, yeah, like the trails, 

chemical trails being left, but just like pixel values in some grid and your ants are 

just little tiny agents that have similar rules. And so I should probably link this 

here, but this is exactly that kind of system, right? Each little tiny dots, which almost too 

small to see is leaving behind these different trails. And then that determines the behavior, 

the difference between this and what we're going to do today is that,

JEREMY: Sorry, just to clarify, I think you've taught me before that like actual slime molds kind

of do this right there. They're another example. JOHNO: Yeah, yeah, exactly. There's some limited 

signaling. Each one is like, oh, I'm by food. And then after that, that signal is going 

to propagate and anything that's moving is going to follow. And so, yeah, if you play 

with this kind of simulation, you often get patterns that look exactly like emergent patterns 

in nature, like ants moving to food or you know, corals coordinating and that sort of 

thing. So it's a very biologically biofield. The difference with our cellular automata 

is that they're going to be, there's nothing moving. Each, each, like, grind cell has its own 

little agent. And so there's no like wandering around. It's just each individual cell 

looking at its neighbors and then updating. JEREMY: And just to clarify, when you say agent, 

that can be really simple. Like I don't really remember, but I vaguely remember that Conway's 

Game of Life. It's kind of like a single kind of if statement. It's like, if there's, I 

don't know, what is it? Two cells around, you get another one or something.

JOHNO: Yeah. Yeah. If there's two or three nearby, you stay alive in the next one. If you're

overcrowded with four or five or there's no one near you with zero or one neighbors, then

you're going to die. So very, very simple rule. But what we're going to do today is

replace that hard coded if statement with a neural network. And in particular, a very

small neural network. So I should start with the paper that inspired me to like even begin

looking at this. So this is by Alexander Mordvintsev and a team with him at Google Brain.

And they built these Neural Cellular Automata. So this is a pixel grid. Every pixel is a

cellular automata that's looking at its neighbors and they can't see the global structure at

all. And it starts out with a single black pixel in the middle. And if you run the simulation,

you can see it builds this little lizard structure, this little emoji.

JEREMY: So that's wild to me that, that, that a bunch of pixels 

that only know about their neighbor can actually create such a 

large and sophisticated image. JOHNO: Yeah. They can self assemble into this. 

And what's more, the way that they train them, they are robust. They're able to repair damage. 

And so there's no, it's not perfect, but there's no global signaling. No, no little agent here 

knows what the full picture looks like. It doesn't know where in the picture it is. All 

it knows is that its neighbors have certain values and so it's going to update itself 

to, to match those values. And so you can see us

JEREMY: It does seem like something that ought to have 

a lot of use in the real world with like, I don't know, like having a bunch of drones 

working together when they can't contact, you know, some kind of central base. So I'm 

thinking about like work that Australia, some Australian folks have been involved in where 

they were doing like subterranean, automated subterranean rescue operations. And they've 

got, you literally can't communicate through thousands of meters of rock, 

stuff like that. JOHNO: Yeah. Yeah. So this idea of like self organizing

systems, there's a lot of promise for like nanotechnology and, and things like that that

can do pretty amazing things. This is the blog post that's linked. Yeah. “The Future

of Artificial Intelligence is Self-Organizing and Self-Assembling”. And definitely, yeah.

It's a pattern that's worked really well in nature, right? Like lots of loosely coordinated

cells coming together and talking about deep learning is quite a miracle. And so I think,

yeah, that's an interesting pattern to explore. Okay. So how do we train something

like this? How on earth do you set up your structure so that you can get something that

not only builds out an image or builds out something like a texture, but then is robust

and able to maintain that and keep it going? So the sort of base is that we're going to

set up a neural network with some learnable weights. That's going to apply our little

update rule, right? And this can, this is just going to be a little dense MLP. We can

get our inputs, which is just the neighborhood of the cell. And they sometimes have like

additional channels that aren't shown that the agents can use as communication with their

neighbors. So we can, we can set this up in code. We'll be able to get our neighbors using

maybe convolution or some other method, flatten those out and feed them through a little MLP

and take our outputs and use that as our updates. JEREMY: Just to clarify 

something that I missed originally is: this is not a simplified picture of 

it. This is it like that, that 3 by like, it's literally 3 by 3. You're only allowed 

to see the little things right next To you or they can be in a different channel. 

Exactly. And this paper has this additional step of like cells being alive or dead. But 

we're going to do one that doesn't even have that. So it's, it's even simpler than this 

diagram. Okay. So to train this, what we could do is we could start from our initial states, 

apply our network over some number of steps, look at the final output and compare it to 

our targets and calculate on loss. And you might think, okay, well, that's pretty cool. 

We can maybe do that. And if you run this, you do indeed get something that after some 

number of steps can learn to grow into something that looks like your target image. But there's 

this problem, which is that you're applying some number of steps and then you're applying 

your loss after that. But that doesn't guarantee that it's going to be stable, [noise 

of something falling] stable longer term. And so we need some additional way to

say, okay, I don't just want to grow into this image. I'd like to then maintain that

shape once I have it. And the solution that this paper 

proposes is to have a pool of training examples, right? And we'll see this in code 

just now. So the idea here is that sometimes we'll start from a random state and we'll 

apply some number of updates. We'll apply our loss function and update our network. 

And then most of the time we'll take that final output and we'll put it back into the 

pool to be used again as a starting point for another round of training. And so this 

means that the network might see the initial state and have to produce the lizard. Or it 

might see a lizard that's already been produced and after some number of steps, it still needs 

to look like that lizard. And so this is adding like an additional constraint that says even 

after much more steps, we'd still like you to look like the final output. And so, yeah, 

it's also nice because like I mentioned here, initially the model ends up in various incorrect 

states that don't look like a lizard, but also don't look like the starting point. And 

it then has to learn to correct those as well. So we get this nice additional robustness 

from this in addition. And you can see here, now they have a thing that is able to grow into 

the lizard and then maintain that structure kind of indefinitely. And in this paper, 

they do this final step where they sometimes chop

off half of the image as additional like augmentation.

JEREMY: So you could have like a bunch of drawings or something that like can only see the ones

nearby and they don't have GPS or something and no gust of wind could come along and set

them off path and they still reconfigure themselves.

JOHNO: Yeah, yeah, exactly. Half of them go offline and 

run out of battery. That's fine. So a very, very cool paper… But you can see this kind 

of training is a little bit more complicated than oh, we just have a network and some target 

outputs and we optimize it. So we're not going to follow that paper exactly, although it 

should be fairly easy to tweak what we have to match that. We're instead going to go for 

a slightly different one by the same authors where they train even smaller networks to 

match textures. And so you can imagine our style loss is going to come in useful here. We'd 

like to produce a texture without necessarily worrying about the overall structure. We just 

want the style. And so the same sort of idea, the same sort of training, we're going to 

start from random and then after some number of steps, we'd like it to look like our target 

style image. And in fact, actually he's a spider web, which I hadn't met just until now.

JEREMY: And that's the one thing that makes a texture texture in this case. Is it something you

can tile nicely? Is that? JOHNO: Yes. Yeah. And so that tiling is going 

to come almost for free. So we're going to have our input, we're going to look at our neighbors, 

we're going to feed that through a network and produce an output. And every cell is going 

to do the same rule, which will work fine by default if we set this up without thinking 

about tiling at all, except that at the edges, when we do like say a convolution to get our 

neighbors, we need to think about what happens for the neighbors of the cells on the edge, 

which ones should those be? And by default, those will just be padding of zero. And so 

those cells on the edge, a) they'll know they're on the edge. And b) they won't necessarily 

have any communication with the other side. If we want this to tile, what we're going 

to do is we're going to set our padding mode to circular. In other words, the neighbors 

of this top right cell are going to be these cells next to it here and these cells down in 

the bottom corner. And then, for free, we're going to get tiling. Okay. So enough waffle. 

Let's get into code. We're going to download our style image. Oops. I need to do my inputs. This is going to be our target style image.

And again, feel free to experiment with your own, please. We're going to set up a style

loss just like we did in lesson 17A. The difference being that we're 

going to have a batch dimension to our inputs to this calculate grams function, 

which I didn't do in the style transfer example, because you're always dealing with a single 

image. Everything else is going to be pretty much the same. So we can set up our style 

loss with the target image, and then we can feed in a new image or, in this case, a batch 

of images, and we're going to get back a loss. So we're, we're setting up our, our evaluation. 

We would like after some number of steps, our output to look like a spider web. Okay. 

Let's define our model. And here I'm making a very small model with only four channels 

and our hidden number of hidden neurons in the brain is just going to be 

eight. You can increase these. JEREMY: Something I would be 

inclined to do, people might want to play with in style loss to target

is you're giving all the layers the same weight. A nice addition would be to have

a vector of weights. You could pass in an experiment with that.

JOHNO: Definitely. All right. So the, the world in which the 

cellular automata are going to live is going to be a grid. We're going to have 

some number of them. If we call this function, number of channels and the size, you could 

make it non-square if you care about that. For our perception in this little diagram here, 

we're going to use some hard coded filters and you could have these be learned, right? 

There'd be additional weights in the neural network. The reason they're hard coded is 

because the people who were working behind this paper, they wanted to keep 

the parameter counts really low. Literally like a few hundred

parameters total. And also they were kind of inspired by the…

JEREMY: A few hundred! That's crazy. Like cause we've been, even our 

little Fashion-MNIST models have had quite a few million parameters. JOHNO: 

Yeah. So this is, yeah, this is a total, I should have mentioned, that's one of the 

coolest things about these systems is they, they really can do a lot with very little 

parameters. And so these filters that we just going to hard code are going to be the identity, 

right? Just looking at itself and then a couple that looking at gradients again, inspired 

by biology where even simple cells can sense gradients of chemical concentration. So we're 

going to have these filters. We're going to have a way to apply these filters 

individually. JEREMY: Just to help people understand that that first

one, for example, that's a 3 by 3. It's been kind of like visually flattened

out, but if you would have kind of lay it out, you could see it's an identity matrix.

JOHNO: Yeah. Anyway, so you can see these filters. This 

one is going to sense a horizontal gradient. This one is going to sense a vertical gradient. 

And the final one is called a Sobel filter. Yeah. So we've got some hard coded filters. 

We're going to apply them individually to each channel of the input. And rather than 

having a kernel that has separate weights for each channel on the input. And so 

we can make a grid. We can apply our... JEREMY: I haven't seen, I didn't know circular 

was a padding mode before. So that just does the thing you said where it's basically going to 

circle around and kind of copy in the thing from the other side when you 

reach the edge. JOHNO: Yeah. Yeah. And this is very useful for avoiding

issues on the edges with those. You'll see a lot of implementations just deal with the fact

that they have slightly weird pixels around the edge and they don't really look into it.

This is one way to deal with that. Yeah. Okay. So we can make a grid. We can apply our filters

to get our model inputs. And this is going to be 16 inputs, right? Because we have four

channels and four filters. 16 inputs, that's going to be the input to our little brain.

And we have this for every location in the grid. So now how do we implement that little

neural network that we saw? The way it's shown in the diagram is it's a dense linear network.

And we can set that up. We have a linear layer with number of channels by four, which is

the number of filters as its number of inputs. Some hidden number of neurons. 

We have a ReLU. We have a second linear layer that's outputting one output per 

channel as the update. And so if we wanted to use this as our brain, what we'd have to do 

is we'd have to deal with these extra dimensions. So we take our batch by channel by height 

and width. We're going to map the batch and the height and the width all to one dimension 

and the channels to the second dimension. So now we have a big grid of 

16 inputs and lots of examples. JEREMY: I don't think we've 

seen einops.rearrange before. So let's put a bit of a bookmark to come back

to teach people about that in maybe the next- JOHNO: Yeah. Very, very useful function. But it is 

a little complicated because we have to rearrange our inputs into something that has just 16 

features, feed that through the linear layer, and then rearrange the outputs back to match 

the shape of our grid. So you can totally do that. And you can see what parameters we 

have on our brain. We have an 8 by 16 inputs and 8 biases for the first layer. And then 

we just have a 4 by 8 weight matrix for the second linear layer. And I've said bias 

equals false because we're having these networks propose an update. And if we want them to 

be stable, the update is usually going to be zero or close to it. And so there's no 

need for the bias and we want to keep the number of parameters as low as possible. That's 

kind of the name of the game. And so that's why we're setting bias equals false. Okay. 

So this is one way to implement this. It's not particularly fast. We have to do this 

reshaping and then we're feeding these examples through the linear layer. We can cheat by 

using convolution. So this might seem like, wait, that isn't the linear layer. We're going 

to apply this linear network on top of each set of inputs. But we can do that by having 

a filter size of one, a kernel size of one in our convolutional layer. So I have 16 input 

channels in my model input here. And I'm going to have eight output channels from this first 

convolutional layer. And my kernel size is going to be 1 by 1. And then I have ReLU and 

then I have another 1 by 1 convolutional layer. And so we can see this gives me the right 

shape output. And if I look at the parameters, my first convolutional layer has 8 by 16 by 1 

by 1 parameters in its filters. And so maybe spend a little bit of time convincing yourself 

that these two are doing the same operation. JEREMY: Yeah, this definitely is using cheating. I 

mean, this is quite elegant. And in languages like APL, actually, there's an operation called 

stenciling, which is basically the same idea as this idea of like applying 

some computation over a grid. JOHNO: And I should mention that convolutions 

are very efficient. All of our GPUs and things are set up for this kind of operation. And 

what makes neural cellular automata quite exciting is that because we're doing this 

convolution, you have an operation for every pixel that we're applying, right? This is 

looking at the neighborhood and producing an output. There's no global thing that we 

need to handle. And so this is actually exactly what GPUs were designed for. They're designed 

for running some operation for every pixel on your screen to render graphics or show you 

your video game or make your website scroll nice and slick. And so we can take advantage 

of that kind of built in bias of the hardware from doing lots of little operations in parallel 

to make these go really, really fast. And we'll show I'll show you just now we can run 

these realtime in the browser, which is quite satisfying. Okay. So now that 

we have all that infrastructure in place, I'm just gonna put it into a class. 

My SimpleCA is my cellular automata. We have our little brain two convolutional layers and 

a ReLU. Optionally we can set the weights of the second layer to zero again, because you 

want to start by being very conservative in terms of what updates we produce. Not necessarily 

necessary, but it does help the training. And then in our forward path… 

JEREMY: I would be inclined, I don't know if it matters, but I'd be inclined

to put that, I would be inclined to do, you know, nn.init constant zero or put that in

a no_grad like often initializing things without no_grad can cause problems.

JOHNO: Okay. And I'll look into that. In the forward method, we're 

going to apply our filters to get our model inputs… JEREMY: Oh, 

you got .data.zero_. Okay. So that's, that's fine. JOHNO: Yeah. I think

this is the built in, the built in method. All right. JEREMY: 

Oh, it's the .data, which is the thing that makes it. Yeah. You don't need 

torch.no_grad because you've got .data. So yeah, it's all good. Cool. JOHNO: 

Okay. And so the forward is applying the filters. It's feeding it through the 

first convolutional layer, then the ReLU, then the second layer. And then it's doing this final 

step, which again goes back to the original paper somewhere in here, they mentioned that 

they are inspired by biology. And one thing that you don't have in a biological system 

is some sort of global clock where everything updates at exactly the same second. It's much 

more random and organic. Each one is almost independent. And so to mirror that, what we 

do here is we create a random update mask. And if you go in, let's just actually write 

this ops. Let's just make, make a, make a cell and check that this is what we're doing. 

So I'm going to just go b, h, w is 1… just to visualize update_rate. There we go. Yeah. 

So this is creating this random mask, some zeros and some ones according to what our 

update rate is. And this is going to determine whether we apply this updates to 

original input or not. JEREMY: It's a lot like dropout. JOHNO: Yeah, exactly.

And why this is nice. If you imagine we start from a perfectly uniform grid and then every

cell is running the exact same rule after one update, we will still have a perfectly

uniform grid. There's no way for there to be any randomness. And so we can never like

break out of that. Whereas once we add this random updates or your subset of cells are

going to be updated and now there's some differences, they have different 

neighborhoods and things. And so then we get this like added randomness 

in, and this is very much like in a biological system, no cell is going to be identical. So 

that's a little bit of additional complexity, but again, inspired by nature and inspired 

by paper. With all of this in place, we can do our training. We're going to use the same 

dummy dataset idea as before. We are going to have a progress callback, which is a lot 

of code, but it's all just basically sitting around for doing some plotting. So I'm not 

going to spend too much time on that. And then the fun stuff is going to happen in our 

training callback. And so now we are actually getting deep into the weeds. We're modifying 

our prediction function. This is much more complicated than just feeding a batch of data 

through our model. We are setting up a pool of grids, right? 256 examples. And these are 

all going to start out as just uniforms zeros. But every time we call predict, we're going 

to pick some random samples from that pool. We're occasionally going to reset those samples 

to the initial state. And then we can apply the model a number of times. And it's worth 

thinking here, if we're applying this model 50 steps, this is like a 50 layer deep model 

all of a sudden. And so we start to get some of these...

JEREMY: Should it be learn.model rather than self.learn.model?

JOHNO: Oh yes, because I already have learn. Nice. Yeah. So we've got to just be aware that by

applying this a large number of times, we could get something like gradient exploding

and things like that, which we'll deal with a little bit later. But we apply the model

a large number of steps. Then we put those final outputs back in the pool for the next

round of training and we store our predictions. These are the outputs after we've applied

a number of steps. And in the loss, we're going to use a style loss saying, does this

match the style of my target image? And we're going to add an overflow loss that penalizes

it if the values are out of bounds. Just to try and...

JEREMY: Change self.learn here too. Ah yes, thank you. I think I read this before 

we changed the... No, because I've got the callback there. Okay, my bad.

JEREMY: One more self.learn.preds.clamp and the overflow loss one.

JOHNO: Yes, thank you. There we go. And yeah, so get_loss is doing a style_loss plus this overflow

loss just to keep things from growing exponentially out of bounds. 

Again, something that's quite likely to happen when you're applying a large 

number of steps. And so we really want to penalize that.

And the final thing is in learn.backward, I've added a technique that is probably going

to be quite useful in some other places as well called gradient normalization. And so

we're just running through the parameters of our model and we are normalizing them.

And this means that even if they're really, really tiny, really, really large at the end

of that multiple number of update steps, this is kind of a hack to bring this back into

control. JEREMY: let's put a bookmark to come back 

to that as well in more detail. And I guess that before_fit, maybe we don't need anymore.

JOHNO: Oh right, because this is now default. Okay. Oh, so I should have set this running before

we started talking. It is going to take a little while. But you can see my progress

callback here is scatter plotting the loss. And the reason you'll see in the callback

here, I'm setting the y limits to the minimum of the initial set of losses is just because

the overflow loss is sometimes so much larger than the rest of the loss ones, that you get

this really bad scaling. So using a log scaling and clipping the balance tends to help just

visualize what's actually important, like the overall trend.

JEREMY: I guess the last benefit of just not run, we can,

then we can see it without you running it. JOHNO: Oh, right. Yeah. Yeah. So you can see 

the outputs here. So what I'm visualizing is the examples that we've drawn from the pool every 

time we're drawing. In this case, I've got a fixed batch size that should probably be 

an argument. But you can take a look at them and kind of compare them to the style last 

and see initially that I really looked too similar. After some training, we get some 

definite like webby tendencies. And we can take this model and then apply it to like a 

random grid and log the images every hundred steps or whatever. And you can see that starting 

from this random position, it quite quickly builds this pattern that doesn't look perfectly 

spider webby. But in its defense, this model has 168 parameters.

JEREMY: And the tiles. JOHNO: That to me is like the 

magic of these models is that even with very few parameters, they're

able to do something pretty impressive. And if you would like, go back up to where we define number of channels and the number of 

layers. If you, if you give it more channels to work with 8 or 16 and more hidden neurons, 

32 or 64, you still have a tiny model. But it's able to capture some, some much nicer. 

So I would say, please on the forums, try some larger sizes. I'll also maybe post some 

results. And just to give you a little preview of what's possible. So I did a project before 

using maniai. So the code's a little messy and hacky. But what I did was I logged the 

cellular automata. Well, maybe I should show this. We, this is way outside of the bounds 

for this course, but you can write something called a fragment shader in WebGL. So this 

is designed to run in the browser. It's a little program that runs once for every pixel. 

And so you can see here, I have the weights of my neural network. I have sampling the 

neighborhood of each cell. We have our filters. You have our activation function. This is in 

a language called GLSL. We're running through the layers of our network and proposing our 

updates. And this one here, I just had more, I think more hidden neurons, more channels 

and optimized with a slightly different loss function. So it was a style loss plus CLIP to 

the prompt, I think, dragon scales or glowing dragon scales. And you can see this is running in 

realtime or near realtime because I'm recording. And it's interactive. You can click to kind 

of like zero out the grid and then see it like rebuild within that. And so in a similar 

way with Mordvintsev's report, I'm logging these kind of interactive HTML previews. We've 

got some videos and just logging the grids from the different things. And so you can 

see these are still pretty small as far as these networks go. I think they only have 

four channels because I'm working with RGBA shaders but quite fun to see what you can do 

with these. And if you pick the right style images and train for a bit longer and use 

a few more channels, you can do some really fun stuff and you can get really creative 

applying them at different scales. Or I did some messing around with video, which again, is 

just like messing with the inputs to different cells to try and get some cool patterns. So yeah, 

to me, this is a really exciting, fun niche... JEREMY: Amazing

JOHNO: Yeah, I don't know if there's too many practical 

applications at this stage, but I'm already thinking of denoising cellular automata and 

stylizing or image restoration cellular automata. And you can really have a lot of fun with 

this structure. And I also thought it was just a good demo of how far can we push what 

you can do with a training callback to have this pool training and gradient normalization 

and all these extra things added in. Very, very different from, here's a batch of images 

and a batch of labels. So I hope you found that interesting. I'll stop sharing my screen 

and then Jeremy, if you have any questions or follow ups.

No, that's amazing. Thank you so much. I actually have to go, but that's just one of the coolest

things I've seen.

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 20: Deep Learning Foundations to Stable Diffusion ...