Lesson 18: Deep Learning Foundations to Stable Diffusion

Jeremy Howard17,210 words

Full Transcript

Hi folks, thanks for joining me for Lesson 18.

We're going to start today in Microsoft Excel. You'll see there's an Excel folder 

actually in the course22p2 repo. And in there there's a spreadsheet 

called graddesc as in gradient descent, which I guess

we should zoom in a bit here. So there's some instructions here, but this is basically 

describing what's in each sheet. We're going to be looking at the various 

SGD accelerated approaches we saw last time, but

done in a spreadsheet. We're going to do something very, very simple, 

which is to try to solve a linear regression. So the actual data was generated with y = 

ax + b, where a, which is the slope, was And so you can see we've got 

some random numbers here. And then over here, we've 

got the ax + b calculation. So then what I did is I copied and pasted as 

values, just one set of those random numbers into the next sheet called basic.

This is the basic SGD sheet. So that's what x and y are.

And so the idea is we're going to try to use SGD to learn that the intercept is 30 and

the slope is 2. So the way we do SGD is we, so those 

are, those are our weights or parameters. So the way we do SGD is we start 

out at some random kind of guess. So my random guess is going to be 1 

and 1 for the intercept and slope. And so if we look at the very first data point, 

which is x is 14 and y is 58, the intercept and slope are both 1, then 

we can make a prediction. And so our prediction is just equal 

to slope times x plus the intercept. So the prediction will be 15.

Now, actually, the answer was 58. So we're a long way off.

So we're going to use mean squared error. So the mean squared error is just the 

error, so the difference, squared. Okay, so one way to calculate how much would 

the prediction, sorry, how much would the error change, so how much would the squared 

error, I should say, change if we changed the intercept, which is b, would be just to 

change b by a little bit, change the intercept by a little bit and see what the error is.

So here that's what I've done is I've just added 0.01 to the intercept and then calculated

y and then calculated the difference squared. And so this is what I mean by er b1.

This is the error squared I get if I change b by 0.01.

So it's made the error go down a little bit. So that suggests that we should probably 

increase b, increase the intercept. So we can now calculate the estimated derivative 

by simply taking the change from when we use the actual intercept using 

the intercept plus 0.01. So that's the rise and we divide it by 

the run, which is, as we said, is 0.01. And that gives us the estimated derivative 

of the squared error with respect to b, the intercept.

Okay, so it's about negative 86, 85.99. So we can do exactly the same thing for a, 

so change the slope by 0.01, calculate y, calculate the difference and square it.

And we can calculate the estimated derivative in the same way, rise, which is the difference

divided by run, which is 0.01. And that's quite a big number, minus 1,200.

In both cases, the estimated derivatives are negative.

So that suggests we should increase the intercept and the slope.

And we know that that's true because actually the intercept and the slope are both bigger

than one. The intercept is 30, should be 

30 and the slope should be 2. So there's one way to calculate the derivatives.

Another way is analytically. And the derivative of squared is 2 times.

So here it is here. I've just written it down for you.

So here's the analytic derivative. It's just two times the difference.

And then the derivative for the slope is here. And you can see that the estimated version 

using the rise over run and the little 0.01 change and the actual, they're pretty similar.

Okay. And same thing here, they're pretty similar.

So anytime I calculate gradients kind of analytically, but by hand, 

I always like to test them against doing the actual rise over run 

calculation with some small number. And this is called using the 

finite differencing approach. We only use it for testing because it's slow 

because you have to do a separate calculation for every single weight.

But it's good for testing. We use analytic derivatives 

all the time in real life. Anyway, so however we calculate the 

derivatives, we can now calculate a new slope. So our new slope will be equal to the previous 

slope minus the derivative times the learning rate, which we've just set here at 0.0001.

And we can do the same thing for the intercept as you see.

And so here's our new slope intercept. So we can use that for the second row of data.

So the second row of data is x equals 86, y equals 202.

So our intercept is not one, one anymore. Intercept and slope are not one, 

one, but they're 1.01 and 1.12. So here's, we're just using a formula just 

to point at the new intercept and slope. We can get a new prediction and squared error 

and derivatives, and then we can get another new slope and intercept.

And so that was a pretty good one, actually. It really helped our slope head in the right 

direction, although the intercept's moving pretty slowly.

And so we can do that for every row of data. Now strictly speaking, this is not mini 

batch gradient descent that we normally do in deep

learning. It's a simpler version where 

every batch is a size one. So I mean, it's still stochastic gradient descent.

It's just not, it's just a batch size of one. I think sometimes it's called online 

gradient descent, if I remember correctly. So we go through every data point in our very 

small data set until we get to the very end. And so at the end of the first epoch, we've 

got an intercept of 1.06 and a slope of 2.57. And those indeed are better estimates 

than our starting estimates of 1, 1. So what I would do is I would copy our slope 

2.57 up to here, 2.57, I'll just type it for now.

And I'll copy our intercept up to here. And then it goes through the entire epoch 

again, and we get another intercept and slope. And so we could keep copying and pasting 

and copying and pasting again and again. And we can watch the root 

mean squared error going down. Now that's pretty boring doing 

that copying and pasting. So what we could do is fire up Visual Basic for applications.

And sorry, this might be a bit small. I'm not sure how to increase the font size.

And what it shows here. So sorry, this is a bit small.

So you might want to just open it on your own computer to be able to see it clearly.

But basically it shows I've created a little macro where if you click on the reset button,

it's just going to set the slope and constant to 1 and calculate.

And if you click the run button, it's going to go through five times, calling one step.

And what one step is going to do is it's going to copy the slope, last slope to the new slope

and the last constant intercept to the new constant intercept.

And also do the same for the RMSE. And it's actually going to paste it down to 

the bottom for reasons I'll show you in a moment.

So if I now run this, reset and then run.

There we go. You can see it's run at five times.

And each time it's pasted the RMSE. And here's a chart of it showing it going down.

So you can see the new slope is 2.57. Your intercept is 1.27.

I could keep running at another five. So this is just doing copy, paste, 

copy, paste, copy, paste; five times. And you can see that the RMSE is 

very, very, very slowly going down. And the intercept and slope are very, very, 

very slowly getting closer to where they want to be.

The big issue really is that the intercept is meant to be 30.

It looks like it's going to take a very, very long time to get there, but it will get there

eventually if you click run enough times or maybe set the VBA macro to loop more than

five steps at a time. But you can see it's very slowly.

And importantly though, you can see like it's kind of taking this linear route every time

these are increasing. So why not increase it by more and more and more?

And so you'll remember from last week that that is what momentum does.

So on the next sheet, we show momentum. And so everything's exactly the same as the 

previous sheet, but this sheet, we didn't bother with the finite differencing.

We just have the analytic derivatives, which are exactly the same as last time.

The data's the same as last time. The slope and intercept are the 

same starting points as last time. And this is the new b and new b that we get. But what we do this time is that we've added 

a momentum term, which we're calling beta. And so the beta is going to these cells here. And what are these cells?

Well, what these cells are is that they're, maybe it's most interesting to take this one

here. What it's doing is it's taking the gradient 

and it's using that to update the weights, but it's also taking the previous update.

So you can see here, the blue one, minus 25. So that is going to get 

multiplied by 0.9, the momentum. And then the derivative is then multiplied by 0.1. So this is momentum, which is 

getting a little bit of each. And so then what we do is we then 

use that instead of the derivative to multiply by our

learning rate. So we keep doing that again and 

again and again, as per usual. And so we've got one column, which is calculating 

the momentum, you know, lerped version of the gradient for both b and for a. And so 

you can see that for this one, it's the same thing.

It look at what was the previous move. And that's going to be 0.9 of what you're going 

to use for your momentum version gradient. And 0.1 is for this version, 

the momentum gradient. And so then that's again, what we're going 

to use to multiply by the learning rate. And so you can see what happens is when you 

keep moving in the same direction, which here is we're saying the derivative is 

negative again and again and again. So it gets higher and higher and higher. And ditto over here.

And so particularly with this big jump we get, we keep getting big jumps because still

we want to, there's still negative gradient, negative gradient, negative gradient.

So if we, at the end of this, our new, our b and our a have jumped ahead.

And so we can click run. And we can keep clicking it.

And you can see that it's moving, you know, not super fast, but certainly faster than

it was before. So if you haven't used VBA, Visual Basic 

for applications before, you can hit Alt, Alt F11 or Option

F11 to open it. And you may need to go into your preferences 

and turn on the developer tools so that you can see it.

You can also right click and choose assign macro on a button and you can see what macro

has been assigned. So if I hit Alt F11 and I can just double, 

or you can just double click on the sheet name and it'll open it up.

And you can see that this is exactly the same as the previous one.

There's no difference here. Oh, one difference is that to keep 

track of momentum at the very, very end. So I've got my momentum 

values going all the way down. The very last momentum I copy back up to the 

top, each epoch so that we don't lose track of our kind of optimizer state, if you like.

Okay. So that's what momentum looks like.

So, yeah, if you're kind of a more of a visual person like me, you like to see everything

laid out in front of you and like to be able to experiment, which I think is a good idea.

This can be really helpful. So RMSProp, we've seen, and it's very similar 

to momentum, but in this case, instead of keeping track of kind of a lerped moving average 

and exponential moving average of gradients, we're keeping track of a moving 

average of gradients squared. And then rather than simply adding that, using 

that as the gradient, what instead we're doing is we are dividing our gradient 

by the square root of that. And so remember the reason we were doing that 

is to say, if there's very little variation, very little going on in your gradients, 

then you probably want to jump further. So that's RMSProp.

And then finally, Adam, remember, was a combination of both. So in Adam, we've got both the lerped version 

of the gradient and we've got the lerped version of the gradient squared.

And then we do both. When we update, we're both dividing 

the gradient by the square root of the lerped- moving, initially

weighting average, moving averages. And we're also using the momentumized version.

And so again, we just go through that each time. And so if I reset, run, and so, oh wow, look at that. It jumped up there very quickly because 

remember we wanted to get to 2 and 30. So just two sets.

So that's 5, that's 10 epochs. Now if I keep running it, it's 

kind of now not getting closer. It's kind of jumping up and down 

between pretty much the same values. So probably what we'd need to do is 

decrease the learning rate at that point. And yeah, that's pretty good.

And now it's jumping up and down between the same two values again.

So it'd probably decrease the learning rate a little bit more.

And I kind of like playing around like this because it gives me a really intuitive feeling

for what training looks like. So I've got a question from our YouTube 

chat, which is how is J33 being initialized? So this is just, what happens is we take the 

very last cell, well actually all these last four cells, and we copy them to here as values.

So this is what those looked like in the last epoch.

So basically we go copy and then paste as values.

And then this here just refers back to them as you see. And it's interesting that they're kind of, 

you can see how they're exact opposites of each other, which is really, 

you can really see how they're, it's just fluctuating around

the actual optimum at this point. Okay, thank you to Sam Watkins.

We've now got a nicer sized editor. That's great.

Where are we Adam? Okay, so with Adam, basically it all looks 

pretty much the same, except now we have to copy and paste both our momentums and our squared gradients, and of course the slopes

and intercepts at the end of each step. But other than that, it's 

just doing the same thing. And when we reset it, it just sets 

everything back to their default values. Now one thing that occurred to me, you know, 

when I first wrote this spreadsheet a few years ago was that manually changing 

the learning rate seems pretty annoying. Now of course we can use a scheduler, but 

a scheduler is something we set up ahead of time.

And I did wonder if it's possible to create an automatic scheduler.

And so I created this Adam annealing tab, which honestly I've never really got back

to experimenting with. So if anybody's interested, 

they should check this out. What I did here was I used exactly the same 

spreadsheet as the Adam spreadsheet, but I added an extra, after I do this step, I added an 

extra thing, which is I automatically decreased the learning rate in a certain situation.

And the situation in which I decreased it was I kept track of the average of the squared

gradients. And anytime the average of the squared gradients 

decreased during an epoch, I stored it. So I basically kept track of the 

lowest squared gradients we had. And then what I did was if we got a… if that 

resulted in the gradients, the squared gradients average halving, then I would 

decrease the learning rate by, then I would decrease the

learning rate by a factor of 4. So I was keeping track of this gradient ratio.

Now when you see a range like this, you can find what that's referring to by just clicking

up here and finding gradient ratio. And there it is, and you can see that it's 

equal to the ratio between the average of the squared gradients versus the 

minimum that we've seen so far. So this is kind of like, my theory here was 

thinking that, yeah, basically as you train, you kind of get into flatter 

and more stable areas. And as you do that, that's a sign that, you 

know, you might want to decrease your learning rate.

So yeah, if I try that, if I hit run again, it jumps straight to a pretty good value,

but I'm not going to change the learning rate manually.

I just press run and you can see it's changed the learning rate automatically now.

And if I keep hitting run without doing anything, look at that.

It's got pretty good, hasn't it? And the learning rates got lower and lower 

and we basically got almost exactly the right answer.

So yeah, that's a little experiment I tried. So maybe some of you should try experiments 

around whether you can create an automatic annealer using miniai.

I think that would be fun. So that is an excellent segue into our 

notebook because we are going to talk about annealing

now. So we've seen it manually before 

where we've just decreased the learning rate in a notebook

and like run a second cell. And we've seen something in Excel.

But let's look at what we generally do in PyTorch. So we're still in the same notebook as 

last time, the Accelerated SGD notebook. And now that we've re-implemented all the 

main optimizers that people tend to use most of the time from scratch, we 

can use PyTorch's, of course. So let's see, look now at how we can do our own learning rate scheduling or annealing

within the miniai framework. Now we've seen when we implemented the learning 

rate finder that we saw how to create something that adjusts the learning rate.

So just to remind you, this was all we had to do. So we had to go through the optimizers’ parameter 

groups, and in each group set the learning rate to times equals some multiplier.

That was for the learning rate finder. So since we know how to do that, we're not going 

to bother re-implementing all the schedulers from scratch because we know the basic idea now.

So instead, what we're going to do is have a look inside the torch.optim.lr_scheduler

module and see what's defined in there. So the lr_scheduler module, you can 

hit dot, tab and see what's in there. But something that I quite like to do is to 

use dir because dir(lr_scheduler) is a nice little function that tells you 

everything inside a Python object. And this particular object is a module object 

and it tells you all the stuff in the module. When you use the dot version tab, it doesn't 

show you stuff that starts with an underscore, by the way, because that stuff's considered 

private, or else dir does show you that stuff. I can kind of see from here 

that the things that start with a capital and then a small

letter look like the things we care about. We probably don't care about this.

We probably don't care about these. So we can just do a little list comprehension 

that checks that the first letter is an uppercase and the second letter is lowercase, and 

then join those all together with a space. And so here is a nice way to get a list of all 

of the schedulers that PyTorch has available. And actually, I couldn't find such a list 

on the PyTorch website in the documentation. So this is actually a handy 

thing to have available. So here's various schedulers we can use.

And so I thought we might experiment with using Cosine Annealing. So before we do, we have to recognize that 

these PyTorch schedulers work with PyTorch optimizers, not with, of course, 

with our custom SGD class. And PyTorch optimizers have 

a slightly different API. And so we might learn how they work.

So to learn how they work, we need an optimizer. So one easy way to just grab an optimizer 

would be to create a learner, just kind of pretty much any old random learner, and pass 

in that single batch callback that we created. Do you remember that single batch callback?

SingleBatch. It just, after_batch, it cancels the fit.

So it literally just does one batch. And we could fit.

And from that, we've now got a learner and an optimizer.

And so we can do the same thing. We can do our optimizer to 

see what attributes it has. This is a nice way, or of course, just 

read the documentation in PyTorch. This one is documented, I think, 

showing all the things it can do. As you would expect, it's got the step and 

the zero_grad, like we're familiar with. Or you can just, if you just hit opt.

So you can, the optimizers in PyTorch do actually have a repr, as it's called, which means

you can just type it in and hit shift enter, and you can also see the information about

it this way. Now, an optimizer, it'll tell 

you what kind of optimizer it is. And so in this case, the default optimizer 

for a learner, when we created it, we decided was optm.sgd.SGD.

So we've got an SGD optimizer. And it's got these things called parameter groups.

What are parameter groups? Well, parameter groups are, as it 

suggests, they're groups of parameters. And in fact, we only have one parameter group 

here, which means all of our parameters are in this group.

So let me kind of try and show you. It's a little bit confusing, 

but it's kind of quite neat. So let's grab all of our parameters.

And that's actually a generator. So we have to turn that into an iterator and 

call next, and that will just give us our first parameter.

Okay. Now what we can do is we can then 

check the state of the optimizer. And the state is a dictionary and 

the keys are parameter tensors. So this is kind of pretty interesting because 

you might be, I'm sure you're familiar with dictionaries.

I hope you're familiar with dictionaries. But normally you probably use numbers or 

strings as keys, but actually you can use tensors as

keys. And indeed that's what happens here.

If we look at param, it's a tensor. It's actually a parameter, which remember 

is a tensor, which it knows to requires_grad and to list in the parameters of the module.

And so we're actually using that to index into the state.

So if you look at opt.state, it's a dictionary where the keys are parameters.

Now what's this for? Well, what we want to be able 

to do is, if you think back to this, we actually had each

parameter, we have state for it. We have the average of the gradients or the 

exponentially weighted moving average gradients and of squared averages.

And we actually stored them as attributes. So PyTorch does it a bit differently.

It doesn't store them as attributes, but instead the optimizer has a dictionary where you can

index into it using a parameter and that gives you the state.

And so you can see here, this is the exponentially weighted moving averages.

And both because we haven't done any training yet and because we're using non-momentum SGD,

it's none, but that's how it would be stored. So this is really important to 

understand PyTorch optimizers. I quite liked our way of doing it, of just 

storing the state directly as attributes. But this works as well and it's fine.

You just have to know it's there. And then, as I said, rather than 

just having parameters, so we in SGD stored the parameters

directly. But in PyTorch, those parameters 

can be put into groups. And so since we haven't put them into 

groups, the length of param_groups is 1, this is

one group. So here is the param_groups.

And that group contains all of our parameters. Okay.

So pg, just to clarify here what's going on, pg is a dictionary.

It's a parameter group. And to get the keys from a 

dictionary, you can just listify it. That gives you back the keys.

And so this is one quick way of finding out all the keys in a dictionary.

But you can see all the parameters in the group. And you can see all of the hyper parameters, 

the learning rate, the momentum, weight decay, and so forth. So that gives you some background about 

what's going on inside an optimizer. So Siva asks, isn't indexing by tensor just 

like passing a tensor argument to a method? And no, it's not quite the 

same, because this is state. So this is how the optimizer 

stores state about the parameters. It has to be stored somewhere.

For our homemade miniai version, we stored it as attributes on the parameter.

In the PyTorch optimizers, they store it as a dictionary.

So it's just how it's stored. Okay.

So with that in mind, let's look at how schedulers work.

So let's create a cosine annealing scheduler. So a scheduler in PyTorch, you 

have to pass at the optimizer. And the reason for that is we want to be able 

to tell it to change the learning rates of our optimizer.

So it needs to know what optimizer to change the learning rates of.

So it can then do that for each set of parameters. And the reason that it does it by parameter 

group is that as we'll learn in a later lesson, for things like transfer learning, we often 

want to adjust the learning rates of the later layers differently to the earlier layers 

and actually have different learning rates. And so that's why we can have different groups 

and the different groups have the different learning rates, momentums, and so forth.

Okay. So we pass in the optimizer and then, if I 

hit shift tab a couple of times, it'll tell me all of the things that you can pass in.

And so it needs to know T_max: how many iterations you're going to do.

And that's because it's trying to do one, you know, half a wave, if you like, of the

cosine curve. So it needs to know how many 

iterations you're going to do. So it needs to know how far to step each time.

So if we're going to do a hundred iterations. So the scheduler is going to 

store the base learning rate. And where did it get that from?

It got it from our optimizer, which we set a learning rate.

Okay. So it's going to steal the optimizers learning 

rate, and that's going to be the starting learning rate, the base learning rate.

And it's a list because there could be a different one for each parameter group.

We only have one parameter group. You can also get the most recent learning 

rate from a scheduler, which of course is the same.

And so I couldn't find any method in PyTorch to actually plot a scheduler's learning rates.

So I just made a tiny little thing that just created a list, set it to the last learning

rate of the scheduler, which is going to start at 0.006, and then goes through however many

steps you ask for. Steps the optimizer, steps the scheduler.

So this is the thing that causes the scheduler to adjust its learning rate, and then just

append that new learning rate to a list of learning rates and then plot it.

So that's here's, and what I've done here is I've intentionally gone over a hundred

because I had told it I'm going to do a hundred. So I'm going over a hundred and you can see the 

learning rate, if we did a hundred iterations, would start high for a while.

It would then go down and then it would stay low for a while.

And if we intentionally go past the maximum, it's actually start going up again because

this is a cosine curve. So one of the main things, I guess, I wanted 

to show here is like what it looks like to really investigate in a REPL 

environment, like a notebook, how an object behaves, what's

in it. And this is something I would always want 

to do when I'm using something from an API I'm not very familiar with.

I really want to like see what's in it, see what they do, run it totally independently,

plot anything I can plot. This is how I like to learn 

about the stuff I'm working with. You know, data scientists don't spend all 

of their time just coding, you know, so that means we need, we can't just rely on 

using the same classes and APIs every day. So we have to be very good at 

exploring them and learning about them. And so that's why I think this 

is a really good approach. Okay.

So let's create a scheduler callback. So a scheduler callback is something we're 

going to pass in the scheduling class. But remember then when we… ah, the scheduling 

callable, actually, and remember that when we create the scheduler, we have to 

pass in the optimizer to schedule. And so before_fit, that's the point at which we 

have an optimizer, we will create the scheduling object.

I like this, “schedo”, it's very Australian. So the scheduling object we will create by 

passing the optimizer into the scheduler callable. And then when we do step(), then we'll check 

if we're training and if so, we'll step(). Okay.

So then what's going to call step is after_batch.

So after_batch, we'll call step. And that would be if you want your scheduler 

to update the learning rate every batch. We could also have an epoch 

scheduler callback, which we'll see later and that's just going to be after_epoch.

Okay. So in order to actually see what the schedule 

is doing, we're going to need to create a new callback to keep track of 

what's going on in our learner. And I figured we could create a recorder callback. And what we're going to do is we're going 

to be passing in the name of the thing that we want to record, that we want to keep track 

of in each batch and a function, which is going to be responsible for 

grabbing the thing that we want. And so in this case, the function here is 

going to grab from the callback, look up its param groups property and grab the learning rate.

Where does the pg property come from? Or attribute?

Well, before_fit the recorder callback is going to grab just the first parameter group.

Just so it's like, you've got to pick some parameter group to track.

So we'll just grab the first one. And so then also we're going to create a 

dictionary of all the things that we're recording. So we'll get all the names.

So that's going to be in this case, just lr. And initially it's just going to be an empty list.

And then after_batch, we'll go through each of the items in that dictionary, which in

this case is just lr is the key and underscore _lr function as the value.

And we will append to that list, call that method, call that function or callable and

pass in this callback. And that's why this is going to get the callback.

And so that's going to basically then have a whole bunch of, you know, dictionary of

the results, you know, of each of these functions after each batch during training.

So we'll just go through and plot them all. And so let me show you what 

that's going to look like. If we… let's create a cosine annealing callable.

So we're going to have to use a partial to say that this callable is going to have T_max

equal to three times, however many mini batches we have in our data loader.

That's because we're going to do 3 epochs. And then we will set it running and 

we're passing in the batch scheduler with the scheduler

callable. And we're also going to pass in our recorder 

callback saying we want to track the learning rate using the _lr function.

We're going to call fit. And oh, this is actually a pretty good accuracy.

We're getting, you know, close to 90% now in only 3 epochs, which is impressive.

And so when we then call rec.plot, it's going to call, remember the rec is the recorder

callback. So it plots the learning rate.

Isn't that sweet? So we could, as I said, we can do exactly 

the same thing, but replace after_batch with after_epoch.

And this will now become a scheduler which steps at the end of each epoch rather than

the end of each batch. So I can do exactly the same thing 

now using an epoch scheduler. So this time T_max is 3 because we're 

only going to be stepping three times. We're not stepping at the end of each 

batch, just at the end of each epoch. So that trains and then we can 

call rec.plot after trains. And as you can see there, 

it's just stepping 3 times. So you can see here, we're really digging 

in deeply to understanding what's happening in everything in our models.

What are all the activations look like? What are the losses look like?

What do our learning rates look like? And we've built all this from scratch.

So yeah, hopefully that gives you a sense that we can really, yeah, do a lot ourselves.

Now, if you've done the fastai Part 1 course, you'll be very aware of 1-Cycle training,

which was from a terrific paper by Leslie Smith, which I'm not sure it ever got published,

actually. And 1-Cycle training is, 

well, let's take a look at it. So we can just replace our 

scheduler with OneCycleLR scheduler. So that's in PyTorch.

And of course, if it wasn't in PyTorch, we could very easily just write our own.

I’m going to make it a batch scheduler and we're going to train, this time we're going to do 5

epochs. So we're going to train a bit longer.

And so the first thing I'll point out is, hooray, we have got a new record for us, 90.6%.

So that's great. So, and then b), you can see here's the plot.

And now look, two things are being plotted. And that's because I've now passed into the 

recorder callback, a plot of learning rates and also a plot of momentums.

And momentums is going to grab the betas zero, because remember for Adam, it's called beta

zero and beta one. It's momentum of the gradients and 

the momentum of the gradient squared. And you can see what the one cycle is doing 

is the learning rate is starting very low and going up to high and then down again.

But the momentum is starting high and then going down and then up again.

So what's the theory here? Well the starting out at a low learning rate 

is particularly important if you have a not perfectly initialized model, which almost 

everybody almost always does, even though we spent a lot of time 

learning to initialize models. We use a lot of models that get more complicated.

And after a while people learn or figure out how to initialize

more complex models properly. So for example, this is a very, very cool paper.

In 2019, this team figured out how to initialize ResNets properly.

We'll be looking at ResNets very shortly. And they discovered when they did 

that they did not need batch norm. They could train networks of 10,000 layers.

And they could get state of the art performance with no batch norm.

And there's actually been something similar for transformers called T-Fixup that does a

similar kind of thing. But anyway, it is quite difficult to initialize models correctly.

Most people fail to. Most people fail to realize that they generally 

don't need tricks like warmup and batch norm if they do initialize them correctly.

In fact, T-Fixup explicitly looks at this. It looks at the difference between 

no warmup versus with warmup with their correct initialization

versus with normal initialization. And you can see these pictures they're 

showing are pretty similar actually. They're very similar to the 

colorful dimension plots. I kind of like our colorful dimension plots 

better in some ways because I think they're easier to read, although I think 

theirs are probably prettier. So there you go, Stefano.

There's something to inspire you if you want to try more things with our colorful dimension

plots. I think it's interesting that some papers 

are actually starting to use a similar idea. I don't know if they got it from us 

or they came up with it independently. But so we do a warmup if our networks not 

quite initialized correctly, then starting at a very low learning rate means it's not 

going to jump off way outside the area where the weights even make sense.

And so then you gradually increase them as the weights move into a part of the space

that does make sense. And then during that time, while we have low 

learning rates, if they keep moving in the same direction, then with this very high 

momentum, they'll move more and more quickly. But if they keep moving in different directions, 

it's just the momentum is going to kind of look at the underlying direction they're moving.

And then once you have got to a good part of the weight space, you can use a very high

learning rate. And with a very high learning rate, 

you wouldn't want so much momentum. So that's why there's low momentum during 

the time when there's high learning rate. And then as we saw in our spreadsheet, which 

did this automatically, as you get closer to the optimal, you generally want 

to decrease the learning rate. And since we're decreasing it 

again, we can increase the momentum. So you can see that starting from random weights, 

we've got a pretty good accuracy on Fashion MNIST with a totally standard convolutional 

neural network, no ResNets, nothing else, everything built from scratch by hand, artisanal 

neural network training, and we've got 90.6% for Fashion-MNIST.

So there you go. All right, let's take a seven minute break.

And I'll see you back shortly. I should warn you, you've got a lot more to cover.

So I hope you're okay with a long lesson today. Okay, we're back.

I just wanted to mention also something we skipped over here, which is this HasLearn

callback. This is more important for the people 

doing the live course than the recordings. If you're doing the recording, 

you will have already seen this. But since I created Learner, actually, Piotr 

Czapla, I don't know how to pronounce your surname, sorry, Piotr, pointed out that there's 

actually kind of a nicer way of handling Learner… previously we were putting the Learner object 

itself into self.learn in each callback. And that meant we were using self.learn.model 

and self.learn.opt and self.learn dot all this all over the place.

It was kind of ugly. So we've modified Learner this week to instead pass in when it calls the callback, when in

run_cbs, which is what it calls, Learner calls, you might remember, is it passes the Learner

as a parameter to the method. So now the Learner no longer goes through the 

callbacks and sets their .learn attribute. But instead in your callbacks, you have to put 

learn as a parameter in all of the callback methods.

So for example, DeviceCB has a before_fit.

So now it's got comma learn here. So now this is not self.learn.

It's just learn. So it does make a lot of the code less yucky 

to not have all this self.learn.pred equals self.learn.model, self.learn.batch.

It's now just learn. It also is good because you don't generally want 

to have, both have the Learner has a reference to the callbacks, but also the 

callbacks having a reference back to the Learner, it creates

something called a cycle. So there's a couple of benefits there.

And that reminds me, there's a few other little changes we've made to the code.

And I want to show you a cool little trick. I want to show you a cool little trick for how 

I'm going to find quickly all of the changes that we've made to the code in the last week.

So to do that, we can go to the course repo. And on any repo, you can 

add slash compare in GitHub. And then you can compare across, you 

know, all kinds of different things. But one of the examples they've got here 

is to compare across different times. Look at the master branch now versus one day ago.

So I actually want the master branch now versus seven days ago.

So I just hit this, change this to 7. And there we go.

There's all my commits and I can immediately see the changes from last week.

And so you can basically see what are the things I had to do when I changed things.

So for example, you can see here, all of my self.learn(s) became learn(s).

I added the anneal, that's right, augmentation. And so in learner, I added an lr_find.

Ah yes, I will show you that one. That's pretty fun.

So here's the changes we made to run_cbs, to fit. So this is a nice way I can quickly, yeah, 

find out what I've changed since last time and make sure that I don't forget 

to tell you folks about any of them. Oh yes, cleanup_fit.

I have to tell you about that as well. Okay.

That's a useful reminder. So the main other change to mention is that 

calling the learning rate finder is now easier because I added what's called 

a patch to the learner. fastcore’s patch decorator lets you take a 

function and it will turn that function into a method of this class, of whatever 

class you put after the colon. So this has created a new method 

called lr_find or Learner.lr_find. And what it does is it calls self.fit, where 

self is a Learner, passing in however many epochs you set as the maximum, check for your 

learning rate finder, what to start the learning rate at.

And then it says to use as callbacks the learning rate finder callback.

Now this is new as well. self.learn.fit didn't used to 

have a callbacks parameter. So that's very convenient because what it 

does is it adds those callbacks just during the fit.

So if you pass in callbacks, then it goes through each one 

and appends it to self.cbs and when it's finished 

fitting, it removes them again. So these are callbacks that are just added 

for the period of this one fit, which is what we want for learning rate finder.

It should just be added for that one fit. So with this patch in place, all that's required 

to do the learning rate finder is now to create your learner and call.lr_find.

And there you go. Bang.

So patch is a very convenient thing. It's one of these things which Python has a 

lot of kind of like folk wisdom about what isn't considered Pythonic or good.

And a lot of people really don't like patching. In other languages, it's used very 

widely and is considered very good. So I don't tend to have strong opinions 

either way about what's good or what's bad. In fact, instead I just figure out 

what's useful in a particular situation. So in this situation, obviously it's very 

nice to be able to add in this additional functionality to our class.

So that's what.lr_find is. And then the only other thing we added to 

the Learner this week was we added a few more parameters to fit.

fit used to just take the number of epochs. As well as the callback parameter, it 

now also has a learning rate parameter. And so you've always been able to provide 

a learning rate to the constructor, but you can override the learning rate for one fit.

So if you pass in the learning rate, it will use it if you pass it in.

And if you don't, it'll use the learning rate passed into the constructor.

And then I also added these two booleans to say when you fit, do you want to do the training

loop and do you want to do the validation loop? So by default, it'll do both.

And you can see here, there's just an if train, do the training loop.

If valid, do the validation loop. I'm not even going to talk about this, but if 

you're interested in testing your understanding of decorators, you might want to think 

about why it is that I didn't have to say with torch.no_grad,

but instead I called torch.no_grad parenthesis function.

That will be a very, if you can get to a point that you understand why that works and what

it does, you'll be on your way to understanding decorators better.

Okay. So that is the end of Excel SGD. ResNets.

Okay, so we are up to 90 point, what was it?

Yeah, let's keep track of this. Oh yeah, 90.6% is what we're up to. Okay.

So to remind you the model, actually, so we're going to open 13_resnet now. And we're going to do the usual 

import and setup initially. And the model that we've been using is 

the same one we've been using for a while, which

is that it's a convolution and an activation and an optional batch norm.

And in our models, we were using batch norm and applying our weight initialization, the

kaiming weight initialization. And then we've got convs that take the 

channels from 1 to 8 to 16 to 32 to 64, and each one

stride-2. And at the end, we then do a flatten.

And so that ended up with a one by one. So that's been the model 

we've been using for a while. So the number of layers is 1, 2, 3, 4.

So 4, 4 convolutional layers with a maximum of 64 channels in the last one.

So can we beat 90.9, about 90 and a half, 90.6, can we beat 90.6%?

So before we do a ResNet, I thought, well, let's just see if we can improve the architecture

thoughtfully. So generally speaking, more depth and more 

channels gives the neural net more opportunity to learn.

And since we're pretty good at initializing our neural nets and using batch norm, we should

be able to handle deeper. So one thing we could do is we could, 

let's just remind ourselves of the previous version

so we can compare, is we could have our, go up to 128 parameters.

Now the way we'd do that is we could make our very first convolutional layer have a

stride of 1. So that would be one that goes from the one 

input channel to eight output channels, or eight filters, if you like.

So if we make it a stride of 1, then that allows us to have one extra layer.

And then that one extra layer could again double the number of channels and take us

up to 128. So that would make it deeper and 

effectively wider as a result. So we can do our normal BatchNorm2d and our 

new OneCycle learning rate with our scheduler. And the callbacks we're going to use is the 

Device callback, our Metrics, our Progress bar, and our activation stats 

looking for GeneralRelus. And I won't have you watch them train 

because that would be kind of boring. But if I do this with this deeper and eventually 

wider network, this is pretty amazing. We get up to 91.7%.

So that's like quite a big difference. And literally the only difference to our previous 

model is this one line of code, which allowed us to take this instead of going 

from 1 to 64, it goes from 8 to 128. So that's a very small change, 

but it massively improved. So the error rate's gone down by 

about over 10% relatively speaking in terms of the error

rate. So there's a huge impact we've already had.

Again, 5 epochs. So now what we're going to do is 

we're going to make it deeper still. But it gets, there becomes a point. So Kaiming He et al. noted that there comes 

a point where making neural nets deeper stops working well.

And remember, this is the guy who created the initializer that we know and love.

And he pointed out that even with good initialization, there 

comes a time where adding more layers becomes problematic.

And he pointed out something particularly interesting.

He said, let's take a 20 layer neural network. This is in a paper called “Deep Residual Learning 

for Image Recognition” that introduced ResNets. Let's take a 20 layer network and train it 

for a few, what's that, tens of thousands of iterations and track its test error.

Okay. And now let's do exactly the same thing on 

a 56 layer, otherwise identical, but deeper And he pointed out that the 56 layer 

network had a worse error than the 20 layer. And it wasn't just a problem of generalization 

because it was worse on the training set as

well. Now the insight that he had is if you just 

set the additional 36 layers to just identity, you know, identity matrices, they 

should, they would do nothing at all. And so a 56 layer network is a 

superset of a 20 layer network. So it should be at least as 

good, but it's not, it's worse. So clearly the problem here is 

something about training it. And so him and his team came up with a really 

clever insight, which is, can we create a 56 layer network, which has the same training 

dynamics as a 20 layer network or even less? And they realized, yes, you can.

What you could do is you could add something called a shortcut connection.

And basically the idea is that, normally, when we have, you know, our inputs coming into

our convolution. So let's say that's, that was our inputs and 

here's our convolution and here's our outputs. Now if we do this 56 times, that's a lot of 

stacked up convolutions, which are effectively matrix multiplications with a 

lot of opportunity for, you know, gradient explosions and all

that fun stuff. So how could we make it so 

that we have convolutions, but with the training dynamics of a much shallower network.

And here's what he did.

He said, let's actually put two convs in here to make it twice as deep because we are trying

to make things deeper, but then let's add what's called a skip connection where instead

of just being out equals, so this is conv1, this is conv2. Instead of being out

equals and there's a, you know, assume that these include activation functions, equals

conv2 of conv1 of in, right? Instead of just doing that, let's 

make it conv2 of conv1 of in plus in.

Now if we initialize these at the first to have weights 

of zero, then initially this will do nothing at all.

It will output zero and therefore at first you'll just get out equals in, which is exactly

what we wanted, right? We actually want to, for it to be 

as if there is no extra layers. And so this way we actually end up with a 

network which can be deep, but also at least when you start training 

behaves as if it's shallow. It's called a residual connection because 

if we subtract in from both sides, then we

would get out minus in equals conv1 of conv2 of in.

In other words, the difference between the endpoint and the starting point, which is

the residual. And so another way of thinking about it 

is that this is calculating a residual. So there's a couple of ways of thinking about it.

And so this thing here is called the Res Block or ResNet Block. Okay, so Sam Watkins has just pointed out 

the confusion here, which is that this only works if, let's put the minus in 

back and put it back over here. This only works if you can add these together.

Now if conv1 and conv2 both have the same number of channels as in the same number

of filters, same number of filters, and they also have stride-1, then that will work

fine. You'll end up, that will be exactly the same 

output shape as the input shape and you can add them together.

But if they are not the same, then you're in a bit of trouble.

So what do you do? And the answer which Kaiming He et al. came 

up with is to add a conv on in as well, but

to make it as simple as possible. We call this the identity conv.

It's not really an identity anymore, but we're trying to make it as simple as possible so

that we do as little to mess up these training dynamics as we can.

And the simplest possible convolution is a 1 by 1 filter block, a 1 by 1 kernel,

I guess we should call it. A 1 by 1 kernel size. And using that, and we can also add 

a stride or whatever if we want to. So let me show you the code.

So we're going to create something called a conv block.

Okay, and the conv block is going to do the two convs.

That's going to be a conv block. Okay, so we've got some number of input filters, 

some number of output filters, some stride, some activation functions, possibly a 

normalization and some kernel shape, some kernel size.

Okay, so the second conv is actually going to go from 

output filters to output filters because the first conv is going to be 

from input filters to output filters. So by the time we get to the second 

conv, it's going to be nf to nf. The first conv, we will set stride-1, and 

then the second conv will have the requested stride.

And that way the two convs back to back are going to overall have the requested stride.

So this way, the combination of these two convs is going to eventually take us from

ni to nf in terms of the number of filters, and it's going to have the stride that we

requested. So it's going to be a, the conv block is a 

sequential block consisting of a convolution followed by another convolution, each 

one with the requested kernel size. And the requested activation function 

and the requested normalization layer. The second conv won't have an activation function.

I'll explain why in a moment. And so I mentioned that one way to make this 

as if it didn't exist would be to set the convolutional weights to 

zero and the biases to zero. But actually we would like to have, you 

know, correctly randomly initialized weights. So instead what we can do is 

if you're using batch norm, we can initialize this conv2[1]

will be the batch norm layer. We can initialize the batch norm weights to zero.

Now if you've forgotten what that means, go back and have a look at our implementation

from scratch of batch norm because the batch norm weights is the thing we multiply by.

So do you remember the batch norm? We subtract the exponential moving average 

mean, we divide by the exponential moving average standard deviation, but then we add 

back the, the kind of the, the, the, the batch norms bias layer and we multiply by the batch 

norms weights, well, other way around, multiply by weights first.

So if we set the batch norm layers weights to zero, we're multiplying by zero.

And so this will cause the initial conv block output to be just all zeros.

And so that's going to give us, what we wanted is that nothing's happening here.

So we just end up with the input with this possible idconv.

So a ResBlock is going to contain those convolutions in the 

convolution block we just discussed, right?

And then we're going to need this idconv. So the idconv is going to be a noop.

So that's nothing at all. If the number of channels in is equal to the 

number of channels out, but otherwise we're going to use a convolution with a 

kernel size of 1 and a stride of 1. And so that is going to, you know, is with 

as little work as possible change the number of filters so that they match.

Also what if the stride's not 1? Well if the stride is 2, actually this isn't 

going to work for any stride, this only works for a stride of 2.

If there's a stride of 2, we will simply average using average pooling.

So this is just saying take the mean of every set of 2 items in the grid.

So we'll just take the mean. So we basically have here pool of idconv of 

in if the stride is 2 and if the filtered number is changed.

And so that's the minimal amount of work. So here it is, here is the forward pass.

We get our input and on the identity connection we call pool and if stride is 1, that's

a noop, so do nothing at all. We do idconv and if the number of filters 

is not changed, that's also a noop. So this is just the input in that situation.

And then we add that to the result of the convs. And here's something interesting, we then 

apply the activation function to the whole thing.

Okay, so that way, I wouldn't say this is like the only way you can do it, but this

is a way that works pretty well, is to apply the activation function to the result of the

whole resnet block. And that's why I didn't add activation 

function to the second conv. So that's a res block.

So it's not a huge amount of code, right? And so now I've literally copied and pasted 

our get_model, but everywhere that previously we had a conv, I've just 

replaced it with ResBlock. In fact, let's have a look. get_model. Okay, so previously we started with 

conv 1 to 8, now we do ResBlock 1 to 8, stride-1, stride-1, then we added conv from 

number of filters i to number of filters i + 1, now it's ResBlock from number 

of filters, number of filters i + 1. Okay, so it's exactly the same.

One change I have made though, is I mean, it doesn't actually make any difference at

all, I think it's mathematically identical, is previously the very last conv at the end

went from the 128 channels down to the 10 channels, followed by flatten, but this conv

is actually working on a 1 by 1 input. So in an alternate way, but I think makes 

it clearer is flatten first and then use a linear layer because a conv on a 1 by 

1 input is identical to a linear layer. And if that doesn't immediately make sense, 

that's totally fine, but this is one of those places where you should pause and have a little 

stop and think about why a conv on a 1 by scratch conv we did, because 

this is a very important insight. So I think it's very useful with a more complex 

model like this to take a good old look at it to see exactly what the inputs 

and outputs of each layer is. So here's a little function called _print_shape, 

which takes the things that a hook takes and we will print out for each layer, the name 

of the class, the shape of the input and the shape of the output.

So we can get our model, create our learner and use a 

handy little Hooks context manager we built in an earlier lesson and 

call the _print_shape function. And then we will call fit for 1 epoch just 

doing the evaluation not the training. And if we use the SingleBatch callback, it'll 

just do a single batch, pass it through and that hook will, as you see, print out each 

layer, the inputs shape and the output shape. So you can see we're starting with an input 

of a batch size of 1024, 1 channel, 28 by Our first ResBlock was stride-1, 

so we still end up with 28 by 28, but now we've

got 8 channels. And then we gradually decrease the grid 

size to 14, to 7, to 4, to 2, to 1, as we gradually increase the number of channels.

We then flatten it, which gets rid of that 1 by 1, which allows us then to do linear

to counter the 10. And then there's some discussion about whether 

you want a batch norm at the end or not. I was finding it quite useful in this case.

So we've got a batch norm at the end. I think this is very useful.

So I decided to create a patch for Learner called summary 

that would do basically exactly the same thing, but it would 

do it as a markdown table. Okay.

So, if we create a TrainLearner with our model and call dot summary, this method is now available because it's been patched 

that method into the Learner. And it's going to do exactly the same thing 

as our print, but it does it more prettily by using a markdown table, if it's in a 

notebook, otherwise it'll just print it. So fastcore has a handy thing for 

keeping track if you're in a notebook. And in a notebook to make something markdown, 

you can just use IPython.display.Markdown as you see.

And the other thing that I added as well as the input and the output is, I thought, let's

also add in the number of parameters. So we can calculate that as we've seen 

before by summing up the number of elements for each

parameter in that module. And so then I've kind of kept track of that 

as well so that at the end I can also print out the total number of parameters.

So we've got a 1.2 million parameter model, and you can see that there's very few parameters

here in the input. Nearly all the parameters are 

actually in the last layer. Why is that?

Well, you might want to go back to our Excel convolutional spreadsheet to see this.

You have a parameter, for every input channel, you have a set of parameters. They're all going to get added up 

across each of the 3 by 3 in the kernel. And then that's going to be done for 

every output filter, every output channel that you

want. So that's why you're going to end 

up with, in fact, let's take a look. Maybe let's create, let's 

just grab some particular one.

So create our model. And so we'll just have a look at the sizes.

And so you can see here there is this 256 by 256 by 3 by 3.

So that's a lot of parameters. Okay.

So we can call lr_find on that and get a sense of what 

kind of learning rate to use. So I chose 2e-2, so 0.02.

This is our standard learning thing. You don't have to watch it train.

I've just trained it. And so look at this, by using 

ResNet, we've gone up from 91.7 This is just keeps getting better.

So that's pretty nice. And you know, this ResNet is not anything fancy.

It's the simplest possible ResBlock, right? The model is literally copied and pasted from 

before and replace each place it said conv with ResBlock.

But we've just been thoughtful about it, you know, and here's something very interesting.

We can actually try lots of other ResNets by grabbing timm.

So that's Ross Wightman's PyTorch image model library.

And if you call timm.list_models(*resnet*), there's a lot of ResNets.

And I tried quite a few of them. Now one thing that's interesting is, if you 

actually look at the source code for timm, you'll see that the various different ResNets 

like ResNet18, ResNet18d, ResNet10d, they're defined in a very nice way using 

this very elegant configuration. You can see exactly what's different.

So there's basically only one line of code different between each different type of ResNet

for the main ResNets. And so what I did was I tried all the timm 

models I could find, and I even tried importing the underlying things and building 

my own ResNets from those pieces. And the best I found was the ResNet18d. And if I train it in exactly 

the same way, I got to 92%. And so the interesting thing is 

you'll see that's less than our 92.2 And it's not like I tried 

lots of things to get here. This was the very first thing I tried.

Whereas this ResNet18d was after trying lots and lots of different timm models.

And so what this shows is that the just thoughtfully designed kind 

of basic architecture goes a very long way.

It's actually better for this problem than any of the PyTorch image models, ResNets,

that I could find. So I think that's quite amazing, actually.

It's really cool. And it shows that you can create 

a state-of-the-art architecture just by using some common sense.

So I hope that's encouraging. So anyway, so we're up to 92.2%.

We're not done yet. Because we haven't even talked 

about data augmentation. All right.

So, let's keep going. So we're going to make 

everything the same as before. But before we do data augmentation, we're 

going to try to improve our model even further, if we can. So I said it was kind of not constructed 

with any great care and thought, really. Like in terms of this ResNet, we just took 

the ConvNet and replaced it with a ResNet. So it's effectively twice as deep, because 

each conv block has 2 convolutions. But ResNets train better than ConvNets.

So surely we could go deeper and wider still. So I thought, OK, how could we go wider?

And I thought, well, let's take our model. And previously, we were going from 8 up to 256.

What if we could get up to 512? And I thought, OK, well, one way to do that 

would be to make our very first ResBlock not have a kernel size of 

3, but a kernel size of 5. So that means that each 

grid is going to be 5 by 5. That's going to be 25 inputs.

So I think it's fair enough, then, to have 16 outputs.

So if I use a kernel size of 5, 16 outputs, then that means if I keep doubling as before,

I'm going to end up at 512 rather than 256. OK, so that's the only change I made, was 

to add ks equals 5 here and then change to double all the sizes.

And so if I train that, wow, look at this, 92.7%

So we're getting better still. And again, it wasn't with lots of 

trying and failing and whatever. It was just like saying, 

well, this just makes sense. And the first thing I tried, it just worked.

We're just trying to use these sensible, thoughtful approaches.

Next thing I'm going to try isn't necessarily something to make it better, but it's something

to make our ResNet more flexible. Our current ResNet is a bit 

awkward in that the number of stride-2 layers has to be exactly

big enough that the last of them ends up with a 1 by 1 output.

So you can flatten it and do the linear. So that's not very flexible, because what 

if you've got something of a different size? So to make that necessary, I've created 

a get_model_2, which goes less far. It has one less layer.

So it only goes up to 256, despite starting at 16. And so because it's got one less layer, that 

means that it's going to end up at the 2 by So what do we do?

Well, we can do something very straightforward, which is we can take the mean over the 2 by

And so if we take the mean over the 2 by 2, that's going to give us… a mean over the 2

by 2, it's going to give us batch size 

by channels output, which is what we can then put into

our linear layer. So this is called, this ridiculously simple 

thing, is called a Global Average Pooling layer.

That's the Keras term. In PyTorch, it's basically the same.

It's called an Adaptive Average Pooling layer. But in PyTorch, you can cause it to 

have an output other than 1 by 1. But nobody ever really uses it that way.

So they're basically the same thing. This is actually a little bit more convenient 

than the PyTorch version, because you don't have to flatten it.

So this is Global Average Pooling. So you can see here, after our 

last ResBlock, which gives us a 2 by 2 output, we have GlobalAvgPool

And that's just going to take the mean. And then we can do the Linear, BatchNorm as usual. So I wanted to improve my summary patch to 

include not only the number of parameters, but also the approximate number of megaFLOPs.

So FLOP is a floating operation per second, a floating point operation per second.

I'm not going to promise my calculation is exactly right.

I think the basic idea is right. I just basically actually calculated.

It's not really FLOPs. I actually calculated the 

number of multiplications. So this is not perfectly accurate, 

but it's pretty indicative, I think. So this is the same summary I had before, 

but I added an extra thing, which is a _flops function, where you pass in the weight matrix 

and the height and the width of your grid. Now if the number of dimensions of the weight 

matrix is less than 3, then we're just doing like a linear layer or something.

So actually just the number of elements is the number of FLOPs, because it's just a matrix

multiply. But if you're doing a convolution, so the 

dimension is 4, then you actually do that matrix multiply for everything 

in the height by width grid. So that's how I calculate this 

kind of FLOPs equivalent number. So okay, so if I run that on this model, we 

can now see our number of parameters compared to the ResNet model has gone from 

1.2 million up to 4.9 million. And the reason why is because we've got this, 

we've got this ResBlock that gets all the way up to 512.

And the way we did this is we made that a stride-1 layer.

So that's why you can see here it's gone 2, 2 and it stayed at 2, 2.

So I wanted to make it as similar as possible to the last ones.

It's got the same 512 final number of channels. And so most of the parameters are in that 

last block for the reason we just discussed. Interestingly, though, it's 

not as clear for the megaFLOPs. It is the greatest of them, but in terms of 

number of parameters, I think this has more parameters than all the other 

ones added together by a lot. But that's not true of megaFLOPs.

And that's because this first layer has to be done 28 by 28 times, whereas this layer

only has to be done 2 by 2 times. Anyway, so I tried training that 

and got pretty similar result, 92.6 And that kind of made me think, oh, let's 

fiddle around with this a little bit more to see like what kind of things would reduce 

the number of parameters and the megaFLOPs. The reason you care about reducing the number 

of parameters is that it has a lower memory requirements. And the reason you want to reduce the 

number of FLOPs is it's less compute. So in this case, what I've done here is I've removed this line of code.

So I've removed the line of code that takes it up to 512.

So that means we don't have this layer anymore. And so the number of parameters has gone 

down from 4.9 million down to 1.2 million. Not a huge impact on the megaFLOPs, 

but a huge impact on the parameters. We've reduced it by like 2 

thirds or 3 quarters or something by getting rid of that.

And you can see that the, if we take the very first ResNet 

block, the number of parameters is, you know, why is it this 5.3 megaFLOPs?

Because although the very first one starts with just one channel, the first conv, remember

our ResNet blocks have two convs. So the second conv is going 

to be a 16 by 16 by 5 by 5. And again, I'm partly doing this to show you 

the actual details of this architecture, but I'm partly showing it so that you can see 

how to investigate exactly what's going on in your models.

And I really want you to try these. So if we train that one, 

interestingly, even though it's only a quarter or something of

the size, we get the same accuracy, 92.7 So that's interesting.

Can we make it faster? Well, at this point, this is the obvious place 

to look at is this first ResNet block, because that's where all the megaFLOPs are.

And as I said, the reason is because it's got two convs.

The second one is 16 by 16 channels, 16 channels in, 16 

channels out, and it's doing these And it's having to do it 

across the whole 28 by 28 grid. So that's the bulk of the biggest compute. So what we could do is we could replace 

this ResBlock with just one convolution. And if we do that, then you'll 

see that we've now got rid of the 16 by 16 by 5 by 5.

We just got the 16 by 1 by 5 by 5. So the number of megaFLOPs has 

gone down from 18.3 to 13.3 The number of parameters hasn't 

really changed at all, right? Because the number of parameters 

was only 6,800, right? So be very careful that when you see people 

talk about, oh, my model has less parameters. That doesn't mean it's faster.

Okay. Really, it doesn't mean that at all.

There's no particular relationship between parameters and speed.

Even counting megaFLOPs doesn't always work that well, because it doesn't take account

of the amount of things moving through memory. But it's not a bad approximation here.

So here's one which has got much less megaFLOPs. And in this case, it's about 

the same accuracy as well. So I think this is really interesting.

We've managed to build a model that has far less parameters and far less megaFLOPs and

has basically exactly the same accuracy. So I think that's a really 

important thing to keep in mind. And remember, this is still way 

better than the ResNet18d from timm. So we built something that 

is fast, small, and accurate. So the obvious question is, 

what if we train for longer? And the answer is, if we train for longer, 

if we train for 20 epochs, I'm not going to wait for it.

The training accuracy gets up to 0.999 But the validation accuracy is worse.

It's 0.924 And the reason for that is that after 20 epochs, 

it's seen the same picture so many times, it's just memorizing them. And so once you start memorizing, 

things actually go downhill. So we need to regularize.

Now something that we have claimed in the past can regularize is to use weight decay.

But here's where I'm going to point out that weight decay doesn't regularize at all if

you use BatchNorm. And it's fascinating.

For years, people didn't even seem to notice this. And then somebody, I think, finally 

wrote a paper that pointed this out. And people were like, oh, wow, that's weird.

But it's really obvious when you think about it. A BatchNorm layer has a single set of 

coefficients which multiplies an entire layer. So that set of coefficients could 

just be the number 100 in every place. And that's going to multiply the entire previous 

weight matrix or convolution kernel matrix by 100.

As far as weight decay is concerned, that's not much of an impact at all because the BatchNorm

layer has very few weights. So it doesn't really have a 

huge impact on weight decay. But it massively increases the 

effective scale of the weight matrix. So BatchNorm basically lets the neural net 

cheat by increasing the coefficients, the parameters, even nearly as much as it wants 

indirectly just by changing the BatchNorm layer's weights.

So weight decay is not going to save us. And that's something really 

important to recognize. Weight decay is not, I mean, with BatchNorm 

layers, I don't see the point of it at all. It does have some, like, there has 

been some studies of what it does. And it does have some weird kind of 

second order effects on the learning rate. But I don't think you should rely on them.

You should use a scheduler for changing the learning rate rather than weird second order

effects caused by weight decay. So instead, we're going to do data augmentation, 

which is where we're going to modify every image a little bit by random change so that 

it doesn't see the same image each time. So there's not any particular reason to 

implement these from scratch, to be honest. We have implemented them 

all from scratch in fastai. So you can certainly look 

them up if you're interested. But it's actually a little bit separate 

to what we're meant to be learning about. So I'm not going to go through it. But yeah, if you're interested, 

go into fastai, vision, augment. And you'll be able to see, for 

example, how do we do flip? And you know, it's just like x.transpose.

Okay, which is not really, yeah, it's not that interesting.

Yeah, how do we do cropping and padding? How do we do random crops, so on and so forth?

Okay, so we're just going to actually, you know, fastai has probably got the best implementation of these, but torchvision’s are fine.

So we'll just use them. And so we've created before 

a batch transform callback. And we used it for normalization, if you remember.

So what we could do is we could create a transform batch function, which transforms the inputs

and transforms the outputs using two different functions.

So that would be an augmentation callback. And so then you would say, okay, for the transform 

batch function, for example, in this case, we want to transform our x's.

And how do we want to transform our x's? And the answer is, we want to transform them 

using this module, which is a sequential module of first of all doing a RandomCrop, 

and then a RandomHorizontalFlip. Now it seems weird to randomly crop a 28 by 

28 image to get a 28 by 28 image, but we can add padding to it.

And so effectively, it's going to randomly add padding on one or both sides to do this

kind of random crop. One thing I did to change 

the BatchTransform callback, can't remember if I've mentioned

this before, but something I changed slightly since we first wrote it, is I added this on_train

and on_val so that it only does it if you said I want to do it on training and it's

training, or I want to do it on validation and it's not training.

And then this is all the code is. So data augmentation, generally speaking, 

shouldn't be done on validation, so we set on_val false.

Okay, so what I'm going to do first of all is I'm going

to use our classic SingleBatchCB trick and fit, in fact, even better, oh yeah, fit(1)

just doing training. And what I'm going to do then is after I 

fit, I can grab the batch out of the learner. And this is a way, this is quite cool, right?

This is a way that I can see exactly what the model sees, right?

So this is not relying on any, you know, approximations.

Remember when we fit, it puts it in the batch that it looks at into learn.batch.

So if we fit for a single batch, we can then grab that batch back out of it and we can

call show_images. And so here you can see 

this little crop it's added. Now something you'll notice is that every 

single image in this batch, notice I grabbed the first 16, so I don't want to show you 

1024, has exactly the same augmentation. And that makes sense, right, because 

we're applying a BatchTransform. Now why is this good and why is it bad?

It's good because this is running on the GPU, right?

Which is great because nowadays very often it's really hard to get enough CPU to feed

your fast GPU fast enough. Particularly if you use something 

like Kaggle or Colab that are really underpowered for

CPU, particularly Kaggle. So this way all of our transformations, all 

of our augmentation is happening on the GPU. On the downside, it means that 

there's a little bit less variety. Every mini batch has the same augmentation.

I don't think the downside matters though, because it's going to see lots of mini batches.

So the fact that each mini batch is going to have a different augmentation is actually

all I care about. So we can see that if we run this multiple times, you can see it's got a different augmentation

in each mini batch. Okay, so I decided actually I'm 

just going to use 1 padding. So I'm just going to do a very, very 

small amount of data augmentation. And I'm going to do 20 epochs 

using OneCycle learning rate. And so this takes quite a while to train, 

so we won't watch it, but check this out. We get to 93.8

That's pretty wild. Yeah that's pretty wild.

So I actually went on Twitter and I said to the entire world on Twitter, you know, which

if you're watching this in 2023, if Twitter doesn't exist yet, ask somebody to tell you

about what Twitter used to be. Hopefully it still does.

Can anybody beat this in 20 epochs? You can use any model you like, any library 

you like, and nobody's got anywhere close. So this is pretty amazing.

And actually, you know, when I had a look at papers with code, there are, you know,

well, I mean, you can see it's right up there, right, with the kind of best models that are

listed, certainly better than these ones. And the better models all use, 

you know, 250 or more epochs. So yeah, if anybody, I'm hoping that somebody 

watching this will find a way to beat this in 20 epochs, that would be really great.

Because as you can see, we haven't really done anything very amazingly weirdly clever.

It's all very, very basic. And actually we can go even 

a bit further than 93.8. Just before we do, I mentioned that since 

this is actually taking a while to train now, I can't remember, it takes like 

10 to 15 seconds per epoch. So you know, you're waiting a few 

minutes, you may as well save it. So you can just call torch.save on a model, 

and then you can load that back later. So something that can make things even better 

is something called test time augmentation. I guess I should write this out properly here.

Test, text, test time augmentation. Now test time augmentation actually does 

our BatchTransform callback on validation as well.

And then what we're going to do is we're actually, in this case, we're going to do just a very,

very, very simple test time augmentation, which is we're going to add a BatchTransform

callback that runs on validate and it's not random, but it actually just does a horizontal

flip. Non-random, so it always does a horizontal flip.

And so check this out. What we're going to do is we're going to 

create a new callback called CapturePreds. And after each batch, it's 

just going to append to a list the predictions, and it's going

to append to a different list the targets. And that way we can just call learn.fit, train 

equals False, and it will show us the accuracy. And this is just the same 

number that we saw before. But then what we can do is we can call the 

same thing, but this time with a different callback, which is with the 

horizontal flip callback. And that way it's going to do exactly the 

same thing as before, but in every time it's going to do a horizontal flip.

And weirdly enough, that accuracy is slightly higher, which that's not the interesting bit.

The interesting bit is that we've now got two sets of predictions. We've got the sets of predictions 

with the non-flipped version. We've got the set of predictions 

with the flipped version. And what we could do is we could stack 

those together and take the mean. So we're going to take the average of 

the flipped and unflipped predictions. And that gives us a better result still, 94.2%

So why is it better? It's because looking at the image from kind 

of like multiple different directions gives it more opportunities to try to 

understand what this is a picture of. And so in this case, I'm just giving it two 

different directions, which is the flipped and unflipped version, and 

then just taking their average. So yeah, this is like a really nice little trick.

Sam's pointed out it's a bit like random forest, which is true.

It's a kind of bagging that we're doing. We're kind of getting multiple 

predictions and bringing them together. And so we can actually, so 94.2 I think is my best 20 epoch result.

And notice I didn't have to do any additional training.

So it still counts as a 20 epoch result. You can do test time augmentation where you 

do, you know, a much wider range of different augmentations that you trained with, and 

then you can use them at test time as well. You know, more, more crops or 

rotations or warps or whatever. I want to show you one of my favorite 

data augmentation approaches, which is called random

erasing. So random erasing, I'll show you 

what it's going to look like. Random erasing, we're going to add a little, 

we're going to basically delete a little bit of each picture and we're going to replace 

it with some random Gaussian noise. Now, in this case, we've just got one patch.

But eventually we're going to do more than one patch.

So I wanted to implement this because remember we have to implement everything from scratch.

And this one's a bit less trivial than the previous transforms.

So we should do it from scratch. And also I'm not sure there's 

that many good implementations. Ross Wightman's Tim I think has one.

And so, and it's also a very good exercise to see how to 

implement this from scratch. So let's grab a batch out of the training set.

And let's just grab the first 16 images. And so then let's grab the 

mean and standard deviation. Okay. And so what we want to do is we wanted to 

delete a patch from each image, but rather than deleting it, deleting it 

would change the statistics, right? If we set those all to zero, the mean and 

standard deviation are now not going to be But if we replace them with exactly the same 

mean and standard deviation pixels that the picture has, or that our dataset has, 

then it won't change the statistics. So that's why we've grabbed the 

mean and standard deviation. And so we could then try grabbing, 

let's say we want to delete 0.2, so 20% of the height

and width. Then let's find out how big that size is.

So 0.2 of the shape, of the height and of the width, that's the size of the x and y.

And then the starting point, we're just going to randomly grab some starting point, right?

So in this case, we've got the starting point for x is 14, starting point for y is 0,

and then it's going to be a 5 by 5 spot. And then we're going to do a Gaussian or 

normal initialization of our mini batch, everything

in the batch, every channel for this x slice, this y slice, 

and we're going to initialize it with this mean and standard 

deviation, normal random noise. And so that's what this is.

So it's just that tiny little bit of code. So you'll see, I don't 

start by writing a function. I start by writing single lines of code that 

I can run independently and make sure that they all work and that I look at the 

pictures and make sure it's working. Now one thing that's wrong here is that you 

see how the different, you know, this looks black and this looks gray.

Now at first this was confusing me as to what's going on.

What's it changed? Because the original images didn't look like that.

And I realized the problem is that the minimum and the maximum have changed.

It used to be from -0.8 to 2. That was the previous min and max.

Now it goes from -3 to 3. So the noise we've added has the same mean 

and standard deviation, but it doesn't have the same range because the pixels were 

not normally distributed originally. So normally distributed noise actually is wrong.

So to fix that, I created a new version and I'm putting in a function now.

It does all the same stuff as before, as I just did before, but it clamps the random

pixels to be between min and max. And so it's going to be exactly the same thing, 

but it's going to make sure that it doesn't change the range.

That's really important, I think. Because changing the range really impacts 

your, you know, your activations quite a lot. So here's what that looks like.

And so as you can see now, all of the backgrounds have that nice black and it's still giving

me random pixels. And I can check, and because I've done the 

clamping, you know, and stuff, the mean and standard deviation aren't quite 0, 

1, but they're very, very close. So I'm going to call that good enough. And of course the min and max haven't changed 

because I clamped them to ensure they didn't change.

So that's my random erasing. So that randomly erases one block.

And so I could create a random erase, which will randomly choose up to, in this case,

four blocks. So with that function, oh, that's annoying.

It happened to be zero this time. Okay, we'll just run it again.

This time it's got three, so that's good. So you can see it's got, oh, maybe it's 

four, one, two, three, four blocks. Okay.

So that's what this data augmentation looks like. So we can create a class to 

do this data augmentation. So you'll pass in what percentage to do in 

each block, what the maximum number of blocks to have is, store that away.

And then in the forward, we're just going to call our random arrays function, passing

in the input and passing in the parameters. Great.

So now we can use random crop, random flip and random RandErase. Make sure it looks okay.

And so now we're going to go all the way up to 50 epochs.

And so if I run this for 50 epochs, I get 94.6

Isn't that crazy? So we're really right up there 

now, up, we're even above this one. So we're somewhere up here.

And this is like stuff people write papers about from 2019, 2020.

Oh look, here's the random erasing paper. That's cool.

So they were way ahead of their time in 2017, but yeah, that would have changed for a lot

longer. Now I was having a think and I realized something, which is like, why, like, how do I, how do

we actually get the correct distribution? Right?

Like in some ways it shouldn't matter, but I was kind of like bothered by this thing

of like, well, we don't actually end up with 0, 1 and this kind of like clamping.

It all feels a bit weird. Like how do we actually replace these pixels 

with something that is guaranteed to be the correct distribution?

And I realized there's actually a very simple answer to this, which is we could copy another

part of the picture over to here. If we copy part of the picture, we're guaranteed 

to have the correct distribution of pixels. And so it wouldn't exactly 

be random erasing anymore. That would be random copying.

Now I'm sure somebody else has invented this. I mean, you know I'm not saying this, 

nobody's ever thought of this before. So if anybody knows a paper that's 

done this, please tell me about it. But I, you know, I think it's a very sensible 

approach and it's very, very easy to implement. So again, we're going to 

implement it all manually, right? So let's get our x mini batch and 

let's get our, again, our size. And again, let's get the x, y that we're going 

to be erasing, but this time we're not erasing, we're copying, so we'll then randomly 

get a different x, y to copy from. And so now it's just, instead of init random 

noise, we just say, replace this slice of the batch with this slice of the batch. And we end up with, you know, you can see 

here, it's kind of copied little bits across. Some of them you can't really see at all.

And some of you can, because I think some of them are black and it's replaced black,

but I guess it's knocked off the end of this shoe, added a little bit extra here, a little

bit extra here. So we can now, again, we'll 

turn it into a function. Once I've tested it in the REPL, 

make sure the function works. And obviously this, in this case, it's copying 

it largely from something that's largely black for a lot of them.

And then again, we can do the thing where we do it multiple times.

And here we go. It's got a couple of random copies.

And so again, turn that into a class, create our transforms.

And again, we, okay. So again, we can have a look at a 

batch to make sure it looks sensible. And do it for, just did it for 

25 epochs here and gets to 94% Now why did I do it for 25 epochs?

Because I was trying to think about how do I beat my 50 epoch record, which was 94.6

And I thought, well, what I could do is I could train for 25 epochs and then I'll train

a whole new model for a different 25 epochs. And I'm going to put it in a 

different learner, learn two, right? This one is 94.1

So one of the models was 94.1 One of them was 94

Maybe you can guess what we're going to do next. It's a bit like test time augmentation, but 

rather than that, we're going to grab the predictions of our first learner and grab 

the predictions of our second learner and stack them up and take the mean.

And this is called ensembling. And not surprisingly, the ensemble is better 

than either of the two individual models at Although, unfortunately, I'm afraid 

to say we didn't beat our best. But it's a useful trick and 

particularly useful trick. In this case, I was kind of like trying something 

a bit interesting to see if using the exact same number of epochs, can I get a better 

result by using ensembling instead of training for longer?

And the answer was I couldn't. Maybe it's because the random copy is not as 

good, or maybe I'm using too much augmentation. Who knows?

But it's something that you could experiment with. So Sharwon mentions in the chat that cutmix 

is similar to this, which is actually, that's a good point.

I'd forgotten cutmix, but cutmix, yes, copies it from different images rather than from

the same image. But yeah, it's pretty much 

the same thing, I guess-ish. Well, similar.

Yeah, very similar. All right.

So that brings us to the end of the lesson. And I am so pumped and excited to share this 

with you because I don't know that this has ever been done before, to be able to go from, 

I mean, even in our previous courses, we've never done this before, go from scratch, step 

by step to an absolute state of the art model where we build everything 

ourselves and it runs this quickly. And we're even using our own 

custom ResNet and everything, just using common sense at

every stage. And so hopefully that shows 

that deep learning is not magic, that we can actually build the

pieces ourselves. And yeah, as you'll see, going up to larger 

data sets, absolutely nothing changes. And so it's exactly these techniques.

And this is actually, I do 99% of my research on very small data sets because you can iterate

much more quickly, you can understand them much better.

And I don't think there's ever been a time where I've then gone up to a bigger data set

and my findings didn't continue to hold true. Now, homework, what I would really like you to 

do is to actually do the thing that I didn't do, which is to do the… create your own 

schedulers that work with Python's optimizers. So, I mean, the tricky bit will be making 

sure that you understand the PyTorch API well, which I've really laid out here.

So study this carefully. So create your own cosine 

annealing scheduler from scratch, and then create your own 1-Cycle

scheduler from scratch. And make sure that they work correctly 

with this batch scheduler callback. This will be a very good exercise for you 

in, you know, hopefully getting extremely frustrated as things don't work the way you 

hoped they would and being mystified for a while and then working through it, you know, 

using this very step-by-step approach, lots of experimentation, lots of 

exploration, and then figuring it out. That's the journey I'm hoping you have.

If it's all super easy and you get it first go, then, you know, you have to find something

else to do. But yeah, I'm hoping you'll find it actually, 

you know, surprisingly tricky to get it all working properly and in the process of doing so, 

you're going to have to do a lot of exploration and experimentation.

But you'll realize that it requires no prerequisite knowledge at all.

Okay, so if it doesn't work first time, it's not because there's something that you didn't

learn in graduate school, if only you had done a PhD, whatever.

It's just that you need to dig through, you know, slowly and carefully to see how it all

works. And you know, then see how neat 

and concise you can get it. And the other homework is to try and beat me.

I really, really want people to beat me. Try to beat me on the 5 epoch or the 

20 epoch or the 50 epoch Fashion-MNIST. Ideally, using miniai with things 

that you've added yourself. But you know, you can try grabbing 

other libraries if you like. Well, ideally, if you do grab another library 

and you find you can beat my approach, try to re-implement that library.

That way you are still within the spirit of the game.

Okay, so in our next lesson, Johno and Tanishq and I are going 

to be putting this all together to create a diffusion model from scratch.

And we're actually going to be taking a couple of lessons for this.

Not just a diffusion model, but a variety of interesting generative approaches.

So we're kind of starting to come full circle. So thank you so much for joining 

me on this very extensive journey. And I look forward to hearing 

what you come up with. Please do come and join us on 

forums.fast.ai and share your progress. Bye!

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 18: Deep Learning Foundations to Stable Diffusion ...