Hi folks, thanks for joining me for Lesson 18.
We're going to start today in Microsoft Excel. You'll see there's an Excel folder
actually in the course22p2 repo. And in there there's a spreadsheet
called graddesc as in gradient descent, which I guess
we should zoom in a bit here. So there's some instructions here, but this is basically
describing what's in each sheet. We're going to be looking at the various
SGD accelerated approaches we saw last time, but
done in a spreadsheet. We're going to do something very, very simple,
which is to try to solve a linear regression. So the actual data was generated with y =
ax + b, where a, which is the slope, was And so you can see we've got
some random numbers here. And then over here, we've
got the ax + b calculation. So then what I did is I copied and pasted as
values, just one set of those random numbers into the next sheet called basic.
This is the basic SGD sheet. So that's what x and y are.
And so the idea is we're going to try to use SGD to learn that the intercept is 30 and
the slope is 2. So the way we do SGD is we, so those
are, those are our weights or parameters. So the way we do SGD is we start
out at some random kind of guess. So my random guess is going to be 1
and 1 for the intercept and slope. And so if we look at the very first data point,
which is x is 14 and y is 58, the intercept and slope are both 1, then
we can make a prediction. And so our prediction is just equal
to slope times x plus the intercept. So the prediction will be 15.
Now, actually, the answer was 58. So we're a long way off.
So we're going to use mean squared error. So the mean squared error is just the
error, so the difference, squared. Okay, so one way to calculate how much would
the prediction, sorry, how much would the error change, so how much would the squared
error, I should say, change if we changed the intercept, which is b, would be just to
change b by a little bit, change the intercept by a little bit and see what the error is.
So here that's what I've done is I've just added 0.01 to the intercept and then calculated
y and then calculated the difference squared. And so this is what I mean by er b1.
This is the error squared I get if I change b by 0.01.
So it's made the error go down a little bit. So that suggests that we should probably
increase b, increase the intercept. So we can now calculate the estimated derivative
by simply taking the change from when we use the actual intercept using
the intercept plus 0.01. So that's the rise and we divide it by
the run, which is, as we said, is 0.01. And that gives us the estimated derivative
of the squared error with respect to b, the intercept.
Okay, so it's about negative 86, 85.99. So we can do exactly the same thing for a,
so change the slope by 0.01, calculate y, calculate the difference and square it.
And we can calculate the estimated derivative in the same way, rise, which is the difference
divided by run, which is 0.01. And that's quite a big number, minus 1,200.
In both cases, the estimated derivatives are negative.
So that suggests we should increase the intercept and the slope.
And we know that that's true because actually the intercept and the slope are both bigger
than one. The intercept is 30, should be
30 and the slope should be 2. So there's one way to calculate the derivatives.
Another way is analytically. And the derivative of squared is 2 times.
So here it is here. I've just written it down for you.
So here's the analytic derivative. It's just two times the difference.
And then the derivative for the slope is here. And you can see that the estimated version
using the rise over run and the little 0.01 change and the actual, they're pretty similar.
Okay. And same thing here, they're pretty similar.
So anytime I calculate gradients kind of analytically, but by hand,
I always like to test them against doing the actual rise over run
calculation with some small number. And this is called using the
finite differencing approach. We only use it for testing because it's slow
because you have to do a separate calculation for every single weight.
But it's good for testing. We use analytic derivatives
all the time in real life. Anyway, so however we calculate the
derivatives, we can now calculate a new slope. So our new slope will be equal to the previous
slope minus the derivative times the learning rate, which we've just set here at 0.0001.
And we can do the same thing for the intercept as you see.
And so here's our new slope intercept. So we can use that for the second row of data.
So the second row of data is x equals 86, y equals 202.
So our intercept is not one, one anymore. Intercept and slope are not one,
one, but they're 1.01 and 1.12. So here's, we're just using a formula just
to point at the new intercept and slope. We can get a new prediction and squared error
and derivatives, and then we can get another new slope and intercept.
And so that was a pretty good one, actually. It really helped our slope head in the right
direction, although the intercept's moving pretty slowly.
And so we can do that for every row of data. Now strictly speaking, this is not mini
batch gradient descent that we normally do in deep
learning. It's a simpler version where
every batch is a size one. So I mean, it's still stochastic gradient descent.
It's just not, it's just a batch size of one. I think sometimes it's called online
gradient descent, if I remember correctly. So we go through every data point in our very
small data set until we get to the very end. And so at the end of the first epoch, we've
got an intercept of 1.06 and a slope of 2.57. And those indeed are better estimates
than our starting estimates of 1, 1. So what I would do is I would copy our slope
2.57 up to here, 2.57, I'll just type it for now.
And I'll copy our intercept up to here. And then it goes through the entire epoch
again, and we get another intercept and slope. And so we could keep copying and pasting
and copying and pasting again and again. And we can watch the root
mean squared error going down. Now that's pretty boring doing
that copying and pasting. So what we could do is fire up Visual Basic for applications.
And sorry, this might be a bit small. I'm not sure how to increase the font size.
And what it shows here. So sorry, this is a bit small.
So you might want to just open it on your own computer to be able to see it clearly.
But basically it shows I've created a little macro where if you click on the reset button,
it's just going to set the slope and constant to 1 and calculate.
And if you click the run button, it's going to go through five times, calling one step.
And what one step is going to do is it's going to copy the slope, last slope to the new slope
and the last constant intercept to the new constant intercept.
And also do the same for the RMSE. And it's actually going to paste it down to
the bottom for reasons I'll show you in a moment.
So if I now run this, reset and then run.
There we go. You can see it's run at five times.
And each time it's pasted the RMSE. And here's a chart of it showing it going down.
So you can see the new slope is 2.57. Your intercept is 1.27.
I could keep running at another five. So this is just doing copy, paste,
copy, paste, copy, paste; five times. And you can see that the RMSE is
very, very, very slowly going down. And the intercept and slope are very, very,
very slowly getting closer to where they want to be.
The big issue really is that the intercept is meant to be 30.
It looks like it's going to take a very, very long time to get there, but it will get there
eventually if you click run enough times or maybe set the VBA macro to loop more than
five steps at a time. But you can see it's very slowly.
And importantly though, you can see like it's kind of taking this linear route every time
these are increasing. So why not increase it by more and more and more?
And so you'll remember from last week that that is what momentum does.
So on the next sheet, we show momentum. And so everything's exactly the same as the
previous sheet, but this sheet, we didn't bother with the finite differencing.
We just have the analytic derivatives, which are exactly the same as last time.
The data's the same as last time. The slope and intercept are the
same starting points as last time. And this is the new b and new b that we get. But what we do this time is that we've added
a momentum term, which we're calling beta. And so the beta is going to these cells here. And what are these cells?
Well, what these cells are is that they're, maybe it's most interesting to take this one
here. What it's doing is it's taking the gradient
and it's using that to update the weights, but it's also taking the previous update.
So you can see here, the blue one, minus 25. So that is going to get
multiplied by 0.9, the momentum. And then the derivative is then multiplied by 0.1. So this is momentum, which is
getting a little bit of each. And so then what we do is we then
use that instead of the derivative to multiply by our
learning rate. So we keep doing that again and
again and again, as per usual. And so we've got one column, which is calculating
the momentum, you know, lerped version of the gradient for both b and for a. And so
you can see that for this one, it's the same thing.
It look at what was the previous move. And that's going to be 0.9 of what you're going
to use for your momentum version gradient. And 0.1 is for this version,
the momentum gradient. And so then that's again, what we're going
to use to multiply by the learning rate. And so you can see what happens is when you
keep moving in the same direction, which here is we're saying the derivative is
negative again and again and again. So it gets higher and higher and higher. And ditto over here.
And so particularly with this big jump we get, we keep getting big jumps because still
we want to, there's still negative gradient, negative gradient, negative gradient.
So if we, at the end of this, our new, our b and our a have jumped ahead.
And so we can click run. And we can keep clicking it.
And you can see that it's moving, you know, not super fast, but certainly faster than
it was before. So if you haven't used VBA, Visual Basic
for applications before, you can hit Alt, Alt F11 or Option
F11 to open it. And you may need to go into your preferences
and turn on the developer tools so that you can see it.
You can also right click and choose assign macro on a button and you can see what macro
has been assigned. So if I hit Alt F11 and I can just double,
or you can just double click on the sheet name and it'll open it up.
And you can see that this is exactly the same as the previous one.
There's no difference here. Oh, one difference is that to keep
track of momentum at the very, very end. So I've got my momentum
values going all the way down. The very last momentum I copy back up to the
top, each epoch so that we don't lose track of our kind of optimizer state, if you like.
Okay. So that's what momentum looks like.
So, yeah, if you're kind of a more of a visual person like me, you like to see everything
laid out in front of you and like to be able to experiment, which I think is a good idea.
This can be really helpful. So RMSProp, we've seen, and it's very similar
to momentum, but in this case, instead of keeping track of kind of a lerped moving average
and exponential moving average of gradients, we're keeping track of a moving
average of gradients squared. And then rather than simply adding that, using
that as the gradient, what instead we're doing is we are dividing our gradient
by the square root of that. And so remember the reason we were doing that
is to say, if there's very little variation, very little going on in your gradients,
then you probably want to jump further. So that's RMSProp.
And then finally, Adam, remember, was a combination of both. So in Adam, we've got both the lerped version
of the gradient and we've got the lerped version of the gradient squared.
And then we do both. When we update, we're both dividing
the gradient by the square root of the lerped- moving, initially
weighting average, moving averages. And we're also using the momentumized version.
And so again, we just go through that each time. And so if I reset, run, and so, oh wow, look at that. It jumped up there very quickly because
remember we wanted to get to 2 and 30. So just two sets.
So that's 5, that's 10 epochs. Now if I keep running it, it's
kind of now not getting closer. It's kind of jumping up and down
between pretty much the same values. So probably what we'd need to do is
decrease the learning rate at that point. And yeah, that's pretty good.
And now it's jumping up and down between the same two values again.
So it'd probably decrease the learning rate a little bit more.
And I kind of like playing around like this because it gives me a really intuitive feeling
for what training looks like. So I've got a question from our YouTube
chat, which is how is J33 being initialized? So this is just, what happens is we take the
very last cell, well actually all these last four cells, and we copy them to here as values.
So this is what those looked like in the last epoch.
So basically we go copy and then paste as values.
And then this here just refers back to them as you see. And it's interesting that they're kind of,
you can see how they're exact opposites of each other, which is really,
you can really see how they're, it's just fluctuating around
the actual optimum at this point. Okay, thank you to Sam Watkins.
We've now got a nicer sized editor. That's great.
Where are we Adam? Okay, so with Adam, basically it all looks
pretty much the same, except now we have to copy and paste both our momentums and our squared gradients, and of course the slopes
and intercepts at the end of each step. But other than that, it's
just doing the same thing. And when we reset it, it just sets
everything back to their default values. Now one thing that occurred to me, you know,
when I first wrote this spreadsheet a few years ago was that manually changing
the learning rate seems pretty annoying. Now of course we can use a scheduler, but
a scheduler is something we set up ahead of time.
And I did wonder if it's possible to create an automatic scheduler.
And so I created this Adam annealing tab, which honestly I've never really got back
to experimenting with. So if anybody's interested,
they should check this out. What I did here was I used exactly the same
spreadsheet as the Adam spreadsheet, but I added an extra, after I do this step, I added an
extra thing, which is I automatically decreased the learning rate in a certain situation.
And the situation in which I decreased it was I kept track of the average of the squared
gradients. And anytime the average of the squared gradients
decreased during an epoch, I stored it. So I basically kept track of the
lowest squared gradients we had. And then what I did was if we got a… if that
resulted in the gradients, the squared gradients average halving, then I would
decrease the learning rate by, then I would decrease the
learning rate by a factor of 4. So I was keeping track of this gradient ratio.
Now when you see a range like this, you can find what that's referring to by just clicking
up here and finding gradient ratio. And there it is, and you can see that it's
equal to the ratio between the average of the squared gradients versus the
minimum that we've seen so far. So this is kind of like, my theory here was
thinking that, yeah, basically as you train, you kind of get into flatter
and more stable areas. And as you do that, that's a sign that, you
know, you might want to decrease your learning rate.
So yeah, if I try that, if I hit run again, it jumps straight to a pretty good value,
but I'm not going to change the learning rate manually.
I just press run and you can see it's changed the learning rate automatically now.
And if I keep hitting run without doing anything, look at that.
It's got pretty good, hasn't it? And the learning rates got lower and lower
and we basically got almost exactly the right answer.
So yeah, that's a little experiment I tried. So maybe some of you should try experiments
around whether you can create an automatic annealer using miniai.
I think that would be fun. So that is an excellent segue into our
notebook because we are going to talk about annealing
now. So we've seen it manually before
where we've just decreased the learning rate in a notebook
and like run a second cell. And we've seen something in Excel.
But let's look at what we generally do in PyTorch. So we're still in the same notebook as
last time, the Accelerated SGD notebook. And now that we've re-implemented all the
main optimizers that people tend to use most of the time from scratch, we
can use PyTorch's, of course. So let's see, look now at how we can do our own learning rate scheduling or annealing
within the miniai framework. Now we've seen when we implemented the learning
rate finder that we saw how to create something that adjusts the learning rate.
So just to remind you, this was all we had to do. So we had to go through the optimizers’ parameter
groups, and in each group set the learning rate to times equals some multiplier.
That was for the learning rate finder. So since we know how to do that, we're not going
to bother re-implementing all the schedulers from scratch because we know the basic idea now.
So instead, what we're going to do is have a look inside the torch.optim.lr_scheduler
module and see what's defined in there. So the lr_scheduler module, you can
hit dot, tab and see what's in there. But something that I quite like to do is to
use dir because dir(lr_scheduler) is a nice little function that tells you
everything inside a Python object. And this particular object is a module object
and it tells you all the stuff in the module. When you use the dot version tab, it doesn't
show you stuff that starts with an underscore, by the way, because that stuff's considered
private, or else dir does show you that stuff. I can kind of see from here
that the things that start with a capital and then a small
letter look like the things we care about. We probably don't care about this.
We probably don't care about these. So we can just do a little list comprehension
that checks that the first letter is an uppercase and the second letter is lowercase, and
then join those all together with a space. And so here is a nice way to get a list of all
of the schedulers that PyTorch has available. And actually, I couldn't find such a list
on the PyTorch website in the documentation. So this is actually a handy
thing to have available. So here's various schedulers we can use.
And so I thought we might experiment with using Cosine Annealing. So before we do, we have to recognize that
these PyTorch schedulers work with PyTorch optimizers, not with, of course,
with our custom SGD class. And PyTorch optimizers have
a slightly different API. And so we might learn how they work.
So to learn how they work, we need an optimizer. So one easy way to just grab an optimizer
would be to create a learner, just kind of pretty much any old random learner, and pass
in that single batch callback that we created. Do you remember that single batch callback?
SingleBatch. It just, after_batch, it cancels the fit.
So it literally just does one batch. And we could fit.
And from that, we've now got a learner and an optimizer.
And so we can do the same thing. We can do our optimizer to
see what attributes it has. This is a nice way, or of course, just
read the documentation in PyTorch. This one is documented, I think,
showing all the things it can do. As you would expect, it's got the step and
the zero_grad, like we're familiar with. Or you can just, if you just hit opt.
So you can, the optimizers in PyTorch do actually have a repr, as it's called, which means
you can just type it in and hit shift enter, and you can also see the information about
it this way. Now, an optimizer, it'll tell
you what kind of optimizer it is. And so in this case, the default optimizer
for a learner, when we created it, we decided was optm.sgd.SGD.
So we've got an SGD optimizer. And it's got these things called parameter groups.
What are parameter groups? Well, parameter groups are, as it
suggests, they're groups of parameters. And in fact, we only have one parameter group
here, which means all of our parameters are in this group.
So let me kind of try and show you. It's a little bit confusing,
but it's kind of quite neat. So let's grab all of our parameters.
And that's actually a generator. So we have to turn that into an iterator and
call next, and that will just give us our first parameter.
Okay. Now what we can do is we can then
check the state of the optimizer. And the state is a dictionary and
the keys are parameter tensors. So this is kind of pretty interesting because
you might be, I'm sure you're familiar with dictionaries.
I hope you're familiar with dictionaries. But normally you probably use numbers or
strings as keys, but actually you can use tensors as
keys. And indeed that's what happens here.
If we look at param, it's a tensor. It's actually a parameter, which remember
is a tensor, which it knows to requires_grad and to list in the parameters of the module.
And so we're actually using that to index into the state.
So if you look at opt.state, it's a dictionary where the keys are parameters.
Now what's this for? Well, what we want to be able
to do is, if you think back to this, we actually had each
parameter, we have state for it. We have the average of the gradients or the
exponentially weighted moving average gradients and of squared averages.
And we actually stored them as attributes. So PyTorch does it a bit differently.
It doesn't store them as attributes, but instead the optimizer has a dictionary where you can
index into it using a parameter and that gives you the state.
And so you can see here, this is the exponentially weighted moving averages.
And both because we haven't done any training yet and because we're using non-momentum SGD,
it's none, but that's how it would be stored. So this is really important to
understand PyTorch optimizers. I quite liked our way of doing it, of just
storing the state directly as attributes. But this works as well and it's fine.
You just have to know it's there. And then, as I said, rather than
just having parameters, so we in SGD stored the parameters
directly. But in PyTorch, those parameters
can be put into groups. And so since we haven't put them into
groups, the length of param_groups is 1, this is
one group. So here is the param_groups.
And that group contains all of our parameters. Okay.
So pg, just to clarify here what's going on, pg is a dictionary.
It's a parameter group. And to get the keys from a
dictionary, you can just listify it. That gives you back the keys.
And so this is one quick way of finding out all the keys in a dictionary.
But you can see all the parameters in the group. And you can see all of the hyper parameters,
the learning rate, the momentum, weight decay, and so forth. So that gives you some background about
what's going on inside an optimizer. So Siva asks, isn't indexing by tensor just
like passing a tensor argument to a method? And no, it's not quite the
same, because this is state. So this is how the optimizer
stores state about the parameters. It has to be stored somewhere.
For our homemade miniai version, we stored it as attributes on the parameter.
In the PyTorch optimizers, they store it as a dictionary.
So it's just how it's stored. Okay.
So with that in mind, let's look at how schedulers work.
So let's create a cosine annealing scheduler. So a scheduler in PyTorch, you
have to pass at the optimizer. And the reason for that is we want to be able
to tell it to change the learning rates of our optimizer.
So it needs to know what optimizer to change the learning rates of.
So it can then do that for each set of parameters. And the reason that it does it by parameter
group is that as we'll learn in a later lesson, for things like transfer learning, we often
want to adjust the learning rates of the later layers differently to the earlier layers
and actually have different learning rates. And so that's why we can have different groups
and the different groups have the different learning rates, momentums, and so forth.
Okay. So we pass in the optimizer and then, if I
hit shift tab a couple of times, it'll tell me all of the things that you can pass in.
And so it needs to know T_max: how many iterations you're going to do.
And that's because it's trying to do one, you know, half a wave, if you like, of the
cosine curve. So it needs to know how many
iterations you're going to do. So it needs to know how far to step each time.
So if we're going to do a hundred iterations. So the scheduler is going to
store the base learning rate. And where did it get that from?
It got it from our optimizer, which we set a learning rate.
Okay. So it's going to steal the optimizers learning
rate, and that's going to be the starting learning rate, the base learning rate.
And it's a list because there could be a different one for each parameter group.
We only have one parameter group. You can also get the most recent learning
rate from a scheduler, which of course is the same.
And so I couldn't find any method in PyTorch to actually plot a scheduler's learning rates.
So I just made a tiny little thing that just created a list, set it to the last learning
rate of the scheduler, which is going to start at 0.006, and then goes through however many
steps you ask for. Steps the optimizer, steps the scheduler.
So this is the thing that causes the scheduler to adjust its learning rate, and then just
append that new learning rate to a list of learning rates and then plot it.
So that's here's, and what I've done here is I've intentionally gone over a hundred
because I had told it I'm going to do a hundred. So I'm going over a hundred and you can see the
learning rate, if we did a hundred iterations, would start high for a while.
It would then go down and then it would stay low for a while.
And if we intentionally go past the maximum, it's actually start going up again because
this is a cosine curve. So one of the main things, I guess, I wanted
to show here is like what it looks like to really investigate in a REPL
environment, like a notebook, how an object behaves, what's
in it. And this is something I would always want
to do when I'm using something from an API I'm not very familiar with.
I really want to like see what's in it, see what they do, run it totally independently,
plot anything I can plot. This is how I like to learn
about the stuff I'm working with. You know, data scientists don't spend all
of their time just coding, you know, so that means we need, we can't just rely on
using the same classes and APIs every day. So we have to be very good at
exploring them and learning about them. And so that's why I think this
is a really good approach. Okay.
So let's create a scheduler callback. So a scheduler callback is something we're
going to pass in the scheduling class. But remember then when we… ah, the scheduling
callable, actually, and remember that when we create the scheduler, we have to
pass in the optimizer to schedule. And so before_fit, that's the point at which we
have an optimizer, we will create the scheduling object.
I like this, “schedo”, it's very Australian. So the scheduling object we will create by
passing the optimizer into the scheduler callable. And then when we do step(), then we'll check
if we're training and if so, we'll step(). Okay.
So then what's going to call step is after_batch.
So after_batch, we'll call step. And that would be if you want your scheduler
to update the learning rate every batch. We could also have an epoch
scheduler callback, which we'll see later and that's just going to be after_epoch.
Okay. So in order to actually see what the schedule
is doing, we're going to need to create a new callback to keep track of
what's going on in our learner. And I figured we could create a recorder callback. And what we're going to do is we're going
to be passing in the name of the thing that we want to record, that we want to keep track
of in each batch and a function, which is going to be responsible for
grabbing the thing that we want. And so in this case, the function here is
going to grab from the callback, look up its param groups property and grab the learning rate.
Where does the pg property come from? Or attribute?
Well, before_fit the recorder callback is going to grab just the first parameter group.
Just so it's like, you've got to pick some parameter group to track.
So we'll just grab the first one. And so then also we're going to create a
dictionary of all the things that we're recording. So we'll get all the names.
So that's going to be in this case, just lr. And initially it's just going to be an empty list.
And then after_batch, we'll go through each of the items in that dictionary, which in
this case is just lr is the key and underscore _lr function as the value.
And we will append to that list, call that method, call that function or callable and
pass in this callback. And that's why this is going to get the callback.
And so that's going to basically then have a whole bunch of, you know, dictionary of
the results, you know, of each of these functions after each batch during training.
So we'll just go through and plot them all. And so let me show you what
that's going to look like. If we… let's create a cosine annealing callable.
So we're going to have to use a partial to say that this callable is going to have T_max
equal to three times, however many mini batches we have in our data loader.
That's because we're going to do 3 epochs. And then we will set it running and
we're passing in the batch scheduler with the scheduler
callable. And we're also going to pass in our recorder
callback saying we want to track the learning rate using the _lr function.
We're going to call fit. And oh, this is actually a pretty good accuracy.
We're getting, you know, close to 90% now in only 3 epochs, which is impressive.
And so when we then call rec.plot, it's going to call, remember the rec is the recorder
callback. So it plots the learning rate.
Isn't that sweet? So we could, as I said, we can do exactly
the same thing, but replace after_batch with after_epoch.
And this will now become a scheduler which steps at the end of each epoch rather than
the end of each batch. So I can do exactly the same thing
now using an epoch scheduler. So this time T_max is 3 because we're
only going to be stepping three times. We're not stepping at the end of each
batch, just at the end of each epoch. So that trains and then we can
call rec.plot after trains. And as you can see there,
it's just stepping 3 times. So you can see here, we're really digging
in deeply to understanding what's happening in everything in our models.
What are all the activations look like? What are the losses look like?
What do our learning rates look like? And we've built all this from scratch.
So yeah, hopefully that gives you a sense that we can really, yeah, do a lot ourselves.
Now, if you've done the fastai Part 1 course, you'll be very aware of 1-Cycle training,
which was from a terrific paper by Leslie Smith, which I'm not sure it ever got published,
actually. And 1-Cycle training is,
well, let's take a look at it. So we can just replace our
scheduler with OneCycleLR scheduler. So that's in PyTorch.
And of course, if it wasn't in PyTorch, we could very easily just write our own.
I’m going to make it a batch scheduler and we're going to train, this time we're going to do 5
epochs. So we're going to train a bit longer.
And so the first thing I'll point out is, hooray, we have got a new record for us, 90.6%.
So that's great. So, and then b), you can see here's the plot.
And now look, two things are being plotted. And that's because I've now passed into the
recorder callback, a plot of learning rates and also a plot of momentums.
And momentums is going to grab the betas zero, because remember for Adam, it's called beta
zero and beta one. It's momentum of the gradients and
the momentum of the gradient squared. And you can see what the one cycle is doing
is the learning rate is starting very low and going up to high and then down again.
But the momentum is starting high and then going down and then up again.
So what's the theory here? Well the starting out at a low learning rate
is particularly important if you have a not perfectly initialized model, which almost
everybody almost always does, even though we spent a lot of time
learning to initialize models. We use a lot of models that get more complicated.
And after a while people learn or figure out how to initialize
more complex models properly. So for example, this is a very, very cool paper.
In 2019, this team figured out how to initialize ResNets properly.
We'll be looking at ResNets very shortly. And they discovered when they did
that they did not need batch norm. They could train networks of 10,000 layers.
And they could get state of the art performance with no batch norm.
And there's actually been something similar for transformers called T-Fixup that does a
similar kind of thing. But anyway, it is quite difficult to initialize models correctly.
Most people fail to. Most people fail to realize that they generally
don't need tricks like warmup and batch norm if they do initialize them correctly.
In fact, T-Fixup explicitly looks at this. It looks at the difference between
no warmup versus with warmup with their correct initialization
versus with normal initialization. And you can see these pictures they're
showing are pretty similar actually. They're very similar to the
colorful dimension plots. I kind of like our colorful dimension plots
better in some ways because I think they're easier to read, although I think
theirs are probably prettier. So there you go, Stefano.
There's something to inspire you if you want to try more things with our colorful dimension
plots. I think it's interesting that some papers
are actually starting to use a similar idea. I don't know if they got it from us
or they came up with it independently. But so we do a warmup if our networks not
quite initialized correctly, then starting at a very low learning rate means it's not
going to jump off way outside the area where the weights even make sense.
And so then you gradually increase them as the weights move into a part of the space
that does make sense. And then during that time, while we have low
learning rates, if they keep moving in the same direction, then with this very high
momentum, they'll move more and more quickly. But if they keep moving in different directions,
it's just the momentum is going to kind of look at the underlying direction they're moving.
And then once you have got to a good part of the weight space, you can use a very high
learning rate. And with a very high learning rate,
you wouldn't want so much momentum. So that's why there's low momentum during
the time when there's high learning rate. And then as we saw in our spreadsheet, which
did this automatically, as you get closer to the optimal, you generally want
to decrease the learning rate. And since we're decreasing it
again, we can increase the momentum. So you can see that starting from random weights,
we've got a pretty good accuracy on Fashion MNIST with a totally standard convolutional
neural network, no ResNets, nothing else, everything built from scratch by hand, artisanal
neural network training, and we've got 90.6% for Fashion-MNIST.
So there you go. All right, let's take a seven minute break.
And I'll see you back shortly. I should warn you, you've got a lot more to cover.
So I hope you're okay with a long lesson today. Okay, we're back.
I just wanted to mention also something we skipped over here, which is this HasLearn
callback. This is more important for the people
doing the live course than the recordings. If you're doing the recording,
you will have already seen this. But since I created Learner, actually, Piotr
Czapla, I don't know how to pronounce your surname, sorry, Piotr, pointed out that there's
actually kind of a nicer way of handling Learner… previously we were putting the Learner object
itself into self.learn in each callback. And that meant we were using self.learn.model
and self.learn.opt and self.learn dot all this all over the place.
It was kind of ugly. So we've modified Learner this week to instead pass in when it calls the callback, when in
run_cbs, which is what it calls, Learner calls, you might remember, is it passes the Learner
as a parameter to the method. So now the Learner no longer goes through the
callbacks and sets their .learn attribute. But instead in your callbacks, you have to put
learn as a parameter in all of the callback methods.
So for example, DeviceCB has a before_fit.
So now it's got comma learn here. So now this is not self.learn.
It's just learn. So it does make a lot of the code less yucky
to not have all this self.learn.pred equals self.learn.model, self.learn.batch.
It's now just learn. It also is good because you don't generally want
to have, both have the Learner has a reference to the callbacks, but also the
callbacks having a reference back to the Learner, it creates
something called a cycle. So there's a couple of benefits there.
And that reminds me, there's a few other little changes we've made to the code.
And I want to show you a cool little trick. I want to show you a cool little trick for how
I'm going to find quickly all of the changes that we've made to the code in the last week.
So to do that, we can go to the course repo. And on any repo, you can
add slash compare in GitHub. And then you can compare across, you
know, all kinds of different things. But one of the examples they've got here
is to compare across different times. Look at the master branch now versus one day ago.
So I actually want the master branch now versus seven days ago.
So I just hit this, change this to 7. And there we go.
There's all my commits and I can immediately see the changes from last week.
And so you can basically see what are the things I had to do when I changed things.
So for example, you can see here, all of my self.learn(s) became learn(s).
I added the anneal, that's right, augmentation. And so in learner, I added an lr_find.
Ah yes, I will show you that one. That's pretty fun.
So here's the changes we made to run_cbs, to fit. So this is a nice way I can quickly, yeah,
find out what I've changed since last time and make sure that I don't forget
to tell you folks about any of them. Oh yes, cleanup_fit.
I have to tell you about that as well. Okay.
That's a useful reminder. So the main other change to mention is that
calling the learning rate finder is now easier because I added what's called
a patch to the learner. fastcore’s patch decorator lets you take a
function and it will turn that function into a method of this class, of whatever
class you put after the colon. So this has created a new method
called lr_find or Learner.lr_find. And what it does is it calls self.fit, where
self is a Learner, passing in however many epochs you set as the maximum, check for your
learning rate finder, what to start the learning rate at.
And then it says to use as callbacks the learning rate finder callback.
Now this is new as well. self.learn.fit didn't used to
have a callbacks parameter. So that's very convenient because what it
does is it adds those callbacks just during the fit.
So if you pass in callbacks, then it goes through each one
and appends it to self.cbs and when it's finished
fitting, it removes them again. So these are callbacks that are just added
for the period of this one fit, which is what we want for learning rate finder.
It should just be added for that one fit. So with this patch in place, all that's required
to do the learning rate finder is now to create your learner and call.lr_find.
And there you go. Bang.
So patch is a very convenient thing. It's one of these things which Python has a
lot of kind of like folk wisdom about what isn't considered Pythonic or good.
And a lot of people really don't like patching. In other languages, it's used very
widely and is considered very good. So I don't tend to have strong opinions
either way about what's good or what's bad. In fact, instead I just figure out
what's useful in a particular situation. So in this situation, obviously it's very
nice to be able to add in this additional functionality to our class.
So that's what.lr_find is. And then the only other thing we added to
the Learner this week was we added a few more parameters to fit.
fit used to just take the number of epochs. As well as the callback parameter, it
now also has a learning rate parameter. And so you've always been able to provide
a learning rate to the constructor, but you can override the learning rate for one fit.
So if you pass in the learning rate, it will use it if you pass it in.
And if you don't, it'll use the learning rate passed into the constructor.
And then I also added these two booleans to say when you fit, do you want to do the training
loop and do you want to do the validation loop? So by default, it'll do both.
And you can see here, there's just an if train, do the training loop.
If valid, do the validation loop. I'm not even going to talk about this, but if
you're interested in testing your understanding of decorators, you might want to think
about why it is that I didn't have to say with torch.no_grad,
but instead I called torch.no_grad parenthesis function.
That will be a very, if you can get to a point that you understand why that works and what
it does, you'll be on your way to understanding decorators better.
Okay. So that is the end of Excel SGD. ResNets.
Okay, so we are up to 90 point, what was it?
Yeah, let's keep track of this. Oh yeah, 90.6% is what we're up to. Okay.
So to remind you the model, actually, so we're going to open 13_resnet now. And we're going to do the usual
import and setup initially. And the model that we've been using is
the same one we've been using for a while, which
is that it's a convolution and an activation and an optional batch norm.
And in our models, we were using batch norm and applying our weight initialization, the
kaiming weight initialization. And then we've got convs that take the
channels from 1 to 8 to 16 to 32 to 64, and each one
stride-2. And at the end, we then do a flatten.
And so that ended up with a one by one. So that's been the model
we've been using for a while. So the number of layers is 1, 2, 3, 4.
So 4, 4 convolutional layers with a maximum of 64 channels in the last one.
So can we beat 90.9, about 90 and a half, 90.6, can we beat 90.6%?
So before we do a ResNet, I thought, well, let's just see if we can improve the architecture
thoughtfully. So generally speaking, more depth and more
channels gives the neural net more opportunity to learn.
And since we're pretty good at initializing our neural nets and using batch norm, we should
be able to handle deeper. So one thing we could do is we could,
let's just remind ourselves of the previous version
so we can compare, is we could have our, go up to 128 parameters.
Now the way we'd do that is we could make our very first convolutional layer have a
stride of 1. So that would be one that goes from the one
input channel to eight output channels, or eight filters, if you like.
So if we make it a stride of 1, then that allows us to have one extra layer.
And then that one extra layer could again double the number of channels and take us
up to 128. So that would make it deeper and
effectively wider as a result. So we can do our normal BatchNorm2d and our
new OneCycle learning rate with our scheduler. And the callbacks we're going to use is the
Device callback, our Metrics, our Progress bar, and our activation stats
looking for GeneralRelus. And I won't have you watch them train
because that would be kind of boring. But if I do this with this deeper and eventually
wider network, this is pretty amazing. We get up to 91.7%.
So that's like quite a big difference. And literally the only difference to our previous
model is this one line of code, which allowed us to take this instead of going
from 1 to 64, it goes from 8 to 128. So that's a very small change,
but it massively improved. So the error rate's gone down by
about over 10% relatively speaking in terms of the error
rate. So there's a huge impact we've already had.
Again, 5 epochs. So now what we're going to do is
we're going to make it deeper still. But it gets, there becomes a point. So Kaiming He et al. noted that there comes
a point where making neural nets deeper stops working well.
And remember, this is the guy who created the initializer that we know and love.
And he pointed out that even with good initialization, there
comes a time where adding more layers becomes problematic.
And he pointed out something particularly interesting.
He said, let's take a 20 layer neural network. This is in a paper called “Deep Residual Learning
for Image Recognition” that introduced ResNets. Let's take a 20 layer network and train it
for a few, what's that, tens of thousands of iterations and track its test error.
Okay. And now let's do exactly the same thing on
a 56 layer, otherwise identical, but deeper And he pointed out that the 56 layer
network had a worse error than the 20 layer. And it wasn't just a problem of generalization
because it was worse on the training set as
well. Now the insight that he had is if you just
set the additional 36 layers to just identity, you know, identity matrices, they
should, they would do nothing at all. And so a 56 layer network is a
superset of a 20 layer network. So it should be at least as
good, but it's not, it's worse. So clearly the problem here is
something about training it. And so him and his team came up with a really
clever insight, which is, can we create a 56 layer network, which has the same training
dynamics as a 20 layer network or even less? And they realized, yes, you can.
What you could do is you could add something called a shortcut connection.
And basically the idea is that, normally, when we have, you know, our inputs coming into
our convolution. So let's say that's, that was our inputs and
here's our convolution and here's our outputs. Now if we do this 56 times, that's a lot of
stacked up convolutions, which are effectively matrix multiplications with a
lot of opportunity for, you know, gradient explosions and all
that fun stuff. So how could we make it so
that we have convolutions, but with the training dynamics of a much shallower network.
And here's what he did.
He said, let's actually put two convs in here to make it twice as deep because we are trying
to make things deeper, but then let's add what's called a skip connection where instead
of just being out equals, so this is conv1, this is conv2. Instead of being out
equals and there's a, you know, assume that these include activation functions, equals
conv2 of conv1 of in, right? Instead of just doing that, let's
make it conv2 of conv1 of in plus in.
Now if we initialize these at the first to have weights
of zero, then initially this will do nothing at all.
It will output zero and therefore at first you'll just get out equals in, which is exactly
what we wanted, right? We actually want to, for it to be
as if there is no extra layers. And so this way we actually end up with a
network which can be deep, but also at least when you start training
behaves as if it's shallow. It's called a residual connection because
if we subtract in from both sides, then we
would get out minus in equals conv1 of conv2 of in.
In other words, the difference between the endpoint and the starting point, which is
the residual. And so another way of thinking about it
is that this is calculating a residual. So there's a couple of ways of thinking about it.
And so this thing here is called the Res Block or ResNet Block. Okay, so Sam Watkins has just pointed out
the confusion here, which is that this only works if, let's put the minus in
back and put it back over here. This only works if you can add these together.
Now if conv1 and conv2 both have the same number of channels as in the same number
of filters, same number of filters, and they also have stride-1, then that will work
fine. You'll end up, that will be exactly the same
output shape as the input shape and you can add them together.
But if they are not the same, then you're in a bit of trouble.
So what do you do? And the answer which Kaiming He et al. came
up with is to add a conv on in as well, but
to make it as simple as possible. We call this the identity conv.
It's not really an identity anymore, but we're trying to make it as simple as possible so
that we do as little to mess up these training dynamics as we can.
And the simplest possible convolution is a 1 by 1 filter block, a 1 by 1 kernel,
I guess we should call it. A 1 by 1 kernel size. And using that, and we can also add
a stride or whatever if we want to. So let me show you the code.
So we're going to create something called a conv block.
Okay, and the conv block is going to do the two convs.
That's going to be a conv block. Okay, so we've got some number of input filters,
some number of output filters, some stride, some activation functions, possibly a
normalization and some kernel shape, some kernel size.
Okay, so the second conv is actually going to go from
output filters to output filters because the first conv is going to be
from input filters to output filters. So by the time we get to the second
conv, it's going to be nf to nf. The first conv, we will set stride-1, and
then the second conv will have the requested stride.
And that way the two convs back to back are going to overall have the requested stride.
So this way, the combination of these two convs is going to eventually take us from
ni to nf in terms of the number of filters, and it's going to have the stride that we
requested. So it's going to be a, the conv block is a
sequential block consisting of a convolution followed by another convolution, each
one with the requested kernel size. And the requested activation function
and the requested normalization layer. The second conv won't have an activation function.
I'll explain why in a moment. And so I mentioned that one way to make this
as if it didn't exist would be to set the convolutional weights to
zero and the biases to zero. But actually we would like to have, you
know, correctly randomly initialized weights. So instead what we can do is
if you're using batch norm, we can initialize this conv2[1]
will be the batch norm layer. We can initialize the batch norm weights to zero.
Now if you've forgotten what that means, go back and have a look at our implementation
from scratch of batch norm because the batch norm weights is the thing we multiply by.
So do you remember the batch norm? We subtract the exponential moving average
mean, we divide by the exponential moving average standard deviation, but then we add
back the, the kind of the, the, the, the batch norms bias layer and we multiply by the batch
norms weights, well, other way around, multiply by weights first.
So if we set the batch norm layers weights to zero, we're multiplying by zero.
And so this will cause the initial conv block output to be just all zeros.
And so that's going to give us, what we wanted is that nothing's happening here.
So we just end up with the input with this possible idconv.
So a ResBlock is going to contain those convolutions in the
convolution block we just discussed, right?
And then we're going to need this idconv. So the idconv is going to be a noop.
So that's nothing at all. If the number of channels in is equal to the
number of channels out, but otherwise we're going to use a convolution with a
kernel size of 1 and a stride of 1. And so that is going to, you know, is with
as little work as possible change the number of filters so that they match.
Also what if the stride's not 1? Well if the stride is 2, actually this isn't
going to work for any stride, this only works for a stride of 2.
If there's a stride of 2, we will simply average using average pooling.
So this is just saying take the mean of every set of 2 items in the grid.
So we'll just take the mean. So we basically have here pool of idconv of
in if the stride is 2 and if the filtered number is changed.
And so that's the minimal amount of work. So here it is, here is the forward pass.
We get our input and on the identity connection we call pool and if stride is 1, that's
a noop, so do nothing at all. We do idconv and if the number of filters
is not changed, that's also a noop. So this is just the input in that situation.
And then we add that to the result of the convs. And here's something interesting, we then
apply the activation function to the whole thing.
Okay, so that way, I wouldn't say this is like the only way you can do it, but this
is a way that works pretty well, is to apply the activation function to the result of the
whole resnet block. And that's why I didn't add activation
function to the second conv. So that's a res block.
So it's not a huge amount of code, right? And so now I've literally copied and pasted
our get_model, but everywhere that previously we had a conv, I've just
replaced it with ResBlock. In fact, let's have a look. get_model. Okay, so previously we started with
conv 1 to 8, now we do ResBlock 1 to 8, stride-1, stride-1, then we added conv from
number of filters i to number of filters i + 1, now it's ResBlock from number
of filters, number of filters i + 1. Okay, so it's exactly the same.
One change I have made though, is I mean, it doesn't actually make any difference at
all, I think it's mathematically identical, is previously the very last conv at the end
went from the 128 channels down to the 10 channels, followed by flatten, but this conv
is actually working on a 1 by 1 input. So in an alternate way, but I think makes
it clearer is flatten first and then use a linear layer because a conv on a 1 by
1 input is identical to a linear layer. And if that doesn't immediately make sense,
that's totally fine, but this is one of those places where you should pause and have a little
stop and think about why a conv on a 1 by scratch conv we did, because
this is a very important insight. So I think it's very useful with a more complex
model like this to take a good old look at it to see exactly what the inputs
and outputs of each layer is. So here's a little function called _print_shape,
which takes the things that a hook takes and we will print out for each layer, the name
of the class, the shape of the input and the shape of the output.
So we can get our model, create our learner and use a
handy little Hooks context manager we built in an earlier lesson and
call the _print_shape function. And then we will call fit for 1 epoch just
doing the evaluation not the training. And if we use the SingleBatch callback, it'll
just do a single batch, pass it through and that hook will, as you see, print out each
layer, the inputs shape and the output shape. So you can see we're starting with an input
of a batch size of 1024, 1 channel, 28 by Our first ResBlock was stride-1,
so we still end up with 28 by 28, but now we've
got 8 channels. And then we gradually decrease the grid
size to 14, to 7, to 4, to 2, to 1, as we gradually increase the number of channels.
We then flatten it, which gets rid of that 1 by 1, which allows us then to do linear
to counter the 10. And then there's some discussion about whether
you want a batch norm at the end or not. I was finding it quite useful in this case.
So we've got a batch norm at the end. I think this is very useful.
So I decided to create a patch for Learner called summary
that would do basically exactly the same thing, but it would
do it as a markdown table. Okay.
So, if we create a TrainLearner with our model and call dot summary, this method is now available because it's been patched
that method into the Learner. And it's going to do exactly the same thing
as our print, but it does it more prettily by using a markdown table, if it's in a
notebook, otherwise it'll just print it. So fastcore has a handy thing for
keeping track if you're in a notebook. And in a notebook to make something markdown,
you can just use IPython.display.Markdown as you see.
And the other thing that I added as well as the input and the output is, I thought, let's
also add in the number of parameters. So we can calculate that as we've seen
before by summing up the number of elements for each
parameter in that module. And so then I've kind of kept track of that
as well so that at the end I can also print out the total number of parameters.
So we've got a 1.2 million parameter model, and you can see that there's very few parameters
here in the input. Nearly all the parameters are
actually in the last layer. Why is that?
Well, you might want to go back to our Excel convolutional spreadsheet to see this.
You have a parameter, for every input channel, you have a set of parameters. They're all going to get added up
across each of the 3 by 3 in the kernel. And then that's going to be done for
every output filter, every output channel that you
want. So that's why you're going to end
up with, in fact, let's take a look. Maybe let's create, let's
just grab some particular one.
So create our model. And so we'll just have a look at the sizes.
And so you can see here there is this 256 by 256 by 3 by 3.
So that's a lot of parameters. Okay.
So we can call lr_find on that and get a sense of what
kind of learning rate to use. So I chose 2e-2, so 0.02.
This is our standard learning thing. You don't have to watch it train.
I've just trained it. And so look at this, by using
ResNet, we've gone up from 91.7 This is just keeps getting better.
So that's pretty nice. And you know, this ResNet is not anything fancy.
It's the simplest possible ResBlock, right? The model is literally copied and pasted from
before and replace each place it said conv with ResBlock.
But we've just been thoughtful about it, you know, and here's something very interesting.
We can actually try lots of other ResNets by grabbing timm.
So that's Ross Wightman's PyTorch image model library.
And if you call timm.list_models(*resnet*), there's a lot of ResNets.
And I tried quite a few of them. Now one thing that's interesting is, if you
actually look at the source code for timm, you'll see that the various different ResNets
like ResNet18, ResNet18d, ResNet10d, they're defined in a very nice way using
this very elegant configuration. You can see exactly what's different.
So there's basically only one line of code different between each different type of ResNet
for the main ResNets. And so what I did was I tried all the timm
models I could find, and I even tried importing the underlying things and building
my own ResNets from those pieces. And the best I found was the ResNet18d. And if I train it in exactly
the same way, I got to 92%. And so the interesting thing is
you'll see that's less than our 92.2 And it's not like I tried
lots of things to get here. This was the very first thing I tried.
Whereas this ResNet18d was after trying lots and lots of different timm models.
And so what this shows is that the just thoughtfully designed kind
of basic architecture goes a very long way.
It's actually better for this problem than any of the PyTorch image models, ResNets,
that I could find. So I think that's quite amazing, actually.
It's really cool. And it shows that you can create
a state-of-the-art architecture just by using some common sense.
So I hope that's encouraging. So anyway, so we're up to 92.2%.
We're not done yet. Because we haven't even talked
about data augmentation. All right.
So, let's keep going. So we're going to make
everything the same as before. But before we do data augmentation, we're
going to try to improve our model even further, if we can. So I said it was kind of not constructed
with any great care and thought, really. Like in terms of this ResNet, we just took
the ConvNet and replaced it with a ResNet. So it's effectively twice as deep, because
each conv block has 2 convolutions. But ResNets train better than ConvNets.
So surely we could go deeper and wider still. So I thought, OK, how could we go wider?
And I thought, well, let's take our model. And previously, we were going from 8 up to 256.
What if we could get up to 512? And I thought, OK, well, one way to do that
would be to make our very first ResBlock not have a kernel size of
3, but a kernel size of 5. So that means that each
grid is going to be 5 by 5. That's going to be 25 inputs.
So I think it's fair enough, then, to have 16 outputs.
So if I use a kernel size of 5, 16 outputs, then that means if I keep doubling as before,
I'm going to end up at 512 rather than 256. OK, so that's the only change I made, was
to add ks equals 5 here and then change to double all the sizes.
And so if I train that, wow, look at this, 92.7%
So we're getting better still. And again, it wasn't with lots of
trying and failing and whatever. It was just like saying,
well, this just makes sense. And the first thing I tried, it just worked.
We're just trying to use these sensible, thoughtful approaches.
Next thing I'm going to try isn't necessarily something to make it better, but it's something
to make our ResNet more flexible. Our current ResNet is a bit
awkward in that the number of stride-2 layers has to be exactly
big enough that the last of them ends up with a 1 by 1 output.
So you can flatten it and do the linear. So that's not very flexible, because what
if you've got something of a different size? So to make that necessary, I've created
a get_model_2, which goes less far. It has one less layer.
So it only goes up to 256, despite starting at 16. And so because it's got one less layer, that
means that it's going to end up at the 2 by So what do we do?
Well, we can do something very straightforward, which is we can take the mean over the 2 by
And so if we take the mean over the 2 by 2, that's going to give us… a mean over the 2
by 2, it's going to give us batch size
by channels output, which is what we can then put into
our linear layer. So this is called, this ridiculously simple
thing, is called a Global Average Pooling layer.
That's the Keras term. In PyTorch, it's basically the same.
It's called an Adaptive Average Pooling layer. But in PyTorch, you can cause it to
have an output other than 1 by 1. But nobody ever really uses it that way.
So they're basically the same thing. This is actually a little bit more convenient
than the PyTorch version, because you don't have to flatten it.
So this is Global Average Pooling. So you can see here, after our
last ResBlock, which gives us a 2 by 2 output, we have GlobalAvgPool
And that's just going to take the mean. And then we can do the Linear, BatchNorm as usual. So I wanted to improve my summary patch to
include not only the number of parameters, but also the approximate number of megaFLOPs.
So FLOP is a floating operation per second, a floating point operation per second.
I'm not going to promise my calculation is exactly right.
I think the basic idea is right. I just basically actually calculated.
It's not really FLOPs. I actually calculated the
number of multiplications. So this is not perfectly accurate,
but it's pretty indicative, I think. So this is the same summary I had before,
but I added an extra thing, which is a _flops function, where you pass in the weight matrix
and the height and the width of your grid. Now if the number of dimensions of the weight
matrix is less than 3, then we're just doing like a linear layer or something.
So actually just the number of elements is the number of FLOPs, because it's just a matrix
multiply. But if you're doing a convolution, so the
dimension is 4, then you actually do that matrix multiply for everything
in the height by width grid. So that's how I calculate this
kind of FLOPs equivalent number. So okay, so if I run that on this model, we
can now see our number of parameters compared to the ResNet model has gone from
1.2 million up to 4.9 million. And the reason why is because we've got this,
we've got this ResBlock that gets all the way up to 512.
And the way we did this is we made that a stride-1 layer.
So that's why you can see here it's gone 2, 2 and it stayed at 2, 2.
So I wanted to make it as similar as possible to the last ones.
It's got the same 512 final number of channels. And so most of the parameters are in that
last block for the reason we just discussed. Interestingly, though, it's
not as clear for the megaFLOPs. It is the greatest of them, but in terms of
number of parameters, I think this has more parameters than all the other
ones added together by a lot. But that's not true of megaFLOPs.
And that's because this first layer has to be done 28 by 28 times, whereas this layer
only has to be done 2 by 2 times. Anyway, so I tried training that
and got pretty similar result, 92.6 And that kind of made me think, oh, let's
fiddle around with this a little bit more to see like what kind of things would reduce
the number of parameters and the megaFLOPs. The reason you care about reducing the number
of parameters is that it has a lower memory requirements. And the reason you want to reduce the
number of FLOPs is it's less compute. So in this case, what I've done here is I've removed this line of code.
So I've removed the line of code that takes it up to 512.
So that means we don't have this layer anymore. And so the number of parameters has gone
down from 4.9 million down to 1.2 million. Not a huge impact on the megaFLOPs,
but a huge impact on the parameters. We've reduced it by like 2
thirds or 3 quarters or something by getting rid of that.
And you can see that the, if we take the very first ResNet
block, the number of parameters is, you know, why is it this 5.3 megaFLOPs?
Because although the very first one starts with just one channel, the first conv, remember
our ResNet blocks have two convs. So the second conv is going
to be a 16 by 16 by 5 by 5. And again, I'm partly doing this to show you
the actual details of this architecture, but I'm partly showing it so that you can see
how to investigate exactly what's going on in your models.
And I really want you to try these. So if we train that one,
interestingly, even though it's only a quarter or something of
the size, we get the same accuracy, 92.7 So that's interesting.
Can we make it faster? Well, at this point, this is the obvious place
to look at is this first ResNet block, because that's where all the megaFLOPs are.
And as I said, the reason is because it's got two convs.
The second one is 16 by 16 channels, 16 channels in, 16
channels out, and it's doing these And it's having to do it
across the whole 28 by 28 grid. So that's the bulk of the biggest compute. So what we could do is we could replace
this ResBlock with just one convolution. And if we do that, then you'll
see that we've now got rid of the 16 by 16 by 5 by 5.
We just got the 16 by 1 by 5 by 5. So the number of megaFLOPs has
gone down from 18.3 to 13.3 The number of parameters hasn't
really changed at all, right? Because the number of parameters
was only 6,800, right? So be very careful that when you see people
talk about, oh, my model has less parameters. That doesn't mean it's faster.
Okay. Really, it doesn't mean that at all.
There's no particular relationship between parameters and speed.
Even counting megaFLOPs doesn't always work that well, because it doesn't take account
of the amount of things moving through memory. But it's not a bad approximation here.
So here's one which has got much less megaFLOPs. And in this case, it's about
the same accuracy as well. So I think this is really interesting.
We've managed to build a model that has far less parameters and far less megaFLOPs and
has basically exactly the same accuracy. So I think that's a really
important thing to keep in mind. And remember, this is still way
better than the ResNet18d from timm. So we built something that
is fast, small, and accurate. So the obvious question is,
what if we train for longer? And the answer is, if we train for longer,
if we train for 20 epochs, I'm not going to wait for it.
The training accuracy gets up to 0.999 But the validation accuracy is worse.
It's 0.924 And the reason for that is that after 20 epochs,
it's seen the same picture so many times, it's just memorizing them. And so once you start memorizing,
things actually go downhill. So we need to regularize.
Now something that we have claimed in the past can regularize is to use weight decay.
But here's where I'm going to point out that weight decay doesn't regularize at all if
you use BatchNorm. And it's fascinating.
For years, people didn't even seem to notice this. And then somebody, I think, finally
wrote a paper that pointed this out. And people were like, oh, wow, that's weird.
But it's really obvious when you think about it. A BatchNorm layer has a single set of
coefficients which multiplies an entire layer. So that set of coefficients could
just be the number 100 in every place. And that's going to multiply the entire previous
weight matrix or convolution kernel matrix by 100.
As far as weight decay is concerned, that's not much of an impact at all because the BatchNorm
layer has very few weights. So it doesn't really have a
huge impact on weight decay. But it massively increases the
effective scale of the weight matrix. So BatchNorm basically lets the neural net
cheat by increasing the coefficients, the parameters, even nearly as much as it wants
indirectly just by changing the BatchNorm layer's weights.
So weight decay is not going to save us. And that's something really
important to recognize. Weight decay is not, I mean, with BatchNorm
layers, I don't see the point of it at all. It does have some, like, there has
been some studies of what it does. And it does have some weird kind of
second order effects on the learning rate. But I don't think you should rely on them.
You should use a scheduler for changing the learning rate rather than weird second order
effects caused by weight decay. So instead, we're going to do data augmentation,
which is where we're going to modify every image a little bit by random change so that
it doesn't see the same image each time. So there's not any particular reason to
implement these from scratch, to be honest. We have implemented them
all from scratch in fastai. So you can certainly look
them up if you're interested. But it's actually a little bit separate
to what we're meant to be learning about. So I'm not going to go through it. But yeah, if you're interested,
go into fastai, vision, augment. And you'll be able to see, for
example, how do we do flip? And you know, it's just like x.transpose.
Okay, which is not really, yeah, it's not that interesting.
Yeah, how do we do cropping and padding? How do we do random crops, so on and so forth?
Okay, so we're just going to actually, you know, fastai has probably got the best implementation of these, but torchvision’s are fine.
So we'll just use them. And so we've created before
a batch transform callback. And we used it for normalization, if you remember.
So what we could do is we could create a transform batch function, which transforms the inputs
and transforms the outputs using two different functions.
So that would be an augmentation callback. And so then you would say, okay, for the transform
batch function, for example, in this case, we want to transform our x's.
And how do we want to transform our x's? And the answer is, we want to transform them
using this module, which is a sequential module of first of all doing a RandomCrop,
and then a RandomHorizontalFlip. Now it seems weird to randomly crop a 28 by
28 image to get a 28 by 28 image, but we can add padding to it.
And so effectively, it's going to randomly add padding on one or both sides to do this
kind of random crop. One thing I did to change
the BatchTransform callback, can't remember if I've mentioned
this before, but something I changed slightly since we first wrote it, is I added this on_train
and on_val so that it only does it if you said I want to do it on training and it's
training, or I want to do it on validation and it's not training.
And then this is all the code is. So data augmentation, generally speaking,
shouldn't be done on validation, so we set on_val false.
Okay, so what I'm going to do first of all is I'm going
to use our classic SingleBatchCB trick and fit, in fact, even better, oh yeah, fit(1)
just doing training. And what I'm going to do then is after I
fit, I can grab the batch out of the learner. And this is a way, this is quite cool, right?
This is a way that I can see exactly what the model sees, right?
So this is not relying on any, you know, approximations.
Remember when we fit, it puts it in the batch that it looks at into learn.batch.
So if we fit for a single batch, we can then grab that batch back out of it and we can
call show_images. And so here you can see
this little crop it's added. Now something you'll notice is that every
single image in this batch, notice I grabbed the first 16, so I don't want to show you
1024, has exactly the same augmentation. And that makes sense, right, because
we're applying a BatchTransform. Now why is this good and why is it bad?
It's good because this is running on the GPU, right?
Which is great because nowadays very often it's really hard to get enough CPU to feed
your fast GPU fast enough. Particularly if you use something
like Kaggle or Colab that are really underpowered for
CPU, particularly Kaggle. So this way all of our transformations, all
of our augmentation is happening on the GPU. On the downside, it means that
there's a little bit less variety. Every mini batch has the same augmentation.
I don't think the downside matters though, because it's going to see lots of mini batches.
So the fact that each mini batch is going to have a different augmentation is actually
all I care about. So we can see that if we run this multiple times, you can see it's got a different augmentation
in each mini batch. Okay, so I decided actually I'm
just going to use 1 padding. So I'm just going to do a very, very
small amount of data augmentation. And I'm going to do 20 epochs
using OneCycle learning rate. And so this takes quite a while to train,
so we won't watch it, but check this out. We get to 93.8
That's pretty wild. Yeah that's pretty wild.
So I actually went on Twitter and I said to the entire world on Twitter, you know, which
if you're watching this in 2023, if Twitter doesn't exist yet, ask somebody to tell you
about what Twitter used to be. Hopefully it still does.
Can anybody beat this in 20 epochs? You can use any model you like, any library
you like, and nobody's got anywhere close. So this is pretty amazing.
And actually, you know, when I had a look at papers with code, there are, you know,
well, I mean, you can see it's right up there, right, with the kind of best models that are
listed, certainly better than these ones. And the better models all use,
you know, 250 or more epochs. So yeah, if anybody, I'm hoping that somebody
watching this will find a way to beat this in 20 epochs, that would be really great.
Because as you can see, we haven't really done anything very amazingly weirdly clever.
It's all very, very basic. And actually we can go even
a bit further than 93.8. Just before we do, I mentioned that since
this is actually taking a while to train now, I can't remember, it takes like
10 to 15 seconds per epoch. So you know, you're waiting a few
minutes, you may as well save it. So you can just call torch.save on a model,
and then you can load that back later. So something that can make things even better
is something called test time augmentation. I guess I should write this out properly here.
Test, text, test time augmentation. Now test time augmentation actually does
our BatchTransform callback on validation as well.
And then what we're going to do is we're actually, in this case, we're going to do just a very,
very, very simple test time augmentation, which is we're going to add a BatchTransform
callback that runs on validate and it's not random, but it actually just does a horizontal
flip. Non-random, so it always does a horizontal flip.
And so check this out. What we're going to do is we're going to
create a new callback called CapturePreds. And after each batch, it's
just going to append to a list the predictions, and it's going
to append to a different list the targets. And that way we can just call learn.fit, train
equals False, and it will show us the accuracy. And this is just the same
number that we saw before. But then what we can do is we can call the
same thing, but this time with a different callback, which is with the
horizontal flip callback. And that way it's going to do exactly the
same thing as before, but in every time it's going to do a horizontal flip.
And weirdly enough, that accuracy is slightly higher, which that's not the interesting bit.
The interesting bit is that we've now got two sets of predictions. We've got the sets of predictions
with the non-flipped version. We've got the set of predictions
with the flipped version. And what we could do is we could stack
those together and take the mean. So we're going to take the average of
the flipped and unflipped predictions. And that gives us a better result still, 94.2%
So why is it better? It's because looking at the image from kind
of like multiple different directions gives it more opportunities to try to
understand what this is a picture of. And so in this case, I'm just giving it two
different directions, which is the flipped and unflipped version, and
then just taking their average. So yeah, this is like a really nice little trick.
Sam's pointed out it's a bit like random forest, which is true.
It's a kind of bagging that we're doing. We're kind of getting multiple
predictions and bringing them together. And so we can actually, so 94.2 I think is my best 20 epoch result.
And notice I didn't have to do any additional training.
So it still counts as a 20 epoch result. You can do test time augmentation where you
do, you know, a much wider range of different augmentations that you trained with, and
then you can use them at test time as well. You know, more, more crops or
rotations or warps or whatever. I want to show you one of my favorite
data augmentation approaches, which is called random
erasing. So random erasing, I'll show you
what it's going to look like. Random erasing, we're going to add a little,
we're going to basically delete a little bit of each picture and we're going to replace
it with some random Gaussian noise. Now, in this case, we've just got one patch.
But eventually we're going to do more than one patch.
So I wanted to implement this because remember we have to implement everything from scratch.
And this one's a bit less trivial than the previous transforms.
So we should do it from scratch. And also I'm not sure there's
that many good implementations. Ross Wightman's Tim I think has one.
And so, and it's also a very good exercise to see how to
implement this from scratch. So let's grab a batch out of the training set.
And let's just grab the first 16 images. And so then let's grab the
mean and standard deviation. Okay. And so what we want to do is we wanted to
delete a patch from each image, but rather than deleting it, deleting it
would change the statistics, right? If we set those all to zero, the mean and
standard deviation are now not going to be But if we replace them with exactly the same
mean and standard deviation pixels that the picture has, or that our dataset has,
then it won't change the statistics. So that's why we've grabbed the
mean and standard deviation. And so we could then try grabbing,
let's say we want to delete 0.2, so 20% of the height
and width. Then let's find out how big that size is.
So 0.2 of the shape, of the height and of the width, that's the size of the x and y.
And then the starting point, we're just going to randomly grab some starting point, right?
So in this case, we've got the starting point for x is 14, starting point for y is 0,
and then it's going to be a 5 by 5 spot. And then we're going to do a Gaussian or
normal initialization of our mini batch, everything
in the batch, every channel for this x slice, this y slice,
and we're going to initialize it with this mean and standard
deviation, normal random noise. And so that's what this is.
So it's just that tiny little bit of code. So you'll see, I don't
start by writing a function. I start by writing single lines of code that
I can run independently and make sure that they all work and that I look at the
pictures and make sure it's working. Now one thing that's wrong here is that you
see how the different, you know, this looks black and this looks gray.
Now at first this was confusing me as to what's going on.
What's it changed? Because the original images didn't look like that.
And I realized the problem is that the minimum and the maximum have changed.
It used to be from -0.8 to 2. That was the previous min and max.
Now it goes from -3 to 3. So the noise we've added has the same mean
and standard deviation, but it doesn't have the same range because the pixels were
not normally distributed originally. So normally distributed noise actually is wrong.
So to fix that, I created a new version and I'm putting in a function now.
It does all the same stuff as before, as I just did before, but it clamps the random
pixels to be between min and max. And so it's going to be exactly the same thing,
but it's going to make sure that it doesn't change the range.
That's really important, I think. Because changing the range really impacts
your, you know, your activations quite a lot. So here's what that looks like.
And so as you can see now, all of the backgrounds have that nice black and it's still giving
me random pixels. And I can check, and because I've done the
clamping, you know, and stuff, the mean and standard deviation aren't quite 0,
1, but they're very, very close. So I'm going to call that good enough. And of course the min and max haven't changed
because I clamped them to ensure they didn't change.
So that's my random erasing. So that randomly erases one block.
And so I could create a random erase, which will randomly choose up to, in this case,
four blocks. So with that function, oh, that's annoying.
It happened to be zero this time. Okay, we'll just run it again.
This time it's got three, so that's good. So you can see it's got, oh, maybe it's
four, one, two, three, four blocks. Okay.
So that's what this data augmentation looks like. So we can create a class to
do this data augmentation. So you'll pass in what percentage to do in
each block, what the maximum number of blocks to have is, store that away.
And then in the forward, we're just going to call our random arrays function, passing
in the input and passing in the parameters. Great.
So now we can use random crop, random flip and random RandErase. Make sure it looks okay.
And so now we're going to go all the way up to 50 epochs.
And so if I run this for 50 epochs, I get 94.6
Isn't that crazy? So we're really right up there
now, up, we're even above this one. So we're somewhere up here.
And this is like stuff people write papers about from 2019, 2020.
Oh look, here's the random erasing paper. That's cool.
So they were way ahead of their time in 2017, but yeah, that would have changed for a lot
longer. Now I was having a think and I realized something, which is like, why, like, how do I, how do
we actually get the correct distribution? Right?
Like in some ways it shouldn't matter, but I was kind of like bothered by this thing
of like, well, we don't actually end up with 0, 1 and this kind of like clamping.
It all feels a bit weird. Like how do we actually replace these pixels
with something that is guaranteed to be the correct distribution?
And I realized there's actually a very simple answer to this, which is we could copy another
part of the picture over to here. If we copy part of the picture, we're guaranteed
to have the correct distribution of pixels. And so it wouldn't exactly
be random erasing anymore. That would be random copying.
Now I'm sure somebody else has invented this. I mean, you know I'm not saying this,
nobody's ever thought of this before. So if anybody knows a paper that's
done this, please tell me about it. But I, you know, I think it's a very sensible
approach and it's very, very easy to implement. So again, we're going to
implement it all manually, right? So let's get our x mini batch and
let's get our, again, our size. And again, let's get the x, y that we're going
to be erasing, but this time we're not erasing, we're copying, so we'll then randomly
get a different x, y to copy from. And so now it's just, instead of init random
noise, we just say, replace this slice of the batch with this slice of the batch. And we end up with, you know, you can see
here, it's kind of copied little bits across. Some of them you can't really see at all.
And some of you can, because I think some of them are black and it's replaced black,
but I guess it's knocked off the end of this shoe, added a little bit extra here, a little
bit extra here. So we can now, again, we'll
turn it into a function. Once I've tested it in the REPL,
make sure the function works. And obviously this, in this case, it's copying
it largely from something that's largely black for a lot of them.
And then again, we can do the thing where we do it multiple times.
And here we go. It's got a couple of random copies.
And so again, turn that into a class, create our transforms.
And again, we, okay. So again, we can have a look at a
batch to make sure it looks sensible. And do it for, just did it for
25 epochs here and gets to 94% Now why did I do it for 25 epochs?
Because I was trying to think about how do I beat my 50 epoch record, which was 94.6
And I thought, well, what I could do is I could train for 25 epochs and then I'll train
a whole new model for a different 25 epochs. And I'm going to put it in a
different learner, learn two, right? This one is 94.1
So one of the models was 94.1 One of them was 94
Maybe you can guess what we're going to do next. It's a bit like test time augmentation, but
rather than that, we're going to grab the predictions of our first learner and grab
the predictions of our second learner and stack them up and take the mean.
And this is called ensembling. And not surprisingly, the ensemble is better
than either of the two individual models at Although, unfortunately, I'm afraid
to say we didn't beat our best. But it's a useful trick and
particularly useful trick. In this case, I was kind of like trying something
a bit interesting to see if using the exact same number of epochs, can I get a better
result by using ensembling instead of training for longer?
And the answer was I couldn't. Maybe it's because the random copy is not as
good, or maybe I'm using too much augmentation. Who knows?
But it's something that you could experiment with. So Sharwon mentions in the chat that cutmix
is similar to this, which is actually, that's a good point.
I'd forgotten cutmix, but cutmix, yes, copies it from different images rather than from
the same image. But yeah, it's pretty much
the same thing, I guess-ish. Well, similar.
Yeah, very similar. All right.
So that brings us to the end of the lesson. And I am so pumped and excited to share this
with you because I don't know that this has ever been done before, to be able to go from,
I mean, even in our previous courses, we've never done this before, go from scratch, step
by step to an absolute state of the art model where we build everything
ourselves and it runs this quickly. And we're even using our own
custom ResNet and everything, just using common sense at
every stage. And so hopefully that shows
that deep learning is not magic, that we can actually build the
pieces ourselves. And yeah, as you'll see, going up to larger
data sets, absolutely nothing changes. And so it's exactly these techniques.
And this is actually, I do 99% of my research on very small data sets because you can iterate
much more quickly, you can understand them much better.
And I don't think there's ever been a time where I've then gone up to a bigger data set
and my findings didn't continue to hold true. Now, homework, what I would really like you to
do is to actually do the thing that I didn't do, which is to do the… create your own
schedulers that work with Python's optimizers. So, I mean, the tricky bit will be making
sure that you understand the PyTorch API well, which I've really laid out here.
So study this carefully. So create your own cosine
annealing scheduler from scratch, and then create your own 1-Cycle
scheduler from scratch. And make sure that they work correctly
with this batch scheduler callback. This will be a very good exercise for you
in, you know, hopefully getting extremely frustrated as things don't work the way you
hoped they would and being mystified for a while and then working through it, you know,
using this very step-by-step approach, lots of experimentation, lots of
exploration, and then figuring it out. That's the journey I'm hoping you have.
If it's all super easy and you get it first go, then, you know, you have to find something
else to do. But yeah, I'm hoping you'll find it actually,
you know, surprisingly tricky to get it all working properly and in the process of doing so,
you're going to have to do a lot of exploration and experimentation.
But you'll realize that it requires no prerequisite knowledge at all.
Okay, so if it doesn't work first time, it's not because there's something that you didn't
learn in graduate school, if only you had done a PhD, whatever.
It's just that you need to dig through, you know, slowly and carefully to see how it all
works. And you know, then see how neat
and concise you can get it. And the other homework is to try and beat me.
I really, really want people to beat me. Try to beat me on the 5 epoch or the
20 epoch or the 50 epoch Fashion-MNIST. Ideally, using miniai with things
that you've added yourself. But you know, you can try grabbing
other libraries if you like. Well, ideally, if you do grab another library
and you find you can beat my approach, try to re-implement that library.
That way you are still within the spirit of the game.
Okay, so in our next lesson, Johno and Tanishq and I are going
to be putting this all together to create a diffusion model from scratch.
And we're actually going to be taking a couple of lessons for this.
Not just a diffusion model, but a variety of interesting generative approaches.
So we're kind of starting to come full circle. So thank you so much for joining
me on this very extensive journey. And I look forward to hearing
what you come up with. Please do come and join us on
forums.fast.ai and share your progress. Bye!
Get free YouTube transcripts with timestamps, translation, and download options.
Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact