Hi everybody and welcome to Lesson 17
of Practical Deep Learning for Coders. I'm really excited about what we're going
to look at over the next lesson or two. It's actually been turning out really
well, much better than I could have hoped. So I can't wait to dive in.
Before I do, I'm just going to mention a couple of minor changes that
I made to our miniai library this week. One was I went back to our Callback class
in the learner notebook and I did decide in the end to add a dunder getattr to it
that just adds these four attributes. And for these four attributes,
it passes it down to self.learn. So in a callback, you'll be able to
refer to model to get self.learn.model, opt will be self.learn.opt, batch will be
self.learn.batch, epoch will be self.learn.epoch. You can change these, you know, you could subclass the callback and
add your own to underscore forward, or you could remove things from
underscore forward or whatever. But I felt like these four things I access
a lot and I was sick of typing self.learn. And then I added one more property, which is
in a callback there'll be a self.training, which saves from typing self.learn.model.training.
Since we have model, you can get rid of the learn. But still, I mean, you so often
have to check the training. Now you can just get self.training in a callback.
So that was one change I made. The second change I made was I found myself
getting a bit bored of adding TrainCB every time. So what I did was I took the four training
methods from the Momentum learner subclass, and I've moved them into a TrainLearner
subclass along with zero_grad. So now MomentumLearner actually inherits
from TrainLearner and just adds momentum. There's kind of a quirky momentum method and
changes zero_grad to do the momentum thing. So yeah, so we'll be using TrainLearner
quite a bit over the next lesson or two. So TrainLearner is just a Learner
which has the usual training. It's exactly the same that fastai2 has or
you'd have in most PyTorch training loops. And obviously by using this, you lose the
ability to change these with a callback. So it's a little bit less flexible.
Okay, so those are little changes. And then I made some changes to what we looked
at last week, which is the activations notebook. And specifically,
Okay, so I added a HooksCallback. So previously, we had a Hooks class, and it
didn't really require too much ceremony to use, but I thought we could make
it even simpler and a bit more fastai-ish or miniai-ish by
putting hooks into a callback. So this callback, as usual, you pass a function
that's going to be called for your hook. And you can optionally pass it a filter
as to what modules you want to hook. And then in before_fit, it will
filter the modules in the Learner. And this is one of these things we can now get rid
of, we don't need the .learn here because model is one of the four things we have a shortcut to.
And then here, we're going to create the Hooks object and put it in hooks. And so one thing that's convenient here is the
hook function. Now you don't have to worry ‒and we can get rid of learn.model‒, you don't have
to worry about checking in your hook functions, whether in training or not, it
always checks whether in training. And if so, it calls that hook function you passed
in. And after it finishes, it removes the hooks. And you can iterate through the hooks and get the
length of the hooks because it just passes these iterators and length down to self.hooks.
So to show you how this works, we can create a HooksCallback.
We can use the same append_stats. And then we can run the model. And so as it's training, what we're going to be
able to do is, yeah, we can now then, here we go. So we just added that as an extra
callback to our fit function. I don't remember if we had the extra callbacks
before. I'm not sure we did. So just to explain. I just added extra callbacks here in the fit function and we're just
adding any extra callbacks here. So then now we've got that callback that we
created because we can iterate through it and so forth, we can just iterate through that callback
as if it's hooks and plot in the usual way. So that's a convenient little thing.
I think it's convenient thing I added. OK. And then I took our colorful dimension
stuff, which is defined when I came up with a few years ago and decided to wrap
all that up in a callback as well. So I've actually subclassed here our hooks
callback to create an ActivationStats. And what that's going to do is it's going to
use this append_stats, which appends the means, the standard deviations and the histograms.
And oh, and I changed that very slightly. Also, the thing which creates these kind
of dead plots, I changed it to just get the ratio of the very first, very smallest
histogram bin to the rest of the bins. So these are really kind of more
like very dead at this point. So these graphs look a little bit different. OK, so, yeah, so I subclassed
the hooks callback and and yeah, added the colorful dimension method,
dead chart method and a plot stats method. So to see them at work, if we want to
get the activations on all of the convs. Then we train our model and then we can just call. And so we've added, created our ActivationStats.
We've added that as an extra callback and then. And then, yeah, then we can
call color_dim to get that plot, dead_chart to get that plot and
plot_stats to get that shot plot. So now we have absolutely no excuse for
not getting all of these really fantastic, informative visualizations of what's going
on inside our model, because it's literally as easy as adding one line of code.
And just putting that in your callbacks. So I really think that couldn't be easier.
And so I hope you're, even for models you thought you know, we're training really well.
Why don't you try using this? Because you might be surprised
to discover that they're not. OK, so those are some changes.
Pretty minor, but hopefully useful. And so today, and over the next lesson
or two, we're going to look at trying to get to a important milestone, which is
to try to get Fashion-MNIST training to accuracy of 90 percent or more, which
is certainly not the end of the road. But it's not bad if we look at paperswithcode. There's so 90 percent would be a 10 percent error.
So there's folks that have got down to 3 or 4 percent error in the very
best, which is very impressive. But, you know, 10 percent
error wouldn't be way off. What's in this paper leaderboard?
I don't know how far we'll get eventually, but, without using even any architectural
changes, no ResNets or anything. We're trying to get into the 10 percent error.
All right, so, let's so the first few cells are just copied from earlier.
And so. Here's a ridiculously simple model.
I like all I did here was I said, OK, well, the very first convolution is
taking a 9 by 9 by 1 channel input. So we should have compressed
it at least a little bit. So I made it 8 channels
output for the convolution. And then I just doubled it to 16,
doubled it to 32, doubled it to 64. And so that's going to get to
a, that will be as you say, 14 by 14 image, 7 by 7 , a 4 by 4, a 2 by 2.
And then this one gets us to a 1 by 1. So, of course, we get the 10 digits.
So there was no thought at all behind, really, this architecture.
This pure, just pure convolutional architecture. And remember, this Flatten at the end is necessary
to get rid of the unit axis that we end up with because this is a 1 by 1.
OK, so let's do a learning rate finder on this very simple model.
And what I found was that this model is. And, you know, this situation is so bad that
when I tried to use the learning rate finder kind of in the usual way,
which would be just to say, you know, start at 1e-5 or
1e-4, say, and then run it. It kind of looks ridiculous. It's
impossible to see what's going on. So if you remember, we added that
multiplier, we called it lr_mult or gamma is what they called it in PyTorch.
So we ended up calling it gamma. So I dialed that way down to make
it much more gradual, which means I have to dial up the starting learning rate.
And only then did I manage even to get the learning rate finder to tell us anything useful.
OK, so. So there we are. So that's our learning rate finder.
Come back to these three later. So I tried using a learning rate of 0.2, and
after trying a few different values, 0.4, 0.1, 0.2 seems about the highest we can get up to.
Even this actually is too high, I found. Much lower and it didn't train much at all.
You can see what happens if I do. It starts training and then it kind of.
Yeah, we lose it, which is unfortunate. And you can see that in the colorful dimension plot.
We get this classic. You know, getting activations, crashing,
getting activations, crashing. And you can kind of say the key problem
here really is that we don't have 0 mean. Standard deviation 1 layers at the start.
So we certainly don't keep them throughout. And this is a problem. That is something I got to
mention, by the way, is. When you're training stuff in Jupyter notebooks,
this is just a new thing we've just added. If you get… you can easily run out of memory,
GPU memory, and there's two reasons it turns out why you can particularly run out of GPU memory
if you run a few cells in a Jupyter notebook. The first is that kind of for your convenience.
Jupyter notebook. —you might may or may not know this— actually stores the results
of your previous few evaluations. If you just type underscore, it tells
you the very last thing you evaluated. And you can do more underscores
to go backwards further in time. Or you can also use. Oh, you can also use numbers
to get the out 16, for example, would be _16. Now, the reason this is an issue is that
if one of your outputs is a big CUDA tensor and you've shown it in a cell, that's going
to keep that GPU memory basically forever. And so that's a bit of a problem.
So if you are running out of memory, one thing you'd want to do is clean out
all of those underscore blah things. I found that there's actually some function that
nearly does that in the IPython source code. So I copied the important bits
out of it and put it in here. So if you call clean_ipython_hist, it will
—don't worry about the lines of code at all. This is just a thing that you can
use to get back that GPU memory. The second thing, which Piotr figured out in
the last week or so, is that you also have… if you have a CUDA error at any point or
even any kind of exception at any point, then the exception object is actually stored
by Python and any tensors that were allocated anywhere in that trace, in that traceback,
will stay allocated basically forever. And again, that's a big problem.
So I created this clean trace back function based on Piotr’s code, which gets rid of that.
So this is particularly problematic because if you have a CUDA out of memory error and then
you try to rerun it, you'll still have a CUDA out of memory error because all the memory that
was allocated before is now in that trace back. So basically, any time you get a CUDA out of
memory error or any kind of error with memory, you can call clean_mem and that will
clean the memory in your trace back. It will clean the memory
used in your Jupyter history. Do a garbage collect.
Empty the CUDA cache and that will basically, should give you a totally clean GPU.
You don't have to restart your notebook. OK, so Sam asked a very good question in the chat.
So just to remind you guys, yes, we did start. He's asking, I thought we were
training an autoencoder or are we training a classifier or what?
So we started doing this autoencoder back in notebook 8 and we decided, oh, this is
we don't have the tools to make this work yet. So let's go back and create the
tools and then come back to it. So in creating the tools,
we're doing a classifier. We try to make a really good
Fashion-MNIST classifier. Well, we try to create tools
which hopefully have a side effect giving us a really good classifier and then using
those tools, we hope that will allow us to create a really good autoencoder.
So, yes, we're kind of like gradually unwinding and we'll come back to
where we were actually trying to get to. So that's why we're doing this this classifier.
The techniques and library pieces we're building will be all very necessary.
OK, so why do we need a 0 mean, 1 standard deviation?
Why do we need that?. And B, how do we get it?
So first of all, on the why. So if you think about what a neural net
does, a deep learning net specifically. It takes an input and it puts it
through a whole bunch of matrix multiplications and of course there are
activation functions sandwiched in there. Don't worry about the activation functions.
That doesn't change the argument. So let's just imagine we start with some bunch of, some matrix.
Right?. Imagine the 50 deep neural net.
So 50 deep neural net, basically, if we ignore the activation functions is taking
the previous input and doing a matrix multiply by some, initially, some random weights.
So these are all. Yeah, these are just a bunch of random weights.
And these are actually randn. And is mean 0, variance 1.
And if we run this. After 50 times of multiplying by a
matrix by matrix by matrix by matrix. We end up with.
NaNs. That's no good.
So that might be that our matrix, the numbers in our matrix, were too big.
So each time we multiply the numbers were getting bigger and bigger and bigger.
So maybe we should make them a bit smaller. Okay, so let's try using, in the matrix we are
playing by, let's try multiplying by 0.01. And we multiply that lots of
times. Oh, now we've got zeros. Now, of course, mathematically speaking,
this isn't actually NaN. It's actually some really big number. Mathematically speaking, this
isn't really zero. It's a really small number. But computers can't handle really, really
small numbers are really, really big numbers. So really, really big numbers eventually just
get called NaN and really, really small numbers eventually just get called zero.
So basically they get washed out. And in fact, even if you don't get a NaN
or even if you don't quite get a zero. The numbers that are extremely big. The internal representation
has no ability to discriminate between even slightly similar numbers.
Basically, in the way a floating point is stored, the further you get away from
zero, the less accurate the numbers are. So, yeah, this is a problem. So we have
to scale weight matrices exactly right. And we have to scale them in such
a way that the standard deviation at every point stays at one
and the mean stays at zero. So there's actually a paper that describes how to
do this for multiplying lots of matrices together. And this paper basically just went
through. It's actually pretty simple math. Actually, let's see. What did they do? All right. Yeah, so they looked at gradients
and the propagation of gradients, and they came up with a particular weight
initialization of using a uniform with. With one over root n as
the bounds of that uniform. And they studied basically what happened
with various different activation functions. And as a result, we now have this
way of initializing neural networks, which is called either Glorot
initialization or Xavier initialization. And, yeah, this is the amount that we scale
our initialization, our random numbers by, where nin is the number of inputs.
So in our case, we have 100 inputs. And so root 100 is 10. So 1/10 is 0.1.
And so if we actually run that, if we start with our random numbers, and then we multiply
by random numbers times 0.1, which is, this is the Glorot initialization, you
can see we do end up with numbers that are actually reasonable.
So that's pretty cool. So just, I mean, just some background in case
you're not familiar with some of these details. What exactly do we mean by variance?
So if we take a tensor, let's call it t and just put 1, 2, 4, 18 in it.
The mean of that is simply the sum divided by the count.
So that's 6.25. Now, we want to know, basically, we want to come up with a measure of
how far away each data point is from the mean. That tells you how much variation there is. If all
the data points are very similar to each other. So if you've got kind of like
a whole bunch of data points. And they're all pretty similar to each other.
Right?. Then the mean would be about here. Right?. And the average distance away of each
point from the mean is not very far. Whereas if you had dots which were
very widely spread all over the place. Right?. Then you might end up with the
same mean. But the distance from each point to the mean is now quite a long way.
So that's what we want. We want some measure of kind of how far away the
points are on average from the mean. So here we could do that. We can take
our tensor, we can subtract the mean. And then take the mean of that.
Ah, that doesn't work. Because we've got some numbers
that are bigger than the mean and some that are smaller than the mean.
And so if you average them all out, then by definition you actually get zero. So instead you could either square those
differences and that will give you something. And you could also take the square root of that
if you wanted to to get it back to the same kind of area.
Or you could take the absolute differences. OK. So actually I'm doing this in
two steps here. So for the first one, here it is on a different scale.
And then add square root, get it on the same scale.
So 6.87 and 5.88 are quite similar, right? But they're mathematically not quite the
same. But they're both similar ideas. So this is the mean absolute difference. And this is called the standard deviation.
And this is called the variance. So the reason that the standard deviation
is bigger than the mean absolute difference is because in our original data,
one of the numbers is much bigger than the others. And so when we square it, that number
ends up having an outsized influence. And so that's a bit of an issue in general with
standard deviation and variance is that outliers like this have an outsized influence.
So you've got to be a bit careful. OK. So here's the formula for the standard
deviation that's normally written as sigma. So it's just going to be each of our
data points minus the mean squared plus the next data point minus
the mean squared, so forth, all the data points and then divide that by
the number of data points in square root. And OK. So one thing I point out here
is that the mean absolute deviation isn't used as much as the standard deviation,
because mathematicians find it difficult to use. But we're not mathematicians.
We have computers so we can use it. OK, now, variance, we can
calculate like this, as we said, the mean of the square of the differences.
And if you feel like doing some math, you discover that actually this is
exactly the same, as you can see. And this is actually nice because this is
showing that the mean of the squared data points minus the square of the mean of the
data points is also the variance. And this is very helpful because it means
you actually never have to calculate this. You can just calculate the mean. So
with just the data points on their own, you can actually calculate the variance.
This is a really nice shortcut. This is how we normally calculate variance.
And so there is the LaTeX version, which, of course, I didn't write myself. I stole
from the Wikipedia LaTeX because I'm lazy. Now, there's a very, very similar idea. Which is covariance and has already come
up a little bit in the first lesson or two. And particularly the extra math
lesson that was same and Tanishq did. And it's yes, a covariance tells you how much two
things vary, not just on their own, but together. And there's a definition here in math, but I like code, so we'll see the
code. So here's our tensor again. Now we're going to want to have two things. So
let's create something called u, which is just two times our tensor with a bit of randomness.
So here it is. Now you can see that u and t are very closely correlated here.
But they're not perfectly correlated. So the covariance tells us how
they vary together and separately. So we can take the, you can see this
exactly the same thing we had before. Each data point minus its mean. But
now we've got two different tensors. So we're also going to do the other one, the
other the other data points minus their mean. And we multiply them together.
So it's actually the same thing as standard deviation. But instead of deviation, it's kind
of like the covariance with itself in a sense. Right. And so that's a product we can calculate. And then what we then do is
we take the mean of that. And that gives us the covariance
between those two tensors. And you can see that's quite a high
number. And if we compare it to two things that aren't very related at all,
so it's quite a totally random tensor, v. So this is not related to t. And
we do exactly the same thing. So take the difference of t to its
means and v to its means and take the mean of that. That's a very small number.
And so you can see covariance is basically telling us how related are these two tensors.
So covariance and variance are basically the same thing. But you kind of can think of you can
think of variance as being covariance with itself. And you can change this mathematical version,
which is the one we just created in code, to this version, just like we have for variance.
There's an easier to calculate version. Which, as you can see, gives
exactly the same answer. OK, so if you haven't done stuff with covariance
much before, you should experiment a bit with it by creating a few different plots
and experimenting with those. And finally, the Pearson correlation
coefficient, which is normally called r or rho, is just the covariance divided
by the product of the standard deviations. So you've seen, probably
seen that number many times. There's just a scaled version of the same thing. OK, so with that in mind, here is how
Xavier init or Glorot init is derived. So when you do a matrix multiplication, for each of the yi's, we're adding
together all of these products. So we've got ai,0 times x0
plus ai,1 times x1, etc. And we can write that in sigma notation. So
we're adding up together all of the aik's with all of the xk's.
This is the stuff that we did in our first lesson of Part 2.
And so here it is in pure Python code. And here it is in NumPy code.
Now at the very beginning, our vector has a mean of about 0 and a standard deviation
of about 1 because that's what we asked for. That's what we asked for.
That's a standard deviation of 1, mean of 0. That's what randn is.
OK, so let's create some random numbers and we can confirm, yeah, they have a mean
of about 0 and a standard deviation of about 1. So if we chose weights for a, that have a mean of 0, we can compute
the standard deviation quite easily. So let's do that.
So 100 times, let's try creating our x and let's try
creating something to multiply it by. And we'll do the matrix multiplication.
And we're going to get the mean and… mean of the squares. And so that is very close to our matrix. So I won't go into, I mean, you can look at
it if you like, but basically as long as the elements in a and x are independent, which
obviously they are because they're random, then we're going to end up with a mean of 0 and
a standard deviation of 1 for these products. And so we can try it if we create a random number, normally distributed random number and
then a second random number, multiply them together and then do it a bunch of times.
And you can see here we've got our 0, 1. So that's the reason why we need
this math dot square root 100. We don't normally worry about the
mathematical reasons why things are exactly. But yeah, I thought I would just dive into this
one because sometimes it's fun to go through it. And so you can check out the paper
if you want to look at that in more detail or experiment with these,
with these little simulations. Now the problem is that that doesn't work.
It doesn't work for us because we use rectified linear units, which is not
something that Xavier Glorot looked at. Let's take a look. Let's create a
couple of matrices. This is 200 by 100. This is just a vector matrix
and a vector. This is 200. And then let's create a couple of weight matrices,
two weight matrices and two bias vectors. OK, so we've got some input data x's and y's and
we've got some weight matrices and bias vectors. So let's create a linear layer function,
which we've done lots of times before. And let's start going through a little
neural net. You know, I'm mentioning this is the forward pass of our neural net.
So we're going to apply our linear layer to the x’s with our first set of weights
and our first set of biases and see what the mean and standard deviation is.
OK, it's. About 0 and about 1. So that's good news. And the
reason why, is because we have 100 inputs and we divided it by square root 100,
just like Glorot told us to. And our second one has 50 inputs and we divide by square root of 50.
And so this all ought to work right. And so far it is. But now we're going to
mess everything up by doing ReLU. So ReLU, after we do a ReLU, look, we don't
have a 0 mean or 1 standard deviation anymore. So if we go through that and create it like a
deep neural network with Glorot initialization. But with a ReLU. Oh, dear.
It's disappeared. It's all gone to 0. And you can see why, right? After a matrix multiply and
a ReLU, our means and variances are going down. And of course, they're going
down because a ReLU squishes it. So I'm not going to worry about the math of why, but a very important paper indeed
called “Delving Deep into Rectifiers: Surpassing Human-Level Performance
on ImageNet Classification” by Kaiming He et al, came up with a new init,
which is just like Glorot initialization. But you multiply the… remember the Glorot
initialization was one over root n. This one is root 2 over n. And again, n
is the number of inputs. So let's try it. So we've got 100 inputs. So we have
to multiply it by root 2 over 100. And there we go. You can see we are,
in fact, getting some non-zero numbers. That's very encouraging, even after going
through 50 layers of depth. So that's good news. So this is called climbing.
It's either called Kaiming initialization or called He initialization.
And notice it looks like it's spelled “he”, but it's a Chinese surname. So
it's actually pronounced “huh”. OK, maybe that's why a lot of people
increasingly call it Kaiming initialization. I don't have to say his surname, just a
little bit harder to pronounce. All right. So how on earth do we actually use this now
that we know what initialization function to use for a deep neural network
with a ReLU activation function? The trick is to use a method called
apply, which all nn.Modules have. So if we grab our model, we can apply any function
we like. For example, let's apply the function. Print the name of the type. So here you can
see it's going through and it's printing out all of the modules that are inside
our model. And notice that our model has modules inside modules.
It's a Conv in a Sequential in a Sequential, but model.apply goes through
all of them regardless of their depth. So we can apply an init function. So we can
apply the init function, which simply does. Random numbers. Normally distributed
random numbers times square root of 2 over the number of inputs.
That's such an easy thing. It's not even worth writing. So that's already
been written, but that's all it does. It just does that one thing is called
init.kaiming_normal. As we've seen before, if there's an underscore at the
end of a PyTorch method name, that means that it changes something in place.
So initinit.kaiming_normal_ will modify this weight matrix so that it has been initialized
with normally distributed random numbers based on root of 2 divided by the number of inputs.
Now you can't do that to a sequential layer or a ReLU layer or a flattened layer.
So we should check that the module is a Conv linear layer. And then we can
just say model.apply the function. And so if we do that. And now,
I can use our learning rate finder callbacks that we created earlier.
And this time I don't have to worry about… actually we can create our own ones
because we don't need to use even the weird gamma thing anymore.
So let's go back and copy that. Let's get rid of this gamma equals one point
one. It shouldn't be necessary anymore. And we can probably make that 4 now. Oh,
I should have it to recreate the model. There we go. Okay, so that's
looking much more sensible. So at least we've got to a point where the
learning rate finder works. That's a good sign. So now when we create our Learner, we're
going to use our MomentumLearner still, after we get the model, we will apply
init_weights and apply also returns the model. So we can actually, this is actually going to
return the model with the initialization applied. While I wait, I will answer questions. Okay, so Fabrizio asks, why do we double the
number of filters in successive convolutions? So what's happening is in
each stride-2 convolution. These are all stride-2 convolutions. So this is
changing the grid size from 28 by 28 to 14 by 14. So it's reducing the size of the
grid by a factor of 4 in total. So basically, so as we go from 1 to 8
from this one to this one, same deal, we're going from 14 by 14 to 7 by 7.
So it reduced the grid size by 4, we want it to learn something.
And if you use, if you give it exactly the same kind of number of units or
activations, there's not really, it's not really forcing it to learn things as much.
So ideally, as we decrease the grid size, we want to have enough channels that
you end up with a few less activations than before it, but not too many less. So if
we double the number of channels, then that means we've decreased the grid size by a model
of 4, increase the channel count by a model of 2. So overall, the number of activations
has decreased by a factor of 2. And so that's what we want. We want to
be kind of forcing it to find ways of compressing the information intelligently.
As it goes down. Also, we kind of want to be having a roughly similar amount of compute,
roughly similar amount through the neural net. So as we decrease the grid size, we can add
more channels because decreasing the grid size decreases the amount of compute.
Increasing the channels then gives it more things to compute. So we're kind of
getting this nice compromise between, yeah, between the, kind of, amount of compute
that it's doing, but also giving it some kind of compression work to do.
That's the, kind of, the basic idea. Well, still not able to train. Well,
OK, if we leave it for a while. OK, it's not great, but it is
actually starting to train. That's encouraging. And we got up
to a 70% accuracy so we can see. Not surprisingly, we're getting
these spikes and spikes. And so in the statistics, you can see that.
Well, it didn't quite work. We don't have a mean of 0. We don't have a standard
deviation of 1, even at the start. Why is that? Well, it's because
we forgot something critical. If you go back to our original point, even when
we had our… let's go to the Kaiming version. Even when we had the correctly normalized
matrix that we're multiplying by. Well, you also have to have a correctly
normalized input matrix. And we never did anything to normalize our inputs.
So our inputs, actually, if we get the... Just get the first x mini batch.
I get its mean and standard deviation. It has a main of 0.28 and a standard deviation of 0.35.
So we actually didn't even start with a 0, 1 input.
And so we started with the mean beneath, above 0 and a standard deviation beneath 1.
So it was very hard for it. So, using the inner helped, at least we're able to train a little bit.
But it's not quite what we want. We actually need to modify our inputs so they have
a mean of 1 and a standard, sorry, A mean of 0 and a standard deviation of 1.
So we could create a callback to do that. So a callback, let's create a batch transform
callback. And so we're going to pass in a function that's going to transform every batch.
And so just in the before_batch, we will set the batch to be equal to the
function applied to the batch. Now, I can note, by the way, we
don't need self.learn.batch here. Because we can read any.. because
it's one of the four things that we kind of proxy down to the Learner automatically.
But we do need it on the left hand side because it's only in the getattr, remember.
So be very careful. So I might just leave it the same on both sides, just so
that people don't get confused. OK, so let's create a function
_norm that subtracts the mean and divides by the standard deviation.
And so, remember, a batch has an x and a y. So it's the x part where we subtract the
mean and divide by the standard deviation. And so the new batch will be that as the x and
the y will be exactly the same as it was before. So let's create an instance of the normalization,
of the BatchTransformCB, which is going to do the normalization function.
We'll call it _norm, so we can pass that as an
additional callback to our Learner. And now. That's looking a lot
better, so you can see here. All we had to do was check that our input matrix was 0, 1. And… mean, standard
deviation, and all of our weight matrices was 0, 1 standard deviation.
And we didn't have to use any tricks at all. It was able to train and got
it to an accuracy of 85 percent. And so if we look at the color_dim and
stats, look at this, it looks beautiful. Now this is layer one. This is layer
two, three, four. It's still not perfect. I mean, there's some randomness, right? And
we've got what is it like seven or eight layers? So that randomness does, kind of, as you
go through the layers by the last one, it still gets a bit ugly and you can kind
of see it bouncing around here as a result. And you can see that also in the
means and standard deviations. There's some other reasons this is
happening. We'll see in a moment. But this is the first time we've really got our
even somewhat deep convolutional model to train. And so this is a really exciting step. You
know, we have, from scratch, in a sequence of 11 notebooks managed to create a real convolutional
neural network that is training properly. So I think that's pretty amazing.
Now, we don't have to use a callback for this. The other thing we could do to modify the input
data, of course, is to use the with_transform method from the Hugging Face datasets library.
So we could modify our transformi to do to subtract the mean and
divide by the standard deviation. And then recreate our DataLoaders. And if
we now get a batch out of that and check it, it's now got, yep, a mean of 0
and the standard deviation of 1. So we could also do it this way. So generally
speaking, for stuff that needs to, kind of, dynamically modify the batch, you can often
do it either in your data processing code, or you can do it in a callback.
And neither is right or wrong. They both work well. And you can see
whichever one works best for you. Okay, now I'm going to show you something amazing. Okay, so it's great this is training
well, but when you look at our stats, despite what we did with the normalized input and
the normalized, the normalized weight matrices, we don't have a mean of 0. And we don't have a
standard deviation of 1, even from the start. So why is that? Well, the problem is that we were putting our data through a ReLU.
And our activation stats are looking at the output of those ReLU blocks, because
that's kind of the end of each, you know, that's the activation of each combination of
matrix multiplication and activation function. And since a ReLU removes all of the negative
numbers, it's impossible for the output of a ReLU to have a mean of 0, unless,
literally, every single number is 0, because it has got no negatives.
So ReLU, seems to me, to be fundamentally incompatible with the idea of a correctly
calibrated bunch of layers in a neural net. So I came up with this idea of saying, well,
why don't we take our normal ReLU and have the ability to subtract something from it.
And so we just take the result of our ReLU and subtract, so sub of minus, I mean, I
just, I can write this in a more obvious way, it's exactly the same as just - =.
Why don't I just do that? We'll subtract something from our ReLU.
That will allow us to pull the whole thing down so that the bottom of our ReLU is
underneath the x-axis and it has negatives. And that would allow us to have a mean of
0. And while we're there, let's also do something that's existed for a while.
I didn't come up with this idea, which is just to do a leaky ReLU,
which is where we say, let's not have the negatives be totally flat, just truncated. But instead, let's just have those
numbers decreased by some constant amount. Let me show you what that looks like. So those
two together, I'm going to call general ReLU, which is where we do this thing called leaky
ReLU, which is where we make it so it's not flat under 0, but instead just less sloped.
And we also subtract something from it. So for example, I've created a little
function here for plotting a function. So let's plot the general ReLU function
with a leakiness of 0.1. So that will mean there's a 0.1 slope underneath
the, under 0, and we'll subtract 0.4. And so you can see above 0, it's just a normal y
equals x line, but it's been pushed down by 0.4. And then when it's less than zero, it's not flat
anymore, but it's just got a slope of one tenth. And so this is now something which if you find
the right amount to subtract for each amount of leakiness, you can make a mean of 0.
And I actually found that this particular combination gives us a mean of 0 or thereabouts.
So let's now create a new convolution function where we can
actually change what activation function is used, that gives us the ability to change the
activation functions in our neural nets. Let's change get_model to allow
it to take an activation function which is passed into the layers.
And while we're there, let's also make it easy to change the number of filters.
So we're going to pass in a list of the number of filters in each layer and we will default it
to the numbers in each layer that we've discussed. And so we're just going to go through, in a list
comprehension, creating a convolution from the previous number of filters, this number
of filters to the next number of filters. And we'll pop that all into a sequential
along with a flatten at the end. And while we're there, we also then need to
be careful about init_weights because this is something that people tend to forget.
Which is that init, it's just that Kaiming initialization, the default only applies at all
to layers that have a ReLU activation function. We don't have ReLU anymore.
We actually have leaky ReLU. The fact that we're subtracting a
bit from it doesn't change things, but the fact that it's leaky does.
Now luckily, a lot of people don't know this, but actually PyTorch's Kaiming
normal has an adjustment for leaky ReLUs. Weirdly enough, they just call it a.
So if you pass into the Kaiming normal initialization, how your
leaky values, leaky factor as a, then you'll get the correct
initialization for a leaky value. So we need to change init_weights
now to pass in the leakiness. All right. So let's put all this together.
So, our general ReLU activation function is GeneralRelu with a leak of
0.1 and a subtract of 0.4. So we use partial to create a function
that has those built-in parameters. For ActivationStats, we need to update it
now to look for GeneralRelu, not nn.ReLU. Okay. And then our init_weights function, we're
going to have a partial with leaky equals 0.1. So we'll call that our init weights. Great. So now we'll get our model, using that new
activation function and that new init weights. And we'll fit that. Oh, that's encouraging. Accuracy of 845, which is
about as high as we got to at the end previously. Wow. Look at that. So we're
up to an accuracy of 87 percent.
And let's take a look. Yeah, I mean, look, we still got a little bit of
a spike, but it's almost smooth and flat. And let's have a look here. Look at that. A mean,
it's standing at about 0. Standard deviation. Standard deviation is still a bit low, but
it's coming up around 1. It's not too bad. Generally around 0.8. So it's all
looking pretty encouraging, I think. And oh, yeah. Look, the percentage of
dead units in each layer is very small. So finally, we've really trained, you know, got
some very nice looking training graphs here. And yeah, it's interesting that
we had to literally invent our own activation function to make this work.
And I think that gives you a sense of how few people actually care
about this, which is crazy, because as you can see, it's in some
ways, it's the only thing that matters. And it's not at all mathematically
difficult to make it all work. And it's not at all computationally
difficult to see whether it's working. But other frameworks don't even
let you plot these kinds of things. So nobody even knows that they've
completely messed up their initialization. So, yeah, now, you know.
Now, some very nice news. So the first thing to be aware of, which is tricky, is
a lot of models use more complicated activation functions nowadays, rather than ReLU or
leaky ReLU or even this general version. You need to initialize your neural
network correctly, and most people don't. And sometimes nobody's even figured
out or bothered to try to figure out what the correct initialization to use is.
But there's actually a very cool trick which almost nobody knows about, which is a
paper called “All You Need Is a Good Init”, which Dmytro Mishkin wrote a few years ago.
And what Dmytro showed is that there's actually a completely general way of initializing
any neural network correctly, regardless of what activation functions are in it.
And it uses a very, very simple idea. And the idea is create your model,
initialize it however you like, and then go through and put a
single batch of data through. And look at the first layer, see what the mean
and standard deviation through the first layer is. And if the mean, you know, if the
standard deviation is too big, divide the weight matrix by a bit.
If the mean is too high, subtract a bit off the weight matrix.
And do that repeatedly for the first layer until you get the correct mean and standard deviation.
And then go to the second layer, do the same thing.
Third layer, do the same thing, and so forth. So we can do that using hooks, right?
So we could create a little, so this is called Layer-wise Sequential Unit Variance, LSUV.
We can create a little LSUV stats that will grab the mean of the activations of a layer and the
standard deviation of the activations of a layer. And we will create a hook with that function.
And what it's going to do is after the, after we've run that hook, to find out the
mean and standard deviation of the layer, we will go through and run the model, get the
standard deviation and mean, see if the standard deviation is not 1, see if the mean is not 0.
And we will subtract the mean from the bias and we will divide the weight
matrix by the standard deviation. And we will keep doing that until we get
a standard deviation of 1 and a mean of 0. And so by making that a hook, what we will do is we will grab all
the ReLU's and all the convs, right? And so just to show you what happens
there, once I've got all the ReLU's and all the convs, I can use zip.
So zip in Python takes a bunch of lists and creates a list of the items, the first items,
the second items, the third items and so forth. So if I go through the zip of ReLU's and
convs and just print them out, you can see it prints out the ReLU and the first conv.
The second ReLU, the second conv, the second ReLU, sorry, the third ReLU, the
third conv and so forth. We use zip all the time in Python.
So it's a really important thing to be aware of. So we could go through the ReLU's and the convs
and call lsuv_init, passing in those module pairs. Sorry, passing in, yeah, passing
in the ReLU and the conv. And then for each one, oh, and
we're going to do that on the batch. And of course, we need to put the batch
on the correct device for our model. And so now that I've done that, we now have, it ran almost instantly.
It's now made all the biases and weights correct, give us 0, 1.
And now if I train it, there it is. So we didn't do any initialization at all
of the model other than just call lsuv_init. And this time we've got an accuracy of 0.86 versus previously it's 0.87.
So pretty much the same thing, close enough. And actually, if you want to actually
see that happening, I guess what we could do, it's going to be pretty obvious,
after we run this, we could say print(h.mean, h.std).
Actually, we could do it before and afterwards, right?
So we could say, right, before and after. There we go. Yeah, so it starts at, so the first layer started
at a mean of -0.13 and a variance of 0.46. And it kept doing the divide,
subtract, divide, subtract, divide, subtract until eventually it got
to mean of 0, standard deviation of 1. And then it went to the next layer and it
kept going, going, going until that was 0, 1. And then the third layer
and then the fourth layer. And so at that point, all of the layers had
a mean of 0 and a standard deviation of 1. So I guess. like, one thing with LSUV, you know,
it's kind of very mathematically convenient. We don't have to spend any
time thinking about, you know, if we've invented a new activation function
or we're using some activation function where nobody seems to have figured out the correct
initialization for it, we can just use LSUV. It did require a little bit more fiddling
around with hooks and stuff to get it to work. And I haven't even put this into,
like, a callback or anything. So if you yeah, if you decide you
want to try using this in some of your models, it might be a good idea.
And it actually be good homework to see if you can come up with a callback
that does LSUV initialization for you. That would be pretty cool, wouldn't
it? In before_fit, I guess it would be. You'd have to be a bit careful
because if you ran fit multiple times, it would actually initialize it each time.
That would be one issue with that to think about. OK, so something which is quite
similar to LSUV is batch normalization. So we're going to have a seven minute break and
then we're going to come back and we're going to talk about batch normalization.
See you in seven minutes. OK, hi, let's do this, Batch Normalization.
Batch Normalization was such an important paper. I remember when it came out, I was at Enlitic,
my medical startup, and I… think that's right. And everybody was talking about it.
And in particular, they were talking about this graph that basically
showed what it used to be like until batch norm to train a model on ImageNet.
How many training steps you'd have to do
to get to a certain accuracy. And then they showed what
you could do with batch-norm. So much faster. It was amazing. And we all
thought that can't be true, but it was true. So basically, the key idea of batch-norm
is that with LSUV and input normalization and Kaiming init, we are normalizing the
layers, each layer's inputs before training. But the distribution of each layer's
inputs changes during training. And that's a problem. So you end up having to
decrease your learning rates. And as we've said, you have to be very
careful about parameter initialization. So the fact that the layers inputs change during
training, they call internal covariate shift, which for some reason, a lot of people tend to
find a confusing statement or confusing name, but it's very clear to me.
And you can fix it by normalizing layer inputs during training.
So you're making the normalization a part of the model architecture, and you perform
the normalization for each mini batch. Now, I'm actually not going to
start with batch normalization. I'm going to start with something that came
out one year later called layer normalization, because layer normalization is simpler.
Let's do the simpler one first. So Layer Normalization came out as a… this group of fellows, the last
of whom I'm sure you've heard of. And it's probably easiest to
explain by showing you the code. So if you're thinking, “Layer Normalization”,
wow, it's a whole paper, a Geoffrey Hinton paper must be complicated.
No, the whole thing is this code. What is layer normalization?
Well, we can create a module. And we're going to pass in. We don't need to pass in anything, actually,
you can totally ignore the parameters for now. In fact, what we're going to do is we're
going to have a single number called mult, for the multiplier and a single number called add.
That's the thing we're going to add. And we're going to start off by multiplying
things by 1 and adding 0. So we're going to start off
by doing nothing at all. Okay, this is the layer.
It has a forward function. And in the forward function, so
remember that, by default, we have NCHW. We have batch by channel by height by width.
We're going to take the mean over the channel, height and width.
So we're just going to find the mean activation for each input in the mini batch.
And when I say input, though, remember that this is going to be, this is a layer, right?
So we can put this layer any way we like. So it's the input to that layer. And we'll
do the same thing for finding the variance. Okay. And then we're going to normalize
our data by subtracting the mean and dividing by the square root of the variance,
which of course is the standard deviation. We're going to add a very small number,
by default 1e-5 to the denominator. Just in case the variance is 0 or ridiculously
small, this will keep the number from going giant. Just if we happen to get something
with a very small variance. This idea of an epsilon as being something
we add to a divisor is really, really common. And in general, you should not
assume that the defaults are correct. Very often the defaults are too small
for algorithms that use an epsilon. Okay. So here we are, as you can
see, we are normalizing the batch. I mean, I can call it a batch, but just remember,
it isn't necessarily the first layer, right? So it's wherever, whichever
layer we decide to put this in. So we normalize it.
Now the thing is, maybe, we don't want it to be normalized.
Maybe we want it to have something other than a unit variance and something
other than zero mean. Well, what we do is we then multiply
it back by self.mult and add self.add. Now remember self.mult was 1 and self.add is 0.
So at first that does nothing at all. So at first this is just normalizing the data.
So that's good. But because these are parameters,
these two numbers are learnable. That means that the SGD algorithm can change them.
So there's a very subtle thing going on here, which is that in fact, this might
not be normalizing the data at all, or normalizing the inputs to
the next layer at all, because self.mult and self.add could be anything.
So I tend to think that when people think about these kind of things like layer
normalization and batch normalization, thinking of this as normalization in some
ways is not the right way to think of it. It's actually doing something I think
to really… well, it's definitely normalizing it for the initial layers.
And we don't really need LSUV anymore if we have this in here, because it's
going to normalize it automatically. So that's handy.
But after a few batches, it's not really normalizing at all.
But what it is doing is, previously, this idea of like, how big are the numbers overall
and how much variation do they have overall, was kind of built into every single number
in the weight matrix and in the bias vector. This way, those two things have
been turned into just two numbers. And I think this makes training
a lot easier for it, basically, to just have just two numbers that it can focus on
to change this overall positioning and variation. So there's something very subtle going
on here, because it's not just doing normalization, at least not after
the first few batches are complete, because it can learn to create any
distribution of outputs it wants. So there's our layer.
So we're going to need to change our conv function again.
Previously, we changed it to add activation function to be modifiable.
Now we're going to also change it to allow us to add normalization layers to the end.
So our basic layers, well, we start off by adding our Conv2d, as usual.
And then, if you're doing normalization, we will append the normalization
layer with this many inputs. Now, in fact, LayerNorm doesn't
care how many inputs, so I just ignore it, but you'll see BatchNorm care.
If you've got an activation function, add it. And so our convolutional layer is
actually a sequential bunch of layers. Now, one thing that's interesting, I think,
is that for bias in the conv, if you're using, well, this isn't quite true, is it?
I was going to say if you're using LayerNorm, you don't need bias, but actually you kind of do. So maybe we should actually change that.
For BatchNorm, we won't need bias, but actually for this one we do.
So let me put this back. bias=True. bias=bias.
OK. So then these initial layers are here.
So they all have bias. And then we've got bias=false
(OOPS, it should be True). OK. So now in our model, we're going to
add layer normalization to every layer except for the last one.
And let's see how we go. Oh, nice 873.
OK. 860 and 872. So just, we've just got our best by a little bit.
So that's cool. So the thing about these
normalization layers is, though, that they do cause a lot of challenges in models.
And generally speaking, ever since batch norm appeared, well, there's been this kind
of like big change of view towards it. At first people were like, oh,
my God, batch norm is our savior. And it kind of was, it let us train much deeper
models and get great results and train quickly. But then increasingly people realized
it also added a lot of complexity. These learnable parameters turned
out to create all kind of complexity. And in particular batch norm, which we'll see
in a minute, created all kinds of complexity. So there has been a tendency in
recent years to be trying to get rid of or at least reduce the
use of these kinds of layers. So, knowing how to actually initialize your models
correctly, at first, is becoming increasingly important as people are trying to move away
from these normalization layers increasingly. So I will say that.
So they're still very helpful, but they're not a silver bullet, as it turns out.
All right. So now let's look at BatchNorm. So BatchNorm is still not huge, but
it's a little bit bigger than LayerNorm. And you'll see that we've now, we've
got the mult and add as before. But it's not just one number to add or
one number to multiply, but actually we've got a whole bunch of them.
And the reason is that we're going to have one for every channel.
And so now when we take the mean and the variance, we're actually taking it over the batch
dimension and the height and width dimensions. So we're ending up with one mean per
channel and one variance per channel. So just like before, once we
get our means and variances, we subtract them out and divide them
by the epsilon modified variance. And just like before, we then
multiply by mult and add add. But now we're actually multiplying by a vector
of mults and we're adding a vector of adds. And that's why we have to
pass in the number of filters, because we have to know how many ones and how
many zeros we have in our initial molts and adds. So that's the main difference in a sense is that
we have one per channel and that we're also taking the average across all of the things in the batch.
Whereas in LayerNorm, we didn't. Each thing in the batch had its own
separate normalization it was doing. Then there's something else in BatchNorm, which
is a bit tricky, which is that during training, we are not just subtracting the mean and
the variance, but instead we're getting an exponentially weighted moving average of the
means and the variances of the last few batches. That's what this is doing.
So we start out, so we basically create something called
vars and something called means. And initially the variances are
all 1 and the means are all 0. And there's one per channel just
like before or one per filter. This is number of filters.
Same idea, I guess. Filters we tend to actually use inside the model
and channels we tend to use as the first input. So I should probably say filters.
Either works though. So we get our, let's for example,
we get our mean per filter. And then what we do is we
use this thing called lerp. And lerp is simply saying,
yes, that's what it's done. So what lerp does is it takes two numbers, in this
case, I'm going to take 5 and 15 or two tensors. They could be vectors or matrices, and
it creates a weighted average of them. And the amount of weight it
uses is this number here. Let me explain.
In this case, if I put 0.5, it's going to take half of this number plus half of this number.
So we end up with just the mean. But what if we used 0.75? Then that's going to take 0.75 times
this number plus 0.25 of this number. So it basically kind of allows
it to be on like a sliding scale. So one extreme would be to take all of the second
number, so that would be lerp with 1 there. And the other extreme would
be all of the first number. And then you can slide
anywhere between them, like so. So that's exactly the same as saying
5 times 0.9 plus 15 times 0.1. So this number here is how much
of the second number do we have. And 1 minus that is how much
of this number do we have. And you can also move this, as you can
with most PyTorch things, you can move the first parameter into there.
And get exactly the same result. So that's what lerp is.
So what we're doing here is we're doing an in-place lerp.
So we're replacing self.means with 1 minus momentum of self.means.
And plus self.momentum times this particular mini batches mean.
So this is basically doing Momentum again, which is why we indeed are calling
the parameter mom for momentum. So with a mom of 0.1, which I kind of think is
the opposite of what I'd expect momentum to mean, I'd expect it to be 0.9.
But with a mom of 0.1, it's saying that each mini batch self.means will
be 0.1 of this particular mini batches mean. And 0.9 of the previous one.
The previous sequence, in fact. And that ends up giving us what's called
an exponentially weighted moving average. And we do the same thing for variances.
Okay, so that's only updated during training. And then during inference, we just
use the saved means and variances. So this, and then why do we have
buffers? What does that mean? These buffers mean that these means and variances
will be actually saved as part of the model. So it's important to understand that this
information about the means and variances that your model saw are saved in the model.
And this is the key thing which makes batch norm very tricky to deal
with and particularly tricky, as we'll see in later lessons
with transfer learning. But what this does do is that it means that we're
going to get something that's much smoother. You know, a single weird mini batch
shouldn't screw things around too much. And because we're averaging across the mini
batch, it's also going to make things smoother. So this whole thing should lead
to a pretty nice, smooth training. So we can train this.
So we're going to, this time, we're going to use our BatchNorm layer for norm.
Oh, actually, we need to put the bias thing. Is that right? Oh, no, it's no, that's fine. Okay. And one interesting thing I found here is I was
able to now finally increase the learning rate up to 0.4 for the first time.
So each time I was really trying to see if I can push the learning rate.
And I'm now able to double the learning rate and still, as you can see, it's training
very smoothly, which is really cool. So there's actually a number of different types
of normal- layer based normalization we can use. In this lesson, we've specifically
seen Batch Norm and Layer Norm. I wanted to mention that there's
also Instance Norm and Group Norm. And this picture from the Group
Norm paper explains what happens. What it's showing is that we've got here the NCHW.
And so they've kind of concatenated flattened HW into a single axis since they can't draw 4D cubes.
And what they're saying is in Batch Norm, all this blue stuff is what we average over.
So we average across the batch and across the height and width.
And we end up with one, therefore, normalization number per channel.
So you can kind of slide these blue blocks across. So Batch Norm is averaging over
the batch and height and width. Layer Norm, as we learned, averages over
the channel and the height and the width. And it has a separate one
per item in the mini batch. I mean, kind of. It's a bit subtle, right?
Because, remember, the overall mult and add, it just had literally a
single number for each, right? So it's not quite as simple as
this, but that's a general idea. Instance Norm, which we're not looking at
today, only averages across height and width. So there's going to be a separate one for every
channel and every element of the mini batch. And then finally, Group Norm, which I'm
quite fond of, is like Instance Norm, but it arbitrarily basically groups
a bunch of channels together. And you can decide how many groups of
channels there are and averages over them. Group Norm tends to be a bit slow,
unfortunately, because the way these things are implemented is a bit tricky.
But Group Norm does allow you to, yeah, avoid some of the challenges of some of the other methods.
So it's worth trying if you can. And of course, Batch Norm has the additional
thing of the kind of momentum-based statistics. But in general, the idea of like, do
you use momentum-based statistics? Do you store things per channel or a single
mean and variance in your buffers or whatever? You know, all that kind of stuff,
along with what do you average over? They're all somewhat independent choices you can
make, and particular combinations of those have been given particular names.
And so there we go. OK, so we're getting, you know, we've got
some good initialization methods here. Let's try putting them all together. And one other thing we can do is we've been
using a batch size of 1024 for speed purposes. If we drop it down a bit to 256,
it's going to mean that it's going to get to see more mini batches.
So that should improve performance. And so we're trying to get to 90%, remember?
So let's do all this. This time we'll use, PyTorch
has its own BatchNorm. We'll just use PyTorch’s.
There's nothing wrong with ours, but we try to switch to PyTorch’s when
something we've recreated exists there. We'll use our MomentumLearner.
And we'll fit for 3 epochs. And so as you can see, it's going
a little bit more slowly now. And then the other thing I'm going
to do is, I'm going to decrease the learning rate and keep the existing model.
And then train for a little bit longer. The idea being that, as the, you know, as it's
kind of getting close to a pretty good answer, maybe it just wants to be able
to fine tune that a little bit. And so by decreasing the learning rate, we
give it a chance to fine tune a little bit. So let's see, how are we going?
So we got to 87.8% accuracy after three epochs, which is an improvement,
I guess, mainly thanks to, well, basically thanks to using
this smaller mini batch size. Now with a smaller mini batch size, you
do have to decrease the learning rate. So I found I could still get away
with 0.2, which is pretty cool. And look at this after just one more
epoch by decreasing the learning rate, we've got up to 89.7.
Oh, we didn't make it. 89.9. So towards 90%, but not quite 90%, 89.9.
So we're going to have to do some more work to get up to our magical 90% number.
But we are getting pretty close. All right.
So that is the end of initialization, an incredibly
important topic, as hopefully you've seen. Accelerated SGD. Let's see if we can use this to
get us up to 90 plus, above 90%. So let's do our normal imports
and data setup as usual. And so just to summarize what
we've got, we've got our MetricsCB. We've got our ActivationStats on the GeneralRelu.
So callbacks are going to be the DeviceCB, put it on CUDA or whatever, the metrics,
the progress bar, the activation stats. Our activation function is going
to be a GeneralRelu with 0.1 leakiness and 0.4 subtraction.
And we've got the init_weights, which we need to tell it about how leaky they are.
And then if we're doing a learning rate finder, we've got a different set of callbacks.
So it's no real reason to have a progress bar callback with a learning rate finder, I guess.
It's pretty short anyway. Oh, which reminds me, there was one little
thing I didn't mention in initializing, which is a fun trick you might
want to play around with. And in fact, Sam Watkins asked a question
earlier in the chat and I didn't answer it because it's actually exactly here.
In GeneralRelu, I added a second thing you might have seen, which is the maximum value.
And if the maximum value is set, then I clamp the value to be no more than the maximum.
So basically, as a result, let's say you set it to 3, then the line would go up to here like it
does here, and then it would go up to three like it does here, and then it would be flat.
And using that can be a nice way. I mean, that probably got higher
up to about six, but that can be a nice way to avoid numbers getting too big.
And maybe if you really wanted to have fun, you could do kind of like a leaky maximum, which
I haven't tried yet, where maybe at the top it kind of goes like, you know, 10 times smaller,
kind of just exactly like the leaky could be. So anyway, if you do that,
you'd need to make sure that the, you know, that you're still getting
0, 1 layers with your initialization. But that would be something you
could consider playing with. Okay, so let's create our own little SGD class.
So an SGD class is going to need to know what parameters to optimize.
And if you remember, the module dot parameters method returns a generator.
So we use a list to turn, you know, we want to turn that into a list. So it's
kind of forced to be a particular, you know, not something that's going to change.
We're going to need to know the learning rate. We're going to need to know the weight
decay, which we'll look at a bit in a moment. And for reasons we'll discuss later, we also want
to keep track of what batch number are we up to. So an optimizer basically has two
things, a step and a zero_grad. So what steps going to do is, obviously,
with no_grad, because this is not part of the thing that we're optimizing.
This is the optimization itself. We go through each tensor of parameters
and we do a step of the optimizer. And we'll come back to this in a moment. We do
a step of the regularizer and we keep track of what batch number we're up to.
And so what does SGD do in a step of the optimizer?
It subtracts out from the parameter its gradient times the learning rate.
So that's an SGD optimization step. And to zero the gradients, we go
through each parameter and we zero it. And that's in torch.no_grad. So I guess it's not.
So use .data that way. If you use .data, then you don't need to say the no_grad.
Just a little typing saver. OK, so let's create a TrainLearner.
So it's a Learner with a training callback kind of built in and we're going to set the optimization
function to be this SGD we just wrote. And we'll use the BatchNorm model with the
weight initialization we've used before. And if we train it, then just this is just to give
us basically the same results we've had before. While this is training, I'm going
to talk about regularization. Hopefully you remember from Part 1 of this course
or from your other learning what weight decay is. And so just to remind you, weight decay or
L2 regularization are kind of the same thing. And basically what we're doing is we're
saying let's add the square of the weights to the loss function.
Now, if we add the square of the weights to the loss function, so.
Whatever our loss function is. So we'll just call it loss, bop,
bop, bop, bop. We're adding plus the sum of the square of the weights.
So that's our L. And so the only thing we actually
care about is the derivative of that. And the derivative of that is
equal to the derivative of the loss plus the derivative of this,
which is just the sum of 2w. And then what we do is we multiply this bit here
by some constant, which is the weight decay. So we call that weight decay. And so
since the weight decay could directly incorporate the number, the 2, we can
actually just delete that entirely. And just. Time weight decay to that
and doing this very quickly because we have already covered it in Part 1. So this is hopefully something that
you've all seen before. So we can. Do weight decay by taking our gradients
and adding on the weight decay times the weights. And so as a result, then in SGD,
because that's part of the gradient. Oh, man, I got it the wrong way around.
Need to do. That first.
I guess it does. Well, whatever. OK. So since that's part of the gradient, then
in the optimization step, that's using the gradient and it's subtracting
out gradient times learning rate. But what you could do is because we're
just ending up doing p.grad * self.lr. And the p.grad update is
just to add in wd * weight. We could simply skip updating the gradients
and instead directly update the weights. To subtract out the learning
rate times the wd times weight. So they would be mathematically
identical. And that is what we've done here in the regularization step.
We basically say if you've got weight decay. Then.
Just take p *= 1 - lr * wd, which is mathematically the same as this.
Because we've got weight on both sides. So that's why the regularization
is here inside our SGD. And yes, that's finished running. That's good.
We've got an 85% accuracy that all looks fine. And we're able to train it a high
learning rate of 0.4. That's pretty cool. So now let's add momentum. Now, we had a kind of a
hacky momentum learner before, but we're actually, momentum should be in an optimizer, really.
And so let's talk a bit about what momentum actually is.
So let's just create some data. So our xs is just going to be
equally spaced numbers from -4 to 4. A hundred of them. And our
ys is just going to be xs divided by 3, squared, 1 minus
that, plus some randomization. And so these dots here is our random data.
I'm going to show you what momentum is by example. And this is something that
Sylvain Gugger helped build. So thank you, Sylvain, for our book,
actually, if memory serves correctly. Actually, it might even be the course before
that. What we're going to do is we're going to show you what momentum looks like for
a range of different levels of momentum. These are the different levels we're
going to use. So let's take a beta of 0.5. So that's going to be our first one. So we're
going to do a scatter plot of our xs and ys. It's the blue dots. And then we're going to go
through each of the ys and we're going to do, this hopefully looks familiar. This is
doing a lerp. We're going to take our previous average, which will start at 0.
Times beta, which is 0.5 plus 1 minus beta. That's 0.5 times our new average.
And then we'll append that to this red line. And we'll do that for all the data points and then
plot them. And you can see what happens when we do that is that the red line becomes less bumpy.
Right?. Because each one is half this exact dot and half of whatever the red line previously was.
So again, this is an exponentially weighted moving average. And so we could
have implemented this using lerp. So as the beta gets higher, it's saying: do more
of just be wherever the red line used to be and less of where this particular data point is.
And so that means when we have these kind of outliers, the red line doesn't
jump around as much, as you see. But if your momentum gets too high,
it doesn't follow what's going on at all. And in fact, it's way behind.
Right?. When you're using momentum, it's always going to be partially responding
to how things were many batches ago. And so even at beta of 0.9 here.
The red line is offset to the right because, again, it's taking it a while for it to recognize
that all things have changed because each time it's 0.9 of it is where the red line used to be.
And only 0.1 of it is what is this data point say. So that's what momentum does.
So the reason that momentum is useful is because when you have a loss function
that's actually kind of like very, very bumpy. Like that, right. You want to be able to follow
the actual curve, right?. So using momentum, you don't quite get that, but you kind of
kind of a version of that that's offset. To the right a little bit, but still, you
know, hopefully spending a lot more time, you don't really want to be heading off
in this direction, which you would if you follow the line and then this direction,
which you would if you follow the line. You really want to be following
the average of those directions. And that's what momentum lets you do. So to use momentum, we will inherit from SGD and we will override the
definition of the optimization step. Remember, there was two things that step called it called
the regularization step and the optimization step. So we're going to modify the
optimization step. We're not just going to do minus equals grad times self.lr. But instead, then when we
create a momentum object. We will tell it what momentum
we want for default point nine. Store that away. And then
in the optimization step. For each parameter, because remember,
the optimization step is being called for each parameter in a model.
So that's each layer's weights and each layer's biases, for example.
We'll find out for that parameter. Have we ever stored away its moving
average of gradients before? And if we haven't, then we'll set them
to 0 initially, just like we did here. And then. We will do our look, right?.
So we're going to say the moving average of, exponentially weighted moving average of
gradients is equal to whatever it used to be… times the momentum plus
this actual new batches gradients times one minus momentum.
So that's just doing the lerp, as we discussed. And so then we're just going
to do exactly the same as the SGD update step. But instead of multiplying by p.grad,
we're multiplying it by p.grad_avg. So there's a cool little trick here, right? Which
is that we are basically inventing a brand new attribute, putting it inside the parameter tensor.
And that attribute is where we're storing away the moving average, exponentially
weighted moving average of gradients for that particular parameter.
So as we look through the parameters, we don't have to do any special work to get access to that.
So I think that's pretty handy. All right.
So one interesting thing, very interesting here I found is I could really
hike the learning rate way up to one point five. And the reason why is because we're not
getting these huge bumps anymore. And so by getting rid of the huge bumps, the
whole thing's just a whole lot smoother. So previously we got up to 85%.
Because we've gone back to our 1024 batch size and just 3 epochs and a constant learning rate.
And look at that. We've gone up to 87.6%. So it's really improved things. And the loss
function is nice and smooth, as you can see. OK. And so then in our color_dim plot, you
can see, this is actually this really the really the smoothest we've seen.
And it's a bit different to the momentum learner because the momentum
learner didn't have this one minus part. It wasn't lerping. It was, it was basically
always including all of the grad plus a bit of the momentum part.
So this is yeah, this is a different, better approach, I think. And yeah, we've got a really nice, smooth result.
One person's asking, “don't we get a similar effect, I think, in terms of the smoothness
if we increase the batch size?”, which we do. But if you just increase the batch size,
you're giving it less opportunities to update. So having a really big batch
size is actually not great. Yann LeCun, who created the first really
successful ConvNets, including LeNet-5, says he thinks the ideal batch size,
if you can get away with it, is 1. But it's just slow. You want it to have as
many opportunities to update as possible. There's this weird thing recently
where people seem to be trying to create really large batch sizes, which
to me is, yeah, doesn't make any sense. We want the smallest batch size we
can get away with, generally speaking, to give it the most chances to update.
So this has done a great job of that. And we're getting very good results despite
using only 3 epochs of very large batch size. Okay, so that's called Momentum.
Now, something that was developed in a course or announced in a
Coursera course back in maybe 2012, 2013 by Geoffrey Hinton, has never
been published, is called RMSProp. Let's have it running while we talk about it.
RMSProp is going to update the optimization step using something very similar to
Momentum, but rather than lerping on the p.grad, we're going to lerp on p.grad squared.
And just to keep it kind of consistent, we won't call it mom, we'll call it
sqr_mom, but this is just the multiplier. And what are we doing with the grad squared?
Well, the idea is that a large grad squared indicates a large variance of gradients.
So what we're then going to do is divide by the square root of that plus epsilon.
Now, you'll see I've actually been a bit all over the place here with my batch norm.
I put the epsilon inside the square root. In this case, I'm putting the
epsilon outside the square root. It does make a difference. And so be careful
as to how your epsilon is being interpreted. Generally speaking, I can't
remember if I've been exactly right, but I've tried to be consistent with
the papers or normal implementations. This is a very common cause of
confusion and errors, though. So what we're doing here is we're dividing
the gradient by the amount of variation. So the square root of the moving
average of gradient squared. And so the idea here is that if the gradient
has been moving around all over the place, then we don't really know what it is.
Right?. So we shouldn't do a very big update if the gradient is very, very
much the same all the time. Then we're very confident about it.
So we do want to be a big update. I have no idea why we're doing this in
two steps. Let's just pop this over here. Now, because we are dividing our gradient by
this generally possibly rather small number, we generally have to decrease the learning rate.
So bring the learning rate back to point one. And as you see, it's training.
It's not amazing, but it's training OK. So RMSProp can be quite nice.
It’s a bit bumpy there, isn't it? I mean, I could try decreasing it a little bit.
Maybe down to 3e-3 instead. That's a bit better. And a bit smoother.
That's probably good. Let's see what the colorful dimension plot looks like, shall we?.
We know again, it's very nice, isn't it? That's great.
Now, one thing I did, which I don't think I've seen done before,
I don't remember people talking about, is I actually decided not to do
the normal thing of initializing to zeros, because if I initialize to zeros,
then my initial denominator here will basically be zero plus epsilon, which will mean
my initial learning rate will be very, very high, which I certainly don't
want. So I actually initialized it, at first, to just whatever the first
mini batches gradient is, squared. And I think this is a really useful
little trick for using RMSProp. Momentum can be a bit aggressive
sometimes for some really finicky learning methods, finicky architectures.
And so RMSProp can be a good way to get reasonably fast optimization of a very finicky architectures.
And in particular, EfficientNet is an architecture which people have generally
trained best with RMSProp. So you don't see it a whole lot,
but you know, in some ways it's just historical interest, but you see it a bit.
But I mean, the thing we really want to look at is RMSProp plus Momentum together and
RMSProp plus Momentum together exists. It has a name. You will have heard the name
many times. Name is Adam. Adam is literally just RMSProp and Momentum.
So we, rather annoyingly, call them beta1 and beta2.
They should be called momentum and square momentum or momentum of squares, I suppose.
So beta1 is just the momentum from the momentum optimizer. beta2 is just these momentum
for the squares from the RMSProp optimizer. So we'll store those away. And just
like RMSProp, we need the epsilon. So I'm going to, as before, store away the
gradient average and the square average. And then we're going to do our lerping. But there's a nice little trick here,
which is in order to avoid doing this, where we just put the initial batch
gradients as our starting values, we're going to use zeros as our starting values.
And then we're going to unbiased them. So basically, the idea is that for the
very first mini batch, if you have zero here being lerped with the gradient,
then the first mini batch will obviously be closer to zero than it should be.
But we know exactly how much closer it should be to zero, which is just it's
going to be self.beta1 times closer, at least in the first mini batch, because that's
what we've lerped with. And then the second mini batch will be self.beta1 squared.
And so in the third mini batch to be self.beta1 cubed and so forth.
And that's why we had this self.i back in our SGD, which was keeping track of
what mini batch we’re up to. So we need that in order to do
this unbiasing of the average. Oh, dear, I'm not unbiasing
the square of the average. Am I? No, I'm not.
Oops. So we need to do that here as well. I wonder if this is going to help things a little bit.
unbias_sqr_avg is going to be p.sqr_avg. And that will be beta2. And so we
will use those unbiased versions. So this unbiasing only matters
for the first few mini batches, where otherwise it would be too close to zero,
you know, be closer to zero than it should be. Right.
So we run that. And so, again, you know, we've,
you would expect the learning rate to be similar to what RMSProp needs
because we're doing that same division. So we actually do have the
same learning rate here. And yeah, so we're up to 86.5% accuracy.
So that's pretty good. I think. Yeah, it's actually a bit less
good than Momentum, which is fine. You know, obviously you can fiddle around.
Well, Momentum we had 0.9. Yeah. So you can fiddle around
with different values of beta2, beta1. See if you can beat the Momentum
version. I suspect you probably can. OK. We're a bit out of time, aren't we? All right.
I am excited about the next bit, but I want to spend time doing it properly
so I won't rush through it now. But instead, we're going to do it
next time. So I will. Yes, I will. Give you a hint that in our next
lesson, we will, in fact, get above 90%. And it's got some very cool stuff to show
you. I can't wait to show you that then. But, you know, I think in the meantime, let's
give ourselves a pat on the back that we have successfully implemented.
You know, I mean, think about all this stuff we've got running and happening and we've
done the whole thing from scratch using nothing but what's in the Python standard library.
We've re implemented everything and we understand exactly what's going on.
So I think this is really quite terrifically cool. Personally, I hope you feel the same way and
look forward to seeing you in the next lesson. Thanks. Bye.
Get free YouTube transcripts with timestamps, translation, and download options.
Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact