Lesson 16: Deep Learning Foundations to Stable Diffusion ...

Hi there and welcome to Lesson 16 where we

are working on building our first flexible training framework: the learner.

And I've got some very good news, which is that I have thought of a way of doing it a

little bit more gradually and simply, actually, than last time.

So that should make things a bit easier. So we're going to take it a bit more step by step.

So we're working in the 09_learner notebook today. And we've seen already this

Basic Callbacks Learner. And so the idea is that… we've seen so far

this Learner which wasn't flexible at all, but it had all the basic pieces,

which is we've got a fit method, we’re hardcoding that we

can only calculate accuracy and average loss. We're hardcoding, we're putting

things on a default device, hardcoding a single learning

rate. But the basic idea is here,

we go through each epoch and call one_epoch to train or evaluate

depending on this flag. And then we loop through

each batch in the DataLoader. And one_batch is going to grab the x and y

parts of the batch, call the model, call the loss function, and if we're

training, do the backward pass. And then print out, well, calculate

the statistics for our accuracy. And then at the end of an epoch, print that out.

So it wasn't very flexible. But it did do something.

So that's good. So what we're going to do now is we're

going to do is an intermediate step. We're going to look at a, what I'm

calling a Basic Callbacks Learner. And it actually has nearly all the

functionality of the full thing. After we look at this Basic Callbacks Learner,

we're then going to, after creating some callbacks and metrics, we're going to look at

something called the flexible learner. So let's go step by step.

So the Basic Callbacks Learner looks very similar to the previous Learner.

It's got a fit function, which is going to go through each epoch, calling one_epoch with

training on and then training off. And then one_epoch will go through each batch

and call one_batch and one_batch will call the model, the loss_func.

And if we're training it will do the backward step.

So that's all pretty similar, but there's a few more things going on here.

For example, if we have a look at fit, you'll see that after creating the optimizer, so

we call self.opt_func, so opt_func here defaults to SGD.

So we instantiate an SGD object passing in our model's parameters and the requested learning

rate. And then before we start looping through one

epoch at a time, now we've set epochs here, we first of all call self.callback

and passing in ‘before_fit’. Now what does that do?

So self.callback is here and it takes some method names,

in this case it's ‘before_fit’, and it calls a function

called run callbacks ~run_cbs. It passes in a list of our callbacks and

the method name, in this case ‘before_fit’. So run callbacks is something that's going

to go for each callback and it's going to sort them in order of their ‘order’ attribute. And so there's a base class to our callbacks

which has an order of 0, so our callbacks are all going to have the same

order of 0 unless you ask otherwise. So here's an example of a callback.

So before we look at how callbacks work, let's just run a callback.

So we can create a ridiculously simple callback called completion

callback ~CompletionCB, which before we start fitting a new model, it

will set its count attribute to 0. And each batch it will increment that, and

after completing the fitting process it will print out how many batches we've done.

So before we even train a model, we could just run manually before_fit, after_batch,

and after_fit using this run_cbs. And you can see it's ended up

saying “Completed 1 batches”. So what did that do?

So it went through each of the cbs in this list, there's only one, so it's going to look

at the one cb, and it's going to try to use getattr to find an attribute with this name,

which is ‘before_fit’. So if we try that manually, so this is the

kind of thing I want you to do if you find anything difficult to understand

is do it all manually. So create a callback, set it to cbs[0],

just like you're doing in a loop, right? And then find out what happens

if we call this and pass in this, and you'll see it's

returned a method. And then what happens to that method?

It gets called. So let's try calling it.

There we are. So that's what happened when we

call the before_fit, which doesn't do anything very interesting.

But if we then call after_batch, and then we call after_fit, there it is, right?

So yeah, make sure you don't just run code willy-nilly, but understand it by experimenting

with it. And I don't always experiment

with it myself in these classes. Often I'm leaving that to you, but sometimes

I'm trying to give you a sense of how I would experiment with code if I was learning it.

So then having done that, I would then go ahead and delete those cells.

But you can see I'm using this interactive notebook environment to explore and learn

and understand. And so now we've got an end.

If I haven't created a simple example of something to make it really easy to understand, you

should do that, right? Don't just use what I've already created

or what somebody else has already created. So we've now got something that

works totally independently. We can see how it works.

This is what a callback does. So a callback is something

which will look at a class. A callback is a class where you can define one

or more of before, after_fit, before, after_batch and before, after_epoch.

So it's going to go through and run all the callbacks that have a before_fit method before

we start fitting. Then it'll go through each epoch and call

one_epoch with training and one epoch with evaluation.

And then when that's all done, it will call after_fit callbacks.

And one_epoch will, before it starts on enumerating through the

batches, it will call before_epoch. And when it's done, it will call after_epoch.

The other thing you'll notice is that there's a try-except immediately before every before

method and immediately after every after method, there's a try and there's an except.

And each one has a different thing to look for, CancelFitException, CancelEpochException,

and CancelBatchException. So here's the bit which goes through each

batch, calls before_batch, processes the batch, calls after_batch.

And if there's an exception that's of type CancelBatchException, it gets ignored.

So what's that for? So the reason we have this is that any of

our callbacks could raise any one of these three exceptions to say, I don't

want to do this batch, please. So maybe we'll look at an

example of that in a moment. So we can now train with this.

So let's call create a little get_model function that creates a sequential model with just

some linear layers. And then we'll call fit.

And it's not telling us anything interesting because the only callback we added was the

completion callback. That's fine.

It's training, it's doing something. And we now have a trained model.

I just didn't print out any metrics or anything because we don't have any callbacks for that.

That's the basic idea. So we could create a, maybe we could call

it a SingleBatchCB, which after_batch, after a single batch, it

raises a CancelFitException. So that's a pretty, I mean, I suppose that

could be kind of useful actually, if you want to just run one batch model to make sure it works.

So we could try that. So now we're going to add to

our list of callbacks, the single batch callback.

Let's try it. And in fact, you know, we probably want this.

Let's just have a think here. That's fine.

Let's run it. There we go.

So it ran and nothing happened. And the reason nothing happened is

because this canceled before this ran. So we could make this run second

by setting its order to be higher. And we could say just order=1

because the default order is 0. And we thought in order of the order attribute.

Actually let's use CancelEpochException. There we go.

That way it'll run the final fit. There we are.

So it did one batch for the training and one batch for the evaluation.

So that's a total of two batches. So remember, callbacks are not

a special magic part of like the Python language or anything.

It's just a name we use to refer to these functions or classes

or callables, more accurately, that we pass into something that will then

call back to that callable at particular times. And I think these are kind of interesting

kinds of callbacks because these callbacks have multiple methods in them.

So is each method a callback? Is each class with all those methods a callback?

I don't know. I tend to think of the class with all

the methods in as a single callback. I'm not sure if we have

great nomenclature for this. All right.

So let's actually try to get this doing something more

interesting by not modifying the learner at all, but just by adding callbacks, because

that's the great hope of callbacks, right? So it would be very nice if it

told us the accuracy and the loss. So to do that, it would be great to have

a class that can keep track of a metric. So I've created here a Metric class.

And maybe before we look at it, we'll see how it works.

You could create, for example, an accuracy metric by defining the calculation necessary

to calculate the accuracy metric, which is the mean of how often do the inputs equal

the targets. And the idea is you could then

create an accuracy metric object. You could add a batch of inputs and targets

and add another batch of inputs and targets and get the value.

And there you would get the 0.45 accuracy. Or another way you could do it would be just to

create a metric which simply gets the weighted average, for example, of your loss.

So you could add 0.6 as the loss with a batch size of 32, 0.9 as a loss and a batch size

of 2. And then that's going to give us a weighted

average loss of 0.62, which is equal to this weighted average calculation.

So that's like one way we could kind of make it easy to calculate metrics.

So here's the class. Basically we're going to keep track of all

of the actual values that we're averaging and the number in each mini batch.

And so when you add a mini batch, we call calculate, which for example, for Accuracy,

remember this is going to override the parent classes calculate.

So it does the calculation here. And then we'll add that to our list of values.

We will add to our list of batch sizes, the current batch size.

And then when you calculate the value, we will calculate the weighted sum.

Sorry, the weighted mean, weighted average. Now notice that here value, I didn't

have to put parentheses after it. And that's because it's a property.

I think we've seen this before. So just to remind you, property just means

you don't have to put parentheses after it to get the calculation to happen.

All right. So just let me know if anybody's got

any questions up to here, of course. So we now need some way to use this metric

in a callback to actually print out. The first thing I'm going to do though, is I'm

going to create one more, one useful metric first, a very simple one, just two lines

of code called the device callback. And that is something which is going to allow

us to use CUDA or for the Apple GPU or whatever without the complications we had before of,

you know, how do we have multiple processes in our DataLoader and also use our

device and not have everything fall over. So the way we could do it is we could say

before fit, put the model onto the default device and before each batch is run, put that

batch onto the device, because look what happened in the, this is really, really

important. In the Learner absolutely everything is put inside self

dot, which means it's all modifiable. So we go for self dot iteration number, comma

self dot the batch itself and numerating the DataLoader. And then we call one_batch, but before it,

we call the callback so we can modify this. Now how does the callback

get access to the Learner? Well what actually happens is we go through

each of our callbacks and put, set an attribute called learn equal to the Learner.

And so that means in the callback itself, we can say self.learn.model.

And actually we could make this a bit better, I think.

So make it like, maybe you don't want to use a default device.

So this is where I would be inclined to add a constructor and set device and we could

default it to the default device, of course. And then we could use that instead and

that would give us a bit more flexibility. So if you wanted to train on some

different device, then you could. I think that might be a slight improvement. Okay.

So there's a callback we can use to put things on CUDA and we could check that it works by

just quickly going back to our old Learner here, remove the SingleBatchCB() and replace

it with DeviceCB(). Yep still works.

So that's a good sign. Okay so now let's do our Metrics.

Now of course we couldn't use metrics

until we built them by hand. The good news is we don't have to write every

single metric now by hand because they already exist in a fairly new project called torcheval,

which is an official PyTorch project. And so torcheval is something that gives

us… actually, I came across it after I had created my own metric class.

But it actually looks pretty similar to the one that I built earlier.

So you can install it with pip. I'm not sure if it's on Conda yet, but it probably

will be soon by the time you see the video. I think it's pure Python anyway so

it doesn't matter how you install it. And yeah it has a pretty similar approach

where you call .update and you call .compute so they're slightly different names but they're

basically super similar to the thing that we just built.

But there's a nice good list of metrics to pick from. So because we've already built our own now

that means we're allowed to use theirs. So we can import the MultiClassAccuracy

metric and the Mean metric. And just to show you they look very, very similar.

If we call MultiClassAccuracy and we can pass in a mini-batch of inputs and targets

and compute and that all works nicely. Now these, in fact, it's exactly

the same as what I wrote. We both added this thing called

reset which basically resets it. And so obviously we're going to be wanting to

do that probably at the start of each epoch. And so if you reset it and then try to compute

you'll get NaN because you can't get accuracy. Accuracy is meaningless when

you don't have any data yet. So let's create a metrics callback

«MetricsCB» so we can print out our metrics. I've got some ideas to improve this which

maybe I do this week but here's a basic working version slightly hacky but it's not too bad. So generally speaking one thing I noticed

actually is I don't know if this is considered a bug but a lot of the metrics didn't seem

to work correctly in torcheval when I had tensors that were on the

GPU and had requires grad. So I created a little to_cpu function which

I think is very useful and that's just going to detach the… so detach takes the tensor and

removes all the gradient history, the computation history used to calculate a

gradient and puts it on the CPU. It will do the same for dictionaries of

tensors, lists of tensors and tuples of tensors. So our metrics callback basically

here's how we're going to use it. So let's run it.

So here we're creating a metrics callback object and saying we want to create a metric

called accuracy. That's what's going to print out. And this is the metrics object we're

going to use to calculate accuracy. And so then we just pass that

in as one of our callbacks. And so you can see what it's going to do is it's

going to print out the epoch number, whether it's training or evaluating, so training set

or validation set, and it'll print out our metrics and our current status.

Actually we can simplify that. We don't need to print those

bits because it's all in the dictionary now.

Let's do that. There we go.

So let's take a look at how this works. So we are going to be creating with, for the

callback we're going to be passing in the names and object, metric objects,

for the metrics to track and print. So here it is here, **metrics.

So we've seen ** before. And as a little shortcut, I decided that it

might be nice if you didn't want to write accuracy equals, you could

just remove that and run it. And if you do that, then it will give it a

name and it'll just use the same name as the class.

And so that's why you can either pass in. So *ms will be a tuple.

Well, ms it's going to be pulled out. So it's just passing a list of positional

arguments, which we turned into a tuple, or you can pass in named arguments that

will be turned into a dictionary. If you pass in positional arguments, then

I'm going to turn them into named arguments in the dictionary by just

grabbing the name from their type. So that's where this comes from.

That's all that's going on here. Just a little shortcut, a bit of convenience.

So we store that away. And this is, yeah, this is the bit I think

I can simplify a little bit, but I'm just adding manually an additional metric,

which is I'm going to call the loss. And that's just going to be the

weighted average of the losses. So before we start fitting, we're going to

actually tell the Learner that we are the Metrics callback. And so you'll see later where

we're going to actually use this. Before each epoch, we will

reset all of our metrics. After each epoch, we will

create a dictionary of the keys and values, which are the actual

strings that we want to print out. And we will call _log, which

for now we'll just print them. And then after each batch, this is

the key thing, we're going to actually grab the input

and target. We're going to put them on the CPU.

And then we're going to go through each of our metrics and call .update.

So remember the update in the metric is the thing that actually says here's a batch of

data, right? So we're passing in the batch of data,

which is the predictions and the targets. And then we'll do the same thing

for our special loss metric, passing in the actual loss and

the size of our mini-batch. And so that's how we're able

to get this, yeah, this actual running on the NVIDIA GPU

and showing our metrics. And obviously there's a lot of room to improve

how this is displayed, but all the information's we needed here, and it's just a

case of changing that function. Okay.

So that's our kind of like intermediate complexity learner.

We can make it more sophisticated, but it's still exactly, it's still going to fit in

a single screen of code. So this is kind of my goal here was to

keep everything in a single screen of code. This first bit is exactly the same as before,

but you'll see that the one_epoch and fit and batch has gone from,

let's see what it was before. It's gone from quite a lot of code, all this, to much less code.

And the trick to doing that is I decided to use a @contextmanager.

We're going to learn more about context managers in the next notebook, but basically I originally

last week I was saying I was going to do this with a decorator, but I realized a context

manager is better. Basically what we're going to do

is we're going to call our before and after callbacks in

a try-except block. And to say that we want to use the callbacks

and the try and except block, we're going to use a with statement.

So in Python, a with statement says everything in that block call our context manager before

and after it. Now there's a few ways to do that, but one

really easy one is using this context manager decorator and everything up until the

yield statement is called before your code. Where it says yield, it then calls your code

and then everything after the yield is called after your code.

So in this case, it's going to be try, self.callback, before_name, where name is fit.

And then it will call for self.epoch etc. Because that's where the yield is.

And then it'll call self.callback(after_fit), except.

Okay and now we need to grab the CancelFitException.

So all of the variables that you have in Python, all live inside a special dictionary called

globals(). So this dictionary contains all of your variables.

So I can just look up in that dictionary, the variable called CancelFit –with a capital

F– Exception. So this is except: CancelFitException.

So this is exactly the same then as this code, except the nice

thing is now I only have to write it once rather than at least three times.

And I'm probably going to want more of them. So I tend to think it's worth refactoring a

code when you have duplicate code, particularly here.

We had the same code three times. So that's going to be more

of a maintenance headache. We're probably going to want to

add callbacks to more things later. So by putting it into a context manager just once,

I think we're going to reduce our maintenance burden.

That's what we do because I've had a similar thing in fast.ai for some years now, and it's

been quite convenient. So that's what this context manager is about. Yeah, other than that, the

code's exactly the same. So we create our optimizer and then with our

callback context manager for fit, go through each epoch, call one_epoch, set it to training

or non-training mode based on the argument we pass in, grab the training or validation

set based on the argument we pass in, and then using the context manager for epoch,

go through each batch in the DataLoader, and then for each batch in the

DataLoader using the batch context. Now this is where something

gets quite interesting. We call predict, get_loss, and if we're

training, backward, step and zero_grad. But previously we actually called

self.model, etc., self.loss_func, etc. So we go through each batch and

call before_batch, do the batch. Oh, sorry, that's our slow version.

Wait, what are we doing? Oh, yes, we're going to be over here. Okay, back where we are, yes. So previously we were calling the model,

calling the loss_func, calling lost.backward, opt.step, opt.zero_grad, but now we are calling instead, self.predict,

self.get_loss, self.backward, and how on earth is that working

–because they're not defined here at all. And so the reason I've decided to do

this is it gives us a lot of flexibility. We can now actually create our own way of

doing predict, get_loss, backward, step, and zero_grad, in different situations, and

we're going to see some of those situations. So what happens if we call

self.predict and it doesn't exist? Well, it doesn't necessarily cause an error.

What actually happens is it calls a special magic method in Python called dunder getattr,

as we've seen before. And what I'm doing here is I'm saying, okay,

well, if it's one of these special five things, don't raise an AttributeError, which this

is the default thing it does, but instead… create a callback, or actually I should say

call self.callback, passing in that name. So it's actually going to

call self.callback(‘predict’). And self.callback is exactly the same as before.

And so what that means now is to make this work exactly the same as it did before, I

need a callback which does these five things. And here it is.

I'm going to call it train callback. So here are the five things, predict,

get_loss, backwards, step, and zero_grad. So they're here, predict, get_loss,

backwards, step, and zero_grad. Okay, so they're almost exactly the same as

what they looked like in our intermediate Learner, except now I just need to have self.learn

in front of everything, because, we remember, this is a Callback, it's not the Learner.

And so for a Callback, the Callback can access the Learner using self.learn.

So self.learn.preds, there's self.learn.model, passing in self.learn.batch, and just the

independent variables. Ditto for the loss, calls the loss function, backward, step, zero_grad.

So that's, at this point, this isn't doing anything that it wasn't doing before.

But the nice thing is now if you want to use HuggingFace Accelerate, or you want something

that works on HuggingFace data styles dictionary things, or whatever,

you can actually change exactly how it behaves by creating

a Callback for training. And if you want everything except one thing

to be the same, you can inherit from TrainCB. So this is, I've not tried this before,

I haven't seen this done anywhere else. So it's a bit of an experiment.

So I'm interested to hear how you go with it. And then finally, I thought it'd

be nice to have a progress bar. So let's create a progress callback.

And the progress bar is going to show on it our current loss and going to create a plot

of it. So I'm going to use a project that we created

called fastprogress, mainly created by the wonderful Sylvain.

And basically, fastprogress is a very nice way to create

very flexible progress bars. So let me show you what it looks like first.

So let's get the model and train. And as you can see, it actually in real

time updates the graph and everything. There you go.

That's pretty cool. So that's the progress bar, the metrics callback,

the device callback, and the training callback all in action.

So before we fit, we actually have to set self.learn.epochs.

Now that might look a little bit weird, but self.learn.epochs is the thing that we loop

through for self.epoch in. So we can change that.

So it's not just a normal range, but instead it is a progress bar around a range.

We can then check, remember I told you that the Learner is

going to have the metrics attribute applied? We can then say, oh, if the Learner has a

metrics attribute, then let's replace the _log method there with ours.

And our one instead will write to the progress bar.

Now this is pretty simple. It looks very similar to before, but we could

easily replace this, for example, with something that creates an HTML table, which is another

thing fastprogress does or other stuff like that.

So you can see, we can modify. The nice thing is we can modify

how our metrics are displayed. So that's a very powerful thing that Python

lets us do is actually replace one piece of code with another.

And that's the whole purpose of why the metrics callback had this _log separately.

So why didn't I just say print here? That's because this way classes can

replace how the metrics are displayed. So we could change that to like send them

over to Weights and Biases, for example, or, you know, create visualizations or so forth. So before_epoch, we do a very similar thing.

The self.learn.dl iterator. We change it to have a

progress bar wrapped around it. And then after each bar, we set the

progress bar's comment to be the loss. It's going to show the loss on

the progress bar as it goes. And if we've asked for a plot, then we

will append the losses to a list of losses. And we will update the graph

with the losses and the batch numbers.

So there we have it. We have a nice working Learner, which is, I

think, the most flexible Learner that training loop probably that's, I

hope, has ever been written. Because I think the fastai2 one was the most

flexible that had ever been written before. And this is more flexible.

And the nice thing is, you can make this your own. You know, you can, you know, fully

understand this training loop. So it's kind of like you can use a framework,

but it's a framework in which you're totally in control of it.

And you can make it work exactly how you want to. Ideally, not by changing the Learner

itself, ideally by creating callbacks. But if you want to, you could certainly, like,

look at that, the whole Learner fits on a single screen.

So you could certainly change that. We haven't added inference yet, although

that shouldn't be too much to add. I guess we have to do that at some point. Okay, now, interestingly, I love

this about Python, it's so flexible. When we said self.predict, self.get_loss,

I said if they don't exist, then it's going to use getattr, and it's going to

try to find those in the callbacks. And in fact, you could have multiple callbacks

that define these things, and then they would chain them together, which

would be kind of interesting. But there's another way we could make these

exist, which is that we could subclass this. So let's not use TrainCB, just

to show us how this would work. Instead, we're going to use a subclass.

So here I'm going to subclass Learner, and I'm going to override the five.

Well, it's not exactly overriding, I didn't have any definition of them before.

So I'm going to define the five directly in the Learner subclass.

So that way, it's never going to end up going to getattr, because getattr is only called

if something doesn't exist. So here, it's basically, all these five are

exactly the same as in our train callback, except we don't need self.learn anymore,

we can just use self because we're now in the

Learner. But I've changed zero_grad

to do something a bit crazy. I'm not sure if this has been done before,

I haven't seen it, but maybe it's an old trick that I just haven't come across.

But it occurred to me, zero_grad, which remember is the thing

that we call after we take the optimizer step, doesn't actually

have to zero the gradients at all. What if instead of zeroing the gradients,

we multiplied them by some number, like say 0.85?

Well, what would that do? Well, what it would do is it would mean that

your previous gradients would still be there, but they would be reduced a bit.

And remember, what happens in PyTorch is PyTorch always adds the gradients to the existing

gradients. And that's why we normally have to call zero_grad.

But if instead we multiply the gradients by some number, I mean, we should really make

this a parameter. Let's do that, shall we?

So let's create a parameter. So probably there's a few ways we could do this.

Well, let's do it properly. We've got a little bit of time.

So we could say, well, maybe I'll just copy

and paste all those over here. And we'll add momentum equals 0.85.

self.momentum equals momentum. And then super.

So make sure you call the super classes, passing in all the stuff. We could use delegates for this and kwargs.

That would be possibly another way of doing it. But let's just do this for now.

Okay. And then so there we wouldn't make it 0.85.

We would make it self.momentum. So you'll see now still trains, but there's

no TrainCB callback anymore in my list. I don't need one because I have defined

the five methods in the subclass. Now this training at the same learning

rate for the same time, the accuracy, let's run

them all. Yeah, this is a lot like

gradient accumulation callback. Kind of cooler, I think. Okay.

So the let's see, the loss has gone from 0.8 to 0.55 and the accuracy has gone from about

0.7 to about 0.8. So they've improved.

Why is that? Well, we're going to be learning a lot

more about this pretty shortly, but basically what's happening here is we have just

implemented in a very interesting way, which I haven't seen done

before, something called momentum. And basically what momentum does is it say,

imagine you've got some kind of complex contour loss surface.

So imagine these are hills with a marble, very similar.

And your marble's up here. What would normally happen with gradient descent

is it would go in the direction downhill. Which is this way.

So it would go over here and then over here. Right?

Very slow. What momentum does is the first step's the same. And then the second step says, oh, I wanted

to go this way, but I'm going to add together the previous direction plus the new direction,

but reduce the previous direction a bit. So that would actually make me end up about here.

And then the second one does the same thing. And so momentum basically makes you much

more quickly go to your destination. So normally momentum is done, the reason I

did it this way, partly to show you, is just a bit of fun, a bit of interest, but it's

very useful because normally momentum, you have to store a complete copy basically of

all the gradients, the momentum version of the gradients, so that you can kind of keep

track of that running exponentially weighted moving average.

But using this trick, you're actually using the dot grad themselves to store the exponentially

weighted moving average. So anyway, there's a little bit of fun, which

hopefully, particularly those of you who are interested in accelerated optimizers and

memory saving might find a bit inspiring. All right.

There's one more callback I'm going to show before the

break, which is the wonderful learning rate finder. I'm assuming that anybody who's watching this

already is familiar with the learning rate finder from fastai.

If you're not, there's lots of videos and tutorials around about it.

It's an idea that comes from a paper by Leslie Smith from a few years ago.

And the basic idea is that we will increase the learning rate.

I should have put titles on this. The X axis here is learning rate.

The Y axis here is loss. We increase the learning rate gradually over

time and we plot the loss against the learning rate and we find how high can we bring the

learning rate up before the loss starts getting worse.

And you kind of want roughly where about the steepest

slope is, so probably here it'd be about 0.1. So it'd be nice to create a learning rate finder.

So here's a learning rate finder callback. So what a learning rate finder needs to do,

well, you have to tell it how much to multiply the learning rate by each batch.

Let's say we add 30% to the learning rate each batch.

So we'll store that. So before we fit, we obviously need to keep

track of the learning rates and we need to keep track of the losses because those

are the things that we put on a plot. The other thing we have to do is

decide when do we stop training. So when has it clearly gone off the rails?

And I decided that if the loss is three times higher than the minimum loss we've seen, then

we should stop. So we're going to keep track of the minimum loss.

And so let's just initially set that to infinity. It's a nice big number.

Well, not quite a number, but a number-ish like thing. So then after every batch, first of

all, let's check that we're training. If we're not training, then

we don't want to do anything. We don't use the learning

rate finder during validation. So here's a really handy thing.

Just raise CancelEpochException, and that stops it from doing that epoch entirely.

So just to see how that works, you can see here one_epoch

does with the callback context manager epoch, and that will

say, oh, it's got canceled. So it goes straight to the except, which is

going to go all the way to the end of that code and it's going to skip it. So you can see that we're using exceptions

as control structures, which is actually a really powerful programming technique that

is really underutilized, in my opinion. Like a lot of things I do, it's

actually somewhat controversial. Some people think it's a bad idea, but I find

it actually makes my code more concise and more maintainable and more powerful.

So I like it. So let's see.

So we've got our CancelEpochException. So then we're just going to keep

track of our learning rates. The learning rates, we're going to learn a

lot more about optimizers shortly, so I won't worry too much about this. But basically the learning rates are

stored by PyTorch inside the optimizer. And they're actually stored in

things called parameter groups. So don't worry too much about the details,

but we can grab the learning rate from that dictionary.

And we'll learn more about that shortly. We've got to keep track of the loss,

append it to our list of losses. And if it's less than the minimum we've

seen, then record it as the minimum. And if it's greater than three times

the minimum, then look at this. This is really cool.

CancelFitException. So this will stop everything

in a very nice, clean way. No need for lots of returns and

conditionals or stuff like that. Just raise the CancelFitException.

And then finally, we've got to actually update our learning rate to 1.3 times the previous

one. And so basically the way you do it in PyTorch

is you have to go through each parameter group and grab the learning rate in the

dictionary and multiply it by lr_mult. So yeah, you've already seen it run.

And we can, at the end of running, you will find that there is now a, the callback will

now contain an lrs and a losses. So for this callback, I can't just

add it directly to the callback list. I need to instantiate it first.

And the reason I need to instantiate it first is because I need to be able to grab its learning

rates and its losses. And in fact, you know, we could grab

that whole thing and move it in here. There's no reason callbacks only have

to have the callback things, right? So we could do this. And now that's just going to become self. There we go.

And so then we can train it again and we could just call lrfind.plot.

So callbacks can really be, you know, quite self-contained nice things, as you can see.

So there's a more sophisticated callback and I think it's doing a lot of really nice stuff

here. You might've come across something in

PyTorch called learning rate schedulers. And in fact, we could implement this whole

thing with a learning rate scheduler. It won't actually save that much time.

But I just want to show you when you use stuff in PyTorch like learning rate schedulers,

you're actually using things that are extremely simple.

The learning rate scheduler basically does this one line of code for us.

So I'm going to now create a new LRFinderCB. And this time I'm going to use the PyTorch's

ExponentialLR scheduler, which is here. So this is coming.

It's interesting that actually the documentation of this is kind of actually wrong.

It claims that it decays the learning rate of each parameter group by gamma.

So gamma is just some number you pass in. I don't know why this has to be a Greek letter,

but it sounds more fancy than multiplying by an LR multiplier.

It says every epoch. But it's not actually done every epoch at all.

What actually happens is, in PyTorch, the schedulers have a step

method and the decay happens each time you call step.

And if you set gamma, which is actually lr_mult, to a number bigger than one, it's not a decay.

It's an increase. So the difference now, I guess I'll

copy and paste the previous version. Okay, so the previous version is on the top.

So the main difference here is that before_fit, we're going to create something called

a self.sched equal to the scheduler. And the scheduler, because it's going to be

adjusting the learning rates, it actually needs access to the optimizer. So we pass in the optimizer and

the learning rate multiplier. And so then in after_batch, rather than having

this line of code, we replace it with this line of code, self.sched.step().

So that's the only difference. And you know, I mean, we're not gaining much,

as I said, by using the PyTorch ExponentialLR scheduler, but I mainly wanted to do it so

you can see that these things like PyTorch schedulers are not doing anything magic.

They're just doing that one line of code for us. And so I run it again using this new version.

Oopsy-Daisey. Oh, I forgot to run this line of code. There we go.

And I guess I should also add the nice little plot method.

Maybe we'll just move it to the bottom. There. lrfind.plot. There we go.

Put that one back to how it was. All right.

Perfect timing. So we added a few very important things in here.

So make sure we export. And we'll be able to use them shortly. All right.

Let's have an 8-minute break. Let's just have a 10-minute break.

So I'll see you back here at 8 past. All right.

Welcome back. One suggestion, which I really like,

is we could rename plot to after_fit, which I really like, because that means we should be able

to then just call learn.fit and delete the next one.

And let's see. That didn't work.

Why not? Oh, no.

That doesn't work, does it? Because the... you know what?

I think the callback here could go into a finally block, actually. That would actually allow us to always

call the callback, even if we've canceled. I think that's reasonable.

That may have its own confusions. Anyway, we could try it for now, because

that would let us put this after_fit in. There we go.

So that automatically runs that.

So that's an interesting idea. I think I quite like it.

Cool. So let's now look at notebook 10.

So I feel like this is the next big piece we need. So we've got a pretty good

system now for training models. What I think we're really missing, though, is

a way to identify how our models are training. And so to identify how our models are training,

we need to be able to look inside them and see what's going on while they train.

We don't currently have any way to do that. And therefore, it's very hard for

us to diagnose and fix problems. Most people have no way of

looking inside their models. And so most people have no way to

properly diagnose and fix models. And that's why most people, when they have a

problem with training their model, randomly try things until something

starts hopefully working. We're not going to do that.

We're going to do it properly. So we can import the stuff that

we just created in the learner. And the first thing I'm going to do,

introduce now is a set_seed function. We've been using torch.manual_seed before.

We know all about RNGs, random number generators. We've actually got three of them. PyTorch’s, NumPy's and Python's.

Let's see all of them. But also in PyTorch, you can use a flag to

ask it to use deterministic algorithms so things should be reproducible.

As we discussed before, you shouldn't always just make things reproducible.

But for lessons, I think this is useful. So here's a function that lets

you set a reproducible seed. All right, let's use the same data set

as before, a fashion MNIST dataset. We'll load it up in the same way.

And let's create a model that looks very similar to our previous models.

This one might be a bit bigger, mightn't it? I didn't actually check.

Okay. So let's use MultiClassAccuracy again.

Same callbacks that we used before. We'll use the TrainCB version

for no particular reason. And generally speaking, we want to train as

fast as possible, not just because we don't like wasting time, but actually more importantly

because the higher the learning rate you train at, the more you're able to find, often,

a more generalizable set of weights. Training quickly also means that we can

look at each batch, each item in the data less

often. So we're going to have less

issues with overfitting. And generally speaking, if we can train at

a high learning rate, then that means that we're learning to train in a stable way.

And stable training is very good. So let's try setting up a high learning

rate of 0.6 and see what happens. So here's a function that's just going to

create our Learner with our callbacks and fit it and return the Learner

in case we want to use it. And it's training up and

then it suddenly fell apart. So it's going well for a while and

then it stopped training nicely. So one nice thing about this graph is that

we can immediately see when it stops training well, which is very useful. So what happened there?

Why did it go badly? I mean, we can guess that it might've been

because of our high learning rate, but what's really going on?

So let's try to look inside it. So one way to look inside it would be

we could create our own SequentialModel. Just like the sequential model we've built before.

Do you remember we created one using nn.ModuleList in a previous lesson?

If you've forgotten, go back and check that out. And when we call that model, we go through

each layer and just call the layer. And what we could do is we could add something

in addition, which is at each layer, we could also get the mean of that layer and the standard

deviation of that layer and append them to a couple of different lists and activation

means and activation standard deviations. This is going to contain, after we call this

model, it's going to contain the means and standard deviations for each layer.

And then we could define dunder iter, which makes this into an iterator, as being, let's

say just, oh, just when you iterate through this model, you can iterate through the layers.

So we can then train this model in the usual way. And this is going to give us exactly the same

outcome as before, because I'm using the same seed.

So you can see it looks identical. But the difference is instead

of using nn.Sequential, we've now used something that's actually saved

the means and standard deviations of each layer. And so therefore we can plot them. Okay, so here we've plotted the activation means.

And notice that we've done it for every batch. So that's why along the x-axis here we have

batch number and on the y-axis we have the activation means and then

we have it for each layer. So rather than starting at one, because we're

starting at zero, so this is the first layer here's blue, second layer is

orange, third layer green, fourth layer red, and fifth layer

white with that like mauve-y kind of color. And look what's happened.

The activations have started pretty small, close to zero, and have increased at an exponentially

increasing rate, and then have crashed, and then have increased again at an exponentially

rate and crashed again. And increased again, crashed again.

And each time they've gone up, they've gone up even higher, and they've crashed, in this

case even lower. And what happens?

Well, what's happening here when our activations are really close to zero?

Well when your activations are really close to zero, that means that the inputs to each

layer are numbers very close to zero. As a result of which, of course,

the outputs are very close to zero. Because we're doing just matrix multiplies.

And so this is a disaster. When activations are very close

to zero, they're dead units. They're not able to do anything.

And you can see for ages here it's not training at all.

And this is the activation means. The standard deviations

tell an even stronger story. So generally speaking, you want the means

of the activations to be about zero, and the standard deviations to be about one.

Mean of zero is fine as long as they're spread around zero.

But a standard deviation of close to zero is terrible, because that means all of the

activations are about the same. So here, after batch 30, all of the activations

are close to zero, and all of the standard deviations are close to zero, so all the numbers

are about the same, and they're about zero. So nothing's going on.

And you can see the same things happening with standard deviations.

You start with not very much variety in the weights.

It exponentially increases how much variety there is, and then it crashes again.

Exponentially increases, crashes again. This is a classic shape of bad behavior.

And with these two plots, you can really understand what's going on in your model.

And if you train a model, and at the end of it, you kind of think, well, I wonder if this

is any good. If you haven't looked at this plot, you don't

know, because you haven't checked to see whether it's training nicely.

Maybe it could be a lot better. If you can get something, we'll see some nicer

training pictures later, but generally speaking, you want something where your mean is always

about zero, and your variance is always about one.

Standard deviation is always about one. And if you see that, then it's a pretty

good chance you're training properly. If you don't see that, you're most

certainly not training properly. Okay, so what I'm going to do in the rest

of this part of the lesson is explain how to do this in a more elegant way.

Because as I say, being able to look inside your models is such a critically important

thing to building and debugging models. We don't have to do it manually.

We don't have to create our own sequential model. We can actually use a PyTorch thing called hooks.

So as it says here, a hook is called when a layer that it's registered to is executed

during the forward pass. That's called a forward hook, or the backward

pass, and that's called a backward hook. And so the key thing about hooks is

we don't have to rewrite the model. We can add them to any existing model.

So we can just use standard nn.Sequential, passing in our layers, which were these ones

here. And so we're still going to have something

to keep track of the activation means and standard deviation.

So just create an empty list, for now, for each layer in the model. And let's create a little function that's

going to be called because a hook is going to call a function when during the forward

pass for a forward hook or the backward pass or backward hook.

So it's got a function called append_stats. It's going to be passed the hook number,

sorry, the layer number, the module, and the input

and the output. So we're going to be grabbing the outputs

mean and putting in activation means and the output standard deviation and putting

it in activation standard deviations. So here's how you do it.

We've got a model. You go through each layer of the model

and you call on it register_forward_hook. That's part of PyTorch.

And we don't need to write it ourselves because we already did, right?

It's just doing the same thing as this basically. And what function is always going to be called?

The function that's going to be called is the append_stats function passing in, remember

partial is the equivalent of saying append_stats passing in i as the first element, the first

argument. So if we now fit that model,

it trains in the usual way. But after each layer, it's going to call this. And so you can see we get

exactly the same thing as before. So one question we get here is what's the

difference between a hook and a callback? Nothing at all.

Hooks and callbacks are the same thing. It's just that PyTorch defines hooks and

they call them hooks instead of callbacks. They are less flexible than

the callbacks that we used in the Learner because you don't

have access to all the available states, you can't change things.

But they're a particular kind of callback. It's just setting a piece of code that's

going to be run for us when something happens. And in this case, there's something that happens

is that either a layer in the forward pass is called or a layer in the

backward pass is called. I guess you could describe the function that's

being called back as the callback and the thing that's doing the callback as the hook. I'm not sure if that level of distinction

is important, but maybe you could do that. So anyway, this is a little bit fussy of creating

globals and appending to them and stuff like that.

So let's try to simplify this a little bit. So what I did here was I

created a class called Hook. So this class, when we create it, we're going

to pass in the module that we're hooking. So we call m.register_forward_hook

and we call the function. We pass the function that we want to be given.

And so here's the, we pass the function and we're also going to pass in the Hook class

to the function. Let's also define a remove because this is

actually the thing that, this is actually the thing that removes the hook.

We don't want it sitting around forever. This is called, dunder del is called

by Python when an object is freed. So when that happens, we should

also make sure that we remove this. Okay.

So append_stats now we're going to replace, it's going to instead get past the Hook instead,

because that's what we asked to be passed. And if there's no .stats attribute

in there yet, then let's create one. And then we're going to be passed the activations. So put that on the CPU and append

the mean and the standard deviation. And now the nice thing is that the stats are

actually inside this object, which is convenient. So now we can do exactly the same thing as

before, but we don't have to set any of that global stuff or whatever.

We can just say, okay, our hooks is a Hook with that layer and that function for all

those models layers. And so just calling it has called

register_forward_hook for us. So now when we fit that, it's

going to run with the hooks. There we go.

It trains. We need to do it too. So then it trains and we get exactly the same

shape as usual and we get back the same results as usual.

But as we can see, we're gradually making this

more convenient, which is nice. So we can make it nicer still because generally

speaking, we're going to be adding multiple hooks and this stuff of, you

know, this list comprehension, whatever, it's a bit inconvenient.

So let's create a Hooks class. So first of all, we'll see how

the Hooks class works in practice. So in the Hooks class, the way we're going

to use it is we're going to call with Hooks, pass in the model, pass in the function to

use as our hook, and then we'll fit the model and that's it.

It’s going to be literally just one extra line of code to set up the whole thing.

And then when we then… you can then go through each Hook and plot the mean and standard deviation

of each layer. So that's how that's the hooks class

is going to make things much easier. So the Hooks class, as you can see, we're

using a, making it a context manager. And we want to be able to loop through it.

We want to be able to index into it. So it's quite a lot of behavior we want, believe

it or not, all that behavior is in this tiny little thing.

And we're going to use the most flexible general way of creating context managers.

Now context managers are things that we can say “with”. The general way of creating a context

manager is to create a class and define two special things dunder enter and dunder exit.

dunder enter is a function that's going to be called when it hits the with statement.

And if you add an as blah after it, then the contents of this variable will be whatever

is returned from dunder enter. And as you can see, we just

returned the object itself. So the Hooks object is

going to be stored in hooks. Now interestingly, the hooks

class inherits from list. You can do this, you can actually

inherit from stuff like list in Python. So a Hooks, the Hooks object is a list.

And therefore we need to call the super classes constructor.

And we're going to pass in a, that list comprehension we saw,

that list of hooks, where it's going to hook into each module in the list

of modules we asked to hook into. Now we're passing in a model here, but because

the model is an nn.Sequential, you can actually loop through an nn.Sequential and

it returns each of the layers. So this is actually very, very

nice and concise and convenient. So that's the constructor.

dunder enter just returns it. dunder exit is what's called automatically

at the end of the whole block. So when this whole thing's finished,

it's going to remove the hooks. And removing the hooks is just going

to go through each Hook and remove it. The reason we can do “for h in self”

is because, remember, this is a list. And then finally we've got

a dunder del like before. And I also added a dunder delitem.

This is the thing that lets you delete a single Hook from the list which will remove that

one Hook and call the list's __delitem__. So there's our whole thing.

This one's optional. This is the one that lets us remove a

single Hook rather than all of them. So let's just understand some

of what's going on there. So here's a dummy context manager.

As you can see here, it's got a dunder enter, which is going to return itself and it's going

to print something. So you can see here I call with DummyCtxMgr.

And so therefore it prints “let's go!” first. The second thing it's going to do is call

this code inside the context manager. So we've got “as dcm”, so that's itself.

And so it's going to call hello, which prints “hello.”

So here it is. And then finally it's going to automatically

call exit, dunder exit, which is “all done!”. So here's “all done!”.

So again, if you haven't used context managers before, you want to be creating little samples

like this yourself and getting them to work. So this is your key homework for this week.

Is anything in the lesson where we're using a part of Python you're not a hundred percent

familiar with is for you to, from scratch, to create some simple like kind of dummy version

that fully explores what it's doing. If you're familiar with all the

Python pieces, then it's to create your own, you know, that

is to explore, do the same thing with the PyTorch pieces, like with, with hooks and

so forth. And so I just wanted to show you also

what it's like to inherit from list. So here I'm here inheriting from a list and

I could redefine how dunder delitem works. So now I can create a DummyList and it looks

exactly the same as usual, but now if I delete an item from the list, it's going

to call my overridden version and then it will call

the original version. And so the list is now got removed that

item and did this at the same time. So you can see, you can actually, yeah, modify

how Python works or create your own things that get all the behavior or the convenience

of Python classes like this one and add stuff to them.

So that's what's happening there. Okay.

So that's our hooks class. So the next bit was developed, largely developed,

the last time I think it was that we did a Part 2 course in San Francisco with Stefano.

So many thanks to him for helping get this next bit looking great.

We're going to create my favorite single image explanations of

what's going on inside a model. We call them the colorful

dimension, which they're histograms. We're going to take our same append_stats.

These are all the same as before. We're going to add an extra line of code,

which is to get a histogram of the absolute values of the activations.

So a histogram, to remind you, is something that takes a collection of numbers and tells

you how frequent each group of numbers are. And we're going to create

50 bins for our histogram. So we will use our hooks that we just

created, and we're going to use this new version of

append_stats. So it's going to train as before, but now

we're going to, in addition, have this extra thing in stats, which is

going to contain a histogram. And so with that, we're now going

to create this amazing plot. Now what this plot is showing

is for the first, second, third, and fourth layers, what does

the training look like? And you can immediately see the basic idea

is that we're seeing the same pattern. But what is this pattern showing?

What exactly is going on in these pictures? So I think it might be best if we try and draw a picture of this.

So let's take a normal histogram. So let's take a normal

histogram where we basically have grouped all the data into bins, and then we have counts of

how much is in each bin. So for example, this will be like the value

of the activations, and it might be, say, from 0 to 10, and then from

10 to 20, and from 20 to 30. And these are generally equally spaced bins.

Okay. And then here is the count.

So that's the number of items with that range of values.

So this is called a histogram. Okay.

So what Stefano and I did was we actually turned that histogram, that whole histogram, into a single column of pixels.

So if I take one column of pixels, that's actually one histogram.

And the way we do it is we take these numbers. So let's say it's like 14, that

one's like 2, 7, 9, 11, 3, 2, 4, 2. And so then what we do is we

turn it into a single column. And so in this case we've got 1, 2, 3, 4, 5, 6, 7, 8, 9 groups, right?

So we would create our 9 groups. Sorry, they were meant to be evenly

spaced, but they were not a very good job. Got our 9 groups.

And so we take the first group, it's 14. And what we do is we color it with a gradient

and a color according to how big that number is.

So 14 is a real big number. So depending on what gradient we

use, maybe red's really, really big. And the next one's really small,

which might be like green. And then the next one's quite big

in the middle, which is like blue. Next one's getting quite, quite bigger still.

So maybe it's just a little bit, sorry, should go back to red.

Go back to more red. Next one's bigger stills, it's even more red and so forth.

So basically we're taking the histogram and taking it into a color coded single column

plot, if that makes sense. And so what that means is that at the

very, so let's take layer number two here. Layer number two, we can

take the very first column. And so in the color scheme that

actually Matplotlib's picked here, yellow is the most common and

then light green is less common. And then light blue is less

common and then dark blue is 0. So you can see the vast majority is 0 and

there's a few with slightly bigger numbers, which is exactly the same that

we saw for index one layer. Here it is, right?

The average is pretty close to 0.

The standard deviation is pretty small. This is giving us more information, however.

So as we train at this point here, there is quite a few activations that are a lot larger,

as you can see. And still the vast majority

of them are very small. There's a few big ones, they've still

got a bright yellow bar at the bottom. The other thing to notice here is what's happened

is we've taken those stats, those histograms, we've stacked them all up into a single

tensor, and then we've taken their log. Now log1p is just log of the number plus one.

That's because we've got zeros here. And so just taking the log is going to kind of let us see the full range more clearly.

So that's what the log's for. So basically what we'd really ideally like

to see here is that this whole thing should be a kind of more like a rectangle.

The maximum should be not changing very much. There shouldn't be a thick yellow bar at the

bottom, but instead it should be a nice even gradient matching a normal distribution.

Each single column of pixels wants to be kind of like a normal distribution, so gradually

decreasing the number of activations. That's what we're aiming for.

There's another really important and actually easier to read version of this, which is what

if we just took those first two bottom pixels, so the least common 5%, and counted up how

many were in, sorry, the least common 5%. The least common, not least

common either, let's try again. In the bottom two pixels, we've got the smallest

two equally sized groups of activations. We don't want there to be too many of them

because those are basically dead or nearly dead activations.

They're much, much, much smaller than the big ones.

And so taking the ratio between those bottom two groups and the total basically tells us

what percentage have zero or near zero or extremely small magnitudes.

And remember that these are with absolute values. So if we plot those, you can see how bad this is.

And in particular, for example, at the final layer, nearly from the very start, really,

nearly all of the activations are just about entirely disabled.

So this is bad news. And if you've got a model where most of

your model is close to 0, then most of your model is doing no work.

And so it's really not working. So it may look like at the very

end, things were improving. But as you can see from

this chart, that's not true. The vast majority are still inactive.

Generally speaking, I found that if early in training you see this rising crash, rising

crash at all, you should stop and restart training because your model will probably

never recover. Too many of the activations

have gone off the rails. So we want it to look kind of like this the

whole time, but with less of this very thick yellow bar, which is showing us most are inactive. Okay.

So that's our activations. So we've got really now all of the kind of key

pieces I think we need to be able to flexibly change how we train models and to understand

what's going on inside our models. And so from this point, we've kind of like

drilled down as deep as we need to go and we can now start to come back up again and

put together the pieces, building up what are all of the things that are going to

help us train models reliably and quickly. And then hopefully we're going to be able to

successfully create from scratch some really high quality generative models

and other models along the way. Okay.

I think that's everything for this class.

But next class, we're going to start looking at things like initialization.

It's a really important topic. If you want to do some revision before then

just make sure that you're very comfortable with things like standard deviations and stuff

like that because we're using that quite a lot for next time.

And yeah, thanks for joining me. Look forward to the next lesson.

See you again.

Lesson 16: Deep Learning Foundations to Stable Diffusion

Full Transcript

Need a transcript for another video?