Lesson 15: Deep Learning Foundations to Stable Diffusion

Jeremy Howard12,733 words

Full Transcript

Hi all and welcome to Lesson 15. And what 

we're going to endeavor to do today is to create a convolutional autoencoder. 

And in the process, we will see why doing that well is a tricky thing to do and 

time permitting, we will begin to work on a framework, a deep learning framework 

to make life a lot easier. Not sure how far we'll get on that today time wise. So 

let's see how we go and get straight into it. Okay. So today, let's start by talking before 

we can create a convolutional autoencoder, we know to talk about convolutions and what are 

they and what are they for. Broadly speaking, convolutions are something that allows us to, to 

tell our neural network a little bit about the structure of the problem. That's going to make 

it a lot easier for it to solve the problem. And in particular, the structure of our problem 

is we're doing things with images. Images are laid out on a grid, a 2d grid for black and white 

or a 3d for color or a 4d for a color video or whatever. And so we would say, you know, there's 

a relationship between the pixels going across and the pixels going down. They tend to be similar 

to each other, differences in those pixels across those dimensions tend to have meaning. Sets, 

patterns of pixels that appear in different places often represent the same thing. So for 

example, a cat in the top left is still a cat, even if it's in the bottom right. These kinds 

of, this kind of prior information is something that is naturally captured by a Convolutional 

Neural Network, something that uses convolutions. Generally speaking, this is a good thing 

because it means that we will be able to use less parameters and less computation because 

more of that information about the problem we're solving is kind of encoded directly into our 

architecture. There are other architectures that don't encode that prior information as strongly, 

such as a Multi-Layer Perceptron, which we've been looking at so far, or a Transformers network, 

which we haven't looked at yet. Those kinds of architectures could potentially give us, or they 

do give us more flexibility and given enough time, compute and data, they could potentially find 

things that maybe CNNs would struggle to find. So we're not always going to use 

Convolutional Neural Networks, but they're a pretty good starting point and 

certainly something important to understand. They're not just used for images. We can also 

take advantage of one-dimensional convolutions for language-based tasks, for instance. 

So convolutions come up a lot. So in this notebook, one thing you'll 

notice that might be of interest is we are importing stuff from miniai now. Now 

miniai is this little library that we're starting to create and we're creating it 

using nbdev. So we've got a miniai.training and a miniai.datasets. And so if we look, 

for example, at the datasets notebook, it starts with something that says that the 

default export is called datasets. And some of the cells have a export directive on them. 

And at the very bottom, we had something that called nbdev_export(). Now what that's going to do 

is it's going to create a file called datasets.py just here, datasets.py. And it contains, those cells that we exported. And why does it, why is it called miniai.datasets? 

That's because everything for nbdev is stored in settings.ini. And there's something here 

saying create a library libname called miniai. You can't use this library until you install it. 

Now we haven't uploaded it to PyPy, like made it a pip installable package from a public server. 

But you can actually install a local directory as if it's a Python module that you've kind of 

installed from the internet. And to do that, you say pip install, in the usual way, but 

you say -e, that stands for editable. And that means set up the current directory as 

a Python module. Well, current directory, actually any directory you like, I just put 

dot to mean the current directory. And so you'll see that's going to go ahead and actually 

install my library. And so after I've done that, I can now import things from 

that library as you see. Okay. So this is just the same as before. We're 

going to grab our MNIST dataset and we're going to create a Convolutional Neural Network on 

it. So before we do that, we're going to talk about what are convolutions. And one of my 

favorite descriptions of convolutions comes from a student in our, I think it was our very 

first course, Matt Kleinsmith, who wrote this really nice Medium article, CNNs from different 

viewpoints, which I'm going to steal from. And here's the basic idea. Say that this is our image. 

It's a 3 by 3 image with 9 pixels labeled from A to J as capital letters. Now a convolution 

uses something called a kernel and a kernel is just another tensor. In this case, it's a 2 

by 2 matrix again. So it's, I mean, this one's, we're going to have alpha, beta, gamma, 

delta, alpha, beta, gamma, delta as our four values in this convolution. Now in this 

kernel, oh, now one thing I'll mention, I can't remember if I've said this before, is the 

Greek letters are things that you want to be able to, I think I have mentioned this, you want to 

be able to pronounce them. So if you don't know how to read these and say what these names are, 

make sure you head over to Wikipedia or whatever, and learn the names of all the Greek letters so 

that you can, cause they come up all the time. Okay. So what happens when we apply a convolution 

with this 2 by 2 kernel to this 3 by 3 image? I mean, it doesn't have to be an image 

it's in this case, it's just a rank 2 tensor, but it might represent an image. What happens 

is we take the kernel and we overlay it over the first little 2 by 2 sub grid, like so. And 

specifically what we do is we match color to color. So the output of this first 2 by 2 

overlay would be alpha times a plus beta times b plus gamma times d plus delta times 

e, and that would yield some value P, and that's going to end up in the top left of a 2 by 

2 output. So the top right of the 2 by 2 output, we're going to slide, it's like a sliding window, 

we're going to slide our kernel over to here and apply each of our coefficients to 

these respectively colored squares. And then ditto for the bottom left 

and then ditto for the bottom right. So we end up with this equation. P, as 

we discussed, is alpha a plus beta b plus gamma d plus delta e, plus some bias 

term. Q, to the top right, as you can see, it's just alpha in this test times b. And 

so we just take multiplying them together and adding them up, multiply together, add 

them up, multiply together and add them up. So we're basically, you can imagine that 

we're basically flattening these out into rank 1 tensors into vectors and then doing a dot 

product would be one way of thinking about what's happening as we slide this kernel over these 

windows. And so this is called a convolution. So let's try and create a convolution. So, 

for example, let's grab our training images. And take a look at one. And let's create a 3 by 3 kernel. So remember, 

a kernel is just ‒we've already‒ kernel appears a lot of times in computer science and math. 

We've already seen the term kernel to mean a piece of code that we run on a GPU across lots 

of parallel kind of virtual devices or potentially in a grid. There's a similar idea here. We've 

got a computation, which is in this case, kind of this dot product or something like a dot product 

sliding over, occurring lots of times over a grid. But it's yeah, it's a bit different. 

That's kind of another use of the word kernel. So in this case, a kernel is a, in 

this case, it's going to be a rank 2 tensor. And so let's create a kernel with these values 

in the 3 by 3 matrix, rank 2 tensor. And we could draw what that looks like. Not surprising, 

it just looks like a bunch of lines. Oops. OK, so what would happen if we slide this over 

just these nine pixels over this 28 by 28? Well, what's going to happen is 

if we've got some‒ the top left, for example, 3 by 3 section has these names, 

then we're going to end up with negative a1, because the top three are all negative. Right? 

Negative a1, minus a2, minus a3. The next are just zero. So that won't do anything. And then plus 

a7, plus a8, plus a9. Why is that interesting? That's interesting. Well, let's try 

here. What I've done here is I've grabbed just the first 13 rows and first 23 columns of 

our image, and I'm actually showing the numbers and also using gray kind of conditional 

formatting, if you like, or the equivalent in pandas to show this top bit. So we're looking at 

just this top bit. So what happens if we take rows 3, 4 and 5? Remember, this is not inclusive, 

right? So it's rows 3, 4 and 5, columns 14, 15, 16… 14, 15, 16. So we're looking at this, these 

three here. What's that going to give us if we multiply it by this kernel? It gives us a fairly 

large positive value because the three that we have negatives on is the top row. Well, they're 

all zero. And the three that we have positives on, they're all close to 1. So we end up with 

quite a large number. What about the same columns but for rows 7, 8, 9? 7, 8, 9. Here, the 

top is all positive and the bottom is all zero. So that means that we're going to get a lot of 

negative terms. And not surprisingly, that's exactly what we see. If we do this, kind of a dot 

product equivalent, which all you need in NumPy to do that is just an element-wise multiplication 

followed by a sum, right? So that's going to be quite a large negative number. And so perhaps 

you're seeing what this is doing. And maybe you got a hint from the name of the tensor we created. 

It's something that is going to find the top edge. So this one is a top edge, so it's a positive. 

And this one is a bottom edge, so it's a negative. So we would like to apply that, this kernel, 

to every single 3 by 3 section in here. So we could do that by creating a little 

apply_kernel function that takes some particular row and some particular column and some particular 

tensor as a kernel and does that multiplication .sum() that we just saw. So 

for example, we could replicate this one by calling apply_kernel(). And this 

here is the center of that 3 by 3 grid area. And so there's that same number, 2.97. So now 

we could apply that kernel to every one of the 3 by 3 sections. 3 by 3 windows in this 28 by 28 

image. So we're going to be sliding over like this red bit sliding over here. But we've actually 

got a 28 by 28 input, not just a 5 by 5 input. So to get all of the coordinates, let's just 

simplify it to do 5 five by 5. We can go, we can create a list comprehension. We can take 

i through every value in range(5). And then for each of those, we can take j for every value in 

range(5). And so if we just look at that tuple, you can see we get a list of lists containing 

all of those coordinates. So this is a list comprehension in a list comprehension, which when 

you first say it may be surprising or confusing, but it's a really helpful idiom. And I 

certainly recommend getting used to it. Now, what we're going to do is we're not just 

going to create this tuple, but we're actually going to call apply_kernel() for each of those. So 

if we go through from 1 to 27, well, actually 1 to 26, because 27 is exclusive. So we're going to go 

through everything from 1 to 26. And then for each of those, go through from 1 to 26 again and call 

apply_kernel(). And that's going to give us the result of applying that convolutional kernel to 

every one of those coordinates. And there's the result. And you can see what it's done as we 

hoped is it is highlighting the top edges. So yeah, you might find that kind of surprising 

that it's that easy to do this kind of image processing. We're literally just doing an element 

wise multiplication and a sum for each window. Okay, so that is called a convolution. So we 

can do another convolution. This time we could do one with the left edge tensor. So as you 

can see, it looks just a rotated version or transposed version, I guess, of our top 

edge tensor, here's what it looks like. And so if we apply that kernel, so this time 

we're going to apply the left edge kernel. And so notice here that we're 

actually passing in a function, right? We're passing in a function. Sorry, not 

a function, is it? It's just a tensor actually. So we're going to pass in the left edge tensor for the same list comprehension, in 

a list comprehension. And this time we're getting back the left edges. It's 

highlighting all of the left edges in the digit. So yeah, this is basically what's happening here 

is that a 2 by 2 can be looped over an image, creating these outputs. Now you'll see here that in the process 

of doing so, we are losing the outermost pixels of our image. We'll learn about 

how to fix that later. But just for now, notice that as we are putting our 3 by 3 through, 

for example, in this 5 by 5 , there's only one, two, three places that we can put it going across, 

not five places because we need some kind of edge. All right, so that's cool. That's 

a convolution. And hopefully if you remember back to kind of the Zeiler 

and Fergus pictures from Lesson 1, you might recognize that the kind of first layer 

of a convolutional network is often looking for kind of edges and gradients and things like 

that. And this is how it does it. And then the convolutions on top of convolutions 

with nonlinear activations between them can combine those into curves, or corners 

or stuff like that, and so on and so forth. Okay, so how do we do this quickly? Because 

currently this is going to be super, super slow doing this in Python. So one of the very earliest 

or probably the earliest publicly available general purpose deep learning, GPU accelerated 

deep learning thing I saw was called Caffe. That was created by somebody called Yangqing Jia. 

And he actually described what happened, how Caffe went about implementing a fast convolution on a 

GPU. And basically he said, well, I had two months to do it and I had to finish my thesis. And so I 

ended up doing something where I said, well, there was some other code out there, Krizhevsky, who you 

might have come across, him and Hinton set up a little startup which Google bought, and that kind 

of became the start of Google's deep learning, Google Brain, basically. And so Krizhevsky 

had all this fancy stuff in his library, but Yangqing Jia said, oh, I didn't know how to 

do all that stuff. So I said, well, I already know how to multiply matrices, so maybe I can convert 

a convolution into a matrix multiplication. And so that became known as im2col. im2col is a way of converting a convolution 

into a matrix multiply. And so actually, I don't know if, I suspect Yangqing Jia kind of 

accidentally reinvented it, because it actually had been around for a while, even at the point 

that he was writing his thesis, I believe. So it was actually, this is the place 

I believe it was created in this paper. So that was in 2006, which is a while ago. 

And so this is actually from that paper. And what they describe is, let's 

say you are putting this 2 by 2 kernel over this 3 by 3 bit of an image. So 

here you've got this window needs to match to this bit of this window. What you could 

do is you could unwrap this to 1, 2, 1, 2 downwards to here. 1, 2, 1, 2, to unroll it 

like so. And you could unroll the kernel here. Yeah, so this is 1, 2, 1, 1. So this bit is here, 

1, 2, 1, 1. And then you could unroll the kernel 1, 1, 2, 2 to here, 1, 1, 2, 2. And then once 

they've been flattened out and moved in that way, and then you'll do exactly the same 

thing for this next patch here, 2, 0, 1, 3. You flatten it out and put it here, 2, 

0, 1, 3. So if you basically take those kernels and flatten them out in this format, then you 

end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up with 

the output that you want from the convolution. So this is basically a way of unrolling your kernels and your input features into 

matrices, such as when you do the matrix multiply, you get the right answer. So it's a kind of 

a nifty trick. And so that is called im2col. I guess we're kind of cheating a little bit. 

Implementing that is kind of boring. It's just a bunch of copying and tensor manipulation. So I 

actually haven't done it. Instead, I've linked to a NumPy implementation, which is here. And 

it also, part of it is this get_indices(), which is here. And as you can see, it's a little 

bit tedious with repeats and tiles and reshapes and whatnot. So I'm not going to call it homework, 

but if you want to practice your tensor indexing manipulation skills, try creating a PyTorch 

version from scratch. I got to admit, I didn't bother. Instead, I used the one that's built 

into PyTorch. And in PyTorch, it's called unfold. So if we take our image and PyTorch expects 

there to be a batch axis and a dimension and a channel dimension. So we'll add two unit leading 

dimensions to it. Then we can unfold our input for a 3 by 3. And that will give us a 9 by 

676 input. And so then we can take that… and then we'll take our kernel and just flatten it 

out into a vector. So view() changes the shape and -1 just says dump everything into this dimension. 

So that's going to create a 9 long vector, length 9 vector. And so now we can do the matrix multiply 

just like they've done here of the kernel matrix. That's our weights by the unrolled input features. And so that gives us a 676 long. We can then view 

that as 26 by 26 and we get back as we hoped our left edge tensor result. And so this is how 

we can kind of from scratch create a better implementation of convolutions. The reason I'm 

cheating, I'm allowed to cheat here is because we did actually create convolutions from 

scratch. We're not always creating the GPU optimized versions from scratch, which was never 

something I promised. So I think that's fair. But it's cool that we can kind of hack it 

out a GPU optimized version in the same way that the kind of original deep learning 

library did. So if we use apply_kernel(), we get nearly nine milliseconds. If 

we use unfold() with matrix multiply, we get 20 microseconds. So that's what about 

400 times faster. So that's pretty cool. Now, of course, we don't have to use 

unfold() and matrix multiply because PyTorch has a conv2d(). So we can run that. 

And that interestingly is about the same speed, at least on GPU. But this would 

also work on GPU just as well. Yeah, I'm not sure this will always be the 

case. In this case, it's a pretty small image. I haven't experimented a whole lot 

to see whereabouts there's a big difference in speeds between these. Obviously, I always 

just use F.conv2d. But if there's some more tricky convolution you need to do with some weird 

thing around channels or dimensions or something, you can always try this unfold trick. 

It's nice to know it's there, I think. So we could do the same thing for diagonal edges. So here's our diagonal edge 

kernel or the other diagonal. So if we just grab the first 16 images. Then we can do a convolution on our whole 

batch with all of our kernels at once. So this is a nice optimized thing that 

we can do. And you end up with your 26 by 26. You've got your 4 kernels and you've got 

your 16 images. And so that's summarized here. So that's generally what we're doing to get 

good GPU acceleration is we're doing a bunch of kernels and a bunch of images all at once 

across all of their pixels. And so here we go. That's what happens when we take a look at 

our various kernels for a particular image. Left edge, I guess top edge and then 

diagonal top left and top right. OK, so that is optimized convolutions 

on… and that works just as well on CPU or GPU. Obviously, GPU will 

be faster if you have one. Now, how do we deal with the 

problem that we're losing one pixel on each side? What we can do 

is we can add something called padding. And for padding, what we basically do 

is rather than starting our window here, we start it right over here and actually would be 

up one as well. And so these 3 on the left here. We just take the input for each of those 

as 0. So we're basically just assuming that they're all zero. I mean, there's 

other options we could choose. We could assume they're the same as the one next 

to them. There's various things we can do, but the simplest and the one we normally 

do is just assume that there's zero. So now. So let's say, for example, this 

is, this is called one pixel padding. Let's say we did 2 pixel padding. So we had 

2 pixel padding with a 5 by 5 input. OK, and a 4 by 4 kernel. So that grays our kernel. 

Right, then we're going to start right up way over here on the corner. OK, and then you can see 

what happens as we slide. The kernel over, there's all the spots that it's going to take, and so that 

this dotted line area is the area that we're kind of effectively going through, but all of these 

white bits, we're just going to treat as zero. And so and then this is this green is the output 

size we end up with, which is going to be 6 by 6. For a 5 by 5 input, I should mention. Even numbered edge kernels are not used 

very often, we normally used odd numbered kernels if you use, for example, a 3 

by 3 kernel and 1 pixel of padding, you will get back the same size you start with. 

If you use 5 by 5 with 3 pixels of padding, you'll end up with the same size you start 

with. So generally odd numbered edge size kernels are easier to deal with to make sure 

you end up with the same thing you start with. OK, so yeah, so as it says here, you've got an 

odd numbered size ks by ks size kernel, then ks truncate divide to (ks//2). That's what slash 

slash means will give you the right size. And so another trick you can do is you don't always have 

to just move your window across by one each time. You could move it by a different amount each time. 

The amount you move it by is called the stride. So, for example, here's a case of doing a stride 

2, so a stride 2 padding 1. So we start out here and then we jump across 2 and then we jump across 

2 and then we go to the next row. So that's called a stride 2 convolution. Stride 2 convolutions 

are handy because they actually reduce the dimensionality of your input by a factor of 2. 

And that's actually what we want to do a lot. For example, with an autoencoder, we want to 

do that. And in fact, for most classification architectures, we do exactly that. We keep 

on reducing the, kind of the grid size by a factor of 2 again and again and again using 

stride 2 convolutions with padding of 1. So that's strides and padding. So let's go ahead 

and create a ConvNet using these approaches. So we're going to put, get our size of our 

training set. This is all the same as before, number of categories, number of 

digits, size of our hidden layer. So. Previously, with our sequential linear 

models, with our MLPs, we basically went from the number of pixels to the number 

of hidden and then a ReLU and then the number of hidden to the number of outputs. 

So here's the equivalent. With a convolution. Now, the problem is that you can't just 

do that because the output is not now 10 probabilities for each item in our batch, 

but it's 10 probabilities for each item in our batch for each of 28 by 28 pixels because 

we don't even have a stride or anything. So you can't just use the same simple approach that we 

had for MLP. We have to be a bit more careful. So to make life easier, let's create a little conv 

function that does a Conv2d with a stride of 2 optionally followed by an activation. So if act 

is true, we will add in a ReLU activation. So this is going to either return a Conv2d or a little 

Sequential containing a Conv2d followed by a ReLU. And so now we can create a CNN 

from scratch as a sequential model. And so since activation is True by default, this 

is going to take our 28 by 28 image, starting with 1 channel and creating an output of 4 channels. 

So this is the number of in, this is the number of filters. Sometimes we'll say filters to 

describe the number of, kind of, channels that our convolution has, that's the number of 

outputs. And it's very similar to the idea of the number of outputs in a linear layer, except 

this is the number of outputs in your convolution. So what I like to do when I create stuff like 

this is I add a little comment just to remind myself what is my grid size after this. So I had 

a 28 by 28 input. So then I've then put it through a stride-2 conv. So the output of this will be 

14 by 14. So then we'll do the same thing again, but this time we'll go from a 4 channel input 

to an 8 channel output and then from 8 to 16. So by this point, we're now down to 

a 4 by 4. And then down to a 2 by 2, and then finally we're down to a 1 by 1. So on 

the very last layer, we won't add an activation. And the very last layer is going to create 10 

outputs. And since we're now down to a 1 by 1, we can just call Flatten() and that's going to remove 

those unnecessary unit axes. So if we take that, pop mini batch through it, we end up with 

exactly what we want, 16 by 10. So for each of our 16 images, we've got 10 probabilities of 

each possible digit. So if we take our training set and make it into 28 by 28 images, and 

we do the same thing for a validation set, and then we create two datasets, one for each, 

which are called train dataset and valid dataset. And we're now going to train this on the GPU. Now, 

if you've got a Mac, you can use a device called, well, if you've got an Apple Silicon Mac, you've 

got a device called MPS, which is going to use your Mac's GPU. Or if you've got Nvidia, you can 

use CUDA, which will use your Nvidia GPU. CUDA's 10 times or more, possibly much more faster than 

a Mac. So you definitely want to use Nvidia if you can. But if you're just running it on a 

Mac laptop or whatever, you can use it MPS. So basically, you want to know what device to use. 

Do we want to use CUDA or MPS? You can check. If you can check torch.backends.mps.is_available() 

to see if you're running on a Mac with MPS. You can check torch.cuda.is_available() to see 

if you've got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course, 

you'll have to use the CPU to do computation. So I've created a little function here to_device, which takes a tensor or a dictionary 

or a list of tensors or whatever, and a device to move it to, and it just goes 

through and moves everything onto that device. Or if it's a dictionary, a dictionary of 

things, values moved onto that device. So there's a handy little 

function. And so we can create a custom collate function, which calls the 

PyTorch default collation function and then puts those tensors onto our device. 

And so with that, we've now got enough to train this neural net on the GPU. We created 

this get_dls function in the last lesson. So we're going to use that, passing in the 

datasets that we just created and our default collation function. We're going to create 

our optimizer using our CNNs parameters. And then we call fit(). Now fit(), remember, we also created 

in our last lesson and it's done. So then what I did then was I reduced 

the learning rate by a factor of four and ran it again. And eventually, yeah, I got to a 

fairly similar accuracy to what we did on our MLP. So, yeah, we've got a convolutional network 

working. I think that's pretty encouraging. And it's nice that to train it, we didn't have 

to write much code, right? We were able to use code that we had already built. We were 

able to use the Dataset class that we made, the get_dls function that we made, 

and the fit function that we made. And, you know, because those things are 

written in a fairly general way, they work just as well for a ConvNet as they did for an 

MLP. Nothing had to change. So that was nice. Notice I had to take the model and put it on 

the device as well. So that will go through and basically put all of the tensors that are in that 

model onto the MPS or CUDA device, if appropriate. So if we've got a batch size of 64, 

and as we do, 1 channel, 28 by 28, so then our axes are batch, channel, height, 

width. So normally this is referred to as NCHW. So N, generally when you see N in 

a paper or whatever, in this way, it's referring to the batch size. N 

being the number, that's the mnemonic, the number of items in the batch. C is the 

number of channels, height by width, NCHW. TensorFlow doesn't use that. TensorFlow uses 

NHWC. So we generally call that channels-last, since channels are at the end. And this one we 

normally call channels-first. Now, of course, it's not actually channels-first. It's actually 

channels-second, but we ignore the batch bit. In some models, particularly some more modern 

models, it turns out the channels-last is faster. So PyTorch has recently added 

support for channels-last. And so you'll see that being used more and more as well. All right, so a couple of comments and questions 

from our chat. The first is Sam Watkins pointing out that we've actually had a bit of a win here, 

which is that the number of parameters in our CNN is pretty small by comparison. So in the MLP 

version, the number of parameters is equal to basically the size of this matrix, so m times nh. Oh, plus the number in this, 

which will be nh times 10. And something that at some point we probably 

should do is actually create something that allows us to automatically 

calculate the number of parameters. And I'm ignoring the bias there, of course. Let's see, what would be a good way to do that? Maybe np.product(). There we go. So what we could do is just 

calculate this automatically by doing a little list comprehension here. 

So there's the number of parameters across all of the different layers, so both bias 

and weights. And then we could, I guess, just, well, we could just use, well, let's use 

PyTorch. So we could turn that into a tensor and sum it up. Oops. So that's the number in 

our MLP. And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000 

to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding me that 

there's a better way than np.product(o.shape), which is just to say o dot 

number of elements: o.numel(). Same thing. Very nice. Now, one person asked a very 

good question, which is, I thought Convolutional Neural Networks can 

handle any sized image. And actually, no, this convolutional network cannot handle any 

sized image. This Convolutional Neural Network only handles images that, once they go through 

these stride-2 convs, end up with a 1 by 1. Because otherwise, you can't dot flatten it 

and end up with 16 by 10. So we will learn how to create convnets that can handle any sized 

input. But there's nothing particularly about a convnet that necessitates that it has 

to be any sized input that it can handle. OK, so just let's briefly finish this section off 

by talking about this, particularly I want to talk about the idea of receptive field. Consider this 

1 input channel, 4 output channel, 3 by 3 kernel. So that's just to show you what we're doing here. 

conv1, well, actually, so simple_cnn. simple_cnn. This is the model we created. Remember, it was 

like a Sequential model containing Sequential models, because that's how our conv function 

worked. So simple_cnn[0] is our first layer. It contains both a Conv and a ReLU. So simple_cnn[0, 

0] is the actual Conv. So if we grab that, call it conv1, it's a 4 by 1 by 3 by 3. So, 

number of outputs, number of input channels, and height by width of the kernel. And then 

it's got its bias as well. So that's how we could deconstruct what's going on with our weight 

matrices or our parameters inside a convolution. Now, I'm going to switch over to Excel. So in 

the lesson notes on the course website or on the forum, you'll find we've got an Excel. 

You'll see we've got an Excel workbook. Oh, Wasim reminded me that there is 

a nice trick we can do. I do want to do that actually because I love this trick. Oh, I just deleted everything though. Let's put 

it all back. Here we go. Which is you actually don't need square brackets. The square brackets is 

a list comprehension. Without the square brackets, it's called a generator. And it, oh, no, you can't 

use it there. Maybe that only works with NumPy. Ah, OK. So that's the list. No, that doesn't work either. So much for that. 

I'm kind of curious now. Maybe torch.sum. No, just sum. Oh, OK. Well, I don't want to 

use Python sum. That's interesting. I feel like all of them should handle 

generators, but there you go. OK. So open up the conv-example spreadsheet. And 

what you'll see on the conv-example worksheet page is something that looks a lot like the number 

7. And indeed, this is the number 7 that I got straight from MNIST. Let's see. OK. So you can 

see over here, we have a number 7. This is a number 7 from MNIST that I have copied into 

Excel. And then you can see over here, we've got the top edge kernel being applied. And over 

here, we've got a right edge kernel being applied. This might be surprising you because you 

might be thinking, where did it take Jeremy? Microsoft Excel doesn't do Convolutional 

Neural Networks. Well, actually, it does. So if I zoom in in Excel, you'll see, actually, 

these numbers are, in fact, conditional formatting applied to a bunch of spreadsheet cells. And so 

what I did was I copied the actual pixel values into Excel and then applied conditional 

formatting. And so now you can see what the digit is actually made of. So you can see here I've created our top edge filter. And here 

I've created our left edge filter. And so here I am applying that filter to that window. And 

so here you can see it looks a lot like NumPy. It's just a sum product. And you might not be 

aware of this, but in Excel, you can actually do broadcasting. You have to hit Apple 

Shift Enter or Control Shift Enter, and it puts these little curly brackets 

around it. It's called an array formula. It basically lets you do broadcasting or 

simple broadcasting in Excel. And so here's how you could say… this is how I created 

this top edge filtered version in Excel. And the left edge version is exactly the same, 

just a different kernel. And as you can see, if I click on it, it's applying this filter 

to this input area and so forth. OK, so we can then… I just 

arbitrarily pick some different values here. And so something to notice now 

in my second layer, so here's Conv1, is Conv2, it's got a bit more work to do. We actually 

need two filters because we need to add together this bit here applied to this with 

this kernel applied and this bit here with this kernel applied. So you actually 

need one set of 3 by 3 for each input. And also, I want two separate outputs, so I 

actually end up needing a 2 by 2 by 3 by 3 weights matrix or weights tensor, I should say, 

which you might remember is exactly what we had in PyTorch. We had a rank 4 tensor. So if I have a 

look at this one, you see exactly the same thing. This input is using this kernel applied 

to here and this kernel applied to here. So that's important to remember that you 

have these rank 4 tensors. And so then rather than doing stride-2 conv, I did something else 

which is actually a bit out of favor nowadays, but it's another option, which is to do something 

called max-pooling to reduce my dimensionality. So you can see here I've got 28 by 28. 

I've reduced it down here to 14 by 14. And the way I did it was simply to take 

the max of each little 2 by 2 area. So that's all that's been done 

there. So that's called max-pooling. And so max-pooling has the same effect as a 

stride-2 conv, not mathematically identical, the same effect, which is it does a convolution 

and reduces the grid size by 2 on each dimension. So then how do we create a single 

output if we don't keep doing this until we get to 1 by 1, which 

I'm too lazy to do in Excel? Well, one approach, and again, this is a little 

bit out of favor as well, but one approach we can do is we can take every one of these, we've 

now got 14 by 14, and apply a dense layer to it. And so what I've done here is I've got a 

big, imagine this is basically all being flattened out into a vector. And so here 

we've got some product of this by this, plus the sum product of this by this, 

and that gives us a single number. And so that is how we could then optimize 

that in order to optimize our weight matrices. Now, and then, you know, the more modern approach, we don't use this kind of dense layer much 

anymore. It still appears a bit. The main place that you see this used is in a network 

called VGG, which is very old now. I thought it might be 2013 or something, but it's actually 

still used. And that's because for certain things like something called style transfer or in 

general perceptual losses, people still find VGG seems to work better. So you still actually 

see this approach nowadays sometimes. The more common approach, however, nowadays is 

we take the penultimate layer and we just simply take the average of all of the activations. So 

the, you know, the nowadays we would simply, the Excel way of doing it would be literally 

simply say AVERAGE of the penultimate layer. And that is called global average pooling. 

Everything has to has a fancy word, a fancy phrase, but that's all it is. Take the average 

is called global average pooling, or you could take the max, whatever that 

would be global max pooling. So anyway, the main reason I wanted to show you 

this was to do something which I think is pretty interesting, which is to take something in our, 

I'm just going to zoom out a little bit here. Let's take something in our max pool here. And I'm going to say trace 

precedence to show you here it is the area that it's coming from. Okay. So it's coming from these 

four numbers. Now for trace precedence again, saying what's actually impacting this. 

Obviously the kernels impacting it. And then you can see that the input area here is a 

bit bigger. And then if I trace precedence again, then you can see the input area is bigger still. 

So this number here is calculated from all of these numbers in the input. This area in the 

input is called the receptive field of this unit. And so the receptive field in this case 

is 1, 2, 3, 4, 5, 6 by 6, right? And that means that a pixel way up here in the top, 

right? Has literally no ability to impact that activation. It's not 

part of its receptive field. If you have a whole bunch of stride 2 convs, each 

time you have one, the receptive field is going to get twice as big. So the receptive field at the 

end of a deep network is actually very large. But the inputs closest to the middle of the receptive 

field have the biggest kind of say in the output because they implicitly appear the most often in 

all of these kind of dot products that are inside this convolutional window. So the receptive field 

is not just like a single binary on off thing. Certainly all the stuff that's not got precedence 

here is not part of it at all. But the closer to the center of the receptive field, the 

more impact it's going to have, the more ability it's got to change this number. So the 

receptive field is a really important concept and yeah, fiddling, playing around with Excel's 

precedent arrows, I think is a nice way to see that, at least in my opinion. 

And apart from anything else, it's great fun creating a Convolutional 

Neural Network in Excel. I thought so anyway. Okay, so let's take a seven minute break. I'll see you back after that to talk about 

a convolutional autoencoder. All right. Okay, welcome back. We're going to have 

a look now at the autoencoder notebook. So we're just going to import all of our 

usual stuff and we've got one more of our own modules to import now as well. And this 

time we are going to switch to a different, we're going to switch to a different 

dataset, which is the fashion MNIST dataset. We can take advantage of the 

stuff that we did in 05_datasets and the Hugging Face stuff to load it. 

So we've seen this a little bit before. Back in our datasets one here. And we never actually built any models 

with it. So let's first of all do that. So this is just, I'm going to convert 

each thing, each image into a tensor, and that's going to be an in-place transform. 

Remember we created this decorator. And so we can call dataset dictionary with 

the same name and so we can call dataset dictionary with transform. This 

is all stuff we've done before. And so here we have our example of a sneaker. All right. And we will create our collation 

function, collating a dictionary for that dataset. That's something you should remind yourself. We 

built that ourselves in the datasets notebook. And let's actually make our collate function something that does to_device(), 

which we wrote in our last notebook. And we'll create a little data_loaders function 

here, which is going to go through each item in the dataset dictionary and create a DataLoader for 

it and give us a dictionary of data loaders. Okay. So, okay. So now we've got a data loader for training 

and a data loader for validation. So we can grab the x and y batch by just calling next() 

on that iterator as we've done before. We can grab the, let's look at each 

of these in turn actually. We've done all this before, but it's a couple of weeks ago. So just to remind you, we can 

get the names of the features. And so we can then get, create an itemgetter() 

for our y's. And we can, so we'll call that the label getter. We can apply that to our labels to 

get the titles of everything in our mini batch. And we can then call our 

show_images() that we created with that mini batch, with those titles. And 

here we have our fashion MNIST mini batch. Okay. So let's create a classifier and 

we're just gonna use exactly the same code, copy and pasted from the previous 

notebook. So here is our sequential model. And we are going to grab the parameters of the CNN and the CNN I've actually 

moved it over to the device. The default device was what we created in our 

last notebook. And as you can see, it's fitting. Now our first problem is it's fitting 

very slowly, which is kind of annoying. So why is it running pretty slowly? Let's think 

about, let's have a look at our dataset. So when it's finally finished, let's take a 

look at an item from the dataset. Actually, let's not look at the dataset. Let's actually 

go all the way back to the dataset dictionary so before it gets transformed dataset dictionary 

and let's grab the training part of that and let's grab one item. And 

actually we can see here the problem. For MNIST, we had all of the data loaded into 

memory into a single big tensor, but this Hugging Face one is created in a much more kind 

of normal way, which is, each image is a totally separate PNG image. It's not all pre-converted 

into a single thing. Why is that a problem? Well, the reason it's a problem is that 

our DataLoader is spending all of its time decoding these PNGs. So if I train here, okay, so while I'm training, I can type htop 

and you can see that basically my CPU is 100% used. Now that's a bit weird because I've 

actually got 64 CPUs. Why is it using just one of them is the first problem. But why 

does it matter that it's using 100% CPU? Well, the reason it matters, 

let's run it again so you can see, why does it matter that our CPU is 100% 

and why is it making it so slow? Well, the reason why is if we look at 

nvidia-smi dmon that will monitor our GPUs utilization. I've got three GPUs, so 

I say to choose just the zeroth index one. And you'll see this column here, sm, this stands 

for symmetric multiprocessor. It's like the, it's the equivalent of like CPU usage. And 

generally we're only using up 1% of our one GPU. So no wonder it's so slow. So the first thing 

we wanna do then is try to make things faster. Now, to make things faster, we wanna be 

using more than one CPU to decode our PNGs. And as it turns out, that's actually 

pretty easy to do. You just have to add a extra argument to your data loaders, which is here, num_workers. And so I 

can say use eight CPUs, for example. Now, if I create, I recreate the data loaders 

and then try to create, get the next one. Oh, now I've got an error. And the error is 

rather quirky. And what it's saying is, oh, you're now trying to use multiple processes. And 

generally in Python and PyTorch, using multiple processes, things start to get complicated. 

And one of the things that absolutely just doesn't work is you can't actually have your 

DataLoader put things onto the GPU in your separate processes. It just doesn't work. So the 

reason for this error is actually because of the fact that we used a collate function that put 

things on the device. That's incompatible, unfortunately, with using multiple 

workers. So that's a problem. And the answer to that problem, sadly, is that we would have to actually rewrite our 

fit function entirely. So there's annoying thing number one, and we don't want to be rewriting 

our fit function again and again. We want to have a single fit function. So, okay. So there's 

a problem that we're gonna have to think about. Problem number two is that this is not very 

accurate, 87%. Well, I mean, is it accurate? It's easy enough to find out. There's a 

really nice website called paperswithcode. And it will tell you A little leaderboard. And we 

can see whether we're any good. And the answer is we're not very good at 

all. So these papers had 96%, 94%, 92%. So yeah, we're not looking 

great. So how do we improve that? There's a lot of things we could try, but 

pretty much all of them are going to involve modifying our fit function again and in reasonably complicated ways. So we still 

got a bit of an issue there. Let's put that aside because what we actually 

wanted to do is create an auto-encoder. So to remind you about what an auto-encoder is, 

and we're gonna be able to go into a bit more detail now, we're gonna start with our input 

image, which is gonna be 28 by 28. So it's the number 3, right? And it's a 28 by 28. And we're 

gonna put it through, for example, a stride-2 conv, stride-2. And that's going 

to have an output of a 14 by 14. And we can have more channels. So say maybe 

4, so this is 28 by 28 by 1. Let's do 14 by 14 by 2. So we've reduced the height and width 

by 2, but added an extra channel. So overall, this is a 2x decrease in parameters. And 

then we could do another stride-2 conv, and that would give us a 7 by 7. And again, 

we can choose however many channels we want, but let's say we choose 4. So now compared to our 

original, we've now got a times 4 reduction. And so we could do that a few times, or we could 

just stay there. And so this is compressing. And so then what we could do is then somehow 

have a convolution layer or group of layers, which does a convolution 

and also increases the size. There is actually something 

called a transposed convolution, which I'll leave you to look up if you're 

interested, which can do that. Also known as a rather weirdly, a stride-1/2 convolution. But 

there's actually a really simple way to do this, which is to say, let's say you've got a bunch of 

pixels is out. Let's say we've got a 3 by 3 pixels that looks like this, 1, 0, 1, 1, 

say. We could make that into a 6 by 6 very easily, which is we could simply, let's get these out. We could simply 

copy that pixel there into the first 4, copy that pixel there into these 4. And so you can 

see, and then copy this pixel here into these 4. And so we're simply turning each pixel into 4 

pixels. And so this is called nearest neighbor upsampling. Now that's not a convolution, 

that's just copying. But what we could then do is we could then apply a stride-1 

convolution to that. Right? And that would allow us to double the grid size with the 

convolution. And that's what we're gonna do. So our autoencoder is gonna need a deconvolutional 

layer, and that's gonna contain two layers, up sampling nearest neighbor, scale factor of 

2, followed by a conv2d with a stride of one. Okay. And you can see for padding, I 

just put kernel size // 2. So that's a truncating division, cause that always 

works for any odd sized kernel. As before, we will have an optional activation function, and 

then we will create a sequential using *layers. So that's gonna pass in each layer as a separate 

argument, which is what Sequential() expects. Okay. So let's write a new fitness function. It goes 

through, I just basically copied it over from our previous one, going through each epoch, but 

I've pulled out eval into a separate function, but it's basically doing the same thing. Okay. So here is our auto 

encoder. And so we're going to, it's a bit tricky because I wanted to go 

down by 1, 2, 3, to get to a 4 by 4 by 8, but starting at 28 by 28, you can't divide 

that three times and get an integer. So what I first do is I zero pad. So add padding 

of 2 on each side to get a 32 by 32 input. So if I then do a conv with 2 channel output, that 

gives us 16 by 16 by 2, and then again to get an 8 by 8 by 4, and then again to get a 4 by 4 by 

8. So this is doing an 8x compression, and then we can call deconv() to do exactly the same thing 

in reverse. The final one with no activation, and then we can truncate off those two pixels 

off the edge, slightly surprisingly PyTorch, let's your pass -2 to zero padding to crop off the 

final 2 pixels. And then we'll add a Sigmoid(), which will force everything to go between 

0 and 1, which of course is what we need. And then we will use mse_loss to compare 

those pixels to our input pixels. And so a big difference we've got here now 

is that our loss function is being applied to the output of the model and itself, 

right? We don't have yb here, we have xb. So we're trying to recreate our original. And 

again, this is a bit annoying that we have to create our own fit function. Anyway, so we can 

now see what is the mse_loss, and it's not, like, gonna be particularly human readable, 

but it's a number we can see if it goes down. And so then we can create, then we can do our SGD with the 

parameters of our auto-encoder, with mse_loss, call that fit 

function, we just wrote, and I won't wait for it to run, cause as you can see, 

it's really slow for reasons we've discussed. I've run it before. And what we want is 

to see that the original, which is here, which is here, gets recreated. 

And the answer is, oh, not really. I mean, they're roughly the same things, 

but there's no point having an auto-encoder which can't even recreate the originals. The idea 

would be that if these looked almost identical to these, then we'd say, wow, this is a fantastic 

network at compressing things by eight times. So I found this like very fiddly to try and 

get this to work at all. Something that I discovered can get it to start training is 

to start with a really low learning rate for a few epochs, and then increase the 

learning rate after a few epochs. I mean, at least it gets it to train and show something 

vaguely sensible, but let's see. Yeah, that still looks pretty crummy. This one here I got actually 

by switching to Adam, and I actually removed the tricky bit. I removed these two as well.

But yeah, I couldn't get this to like recreate anything very reasonable 

or any reasonable amount of time. And why is this not working very well? There's so 

many reasons it could be. Like do we need a better optimizer? Do we need a better architecture? 

Do we need to use a Variational Auto-Encoder? There's a thousand things we could try, 

but doing it like this is going to drive us crazy. We need to be able to really rapidly 

try things and all kinds of different things. And so what I often see in projects or on 

Kaggle or whatever, people's code looks kind of like this. It's all like manual. And then their 

iteration speed is too slow. We need to be able to really rapidly try things. So we're not gonna keep 

doing stuff manually anymore. This is where we take a halt and we say, okay, let's build 

up a framework that we can use to rapidly try things and to understand when things 

are working and when things aren't working. So we're gonna start creating a learner. So what is a learner? It's basically the 

idea is this learner is gonna be something that we build, which will allow us to 

try like anything that we can imagine very quickly. And we will build that on top 

of that learner things that will allow us to introspect what's going on inside our model, 

will allow us to do multi-process CUDA to go fast. It will allow us to add things like 

data augmentation. It will allow us to try a wide variety of architectures quickly 

and so forth. So that's gonna be the idea. And of course we're gonna create it from scratch. 

And so let's start with Fashion MNIST like before. And let's create a DataLoaders class, which is gonna look a bit like what we 

had before, where we're just going to pass in, this is just, this couldn't be 

simpler, right? We're just gonna pass in two DataLoaders and store them away. And I'm gonna 

create a @classmethod from dataset dictionary. And what that's gonna do is 

it's gonna call DataLoader on each of the dataset dictionary items with 

our batch size and instantiate our class. So if you haven't seen @classmethod before, it's 

what allows us to say DataLoaders dot something in order to construct this. We could have put this in 

__init__ just as well, but we'll be building more complex DataLoaders things later. So I thought we 

might start by getting the basic structure right. So this is all pretty much the same as 

what we've had before. I'm not doing anything on the device here, cause 

as we know that didn't really work. Okay. Oh, this is an old thing. 

I don't need to_cuda() anymore. So we're gonna use to_device(), 

which I think came from. There we go. So here's an example of 

a very simple Learner that fits on one screen. And this is basically 

gonna replace our fit function. So a Learner is gonna be something that is 

going to train or learn a particular model using a particular set of DataLoaders, a 

particular loss function, some particular learning rate and some particular optimizer 

or some particular optimization function. Now, normally I, you know, most people 

would often kind of store each of these away separately by writing like 

self.model equals model, blah, blah, blah, right? And as I think we've 

talked about before, that's, you know, that kind of huge amounts of boilerplate. It 

just, it's more stuff that you can get wrong. And it's more stuff to mean that you have to 

read to understand the code and yeah, don't like that kind of repetition. So instead we just call 

fastcore.store_attr() to do that all in one line. Okay, so that's the basic idea with 

a class is to think about what's the information it's gonna need. So you pass 

that all to the constructor, store it away. And then our fit function is going to, 

we've got the basic stuff that we have for keeping track of accuracy. So this has only worked for stuff that's a 

classification where we can use accuracy. Put the model on our device, create the optimizer, store how many epochs we're 

going through. Then for each epoch, we'll call the one epoch function and the one epoch function, 

we're gonna either do train or evaluation. So we pass in True if we're training and False if 

we're evaluating. And they're basically almost the same. We basically set the model to training 

mode or not. We then decide whether to use the validation set or the training set based on 

whether we're training. And then we go through each batch in the DataLoader and 

call one batch. And one batch is then the thing which is going to 

put our batch onto the device, call our model, call our loss function. And then, 

if we're training, then do our backward step, our optimizer step and our zero gradient. 

And then finally calculate our metrics or our stats. And so here's where we calculate our 

metrics. So that's basically what we have there. So let's go back to using an MLP. We call fit() and the way it goes. This is an error here, pointed out 

by Kevin. Thank you. self.model.to(). One thing I guess we could try now is we think 

that maybe we can use more than one process. So let's try that. Oh, it's so fast. I didn't even see. There it goes. You can see all four CPUs 

being used at once. Bang, it's done. Okay, so that's pretty great. Let's see how fast 

it looks here. Bump, bump. All right, lovely. Okay, so that's a good sign. We've got a learner 

that can fit things, but it's not very flexible. It's not gonna help us, for example, with our 

autoencoder, because there's no way of like, just like, you know, changing which 

things are used for predicting with, or for calculating with. We can't use it for 

anything except things that involve accuracy with a binary classification. Sorry… is that 

right? Sorry, yeah, a multi-class classification. It's not flexible at all, but it's a 

start. And so I wanted to basically put this all on one screen so you can 

see what the basic Learner looks like. All right, so how do we do things 

other than multi-class accuracy? I decided to create a Metric class. And basically 

a Metric class is something where we are going to define subclasses of it that calculate 

particular metrics. So for example, here, I've got a subclass of a Metric called Accuracy. 

So if you haven't done subclasses before, you can basically think of this as saying, 

please copy and paste all the code from here into here for me, but the bit that says def 

calc(), replace it with this version. So in fact, this would be identical to copying and pasting 

this whole thing, typing Accuracy here, and replacing the definition of calc() with 

that. That's what is happening here when we do subclassing. So it's basically copying and 

pasting all that code in there for us. It's actually more powerful than that. There's more 

we can do with it, but in this case, this is all that's happening with this subclassing. And this 

is called, actually I'll leave that, that's fine. Okay, so the Accuracy metric is here, and 

then this is kind of our really basic Metric, which is we're gonna use for just for loss. And so 

what happens is we're going to, let's for example, create an Accuracy metric object. We're 

basically gonna add in mini batches of data, right? So for example, here's a mini batches of 

inputs and predictions. Here's another mini batch of inputs and predictions. And then we're gonna 

call .value and it will calculate the accuracy. Now .value is a neat little thing. It 

doesn't require parentheses after it because it's called a property. And 

so a property is something that just calculates automatically without having to 

put parentheses. That's all a property is, well, property getter anyway. And so they look 

like this, you give it a name. And so we are going to be, each time we call add(), we are 

gonna be storing that input and that target. And also the number of items 

in the mini batch optionally. For now, that's just always gonna be one. 

And you can see here that we then call .calc(), which is gonna call the accuracy 

calc(). So just see how often they're equal. And then we're going to 

append to the list of values that calculation. And we're also gonna append 

to the list of ns, in this case just one. And so then to calculate the value, we just do that. 

So that's all that's happening for Accuracy. And then we can do for loss, we can just use 

Metric directly, cause Metric directly will just calculate the average of whatever it's passed. 

So we can say, oh, add the number 0.6. So the target's optional. And we're saying this is a 

mini batch of size 32. So it's gonna be the n. And then add the value 0.9 with a mini batch size 

of 2, and then get the value. And as you can see, that's exactly the same as the weighted average 

of 0.6 and 0.9 with weights of 32 and 2. So we've created a Metric class. And so 

that's something that we can use to create any metric we like just by overriding calc(). Or we could create totally things from scratch 

as long as they have an add() and a value. Okay, so we're now going to change our Learner. 

And what we're gonna do is we're going to keep the same basic structure. So there's gonna be 

fit(). It's gonna go through each epoch. It's gonna call one_epoch() passing in True and False 

as for training and validation. one_epoch() is going to go through each batch in the DataLoader 

and call one_batch(). one_batch() is going to do the prediction, get_loss(), and if it's training, 

it's gonna do the backward() step and zero_grad(). But there's a few other things 

going on. So let's take a look. Well, actually let's just look at 

it in use first. So when we use it, we're gonna be creating a Learner() with 

the model, data loaders, loss function, learning rate, and some callbacks, which 

we'll learn about in a moment. And we call fit() and it's gonna do our thing. And 

look, we're gonna have charts and stuff. All right, so the basic idea is gonna 

look very similar. So we're gonna call fit(). So when we construct it, we're gonna be 

passing in exactly the same things as before, but we've got one extra thing, callbacks, which 

we'll see in a moment, store the attributes as before and we're gonna be doing some stuff 

with the callbacks. So when we call fit() for this number of epochs, we're gonna 

store away how many epochs we're gonna do. We're also gonna store away 

the actual range that we're going to loop through as self.epoch. So 

here's that looping through self.epoch. We're gonna create the optimizer using 

the optimizer function and the parameters. And then we're gonna call _fit(). Now what on earth is _fit()? Why 

didn't we just copy and paste? So this into here, why do this? It's because we've 

created this special decorator with callbacks. What does that do? So it's up here with 

callbacks. With callbacks is a class. It's gonna just store one thing, which 

is the name. In this case, the name is ‘fit’. And what it's gonna do is… now this is the 

decorator, right? So when we call it, remember, decorators get passed a function. So it's gonna 

get passed this whole function and that's gonna be called f. So dunder call, remember is what 

happens when a class is treated, an object is treated as if it's a function. So it's gonna get 

passed this function. So this function is _fit. And so what we wanna do is we wanna return a 

different function. It's going to of course call the function that we were asked to call using 

the arguments and keyword arguments we were asked to use. But before it calls that function, it's 

going to call a special method called callback, passing in the string before, in this case, 

before underscore fit. After it's completed, it's gonna call that method called callback 

and passing the string after underscore fit. And it's gonna wrap the whole 

thing in a try except block. And it's going to be looking for an exception 

called CancelFitException. And if it gets one, it's not gonna complain. So let me explain 

what's going on with all of those things. Let's look at an example of a callback. So for example, here is a Callback called 

DeviceCB, device callback. And before_fit() will be called automatically before that underscore fit 

method is called. And it's going to put the model onto our device, CUDA or MPS, if we have one, 

otherwise it'll just be on GPU. So what's gonna happen here? So it's going to call, we're gonna 

call fit(). It's gonna go through these lines of code. It's then gonna call _fit(). _fit() is not 

this function. _fit() is this function with f is this function. So it's going to call our 

learner dot callback passing in before_fit. And callback() is defined here. What's 

callback() gonna do? It's gonna be passed the string before_fit(). It's going to 

then go through each of our callbacks sorted based on their order. And you can 

see here, our callbacks can have an order and it's going to look at that callback and 

try to get an attribute called before_fit. And it will find one. And so then it's going to 

call that method. Now, if that method doesn't exist, it doesn't appear at all, then getatrr() 

will return this instead. Identity is a function just here. This is an identity function. All 

it does is whatever arguments it gets passed, it returns them. And if it's not 

passed any arguments, it just returns. So there's a lot of Python going on here. And 

that is why we did that foundations lesson. And so for people who haven't done a lot of 

this Python, there's gonna be a lot of stuff to experiment with and learn about. 

And so do ask on the forums, if any of these bits get confusing, but the 

best way to learn about these things is to open up this Jupyter notebook and try and 

create really simple versions of things. So for example, let's try identity(). 

identity(), how exactly does identity work? I can call it and it gets nothing. I can call it 

with 1, it gets back 1. I could call it with ‘a’, gets back ‘a’, call it with ‘a’, 1. Call it with ‘a’, 1 and get ‘a’, 1. And how is it doing that exactly? So remember 

we can add a break point and this would be a great time to really test your debugging skills. 

Okay, so remember in our debugger, we can hit h to find out what the commands are, but you 

really should do a tutorial on the debugger if you're not familiar with it. And then we can 

step through each one. So I can now print args. And there's actually a trick which I like is 

that args is actually a command, funnily enough, which will just tell you the arguments to any 

function, regardless of what they're called, which is kind of nice. And so then 

we can step through by pressing n and after this, we can check 

like, okay, what is x now? And what is args now? Right?, so remember 

to really experiment with these things. So anyway, we're gonna talk about 

this a lot more in the next lesson. But before that, if you're not 

familiar with try-except blocks, you know, spend some time practicing them. 

If you're not familiar with decorators, well, we've seen them before. So go back 

and look at them again really carefully. If you're not familiar with the debugger, practice 

with that. If you haven't spent much time with getattr, remind yourself about that. So try to get yourself really 

familiar and comfortable as much as possible with the pieces, because 

if you're not comfortable with the pieces, then the way we put the pieces together is 

gonna be confusing. There's actually something in education in kind of the theory of education 

called cognitive load theory. And the theory of cognitive, basically cognitive load theory 

says, if you're trying to learn something, but your cognitive load is really high because 

of all lots of other things going on at the same time, you're not gonna learn it. So it's gonna 

be hard for you to learn this framework that we're building if you have too much cognitive 

load of like what the hell's a decorator or what the hell's getattr or what does sorted do 

or what's partial, you know, all these things. Now, I actually spent quite a bit of time 

trying to make this as simple as possible, but also as flexible as it needs 

to be for the rest of the course. And this is as simple as I could get 

it. So these are kind of things that you actually do have to learn. But in doing 

so, you're gonna be able to write some really powerful and general code yourself. So hopefully 

you'll find this a really valuable and mind expanding exercise in bringing high level software 

engineering skills to your data science work. Okay, so with that, this looks like a good place to leave it and look forward 

to seeing you next time. Bye.

Need a transcript for another video?

Get free YouTube transcripts with timestamps, translation, and download options.

Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact

Lesson 15: Deep Learning Foundations to Stable Diffusion ...