Lesson 15: Deep Learning Foundations to Stable Diffusion ...

Hi all and welcome to Lesson 15. And what

we're going to endeavor to do today is to create a convolutional autoencoder.

And in the process, we will see why doing that well is a tricky thing to do and

time permitting, we will begin to work on a framework, a deep learning framework

to make life a lot easier. Not sure how far we'll get on that today time wise. So

let's see how we go and get straight into it. Okay. So today, let's start by talking before

we can create a convolutional autoencoder, we know to talk about convolutions and what are

they and what are they for. Broadly speaking, convolutions are something that allows us to, to

tell our neural network a little bit about the structure of the problem. That's going to make

it a lot easier for it to solve the problem. And in particular, the structure of our problem

is we're doing things with images. Images are laid out on a grid, a 2d grid for black and white

or a 3d for color or a 4d for a color video or whatever. And so we would say, you know, there's

a relationship between the pixels going across and the pixels going down. They tend to be similar

to each other, differences in those pixels across those dimensions tend to have meaning. Sets,

patterns of pixels that appear in different places often represent the same thing. So for

example, a cat in the top left is still a cat, even if it's in the bottom right. These kinds

of, this kind of prior information is something that is naturally captured by a Convolutional

Neural Network, something that uses convolutions. Generally speaking, this is a good thing

because it means that we will be able to use less parameters and less computation because

more of that information about the problem we're solving is kind of encoded directly into our

architecture. There are other architectures that don't encode that prior information as strongly,

such as a Multi-Layer Perceptron, which we've been looking at so far, or a Transformers network,

which we haven't looked at yet. Those kinds of architectures could potentially give us, or they

do give us more flexibility and given enough time, compute and data, they could potentially find

things that maybe CNNs would struggle to find. So we're not always going to use

Convolutional Neural Networks, but they're a pretty good starting point and

certainly something important to understand. They're not just used for images. We can also

take advantage of one-dimensional convolutions for language-based tasks, for instance.

So convolutions come up a lot. So in this notebook, one thing you'll

notice that might be of interest is we are importing stuff from miniai now. Now

miniai is this little library that we're starting to create and we're creating it

using nbdev. So we've got a miniai.training and a miniai.datasets. And so if we look,

for example, at the datasets notebook, it starts with something that says that the

default export is called datasets. And some of the cells have a export directive on them.

And at the very bottom, we had something that called nbdev_export(). Now what that's going to do

is it's going to create a file called datasets.py just here, datasets.py. And it contains, those cells that we exported. And why does it, why is it called miniai.datasets?

That's because everything for nbdev is stored in settings.ini. And there's something here

saying create a library libname called miniai. You can't use this library until you install it.

Now we haven't uploaded it to PyPy, like made it a pip installable package from a public server.

But you can actually install a local directory as if it's a Python module that you've kind of

installed from the internet. And to do that, you say pip install, in the usual way, but

you say -e, that stands for editable. And that means set up the current directory as

a Python module. Well, current directory, actually any directory you like, I just put

dot to mean the current directory. And so you'll see that's going to go ahead and actually

install my library. And so after I've done that, I can now import things from

that library as you see. Okay. So this is just the same as before. We're

going to grab our MNIST dataset and we're going to create a Convolutional Neural Network on

it. So before we do that, we're going to talk about what are convolutions. And one of my

favorite descriptions of convolutions comes from a student in our, I think it was our very

first course, Matt Kleinsmith, who wrote this really nice Medium article, CNNs from different

viewpoints, which I'm going to steal from. And here's the basic idea. Say that this is our image.

It's a 3 by 3 image with 9 pixels labeled from A to J as capital letters. Now a convolution

uses something called a kernel and a kernel is just another tensor. In this case, it's a 2

by 2 matrix again. So it's, I mean, this one's, we're going to have alpha, beta, gamma,

delta, alpha, beta, gamma, delta as our four values in this convolution. Now in this

kernel, oh, now one thing I'll mention, I can't remember if I've said this before, is the

Greek letters are things that you want to be able to, I think I have mentioned this, you want to

be able to pronounce them. So if you don't know how to read these and say what these names are,

make sure you head over to Wikipedia or whatever, and learn the names of all the Greek letters so

that you can, cause they come up all the time. Okay. So what happens when we apply a convolution

with this 2 by 2 kernel to this 3 by 3 image? I mean, it doesn't have to be an image

it's in this case, it's just a rank 2 tensor, but it might represent an image. What happens

is we take the kernel and we overlay it over the first little 2 by 2 sub grid, like so. And

specifically what we do is we match color to color. So the output of this first 2 by 2

overlay would be alpha times a plus beta times b plus gamma times d plus delta times

e, and that would yield some value P, and that's going to end up in the top left of a 2 by

2 output. So the top right of the 2 by 2 output, we're going to slide, it's like a sliding window,

we're going to slide our kernel over to here and apply each of our coefficients to

these respectively colored squares. And then ditto for the bottom left

and then ditto for the bottom right. So we end up with this equation. P, as

we discussed, is alpha a plus beta b plus gamma d plus delta e, plus some bias

term. Q, to the top right, as you can see, it's just alpha in this test times b. And

so we just take multiplying them together and adding them up, multiply together, add

them up, multiply together and add them up. So we're basically, you can imagine that

we're basically flattening these out into rank 1 tensors into vectors and then doing a dot

product would be one way of thinking about what's happening as we slide this kernel over these

windows. And so this is called a convolution. So let's try and create a convolution. So,

for example, let's grab our training images. And take a look at one. And let's create a 3 by 3 kernel. So remember,

a kernel is just ‒we've already‒ kernel appears a lot of times in computer science and math.

We've already seen the term kernel to mean a piece of code that we run on a GPU across lots

of parallel kind of virtual devices or potentially in a grid. There's a similar idea here. We've

got a computation, which is in this case, kind of this dot product or something like a dot product

sliding over, occurring lots of times over a grid. But it's yeah, it's a bit different.

That's kind of another use of the word kernel. So in this case, a kernel is a, in

this case, it's going to be a rank 2 tensor. And so let's create a kernel with these values

in the 3 by 3 matrix, rank 2 tensor. And we could draw what that looks like. Not surprising,

it just looks like a bunch of lines. Oops. OK, so what would happen if we slide this over

just these nine pixels over this 28 by 28? Well, what's going to happen is

if we've got some‒ the top left, for example, 3 by 3 section has these names,

then we're going to end up with negative a1, because the top three are all negative. Right?

Negative a1, minus a2, minus a3. The next are just zero. So that won't do anything. And then plus

a7, plus a8, plus a9. Why is that interesting? That's interesting. Well, let's try

here. What I've done here is I've grabbed just the first 13 rows and first 23 columns of

our image, and I'm actually showing the numbers and also using gray kind of conditional

formatting, if you like, or the equivalent in pandas to show this top bit. So we're looking at

just this top bit. So what happens if we take rows 3, 4 and 5? Remember, this is not inclusive,

right? So it's rows 3, 4 and 5, columns 14, 15, 16… 14, 15, 16. So we're looking at this, these

three here. What's that going to give us if we multiply it by this kernel? It gives us a fairly

large positive value because the three that we have negatives on is the top row. Well, they're

all zero. And the three that we have positives on, they're all close to 1. So we end up with

quite a large number. What about the same columns but for rows 7, 8, 9? 7, 8, 9. Here, the

top is all positive and the bottom is all zero. So that means that we're going to get a lot of

negative terms. And not surprisingly, that's exactly what we see. If we do this, kind of a dot

product equivalent, which all you need in NumPy to do that is just an element-wise multiplication

followed by a sum, right? So that's going to be quite a large negative number. And so perhaps

you're seeing what this is doing. And maybe you got a hint from the name of the tensor we created.

It's something that is going to find the top edge. So this one is a top edge, so it's a positive.

And this one is a bottom edge, so it's a negative. So we would like to apply that, this kernel,

to every single 3 by 3 section in here. So we could do that by creating a little

apply_kernel function that takes some particular row and some particular column and some particular

tensor as a kernel and does that multiplication .sum() that we just saw. So

for example, we could replicate this one by calling apply_kernel(). And this

here is the center of that 3 by 3 grid area. And so there's that same number, 2.97. So now

we could apply that kernel to every one of the 3 by 3 sections. 3 by 3 windows in this 28 by 28

image. So we're going to be sliding over like this red bit sliding over here. But we've actually

got a 28 by 28 input, not just a 5 by 5 input. So to get all of the coordinates, let's just

simplify it to do 5 five by 5. We can go, we can create a list comprehension. We can take

i through every value in range(5). And then for each of those, we can take j for every value in

range(5). And so if we just look at that tuple, you can see we get a list of lists containing

all of those coordinates. So this is a list comprehension in a list comprehension, which when

you first say it may be surprising or confusing, but it's a really helpful idiom. And I

certainly recommend getting used to it. Now, what we're going to do is we're not just

going to create this tuple, but we're actually going to call apply_kernel() for each of those. So

if we go through from 1 to 27, well, actually 1 to 26, because 27 is exclusive. So we're going to go

through everything from 1 to 26. And then for each of those, go through from 1 to 26 again and call

apply_kernel(). And that's going to give us the result of applying that convolutional kernel to

every one of those coordinates. And there's the result. And you can see what it's done as we

hoped is it is highlighting the top edges. So yeah, you might find that kind of surprising

that it's that easy to do this kind of image processing. We're literally just doing an element

wise multiplication and a sum for each window. Okay, so that is called a convolution. So we

can do another convolution. This time we could do one with the left edge tensor. So as you

can see, it looks just a rotated version or transposed version, I guess, of our top

edge tensor, here's what it looks like. And so if we apply that kernel, so this time

we're going to apply the left edge kernel. And so notice here that we're

actually passing in a function, right? We're passing in a function. Sorry, not

a function, is it? It's just a tensor actually. So we're going to pass in the left edge tensor for the same list comprehension, in

a list comprehension. And this time we're getting back the left edges. It's

highlighting all of the left edges in the digit. So yeah, this is basically what's happening here

is that a 2 by 2 can be looped over an image, creating these outputs. Now you'll see here that in the process

of doing so, we are losing the outermost pixels of our image. We'll learn about

how to fix that later. But just for now, notice that as we are putting our 3 by 3 through,

for example, in this 5 by 5 , there's only one, two, three places that we can put it going across,

not five places because we need some kind of edge. All right, so that's cool. That's

a convolution. And hopefully if you remember back to kind of the Zeiler

and Fergus pictures from Lesson 1, you might recognize that the kind of first layer

of a convolutional network is often looking for kind of edges and gradients and things like

that. And this is how it does it. And then the convolutions on top of convolutions

with nonlinear activations between them can combine those into curves, or corners

or stuff like that, and so on and so forth. Okay, so how do we do this quickly? Because

currently this is going to be super, super slow doing this in Python. So one of the very earliest

or probably the earliest publicly available general purpose deep learning, GPU accelerated

deep learning thing I saw was called Caffe. That was created by somebody called Yangqing Jia.

And he actually described what happened, how Caffe went about implementing a fast convolution on a

GPU. And basically he said, well, I had two months to do it and I had to finish my thesis. And so I

ended up doing something where I said, well, there was some other code out there, Krizhevsky, who you

might have come across, him and Hinton set up a little startup which Google bought, and that kind

of became the start of Google's deep learning, Google Brain, basically. And so Krizhevsky

had all this fancy stuff in his library, but Yangqing Jia said, oh, I didn't know how to

do all that stuff. So I said, well, I already know how to multiply matrices, so maybe I can convert

a convolution into a matrix multiplication. And so that became known as im2col. im2col is a way of converting a convolution

into a matrix multiply. And so actually, I don't know if, I suspect Yangqing Jia kind of

accidentally reinvented it, because it actually had been around for a while, even at the point

that he was writing his thesis, I believe. So it was actually, this is the place

I believe it was created in this paper. So that was in 2006, which is a while ago.

And so this is actually from that paper. And what they describe is, let's

say you are putting this 2 by 2 kernel over this 3 by 3 bit of an image. So

here you've got this window needs to match to this bit of this window. What you could

do is you could unwrap this to 1, 2, 1, 2 downwards to here. 1, 2, 1, 2, to unroll it

like so. And you could unroll the kernel here. Yeah, so this is 1, 2, 1, 1. So this bit is here,

1, 2, 1, 1. And then you could unroll the kernel 1, 1, 2, 2 to here, 1, 1, 2, 2. And then once

they've been flattened out and moved in that way, and then you'll do exactly the same

thing for this next patch here, 2, 0, 1, 3. You flatten it out and put it here, 2,

0, 1, 3. So if you basically take those kernels and flatten them out in this format, then you

end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up with

the output that you want from the convolution. So this is basically a way of unrolling your kernels and your input features into

matrices, such as when you do the matrix multiply, you get the right answer. So it's a kind of

a nifty trick. And so that is called im2col. I guess we're kind of cheating a little bit.

Implementing that is kind of boring. It's just a bunch of copying and tensor manipulation. So I

actually haven't done it. Instead, I've linked to a NumPy implementation, which is here. And

it also, part of it is this get_indices(), which is here. And as you can see, it's a little

bit tedious with repeats and tiles and reshapes and whatnot. So I'm not going to call it homework,

but if you want to practice your tensor indexing manipulation skills, try creating a PyTorch

version from scratch. I got to admit, I didn't bother. Instead, I used the one that's built

into PyTorch. And in PyTorch, it's called unfold. So if we take our image and PyTorch expects

there to be a batch axis and a dimension and a channel dimension. So we'll add two unit leading

dimensions to it. Then we can unfold our input for a 3 by 3. And that will give us a 9 by

676 input. And so then we can take that… and then we'll take our kernel and just flatten it

out into a vector. So view() changes the shape and -1 just says dump everything into this dimension.

So that's going to create a 9 long vector, length 9 vector. And so now we can do the matrix multiply

just like they've done here of the kernel matrix. That's our weights by the unrolled input features. And so that gives us a 676 long. We can then view

that as 26 by 26 and we get back as we hoped our left edge tensor result. And so this is how

we can kind of from scratch create a better implementation of convolutions. The reason I'm

cheating, I'm allowed to cheat here is because we did actually create convolutions from

scratch. We're not always creating the GPU optimized versions from scratch, which was never

something I promised. So I think that's fair. But it's cool that we can kind of hack it

out a GPU optimized version in the same way that the kind of original deep learning

library did. So if we use apply_kernel(), we get nearly nine milliseconds. If

we use unfold() with matrix multiply, we get 20 microseconds. So that's what about

400 times faster. So that's pretty cool. Now, of course, we don't have to use

unfold() and matrix multiply because PyTorch has a conv2d(). So we can run that.

And that interestingly is about the same speed, at least on GPU. But this would

also work on GPU just as well. Yeah, I'm not sure this will always be the

case. In this case, it's a pretty small image. I haven't experimented a whole lot

to see whereabouts there's a big difference in speeds between these. Obviously, I always

just use F.conv2d. But if there's some more tricky convolution you need to do with some weird

thing around channels or dimensions or something, you can always try this unfold trick.

It's nice to know it's there, I think. So we could do the same thing for diagonal edges. So here's our diagonal edge

kernel or the other diagonal. So if we just grab the first 16 images. Then we can do a convolution on our whole

batch with all of our kernels at once. So this is a nice optimized thing that

we can do. And you end up with your 26 by 26. You've got your 4 kernels and you've got

your 16 images. And so that's summarized here. So that's generally what we're doing to get

good GPU acceleration is we're doing a bunch of kernels and a bunch of images all at once

across all of their pixels. And so here we go. That's what happens when we take a look at

our various kernels for a particular image. Left edge, I guess top edge and then

diagonal top left and top right. OK, so that is optimized convolutions

on… and that works just as well on CPU or GPU. Obviously, GPU will

be faster if you have one. Now, how do we deal with the

problem that we're losing one pixel on each side? What we can do

is we can add something called padding. And for padding, what we basically do

is rather than starting our window here, we start it right over here and actually would be

up one as well. And so these 3 on the left here. We just take the input for each of those

as 0. So we're basically just assuming that they're all zero. I mean, there's

other options we could choose. We could assume they're the same as the one next

to them. There's various things we can do, but the simplest and the one we normally

do is just assume that there's zero. So now. So let's say, for example, this

is, this is called one pixel padding. Let's say we did 2 pixel padding. So we had

2 pixel padding with a 5 by 5 input. OK, and a 4 by 4 kernel. So that grays our kernel.

Right, then we're going to start right up way over here on the corner. OK, and then you can see

what happens as we slide. The kernel over, there's all the spots that it's going to take, and so that

this dotted line area is the area that we're kind of effectively going through, but all of these

white bits, we're just going to treat as zero. And so and then this is this green is the output

size we end up with, which is going to be 6 by 6. For a 5 by 5 input, I should mention. Even numbered edge kernels are not used

very often, we normally used odd numbered kernels if you use, for example, a 3

by 3 kernel and 1 pixel of padding, you will get back the same size you start with.

If you use 5 by 5 with 3 pixels of padding, you'll end up with the same size you start

with. So generally odd numbered edge size kernels are easier to deal with to make sure

you end up with the same thing you start with. OK, so yeah, so as it says here, you've got an

odd numbered size ks by ks size kernel, then ks truncate divide to (ks//2). That's what slash

slash means will give you the right size. And so another trick you can do is you don't always have

to just move your window across by one each time. You could move it by a different amount each time.

The amount you move it by is called the stride. So, for example, here's a case of doing a stride

2, so a stride 2 padding 1. So we start out here and then we jump across 2 and then we jump across

2 and then we go to the next row. So that's called a stride 2 convolution. Stride 2 convolutions

are handy because they actually reduce the dimensionality of your input by a factor of 2.

And that's actually what we want to do a lot. For example, with an autoencoder, we want to

do that. And in fact, for most classification architectures, we do exactly that. We keep

on reducing the, kind of the grid size by a factor of 2 again and again and again using

stride 2 convolutions with padding of 1. So that's strides and padding. So let's go ahead

and create a ConvNet using these approaches. So we're going to put, get our size of our

training set. This is all the same as before, number of categories, number of

digits, size of our hidden layer. So. Previously, with our sequential linear

models, with our MLPs, we basically went from the number of pixels to the number

of hidden and then a ReLU and then the number of hidden to the number of outputs.

So here's the equivalent. With a convolution. Now, the problem is that you can't just

do that because the output is not now 10 probabilities for each item in our batch,

but it's 10 probabilities for each item in our batch for each of 28 by 28 pixels because

we don't even have a stride or anything. So you can't just use the same simple approach that we

had for MLP. We have to be a bit more careful. So to make life easier, let's create a little conv

function that does a Conv2d with a stride of 2 optionally followed by an activation. So if act

is true, we will add in a ReLU activation. So this is going to either return a Conv2d or a little

Sequential containing a Conv2d followed by a ReLU. And so now we can create a CNN

from scratch as a sequential model. And so since activation is True by default, this

is going to take our 28 by 28 image, starting with 1 channel and creating an output of 4 channels.

So this is the number of in, this is the number of filters. Sometimes we'll say filters to

describe the number of, kind of, channels that our convolution has, that's the number of

outputs. And it's very similar to the idea of the number of outputs in a linear layer, except

this is the number of outputs in your convolution. So what I like to do when I create stuff like

this is I add a little comment just to remind myself what is my grid size after this. So I had

a 28 by 28 input. So then I've then put it through a stride-2 conv. So the output of this will be

14 by 14. So then we'll do the same thing again, but this time we'll go from a 4 channel input

to an 8 channel output and then from 8 to 16. So by this point, we're now down to

a 4 by 4. And then down to a 2 by 2, and then finally we're down to a 1 by 1. So on

the very last layer, we won't add an activation. And the very last layer is going to create 10

outputs. And since we're now down to a 1 by 1, we can just call Flatten() and that's going to remove

those unnecessary unit axes. So if we take that, pop mini batch through it, we end up with

exactly what we want, 16 by 10. So for each of our 16 images, we've got 10 probabilities of

each possible digit. So if we take our training set and make it into 28 by 28 images, and

we do the same thing for a validation set, and then we create two datasets, one for each,

which are called train dataset and valid dataset. And we're now going to train this on the GPU. Now,

if you've got a Mac, you can use a device called, well, if you've got an Apple Silicon Mac, you've

got a device called MPS, which is going to use your Mac's GPU. Or if you've got Nvidia, you can

use CUDA, which will use your Nvidia GPU. CUDA's 10 times or more, possibly much more faster than

a Mac. So you definitely want to use Nvidia if you can. But if you're just running it on a

Mac laptop or whatever, you can use it MPS. So basically, you want to know what device to use.

Do we want to use CUDA or MPS? You can check. If you can check torch.backends.mps.is_available()

to see if you're running on a Mac with MPS. You can check torch.cuda.is_available() to see

if you've got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course,

you'll have to use the CPU to do computation. So I've created a little function here to_device, which takes a tensor or a dictionary

or a list of tensors or whatever, and a device to move it to, and it just goes

through and moves everything onto that device. Or if it's a dictionary, a dictionary of

things, values moved onto that device. So there's a handy little

function. And so we can create a custom collate function, which calls the

PyTorch default collation function and then puts those tensors onto our device.

And so with that, we've now got enough to train this neural net on the GPU. We created

this get_dls function in the last lesson. So we're going to use that, passing in the

datasets that we just created and our default collation function. We're going to create

our optimizer using our CNNs parameters. And then we call fit(). Now fit(), remember, we also created

in our last lesson and it's done. So then what I did then was I reduced

the learning rate by a factor of four and ran it again. And eventually, yeah, I got to a

fairly similar accuracy to what we did on our MLP. So, yeah, we've got a convolutional network

working. I think that's pretty encouraging. And it's nice that to train it, we didn't have

to write much code, right? We were able to use code that we had already built. We were

able to use the Dataset class that we made, the get_dls function that we made,

and the fit function that we made. And, you know, because those things are

written in a fairly general way, they work just as well for a ConvNet as they did for an

MLP. Nothing had to change. So that was nice. Notice I had to take the model and put it on

the device as well. So that will go through and basically put all of the tensors that are in that

model onto the MPS or CUDA device, if appropriate. So if we've got a batch size of 64,

and as we do, 1 channel, 28 by 28, so then our axes are batch, channel, height,

width. So normally this is referred to as NCHW. So N, generally when you see N in

a paper or whatever, in this way, it's referring to the batch size. N

being the number, that's the mnemonic, the number of items in the batch. C is the

number of channels, height by width, NCHW. TensorFlow doesn't use that. TensorFlow uses

NHWC. So we generally call that channels-last, since channels are at the end. And this one we

normally call channels-first. Now, of course, it's not actually channels-first. It's actually

channels-second, but we ignore the batch bit. In some models, particularly some more modern

models, it turns out the channels-last is faster. So PyTorch has recently added

support for channels-last. And so you'll see that being used more and more as well. All right, so a couple of comments and questions

from our chat. The first is Sam Watkins pointing out that we've actually had a bit of a win here,

which is that the number of parameters in our CNN is pretty small by comparison. So in the MLP

version, the number of parameters is equal to basically the size of this matrix, so m times nh. Oh, plus the number in this,

which will be nh times 10. And something that at some point we probably

should do is actually create something that allows us to automatically

calculate the number of parameters. And I'm ignoring the bias there, of course. Let's see, what would be a good way to do that? Maybe np.product(). There we go. So what we could do is just

calculate this automatically by doing a little list comprehension here.

So there's the number of parameters across all of the different layers, so both bias

and weights. And then we could, I guess, just, well, we could just use, well, let's use

PyTorch. So we could turn that into a tensor and sum it up. Oops. So that's the number in

our MLP. And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000

to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding me that

there's a better way than np.product(o.shape), which is just to say o dot

number of elements: o.numel(). Same thing. Very nice. Now, one person asked a very

good question, which is, I thought Convolutional Neural Networks can

handle any sized image. And actually, no, this convolutional network cannot handle any

sized image. This Convolutional Neural Network only handles images that, once they go through

these stride-2 convs, end up with a 1 by 1. Because otherwise, you can't dot flatten it

and end up with 16 by 10. So we will learn how to create convnets that can handle any sized

input. But there's nothing particularly about a convnet that necessitates that it has

to be any sized input that it can handle. OK, so just let's briefly finish this section off

by talking about this, particularly I want to talk about the idea of receptive field. Consider this

1 input channel, 4 output channel, 3 by 3 kernel. So that's just to show you what we're doing here.

conv1, well, actually, so simple_cnn. simple_cnn. This is the model we created. Remember, it was

like a Sequential model containing Sequential models, because that's how our conv function

worked. So simple_cnn[0] is our first layer. It contains both a Conv and a ReLU. So simple_cnn[0,

0] is the actual Conv. So if we grab that, call it conv1, it's a 4 by 1 by 3 by 3. So,

number of outputs, number of input channels, and height by width of the kernel. And then

it's got its bias as well. So that's how we could deconstruct what's going on with our weight

matrices or our parameters inside a convolution. Now, I'm going to switch over to Excel. So in

the lesson notes on the course website or on the forum, you'll find we've got an Excel.

You'll see we've got an Excel workbook. Oh, Wasim reminded me that there is

a nice trick we can do. I do want to do that actually because I love this trick. Oh, I just deleted everything though. Let's put

it all back. Here we go. Which is you actually don't need square brackets. The square brackets is

a list comprehension. Without the square brackets, it's called a generator. And it, oh, no, you can't

use it there. Maybe that only works with NumPy. Ah, OK. So that's the list. No, that doesn't work either. So much for that.

I'm kind of curious now. Maybe torch.sum. No, just sum. Oh, OK. Well, I don't want to

use Python sum. That's interesting. I feel like all of them should handle

generators, but there you go. OK. So open up the conv-example spreadsheet. And

what you'll see on the conv-example worksheet page is something that looks a lot like the number

7. And indeed, this is the number 7 that I got straight from MNIST. Let's see. OK. So you can

see over here, we have a number 7. This is a number 7 from MNIST that I have copied into

Excel. And then you can see over here, we've got the top edge kernel being applied. And over

here, we've got a right edge kernel being applied. This might be surprising you because you

might be thinking, where did it take Jeremy? Microsoft Excel doesn't do Convolutional

Neural Networks. Well, actually, it does. So if I zoom in in Excel, you'll see, actually,

these numbers are, in fact, conditional formatting applied to a bunch of spreadsheet cells. And so

what I did was I copied the actual pixel values into Excel and then applied conditional

formatting. And so now you can see what the digit is actually made of. So you can see here I've created our top edge filter. And here

I've created our left edge filter. And so here I am applying that filter to that window. And

so here you can see it looks a lot like NumPy. It's just a sum product. And you might not be

aware of this, but in Excel, you can actually do broadcasting. You have to hit Apple

Shift Enter or Control Shift Enter, and it puts these little curly brackets

around it. It's called an array formula. It basically lets you do broadcasting or

simple broadcasting in Excel. And so here's how you could say… this is how I created

this top edge filtered version in Excel. And the left edge version is exactly the same,

just a different kernel. And as you can see, if I click on it, it's applying this filter

to this input area and so forth. OK, so we can then… I just

arbitrarily pick some different values here. And so something to notice now

in my second layer, so here's Conv1, is Conv2, it's got a bit more work to do. We actually

need two filters because we need to add together this bit here applied to this with

this kernel applied and this bit here with this kernel applied. So you actually

need one set of 3 by 3 for each input. And also, I want two separate outputs, so I

actually end up needing a 2 by 2 by 3 by 3 weights matrix or weights tensor, I should say,

which you might remember is exactly what we had in PyTorch. We had a rank 4 tensor. So if I have a

look at this one, you see exactly the same thing. This input is using this kernel applied

to here and this kernel applied to here. So that's important to remember that you

have these rank 4 tensors. And so then rather than doing stride-2 conv, I did something else

which is actually a bit out of favor nowadays, but it's another option, which is to do something

called max-pooling to reduce my dimensionality. So you can see here I've got 28 by 28.

I've reduced it down here to 14 by 14. And the way I did it was simply to take

the max of each little 2 by 2 area. So that's all that's been done

there. So that's called max-pooling. And so max-pooling has the same effect as a

stride-2 conv, not mathematically identical, the same effect, which is it does a convolution

and reduces the grid size by 2 on each dimension. So then how do we create a single

output if we don't keep doing this until we get to 1 by 1, which

I'm too lazy to do in Excel? Well, one approach, and again, this is a little

bit out of favor as well, but one approach we can do is we can take every one of these, we've

now got 14 by 14, and apply a dense layer to it. And so what I've done here is I've got a

big, imagine this is basically all being flattened out into a vector. And so here

we've got some product of this by this, plus the sum product of this by this,

and that gives us a single number. And so that is how we could then optimize

that in order to optimize our weight matrices. Now, and then, you know, the more modern approach, we don't use this kind of dense layer much

anymore. It still appears a bit. The main place that you see this used is in a network

called VGG, which is very old now. I thought it might be 2013 or something, but it's actually

still used. And that's because for certain things like something called style transfer or in

general perceptual losses, people still find VGG seems to work better. So you still actually

see this approach nowadays sometimes. The more common approach, however, nowadays is

we take the penultimate layer and we just simply take the average of all of the activations. So

the, you know, the nowadays we would simply, the Excel way of doing it would be literally

simply say AVERAGE of the penultimate layer. And that is called global average pooling.

Everything has to has a fancy word, a fancy phrase, but that's all it is. Take the average

is called global average pooling, or you could take the max, whatever that

would be global max pooling. So anyway, the main reason I wanted to show you

this was to do something which I think is pretty interesting, which is to take something in our,

I'm just going to zoom out a little bit here. Let's take something in our max pool here. And I'm going to say trace

precedence to show you here it is the area that it's coming from. Okay. So it's coming from these

four numbers. Now for trace precedence again, saying what's actually impacting this.

Obviously the kernels impacting it. And then you can see that the input area here is a

bit bigger. And then if I trace precedence again, then you can see the input area is bigger still.

So this number here is calculated from all of these numbers in the input. This area in the

input is called the receptive field of this unit. And so the receptive field in this case

is 1, 2, 3, 4, 5, 6 by 6, right? And that means that a pixel way up here in the top,

right? Has literally no ability to impact that activation. It's not

part of its receptive field. If you have a whole bunch of stride 2 convs, each

time you have one, the receptive field is going to get twice as big. So the receptive field at the

end of a deep network is actually very large. But the inputs closest to the middle of the receptive

field have the biggest kind of say in the output because they implicitly appear the most often in

all of these kind of dot products that are inside this convolutional window. So the receptive field

is not just like a single binary on off thing. Certainly all the stuff that's not got precedence

here is not part of it at all. But the closer to the center of the receptive field, the

more impact it's going to have, the more ability it's got to change this number. So the

receptive field is a really important concept and yeah, fiddling, playing around with Excel's

precedent arrows, I think is a nice way to see that, at least in my opinion.

And apart from anything else, it's great fun creating a Convolutional

Neural Network in Excel. I thought so anyway. Okay, so let's take a seven minute break. I'll see you back after that to talk about

a convolutional autoencoder. All right. Okay, welcome back. We're going to have

a look now at the autoencoder notebook. So we're just going to import all of our

usual stuff and we've got one more of our own modules to import now as well. And this

time we are going to switch to a different, we're going to switch to a different

dataset, which is the fashion MNIST dataset. We can take advantage of the

stuff that we did in 05_datasets and the Hugging Face stuff to load it.

So we've seen this a little bit before. Back in our datasets one here. And we never actually built any models

with it. So let's first of all do that. So this is just, I'm going to convert

each thing, each image into a tensor, and that's going to be an in-place transform.

Remember we created this decorator. And so we can call dataset dictionary with

the same name and so we can call dataset dictionary with transform. This

is all stuff we've done before. And so here we have our example of a sneaker. All right. And we will create our collation

function, collating a dictionary for that dataset. That's something you should remind yourself. We

built that ourselves in the datasets notebook. And let's actually make our collate function something that does to_device(),

which we wrote in our last notebook. And we'll create a little data_loaders function

here, which is going to go through each item in the dataset dictionary and create a DataLoader for

it and give us a dictionary of data loaders. Okay. So, okay. So now we've got a data loader for training

and a data loader for validation. So we can grab the x and y batch by just calling next()

on that iterator as we've done before. We can grab the, let's look at each

of these in turn actually. We've done all this before, but it's a couple of weeks ago. So just to remind you, we can

get the names of the features. And so we can then get, create an itemgetter()

for our y's. And we can, so we'll call that the label getter. We can apply that to our labels to

get the titles of everything in our mini batch. And we can then call our

show_images() that we created with that mini batch, with those titles. And

here we have our fashion MNIST mini batch. Okay. So let's create a classifier and

we're just gonna use exactly the same code, copy and pasted from the previous

notebook. So here is our sequential model. And we are going to grab the parameters of the CNN and the CNN I've actually

moved it over to the device. The default device was what we created in our

last notebook. And as you can see, it's fitting. Now our first problem is it's fitting

very slowly, which is kind of annoying. So why is it running pretty slowly? Let's think

about, let's have a look at our dataset. So when it's finally finished, let's take a

look at an item from the dataset. Actually, let's not look at the dataset. Let's actually

go all the way back to the dataset dictionary so before it gets transformed dataset dictionary

and let's grab the training part of that and let's grab one item. And

actually we can see here the problem. For MNIST, we had all of the data loaded into

memory into a single big tensor, but this Hugging Face one is created in a much more kind

of normal way, which is, each image is a totally separate PNG image. It's not all pre-converted

into a single thing. Why is that a problem? Well, the reason it's a problem is that

our DataLoader is spending all of its time decoding these PNGs. So if I train here, okay, so while I'm training, I can type htop

and you can see that basically my CPU is 100% used. Now that's a bit weird because I've

actually got 64 CPUs. Why is it using just one of them is the first problem. But why

does it matter that it's using 100% CPU? Well, the reason it matters,

let's run it again so you can see, why does it matter that our CPU is 100%

and why is it making it so slow? Well, the reason why is if we look at

nvidia-smi dmon that will monitor our GPUs utilization. I've got three GPUs, so

I say to choose just the zeroth index one. And you'll see this column here, sm, this stands

for symmetric multiprocessor. It's like the, it's the equivalent of like CPU usage. And

generally we're only using up 1% of our one GPU. So no wonder it's so slow. So the first thing

we wanna do then is try to make things faster. Now, to make things faster, we wanna be

using more than one CPU to decode our PNGs. And as it turns out, that's actually

pretty easy to do. You just have to add a extra argument to your data loaders, which is here, num_workers. And so I

can say use eight CPUs, for example. Now, if I create, I recreate the data loaders

and then try to create, get the next one. Oh, now I've got an error. And the error is

rather quirky. And what it's saying is, oh, you're now trying to use multiple processes. And

generally in Python and PyTorch, using multiple processes, things start to get complicated.

And one of the things that absolutely just doesn't work is you can't actually have your

DataLoader put things onto the GPU in your separate processes. It just doesn't work. So the

reason for this error is actually because of the fact that we used a collate function that put

things on the device. That's incompatible, unfortunately, with using multiple

workers. So that's a problem. And the answer to that problem, sadly, is that we would have to actually rewrite our

fit function entirely. So there's annoying thing number one, and we don't want to be rewriting

our fit function again and again. We want to have a single fit function. So, okay. So there's

a problem that we're gonna have to think about. Problem number two is that this is not very

accurate, 87%. Well, I mean, is it accurate? It's easy enough to find out. There's a

really nice website called paperswithcode. And it will tell you A little leaderboard. And we

can see whether we're any good. And the answer is we're not very good at

all. So these papers had 96%, 94%, 92%. So yeah, we're not looking

great. So how do we improve that? There's a lot of things we could try, but

pretty much all of them are going to involve modifying our fit function again and in reasonably complicated ways. So we still

got a bit of an issue there. Let's put that aside because what we actually

wanted to do is create an auto-encoder. So to remind you about what an auto-encoder is,

and we're gonna be able to go into a bit more detail now, we're gonna start with our input

image, which is gonna be 28 by 28. So it's the number 3, right? And it's a 28 by 28. And we're

gonna put it through, for example, a stride-2 conv, stride-2. And that's going

to have an output of a 14 by 14. And we can have more channels. So say maybe

4, so this is 28 by 28 by 1. Let's do 14 by 14 by 2. So we've reduced the height and width

by 2, but added an extra channel. So overall, this is a 2x decrease in parameters. And

then we could do another stride-2 conv, and that would give us a 7 by 7. And again,

we can choose however many channels we want, but let's say we choose 4. So now compared to our

original, we've now got a times 4 reduction. And so we could do that a few times, or we could

just stay there. And so this is compressing. And so then what we could do is then somehow

have a convolution layer or group of layers, which does a convolution

and also increases the size. There is actually something

called a transposed convolution, which I'll leave you to look up if you're

interested, which can do that. Also known as a rather weirdly, a stride-1/2 convolution. But

there's actually a really simple way to do this, which is to say, let's say you've got a bunch of

pixels is out. Let's say we've got a 3 by 3 pixels that looks like this, 1, 0, 1, 1,

say. We could make that into a 6 by 6 very easily, which is we could simply, let's get these out. We could simply

copy that pixel there into the first 4, copy that pixel there into these 4. And so you can

see, and then copy this pixel here into these 4. And so we're simply turning each pixel into 4

pixels. And so this is called nearest neighbor upsampling. Now that's not a convolution,

that's just copying. But what we could then do is we could then apply a stride-1

convolution to that. Right? And that would allow us to double the grid size with the

convolution. And that's what we're gonna do. So our autoencoder is gonna need a deconvolutional

layer, and that's gonna contain two layers, up sampling nearest neighbor, scale factor of

2, followed by a conv2d with a stride of one. Okay. And you can see for padding, I

just put kernel size // 2. So that's a truncating division, cause that always

works for any odd sized kernel. As before, we will have an optional activation function, and

then we will create a sequential using *layers. So that's gonna pass in each layer as a separate

argument, which is what Sequential() expects. Okay. So let's write a new fitness function. It goes

through, I just basically copied it over from our previous one, going through each epoch, but

I've pulled out eval into a separate function, but it's basically doing the same thing. Okay. So here is our auto

encoder. And so we're going to, it's a bit tricky because I wanted to go

down by 1, 2, 3, to get to a 4 by 4 by 8, but starting at 28 by 28, you can't divide

that three times and get an integer. So what I first do is I zero pad. So add padding

of 2 on each side to get a 32 by 32 input. So if I then do a conv with 2 channel output, that

gives us 16 by 16 by 2, and then again to get an 8 by 8 by 4, and then again to get a 4 by 4 by

8. So this is doing an 8x compression, and then we can call deconv() to do exactly the same thing

in reverse. The final one with no activation, and then we can truncate off those two pixels

off the edge, slightly surprisingly PyTorch, let's your pass -2 to zero padding to crop off the

final 2 pixels. And then we'll add a Sigmoid(), which will force everything to go between

0 and 1, which of course is what we need. And then we will use mse_loss to compare

those pixels to our input pixels. And so a big difference we've got here now

is that our loss function is being applied to the output of the model and itself,

right? We don't have yb here, we have xb. So we're trying to recreate our original. And

again, this is a bit annoying that we have to create our own fit function. Anyway, so we can

now see what is the mse_loss, and it's not, like, gonna be particularly human readable,

but it's a number we can see if it goes down. And so then we can create, then we can do our SGD with the

parameters of our auto-encoder, with mse_loss, call that fit

function, we just wrote, and I won't wait for it to run, cause as you can see,

it's really slow for reasons we've discussed. I've run it before. And what we want is

to see that the original, which is here, which is here, gets recreated.

And the answer is, oh, not really. I mean, they're roughly the same things,

but there's no point having an auto-encoder which can't even recreate the originals. The idea

would be that if these looked almost identical to these, then we'd say, wow, this is a fantastic

network at compressing things by eight times. So I found this like very fiddly to try and

get this to work at all. Something that I discovered can get it to start training is

to start with a really low learning rate for a few epochs, and then increase the

learning rate after a few epochs. I mean, at least it gets it to train and show something

vaguely sensible, but let's see. Yeah, that still looks pretty crummy. This one here I got actually

by switching to Adam, and I actually removed the tricky bit. I removed these two as well.

But yeah, I couldn't get this to like recreate anything very reasonable

or any reasonable amount of time. And why is this not working very well? There's so

many reasons it could be. Like do we need a better optimizer? Do we need a better architecture?

Do we need to use a Variational Auto-Encoder? There's a thousand things we could try,

but doing it like this is going to drive us crazy. We need to be able to really rapidly

try things and all kinds of different things. And so what I often see in projects or on

Kaggle or whatever, people's code looks kind of like this. It's all like manual. And then their

iteration speed is too slow. We need to be able to really rapidly try things. So we're not gonna keep

doing stuff manually anymore. This is where we take a halt and we say, okay, let's build

up a framework that we can use to rapidly try things and to understand when things

are working and when things aren't working. So we're gonna start creating a learner. So what is a learner? It's basically the

idea is this learner is gonna be something that we build, which will allow us to

try like anything that we can imagine very quickly. And we will build that on top

of that learner things that will allow us to introspect what's going on inside our model,

will allow us to do multi-process CUDA to go fast. It will allow us to add things like

data augmentation. It will allow us to try a wide variety of architectures quickly

and so forth. So that's gonna be the idea. And of course we're gonna create it from scratch.

And so let's start with Fashion MNIST like before. And let's create a DataLoaders class, which is gonna look a bit like what we

had before, where we're just going to pass in, this is just, this couldn't be

simpler, right? We're just gonna pass in two DataLoaders and store them away. And I'm gonna

create a @classmethod from dataset dictionary. And what that's gonna do is

it's gonna call DataLoader on each of the dataset dictionary items with

our batch size and instantiate our class. So if you haven't seen @classmethod before, it's

what allows us to say DataLoaders dot something in order to construct this. We could have put this in

__init__ just as well, but we'll be building more complex DataLoaders things later. So I thought we

might start by getting the basic structure right. So this is all pretty much the same as

what we've had before. I'm not doing anything on the device here, cause

as we know that didn't really work. Okay. Oh, this is an old thing.

I don't need to_cuda() anymore. So we're gonna use to_device(),

which I think came from. There we go. So here's an example of

a very simple Learner that fits on one screen. And this is basically

gonna replace our fit function. So a Learner is gonna be something that is

going to train or learn a particular model using a particular set of DataLoaders, a

particular loss function, some particular learning rate and some particular optimizer

or some particular optimization function. Now, normally I, you know, most people

would often kind of store each of these away separately by writing like

self.model equals model, blah, blah, blah, right? And as I think we've

talked about before, that's, you know, that kind of huge amounts of boilerplate. It

just, it's more stuff that you can get wrong. And it's more stuff to mean that you have to

read to understand the code and yeah, don't like that kind of repetition. So instead we just call

fastcore.store_attr() to do that all in one line. Okay, so that's the basic idea with

a class is to think about what's the information it's gonna need. So you pass

that all to the constructor, store it away. And then our fit function is going to,

we've got the basic stuff that we have for keeping track of accuracy. So this has only worked for stuff that's a

classification where we can use accuracy. Put the model on our device, create the optimizer, store how many epochs we're

going through. Then for each epoch, we'll call the one epoch function and the one epoch function,

we're gonna either do train or evaluation. So we pass in True if we're training and False if

we're evaluating. And they're basically almost the same. We basically set the model to training

mode or not. We then decide whether to use the validation set or the training set based on

whether we're training. And then we go through each batch in the DataLoader and

call one batch. And one batch is then the thing which is going to

put our batch onto the device, call our model, call our loss function. And then,

if we're training, then do our backward step, our optimizer step and our zero gradient.

And then finally calculate our metrics or our stats. And so here's where we calculate our

metrics. So that's basically what we have there. So let's go back to using an MLP. We call fit() and the way it goes. This is an error here, pointed out

by Kevin. Thank you. self.model.to(). One thing I guess we could try now is we think

that maybe we can use more than one process. So let's try that. Oh, it's so fast. I didn't even see. There it goes. You can see all four CPUs

being used at once. Bang, it's done. Okay, so that's pretty great. Let's see how fast

it looks here. Bump, bump. All right, lovely. Okay, so that's a good sign. We've got a learner

that can fit things, but it's not very flexible. It's not gonna help us, for example, with our

autoencoder, because there's no way of like, just like, you know, changing which

things are used for predicting with, or for calculating with. We can't use it for

anything except things that involve accuracy with a binary classification. Sorry… is that

right? Sorry, yeah, a multi-class classification. It's not flexible at all, but it's a

start. And so I wanted to basically put this all on one screen so you can

see what the basic Learner looks like. All right, so how do we do things

other than multi-class accuracy? I decided to create a Metric class. And basically

a Metric class is something where we are going to define subclasses of it that calculate

particular metrics. So for example, here, I've got a subclass of a Metric called Accuracy.

So if you haven't done subclasses before, you can basically think of this as saying,

please copy and paste all the code from here into here for me, but the bit that says def

calc(), replace it with this version. So in fact, this would be identical to copying and pasting

this whole thing, typing Accuracy here, and replacing the definition of calc() with

that. That's what is happening here when we do subclassing. So it's basically copying and

pasting all that code in there for us. It's actually more powerful than that. There's more

we can do with it, but in this case, this is all that's happening with this subclassing. And this

is called, actually I'll leave that, that's fine. Okay, so the Accuracy metric is here, and

then this is kind of our really basic Metric, which is we're gonna use for just for loss. And so

what happens is we're going to, let's for example, create an Accuracy metric object. We're

basically gonna add in mini batches of data, right? So for example, here's a mini batches of

inputs and predictions. Here's another mini batch of inputs and predictions. And then we're gonna

call .value and it will calculate the accuracy. Now .value is a neat little thing. It

doesn't require parentheses after it because it's called a property. And

so a property is something that just calculates automatically without having to

put parentheses. That's all a property is, well, property getter anyway. And so they look

like this, you give it a name. And so we are going to be, each time we call add(), we are

gonna be storing that input and that target. And also the number of items

in the mini batch optionally. For now, that's just always gonna be one.

And you can see here that we then call .calc(), which is gonna call the accuracy

calc(). So just see how often they're equal. And then we're going to

append to the list of values that calculation. And we're also gonna append

to the list of ns, in this case just one. And so then to calculate the value, we just do that.

So that's all that's happening for Accuracy. And then we can do for loss, we can just use

Metric directly, cause Metric directly will just calculate the average of whatever it's passed.

So we can say, oh, add the number 0.6. So the target's optional. And we're saying this is a

mini batch of size 32. So it's gonna be the n. And then add the value 0.9 with a mini batch size

of 2, and then get the value. And as you can see, that's exactly the same as the weighted average

of 0.6 and 0.9 with weights of 32 and 2. So we've created a Metric class. And so

that's something that we can use to create any metric we like just by overriding calc(). Or we could create totally things from scratch

as long as they have an add() and a value. Okay, so we're now going to change our Learner.

And what we're gonna do is we're going to keep the same basic structure. So there's gonna be

fit(). It's gonna go through each epoch. It's gonna call one_epoch() passing in True and False

as for training and validation. one_epoch() is going to go through each batch in the DataLoader

and call one_batch(). one_batch() is going to do the prediction, get_loss(), and if it's training,

it's gonna do the backward() step and zero_grad(). But there's a few other things

going on. So let's take a look. Well, actually let's just look at

it in use first. So when we use it, we're gonna be creating a Learner() with

the model, data loaders, loss function, learning rate, and some callbacks, which

we'll learn about in a moment. And we call fit() and it's gonna do our thing. And

look, we're gonna have charts and stuff. All right, so the basic idea is gonna

look very similar. So we're gonna call fit(). So when we construct it, we're gonna be

passing in exactly the same things as before, but we've got one extra thing, callbacks, which

we'll see in a moment, store the attributes as before and we're gonna be doing some stuff

with the callbacks. So when we call fit() for this number of epochs, we're gonna

store away how many epochs we're gonna do. We're also gonna store away

the actual range that we're going to loop through as self.epoch. So

here's that looping through self.epoch. We're gonna create the optimizer using

the optimizer function and the parameters. And then we're gonna call _fit(). Now what on earth is _fit()? Why

didn't we just copy and paste? So this into here, why do this? It's because we've

created this special decorator with callbacks. What does that do? So it's up here with

callbacks. With callbacks is a class. It's gonna just store one thing, which

is the name. In this case, the name is ‘fit’. And what it's gonna do is… now this is the

decorator, right? So when we call it, remember, decorators get passed a function. So it's gonna

get passed this whole function and that's gonna be called f. So dunder call, remember is what

happens when a class is treated, an object is treated as if it's a function. So it's gonna get

passed this function. So this function is _fit. And so what we wanna do is we wanna return a

different function. It's going to of course call the function that we were asked to call using

the arguments and keyword arguments we were asked to use. But before it calls that function, it's

going to call a special method called callback, passing in the string before, in this case,

before underscore fit. After it's completed, it's gonna call that method called callback

and passing the string after underscore fit. And it's gonna wrap the whole

thing in a try except block. And it's going to be looking for an exception

called CancelFitException. And if it gets one, it's not gonna complain. So let me explain

what's going on with all of those things. Let's look at an example of a callback. So for example, here is a Callback called

DeviceCB, device callback. And before_fit() will be called automatically before that underscore fit

method is called. And it's going to put the model onto our device, CUDA or MPS, if we have one,

otherwise it'll just be on GPU. So what's gonna happen here? So it's going to call, we're gonna

call fit(). It's gonna go through these lines of code. It's then gonna call _fit(). _fit() is not

this function. _fit() is this function with f is this function. So it's going to call our

learner dot callback passing in before_fit. And callback() is defined here. What's

callback() gonna do? It's gonna be passed the string before_fit(). It's going to

then go through each of our callbacks sorted based on their order. And you can

see here, our callbacks can have an order and it's going to look at that callback and

try to get an attribute called before_fit. And it will find one. And so then it's going to

call that method. Now, if that method doesn't exist, it doesn't appear at all, then getatrr()

will return this instead. Identity is a function just here. This is an identity function. All

it does is whatever arguments it gets passed, it returns them. And if it's not

passed any arguments, it just returns. So there's a lot of Python going on here. And

that is why we did that foundations lesson. And so for people who haven't done a lot of

this Python, there's gonna be a lot of stuff to experiment with and learn about.

And so do ask on the forums, if any of these bits get confusing, but the

best way to learn about these things is to open up this Jupyter notebook and try and

create really simple versions of things. So for example, let's try identity().

identity(), how exactly does identity work? I can call it and it gets nothing. I can call it

with 1, it gets back 1. I could call it with ‘a’, gets back ‘a’, call it with ‘a’, 1. Call it with ‘a’, 1 and get ‘a’, 1. And how is it doing that exactly? So remember

we can add a break point and this would be a great time to really test your debugging skills.

Okay, so remember in our debugger, we can hit h to find out what the commands are, but you

really should do a tutorial on the debugger if you're not familiar with it. And then we can

step through each one. So I can now print args. And there's actually a trick which I like is

that args is actually a command, funnily enough, which will just tell you the arguments to any

function, regardless of what they're called, which is kind of nice. And so then

we can step through by pressing n and after this, we can check

like, okay, what is x now? And what is args now? Right?, so remember

to really experiment with these things. So anyway, we're gonna talk about

this a lot more in the next lesson. But before that, if you're not

familiar with try-except blocks, you know, spend some time practicing them.

If you're not familiar with decorators, well, we've seen them before. So go back

and look at them again really carefully. If you're not familiar with the debugger, practice

with that. If you haven't spent much time with getattr, remind yourself about that. So try to get yourself really

familiar and comfortable as much as possible with the pieces, because

if you're not comfortable with the pieces, then the way we put the pieces together is

gonna be confusing. There's actually something in education in kind of the theory of education

called cognitive load theory. And the theory of cognitive, basically cognitive load theory

says, if you're trying to learn something, but your cognitive load is really high because

of all lots of other things going on at the same time, you're not gonna learn it. So it's gonna

be hard for you to learn this framework that we're building if you have too much cognitive

load of like what the hell's a decorator or what the hell's getattr or what does sorted do

or what's partial, you know, all these things. Now, I actually spent quite a bit of time

trying to make this as simple as possible, but also as flexible as it needs

to be for the rest of the course. And this is as simple as I could get

it. So these are kind of things that you actually do have to learn. But in doing

so, you're gonna be able to write some really powerful and general code yourself. So hopefully

you'll find this a really valuable and mind expanding exercise in bringing high level software

engineering skills to your data science work. Okay, so with that, this looks like a good place to leave it and look forward

to seeing you next time. Bye.

Lesson 15: Deep Learning Foundations to Stable Diffusion

Full Transcript

Need a transcript for another video?