Hi all and welcome to Lesson 15. And what
we're going to endeavor to do today is to create a convolutional autoencoder.
And in the process, we will see why doing that well is a tricky thing to do and
time permitting, we will begin to work on a framework, a deep learning framework
to make life a lot easier. Not sure how far we'll get on that today time wise. So
let's see how we go and get straight into it. Okay. So today, let's start by talking before
we can create a convolutional autoencoder, we know to talk about convolutions and what are
they and what are they for. Broadly speaking, convolutions are something that allows us to, to
tell our neural network a little bit about the structure of the problem. That's going to make
it a lot easier for it to solve the problem. And in particular, the structure of our problem
is we're doing things with images. Images are laid out on a grid, a 2d grid for black and white
or a 3d for color or a 4d for a color video or whatever. And so we would say, you know, there's
a relationship between the pixels going across and the pixels going down. They tend to be similar
to each other, differences in those pixels across those dimensions tend to have meaning. Sets,
patterns of pixels that appear in different places often represent the same thing. So for
example, a cat in the top left is still a cat, even if it's in the bottom right. These kinds
of, this kind of prior information is something that is naturally captured by a Convolutional
Neural Network, something that uses convolutions. Generally speaking, this is a good thing
because it means that we will be able to use less parameters and less computation because
more of that information about the problem we're solving is kind of encoded directly into our
architecture. There are other architectures that don't encode that prior information as strongly,
such as a Multi-Layer Perceptron, which we've been looking at so far, or a Transformers network,
which we haven't looked at yet. Those kinds of architectures could potentially give us, or they
do give us more flexibility and given enough time, compute and data, they could potentially find
things that maybe CNNs would struggle to find. So we're not always going to use
Convolutional Neural Networks, but they're a pretty good starting point and
certainly something important to understand. They're not just used for images. We can also
take advantage of one-dimensional convolutions for language-based tasks, for instance.
So convolutions come up a lot. So in this notebook, one thing you'll
notice that might be of interest is we are importing stuff from miniai now. Now
miniai is this little library that we're starting to create and we're creating it
using nbdev. So we've got a miniai.training and a miniai.datasets. And so if we look,
for example, at the datasets notebook, it starts with something that says that the
default export is called datasets. And some of the cells have a export directive on them.
And at the very bottom, we had something that called nbdev_export(). Now what that's going to do
is it's going to create a file called datasets.py just here, datasets.py. And it contains, those cells that we exported. And why does it, why is it called miniai.datasets?
That's because everything for nbdev is stored in settings.ini. And there's something here
saying create a library libname called miniai. You can't use this library until you install it.
Now we haven't uploaded it to PyPy, like made it a pip installable package from a public server.
But you can actually install a local directory as if it's a Python module that you've kind of
installed from the internet. And to do that, you say pip install, in the usual way, but
you say -e, that stands for editable. And that means set up the current directory as
a Python module. Well, current directory, actually any directory you like, I just put
dot to mean the current directory. And so you'll see that's going to go ahead and actually
install my library. And so after I've done that, I can now import things from
that library as you see. Okay. So this is just the same as before. We're
going to grab our MNIST dataset and we're going to create a Convolutional Neural Network on
it. So before we do that, we're going to talk about what are convolutions. And one of my
favorite descriptions of convolutions comes from a student in our, I think it was our very
first course, Matt Kleinsmith, who wrote this really nice Medium article, CNNs from different
viewpoints, which I'm going to steal from. And here's the basic idea. Say that this is our image.
It's a 3 by 3 image with 9 pixels labeled from A to J as capital letters. Now a convolution
uses something called a kernel and a kernel is just another tensor. In this case, it's a 2
by 2 matrix again. So it's, I mean, this one's, we're going to have alpha, beta, gamma,
delta, alpha, beta, gamma, delta as our four values in this convolution. Now in this
kernel, oh, now one thing I'll mention, I can't remember if I've said this before, is the
Greek letters are things that you want to be able to, I think I have mentioned this, you want to
be able to pronounce them. So if you don't know how to read these and say what these names are,
make sure you head over to Wikipedia or whatever, and learn the names of all the Greek letters so
that you can, cause they come up all the time. Okay. So what happens when we apply a convolution
with this 2 by 2 kernel to this 3 by 3 image? I mean, it doesn't have to be an image
it's in this case, it's just a rank 2 tensor, but it might represent an image. What happens
is we take the kernel and we overlay it over the first little 2 by 2 sub grid, like so. And
specifically what we do is we match color to color. So the output of this first 2 by 2
overlay would be alpha times a plus beta times b plus gamma times d plus delta times
e, and that would yield some value P, and that's going to end up in the top left of a 2 by
2 output. So the top right of the 2 by 2 output, we're going to slide, it's like a sliding window,
we're going to slide our kernel over to here and apply each of our coefficients to
these respectively colored squares. And then ditto for the bottom left
and then ditto for the bottom right. So we end up with this equation. P, as
we discussed, is alpha a plus beta b plus gamma d plus delta e, plus some bias
term. Q, to the top right, as you can see, it's just alpha in this test times b. And
so we just take multiplying them together and adding them up, multiply together, add
them up, multiply together and add them up. So we're basically, you can imagine that
we're basically flattening these out into rank 1 tensors into vectors and then doing a dot
product would be one way of thinking about what's happening as we slide this kernel over these
windows. And so this is called a convolution. So let's try and create a convolution. So,
for example, let's grab our training images. And take a look at one. And let's create a 3 by 3 kernel. So remember,
a kernel is just ‒we've already‒ kernel appears a lot of times in computer science and math.
We've already seen the term kernel to mean a piece of code that we run on a GPU across lots
of parallel kind of virtual devices or potentially in a grid. There's a similar idea here. We've
got a computation, which is in this case, kind of this dot product or something like a dot product
sliding over, occurring lots of times over a grid. But it's yeah, it's a bit different.
That's kind of another use of the word kernel. So in this case, a kernel is a, in
this case, it's going to be a rank 2 tensor. And so let's create a kernel with these values
in the 3 by 3 matrix, rank 2 tensor. And we could draw what that looks like. Not surprising,
it just looks like a bunch of lines. Oops. OK, so what would happen if we slide this over
just these nine pixels over this 28 by 28? Well, what's going to happen is
if we've got some‒ the top left, for example, 3 by 3 section has these names,
then we're going to end up with negative a1, because the top three are all negative. Right?
Negative a1, minus a2, minus a3. The next are just zero. So that won't do anything. And then plus
a7, plus a8, plus a9. Why is that interesting? That's interesting. Well, let's try
here. What I've done here is I've grabbed just the first 13 rows and first 23 columns of
our image, and I'm actually showing the numbers and also using gray kind of conditional
formatting, if you like, or the equivalent in pandas to show this top bit. So we're looking at
just this top bit. So what happens if we take rows 3, 4 and 5? Remember, this is not inclusive,
right? So it's rows 3, 4 and 5, columns 14, 15, 16… 14, 15, 16. So we're looking at this, these
three here. What's that going to give us if we multiply it by this kernel? It gives us a fairly
large positive value because the three that we have negatives on is the top row. Well, they're
all zero. And the three that we have positives on, they're all close to 1. So we end up with
quite a large number. What about the same columns but for rows 7, 8, 9? 7, 8, 9. Here, the
top is all positive and the bottom is all zero. So that means that we're going to get a lot of
negative terms. And not surprisingly, that's exactly what we see. If we do this, kind of a dot
product equivalent, which all you need in NumPy to do that is just an element-wise multiplication
followed by a sum, right? So that's going to be quite a large negative number. And so perhaps
you're seeing what this is doing. And maybe you got a hint from the name of the tensor we created.
It's something that is going to find the top edge. So this one is a top edge, so it's a positive.
And this one is a bottom edge, so it's a negative. So we would like to apply that, this kernel,
to every single 3 by 3 section in here. So we could do that by creating a little
apply_kernel function that takes some particular row and some particular column and some particular
tensor as a kernel and does that multiplication .sum() that we just saw. So
for example, we could replicate this one by calling apply_kernel(). And this
here is the center of that 3 by 3 grid area. And so there's that same number, 2.97. So now
we could apply that kernel to every one of the 3 by 3 sections. 3 by 3 windows in this 28 by 28
image. So we're going to be sliding over like this red bit sliding over here. But we've actually
got a 28 by 28 input, not just a 5 by 5 input. So to get all of the coordinates, let's just
simplify it to do 5 five by 5. We can go, we can create a list comprehension. We can take
i through every value in range(5). And then for each of those, we can take j for every value in
range(5). And so if we just look at that tuple, you can see we get a list of lists containing
all of those coordinates. So this is a list comprehension in a list comprehension, which when
you first say it may be surprising or confusing, but it's a really helpful idiom. And I
certainly recommend getting used to it. Now, what we're going to do is we're not just
going to create this tuple, but we're actually going to call apply_kernel() for each of those. So
if we go through from 1 to 27, well, actually 1 to 26, because 27 is exclusive. So we're going to go
through everything from 1 to 26. And then for each of those, go through from 1 to 26 again and call
apply_kernel(). And that's going to give us the result of applying that convolutional kernel to
every one of those coordinates. And there's the result. And you can see what it's done as we
hoped is it is highlighting the top edges. So yeah, you might find that kind of surprising
that it's that easy to do this kind of image processing. We're literally just doing an element
wise multiplication and a sum for each window. Okay, so that is called a convolution. So we
can do another convolution. This time we could do one with the left edge tensor. So as you
can see, it looks just a rotated version or transposed version, I guess, of our top
edge tensor, here's what it looks like. And so if we apply that kernel, so this time
we're going to apply the left edge kernel. And so notice here that we're
actually passing in a function, right? We're passing in a function. Sorry, not
a function, is it? It's just a tensor actually. So we're going to pass in the left edge tensor for the same list comprehension, in
a list comprehension. And this time we're getting back the left edges. It's
highlighting all of the left edges in the digit. So yeah, this is basically what's happening here
is that a 2 by 2 can be looped over an image, creating these outputs. Now you'll see here that in the process
of doing so, we are losing the outermost pixels of our image. We'll learn about
how to fix that later. But just for now, notice that as we are putting our 3 by 3 through,
for example, in this 5 by 5 , there's only one, two, three places that we can put it going across,
not five places because we need some kind of edge. All right, so that's cool. That's
a convolution. And hopefully if you remember back to kind of the Zeiler
and Fergus pictures from Lesson 1, you might recognize that the kind of first layer
of a convolutional network is often looking for kind of edges and gradients and things like
that. And this is how it does it. And then the convolutions on top of convolutions
with nonlinear activations between them can combine those into curves, or corners
or stuff like that, and so on and so forth. Okay, so how do we do this quickly? Because
currently this is going to be super, super slow doing this in Python. So one of the very earliest
or probably the earliest publicly available general purpose deep learning, GPU accelerated
deep learning thing I saw was called Caffe. That was created by somebody called Yangqing Jia.
And he actually described what happened, how Caffe went about implementing a fast convolution on a
GPU. And basically he said, well, I had two months to do it and I had to finish my thesis. And so I
ended up doing something where I said, well, there was some other code out there, Krizhevsky, who you
might have come across, him and Hinton set up a little startup which Google bought, and that kind
of became the start of Google's deep learning, Google Brain, basically. And so Krizhevsky
had all this fancy stuff in his library, but Yangqing Jia said, oh, I didn't know how to
do all that stuff. So I said, well, I already know how to multiply matrices, so maybe I can convert
a convolution into a matrix multiplication. And so that became known as im2col. im2col is a way of converting a convolution
into a matrix multiply. And so actually, I don't know if, I suspect Yangqing Jia kind of
accidentally reinvented it, because it actually had been around for a while, even at the point
that he was writing his thesis, I believe. So it was actually, this is the place
I believe it was created in this paper. So that was in 2006, which is a while ago.
And so this is actually from that paper. And what they describe is, let's
say you are putting this 2 by 2 kernel over this 3 by 3 bit of an image. So
here you've got this window needs to match to this bit of this window. What you could
do is you could unwrap this to 1, 2, 1, 2 downwards to here. 1, 2, 1, 2, to unroll it
like so. And you could unroll the kernel here. Yeah, so this is 1, 2, 1, 1. So this bit is here,
1, 2, 1, 1. And then you could unroll the kernel 1, 1, 2, 2 to here, 1, 1, 2, 2. And then once
they've been flattened out and moved in that way, and then you'll do exactly the same
thing for this next patch here, 2, 0, 1, 3. You flatten it out and put it here, 2,
0, 1, 3. So if you basically take those kernels and flatten them out in this format, then you
end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up with
the output that you want from the convolution. So this is basically a way of unrolling your kernels and your input features into
matrices, such as when you do the matrix multiply, you get the right answer. So it's a kind of
a nifty trick. And so that is called im2col. I guess we're kind of cheating a little bit.
Implementing that is kind of boring. It's just a bunch of copying and tensor manipulation. So I
actually haven't done it. Instead, I've linked to a NumPy implementation, which is here. And
it also, part of it is this get_indices(), which is here. And as you can see, it's a little
bit tedious with repeats and tiles and reshapes and whatnot. So I'm not going to call it homework,
but if you want to practice your tensor indexing manipulation skills, try creating a PyTorch
version from scratch. I got to admit, I didn't bother. Instead, I used the one that's built
into PyTorch. And in PyTorch, it's called unfold. So if we take our image and PyTorch expects
there to be a batch axis and a dimension and a channel dimension. So we'll add two unit leading
dimensions to it. Then we can unfold our input for a 3 by 3. And that will give us a 9 by
676 input. And so then we can take that… and then we'll take our kernel and just flatten it
out into a vector. So view() changes the shape and -1 just says dump everything into this dimension.
So that's going to create a 9 long vector, length 9 vector. And so now we can do the matrix multiply
just like they've done here of the kernel matrix. That's our weights by the unrolled input features. And so that gives us a 676 long. We can then view
that as 26 by 26 and we get back as we hoped our left edge tensor result. And so this is how
we can kind of from scratch create a better implementation of convolutions. The reason I'm
cheating, I'm allowed to cheat here is because we did actually create convolutions from
scratch. We're not always creating the GPU optimized versions from scratch, which was never
something I promised. So I think that's fair. But it's cool that we can kind of hack it
out a GPU optimized version in the same way that the kind of original deep learning
library did. So if we use apply_kernel(), we get nearly nine milliseconds. If
we use unfold() with matrix multiply, we get 20 microseconds. So that's what about
400 times faster. So that's pretty cool. Now, of course, we don't have to use
unfold() and matrix multiply because PyTorch has a conv2d(). So we can run that.
And that interestingly is about the same speed, at least on GPU. But this would
also work on GPU just as well. Yeah, I'm not sure this will always be the
case. In this case, it's a pretty small image. I haven't experimented a whole lot
to see whereabouts there's a big difference in speeds between these. Obviously, I always
just use F.conv2d. But if there's some more tricky convolution you need to do with some weird
thing around channels or dimensions or something, you can always try this unfold trick.
It's nice to know it's there, I think. So we could do the same thing for diagonal edges. So here's our diagonal edge
kernel or the other diagonal. So if we just grab the first 16 images. Then we can do a convolution on our whole
batch with all of our kernels at once. So this is a nice optimized thing that
we can do. And you end up with your 26 by 26. You've got your 4 kernels and you've got
your 16 images. And so that's summarized here. So that's generally what we're doing to get
good GPU acceleration is we're doing a bunch of kernels and a bunch of images all at once
across all of their pixels. And so here we go. That's what happens when we take a look at
our various kernels for a particular image. Left edge, I guess top edge and then
diagonal top left and top right. OK, so that is optimized convolutions
on… and that works just as well on CPU or GPU. Obviously, GPU will
be faster if you have one. Now, how do we deal with the
problem that we're losing one pixel on each side? What we can do
is we can add something called padding. And for padding, what we basically do
is rather than starting our window here, we start it right over here and actually would be
up one as well. And so these 3 on the left here. We just take the input for each of those
as 0. So we're basically just assuming that they're all zero. I mean, there's
other options we could choose. We could assume they're the same as the one next
to them. There's various things we can do, but the simplest and the one we normally
do is just assume that there's zero. So now. So let's say, for example, this
is, this is called one pixel padding. Let's say we did 2 pixel padding. So we had
2 pixel padding with a 5 by 5 input. OK, and a 4 by 4 kernel. So that grays our kernel.
Right, then we're going to start right up way over here on the corner. OK, and then you can see
what happens as we slide. The kernel over, there's all the spots that it's going to take, and so that
this dotted line area is the area that we're kind of effectively going through, but all of these
white bits, we're just going to treat as zero. And so and then this is this green is the output
size we end up with, which is going to be 6 by 6. For a 5 by 5 input, I should mention. Even numbered edge kernels are not used
very often, we normally used odd numbered kernels if you use, for example, a 3
by 3 kernel and 1 pixel of padding, you will get back the same size you start with.
If you use 5 by 5 with 3 pixels of padding, you'll end up with the same size you start
with. So generally odd numbered edge size kernels are easier to deal with to make sure
you end up with the same thing you start with. OK, so yeah, so as it says here, you've got an
odd numbered size ks by ks size kernel, then ks truncate divide to (ks//2). That's what slash
slash means will give you the right size. And so another trick you can do is you don't always have
to just move your window across by one each time. You could move it by a different amount each time.
The amount you move it by is called the stride. So, for example, here's a case of doing a stride
2, so a stride 2 padding 1. So we start out here and then we jump across 2 and then we jump across
2 and then we go to the next row. So that's called a stride 2 convolution. Stride 2 convolutions
are handy because they actually reduce the dimensionality of your input by a factor of 2.
And that's actually what we want to do a lot. For example, with an autoencoder, we want to
do that. And in fact, for most classification architectures, we do exactly that. We keep
on reducing the, kind of the grid size by a factor of 2 again and again and again using
stride 2 convolutions with padding of 1. So that's strides and padding. So let's go ahead
and create a ConvNet using these approaches. So we're going to put, get our size of our
training set. This is all the same as before, number of categories, number of
digits, size of our hidden layer. So. Previously, with our sequential linear
models, with our MLPs, we basically went from the number of pixels to the number
of hidden and then a ReLU and then the number of hidden to the number of outputs.
So here's the equivalent. With a convolution. Now, the problem is that you can't just
do that because the output is not now 10 probabilities for each item in our batch,
but it's 10 probabilities for each item in our batch for each of 28 by 28 pixels because
we don't even have a stride or anything. So you can't just use the same simple approach that we
had for MLP. We have to be a bit more careful. So to make life easier, let's create a little conv
function that does a Conv2d with a stride of 2 optionally followed by an activation. So if act
is true, we will add in a ReLU activation. So this is going to either return a Conv2d or a little
Sequential containing a Conv2d followed by a ReLU. And so now we can create a CNN
from scratch as a sequential model. And so since activation is True by default, this
is going to take our 28 by 28 image, starting with 1 channel and creating an output of 4 channels.
So this is the number of in, this is the number of filters. Sometimes we'll say filters to
describe the number of, kind of, channels that our convolution has, that's the number of
outputs. And it's very similar to the idea of the number of outputs in a linear layer, except
this is the number of outputs in your convolution. So what I like to do when I create stuff like
this is I add a little comment just to remind myself what is my grid size after this. So I had
a 28 by 28 input. So then I've then put it through a stride-2 conv. So the output of this will be
14 by 14. So then we'll do the same thing again, but this time we'll go from a 4 channel input
to an 8 channel output and then from 8 to 16. So by this point, we're now down to
a 4 by 4. And then down to a 2 by 2, and then finally we're down to a 1 by 1. So on
the very last layer, we won't add an activation. And the very last layer is going to create 10
outputs. And since we're now down to a 1 by 1, we can just call Flatten() and that's going to remove
those unnecessary unit axes. So if we take that, pop mini batch through it, we end up with
exactly what we want, 16 by 10. So for each of our 16 images, we've got 10 probabilities of
each possible digit. So if we take our training set and make it into 28 by 28 images, and
we do the same thing for a validation set, and then we create two datasets, one for each,
which are called train dataset and valid dataset. And we're now going to train this on the GPU. Now,
if you've got a Mac, you can use a device called, well, if you've got an Apple Silicon Mac, you've
got a device called MPS, which is going to use your Mac's GPU. Or if you've got Nvidia, you can
use CUDA, which will use your Nvidia GPU. CUDA's 10 times or more, possibly much more faster than
a Mac. So you definitely want to use Nvidia if you can. But if you're just running it on a
Mac laptop or whatever, you can use it MPS. So basically, you want to know what device to use.
Do we want to use CUDA or MPS? You can check. If you can check torch.backends.mps.is_available()
to see if you're running on a Mac with MPS. You can check torch.cuda.is_available() to see
if you've got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course,
you'll have to use the CPU to do computation. So I've created a little function here to_device, which takes a tensor or a dictionary
or a list of tensors or whatever, and a device to move it to, and it just goes
through and moves everything onto that device. Or if it's a dictionary, a dictionary of
things, values moved onto that device. So there's a handy little
function. And so we can create a custom collate function, which calls the
PyTorch default collation function and then puts those tensors onto our device.
And so with that, we've now got enough to train this neural net on the GPU. We created
this get_dls function in the last lesson. So we're going to use that, passing in the
datasets that we just created and our default collation function. We're going to create
our optimizer using our CNNs parameters. And then we call fit(). Now fit(), remember, we also created
in our last lesson and it's done. So then what I did then was I reduced
the learning rate by a factor of four and ran it again. And eventually, yeah, I got to a
fairly similar accuracy to what we did on our MLP. So, yeah, we've got a convolutional network
working. I think that's pretty encouraging. And it's nice that to train it, we didn't have
to write much code, right? We were able to use code that we had already built. We were
able to use the Dataset class that we made, the get_dls function that we made,
and the fit function that we made. And, you know, because those things are
written in a fairly general way, they work just as well for a ConvNet as they did for an
MLP. Nothing had to change. So that was nice. Notice I had to take the model and put it on
the device as well. So that will go through and basically put all of the tensors that are in that
model onto the MPS or CUDA device, if appropriate. So if we've got a batch size of 64,
and as we do, 1 channel, 28 by 28, so then our axes are batch, channel, height,
width. So normally this is referred to as NCHW. So N, generally when you see N in
a paper or whatever, in this way, it's referring to the batch size. N
being the number, that's the mnemonic, the number of items in the batch. C is the
number of channels, height by width, NCHW. TensorFlow doesn't use that. TensorFlow uses
NHWC. So we generally call that channels-last, since channels are at the end. And this one we
normally call channels-first. Now, of course, it's not actually channels-first. It's actually
channels-second, but we ignore the batch bit. In some models, particularly some more modern
models, it turns out the channels-last is faster. So PyTorch has recently added
support for channels-last. And so you'll see that being used more and more as well. All right, so a couple of comments and questions
from our chat. The first is Sam Watkins pointing out that we've actually had a bit of a win here,
which is that the number of parameters in our CNN is pretty small by comparison. So in the MLP
version, the number of parameters is equal to basically the size of this matrix, so m times nh. Oh, plus the number in this,
which will be nh times 10. And something that at some point we probably
should do is actually create something that allows us to automatically
calculate the number of parameters. And I'm ignoring the bias there, of course. Let's see, what would be a good way to do that? Maybe np.product(). There we go. So what we could do is just
calculate this automatically by doing a little list comprehension here.
So there's the number of parameters across all of the different layers, so both bias
and weights. And then we could, I guess, just, well, we could just use, well, let's use
PyTorch. So we could turn that into a tensor and sum it up. Oops. So that's the number in
our MLP. And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000
to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding me that
there's a better way than np.product(o.shape), which is just to say o dot
number of elements: o.numel(). Same thing. Very nice. Now, one person asked a very
good question, which is, I thought Convolutional Neural Networks can
handle any sized image. And actually, no, this convolutional network cannot handle any
sized image. This Convolutional Neural Network only handles images that, once they go through
these stride-2 convs, end up with a 1 by 1. Because otherwise, you can't dot flatten it
and end up with 16 by 10. So we will learn how to create convnets that can handle any sized
input. But there's nothing particularly about a convnet that necessitates that it has
to be any sized input that it can handle. OK, so just let's briefly finish this section off
by talking about this, particularly I want to talk about the idea of receptive field. Consider this
1 input channel, 4 output channel, 3 by 3 kernel. So that's just to show you what we're doing here.
conv1, well, actually, so simple_cnn. simple_cnn. This is the model we created. Remember, it was
like a Sequential model containing Sequential models, because that's how our conv function
worked. So simple_cnn[0] is our first layer. It contains both a Conv and a ReLU. So simple_cnn[0,
0] is the actual Conv. So if we grab that, call it conv1, it's a 4 by 1 by 3 by 3. So,
number of outputs, number of input channels, and height by width of the kernel. And then
it's got its bias as well. So that's how we could deconstruct what's going on with our weight
matrices or our parameters inside a convolution. Now, I'm going to switch over to Excel. So in
the lesson notes on the course website or on the forum, you'll find we've got an Excel.
You'll see we've got an Excel workbook. Oh, Wasim reminded me that there is
a nice trick we can do. I do want to do that actually because I love this trick. Oh, I just deleted everything though. Let's put
it all back. Here we go. Which is you actually don't need square brackets. The square brackets is
a list comprehension. Without the square brackets, it's called a generator. And it, oh, no, you can't
use it there. Maybe that only works with NumPy. Ah, OK. So that's the list. No, that doesn't work either. So much for that.
I'm kind of curious now. Maybe torch.sum. No, just sum. Oh, OK. Well, I don't want to
use Python sum. That's interesting. I feel like all of them should handle
generators, but there you go. OK. So open up the conv-example spreadsheet. And
what you'll see on the conv-example worksheet page is something that looks a lot like the number
7. And indeed, this is the number 7 that I got straight from MNIST. Let's see. OK. So you can
see over here, we have a number 7. This is a number 7 from MNIST that I have copied into
Excel. And then you can see over here, we've got the top edge kernel being applied. And over
here, we've got a right edge kernel being applied. This might be surprising you because you
might be thinking, where did it take Jeremy? Microsoft Excel doesn't do Convolutional
Neural Networks. Well, actually, it does. So if I zoom in in Excel, you'll see, actually,
these numbers are, in fact, conditional formatting applied to a bunch of spreadsheet cells. And so
what I did was I copied the actual pixel values into Excel and then applied conditional
formatting. And so now you can see what the digit is actually made of. So you can see here I've created our top edge filter. And here
I've created our left edge filter. And so here I am applying that filter to that window. And
so here you can see it looks a lot like NumPy. It's just a sum product. And you might not be
aware of this, but in Excel, you can actually do broadcasting. You have to hit Apple
Shift Enter or Control Shift Enter, and it puts these little curly brackets
around it. It's called an array formula. It basically lets you do broadcasting or
simple broadcasting in Excel. And so here's how you could say… this is how I created
this top edge filtered version in Excel. And the left edge version is exactly the same,
just a different kernel. And as you can see, if I click on it, it's applying this filter
to this input area and so forth. OK, so we can then… I just
arbitrarily pick some different values here. And so something to notice now
in my second layer, so here's Conv1, is Conv2, it's got a bit more work to do. We actually
need two filters because we need to add together this bit here applied to this with
this kernel applied and this bit here with this kernel applied. So you actually
need one set of 3 by 3 for each input. And also, I want two separate outputs, so I
actually end up needing a 2 by 2 by 3 by 3 weights matrix or weights tensor, I should say,
which you might remember is exactly what we had in PyTorch. We had a rank 4 tensor. So if I have a
look at this one, you see exactly the same thing. This input is using this kernel applied
to here and this kernel applied to here. So that's important to remember that you
have these rank 4 tensors. And so then rather than doing stride-2 conv, I did something else
which is actually a bit out of favor nowadays, but it's another option, which is to do something
called max-pooling to reduce my dimensionality. So you can see here I've got 28 by 28.
I've reduced it down here to 14 by 14. And the way I did it was simply to take
the max of each little 2 by 2 area. So that's all that's been done
there. So that's called max-pooling. And so max-pooling has the same effect as a
stride-2 conv, not mathematically identical, the same effect, which is it does a convolution
and reduces the grid size by 2 on each dimension. So then how do we create a single
output if we don't keep doing this until we get to 1 by 1, which
I'm too lazy to do in Excel? Well, one approach, and again, this is a little
bit out of favor as well, but one approach we can do is we can take every one of these, we've
now got 14 by 14, and apply a dense layer to it. And so what I've done here is I've got a
big, imagine this is basically all being flattened out into a vector. And so here
we've got some product of this by this, plus the sum product of this by this,
and that gives us a single number. And so that is how we could then optimize
that in order to optimize our weight matrices. Now, and then, you know, the more modern approach, we don't use this kind of dense layer much
anymore. It still appears a bit. The main place that you see this used is in a network
called VGG, which is very old now. I thought it might be 2013 or something, but it's actually
still used. And that's because for certain things like something called style transfer or in
general perceptual losses, people still find VGG seems to work better. So you still actually
see this approach nowadays sometimes. The more common approach, however, nowadays is
we take the penultimate layer and we just simply take the average of all of the activations. So
the, you know, the nowadays we would simply, the Excel way of doing it would be literally
simply say AVERAGE of the penultimate layer. And that is called global average pooling.
Everything has to has a fancy word, a fancy phrase, but that's all it is. Take the average
is called global average pooling, or you could take the max, whatever that
would be global max pooling. So anyway, the main reason I wanted to show you
this was to do something which I think is pretty interesting, which is to take something in our,
I'm just going to zoom out a little bit here. Let's take something in our max pool here. And I'm going to say trace
precedence to show you here it is the area that it's coming from. Okay. So it's coming from these
four numbers. Now for trace precedence again, saying what's actually impacting this.
Obviously the kernels impacting it. And then you can see that the input area here is a
bit bigger. And then if I trace precedence again, then you can see the input area is bigger still.
So this number here is calculated from all of these numbers in the input. This area in the
input is called the receptive field of this unit. And so the receptive field in this case
is 1, 2, 3, 4, 5, 6 by 6, right? And that means that a pixel way up here in the top,
right? Has literally no ability to impact that activation. It's not
part of its receptive field. If you have a whole bunch of stride 2 convs, each
time you have one, the receptive field is going to get twice as big. So the receptive field at the
end of a deep network is actually very large. But the inputs closest to the middle of the receptive
field have the biggest kind of say in the output because they implicitly appear the most often in
all of these kind of dot products that are inside this convolutional window. So the receptive field
is not just like a single binary on off thing. Certainly all the stuff that's not got precedence
here is not part of it at all. But the closer to the center of the receptive field, the
more impact it's going to have, the more ability it's got to change this number. So the
receptive field is a really important concept and yeah, fiddling, playing around with Excel's
precedent arrows, I think is a nice way to see that, at least in my opinion.
And apart from anything else, it's great fun creating a Convolutional
Neural Network in Excel. I thought so anyway. Okay, so let's take a seven minute break. I'll see you back after that to talk about
a convolutional autoencoder. All right. Okay, welcome back. We're going to have
a look now at the autoencoder notebook. So we're just going to import all of our
usual stuff and we've got one more of our own modules to import now as well. And this
time we are going to switch to a different, we're going to switch to a different
dataset, which is the fashion MNIST dataset. We can take advantage of the
stuff that we did in 05_datasets and the Hugging Face stuff to load it.
So we've seen this a little bit before. Back in our datasets one here. And we never actually built any models
with it. So let's first of all do that. So this is just, I'm going to convert
each thing, each image into a tensor, and that's going to be an in-place transform.
Remember we created this decorator. And so we can call dataset dictionary with
the same name and so we can call dataset dictionary with transform. This
is all stuff we've done before. And so here we have our example of a sneaker. All right. And we will create our collation
function, collating a dictionary for that dataset. That's something you should remind yourself. We
built that ourselves in the datasets notebook. And let's actually make our collate function something that does to_device(),
which we wrote in our last notebook. And we'll create a little data_loaders function
here, which is going to go through each item in the dataset dictionary and create a DataLoader for
it and give us a dictionary of data loaders. Okay. So, okay. So now we've got a data loader for training
and a data loader for validation. So we can grab the x and y batch by just calling next()
on that iterator as we've done before. We can grab the, let's look at each
of these in turn actually. We've done all this before, but it's a couple of weeks ago. So just to remind you, we can
get the names of the features. And so we can then get, create an itemgetter()
for our y's. And we can, so we'll call that the label getter. We can apply that to our labels to
get the titles of everything in our mini batch. And we can then call our
show_images() that we created with that mini batch, with those titles. And
here we have our fashion MNIST mini batch. Okay. So let's create a classifier and
we're just gonna use exactly the same code, copy and pasted from the previous
notebook. So here is our sequential model. And we are going to grab the parameters of the CNN and the CNN I've actually
moved it over to the device. The default device was what we created in our
last notebook. And as you can see, it's fitting. Now our first problem is it's fitting
very slowly, which is kind of annoying. So why is it running pretty slowly? Let's think
about, let's have a look at our dataset. So when it's finally finished, let's take a
look at an item from the dataset. Actually, let's not look at the dataset. Let's actually
go all the way back to the dataset dictionary so before it gets transformed dataset dictionary
and let's grab the training part of that and let's grab one item. And
actually we can see here the problem. For MNIST, we had all of the data loaded into
memory into a single big tensor, but this Hugging Face one is created in a much more kind
of normal way, which is, each image is a totally separate PNG image. It's not all pre-converted
into a single thing. Why is that a problem? Well, the reason it's a problem is that
our DataLoader is spending all of its time decoding these PNGs. So if I train here, okay, so while I'm training, I can type htop
and you can see that basically my CPU is 100% used. Now that's a bit weird because I've
actually got 64 CPUs. Why is it using just one of them is the first problem. But why
does it matter that it's using 100% CPU? Well, the reason it matters,
let's run it again so you can see, why does it matter that our CPU is 100%
and why is it making it so slow? Well, the reason why is if we look at
nvidia-smi dmon that will monitor our GPUs utilization. I've got three GPUs, so
I say to choose just the zeroth index one. And you'll see this column here, sm, this stands
for symmetric multiprocessor. It's like the, it's the equivalent of like CPU usage. And
generally we're only using up 1% of our one GPU. So no wonder it's so slow. So the first thing
we wanna do then is try to make things faster. Now, to make things faster, we wanna be
using more than one CPU to decode our PNGs. And as it turns out, that's actually
pretty easy to do. You just have to add a extra argument to your data loaders, which is here, num_workers. And so I
can say use eight CPUs, for example. Now, if I create, I recreate the data loaders
and then try to create, get the next one. Oh, now I've got an error. And the error is
rather quirky. And what it's saying is, oh, you're now trying to use multiple processes. And
generally in Python and PyTorch, using multiple processes, things start to get complicated.
And one of the things that absolutely just doesn't work is you can't actually have your
DataLoader put things onto the GPU in your separate processes. It just doesn't work. So the
reason for this error is actually because of the fact that we used a collate function that put
things on the device. That's incompatible, unfortunately, with using multiple
workers. So that's a problem. And the answer to that problem, sadly, is that we would have to actually rewrite our
fit function entirely. So there's annoying thing number one, and we don't want to be rewriting
our fit function again and again. We want to have a single fit function. So, okay. So there's
a problem that we're gonna have to think about. Problem number two is that this is not very
accurate, 87%. Well, I mean, is it accurate? It's easy enough to find out. There's a
really nice website called paperswithcode. And it will tell you A little leaderboard. And we
can see whether we're any good. And the answer is we're not very good at
all. So these papers had 96%, 94%, 92%. So yeah, we're not looking
great. So how do we improve that? There's a lot of things we could try, but
pretty much all of them are going to involve modifying our fit function again and in reasonably complicated ways. So we still
got a bit of an issue there. Let's put that aside because what we actually
wanted to do is create an auto-encoder. So to remind you about what an auto-encoder is,
and we're gonna be able to go into a bit more detail now, we're gonna start with our input
image, which is gonna be 28 by 28. So it's the number 3, right? And it's a 28 by 28. And we're
gonna put it through, for example, a stride-2 conv, stride-2. And that's going
to have an output of a 14 by 14. And we can have more channels. So say maybe
4, so this is 28 by 28 by 1. Let's do 14 by 14 by 2. So we've reduced the height and width
by 2, but added an extra channel. So overall, this is a 2x decrease in parameters. And
then we could do another stride-2 conv, and that would give us a 7 by 7. And again,
we can choose however many channels we want, but let's say we choose 4. So now compared to our
original, we've now got a times 4 reduction. And so we could do that a few times, or we could
just stay there. And so this is compressing. And so then what we could do is then somehow
have a convolution layer or group of layers, which does a convolution
and also increases the size. There is actually something
called a transposed convolution, which I'll leave you to look up if you're
interested, which can do that. Also known as a rather weirdly, a stride-1/2 convolution. But
there's actually a really simple way to do this, which is to say, let's say you've got a bunch of
pixels is out. Let's say we've got a 3 by 3 pixels that looks like this, 1, 0, 1, 1,
say. We could make that into a 6 by 6 very easily, which is we could simply, let's get these out. We could simply
copy that pixel there into the first 4, copy that pixel there into these 4. And so you can
see, and then copy this pixel here into these 4. And so we're simply turning each pixel into 4
pixels. And so this is called nearest neighbor upsampling. Now that's not a convolution,
that's just copying. But what we could then do is we could then apply a stride-1
convolution to that. Right? And that would allow us to double the grid size with the
convolution. And that's what we're gonna do. So our autoencoder is gonna need a deconvolutional
layer, and that's gonna contain two layers, up sampling nearest neighbor, scale factor of
2, followed by a conv2d with a stride of one. Okay. And you can see for padding, I
just put kernel size // 2. So that's a truncating division, cause that always
works for any odd sized kernel. As before, we will have an optional activation function, and
then we will create a sequential using *layers. So that's gonna pass in each layer as a separate
argument, which is what Sequential() expects. Okay. So let's write a new fitness function. It goes
through, I just basically copied it over from our previous one, going through each epoch, but
I've pulled out eval into a separate function, but it's basically doing the same thing. Okay. So here is our auto
encoder. And so we're going to, it's a bit tricky because I wanted to go
down by 1, 2, 3, to get to a 4 by 4 by 8, but starting at 28 by 28, you can't divide
that three times and get an integer. So what I first do is I zero pad. So add padding
of 2 on each side to get a 32 by 32 input. So if I then do a conv with 2 channel output, that
gives us 16 by 16 by 2, and then again to get an 8 by 8 by 4, and then again to get a 4 by 4 by
8. So this is doing an 8x compression, and then we can call deconv() to do exactly the same thing
in reverse. The final one with no activation, and then we can truncate off those two pixels
off the edge, slightly surprisingly PyTorch, let's your pass -2 to zero padding to crop off the
final 2 pixels. And then we'll add a Sigmoid(), which will force everything to go between
0 and 1, which of course is what we need. And then we will use mse_loss to compare
those pixels to our input pixels. And so a big difference we've got here now
is that our loss function is being applied to the output of the model and itself,
right? We don't have yb here, we have xb. So we're trying to recreate our original. And
again, this is a bit annoying that we have to create our own fit function. Anyway, so we can
now see what is the mse_loss, and it's not, like, gonna be particularly human readable,
but it's a number we can see if it goes down. And so then we can create, then we can do our SGD with the
parameters of our auto-encoder, with mse_loss, call that fit
function, we just wrote, and I won't wait for it to run, cause as you can see,
it's really slow for reasons we've discussed. I've run it before. And what we want is
to see that the original, which is here, which is here, gets recreated.
And the answer is, oh, not really. I mean, they're roughly the same things,
but there's no point having an auto-encoder which can't even recreate the originals. The idea
would be that if these looked almost identical to these, then we'd say, wow, this is a fantastic
network at compressing things by eight times. So I found this like very fiddly to try and
get this to work at all. Something that I discovered can get it to start training is
to start with a really low learning rate for a few epochs, and then increase the
learning rate after a few epochs. I mean, at least it gets it to train and show something
vaguely sensible, but let's see. Yeah, that still looks pretty crummy. This one here I got actually
by switching to Adam, and I actually removed the tricky bit. I removed these two as well.
But yeah, I couldn't get this to like recreate anything very reasonable
or any reasonable amount of time. And why is this not working very well? There's so
many reasons it could be. Like do we need a better optimizer? Do we need a better architecture?
Do we need to use a Variational Auto-Encoder? There's a thousand things we could try,
but doing it like this is going to drive us crazy. We need to be able to really rapidly
try things and all kinds of different things. And so what I often see in projects or on
Kaggle or whatever, people's code looks kind of like this. It's all like manual. And then their
iteration speed is too slow. We need to be able to really rapidly try things. So we're not gonna keep
doing stuff manually anymore. This is where we take a halt and we say, okay, let's build
up a framework that we can use to rapidly try things and to understand when things
are working and when things aren't working. So we're gonna start creating a learner. So what is a learner? It's basically the
idea is this learner is gonna be something that we build, which will allow us to
try like anything that we can imagine very quickly. And we will build that on top
of that learner things that will allow us to introspect what's going on inside our model,
will allow us to do multi-process CUDA to go fast. It will allow us to add things like
data augmentation. It will allow us to try a wide variety of architectures quickly
and so forth. So that's gonna be the idea. And of course we're gonna create it from scratch.
And so let's start with Fashion MNIST like before. And let's create a DataLoaders class, which is gonna look a bit like what we
had before, where we're just going to pass in, this is just, this couldn't be
simpler, right? We're just gonna pass in two DataLoaders and store them away. And I'm gonna
create a @classmethod from dataset dictionary. And what that's gonna do is
it's gonna call DataLoader on each of the dataset dictionary items with
our batch size and instantiate our class. So if you haven't seen @classmethod before, it's
what allows us to say DataLoaders dot something in order to construct this. We could have put this in
__init__ just as well, but we'll be building more complex DataLoaders things later. So I thought we
might start by getting the basic structure right. So this is all pretty much the same as
what we've had before. I'm not doing anything on the device here, cause
as we know that didn't really work. Okay. Oh, this is an old thing.
I don't need to_cuda() anymore. So we're gonna use to_device(),
which I think came from. There we go. So here's an example of
a very simple Learner that fits on one screen. And this is basically
gonna replace our fit function. So a Learner is gonna be something that is
going to train or learn a particular model using a particular set of DataLoaders, a
particular loss function, some particular learning rate and some particular optimizer
or some particular optimization function. Now, normally I, you know, most people
would often kind of store each of these away separately by writing like
self.model equals model, blah, blah, blah, right? And as I think we've
talked about before, that's, you know, that kind of huge amounts of boilerplate. It
just, it's more stuff that you can get wrong. And it's more stuff to mean that you have to
read to understand the code and yeah, don't like that kind of repetition. So instead we just call
fastcore.store_attr() to do that all in one line. Okay, so that's the basic idea with
a class is to think about what's the information it's gonna need. So you pass
that all to the constructor, store it away. And then our fit function is going to,
we've got the basic stuff that we have for keeping track of accuracy. So this has only worked for stuff that's a
classification where we can use accuracy. Put the model on our device, create the optimizer, store how many epochs we're
going through. Then for each epoch, we'll call the one epoch function and the one epoch function,
we're gonna either do train or evaluation. So we pass in True if we're training and False if
we're evaluating. And they're basically almost the same. We basically set the model to training
mode or not. We then decide whether to use the validation set or the training set based on
whether we're training. And then we go through each batch in the DataLoader and
call one batch. And one batch is then the thing which is going to
put our batch onto the device, call our model, call our loss function. And then,
if we're training, then do our backward step, our optimizer step and our zero gradient.
And then finally calculate our metrics or our stats. And so here's where we calculate our
metrics. So that's basically what we have there. So let's go back to using an MLP. We call fit() and the way it goes. This is an error here, pointed out
by Kevin. Thank you. self.model.to(). One thing I guess we could try now is we think
that maybe we can use more than one process. So let's try that. Oh, it's so fast. I didn't even see. There it goes. You can see all four CPUs
being used at once. Bang, it's done. Okay, so that's pretty great. Let's see how fast
it looks here. Bump, bump. All right, lovely. Okay, so that's a good sign. We've got a learner
that can fit things, but it's not very flexible. It's not gonna help us, for example, with our
autoencoder, because there's no way of like, just like, you know, changing which
things are used for predicting with, or for calculating with. We can't use it for
anything except things that involve accuracy with a binary classification. Sorry… is that
right? Sorry, yeah, a multi-class classification. It's not flexible at all, but it's a
start. And so I wanted to basically put this all on one screen so you can
see what the basic Learner looks like. All right, so how do we do things
other than multi-class accuracy? I decided to create a Metric class. And basically
a Metric class is something where we are going to define subclasses of it that calculate
particular metrics. So for example, here, I've got a subclass of a Metric called Accuracy.
So if you haven't done subclasses before, you can basically think of this as saying,
please copy and paste all the code from here into here for me, but the bit that says def
calc(), replace it with this version. So in fact, this would be identical to copying and pasting
this whole thing, typing Accuracy here, and replacing the definition of calc() with
that. That's what is happening here when we do subclassing. So it's basically copying and
pasting all that code in there for us. It's actually more powerful than that. There's more
we can do with it, but in this case, this is all that's happening with this subclassing. And this
is called, actually I'll leave that, that's fine. Okay, so the Accuracy metric is here, and
then this is kind of our really basic Metric, which is we're gonna use for just for loss. And so
what happens is we're going to, let's for example, create an Accuracy metric object. We're
basically gonna add in mini batches of data, right? So for example, here's a mini batches of
inputs and predictions. Here's another mini batch of inputs and predictions. And then we're gonna
call .value and it will calculate the accuracy. Now .value is a neat little thing. It
doesn't require parentheses after it because it's called a property. And
so a property is something that just calculates automatically without having to
put parentheses. That's all a property is, well, property getter anyway. And so they look
like this, you give it a name. And so we are going to be, each time we call add(), we are
gonna be storing that input and that target. And also the number of items
in the mini batch optionally. For now, that's just always gonna be one.
And you can see here that we then call .calc(), which is gonna call the accuracy
calc(). So just see how often they're equal. And then we're going to
append to the list of values that calculation. And we're also gonna append
to the list of ns, in this case just one. And so then to calculate the value, we just do that.
So that's all that's happening for Accuracy. And then we can do for loss, we can just use
Metric directly, cause Metric directly will just calculate the average of whatever it's passed.
So we can say, oh, add the number 0.6. So the target's optional. And we're saying this is a
mini batch of size 32. So it's gonna be the n. And then add the value 0.9 with a mini batch size
of 2, and then get the value. And as you can see, that's exactly the same as the weighted average
of 0.6 and 0.9 with weights of 32 and 2. So we've created a Metric class. And so
that's something that we can use to create any metric we like just by overriding calc(). Or we could create totally things from scratch
as long as they have an add() and a value. Okay, so we're now going to change our Learner.
And what we're gonna do is we're going to keep the same basic structure. So there's gonna be
fit(). It's gonna go through each epoch. It's gonna call one_epoch() passing in True and False
as for training and validation. one_epoch() is going to go through each batch in the DataLoader
and call one_batch(). one_batch() is going to do the prediction, get_loss(), and if it's training,
it's gonna do the backward() step and zero_grad(). But there's a few other things
going on. So let's take a look. Well, actually let's just look at
it in use first. So when we use it, we're gonna be creating a Learner() with
the model, data loaders, loss function, learning rate, and some callbacks, which
we'll learn about in a moment. And we call fit() and it's gonna do our thing. And
look, we're gonna have charts and stuff. All right, so the basic idea is gonna
look very similar. So we're gonna call fit(). So when we construct it, we're gonna be
passing in exactly the same things as before, but we've got one extra thing, callbacks, which
we'll see in a moment, store the attributes as before and we're gonna be doing some stuff
with the callbacks. So when we call fit() for this number of epochs, we're gonna
store away how many epochs we're gonna do. We're also gonna store away
the actual range that we're going to loop through as self.epoch. So
here's that looping through self.epoch. We're gonna create the optimizer using
the optimizer function and the parameters. And then we're gonna call _fit(). Now what on earth is _fit()? Why
didn't we just copy and paste? So this into here, why do this? It's because we've
created this special decorator with callbacks. What does that do? So it's up here with
callbacks. With callbacks is a class. It's gonna just store one thing, which
is the name. In this case, the name is ‘fit’. And what it's gonna do is… now this is the
decorator, right? So when we call it, remember, decorators get passed a function. So it's gonna
get passed this whole function and that's gonna be called f. So dunder call, remember is what
happens when a class is treated, an object is treated as if it's a function. So it's gonna get
passed this function. So this function is _fit. And so what we wanna do is we wanna return a
different function. It's going to of course call the function that we were asked to call using
the arguments and keyword arguments we were asked to use. But before it calls that function, it's
going to call a special method called callback, passing in the string before, in this case,
before underscore fit. After it's completed, it's gonna call that method called callback
and passing the string after underscore fit. And it's gonna wrap the whole
thing in a try except block. And it's going to be looking for an exception
called CancelFitException. And if it gets one, it's not gonna complain. So let me explain
what's going on with all of those things. Let's look at an example of a callback. So for example, here is a Callback called
DeviceCB, device callback. And before_fit() will be called automatically before that underscore fit
method is called. And it's going to put the model onto our device, CUDA or MPS, if we have one,
otherwise it'll just be on GPU. So what's gonna happen here? So it's going to call, we're gonna
call fit(). It's gonna go through these lines of code. It's then gonna call _fit(). _fit() is not
this function. _fit() is this function with f is this function. So it's going to call our
learner dot callback passing in before_fit. And callback() is defined here. What's
callback() gonna do? It's gonna be passed the string before_fit(). It's going to
then go through each of our callbacks sorted based on their order. And you can
see here, our callbacks can have an order and it's going to look at that callback and
try to get an attribute called before_fit. And it will find one. And so then it's going to
call that method. Now, if that method doesn't exist, it doesn't appear at all, then getatrr()
will return this instead. Identity is a function just here. This is an identity function. All
it does is whatever arguments it gets passed, it returns them. And if it's not
passed any arguments, it just returns. So there's a lot of Python going on here. And
that is why we did that foundations lesson. And so for people who haven't done a lot of
this Python, there's gonna be a lot of stuff to experiment with and learn about.
And so do ask on the forums, if any of these bits get confusing, but the
best way to learn about these things is to open up this Jupyter notebook and try and
create really simple versions of things. So for example, let's try identity().
identity(), how exactly does identity work? I can call it and it gets nothing. I can call it
with 1, it gets back 1. I could call it with ‘a’, gets back ‘a’, call it with ‘a’, 1. Call it with ‘a’, 1 and get ‘a’, 1. And how is it doing that exactly? So remember
we can add a break point and this would be a great time to really test your debugging skills.
Okay, so remember in our debugger, we can hit h to find out what the commands are, but you
really should do a tutorial on the debugger if you're not familiar with it. And then we can
step through each one. So I can now print args. And there's actually a trick which I like is
that args is actually a command, funnily enough, which will just tell you the arguments to any
function, regardless of what they're called, which is kind of nice. And so then
we can step through by pressing n and after this, we can check
like, okay, what is x now? And what is args now? Right?, so remember
to really experiment with these things. So anyway, we're gonna talk about
this a lot more in the next lesson. But before that, if you're not
familiar with try-except blocks, you know, spend some time practicing them.
If you're not familiar with decorators, well, we've seen them before. So go back
and look at them again really carefully. If you're not familiar with the debugger, practice
with that. If you haven't spent much time with getattr, remind yourself about that. So try to get yourself really
familiar and comfortable as much as possible with the pieces, because
if you're not comfortable with the pieces, then the way we put the pieces together is
gonna be confusing. There's actually something in education in kind of the theory of education
called cognitive load theory. And the theory of cognitive, basically cognitive load theory
says, if you're trying to learn something, but your cognitive load is really high because
of all lots of other things going on at the same time, you're not gonna learn it. So it's gonna
be hard for you to learn this framework that we're building if you have too much cognitive
load of like what the hell's a decorator or what the hell's getattr or what does sorted do
or what's partial, you know, all these things. Now, I actually spent quite a bit of time
trying to make this as simple as possible, but also as flexible as it needs
to be for the rest of the course. And this is as simple as I could get
it. So these are kind of things that you actually do have to learn. But in doing
so, you're gonna be able to write some really powerful and general code yourself. So hopefully
you'll find this a really valuable and mind expanding exercise in bringing high level software
engineering skills to your data science work. Okay, so with that, this looks like a good place to leave it and look forward
to seeing you next time. Bye.
Get free YouTube transcripts with timestamps, translation, and download options.
Transcript content is sourced from YouTube's auto-generated captions or AI transcription. All video content belongs to the original creators. Terms of Service · DMCA Contact