MIT6036L03F19f

so this was good for one dimension what if we have multiple dimensions so let's do the multiple dimension case so if we have multiple dimensions so then just general gradient descent I'm actually I actually don't hardly need to write this down again but maybe I will just so that we can be sure well look at the types of things so let's I'll write down all the things and we'll look at their shapes okay so first thing is if you have a function imagine it's a function that takes let's let me not use D let me use I don't know M it's not necessarily the dimension of our data we you have a function that takes M arguments let's imagine that our function takes M arguments so our Thetas are d by am m know deep excuse me and by and by 1 right okay so theta has dimension by 1 okay so theta has dimension M by 1 our function takes a M by 1 vector really is a way to think about it and gives back a scalar so let's just assume for right now that that but the F output is a scalar for all of this we'll assume that so now the question is what about the gradient and many of you looked at gradients and some of you more recently than others so let's just remember the definition because it's not super terrifying even though the symbols cool so what is this so the gradient of f with respect to theta so first of all really important is to know it's dimension it's dimension is M by 1 so it's gonna be a vector of the same shape as theta and it's really just a vector of the partial derivative so it's the partial derivative of F with respect to parameter 1 and then the next element is partial F parameter 2 and then partial F partial parameter so that's the gradient just the vector of partial derivatives and sometimes we'll use some fancy gradient calculus actually we use it off and on throughout the class it makes writing stuff on the board super compact and beautiful if you kind of it doesn't make sense to you you can always always always go back and just compute the entries of this vector and usually you just have to compute one or two because they just depend on indices in some systematic way so this is really this is a gradient but you could think of it as a direction right it's the it's a gradient ah when I remember I think I don't know was I was in high school or college maybe college probably and I was learning to ski and I was taking a ski lesson and the ski instructor was like trying to explain this idea that there was a steepest way down the hill and he called it the fall line and the idea is you're supposed to keep your skis you know so that you don't fall down anyway so keeping your skis or foggle to the gradient would have been a very simple way to say the thing he was trying to say but he had to invent all these words you know finally I understood okay or thought go to the gradient we're good okay so the gradient is the steepest way down the hill where Steve is way up the hill actually excuse me the green is steepest way up the hill the direction is the same but the sine is different okay so that's the gradient so good so now we have data in it so theta in it is got to be also M by one and that's just the point on point we're gonna start at on the hill and then we're gonna execute the same algorithm these two guys are still scalars and the only thing that's different so I'm not going to write all the bookkeeping stuff the only thing that's different is that now we say theta T is Theta t minus 1 minus ADA now is there the derivative we say the gradient of F at theta t minus one I mean so it's sort of the gradient with respect to theta but at this particular place right so we're standing here on the mountain we figured out what's the steepest way down we take a step now we look and we see where's the steepest me down this way we take a step and so on so that's that's great in sense now really gradient descent good questions about this termination is the same alright let's see some examples so so what do we have put together now we have a hypothesis class which gives us out numbers between 0 and 1 we have a loss function that consumes numbers between 0 and 1 which gives us an objective function J and then now we have an algorithm for trying to find the minimum

Full Transcript

Need a transcript for another video?