MIT6036L03F19e

so now let's say we have this hypothesis class these logistic linear classifiers and we have this loss function and so now we have an objective right now let's we can ignore the lambda and the regularizer we have an objective and so now what we'd like to do is find the Thetas that minimize the objective right we think that's going to give us the best classifier and we would like to not have to be smart about inventing a new algorithm we would like to just use some old algorithm that optimization people now so this is the simplest possible optimization algorithm know the simplest possible one we already explored at the beginning which is maybe like generate hypotheses at random and see how good they are you could still do that but here's something slightly better people who really work in optimization have much cleverer things to do but this one is simple and efficient and easy to understand so we'll use it to start with actually we'll use it most of the time mostly because it's efficient okay so gradient descent ok so fundamentally the idea is maybe we have some function of some parameter values so you might say oh this is some function value f of some input value X you'd think about it that way if we're just like thinking about how to find the minimum of a function and leave it like that and what gradient descent does it's an iterative algorithm it guesses an X to start with not smart just somehow guesses an ax comes up here looks at the derivative so in this case the derivative is pointing this way well the tribute actually is pointing this way right it says Oh to get bigger you should go up so the dur boot is pointing this way so the Dravida is going that way and we're gonna take a step in the negative direction of the derivative which means we're gonna take a step this way get a new X come up here see what the derivative is like it's pointing that way and we did the derivative again is pointing this way we're gonna go on the negative derivative take a step come over here see what it looks like negative derivative is this way take step come up here negative derivative is like this take a step oh look there the derivative seems to be pointing this way all right I'm gonna go this way and back and forth and eventually I'm gonna arrive at an answer so that's the intuitive idea so we we sit at it sit somewhere compute the derivative take step some more computer derivative take a step okay so let me just write it down in one dimension one dimension is awesome so 1 d 1 DG d Wendie gradient descent for this you need an initial theta some function you're trying to optimize the derivative of some function you're trying to optimize and a little epsilon which I'll talk about okay so the F here this is the F theta in it or it could be X in it either way I guess I wrote this code as if it's f of theta it's just a formal parameter it doesn't doesn't matter I mean we're gonna use this technique to optimize loss functions on hypotheses because we're machine learning people but this technique is super useful for minimizing all kinds of functions it's not just about this okay so how does the algorithm go we set our initial theta value to be theta in it and we set some time index to be 0 and we loop and we'll increment the T and we do the important step which is this we set our new theta values to be our old theta values - we're up here talked about ADA in a minute okay this is this is the only important really well there are two important lines in this program but this is the most important one so what is this this says write your finger theta your theta was this value that was moving around right you're trying to find a good theta value so you got your finger on this theta and you say what's the derivative right there at that data so if we're doing this one to mention the derivative move is a scalar right it might be plus 3 or minus 2 or something like that the derivative is a scalar and what we're gonna do is we're gonna take our old theta value and subtract something times the derivative to get a new theta value this thing that looks like an N is called 8 huh actually ada and it's a call it's really a step size parameter it says how big are the steps that I should take in particular how much I'm going to find out this derivative and multiply it by some step size okay so I'm going to do this go around in a loop until I don't want to go around loop anymore and the question is when is that so when do I not want to go around the loop and there's a bunch of ways you can terminate but this is the one that will generalize best to other cases until F of theta at time t minus f of theta at time t minus one is small so epsilon is like a tolerance it's like a small number it's basically the idea is that you're gonna keep doing this thing trying to find the optimum until the values not changing very much and then you're done okay that's the outer rhythm oh and then what do we do we return theta return theta T okay questions about the algorithm yep let's say what can we say about this algorithm right so what does it always work what does it do what does it mean to work right because you might have a function that looks like that and if your initial value is here right if your initial X is here it might be that you'll find your way to here but you might not find your way to here so let's establish a little terminology this is a global minimum this man a mom a good okay global minimum and this is a local minimum right so the the slope is zero and the gradient is is you know positive in each direction I mean if you if you move a little bit this way you go up or you move a little bit this way you go up so that makes a local minimum and you might not be able to get out of here right so what we can promise you is that if your function has a single local minimum we'll go there and in general we can I will threat the conditions down but we can kind of promise that we can get you to a local minimum we cannot promise that we will get you to a global minimum right so the theorem is if F is convex which means basically somehow shaped like this you can read the in more detail in the notes about that then for any desired it's a little bit tricky this theorem accuracy Epsilon there is some step size such that G D will converge to a theta within Epsilon of the optimum okay what's sneaky about this theorem is that there is some ADA okay so let's just kind of the lab actually is gonna have you mess around the step size a lot and get some intuition for what happens but fundamentally the issue is that if you pick ADA to be very small then you take like teeny little steps and then it takes like forever but if you pick eight it to be too big you risk like taking enormous steps and then the whole thing like doesn't converge like it goes nuts so you'll have to play around with this kind of feeling for it but there's a fundamental a fundamental kind of difficulty which is there is no obvious way to know what your ADA should be so it's another thing that you have to kind of play around with so that's that's what that's a little bit of trouble okay we are almost done talking about gradient descent but this isn't this is great in the sense of one dimension they're called derivative this out yep right question is yeah and I've written this down really informally if you want the formal definition post on Piazza I will find you a good one yeah other questions okay this is cool if we have a one dimensional parameter family which is not usual for us right usually our theta is a whole vector of parameters and remember now here when I'm going to talk about theta when I talked about J of theta over there theta is like all our parameters right so if for the linear hypothesis class it's going to be our normal theta vector plus all so the theta naught right so all of our parameters we're going to find values of all of our parameters together that minimize our objective function

Full Transcript

Need a transcript for another video?