MIT6036L01j

we're gonna talk about two algorithms for finding a linear classifier given a data set right so given the data set D what H so really if you think about it this way what specified once i've specified the big class script h which is linear classifiers what specifies the particular h in this case is really this vector theta and theta naught right that specifies the hypothesis the particular hypothesis okay all right so let's talk about it out of it so I said there were several ways to do this and we're gonna start with B dumb my favorite method okay so here's an algorithm now is our first machine learning algorithm and you're a little implement in the homework assignment random linear classifier it takes in a data set and a K I'll explain K in a minute okay so what's it gonna do I'll write it down now we'll talk about it I have to put a superscript here theta J okay there we go so there's our first machine learning algorithm let's talk about it okay so what are we gonna do there's first of all what what this algorithm takes as input is two things and already there's something interesting about the two things so this D is a data set it's a training data set like the kind that we talked about already k is an integer positive integer and it's a kind of a thing that we call a hyper parameter so this is cool already machine learning this cloth right regular old computer programming has parameters but we have hyper parameters and by hyper parameter what we mean is that it's something that affects how the machine learning algorithm works it's a parameter of the machine learning algorithm it's not a primitive the hypothesis it's a parameter of the machine learning algorithm okay so we'll we'll see what role Cate plays here after we talk through the algorithm so what we're going to do is we're gonna go around K times and K times we're gonna pick a random theta and theta naught so we're gonna try this one and that one and this one we're gonna pick like a bunch of crazy separators by using the random number generator okay and then what are we gonna do so what partly I like to teach you guys this algorithm because this is a kind of notation that some people haven't run into and we use it a lot so armed min so now okay we generated a bunch of hypotheses and now well what do you say well if we wanted our job is to try to do as well as we can to minimize our error on the training data so if we want to minimize our error on the training data and we just generated a bunch of random hypotheses what we're gonna do is we're gonna come cute the training set error the training error for each of our hypotheses and then let J star be the index the the value of J that makes this as small as possible right so J star is the index of the best of these guys and then that's the one we're gonna return right okay so what's the how does this work we generate a bunch of random hypotheses we search when the scores the best and we returning cool so what should happen as K gets bigger and bigger and bigger we should get a better answer and eventually we should we should hit the optimum right so you might imagine here's a plot right you might imagine that as we increase K and if we plot e n right well if we only try one it's not going to work very well but as we try more it's going to work better and eventually it's not going to work any better let's say something like that so K is a hyper parameter it's a parameter that governs how well this algorithm is going to work if we make it really really really big we'll get a good answer but if we do it we might not awesome thank you oh yeah I say that again it depends so this is all has to do with her but the last function notice the last function comes to the party here twice the last function is embedded in our training error right our training error depends on our loss function so this says pick the hypothesis that has the lowest average loss on our training data so it's going to come here in playing a role in which hypothesis we pick and then it's gonna come here when we draw this plot because this is what we actually care about at least for right now okay so there's an algorithm machine learning algorithm certified not broken maybe not the best thing but not totally crazy so we're gonna ask you guys to implement this machine learning algorithm this week but it has the right shape right it has the right kind of importance the right kind of outputs it does the job one thing those of you who are sticklers for notation I know there are a few of it you in here but it lets me say something so I just I defined error on hypothesis space and here I've written error in terms of these parameters right but this is this is just slightly zenus right we say that that the error here is really the error of H and this lets me bring up another notational conundrum which is why I don't write it that way statisticians use this notation with the little dot in the center you might run into it it means the function of this argument that you get by filling in the other parameters like this computer scientists have a better notation for this right but this is only computer scientists know the land the thing and not all computer science isn't it every computer science should know the land of thing if you don't know the land of thing we will teach it to you but anyway what really you know the question is how do I name a hypothesis how do I name the hypothesis that I get when I specify a theta and a theta or not so this is one way to name that function that hypothesis this is another way to name it here we kind of were slightly lazy and just put the parameters in there yep okay so why should that's great question why should they I love so whenever anybody draws a plot it's really easy to nod your head but it is really good to not nod your head and say why okay so and the question was why would it look like this it seems like if you have a few points it's easier than if you have a lot of points notice what I'm plotting here I am keeping the same training set the whole time and I am measuring how good my hypothesis is on that training set what I'm varying here is this hyper parameter of my algorithm so we could say you know what you could run this crazy algorithm that generates one hypothesis at random for K equals one and returns it now if I generate one hypothesis at random and return it it's gonna sign but if I generate a hundred thousand the best one might be pretty good so this is a function of how many I try great good good I'm glad you're looking at that okay so that's are not so smart but not totally stupid out

Full Transcript

Need a transcript for another video?