MIT6036L03F19a

our plan for today is to keep talking about supervised learning and keep talking about classification but introduce a kind of way of thinking about solving machine learning problems and a way of deriving algorithms that's we'll start out just looking at linear separators like we were doing with a perceptron but this basic underlying set of ideas now will kind of go forward and be the foundations of much more complicated kind of learning methods that we'll look at later it's so fundamentally the idea here is that instead of trying to kind of Intuit cool learning algorithms like perceptron we're going to come up with a kind of a machinery that will let us derive machine learning algorithms for kind of arbitrarily complicated problems and the way we're going to think about it is to think about turning a machine learning problem into an optimization problem so optimization is an important field in kind of applied math and numerical methods in computer science it's a way of taking some function and trying to use computational methods to find a minimum or a maximum of that function so we're going to see if we can turn machine learning problems into optimization problems and then take advantage of the huge amount of stuff that people know already about how to solve optimization problems so that's our plan so we're going to do MLS optimization so fundamentally when you do optimization what you do is you write down an objective function so usually I don't know people use different things J is a common one J of theta so J is gonna be our objective function that's the J and the theta here are some parameters now right now we're gonna do a little bit of a notational shift you might have noticed when I wrote it in the notes it's really hard for me to do it font wise on the board but I used like a capital theta here when we talked about optimization machine learning is optimization will talk about parameters generically as theta that capital theta just means all the parameters in my problem so if what we're doing is linear separators this capital theta will actually stand for our theta vector and the theta naught if we have that later on it might stand for actually a whole set of weight matrices or something like that in a neural network so theta is like all the parameters of my problem so what I want to do is say okay my problem is a machine learning person is that I want to find a set of parameters right so we're still looking for a hypothesis and hypothesis class and we'll see what the class looks like but what we want to do is find some parameters of a hypothesis in my class and we want to find ones that optimize some objective function J so generally the optimization problem then is that we want to find a theta star which is the Arg min over Thetas of J of theta right that is to say I want to find the theta that minimizes J of theta so that's that's gonna be the thing that we're we're trying to do okay so this is kind of a general way of thinking about an optimization problem in machine learning we can set up a bunch of different optimization problems but by far the most typical kind of objective function that we'll write down for machine learning looks like this so it typically looks like we want to find some parameters theta that minimize the sum of two terms so we're gonna have a sum over our data points I'm gonna write this l and come back to it oops did I close this no I didn't one more parenthesis okay so this is the general form of the thing that we'll look at so the first term here is an average over all of our data points and this L is our loss function we talked about zero one loss before we talked about right so this here H of X I theta that's the prediction we would make if theta where our parameters so this is our guests our guests of the answer for question x and y is what our data set tells us the answer should be right so this is a loss function that takes the guests and the actual value and tells us how sad we are that we made this guess when that was really the answer now so that's the loss function how unhappy are we with this guess so that's this whole first part what's another name for this whole first term I can wait what's another name for this term yes the training error good okay I know it's early and many of you don't normally come so thank you for coming that's good so good so this is the training error right this is this is the sum over our training set of how unhappy we are about the predictions were making on the training set and we talked a little bit at the very beginning about how training error it's all good and well to minimize the training error but in fact our problem is to predict well not just on the training data but really on testing data that we haven't seen yet so we're asked at the same time that we're introducing this idea of machine learning is an optimization problem we're going to introduce this extra term and will over several weeks will kind of keep using a term like this and see how it affects the kind of predictions we make but let me just talk to you about it a little bit so this thing right this term right here is called a regularizer and generally speaking it's some kind of penalty on the Thetas it's some way of saying well you know what I would like you to pick some theaters that work well on the data but don't go crazy with that right don't don't try your hardest to fit the data exactly maybe you should just relax a little bit maybe it'll work better if you don't try super hard to fit the data so we'll see some examples of that today we'll see examples next week we'll keep seeing examples of what this regularizer might be like this lambda is a constant and it's it's a knob that you can turn so knobs that you can turn in machine learning they're often called hyper parameters we talked about that before hyper parameters and lambda governs the degree to which you want to try to fit your data your training data really well versus the degree to which you might want your hypothesis to be simple so that's going to be a knob that we'll play with and we'll practice again ways of ways of setting the data a common way of writing down a regularizer and the one that will pursue well I'll come back to the regular sir we'll just leave it at that for now okay so the cool thing is that if we can turn a machine learning problem into an optimization problem like this then we can apply general-purpose techniques so the plan for the lecture today is that I'm going to introduce a loss function and the hypothesis class that gives us another algorithm for finding linear separators so that's gonna be the first part and the second part is that we're gonna study a simple algorithm for actually optimizing this objective so that's the plan so and and I had a kind of a terminological crisis so I'm gonna introduce two names for the same thing and hopefully you'll see what

Full Transcript

Need a transcript for another video?