MIT6036L03F19b

so we're gonna talk about something which we call in the notes linear logistic classifiers cuz this is a better name for what it is but if you go out in the world and read about this topic it will be called logistic regression which is a less great name but the name that people use mostly so that's so we're gonna look at kind of a hypothesis class and a loss function often the design of the hypothesis class and the design of the last one should go together in some way so we'll look at a pair of these together that makes sense okay so first of all let's just think about instead of inventing a new loss function let's look at the last loss function that we thought very much about right so before or when we talked about perceptron we thought about zero one loss right and zero one loss just as a reminder so loss of zero one of a guess and an actual was one if the guests didn't equal the actual and zero otherwise so there's nothing wrong nothing wrong at all conceptually about saying I'm gonna use my hypothesis Paz is going to be linear separators right so just as we saw before with the theta and a theta naught and I'm gonna use this loss function and never mind that regularizer stuff for a minute we'll come back to it you could just say an interesting question to ask would be can I find a theta and a theta naught that minimizes this loss function is that something perceptron can do let me this question right does the perceptron algorithm can it does it minimize zero one loss on a dataset or under what conditions does the perceptron minimize the zero one loss on the dataset okay it minimizes its it promises it's gonna minimize this loss if the minimal loss is zero that is to say if the data is linearly separable perceptron promises to find you a linear separator but perceptron doesn't promise anything average perceptron doesn't promise anything if the data is not linearly separable right so if you have a data set that looks like + + + + + - - - - - oh and there's 1 + over there you know you might be happy to have a separator like that a perceptron doesn't promise to give you that right that that separator would make error 1 I'm sorry these my pluses and minuses are not very clear this would make error loss 1 total lost 1 on that data set and that's not such a bad answer but we can't rely on perceptron to do that ok so we could say all right well maybe we can since you're so fancy talking about optimization here maybe we can just set ourselves the problem of minimizing 0 1 loss on the data and we could find the separator that minimizes the loss and the answer is that that's like a completely well-formed problem you could say what is a separator that minimizes this loss function the problem is now it's computationally very difficult to find so in fact we don't know computer scientists don't know any great efficient algorithms for finding the minimal hypothesis right so one of the things that's interesting is writing down the j function like that's the machine learning problem in a sense right we have to think about our problem what kind of hypothesis class do we want and so on once we write down the j and the and the loss function right that helps write down the J then it becomes a computer science problem or an algorithm problem and then it turns out that this problem is np-hard the problem of minimizing zero one loss and so there's not going to be an efficient algorithm for solving it so that makes us sad but that's the kind of a fundamental truth of computer science okay so then what do we do what do we do what do we do well we try to make the problem easier in a way that gives us computational leverage but doesn't change the actual statement of the problem too much okay so now I want to illustrate actually the idea in one dimension we're gonna spend some time today where the one dimensional classification problems which are really simple but it helps us to think about what's going on so if I had okay we'll get it this way we had some data like this in one dimension then you know a linear separator is just some points really some point in between here and zero one loss is like well either I get it completely right or completely wrong okay another way that we can think about this is that instead of our guests being 0 or 1 we can let our guess be something that's smooth so that's what we're gonna do we're going to so the 0 1 loss function you can think of it as as basically like either I have it like right so my loss is 0 or I have it wrong so my loss is 1 or you can say either my output of my classifier is 0 or else the output of my classifier is 1 that makes it hard to deal with we're gonna think about smoothing that thing out in a way so here's our new hypothesis class so a logistic linear class of linear logistic classifier it's gonna take an X and a theta and a theta naught and the prediction it makes is gonna be this okay so it used to be this square thing is the sine function right and when we were talking before about logistic I mean about about just the the linear classifiers we talked about before when we talked about perceptron we said oh we'll take the sine function which is either actually might as well address this now the sine function either gives you value minus 1 or +1 that was the sine function and our old linear hypotheses said I'm gonna take the sine of this give us plus 1 or minus 1 so now instead of the sine we're going to use this function Sigma so that's the sigmoid function and Sigma of some value Z is 1 over 1 and I always have to look plus e to the minus Z and that sigmoid function if this is Z looks like this it goes between 0 and 1 so if the input if this Z is negative then it will be and it crosses actually it crosses right here at 0.5 right so when Z is 0 this is 0.5 when Z is negative then we get a value between 0 and 0.5 when Z is positive we get a value between 0.5 and 1 so you can just kind of interpret it as something kind of like the sine function but it's smooth okay so that's Sigma Sigma white function and why are we doing this well the reason that we're doing this is because it gives us a way to talk about whether if we're making let's say you're wrong classification like how wrong is it before when we just had this step function if we were just barely wrong or like totally desperately wrong the loss would be the same and that would mean that an algorithm that's trying to find a good place to put the separator wouldn't have any idea about whether making an incremental change would make things better so what we're gonna do is try to find loss functions that have the property that they're smooth and so that the smoothness will actually help us out rhythmically to find a good solution so that's why we're doing what we're doing not because it gives us a better class of hypotheses in fact in the enemy gives us the same class of hypotheses but because this loss function gives the algorithm a way to say warmer warmer warmer warmer so it can search in a way to find a better solution okay so that's what we're up to any questions about this stuff so far Oh a little lambda up there positive I can't leap up there and get that back down just positive constant yeah okay okay good so this is a logistic linear classifier almost it's not quite a classifier so this is going to give us out a value that's in the interval 0 1 it can't give you actually exactly zero unless you have infinite parameters or exactly 1 unless you have infinite parameters but gives you something between 0 & 1 and in a minute we're going to actually interpret it as a probability and we can think of it as the probability that X is positive all right so the closer to what it is the more sure we are that X should be positive the closer to 0 it is the moisture we are that X should be negative when it's 50/50 we really don't

Full Transcript

Need a transcript for another video?