Improve Machine Learning Binary Classification using Boosting by Maulik Dave on July 13, 2020 1,216 views
The theorists have come out and done something very remarkable, in the end, you have to say, wow, these are such powerful ideas. The idea we are looking into is easy to implement, and it’s extremely powerful in what it does, and it’s the essential item in anybody’s repertoire of learning “the machine-learning” and it’s algorithms. So it’s about letting multiple methods work on your behalf. Now we are looking for a herd that can be smarter than the individuals in the herd. The whole work has to do with binary classification.
- Tumor detection (Benign or Malignant)
- Cat or Dog
We will be diving into the binary classification, not about finding the right letter in the alphabet that’s written on the page, that’s 26 way choices. We have a set of classifiers,
- H : produces either +1 or -1
- If Benign gives +1 , otherwise -1
- Error rate : 0 to 1 (in terms of the fraction of the cases got wrong on a sample set)
- Ideally we would like our error rate to be 0, and dead if it is 1.
- Generally it is in the middle, let’s say 0.4, just a little bit better than flipping a coin.
- That’s called WEAK CLASSIFIER
So, the strong classifier indicates the error rate close to 0. What if we get the strong classifier, combining several of these weak classifiers, and get a vote of each one.
- H (x) = sign( h1(x) + h2(x) + h3(x) )
- The big classifier, works on sample X, and produces the results of weak classifiers, h1, h2, and h3, and add them vote with the sign
We are looking for wrapping this stuff over the sample set. We will also take care of how well the model works on the new sample as we need to take care of overfitting issues.
We want to look and see if each of the h producing +1, or -1 , this arrangement, where each of these produces, adding up and taking the sign, is going to give us better results than the tests individually.
Now it looks simple, but never it is, what about the wrong answers, wrong answers produced by h1, h2, or h3. We have to think where two out of the three or all three, get it wrong, is sufficiently big so as to be worse than 1 of the individual tests. We need to look into the algorithm that will help pick the particular weak classifier to plug in. We will use many classifiers and produces it’s error rates, and the one which is having the lowest, will pick, and that’s our first step, described below:
- DATA ————————> h1
- DATA with Exaggeration of h1 errors ————————> h2
- DATA with Exaggeration of h1 != h2 ————————> h3
So, H(x) depends on h1, h2, and h3, if that’s a good idea, and that gives a better answer than any of the individual tests, maybe we can make this a little bit recursive, and h1 is actually not an atomic test. And maybe it’s the vote of three other tests.
- h1 is vote of three other tests like h11, h12, and h13
- Similar for h2, and h3
DECISION TREES
The decision tree stump is a single test. It’s not a complete tree that will divide up the samples into homogeneous groups. It’s just what you can do with one test. So each decision tree stump is a classifier.
Similarly, you can use boosting with any kind of classifier.
Each of the samples has weight incorporated with it, W1, W2, and W3. So the error is just summing up the number of samples that were not correct. And that will be the fraction of samples that you didn’t get right. And that will be the error rate. Error rate is equal to the sum over the things you got wrong in the current step times the weights of those that were wrong. In step one, everything’s got the same weight, it gets the same weight, and if we find a way to change their weights going downstream so as to, for example, highly exaggerated that third sample then W3 will go up relative to W1 and W2. The one thing we want to be sure of is that no matter how we adjust the weights, the sum of the weights over the whole space is equal to 1.
In other words, we want to choose the wrights so they emphasize some of the samples, but we also want to put a constraint on the weights such that all of them are added together summing to one.
H(x) = sign ( h1(x) + h2 + h3 + ….. Other classifiers )
It looks almost like a scoring polynomial. And now we add weights.
At the end, we understood how boosting works in general, there are ways in which we can implement AdaBoost, Gradient Boosting, XGBoost. Boosting algorithms represent a different approach of modeling in machine learning. Making a strong model, using weak models, and also to fix weakness. As you might have a better understanding of it theoretically, you can try applying XGBoost, or any Boosting mechanism from the ML library and improve the prediction results.