Data sets can be categorized as two types. One is balanced data set and other is imbalanced dataset, having a balanced dataset is important because, the balance in the dataset improves the accuracy of prediction as the all the classes have same amount of weight, or are present in a balanced amount and none of them are dominating. While handling imbalanced data can be rough sometimes as it makes the whole result faulty, and it reduces the accuracy of prediction with a decent amount. Imbalanced dataset are the data sets in which one class is dominating the other classes. This data set creates a biasness in the prediction which leads to the increase in the direction of wrong predictions hence, reduced accuracy and increased error. So, here are some techniques with which we can detect the more dominant classes.
The first technique is to use Evaluation techniques such that they can detect which of the class is in dominance.
The second technique is to do Resampling. There are two types of sampling. One is under sampling and the other is over sampling. In under-sampling as the name suggests, the class with high number of data points as well as with minimum number of points, from both the classes some data points are collected and then for further modelling a new data set is created. While in over-sampling the rare samples are generated by removing the abundant samples using SMOTE, it is a technique in which the algorithm finds the k nearest neighbor from the points in the data and creates a synthetic data set.
Then there is Balanced Bagging Classifier which is like sklearn but with extra features. There are two special arguments that are provided in a balanced bagging classifier through which we can decide which class needs to be resampled and which class needs replacement.