Data handling is one of the most important concepts of Data science. These techniques help in cleaning the data set we have acquired. These methods help us know, how to remove outliers if there are any, to treat duplicate values, and perform various functions on the data set such that we could use it in training our machine learning model and increase its accuracy in predictions.
Using these techniques on these challenges we can clean our data set and make it more useful and suitable for our model training.
NaN Values:
These are the most annoying parts of a data set that can be present in it. This basically means that there are some places that are not filled by the data set provider or the place from where the data was collected, there were few entries that were not filled in the table or any source using which data was collected. To solve the problem of the missing values, the data set will provide you the answer. Solving these NaN values is case specific. These changes can be made by using measures of central tendency, or by removing the column. These all things are case specific. So, let’s talk about some of these ways more specifically:
-
- Replacing Values at NaN entries: In this method we can first find out the NaN values by subtracting its shape from the value count of that data frame. Using this technique, we can calculate the NaN values. Then if the data allows us to replace it with the mean we will replace it with the mean, and if not by mean then by its median or mode. Even in some datasets we can change by a fixed value.
- Removing NaN values: Removing NaN values is completely situation based. If by removing the values, the data set is not getting you may remove the values of the row or the whole column.
Duplicate Values:
The duplicate values create a data set that will not give accurate prediction or the training on the model will be affected. If there is the same data, it may make the prediction more difficult since it will make the data biased. Sometimes it is necessary to remove duplicate values and sometimes it’s not.
Outliers:
Outliers are one of the most interesting challenges that can be faced while cleaning the data. Outliers are sometimes helpful and at the same time may harm your data prediction. So, it is said that the outliers can be treated by removing them and sometimes just by ignoring them if they are present in the data.
Type Conversion:
This is also one of the biggest challenges faced while cleaning the data. In this method we find the columns that are stored in the wrong data type so sometimes you must typecast it into some other data type. One can do this by simple use of the ‘astype’ function in the python library.
