Machine Learning is like a cup of tea, to get the best taste we need to set the proportion of all ingredients properly because if the magnitude of sugar is more the tea will be very sweet, or if tea leaves are more the tea will taste bitter, we need to remember that tea leaves and sugar are not same unless we make them similar in some context to compare their attribute. Similarly, in Machine Learning Algorithms, to bring all the features in the same context, so that one significant number doesn’t impact the ML Model because of its large magnitude.
Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. This includes algorithms that use a weighted sum of the input, like linear regression and algorithms that use distance measures, like k-nearest neighbors.
Feature scaling is a crucial step in pre-processing of data before creating a model, as it can make a difference between a Weak Machine Learning model and a better one.
The two of the most popular methods for Feature scaling done are:
- Standard Scaling: This method performs the task of Standardization, it is performed upon a dataset and is a common requirement for many Machine learning models, as they might behave badly if the individual features do not more or less look like standard normally distributed data. The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value of 0 and a standard deviation of 1.
In the case of multivariate data, this is done feature-wise (in other words independently for each column of the data).
Given the distribution of the data, each value in the dataset will have the mean value subtracted and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).
Python Code
Output without Scaling:
Output with scaling:
- Normalization: This method organizes data to appear similar across all records and fields. Performing so always results in getting higher quality data. This process basically includes eliminating unstructured data and duplicates to ensure logical data storage. In machine learning, features with higher values dominate the learning process, but it doesn’t mean that those variables are more important to predict the outcome of the model. After applying normalization, all variables have a similar weightage on the model, hence improving the stability and performance of the learning algorithm.
Python Code:
Output without Normalization:
Output with Normalization:
Applications (Algorithms where feature scaling matters)
- K-Means
- K-Nearest-Neighbors
- Principal Component Analysis
- Gradient Descent