When there aren’t enough features in the data, the model is likely to underfit, and when there are too many features, the model is likely to overfit or underfit. The curse of dimensionality is the name given to this occurrence.
Principal component analysis, or PCA is used for dimensionality reduction and is an unsupervised learning algorithm. Techniques for minimizing the number of input variables in training data are known as dimensionality reduction.
The following are some of the applications of principal component analysis (PCA):
- To view data with a high degree of dimensionality.
- To introduce classification improvements.
- To get a concise description.
- To capture as much variation as possible in the data.
- To reduce the dataset’s number of dimensions.
- To search for patterns in a high-dimensional dataset.
- To eliminate noise
There are numerous approaches to dimensionality reduction, but the majority of them fall into one of two categories:
- Feature Extraction
- Feature Elimination
Feature extraction is where we can delete the “least important” variables while keeping the most valuable aspects of all the variables if we mix our input variables in a specific way. Additionally, following PCA, each of the “new” variables is independent of the others.
Feature elimination is exactly what it sounds like: we minimize the number of features available by removing them. The simplicity and interpretability of your variables are two pros of feature elimination methods. The con for the same is we’ve completely erased any benefits that those omitted variables might offer by removing functionality.
When to Use?
Do you want to limit the number of variables but can’t seem to figure out which ones to eliminate entirely?
Do you want to make certain that your variables are unrelated to one another?
Do you mind if your independent variables are less interpretable?
PCA is a useful strategy to utilize if you responded “yes” to all three questions. PCA should not be used if you replied “no” to question 3.
PCA procedure steps
The main steps in Principal Component Analysis are listed below.
- The PCA should be standardized.
- The covariance matrix is calculated.
- Finding the covariance matrix’s eigenvectors and eigenvalues.
- The vectors are plotted on the scaled data.
Features Ignored by PCA
Features that are linearly dependent or collinear.
Consistent characteristics.
Consistently noisy characteristics
Important PCA Features to Remember
Features with low covariance or non-collinearity
Variable and high-variance characteristics.
For the code part you may refer to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html