Understanding the Dimensionality Reduction

In Data Science we solve the problem by using a large dataset, and some complex data has a large number of features or columns. The number of column for these data can go beyond 1000 sometime and working on these data set is a very challenging task because analyzing each and every feature individually and finding the correlation of one feature to other features data is not feasible, it's difficult to get insight from the data.  To solve this problem we find a way how we can represent the same data in less number of columns without losing much information.  

What is Dimensionality Reduction

Dimensionality reduction is a technique to reduce the dimensions of the dataset and Dimension of the dataset is the number of features(columns) it has, in other words, Dimensionality reduction means reducing the number of features (columns) of the dataset, still preserving the most of information it has before.

Mathematically, reducing a matrix of MxN to MxP where P<<N.

The need for Dimensionality Reduction.

  • Visualization: One of the most important of data analysis is Data Visualisation. In data visualisation, we plot one feature against other features in 2D or 3D space to study the relationship between them how they are related? what is the correlation between them?  what is the pattern? also, it helps in identifying the groups or cluster in data at the very beginning stage of data analysis. we can't apply the visualization to the high dimensional data as it one feature represent one dimension and 1000 features means 1000 dimension to study our data. which doesn't make sense as we are limited up to 3-dimensional only. The best way is to reduce the dimension of the data from 1000 features to 3 features and represent the data on a 3D plot and it will still make sense we have reduced the dimension but we haven't lost much of the information. 

  • Reduction in storage space required to store the data.
  • It removes the redundant columns so it removes multicollinearity by the default.
  • The training time of Machine Learning algorithm is reduced because the number of features in data is reduced.
That's all for Dimensionality Reduction I will try to introduce some dimensionality reduction technique in my future post.



You Might Also Like

0 comments