Machine Learning | Data Science : Handling the missing value in the dataset in Python and R
- May 22, 2018
- By Pawan Prasad
- 0 Comments
Data cleaning is the first process you must be doing to get the better insight and better performance of your machine learning models. It's most important in the process of data Data preparation. One of the main challenges in the data cleaning is handling the missing values in the dataset. The missing values is a piece of information that was not available during the data creation or the value is unknown. In simple words it is NULL.
Here I will be explaining the different ways of dealing with missing values in datasets
Ignore observations:
In this approach, we delete the observation rows which has the null values in it and would not be used in model training and testing. This is done before splitting data into training/testing set. This method of dealing with missing doesn't impact the model training much when we have huge and good distributed data sets and number of record which has missing data is very less about less than 5% of the datasets. However, when the data set size not huge, it is not optimal approach because the significant information could be lost when incomplete rows of data are discarded and impact our Learning models.If we are working with multivariable analysis and there are large no. missing values for that column, it's better to drop that column from the dataset.
Deleting the row with the missing value in python :
import pandas as pd
df = pd.read_csv('Sample.csv',encoding='utf-8')
df = df.dropna(how='any',axis=0)
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Deprecated since version 0.23.0:: Pass tuple or list to drop on multiple axes.
how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
Deleting the row with the missing value in R:
dataset[complete.cases(dataset), ]
note: This will delete the rows having any of column value as NULL. complete.cases also allow deleting only the rows with the certain column with the null value.
dataset[complete.cases(dataset[ , 2:3]),]
Statistical Approach:
There is another way to handle the missing data without losing the information. The basic idea of this approach is instead of deleting the rows we can substitute the missing value with sensible data. we Often use mean of the columns to replace null in the dataset but median and mode can also be used.In python, the scikit-learn library provides the Imputer() pre-processing class that can be used to replace the null with Mean, Median, and Mode
The strategy parameter is set to mean which mean the missing value will be replaced by that column mean. similarly for Median and Mode.
Python
In Python, we use Imputer class from sklearn.preprocessing
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5])
X[:, 1:5] = imputer.transform(X[:, 1:5])
Note: we can use the different strategy like mean, median and mode.
R
In R, this is different there is no imputer class present
emp$age = ifelse(is.na(emp$age), ave(emp$age, FUN = function(x) mean(x, na.rm = TRUE)),emp$age)
emp$age represents the column name age in the dataset emp
Hope you have liked this post please share your feedback and questions in the comments.
0 comments