Machine Learning | Data Science : Handling the missing value in the dataset in Python and R

May 22, 2018
By Pawan Prasad
0 Comments

Data cleaning is the first process you must be doing to get the better insight and better performance of your machine learning models. It's most important in the process of data Data preparation. One of the main challenges in the data cleaning is handling the missing values in the dataset. The missing values is a piece of information that was not available during the data creation or the value is unknown. In simple words it is NULL.

Here I will be explaining the different ways of dealing with missing values in datasets

Ignore observations:

In this approach, we delete the observation rows which has the null values in it and would not be used in model training and testing. This is done before splitting data into training/testing set. This method of dealing with missing doesn't impact the model training much when we have huge and good distributed data sets and number of record which has missing data is very less about less than 5% of the datasets. However, when the data set size not huge, it is not optimal approach because the significant information could be lost when incomplete rows of data are discarded and impact our Learning models.

If we are working with multivariable analysis and there are large no. missing values for that column, it's better to drop that column from the dataset.

Deleting the row with the missing value in python :

       
import pandas as pd
df = pd.read_csv('Sample.csv',encoding='utf-8')
df = df.dropna(how='any',axis=0)

axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Deprecated since version 0.23.0:: Pass tuple or list to drop on multiple axes.

how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

Deleting the row with the missing value in R:

       
dataset[complete.cases(dataset), ]

note: This will delete the rows having any of column value as NULL. complete.cases also allow deleting only the rows with the certain column with the null value.

       
dataset[complete.cases(dataset[ , 2:3]),]

Statistical Approach:

There is another way to handle the missing data without losing the information. The basic idea of this approach is instead of deleting the rows we can substitute the missing value with sensible data. we Often use mean of the columns to replace null in the dataset but median and mode can also be used.

In python, the scikit-learn library provides the Imputer() pre-processing class that can be used to replace the null with Mean, Median, and Mode

The strategy parameter is set to mean which mean the missing value will be replaced by that column mean. similarly for Median and Mode.

Python

In Python, we use Imputer class from sklearn.preprocessing

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:5])
X[:, 1:5] = imputer.transform(X[:, 1:5])

Note: we can use the different strategy like mean, median and mode.

R

In R, this is different there is no imputer class present

       
emp$age = ifelse(is.na(emp$age), ave(emp$age, FUN = function(x) mean(x, na.rm = TRUE)),emp$age)

emp$age represents the column name age in the dataset emp

Hope you have liked this post please share your feedback and questions in the comments.

#Data Cleaning

#Data Science

#Machine Learning

#Python

Pages

Machine Learning and Data Science Blog

Machine Learning | Data Science : Handling the missing value in the dataset in Python and R

Ignore observations:

Deleting the row with the missing value in python :

Deleting the row with the missing value in R:

Statistical Approach:

Python

R

0 comments

Popular Posts

Labels

recent posts

Blog Archive

Pages

Machine Learning and Data Science Blog

Machine Learning | Data Science : Handling the missing value in the dataset in Python and R

Ignore observations:

Deleting the row with the missing value in python :

Deleting the row with the missing value in R:

Statistical Approach:

Python

R

Share This Story

You Might Also Like

0 comments

Popular Posts

Labels

recent posts

Blog Archive