Removing Missing Values (Hotel Booking Dataset)

Removing missing values from your dataset is one of the first thing any data analyst or data scientist performs. Machine learning model performance may get affected by missing values in the dataset. To understand how to remove missing values from any dataset, I am taking a rather simple dataset known as Hotel Booking dataset available at following link on Kaggle website:

Hotel booking demand

My setup for performing data analytics is as follows:

  1. Vs Code as editor
  2. Python 3.7
  3. Juypter notebook
  4. Anaconda installed for managing the packages

Importing the dataset into VScode

The first step is to import the dataset downloaded from kaggle website into our workspace. I am keeping the dataset in the same folder where my notebook will be stored. So, I can import the data with its name only. If you have your dataset at differernt location you may need to provide the fill path to the dataset.

We can see that dataset provided by the kaggle website is in .csv format and to import it we need pandas package. Pandas package provides read_csv function to read any csv file as a dataframe. Use the following code to import the dataset.

import pandas as pd
hotel_raw=pd.read_csv('hotel_bookings.csv', header=0)
hotel_raw.head()

header =0 is provided as an argument to read_csv function because we want first row of our raw data as the column names. After importing lets check the head of the dataframe to confirm that it has been loaded successfully.

Identifying the columns which contains missing values

I have defined a function below which prints the name of columns which have missing values, number of missing values and percentage of missing values.

def missing_value(df):
    col_na=list(df[df.columns[df.isna().any()]])
    missing_numbers=list(df[col_na].isna().sum())
    ratio_na=list(map(lambda num: (num/len(df))*100, missing_numbers))
    final_na=pd.DataFrame({'Column_name':col_na,'Missing_num':missing_numbers,'Ratio':ratio_na})
    print(final_na)

when we call this function with our dataset missing_value(hotel_raw) we get the following output:

  Column_name    Missing_num      Ratio
0    children            4     0.003350
1     country          488     0.408744
2       agent        16340    13.686238
3     company       112593    94.306893

Now we can see that company column has highest number of missing values where as children column has lowest number of missing values. Lets first tackle the company column and see what we can do.

Company and Agent column

When we carefully analyzed the company and agent column from the dataset description given on the website, we came to know that company column has the company number and agent column has the agent number who booked the hotel reservation. Many people do not use any specific company or agent to book the rooms so this was the reason that these columns have large number of missing values. Actually the value represented by np.nan is actually states that no company was used for the booking. Following two actions need to be done to make these column cleaner.

  1. We need to change the data type of these columns to category because each value is an ID rather than an integer value.
  2. We need to replace the np.nan value with a better string say 'not_using_company' which clearly indicate that a paticular booking was not done by any company.

We can achive above two targets using single line of code which is as follows:

hotel_raw['company']=hotel_raw['company'].astype('category').replace({np.nan: "Not_using_company"})
hotel_raw['agent']=hotel_raw['agent'].astype('category').replace({np.nan: "Not_using_agent"})

We have used astype() method to change the data type of company and agent column to category. And when we need to replace a value with another value in category data type base columns we need to use dictionary. So a dictionary was created and it was mapped with the values of the given columns.

Children Column

The children column has only 4 missing values which when dataset was carefully analyzed was done when there were no children with the people who booked the room. However, most of the columns have been written 0. What we can do is that we can replace the np.nan value with 0 which is better representation. This can be done by following code:

hotel_raw.loc[hotel_raw['children'].isna(), 'children']=0

In this case we have used .loc method to locate the appropiate row and column of values which are np.nan in children column and thus these values were replaced by 0 integer value.

Country Column

As we have done for other columns, firstly we need to analyse the dataset before taking any decisions regarding the missing values.  We are going to do same for the country column. When we carefully analysed our dataset, we concluded that most of the values related to the country are available and only 488 values are missing. It is a better solution to drop the rows which has missing value rather than imputing these values. So, we are going to drop these rows using drop method as follows.

hotel_raw=hotel_raw.drop(hotel_raw[hotel_raw['country'].isna()].index)

As to delete values using drop method, we need to provide the index value of that row. So, for that first we need to identify the rows which do not have values related to the country column and then we need to detect their index which can be provided to the drop method.

Conclusion

In this blog we have chosen rather a quite simple data set which is hotel booking data set. Then we search for the columns which has some missing values. We took these columns one by one and try to solve the issue of missing values. Our first impression was to impute these values so that no row needs to be dropped but, in some columns, we have no other choice So, we had also done that. I hope in this blog you have learned how to impute as well as drop the missing values from a given data set in the coming blogs, we’re going to work on the same dataset to perform some other data analytics processes like dimension reduction, exploratory data analysis and to make some good plots.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s