Removing missing values from your dataset is one of the first thing any data analyst or data scientist performs. Machine learning model performance may get affected by missing values in the dataset. To understand how to remove missing values from any dataset, I am taking a rather simple dataset known as Hotel Booking dataset available at following link on Kaggle website:
My setup for performing data analytics is as follows:
- Vs Code as editor
- Python 3.7
- Juypter notebook
- Anaconda installed for managing the packages
Importing the dataset into VScode
The first step is to import the dataset downloaded from kaggle website into our workspace. I am keeping the dataset in the same folder where my notebook will be stored. So, I can import the data with its name only. If you have your dataset at differernt location you may need to provide the fill path to the dataset.
We can see that dataset provided by the kaggle website is in .csv format and to import it we need pandas package. Pandas package provides read_csv function to read any csv file as a dataframe. Use the following code to import the dataset.
import pandas as pd
hotel_raw=pd.read_csv('hotel_bookings.csv', header=0)
hotel_raw.head()
header =0 is provided as an argument to read_csv function because we want first row of our raw data as the column names. After importing lets check the head of the dataframe to confirm that it has been loaded successfully.

Identifying the columns which contains missing values
I have defined a function below which prints the name of columns which have missing values, number of missing values and percentage of missing values.
def missing_value(df):
col_na=list(df[df.columns[df.isna().any()]])
missing_numbers=list(df[col_na].isna().sum())
ratio_na=list(map(lambda num: (num/len(df))*100, missing_numbers))
final_na=pd.DataFrame({'Column_name':col_na,'Missing_num':missing_numbers,'Ratio':ratio_na})
print(final_na)
when we call this function with our dataset missing_value(hotel_raw)
we get the following output:
Column_name Missing_num Ratio
0 children 4 0.003350
1 country 488 0.408744
2 agent 16340 13.686238
3 company 112593 94.306893
Now we can see that company column
has highest number of missing values where as children
column has lowest number of missing values. Lets first tackle the company column
and see what we can do.
Company and Agent column
When we carefully analyzed the company
and agent
column from the dataset description given on the website, we came to know that company
column has the company number and agent
column has the agent number who booked the hotel reservation. Many people do not use any specific company or agent to book the rooms so this was the reason that these columns have large number of missing values. Actually the value represented by np.nan
is actually states that no company was used for the booking. Following two actions need to be done to make these column cleaner.
- We need to change the data type of these columns to
category
because each value is an ID rather than an integer value. - We need to replace the
np.nan
value with a better string say'not_using_company'
which clearly indicate that a paticular booking was not done by any company.
We can achive above two targets using single line of code which is as follows:
hotel_raw['company']=hotel_raw['company'].astype('category').replace({np.nan: "Not_using_company"})
hotel_raw['agent']=hotel_raw['agent'].astype('category').replace({np.nan: "Not_using_agent"})
We have used astype()
method to change the data type of company
and agent
column to category
. And when we need to replace a value with another value in category data type base columns we need to use dictionary. So a dictionary was created and it was mapped with the values of the given columns.
Children Column
The children
column has only 4 missing values which when dataset was carefully analyzed was done when there were no children with the people who booked the room. However, most of the columns have been written 0. What we can do is that we can replace the np.nan
value with 0 which is better representation. This can be done by following code:
hotel_raw.loc[hotel_raw['children'].isna(), 'children']=0
In this case we have used .loc
method to locate the appropiate row and column of values which are np.nan
in children column and thus these values were replaced by 0 integer value.
Country Column
As we have done for other columns, firstly we need to analyse the dataset before taking any decisions regarding the missing values. We are going to do same for the country
column. When we carefully analysed our dataset, we concluded that most of the values related to the country
are available and only 488 values are missing. It is a better solution to drop the rows which has missing value rather than imputing these values. So, we are going to drop these rows using drop method as follows.
hotel_raw=hotel_raw.drop(hotel_raw[hotel_raw['country'].isna()].index)
As to delete values using drop method, we need to provide the index value of that row. So, for that first we need to identify the rows which do not have values related to the country column and then we need to detect their index which can be provided to the drop method.
Conclusion
In this blog we have chosen rather a quite simple data set which is hotel booking data set. Then we search for the columns which has some missing values. We took these columns one by one and try to solve the issue of missing values. Our first impression was to impute these values so that no row needs to be dropped but, in some columns, we have no other choice So, we had also done that. I hope in this blog you have learned how to impute as well as drop the missing values from a given data set in the coming blogs, we’re going to work on the same dataset to perform some other data analytics processes like dimension reduction, exploratory data analysis and to make some good plots.