Data visualization is an important component in any data science application also called as Graphical Exploratory Analysis. As they say, a picture says a thousand words. Similarly, data visualization helps us to understand the data points. In this blog, we will use the office dataset available at Kaggle which shows the ratings of different episodes of The Office series which is currently streamed on Netflix. The dataset can be found at:
All the commands run on the dataset can be found in my GitHub repository located at
https://github.com/rajsandhu1989/data_EDA_visualization.git
Importing the Dataset
Lets now import the dataset into the workspace of VS Code. I am using Juypter Notebook inside the VS code, you can use whatever IDE you may like.
First of all import necessary packages for working of the project:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Import the dataset using pandas library:
office_df=pd.read_csv("the_office_series.csv")
#View the dataset as it is imported
office_df.head()

Data Preprocessing
First examine the dtypes and null values in the dataset using following code:
office_df.info()

We can see that Dtype of Date is not DateTime type which is wrong, we should change it. Secondly, GuestStart column has some null values. After examining the dataset it was found that if any episode does not have any guest star it was kept as NaN value. It would be better to change that to Zero for better visualization and prediction.
Changing the Dtype of Date Column
Pandas provide us feature to change the dtype of any column to datetime. It can also infer the dates if there are some error.
office_df['Date']=pd.to_datetime(office_df['Date'])
print(office_df['Date'].dtypes)
Now the Date column will be converted to DateTime. This is very useful in visualization also.
Removing the Missing Value
office_df=office_df.fillna({'GuestStars':0})
This will insert 0 at the place of NaN in the table. Now all columns are cleaned and ready for making visualization.
Renaming and adjusting the Episode Column
After carefully examining the dataset it was observed that the Episode_no column is not present in the dataset and a not required column named Unnamed: 0 is at first column. So, I decided to change this column to episode number and start this column from 1 rather than 0 as it is in the dataset.
office_df= office_df.rename(columns={"Unnamed: 0":"Episode_no"})
office_df['Episode_no']=office_df["Episode_no"]+1
Dataset Visualizations
Now our dataset doesnot have any missing value and all dtypes are set to desired ones. Lets dive into graphical EDA and see what we have in our dataset.See viewership based on episodes number
Episode Number vs Viewership (Matplotlib)
plt.scatter(office_df['Episode_no'], office_df['Viewership'])
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.show()

Episode Number vs Viewership (Seaborn)
sns.scatterplot(x='Episode_no', y='Viewership', data=office_df)
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.show()

Data Cleaning
Above Figure is good. We can see that viewership declined with time. Plus there is one outlier with 22.5 million viwership. Graphical EDA helps to identify the outliers also. We need to remove this one by identifying it.
#Find the season number where outlier was obsereved
season_no=int(office_df.loc[office_df['Viewership']>15, 'Season'])
#Find the average of rating of that season
season_avg=office_df.loc[office_df['Season']==season_no, 'Viewership'].mean()
#Replace outlier with the average value
office_df.loc[data['Viewership']>15, 'Viewership']=season_avg
The Above code replaces the outlier value with the average viewership of that season. Again create the scatter plot and with the removal of outlier now we can see the trend far better.
plt.scatter(office_df['Episode_no'], office_df['Viewership'])
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.show()

Now its good we can see that viewership increases with time and was stable for a while then suddenly decreases close to the end season. Further, we can see that it got increased in final episode also.
Viewership vs Episode with ratings
Now lets see if viewership affected the average rating of an episode and vice-versa with this plot. As we know that rating is a continous variable in our dataset which can be divided into different categories of ranges. We will create a new column in our dataset with categorical rating of every episode.
#Define the range value
range=[0,7,8,9,10]
#Give the names of labels
names=['0-7','7-8','8-9','9-10']
office_df['rating_cat']=pd.cut(office_df['Ratings'], bins=range, labels=names)
office_df['rating_cat'].head()
Now creates scatter plot using seaborn with this added information.

We can see from the above scatter plot that most of the episods have rating between 8-9 during initial and middle seasons and at the end this rating drop to 7-8 category. However, episodes were evenly distributed among initial and middle season with 9-10 rating category. Final episode have only two final episodes with rating in 9-10 category. Lets divide this graph based on seaons, we can draw a line for average of different seasons.
sns.scatterplot(x='Episode_no', y='Viewership', hue='rating_cat', data=office_df, hue_order=['0-7','7-8','8-9','9-10'])
plt.plot(season_mid, season_avg, marker='D', color='black')
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.savefig('epi_view_rating.png')
plt.show()

Above graph shows that average viewership of The Office season increased in 2nd season and dropped after 7th season.
Viewership and Guest Stars
Lets see if having a guest star helps in getting higher viewerships. Add size factor to the scatter plot created above. Before doing that we need to add a new column which stores if guest start was present or not in the episode.
office_df['has_guest']='NA'
office_df.loc[data['GuestStars']==0, 'has_guest']='No'
office_df.loc[data['GuestStars']!=0, 'has_guest']='Yes'
Now create the scatter graph
sns.scatterplot(x='Episode_no', y='Viewership', hue='rating_cat', data=office_df, size='has_guest', sizes=[25,250])
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.show()

According to the above graph, it is clear that having a guest start does not have any impact on the rating as well as the viewership of the season The Office.
Beautification of Graphs
Till now we were using defaut setting of matplotlib which needs beautification for better visualization. See many beautifications done by following code.
sns.set_style("darkgrid")
sns.set_context("paper", font_scale=1.2)
sns.scatterplot(x='Episode_no', y='Viewership', hue='rating_cat', data=office_df,hue_order=['0-7','7-8','8-9','9-10'], size='has_guest',
sizes=[25,250], style='rating_cat')
# calc the trendline
z = np.polyfit(office_df.Episode_no, office_df.Viewership, 2)
p = np.poly1d(z)
plt.plot(office_df.Episode_no,p(office_df.Episode_no),color="black", label="Trend Line")
plt.plot(office_df.Episode_no, office_df.moving_avg, color='magenta', label="Moving Average")
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.xlabel('Episode Number')
plt.ylabel('Viewership in Millions')
plt.title('Episodes vs Viewerships')
plt.figure(figsize=(10,6))
plt.tight_layout()
plt.savefig('epi_view_guest.png')
plt.show()

Conclusion
This was rather a simple dataset, but we have done preprocessing and cleaning along with G-EDA to understand what data is representing. Furthermore, we have answered few questions which are
- Rating decreases with decrease in the viewership or vice-versa may be true.
- Having guest star does not affected the viewership as well as rating.
Informative article! Gud Work.
LikeLike