Credit: ZeeMELT and Kyoorius |
Introduction
TV Serials and family dramas have a special place in every Indian’s heart. Nothing can ever replace the iconic “Dhum Ta Terenana” score that amplifies the tension in the air or the “Saas Bahu” dramatic tropes introduced into the Indian Entertainment Industry by these TV Serials.
From classics like “Saas Bhi Kabhi Bahu Thi” and “Sasural Simar Ka” to modern entries like “Shark Tank”, this industry and this culture is ever-evolving and uniquely creative.
Its only fitting then, that when I found a dataset about Hindi TV Serials, I immediately decided to do this analysis and draw some interesting insights from it.
The Dataset
Let us start with looking at the dataset I am going to be using for this analysis project. This dataset titled “Hindi TV Serials” contains almost 800 unique values with the name of the serial, its cast, its IMDB rating and an overview.
It contains all the TV Serials aired on the following channels from 1988 to the present day (May 2022):
- Sab TV
- Sony TV
- Colors TV
- StarPlus
- Zee TV
Technically the dataset is distributed as a CSV file (181.76kB) and has 736 unique values spread of the following columns:
- Name
- Ratings
- genres
- overview
- Year
- Cast
Example Values from the Dataset
Name | Ratings | genres | overview | Year | Cast |
---|---|---|---|---|---|
Kyunki Saas Bhi Kabhi Bahu Thi | 1.6 | "Comedy, Drama, Family" | A mother-in-law's struggle to put up with her three bahu's. The three bahu's have grown up sons. The bahu's sons start to get involved with having girlfriends and the bahu's try and break their relationships up. | 2000–2008 | "Smriti Malhotra-Irani ,Ronit Roy ,Amar Upadhyay ,Sudha Shivpuri" |
Kahaani Ghar Ghar Kii | 2.1 | Drama | "The show explored the worlds of its protagonists Parvati Aggarwal and Om Aggarwal, who live in a joint family where by Parvati is an ideal daughter-in-law of Aggarwal family and Om the ideal son." | 2000–2008 | "Sakshi Tanwar ,Kiran Karmarkar ,Mita Vashisht ,Ali Asgar" |
I will be analyzing the relationships and the insights that each of the column provides when properly cleaned and arranged.
Setting up the Environment
I start with importing the necessary modules for this project:
- pandas
- numpy
- matplotlib
Then the dataset is imported into the environment through the read.csv
method.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dfmain = pd.read_csv("Hindi TV Serials.csv")
The IMDB ratings
The IMDB ratings are going to be very important throughout this analysis as a way to judge the quality and popularity of a TV Show whenever applicable.
But before we dive-in into how other parameters relate and affect the IMDB rating of a show, let us independently look at these ratings.
Top 5 shows by IMDB ratings
We use the sort_values()
function to get an output of the top shows according their IMDB ratings.
print(dfmain.sort_values(["Ratings"], ascending=False))
Output:
Name Ratings ... Year Cast
407 Mitegi Laxman Rekha 9.7 ... 2018 Aayesha Vindhara ,Ankita Goraya ,Rajeev Saxena...
242 Shobha Somnath Ki 9.4 ... 2011–2012 Ashnoor Kaur ,Tarun Khanna ,Joy Rattan Singh M...
79 Love U Zindagi 9.4 ... 2011 NaN
586 Wagle Ki Duniya 9.2 ... 2021– Sumeet Raghavan ,Pariva Pranati ,Sheehan Kapah...
742 Jagannath Aur Purvi Ki Dosti Anokhi 9.2 ... 2022– Rajendra Gupta ,Sushmita Mukherjee ,Ismeet Koh...
.. ... ... ... ... ...
(remaining output omitted due to irrelevancy)
As is clearly discernible, the top 5 shows according to their ratings are:
- Mitegi Laxman Rekha (9.7)
- Shobha Somnath Ki (9.4)
- Love U Zindagi (9.4)
- Wagle Ki Duniya (9.2)
- Jagannath Aur Purvi ki Dosti Anokhi (9.2)
Well I am not sure I agree with these results but well if you say so IMDB, if you say so...
The Cast and The Artists
Analyzing the cast column can provide some interesting statistics to look at, but there is a serious problem that limits us from using it to any useful extent.
The problem is the format in which these values are stored in the dataset.
For example take the value for the "Cast" column in the row for Shobha Somnath Ki
:
|Cast|
|---|
|Ashnoor Kaur ,Tarun Khanna ,Joy Rattan Singh Mathur ,Sandeep Arora|
This value is troublesome as it is stored as a single <str> type object and thus it is not possible to calculate or discern any data for individual cast members.
Cleaning Data: Solving the Cast Problem
Thankfully, as elaborated by Max Hilsdorf in his Medium blog, the string object present in the cell can be converted into a list object, and subsequently into a one dimensional data type that can allow functions like value_counts()
and groupby()
to function.
But his solution does not apply to our problem without extensive modifications as the values we wish to convert to a list do not have any pre-existent list based syntax. Therefore we need to convert each cell in the Cast Column into a value based on list syntax i.e. ["a","b","c",...]
.
We can implement this by writing a function the takes input in the format that we have and then adding the square brackets and the quotation marks and returning it in the format that we need. This is my implementation of such a function:
def clean_artist_list(list_):
if type(list_) is str:
list_ = "[" + list_ + "]"
list_ = list_.replace(',', '","')
list_ = list_.replace('[', '["')
list_ = list_.replace(']', '"]')
list_ = list_.replace(' "', '"')
return list_
else:
return "[]"
This function also takes care to properly handle and replace any disruptive data. I mainly encountered some FLOAT datatypes which threw errors as they could not be treated like strings.
After applying this function and the python eval()
function, we have the required list datatypes.
dfmain["Cast"] = dfmain["Cast"].apply(clean_artist_list)
dfmain["Cast"] = dfmain["Cast"].apply(eval)
Before proceeding we also need to create the function needed to convert these 2D lists to 1D. For that we will use:
def to_1D(series):
return pd.Series([x for _list in series for x in _list])
Top Rated Artist
Now that we can use the Cast data properly, lets find out which artist has the best average IMDB ratings for the shows they worked in.
df_cast_imdb = dfmain.groupby(to_1D(dfmain["Cast"])).mean()
print(df_cast_imdb.sort_values(["Ratings"],ascending=False))
Output:
Ratings
Tusharr Khanna 9.2
Sahil Mehta 9.2
Vrajesh Hirjee 9.2
Gautami Kapoor 9.1
Vaidehi Amrute 9.1
... ...
(remaining output omitted due to irrelevancy)
The artists with the best mean IMDB rating for his shows is Tushar Khanna. He has worked in "Pyaar Tune Kia Kya", "Piyaa Albela" and "Bekaboo".
This however does not necessarily reflect any superiority in acting or talent, but it may show (at least to people who believe in it) some signs of luck an artist brings to a set.
Most Experienced Artist
Now moving to a more concrete relation. We will be finding out which actor has worked in the most TV shows.
It should be noted that the values of this dataset only list the leading cast members in the cast section and thus artist with minor roles are not properly recognized in this analysis.
print(to_1D(dfmain["Cast"]).value_counts())
Output:
Ronit Roy 9
Jennifer Winget 8
Seema Kapoor 7
Sangeeta Ghosh 7
Shahab Khan 7
..
(remaining output omitted due to irrelevancy)
Ronit Roy having worked in 9 shows, comes out to be the most experienced artist in this dataset. No wonder I see him in every other serious father type role.
Genre
Its either comedy (the family kind) or drama (also the family kind) with Indian TV Serials. But don't take my word for it, let us see for ourselves the genre dynamics of Indian TV.
Cleaning Data: Genre
Genres also face the same problem as we faced above with artists. There is a small edit made to handle redundancies due to whitespace characters.
def clean_genre_list(list_):
if type(list_) is str:
list_ = "[" + list_ + "]"
list_ = list_.replace(',', '","')
list_ = list_.replace('[', '["')
list_ = list_.replace(']', '"]')
list_ = list_.replace(' "', '"')
list_ = list_.replace(" ","")
return list_
else:
return "[]"
It is then used similarly as the Cast solution.
dfmain["genres"] = dfmain["genres"].apply(clean_genre_list)
dfmain["genres"] = dfmain["genres"].apply(eval)
Most Acclaimed Genre
First lets look at which genre claims the best mean IMDB ratings and garners the best critic response.
df_genre_imdb = dfmain.groupby(to_1D(dfmain["genres"])).mean()
print(df_genre_imdb.sort_values(["Ratings"],ascending=False))
Output:
Ratings
War 6.900000
Horror 6.684211
Adventure 6.680000
Biography 6.650000
Sport 6.500000
Family 6.443478
Crime 6.271429
History 6.162500
Action 5.966667
Comedy 5.961644
(remaining output omitted due to irrelevancy)
Humans do love war, huh.
Bigger Genre
Next lets look at which genre the creators love the most and thus create the most shows based around.
df_genre_count = to_1D(dfmain["genres"]).value_counts()
print(df_genre_count)
df_genre_count.plot(kind = 'bar')
plt.show()
Instead of the text output, a visual representation of the output would be more suitable here, thus we generate a bar graph using the Series.plot()
function.
Output:
So THAT is why Indian households end up being so dramatic...
Release Year
Shows like "Sarabhai vs Sarabhai" were definitely much ahead of their time. But lets look at how time affected the rest of the Indian TV.
Cleaning Data: Years
To make use of the data in the Years column, we need to convert it into forms that are not haphazard and unusable like it originally is.
I created two new columns based on the Years column:
- First Year: This column tracks the year in which the show started airing.
- Years Run: This column tracks how long a show ran.
These columns were created with the following code:
def findstart(list_):
if type(list_) is str:
list_ = list_[:4]
return list_
else:
return ""
def duration(list_):
if type(list_) is str:
if len(list_) == 9 and list_[0]!="I":
l1 = int(list_[:4])
l2 = int(list_[5:])
return l2-l1
else:
return 0
else:
return 0
dfmain["First Year"] = dfmain["Year"].apply(findstart)
dfmain["Years Run"] = dfmain["Year"].apply(duration)
The code was made to handle edge cases like wrong datatype and the weird "I XX" values in the Year column.
Busiest Year
Which year was the busiest for the creators? We can use the following code to visualize the frequency of productions across years.
df_year_count = dfmain["First Year"].value_counts().sort_index()
df_year_count = df_year_count.iloc[:-4] #removing the weird I values
df_year_count.plot(kind = 'bar')
plt.show()
2017 brought us shows like "Naagin 2", "Yeh Rishta Kya Kehlata Hai" and "Yeh Hein Mohabbatein". In total it records the production of 59 shows compared to the runner up 2018 with 46 shows.
Longest Running Show
Indian shows like "Sasural Simar Ka" and "Kyunki Saas Bhi Kabhi Bahu Thi" are infamous for running long enough to be part of a late teenager's life since birth. So its obvious to find out which show actually has the longest runtime.
print(dfmain.sort_values(["Years Run"], ascending=False))
Output:
Name Ratings ... First Year Years Run
720 C.I.D. 6.8 ... 1998 20
255 Hum Paanch 8.2 ... 1995 11
536 Yes Boss 8.4 ... 1999 10
0 Kyunki Saas Bhi Kabhi Bahu Thi 1.6 ... 2000 8
1 Kahaani Ghar Ghar Kii 2.1 ... 2000 8
.. ... ... ... ... ...
(remaining output omitted due to irrelevancy)
"C.I.D." is no-doubt part of every Indian's life. With iconic characters like ACP Pradyuman, Abhijit, and Daya, and a premise revolving around crime in India, its not a surprise that it had a runtime of 20 years.
Analyzing the Overviews
Here comes the part I was most excited for. The written descriptions and overviews of these shows could surely provide me some very interesting insights that could have been the highlights of this project.
Unfortunately after cleaning the data and writing the code to analyze it, it was shocking to see how useless the ordeal was. The data did was not sufficient and quality enough to let me draw any real conclusions from it.
But I will still show the method I used to clean and try analyzing the data.
Cleaning Data: Description
Similar to the approach I took for the problems with other columns, I decided to convert the string based values to a list with every word being an element of the list. Also additionally the words were all turned to lowercase and any special characters were removed so as to make sure that redundancy was minimized.
def clean_ovw_list(list_):
if type(list_) is str:
list_ = "[" + list_ + "]"
#removing all the special characters
list_ = list_.replace(',', '')
list_ = list_.replace('.', '')
list_ = list_.replace('"', '')
list_ = list_.replace('(', '')
list_ = list_.replace(')', '')
list_ = list_.replace('-', '')
list_ = list_.replace('»', '')
list_ = list_.replace(' ', '","')
list_ = list_.replace('[', '["')
list_ = list_.replace(']', '"]')
list_ = list_.replace(' "', '"')
#converting to lower case
list_ = list_.lower()
return list_
else:
return "[]"
The function was applied:
dfmain["overview"] = dfmain["overview"].apply(clean_ovw_list)
dfmain["overview"] = dfmain["overview"].apply(eval)
Now we have data that we can supposedly work on.
Usage of words over time
I planned to analyze multiple words like "love", "hate", "mother", "mother-in-law", "brother", etc. and their usage over time in the descriptions of TV Serials and even plot graphs showing interesting relations between the trends of different words.
This code gives the count of the words used grouped by years:
df_ovwcount = dfmain.groupby(['First Year',to_1D(dfmain["overview"])]).count().reset_index()
The following code could be used to plot the variance of occurance of words overtime, and also to show contrast in different words.
#Selecting and plotting the first word
df_selectedword = df_ovwcount[df_ovwcount["level_1"].isin(["First Word"])]
plt.plot(df_selectedword["First Year"],df_selectedword["overview"])
#Selecting and plotting the second word
df_selectedword = df_ovwcount[df_ovwcount["level_1"].isin(["Second Word"])]
plt.plot(df_selectedword["First Year"],df_selectedword["overview"])
plt.xticks(rotation=90)
plt.show()
A visualization generated through this code (provided better data) could have looked like this:
This data could have led to a lot of other interesting analysis too, but unfortunately it was not possible.
Most Used Word
We can still draw some simple insights from this data. Let us find out the 50 most used words in the descriptions for Indian TV Serials.
df_ovw_count_simple = to_1D(dfmain["overview"]).value_counts()
print(df_ovw_count_simple.head(50))
Output:
1843
a 856
the 848
and 647
of 588
to 394
is 338
her 314
in 302
who 201
with 191
story 185
their 158
his 140
on 129
family 128
love 125
an 125
plot 119
add 118
see 117
full 117
summary 114
for 113
from 111
life 107
she 105
by 103
girl 84
as 79
that 79
two 76
are 73
show 72
they 71
but 71
when 66
young 57
about 57
around 56
this 53
lives 52
it 51
has 49
he 49
married 47
series 47
one 44
other 42
revolves 41
Some significant meaningful words come out to be "family", "love" and "life"... That is some Fast & Furious philosophy it seems.
Conclusion
Indian TV is definitely an interesting place to observe and analyze. This project aimed at looking at some of the angles of the vast possibilities that are present with proper datasets.
But the tip of the iceberg that we touched also gave us some interesting results:
- Top 5 Indian TV Shows by IMDB Rating.
- Artists with the best mean IMDB Rating.
- Artists with the most experience.
- Genre with the best mean IMDB Rating.
- Genre with the most available content.
- The release frequency of shows over the years.
- The longest running shows.
- Usage of certain words in the overviews of TV shows over time.
- Most used words in TV Show descriptions.
This project also helped me cement my skills in data analysis, especially learning how to analyze a varied dataset in multi-faceted fashion.
I also gained experience cleaning data and how to treat list like values in cells and treat elements individually.
Thankyou to everyone who actually stuck with reading till here, it was very fun for me to work on this project.
Top comments (8)
Seems like a fun project! Good read as well. I love when people take their skills and apply it to an India-specific context. Curious to see what you pick for the next analysis.
Thankyou so much man. I am also excited about starting with a new project after learning some more new stuff.
It was a fun to read project...good job..
Thankyou so much.
😂couldn't have said it any better, loved the analysis
Thanks a lot
Wow this is really interesting!
thankyou