Hello my dear Kagglers. As you all know I love Kaggle and its community. I spend most of my time surfing my Kaggle feed, scrolling over discussion forms, appreciating the efforts put on by various Kagglers via their unique / interesting way of Storytelling.
So, this morning when i was following my usual routine in Kaggle, I came across this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani.
At first glance, I was like what's this? How they know I'm having a tea break?
Oh no buddy! I was wrong. It's CTDS.Show :)
Chai Time Data Science show is a Podcast + Video + Blog based show for interviews with Practitioners, Kagglers & Researchers and all things Data Science.
CTDS.Show, driven by the community under the supervision of Mr. Sanyam Bhutani gonna complete its 1 year anniversary on 21st June, 2020 and to celebrate this achievement they decided to run a Kaggle contest around the dataset with all of the 75+ ML Hero interviews on the series.
According to our Host, The competition is aimed at articulating insights from the Interviews with ML Heroes. Provided a dataset consists of detailed Stats, Transcripts of CTDS.Show, the goal is to use these and come up with interesting insights or stories based on the 100+ interviews with ML Heroes.
We have our Dataset containing :
Description.csv : This file consists of the descriptions texts from YouTube and Audio
Episodes.csv : This file contains the statistics of all the Episodes of the Chai Time Data Science show.
YouTube Thumbnail Types.csv : This file consists of the description of Anchor/Audio thumbnail of the episodes
Anchor Thumbnail Types.csv : This file contains the statistics of the Anchor/Audio thumbnail
Raw Subtitles : Directory containing 74 text files having raw subtitles of all the episodes
Cleaned Subtitles : Directory containing cleaned subtitles (in CSV format)
Hmm.. Seems we have some stories to talk about..
Congratulating CTDS.Show for their 1 year anniversary, Let's get it started :)
Btw This gonna be a long kernel. So, Hey! Looking for a guide :) ?
0. Importing Necessary Libraries
1. A Closer look to our Dataset
1.1. Exploring YouTube Thumbnail Types.csv
1.2. Exploring Anchor Thumbnail Types.csv
1.3. Exploring Description.csv
1.4. Exploring Episodes.csv
1.4.1. Missing Values ?
1.4.2. M0-M8 Episodes
1.4.3. Solving the Mystery of Missing Values
1.4.4. Is it a Gender Biased Show?
1.4.5. Time for a Chai Break
1.4.6. How to get More Audience?
1.4.7. Youtube Favors CTDS?
1.4.8. Do Thumbnails really matter?
1.4.9. How much Viewers wanna watch?
1.4.10. Performance on Other Platforms
1.4.11. Distribution of Heores by Country and Nationality
1.4.12. Any Relation between Release Date of Epiosdes?
1.4.13. Do I know about Release of anniversary interview episode?
1.5. Exploring Raw / Cleaned Substitles
1.5.1. A Small Shoutout to Ramshankar Yadhunath
1.5.2. Intro is Bad for CTDS ?
1.5.3. Who Speaks More ?
1.5.4. Frequency of Questions Per Episode
1.5.5. Favourite Text ?
import os
import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
!pip install pywaffle
from pywaffle import Waffle
from bokeh.layouts import column, row
from bokeh.models.tools import HoverTool
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, output_notebook, show
output_notebook()
from IPython.display import IFrame
pd.set_option('display.max_columns', None)
Let's dive into each and every aspect of our Dataset step by step in order to get every inches out of it...
As per our knowledge, This file consists of the description of Anchor/Audio thumbnail of the episodes. Let's explore more about it...
YouTube_df=pd.read_csv("../input/chai-time-data-science/YouTube Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(YouTube_df.shape[0], YouTube_df.shape[1]))
YouTube_df.head()
So, Basically CTDS uses 4 types of Thumbnails in their Youtube videos. Its 2020 and people still uses YouTube default image as thumbnail !
Hmm... a Smart decision or just a blind arrow, We'll figure it out in our futher analysis ...
So, This file contains the statistics of the Anchor/Audio thumbnail
Anchor_df=pd.read_csv("../input/chai-time-data-science/Anchor Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Anchor_df.shape[0], Anchor_df.shape[1]))
Anchor_df.head()
It's just similar to Youtube Thumbnail Types.
If you are wondering What's anchor then it's a free platform for podcast creation
IFrame('https://anchor.fm/chaitimedatascience', width=800, height=450)
This file consists of the descriptions texts from YouTube and Audio
Des_df=pd.read_csv("../input/chai-time-data-science/Description.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Des_df.shape[0], Des_df.shape[1]))
Des_df.head()
So, We have description for every episode. Let's have a close look what we have here
def show_description(specific_id=None, top_des=None):
if specific_id is not None:
print(Des_df[Des_df.episode_id==specific_id].description.tolist()[0])
if top_des is not None:
for each_des in range(top_des):
print(Des_df.description.tolist()[each_des])
print("-"*100)
⚒️ About the Function :
In order to explore our Descriptions, I just wrote a small script. It has two options:
show_description("E1")
show_description(top_des=3)
🧠 My Cessation:
This file contains the statistics of all the Episodes of the Chai Time Data Science show.
Okay ! So, it's the big boy itself ..
Episode_df=pd.read_csv("../input/chai-time-data-science/Episodes.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Episode_df.shape[0], Episode_df.shape[1]))
Episode_df.head()
Wew ! That's a lot of features.
I'm sure We'll gonna explore some interesting insights from this Metadata
If you reached till here, then please bare me for couple of minutes more..
Now, We'll gonna have a big sip of our "Chai" :)
Before diving into our analysis, Let's have check for Missing Values in our CSV..
For this purpose, I'm gonna use this library named missingno.
Just use this line :
import missingno as msno
missingno helps us to deal with missing values in a dataset with the help of visualisations. With over 2k stars on github, this library is already very popular.
msno.matrix(Episode_df)
Aah shit ! Here We go again..
📌 Observations :
There is also a chart on the right side of plot.It summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.
Well, Before giving any False Proclaim Let's explore more about it..
temp=Episode_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]
Source=ColumnDataSource(temp)
tooltips = [
("Feature Name", "@Name"),
("No of Missing entites", "@Count")
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)
fig1.xaxis.major_label_orientation = np.pi / 8
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"
fig1.grid.grid_line_color="#feffff"
show(fig1)
📌 Observations :
🧠 My Cessation:
Let's find out..
Episode_df[Episode_df.heroes.isnull()]
💭 Interesting..
But What are these M0-M8 episodes .. ?
🧠 My Cessation:
temp=[id for id in Episode_df.episode_id if id.startswith('M')]
fastai_df=Episode_df[Episode_df.episode_id.isin(temp)]
Episode_df=Episode_df[~Episode_df.episode_id.isin(temp)]
Also, ignoring "E0" and "E69" for right now ...
dummy_df=Episode_df[(Episode_df.episode_id!="E0") & (Episode_df.episode_id!="E69")]
msno.matrix(dummy_df)
temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]
Source=ColumnDataSource(temp)
tooltips = [
("Feature Name", "@Name"),
("No of Missing entites", "@Count")
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)
fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"
fig1.grid.grid_line_color="#feffff"
show(fig1)
Now, That's much better..
But We still have a lots of Missing Values
parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category"]).heroes_gender.value_counts()
for k in temp.index:
parent.append(k[0])
names.append(k[1])
values.append(temp.loc[k])
df1 = pd.DataFrame(
dict(names=names, parents=parent,values=values))
parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category","heroes_gender"]).heroes_kaggle_username.count()
for k in temp.index:
parent.append(k[0])
names.append(k[1])
values.append(temp.loc[k])
df2 = pd.DataFrame(
dict(names=names, parents=parent,values=values))
fig = px.sunburst(df1, path=['names', 'parents'], values='values', color='parents',hover_data=["names"], title="Heroes associated with Categories")
fig.update_traces(
textinfo='percent entry+label',
hovertemplate = "Industry:%{label}: <br>Count: %{value}"
)
fig.show()
fig = px.sunburst(df2, path=['names', 'parents'], values='values', color='parents', title="Heroes associated with Categories having Kaggle Account")
fig.update_traces(
textinfo='percent entry+label',
hovertemplate = "Industry:%{label}: <br>Count: %{value}"
)
fig.show()
📌 Observations :
Because of this Kaggle platform, Now I've aprox 42% chance of becoming a CTDS Hero :) ...
Ahem Ahem... Focus RsTaK Focus.. Let's get back to our work.
Wait? Guess I missed something.. What's that gender ratio?
gender = Episode_df.heroes_gender.value_counts()
fig = plt.figure(
FigureClass=Waffle,
rows=5,
columns=12,
values=gender,
colors = ('#20639B', '#ED553B'),
title={'label': 'Gender Distribution', 'loc': 'left'},
labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.4), 'ncol': len(Episode_df), 'framealpha': 0},
font_size=30,
icons = 'child',
figsize=(12, 5),
icon_legend=True
)
Jokes apart, We can't give any strong statement over this.
But yea, I'm hoping for more Female Heroes :D
🧠 My Cessation:
I won't talk much about relation of gender with other features because :
So, We can not conclude any relation with other features.
Even if we somehow observe any positive conclusion for Female gender then I would say it will be just a coincidence. There are other factors apart from gender that may have resulted in positive conclusion for Female gender.
With such a biased and small sample size for Female, We can not comment any strong statement on that
dummy_df[dummy_df.apple_listeners.isnull()]
📌 Observations :
Following our analysis, We realized :
With this, We have solved all the mysteries related to the Missing Data. Now we can finally explore other aspects of this CSV
But before that..
While having a sip of my Chai (Tea), I'm just curious Why this show is named "Chai Time Data Science"?
Well, I don't have a solid answer for this but maybe its just because our Host loves Chai? Hmmm.. So You wanna say our Host is more Hardcore Chai Lover than me?
Hey ! Hold my Chai..
fig = go.Figure([go.Pie(labels=Episode_df.flavour_of_tea.value_counts().index.to_list(),values=Episode_df.flavour_of_tea.value_counts().values,hovertemplate = '<br>Type: %{label}</br>Count: %{value}<br>Popularity: %{percent}</br>', name = '')])
fig.update_layout(title_text="What Host drinks everytime ?", template="plotly_white", title_x=0.45, title_y = 1)
fig.data[0].marker.line.color = 'rgb(255, 255, 255)'
fig.data[0].marker.line.width = 2
fig.update_traces(hole=.4,)
fig.show()
📌 Observations :
Masala Chai (count=16) and Ginger Chai (count=16) seems to be favourite Chai of our Host followed by Herbal Tea (count=11) and Sulemani Chai (count=11)
Also, Our Host seems to be quite experimental with Chai. He has varities of flavour in his belly
Oh Man ! This time you win. You're a real Chai lover
Now, One Question arises..❓
So, Host drinking any specific Chai at specific time have any relation with other factors or success of CTDS?
🧠 My Cessation:
Well, rewarding for your victory in that Chai Lover Challenge, I'll try to assist CTDS on how to get more Audience 😄
temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]
Source=ColumnDataSource(temp)
tooltips = [
("Feature Name", "@Name"),
("No of Missing entites", "@Count")
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)
fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"
fig1.grid.grid_line_color="#feffff"
show(fig1)
Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)
fastai_df.release_date = pd.to_datetime(fastai_df.release_date)
Source2 = ColumnDataSource(fastai_df)
tooltips = [
("Episode Id", "@episode_id"),
("Episode Title", "@episode_name"),
("Hero Present", "@heroes"),
("CTR", "@youtube_ctr"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
tooltips2 = [
("Episode Id", "@episode_id"),
("Episode Title", "@episode_name"),
("Hero Present", "@heroes"),
("Subscriber Gain", "@youtube_subscribers"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "CTR Per Episode")
fig1.line("release_date", "youtube_ctr", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="youtube_ctr")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_ctr", alpha=0.2, fill_color='#55FF88', legend_label="youtube_ctr")
fig1.line("release_date", Episode_df.youtube_ctr.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Youtube CTR Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))
fig1.circle(x="release_date", y="youtube_ctr", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")
fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Click Per Impression"
fig1.grid.grid_line_color="#feffff"
fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")
fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"
fig2.grid.grid_line_color="#feffff"
show(column(fig1, fig2))
📌 Observations :
From the Graphs, We can see :
🤖 M0-M8 Series :
💭 Interestingly..,
But Why ❓
There's an huge possibility of such cases. But in conclusion, We can say high CTR reflect cases like :
📃 I don't know how Youtube algorithm works. But for the sake of answering exceptional case of E19, My hypothetical answer will be:
Okay What's about organic reach of channel or reach via Heroes?
Source = ColumnDataSource(Episode_df)
Source2 = ColumnDataSource(fastai_df)
tooltips = [
("Episode Id", "@episode_id"),
("Hero Present", "@heroes"),
("Impression Views", "@youtube_impression_views"),
("Non Impression Views", "@youtube_nonimpression_views"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
tooltips2 = [
("Episode Id", "@episode_id"),
("Hero Present", "@heroes"),
("Subscriber Gain", "@youtube_subscribers"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Impression-Non Impression Views Per Episode")
fig1.line("release_date", "youtube_impression_views", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Impression Views")
fig1.line("release_date", "youtube_nonimpression_views", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Non Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_impression_views", alpha=0.2, fill_color='#55FF88', legend_label="Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_nonimpression_views", alpha=0.2, fill_color='#e09d53', legend_label="Non Impression Views")
fig1.circle(x="release_date", y="youtube_impression_views", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series Impression Views")
fig1.circle(x="release_date", y="youtube_nonimpression_views", source = Source2, color = "#2d3328", alpha = 0.8, legend_label="M0-M8 Series Non Impression Views")
fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Total Views"
fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")
fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"
show(column(fig1, fig2))
📌 Observations :
data1={
"Youtube Impressions":Episode_df.youtube_impressions.sum(),
"Youtube Impression Views": Episode_df.youtube_impression_views.sum(),
"Youtube NonImpression Views" : Episode_df.youtube_nonimpression_views.sum()
}
text=("Youtube Impressions","Youtube Impression Views","Youtube NonImpression Views")
fig = go.Figure(go.Funnelarea(
textinfo= "text+value",
text =list(data1.keys()),
values = list(data1.values()),
title = {"position": "top center", "text": "Youtube and Views"},
name = '', showlegend=False,customdata=['Video Thumbnail shown to Someone', 'Views From Youtube Impressions', 'Views without Youtube Impressions'], hovertemplate = '%{customdata} <br>Count: %{value}</br>'
))
fig.show()
📌 Observations :
Few things to note here :
Seems It's clear Youtube Thumbnail, Video Title are the important factor for deciding whether a person will click on the video or not.
Wait, You want some figures?
colors = ["red", "olive", "darkred", "goldenrod"]
index={
0:"YouTube default image",
1:"YouTube default image with custom annotation",
2:"Mini Series: Custom Image with annotations",
3:"Custom image with CTDS branding, Title and Tags"
}
p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS CTR")
base, lower, upper = [], [], []
for each_thumbnail_ref in index:
if each_thumbnail_ref==2:
temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr
else:
temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr
mpgs_mean = temp.mean()
mpgs_std = temp.std()
lower.append(mpgs_mean - mpgs_std)
upper.append(mpgs_mean + mpgs_std)
base.append(each_thumbnail_ref)
source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
p.add_layout(
Whisker(source=source_error, base="base", lower="lower", upper="upper")
)
tooltips = [
("Episode Id", "@episode_id"),
("Hero Present", "@heroes"),
]
color = colors[each_thumbnail_ref % len(colors)]
p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
print("Mean CTR for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)
📌 Observations :
From above Box-Plot :
In short, Don't use Default Youtube Image for Thumbnail
In order to get a significant insight, I'll calculate the percentage of each Episode watched..
a=Episode_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
a["percentage"]=(a.youtube_avg_watch_duration/a.episode_duration)*100
b=fastai_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
b["percentage"]=(b.youtube_avg_watch_duration/b.episode_duration)*100
temp=a.append(b).reset_index().drop(["index"], axis=1)
Source = ColumnDataSource(temp)
tooltips = [
("Episode Id", "@episode_id"),
("Episode Duration", "@episode_duration"),
("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
("Percentage of video watched", "@percentage"),
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_range = temp["episode_id"].values, title = "Percentage of Episode Watched")
fig1.line("episode_id", "percentage", source = Source, color = "#03c2fc", alpha = 0.8)
fig1.line("episode_id", temp.percentage.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Mean : {:.3f}".format(temp.percentage.mean()))
fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"
fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))
📌 Observations :
How does it make sense ❓
But Why such outliers occured ❓
In this fast moving world, Humans get bored of things very easily. E0 and M Series having low Episode Duration made viewers to watch more.
If they'll subscribe to the channel or not that's a different thing. That depends on the content.
In order to give more to Viewers and Community, Short lengthed Episodes can be a big step
colors = ["red", "olive", "darkred", "goldenrod"]
index={
0:"YouTube default playlist image",
1:"CTDS Branding",
2:"Mini Series: Custom Image with annotations",
3:"Custom image with CTDS branding, Title and Tags"
}
p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS Anchor Plays")
base, lower, upper = [], [], []
for each_thumbnail_ref in index:
if each_thumbnail_ref==2:
temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays
else:
temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays
mpgs_mean = temp.mean()
mpgs_std = temp.std()
lower.append(mpgs_mean - mpgs_std)
upper.append(mpgs_mean + mpgs_std)
base.append(each_thumbnail_ref)
source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
p.add_layout(
Whisker(source=source_error, base="base", lower="lower", upper="upper")
)
tooltips = [
("Episode Id", "@episode_id"),
("Hero Present", "@heroes"),
]
color = colors[each_thumbnail_ref % len(colors)]
p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
print("Mean Anchor Plays for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)
📌 Observations :
Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)
tooltips = [
("Episode Id", "@episode_id"),
("Episode Title", "@episode_name"),
("Hero Present", "@heroes"),
("Anchor Plays", "@anchor_plays"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
tooltips2 = [
("Episode Id", "@episode_id"),
("Episode Title", "@episode_name"),
("Hero Present", "@heroes"),
("Spotify Starts Plays", "@spotify_starts"),
("Spotify Streams", "@spotify_streams"),
("Spotify Listeners", "@spotify_listeners"),
("Category", "@category"),
("Date", "@release_date{%F}"),
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Anchor Plays Per Episode")
fig1.line("release_date", "anchor_plays", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Anchor Plays")
fig1.line("release_date", Episode_df.anchor_plays.mean(), source = Source, color = "#f2a652", alpha = 0.8, line_dash="dashed", legend_label="Anchor Plays Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))
fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Anchor Plays"
fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Performance on Spotify Per Episode")
fig2.line("release_date", "spotify_starts", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Spotify Starts Plays")
fig2.line("release_date", "spotify_streams", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Spotify Streams")
fig2.line("release_date", "spotify_listeners", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Spotify Listeners")
fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Total Plays"
show(column(fig1,fig2))
temp=Episode_df.groupby(["heroes_location", "heroes"])["heroes_nationality"].value_counts()
parent=[]
names =[]
values=[]
heroes=[]
for k in temp.index:
parent.append(k[0])
heroes.append(k[1])
names.append(k[2])
values.append(temp.loc[k])
df = pd.DataFrame(
dict(names=names, parents=parent,values=values, heroes=heroes))
df["World"] = "World"
fig = px.treemap(
df,
path=['World', 'parents','names','heroes'], values='values',color='parents')
fig.update_layout(
width=1000,
height=700,
title_text="Distribution of Heores by Country and Nationality")
fig.show()
a=Episode_df.release_date
b=(a-a.shift(periods=1, fill_value='2019-07-21')).astype('timedelta64[D]')
d = {'episode_id':Episode_df.episode_id, 'heroes':Episode_df.heroes, 'release_date': Episode_df.release_date, 'day_difference': b}
temp = pd.DataFrame(d)
Source = ColumnDataSource(temp)
tooltips = [
("Episode Id", "@episode_id"),
("Hero Present", "@heroes"),
("Day Difference", "@day_difference"),
("Date", "@release_date{%F}"),
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_axis_type = "datetime", title = "Day difference between Each Release Date")
fig1.line("release_date", "day_difference", source = Source, color = "#03c2fc", alpha = 0.8)
fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Date"
fig1.yaxis.axis_label = "No of Days"
fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))
📌 Observations :
Though I visited his Youtube Channel and manually examined his Release Patterns
Episode Id | Release | Day Difference |
---|---|---|
E75 | 2020-06-18 | 4 |
E76 | 2020-06-21 | 3 |
E77 | 2020-06-28 | 7 |
E78 | 2020-07-02 | 4 |
E79 | 2020-07-09 | 7 |
E80 | 2020-07-12 | 3 |
Maybe He's experimenting with a new pattern
Can We pin-point when 1 year anniversary interview episode❓ Actually No!
Though a small pattern can be observed in Release dates, He has bit odd recording pattern :
As per his Release pattern, He's been releasing his Episodes after 3 or 4 days.
If I'm correct then Mr. Sanyam Bhutani, Please don't forget to give a small shoutout to me 😄
So, We have 2 directories here :
def show_script(id):
return pd.read_csv("../input/chai-time-data-science/Cleaned Subtitles/{}.csv".format(id))
df = show_script("E1")
df
I would like to give a small shoutout to Ramshankar Yadhunath for providing a feature engineering script in his Kernel.
Hey Guys, If you followed me till here, then dont forget to check out his Kernel too.
# feature engineer the transcript features
def conv_to_sec(x):
""" Time to seconds """
t_list = x.split(":")
if len(t_list) == 2:
m = t_list[0]
s = t_list[1]
time = int(m) * 60 + int(s)
else:
h = t_list[0]
m = t_list[1]
s = t_list[2]
time = int(h) * 60 * 60 + int(m) * 60 + int(s)
return time
def get_durations(nums, size):
""" Get durations i.e the time for which each speaker spoke continuously """
diffs = []
for i in range(size - 1):
diffs.append(nums[i + 1] - nums[i])
diffs.append(30) # standard value for all end of the episode CFA by Sanyam
return diffs
def transform_transcript(sub, episode_id):
""" Transform the transcript of the given episode """
# create the time second feature that converts the time into the unified qty. of seconds
sub["Time_sec"] = sub["Time"].apply(conv_to_sec)
# get durations
sub["Duration"] = get_durations(sub["Time_sec"], sub.shape[0])
# providing an identity to each transcript
sub["Episode_ID"] = episode_id
sub = sub[["Episode_ID", "Time", "Time_sec", "Duration", "Speaker", "Text"]]
return sub
def combine_transcripts(sub_dir):
""" Combine all the 75 transcripts of the ML Heroes Interviews together as one dataframe """
episodes = []
for i in range(1, 76):
file = "E" + str(i) + ".csv"
try:
sub_epi = pd.read_csv(os.path.join(sub_dir, file))
sub_epi = transform_transcript(sub_epi, ("E" + str(i)))
episodes.append(sub_epi)
except:
continue
return pd.concat(episodes, ignore_index=True)
# create the combined transcript dataset
sub_dir = "../input/chai-time-data-science/Cleaned Subtitles"
transcripts = combine_transcripts(sub_dir)
transcripts.head()
Now, We have some Data to work with.
Thanking Ramshankar Yadhunath once again, Let's get it started ..
In such case, How much CTDS intro hurts itself in terms of intro duration.
Let's find out...
temp = Episode_df[["episode_id","youtube_avg_watch_duration"]]
temp=temp[(temp.episode_id!="E0") & (temp.episode_id!="E4")]
intro=[]
for i in transcripts.Episode_ID.unique():
intro.append(transcripts[transcripts.Episode_ID==i].iloc[0].Duration)
temp["Intro_Duration"]=intro
temp["diff"]=temp.youtube_avg_watch_duration-temp.Intro_Duration
Source = ColumnDataSource(temp)
tooltips = [
("Episode Id", "@episode_id"),
("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
("Intro Duration", "@Intro_Duration"),
("Avg Duration of Content Watched", "@diff"),
]
fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 600, x_range = temp["episode_id"].values, title = "Impact of Intro Durations")
fig1.line("episode_id", "youtube_avg_watch_duration", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Youtube Avg Watch_duration Views")
fig1.line("episode_id", "Intro_Duration", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Intro Duration")
fig1.line("episode_id", "diff", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Avg Duration of Content Watched")
fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"
fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 5 minutes".format(len(temp[temp["diff"]<300])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 4 minutes".format(len(temp[temp["diff"]<240])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 3 minutes".format(len(temp[temp["diff"]<180])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 2 minutes".format(len(temp[temp["diff"]<120])/len(temp)*100))
print("In {} case, Viewer left in the Intro Duration".format(len(temp[temp["diff"]<0])))
🧠 My Cessation:
There's lots of things to improve.
With 45.95% of Episodes having Avg Duration of Content Watched less than 3 minutes, We can hardly gain any useful insight or can comment on quality of Content delivered.
But Okay! We can have some fun though 😄
host_text = []
hero_text = []
for i in transcripts.Episode_ID.unique():
host_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker=="Sanyam Bhutani")].Text])
hero_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker!="Sanyam Bhutani")].Text])
temp_host={}
temp_hero={}
for i in range(len(transcripts.Episode_ID.unique())):
host_text_count = len(host_text[i][1])
hero_text_count = len(hero_text[i][1])
temp_host[hero_text[i][0]]=host_text_count
temp_hero[hero_text[i][0]]=hero_text_count
def getkey(dict):
list = []
for key in dict.keys():
list.append(key)
return list
def getvalue(dict):
list = []
for key in dict.values():
list.append(key)
return list
Source = ColumnDataSource(data=dict(
x=getkey(temp_host),
y=getvalue(temp_host),
a=getkey(temp_hero),
b=getvalue(temp_hero),
))
tooltips = [
("Episode Id", "@x"),
("No of Times Host Speaks", "@y"),
("No of Times Hero Speaks", "@b"),
]
fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, tooltips=tooltips,plot_height = 400, x_range = getkey(temp_host), title = "Who Speaks More ?")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Times Host Speaks")
fig1.vbar("a", top = "b", source = Source, width = 0.4, color = "#e7f207", alpha=.8, legend_label="No of Times Hero Speaks")
fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "Count"
fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4
show(fig1)
ques=0
total_ques={}
for episode in range(len(transcripts.Episode_ID.unique())):
for each_text in range(len(host_text[episode][1])):
ques += host_text[episode][1].reset_index().iloc[each_text].Text.count("?")
total_ques[hero_text[episode][0]]= ques
ques=0
from statistics import mean
Source = ColumnDataSource(data=dict(
x=getkey(total_ques),
y=getvalue(total_ques),
))
tooltips = [
("Episode Id", "@x"),
("No of Questions", "@y"),
]
fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, plot_height = 400,tooltips=tooltips, x_range = getkey(temp_host), title = "Questions asked Per Episode")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Questions asked Per Episode")
fig1.line("x", mean(getvalue(total_ques)), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Average Questions : {:.3f}".format(mean(getvalue(total_ques))))
fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "No of Questions"
fig1.legend.location = "top_left"
fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4
show(fig1)
⚒️ About the Function :
Well, I'm gonna write a small function. You can pass a Hero Name and it will create a graph about 7 most common words spoken by that person
But before that I would like to give a small Shoutout to Parul Pandey for providing a text cleaning script in her Kernel.
import re
import nltk
from statistics import mean
from collections import Counter
import string
def clean_text(text):
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
and remove words containing numbers.'''
text = text.lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('\w*\d\w*', '', text)
return text
def text_preprocessing(text):
"""
Cleaning and parsing the text.
"""
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
nopunc = clean_text(text)
tokenized_text = tokenizer.tokenize(nopunc)
#remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
combined_text = ' '.join(tokenized_text)
return combined_text
transcripts['Text'] = transcripts['Text'].apply(str).apply(lambda x: text_preprocessing(x))
def get_data(speakername=None):
label=[]
value=[]
text_data=transcripts[(transcripts.Speaker==speakername)].Text.tolist()
temp=list(filter(lambda x: x.count(" ")<10 , text_data))
freq=nltk.FreqDist(temp).most_common(7)
for each in freq:
label.append(each[0])
value.append(each[1])
Source = ColumnDataSource(data=dict(
x=label,
y=value,
))
tooltips = [
("Favourite Text", "@x"),
("Frequency", "@y"),
]
fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, tooltips=tooltips, plot_height = 400, x_range = label, title = "Favourite Text")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)
fig1.xaxis.axis_label = "Text"
fig1.yaxis.axis_label = "Frequency"
fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4
show(fig1)
get_data(speakername="Sanyam Bhutani")
📌 Observations :
With this, I end my analysis on this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani.
It was a wonderfull experience for me
Somehow if my Analysis/Way of StoryTelling hurted any sentiments then I apologize for that
And yea Congraulations to Chai Time Data Science | CTDS.Show for completing a successfull 1 Year journey.
Now I can finally enjoy my Chai break in a peace