Solving the Mystery of Chai Time Data Science

A Data Science podcast series by Sanyam Bhutani

drawing

Mr. RsTaK, Where are we?

Hello my dear Kagglers. As you all know I love Kaggle and its community. I spend most of my time surfing my Kaggle feed, scrolling over discussion forms, appreciating the efforts put on by various Kagglers via their unique / interesting way of Storytelling.

So, this morning when i was following my usual routine in Kaggle, I came across this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani. At first glance, I was like what's this? How they know I'm having a tea break? Oh no buddy! I was wrong. It's CTDS.Show :)

Chai Time Data Science (CTDS.Show)

Chai Time Data Science show is a Podcast + Video + Blog based show for interviews with Practitioners, Kagglers & Researchers and all things Data Science.

CTDS.Show, driven by the community under the supervision of Mr. Sanyam Bhutani gonna complete its 1 year anniversary on 21st June, 2020 and to celebrate this achievement they decided to run a Kaggle contest around the dataset with all of the 75+ ML Hero interviews on the series.

According to our Host, The competition is aimed at articulating insights from the Interviews with ML Heroes. Provided a dataset consists of detailed Stats, Transcripts of CTDS.Show, the goal is to use these and come up with interesting insights or stories based on the 100+ interviews with ML Heroes.

We have our Dataset containing :

Description.csv : This file consists of the descriptions texts from YouTube and Audio
Episodes.csv : This file contains the statistics of all the Episodes of the Chai Time Data Science show.
YouTube Thumbnail Types.csv : This file consists of the description of Anchor/Audio thumbnail of the episodes
Anchor Thumbnail Types.csv : This file contains the statistics of the Anchor/Audio thumbnail
Raw Subtitles : Directory containing 74 text files having raw subtitles of all the episodes
Cleaned Subtitles : Directory containing cleaned subtitles (in CSV format)

Hmm.. Seems we have some stories to talk about..

Congratulating CTDS.Show for their 1 year anniversary, Let's get it started :)

Btw This gonna be a long kernel. So, Hey! Looking for a guide :) ?

0. Importing Necessary Libraries

1. A Closer look to our Dataset

1.1. Exploring YouTube Thumbnail Types.csv
1.2. Exploring Anchor Thumbnail Types.csv
1.3. Exploring Description.csv
1.4. Exploring Episodes.csv

1.4.1. Missing Values ?
1.4.2. M0-M8 Episodes
1.4.3. Solving the Mystery of Missing Values
1.4.4. Is it a Gender Biased Show?
1.4.5. Time for a Chai Break
1.4.6. How to get More Audience?
1.4.7. Youtube Favors CTDS?
1.4.8. Do Thumbnails really matter?
1.4.9. How much Viewers wanna watch?
1.4.10. Performance on Other Platforms
1.4.11. Distribution of Heores by Country and Nationality
1.4.12. Any Relation between Release Date of Epiosdes?
1.4.13. Do I know about Release of anniversary interview episode?

1.5. Exploring Raw / Cleaned Substitles
1.5.1. A Small Shoutout to Ramshankar Yadhunath
1.5.2. Intro is Bad for CTDS ?
1.5.3. Who Speaks More ?
1.5.4. Frequency of Questions Per Episode
1.5.5. Favourite Text ?

2. End Notes

3. Credits

Note : Sometimes, Plotly Graphs fails to render with the Kernel. Please restart the page in that case

Importing Necessary Libraries

Go back to our Guide

import os

import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import missingno as msno

import plotly.express as px

import plotly.graph_objects as go
from plotly.subplots import make_subplots

!pip install pywaffle
from pywaffle import Waffle

from bokeh.layouts import column, row
from bokeh.models.tools import HoverTool
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, output_notebook, show

output_notebook()

from IPython.display import IFrame

pd.set_option('display.max_columns', None)

Collecting pywaffle

  Downloading pywaffle-0.5.1-py2.py3-none-any.whl (525 kB)

     |████████████████████████████████| 525 kB 5.1 MB/s 

Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pywaffle) (3.2.1)

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (2.4.7)

Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (1.18.1)

Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (0.10.0)

Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (2.8.1)

Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (1.2.0)

Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->pywaffle) (1.14.0)

Installing collected packages: pywaffle

Successfully installed pywaffle-0.5.1

A Closer look to our Directories

Go back to our Guide

Let's dive into each and every aspect of our Dataset step by step in order to get every inches out of it...

Exploring YouTube Thumbnail Types.csv

Go back to our Guide

As per our knowledge, This file consists of the description of Anchor/Audio thumbnail of the episodes. Let's explore more about it...

YouTube_df=pd.read_csv("../input/chai-time-data-science/YouTube Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(YouTube_df.shape[0], YouTube_df.shape[1]))
YouTube_df.head()

No of Datapoints : 4
No of Features : 6

So, Basically CTDS uses 4 types of Thumbnails in their Youtube videos. Its 2020 and people still uses YouTube default image as thumbnail !

Hmm... a Smart decision or just a blind arrow, We'll figure it out in our futher analysis ...

Exploring Anchor Thumbnail Types.csv

Go back to our Guide

So, This file contains the statistics of the Anchor/Audio thumbnail

Anchor_df=pd.read_csv("../input/chai-time-data-science/Anchor Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Anchor_df.shape[0], Anchor_df.shape[1]))
Anchor_df.head()

No of Datapoints : 4
No of Features : 6

It's just similar to Youtube Thumbnail Types.

If you are wondering What's anchor then it's a free platform for podcast creation

IFrame('https://anchor.fm/chaitimedatascience', width=800, height=450)

Exploring Description.csv

Go back to our Guide

This file consists of the descriptions texts from YouTube and Audio

Des_df=pd.read_csv("../input/chai-time-data-science/Description.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Des_df.shape[0], Des_df.shape[1]))
Des_df.head()

No of Datapoints : 85
No of Features : 2

So, We have description for every episode. Let's have a close look what we have here

def show_description(specific_id=None, top_des=None):
    
    if specific_id is not None:
        print(Des_df[Des_df.episode_id==specific_id].description.tolist()[0])
        
    if top_des is not None:
        for each_des in range(top_des):  
            print(Des_df.description.tolist()[each_des])
            print("-"*100)

⚒️ About the Function :

In order to explore our Descriptions, I just wrote a small script. It has two options:

Either You provide specific episode id(specific_id) to have a look at that particular description
Or you can provide a number(top_des) and this script will display description for top "x" numbers that you provided in top_des

show_description("E1")

In the first Episode, Sanyam Bhutani interviews Kaggle Triple Grandmaster: Abhishek Thakur. They talk about Abhishek's journey into Data Science and Kaggle; his Kaggle Experience and current projects. 

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Abhishek Thakur: https://www.kaggle.com/abhishek
https://www.linkedin.com/in/abhisvnit/
https://twitter.com/abhi1thakur

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd

show_description(top_des=3)

Interview with ML Hero Series: https://medium.com/p/bfaaf38df219

http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogpost.

Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------
In the first Episode, Sanyam Bhutani interviews Kaggle Triple Grandmaster: Abhishek Thakur. They talk about Abhishek's journey into Data Science and Kaggle; his Kaggle Experience and current projects. 

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Abhishek Thakur: https://www.kaggle.com/abhishek
https://www.linkedin.com/in/abhisvnit/
https://twitter.com/abhi1thakur

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------
Audio Version Available here: https://anchor.fm/chaitimedatascience

In this Episode, Sanyam Bhutani interviews Kaggle Competition Master: Ryan Chesler. They talk about Ryan's journey into Data Science and Kaggle; his Kaggle Experience and current projects as well as his solution to the Jigsaw Unintended Bias in Toxicity Classification Kaggle Competition

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Ryan Chesler: https://www.kaggle.com/ryches
https://www.linkedin.com/in/ryan-chesler/
https://twitter.com/ryan_chesler

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------

Advice : Feel free to play with the function "show_description()" to have a look over various descriptions provided in a go

🧠 My Cessation:

Although I went through some description and realized it just contains URL, Necessary links for social media sites with a little description of the current show and some announcements regarding future releases
I'm not gonna put stress in this area because I don't think there's much to scrap in them. Right now, let's move ahead.

Exploring Episodes.csv

Go back to our Guide

This file contains the statistics of all the Episodes of the Chai Time Data Science show.

Okay ! So, it's the big boy itself ..

Episode_df=pd.read_csv("../input/chai-time-data-science/Episodes.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Episode_df.shape[0], Episode_df.shape[1]))
Episode_df.head()

No of Datapoints : 85
No of Features : 36

Wew ! That's a lot of features.

I'm sure We'll gonna explore some interesting insights from this Metadata If you reached till here, then please bare me for couple of minutes more..

Now, We'll gonna have a big sip of our "Chai" :)

Missing Values

Before diving into our analysis, Let's have check for Missing Values in our CSV..

For this purpose, I'm gonna use this library named missingno.

Just use this line :

import missingno as msno

missingno helps us to deal with missing values in a dataset with the help of visualisations. With over 2k stars on github, this library is already very popular.

msno.matrix(Episode_df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f841c41dbd0>

Aah shit ! Here We go again..

📌 Observations :

We can clearly see that heroes_kaggle_username, heroes_twitter_handle have lots of missing values
We can observe bunch of data missing from column name heroes to heroes_twitter_handle in a continous way(that big block region) which shows a specific reason of data missing at those points
Few datapoints are too missing in anchor, spotify and apple section i.e. missing data in podcasts

There is also a chart on the right side of plot.It summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

Well, Before giving any False Proclaim Let's explore more about it..

temp=Episode_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 8
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"


show(fig1)

📌 Observations :

Columns from heroes to heroes_nationality has same about of missing data. Seems We can find a reasonable relation between them
About 45.88% (85-39) and 22.35% (85-19) of Data missing in column name heroes_kaggle_username and heroes_twitter_handle respectively
We have just 1 missing value in anchor and spotify, 2 missing values in apple section that is quite easy to handle

🧠 My Cessation:

Come-on. I don't understand. Chai Time Data Science show is about interviews with our Heroes. So, How do we have 11 missing values in Feature "heroes"

Let's find out..

Episode_df[Episode_df.heroes.isnull()]

💭 Interesting..

episode_id "E0" was all about Chai Time Data Science Launch Announcement
episode_id "E69" was Birthday Special It make sense why there's no hero for the following episodes

But What are these M0-M8 episodes .. ?

M0-M8 Episodes

Looking around for a while, I realized M0-M8 was a small mini-series based on fast.ai summaries and the Things Jeremy says to do that were released on same date.

🧠 My Cessation:

Well for the sake of our analysis, I'll treat them as outlier for the current CSV and will analyise them seperately. So, I'm gonna remove them from this CSV, storing seperately for later analysis

temp=[id for id in Episode_df.episode_id if id.startswith('M')]
fastai_df=Episode_df[Episode_df.episode_id.isin(temp)]
Episode_df=Episode_df[~Episode_df.episode_id.isin(temp)]

Also, ignoring "E0" and "E69" for right now ...

dummy_df=Episode_df[(Episode_df.episode_id!="E0") & (Episode_df.episode_id!="E69")]

msno.matrix(dummy_df)

<matplotlib.axes._subplots.AxesSubplot at 0x7f841c21c710>

temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"

show(fig1)

Now, That's much better..

But We still have a lots of Missing Values

Solving the Mystery of Missing Values

parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category"]).heroes_gender.value_counts()
for k in temp.index:
    parent.append(k[0])
    names.append(k[1])
    values.append(temp.loc[k])

df1 = pd.DataFrame(
    dict(names=names, parents=parent,values=values))


parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category","heroes_gender"]).heroes_kaggle_username.count()
for k in temp.index:
    parent.append(k[0])
    names.append(k[1])
    values.append(temp.loc[k])

df2 = pd.DataFrame(
    dict(names=names, parents=parent,values=values))


fig = px.sunburst(df1, path=['names', 'parents'], values='values', color='parents',hover_data=["names"], title="Heroes associated with Categories")
fig.update_traces( 
                 textinfo='percent entry+label',
                 hovertemplate = "Industry:%{label}: <br>Count: %{value}"
                )
fig.show()


fig = px.sunburst(df2, path=['names', 'parents'], values='values', color='parents', title="Heroes associated with Categories having Kaggle Account")
fig.update_traces( 
                 textinfo='percent entry+label',
                 hovertemplate = "Industry:%{label}: <br>Count: %{value}"
                )
fig.show()

📌 Observations :

Heores associated with "Category" Kaggle are expected to have a Kaggle account
Ignoring the counts from "Category" Kaggle (74-31=43), Out of 43 only 15 Heroes have Kaggle account.
This explains all Missing 28 Values from our CSV
Similarly We have 8 Heroes who don't have twitter handle. It's okay. Even I don't have a twitter handle :D

Wanna know a fun fact ?

Because of this Kaggle platform, Now I've aprox 42% chance of becoming a CTDS Hero :) ...

Ahem Ahem... Focus RsTaK Focus.. Let's get back to our work.

Wait? Guess I missed something.. What's that gender ratio?

Is it a Gender Biased Show?

gender = Episode_df.heroes_gender.value_counts()

fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=12,
    values=gender,
    colors = ('#20639B', '#ED553B'),
    title={'label': 'Gender Distribution', 'loc': 'left'},
    labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.4), 'ncol': len(Episode_df), 'framealpha': 0},
    font_size=30, 
    icons = 'child',
    figsize=(12, 5),  
    icon_legend=True
)

Jokes apart, We can't give any strong statement over this.

But yea, I'm hoping for more Female Heroes :D

🧠 My Cessation:

I won't talk much about relation of gender with other features because :

Gender feature is highly biased towards one category

So, We can not conclude any relation with other features.

If we had a good gender ratio, then we could have talked about impact of gender

Even if we somehow observe any positive conclusion for Female gender then I would say it will be just a coincidence. There are other factors apart from gender that may have resulted in positive conclusion for Female gender.

With such a biased and small sample size for Female, We can not comment any strong statement on that

dummy_df[dummy_df.apple_listeners.isnull()]

📌 Observations :

Following our analysis, We realized :

CTDS had an episode with Tuatini Godard, episode_id : "E12" exclusively for Youtube. Although it was an Audio Only video(if it makes sense :D ) released on Youtube
If it was an Audio Only Version, then Why it wasn't released on other platforms ? Hmmm... interesting. Well, I think Mr. Sanyam Bhutani can answer this well.
Similarly, CTDS had an episode with Rachel Thomas released at every platform expect for Apple

With this, We have solved all the mysteries related to the Missing Data. Now we can finally explore other aspects of this CSV

But before that..

Time for a Chai Break

While having a sip of my Chai (Tea), I'm just curious Why this show is named "Chai Time Data Science"?

Well, I don't have a solid answer for this but maybe its just because our Host loves Chai? Hmmm.. So You wanna say our Host is more Hardcore Chai Lover than me?

Hey ! Hold my Chai..

fig = go.Figure([go.Pie(labels=Episode_df.flavour_of_tea.value_counts().index.to_list(),values=Episode_df.flavour_of_tea.value_counts().values,hovertemplate = '<br>Type: %{label}</br>Count: %{value}<br>Popularity: %{percent}</br>', name = '')])
fig.update_layout(title_text="What Host drinks everytime ?", template="plotly_white", title_x=0.45, title_y = 1)
fig.data[0].marker.line.color = 'rgb(255, 255, 255)'
fig.data[0].marker.line.width = 2
fig.update_traces(hole=.4,)
fig.show()

📌 Observations :

Masala Chai (count=16) and Ginger Chai (count=16) seems to be favourite Chai of our Host followed by Herbal Tea (count=11) and Sulemani Chai (count=11)
Also, Our Host seems to be quite experimental with Chai. He has varities of flavour in his belly

Oh Man ! This time you win. You're a real Chai lover

Now, One Question arises..❓

So, Host drinking any specific Chai at specific time have any relation with other factors or success of CTDS?

🧠 My Cessation:

Thinking practically, I don't think drinking Chai at any specific time can have any real impact for the show.
Believing on such things is an example of superstition.
No doubt, as per the data it may have some relation with other factors. But to support any statement here, I would like to quote a famous sentence used in statistics i.e.

Correlation does not imply Causation

How to get More Audience?

Well, rewarding for your victory in that Chai Lover Challenge, I'll try to assist CTDS on how to get more Audience 😄

Ofcourse CTDS.Shows are a kind of gem, fully informative, covering interviews with some successfull people
But talking Statistically here, We'll gonna define Success of an Episode by amount of Audience it gathered

temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"

show(fig1)

Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)
fastai_df.release_date = pd.to_datetime(fastai_df.release_date)
Source2 = ColumnDataSource(fastai_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("CTR", "@youtube_ctr"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Subscriber Gain", "@youtube_subscribers"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "CTR Per Episode")
fig1.line("release_date", "youtube_ctr", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="youtube_ctr")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_ctr", alpha=0.2, fill_color='#55FF88', legend_label="youtube_ctr")
fig1.line("release_date", Episode_df.youtube_ctr.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Youtube CTR Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))
fig1.circle(x="release_date", y="youtube_ctr", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")

fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Click Per Impression"

fig1.grid.grid_line_color="#feffff"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"

fig2.grid.grid_line_color="#feffff"

show(column(fig1, fig2))

📌 Observations :

From the Graphs, We can see :

On an average, CTDS episodes had 2.702 CTR
38 out of 76 (50% exact) have CTRs above the average
Episode E1, E27 and E49 seems to be lucky for CTDS in terms of Subscriber Count
Episode E19 had the best CTR (8.460) which is self explanatory from the Episode Title. Everyone loves to know about MOOC(s) and ML Interviews in Industries

🤖 M0-M8 Series :

Despite related to Fast.ai, M0-M7 doesn't perform that well as compared to other vides related to fast.ai
M0 and M1 though received a good amount of CTR but other M Series quite below the average CTR
M0 and M1 has better impact on subscriber gain as compared to other M series but overall series doesn't perform well on Subscriber gain
All M0-M8 series were released on the same day, which can be the reason for this incident. M0-M1 despite having good CTR fails to hold the viewers interest on M series

💭 Interestingly..,

Episode E19 despite of having best CTR till now (8.460), didn't contributed much in Subscriber Count (only 7 Subscriber Gained)

But Why ❓

CTR doen't mean that person likes the content, or he/she will be watching that video for long or will be subscribing to the channel
Maybe that video was recommended on his newsfeed and he/she clicked on it just to check the video
Maybe he/she doesn't liked the content
Maybe he/she accidently clicked on the video

There's an huge possibility of such cases. But in conclusion, We can say high CTR reflect cases like :

People clicked on the video maybe because of the Thumbnail, or Title was soothing. Maybe he/she clicked because of the hero mentioned in Title/Thumbnail

📃 I don't know how Youtube algorithm works. But for the sake of answering exceptional case of E19, My hypothetical answer will be:

Title contains the word "MOOC". Since now a days everyone wanna break into DataScience, Youtube algorithm may have suggested that video to people looking for MOOCs
Most of other episodes have similar kind of Titles stating "Interview with" or have some terms that are't that begineer friendly. Resulting in low CTR
Supporting my hypothesis, observe E27 (having fast.ai in Title that is a famous MOOC), E38(Title with Youngest Grandmaster may have attracted people to click),E49 (Getting started in Datascience), E60(Terms like NLP and Open Source Projects) and E75(again fast.ai)
You can argue for E12 which has the word "Freelancing" in the Title. Well exceptions will be there

Okay What's about organic reach of channel or reach via Heroes?

Source = ColumnDataSource(Episode_df)
Source2 = ColumnDataSource(fastai_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Impression Views", "@youtube_impression_views"),
    ("Non Impression Views", "@youtube_nonimpression_views"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Subscriber Gain", "@youtube_subscribers"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Impression-Non Impression Views Per Episode")
fig1.line("release_date", "youtube_impression_views", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Impression Views")
fig1.line("release_date", "youtube_nonimpression_views", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Non Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_impression_views", alpha=0.2, fill_color='#55FF88', legend_label="Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_nonimpression_views", alpha=0.2, fill_color='#e09d53', legend_label="Non Impression Views")
fig1.circle(x="release_date", y="youtube_impression_views", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series Impression Views")
fig1.circle(x="release_date", y="youtube_nonimpression_views", source = Source2, color = "#2d3328", alpha = 0.8, legend_label="M0-M8 Series Non Impression Views")



fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Total Views"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"


show(column(fig1, fig2))

📌 Observations :

Mostly Non Impression Views are greater than Impression views. CTDS seems to have a loyal fan base that's sharing his videos, producing more Non Impression Views
In some cases, there's sharp increase in views and big difference between Impression and Non Impression Views.
People love to see specific Heroes. Hero choice do matter
Total Views(especially Non Impression Views) definately plays role in Subscriber Gain
Though M series doesn't have good performance, but if you watch carefully, you'll realise M series has better Impression Views than Non Impression Views

Youtube Favors CTDS?

data1={
      "Youtube Impressions":Episode_df.youtube_impressions.sum(), 
      "Youtube Impression Views": Episode_df.youtube_impression_views.sum(), 
      "Youtube NonImpression Views" : Episode_df.youtube_nonimpression_views.sum()
     }

text=("Youtube Impressions","Youtube Impression Views","Youtube NonImpression Views")

fig = go.Figure(go.Funnelarea(
    textinfo= "text+value",
    text =list(data1.keys()),
    values = list(data1.values()),
    title = {"position": "top center", "text": "Youtube and Views"},
      name = '', showlegend=False,customdata=['Video Thumbnail shown to Someone', 'Views From Youtube Impressions', 'Views without Youtube Impressions'], hovertemplate = '%{customdata} <br>Count: %{value}</br>'
  ))
fig.show()

📌 Observations :

Few things to note here :

Well, I havn't cracked Youtube Algorithm, but it seems Youtube has its blessing over CTDS
CTDS Episodes is only able to convert 2.84% of Youtube Impressions into Viewers
65.12% of CTDS views are Non Impression views

Seems It's clear Youtube Thumbnail, Video Title are the important factor for deciding whether a person will click on the video or not.

Wait, You want some figures?

Do Thumbnails really matter ?

colors = ["red", "olive", "darkred", "goldenrod"]

index={
    0:"YouTube default image",
    1:"YouTube default image with custom annotation",
    2:"Mini Series: Custom Image with annotations",
    3:"Custom image with CTDS branding, Title and Tags"
}

p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS CTR")

base, lower, upper = [], [], []

for each_thumbnail_ref in index:
    if each_thumbnail_ref==2:
        temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr 
    else:
        temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr
    mpgs_mean = temp.mean()
    mpgs_std = temp.std()
    lower.append(mpgs_mean - mpgs_std)
    upper.append(mpgs_mean + mpgs_std)
    base.append(each_thumbnail_ref)

    source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
    p.add_layout(
        Whisker(source=source_error, base="base", lower="lower", upper="upper")
    )

    tooltips = [
        ("Episode Id", "@episode_id"),
        ("Hero Present", "@heroes"),
        ]

    color = colors[each_thumbnail_ref % len(colors)]
    p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
    print("Mean CTR for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)

Mean CTR for Thumbnail Type YouTube default image : 2.565 
Mean CTR for Thumbnail Type YouTube default image with custom annotation : 2.725 
Mean CTR for Thumbnail Type Mini Series: Custom Image with annotations : 1.969 
Mean CTR for Thumbnail Type Custom image with CTDS branding, Title and Tags : 3.115

📌 Observations :

From above Box-Plot :

It seems Type of Thumbnail do have some impact on CTR
Despite of using YouTube default image for maximum of time, it's average CTR is lowest as compared to CTR from other Youtube Thumbnail
Since Count of other YouTube thumbnails are less, We can't say What's the best Thumbnail
CTR depends on other factors too like Title, Hero featured in the Episode etc. Still we can confidently say that Thumbnails other than YouTube default image attracts more Users to click on the Video
As We talked about M series, M0-M1 failed to keep interest of people in the series.
Although their Mean CTR is lowest yet we can observe M0-M1 has a better CTR as compared to majority of Episodes with Youtube default thumbnails.

In short, Don't use Default Youtube Image for Thumbnail

How much Viewers wanna watch?

Episodes have different duration.

In order to get a significant insight, I'll calculate the percentage of each Episode watched..

a=Episode_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
a["percentage"]=(a.youtube_avg_watch_duration/a.episode_duration)*100

b=fastai_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
b["percentage"]=(b.youtube_avg_watch_duration/b.episode_duration)*100

temp=a.append(b).reset_index().drop(["index"], axis=1)

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Duration", "@episode_duration"),
    ("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
    ("Percentage of video watched", "@percentage"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_range  = temp["episode_id"].values, title = "Percentage of Episode Watched")
fig1.line("episode_id", "percentage", source = Source, color = "#03c2fc", alpha = 0.8)
fig1.line("episode_id", temp.percentage.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Mean : {:.3f}".format(temp.percentage.mean()))

fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))

📌 Observations :

On an average, 13.065% of Episode is watched by Viewers
But most Episode haave watched percentage less than this threshold.

How does it make sense ❓

That's because we have some outliers like E0 and M series that has watched percentage over 20%.

But Why such outliers occured ❓

That's because they have low Episode Duration

In this fast moving world, Humans get bored of things very easily. E0 and M Series having low Episode Duration made viewers to watch more.

If they'll subscribe to the channel or not that's a different thing. That depends on the content.

In order to give more to Viewers and Community, Short lengthed Episodes can be a big step

Performance on Other Platforms

colors = ["red", "olive", "darkred", "goldenrod"]

index={
    0:"YouTube default playlist image",
    1:"CTDS Branding",
    2:"Mini Series: Custom Image with annotations",
    3:"Custom image with CTDS branding, Title and Tags"
}

p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS Anchor Plays")

base, lower, upper = [], [], []

for each_thumbnail_ref in index:
    if each_thumbnail_ref==2:
        temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays 
    else:
        temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays
    mpgs_mean = temp.mean()
    mpgs_std = temp.std()
    lower.append(mpgs_mean - mpgs_std)
    upper.append(mpgs_mean + mpgs_std)
    base.append(each_thumbnail_ref)

    source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
    p.add_layout(
        Whisker(source=source_error, base="base", lower="lower", upper="upper")
    )

    tooltips = [
        ("Episode Id", "@episode_id"),
        ("Hero Present", "@heroes"),
        ]

    color = colors[each_thumbnail_ref % len(colors)]
    p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
    print("Mean Anchor Plays for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)

Mean Anchor Plays for Thumbnail Type YouTube default playlist image : 620.939 
Mean Anchor Plays for Thumbnail Type CTDS Branding : 534.500 
Mean Anchor Plays for Thumbnail Type Mini Series: Custom Image with annotations : 309.000 
Mean Anchor Plays for Thumbnail Type Custom image with CTDS branding, Title and Tags : 387.375

📌 Observations :

55.40% of the Anchor Thumbnail have CTDS Branding
But on an Average, Podcasts with YouTube default playlist image performs better in terms of Anchor Plays

Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Anchor Plays", "@anchor_plays"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Spotify Starts Plays", "@spotify_starts"),
    ("Spotify Streams", "@spotify_streams"),
    ("Spotify Listeners", "@spotify_listeners"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Anchor Plays Per Episode")
fig1.line("release_date", "anchor_plays", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Anchor Plays")
fig1.line("release_date", Episode_df.anchor_plays.mean(), source = Source, color = "#f2a652", alpha = 0.8, line_dash="dashed", legend_label="Anchor Plays Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))


fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Anchor Plays"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Performance on Spotify Per Episode")
fig2.line("release_date", "spotify_starts", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Spotify Starts Plays")
fig2.line("release_date", "spotify_streams", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Spotify Streams")
fig2.line("release_date", "spotify_listeners", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Spotify Listeners")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Total Plays"


show(column(fig1,fig2))

It's 2020 and Seems now-a-days people aren't not much into podcasts and Audios

Distribution of Heores by Country and Nationality

temp=Episode_df.groupby(["heroes_location", "heroes"])["heroes_nationality"].value_counts()

parent=[]
names =[]
values=[]
heroes=[]
for k in temp.index:
    parent.append(k[0])
    heroes.append(k[1])
    names.append(k[2])
    values.append(temp.loc[k])

df = pd.DataFrame(
    dict(names=names, parents=parent,values=values, heroes=heroes))
df["World"] = "World"

fig = px.treemap(
    df,
    path=['World', 'parents','names','heroes'], values='values',color='parents')

fig.update_layout( 
    width=1000,
    height=700,
    title_text="Distribution of Heores by Country and Nationality")
fig.show()

Most of our Heroes lives in USA but There's quite range of Diversity in Heroes Nationality within a country which is good to know

Any Relation between Release Date of Epiosdes?

a=Episode_df.release_date
b=(a-a.shift(periods=1, fill_value='2019-07-21')).astype('timedelta64[D]')
d = {'episode_id':Episode_df.episode_id, 'heroes':Episode_df.heroes, 'release_date': Episode_df.release_date, 'day_difference': b}
temp = pd.DataFrame(d)

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Day Difference", "@day_difference"),
    ("Date", "@release_date{%F}"),
    ]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_axis_type  = "datetime", title = "Day difference between Each Release Date")
fig1.line("release_date", "day_difference", source = Source, color = "#03c2fc", alpha = 0.8)

fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Date"
fig1.yaxis.axis_label = "No of Days"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))

📌 Observations :

Seems 2020 made Sanyam a bit consistant on his Release Date having a difference of 3 or 4 days between each release till 18th July

Do I know about Release of anniversary interview episode?

Because of time shortage, I didn't scraped new data myself.

Though I visited his Youtube Channel and manually examined his Release Patterns

Episode Id	Release	Day Difference
E75	2020-06-18	4
E76	2020-06-21	3
E77	2020-06-28	7
E78	2020-07-02	4
E79	2020-07-09	7
E80	2020-07-12	3

Maybe He's experimenting with a new pattern

Can We pin-point when 1 year anniversary interview episode❓ Actually No!

Though a small pattern can be observed in Release dates, He has bit odd recording pattern :

Who knows He may have 3-4 videos already recorded and ready to be released.
Even if He records anniversary interview episode today, We can not say when He'll gonna release that Episode

As per his Release pattern, He's been releasing his Episodes after 3 or 4 days.

Considering E77 and E79 as exception, He'll more probably release E81 on 2020-07-16 or 2020-07-15
If He's experimenting with a new pattern (7 day difference after one video), then E81 will be released on 2020-07-19 followed by E82 on 2020-07-22 or 2020-07-23

If I'm correct then Mr. Sanyam Bhutani, Please don't forget to give a small shoutout to me 😄

Exploring Raw / Cleaned Substitles

Go back to our Guide

So, We have 2 directories here :

Raw Subtitles : Tanscript in Text format
Cleaned Subtitles : Tanscript in CSV format with Timestamp

def show_script(id):
    return pd.read_csv("../input/chai-time-data-science/Cleaned Subtitles/{}.csv".format(id))

df = show_script("E1")
df

A Small Shoutout to Ramshankar Yadhunath

I would like to give a small shoutout to Ramshankar Yadhunath for providing a feature engineering script in his Kernel.

Hey Guys, If you followed me till here, then dont forget to check out his Kernel too.

# feature engineer the transcript features
def conv_to_sec(x):
    """ Time to seconds """

    t_list = x.split(":")
    if len(t_list) == 2:
        m = t_list[0]
        s = t_list[1]
        time = int(m) * 60 + int(s)
    else:
        h = t_list[0]
        m = t_list[1]
        s = t_list[2]
        time = int(h) * 60 * 60 + int(m) * 60 + int(s)
    return time


def get_durations(nums, size):
    """ Get durations i.e the time for which each speaker spoke continuously """

    diffs = []
    for i in range(size - 1):
        diffs.append(nums[i + 1] - nums[i])
    diffs.append(30)  # standard value for all end of the episode CFA by Sanyam
    return diffs


def transform_transcript(sub, episode_id):
    """ Transform the transcript of the given episode """

    # create the time second feature that converts the time into the unified qty. of seconds
    sub["Time_sec"] = sub["Time"].apply(conv_to_sec)

    # get durations
    sub["Duration"] = get_durations(sub["Time_sec"], sub.shape[0])

    # providing an identity to each transcript
    sub["Episode_ID"] = episode_id
    sub = sub[["Episode_ID", "Time", "Time_sec", "Duration", "Speaker", "Text"]]

    return sub


def combine_transcripts(sub_dir):
    """ Combine all the 75 transcripts of the ML Heroes Interviews together as one dataframe """

    episodes = []
    for i in range(1, 76):
        file = "E" + str(i) + ".csv"
        try:
            sub_epi = pd.read_csv(os.path.join(sub_dir, file))
            sub_epi = transform_transcript(sub_epi, ("E" + str(i)))
            episodes.append(sub_epi)
        except:
            continue
    return pd.concat(episodes, ignore_index=True)


# create the combined transcript dataset
sub_dir = "../input/chai-time-data-science/Cleaned Subtitles"
transcripts = combine_transcripts(sub_dir)
transcripts.head()

Now, We have some Data to work with.

Thanking Ramshankar Yadhunath once again, Let's get it started ..

Note : Transcript for E0 and E4 is missing

Intro is Bad for CTDS ?

From our previous analysis, We realised Majority of the Episodes have quite less watch time i.e. less than 13.065% of the total Duration.

In such case, How much CTDS intro hurts itself in terms of intro duration.

Let's find out...

temp = Episode_df[["episode_id","youtube_avg_watch_duration"]]
temp=temp[(temp.episode_id!="E0") & (temp.episode_id!="E4")]

intro=[]

for i in transcripts.Episode_ID.unique():
    intro.append(transcripts[transcripts.Episode_ID==i].iloc[0].Duration)
temp["Intro_Duration"]=intro
temp["diff"]=temp.youtube_avg_watch_duration-temp.Intro_Duration

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
    ("Intro Duration", "@Intro_Duration"),
    ("Avg Duration of Content Watched", "@diff"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 600, x_range  = temp["episode_id"].values, title = "Impact of Intro Durations")
fig1.line("episode_id", "youtube_avg_watch_duration", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Youtube Avg Watch_duration Views")
fig1.line("episode_id", "Intro_Duration", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Intro Duration")
fig1.line("episode_id", "diff", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Avg Duration of Content Watched")


fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))

print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 5 minutes".format(len(temp[temp["diff"]<300])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 4 minutes".format(len(temp[temp["diff"]<240])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 3 minutes".format(len(temp[temp["diff"]<180])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 2 minutes".format(len(temp[temp["diff"]<120])/len(temp)*100))
print("In {} case, Viewer left in the Intro Duration".format(len(temp[temp["diff"]<0])))

81.08 % of Episodes have Avg Duration of Content Watched less than 5 minutes
72.97 % of Episodes have Avg Duration of Content Watched less than 4 minutes
45.95 % of Episodes have Avg Duration of Content Watched less than 3 minutes
22.97 % of Episodes have Avg Duration of Content Watched less than 2 minutes
In 1 case, Viewer left in the Intro Duration

🧠 My Cessation:

Observing the graph and stats, it's clear it's a high time
We don't have Transcript of M Series where the Percentage of Episode Watched i.e. Episode had small Duration.
Concluding from analysis We can now strongly comment, Episode with shorter length will definitely help

There's lots of things to improve.

Shorter Duration Videos can be delivered highlighting the important aspects of the Shows
Short Summaries can be provided in the description. Maybe after reading them, Viewers could devote for a longer Show (depending on the interest on the topic reflected in summery)
Full length Show can be provided as Podcast in Apple, Spotify, Anchor. If Viewer after shorter duration videos and summeries wishes to have a full show, they can have it from there

With 45.95% of Episodes having Avg Duration of Content Watched less than 3 minutes, We can hardly gain any useful insight or can comment on quality of Content delivered.

But Okay! We can have some fun though 😄

Who Speaks More ?

host_text = []
hero_text = []
for i in transcripts.Episode_ID.unique():
    host_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker=="Sanyam Bhutani")].Text])
    hero_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker!="Sanyam Bhutani")].Text])

temp_host={}
temp_hero={}
for i in range(len(transcripts.Episode_ID.unique())):
    host_text_count = len(host_text[i][1])
    hero_text_count = len(hero_text[i][1])
    temp_host[hero_text[i][0]]=host_text_count
    temp_hero[hero_text[i][0]]=hero_text_count
    
def getkey(dict): 
    list = [] 
    for key in dict.keys(): 
        list.append(key)          
    return list

def getvalue(dict): 
    list = [] 
    for key in dict.values(): 
        list.append(key)          
    return list

Source = ColumnDataSource(data=dict(
    x=getkey(temp_host),
    y=getvalue(temp_host),
    a=getkey(temp_hero),
    b=getvalue(temp_hero),
))

tooltips = [
    ("Episode Id", "@x"),
    ("No of Times Host Speaks", "@y"),
    ("No of Times Hero Speaks", "@b"),
]

fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, tooltips=tooltips,plot_height = 400, x_range = getkey(temp_host), title = "Who Speaks More ?")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Times Host Speaks")
fig1.vbar("a", top = "b", source = Source, width = 0.4, color = "#e7f207", alpha=.8, legend_label="No of Times Hero Speaks")

fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4

show(fig1)

Excluding Few Episodes, Ratio between No of Times One Speaks is quite mantained
E69 was AMA episode. That's why there is no Hero

Frequency of Questions Per Episode

ques=0
total_ques={}
for episode in range(len(transcripts.Episode_ID.unique())):
    for each_text in range(len(host_text[episode][1])):
        ques += host_text[episode][1].reset_index().iloc[each_text].Text.count("?")
    total_ques[hero_text[episode][0]]= ques
    ques=0

from statistics import mean 
Source = ColumnDataSource(data=dict(
    x=getkey(total_ques),
    y=getvalue(total_ques),
))

tooltips = [
    ("Episode Id", "@x"),
    ("No of Questions", "@y"),
]

fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, plot_height = 400,tooltips=tooltips, x_range = getkey(temp_host), title = "Questions asked Per Episode")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Questions asked Per Episode")
fig1.line("x", mean(getvalue(total_ques)), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Average Questions : {:.3f}".format(mean(getvalue(total_ques))))

fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "No of Questions"

fig1.legend.location = "top_left"

fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4

show(fig1)

On an Average, around 30 Questions are asked by Host
E69 being AMA Episode justifies the reason of having such high no of Questions

Favourite Text ?

⚒️ About the Function :

Well, I'm gonna write a small function. You can pass a Hero Name and it will create a graph about 7 most common words spoken by that person

But before that I would like to give a small Shoutout to Parul Pandey for providing a text cleaning script in her Kernel.

import re
import nltk
from statistics import mean 
from collections import Counter
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text


def text_preprocessing(text):
    """
    Cleaning and parsing the text.

    """
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    nopunc = clean_text(text)
    tokenized_text = tokenizer.tokenize(nopunc)
    #remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    combined_text = ' '.join(tokenized_text)
    return combined_text

transcripts['Text'] = transcripts['Text'].apply(str).apply(lambda x: text_preprocessing(x))

def get_data(speakername=None):
    label=[]
    value=[]

    text_data=transcripts[(transcripts.Speaker==speakername)].Text.tolist()
    temp=list(filter(lambda x: x.count(" ")<10 , text_data)) 

    freq=nltk.FreqDist(temp).most_common(7)
    for each in freq:
        label.append(each[0])
        value.append(each[1])
        
        
    Source = ColumnDataSource(data=dict(
        x=label,
        y=value,
    ))

    tooltips = [
        ("Favourite Text", "@x"),
        ("Frequency", "@y"),
    ]

    fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, tooltips=tooltips, plot_height = 400, x_range = label, title = "Favourite Text")
    fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

    fig1.xaxis.axis_label = "Text"
    fig1.yaxis.axis_label = "Frequency"


    fig1.grid.grid_line_color="#feffff"
    fig1.xaxis.major_label_orientation = np.pi / 4

    show(fig1)

get_data(speakername="Sanyam Bhutani")

📌 Observations :

Okay, Yeah seems to be favourite words of Sanyam Bhutani
Well He has some different laughs for different scenario I guess 😄
We have all the Transcript where Sanyam Bhutani speaks. So, It's common that you'll find words with good frequency for Sanyam Bhutani only.
But you can still try. Who knows I might be missing something interesting 😄

Tip: Pass your favourite hero name in function get_data() and you're good to go

End Notes

With this, I end my analysis on this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani.

It was a wonderfull experience for me

Somehow if my Analysis/Way of StoryTelling hurted any sentiments then I apologize for that

And yea Congraulations to Chai Time Data Science | CTDS.Show for completing a successfull 1 Year journey.

Now I can finally enjoy my Chai break in a peace

Credits

Thanks everyone for these amazing photos. A Small shoutout to all of you

	episode_id	episode_name	heroes	heroes_gender	heroes_location	heroes_nationality	heroes_kaggle_username	heroes_twitter_handle	category	flavour_of_tea	recording_date	recording_time	release_date	episode_duration	youtube_url	youtube_thumbnail_type	youtube_impressions	youtube_impression_views	youtube_ctr	youtube_nonimpression_views	youtube_views	youtube_watch_hours	youtube_avg_watch_duration	youtube_likes	youtube_comments	youtube_subscribers	anchor_url	anchor_plays	spotify_starts	spotify_streams	spotify_listeners	apple_listeners	apple_listened_hours	apple_avg_listen_duration
0	E0	Chai Time Data Science Launch Announcement	NaN	NaN	NaN	NaN	NaN	NaN	Other	Masala Chai	2019-07-15	Evening	2019-07-21	157	https://www.youtube.com/watch?v=Ko_gxs42lM8	1	4433	86	1.94	45	131	3	82	4	2	3	https://anchor.fm/chaitimedatascience/episodes...	553.0	491.0	262.0	359.0	29.0	1.0	117.0
1	E1	Kaggle Triple Grandmaster, Abhishek Thakur Int...	Abhishek Thakur	Male	Norway	India	abhishek	abhi1thakur	Kaggle	Ginger Chai	2019-07-14	Evening	2019-07-22	2995	https://www.youtube.com/watch?v=Ezbo57Z33N8	0	25212	845	3.35	683	1528	142	335	55	5	60	https://anchor.fm/chaitimedatascience/episodes...	1271.0	826.0	608.0	456.0	56.0	25.0	1621.0
2	E2	Interview with Kaggle Master, ML Engineer: Rya...	Ryan Chesler	Male	USA	USA	ryches	ryan_chesler	Kaggle	Masala Chai	2019-07-20	Afternoon	2019-07-26	2118	https://www.youtube.com/watch?v=SJVMSKig14k	0	3282	84	2.56	44	128	14	394	7	1	3	https://anchor.fm/chaitimedatascience/episodes...	681.0	398.0	274.0	214.0	19.0	10.0	1879.0
3	E3	Interview with CEO of SharpestMinds, Edouard H...	Edouard Harris	Male	Canada	Canada	NaN	neutronsNeurons	Industry	Kashmiri Kahwa	2019-07-23	Night	2019-07-29	3072	https://www.youtube.com/watch?v=69urmSt34Ac	0	2376	38	1.60	57	95	11	417	2	0	1	https://anchor.fm/chaitimedatascience/episodes...	638.0	334.0	230.0	169.0	10.0	4.0	1344.0
4	E4	Data Science for Good: City of LA Kaggle Winni...	Shivam Bansal	Male	Singapore	India	shivamb	shivamshaz	Kaggle	Apple Cinnamon	2019-07-14	Morning	2019-08-02	1048	https://www.youtube.com/watch?v=wMYX3KABHCk	0	3884	116	2.99	36	152	9	213	4	0	4	https://anchor.fm/chaitimedatascience/episodes...	495.0	201.0	139.0	123.0	17.0	3.0	633.0

	episode_id	episode_name	heroes	heroes_gender	heroes_location	heroes_nationality	heroes_kaggle_username	heroes_twitter_handle	category	flavour_of_tea	recording_date	recording_time	release_date	episode_duration	youtube_url	youtube_thumbnail_type	youtube_impressions	youtube_impression_views	youtube_ctr	youtube_nonimpression_views	youtube_views	youtube_watch_hours	youtube_avg_watch_duration	youtube_likes	youtube_dislikes	youtube_comments	youtube_subscribers	anchor_url	anchor_thumbnail_type	anchor_plays	spotify_starts	spotify_streams	spotify_listeners	apple_listeners	apple_listened_hours	apple_avg_listen_duration
0	E0	Chai Time Data Science Launch Announcement	NaN	NaN	NaN	NaN	NaN	NaN	Other	Masala Chai	2019-07-15	Evening	2019-07-21	157	https://www.youtube.com/watch?v=Ko_gxs42lM8	1	4433	86	1.94	45	131	3	82	4	0	2	3	https://anchor.fm/chaitimedatascience/episodes...	0.0	553.0	491.0	262.0	359.0	29.0	1.0	117.0
46	M0	00 Introduction & About: fast.ai 2019 & Things...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	624	https://www.youtube.com/watch?v=rzuIkj8lymc	2	3789	139	3.67	162	301	15	179	15	0	2	10	https://anchor.fm/chaitimedatascience/episodes...	2.0	308.0	49.0	33.0	35.0	6.0	1.0	463.0
47	M1	01: Lesson-1 Image Classification \| fast.ai 20...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	341	https://www.youtube.com/watch?v=RKtfgXz7Qo0	2	4643	163	3.51	56	219	7	115	8	0	2	1	https://anchor.fm/chaitimedatascience/episodes...	2.0	368.0	37.0	32.0	29.0	10.0	1.0	504.0
48	M2	02: Lesson-2 Production & SGD From Scratch \| f...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	316	https://www.youtube.com/watch?v=ahdybq2V-38	2	3144	63	2.00	37	100	3	108	2	1	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	317.0	33.0	21.0	24.0	8.0	1.0	312.0
49	M3	03: Lesson-3 Multi-label; SGD from scratch \| f...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	332	https://www.youtube.com/watch?v=Z-waVKLcLJE	2	2436	52	2.13	28	80	3	135	2	0	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	276.0	20.0	13.0	16.0	11.0	1.0	260.0
50	M4	04: Lesson-4 NLP:Tabular Data; Recsys \| fast.a...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	281	https://www.youtube.com/watch?v=5CW3QdGdr8c	2	2592	40	1.54	23	63	2	114	3	0	0	1	https://anchor.fm/chaitimedatascience/episodes...	2.0	301.0	24.0	17.0	17.0	10.0	7.0	2547.0
51	M5	05: Lesson 5: Backprop; Neural Nets from scrat...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	311	https://www.youtube.com/watch?v=RIGlXwvUo_Q	2	2536	26	1.03	11	37	1	97	0	0	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	279.0	18.0	16.0	15.0	15.0	2.0	479.0
52	M6	06: Lesson-6 CNN Deep Dive; Ethics \| fast.ai 2...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	412	https://www.youtube.com/watch?v=nAE8tq_SIXo	2	3572	49	1.37	33	82	2	88	2	0	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	275.0	27.0	13.0	17.0	11.0	2.0	515.0
53	M7	07: Lesson-7 ResNet; U-Net; GANs \| fast.ai 201...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	467	https://www.youtube.com/watch?v=0eWG6apI1iY	2	2381	22	0.92	20	42	2	171	1	0	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	281.0	19.0	14.0	17.0	9.0	1.0	288.0
54	M8	08: Where to go from here, General fast.ai advice	NaN	NaN	NaN	NaN	NaN	NaN	Other	Kesar Rose Chai	2020-02-26	Night	2020-03-07	605	https://www.youtube.com/watch?v=oOr-7hYaU8o	2	2133	33	1.55	11	44	2	164	1	0	0	0	https://anchor.fm/chaitimedatascience/episodes...	2.0	376.0	26.0	17.0	22.0	8.0	1.0	301.0
78	E69	Birthday Special AMA: Answering Questions from...	NaN	NaN	NaN	NaN	NaN	NaN	Other	Masala Chai	2020-05-27	Morning	2020-05-27	3984	https://www.youtube.com/watch?v=hyJhwWshfbY	3	3698	163	4.41	338	501	55	395	36	1	3	15	https://anchor.fm/chaitimedatascience/episodes...	3.0	342.0	24.0	16.0	16.0	17.0	9.0	1992.0

	episode_id	episode_name	heroes	heroes_gender	heroes_location	heroes_nationality	heroes_kaggle_username	heroes_twitter_handle	category	flavour_of_tea	recording_date	recording_time	release_date	episode_duration	youtube_url	youtube_thumbnail_type	youtube_impressions	youtube_impression_views	youtube_ctr	youtube_nonimpression_views	youtube_views	youtube_watch_hours	youtube_avg_watch_duration	youtube_likes	youtube_dislikes	youtube_comments	youtube_subscribers	anchor_url	anchor_thumbnail_type	anchor_plays	spotify_starts	spotify_streams	spotify_listeners	apple_listeners	apple_listened_hours	apple_avg_listen_duration
22	E12	Freelancing in Machine Learning \| Interview wi...	Tuatini Godard	Male	France	France	ekami66	NaN	Industry	Kashmiri Kahwa	2019-07-11	Morning	2019-10-29	2684	https://www.youtube.com/watch?v=AwJpKBMog6c	0	3659	61	1.67	53	114	17	537	4	0	0	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
84	E75	Rachel Thomas \| Fast.ai \| Applied Ethics \| Top...	Rachel Thomas	Female	USA	USA	NaN	math_rachel	Industry	Masala Chai	2020-06-16	Night	2020-06-18	2214	https://www.youtube.com/watch?v=tq_XcFubgKo&li...	3	1931	115	5.96	164	279	23	297	20	0	1	3	https://anchor.fm/chaitimedatascience/episodes...	3.0	247.0	17.0	10.0	13.0	NaN	NaN	NaN

	Time	Speaker	Text
0	0:13	Sanyam Bhutani	Hey, this is Sanyam Bhutani and you're listeni...
1	1:49	Abhishek Thakur	Thank you very much for the invitation. It's a...
2	1:53	Sanyam Bhutani	Today, you're the world's only Triple Grandmas...
3	2:12	Abhishek Thakur	Yeah cool story. Data science was never my int...
4	2:41	Sanyam Bhutani	And this was before the boom had happened. And...
...	...	...	...
221	48:57	Sanyam Bhutani	Not recently.
222	49:00	Abhishek Thakur	See you there soon.
223	49:01	Sanyam Bhutani	Alright. Thanks. Thanks a lot.
224	49:03	Abhishek Thakur	Thank you. Bye bye.
225	49:16	Sanyam Bhutani	Thank you so much for listening to this episod...

	youtube_thumbnail_type	description	youtube_default	annotation	mini_series	ctds_brand
0	0	YouTube default image	1	0	0	0
1	1	YouTube default image with custom annotation	1	1	0	0
2	2	Mini Series: Custom Image with annotations	0	1	1	0
3	3	Custom image with CTDS branding, Title and Tags	0	1	0	1

	anchor_thumbnail_type	description	same_as_youtube	title	episode_details	ctds_brand
0	0	YouTube default playlist image	1	0	0	0
1	1	CTDS Branding	1	0	0	0
2	2	Mini Series: Custom Image with annotations	1	1	1	0
3	3	Custom image with CTDS branding, Title and Tags	1	1	1	1

	episode_id	description
0	E0	Interview with ML Hero Series: https://medium....
1	E1	In the first Episode, Sanyam Bhutani interview...
2	E2	Audio Version Available here: https://anchor.f...
3	E3	Audio Version available here: https://anchor.f...
4	E4	In this Conversation, Sanyam Bhutani interview...