Solving the Mystery of Chai Time Data Science

A Data Science podcast series by Sanyam Bhutani

drawing


Mr. RsTaK, Where are we?

Hello my dear Kagglers. As you all know I love Kaggle and its community. I spend most of my time surfing my Kaggle feed, scrolling over discussion forms, appreciating the efforts put on by various Kagglers via their unique / interesting way of Storytelling.

So, this morning when i was following my usual routine in Kaggle, I came across this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani. At first glance, I was like what's this? How they know I'm having a tea break? Oh no buddy! I was wrong. It's CTDS.Show :)


Chai Time Data Science (CTDS.Show)

Chai Time Data Science show is a Podcast + Video + Blog based show for interviews with Practitioners, Kagglers & Researchers and all things Data Science.

CTDS.Show, driven by the community under the supervision of Mr. Sanyam Bhutani gonna complete its 1 year anniversary on 21st June, 2020 and to celebrate this achievement they decided to run a Kaggle contest around the dataset with all of the 75+ ML Hero interviews on the series.

According to our Host, The competition is aimed at articulating insights from the Interviews with ML Heroes. Provided a dataset consists of detailed Stats, Transcripts of CTDS.Show, the goal is to use these and come up with interesting insights or stories based on the 100+ interviews with ML Heroes.

We have our Dataset containing :

  • Description.csv : This file consists of the descriptions texts from YouTube and Audio

  • Episodes.csv : This file contains the statistics of all the Episodes of the Chai Time Data Science show.

  • YouTube Thumbnail Types.csv : This file consists of the description of Anchor/Audio thumbnail of the episodes

  • Anchor Thumbnail Types.csv : This file contains the statistics of the Anchor/Audio thumbnail

  • Raw Subtitles : Directory containing 74 text files having raw subtitles of all the episodes

  • Cleaned Subtitles : Directory containing cleaned subtitles (in CSV format)

Hmm.. Seems we have some stories to talk about..

Congratulating CTDS.Show for their 1 year anniversary, Let's get it started :)


Btw This gonna be a long kernel. So, Hey! Looking for a guide :) ?

  0. Importing Necessary Libraries

  1. A Closer look to our Dataset

    1.1. Exploring YouTube Thumbnail Types.csv
    1.2. Exploring Anchor Thumbnail Types.csv
    1.3. Exploring Description.csv
    1.4. Exploring Episodes.csv

      1.4.1. Missing Values ?
      1.4.2. M0-M8 Episodes
      1.4.3. Solving the Mystery of Missing Values
      1.4.4. Is it a Gender Biased Show?
      1.4.5. Time for a Chai Break
      1.4.6. How to get More Audience?
      1.4.7. Youtube Favors CTDS?
      1.4.8. Do Thumbnails really matter?
      1.4.9. How much Viewers wanna watch?
      1.4.10. Performance on Other Platforms
      1.4.11. Distribution of Heores by Country and Nationality
      1.4.12. Any Relation between Release Date of Epiosdes?
      1.4.13. Do I know about Release of anniversary interview episode?

    1.5. Exploring Raw / Cleaned Substitles
      1.5.1. A Small Shoutout to Ramshankar Yadhunath
      1.5.2. Intro is Bad for CTDS ?
      1.5.3. Who Speaks More ?
      1.5.4. Frequency of Questions Per Episode
      1.5.5. Favourite Text ?

  2. End Notes

  3. Credits

Note : Sometimes, Plotly Graphs fails to render with the Kernel. Please restart the page in that case
Importing Necessary Libraries

Go back to our Guide

In [1]:
import os

import warnings
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import missingno as msno

import plotly.express as px

import plotly.graph_objects as go
from plotly.subplots import make_subplots

!pip install pywaffle
from pywaffle import Waffle

from bokeh.layouts import column, row
from bokeh.models.tools import HoverTool
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, output_notebook, show

output_notebook()

from IPython.display import IFrame

pd.set_option('display.max_columns', None)
Collecting pywaffle

  Downloading pywaffle-0.5.1-py2.py3-none-any.whl (525 kB)

     |████████████████████████████████| 525 kB 5.1 MB/s 

Requirement already satisfied: matplotlib in /opt/conda/lib/python3.7/site-packages (from pywaffle) (3.2.1)

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (2.4.7)

Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (1.18.1)

Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (0.10.0)

Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (2.8.1)

Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.7/site-packages (from matplotlib->pywaffle) (1.2.0)

Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->pywaffle) (1.14.0)

Installing collected packages: pywaffle

Successfully installed pywaffle-0.5.1

Loading BokehJS ...
A Closer look to our Directories

Go back to our Guide

Let's dive into each and every aspect of our Dataset step by step in order to get every inches out of it...

Exploring YouTube Thumbnail Types.csv

Go back to our Guide

As per our knowledge, This file consists of the description of Anchor/Audio thumbnail of the episodes. Let's explore more about it...

In [2]:
YouTube_df=pd.read_csv("../input/chai-time-data-science/YouTube Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(YouTube_df.shape[0], YouTube_df.shape[1]))
YouTube_df.head()
No of Datapoints : 4
No of Features : 6
Out[2]:
youtube_thumbnail_type description youtube_default annotation mini_series ctds_brand
0 0 YouTube default image 1 0 0 0
1 1 YouTube default image with custom annotation 1 1 0 0
2 2 Mini Series: Custom Image with annotations 0 1 1 0
3 3 Custom image with CTDS branding, Title and Tags 0 1 0 1

So, Basically CTDS uses 4 types of Thumbnails in their Youtube videos. Its 2020 and people still uses YouTube default image as thumbnail !

Hmm... a Smart decision or just a blind arrow, We'll figure it out in our futher analysis ...

Exploring Anchor Thumbnail Types.csv

Go back to our Guide

So, This file contains the statistics of the Anchor/Audio thumbnail

In [3]:
Anchor_df=pd.read_csv("../input/chai-time-data-science/Anchor Thumbnail Types.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Anchor_df.shape[0], Anchor_df.shape[1]))
Anchor_df.head()
No of Datapoints : 4
No of Features : 6
Out[3]:
anchor_thumbnail_type description same_as_youtube title episode_details ctds_brand
0 0 YouTube default playlist image 1 0 0 0
1 1 CTDS Branding 1 0 0 0
2 2 Mini Series: Custom Image with annotations 1 1 1 0
3 3 Custom image with CTDS branding, Title and Tags 1 1 1 1

It's just similar to Youtube Thumbnail Types.

If you are wondering What's anchor then it's a free platform for podcast creation

In [4]:
IFrame('https://anchor.fm/chaitimedatascience', width=800, height=450)
Out[4]:
Exploring Description.csv

Go back to our Guide

This file consists of the descriptions texts from YouTube and Audio

In [5]:
Des_df=pd.read_csv("../input/chai-time-data-science/Description.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Des_df.shape[0], Des_df.shape[1]))
Des_df.head()
No of Datapoints : 85
No of Features : 2
Out[5]:
episode_id description
0 E0 Interview with ML Hero Series: https://medium....
1 E1 In the first Episode, Sanyam Bhutani interview...
2 E2 Audio Version Available here: https://anchor.f...
3 E3 Audio Version available here: https://anchor.f...
4 E4 In this Conversation, Sanyam Bhutani interview...

So, We have description for every episode. Let's have a close look what we have here

In [6]:
def show_description(specific_id=None, top_des=None):
    
    if specific_id is not None:
        print(Des_df[Des_df.episode_id==specific_id].description.tolist()[0])
        
    if top_des is not None:
        for each_des in range(top_des):  
            print(Des_df.description.tolist()[each_des])
            print("-"*100)

⚒️ About the Function :

In order to explore our Descriptions, I just wrote a small script. It has two options:

  • Either You provide specific episode id(specific_id) to have a look at that particular description
  • Or you can provide a number(top_des) and this script will display description for top "x" numbers that you provided in top_des
In [7]:
show_description("E1")
In the first Episode, Sanyam Bhutani interviews Kaggle Triple Grandmaster: Abhishek Thakur. They talk about Abhishek's journey into Data Science and Kaggle; his Kaggle Experience and current projects. 

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Abhishek Thakur: https://www.kaggle.com/abhishek
https://www.linkedin.com/in/abhisvnit/
https://twitter.com/abhi1thakur

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
In [8]:
show_description(top_des=3)
Interview with ML Hero Series: https://medium.com/p/bfaaf38df219

http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogpost.

Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------
In the first Episode, Sanyam Bhutani interviews Kaggle Triple Grandmaster: Abhishek Thakur. They talk about Abhishek's journey into Data Science and Kaggle; his Kaggle Experience and current projects. 

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Abhishek Thakur: https://www.kaggle.com/abhishek
https://www.linkedin.com/in/abhisvnit/
https://twitter.com/abhi1thakur

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------
Audio Version Available here: https://anchor.fm/chaitimedatascience

In this Episode, Sanyam Bhutani interviews Kaggle Competition Master: Ryan Chesler. They talk about Ryan's journey into Data Science and Kaggle; his Kaggle Experience and current projects as well as his solution to the Jigsaw Unintended Bias in Toxicity Classification Kaggle Competition

Interview with Machine Learning Hero Series: https://medium.com/dsnet/interviews-with-machine-learning-heroes-ad9358385278 

Follow:
Ryan Chesler: https://www.kaggle.com/ryches
https://www.linkedin.com/in/ryan-chesler/
https://twitter.com/ryan_chesler

Sanyam Bhutani: https://twitter.com/bhutanisanyam1

About:
http://chaitimedatascience.com/
A show for Interviews with Practitioners, Kagglers & Researchers and all things Data Science hosted by Sanyam Bhutani. 

You can expect weekly episodes every Sunday, Thursday available as Video, Podcast, and blogposts.

If you'd like to support the podcast: https://www.patreon.com/chaitimedatascience
Intro track: 
Flow by LiQWYD https://soundcloud.com/liqwyd
----------------------------------------------------------------------------------------------------
Advice : Feel free to play with the function "show_description()" to have a look over various descriptions provided in a go


🧠 My Cessation:

  • Although I went through some description and realized it just contains URL, Necessary links for social media sites with a little description of the current show and some announcements regarding future releases
  • I'm not gonna put stress in this area because I don't think there's much to scrap in them. Right now, let's move ahead.
Exploring Episodes.csv

Go back to our Guide

This file contains the statistics of all the Episodes of the Chai Time Data Science show.

Okay ! So, it's the big boy itself ..

In [9]:
Episode_df=pd.read_csv("../input/chai-time-data-science/Episodes.csv")
print("No of Datapoints : {}\nNo of Features : {}".format(Episode_df.shape[0], Episode_df.shape[1]))
Episode_df.head()
No of Datapoints : 85
No of Features : 36
Out[9]:
episode_id episode_name heroes heroes_gender heroes_location heroes_nationality heroes_kaggle_username heroes_twitter_handle category flavour_of_tea recording_date recording_time release_date episode_duration youtube_url youtube_thumbnail_type youtube_impressions youtube_impression_views youtube_ctr youtube_nonimpression_views youtube_views youtube_watch_hours youtube_avg_watch_duration youtube_likes youtube_dislikes youtube_comments youtube_subscribers anchor_url anchor_thumbnail_type anchor_plays spotify_starts spotify_streams spotify_listeners apple_listeners apple_listened_hours apple_avg_listen_duration
0 E0 Chai Time Data Science Launch Announcement NaN NaN NaN NaN NaN NaN Other Masala Chai 2019-07-15 Evening 2019-07-21 157 https://www.youtube.com/watch?v=Ko_gxs42lM8 1 4433 86 1.94 45 131 3 82 4 0 2 3 https://anchor.fm/chaitimedatascience/episodes... 0.0 553.0 491.0 262.0 359.0 29.0 1.0 117.0
1 E1 Kaggle Triple Grandmaster, Abhishek Thakur Int... Abhishek Thakur Male Norway India abhishek abhi1thakur Kaggle Ginger Chai 2019-07-14 Evening 2019-07-22 2995 https://www.youtube.com/watch?v=Ezbo57Z33N8 0 25212 845 3.35 683 1528 142 335 55 0 5 60 https://anchor.fm/chaitimedatascience/episodes... 0.0 1271.0 826.0 608.0 456.0 56.0 25.0 1621.0
2 E2 Interview with Kaggle Master, ML Engineer: Rya... Ryan Chesler Male USA USA ryches ryan_chesler Kaggle Masala Chai 2019-07-20 Afternoon 2019-07-26 2118 https://www.youtube.com/watch?v=SJVMSKig14k 0 3282 84 2.56 44 128 14 394 7 0 1 3 https://anchor.fm/chaitimedatascience/episodes... 0.0 681.0 398.0 274.0 214.0 19.0 10.0 1879.0
3 E3 Interview with CEO of SharpestMinds, Edouard H... Edouard Harris Male Canada Canada NaN neutronsNeurons Industry Kashmiri Kahwa 2019-07-23 Night 2019-07-29 3072 https://www.youtube.com/watch?v=69urmSt34Ac 0 2376 38 1.60 57 95 11 417 2 0 0 1 https://anchor.fm/chaitimedatascience/episodes... 0.0 638.0 334.0 230.0 169.0 10.0 4.0 1344.0
4 E4 Data Science for Good: City of LA Kaggle Winni... Shivam Bansal Male Singapore India shivamb shivamshaz Kaggle Apple Cinnamon 2019-07-14 Morning 2019-08-02 1048 https://www.youtube.com/watch?v=wMYX3KABHCk 0 3884 116 2.99 36 152 9 213 4 0 0 4 https://anchor.fm/chaitimedatascience/episodes... 0.0 495.0 201.0 139.0 123.0 17.0 3.0 633.0

Wew ! That's a lot of features.

I'm sure We'll gonna explore some interesting insights from this Metadata If you reached till here, then please bare me for couple of minutes more..

Now, We'll gonna have a big sip of our "Chai" :)

Missing Values

Before diving into our analysis, Let's have check for Missing Values in our CSV..

For this purpose, I'm gonna use this library named missingno.

Just use this line :

import missingno as msno

missingno helps us to deal with missing values in a dataset with the help of visualisations. With over 2k stars on github, this library is already very popular.

In [10]:
msno.matrix(Episode_df)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f841c41dbd0>

Aah shit ! Here We go again..

📌 Observations :

  • We can clearly see that heroes_kaggle_username, heroes_twitter_handle have lots of missing values
  • We can observe bunch of data missing from column name heroes to heroes_twitter_handle in a continous way(that big block region) which shows a specific reason of data missing at those points
  • Few datapoints are too missing in anchor, spotify and apple section i.e. missing data in podcasts

There is also a chart on the right side of plot.It summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

Well, Before giving any False Proclaim Let's explore more about it..

In [11]:
temp=Episode_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 8
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"


show(fig1)

📌 Observations :

  • Columns from heroes to heroes_nationality has same about of missing data. Seems We can find a reasonable relation between them
  • About 45.88% (85-39) and 22.35% (85-19) of Data missing in column name heroes_kaggle_username and heroes_twitter_handle respectively
  • We have just 1 missing value in anchor and spotify, 2 missing values in apple section that is quite easy to handle


🧠 My Cessation:

  • Come-on. I don't understand. Chai Time Data Science show is about interviews with our Heroes. So, How do we have 11 missing values in Feature "heroes"


Let's find out..

In [12]:
Episode_df[Episode_df.heroes.isnull()]
Out[12]:
episode_id episode_name heroes heroes_gender heroes_location heroes_nationality heroes_kaggle_username heroes_twitter_handle category flavour_of_tea recording_date recording_time release_date episode_duration youtube_url youtube_thumbnail_type youtube_impressions youtube_impression_views youtube_ctr youtube_nonimpression_views youtube_views youtube_watch_hours youtube_avg_watch_duration youtube_likes youtube_dislikes youtube_comments youtube_subscribers anchor_url anchor_thumbnail_type anchor_plays spotify_starts spotify_streams spotify_listeners apple_listeners apple_listened_hours apple_avg_listen_duration
0 E0 Chai Time Data Science Launch Announcement NaN NaN NaN NaN NaN NaN Other Masala Chai 2019-07-15 Evening 2019-07-21 157 https://www.youtube.com/watch?v=Ko_gxs42lM8 1 4433 86 1.94 45 131 3 82 4 0 2 3 https://anchor.fm/chaitimedatascience/episodes... 0.0 553.0 491.0 262.0 359.0 29.0 1.0 117.0
46 M0 00 Introduction & About: fast.ai 2019 & Things... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 624 https://www.youtube.com/watch?v=rzuIkj8lymc 2 3789 139 3.67 162 301 15 179 15 0 2 10 https://anchor.fm/chaitimedatascience/episodes... 2.0 308.0 49.0 33.0 35.0 6.0 1.0 463.0
47 M1 01: Lesson-1 Image Classification | fast.ai 20... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 341 https://www.youtube.com/watch?v=RKtfgXz7Qo0 2 4643 163 3.51 56 219 7 115 8 0 2 1 https://anchor.fm/chaitimedatascience/episodes... 2.0 368.0 37.0 32.0 29.0 10.0 1.0 504.0
48 M2 02: Lesson-2 Production & SGD From Scratch | f... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 316 https://www.youtube.com/watch?v=ahdybq2V-38 2 3144 63 2.00 37 100 3 108 2 1 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 317.0 33.0 21.0 24.0 8.0 1.0 312.0
49 M3 03: Lesson-3 Multi-label; SGD from scratch | f... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 332 https://www.youtube.com/watch?v=Z-waVKLcLJE 2 2436 52 2.13 28 80 3 135 2 0 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 276.0 20.0 13.0 16.0 11.0 1.0 260.0
50 M4 04: Lesson-4 NLP:Tabular Data; Recsys | fast.a... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 281 https://www.youtube.com/watch?v=5CW3QdGdr8c 2 2592 40 1.54 23 63 2 114 3 0 0 1 https://anchor.fm/chaitimedatascience/episodes... 2.0 301.0 24.0 17.0 17.0 10.0 7.0 2547.0
51 M5 05: Lesson 5: Backprop; Neural Nets from scrat... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 311 https://www.youtube.com/watch?v=RIGlXwvUo_Q 2 2536 26 1.03 11 37 1 97 0 0 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 279.0 18.0 16.0 15.0 15.0 2.0 479.0
52 M6 06: Lesson-6 CNN Deep Dive; Ethics | fast.ai 2... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 412 https://www.youtube.com/watch?v=nAE8tq_SIXo 2 3572 49 1.37 33 82 2 88 2 0 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 275.0 27.0 13.0 17.0 11.0 2.0 515.0
53 M7 07: Lesson-7 ResNet; U-Net; GANs | fast.ai 201... NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 467 https://www.youtube.com/watch?v=0eWG6apI1iY 2 2381 22 0.92 20 42 2 171 1 0 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 281.0 19.0 14.0 17.0 9.0 1.0 288.0
54 M8 08: Where to go from here, General fast.ai advice NaN NaN NaN NaN NaN NaN Other Kesar Rose Chai 2020-02-26 Night 2020-03-07 605 https://www.youtube.com/watch?v=oOr-7hYaU8o 2 2133 33 1.55 11 44 2 164 1 0 0 0 https://anchor.fm/chaitimedatascience/episodes... 2.0 376.0 26.0 17.0 22.0 8.0 1.0 301.0
78 E69 Birthday Special AMA: Answering Questions from... NaN NaN NaN NaN NaN NaN Other Masala Chai 2020-05-27 Morning 2020-05-27 3984 https://www.youtube.com/watch?v=hyJhwWshfbY 3 3698 163 4.41 338 501 55 395 36 1 3 15 https://anchor.fm/chaitimedatascience/episodes... 3.0 342.0 24.0 16.0 16.0 17.0 9.0 1992.0

💭 Interesting..

  • episode_id "E0" was all about Chai Time Data Science Launch Announcement
  • episode_id "E69" was Birthday Special It make sense why there's no hero for the following episodes

But What are these M0-M8 episodes .. ?

M0-M8 Episodes

  • Looking around for a while, I realized M0-M8 was a small mini-series based on fast.ai summaries and the Things Jeremy says to do that were released on same date.

🧠 My Cessation:

  • Well for the sake of our analysis, I'll treat them as outlier for the current CSV and will analyise them seperately. So, I'm gonna remove them from this CSV, storing seperately for later analysis
In [13]:
temp=[id for id in Episode_df.episode_id if id.startswith('M')]
fastai_df=Episode_df[Episode_df.episode_id.isin(temp)]
Episode_df=Episode_df[~Episode_df.episode_id.isin(temp)]

Also, ignoring "E0" and "E69" for right now ...

In [14]:
dummy_df=Episode_df[(Episode_df.episode_id!="E0") & (Episode_df.episode_id!="E69")]
In [15]:
msno.matrix(dummy_df)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f841c21c710>
In [16]:
temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"

show(fig1)

Now, That's much better..

But We still have a lots of Missing Values

Solving the Mystery of Missing Values
In [17]:
parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category"]).heroes_gender.value_counts()
for k in temp.index:
    parent.append(k[0])
    names.append(k[1])
    values.append(temp.loc[k])

df1 = pd.DataFrame(
    dict(names=names, parents=parent,values=values))


parent=[]
names =[]
values=[]
temp=dummy_df.groupby(["category","heroes_gender"]).heroes_kaggle_username.count()
for k in temp.index:
    parent.append(k[0])
    names.append(k[1])
    values.append(temp.loc[k])

df2 = pd.DataFrame(
    dict(names=names, parents=parent,values=values))


fig = px.sunburst(df1, path=['names', 'parents'], values='values', color='parents',hover_data=["names"], title="Heroes associated with Categories")
fig.update_traces( 
                 textinfo='percent entry+label',
                 hovertemplate = "Industry:%{label}: <br>Count: %{value}"
                )
fig.show()


fig = px.sunburst(df2, path=['names', 'parents'], values='values', color='parents', title="Heroes associated with Categories having Kaggle Account")
fig.update_traces( 
                 textinfo='percent entry+label',
                 hovertemplate = "Industry:%{label}: <br>Count: %{value}"
                )
fig.show()

📌 Observations :

  • Heores associated with "Category" Kaggle are expected to have a Kaggle account
  • Ignoring the counts from "Category" Kaggle (74-31=43), Out of 43 only 15 Heroes have Kaggle account.
  • This explains all Missing 28 Values from our CSV
  • Similarly We have 8 Heroes who don't have twitter handle. It's okay. Even I don't have a twitter handle :D

Wanna know a fun fact ?

Because of this Kaggle platform, Now I've aprox 42% chance of becoming a CTDS Hero :) ...

Ahem Ahem... Focus RsTaK Focus.. Let's get back to our work.

Wait? Guess I missed something.. What's that gender ratio?

Is it a Gender Biased Show?
In [18]:
gender = Episode_df.heroes_gender.value_counts()

fig = plt.figure(
    FigureClass=Waffle, 
    rows=5,
    columns=12,
    values=gender,
    colors = ('#20639B', '#ED553B'),
    title={'label': 'Gender Distribution', 'loc': 'left'},
    labels=["{}({})".format(a, b) for a, b in zip(gender.index, gender) ],
    legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.4), 'ncol': len(Episode_df), 'framealpha': 0},
    font_size=30, 
    icons = 'child',
    figsize=(12, 5),  
    icon_legend=True
)

Jokes apart, We can't give any strong statement over this.

But yea, I'm hoping for more Female Heroes :D

🧠 My Cessation:

I won't talk much about relation of gender with other features because :

  • Gender feature is highly biased towards one category

So, We can not conclude any relation with other features.

  • If we had a good gender ratio, then we could have talked about impact of gender

Even if we somehow observe any positive conclusion for Female gender then I would say it will be just a coincidence. There are other factors apart from gender that may have resulted in positive conclusion for Female gender.

With such a biased and small sample size for Female, We can not comment any strong statement on that

In [19]:
dummy_df[dummy_df.apple_listeners.isnull()]
Out[19]:
episode_id episode_name heroes heroes_gender heroes_location heroes_nationality heroes_kaggle_username heroes_twitter_handle category flavour_of_tea recording_date recording_time release_date episode_duration youtube_url youtube_thumbnail_type youtube_impressions youtube_impression_views youtube_ctr youtube_nonimpression_views youtube_views youtube_watch_hours youtube_avg_watch_duration youtube_likes youtube_dislikes youtube_comments youtube_subscribers anchor_url anchor_thumbnail_type anchor_plays spotify_starts spotify_streams spotify_listeners apple_listeners apple_listened_hours apple_avg_listen_duration
22 E12 Freelancing in Machine Learning | Interview wi... Tuatini Godard Male France France ekami66 NaN Industry Kashmiri Kahwa 2019-07-11 Morning 2019-10-29 2684 https://www.youtube.com/watch?v=AwJpKBMog6c 0 3659 61 1.67 53 114 17 537 4 0 0 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
84 E75 Rachel Thomas | Fast.ai | Applied Ethics | Top... Rachel Thomas Female USA USA NaN math_rachel Industry Masala Chai 2020-06-16 Night 2020-06-18 2214 https://www.youtube.com/watch?v=tq_XcFubgKo&li... 3 1931 115 5.96 164 279 23 297 20 0 1 3 https://anchor.fm/chaitimedatascience/episodes... 3.0 247.0 17.0 10.0 13.0 NaN NaN NaN

📌 Observations :

Following our analysis, We realized :

  • CTDS had an episode with Tuatini Godard, episode_id : "E12" exclusively for Youtube. Although it was an Audio Only video(if it makes sense :D ) released on Youtube
  • If it was an Audio Only Version, then Why it wasn't released on other platforms ? Hmmm... interesting. Well, I think Mr. Sanyam Bhutani can answer this well.
  • Similarly, CTDS had an episode with Rachel Thomas released at every platform expect for Apple

With this, We have solved all the mysteries related to the Missing Data. Now we can finally explore other aspects of this CSV

But before that..

Time for a Chai Break

While having a sip of my Chai (Tea), I'm just curious Why this show is named "Chai Time Data Science"?

Well, I don't have a solid answer for this but maybe its just because our Host loves Chai? Hmmm.. So You wanna say our Host is more Hardcore Chai Lover than me?

Hey ! Hold my Chai..

In [20]:
fig = go.Figure([go.Pie(labels=Episode_df.flavour_of_tea.value_counts().index.to_list(),values=Episode_df.flavour_of_tea.value_counts().values,hovertemplate = '<br>Type: %{label}</br>Count: %{value}<br>Popularity: %{percent}</br>', name = '')])
fig.update_layout(title_text="What Host drinks everytime ?", template="plotly_white", title_x=0.45, title_y = 1)
fig.data[0].marker.line.color = 'rgb(255, 255, 255)'
fig.data[0].marker.line.width = 2
fig.update_traces(hole=.4,)
fig.show()

📌 Observations :

  • Masala Chai (count=16) and Ginger Chai (count=16) seems to be favourite Chai of our Host followed by Herbal Tea (count=11) and Sulemani Chai (count=11)

  • Also, Our Host seems to be quite experimental with Chai. He has varities of flavour in his belly

Oh Man ! This time you win. You're a real Chai lover

Now, One Question arises..❓

So, Host drinking any specific Chai at specific time have any relation with other factors or success of CTDS?

🧠 My Cessation:

  • Thinking practically, I don't think drinking Chai at any specific time can have any real impact for the show.
  • Believing on such things is an example of superstition.
  • No doubt, as per the data it may have some relation with other factors. But to support any statement here, I would like to quote a famous sentence used in statistics i.e.


Correlation does not imply Causation

How to get More Audience?

Well, rewarding for your victory in that Chai Lover Challenge, I'll try to assist CTDS on how to get more Audience 😄

  • Ofcourse CTDS.Shows are a kind of gem, fully informative, covering interviews with some successfull people
  • But talking Statistically here, We'll gonna define Success of an Episode by amount of Audience it gathered
In [21]:
temp=dummy_df.isnull().sum().reset_index().rename(columns={"index": "Name", 0: "Count"})
temp=temp[temp.Count!=0]

Source=ColumnDataSource(temp)

tooltips = [
    ("Feature Name", "@Name"),
    ("No of Missing entites", "@Count")
]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400,tooltips=tooltips, x_range = temp["Name"].values, title = "Count of Missing Values")
fig1.vbar("Name", top = "Count", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

fig1.xaxis.major_label_orientation = np.pi / 4
fig1.xaxis.axis_label = "Features"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"

show(fig1)
In [22]:
Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)
fastai_df.release_date = pd.to_datetime(fastai_df.release_date)
Source2 = ColumnDataSource(fastai_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("CTR", "@youtube_ctr"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Subscriber Gain", "@youtube_subscribers"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "CTR Per Episode")
fig1.line("release_date", "youtube_ctr", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="youtube_ctr")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_ctr", alpha=0.2, fill_color='#55FF88', legend_label="youtube_ctr")
fig1.line("release_date", Episode_df.youtube_ctr.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Youtube CTR Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))
fig1.circle(x="release_date", y="youtube_ctr", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")

fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Click Per Impression"

fig1.grid.grid_line_color="#feffff"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"

fig2.grid.grid_line_color="#feffff"

show(column(fig1, fig2))

📌 Observations :

From the Graphs, We can see :

  • On an average, CTDS episodes had 2.702 CTR
  • 38 out of 76 (50% exact) have CTRs above the average
  • Episode E1, E27 and E49 seems to be lucky for CTDS in terms of Subscriber Count
  • Episode E19 had the best CTR (8.460) which is self explanatory from the Episode Title. Everyone loves to know about MOOC(s) and ML Interviews in Industries

🤖 M0-M8 Series :

  • Despite related to Fast.ai, M0-M7 doesn't perform that well as compared to other vides related to fast.ai
  • M0 and M1 though received a good amount of CTR but other M Series quite below the average CTR
  • M0 and M1 has better impact on subscriber gain as compared to other M series but overall series doesn't perform well on Subscriber gain
  • All M0-M8 series were released on the same day, which can be the reason for this incident. M0-M1 despite having good CTR fails to hold the viewers interest on M series

💭 Interestingly..,

  • Episode E19 despite of having best CTR till now (8.460), didn't contributed much in Subscriber Count (only 7 Subscriber Gained)

But Why ❓

  • CTR doen't mean that person likes the content, or he/she will be watching that video for long or will be subscribing to the channel
  • Maybe that video was recommended on his newsfeed and he/she clicked on it just to check the video
  • Maybe he/she doesn't liked the content
  • Maybe he/she accidently clicked on the video

There's an huge possibility of such cases. But in conclusion, We can say high CTR reflect cases like :

  • People clicked on the video maybe because of the Thumbnail, or Title was soothing. Maybe he/she clicked because of the hero mentioned in Title/Thumbnail

📃 I don't know how Youtube algorithm works. But for the sake of answering exceptional case of E19, My hypothetical answer will be:

  • Title contains the word "MOOC". Since now a days everyone wanna break into DataScience, Youtube algorithm may have suggested that video to people looking for MOOCs
  • Most of other episodes have similar kind of Titles stating "Interview with" or have some terms that are't that begineer friendly. Resulting in low CTR
  • Supporting my hypothesis, observe E27 (having fast.ai in Title that is a famous MOOC), E38(Title with Youngest Grandmaster may have attracted people to click),E49 (Getting started in Datascience), E60(Terms like NLP and Open Source Projects) and E75(again fast.ai)
  • You can argue for E12 which has the word "Freelancing" in the Title. Well exceptions will be there

Okay What's about organic reach of channel or reach via Heroes?

In [23]:
Source = ColumnDataSource(Episode_df)
Source2 = ColumnDataSource(fastai_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Impression Views", "@youtube_impression_views"),
    ("Non Impression Views", "@youtube_nonimpression_views"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Subscriber Gain", "@youtube_subscribers"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Impression-Non Impression Views Per Episode")
fig1.line("release_date", "youtube_impression_views", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Impression Views")
fig1.line("release_date", "youtube_nonimpression_views", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Non Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_impression_views", alpha=0.2, fill_color='#55FF88', legend_label="Impression Views")
fig1.varea(source=Source, x="release_date", y1=0, y2="youtube_nonimpression_views", alpha=0.2, fill_color='#e09d53', legend_label="Non Impression Views")
fig1.circle(x="release_date", y="youtube_impression_views", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series Impression Views")
fig1.circle(x="release_date", y="youtube_nonimpression_views", source = Source2, color = "#2d3328", alpha = 0.8, legend_label="M0-M8 Series Non Impression Views")



fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Total Views"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Subscriber Gain Per Episode")
fig2.line("release_date", "youtube_subscribers", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Subscribers")
fig2.varea(source=Source, x="release_date", y1=0, y2="youtube_subscribers", alpha=0.2, fill_color='#55FF88', legend_label="Subscribers")
fig2.circle(x="release_date", y="youtube_subscribers", source = Source2, color = "#5bab37", alpha = 0.8, legend_label="M0-M8 Series")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Subscriber Count"


show(column(fig1, fig2))

📌 Observations :

  • Mostly Non Impression Views are greater than Impression views. CTDS seems to have a loyal fan base that's sharing his videos, producing more Non Impression Views
  • In some cases, there's sharp increase in views and big difference between Impression and Non Impression Views.
  • People love to see specific Heroes. Hero choice do matter
  • Total Views(especially Non Impression Views) definately plays role in Subscriber Gain
  • Though M series doesn't have good performance, but if you watch carefully, you'll realise M series has better Impression Views than Non Impression Views
Youtube Favors CTDS?
In [24]:
data1={
      "Youtube Impressions":Episode_df.youtube_impressions.sum(), 
      "Youtube Impression Views": Episode_df.youtube_impression_views.sum(), 
      "Youtube NonImpression Views" : Episode_df.youtube_nonimpression_views.sum()
     }

text=("Youtube Impressions","Youtube Impression Views","Youtube NonImpression Views")

fig = go.Figure(go.Funnelarea(
    textinfo= "text+value",
    text =list(data1.keys()),
    values = list(data1.values()),
    title = {"position": "top center", "text": "Youtube and Views"},
      name = '', showlegend=False,customdata=['Video Thumbnail shown to Someone', 'Views From Youtube Impressions', 'Views without Youtube Impressions'], hovertemplate = '%{customdata} <br>Count: %{value}</br>'
  ))
fig.show()

📌 Observations :

Few things to note here :

  • Well, I havn't cracked Youtube Algorithm, but it seems Youtube has its blessing over CTDS
  • CTDS Episodes is only able to convert 2.84% of Youtube Impressions into Viewers
  • 65.12% of CTDS views are Non Impression views

Seems It's clear Youtube Thumbnail, Video Title are the important factor for deciding whether a person will click on the video or not.

Wait, You want some figures?

Do Thumbnails really matter ?
In [25]:
colors = ["red", "olive", "darkred", "goldenrod"]

index={
    0:"YouTube default image",
    1:"YouTube default image with custom annotation",
    2:"Mini Series: Custom Image with annotations",
    3:"Custom image with CTDS branding, Title and Tags"
}

p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS CTR")

base, lower, upper = [], [], []

for each_thumbnail_ref in index:
    if each_thumbnail_ref==2:
        temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr 
    else:
        temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].youtube_ctr
    mpgs_mean = temp.mean()
    mpgs_std = temp.std()
    lower.append(mpgs_mean - mpgs_std)
    upper.append(mpgs_mean + mpgs_std)
    base.append(each_thumbnail_ref)

    source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
    p.add_layout(
        Whisker(source=source_error, base="base", lower="lower", upper="upper")
    )

    tooltips = [
        ("Episode Id", "@episode_id"),
        ("Hero Present", "@heroes"),
        ]

    color = colors[each_thumbnail_ref % len(colors)]
    p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
    print("Mean CTR for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)
Mean CTR for Thumbnail Type YouTube default image : 2.565 
Mean CTR for Thumbnail Type YouTube default image with custom annotation : 2.725 
Mean CTR for Thumbnail Type Mini Series: Custom Image with annotations : 1.969 
Mean CTR for Thumbnail Type Custom image with CTDS branding, Title and Tags : 3.115 

📌 Observations :

From above Box-Plot :

  • It seems Type of Thumbnail do have some impact on CTR
  • Despite of using YouTube default image for maximum of time, it's average CTR is lowest as compared to CTR from other Youtube Thumbnail
  • Since Count of other YouTube thumbnails are less, We can't say What's the best Thumbnail
  • CTR depends on other factors too like Title, Hero featured in the Episode etc. Still we can confidently say that Thumbnails other than YouTube default image attracts more Users to click on the Video
  • As We talked about M series, M0-M1 failed to keep interest of people in the series.
  • Although their Mean CTR is lowest yet we can observe M0-M1 has a better CTR as compared to majority of Episodes with Youtube default thumbnails.

In short, Don't use Default Youtube Image for Thumbnail

How much Viewers wanna watch?

Episodes have different duration.

In order to get a significant insight, I'll calculate the percentage of each Episode watched..

In [26]:
a=Episode_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
a["percentage"]=(a.youtube_avg_watch_duration/a.episode_duration)*100

b=fastai_df[["episode_id", "episode_duration", "youtube_avg_watch_duration"]]
b["percentage"]=(b.youtube_avg_watch_duration/b.episode_duration)*100

temp=a.append(b).reset_index().drop(["index"], axis=1)

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Duration", "@episode_duration"),
    ("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
    ("Percentage of video watched", "@percentage"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_range  = temp["episode_id"].values, title = "Percentage of Episode Watched")
fig1.line("episode_id", "percentage", source = Source, color = "#03c2fc", alpha = 0.8)
fig1.line("episode_id", temp.percentage.mean(), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Mean : {:.3f}".format(temp.percentage.mean()))

fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))

📌 Observations :

  • On an average, 13.065% of Episode is watched by Viewers
  • But most Episode haave watched percentage less than this threshold.

How does it make sense ❓

  • That's because we have some outliers like E0 and M series that has watched percentage over 20%.

But Why such outliers occured ❓

  • That's because they have low Episode Duration

In this fast moving world, Humans get bored of things very easily. E0 and M Series having low Episode Duration made viewers to watch more.

If they'll subscribe to the channel or not that's a different thing. That depends on the content.

In order to give more to Viewers and Community, Short lengthed Episodes can be a big step

Performance on Other Platforms
In [27]:
colors = ["red", "olive", "darkred", "goldenrod"]

index={
    0:"YouTube default playlist image",
    1:"CTDS Branding",
    2:"Mini Series: Custom Image with annotations",
    3:"Custom image with CTDS branding, Title and Tags"
}

p = figure(background_fill_color="#ebf4f6", plot_width=600, plot_height=300, title="Thumbnail Type VS Anchor Plays")

base, lower, upper = [], [], []

for each_thumbnail_ref in index:
    if each_thumbnail_ref==2:
        temp = fastai_df[fastai_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays 
    else:
        temp = Episode_df[Episode_df.youtube_thumbnail_type==each_thumbnail_ref].anchor_plays
    mpgs_mean = temp.mean()
    mpgs_std = temp.std()
    lower.append(mpgs_mean - mpgs_std)
    upper.append(mpgs_mean + mpgs_std)
    base.append(each_thumbnail_ref)

    source_error = ColumnDataSource(data=dict(base=base, lower=lower, upper=upper))
    p.add_layout(
        Whisker(source=source_error, base="base", lower="lower", upper="upper")
    )

    tooltips = [
        ("Episode Id", "@episode_id"),
        ("Hero Present", "@heroes"),
        ]

    color = colors[each_thumbnail_ref % len(colors)]
    p.circle(y=temp, x=each_thumbnail_ref, color=color, legend_label=index[each_thumbnail_ref])
    print("Mean Anchor Plays for Thumbnail Type {} : {:.3f} ".format(index[each_thumbnail_ref], temp.mean()))
show(p)
Mean Anchor Plays for Thumbnail Type YouTube default playlist image : 620.939 
Mean Anchor Plays for Thumbnail Type CTDS Branding : 534.500 
Mean Anchor Plays for Thumbnail Type Mini Series: Custom Image with annotations : 309.000 
Mean Anchor Plays for Thumbnail Type Custom image with CTDS branding, Title and Tags : 387.375 

📌 Observations :

  • 55.40% of the Anchor Thumbnail have CTDS Branding
  • But on an Average, Podcasts with YouTube default playlist image performs better in terms of Anchor Plays
In [28]:
Episode_df.release_date = pd.to_datetime(Episode_df.release_date)
Source = ColumnDataSource(Episode_df)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Anchor Plays", "@anchor_plays"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]

tooltips2 = [
    ("Episode Id", "@episode_id"),
    ("Episode Title", "@episode_name"),
    ("Hero Present", "@heroes"),
    ("Spotify Starts Plays", "@spotify_starts"),
    ("Spotify Streams", "@spotify_streams"),
    ("Spotify Listeners", "@spotify_listeners"),
    ("Category", "@category"),
    ("Date", "@release_date{%F}"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Anchor Plays Per Episode")
fig1.line("release_date", "anchor_plays", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Anchor Plays")
fig1.line("release_date", Episode_df.anchor_plays.mean(), source = Source, color = "#f2a652", alpha = 0.8, line_dash="dashed", legend_label="Anchor Plays Mean : {:.3f}".format(Episode_df.youtube_ctr.mean()))


fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Release Date"
fig1.yaxis.axis_label = "Anchor Plays"

fig2 = figure(background_fill_color="#ebf4f6", plot_width = 600, plot_height = 400, x_axis_type = "datetime", title = "Performance on Spotify Per Episode")
fig2.line("release_date", "spotify_starts", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Spotify Starts Plays")
fig2.line("release_date", "spotify_streams", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Spotify Streams")
fig2.line("release_date", "spotify_listeners", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Spotify Listeners")


fig2.add_tools(HoverTool(tooltips=tooltips2,formatters={"@release_date": "datetime"}))
fig2.xaxis.axis_label = "Release Date"
fig2.yaxis.axis_label = "Total Plays"


show(column(fig1,fig2))
  • It's 2020 and Seems now-a-days people aren't not much into podcasts and Audios
Distribution of Heores by Country and Nationality
In [29]:
temp=Episode_df.groupby(["heroes_location", "heroes"])["heroes_nationality"].value_counts()

parent=[]
names =[]
values=[]
heroes=[]
for k in temp.index:
    parent.append(k[0])
    heroes.append(k[1])
    names.append(k[2])
    values.append(temp.loc[k])

df = pd.DataFrame(
    dict(names=names, parents=parent,values=values, heroes=heroes))
df["World"] = "World"

fig = px.treemap(
    df,
    path=['World', 'parents','names','heroes'], values='values',color='parents')

fig.update_layout( 
    width=1000,
    height=700,
    title_text="Distribution of Heores by Country and Nationality")
fig.show()
  • Most of our Heroes lives in USA but There's quite range of Diversity in Heroes Nationality within a country which is good to know
Any Relation between Release Date of Epiosdes?
In [30]:
a=Episode_df.release_date
b=(a-a.shift(periods=1, fill_value='2019-07-21')).astype('timedelta64[D]')
d = {'episode_id':Episode_df.episode_id, 'heroes':Episode_df.heroes, 'release_date': Episode_df.release_date, 'day_difference': b}
temp = pd.DataFrame(d)

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Hero Present", "@heroes"),
    ("Day Difference", "@day_difference"),
    ("Date", "@release_date{%F}"),
    ]

fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 400, x_axis_type  = "datetime", title = "Day difference between Each Release Date")
fig1.line("release_date", "day_difference", source = Source, color = "#03c2fc", alpha = 0.8)

fig1.add_tools(HoverTool(tooltips=tooltips,formatters={"@release_date": "datetime"}))
fig1.xaxis.axis_label = "Date"
fig1.yaxis.axis_label = "No of Days"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))

📌 Observations :

  • Seems 2020 made Sanyam a bit consistant on his Release Date having a difference of 3 or 4 days between each release till 18th July

Do I know about Release of anniversary interview episode?

Because of time shortage, I didn't scraped new data myself.

Though I visited his Youtube Channel and manually examined his Release Patterns

Episode Id Release Day Difference
E75 2020-06-18 4
E76 2020-06-21 3
E77 2020-06-28 7
E78 2020-07-02 4
E79 2020-07-09 7
E80 2020-07-12 3

Maybe He's experimenting with a new pattern

Can We pin-point when 1 year anniversary interview episode❓ Actually No!

Though a small pattern can be observed in Release dates, He has bit odd recording pattern :

  • Who knows He may have 3-4 videos already recorded and ready to be released.
  • Even if He records anniversary interview episode today, We can not say when He'll gonna release that Episode

As per his Release pattern, He's been releasing his Episodes after 3 or 4 days.

  • Considering E77 and E79 as exception, He'll more probably release E81 on 2020-07-16 or 2020-07-15
  • If He's experimenting with a new pattern (7 day difference after one video), then E81 will be released on 2020-07-19 followed by E82 on 2020-07-22 or 2020-07-23

If I'm correct then Mr. Sanyam Bhutani, Please don't forget to give a small shoutout to me 😄

Exploring Raw / Cleaned Substitles
Go back to our Guide


So, We have 2 directories here :

  • Raw Subtitles : Tanscript in Text format
  • Cleaned Subtitles : Tanscript in CSV format with Timestamp
In [31]:
def show_script(id):
    return pd.read_csv("../input/chai-time-data-science/Cleaned Subtitles/{}.csv".format(id))
In [32]:
df = show_script("E1")
df
Out[32]:
Time Speaker Text
0 0:13 Sanyam Bhutani Hey, this is Sanyam Bhutani and you're listeni...
1 1:49 Abhishek Thakur Thank you very much for the invitation. It's a...
2 1:53 Sanyam Bhutani Today, you're the world's only Triple Grandmas...
3 2:12 Abhishek Thakur Yeah cool story. Data science was never my int...
4 2:41 Sanyam Bhutani And this was before the boom had happened. And...
... ... ... ...
221 48:57 Sanyam Bhutani Not recently.
222 49:00 Abhishek Thakur See you there soon.
223 49:01 Sanyam Bhutani Alright. Thanks. Thanks a lot.
224 49:03 Abhishek Thakur Thank you. Bye bye.
225 49:16 Sanyam Bhutani Thank you so much for listening to this episod...

226 rows × 3 columns

A Small Shoutout to Ramshankar Yadhunath

I would like to give a small shoutout to Ramshankar Yadhunath for providing a feature engineering script in his Kernel.

Hey Guys, If you followed me till here, then dont forget to check out his Kernel too.

In [33]:
# feature engineer the transcript features
def conv_to_sec(x):
    """ Time to seconds """

    t_list = x.split(":")
    if len(t_list) == 2:
        m = t_list[0]
        s = t_list[1]
        time = int(m) * 60 + int(s)
    else:
        h = t_list[0]
        m = t_list[1]
        s = t_list[2]
        time = int(h) * 60 * 60 + int(m) * 60 + int(s)
    return time


def get_durations(nums, size):
    """ Get durations i.e the time for which each speaker spoke continuously """

    diffs = []
    for i in range(size - 1):
        diffs.append(nums[i + 1] - nums[i])
    diffs.append(30)  # standard value for all end of the episode CFA by Sanyam
    return diffs


def transform_transcript(sub, episode_id):
    """ Transform the transcript of the given episode """

    # create the time second feature that converts the time into the unified qty. of seconds
    sub["Time_sec"] = sub["Time"].apply(conv_to_sec)

    # get durations
    sub["Duration"] = get_durations(sub["Time_sec"], sub.shape[0])

    # providing an identity to each transcript
    sub["Episode_ID"] = episode_id
    sub = sub[["Episode_ID", "Time", "Time_sec", "Duration", "Speaker", "Text"]]

    return sub


def combine_transcripts(sub_dir):
    """ Combine all the 75 transcripts of the ML Heroes Interviews together as one dataframe """

    episodes = []
    for i in range(1, 76):
        file = "E" + str(i) + ".csv"
        try:
            sub_epi = pd.read_csv(os.path.join(sub_dir, file))
            sub_epi = transform_transcript(sub_epi, ("E" + str(i)))
            episodes.append(sub_epi)
        except:
            continue
    return pd.concat(episodes, ignore_index=True)


# create the combined transcript dataset
sub_dir = "../input/chai-time-data-science/Cleaned Subtitles"
transcripts = combine_transcripts(sub_dir)
transcripts.head()
Out[33]:
Episode_ID Time Time_sec Duration Speaker Text
0 E1 0:13 13 96 Sanyam Bhutani Hey, this is Sanyam Bhutani and you're listeni...
1 E1 1:49 109 4 Abhishek Thakur Thank you very much for the invitation. It's a...
2 E1 1:53 113 19 Sanyam Bhutani Today, you're the world's only Triple Grandmas...
3 E1 2:12 132 29 Abhishek Thakur Yeah cool story. Data science was never my int...
4 E1 2:41 161 8 Sanyam Bhutani And this was before the boom had happened. And...

Now, We have some Data to work with.

Thanking Ramshankar Yadhunath once again, Let's get it started ..

Note : Transcript for E0 and E4 is missing

Intro is Bad for CTDS ?

From our previous analysis, We realised Majority of the Episodes have quite less watch time i.e. less than 13.065% of the total Duration.

In such case, How much CTDS intro hurts itself in terms of intro duration.

Let's find out...

In [34]:
temp = Episode_df[["episode_id","youtube_avg_watch_duration"]]
temp=temp[(temp.episode_id!="E0") & (temp.episode_id!="E4")]

intro=[]

for i in transcripts.Episode_ID.unique():
    intro.append(transcripts[transcripts.Episode_ID==i].iloc[0].Duration)
temp["Intro_Duration"]=intro
temp["diff"]=temp.youtube_avg_watch_duration-temp.Intro_Duration

Source = ColumnDataSource(temp)

tooltips = [
    ("Episode Id", "@episode_id"),
    ("Youtube Avg Watch_duration Views", "@youtube_avg_watch_duration"),
    ("Intro Duration", "@Intro_Duration"),
    ("Avg Duration of Content Watched", "@diff"),
    ]


fig1 = figure(background_fill_color="#ebf4f6", plot_width = 1000, plot_height = 600, x_range  = temp["episode_id"].values, title = "Impact of Intro Durations")
fig1.line("episode_id", "youtube_avg_watch_duration", source = Source, color = "#03c2fc", alpha = 0.8, legend_label="Youtube Avg Watch_duration Views")
fig1.line("episode_id", "Intro_Duration", source = Source, color = "#f2a652", alpha = 0.8, legend_label="Intro Duration")
fig1.line("episode_id", "diff", source = Source, color = "#03fc5a", alpha = 0.8, legend_label="Avg Duration of Content Watched")


fig1.add_tools(HoverTool(tooltips=tooltips))
fig1.xaxis.axis_label = "Episode Id"
fig1.yaxis.axis_label = "Percentage"

fig1.xaxis.major_label_orientation = np.pi / 3
show(column(fig1))
In [35]:
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 5 minutes".format(len(temp[temp["diff"]<300])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 4 minutes".format(len(temp[temp["diff"]<240])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 3 minutes".format(len(temp[temp["diff"]<180])/len(temp)*100))
print("{:.2f} % of Episodes have Avg Duration of Content Watched less than 2 minutes".format(len(temp[temp["diff"]<120])/len(temp)*100))
print("In {} case, Viewer left in the Intro Duration".format(len(temp[temp["diff"]<0])))
81.08 % of Episodes have Avg Duration of Content Watched less than 5 minutes
72.97 % of Episodes have Avg Duration of Content Watched less than 4 minutes
45.95 % of Episodes have Avg Duration of Content Watched less than 3 minutes
22.97 % of Episodes have Avg Duration of Content Watched less than 2 minutes
In 1 case, Viewer left in the Intro Duration

🧠 My Cessation:

  • Observing the graph and stats, it's clear it's a high time
  • We don't have Transcript of M Series where the Percentage of Episode Watched i.e. Episode had small Duration.
  • Concluding from analysis We can now strongly comment, Episode with shorter length will definitely help

There's lots of things to improve.

  • Shorter Duration Videos can be delivered highlighting the important aspects of the Shows
  • Short Summaries can be provided in the description. Maybe after reading them, Viewers could devote for a longer Show (depending on the interest on the topic reflected in summery)
  • Full length Show can be provided as Podcast in Apple, Spotify, Anchor. If Viewer after shorter duration videos and summeries wishes to have a full show, they can have it from there

With 45.95% of Episodes having Avg Duration of Content Watched less than 3 minutes, We can hardly gain any useful insight or can comment on quality of Content delivered.

But Okay! We can have some fun though 😄

Who Speaks More ?

In [36]:
host_text = []
hero_text = []
for i in transcripts.Episode_ID.unique():
    host_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker=="Sanyam Bhutani")].Text])
    hero_text.append([i, transcripts[(transcripts.Episode_ID==i) & (transcripts.Speaker!="Sanyam Bhutani")].Text])

temp_host={}
temp_hero={}
for i in range(len(transcripts.Episode_ID.unique())):
    host_text_count = len(host_text[i][1])
    hero_text_count = len(hero_text[i][1])
    temp_host[hero_text[i][0]]=host_text_count
    temp_hero[hero_text[i][0]]=hero_text_count
    
def getkey(dict): 
    list = [] 
    for key in dict.keys(): 
        list.append(key)          
    return list

def getvalue(dict): 
    list = [] 
    for key in dict.values(): 
        list.append(key)          
    return list
In [37]:
Source = ColumnDataSource(data=dict(
    x=getkey(temp_host),
    y=getvalue(temp_host),
    a=getkey(temp_hero),
    b=getvalue(temp_hero),
))

tooltips = [
    ("Episode Id", "@x"),
    ("No of Times Host Speaks", "@y"),
    ("No of Times Hero Speaks", "@b"),
]

fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, tooltips=tooltips,plot_height = 400, x_range = getkey(temp_host), title = "Who Speaks More ?")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Times Host Speaks")
fig1.vbar("a", top = "b", source = Source, width = 0.4, color = "#e7f207", alpha=.8, legend_label="No of Times Hero Speaks")

fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "Count"

fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4

show(fig1)
  • Excluding Few Episodes, Ratio between No of Times One Speaks is quite mantained
  • E69 was AMA episode. That's why there is no Hero

Frequency of Questions Per Episode

In [38]:
ques=0
total_ques={}
for episode in range(len(transcripts.Episode_ID.unique())):
    for each_text in range(len(host_text[episode][1])):
        ques += host_text[episode][1].reset_index().iloc[each_text].Text.count("?")
    total_ques[hero_text[episode][0]]= ques
    ques=0
In [39]:
from statistics import mean 
Source = ColumnDataSource(data=dict(
    x=getkey(total_ques),
    y=getvalue(total_ques),
))

tooltips = [
    ("Episode Id", "@x"),
    ("No of Questions", "@y"),
]

fig1 = figure(background_fill_color="#ebf4f6",plot_width = 1000, plot_height = 400,tooltips=tooltips, x_range = getkey(temp_host), title = "Questions asked Per Episode")
fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8, legend_label="No of Questions asked Per Episode")
fig1.line("x", mean(getvalue(total_ques)), source = Source, color = "#f2a652", alpha = 0.8,line_dash="dashed", legend_label="Average Questions : {:.3f}".format(mean(getvalue(total_ques))))

fig1.xaxis.axis_label = "Episode"
fig1.yaxis.axis_label = "No of Questions"

fig1.legend.location = "top_left"

fig1.grid.grid_line_color="#feffff"
fig1.xaxis.major_label_orientation = np.pi / 4

show(fig1)
  • On an Average, around 30 Questions are asked by Host
  • E69 being AMA Episode justifies the reason of having such high no of Questions

Favourite Text ?

⚒️ About the Function :

Well, I'm gonna write a small function. You can pass a Hero Name and it will create a graph about 7 most common words spoken by that person

But before that I would like to give a small Shoutout to Parul Pandey for providing a text cleaning script in her Kernel.

In [40]:
import re
import nltk
from statistics import mean 
from collections import Counter
import string

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text


def text_preprocessing(text):
    """
    Cleaning and parsing the text.

    """
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    nopunc = clean_text(text)
    tokenized_text = tokenizer.tokenize(nopunc)
    #remove_stopwords = [w for w in tokenized_text if w not in stopwords.words('english')]
    combined_text = ' '.join(tokenized_text)
    return combined_text
In [41]:
transcripts['Text'] = transcripts['Text'].apply(str).apply(lambda x: text_preprocessing(x))
In [42]:
def get_data(speakername=None):
    label=[]
    value=[]

    text_data=transcripts[(transcripts.Speaker==speakername)].Text.tolist()
    temp=list(filter(lambda x: x.count(" ")<10 , text_data)) 

    freq=nltk.FreqDist(temp).most_common(7)
    for each in freq:
        label.append(each[0])
        value.append(each[1])
        
        
    Source = ColumnDataSource(data=dict(
        x=label,
        y=value,
    ))

    tooltips = [
        ("Favourite Text", "@x"),
        ("Frequency", "@y"),
    ]

    fig1 = figure(background_fill_color="#ebf4f6",plot_width = 600, tooltips=tooltips, plot_height = 400, x_range = label, title = "Favourite Text")
    fig1.vbar("x", top = "y", source = Source, width = 0.4, color = "#76b4bd", alpha=.8)

    fig1.xaxis.axis_label = "Text"
    fig1.yaxis.axis_label = "Frequency"


    fig1.grid.grid_line_color="#feffff"
    fig1.xaxis.major_label_orientation = np.pi / 4

    show(fig1)
In [43]:
get_data(speakername="Sanyam Bhutani")

📌 Observations :

  • Okay, Yeah seems to be favourite words of Sanyam Bhutani
  • Well He has some different laughs for different scenario I guess 😄
  • We have all the Transcript where Sanyam Bhutani speaks. So, It's common that you'll find words with good frequency for Sanyam Bhutani only.
  • But you can still try. Who knows I might be missing something interesting 😄
Tip: Pass your favourite hero name in function get_data() and you're good to go

End Notes

With this, I end my analysis on this Dataset named Chai Time Data Science | CTDS.Show provided by Mr. Vopani and Mr. Sanyam Bhutani.

It was a wonderfull experience for me

Somehow if my Analysis/Way of StoryTelling hurted any sentiments then I apologize for that

And yea Congraulations to Chai Time Data Science | CTDS.Show for completing a successfull 1 Year journey.

Now I can finally enjoy my Chai break in a peace