Homework 2 : Data Collection

 ## Notebook Presentation [6 points ]
# For web and html
import requests
from time import sleep
from bs4 import BeautifulSoup

#For working with raw files
import zipfile

#For working with data
import pandas as pd
from datetime import datetime

Problem 1 : What’s the secret code? [ 4 points ]

Q 1. Use the library we learned in class to get the page’s HTML. [ 1 point ]. See 2:36 in the Solutions Video

BONUS_URL = "https://csc380.beingenfa.com/Bonus/1.html"
bonus_req_obj = requests.get(BONUS_URL)
<Response [200]>
bonus_page_html = bonus_req_obj.text
Q 2. Find the section with the secret code by using the Beautiful Soup’s find function [ 2 points ]. See 5:53 in the Solutions Video

bonus_bs4_obj = BeautifulSoup(bonus_page_html,"html.parser")
secret_section = bonus_bs4_obj.find("div","Secret")
Q 3. Clean up the secret code and print it as “The Secret Code is: CSC380” [ 1 point ]. See 10:30 in the Solutions Video

secret_code_tag = list(secret_section.children)[1]
secret_code = secret_code_tag.name
print("The Secret Code is: ",secret_code)
Problem 2. Random Facts API [ 11 points ].

RANDOM_FACT_WEBSITE_URL = "https://uselessfacts.jsph.pl"
RANDOM_FACTS_ENDPOINT = "/api/v2/facts/random"
TODAY_RANDOM_FACT_ENDPOINT = "/api/v2/facts/today"

Q1. Find the URL for random facts API. [1 point]. See 13:38 in the Solutions Video


Qs. 2,3,4 . Collect 10 Random Facts [3 point]. See 16:48 in the Solutions Video

random_facts_list_of_json = []
for fact_no in range(nos_of_facts):

    random_fact_req_obj = requests.get(random_facts_api)

    if random_fact_req_obj.status_code == 200:


Q.5 Creating the Dataframe [1 point]. See 22:26 in the Solutions Video

random_facts_df = pd.DataFrame(random_facts_list_of_json)
Q 6. Display Full Facts [2 points]. See 24:03 in the Solutions Video

random_facts_series = random_facts_df['text'] # Part a
Q 7. Show 3 random facts from the data frame [1 point]. See 25:58

Q8. What is today’s random fact? [3 points]. See 26:29 in the Solutions Video

Part a

today_fact_req_obj = requests.get(todays_random_fact_api)
random_fact_of_the_day = today_fact_req_obj.json()['text'] #Part a
Part b

time_rn = datetime.today()
print('Time right now is :',time_rn.strftime("%Y-%m-%d %I:%M:%S %p"))
Part c

print("At", time_rn.strftime("%Y-%m-%d %I:%M:%S %p"), " the random fact of the day is ",random_fact_of_the_day)
Part 3. Movies and Shows [29 points]

Q1. Download the following dataset [2 points]. See 34:18 in the Solutions Video

dataset_names = ['hulu','disney','prime','netflix']
for dataset_name in dataset_names:
    with zipfile.ZipFile(dataset_name+".zip","r") as zip_ref:

Q2. Create one large dataframe [3 points]. See 39:10 in the Solutions Video

Part a

hulu_df = pd.read_csv('hulu/hulu_titles.csv')
netflix_df = pd.read_csv('netflix/netflix_titles.csv')
prime_df = pd.read_csv('prime/amazon_prime_titles.csv')
disney_df = pd.read_csv('disney/disney_plus_titles.csv')
Part b

hulu_df['Platform'] = "Hulu"
netflix_df['Platform'] = "Netflix"
disney_df['Platform'] = "Disney"
prime_df['Platform'] = "Prime"
Part c

all_platforms_df = pd.concat([prime_df,netflix_df,disney_df,hulu_df])
Q3. Longest show and movie [6 points]. See 44:49 for part a, and 55:20 for part b, in the Solutions Video

Part a

shows_df = all_platforms_df[all_platforms_df['type']=='TV Show']
shows_count_df['Number of seasons'] = shows_count_df['Number of seasons'].apply(lambda x : int(x.replace('Seasons','').replace('Season','')))
Number of seasons No of shows
6 7 89
print("After preprocessing , Longest running season appears to be :",shows_count_df['Number of seasons'].max(), "Seasons")
After preprocessing , Longest running season appears to be : 34 Seasons

Part b

movies_df = all_platforms_df[all_platforms_df['type']=='Movie']
longest_movies_df = movies_df[movies_df["duration"] == longest_movie_duration ]
Part ii

movies_count_df_ii = movies_df["duration"].value_counts().to_frame()
20 min 25
longest_movies_df = movies_df[movies_df["duration"] == longest_movie_duration ]
Part iii

Max function tried to compare values in the column. But the column had two datatypes in it - string and float( for null values ). Comparison was not a permitted operation between string and float. Hence the error

Q 4.Shows streaming on multiple platforms [7 points]. See 1:19:12 in the Solutions Video

show_id type title director cast country date_added release_year rating duration listed_in description Platform
Part a

print("The number of rows is :",all_platforms_df.shape[0])
The number of rows is : 22998

Part b

unique_titles = all_platforms_df['title'].unique()
# alternative solution from student submissions
print(f"Number of unique titles: {all_platforms_df['title'].nunique()}")
Number of unique titles: 22115

Part c

titles_count_df = all_platforms_df['title'].value_counts().to_frame()
Part d

}, axis = 1, inplace = True)
Part e

    'name' : "Movie or Show Name",
    "title" : "No of Platforms"
}, axis = 1, inplace = True)
Part f

max_platform_shows = shows_df['title'].value_counts().to_frame()['title'].max()
print("The Maximum number of platforms a show is on is ",max_platform_shows)
The Maximum number of platforms a show is on is  3

Q5. Favorite show or movie [2 points]. See 1:31:20 in the Solutions Video

Part a

duplicated_shows = shows_df[shows_df.duplicated(subset=['title'], keep = False)]
Part b

favorite_movie = 'Everything Everywhere All at Once'
all_platforms_df[all_platforms_df['title'].apply(lambda x : True if x.lower()==favorite_movie.lower() else False )]
random_row = all_platforms_df.sample(1)
Q 6. Save Data [1 points]. See 1:39:11 in the Solutions Video

target_file_name = "streaming_titles.csv"

Q 7 . Name starts with [8 points]. See 1:39:29 in the Solutions Video

Part a

streaming_titles_df = pd.read_csv(target_file_name)
Part b

first_name = "Enfa"
first_letter_match_movies = movies_df[movies_df['title'].str.startswith(first_name[0])]
Part c

first_letter_match_shows = shows_df[shows_df['title'].str.startswith(first_name[0])]
Part d

TA_first_letter = 'B'
first_letter_match_movies = movies_df[movies_df['title'].str.startswith(TA_first_letter)]
Part e

first_letter_match_shows = shows_df[shows_df['title'].str.startswith(TA_first_letter)]
Part f

first_letter_match_shows = shows_df[shows_df['title'].str.startswith(TA_first_letter)]
Part g

print("Diff between f and g is", last_name_TA_uniq-first_name_TA_uniq)
Diff between f and g is 0