Lecture 4.1

Continuing Discussion on Pandas - Part 3

import pandas as pd

WORLD_DATA_PATH = "data/spotify-top-50-playlist-songs-anxods/data/spotify-streaming-top-50-world.csv"
USA_DATA_PATH = "data/spotify-top-50-playlist-songs-anxods/data/spotify-streaming-top-50-usa.csv"

world_df = pd.read_csv(WORLD_DATA_PATH)
usa_df = pd.read_csv(USA_DATA_PATH)

world_df.sample()

	date	position	song	artist	popularity	duration_ms	album_type	total_tracks	release_date	is_explicit	album_cover_url
263	2023-05-23	14	El Azul	Junior H & Peso Pluma	95	187225	single	1	2023-02-10	False	https://i.scdn.co/image/ab67616d0000b27333ed35...

usa_df.sample(1)

	date	position	song	artist	popularity	duration_ms	album_type	total_tracks	release_date	is_explicit	album_cover_url
254	2023-05-23	5	Kill Bill	SZA	94	153946	album	23	2022-12-08	False	https://i.scdn.co/image/ab67616d0000b2730c471c...

1. Revisiting groupby

source = “https://www.kaggle.com/code/alenavorushilova/grouping-sorting-and-filtering-data-tutorial”

world_df.groupby("position")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x117d3ae90>

Q:For each artist in the dataset, what was the longest song that made it to top 50?

world_df.sample(1)

	date	position	song	artist	popularity	duration_ms	album_type	total_tracks	release_date	is_explicit	album_cover_url
105	2023-05-20	6	Daylight	David Kushner	97	212953	single	1	2023-04-14	False	https://i.scdn.co/image/ab67616d0000b27395ca6a...

world_df.groupby("artist")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x117d3b880>

world_df.groupby("artist")["duration_ms"].max().to_frame()

	duration_ms
artist
Alessandra	147979
Arctic Monkeys	183956
BTS	229953
Bad Bunny	231704
Beyoncé & Kendrick Lamar	260962
...	...
Tyler, The Creator	180386
Yahritza Y Su Esencia & Grupo Frontera	160517
Yandel	216148
Yng Lvcas & Peso Pluma	234352
d4vd	242484

75 rows × 1 columns

Q: For each position (1 to 50), which artist was in that rank the maximum number of days ( in total , needn’t be consecutive)

world_df.groupby('position')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x117f5c2b0>

world_df.groupby('position')['artist']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x117f5c310>

world_df.groupby('position')['artist'].value_counts().to_frame()

		artist
position	artist
1	Eslabon Armado	31
	Bad Bunny	2
	Bizarrap & Peso Pluma	2
	BTS	1
2	Bizarrap & Peso Pluma	16
...	...	...
50	Lil Durk	1
	Lil Mabu	1
	Natanael Cano & Peso Pluma & Gabito Ballesteros	1
	Sam Smith	1
	Taylor Swift	1

790 rows × 1 columns

2. Introducing MultiIndex in Pandas Dataframes

world_df.index

RangeIndex(start=0, stop=1800, step=1)

world_df_with_rank_counts = world_df.groupby('position')['artist'].value_counts().to_frame()
world_df_with_rank_counts.index

MultiIndex([( 1,                                  'Eslabon Armado'),
            ( 1,                                       'Bad Bunny'),
            ( 1,                           'Bizarrap & Peso Pluma'),
            ( 1,                                             'BTS'),
            ( 2,                           'Bizarrap & Peso Pluma'),
            ( 2,                                       'Bad Bunny'),
            ( 2,                                  'Eslabon Armado'),
            ( 2,                      'Grupo Frontera & Bad Bunny'),
            ( 2,                          'Yng Lvcas & Peso Pluma'),
            ( 3,                                       'Bad Bunny'),
            ...
            (50,                       'DENNIS & MC Kevin o Chris'),
            (50,                                        'Dua Lipa'),
            (50,                           'Eden Muñoz & Junior H'),
            (50,                                  'Eladio Carrion'),
            (50,                                      'Kali Uchis'),
            (50,                                        'Lil Durk'),
            (50,                                        'Lil Mabu'),
            (50, 'Natanael Cano & Peso Pluma & Gabito Ballesteros'),
            (50,                                       'Sam Smith'),
            (50,                                    'Taylor Swift')],
           names=['position', 'artist'], length=790)

Approach 1

world_df_with_rank_counts.index.get_level_values('position')#world_df_with_rank_counts.index.get_level_values(0)

Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  2,  3,
            ...
            50, 50, 50, 50, 50, 50, 50, 50, 50, 50],
           dtype='int64', name='position', length=790)

world_df_with_rank_counts.index.get_level_values('artist') #world_df_with_rank_counts.index.get_level_values(1)

Index(['Eslabon Armado', 'Bad Bunny', 'Bizarrap & Peso Pluma', 'BTS',
       'Bizarrap & Peso Pluma', 'Bad Bunny', 'Eslabon Armado',
       'Grupo Frontera & Bad Bunny', 'Yng Lvcas & Peso Pluma', 'Bad Bunny',
       ...
       'DENNIS & MC Kevin o Chris', 'Dua Lipa', 'Eden Muñoz & Junior H',
       'Eladio Carrion', 'Kali Uchis', 'Lil Durk', 'Lil Mabu',
       'Natanael Cano & Peso Pluma & Gabito Ballesteros', 'Sam Smith',
       'Taylor Swift'],
      dtype='object', name='artist', length=790)

world_df_with_rank_counts['rank'] = world_df_with_rank_counts.index.get_level_values('position')
world_df_with_rank_counts['artist(s)'] = world_df_with_rank_counts.index.get_level_values('artist') 
world_df_with_rank_counts.sample(2)

		artist	rank	artist(s)
position	artist
39	Taylor Swift	1	39	Taylor Swift
46	Dua Lipa	1	46	Dua Lipa

world_df_with_rank_counts.reset_index(drop = True, inplace = True)

world_df_with_rank_counts

	artist	rank	artist(s)
0	31	1	Eslabon Armado
1	2	1	Bad Bunny
2	2	1	Bizarrap & Peso Pluma
3	1	1	BTS
4	16	2	Bizarrap & Peso Pluma
...	...	...	...
785	1	50	Lil Durk
786	1	50	Lil Mabu
787	1	50	Natanael Cano & Peso Pluma & Gabito Ballesteros
788	1	50	Sam Smith
789	1	50	Taylor Swift

790 rows × 3 columns

world_df_with_rank_counts.rename({'artist':'count'},axis = 'columns',inplace=True)

world_df_with_rank_counts

	count	rank	artist(s)
0	31	1	Eslabon Armado
1	2	1	Bad Bunny
2	2	1	Bizarrap & Peso Pluma
3	1	1	BTS
4	16	2	Bizarrap & Peso Pluma
...	...	...	...
785	1	50	Lil Durk
786	1	50	Lil Mabu
787	1	50	Natanael Cano & Peso Pluma & Gabito Ballesteros
788	1	50	Sam Smith
789	1	50	Taylor Swift

790 rows × 3 columns

Approach 2

world_df_with_rank_counts = world_df.groupby('position')['artist'].value_counts().to_frame()
world_df_with_rank_counts.sample(2)

		artist
position	artist
26	Miley Cyrus	1
33	Tina Turner	1

world_df_with_rank_counts.rename({'artist':'count'},axis = 'columns', inplace = True)

world_df_with_rank_counts.reset_index(inplace=True)

world_df_with_rank_counts

	position	artist	count
0	1	Eslabon Armado	31
1	1	Bad Bunny	2
2	1	Bizarrap & Peso Pluma	2
3	1	BTS	1
4	2	Bizarrap & Peso Pluma	16
...	...	...	...
785	50	Lil Durk	1
786	50	Lil Mabu	1
787	50	Natanael Cano & Peso Pluma & Gabito Ballesteros	1
788	50	Sam Smith	1
789	50	Taylor Swift	1

790 rows × 3 columns

world_df_with_rank_counts[world_df_with_rank_counts['position'] ==1]

	position	artist	count
0	1	Eslabon Armado	31
1	1	Bad Bunny	2
2	1	Bizarrap & Peso Pluma	2
3	1	BTS	1

max_count_index = world_df_with_rank_counts[world_df_with_rank_counts['position'] ==1]['count'].idxmax() # new : idxmax()

world_df_with_rank_counts.loc[max_count_index].to_frame().T #Revisiting .loc

	position	artist	count
0	1	Eslabon Armado	31

# for easier view : **Q: For each position (1 to 50), which artist was in that rank the maximum number of days** ( in total , needn't be consecutive)

list_of_dataframes = []
for position in range(1,51):
    max_count_index = world_df_with_rank_counts[world_df_with_rank_counts['position'] ==position]['count'].idxmax()
    list_of_dataframes.append(world_df_with_rank_counts.loc[max_count_index].to_frame().T )

list_of_dataframes[10]

	position	artist	count
69	11	SZA	9

pd.concat(list of dataframes)

merged_df = pd.concat(list_of_dataframes)
merged_df

	position	artist	count
0	1	Eslabon Armado	31
4	2	Bizarrap & Peso Pluma	16
9	3	Bad Bunny	18
14	4	Yng Lvcas & Peso Pluma	13
20	5	Grupo Frontera & Bad Bunny	11
26	6	Miley Cyrus	19
33	7	David Kushner	10
41	8	SZA	9
50	9	Harry Styles	14
59	10	David Kushner	8
69	11	SZA	9
79	12	Fuerza Regida	7
92	13	Feid & Young Miko	11
105	14	Metro Boomin	12
116	15	Metro Boomin	5
130	16	Metro Boomin	8
144	17	Rema	6
163	18	Rema	5
180	19	Metro Boomin	7
196	20	Metro Boomin	8
213	21	Junior H & Peso Pluma	4
232	22	PinkPantheress & Ice Spice	7
250	23	Metro Boomin	5
267	24	Lana Del Rey	4
283	25	The Weeknd	6
303	26	Morgan Wallen	6
323	27	Morgan Wallen	5
341	28	Taylor Swift	8
355	29	Arctic Monkeys	5
373	30	The Weeknd	10
388	31	The Weeknd	6
406	32	Taylor Swift	5
425	33	d4vd	4
446	34	Metro Boomin	4
465	35	The Weeknd	6
485	36	The Weeknd	6
505	37	Yandel	4
529	38	The Weeknd	7
548	39	The Weeknd	7
566	40	Tom Odell	5
585	41	JVKE	5
602	42	Libianca	6
623	43	OneRepublic	5
643	44	Libianca	6
663	45	Yahritza Y Su Esencia & Grupo Frontera	4
687	46	Manuel Turizo	4
710	47	Manuel Turizo	4
730	48	Sam Smith	4
747	49	Taylor Swift	4
769	50	Peso Pluma	4

set_index()

merged_df.set_index("position",drop = True,inplace = True)

del merged_df['count']

merged_df

	artist
position
1	Eslabon Armado
2	Bizarrap & Peso Pluma
3	Bad Bunny
4	Yng Lvcas & Peso Pluma
5	Grupo Frontera & Bad Bunny
6	Miley Cyrus
7	David Kushner
8	SZA
9	Harry Styles
10	David Kushner
11	SZA
12	Fuerza Regida
13	Feid & Young Miko
14	Metro Boomin
15	Metro Boomin
16	Metro Boomin
17	Rema
18	Rema
19	Metro Boomin
20	Metro Boomin
21	Junior H & Peso Pluma
22	PinkPantheress & Ice Spice
23	Metro Boomin
24	Lana Del Rey
25	The Weeknd
26	Morgan Wallen
27	Morgan Wallen
28	Taylor Swift
29	Arctic Monkeys
30	The Weeknd
31	The Weeknd
32	Taylor Swift
33	d4vd
34	Metro Boomin
35	The Weeknd
36	The Weeknd
37	Yandel
38	The Weeknd
39	The Weeknd
40	Tom Odell
41	JVKE
42	Libianca
43	OneRepublic
44	Libianca
45	Yahritza Y Su Esencia & Grupo Frontera
46	Manuel Turizo
47	Manuel Turizo
48	Sam Smith
49	Taylor Swift
50	Peso Pluma

Q. How many times has an artist taken a spot between 1 and 50 ?

For each artist, how many times have they taken a spot between 1 and 50 (both inclusive)

world_df.sample()

	date	position	song	artist	popularity	duration_ms	album_type	total_tracks	release_date	is_explicit	album_cover_url
1458	2023-06-16	9	As It Was	Harry Styles	93	167303	album	13	2022-05-20	False	https://i.scdn.co/image/ab67616d0000b2732e8ed7...

rank_count_df = world_df.groupby('artist')['position'].value_counts().to_frame()
rank_count_df

		position
artist	position
Alessandra	42	1
Alessandra	47	1
Arctic Monkeys	29	5
	25	4
	22	3
...	...	...
d4vd	39	1
	40	1
	41	1
	42	1
	43	1

790 rows × 1 columns

rank_count_df.rename({'position':'number of times'}, axis = 'columns',inplace = True)

rank_count_df.reset_index(inplace = True)

rank_count_df

	artist	position	number of times
0	Alessandra	42	1
1	Alessandra	47	1
2	Arctic Monkeys	29	5
3	Arctic Monkeys	25	4
4	Arctic Monkeys	22	3
...	...	...	...
785	d4vd	39	1
786	d4vd	40	1
787	d4vd	41	1
788	d4vd	42	1
789	d4vd	43	1

790 rows × 3 columns

top_50_count_for_each_artist = rank_count_df.groupby('artist')['number of times'].sum().to_frame()

top_50_count_for_each_artist.sort_values(by = 'number of times', ascending=False)

	number of times
artist
The Weeknd	108
Taylor Swift	92
Metro Boomin	91
Miley Cyrus	72
ROSALÍA & Rauw Alejandro	36
...	...
JISOO	1
Stray Kids	1
Fast & Furious: The Fast Saga	1
Saiko & Feid & Quevedo	1
Eladio Carrion	1

75 rows × 1 columns

3. Joins

(discussed in Lecture 3.3)

world_info = world_df[['date','position','artist','song']]
world_info.sample(3)

	date	position	artist	song
1037	2023-06-07	38	The Weeknd	Die For You
557	2023-05-29	8	SZA	Kill Bill
975	2023-06-06	26	Fast & Furious: The Fast Saga & Jimin & BTS	Angel Pt. 1 (feat. Jimin of BTS, JVKE & Muni L...

world_info.shape

(1800, 4)

world_df['position'] = world_df['position'].astype('int')

usa_info = usa_df[['date','position','artist','song']]
usa_info.sample(3)

	date	position	artist	song
1763	2023-06-22	14	Metro Boomin	Annihilate (Spider-Man: Across the Spider-Vers...
1785	2023-06-22	36	Metro Boomin	Creepin' (with The Weeknd & 21 Savage)
120	2023-05-20	21	Peso Pluma	Por las Noches

usa_info.shape

(1800, 4)

usa_info['position'] = usa_info['position'].astype('int')

/var/folders/r0/bggl6hf15j708chpxqrx5tp00000gn/T/ipykernel_2958/3011614831.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  usa_info['position'] = usa_info['position'].astype('int')

world_info.merge(right=usa_info, on= ['date','artist','song'], how='inner') #default is inner

	date	position_x	artist	song	position_y
0	2023-05-18	1	Eslabon Armado	Ella Baila Sola	1
1	2023-05-18	2	Grupo Frontera & Bad Bunny	un x100to	4
2	2023-05-18	3	Yng Lvcas & Peso Pluma	La Bebe - Remix	8
3	2023-05-18	4	FIFTY FIFTY	Cupid - Twin Ver.	7
4	2023-05-18	5	Miley Cyrus	Flowers	14
...	...	...	...	...	...
983	2023-06-22	37	Junior H & Peso Pluma	El Azul	39
984	2023-06-22	43	Kali Uchis	Moonlight	40
985	2023-06-22	44	PinkPantheress & Ice Spice	Boy's a Liar Pt. 2	18
986	2023-06-22	48	Peso Pluma & Natanael Cano	PRC	37
987	2023-06-22	50	Lil Durk	All My Life (feat. J. Cole)	8

988 rows × 5 columns

3.1 Inner Join

world_info.merge(right=usa_info, on= ['date','artist','song'], how='inner', suffixes=['_world','_usa']) #default is inner

	date	position_world	artist	song	position_usa
0	2023-05-18	1	Eslabon Armado	Ella Baila Sola	1
1	2023-05-18	2	Grupo Frontera & Bad Bunny	un x100to	4
2	2023-05-18	3	Yng Lvcas & Peso Pluma	La Bebe - Remix	8
3	2023-05-18	4	FIFTY FIFTY	Cupid - Twin Ver.	7
4	2023-05-18	5	Miley Cyrus	Flowers	14
...	...	...	...	...	...
983	2023-06-22	37	Junior H & Peso Pluma	El Azul	39
984	2023-06-22	43	Kali Uchis	Moonlight	40
985	2023-06-22	44	PinkPantheress & Ice Spice	Boy's a Liar Pt. 2	18
986	2023-06-22	48	Peso Pluma & Natanael Cano	PRC	37
987	2023-06-22	50	Lil Durk	All My Life (feat. J. Cole)	8

988 rows × 5 columns

On a given date, a particular artist was on BOTH global and US spotify top 50. position_world gives their global spotify position, and position_usa gives their spotify top 50 in US position

3.2 Outer Join

world_info.merge(right=usa_info, on= ['date','artist','song'], how='outer', suffixes=['_world','_usa']) #default is inner

	date	position_world	artist	song	position_usa
0	2023-05-18	1.0	Eslabon Armado	Ella Baila Sola	1.0
1	2023-05-18	2.0	Grupo Frontera & Bad Bunny	un x100to	4.0
2	2023-05-18	3.0	Yng Lvcas & Peso Pluma	La Bebe - Remix	8.0
3	2023-05-18	4.0	FIFTY FIFTY	Cupid - Twin Ver.	7.0
4	2023-05-18	5.0	Miley Cyrus	Flowers	14.0
...	...	...	...	...	...
2607	2023-06-22	NaN	Peso Pluma	Por las Noches	45.0
2608	2023-06-22	NaN	NLE Choppa	SLUT ME OUT	47.0
2609	2023-06-22	NaN	Drake	Jimmy Cooks (feat. 21 Savage)	48.0
2610	2023-06-22	NaN	Kendrick Lamar	Money Trees	49.0
2611	2023-06-22	NaN	SZA	Snooze	50.0

2612 rows × 5 columns

On a given date, a particular artist was on EITHER global or US spotify top 50. position_world gives their global spotify position, and position_usa gives their spotify top 50 in US position

3.3 Left Join

world_info.merge(right=usa_info, on= ['date','artist','song'], how='left', suffixes=['_world','_usa']) #default is inner

	date	position_world	artist	song	position_usa
0	2023-05-18	1	Eslabon Armado	Ella Baila Sola	1.0
1	2023-05-18	2	Grupo Frontera & Bad Bunny	un x100to	4.0
2	2023-05-18	3	Yng Lvcas & Peso Pluma	La Bebe - Remix	8.0
3	2023-05-18	4	FIFTY FIFTY	Cupid - Twin Ver.	7.0
4	2023-05-18	5	Miley Cyrus	Flowers	14.0
...	...	...	...	...	...
1795	2023-06-22	46	Glass Animals	Heat Waves	NaN
1796	2023-06-22	47	Yahritza Y Su Esencia & Grupo Frontera	Frágil	NaN
1797	2023-06-22	48	Peso Pluma & Natanael Cano	PRC	37.0
1798	2023-06-22	49	Dua Lipa	Dance The Night (From Barbie The Album)	NaN
1799	2023-06-22	50	Lil Durk	All My Life (feat. J. Cole)	8.0

1800 rows × 5 columns

On a given date, a particular artist was on global spotify top 50 list. If the same artist for the same song, on the same date, was also on USA top 50, then their rank is given in position_usa. If they were not in USA top 50, position_usa would be NaN( Not a Number)

3.4 Right Join

world_info.merge(right=usa_info, on= ['date','artist','song'], how='right', suffixes=['_world','_usa']) #default is inner

	date	position_world	artist	song	position_usa
0	2023-05-18	1.0	Eslabon Armado	Ella Baila Sola	1
1	2023-05-18	27.0	Morgan Wallen	Last Night	2
2	2023-05-18	26.0	Lil Durk & J. Cole	All My Life (feat. J. Cole)	3
3	2023-05-18	2.0	Grupo Frontera & Bad Bunny	un x100to	4
4	2023-05-18	7.0	SZA	Kill Bill	5
...	...	...	...	...	...
1795	2023-06-22	31.0	The Weeknd	Die For You	46
1796	2023-06-22	NaN	NLE Choppa	SLUT ME OUT	47
1797	2023-06-22	NaN	Drake	Jimmy Cooks (feat. 21 Savage)	48
1798	2023-06-22	NaN	Kendrick Lamar	Money Trees	49
1799	2023-06-22	NaN	SZA	Snooze	50

1800 rows × 5 columns

On a given date, a particular artist was on usa spotify top 50 list. If the same artist for the same song, on the same date, was also on Global top 50, then their rank is given in position_world. If they were not in Global top 50, position_world would be NaN( Not a Number)

3.5 Cross ie Cartisean Product

left = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
left

	col1	col2
0	1	3
1	2	4

right = pd.DataFrame({'col3': [5, 6]}) 
right

	col3
0	5
1	6

left.merge(right,how="cross")

	col1	col2	col3
0	1	3	5
1	1	3	6
2	2	4	5
3	2	4	6

Continuing Discussion on Pandas - Part 3

1. Revisiting groupby

2. Introducing MultiIndex in Pandas Dataframes

Approach 1

Approach 2

pd.concat(list of dataframes)

set_index()

3. Joins

3.1 Inner Join

3.2 Outer Join

3.3 Left Join

3.4 Right Join

3.5 Cross ie Cartisean Product

Next Class - 2. Data Cleaning