Objective of the project: The goal of this project is to test the hypothisis of whether the same songs are popular across different streaming platforms and explore the characteristic (genre/album/tempo/duration) of music that is more popular in each streaming platform.
Source of the data: The dataset I will be using is from kaggle and you can find it via this link (https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.cm as cm
datam= pd.read_csv("Spotify_Youtube.csv")
datam.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20718 entries, 0 to 20717 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 20718 non-null int64 1 Artist 20718 non-null object 2 Url_spotify 20718 non-null object 3 Track 20718 non-null object 4 Album 20718 non-null object 5 Album_type 20718 non-null object 6 Uri 20718 non-null object 7 Danceability 20716 non-null float64 8 Energy 20716 non-null float64 9 Key 20716 non-null float64 10 Loudness 20716 non-null float64 11 Speechiness 20716 non-null float64 12 Acousticness 20716 non-null float64 13 Instrumentalness 20716 non-null float64 14 Liveness 20716 non-null float64 15 Valence 20716 non-null float64 16 Tempo 20716 non-null float64 17 Duration_ms 20716 non-null float64 18 Url_youtube 20248 non-null object 19 Title 20248 non-null object 20 Channel 20248 non-null object 21 Views 20248 non-null float64 22 Likes 20177 non-null float64 23 Comments 20149 non-null float64 24 Description 19842 non-null object 25 Licensed 20248 non-null object 26 official_video 20248 non-null object 27 Stream 20142 non-null float64 dtypes: float64(15), int64(1), object(12) memory usage: 4.4+ MB
datam.head()
| Unnamed: 0 | Artist | Url_spotify | Track | Album | Album_type | Uri | Danceability | Energy | Key | ... | Url_youtube | Title | Channel | Views | Likes | Comments | Description | Licensed | official_video | Stream | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Gorillaz | https://open.spotify.com/artist/3AA28KZvwAUcZu... | Feel Good Inc. | Demon Days | album | spotify:track:0d28khcov6AiegSCpG5TuT | 0.818 | 0.705 | 6.0 | ... | https://www.youtube.com/watch?v=HyHNuVaZJ-k | Gorillaz - Feel Good Inc. (Official Video) | Gorillaz | 693555221.0 | 6220896.0 | 169907.0 | Official HD Video for Gorillaz' fantastic trac... | True | True | 1.040235e+09 |
| 1 | 1 | Gorillaz | https://open.spotify.com/artist/3AA28KZvwAUcZu... | Rhinestone Eyes | Plastic Beach | album | spotify:track:1foMv2HQwfQ2vntFf9HFeG | 0.676 | 0.703 | 8.0 | ... | https://www.youtube.com/watch?v=yYDmaexVHic | Gorillaz - Rhinestone Eyes [Storyboard Film] (... | Gorillaz | 72011645.0 | 1079128.0 | 31003.0 | The official video for Gorillaz - Rhinestone E... | True | True | 3.100837e+08 |
| 2 | 2 | Gorillaz | https://open.spotify.com/artist/3AA28KZvwAUcZu... | New Gold (feat. Tame Impala and Bootie Brown) | New Gold (feat. Tame Impala and Bootie Brown) | single | spotify:track:64dLd6rVqDLtkXFYrEUHIU | 0.695 | 0.923 | 1.0 | ... | https://www.youtube.com/watch?v=qJa-VFwPpYA | Gorillaz - New Gold ft. Tame Impala & Bootie B... | Gorillaz | 8435055.0 | 282142.0 | 7399.0 | Gorillaz - New Gold ft. Tame Impala & Bootie B... | True | True | 6.306347e+07 |
| 3 | 3 | Gorillaz | https://open.spotify.com/artist/3AA28KZvwAUcZu... | On Melancholy Hill | Plastic Beach | album | spotify:track:0q6LuUqGLUiCPP1cbdwFs3 | 0.689 | 0.739 | 2.0 | ... | https://www.youtube.com/watch?v=04mfKJWDSzI | Gorillaz - On Melancholy Hill (Official Video) | Gorillaz | 211754952.0 | 1788577.0 | 55229.0 | Follow Gorillaz online:\nhttp://gorillaz.com \... | True | True | 4.346636e+08 |
| 4 | 4 | Gorillaz | https://open.spotify.com/artist/3AA28KZvwAUcZu... | Clint Eastwood | Gorillaz | album | spotify:track:7yMiX7n9SBvadzox8T5jzT | 0.663 | 0.694 | 10.0 | ... | https://www.youtube.com/watch?v=1V_xRb0x9aw | Gorillaz - Clint Eastwood (Official Video) | Gorillaz | 618480958.0 | 6197318.0 | 155930.0 | The official music video for Gorillaz - Clint ... | True | True | 6.172597e+08 |
5 rows × 28 columns
In this dataset we have 28 variables and 20718 entries.
Our next step is to clean the data by deleting the columns that we won't need and checking for missing data.
# Detele the variables we won't need
datam.drop(['Unnamed: 0', 'Url_spotify', 'Uri', 'Url_youtube', 'Title', 'Description'], axis=1, inplace=True)
#checking for missing data
datam.isnull().values.any()
datam.isnull().sum()
Artist 0 Track 0 Album 0 Album_type 0 Danceability 2 Energy 2 Key 2 Loudness 2 Speechiness 2 Acousticness 2 Instrumentalness 2 Liveness 2 Valence 2 Tempo 2 Duration_ms 2 Channel 470 Views 470 Likes 541 Comments 569 Licensed 470 official_video 470 Stream 576 dtype: int64
#Checking for duplicates:
datam.duplicated().values.any()
False
As we can see we don't have any duplicates but we do have some missing data so we will drop the rows with missing entries.
#Deleting missing data:
datam.dropna(inplace=True)
datam.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20718 entries, 0 to 20717 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Artist 20718 non-null object 1 Track 20718 non-null object 2 Album 20718 non-null object 3 Album_type 20718 non-null object 4 Danceability 20716 non-null float64 5 Energy 20716 non-null float64 6 Key 20716 non-null float64 7 Loudness 20716 non-null float64 8 Speechiness 20716 non-null float64 9 Acousticness 20716 non-null float64 10 Instrumentalness 20716 non-null float64 11 Liveness 20716 non-null float64 12 Valence 20716 non-null float64 13 Tempo 20716 non-null float64 14 Duration_ms 20716 non-null float64 15 Channel 20248 non-null object 16 Views 20248 non-null float64 17 Likes 20177 non-null float64 18 Comments 20149 non-null float64 19 Licensed 20248 non-null object 20 official_video 20248 non-null object 21 Stream 20142 non-null float64 dtypes: float64(15), object(7) memory usage: 3.5+ MB
datam.head()
| Artist | Track | Album | Album_type | Danceability | Energy | Key | Loudness | Speechiness | Acousticness | ... | Valence | Tempo | Duration_ms | Channel | Views | Likes | Comments | Licensed | official_video | Stream | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Gorillaz | Feel Good Inc. | Demon Days | album | 0.818 | 0.705 | 6.0 | -6.679 | 0.1770 | 0.008360 | ... | 0.772 | 138.559 | 222640.0 | Gorillaz | 693555221.0 | 6220896.0 | 169907.0 | True | True | 1.040235e+09 |
| 1 | Gorillaz | Rhinestone Eyes | Plastic Beach | album | 0.676 | 0.703 | 8.0 | -5.815 | 0.0302 | 0.086900 | ... | 0.852 | 92.761 | 200173.0 | Gorillaz | 72011645.0 | 1079128.0 | 31003.0 | True | True | 3.100837e+08 |
| 2 | Gorillaz | New Gold (feat. Tame Impala and Bootie Brown) | New Gold (feat. Tame Impala and Bootie Brown) | single | 0.695 | 0.923 | 1.0 | -3.930 | 0.0522 | 0.042500 | ... | 0.551 | 108.014 | 215150.0 | Gorillaz | 8435055.0 | 282142.0 | 7399.0 | True | True | 6.306347e+07 |
| 3 | Gorillaz | On Melancholy Hill | Plastic Beach | album | 0.689 | 0.739 | 2.0 | -5.810 | 0.0260 | 0.000015 | ... | 0.578 | 120.423 | 233867.0 | Gorillaz | 211754952.0 | 1788577.0 | 55229.0 | True | True | 4.346636e+08 |
| 4 | Gorillaz | Clint Eastwood | Gorillaz | album | 0.663 | 0.694 | 10.0 | -8.627 | 0.1710 | 0.025300 | ... | 0.525 | 167.953 | 340920.0 | Gorillaz | 618480958.0 | 6197318.0 | 155930.0 | True | True | 6.172597e+08 |
5 rows × 22 columns
After re-shaping our data we are left with 22 variables and 19549 entries.
License_count = datam['Licensed'].value_counts()
print(License_count)
labels = License_count.index.tolist()
sizes = License_count.values.tolist()
colors = colors = plt.cm.inferno(np.linspace(0.9, 0.8, len(labels)))
plt.pie(sizes, labels=labels, colors= colors, autopct='%1.1f%%', startangle=90)
plt.title('Number of tracks that have a license')
plt.legend(labels, loc='best')
plt.show()
True 14140 False 6108 Name: Licensed, dtype: int64
As you can see, 70,2% of the songs that are featured in this dataset are licensed.
Album_count = datam['Album_type'].value_counts()
print(Album_count)
labels = Album_count.index.tolist()
sizes = Album_count.values.tolist()
colors = plt.cm.inferno(np.linspace(0.9, 0.7, len(labels)))
plt.pie(sizes, labels=labels, colors= colors, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of the Types of Albums')
plt.legend(labels, loc='best')
plt.show()
album 14148 single 4689 compilation 712 Name: Album_type, dtype: int64
Most of the tracks in this list are from albums with 72.4% followed by singles 24.0%.Compilations represent only 3.6% of all tracks.
#Changing the measure of duration from milliseconds to minutes
datam['Duration_ms'] = (round(datam['Duration_ms']/(1000*60),2))
datam.rename(columns={'Duration_ms': 'Duration'}, inplace=True)
datam.head()
| Artist | Track | Album | Album_type | Danceability | Energy | Key | Loudness | Speechiness | Acousticness | ... | Valence | Tempo | Duration | Channel | Views | Likes | Comments | Licensed | official_video | Stream | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Gorillaz | Feel Good Inc. | Demon Days | album | 0.818 | 0.705 | 6.0 | -6.679 | 0.1770 | 0.008360 | ... | 0.772 | 138.559 | 3.71 | Gorillaz | 693555221.0 | 6220896.0 | 169907.0 | True | True | 1.040235e+09 |
| 1 | Gorillaz | Rhinestone Eyes | Plastic Beach | album | 0.676 | 0.703 | 8.0 | -5.815 | 0.0302 | 0.086900 | ... | 0.852 | 92.761 | 3.34 | Gorillaz | 72011645.0 | 1079128.0 | 31003.0 | True | True | 3.100837e+08 |
| 2 | Gorillaz | New Gold (feat. Tame Impala and Bootie Brown) | New Gold (feat. Tame Impala and Bootie Brown) | single | 0.695 | 0.923 | 1.0 | -3.930 | 0.0522 | 0.042500 | ... | 0.551 | 108.014 | 3.59 | Gorillaz | 8435055.0 | 282142.0 | 7399.0 | True | True | 6.306347e+07 |
| 3 | Gorillaz | On Melancholy Hill | Plastic Beach | album | 0.689 | 0.739 | 2.0 | -5.810 | 0.0260 | 0.000015 | ... | 0.578 | 120.423 | 3.90 | Gorillaz | 211754952.0 | 1788577.0 | 55229.0 | True | True | 4.346636e+08 |
| 4 | Gorillaz | Clint Eastwood | Gorillaz | album | 0.663 | 0.694 | 10.0 | -8.627 | 0.1710 | 0.025300 | ... | 0.525 | 167.953 | 5.68 | Gorillaz | 618480958.0 | 6197318.0 | 155930.0 | True | True | 6.172597e+08 |
5 rows × 22 columns
datam['Duration'].describe()
count 20716.000000 mean 3.745316 std 2.079864 min 0.520000 25% 3.000000 50% 3.550000 75% 4.210000 max 77.930000 Name: Duration, dtype: float64
The average duration of songs is 3.74 minutes.
The shortest song has 53 seconds and the longest one has one hour and 17 minutes.
# select the top 10 songs based on number of views on Youtube
YouTubeTOP10ALL =datam.sort_values(by='Views', ascending=False)[:10]
YouTubeTOP10=YouTubeTOP10ALL[['Views','Track']]
# create a list of colors from the the inferno colormap
colors = plt.cm.inferno(np.linspace(0.9, 0.5, len(YouTubeTOP10)))
# Show the plots next to each other
frame,(H_Plot1, H_Plot2) = plt.subplots(1, 2, figsize=(16,8))
# create the horizontal bar plot
H_Plot1 = YouTubeTOP10.plot(kind='barh', x='Track', y='Views', color=colors,ax=H_Plot1)
# set the title and axis labels
H_Plot1.set_title('Top 10 Songs by Number of Views on Youtube')
H_Plot1.set_xlabel('Views')
H_Plot1.set_ylabel('Song Title')
# select the top 10 songs based on number of streams on Spotify
SpotifyTOP10ALL =datam.sort_values(by='Stream', ascending=False)[:10]
SpotifyTOP10=SpotifyTOP10ALL[['Stream','Track']]
# create a list of colors from the the inferno colormap
colors = plt.cm.inferno(np.linspace(0.9, 0.5, len(SpotifyTOP10)))
# create the horizontal bar plot
H_Plot2 = SpotifyTOP10.plot(kind='barh', x='Track', y='Stream', color=colors,ax= H_Plot2)
# set the title and axis labels
H_Plot2.set_title('Top 10 Songs by Number of Streams on Spotify')
H_Plot2.set_xlabel('Streams')
H_Plot2.set(ylabel=None)
# show the plots
frame.tight_layout()
plt.show()
/var/folders/lk/l0x0m4v150vc1h4g637pjgyw0000gn/T/ipykernel_28494/3057268177.py:38: UserWarning: Glyph 44053 (\N{HANGUL SYLLABLE GANG}) missing from current font.
frame.tight_layout()
/var/folders/lk/l0x0m4v150vc1h4g637pjgyw0000gn/T/ipykernel_28494/3057268177.py:38: UserWarning: Glyph 45224 (\N{HANGUL SYLLABLE NAM}) missing from current font.
frame.tight_layout()
/var/folders/lk/l0x0m4v150vc1h4g637pjgyw0000gn/T/ipykernel_28494/3057268177.py:38: UserWarning: Glyph 49828 (\N{HANGUL SYLLABLE SEU}) missing from current font.
frame.tight_layout()
/var/folders/lk/l0x0m4v150vc1h4g637pjgyw0000gn/T/ipykernel_28494/3057268177.py:38: UserWarning: Glyph 53440 (\N{HANGUL SYLLABLE TA}) missing from current font.
frame.tight_layout()
/var/folders/lk/l0x0m4v150vc1h4g637pjgyw0000gn/T/ipykernel_28494/3057268177.py:38: UserWarning: Glyph 51068 (\N{HANGUL SYLLABLE IL}) missing from current font.
frame.tight_layout()
/Users/khouloud/opt/anaconda3/lib/python3.9/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 44053 (\N{HANGUL SYLLABLE GANG}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
/Users/khouloud/opt/anaconda3/lib/python3.9/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 45224 (\N{HANGUL SYLLABLE NAM}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
/Users/khouloud/opt/anaconda3/lib/python3.9/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 49828 (\N{HANGUL SYLLABLE SEU}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
/Users/khouloud/opt/anaconda3/lib/python3.9/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 53440 (\N{HANGUL SYLLABLE TA}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
/Users/khouloud/opt/anaconda3/lib/python3.9/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 51068 (\N{HANGUL SYLLABLE IL}) missing from current font.
fig.canvas.print_figure(bytes_io, **kw)
Aside from Shape of You, we can observe that the top 10 songs on Spotify and YouTube are completely different.
We would like to explore this disprency farther and compare the most streamed artists on each platform.
#Grouping the data by the variable group
Artist_Y= datam.groupby('Artist')[['Views']].sum()
#Storing the top 10 artists
top_10_artist_Y= Artist_Y.sort_values(['Views'], ascending=False)[:10]
Artist_Y.rename(columns={'':'Artist'}, inplace=True)
#Prepare the data to create the visuals
top_10_artist_Y.reset_index(inplace=True)
#Grouping the data by the variable group
Artist_S= datam.groupby('Artist')[['Stream']].sum()
#Storing the top 10 artists
top_10_artist_S= Artist_S.sort_values(['Stream'], ascending=False)[:10]
Artist_S.rename(columns={'':'Artist'}, inplace=True)
#Prepare the data to create the visuals
top_10_artist_S.reset_index(inplace=True)
# create a list of colors from the the inferno colormap
colors = plt.cm.inferno(np.linspace(0.9, 0.5, len(top_10_artist_Y)))
# Show the plots next to each other
frame2,(H_Plot3, H_Plot4) = plt.subplots(1, 2, figsize=(16,8))
# create the horizontal bar plot
H_Plot3 = top_10_artist_Y.plot(kind='barh', x='Artist', y='Views', color=colors,ax= H_Plot3)
# set the title and axis labels
H_Plot3.set_title('Top 10 Streamed Artists on YouTube')
H_Plot3.set_xlabel('Views')
H_Plot3.set_ylabel('Artist')
H_Plot5 = top_10_artist_S.plot(kind='barh', x='Artist', y='Stream', color=colors,ax= H_Plot4)
# set the title and axis labels
H_Plot4.set_title('Top 10 Streamed Artists on Spotify')
H_Plot4.set_xlabel('Streams')
H_Plot4.set(ylabel= None)
frame2.tight_layout()
plt.show()
Similar to our previous analysis we can observe that "Ed Sheeran" the singer of "Shape of you" is one of the artist present in the Top 10 of both platforms. Bruno Mars is also prsent in both lists.
Let's explore further the musical characteristics of songs popular on ech streaming service.
Youtube_Songs_Attributs=YouTubeTOP10ALL[['Track','Danceability','Loudness','Speechiness','Acousticness','Liveness','Valence','Tempo']]
print(Youtube_Songs_Attributs)
Track Danceability Loudness \
1147 Despacito 0.655 -4.787
365 Despacito 0.655 -4.787
12452 Shape of You 0.825 -3.183
14580 See You Again (feat. Charlie Puth) 0.689 -7.503
12469 See You Again (feat. Charlie Puth) 0.689 -7.503
20303 Wheels on the Bus 0.941 -11.920
10686 Uptown Funk (feat. Bruno Mars) 0.856 -7.223
8937 Gangnam Style (강남스타일) 0.727 -2.871
9569 Sugar 0.748 -7.055
13032 Roar 0.671 -4.821
Speechiness Acousticness Liveness Valence Tempo
1147 0.1530 0.19800 0.0670 0.839 177.928
365 0.1530 0.19800 0.0670 0.839 177.928
12452 0.0802 0.58100 0.0931 0.931 95.977
14580 0.0815 0.36900 0.0649 0.283 80.025
12469 0.0815 0.36900 0.0649 0.283 80.025
20303 0.0427 0.18400 0.1570 0.965 125.021
10686 0.0824 0.00801 0.0344 0.928 114.988
8937 0.2860 0.00417 0.0910 0.749 132.067
9569 0.0334 0.05910 0.0863 0.884 120.076
13032 0.0316 0.00492 0.3540 0.436 90.003
Spotify_Songs_Attributs=SpotifyTOP10ALL[['Track','Danceability','Loudness','Speechiness','Acousticness','Liveness','Valence','Tempo']]
print(Spotify_Songs_Attributs)
Track Danceability Loudness \
15250 Blinding Lights 0.514 -5.934
12452 Shape of You 0.825 -3.183
19186 Someone You Loved 0.501 -5.679
17937 rockstar (feat. 21 Savage) 0.585 -6.136
17445 Sunflower - Spider-Man: Into the Spider-Verse 0.755 -4.368
17938 Sunflower - Spider-Man: Into the Spider-Verse 0.755 -4.368
13503 One Dance 0.792 -5.609
16099 Closer 0.748 -5.599
16028 Closer 0.748 -5.599
14030 Believer 0.776 -4.374
Speechiness Acousticness Liveness Valence Tempo
15250 0.0598 0.00146 0.0897 0.334 171.005
12452 0.0802 0.58100 0.0931 0.931 95.977
19186 0.0319 0.75100 0.1050 0.446 109.891
17937 0.0712 0.12400 0.1310 0.129 159.801
17445 0.0575 0.53300 0.0685 0.925 89.960
17938 0.0575 0.53300 0.0685 0.925 89.960
13503 0.0536 0.00776 0.3290 0.370 103.967
16099 0.0338 0.41400 0.1110 0.661 95.010
16028 0.0338 0.41400 0.1110 0.661 95.010
14030 0.1280 0.06220 0.0810 0.666 124.949