Inspired by Nerudista’s post in the Tacos de Datos website (in spanish) I asked Spotify for my data and started making some plots with them. If you want to do the same, head over to this page; once you request it, it will take a couple of days to be available. In the meantime, you can use this Google Colab, I’ve created a subset of my data for you to play with.
Among all the information you will receive, there will be some files named following this pattern: StreamingHistoryXX.json
, and these are the ones we’ll use throughout this post.
The data
The files mentioned above files contain something like this:
[
{
"endTime" : "2019-02-04 17:14",
"artistName" : "MGMT",
"trackName" : "Time to Pretend",
"msPlayed" : 261000
},
{
"endTime" : "2019-02-04 17:18",
"artistName" : "MGMT",
"..."
Where the values are:
endTime
: Day and time in which a song finished, in UTC format.artistName
: Name of the artist of the song.trackName
: Name of the song.msPlayed
: For how long (in milliseconds) a song was played.
To load these this data into a DataFrame we’ll need this tiny function:
from glob import glob
import json
import pandas as pd
def read_history():
history = []
for file in sorted(glob("StreamingHistory*.json")):
with open(file) as readable:
history.extend(json.load(readable))
history = pd.DataFrame(history)
history["endTime"] = pd.to_datetime(history["endTime"])
return history
streaming_history = read_history()
streaming_history.head(5)
Which should give us something like this once we execute it:
endTime | artistName | trackName | msPlayed | |
---|---|---|---|---|
0 | 2019-02-04 17:14:00 | MGMT | Time to Pretend | 261000 |
1 | 2019-02-04 17:18:00 | MGMT | When You Die | 263880 |
2 | 2019-02-04 17:23:00 | Mr Dooves | Super Smash Bros. Brawl Main Theme (From “Super Smash Bros. Brawl”) [A Cappella] | 201518 |
Histogram.
I’ve always been a fan of the way GitHub shows developer contributions, and the data we got from Spotify seems like the perfect candidate to create something similar; however, we need to perform some transformations first.
Since we are not interested in the time each song finished we’ll get rid of the temporal part of endTime
:
streaming_history["date"] = streaming_history["endTime"].dt.floor('d')
Then we’ll get a count of songs per day using groupby
:
by_date = streaming_history.groupby("date")[["trackName"]].count()
by_date = by_date.sort_index()
For out plot, we’ll need to know which day of the week each date refers to, we can use the properties week
and weekday
:
by_date["weekday"] = by_date.index.weekday
by_date["week"] = by_date.index.week
By the end, our dataframe should look like this:
trackName | weekday | week | |
---|---|---|---|
date | |||
2019-02-04 | 5 | 0 | 6 |
2019-02-05 | 8 | 1 | 6 |
2019-02-07 | 60 | 3 | 6 |
We have almost everything we need, the next step is to get a contiuous sequence of numbers for each week. In the dataframe above the 6th week of 2016 must be the week 0, the 7th must be the week 1… I can’t think of a better way to do it than a for
loop:
week = 0
prev_week = by_date.iloc[0]["week"]
continuous_week = np.zeros(len(by_date)).astype(int)
sunday_dates = []
for i, (_, row) in enumerate(by_date.iterrows()):
if row["week"] != prev_week:
week += 1
prev_week = row["week"]
continuous_week[i] = week
by_date["continuous_week"] = continuous_week
by_date.head()
trackName | weekday | week | continuous_week | |
---|---|---|---|---|
date | ||||
2019-02-04 | 5 | 0 | 6 | 0 |
2019-02-05 | 8 | 1 | 6 | 0 |
2019-02-07 | 60 | 3 | 6 | 0 |
Our next step is to generate a matrix with days ✕ weeks
as dimensions, where each one of the entries will be the number of songs we listened to in that day and week:
songs = np.full((7, continuous_week.max()+1), np.nan)
for index, row in by_date.iterrows():
songs[row["weekday"]][row["continuous_week"]] = row["trackName"]
Now we could just plot the matrix songs
using seaborn
:
fig = plt.figure(figsize=(20,5))
ax = plt.subplot()
mask = np.isnan(songs)
sns.heatmap(songs, ax = ax)
But the result is not that great:
If we want it to look better we still need some code, the first thing to do is to clean the axis labels:
min_date = streaming_history["endTime"].min()
first_monday = min_date - timedelta(min_date.weekday())
mons = [first_monday + timedelta(weeks=wk) for wk in range(continuous_week.max())]
x_labels = [calendar.month_abbr[mons[0].month]]
x_labels.extend([
calendar.month_abbr[mons[i].month] if mons[i-1].month != mons[i].month else ""
for i in range(1, len(mons))])
y_labels = ["Mon", "", "Wed", "", "Fri", "", "Sun"]
The X axis labels are far more complicated than the Y ones; this is because unlike Y, the X axis is not fixed nor continuous, as such, they need to calculated based on the data. If you want a more detailed explanation, tell me in the comments or at me on twitter at @feregri_no).
After doing this we’ll perform some modifications with colours and the axis:
fig = plt.figure(figsize=(20,5))
ax = plt.subplot()
ax.set_title("My year on Spotify", fontsize=20,pad=40)
ax.xaxis.tick_top()
ax.tick_params(axis='both', which='both',length=0)
ax.set_facecolor("#ebedf0")
fig.patch.set_facecolor('white')
And finally, we can use seaborn’s heatmap
again, this time with a few extr arguments that I will explain later:
sns.heatmap(songs, linewidths=2, linecolor='white', square=True,
mask=np.isnan(songs), cmap="Greens",
vmin=0, vmax=100, cbar=False, ax=ax)
ax.set_yticklabels(y_labels, rotation=0)
ax.set_xticklabels(x_labels, ha="left")
The arguments are as follows:
songs
: our matrix with shapedays ✕ weeks
with the counts of songs per day,linewidths
: the size of the spacing between each patch,linecolor
: the colour of the spacing between each patch,square
: this tells the function that we want to keep the aspect ratio1:1
for each patch,mask
: a very interesting argument, it will help us “mask” the patches for which there is no recorded value, this argument should be a boolean matrix of the same dimensions as the data being plotted, where eachTrue
means that that specific value must be masked,cmap
: the colormap to be used, luckily for us, the value “Greens” matches with the colour palette chosen by GitHub,vmin
: the value that should be considered as the minimum among our values,vmax
: the value that should be considered as the maximum among our values, I’d consider 100 to be the maximum, even though my record sits as 190 in a day!cbar
: a boolean value to indicate whether we want to show the colour bar that usually comes with the heatmap,ax
: the axes our plot should be plotted on.
And voilà, our plot is ready:
It is up to you to modify the plot, may be by adding information about the number of songs, or show the colorbar… a great idea would be to recreate this plot in a framework such as D3.js, but that may well belong to another post. Again, feel free to head over to this Colab and contact me via twitter @feregri_no.