Spotify

music

recommender

by

Group 9

Shuying Ni

Mingyue Wei

Zhenwei Wu

Yiqing Zhang

Problem Statement and Motivation

The problem we are trying to solve is to recommend songs based on an existing list of songs, or a user-chosen scenario, or both. Specifically, we can take a list of songs as input, and output a list of songs that are similar to the original list; and/or we can take a scenario in which the songs are to be played, eg. workout, and we would recommend songs that’s suitable for workout.

Introduction and Description of Data

To start with initial analysis and exploration of the provided data, and due to the limited usage of Spotify API, we first looked into only the first track file, which gives about 34,000 unique track records. Even though it is a small percentage of the total data, it is sufficient for EDA, since the quantitative data covers almost the whole range, eg. the “danceability” field ranges from 0 to 1, and categorical data covers almost the whole space. Other than the basic information (track name, artist name, album), we also utilized Spotify API to acquire features and other relevant data for each of the tracks, as well as details of the artists corresponding to these tracks. The artist file also has duplicated values, and we cleaned the data by keeping 17,000 unique records out of 134,000 records.

We are dealing with mainly two types of data: categorical data and quantitative data, where the former is mostly text, and some numeric data like "mode" with values taken are 0 and 1, and the latter is all numeric. Categorical data include variables like track name, artist name, album name, etc for songs, and genres for artists. Quantitative variables include features, which are feature scores assigned to each track by Spotify, popularity score, track duration, release date, etc.

We intended to take a content-based approach for recommendation, and looked into which variables could be used to define similarity between songs. We decided to mainly consider quantitative variables, and categorical data include genres and mode. Other categorical data like track name, artist name, album are excluded, since predicting on these variables are too straightforward. The 11 track features ('danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo') provided by Spotify describe different aspects of tracks, and these are included. We chose to not use popularity score and track duration, because the first represents public taste, and the latter is not related closely to a song’s content.

Missing values are not a big problem except for genres: around 22% of 17,000 singers are not categorized to any genre. Further inspection reveals that the missing data might not be random, since the majority of singers without a genre are in the lower popularity quartile. This could be a potential problem if we are going to incorporate genre in the recommendation algorithm, but possible solution is to impute genre(s) by connecting similar singers based on tracks/songs.

Figure 1 presents the distribution of number of genres assigned to a singer. The frequency is in the log score. In the artist data, each artist has a genre list, whose length ranges from 0 to 25. Around 22% artists are not assigned to a genre, and most artists have 1 to 10 genres. This poses a challenge to include genre in our recommendation system.

Figure 1

Figure 2

We also investigated the popularity of all 2207 genres, and the top 20 genres are displayed in Figure 2. The distribution is skewed, with dance pop as the most popular genre. The other subcategories of pop are also in the top 20 list, such as pop and pop rap. This presents a challenge of assigning a single genre to an artist.

To check the relationship between artists and songs, we checked the number of songs an artist has. Figure 3 shows the distribution of number of songs per artist. As we can see, some singers have over 200 songs. Thus, assigning a single genre to songs based on the genre list of an artist would be challenging, as an artist’s songs can have different styles.

Figure 3

Literature Review/Related Work

We used literature review to decide a specific approach, modeling strategy, and evaluation [1]. Generally, there are three approaches to recommender systems: collaborative filtering, content-based filtering and a hybrid of both. Collaborative filtering is based on user-item interactions, assuming that users who agreed in the past would agree in the future. Content-based filtering depends on the features of items or characteristics of users. We decided to pursue the content-based filtering, since the data we have the features information for songs and artists, and we do not have the user-item interaction history or users’ information needed for collaborative filtering.

In our content-based filtering approach, we decided to use cosine similarity measure to find similar songs [2]. The rating of an item i and an item j is based on the distance of the songs’ content profile. The cosine similarity is between 0 and 1 where the formula is as follows:

To measure the effectiveness of the recommender system, common practices are to do online evaluations like user studies, A/B test, or offline evaluations. [3] Due to the limitations of our work scope, we are not able to perform such evaluations. This could be a next step for continuing this project.

Modeling Approach

As mentioned in the previous section, collaborative filtering and content-based filtering are two common approaches for a recommendation system. Since we did not have user information at hand but only the track and artist information, we went with the content-based path. That is, we tried to recommend songs similar to the user input.

Our baseline model depended solely on k-means clustering to determine similarity: we used the 11 features to do the clustering, arbitrarily choosing k=100. The recommender works by predicting cluster of the input track first, and then randomly selecting one track from the cluster to return as the recommendation.

To give an overview of our real recommender, the main approach is still to recommend similar songs, but this time we decided to utilize cosine similarity to make recommendation, which is more precise than randomly selecting from a big cluster. We also used cosine similarity to construct a subset of tracks fit for a specific recommendation mode, "workout". More details below.

The real recommender is built using both cosine similarity and k-means clustering. We used cosine similarity to decide the most similar song to recommend, but due to the large number of tracks within out data set, it is extremely inefficient and time-consuming to compute the cosine similarity for every track for each of the input songs. Thus, k-means clustering came in to solve this problem. By using k-means clustering, we first divided the original track data set into k clusters/groups using the same features as the computation of cosine similarity. The optimal k is chosen by elbow method using WSS (Within-cluster-Sum of Squared Error), and the resulted k is 100. After the k-means clustering was done, every track in the original set is assigned to a cluster.

The input of the general recommender is a list of songs provided by the user. Although the input is a list of tracks rather than a single track, we decided to not condense all songs' features by taking the mean or average, because there could be scenarios when one prefers both very energetic songs and very slow songs, and averaging them will lead to the middle which is not that person’s taste. Thus, when a list of new songs coming in (the input from the user), the model first predicts the cluster label for each of the new songs. After that, songs in the original set that is assigned to the same cluster as the input will be used to calculate cosine similarity with the input song, and will return the most similar one based on the selected features. This is like a 1 to 1 pairing for each of the songs passed in, and the result returned is a list of songs of the same length.

Besides this general recommender, we also utilized the workout track set to implement a special mode of recommendation. If the mode "workout" is passed in, our model will recommend songs good for workout based on another input: if no user song is passed in, the recommender will randomly select 10 songs in the workout track set and return. If the user also has song list input, the recommender will use cosine similarity to return the most similar song for each of the input, but this time choosing from the workout track set rather than the full set.

Project Trajectory, Results, and Interpretation

There are two changes in our project goals. First, we intended to construct multiple modes for users to choose from, but ended up with one example mode. Second, we intended to use genre of a song as a predictor, but ended up not using this feature. Details below.

Workout track data set

Our initial idea was to construct a list of songs for most common scenarios to listen to songs, like "workout", "study", "road trip", etc. There are a couple challenges. First, the data for the initial list is hard to acquire, so we only provide an example of one scenario, and the method can be easily carried out to other scenarios. Second, coming up with scenarios is more an art than a science - it would be hard to come up with an exhaustive list of scenarios that users listen to songs. As a result, we decided to do only one scenario, which is "workout". Based on this mode, a new list of more than 500 tracks were generated.

Genre Imputation

Since genre could be used in deciding similarity for songs, we wanted to include that in recommendation model as well. However, there are several problems: 1) the genres are assigned to artists rather than tracks, and most of the artists have more than one genre, thus we need to figure how to map multiple artist genres to one single track genre. 2) 22% of the artists have no genres assigned, and we would need to impute the missing values.

Our first approach was to use artists with single genre as the train data, and use the model acquired to predict songs whose artists have multiple genres. We mapped these artists’ genre to their tracks (because there is only one genre, all tracks from that artist is assigned that genre), and used the tracks’ quantitative features as predictors to predict genre. After that, we used the trained model to predict genre for tracks from artists that have multiple genres. This method does not work because 1) Only 21.6% of artists have a single genre, 2) A lot of the genres for those multi-genre artists do not exist in those single genres (the model’s response variable) and the model could not predict something it had never seen.

The second approach was to assign artists with genres a single genre, merge artist with track data, and then train a classifier. Each artist has a genre list, whose length can be from 0 to 25. We started with assigning a single genre to artists with multiple genres by using the most popular genre within the list. After checking the distribution of genre distribution, we had over 1000 unique genres, which could cause the curse of dimensionality if we used one-hot encoding for genre. We decided to keep genres with frequency greater than a threshold value 32 (0.05% of the total number of songs), and merged all minorities into a single genre call minor. We had 148 genres after combining minor genres. As we need to map genre to each song, we merged data of artists with genre assigned with song’s track features and used the merged data to train a random forest model. Below is a plot of train and test scores with respect to depth of the tree. We chose the best depth to be 12 to avoid overfitting. However, the train score was 0.42, and test score was only 0.25. We also tried SVM but that did not perform better: with both train and test score at around 0.22.

As a result, we decided to not include genre in our recommendation for several reasons: 1) The prediction accuracy of the model is too low and we do not trust the result 2) Even if the imputation is trustworthy, it will result in a high dimensional categorical data after one hot encoding, which is not preferred 3) Assigning one genre to all tracks for those artists have multiple genres might have problem because their music could be very different.

The final recommendation algorithm has 3 kinds of recommendations: general recommendation based on input list, random recommendation of workout songs, and recommendation of workout songs based on input list.

Here is a demo of the three use cases: randomly select 10 songs as input (songs listed below), the returned result of the three modes are presented.

Lyrics to Go

Clappin' (remix) [feat. Alex Faith, Mission, Gs & Ada-L]

Check My Fresh

Lo Mejor de Mi Vida Eres Tú

Telling The World

Retiro Lo Dicho'

Pornographic

Quién

Shadowfall

I Giorni

Songs similar to inputs

Earthquake

The Difference

The Problem

Doble Vida

Big Empty

Me Duele Amarte

Good Knight (feat. Joey Bada$$, Flatbush Zombies & Dizzy Wright)

I Fall Apart

Fiction Friction

Rest

Random workout songs

Deeper Than The Holler

The News

Need U (100%)

The Seed (2.0)

Same Old Lie

Good For You - Phantoms Remix

One That Got Away

Amplifier

Buckin On Em (feat. Mr Sneed & Relapse)

Your Wish - Naxxos Remix

Workout songs similar to inputs

Sea Calls Me Home

Better When I'm Dancin'

Trampoline Booty

A Hazy Shade of Winter

OTW

When Evening is Overwhelming

Hand Of Doom - Remastered Version

Conrad

All The Way - Remastered

Coulda DJ (Dem Neva Know)

Since it is hard to deploy the mainstream evaluation methods for recommender system, like A/B testing and user studies. We ourselves listened to the recommended songs, and made sure our result makes sense. However, this is purely qualitative and the tested sizes are very limited.

Conclusion and Future Work

We utilized data provided by Spotify, developed a recommender system, which takes as input an existing list of songs and/or a scenario where the list is to be used, and recommends songs based on the input. Although our recommender only supports one scenario for now, the algorithm could be easily implemented for other scenarios.

The limitations of our project are as following: First, we could develop a longer list of scenarios that users can choose from. This can be achieved by using the same algorithms as we did for "workout", but would require more input data. Second, we could incorporate genre as a predictor. It might be hard to label each song with a genre based on current data provided by Spotify, but maybe we can try other sources for the data. Third, the evaluation of our project is purely qualitative and is tested on a limited number of people. If this system is to be carried out in production, we would want more rigorous testing.

As per data usage, we utilized both quantitative and categorical features to build our recommendation system for this project. We could extend the project by incorporating text data, such as lyrics, into our recommendation system. Potential methods we can take include: 1) Run sentimental analysis algorithm on the lyrics to transfer text data into numeric features, so that we can interpret the music context by quantitative values. For example, we can have scores on positive and negative levels of the lyrics, 2) Run bag of words algorithm on the lyrics to find the top N key words in one song, then apply word of vector algorithm on the key words to compare the similarity between two songs on lyrics level. [4]

References

[1] Aggarwal, Charu C. (2016). Recommender Systems: The Textbook. Springer. ISBN 9783319296579.

[2] Harvard CS 136 Economics and Computation, class 20 lecture notes

[3] http://docear.org/papers/a_comparative_analysis_of_offline_and_online_evaluations.pdf

[4] Agarwal, S. (2017). Word to Vectors — Natural Language Processing. [online] Medium. Available at: https://towardsdatascience.com/word-to-vectors-natural-language-processing-b253dd0b0817.

problem statemet

data

literature review

model

result

conclusion