Separation of Users' Loved Songs from Other Songs

Kan Wu

wukan@cs.dartmouth.eud

Introduction

People have been fascinated by music since the dawn of humanity. A wide variety of music genres and styles has evolved, reflecting diversity in personalities, cultures and age groups. It comes as no surprise that human tastes in music are remarkably diverse.

Yahoo! Music has amassed billions of user ratings for musical pieces. When properly analyzed, the raw ratings encode information on how songs are grouped, which hidden patterns link various albums, which artists complement each other, and above all, which songs users would like to listen to.

In this course project, I am going to participate in the competition of KDD Cup, which is sponsored by Yahoo! Music. The competition provides us both train data and test data. The train data includes the songs associated with the score rated by the users and the relationships between songs, albums, artists and generes. In the test data, users and songs are given in pairs, and we want to use learning algorithms to separate the songs, which are really favored by the specific user, from other songs.
Methods

The project is a common recommendation system, the technique that recommend users the information items which are likely to be interest to the usres.

Collaborative filtering is a family of algorithm widely used in receommendation system. Collaborative filter methods are based on collecting and analysing a large amount of information on users’ behaviour, activity or preferences and predicting what users will like based on their similarity to other users.

At the first step of the project, I am going to develop the simplest CF algorithm, Slope One, to predict the score of the songs in the test data. In the test data I will pick up three songs with the highest score, and then I think these three songs are favored by the usres.
Dataset

The data set is split into two subsets: train data and test data. At each subset, user rating data is grouped by user.
- First line of user is formatted as: [UserID]|[#UserRatings]\n
- Each of the next [#UserRatings] lines describes a single rating by [UsedId]. Rating line format: [ItemID]\t[Score]\n
The scores are integers lying between 0 and 100, and are withheld from the test set. All user id's and item id's are consecutive integers, both starting at zero.

For each user participating in the test set, six items are listed. All these items must be tracks (not albums, artist or genres). Three out of these six items have never been rated by the user, whereas the other three items were rated "highly" by the user, that is, scored 80 or higher.
Timeline

By the milestone, my plan is to finish the implementation of the algorithm. But my algorithm does not take care of the relationship between the tracks, albums, artists and genres, instead my algorithm will focus on the score of songs in the train data.

After the milestone, I will try to find the feature in the relationship between these different items to get more accuracy.
References
1. KDD Cup 2011 from Yahoo! Labs http://kddcup.yahoo.com/index.php
2. Netflix Prize http://www.netflixprize.com/
3. Wikipedia: Collaborative Filtering http://en.wikipedia.org/wiki/Collaborative_filtering

Separation of Users' Loved Songs from Other Songs

Kan Wu

wukan@cs.dartmouth.eud

Introduction

Methods

Dataset

Timeline

References