View my profileView my profile
Netflix Contest

Netflix Contest

An improvement of 4.81% over Cinematch at the time of writing. I used a mix of algorithms to get there. The simplest one is based on classifying the movies and then using the ratings of similar movies to make a prediction. Thus to predict the rating for a Movie-User pair (M, U), I compute a weighted average of the ratings of U, for movies that are similar to M (similarity between movies is used as weight). Apparently, in the literature this algorithm is known as k-Nearest Neighbor (thank you Wikipedia!).

How does one find a movie's neighbors? One obvious method is to use the ratings. If two movies are loved or hated by a user, they are probably similar movies. Over a large number of ratings from multiple users, this scheme is able to find similar movies somewhat reliably. You do come across a few "questionable" neighbors (Is there anything in common between Lord of the Rings and Full Metal Jacket?). But by and large this method works quite well and is relatively fast. With my implementation, it takes about 12 hours to make 2 million predictions (running on a Desktop with a 3GHz CPU and 3GB of memory). Computing the neighborhood for all the movies is a one-time operation. Once you have this done, the prediction can be done in a couple of minutes.

Here's the immediate neighborhood of Lord of the Rings: The Return of the King, based on Pearson's r values computed over normalized ratings.

	The Lord of the Rings: The Fellowship of the Ring: Extended Edition 
	Lord of the Rings: The Two Towers: Extended Edition 
	Lord of the Rings: The Return of the King: Extended Edition 
	Lord of the Rings: The Return of the King 
	Full Metal Jacket 
	Lord of the Rings: The Two Towers 
	Harry Potter and the Prisoner of Azkaban 
	Reservoir Dogs 
	Titanic 
	Constantine 
	The Chronicles of Riddick 
	Blade: Trinity 
	Tomb Raider 
	Planet of the Apes 
	Star Wars: Episode V: The Empire Strikes Back 
	Windtalkers 
	Underworld 
	Raiders of the Lost Ark 
	Austin Powers: International Man of Mystery 
	Lord of the Rings: The Fellowship of the Ring 

Admittedly, this looks pretty bad. Why is Austin Powers up there, for heaven's sake?! And yet: this algorithm was able to beat Netflix's own system by 1%.

Another way to determine neighborhoods is to compute the distances over movie aspects (action, romance, comedy etc.) But how to find the aspects for a certain movie in the first place? Singular Value Decomposition is the answer - expained by Simon Funk a lot better than I ever can.

Here's the closest neighbors of LOTR: Return of the King again, this time based on running Pearson over aspects.

	National Geographic: Beyond the Movie: Lord of the Rings: Return of the King 
	Lord of the Rings: The Return of the King: Bonus Material 
	Top Cat: The Complete Series 
	Lord of the Rings: The Fellowship of the Ring: Bonus Material 
	Star Wars: Episode VI: Return of the Jedi 
	Elfen Lied 
	Star Wars: Episode V: The Empire Strikes Back 
	Tour of Duty: Season 3 
	The Battle of Algiers: Bonus Material 2 
	Harry Potter and the Prisoner of Azkaban 
	Rocky & Bullwinkle & Friends: Season 3 
	Star Wars: Episode IV: A New Hope 
	Farscape: The Peacekeeper Wars: Bonus Material 
	Dune: Extended Edition 
	Lord of the Rings: The Two Towers: Bonus Material 
	The Lord of the Rings: The Fellowship of the Ring: Extended Edition 
	Star Wars Trilogy: Bonus Material 
	Lord of the Rings: The Return of the King: Extended Edition 
	Lord of the Rings: The Two Towers: Extended Edition 
	Lord of the Rings: The Return of the King 
	The First World War: The Complete Series 

Ok, this still looks bad - but luckily a different kind of bad. More about this later.

Once the neighborhoods are computed, further tuning can be done by deriving a second level of correlations. The principle behind this is simple - if two movies have a lot of common neighbors, they must be neighbors themselves. This helps to eliminate noise from the results (Full Metal Jacket will no longer be tagged as a neighbor of Return of the King because they do not have many common neighbors). I think this is a novel technique, but haven't done enough research to verify this claim.

	Lord of the Rings: The Two Towers: Extended Edition 
	The Lord of the Rings: The Fellowship of the Ring: Extended Edition 
	Lord of the Rings: The Return of the King: Extended Edition 
	Lord of the Rings: The Return of the King 
	Star Wars: Episode V: The Empire Strikes Back 
	Lord of the Rings: The Two Towers 
	Lord of the Rings: The Fellowship of the Ring 
	Lord of the Rings: The Two Towers: Bonus Material 
	Lord of the Rings: The Fellowship of the Ring: Bonus Material 
	Lord of the Rings: The Return of the King: Bonus Material 
	Star Wars: Episode VI: Return of the Jedi 
	Star Wars: Episode IV: A New Hope 
	Battlestar Galactica: Season 1 
	Lost: Season 1 
	Raiders of the Lost Ark 
	Star Wars Trilogy: Bonus Material 
	Heat: Special Edition: Bonus Material 
	Indiana Jones and the Last Crusade 
	Batman Begins 
	Harry Potter and the Prisoner of Azkaban 

Much better! This was obtained by running a modified Pearson over 'r' values computed earlier. Surprisingly, this did not improve the quality of predictions. My hypothesis is that the noise is helping the predictions be more accurate. Filtering out the noise may not help when you are trying to predict the behavior of human beings that are inherently noisy.

The method that works best for me is a combination of the first two - determine two neighborhoods for each movie, one based on ratings and the other based on aspects and then combine these neighborhoods by averaging the distances (with equal weighting). Since the "really bad neighbors" are different in both cases, the averaging drives them far away. An RMSE of 0.92 is achievable by this method alone.

Some tips for people who may be considering this challenge:

- The database is quite big. Unless you have a supercomputer at your disposal, it is very important to write tight code. Make sure that all the ratings are loaded into RAM so that data accesses are quick. All the accesses had better be O(1). Search operations will kill your program. If you find memory to be a bottleneck, go out and get more memory rather than trade off CPU cycles to optimize memory usage.

- Make use of the symmetry between movies and users. Easy if you are using C++ templates. For example, the k-NN algorithm described above can be run over users as well. Since there are about half a million users in the database, this variant takes forever to run. My initial results weren't encouraging, but your mileage may vary.

- If you are using a programming language with arrays that have zero-based indexes, remap the movie id's to make them zero based. I made the mistake of leaving them as given and it has given me a lot of unnecessary grief.

What I have come to realize is that lack of having any relevant background makes for slow progress. But it also paves the way for outside-the-box ideas. My strategy has been to make up for lack of knowledge with a little creativity and some luck. It has worked reasonably well so far.

Applications:

- Here's a very cool application of Singular Value Decomposition, courtesy Aron.

Apparently, all the pixels within the red box were predicted from SVD matrices derived using the Simon Funk method.

- I wonder if the 80% satisfaction rate of whattorent.com can be improved by better correlating movie aspects with personality types. And of course, this concept can be extended to any product.

- Using creative data collection techniques for better recommender systems - Take YouTube as an example: In addition to the explicit ratings given by users, they could collect statistics from the interactive behavior of users (rewind, skip, repeat etc.). If combined with a fast recommender system that uses pre-computed aspect data for each video clip, the quality of recommendations could be improved dramatically.

IMO, an algorithm that is able to beat Cinematch by a wide margin and is fast enough to run in real-time would be worth a lot more than a million dollars.

anlthms(at)yahoo(dot)com