An improvement of 4.81% over Cinematch at the time of writing. I used a mix of algorithms to get there. The simplest one is based on classifying the movies and then using the ratings of similar movies to make a prediction. Thus to predict the rating for a Movie-User pair (M, U), I compute a weighted average of the ratings of U, for movies that are similar to M (similarity between movies is used as weight). Apparently, in the literature this algorithm is known as k-Nearest Neighbor (thank you Wikipedia!).
How does one find a movie's neighbors? One obvious method is to use the ratings. If two movies are loved or hated by a user, they are probably similar movies. Over a large number of ratings from multiple users, this scheme is able to find similar movies somewhat reliably. You do come across a few "questionable" neighbors (Is there anything in common between Lord of the Rings and Full Metal Jacket?). But by and large this method works quite well and is relatively fast. With my implementation, it takes about 12 hours to make 2 million predictions (running on a Desktop with a 3GHz CPU and 3GB of memory). Computing the neighborhood for all the movies is a one-time operation. Once you have this done, the prediction can be done in a couple of minutes.
Here's the immediate neighborhood of Lord of the Rings: The Return of the King, based on Pearson's r values computed over normalized ratings.
The Lord of the Rings: The Fellowship of the Ring: Extended Edition Lord of the Rings: The Two Towers: Extended Edition Lord of the Rings: The Return of the King: Extended Edition Lord of the Rings: The Return of the King Full Metal Jacket Lord of the Rings: The Two Towers Harry Potter and the Prisoner of Azkaban Reservoir Dogs Titanic Constantine The Chronicles of Riddick Blade: Trinity Tomb Raider Planet of the Apes Star Wars: Episode V: The Empire Strikes Back Windtalkers Underworld Raiders of the Lost Ark Austin Powers: International Man of Mystery Lord of the Rings: The Fellowship of the Ring
Admittedly, this looks pretty bad. Why is Austin Powers up there, for heaven's sake?! And yet: this algorithm was able to beat Netflix's own system by 1%.
Another way to determine neighborhoods is to compute the distances over movie aspects (action, romance, comedy etc.) But how to find the aspects for a certain movie in the first place? Singular Value Decomposition is the answer - expained by Simon Funk a lot better than I ever can.
Here's the closest neighbors of LOTR: Return of the King again, this time based on running Pearson over aspects.
National Geographic: Beyond the Movie: Lord of the Rings: Return of the King Lord of the Rings: The Return of the King: Bonus Material Top Cat: The Complete Series Lord of the Rings: The Fellowship of the Ring: Bonus Material Star Wars: Episode VI: Return of the Jedi Elfen Lied Star Wars: Episode V: The Empire Strikes Back Tour of Duty: Season 3 The Battle of Algiers: Bonus Material 2 Harry Potter and the Prisoner of Azkaban Rocky & Bullwinkle & Friends: Season 3 Star Wars: Episode IV: A New Hope Farscape: The Peacekeeper Wars: Bonus Material Dune: Extended Edition Lord of the Rings: The Two Towers: Bonus Material The Lord of the Rings: The Fellowship of the Ring: Extended Edition Star Wars Trilogy: Bonus Material Lord of the Rings: The Return of the King: Extended Edition Lord of the Rings: The Two Towers: Extended Edition Lord of the Rings: The Return of the King The First World War: The Complete Series
Ok, this still looks bad - but luckily a different kind of bad. More about this later.
Once the neighborhoods are computed, further tuning can be done by deriving a second level of correlations. The principle behind this is simple - if two movies have a lot of common neighbors, they must be neighbors themselves. This helps to eliminate noise from the results (Full Metal Jacket will no longer be tagged as a neighbor of Return of the King because they do not have many common neighbors). I think this is a novel technique, but haven't done enough research to verify this claim.
Lord of the Rings: The Two Towers: Extended Edition The Lord of the Rings: The Fellowship of the Ring: Extended Edition Lord of the Rings: The Return of the King: Extended Edition Lord of the Rings: The Return of the King Star Wars: Episode V: The Empire Strikes Back Lord of the Rings: The Two Towers Lord of the Rings: The Fellowship of the Ring Lord of the Rings: The Two Towers: Bonus Material Lord of the Rings: The Fellowship of the Ring: Bonus Material Lord of the Rings: The Return of the King: Bonus Material Star Wars: Episode VI: Return of the Jedi Star Wars: Episode IV: A New Hope Battlestar Galactica: Season 1 Lost: Season 1 Raiders of the Lost Ark Star Wars Trilogy: Bonus Material Heat: Special Edition: Bonus Material Indiana Jones and the Last Crusade Batman Begins Harry Potter and the Prisoner of Azkaban
Much better! This was obtained by running a modified Pearson over 'r' values computed earlier. Surprisingly, this did not improve the quality of predictions. My hypothesis is that the noise is helping the predictions be more accurate. Filtering out the noise may not help when you are trying to predict the behavior of human beings that are inherently noisy.
The method that works best for me is a combination of the first two - determine two neighborhoods for each movie, one based on ratings and the other based on aspects and then combine these neighborhoods by averaging the distances (with equal weighting). Since the "really bad neighbors" are different in both cases, the averaging drives them far away. An RMSE of 0.92 is achievable by this method alone.
Some tips for people who may be considering this challenge:
- The database is quite big. Unless you have a supercomputer at your disposal, it is very important to write tight code. Make sure that all the ratings are loaded into RAM so that data accesses are quick. All the accesses had better be O(1). Search operations will kill your program. If you find memory to be a bottleneck, go out and get more memory rather than trade off CPU cycles to optimize memory usage.
- Make use of the symmetry between movies and users. Easy if you are using C++ templates. For example, the k-NN algorithm described above can be run over users as well. Since there are about half a million users in the database, this variant takes forever to run. My initial results weren't encouraging, but your mileage may vary.
- If you are using a programming language with arrays that have zero-based indexes, remap the movie id's to make them zero based. I made the mistake of leaving them as given and it has given me a lot of unnecessary grief.
What I have come to realize is that lack of having any relevant background makes for slow progress. But it also paves the way for outside-the-box ideas. My strategy has been to make up for lack of knowledge with a little creativity and some luck. It has worked reasonably well so far.
Apparently, all the pixels within the red box were predicted from SVD matrices derived using the Simon Funk method.
- I wonder if the 80% satisfaction rate of whattorent.com can be improved by better correlating movie aspects with personality types. And of course, this concept can be extended to any product.
- Using creative data collection techniques for better recommender systems - Take YouTube as an example: In addition to the explicit ratings given by users, they could collect statistics from the interactive behavior of users (rewind, skip, repeat etc.). If combined with a fast recommender system that uses pre-computed aspect data for each video clip, the quality of recommendations could be improved dramatically.
IMO, an algorithm that is able to beat Cinematch by a wide margin and is fast enough to run in real-time would be worth a lot more than a million dollars.