Netflix Prize: Update 1

I’ve been working (almost daily) on my Netflix Prize code.  I gave up on using a database after I realized how impractically slow it would be.  Instead, I focused on fitting all of the data into RAM in a useful format.  At one point, I broke down and tried to buy more memory (upgrade from 1GB to 2), but that didn’t work out so I got back to work on keeping things small.

I ended up storing everything in memory twice so that I can quickly search either by movie id, customer id, or both.  The whole thing with ratings, counts, averages and standard deviations for movies and customers comes out to less than 700 MB both on the hard drive and in memory.  My average search time for a particular rating is now about 0.0025 ms.

Since I’ve been focused on getting my basic tools working, I haven’t really had time to work on an intelligent prediction algorithm.  So far, my best RMSE is 1.016 which is pretty weak.  I did a quick test to see what would happen if I used whichever was closest to the correct value of either the average rating for the customer, or the average rating for the movie.  I was quite surprised to get an RMSE of about 0.84.

Of course, this isn’t a useful algorithm because in the real world we don’t know the actual ratings.  An interesting approach would be to come up with an accurate way of determining which average is better without knowing the correct rating.  It might be easier than trying to guess the rating directly.  Then again, maybe not.