Thanks to a handy little reminder from ApeJet Delay Mail, here’s my one year update for the Hacker’s Diet. Unsurprisingly, it’s still working. I reached my goal weight and have been able to maintain it within a few pounds. It’s just a matter of measuring your weight every day, calculating your trend weight, and adjusting how many calories you eat accordingly. I’ve created a nice little graph, and here are some links to my previous posts on the topic.
I’ve been working (almost daily) on my Netflix Prize code. I gave up on using a database after I realized how impractically slow it would be. Instead, I focused on fitting all of the data into RAM in a useful format. At one point, I broke down and tried to buy more memory (upgrade from 1GB to 2), but that didn’t work out so I got back to work on keeping things small.
I ended up storing everything in memory twice so that I can quickly search either by movie id, customer id, or both. The whole thing with ratings, counts, averages and standard deviations for movies and customers comes out to less than 700 MB both on the hard drive and in memory. My average search time for a particular rating is now about 0.0025 ms.
Since I’ve been focused on getting my basic tools working, I haven’t really had time to work on an intelligent prediction algorithm. So far, my best RMSE is 1.016 which is pretty weak. I did a quick test to see what would happen if I used whichever was closest to the correct value of either the average rating for the customer, or the average rating for the movie. I was quite surprised to get an RMSE of about 0.84.
Of course, this isn’t a useful algorithm because in the real world we don’t know the actual ratings. An interesting approach would be to come up with an accurate way of determining which average is better without knowing the correct rating. It might be easier than trying to guess the rating directly. Then again, maybe not.
Thanks to BookMooch, I was able to pick up a used copy of Engines of Creation, Eric Drexler‘s excellent book on nanotechnology. It’s been on my reading list for years, and I’m glad that I’ve finally gotten around to reading it. My only complaint it’s more speculative and less hard science than I was expecting. I probably expected too much from a 20 year old book.
The material that the book did cover is quite interesting. Imagine machines smaller than blood cells able to fix virtually any disease or wound. Imagine computers millions or billions of times more powerful than today’s machines. Imagine skyscrapers growing out of the ground like weeds. Imagine no more hunger or thirst. All of these things can be made a reality with nanotech.
Of course, any technology powerful enough to do all of the above could also be extremely destructive. The extinction of all life on Earth could be just the beginning. Genetic engineering and artificial intelligence bring similar risks and rewards. (Actually, all three technologies feed on each other making the situation even more perilous.) Drexler does a good job of recognizing the problem and suggesting solutions.
One of the most interesting ideas discussed in the book is hypertext. In 1986, hypertext was still new technology, primarily in development at universities and research institutions. That all changed with the popularization of the World Wide Web. Now, hypertext is part of everyday life. The web that we use today (with wikis, blogs, videos, search engines, web applications, etc…) is even more powerful than the cross-referenced books imagined in Engines of Creation. I suspect this will also be true of the future technologies that we can only dream of today.
For more information on the safe development of these types of technologies, see:
Netfilx announced yesterday a contest called the Netflix Prize. The challenge is to create a system that, based on past data, can accurately predict how a customer will rate a movie. They currently have a system to do this called Cinematch. This is what they use to make movie recommendations. Better recommendations mean happier customers. Happier customers means more money for Netflix. Oh, and the prize part: if you can out predict Cinematch by 10%, you win $1,000,000 (assuming nobody else does even better).
For the contest, Netflix is supplying over 100 million ratings covering 17,770 movies and nearly half a million customers. For infovores, it’s a gold mine. I’m going to go ahead and try my hand at the prize. I have a reasonably good idea of how to go about creating a ratings prediction system. I highly doubt I’ll be able to come up with anything better than what the Netflix engineers have put together, but it’ll be fun to try.
So far, I’ve pulled all of the data into a nice little (Ha!) MySQL database. My next steps will be to create the tools I’ll need to build and test a very basic prediction system. In the mean time, here are some fun little facts:
* Average movie rating: 3.23
* Average number of ratings for movies: 5655
* Average customer average movie rating average: 3.67
* Average number of ratings for customers: 209