I will go into the details later but I am thoroughly convinced that youtube is one of the largest most racist places on the internet. You would be hard-pressed to find a more high profile site with the N-word used so frequently in a derogatory sense. Anyway more recently in parallel I have been looking for some way of getting my hand’s on enough data to put Hadoop to the test. About a week ago the two idea’s converged in my head and I had a vision of creating an application that would consume all of the comments on youtube (or at least as many as it can swallow) and then using a little linguistic regexp magic figure out who is the MOST racist user on Youtube. I started out by writing the basic mapper in python in my previous post. It is only good for a single video however. I new I would need a spider and I was planning on writing my own (python has some awesome dom and xml parsing: beautiful soup mini-dom, expat) but decided that for future use it might be better to make use of something based on lucene. I have been using solr quite a bit lately but was hoping for something a little more lightweight. Luckily I was able to remember the name of the spider that came out shortly after lucene was made available. Nutch! It hasn’t had the glory that Solr has had in recent years (I don’t understand why these two projects exist and don’t converge?) but it turns out that it is now based on Hadoop by default! In fact little to my knowledge but it looks like Hadoop was actually spawned by Nutch?! Anyway I spent some time setting up nutch on a hadoop cluster today and I have to say it is still not the easiest thing in the world to work with. I am still new to Hadoop and was following this Nutch tutorial very closely. Unfortunatley it glazes over setting up the slave nodes so I plan to fill in the details here and offer them to the community. One thing that it fails to mention that I should have know from my other Hadoop tutorials (never underestimate the minutia!) hdfs configurations need to use FULLY qualified hostnames! you can’t just user your machine name even though you may be able to ping each machine on a local subnet with just the hostname and even though you can telnet from the slave machine to say an Httpd instance running on the namenode you WILL get silent failure when trying to telnet to the namenode daemon’s port. Check your firewalls of course but even if they are turned off on both your master and slave it is quite pssible you will just see:
$telnet master 9000
telnet: connect to address 192.168.0.1: Connection refused
however if you try the same thing from the master machine itself it works!? (assuming your name node is currently running and supposedly listening on port 9000)
This of course through me for a loop and I assumed it was everything from firewalls to bad interconnects between my machines. Strange that I never thought about the hostnames. I guess it was because all other services between them have been working fine. If anyone has any idea why hdfs has a problem with this I would love to know.
