HOWTO Run Nutch on a Hadoop Cluster

Amit Jain and Joey Mazzarelli



This document is a modification of http://joey.mazzarelli.com/2007/07/25/nutch-and-hadoop-as-user-with-nfs/, which is based on http://wiki.apache.org/nutch/NutchHadoopTutorial. The article above assumes you have root access, which should be the case if you are going to consume the resources needed to crawl the Internet. However, we want to run this as a normal user, say nutch. Another gotcha is that we are working on a cluster that shares the home directories of users over NFS. For more details or explanation, refer to the original article.

We need to be able to login to all the various nodes on the cluster through SSH without being prompted for a password. This should already be setup on a cluster for a user.

ssh-keygen -t rsa

#Leave the password empty.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Next, build Nutch using ant.

mkdir -p {~/src/nutch,~/opt/nutch/build}

cd ~/src/nutch

svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk .

echo dist.dir=$HOME/opt/nutch/build > build.properties

ant package


We want it working with one copy for the crawling, and another copy for searching or just poking around with, without changing the settings for the crawling one.

cd ~/opt/nutch/

cp -r build crawler

cp -r build sandbox


~/opt/nutch now contains three directories:

  1. build: Ant builds Nutch into this directory.

  2. Crawler: The instance used for crawling.

  3. Sandbox: The instance used for searching the indexes created by the crawls, or whatever I want to play with.

If I need to rebuild Nutch, I can copy the project from the build directory into the other two. Since my home directory is mounted via NFS, there is no need to log into the other nodes of the cluster and repeat this process. It is done.

Now hadoop needs to be configured. We will only be going over the configuration of the crawler instance. The sandbox one will not use hadoop, and therefore is straightforward. We need a place for the logs, the pid files for managing processes, and the filesystem hadoop uses.

cd ~/opt/nutch/crawler

mkdir {logs,pids}

cd conf

vim hadoop-env.sh

We added/modified conf/hadoop-env.sh in the following manner:

export HOSTNAME=`hostname`

export HADOOP_HOME=/home/nutch/opt/nutch/crawler

export JAVA_HOME=/usr/local/java

export HADOOP_LOG_DIR=${HADOOP_HOME}/logs/${HOSTNAME}

export HADOOP_PID_DIR=${HADOOP_HOME}/pids/${HOSTNAME}

export HADOOP_IDENT_STRING=${USER}_${HOSTNAME}



You will, of course, adjust the path to your home directory, and your java home as needed. Now the xml configuration files need set up. Read this article on which files should contain which information. http://wiki.apache.org/lucene-hadoop/HowToConfigure. Here we use one multi-homed master node (public interface: pack.balihoo.com and private interface: node0) and four slave nodes (node1, node2, node3, node4) on the private subnet as our example.

conf/masters

node0

conf/slaves

node1

node2

node3

node4


conf/hadoop-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

<name>fs.default.name</name>

<value>node0:9000</value>

<description>

The name of the default file system. Either the literal string

"local" or a host:port for NDFS.

</description>

</property>

<property>

<name>mapred.job.tracker</name>

<value>node0:9001</value>

<description>

The host and port that the MapReduce job

tracker runs at. If "local", then jobs are

run in-process as a single map and reduce task.

</description>

</property>


<property>

<name>mapred.tasktracker.tasks.maximum</name>

<value>2</value>

<description>

The maximum number of tasks that will be run simultaneously by

a task tracker. This should be adjusted according to the heap size

per task, the amount of RAM available, and CPU consumption of each task.

Here we assume 2G of RAM per node, with each job taking up to 800M of

heap.

</description>

</property>


<property>

<name>mapred.child.java.opts</name>

<value>-Xmx800m</value>

<description>

You can specify other Java options for each map or reduce task here,

but most likely you will want to adjust the heap size.

</description>

</property>


<property>

<name>dfs.name.dir</name>

<value>/tmp/hadoop-filesystem-nutch/name</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/tmp/hadoop-filesystem-nutch/data</value>

</property>

<property>

<name>mapred.system.dir</name>

<value>/tmp/hadoop-filesystem-nutch/mapreduce/system</value>

</property>

<property>

<name>mapred.local.dir</name>

<value>/tmp/hadoop-filesystem-nutch/mapreduce/local</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

<description>

</description>

</property>

</configuration>




Next create conf/mapred-default.xml (or replace if one exists) with the following file. The following values should not be placed in hadoop-site.xml.


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>mapred.map.tasks</name>

<value>41</value>

<description>

This should be a prime number larger than multiple of the number of

slave hosts, e.g. for 3 nodes set this to 17 or higher. If you get

OutOfMemory errors,increasing this number and the number below can help.

</description>

</property>


<property>

<name>mapred.reduce.tasks</name>

<value>11</value>

<description>

This should be a prime number close to a low multiple of slave hosts,

e.g. for 3 nodes set this to 7 or higher

</description>

</property>

</configuration>




The filesystem hadoop uses expects to have its own directories on each host to write to, and since our home folder is on NFS, this can be a problem. So we put the filesystem in /tmp since that is mounted separately on each node. We link to it for convenience.


cd ~/opt/nutch/crawler

mkdir /tmp/hadoop-filesystem-`whoami`

chmod 700 /tmp/hadoop-filesystem-`whoami`

ln -s /tmp/hadoop-filesystem-`whoami` ./filesystem



The above steps need to be done on each node (master and slaves). Then on the master node, we format the hadoop filesystem as shown below.

bin/hadoop namenode -format

It seems to create the directories for the log files just fine, but not so much for the pids. So we create those directories.

cd ~/opt/nutch/crawler/conf

for node in $(cat slaves);

do

mkdir ../pids/$node;

done;

mkdir ../pids/pack.balihoo.com

cd ..

bin/start-all.sh


The easiest way to verify that everything went correctly is to run the bin/stop-all.sh command. If it complains that there was nothing to stop, then something isn’t configured correctly. If it claims to have stopped everything, then all is well. If things don’t seem right, make sure you don’t have any processes that have escaped your attention. Kill those. Check with the following command:

pdsh -a ps augx | grep java

When we run Nutch now, it will use the filesystem from Hadoop. So any files that Nutch needs to be aware of have to be put into Hadoop filesystem. Here we will show the classic example of crawling the apache website.

cd ~/opt/nutch/crawler/

mkdir urls

echo 'http://lucene.apache.org' > urls/urllist.txt

This file needs put into the Hadoop filesystem.

# Must be running... bin/start-all.sh

cd ~/opt/nutch/crawler

bin/hadoop dfs -put urls urls

# You can verify with:

bin/hadoop dfs -ls

bin/hadoop dfs -cat urls/urllist.txt

More references for configuring Nutch can be found here and here.

In particular, make sure a http.agent.name is set in conf/ and add lucene.apache.org to the URL list in in conf/crawl-urlfilter.txt

Now, finally do the crawl.

# Again, make sure it is all running

cd ~/opt/nutch/crawler

bin/nutch crawl urls -dir crawled -depth 3

You can monitor the output directly, or open a browser and go to port 50030 of the master node. http://pack.balihoo.com:50030. You will be able to see the output very easily from there. Check http://node00:50030/machines.jsp and check for failures. Make sure that the port 50030 is open in your iptables (as well as in any firewall run by your ISP). If everything is going fine, just wait for it to finish.

After it is done, you can export the data to the sandbox.

cd ~/opt/nutch/crawler

bin/hadoop dfs -copyToLocal crawled ../sandbox/crawl

bin/stop-all.sh

Now you can do whatever you want with that; point tomcat at it, or query it directly.

cd ~/opt/nutch/sandbox

bin/nutch org.apache.nutch.searcher.NutchBean apache



Running a world wide web crawl with hadoop. This is basically the same as shown in the Nutch tutorial except the paths are different since we are using hadoop distributed filesystem. After putting the folder urls/ with URLs into the hadoop file system as shown earlier, we inject them as follows, where we assume that the crawl folder is to be named crawl.

bin/nutch inject crawl/crawldb urls



Then we generate the fetch list.

nohup bin/nutch generate crawl/crawldb crawl/segments >& log &



You can watch the output with command tail –f log if you wish or just logout and watch the progress on the hadoop web monitor at port 50030 on the master node. Next we determine the new segment that was created and fetch it.



path=`bin/hadoop dfs -ls crawl/segments | tail -1 | awk '{print $1}'`

seg=`basename $path`

nohup bin/nutch fetch crawl/segments/$seg >& log &



Next we update the crawl database with the new stuff.

nohup bin/nutch updatedb crawl/crawldb crawl/segments/$seg >& log &



Then we can generate the top Num URLs and repeat the process to the next level and so on.

nohup bin/nutch generate crawl/crawldb crawl/segments –topN Num >& log &



The above can be automated in a script.