Running MapReduce Jobs

After you launch a cluster, a hadoop-site.xml file is created in the directory ~/.hadoop-cloud/<cluster-name>. You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable. (It is also possible to set the configuration file to use by passing it as a -conf option to Hadoop Tools):

% export HADOOP_CONF_DIR=~/.hadoop-cloud/my-hadoop-cluster

To browse HDFS:

% hadoop fs -ls /

Note that the version of Hadoop installed locally should match the version installed on the cluster.

To run a job locally:

% hadoop fs -mkdir input # create an input directory
% hadoop fs -put $HADOOP_HOME/LICENSE.txt input # copy a file there
% hadoop jar $HADOOP_HOME/hadoop-*examples*.jar wordcount input output
% hadoop fs -cat output/part-* | head

The preceding examples assume that you installed Hadoop on your local machine. But you can also run jobs within the cluster.

To run jobs within the cluster:

1. Log into the Namenode:

% hadoop-ec2 login my-hadoop-cluster

2. Run the job:

# hadoop fs -mkdir input
# hadoop fs -put /etc/hadoop/conf/*.xml input
# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output 'dfs\[a-z.]+'
# hadoop fs -cat output/part-* | head