Support for Amazon Elastic Block Storage (EBS) is a Beta feature.
When not in use, an EBS-cluster can surrender unneeded EC2 instances, then restart later and continue where it left off. Users no longer need to copy large volumes of data from S3 to local disk on the EC2 instance; data persists reliably and independently in Amazon's EBS, saving compute costs.
Schematic showing how the cluster is set up:
To Use Persistent Cluster with EBS Storage
% hadoop-ec2 create-formatted-snapshot my-ebs-cluster 100
You create storage for a single Namenode and for two Datanodes. The volumes to create are described in a JSON spec file, which references the snapshot you just created. Here is the contents of a JSON file, called my-ebs-cluster-storage-spec.jso:
Example contents of my-ebs-cluster-storage-spec.json
{ "nn": [ { "device": "/dev/sdj", "mount_point": "/ebs1", "size_gb": "100", "snapshot_id": "snap-268e704f" }, { "device": "/dev/sdk", "mount_point": "/ebs2", "size_gb": "100", "snapshot_id": "snap-268e704f" } ], "dn": [ { "device": "/dev/sdj", "mount_point": "/ebs1", "size_gb": "100", "snapshot_id": "snap-268e704f" }, { "device": "/dev/sdk", "mount_point": "/ebs2", "size_gb": "100", "snapshot_id": "snap-268e704f" } ] }
Each role (nn and dn) is the key to an array of volume specifications. In this example, each role has two devices (/dev/sdj and /dev/sdk) with different mount points, and generated from an EBS snapshot. The snapshot is the formatted snapshot created earlier, so that the volumes you create are pre-formatted. The size of the drives must match the size of the snapshot created earlier.
To use this file to create actual volumes:
% hadoop-ec2 create-storage my-ebs-cluster nn 1 \ my-ebs-cluster-storage-spec.json % hadoop-ec2 create-storage my-ebs-cluster dn 2 \ my-ebs-cluster-storage-spec.json
To start the cluster with two slave nodes:
% hadoop-ec2 launch-cluster my-ebs-cluster 1 nn,snn,jt 2 dn,tt
To login and run a job which creates some output:
% hadoop-ec2 login my-ebs-cluster # hadoop fs -mkdir input # hadoop fs -put /etc/hadoop/conf/*.xml input # hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output \ 'dfs[a-z.]+'
To view the output:
# hadoop fs -cat output/part-* | head
To shutdown the cluster:
% hadoop-ec2 terminate-cluster my-ebs-cluster
To restart the cluster and login (after a short delay):
% hadoop-ec2 launch-cluster my-ebs-cluster 2 % hadoop-ec2 login my-ebs-cluster
The output from the job you ran before should still be there:
# hadoop fs -cat output/part-* | head