Package org.apache.hadoop.hbase.mapred

Provides HBase MapReduce Input/OutputFormats, a table indexing MapReduce job, and utility

See:
          Description

Interface Summary
TableMap<K extends WritableComparable<? super K>,V extends Writable> Deprecated.
TableReduce<K extends WritableComparable,V extends Writable> Deprecated.
 

Class Summary
BuildTableIndex Deprecated.
Driver Deprecated.
GroupingTableMap Deprecated.
HRegionPartitioner<K2,V2> Deprecated.
IdentityTableMap Deprecated.
IdentityTableReduce Deprecated.
IndexConfiguration Deprecated.
IndexConfiguration.ColumnConf  
IndexOutputFormat Deprecated.
IndexTableReduce Deprecated.
LuceneDocumentWrapper Deprecated.
RowCounter Deprecated.
TableInputFormat Deprecated.
TableInputFormatBase Deprecated.
TableMapReduceUtil Deprecated.
TableOutputFormat Deprecated.
TableOutputFormat.TableRecordWriter Convert Reduce output (key, value) to (HStoreKey, KeyedDataArrayWritable) and write to an HBase table
TableSplit Deprecated.
 

Package org.apache.hadoop.hbase.mapred Description

Provides HBase MapReduce Input/OutputFormats, a table indexing MapReduce job, and utility

Table of Contents

HBase, MapReduce and the CLASSPATH

MapReduce jobs deployed to a MapReduce cluster do not by default have access to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes. You could add hbase-site.xml to $HADOOP_HOME/conf and add hbase-X.X.X.jar to the $HADOOP_HOME/lib and copy these changes across your cluster but the cleanest means of adding hbase configuration and classes to the cluster CLASSPATH is by uncommenting HADOOP_CLASSPATH in $HADOOP_HOME/conf/hadoop-env.sh and adding the path to the hbase jar and $HBASE_CONF_DIR directory. Then copy the amended configuration around the cluster. You'll probably need to restart the MapReduce cluster if you want it to notice the new configuration.

For example, here is how you would amend hadoop-env.sh adding the built hbase jar, hbase conf, and the PerformanceEvaluation class from the built hbase test jar to the hadoop CLASSPATH:

# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
export HADOOP_CLASSPATH=$HBASE_HOME/build/test:$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf

Expand $HBASE_HOME in the above appropriately to suit your local environment.

After copying the above change around your cluster, this is how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce cluster):

$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4
The PerformanceEvaluation class wil be found on the CLASSPATH because you added $HBASE_HOME/build/test to HADOOP_CLASSPATH

Another possibility, if for example you do not have access to hadoop-env.sh or are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce job jar adding it and its dependencies under the job jar lib/ directory and the hbase conf into a job jar conf/ directory.

HBase as MapReduce job data source and sink

HBase can be used as a data source, TableInputFormat, and data sink, TableOutputFormat, for MapReduce jobs. Writing MapReduce jobs that read or write HBase, you'll probably want to subclass TableMap and/or TableReduce. See the do-nothing pass-through classes IdentityTableMap and IdentityTableReduce for basic usage. For a more involved example, see BuildTableIndex or review the org.apache.hadoop.hbase.mapred.TestTableMapReduce unit test.

Running mapreduce jobs that have hbase as source or sink, you'll need to specify source/sink table and column names in your configuration.

Reading from hbase, the TableInputFormat asks hbase for the list of regions and makes a map-per-region or mapred.map.tasks maps, whichever is smaller (If your job only has two maps, up mapred.map.tasks to a number > number of regions). Maps will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per node. Writing, it may make sense to avoid the reduce step and write yourself back into hbase from inside your map. You'd do this when your job does not need the sort and collation that mapreduce does on the map emitted data; on insert, hbase 'sorts' so there is no point double-sorting (and shuffling data around your mapreduce cluster) unless you need to. If you do not need the reduce, you might just have your map emit counts of records processed just so the framework's report at the end of your job has meaning or set the number of reduces to zero and use TableOutputFormat. See example code below. If running the reduce step makes sense in your case, its usually better to have lots of reducers so load is spread across the hbase cluster.

There is also a new hbase partitioner that will run as many reducers as currently existing regions. The HRegionPartitioner is suitable when your table is large and your upload is not such that it will greatly alter the number of existing regions when done; other use the default partitioner.

Example Code

Sample Row Counter

See RowCounter. You should be able to run it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs offered. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH so the rowcounter gets pointed at the right hbase cluster (or, build a new jar with an appropriate hbase-site.xml built into your job jar).

PerformanceEvaluation

See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs a mapreduce job to run concurrent clients reading and writing hbase.

Sample MR Bulk Uploader

A students/classes example based on a contribution by Naama Kraus with logs of documentation can be found over in src/examples/mapred. Its the org.apache.hadoop.hbase.mapred.SampleUploader class. Just copy it under src/java/org/apache/hadoop/hbase/mapred to compile and try it (until we start generating an hbase examples jar). The class reads a data file from HDFS and per line, does an upload to HBase using TableReduce. Read the class comment for specification of inputs, prerequisites, etc.

Example to bulk import/load a text file into an HTable

Here's a sample program from Allen Day that takes an HDFS text file path and an HBase table name as inputs, and loads the contents of the text file to the table all up in the map phase.

package com.spicylogic.hbase;
package org.apache.hadoop.hbase.mapred;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.io.BatchUpdate;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.NullOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 Class that adds the parsed line from the input to hbase
 in the map function.  Map has no emissions and job
 has no reduce.
/
public class BulkImport implements Tool {
  private static final String NAME = "BulkImport";
  private Configuration conf;

  public static class InnerMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
    private HTable table;
    private HBaseConfiguration HBconf;

    public void map(LongWritable key, Text value,
        OutputCollector<Text, Text> output, Reporter reporter)
    throws IOException {
      if ( table == null )
        throw new IOException("table is null");

      // Split input line on tab character
      String [] splits = value.toString().split("\t");
      if ( splits.length != 4 )
        return;

      String rowID = splits[0];
      int timestamp  = Integer.parseInt( splits[1] );
      String colID = splits[2];
      String cellValue = splits[3];

      reporter.setStatus("Map emitting cell for row='" + rowID +
          "', column='" + colID + "', time='" + timestamp + "'");

      BatchUpdate bu = new BatchUpdate( rowID );
      if ( timestamp > 0 )
        bu.setTimestamp( timestamp );

      bu.put(colID, cellValue.getBytes());      
      table.commit( bu );      
    }

    public void configure(JobConf job) {
      HBconf = new HBaseConfiguration(job);
      try {
        table = new HTable( HBconf, job.get("input.table") );
      } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
      }
    }
  }

  public JobConf createSubmittableJob(String[] args) {
    JobConf c = new JobConf(getConf(), BulkImport.class);
    c.setJobName(NAME);
    FileInputFormat.setInputPaths(c, new Path(args[0]));

    c.set("input.table", args[1]);
    c.setMapperClass(InnerMap.class);
    c.setNumReduceTasks(0);
    c.setOutputFormat(NullOutputFormat.class);
    return c;
  }

  static int printUsage() {
    System.err.println("Usage: " + NAME + " <input> <table_name>");
    System.err.println("\twhere <input> is a tab-delimited text file with 4 columns.");
    System.err.println("\t\tcolumn 1 = row ID");
    System.err.println("\t\tcolumn 2 = timestamp (use a negative value for current time)");
    System.err.println("\t\tcolumn 3 = column ID");
    System.err.println("\t\tcolumn 4 = cell value");
    return -1;
  } 

  public int run(@SuppressWarnings("unused") String[] args) throws Exception {
    // Make sure there are exactly 3 parameters left.
    if (args.length != 2) {
      return printUsage();
    }
    JobClient.runJob(createSubmittableJob(args));
    return 0;
  }

  public Configuration getConf() {
    return this.conf;
  } 

  public void setConf(final Configuration c) {
    this.conf = c;
  }

  public static void main(String[] args) throws Exception {
    int errCode = ToolRunner.run(new Configuration(), new BulkImport(), args);
    System.exit(errCode);
  }
}



Copyright © 2009 The Apache Software Foundation