org.apache.hadoop.hbase.mapreduce
Class TableSnapshotInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
      extended by org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat

public final class TableSnapshotInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot. The job bypasses HBase servers, and directly accesses the underlying files (hfile, recovered edits, hlogs, etc) directly to provide maximum performance. The snapshot is not required to be restored or cloned. This also allows to run the mapreduce job from an online or offline hbase cluster. The snapshot files can be exported by using the ExportSnapshot tool, to a pure-hdfs cluster, and this InputFormat can be used to run the mapreduce job directly over the snapshot files.

Usage is similar to TableInputFormat. TableMapReduceUtil.initTableSnapshotMapperJob(String, Scan, Class, Class, Class, Job, boolean, Path) can be used to configure the job.

 {
   @code
   Job job = new Job(conf);
   Scan scan = new Scan();
   TableMapReduceUtil.initSnapshotMapperJob(snapshotName, scan,
       MyTableMapper.class, MyMapKeyOutput.class,
       MyMapOutputValueWritable.class, job, true, tmpDir);
 }
 

Internally, this input format restores the snapshot into the given tmp directory. Similar to TableInputFormat an InputSplit is created per region. The region is opened for reading from each RecordReader. An internal RegionScanner is used to execute the Scan obtained from the user.

HBase owns all the data and snapshot files on the filesystem. Only the HBase user can read from snapshot files and data files. HBase also enforces security because all the requests are handled by the server layer, and the user cannot read from the data files directly. To read from snapshot files directly from the file system, the user who is running the MR job must have sufficient permissions to access snapshot and reference files. This means that to run mapreduce over snapshot files, the MR job has to be run as the HBase user or the user must have group or other priviledges in the filesystem (See HBASE-8369). Note that, given other users access to read from snapshot/data files will completely circumvent the access control enforced by HBase.


Nested Class Summary
static class TableSnapshotInputFormat.TableSnapshotRegionRecordReader
          Snapshot region record reader.
static class TableSnapshotInputFormat.TableSnapshotRegionSplit
          Snapshot region split.
 
Constructor Summary
TableSnapshotInputFormat()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
           
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
           
static void setInput(org.apache.hadoop.mapreduce.Job job, String snapshotName, org.apache.hadoop.fs.Path restoreDir)
          Set job input.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TableSnapshotInputFormat

public TableSnapshotInputFormat()
Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                  org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                           throws IOException
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Throws:
IOException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
                                                       throws IOException,
                                                              InterruptedException
Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Throws:
IOException
InterruptedException

setInput

public static void setInput(org.apache.hadoop.mapreduce.Job job,
                            String snapshotName,
                            org.apache.hadoop.fs.Path restoreDir)
                     throws IOException
Set job input.

Parameters:
job - The job
snapshotName - The snapshot name
restoreDir - The directory where the temp table will be created
Throws:
IOException - on error


Copyright © 2015 The Apache Software Foundation. All Rights Reserved.