org.apache.hadoop.hbase.mapreduce
Class TableSnapshotInputFormat
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat
- Direct Known Subclasses:
- MultiTableSnapshotInputFormat
@InterfaceAudience.Public
@InterfaceStability.Evolving
public class TableSnapshotInputFormat
- extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot. The job
bypasses HBase servers, and directly accesses the underlying files (hfile, recovered edits,
hlogs, etc) directly to provide maximum performance. The snapshot is not required to be
restored to the live cluster or cloned. This also allows to run the mapreduce job from an
online or offline hbase cluster. The snapshot files can be exported by using the
ExportSnapshot
tool, to a pure-hdfs cluster, and this InputFormat can be used to
run the mapreduce job directly over the snapshot files. The snapshot should not be deleted
while there are jobs reading from snapshot files.
Usage is similar to TableInputFormat, and
TableMapReduceUtil.initTableSnapshotMapperJob(String, Scan, Class, Class, Class, Job,
boolean, Path)
can be used to configure the job.
Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
scan, MyTableMapper.class, MyMapKeyOutput.class,
MyMapOutputValueWritable.class, job, true);
Internally, this input format restores the snapshot into the given tmp directory. Similar to
TableInputFormat
an InputSplit is created per region. The region is opened for reading
from each RecordReader. An internal RegionScanner is used to execute the Scan
obtained
from the user.
HBase owns all the data and snapshot files on the filesystem. Only the 'hbase' user can read from
snapshot files and data files.
To read from snapshot files directly from the file system, the user who is running the MR job
must have sufficient permissions to access snapshot and reference files.
This means that to run mapreduce over snapshot files, the MR job has to be run as the HBase
user or the user must have group or other privileges in the filesystem (See HBASE-8369).
Note that, given other users access to read from snapshot/data files will completely circumvent
the access control enforced by HBase.
- See Also:
TableSnapshotScanner
Method Summary |
org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
|
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext job)
|
static void |
setInput(org.apache.hadoop.mapreduce.Job job,
String snapshotName,
org.apache.hadoop.fs.Path restoreDir)
Configures the job to use TableSnapshotInputFormat to read from a snapshot. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TableSnapshotInputFormat
public TableSnapshotInputFormat()
createRecordReader
public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws IOException
- Specified by:
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
- Throws:
IOException
getSplits
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext job)
throws IOException,
InterruptedException
- Specified by:
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
- Throws:
IOException
InterruptedException
setInput
public static void setInput(org.apache.hadoop.mapreduce.Job job,
String snapshotName,
org.apache.hadoop.fs.Path restoreDir)
throws IOException
- Configures the job to use TableSnapshotInputFormat to read from a snapshot.
- Parameters:
job
- the job to configuresnapshotName
- the name of the snapshot to read fromrestoreDir
- a temporary directory to restore the snapshot into. Current user should
have write permissions to this directory, and this should not be a subdirectory of rootdir.
After the job is finished, restoreDir can be deleted.
- Throws:
IOException
- if an error occurs
Copyright © 2007–2016 The Apache Software Foundation. All rights reserved.