org.apache.hadoop.hbase.mapreduce
Class TableInputFormatBase

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
      extended by org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
Direct Known Subclasses:
TableInputFormat

public abstract class TableInputFormatBase
extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

A base for TableInputFormats. Receives a HTable, an Scan instance that defines the input columns etc. Subclasses may use other TableRecordReader implementations.

An example of a subclass:

   class ExampleTIF extends TableInputFormatBase implements JobConfigurable {

     public void configure(JobConf job) {
       HTable exampleTable = new HTable(HBaseConfiguration.create(job),
         Bytes.toBytes("exampleTable"));
       // mandatory
       setHTable(exampleTable);
       Text[] inputColumns = new byte [][] { Bytes.toBytes("columnA"),
         Bytes.toBytes("columnB") };
       // mandatory
       setInputColumns(inputColumns);
       RowFilterInterface exampleFilter = new RegExpRowFilter("keyPrefix.*");
       // optional
       setRowFilter(exampleFilter);
     }

     public void validateInput(JobConf job) throws IOException {
     }
  }
 


Constructor Summary
TableInputFormatBase()
           
 
Method Summary
 org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Builds a TableRecordReader.
protected  HTable getHTable()
          Allows subclasses to get the HTable.
 Scan getScan()
          Gets the scan defining the actual details like columns etc.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Calculates the splits that will serve as input for the map tasks.
protected  boolean includeRegionInSplit(byte[] startKey, byte[] endKey)
          Test if the given region is to be included in the InputSplit while splitting the regions of a table.
protected  void setHTable(HTable table)
          Allows subclasses to set the HTable.
 void setScan(Scan scan)
          Sets the scan defining the actual details like columns etc.
protected  void setTableRecordReader(TableRecordReader tableRecordReader)
          Allows subclasses to set the TableRecordReader.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TableInputFormatBase

public TableInputFormatBase()
Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                  org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                           throws IOException
Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Parameters:
split - The split to work with.
context - The current context.
Returns:
The newly created record reader.
Throws:
IOException - When creating the reader fails.
See Also:
InputFormat.createRecordReader( org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table.

Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Parameters:
context - The current job context.
Returns:
The list of input splits.
Throws:
IOException - When creating the list of splits fails.
See Also:
InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)

includeRegionInSplit

protected boolean includeRegionInSplit(byte[] startKey,
                                       byte[] endKey)
Test if the given region is to be included in the InputSplit while splitting the regions of a table.

This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job, (and hence, not contributing to the InputSplit), given the start and end keys of the same.
Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.

Note: It is possible that endKey.length() == 0 , for the last (recent) region.
Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).

Parameters:
startKey - Start key of the region
endKey - End key of the region
Returns:
true, if this region needs to be included as part of the input (default).

getHTable

protected HTable getHTable()
Allows subclasses to get the HTable.


setHTable

protected void setHTable(HTable table)
Allows subclasses to set the HTable.

Parameters:
table - The table to get the data from.

getScan

public Scan getScan()
Gets the scan defining the actual details like columns etc.

Returns:
The internal scan instance.

setScan

public void setScan(Scan scan)
Sets the scan defining the actual details like columns etc.

Parameters:
scan - The scan to set.

setTableRecordReader

protected void setTableRecordReader(TableRecordReader tableRecordReader)
Allows subclasses to set the TableRecordReader.

Parameters:
tableRecordReader - A different TableRecordReader implementation.


Copyright © 2011 The Apache Software Foundation. All Rights Reserved.