org.apache.hadoop.hbase.mapreduce
Class TableInputFormatBase

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
      extended by org.apache.hadoop.hbase.mapreduce.TableInputFormatBase
Direct Known Subclasses:
TableInputFormat

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class TableInputFormatBase
extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>

A base for TableInputFormats. Receives a HTable, an Scan instance that defines the input columns etc. Subclasses may use other TableRecordReader implementations.

An example of a subclass:

   public static class ExampleTIF extends TableInputFormatBase implements JobConfigurable {


Field Summary
static String INPUT_AUTOBALANCE_MAXSKEWRATIO
          Specify if ratio for data skew in M/R jobs, it goes well with the enabling hbase.mapreduce .input.autobalance property.
static String MAPREDUCE_INPUT_AUTOBALANCE
          Specify if we enable auto-balance for input in M/R jobs.
static String TABLE_ROW_TEXTKEY
          Specify if the row key in table is text (ASCII between 32~126), default is true.
 
Constructor Summary
TableInputFormatBase()
           
 
Method Summary
 List<org.apache.hadoop.mapreduce.InputSplit> calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list, org.apache.hadoop.mapreduce.JobContext context, long average)
          Calculates the number of MapReduce input splits for the map tasks.
 org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Builds a TableRecordReader.
protected  HTable getHTable()
          Allows subclasses to get the HTable.
 Scan getScan()
          Gets the scan defining the actual details like columns etc.
static byte[] getSplitKey(byte[] start, byte[] end, boolean isText)
          select a split point in the region.
 List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Calculates the splits that will serve as input for the map tasks.
protected  Pair<byte[][],byte[][]> getStartEndKeys()
           
protected  boolean includeRegionInSplit(byte[] startKey, byte[] endKey)
          Test if the given region is to be included in the InputSplit while splitting the regions of a table.
 String reverseDNS(InetAddress ipAddress)
           
protected  void setHTable(HTable table)
          Allows subclasses to set the HTable.
 void setScan(Scan scan)
          Sets the scan defining the actual details like columns etc.
protected  void setTableRecordReader(TableRecordReader tableRecordReader)
          Allows subclasses to set the TableRecordReader.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAPREDUCE_INPUT_AUTOBALANCE

public static final String MAPREDUCE_INPUT_AUTOBALANCE
Specify if we enable auto-balance for input in M/R jobs.

See Also:
Constant Field Values

INPUT_AUTOBALANCE_MAXSKEWRATIO

public static final String INPUT_AUTOBALANCE_MAXSKEWRATIO
Specify if ratio for data skew in M/R jobs, it goes well with the enabling hbase.mapreduce .input.autobalance property.

See Also:
Constant Field Values

TABLE_ROW_TEXTKEY

public static final String TABLE_ROW_TEXTKEY
Specify if the row key in table is text (ASCII between 32~126), default is true. False means the table is using binary row key

See Also:
Constant Field Values
Constructor Detail

TableInputFormatBase

public TableInputFormatBase()
Method Detail

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                  org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                           throws IOException
Builds a TableRecordReader. If no TableRecordReader was provided, uses the default.

Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Parameters:
split - The split to work with.
context - The current context.
Returns:
The newly created record reader.
Throws:
IOException - When creating the reader fails.
See Also:
InputFormat.createRecordReader( org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)

getStartEndKeys

protected Pair<byte[][],byte[][]> getStartEndKeys()
                                           throws IOException
Throws:
IOException

getSplits

public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                       throws IOException
Calculates the splits that will serve as input for the map tasks. The number of splits matches the number of regions in a table.

Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
Parameters:
context - The current job context.
Returns:
The list of input splits.
Throws:
IOException - When creating the list of splits fails.
See Also:
InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)

reverseDNS

public String reverseDNS(InetAddress ipAddress)
                  throws NamingException,
                         UnknownHostException
Throws:
NamingException
UnknownHostException

calculateRebalancedSplits

public List<org.apache.hadoop.mapreduce.InputSplit> calculateRebalancedSplits(List<org.apache.hadoop.mapreduce.InputSplit> list,
                                                                              org.apache.hadoop.mapreduce.JobContext context,
                                                                              long average)
                                                                       throws IOException
Calculates the number of MapReduce input splits for the map tasks. The number of MapReduce input splits depends on the average region size and the "data skew ratio" user set in configuration.

Parameters:
list - The list of input splits before balance.
context - The current job context.
average - The average size of all regions .
Returns:
The list of input splits.
Throws:
IOException - When creating the list of splits fails.
See Also:
InputFormat.getSplits( org.apache.hadoop.mapreduce.JobContext)

getSplitKey

public static byte[] getSplitKey(byte[] start,
                                 byte[] end,
                                 boolean isText)
select a split point in the region. The selection of the split point is based on an uniform distribution assumption for the keys in a region. Here are some examples: startKey: aaabcdefg endKey: aaafff split point: aaad startKey: 111000 endKey: 1125790 split point: 111b startKey: 1110 endKey: 1120 split point: 111_ startKey: binary key { 13, -19, 126, 127 }, endKey: binary key { 13, -19, 127, 0 }, split point: binary key { 13, -19, 127, -64 } Set this function as "public static", make it easier for test.

Parameters:
start - Start key of the region
end - End key of the region
isText - It determines to use text key mode or binary key mode
Returns:
The split point in the region.

includeRegionInSplit

protected boolean includeRegionInSplit(byte[] startKey,
                                       byte[] endKey)
Test if the given region is to be included in the InputSplit while splitting the regions of a table.

This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job, (and hence, not contributing to the InputSplit), given the start and end keys of the same.
Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing, continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.

Note: It is possible that endKey.length() == 0 , for the last (recent) region.
Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).

Parameters:
startKey - Start key of the region
endKey - End key of the region
Returns:
true, if this region needs to be included as part of the input (default).

getHTable

protected HTable getHTable()
Allows subclasses to get the HTable.


setHTable

protected void setHTable(HTable table)
Allows subclasses to set the HTable.

Parameters:
table - The table to get the data from.

getScan

public Scan getScan()
Gets the scan defining the actual details like columns etc.

Returns:
The internal scan instance.

setScan

public void setScan(Scan scan)
Sets the scan defining the actual details like columns etc.

Parameters:
scan - The scan to set.

setTableRecordReader

protected void setTableRecordReader(TableRecordReader tableRecordReader)
Allows subclasses to set the TableRecordReader.

Parameters:
tableRecordReader - A different TableRecordReader implementation.


Copyright © 2015 The Apache Software Foundation. All rights reserved.