If HBase is used as an input source for a MapReduce job, for
example, make sure that the input Scan
instance to the MapReduce job has setCaching
set to something greater
than the default (which is 1). Using the default value means that the
map-task will make call back to the region-server for every record
processed. Setting this value to 500, for example, will transfer 500
rows at a time to the client to be processed. There is a cost/benefit to
have the cache value be large because it costs more in memory for both
client and RegionServer, so bigger isn't always better.
Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.
Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.
Whenever a Scan is used to process large numbers of rows (and especially when used
as a MapReduce source), be aware of which attributes are selected. If scan.addFamily
is called
then all of the attributes in the specified ColumnFamily will be returned to the client.
If only a small number of the available attributes are to be processed, then only those attributes should be specified
in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks...
Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();
Scan
instances can be set to use the block cache in the RegionServer via the
setCacheBlocks
method. For input Scans to MapReduce jobs, this should be
false
. For frequently accessed rows, it is advisable to use the block
cache.
When performing a table scan
where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a
MUST_PASS_ALL
operator to the scanner using setFilter
. The filter list
should include both a FirstKeyOnlyFilter
and a KeyOnlyFilter.
Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk
and minimal network traffic to the client for a single row.
When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes.
See Section 10.7.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 10.4, “HBase Configurations”