When performing a lot of Puts, make sure that setAutoFlush is set
to false on your HTable
instance. Otherwise, the Puts will be sent one at a time to the
RegionServer. Puts added via htable.add(Put)
and htable.add( <List> Put)
wind up in the same write buffer. If autoFlush = false
,
these messages are not sent until the write-buffer is filled. To
explicitly flush the messages, call flushCommits
.
Calling close
on the HTable
instance will invoke flushCommits
.
If HBase is used as an input source for a MapReduce job, for
example, make sure that the input Scan
instance to the MapReduce job has setCaching
set to something greater
than the default (which is 1). Using the default value means that the
map-task will make call back to the region-server for every record
processed. Setting this value to 500, for example, will transfer 500
rows at a time to the client to be processed. There is a cost/benefit to
have the cache value be large because it costs more in memory for both
client and RegionServer, so bigger isn't always better.
Whenever a Scan is used to process large numbers of rows (and especially when used
as a MapReduce source), be aware of which attributes are selected. If scan.addFamily
is called
then all of the attributes in the specified ColumnFamily will be returned to the client.
If only a small number of the available attributes are to be processed, then only those attributes should be specified
in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks...
Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();
Scan
instances can be set to use the block cache in the RegionServer via the
setCacheBlocks
method. For input Scans to MapReduce jobs, this should be
false
. For frequently accessed rows, it is advisable to use the block
cache.
When performing a table scan
where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a
MUST_PASS_ALL
operator to the scanner using setFilter
. The filter list
should include both a FirstKeyOnlyFilter
and a KeyOnlyFilter.
Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk
and minimal network traffic to the client for a single row.