When performing a lot of Puts, make sure that setAutoFlush is set
to false on your HTable
instance. Otherwise, the Puts will be sent one at a time to the
regionserver. Puts added via htable.add(Put)
and htable.add( <List> Put)
wind up in the same write buffer. If autoFlush = false
,
these messages are not sent until the write-buffer is filled. To
explicitly flush the messages, call flushCommits
.
Calling close
on the HTable
instance will invoke flushCommits
.
If HBase is used as an input source for a MapReduce job, for
example, make sure that the input Scan
instance to the MapReduce job has setCaching
set to something greater
than the default (which is 1). Using the default value means that the
map-task will make call back to the region-server for every record
processed. Setting this value to 500, for example, will transfer 500
rows at a time to the client to be processed. There is a cost/benefit to
have the cache value be large because it costs more in memory for both
client and regionserver, so bigger isn't always better.
This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the regionservers. Always have ResultScanner processing enclosed in try/catch blocks...
Scan scan = new Scan(); // set attrs... ResultScanner rs = htable.getScanner(scan); try { for (Result r = rs.next(); r != null; r = rs.next()) { // process result... } finally { rs.close(); // always close the ResultScanner! } htable.close();
Scan
instances can be set to use the block cache in the region server via the
setCacheBlocks
method. For input Scans to MapReduce jobs, this should be
false
. For frequently accessed rows, it is advisable to use the block
cache.