Use the bulk load tool if you can. See Bulk Loads. Otherwise, pay attention to the below.
Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys):
public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits) throws IOException { try { admin.createTable( table, splits ); return true; } catch (TableExistsException e) { logger.info("table " + table.getNameAsString() + " already exists"); // the table already exists... return false; } } public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) { byte[][] splits = new byte[numRegions-1][]; BigInteger lowestKey = new BigInteger(startKey, 16); BigInteger highestKey = new BigInteger(endKey, 16); BigInteger range = highestKey.subtract(lowestKey); BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions)); lowestKey = lowestKey.add(regionIncrement); for(int i=0; i < numRegions-1;i++) { BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i))); byte[] b = String.format("%016x", key).getBytes(); splits[i] = b; } return splits; }
The default behavior for Puts using the Write Ahead Log (WAL) is that HLog
edits will be written immediately. If deferred log flush is used,
WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous HLog
- writes, but the potential downside is that if
the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts.
Deferred log flush can be configured on tables via HTableDescriptor. The default value of hbase.regionserver.optionallogflushinterval
is 1000ms.
When performing a lot of Puts, make sure that setAutoFlush is set
to false on your HTable
instance. Otherwise, the Puts will be sent one at a time to the
RegionServer. Puts added via htable.add(Put)
and htable.add( <List> Put)
wind up in the same write buffer. If autoFlush = false
,
these messages are not sent until the write-buffer is filled. To
explicitly flush the messages, call flushCommits
.
Calling close
on the HTable
instance will invoke flushCommits
.
A frequently discussed option for increasing throughput on Put
s is to call writeToWAL(false)
. Turning this off means
that the RegionServer will not write the Put
to the Write Ahead Log,
only into the memstore, HOWEVER the consequence is that if there
is a RegionServer failure there will be data loss.
If writeToWAL(false)
is used, do so with extreme caution. You may find in actuality that
it makes little difference if your load is well distributed across the cluster.
In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.
In addition to using the writeBuffer, grouping Put
s by RegionServer can reduce the number of client RPC calls per writeBuffer flush.
There is a utility HTableUtil
currently on TRUNK that does this, but you can either copy that or implement your own verison for
those still on 0.90.x or earlier.
When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.
For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the the above case.
If all your data is being written to one region at a time, then re-read the section on processing timeseries data.
Also, see Section 1.6.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 1.4, “HBase Configurations”