Chapter 1. Performance Tuning

Table of Contents

1.1. Operating System
1.1.1. Memory
1.1.2. 64-bit
1.1.3. Swapping
1.2. Network
1.2.1. Single Switch
1.2.2. Multiple Switches
1.2.3. Multiple Racks
1.3. Java
1.3.1. The Garbage Collector and HBase
1.4. HBase Configurations
1.4.1. Number of Regions
1.4.2. Managing Compactions
1.4.3. hbase.regionserver.handler.count
1.4.4. hfile.block.cache.size
1.4.5. hbase.regionserver.global.memstore.upperLimit
1.4.6. hbase.regionserver.global.memstore.lowerLimit
1.4.7. hbase.hstore.blockingStoreFiles
1.4.8. hbase.hregion.memstore.block.multiplier
1.5. Schema Design
1.5.1. Number of Column Families
1.5.2. Key and Attribute Lengths
1.5.3. Table RegionSize
1.5.4. Bloom Filters
1.5.5. ColumnFamily BlockSize
1.5.6. In-Memory ColumnFamilies
1.5.7. Compression
1.6. Writing to HBase
1.6.1. Batch Loading
1.6.2. Table Creation: Pre-Creating Regions
1.6.3. Table Creation: Deferred Log Flush
1.6.4. HBase Client: AutoFlush
1.6.5. HBase Client: Turn off WAL on Puts
1.6.6. HBase Client: Group Puts by RegionServer
1.6.7. MapReduce: Skip The Reducer
1.6.8. Anti-Pattern: One Hot Region
1.7. Reading from HBase
1.7.1. Scan Caching
1.7.2. Scan Attribute Selection
1.7.3. Close ResultScanners
1.7.4. Block Cache
1.7.5. Optimal Loading of Row Keys
1.7.6. Concurrency: Monitor Data Spread
1.8. Deleting from HBase
1.8.1. Using HBase Tables as Queues
1.8.2. Delete RPC Behavior
1.9. HDFS
1.9.1. Current Issues With Low-Latency Reads
1.9.2. Performance Comparisons of HBase vs. HDFS
1.10. Amazon EC2

1.1. Operating System

1.1.1. Memory

RAM, RAM, RAM. Don't starve HBase.

1.1.2. 64-bit

Use a 64-bit platform (and 64-bit JVM).

1.1.3. Swapping

Watch out for swapping. Set swappiness to 0.

1.2. Network

Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more).

Important items to consider:

  • Switching capacity of the device
  • Number of systems connected
  • Uplink capacity

1.2.1. Single Switch

The single most important factor in this configuration is that the switching capacity of the hardware is capable of handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware can have a slower switching capacity than could be utilized by a full switch.

1.2.2. Multiple Switches

Multiple switches are a potential pitfall in the architecture. The most common configuration of lower priced hardware is a simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated.

Mitigation of this issue is fairly simple and can be accomplished in multiple ways:

  • Use appropriate hardware for the scale of the cluster which you're attempting to build.
  • Use larger single switch configurations i.e. single 48 port as opposed to 2x 24 port
  • Configure port trunking for uplinks to utilize multiple interfaces to increase cross switch bandwidth.

1.2.3. Multiple Racks

Multiple rack configurations carry the same potential issues as multiple switches, and can suffer performance degradation from two main areas:

  • Poor switch capacity performance
  • Insufficient uplink to another rack

If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing more of your cluster across racks. The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks. The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster.

Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to save your ports for machines as opposed to uplinks.

1.3. Java

1.3.1. The Garbage Collector and HBase

1.3.1.1. Long GC pauses

In his presentation, Avoiding Full GCs with MemStore-Local Allocation Buffers, Todd Lipcon describes two cases of stop-the-world garbage collections common in HBase, especially during loading; CMS failure modes and old generation heap fragmentation brought. To address the first, start the CMS earlier than default by adding -XX:CMSInitiatingOccupancyFraction and setting it down from defaults. Start at 60 or 70 percent (The lower you bring down the threshold, the more GCing is done, the more CPU used). To address the second fragmentation issue, Todd added an experimental facility that must be explicitly enabled in HBase 0.90.x (Its defaulted to be on in 0.92.x HBase). See hbase.hregion.memstore.mslab.enabled to true in your Configuration. See the cited slides for background and detail[1].

For more information about GC logs, see ???.

1.4. HBase Configurations

See ???.

1.4.1. Number of Regions

The number of regions for an HBase table is driven by the ???. Also, see the architecture section on ???

A lower number of regions is preferred, generally in the range of 20 to low-hundreds per RegionServer. Adjust the regionsize as appropriate to achieve this number.

For the 0.90.x codebase, the upper-bound of regionsize is about 4Gb. For 0.92.x codebase, due to the HFile v2 change much larger regionsizes can be supported (e.g., 20Gb).

You may need to experiment with this setting based on your hardware configuration and application needs.

1.4.2. Managing Compactions

For larger systems, managing compactions and splits may be something you want to consider.

1.4.3. hbase.regionserver.handler.count

See ???. This setting in essence sets how many requests are concurrently being processed inside the RegionServer at any one time. If set too high, then throughput may suffer as the concurrent requests contend; if set too low, requests will be stuck waiting to get into the machine. You can get a sense of whether you have too little or too many handlers by ??? on an individual RegionServer then tailing its logs (Queued requests consume memory).

1.4.4. hfile.block.cache.size

See ???. A memory setting for the RegionServer process.

1.4.5. hbase.regionserver.global.memstore.upperLimit

See ???. This memory setting is often adjusted for the RegionServer process depending on needs.

1.4.6. hbase.regionserver.global.memstore.lowerLimit

See ???. This memory setting is often adjusted for the RegionServer process depending on needs.

1.4.7. hbase.hstore.blockingStoreFiles

See ???. If there is blocking in the RegionServer logs, increasing this can help.

1.4.8. hbase.hregion.memstore.block.multiplier

See ???. If there is enough RAM, increasing this can help.

1.5. Schema Design

1.5.1. Number of Column Families

See ???.

1.5.2. Key and Attribute Lengths

See ???.

1.5.3. Table RegionSize

The regionsize can be set on a per-table basis via setFileSize on HTableDescriptor in the event where certain tables require different regionsizes than the configured default regionsize.

See Section 1.4.1, “Number of Regions” for more information.

1.5.4. Bloom Filters

Bloom Filters can be enabled per-ColumnFamily. Use HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per Column Family. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.

See HColumnDescriptor and ??? for more information.

1.5.5. ColumnFamily BlockSize

The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).

See HColumnDescriptor and ???for more information.

1.5.6. In-Memory ColumnFamilies

ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the ???, but it is not a guarantee that the entire table will be in memory.

See HColumnDescriptor for more information.

1.5.7. Compression

Production systems should use compression with their ColumnFamily definitions. See ??? for more information.

1.6. Writing to HBase

1.6.1. Batch Loading

Use the bulk load tool if you can. See Bulk Loads. Otherwise, pay attention to the below.

1.6.2.  Table Creation: Pre-Creating Regions

Tables in HBase are initially created with one region by default. For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster. A useful pattern to speed up the bulk import process is to pre-create empty regions. Be somewhat conservative in this, because too-many regions can actually degrade performance. An example of pre-creation using hex-keys is as follows (note: this example may need to be tweaked to the individual applications keys):

public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
  try {
    admin.createTable( table, splits );
    return true;
  } catch (TableExistsException e) {
    logger.info("table " + table.getNameAsString() + " already exists");
    // the table already exists...
    return false;  
  }
}

public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
  byte[][] splits = new byte[numRegions-1][];
  BigInteger lowestKey = new BigInteger(startKey, 16);
  BigInteger highestKey = new BigInteger(endKey, 16);
  BigInteger range = highestKey.subtract(lowestKey);
  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
  lowestKey = lowestKey.add(regionIncrement);
  for(int i=0; i < numRegions-1;i++) {
    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
    byte[] b = String.format("%016x", key).getBytes();
    splits[i] = b;
  }
  return splits;
}

1.6.3.  Table Creation: Deferred Log Flush

The default behavior for Puts using the Write Ahead Log (WAL) is that HLog edits will be written immediately. If deferred log flush is used, WAL edits are kept in memory until the flush period. The benefit is aggregated and asynchronous HLog- writes, but the potential downside is that if the RegionServer goes down the yet-to-be-flushed edits are lost. This is safer, however, than not using WAL at all with Puts.

Deferred log flush can be configured on tables via HTableDescriptor. The default value of hbase.regionserver.optionallogflushinterval is 1000ms.

1.6.4. HBase Client: AutoFlush

When performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance. Otherwise, the Puts will be sent one at a time to the RegionServer. Puts added via htable.add(Put) and htable.add( <List> Put) wind up in the same write buffer. If autoFlush = false, these messages are not sent until the write-buffer is filled. To explicitly flush the messages, call flushCommits. Calling close on the HTable instance will invoke flushCommits.

1.6.5. HBase Client: Turn off WAL on Puts

A frequently discussed option for increasing throughput on Puts is to call writeToWAL(false). Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore, HOWEVER the consequence is that if there is a RegionServer failure there will be data loss. If writeToWAL(false) is used, do so with extreme caution. You may find in actuality that it makes little difference if your load is well distributed across the cluster.

In general, it is best to use WAL for Puts, and where loading throughput is a concern to use bulk loading techniques instead.

1.6.6. HBase Client: Group Puts by RegionServer

In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier.

1.6.7. MapReduce: Skip The Reducer

When writing a lot of data to an HBase table from a MR job (e.g., with TableOutputFormat), and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.

For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). This is a different processing problem than from the the above case.

1.6.8. Anti-Pattern: One Hot Region

If all your data is being written to one region at a time, then re-read the section on processing timeseries data.

Also, see Section 1.6.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 1.4, “HBase Configurations”

1.7. Reading from HBase

1.7.1. Scan Caching

If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.

1.7.1.1. Scan Caching in MapReduce Jobs

Scan settings in MapReduce jobs deserve special attention. Timeouts can result (e.g., UnknownScannerException) in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the next set of data. This problem can occur because there is non-trivial processing occuring per row. If you process rows quickly, set caching higher. If you process rows more slowly (e.g., lots of transformations per row, writes), then set caching lower.

Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the processing that is often performed in MapReduce jobs tends to exacerbate this issue.

1.7.2. Scan Attribute Selection

Whenever a Scan is used to process large numbers of rows (and especially when used as a MapReduce source), be aware of which attributes are selected. If scan.addFamily is called then all of the attributes in the specified ColumnFamily will be returned to the client. If only a small number of the available attributes are to be processed, then only those attributes should be specified in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.

1.7.3. Close ResultScanners

This isn't so much about improving performance but rather avoiding performance problems. If you forget to close ResultScanners you can cause problems on the RegionServers. Always have ResultScanner processing enclosed in try/catch blocks...

Scan scan = new Scan();
// set attrs...
ResultScanner rs = htable.getScanner(scan);
try {
  for (Result r = rs.next(); r != null; r = rs.next()) {
  // process result...
} finally {
  rs.close();  // always close the ResultScanner!
}
htable.close();

1.7.4. Block Cache

Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks method. For input Scans to MapReduce jobs, this should be false. For frequently accessed rows, it is advisable to use the block cache.

1.7.5. Optimal Loading of Row Keys

When performing a table scan where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a MUST_PASS_ALL operator to the scanner using setFilter. The filter list should include both a FirstKeyOnlyFilter and a KeyOnlyFilter. Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk and minimal network traffic to the client for a single row.

1.7.6. Concurrency: Monitor Data Spread

When performing a high number of concurrent reads, monitor the data spread of the target tables. If the target table(s) have too few regions then the reads could likely be served from too few nodes.

See Section 1.6.2, “ Table Creation: Pre-Creating Regions ”, as well as Section 1.4, “HBase Configurations”

1.8. Deleting from HBase

1.8.1. Using HBase Tables as Queues

HBase tables are sometimes used as queues. In this case, special care must be taken to regularly perform major compactions on tables used in this manner. As is documented in ???, marking rows as deleted creates additional StoreFiles which then need to be processed on reads. Tombstones only get cleaned up with major compactions.

See also ??? and HBaseAdmin.majorCompact.

1.8.2. Delete RPC Behavior

Be aware that htable.delete(Delete) doesn't use the writeBuffer. It will execute an RegionServer RPC with each invocation. For a large number of deletes, consider htable.delete(List).

See http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29

1.9. HDFS

Because HBase runs on ??? it is important to understand how it works and how it affects HBase.

1.9.1. Current Issues With Low-Latency Reads

The original use-case for HDFS was batch processing. As such, there low-latency reads were historically not a priority. With the increased adoption of HBase this is changing, and several improvements are already in development. See the Umbrella Jira Ticket for HDFS Improvements for HBase.

1.9.2. Performance Comparisons of HBase vs. HDFS

A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as a MapReduce source or sink). The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this processing context. Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS will always be faster in this use-case.

1.10. Amazon EC2

Performance questions are common on Amazon EC2 environments because it is a shared environment. You will not see the same throughput as a dedicated server. In terms of running tests on EC2, run them several times for the same reason (i.e., it's a shared environment and you don't know what else is happening on the server).

If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that because EC2 issues are practically a separate class of performance issues.



[1] The latest jvms do better regards fragmentation so make sure you are running a recent release. Read down in the message, Identifying concurrent mode failures caused by fragmentation.