Chapter 1. HBase Operational Management

Table of Contents

1.1. HBase Tools and Utilities
1.1.1. HBase hbck
1.1.2. HFile Tool
1.1.3. WAL Tools
1.1.4. Compression Tool
1.1.5. CopyTable
1.1.6. Export
1.1.7. Import
1.1.8. WALPlayer
1.1.9. RowCounter
1.2. Region Management
1.2.1. Major Compaction
1.2.2. Merge
1.3. Node Management
1.3.1. Node Decommission
1.3.2. Rolling Restart
1.4. Metrics
1.4.1. Metric Setup
1.4.2. RegionServer Metrics
1.5. HBase Monitoring
1.5.1. Slow Query Log
1.6. Cluster Replication
1.7. HBase Backup
1.7.1. Full Shutdown Backup
1.7.2. Live Cluster Backup - Replication
1.7.3. Live Cluster Backup - CopyTable
1.7.4. Live Cluster Backup - Export
1.8. Capacity Planning
1.8.1. Storage
1.8.2. Regions
This chapter will cover operational tools and practices required of a running HBase cluster. The subject of operations is related to the topics of ???, ???, and ??? but is a distinct topic in itself.

1.1. HBase Tools and Utilities

Here we list HBase tools for administration, analysis, fixup, and debugging.

1.1.1. HBase hbck

An fsck for your HBase install

To run hbck against your HBase cluster run

$ ./bin/hbase hbck

At the end of the commands output it prints OK or INCONSISTENCY. If your cluster reports inconsistencies, pass -details to see more detail emitted. If inconsistencies, run hbck a few times because the inconsistency may be transient (e.g. cluster is starting up or a region is splitting). Passing -fix may correct the inconsistency (This latter is an experimental feature).

1.1.2. HFile Tool

See ???.

1.1.3. WAL Tools

1.1.3.1. HLog tool

The main method on HLog offers manual split and dump facilities. Pass it WALs or the product of a split, the content of the recovered.edits. directory.

You can get a textual dump of a WAL file content by doing the following:

 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012 

The return code will be non-zero if issues with the file so you can test wholesomeness of file by redirecting STDOUT to /dev/null and testing the program return.

Similarly you can force a split of a log file directory by doing:

 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/

1.1.4. Compression Tool

See Section 1.1.4, “Compression Tool”.

1.1.5. CopyTable

CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename

Options:

  • starttime Beginning of the time range. Without endtime means starttime to forever.
  • endtime End of the time range. Without endtime means starttime to forever.
  • versions Number of cell versions to copy.
  • new.name New table's name.
  • peer.adr Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
  • families Comma-separated list of ColumnFamilies to copy.
  • all.cells Also copy delete markers and uncollected deleted cells (advanced option).

Args:

  • tablename Name of table to copy.

Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--starttime=1265875194289 --endtime=1265878794289
--peer.adr=server1,server2,server3:2181:/hbase TestTable

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

1.1.6. Export

Export is a utility that will dump the contents of table to HDFS in a sequence file. Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

1.1.7. Import

Import is a utility that will load data that has been exported back into HBase. Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

1.1.8. WALPlayer

WALPlayer is a utility to replay WAL files into HBase.

The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables.

WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.

Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>

For example:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2

1.1.9. RowCounter

RowCounter is a utility that will count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.

$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename> [<column1> <column2>...]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

1.2. Region Management

1.2.1. Major Compaction

Major compactions can be requested via the HBase shell or HBaseAdmin.majorCompact.

Note: major compactions do NOT do region merges. See ??? for more information about compactions.

1.2.2. Merge

Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).

$ bin/hbase org.apache.hbase.util.Merge <tablename> <region1> <region2>

If you feel you have too many regions and want to consolidate them, Merge is the utility you need. Merge must run be done when the cluster is down. See the O'Reilly HBase Book for an example of usage.

Additionally, there is a Ruby script attached to HBASE-1621 for region merging.

1.3. Node Management

1.3.1. Node Decommission

You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:

$ ./bin/hbase-daemon.sh stop regionserver

The RegionServer will first close all regions and then shut itself down. On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.

Disable the Load Balancer before Decommissioning a node

If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer. Avoid any problems by disabling the balancer first. See Load Balancer below.

A downside to the above stop of a RegionServer is that regions could be offline for a good period of time. Regions are closed in order. If many regions on the server, the first region to close may not be back online until all regions close and after the master notices the RegionServer's znode gone. In HBase 0.90.2, we added facility for having a node gradually shed its load and then shutdown itself down. HBase 0.90.2 added the graceful_stop.sh script. Here is its usage:

$ ./bin/graceful_stop.sh 
Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname>
 thrift      If we should stop/start thrift before/after the hbase stop/start
 rest        If we should stop/start rest before/after the hbase stop/start
 restart     If we should restart after graceful stop
 reload      Move offloaded regions back on to the stopped server
 debug       Move offloaded regions back on to the stopped server
 hostname    Hostname of server we are to stop

To decommission a loaded RegionServer, run the following:

$ ./bin/graceful_stop.sh HOSTNAME

where HOSTNAME is the host carrying the RegionServer you would decommission.

On HOSTNAME

The HOSTNAME passed to graceful_stop.sh must match the hostname that hbase is using to identify RegionServers. Check the list of RegionServers in the master UI for how HBase is referring to servers. Its usually hostname but can also be FQDN. Whatever HBase is using, this is what you should pass the graceful_stop.sh decommission script. If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks if server is currently running; the graceful unloading of regions will not run.

The graceful_stop.sh script will move the regions off the decommissioned RegionServer one at a time to minimize region churn. It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned server is carrying zero regions. At this point, the graceful_stop.sh tells the RegionServer stop. The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to split.

Load Balancer

It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:

hbase(main):001:0> balance_switch false
true
0 row(s) in 0.3590 seconds

This turns the balancer OFF. To reenable, do:

hbase(main):001:0> balance_switch true
false
0 row(s) in 0.3590 seconds

1.3.2. Rolling Restart

You can also ask this script to restart a RegionServer after the shutdown AND move its old regions back into place. The latter you might do to retain data locality. A primitive rolling restart might be effected by running something like the following:

$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
            

Tail the output of /tmp/log.txt to follow the scripts progress. The above does RegionServers only. Be sure to disable the load balancer before doing the above. You'd need to do the master update separately. Do it before you run the above script. Here is a pseudo-script for how you might craft a rolling restart script:

  1. Untar your release, make sure of its configuration and then rsync it across the cluster. If this is 0.90.2, patch it with HBASE-3744 and HBASE-3756.

  2. Run hbck to ensure the cluster consistent

    $ ./bin/hbase hbck

    Effect repairs if inconsistent.

  3. Restart the Master:

    $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master

  4. Disable the region balancer:

    $ echo "balance_switch false" | ./bin/hbase shell

  5. Run the graceful_stop.sh script per RegionServer. For example:

    $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
                

    If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage for graceful_stop.sh script).

  6. Restart the Master again. This will clear out dead servers list and reenable the balancer.

  7. Run hbck to ensure the cluster is consistent.

1.4. Metrics

1.4.1. Metric Setup

See Metrics for an introduction and how to enable Metrics emission.

1.4.2. RegionServer Metrics

1.4.2.1. hbase.regionserver.blockCacheCount

Block cache item count in memory. This is the number of blocks of StoreFiles (HFiles) in the cache.

1.4.2.2. hbase.regionserver.blockCacheEvictedCount

Number of blocks that had to be evicted from the block cache due to heap size constraints.

1.4.2.3. hbase.regionserver.blockCacheFree

Block cache memory available (bytes).

1.4.2.4. hbase.regionserver.blockCacheHitCachingRatio

Block cache hit caching ratio (0 to 100). The cache-hit ratio for reads configured to look in the cache (i.e., cacheBlocks=true).

1.4.2.5. hbase.regionserver.blockCacheHitCount

Number of blocks of StoreFiles (HFiles) read from the cache.

1.4.2.6. hbase.regionserver.blockCacheHitRatio

Block cache hit ratio (0 to 100). Includes all read requests, although those with cacheBlocks=false will always read from disk and be counted as a "cache miss".

1.4.2.7. hbase.regionserver.blockCacheMissCount

Number of blocks of StoreFiles (HFiles) requested but not read from the cache.

1.4.2.8. hbase.regionserver.blockCacheSize

Block cache size in memory (bytes). i.e., memory in use by the BlockCache

1.4.2.9. hbase.regionserver.compactionQueueSize

Size of the compaction queue. This is the number of Stores in the RegionServer that have been targeted for compaction.

1.4.2.10. hbase.regionserver.flushQueueSize

Number of enqueued regions in the MemStore awaiting flush.

1.4.2.11. hbase.regionserver.fsReadLatency_avg_time

Filesystem read latency (ms). This is the average time to read from HDFS.

1.4.2.12. hbase.regionserver.fsReadLatency_num_ops

Filesystem read operations.

1.4.2.13. hbase.regionserver.fsSyncLatency_avg_time

Filesystem sync latency (ms). Latency to sync the write-ahead log records to the filesystem.

1.4.2.14. hbase.regionserver.fsSyncLatency_num_ops

Number of operations to sync the write-ahead log records to the filesystem.

1.4.2.15. hbase.regionserver.fsWriteLatency_avg_time

Filesystem write latency (ms). Total latency for all writers, including StoreFiles and write-head log.

1.4.2.16. hbase.regionserver.fsWriteLatency_num_ops

Number of filesystem write operations, including StoreFiles and write-ahead log.

1.4.2.17. hbase.regionserver.memstoreSizeMB

Sum of all the memstore sizes in this RegionServer (MB)

1.4.2.18. hbase.regionserver.regions

Number of regions served by the RegionServer

1.4.2.19. hbase.regionserver.requests

Total number of read and write requests. Requests correspond to RegionServer RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000 will result in 1 request for each 'next' call (i.e., not each row). A bulk-load request will constitute 1 request per HFile.

1.4.2.20. hbase.regionserver.storeFileIndexSizeMB

Sum of all the StoreFile index sizes in this RegionServer (MB)

1.4.2.21. hbase.regionserver.stores

Number of Stores open on the RegionServer. A Store corresponds to a ColumnFamily. For example, if a table (which contains the column family) has 3 regions on a RegionServer, there will be 3 stores open for that column family.

1.4.2.22. hbase.regionserver.storeFiles

Number of StoreFiles open on the RegionServer. A store may have more than one StoreFile (HFile).

1.5. HBase Monitoring

TODO

1.5.1. Slow Query Log

The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output. The thresholds for "too long to run" and "too much output" are configurable, as described below. The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events. It is also prepended with identifying tags (responseTooSlow), (responseTooLarge), (operationTooSlow), and (operationTooLarge) in order to enable easy filtering with grep, in case the user desires to see only slow queries.

1.5.1.1. Configuration

There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.

  • hbase.ipc.warn.response.time Maximum number of milliseconds that a query can be run without being logged. Defaults to 10000, or 10 seconds. Can be set to -1 to disable logging by time.
  • hbase.ipc.warn.response.size Maximum byte size of response that a query can return without being logged. Defaults to 100 megabytes. Can be set to -1 to disable logging by size.

1.5.1.2. Metrics

The slow query log exposes to metrics to JMX.

  • hadoop.regionserver_rpc_slowResponse a global metric reflecting the durations of all responses that triggered logging.
  • hadoop.regionserver_rpc_methodName.aboveOneSec A metric reflecting the durations of all responses that lasted for more than one second.

1.5.1.3. Output

The output is tagged with operation e.g. (operationTooSlow) if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for. If not, it is tagged (responseTooSlow) and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. TooLarge is substituted for TooSlow if the response size triggered the logging, with TooLarge appearing even in the case that both size and duration triggered logging.

1.5.1.4. Example

2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}

Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port. Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations. In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.

This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.

1.6. Cluster Replication

See Cluster Replication.

1.7. HBase Backup

There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has pros and cons.

For additional information, see HBase Backup Options over on the Sematext Blog.

1.7.1. Full Shutdown Backup

Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages. The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata. The obvious con is that the cluster is down. The steps include:

1.7.1.1. Stop HBase

1.7.1.2. Distcp

Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.

Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files. Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.

1.7.1.3. Restore (if needed)

The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp. The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.

1.7.2. Live Cluster Backup - Replication

This approach assumes that there is a second cluster. See the HBase page on replication for more information.

1.7.3. Live Cluster Backup - CopyTable

The Section 1.1.5, “CopyTable” utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.

Since the cluster is up, there is a risk that edits could be missed in the copy process.

1.7.4. Live Cluster Backup - Export

The Section 1.1.6, “Export” approach dumps the content of a table to HDFS on the same cluster. To restore the data, the Section 1.1.7, “Import” utility would be used.

Since the cluster is up, there is a risk that edits could be missed in the export process.

1.8. Capacity Planning

1.8.1. Storage

A common question for HBase administrators is estimating how much storage will be required for an HBase cluster. There are several apsects to consider, the most important of which is what data load into the cluster. Start with a solid understanding of how HBase handles data internally (KeyValue).

1.8.1.1. KeyValue

HBase storage will be dominated by KeyValues. See ??? and ??? for how HBase stores data internally.

It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other factor.

1.8.1.2. StoreFiles and Blocks

KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis. Blocks are aggregated into StoreFile's. See ???.

1.8.1.3. HDFS Block Replication

Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.

1.8.2. Regions

Another common question for HBase administrators is determining the right number of regions per RegionServer. This affects both storage and hardware planning. See ???.