Chapter 15. Troubleshooting and Debugging HBase

Table of Contents

15.1. General Guidelines
15.2. Logs
15.2.1. Log Locations
15.3. Tools
15.3.1. search-hadoop.com
15.3.2. tail
15.3.3. top
15.3.4. jps
15.3.5. jstack
15.3.6. OpenTSDB
15.3.7. clusterssh+top
15.4. Client
15.4.1. ScannerTimeoutException
15.5. RegionServer
15.5.1. Startup Errors
15.5.2. Runtime Errors
15.5.3. Shutdown Errors
15.6. Master
15.6.1. Startup Errors
15.6.2. Shutdown Errors

15.1. General Guidelines

Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.

An error rarely comes alone in HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print some metrics when aborting so grepping for Dump should get you around the start of the problem.

RegionServer suicides are “normal”, as this is what they do when something goes wrong. For example, if ulimit and xcievers (the two most important initial settings, see Section 1.3.1.6, “ ulimit and nproc) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by Todd Lipcon and Section 13.1.1.1, “Long GC pauses” above.