Below we list what the important Configurations. We've divided this section into required configuration and worth-a-look recommended configs.
Review the Section 1.1.2, “Operating System” and Section 1.1.3, “Hadoop” sections.
If a cluster with a lot of regions, it is possible if an eager beaver regionserver
checks in soon after master start while all the rest in the cluster are laggardly, this
first server to checkin will be assigned all regions. If lots of regions, this first
server could buckle under the load. To prevent the above scenario happening up the
hbase.master.wait.on.regionservers.mintostart
from its default value
of 1. See HBASE-6389 Modify the
conditions to ensure that Master waits for sufficient number of Region Servers before
starting region assignments for more detail.
If the primary Master loses its connection with ZooKeeper, it will fall into a loop
where it keeps trying to reconnect. Disable this functionality if you are running more
than one Master: i.e. a backup Master. Failing to do so, the dying Master may continue to
receive RPCs though another Master has assumed the role of primary. See the configuration fail.fast.expired.active.master
.
The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery. You might like to tune the timeout down to a minute or even less so the Master notices failures the sooner. Before changing this value, be sure you have your JVM garbage collection configuration under control otherwise, a long garbage collection that lasts beyond the ZooKeeper session timeout will take out your RegionServer (You might be fine with this -- you probably want recovery to start on the server if a RegionServer has been in GC for a long period of time).
To change this configuration, edit hbase-site.xml
, copy the
changed file around the cluster and restart.
We set this value high to save our having to field noob questions up on the mailing lists asking why a RegionServer went down during a massive import. The usual cause is that their JVM is untuned and they are running into long GC pauses. Our thinking is that while users are getting familiar with HBase, we'd save them having to know all of its intricacies. Later when they've built some confidence, then they can play with configuration such as this.
See ???.
This is the "...number of volumes that are allowed to fail before a datanode stops
offering service. By default any volume failure will cause a datanode to shutdown" from
the hdfs-default.xml
description. If you have > three or four
disks, you might want to set this to 1 or if you have many disks, two or more.
This setting defines the number of threads that are kept open to answer incoming requests to user tables. The rule of thumb is to keep this number low when the payload per request approaches the MB (big puts, scans using a large cache) and high when the payload is small (gets, small puts, ICVs, deletes). The total size of the queries in progress is limited by the setting "ipc.server.max.callqueue.size".
It is safe to set that number to the maximum number of incoming clients if their payload is small, the typical example being a cluster that serves a website since puts aren't typically buffered and most of the operations are gets.
The reason why it is dangerous to keep this setting high is that the aggregate size of all the puts that are currently happening in a region server may impose too much pressure on its memory, or even trigger an OutOfMemoryError. A region server running on low memory will trigger its JVM's garbage collector to run more frequently up to a point where GC pauses become noticeable (the reason being that all the memory used to keep all the requests' payloads cannot be trashed, no matter how hard the garbage collector tries). After some time, the overall cluster throughput is affected since every request that hits that region server will take longer, which exacerbates the problem even more.
You can get a sense of whether you have too little or too many handlers by ??? on an individual RegionServer then tailing its logs (Queued requests consume memory).
HBase ships with a reasonable, conservative configuration that will work on nearly all machine types that people might want to test with. If you have larger machines -- HBase has 8G and larger heap -- you might the following configuration options helpful. TODO.
You should consider enabling ColumnFamily compression. There are several options that are near-frictionless and in most all cases boost performance by reducing the size of StoreFiles and thus reducing I/O.
See ??? for more information.
HBase uses ??? to recover the memstore data that has not been flushed to disk in case of an RS failure. These WAL files should be configured to be slightly smaller than HDFS block (by default, HDFS block is 64Mb and WAL file is ~60Mb).
HBase also has a limit on number of WAL files, designed to ensure there's never too much data that needs to be replayed during recovery. This limit needs to be set according to memstore configuration, so that all the necessary data would fit. It is recommended to allocated enough WAL files to store at least that much data (when all memstores are close to full). For example, with 16Gb RS heap, default memstore settings (0.4), and default WAL file size (~60Mb), 16Gb*0.4/60, the starting point for WAL file count is ~109. However, as all memstores are not expected to be full all the time, less WAL files can be allocated.
Rather than let HBase auto-split your Regions, manage the splitting manually [11]. With growing amounts of data, splits will continually be needed. Since you
always know exactly what regions you have, long-term debugging and profiling is much
easier with manual splits. It is hard to trace the logs to understand region level
problems if it keeps splitting and getting renamed. Data offlining bugs + unknown number
of split regions == oh crap! If an HLog
or
StoreFile
was mistakenly unprocessed by HBase due to a weird bug
and you notice it a day or so later, you can be assured that the regions specified in
these files are the same as the current regions and you have less headaches trying to
restore/replay your data. You can finely tune your compaction algorithm. With roughly
uniform data growth, it's easy to cause split / compaction storms as the regions all
roughly hit the same data size at the same time. With manual splits, you can let
staggered, time-based major compactions spread out your network IO load.
How do I turn off automatic splitting? Automatic splitting is determined by the
configuration value hbase.hregion.max.filesize
. It is not recommended that
you set this to Long.MAX_VALUE
in case you forget about manual splits.
A suggested setting is 100GB, which would result in > 1hr major compactions if reached.
What's the optimal number of pre-split regions to create? Mileage will vary depending
upon your application. You could start low with 10 pre-split regions / server and watch as
data grows over time. It's better to err on the side of too little regions and rolling
split later. A more complicated answer is that this depends upon the largest storefile in
your region. With a growing data size, this will get larger over time. You want the
largest region to be just big enough that the Store
compact
selection algorithm only compacts it due to a timed major. If you don't, your cluster can
be prone to compaction storms as the algorithm decides to run major compactions on a large
series of regions all at once. Note that compaction storms are due to the uniform data
growth, not the manual split decision.
If you pre-split your regions too thin, you can increase the major compaction
interval by configuring HConstants.MAJOR_COMPACTION_PERIOD
. If your
data size grows too large, use the (post-0.90.0 HBase)
org.apache.hadoop.hbase.util.RegionSplitter
script to perform a
network IO safe rolling split of all regions.
A common administrative technique is to manage major compactions manually, rather than
letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD
is
one day and major compactions may kick in when you least desire it - especially on a busy
system. To turn off automatic major compactions set the value to 0
.
It is important to stress that major compactions are absolutely necessary for StoreFile cleanup, the only variant is when they occur. They can be administered through the HBase shell, or via HBaseAdmin.
For more information about compactions and the compaction file selection process, see ???
Speculative Execution of MapReduce tasks is on by default, and for HBase clusters it
is generally advised to turn off Speculative Execution at a system-level unless you need
it for a specific case, where it can be configured per-job. Set the properties
mapreduce.map.speculative
and
mapreduce.reduce.speculative
to false.
The balancer is a periodic operation which is run on the master to redistribute
regions on the cluster. It is configured via hbase.balancer.period
and
defaults to 300000 (5 minutes).
See ??? for more information on the LoadBalancer.
Do not turn off block cache (You'd do it by setting
hbase.block.cache.size
to zero). Currently we do not do well if you
do this because the regionserver will spend all its time loading hfile indices over and
over again. If your working set it such that block cache does you no good, at least size
the block cache such that hfile indices will stay up in the cache (you can get a rough
idea on the size you need by surveying regionserver UIs; you'll see index block size
accounted near the top of the webpage).
If a big 40ms or so occasional delay is seen in operations against HBase, try the Nagles' setting. For example, see the user mailing list thread, Inconsistent scan performance with caching set to 1 and the issue cited therein where setting notcpdelay improved scan speeds. You might also see the graphs on the tail of HBASE-7008 Set scanner caching to a better default where our Lars Hofhansl tries various data sizes w/ Nagle's on and off measuring the effect.
This section is about configurations that will make servers come back faster after a fail. See the Deveraj Das an Nicolas Liochon blog post Introduction to HBase Mean Time to Recover (MTTR) for a brief introduction.
The issue HBASE-8354 forces Namenode into loop with lease recovery requests is messy but has a bunch of good discussion toward the end on low timeouts and how to effect faster recovery including citation of fixes added to HDFS. Read the Varun Sharma comments. The below suggested configurations are Varun's suggestions distilled and tested. Make sure you are running on a late-version HDFS so you have the fixes he refers too and himself adds to HDFS that help HBase MTTR (e.g. HDFS-3703, HDFS-3712, and HDFS-4791 -- hadoop 2 for sure has them and late hadoop 1 has some). Set the following in the RegionServer.
<property> <name>hbase.lease.recovery.dfs.timeout</name> <value>23000</value> <description>How much time we allow elapse between calls to recover lease. Should be larger than the dfs timeout.</description> </property> <property> <name>dfs.client.socket-timeout</name> <value>10000</value> <description>Down the DFS timeout from 60 to 10 seconds.</description> </property>
And on the namenode/datanode side, set the following to enable 'staleness' introduced in HDFS-3703, HDFS-3912.
<property> <name>dfs.client.socket-timeout</name> <value>10000</value> <description>Down the DFS timeout from 60 to 10 seconds.</description> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>10000</value> <description>Down the DFS timeout from 8 * 60 to 10 seconds.</description> </property> <property> <name>ipc.client.connect.timeout</name> <value>3000</value> <description>Down from 60 seconds to 3.</description> </property> <property> <name>ipc.client.connect.max.retries.on.timeouts</name> <value>2</value> <description>Down from 45 seconds to 3 (2 == 3 retries).</description> </property> <property> <name>dfs.namenode.avoid.read.stale.datanode</name> <value>true</value> <description>Enable stale state in hdfs</description> </property> <property> <name>dfs.namenode.stale.datanode.interval</name> <value>20000</value> <description>Down from default 30 seconds</description> </property> <property> <name>dfs.namenode.avoid.write.stale.datanode</name> <value>true</value> <description>Enable stale state in hdfs</description> </property>
[11] What follows is taken from the javadoc at the head of the
org.apache.hadoop.hbase.util.RegionSplitter
tool added to
HBase post-0.90.0 release.