Revision History
Revision 0.92.2
HBase version

Abstract

This is the official book of Apache HBase, a distributed, versioned, column-oriented database built on top of Apache Hadoop and Apache ZooKeeper.


Table of Contents

Preface
1. Getting Started
1.1. Introduction
1.2. Quick Start
1.2.1. Download and unpack the latest stable release.
1.2.2. Start HBase
1.2.3. Shell Exercises
1.2.4. Stopping HBase
1.2.5. Where to go next
2. Configuration
2.1. Java
2.2. Operating System
2.2.1. ssh
2.2.2. DNS
2.2.3. NTP
2.2.4. ulimit and nproc
2.2.5. Windows
2.3. Hadoop
2.3.1. Hadoop Security
2.3.2. dfs.datanode.max.xcievers
2.4. HBase run modes: Standalone and Distributed
2.4.1. Standalone HBase
2.4.2. Distributed
2.4.3. Running and Confirming Your Installation
2.5. ZooKeeper
2.5.1. Using existing ZooKeeper ensemble
2.6. Configuration Files
2.6.1. hbase-site.xml and hbase-default.xml
2.6.2. hbase-env.sh
2.6.3. log4j.properties
2.6.4. Client configuration and dependencies connecting to an HBase cluster
2.7. Example Configurations
2.7.1. Basic Distributed HBase Install
2.8. The Important Configurations
2.8.1. Required Configurations
2.8.2. Recommended Configuations
2.8.3. Other Configurations
2.9. Bloom Filter Configuration
2.9.1. io.hfile.bloom.enabled global kill switch
2.9.2. io.hfile.bloom.error.rate
2.9.3. io.hfile.bloom.max.fold
3. Upgrading
3.1. Upgrading to HBase 0.90.x from 0.20.x or 0.89.x
4. The HBase Shell
4.1. Scripting
4.2. Shell Tricks
4.2.1. irbrc
4.2.2. LOG data to timestamp
4.2.3. Debug
5. Data Model
5.1. Conceptual View
5.2. Physical View
5.3. Table
5.4. Row
5.5. Column Family
5.6. Cells
5.7. Data Model Operations
5.7.1. Get
5.7.2. Put
5.7.3. Scans
5.7.4. Delete
5.8. Versions
5.8.1. Versions and HBase Operations
5.8.2. Current Limitations
6. HBase and Schema Design
6.1. Schema Creation
6.2. On the number of column families
6.2.1. Cardinality of ColumnFamilies
6.3. Rowkey Design
6.3.1. Monotonically Increasing Row Keys/Timeseries Data
6.3.2. Try to minimize row and column sizes
6.3.3. Reverse Timestamps
6.3.4. Rowkeys and ColumnFamilies
6.3.5. Immutability of Rowkeys
6.4. Number of Versions
6.4.1. Maximum Number of Versions
6.4.2. Minimum Number of Versions
6.5. Supported Datatypes
6.5.1. Counters
6.6. Time To Live (TTL)
6.7. Keeping Deleted Cells
6.8. Secondary Indexes and Alternate Query Paths
6.8.1. Filter Query
6.8.2. Periodic-Update Secondary Index
6.8.3. Dual-Write Secondary Index
6.8.4. Summary Tables
6.8.5. Coprocessor Secondary Index
6.9. Schema Design Smackdown
6.9.1. Rows vs. Versions
6.9.2. Rows vs. Columns
6.10. Operational and Performance Configuration Options
7. HBase and MapReduce
7.1. Map-Task Spitting
7.1.1. The Default HBase MapReduce Splitter
7.1.2. Custom Splitters
7.2. HBase MapReduce Examples
7.2.1. HBase MapReduce Read Example
7.2.2. HBase MapReduce Read/Write Example
7.2.3. HBase MapReduce Read/Write Example With Multi-Table Output
7.2.4. HBase MapReduce Summary Example
7.2.5. HBase MapReduce Summary to File Example
7.3. Accessing Other HBase Tables in a MapReduce Job
7.4. Speculative Execution
8. Architecture
8.1. Catalog Tables
8.1.1. ROOT
8.1.2. META
8.1.3. Startup Sequencing
8.2. Client
8.2.1. Connections
8.2.2. WriteBuffer and Batch Methods
8.2.3. External Clients
8.3. Client Filters
8.3.1. Structural
8.3.2. Column Value
8.3.3. Column Value Comparators
8.3.4. KeyValue Metadata
8.3.5. RowKey
8.3.6. Utility
8.4. Master
8.4.1. Startup Behavior
8.4.2. Interface
8.4.3. Processes
8.5. RegionServer
8.5.1. Interface
8.5.2. Processes
8.5.3. Block Cache
8.5.4. Write Ahead Log (WAL)
8.6. Regions
8.6.1. Region Size
8.6.2. Region Splits
8.6.3. Region Load Balancing
8.6.4. Store
8.6.5. Bloom Filters
8.7. HDFS
8.7.1. NameNode
8.7.2. DataNode
9. External APIs
9.1. Non-Java Languages Talking to the JVM
9.2. REST
9.3. Thrift
9.3.1. Filter Language
10. Performance Tuning
10.1. Operating System
10.1.1. Memory
10.1.2. 64-bit
10.1.3. Swapping
10.2. Network
10.2.1. Single Switch
10.2.2. Multiple Switches
10.2.3. Multiple Racks
10.3. Java
10.3.1. The Garbage Collector and HBase
10.4. HBase Configurations
10.4.1. Number of Regions
10.4.2. Managing Compactions
10.4.3. hbase.regionserver.handler.count
10.4.4. hfile.block.cache.size
10.4.5. hbase.regionserver.global.memstore.upperLimit
10.4.6. hbase.regionserver.global.memstore.lowerLimit
10.4.7. hbase.hstore.blockingStoreFiles
10.4.8. hbase.hregion.memstore.block.multiplier
10.5. Schema Design
10.5.1. Number of Column Families
10.5.2. Key and Attribute Lengths
10.5.3. Table RegionSize
10.5.4. Bloom Filters
10.5.5. ColumnFamily BlockSize
10.5.6. In-Memory ColumnFamilies
10.5.7. Compression
10.6. Writing to HBase
10.6.1. Batch Loading
10.6.2. Table Creation: Pre-Creating Regions
10.6.3. Table Creation: Deferred Log Flush
10.6.4. HBase Client: AutoFlush
10.6.5. HBase Client: Turn off WAL on Puts
10.6.6. HBase Client: Group Puts by RegionServer
10.6.7. MapReduce: Skip The Reducer
10.6.8. Anti-Pattern: One Hot Region
10.7. Reading from HBase
10.7.1. Scan Caching
10.7.2. Scan Attribute Selection
10.7.3. Close ResultScanners
10.7.4. Block Cache
10.7.5. Optimal Loading of Row Keys
10.7.6. Concurrency: Monitor Data Spread
10.8. Deleting from HBase
10.8.1. Using HBase Tables as Queues
10.8.2. Delete RPC Behavior
10.9. HDFS
10.9.1. Current Issues With Low-Latency Reads
10.9.2. Performance Comparisons of HBase vs. HDFS
10.10. Amazon EC2
11. Troubleshooting and Debugging HBase
11.1. General Guidelines
11.2. Logs
11.2.1. Log Locations
11.2.2. Log Levels
11.2.3. JVM Garbage Collection Logs
11.3. Tools
11.3.1. Builtin Tools
11.3.2. External Tools
11.4. Client
11.4.1. ScannerTimeoutException or UnknownScannerException
11.4.2. Shell or client application throws lots of scary exceptions during normal operation
11.4.3. Long Client Pauses With Compression
11.4.4. ZooKeeper Client Connection Errors
11.4.5. Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)])
11.5. NameNode
11.5.1. HDFS Utilization of Tables and Regions
11.5.2. Browsing HDFS for HBase Objects
11.6. RegionServer
11.6.1. Startup Errors
11.6.2. Runtime Errors
11.6.3. Shutdown Errors
11.7. Master
11.7.1. Startup Errors
11.7.2. Shutdown Errors
11.8. ZooKeeper
11.8.1. Startup Errors
11.8.2. ZooKeeper, The Cluster Canary
11.9. Amazon EC2
11.9.1. ZooKeeper does not seem to work on Amazon EC2
11.9.2. Instability on Amazon EC2
11.9.3. Remote Java Connection into EC2 Cluster Not Working
12. HBase Operational Management
12.1. HBase Tools and Utilities
12.1.1. HBase hbck
12.1.2. HFile Tool
12.1.3. WAL Tools
12.1.4. Compression Tool
12.1.5. CopyTable
12.1.6. Export
12.1.7. Import
12.1.8. RowCounter
12.2. Node Management
12.2.1. Node Decommission
12.2.2. Rolling Restart
12.3. Metrics
12.3.1. Metric Setup
12.3.2. RegionServer Metrics
12.4. HBase Monitoring
12.5. Cluster Replication
12.6. HBase Backup
12.6.1. Full Shutdown Backup
12.6.2. Live Cluster Backup - Replication
12.6.3. Live Cluster Backup - CopyTable
12.6.4. Live Cluster Backup - Export
12.7. Capacity Planning
12.7.1. Storage
12.7.2. Regions
13. Building and Developing HBase
13.1. HBase Repositories
13.1.1. SVN
13.1.2. Git
13.2. IDEs
13.2.1. Eclipse
13.3. Building HBase
13.3.1. Building in snappy compression support
13.3.2. Adding an HBase release to Apache's Maven Repository
13.4. Maven Build Commands
13.4.1. Compile
13.4.2. Run all Unit Tests
13.4.3. Run a Single Unit Test
13.4.4. Run a Few Unit Tests
13.4.5. Run all Unit Tests for a Package
13.4.6. Integration Tests
13.5. Getting Involved
13.5.1. Mailing Lists
13.5.2. Jira
13.6. Developing
13.6.1. Codelines
13.6.2. Unit Tests
13.7. Submitting Patches
13.7.1. Create Patch
13.7.2. Patch File Naming
13.7.3. Unit Tests
13.7.4. Attach Patch to Jira
13.7.5. Common Patch Feedback
13.7.6. ReviewBoard
13.7.7. Committing Patches
A. Compression In HBase
A.1. CompressionTest Tool
A.2. hbase.regionserver.codecs
A.3. LZO
A.4. GZIP
A.5. SNAPPY
B. FAQ
C. YCSB: The Yahoo! Cloud Serving Benchmark and HBase
D. HFile format version 2
D.1. Motivation
D.2. HFile format version 1 overview
D.2.1. Block index format in version 1
D.3. HBase file format with inline blocks (version 2)
D.3.1. Overview
D.3.2. Unified version 2 block format
D.3.3. Block index in version 2
D.3.4. Root block index format in version 2
D.3.5. Non-root block index format in version 2
D.3.6. Bloom filters in version 2
D.3.7. File Info format in versions 1 and 2
D.3.8. Fixed file trailer format differences between versions 1 and 2
E. HBase and the Apache Software Foundation
Index

List of Tables

5.1. Table webtable
5.2. ColumnFamily anchor
5.3. ColumnFamily contents