Overview

HBase includes several methods of loading data into tables. The most straightforward method is to either use the TableOutputFormat class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.

This document describes HBase's bulk load functionality. The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster.

Bulk Load Architecture

The HBase bulk load process consists of two main steps.

Preparing data via a MapReduce job

The first step of a bulk load is to generate HBase data files from a MapReduce job using HFileOutputFormat. This output format writes out data in HBase's internal storage format so that they can be later loaded very efficiently into the cluster.

In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.

HFileOutputFormat includes a convenience function, configureIncrementalLoad(), which automatically sets up a TotalOrderPartitioner based on the current region boundaries of a table.

Completing the data load

After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using a command line tool. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.

If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the bulk load commandline utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.

Preparing a bulk load using the importtsv tool

HBase ships with a command line tool called importtsv. This tool is available by running hadoop jar /path/to/hbase-VERSION.jar importtsv. Running this tool with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>

Imports the given input directory of TSV data into the specified table.

The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key.

In order to prepare data for a bulk data load, pass the option:
  -Dimporttsv.bulk.output=/path/for/output

Other options that may be specified with -D include:
  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

Importing the prepared data using the completebulkload tool

After a data import has been prepared using the importtsv tool, the completebulkload tool is used to import the data into the running cluster.

The completebulkload tool simply takes the same output path where importtsv put its results, and the table name. For example:

$ hadoop jar hbase-VERSION.jar completebulkload /user/todd/myoutput mytable

This tool will run quickly, after which point the new data will be visible in the cluster.

Advanced Usage

Although the importtsv tool is useful in many cases, advanced users may want to generate data programatically, or import data from other formats. To get started doing so, dig into ImportTsv.java and check the JavaDoc for HFileOutputFormat.

The import step of the bulk load can also be done programatically. See the LoadIncrementalHFiles class for more information.