Apache > Hadoop > Core
 

GridMix

Overview

GridMix is a benchmark for Hadoop clusters. It submits a mix of synthetic jobs, modeling a profile mined from production loads.

There exist three versions of the GridMix tool. This document discusses the third (checked into src/contrib), distinct from the two checked into the src/benchmarks sub-directory. While the first two versions of the tool included stripped-down versions of common jobs, both were principally saturation tools for stressing the framework at scale. In support of a broader range of deployments and finer-tuned job mixes, this version of the tool will attempt to model the resource profiles of production jobs to identify bottlenecks, guide development, and serve as a replacement for the existing GridMix benchmarks.

To run GridMix, you need a MapReduce job trace describing the job mix for a given cluster. Such traces are typically generated by Rumen (see Rumen documentation). GridMix also requires input data from which the synthetic jobs will be reading bytes. The input data need not be in any particular format, as the synthetic jobs are currently binary readers. If you are running on a new cluster, an optional step generating input data may precede the run.

In order to emulate the load of production jobs from a given cluster on the same or another cluster, follow these steps:

  1. Locate the job history files on the production cluster. This location is specified by the mapreduce.jobtracker.jobhistory.completed.location configuration property of the cluster.
  2. Run Rumen to build a job trace in JSON format for all or select jobs.
  3. (Optional) Use Rumen to fold this job trace to scale the load.
  4. Use GridMix with the job trace on the benchmark cluster.

Jobs submitted by GridMix have names of the form "GRIDMIXnnnnn", where "nnnnn" is a sequence number padded with leading zeroes.

Usage

Basic command-line usage without configuration parameters:

org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
      

Basic command-line usage with configuration parameters:

org.apache.hadoop.mapred.gridmix.Gridmix \
  -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
  [-generate <size>] [-users <users-list>] <iopath> <trace>
      
Note
Configuration parameters like -Dgridmix.client.submit.threads=10 and -Dgridmix.output.directory=foo as given above should be used before other GridMix parameters.

The -generate option is used to generate input data for the synthetic jobs. It accepts size suffixes, e.g. 100g will generate 100 * 230 bytes.

The -users option is used to point to a users-list file (see Emulating Users and Queues).

The <iopath> parameter is the destination directory for generated output and/or the directory from which input data will be read. Note that this can either be on the local file-system or on HDFS, but it is highly recommended that it be the same as that for the original job mix so that GridMix puts the same load on the local file-system and HDFS respectively.

The <trace> parameter is a path to a job trace generated by Rumen. This trace can be compressed (it must be readable using one of the compression codecs supported by the cluster) or uncompressed. Use "-" as the value of this parameter if you want to pass an uncompressed trace via the standard input-stream of GridMix.

The class org.apache.hadoop.mapred.gridmix.Gridmix can be found in the JAR contrib/gridmix/hadoop-$VERSION-gridmix.jar inside your Hadoop installation, where $VERSION corresponds to the version of Hadoop installed. A simple way of ensuring that this class and all its dependencies are loaded correctly is to use the hadoop wrapper script in Hadoop:

hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
  [-generate <size>] [-users <users-list>] <iopath> <trace>
      

The supported configuration parameters are explained in the following sections.

General Configuration Parameters

Parameter Description
gridmix.output.directory The directory into which output will be written. If specified, iopath will be relative to this parameter. The submitting user must have read/write access to this directory. The user should also be mindful of any quota issues that may arise during a run. The default is "gridmix".
gridmix.client.submit.threads The number of threads submitting jobs to the cluster. This also controls how many splits will be loaded into memory at a given time, pending the submit time in the trace. Splits are pre-generated to hit submission deadlines, so particularly dense traces may want more submitting threads. However, storing splits in memory is reasonably expensive, so you should raise this cautiously. The default is 1 for the SERIAL job-submission policy (see Job Submission Policies) and one more than the number of processors on the client machine for the other policies.
gridmix.submit.multiplier The multiplier to accelerate or decelerate the submission of jobs. The time separating two jobs is multiplied by this factor. The default value is 1.0. This is a crude mechanism to size a job trace to a cluster.
gridmix.client.pending.queue.depth The depth of the queue of job descriptions awaiting split generation. The jobs read from the trace occupy a queue of this depth before being processed by the submission threads. It is unusual to configure this. The default is 5.
gridmix.gen.blocksize The block-size of generated data. The default value is 256 MiB.
gridmix.gen.bytes.per.file The maximum bytes written per file. The default value is 1 GiB.
gridmix.min.file.size The minimum size of the input files. The default limit is 128 MiB. Tweak this parameter if you see an error-message like "Found no satisfactory file" while testing GridMix with a relatively-small input data-set.
gridmix.max.total.scan The maximum size of the input files. The default limit is 100 TiB.

Job Types

GridMix takes as input a job trace, essentially a stream of JSON-encoded job descriptions. For each job description, the submission client obtains the original job submission time and for each task in that job, the byte and record counts read and written. Given this data, it constructs a synthetic job with the same byte and record patterns as recorded in the trace. It constructs jobs of two types:

Job Type Description
LOADJOB A synthetic job that emulates the workload mentioned in Rumen trace. In the current version we are supporting I/O. It reproduces the I/O workload on the benchmark cluster. It does so by embedding the detailed I/O information for every map and reduce task, such as the number of bytes and records read and written, into each job's input splits. The map tasks further relay the I/O patterns of reduce tasks through the intermediate map output data.
SLEEPJOB A synthetic job where each task does nothing but sleep for a certain duration as observed in the production trace. The scalability of the Job Tracker is often limited by how many heartbeats it can handle every second. (Heartbeats are periodic messages sent from Task Trackers to update their status and grab new tasks from the Job Tracker.) Since a benchmark cluster is typically a fraction in size of a production cluster, the heartbeat traffic generated by the slave nodes is well below the level of the production cluster. One possible solution is to run multiple Task Trackers on each slave node. This leads to the obvious problem that the I/O workload generated by the synthetic jobs would thrash the slave nodes. Hence the need for such a job.

The following configuration parameters affect the job type:

Parameter Description
gridmix.job.type The value for this key can be one of LOADJOB or SLEEPJOB. The default value is LOADJOB.
gridmix.key.fraction For a LOADJOB type of job, the fraction of a record used for the data for the key. The default value is 0.1.
gridmix.sleep.maptask-only For a SLEEPJOB type of job, whether to ignore the reduce tasks for the job. The default is false.
gridmix.sleep.fake-locations For a SLEEPJOB type of job, the number of fake locations for map tasks for the job. The default is 0.
gridmix.sleep.max-map-time For a SLEEPJOB type of job, the maximum runtime for map tasks for the job in milliseconds. The default is unlimited.
gridmix.sleep.max-reduce-time For a SLEEPJOB type of job, the maximum runtime for reduce tasks for the job in milliseconds. The default is unlimited.

Job Submission Policies

GridMix controls the rate of job submission. This control can be based on the trace information or can be based on statistics it gathers from the Job Tracker. Based on the submission policies users define, GridMix uses the respective algorithm to control the job submission. There are currently three types of policies:

Job Submission Policy Description
STRESS Keep submitting jobs so that the cluster remains under stress. In this mode we control the rate of job submission by monitoring the real-time load of the cluster so that we can maintain a stable stress level of workload on the cluster. Based on the statistics we gather we define if a cluster is underloaded or overloaded. We consider a cluster underloaded if and only if the following three conditions are true:
  1. the number of pending and running jobs are under a threshold TJ
  2. the number of pending and running maps are under threshold TM
  3. the number of pending and running reduces are under threshold TR
The thresholds TJ, TM and TR are proportional to the size of the cluster and map, reduce slots capacities respectively. In case of a cluster being overloaded, we throttle the job submission. In the actual calculation we also weigh each running task with its remaining work - namely, a 90% complete task is only counted as 0.1 in calculation. Finally, to avoid a very large job blocking other jobs, we limit the number of pending/waiting tasks each job can contribute.
REPLAY In this mode we replay the job traces faithfully. This mode exactly follows the time-intervals given in the actual job trace.
SERIAL In this mode we submit the next job only once the job submitted earlier is completed.

The following configuration parameters affect the job submission policy:

Parameter Description
gridmix.job-submission.policy The value for this key would one of the three: STRESS, REPLAY or SERIAL. In most of the cases the value of key would be STRESS or REPLAY. The default value is STRESS.
gridmix.throttle.jobs-to-tracker-ratio In STRESS mode, the minimum ratio of running jobs to Task Trackers in a cluster for the cluster to be considered overloaded. This is the threshold TJ referred to earlier. The default is 1.0.
gridmix.throttle.maps.task-to-slot-ratio In STRESS mode, the minimum ratio of pending and running map tasks (i.e. incomplete map tasks) to the number of map slots for a cluster for the cluster to be considered overloaded. This is the threshold TM referred to earlier. Running map tasks are counted partially. For example, a 40% complete map task is counted as 0.6 map tasks. The default is 2.0.
gridmix.throttle.reduces.task-to-slot-ratio In STRESS mode, the minimum ratio of pending and running reduce tasks (i.e. incomplete reduce tasks) to the number of reduce slots for a cluster for the cluster to be considered overloaded. This is the threshold TR referred to earlier. Running reduce tasks are counted partially. For example, a 30% complete reduce task is counted as 0.7 reduce tasks. The default is 2.5.
gridmix.throttle.maps.max-slot-share-per-job In STRESS mode, the maximum share of a cluster's map-slots capacity that can be counted toward a job's incomplete map tasks in overload calculation. The default is 0.1.
gridmix.throttle.reducess.max-slot-share-per-job In STRESS mode, the maximum share of a cluster's reduce-slots capacity that can be counted toward a job's incomplete reduce tasks in overload calculation. The default is 0.1.

Emulating Users and Queues

Typical production clusters are often shared with different users and the cluster capacity is divided among different departments through job queues. Ensuring fairness among jobs from all users, honoring queue capacity allocation policies and avoiding an ill-behaving job from taking over the cluster adds significant complexity in Hadoop software. To be able to sufficiently test and discover bugs in these areas, GridMix must emulate the contentions of jobs from different users and/or submitted to different queues.

Emulating multiple queues is easy - we simply set up the benchmark cluster with the same queue configuration as the production cluster and we configure synthetic jobs so that they get submitted to the same queue as recorded in the trace. However, not all users shown in the trace have accounts on the benchmark cluster. Instead, we set up a number of testing user accounts and associate each unique user in the trace to testing users in a round-robin fashion.

The following configuration parameters affect the emulation of users and queues:

Parameter Description
gridmix.job-submission.use-queue-in-trace When set to true it uses exactly the same set of queues as those mentioned in the trace. The default value is false.
gridmix.job-submission.default-queue Specifies the default queue to which all the jobs would be submitted. If this parameter is not specified, GridMix uses the default queue defined for the submitting user on the cluster.
gridmix.user.resolve.class Specifies which UserResolver implementation to use. We currently have three implementations:
  1. org.apache.hadoop.mapred.gridmix.EchoUserResolver - submits a job as the user who submitted the original job. All the users of the production cluster identified in the job trace must also have accounts on the benchmark cluster in this case.
  2. org.apache.hadoop.mapred.gridmix.SubmitterUserResolver - submits all the jobs as current GridMix user. In this case we simply map all the users in the trace to the current GridMix user and submit the job.
  3. org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver - maps trace users to test users in a round-robin fashion. In this case we set up a number of testing user accounts and associate each unique user in the trace to testing users in a round-robin fashion.
The default is org.apache.hadoop.mapred.gridmix.SubmitterUserResolver.

If the parameter gridmix.user.resolve.class is set to org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver, we need to define a users-list file with a list of test users and groups. This is specified using the -users option to GridMix.

Note
Specifying a users-list file using the -users option is mandatory when using the round-robin user-resolver. Other user-resolvers ignore this option.

A users-list file has one user-group-information (UGI) per line, each UGI of the format:

      <username>,<group>[,group]*
      

For example:

      user1,group1
      user2,group2,group3
      user3,group3,group4
      

In the above example we have defined three users user1, user2 and user3 with their respective groups. Now we would associate each unique user in the trace to the above users defined in round-robin fashion. For example, if traces users are tuser1, tuser2, tuser3, tuser4 and tuser5, then the mappings would be:

      tuser1 -> user1
      tuser2 -> user2
      tuser3 -> user3
      tuser4 -> user1
      tuser5 -> user2
      

Simplifying Assumptions

GridMix will be developed in stages, incorporating feedback and patches from the community. Currently its intent is to evaluate MapReduce and HDFS performance and not the layers on top of them (i.e. the extensive lib and sub-project space). Given these two limitations, the following characteristics of job load are not currently captured in job traces and cannot be accurately reproduced in GridMix:

  • CPU Usage - We have no data for per-task CPU usage, so we cannot even attempt an approximation. GridMix tasks are never CPU-bound independent of I/O, though this surely happens in practice.
  • Filesystem Properties - No attempt is made to match block sizes, namespace hierarchies, or any property of input, intermediate or output data other than the bytes/records consumed and emitted from a given task. This implies that some of the most heavily-used parts of the system - the compression libraries, text processing, streaming, etc. - cannot be meaningfully tested with the current implementation.
  • I/O Rates - The rate at which records are consumed/emitted is assumed to be limited only by the speed of the reader/writer and constant throughout the task.
  • Memory Profile - No data on tasks' memory usage over time is available, though the max heap-size is retained.
  • Skew - The records consumed and emitted to/from a given task are assumed to follow observed averages, i.e. records will be more regular than may be seen in the wild. Each map also generates a proportional percentage of data for each reduce, so a job with unbalanced input will be flattened.
  • Job Failure - User code is assumed to be correct.
  • Job Independence - The output or outcome of one job does not affect when or whether a subsequent job will run.

Appendix

Issues tracking the original implementations of GridMix1, GridMix2, and GridMix3 can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking the current development of GridMix can be found by searching the Apache Hadoop MapReduce JIRA