org.apache.cassandra.hadoop
Class ColumnFamilyOutputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.OutputFormat<byte[],java.util.List<IColumn>>
      extended by org.apache.cassandra.hadoop.ColumnFamilyOutputFormat

public class ColumnFamilyOutputFormat
extends org.apache.hadoop.mapreduce.OutputFormat<byte[],java.util.List<IColumn>>

The ColumnFamilyOutputFormat acts as a Hadoop-specific OutputFormat that allows reduce tasks to store keys (and corresponding values) as Cassandra rows (and respective columns) in a given ColumnFamily.

As is the case with the ColumnFamilyInputFormat, you need to set the Keyspace and ColumnFamily in your Hadoop job Configuration. The ConfigHelper class, through its ConfigHelper.setOutputColumnFamily(org.apache.hadoop.conf.Configuration, java.lang.String, java.lang.String) method, is provided to make this simple.

For the sake of performance, this class employs a lazy write-back caching mechanism, where its record writer batches mutations created based on the reduce's inputs (in a task-specific map). When the writer is closed, then it makes the changes official by sending a batch mutate request to Cassandra.


Nested Class Summary
 class ColumnFamilyOutputFormat.NullOutputCommitter
          An OutputCommitter that does nothing.
 
Field Summary
static java.lang.String BATCH_THRESHOLD
           
 
Constructor Summary
ColumnFamilyOutputFormat()
           
 
Method Summary
 void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext context)
          Check for validity of the output-specification for the job.
static Cassandra.Client createAuthenticatedClient(org.apache.thrift.transport.TSocket socket, org.apache.hadoop.mapreduce.JobContext context)
          Return a client based on the given socket that points to the configured keyspace, and is logged in with the configured credentials.
 org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Get the output committer for this output format.
 org.apache.hadoop.mapreduce.RecordWriter<byte[],java.util.List<IColumn>> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Get the RecordWriter for the given task.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

BATCH_THRESHOLD

public static final java.lang.String BATCH_THRESHOLD
See Also:
Constant Field Values
Constructor Detail

ColumnFamilyOutputFormat

public ColumnFamilyOutputFormat()
Method Detail

checkOutputSpecs

public void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext context)
Check for validity of the output-specification for the job.

Specified by:
checkOutputSpecs in class org.apache.hadoop.mapreduce.OutputFormat<byte[],java.util.List<IColumn>>
Parameters:
context - information about the job
Throws:
java.io.IOException - when output should not be attempted

getOutputCommitter

public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                               throws java.io.IOException,
                                                                      java.lang.InterruptedException
Get the output committer for this output format. This is responsible for ensuring the output is committed correctly.

This output format employs a lazy write-back caching mechanism, where the RecordWriter is responsible for collecting mutations in the #MUTATIONS_CACHE, and the ColumnFamilyOutputFormat.NullOutputCommitter makes the changes official by making the change request to Cassandra.

Specified by:
getOutputCommitter in class org.apache.hadoop.mapreduce.OutputFormat<byte[],java.util.List<IColumn>>
Parameters:
context - the task context
Returns:
an output committer
Throws:
java.io.IOException
java.lang.InterruptedException

getRecordWriter

public org.apache.hadoop.mapreduce.RecordWriter<byte[],java.util.List<IColumn>> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                         throws java.io.IOException,
                                                                                                java.lang.InterruptedException
Get the RecordWriter for the given task.

As stated above, this RecordWriter merely batches the mutations that it defines in the #MUTATIONS_CACHE. In other words, it doesn't literally cause any changes on the Cassandra server.

Specified by:
getRecordWriter in class org.apache.hadoop.mapreduce.OutputFormat<byte[],java.util.List<IColumn>>
Parameters:
context - the information about the current task.
Returns:
a RecordWriter to write the output for the job.
Throws:
java.io.IOException
java.lang.InterruptedException

createAuthenticatedClient

public static Cassandra.Client createAuthenticatedClient(org.apache.thrift.transport.TSocket socket,
                                                         org.apache.hadoop.mapreduce.JobContext context)
                                                  throws InvalidRequestException,
                                                         org.apache.thrift.TException,
                                                         AuthenticationException,
                                                         AuthorizationException
Return a client based on the given socket that points to the configured keyspace, and is logged in with the configured credentials.

Parameters:
socket - a socket pointing to a particular node, seed or otherwise
context - a job context
Returns:
a cassandra client
Throws:
InvalidRequestException
org.apache.thrift.TException
AuthenticationException
AuthorizationException


Copyright © 2010 The Apache Software Foundation