org.apache.pig.backend.hadoop.hbase
Class HBaseStorage

java.lang.Object
  extended by org.apache.pig.builtin.Utf8StorageConverter
      extended by org.apache.pig.backend.hadoop.hbase.HBaseStorage
All Implemented Interfaces:
LoadFunc, Slicer

public class HBaseStorage
extends Utf8StorageConverter
implements Slicer, LoadFunc

A Slicer that split the hbase table into HBaseSlices. And a load function will provided to do none load operations, the actually load operatrions will be done in HBaseSlice.


Nested Class Summary
 
Nested classes/interfaces inherited from interface org.apache.pig.LoadFunc
LoadFunc.RequiredField, LoadFunc.RequiredFieldList, LoadFunc.RequiredFieldResponse
 
Field Summary
 
Fields inherited from class org.apache.pig.builtin.Utf8StorageConverter
mBagFactory, mLog, mTupleFactory
 
Constructor Summary
HBaseStorage(String columnList)
          Constructor.
 
Method Summary
 void bindTo(String fileName, BufferedPositionedInputStream is, long offset, long end)
          Specifies a portion of an InputStream to read tuples.
 Schema determineSchema(String fileName, ExecType execType, DataStorage storage)
          Find the schema from the loader.
 LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList requiredFields)
          Indicate to the loader fields that will be needed.
 Tuple getNext()
          Retrieves the next tuple to be processed.
 Slice[] slice(DataStorage store, String tablename)
          Creates slices of data from store at location.
 void validate(DataStorage store, String tablename)
          Checks that location is parsable by this Slicer, and that if the DataStorage is used by the Slicer, it's readable from there.
 
Methods inherited from class org.apache.pig.builtin.Utf8StorageConverter
bytesToBag, bytesToCharArray, bytesToDouble, bytesToFloat, bytesToInteger, bytesToLong, bytesToMap, bytesToTuple, toBytes, toBytes, toBytes, toBytes, toBytes, toBytes, toBytes, toBytes
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.pig.LoadFunc
bytesToBag, bytesToCharArray, bytesToDouble, bytesToFloat, bytesToInteger, bytesToLong, bytesToMap, bytesToTuple
 

Constructor Detail

HBaseStorage

public HBaseStorage(String columnList)
Constructor. Construct a HBase Table loader to load the cells of the provided columns.

Parameters:
columnList - columnlist that is a presented string delimited by space.
Method Detail

slice

public Slice[] slice(DataStorage store,
                     String tablename)
              throws IOException
Description copied from interface: Slicer
Creates slices of data from store at location.

Specified by:
slice in interface Slicer
Returns:
the Slices to be serialized and sent out to nodes for processing.
Throws:
IOException

validate

public void validate(DataStorage store,
                     String tablename)
              throws IOException
Description copied from interface: Slicer
Checks that location is parsable by this Slicer, and that if the DataStorage is used by the Slicer, it's readable from there. If it isn't, an IOException with a message explaining why will be thrown.

This does not ensure that all the data in location is valid. It's a preflight check that there's some chance of the Slicer working before actual Slices are created and sent off for processing.

Specified by:
validate in interface Slicer
Throws:
IOException

bindTo

public void bindTo(String fileName,
                   BufferedPositionedInputStream is,
                   long offset,
                   long end)
            throws IOException
Description copied from interface: LoadFunc
Specifies a portion of an InputStream to read tuples. Because the starting and ending offsets may not be on record boundaries it is up to the implementor to deal with figuring out the actual starting and ending offsets in such a way that an arbitrarily sliced up file will be processed in its entirety.

A common way of handling slices in the middle of records is to start at the given offset and, if the offset is not zero, skip to the end of the first record (which may be a partial record) before reading tuples. Reading continues until a tuple has been read that ends at an offset past the ending offset.

The load function should not do any buffering on the input stream. Buffering will cause the offsets returned by is.getPos() to be unreliable.

Specified by:
bindTo in interface LoadFunc
Parameters:
fileName - the name of the file to be read
is - the stream representing the file to be processed, and which can also provide its position.
offset - the offset to start reading tuples.
end - the ending offset for reading.
Throws:
IOException

determineSchema

public Schema determineSchema(String fileName,
                              ExecType execType,
                              DataStorage storage)
                       throws IOException
Description copied from interface: LoadFunc
Find the schema from the loader. This function will be called at parse time (not run time) to see if the loader can provide a schema for the data. The loader may be able to do this if the data is self describing (e.g. JSON). If the loader cannot determine the schema, it can return a null. LoadFunc implementations which need to open the input "fileName", can use FileLocalizer.open(String fileName, ExecType execType, DataStorage storage) to get an InputStream which they can use to initialize their loader implementation. They can then use this to read the input data to discover the schema. Note: this will work only when the fileName represents a file on Local File System or Hadoop file system

Specified by:
determineSchema in interface LoadFunc
Parameters:
fileName - Name of the file to be read.(this will be the same as the filename in the "load statement of the script)
execType - - execution mode of the pig script - one of ExecType.LOCAL or ExecType.MAPREDUCE
storage - - the DataStorage object corresponding to the execType
Returns:
a Schema describing the data if possible, or null otherwise.
Throws:
IOException

fieldsToRead

public LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList requiredFields)
                                            throws FrontendException
Description copied from interface: LoadFunc
Indicate to the loader fields that will be needed. This can be useful for loaders that access data that is stored in a columnar format where indicating columns to be accessed a head of time will save scans. If the loader function cannot make use of this information, it is free to ignore it.

Specified by:
fieldsToRead in interface LoadFunc
Parameters:
requiredFields - RequiredFieldList indicating which columns will be needed.
Throws:
FrontendException

getNext

public Tuple getNext()
              throws IOException
Description copied from interface: LoadFunc
Retrieves the next tuple to be processed.

Specified by:
getNext in interface LoadFunc
Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException


Copyright © ${year} The Apache Software Foundation