org.apache.pig.impl.builtin
Class DefaultIndexableLoader

java.lang.Object
  extended by org.apache.pig.impl.builtin.DefaultIndexableLoader
All Implemented Interfaces:
IndexableLoadFunc, LoadFunc

public class DefaultIndexableLoader
extends Object
implements IndexableLoadFunc


Nested Class Summary
 
Nested classes/interfaces inherited from interface org.apache.pig.LoadFunc
LoadFunc.RequiredField, LoadFunc.RequiredFieldList, LoadFunc.RequiredFieldResponse
 
Constructor Summary
DefaultIndexableLoader(String loaderFuncSpec, String indexFile, String indexFileLoadFuncSpec, String scope)
           
 
Method Summary
 void bindTo(String fileName, BufferedPositionedInputStream is, long offset, long end)
          Specifies a portion of an InputStream to read tuples.
 DataBag bytesToBag(byte[] b)
          Cast data from bytes to bag value.
 String bytesToCharArray(byte[] b)
          Cast data from bytes to chararray value.
 Double bytesToDouble(byte[] b)
          Cast data from bytes to double value.
 Float bytesToFloat(byte[] b)
          Cast data from bytes to float value.
 Integer bytesToInteger(byte[] b)
          Cast data from bytes to integer value.
 Long bytesToLong(byte[] b)
          Cast data from bytes to long value.
 Map<String,Object> bytesToMap(byte[] b)
          Cast data from bytes to map value.
 Tuple bytesToTuple(byte[] b)
          Cast data from bytes to tuple value.
 void close()
          A method called by the pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream.
 Schema determineSchema(String fileName, ExecType execType, DataStorage storage)
          Find the schema from the loader.
 LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)
          Indicate to the loader fields that will be needed.
 Tuple getNext()
          Retrieves the next tuple to be processed.
 void initialize(org.apache.hadoop.conf.Configuration conf)
          This method is called by pig run time to allow the IndexableLoadFunc to perform any initialization actions
 void seekNear(Tuple keys)
          This method is called by the pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DefaultIndexableLoader

public DefaultIndexableLoader(String loaderFuncSpec,
                              String indexFile,
                              String indexFileLoadFuncSpec,
                              String scope)
Method Detail

seekNear

public void seekNear(Tuple keys)
              throws IOException
Description copied from interface: IndexableLoadFunc
This method is called by the pig runtime to indicate to the LoadFunc to position its underlying input stream near the keys supplied as the argument. Specifically: 1) if the keys are present in the input stream, the loadfunc implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied in the argument OR to the record with the first occurrence of the keys(s) supplied. 2) if the key(s) are absent in the input stream, the implementation should position its read position to a record where the key(s) is/are the biggest key(s) less than the key(s) supplied OR to the first record where the key(s) is/are the smallest key(s) greater than the keys(s) supplied. The description above holds for descending order data in a similar manner with "biggest" and "less than" replaced with "smallest" and "greater than" and vice versa.

Specified by:
seekNear in interface IndexableLoadFunc
Parameters:
keys - Tuple with join keys (which are a prefix of the sort keys of the input data). For example if the data is sorted on columns in position 2,4,5 any of the following Tuples are valid as an argument value: (fieldAt(2)) (fieldAt(2), fieldAt(4)) (fieldAt(2), fieldAt(4), fieldAt(5)) The following are some invalid cases: (fieldAt(4)) (fieldAt(2), fieldAt(5)) (fieldAt(4), fieldAt(5))
Throws:
IOException - When the loadFunc is unable to position to the required point in its input stream

bindTo

public void bindTo(String fileName,
                   BufferedPositionedInputStream is,
                   long offset,
                   long end)
            throws IOException
Description copied from interface: LoadFunc
Specifies a portion of an InputStream to read tuples. Because the starting and ending offsets may not be on record boundaries it is up to the implementor to deal with figuring out the actual starting and ending offsets in such a way that an arbitrarily sliced up file will be processed in its entirety.

A common way of handling slices in the middle of records is to start at the given offset and, if the offset is not zero, skip to the end of the first record (which may be a partial record) before reading tuples. Reading continues until a tuple has been read that ends at an offset past the ending offset.

The load function should not do any buffering on the input stream. Buffering will cause the offsets returned by is.getPos() to be unreliable.

Specified by:
bindTo in interface LoadFunc
Parameters:
fileName - the name of the file to be read
is - the stream representing the file to be processed, and which can also provide its position.
offset - the offset to start reading tuples.
end - the ending offset for reading.
Throws:
IOException

bytesToBag

public DataBag bytesToBag(byte[] b)
                   throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to bag value.

Specified by:
bytesToBag in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Bag value.
Throws:
IOException - if the value cannot be cast.

bytesToCharArray

public String bytesToCharArray(byte[] b)
                        throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to chararray value.

Specified by:
bytesToCharArray in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
String value.
Throws:
IOException - if the value cannot be cast.

bytesToDouble

public Double bytesToDouble(byte[] b)
                     throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to double value.

Specified by:
bytesToDouble in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Double value.
Throws:
IOException - if the value cannot be cast.

bytesToFloat

public Float bytesToFloat(byte[] b)
                   throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to float value.

Specified by:
bytesToFloat in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Float value.
Throws:
IOException - if the value cannot be cast.

bytesToInteger

public Integer bytesToInteger(byte[] b)
                       throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to integer value.

Specified by:
bytesToInteger in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Integer value.
Throws:
IOException - if the value cannot be cast.

bytesToLong

public Long bytesToLong(byte[] b)
                 throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to long value.

Specified by:
bytesToLong in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Long value.
Throws:
IOException - if the value cannot be cast.

bytesToMap

public Map<String,Object> bytesToMap(byte[] b)
                              throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to map value.

Specified by:
bytesToMap in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Map value.
Throws:
IOException - if the value cannot be cast.

bytesToTuple

public Tuple bytesToTuple(byte[] b)
                   throws IOException
Description copied from interface: LoadFunc
Cast data from bytes to tuple value.

Specified by:
bytesToTuple in interface LoadFunc
Parameters:
b - byte array to be cast.
Returns:
Tuple value.
Throws:
IOException - if the value cannot be cast.

determineSchema

public Schema determineSchema(String fileName,
                              ExecType execType,
                              DataStorage storage)
                       throws IOException
Description copied from interface: LoadFunc
Find the schema from the loader. This function will be called at parse time (not run time) to see if the loader can provide a schema for the data. The loader may be able to do this if the data is self describing (e.g. JSON). If the loader cannot determine the schema, it can return a null. LoadFunc implementations which need to open the input "fileName", can use FileLocalizer.open(String fileName, ExecType execType, DataStorage storage) to get an InputStream which they can use to initialize their loader implementation. They can then use this to read the input data to discover the schema. Note: this will work only when the fileName represents a file on Local File System or Hadoop file system

Specified by:
determineSchema in interface LoadFunc
Parameters:
fileName - Name of the file to be read.(this will be the same as the filename in the "load statement of the script)
execType - - execution mode of the pig script - one of ExecType.LOCAL or ExecType.MAPREDUCE
storage - - the DataStorage object corresponding to the execType
Returns:
a Schema describing the data if possible, or null otherwise.
Throws:
IOException

fieldsToRead

public LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)
                                            throws FrontendException
Description copied from interface: LoadFunc
Indicate to the loader fields that will be needed. This can be useful for loaders that access data that is stored in a columnar format where indicating columns to be accessed a head of time will save scans. If the loader function cannot make use of this information, it is free to ignore it.

Specified by:
fieldsToRead in interface LoadFunc
Parameters:
requiredFieldList - RequiredFieldList indicating which columns will be needed.
Throws:
FrontendException

getNext

public Tuple getNext()
              throws IOException
Description copied from interface: LoadFunc
Retrieves the next tuple to be processed.

Specified by:
getNext in interface LoadFunc
Returns:
the next tuple to be processed or null if there are no more tuples to be processed.
Throws:
IOException

close

public void close()
           throws IOException
Description copied from interface: IndexableLoadFunc
A method called by the pig runtime to give an opportunity for implementations to perform cleanup actions like closing the underlying input stream. This is necessary since while performing a join the pig run time may determine than no further join is possible with remaining records and may indicate to the IndexableLoader to cleanup by calling this method.

Specified by:
close in interface IndexableLoadFunc
Throws:
IOException - if the loadfunc is unable to perform its close actions.

initialize

public void initialize(org.apache.hadoop.conf.Configuration conf)
                throws IOException
Description copied from interface: IndexableLoadFunc
This method is called by pig run time to allow the IndexableLoadFunc to perform any initialization actions

Specified by:
initialize in interface IndexableLoadFunc
Parameters:
conf - The job configuration object
Throws:
IOException


Copyright © ${year} The Apache Software Foundation