org.apache.pig.piggybank.storage
Class PigStorageSchema
java.lang.Object
org.apache.pig.LoadFunc
org.apache.pig.FileInputLoadFunc
org.apache.pig.builtin.PigStorage
org.apache.pig.piggybank.storage.PigStorageSchema
- All Implemented Interfaces:
- LoadMetadata, LoadPushDown, OrderedLoadFunc, StoreFuncInterface, StoreMetadata
public class PigStorageSchema
- extends PigStorage
- implements LoadMetadata, StoreMetadata
This Load/Store Func reads/writes metafiles that allow the schema and
aliases to be determined at load time, saving one from having to manually
enter schemas for pig-generated datasets.
It also creates a ".pig_headers" file that simply lists the delimited aliases.
This is intended to make export to tools that can read files with header
lines easier (just cat the header to your data).
Due to StoreFunc limitations, you can only write the metafiles in MapReduce
mode. You can read them in Local or MapReduce mode.
Methods inherited from class org.apache.pig.builtin.PigStorage |
checkSchema, cleanupOnFailure, equals, equals, getFeatures, getInputFormat, getNext, getOutputFormat, hashCode, prepareToRead, prepareToWrite, pushProjection, putNext, relToAbsPathForStoreLocation, setLocation, setStoreFuncUDFContextSignature, setStoreLocation, setUDFContextSignature |
PigStorageSchema
public PigStorageSchema()
PigStorageSchema
public PigStorageSchema(String delim)
getSchema
public ResourceSchema getSchema(String location,
org.apache.hadoop.mapreduce.Job job)
throws IOException
- Description copied from interface:
LoadMetadata
- Get a schema for the data to be loaded.
- Specified by:
getSchema
in interface LoadMetadata
- Parameters:
location
- Location as returned by
LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job
- The Job
object - this should be used only to obtain
cluster properties through JobContext.getConfiguration()
and not to set/query
any runtime job information.
- Returns:
- schema for the data to be loaded. This schema should represent
all tuples of the returned data. If the schema is unknown or it is
not possible to return a schema that represents all returned data,
then null should be returned. The schema should not be affected by pushProjection, ie.
getSchema should always return the original schema even after pushProjection
- Throws:
IOException
- if an exception occurs while determining the schema
getStatistics
public ResourceStatistics getStatistics(String location,
org.apache.hadoop.mapreduce.Job job)
throws IOException
- Description copied from interface:
LoadMetadata
- Get statistics about the data to be loaded. If no statistics are
available, then null should be returned.
- Specified by:
getStatistics
in interface LoadMetadata
- Parameters:
location
- Location as returned by
LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job
- The Job
object - this should be used only to obtain
cluster properties through JobContext.getConfiguration()
and not to set/query
any runtime job information.
- Returns:
- statistics about the data to be loaded. If no statistics are
available, then null should be returned.
- Throws:
IOException
- if an exception occurs while retrieving statistics
setPartitionFilter
public void setPartitionFilter(Expression partitionFilter)
throws IOException
- Description copied from interface:
LoadMetadata
- Set the filter for partitioning. It is assumed that this filter
will only contain references to fields given as partition keys in
getPartitionKeys. So if the implementation returns null in
LoadMetadata.getPartitionKeys(String, Job)
, then this method is not
called by pig runtime. This method is also not called by the pig runtime
if there are no partition filter conditions.
- Specified by:
setPartitionFilter
in interface LoadMetadata
- Parameters:
partitionFilter
- that describes filter for partitioning
- Throws:
IOException
- if the filter is not compatible with the storage
mechanism or contains non-partition fields.
getPartitionKeys
public String[] getPartitionKeys(String location,
org.apache.hadoop.mapreduce.Job job)
throws IOException
- Description copied from interface:
LoadMetadata
- Find what columns are partition keys for this input.
- Specified by:
getPartitionKeys
in interface LoadMetadata
- Parameters:
location
- Location as returned by
LoadFunc.relativeToAbsolutePath(String, org.apache.hadoop.fs.Path)
job
- The Job
object - this should be used only to obtain
cluster properties through JobContext.getConfiguration()
and not to set/query
any runtime job information.
- Returns:
- array of field names of the partition keys. Implementations
should return null to indicate that there are no partition keys
- Throws:
IOException
- if an exception occurs while retrieving partition keys
storeSchema
public void storeSchema(ResourceSchema schema,
String location,
org.apache.hadoop.mapreduce.Job job)
throws IOException
- Description copied from interface:
StoreMetadata
- Store schema of the data being written
- Specified by:
storeSchema
in interface StoreMetadata
job
- The Job
object - this should be used only to obtain
cluster properties through JobContext.getConfiguration()
and not to set/query
any runtime job information.
- Throws:
IOException
storeStatistics
public void storeStatistics(ResourceStatistics stats,
String location,
org.apache.hadoop.mapreduce.Job job)
throws IOException
- Description copied from interface:
StoreMetadata
- Store statistics about the data being written.
- Specified by:
storeStatistics
in interface StoreMetadata
job
- The Job
object - this should be used only to obtain
cluster properties through JobContext.getConfiguration()
and not to set/query
any runtime job information.
- Throws:
IOException
Copyright © ${year} The Apache Software Foundation