Package org.apache.pig.piggybank.storage

Class Summary
HiveColumnarLoader Loader for Hive RC Columnar files.
Supports the following types:
* Hive TypePig Type from DataType stringCHARARRAY intINTEGER bigint or longLONG floatfloat doubleDOUBLE booleanBOOLEAN byteBYTE arrayTUPLE mapMAP
Usage 1:
To load a hive table: uid bigint, ts long, arr ARRAY, m MAP a = LOAD 'file' USING HiveColumnarLoader("uid bigint, ts long, arr array, m map"); -- to reference the fields b = FOREACH GENERATE a.uid, a.ts, a.arr, a.m;

Usage 2:
To load a hive table: uid bigint, ts long, arr ARRAY, m MAP only processing dates 2009-10-01 to 2009-10-02 in a
date partitioned hive table.
a = LOAD 'file' USING HiveColumnarLoader("uid bigint, ts long, arr array, m map", "2009-10-01:2009-10-02"); -- to reference the fields b = FOREACH GENERATE a.uid, a.ts, a.arr, a.m;

Usage 3:
To load a hive table: uid bigint, ts long, arr ARRAY, m MAP only reading column uid and ts.
a = LOAD 'file' USING HiveColumnarLoader("uid bigint, ts long, arr array, m map", "", "uid,ts"); -- to reference the fields b = FOREACH a GENERATE uid, ts, arr, m;

Usage 4:
To load a hive table: uid bigint, ts long, arr ARRAY, m MAP only reading column uid and ts for dates 2009-10-01 to 2009-10-02.
a = LOAD 'file' USING HiveColumnarLoader("uid bigint, ts long, arr array, m map", "2009-10-01:2009-10-02", "uid,ts"); -- to reference the fields b = FOREACH a GENERATE uid, ts, arr, m;

Issues

Table schema definition
The schema definition must be column name followed by a space then a comma then no space and the next column name and so on.
This so column1 string, column2 string will not word, it must be column1 string,column2 string

Date partitioning
Hive date partition folders must have format daydate=[date].

JsonMetadata Reads and Writes metadata using JSON in metafiles next to the data.
MultiStorage The UDF is useful for splitting the output data into a bunch of directories and files dynamically based on user specified key field in the output tuple.
MultiStorage.MultiStorageOutputFormat  
MultiStorage.MultiStorageOutputFormat.MyLineRecordWriter  
MyRegExLoader  
PigStorageSchema This Load/Store Func reads/writes metafiles that allow the schema and aliases to be determined at load time, saving one from having to manually enter schemas for pig-generated datasets.
RegExLoader RegExLoader is an abstract class used to parse logs based on a regular expression.
SequenceFileLoader A Loader for Hadoop-Standard SequenceFiles.
XMLLoader The load function to load the XML file This implements the LoadFunc interface which is used to parse records from a dataset.
XMLLoader.XMLFileInputFormat  
XMLLoader.XMLFileRecordReader  
 



Copyright © ${year} The Apache Software Foundation