|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.pig.LoadFunc
org.apache.pig.FileInputLoadFunc
org.apache.pig.builtin.PigStorage
org.apache.pig.piggybank.storage.CSVExcelStorage
public class CSVExcelStorage
CSV loading and storing with support for multi-line fields, and escaping of delimiters and double quotes within fields; uses CSV conventions of Excel 2007. Arguments allow for control over:
STORE x INTO '<destFileName>'
USING CSVExcelStorage(['<delimiter>' [,{'YES_MULTILINE' | 'NO_MULTILINE'} [,{'UNIX' | 'WINDOWS' | 'UNCHANGED'}]]]);
Defaults are comma, 'NO_MULTILINE', 'UNCHANGED' The linebreak parameter is only used during store. During load no conversion is performed.
Example:
STORE res INTO '/tmp/result.csv'
USING CSVExcelStorage(',', 'NO_MULTILINE', 'WINDOWS');
would expect to see comma separated files for load, would use comma as field separator during store, would treat every newline as a record terminator, and would use CRLF as line break characters (0x0d 0x0a: \r\n).
Example:
STORE res INTO '/tmp/result.csv'
USING CSVExcelStorage(',', 'YES_MULTILINE');
would allow newlines inside of fields. During load
such fields are expected to conform to the Excel
requirement that the field is enclosed in double quotes.
On store, the chararray
containing the field will accordingly be
enclosed in double quotes.
Note:
A danger with enabling multiline fields during load is that unbalanced
double quotes will cause slurping up of input until a balancing double
quote is found, or until something breaks. If you are not expecting
newlines within fields it is therefore more robust to use NO_MULTILINE,
which is the default for that reason.
Excel expects double quotes within fields to be escaped with a second double quote. When such an embedding of double quotes is used, Excel additionally expects the entire fields to be surrounded by double quotes. This package follows that escape mechanism, rather than the use of backslash.
Tested with: Pig 0.8.0, Windows Vista, Excel 2007 SP2 MSO(12.0.6545.5004).
Note:
When a file with newlines embedded in a field is loaded into Excel,
the application does not automatically vertically enlarge the respective
rows. It is therefore easy to miss when fields consist of multiple lines.
To make the multiline rows clear:
Examples:
With multiline turned on:
"Conrad\n
Emil",Dinger,40
Is read as (Conrad\nEmil,Dinger,40)
With multiline turned off:
"Conrad\n
Emil",Dinger,40
is read as
(Conrad)
(Emil,Dinger,40)
Always:
"Mac ""the knife""",Cohen,30
is read as (Mac "the knife",Cohen,30)
Jane, "nee, Smith",20
Is read as (Jane,nee, Smith,20)
That is, the escape character is the double quote, not backslash.
Known Issues:
TAB
as the field delimiter, Excel does not
properly handle newlines embedded in fields. Maybe there is a trick...
Nested Class Summary | |
---|---|
static class |
CSVExcelStorage.Linebreaks
|
static class |
CSVExcelStorage.Multiline
|
Nested classes/interfaces inherited from interface org.apache.pig.LoadPushDown |
---|
LoadPushDown.OperatorSet, LoadPushDown.RequiredField, LoadPushDown.RequiredFieldList, LoadPushDown.RequiredFieldResponse |
Field Summary | |
---|---|
protected static byte |
DOUBLE_QUOTE
|
protected org.apache.hadoop.mapreduce.RecordReader |
in
|
protected static byte |
LINEFEED
|
protected static byte |
NEWLINE
|
protected static byte |
RECORD_DEL
|
Fields inherited from class org.apache.pig.builtin.PigStorage |
---|
caster, mLog, schema, writer |
Constructor Summary | |
---|---|
CSVExcelStorage()
Constructs a CSVExcel load/store that uses comma as the
field delimiter, terminates records on reading a newline
within a field (even if the field is enclosed in double quotes),
and uses LF as line terminator. |
|
CSVExcelStorage(String delimiter)
Constructs a CSVExcel load/store that uses specified string as a field delimiter. |
|
CSVExcelStorage(String delimiter,
String multilineTreatment)
Constructs a CSVExcel load/store that uses specified string as a field delimiter, and allows specification whether to handle line breaks within fields. |
|
CSVExcelStorage(String delimiter,
String multilineTreatment,
String eolTreatment)
Constructs a CSVExcel load/store that uses specified string as a field delimiter, provides choice whether to manage multiline fields, and specifies chars used for end of line. |
Method Summary | |
---|---|
List<LoadPushDown.OperatorSet> |
getFeatures()
Determine the operators that can be pushed to the loader. |
org.apache.hadoop.mapreduce.InputFormat |
getInputFormat()
This will be called during planning on the front end. |
Tuple |
getNext()
Retrieves the next tuple to be processed. |
void |
prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader,
PigSplit split)
Initializes LoadFunc for reading data. |
LoadPushDown.RequiredFieldResponse |
pushProjection(LoadPushDown.RequiredFieldList requiredFieldList)
Indicate to the loader fields that will be needed. |
void |
putNext(Tuple tupleToWrite)
Write a tuple to the data store. |
void |
setLocation(String location,
org.apache.hadoop.mapreduce.Job job)
Communicate to the loader the location of the object(s) being loaded. |
void |
setUDFContextSignature(String signature)
This method will be called by Pig both in the front end and back end to pass a unique signature to the LoadFunc . |
Methods inherited from class org.apache.pig.builtin.PigStorage |
---|
checkSchema, cleanupOnFailure, equals, equals, getOutputFormat, getPartitionKeys, getSchema, getStatistics, hashCode, prepareToWrite, relToAbsPathForStoreLocation, setPartitionFilter, setStoreFuncUDFContextSignature, setStoreLocation, storeSchema, storeStatistics |
Methods inherited from class org.apache.pig.FileInputLoadFunc |
---|
getSplitComparable |
Methods inherited from class org.apache.pig.LoadFunc |
---|
getAbsolutePath, getLoadCaster, getPathStrings, join, relativeToAbsolutePath, warn |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.pig.StoreFuncInterface |
---|
checkSchema, cleanupOnFailure, getOutputFormat, prepareToWrite, relToAbsPathForStoreLocation, setStoreFuncUDFContextSignature, setStoreLocation |
Field Detail |
---|
protected static final byte LINEFEED
protected static final byte NEWLINE
protected static final byte DOUBLE_QUOTE
protected static final byte RECORD_DEL
protected org.apache.hadoop.mapreduce.RecordReader in
Constructor Detail |
---|
public CSVExcelStorage()
comma
as the
field delimiter, terminates records on reading a newline
within a field (even if the field is enclosed in double quotes),
and uses LF
as line terminator.
public CSVExcelStorage(String delimiter)
delimiter
- the single byte character that is used to separate fields.
("," is the default.)public CSVExcelStorage(String delimiter, String multilineTreatment)
STORE a INTO '/tmp/foo.csv'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(",", "YES_MULTILINE");
delimiter
- the single byte character that is used to separate fields.
("," is the default.)multilineTreatment
- "YES_MULTILINE" or "NO_MULTILINE"
("NO_MULTILINE is the default.)public CSVExcelStorage(String delimiter, String multilineTreatment, String eolTreatment)
The eofTreatment parameter is only relevant for STORE():
STORE a INTO '/tmp/foo.csv'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(",", "NO_MULTILINE", "WINDOWS");
delimiter
- the single byte character that is used to separate fields.
("," is the default.)String
- "YES_MULTILINE" or "NO_MULTILINE"
("NO_MULTILINE is the default.)eolTreatment
- "UNIX", "WINDOWS", or "NOCHANGE"
("NOCHANGE" is the default.)Method Detail |
---|
public void putNext(Tuple tupleToWrite) throws IOException
StoreFuncInterface
putNext
in interface StoreFuncInterface
putNext
in class PigStorage
tupleToWrite
- the tuple to store.
IOException
- if an exception occurs during the writepublic Tuple getNext() throws IOException
LoadFunc
getNext
in class PigStorage
IOException
- if there is an exception while retrieving the next
tuplepublic void setLocation(String location, org.apache.hadoop.mapreduce.Job job) throws IOException
LoadFunc
LoadFunc.relativeToAbsolutePath(String, Path)
. Implementations
should use this method to communicate the location (and any other information)
to its underlying InputFormat through the Job object.
This method will be called in the backend multiple times. Implementations
should bear in mind that this method is called multiple times and should
ensure there are no inconsistent side effects due to the multiple calls.
setLocation
in class PigStorage
location
- Location as returned by
LoadFunc.relativeToAbsolutePath(String, Path)
job
- the Job
object
store or retrieve earlier stored information from the UDFContext
IOException
- if the location is not valid.public org.apache.hadoop.mapreduce.InputFormat getInputFormat()
LoadFunc
getInputFormat
in class PigStorage
public void prepareToRead(org.apache.hadoop.mapreduce.RecordReader reader, PigSplit split)
LoadFunc
prepareToRead
in class PigStorage
reader
- RecordReader
to be used by this instance of the LoadFuncsplit
- The input PigSplit
to processpublic LoadPushDown.RequiredFieldResponse pushProjection(LoadPushDown.RequiredFieldList requiredFieldList) throws FrontendException
LoadPushDown
pushProjection
in interface LoadPushDown
pushProjection
in class PigStorage
requiredFieldList
- RequiredFieldList indicating which columns will be needed.
This structure is read only. User cannot make change to it inside pushProjection.
FrontendException
public void setUDFContextSignature(String signature)
LoadFunc
LoadFunc
. The signature can be used
to store into the UDFContext
any information which the
LoadFunc
needs to store between various method invocations in the
front end and back end. A use case is to store LoadPushDown.RequiredFieldList
passed to it in LoadPushDown.pushProjection(RequiredFieldList)
for
use in the back end before returning tuples in LoadFunc.getNext()
.
This method will be call before other methods in LoadFunc
setUDFContextSignature
in class PigStorage
signature
- a unique signature to identify this LoadFuncpublic List<LoadPushDown.OperatorSet> getFeatures()
LoadPushDown
getFeatures
in interface LoadPushDown
getFeatures
in class PigStorage
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |