net.nutch.mapReduce
Interface InputFormat

All Known Implementing Classes:
TextInputFormat

public interface InputFormat

An input data format. Input files are stored in a NutchFileSystem. The processing of an input file may be split across multiple machines. Files are processed as sequences of records, implementing RecordReader. Files must thus be split on record boundaries.


Nested Class Summary
static interface InputFormat.Split
          A section of an input file.
 
Method Summary
 RecordReader getRecordReader(InputFormat.Split split)
          Construct a RecordReader for a InputFormat.Split.
 InputFormat.Split[] getSplits(NutchFileSystem fs, File[] files, int numSplits)
          Splits a set of input files.
 

Method Detail

getSplits

public InputFormat.Split[] getSplits(NutchFileSystem fs,
                                     File[] files,
                                     int numSplits)
                              throws IOException
Splits a set of input files. One split is created per map task.

Parameters:
fs - the filesystem containing the files to be split
files - the input files to split
numSplits - the desired number of splits
Returns:
the splits
Throws:
IOException

getRecordReader

public RecordReader getRecordReader(InputFormat.Split split)
                             throws IOException
Construct a RecordReader for a InputFormat.Split.

Parameters:
split - the split
Returns:
a RecordReader
Throws:
IOException


Copyright © 2005 The Nutch Organization.