Description:
This processor is used to pull files from HDFS. The files being pulled in MUST be SequenceFile
formatted files. The processor creates a flow file for each key/value entry in the ingested SequenceFile
.
The created flow file's content depends on the value of the optional configuration property FlowFile Content. Currently,
there are two choices: VALUE ONLY and KEY VALUE PAIR. With the prior, only the SequenceFile
value element is
written to the flow file contents. With the latter, the SequenceFile
key and value are written to the flow file
contents as serialized objects; the format is key length (int), key(String), value length(int), value(bytes). The default is
VALUE ONLY.
NOTE: This processor loads the entire value entry into memory. While the size limit for a value entry is 2GB, this will cause
memory problems if there are too many concurrent tasks and the data being ingested is large.
Properties:
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are
considered optional. If a property has a default value, it is indicated. If a property supports the use of the
NiFi Expression Language (or simply, "expression language"), that is also indicated.
- Hadoop Configuration Resources
- A file or comma separated list of files which contains the Hadoop file system configuration.
Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will
revert to a default configuration.
- Default value: none
- FlowFile Content
- Indicate if the content is to be both the key and value of the Sequence File, or just the value.
- Default value: VALUE ONLY
- Directory
- The HDFS directory from which FlowFile content should be read.
- Default value: none
- Recurse Subdirectories
- A Boolean value (true/false), when true will pull files from subdirectories of the HDFS Directory.
- Default value: true
- Keep Source File
- A Boolean value (true/false), indicates whether to keep (true) or delete (false) the file from HDFS
after it has been successfully transferred.
- Default value: false
- File Filter Regex
- A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose
names match that Regular Expression will be fetched, otherwise all files will be fetched.
- Default value: none
- Filter Match Name Only
- A Boolean value (true/false), when true File Filter Regex will match on just the filename,
otherwise subdirectory names will be included with filename in the regex comparison.
- Default value: true
- Ignore Dotted Files
- A Boolean value (true/false), when true files whose names begin with a dot (".") will not
be fetched.
- Default value: true
- Minimum File Age
- The minimum age that a file must be in order to be fetched; any file that is younger than this
amount of time (based on last modification time) will be ignored. The value must be a non-negative
integer and be followed by a time unit, such as nanos, millis, secs, mins, hrs, days.
- Default value: 0 sec
- Maximum File Age
- The maximum age that a file must be in order to be fetched; any file that is older than this amount
of time (based on last modification time) will be ignored. The value must be a non-negative integer,
followed by a time unit, such as nanos, millis, secs, mins, hrs, days. Cannot be less than 100 millis.
- Default value: none
- Polling Interval
- Indicates how long to wait between performing directory listings. The value must be a non-negative
integer and be followed by a time unit, such as nanos, millis, secs, mins, hrs, days.
- Default value: 0 sec
- Batch Size
- The maximum number of files to pull in each iteration, based on configured run schedule.
- Default value: 100
- IO Buffer Size
- Amount of memory to use to buffer file contents during IO. This is a data size integer that must
include units of B, KB, MB, GB, or TB. This overrides the Hadoop Configuration.
- Default value: none
Relationships:
- success
- All files retrieved from HDFS are transferred to this relationship.
- passthrough
- If this processor has an input queue for some reason, then FlowFiles arriving on that input are
transferred to this relationship.
See Also: