Description:
This processor reads files from an HDFS cluster into NiFi FlowFiles.
Modifies Attributes:
Attribute Name |
Description |
filename |
The name of the file that was read from HDFS. |
path |
The path is set to the relative path of the file's directory on HDFS. For example, if the Directory
property is set to /tmp , then files picked up from /tmp will have the path attribute set to
"./ ". If the Recurse Subdirectories property is set to true
and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "abc/1/2/3 ".
|
Properties:
In the list below, the names of required properties appear in bold. Any other properties (not in bold) are
considered optional. If a property has a default value, it is indicated. If a property supports the use of the
NiFi Expression Language (or simply, "expression language"), that is also indicated.
- Hadoop Configuration Resources
- A file or comma separated list of files which contains the Hadoop file system configuration.
Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will
revert to a default configuration.
- Default value: none
- Directory
- The HDFS directory from which FlowFile content should be read.
- Default value: none
- Recurse Subdirectories
- A Boolean value (true/false), when true will pull files from subdirectories of the HDFS Directory.
- Default value: true
- Keep Source File
- A Boolean value (true/false), indicates whether to keep (true) or delete (false) the file from HDFS
after it has been successfully transferred.
- Default value: false
- File Filter Regex
- A Java Regular Expression for filtering Filenames; if a filter is supplied then only files whose
names match that Regular Expression will be fetched, otherwise all files will be fetched.
- Default value: none
- Filter Match Name Only
- A Boolean value (true/false), when true File Filter Regex will match on just the filename,
otherwise subdirectory names will be included with filename in the regex comparison.
- Default value: true
- Ignore Dotted Files
- A Boolean value (true/false), when true files whose names begin with a dot (".") will not
be fetched.
- Default value: true
- Minimum File Age
- The minimum age that a file must be in order to be fetched; any file that is younger than this
amount of time (based on last modification time) will be ignored. The value must be a non-negative
integer and be followed by a time unit, such as nanos, millis, secs, mins, hrs, days.
- Default value: 0 sec
- Maximum File Age
- The maximum age that a file must be in order to be fetched; any file that is older than this amount
of time (based on last modification time) will be ignored. The value must be a non-negative integer,
followed by a time unit, such as nanos, millis, secs, mins, hrs, days. Cannot be less than 100 millis.
- Default value: none
- Polling Interval
- Indicates how long to wait between performing directory listings. The value must be a non-negative
integer and be followed by a time unit, such as nanos, millis, secs, mins, hrs, days.
- Default value: 0 sec
- Batch Size
- The maximum number of files to pull in each iteration, based on configured run schedule.
- Default value: 100
- IO Buffer Size
- Amount of memory to use to buffer file contents during IO. This is a data size integer that must
include units of B, KB, MB, GB, or TB. This overrides the Hadoop Configuration.
- Default value: none
Relationships:
- success
- All files retrieved from HDFS are transferred to this relationship.
- passthrough
- If this processor has an input queue for some reason, then FlowFiles arriving on that input are
transferred to this relationship.
See Also: