public class ExemptionUrlFilter extends RegexURLFilter implements URLExemptionFilter
URLExemptionFilter
uses regex configuration
to check if URL is eligible for exemption from 'db.ignore.external'.
When this filter is enabled, the external urls will be checked against configured sequence of regex rules.
The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be
overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
URLExemptionFilter
,
RegexURLFilter
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE |
URLFILTER_REGEX_FILE, URLFILTER_REGEX_RULES
X_POINT_ID
X_POINT_ID
Constructor and Description |
---|
ExemptionUrlFilter() |
Modifier and Type | Method and Description |
---|---|
boolean |
filter(java.lang.String fromUrl,
java.lang.String toUrl)
Checks if toUrl is exempted when the ignore external is enabled
|
java.util.List<java.util.regex.Pattern> |
getExemptions() |
protected java.io.Reader |
getRulesReader(Configuration conf)
Gets reader for regex rules
|
static void |
main(java.lang.String[] args) |
createRule, createRule
filter, getConf, main, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public static final java.lang.String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
public java.util.List<java.util.regex.Pattern> getExemptions()
public boolean filter(java.lang.String fromUrl, java.lang.String toUrl)
URLExemptionFilter
filter
in interface URLExemptionFilter
fromUrl
- : the source url which generated the outlinktoUrl
- : the destination url which needs to be checked for exemptionprotected java.io.Reader getRulesReader(Configuration conf) throws java.io.IOException
getRulesReader
in class RegexURLFilter
conf
- is the current configuration.java.io.IOException
public static void main(java.lang.String[] args)
Copyright © 2018 The Apache Software Foundation