public class LinksIndexingFilter extends Object implements IndexingFilter
IndexingFilter
that adds
outlinks
and inlinks
field(s) to the document.
In case that you want to ignore the outlinks that point to the same host
as the URL being indexed use the following settings in your configuration
file:
Modifier and Type | Field and Description |
---|---|
static String |
LINKS_INLINKS_HOST |
static String |
LINKS_ONLY_HOSTS |
static String |
LINKS_OUTLINKS_HOST |
static org.slf4j.Logger |
LOG |
X_POINT_ID
Constructor and Description |
---|
LinksIndexingFilter() |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public static final String LINKS_OUTLINKS_HOST
public static final String LINKS_INLINKS_HOST
public static final String LINKS_ONLY_HOSTS
public static final org.slf4j.Logger LOG
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
IndexingFilter
filter
in interface IndexingFilter
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks
- page inlinksIndexingException
public void setConf(Configuration conf)
setConf
in interface Configurable
public Configuration getConf()
getConf
in interface Configurable
Copyright © 2016 The Apache Software Foundation