public class SitemapProcessor extends Configured implements Tool
Performs Sitemap processing by fetching sitemap links, parsing the content and merging the urls from Sitemap (with the metadata) with the existing crawldb.
There are two use cases supported in Nutch's Sitemap processing:
For more details see: https://wiki.apache.org/nutch/SitemapFeature
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
CURRENT_NAME |
static java.lang.String |
LOCK_NAME |
static org.slf4j.Logger |
LOG |
static java.text.SimpleDateFormat |
sdf |
static java.lang.String |
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT |
static java.lang.String |
SITEMAP_OVERWRITE_EXISTING |
static java.lang.String |
SITEMAP_REDIR_MAX |
static java.lang.String |
SITEMAP_STRICT_PARSING |
static java.lang.String |
SITEMAP_URL_FILTERING |
static java.lang.String |
SITEMAP_URL_NORMALIZING |
Constructor and Description |
---|
SitemapProcessor() |
Modifier and Type | Method and Description |
---|---|
static void |
main(java.lang.String[] args) |
int |
run(java.lang.String[] args) |
void |
sitemap(Path crawldb,
Path hostdb,
Path sitemapUrlDir,
boolean strict,
boolean filter,
boolean normalize,
int threads) |
static void |
usage() |
getConf, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public static final org.slf4j.Logger LOG
public static final java.text.SimpleDateFormat sdf
public static final java.lang.String CURRENT_NAME
public static final java.lang.String LOCK_NAME
public static final java.lang.String SITEMAP_STRICT_PARSING
public static final java.lang.String SITEMAP_URL_FILTERING
public static final java.lang.String SITEMAP_URL_NORMALIZING
public static final java.lang.String SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
public static final java.lang.String SITEMAP_OVERWRITE_EXISTING
public static final java.lang.String SITEMAP_REDIR_MAX
public void sitemap(Path crawldb, Path hostdb, Path sitemapUrlDir, boolean strict, boolean filter, boolean normalize, int threads) throws java.lang.Exception
java.lang.Exception
public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
public static void usage()
Copyright © 2019 The Apache Software Foundation