Package | Description |
---|---|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.service.impl |
Modifier and Type | Class and Description |
---|---|
class |
CrawlDb
This class takes the output of the fetcher and updates the crawldb
accordingly.
|
class |
DeduplicationJob
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
|
class |
Generator
Generates a subset of a crawl db to fetch.
|
class |
Injector
This class takes a flat file of URLs and adds them to the of pages to be
crawled.
|
class |
LinkDb
Maintains an inverted link map, listing incoming links for each url.
|
Modifier and Type | Class and Description |
---|---|
class |
Fetcher
A queue-based fetcher.
|
Modifier and Type | Class and Description |
---|---|
class |
ParseSegment |
Modifier and Type | Method and Description |
---|---|
NutchTool |
JobFactory.createToolByClassName(String className,
Configuration conf) |
NutchTool |
JobFactory.createToolByType(JobManager.JobType type,
Configuration conf) |
Constructor and Description |
---|
JobWorker(JobConfig jobConfig,
Configuration conf,
NutchTool tool)
To initialize JobWorker thread with the Job Configurations provided by user.
|
Copyright © 2015 The Apache Software Foundation