Package | Description |
---|---|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.service.impl |
Modifier and Type | Class and Description |
---|---|
class |
CrawlDb
This class takes the output of the fetcher and updates the crawldb
accordingly.
|
class |
DeduplicationJob
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
|
class |
Generator
Generates a subset of a crawl db to fetch.
|
class |
Injector
Injector takes a flat file of URLs and merges ("injects") these URLs into the
CrawlDb.
|
class |
LinkDb
Maintains an inverted link map, listing incoming links for each url.
|
Modifier and Type | Class and Description |
---|---|
class |
Fetcher
A queue-based fetcher.
|
Modifier and Type | Class and Description |
---|---|
class |
IndexingJob
Generic indexer which relies on the plugins implementing IndexWriter
|
Modifier and Type | Class and Description |
---|---|
class |
ParseSegment |
Modifier and Type | Method and Description |
---|---|
NutchTool |
JobFactory.createToolByClassName(String className,
Configuration conf) |
NutchTool |
JobFactory.createToolByType(JobManager.JobType type,
Configuration conf) |
Constructor and Description |
---|
JobWorker(JobConfig jobConfig,
Configuration conf,
NutchTool tool)
To initialize JobWorker thread with the Job Configurations provided by user.
|
Copyright © 2016 The Apache Software Foundation