Package | Description |
---|---|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
Modifier and Type | Method and Description |
---|---|
void |
LinkDb.map(org.apache.hadoop.io.Text key,
ParseData parseData,
org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,Inlinks> output,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
ParseData |
ParseImpl.getData() |
ParseData |
Parse.getData()
Other data extracted from the page.
|
static ParseData |
ParseData.read(DataInput in) |
Modifier and Type | Method and Description |
---|---|
void |
ParseResult.put(String key,
ParseText text,
ParseData data)
Store a result of parsing.
|
void |
ParseResult.put(org.apache.hadoop.io.Text key,
ParseText text,
ParseData data)
Store a result of parsing.
|
Constructor and Description |
---|
ParseImpl(ParseText text,
ParseData data) |
ParseImpl(ParseText text,
ParseData data,
boolean isCanonical) |
ParseImpl(String text,
ParseData data) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
CrawlDatum |
AbstractScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
DepthScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(org.apache.hadoop.io.Text fromUrl,
org.apache.hadoop.io.Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount) |
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(org.apache.hadoop.io.Text fromUrl,
ParseData parseData,
Collection<Map.Entry<org.apache.hadoop.io.Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilters.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
boolean |
SegmentMergeFilter.filter(org.apache.hadoop.io.Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
Copyright © 2014 The Apache Software Foundation