|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.hadoop.conf.Configured
org.apache.nutch.scoring.ScoringFilters
Creates and caches ScoringFilter
implementing plugins.
Field Summary |
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter |
X_POINT_ID |
Constructor Summary | |
ScoringFilters(Configuration conf)
|
Method Summary | |
CrawlDatum |
distributeScoreToOutlink(UTF8 fromUrl,
UTF8 toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
Distribute score value from the current page to all its outlinked pages. |
float |
generatorSortValue(UTF8 url,
CrawlDatum datum,
float initSort)
Calculate a sort value for Generate. |
float |
indexerScore(UTF8 url,
Document doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost. |
void |
initialScore(UTF8 url,
CrawlDatum datum)
Calculate a new initial score, used when adding new pages. |
void |
passScoreAfterParsing(UTF8 url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process. |
void |
passScoreBeforeParsing(UTF8 url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. |
void |
updateDbScore(UTF8 url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Calculate updated page score during CrawlDb.update(). |
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf, setConf |
Constructor Detail |
public ScoringFilters(Configuration conf)
Method Detail |
public float generatorSortValue(UTF8 url, CrawlDatum datum, float initSort) throws ScoringFilterException
generatorSortValue
in interface ScoringFilter
url
- url of the pagedatum
- page's datum, should not be modifiedinitSort
- initial sort value, or a value from previous filters in chain
ScoringFilterException
public void initialScore(UTF8 url, CrawlDatum datum) throws ScoringFilterException
initialScore
in interface ScoringFilter
url
- url of the pagedatum
- new datum. Filters will modify it in-place.
ScoringFilterException
public void updateDbScore(UTF8 url, CrawlDatum old, CrawlDatum datum, List inlinked) throws ScoringFilterException
updateDbScore
in interface ScoringFilter
url
- url of the pageold
- original datum, with original score. May be null if this is a newly
discovered page. If not null, filters should use score values from this parameter
as the starting values - the parameter may contain values that are
no longer valid, if other updates occured between generation and this update.datum
- the new datum, with the original score saved at the time when
fetchlist was generated. Filters should update this in-place, and it will be saved in
the crawldb.inlinked
- (partial) list of CrawlDatum-s (with their scores) from
links pointing to this page, found in the current update batch.
ScoringFilterException
public void passScoreBeforeParsing(UTF8 url, CrawlDatum datum, Content content) throws ScoringFilterException
ScoringFilter
Content
metadata.
This is needed in order to pass this value(s) to the mechanism that distributes it
to outlinked pages.
passScoreBeforeParsing
in interface ScoringFilter
url
- url of the pagedatum
- source datum. NOTE: modifications to this value are not persisted.content
- instance of content. Implementations may modify this
in-place, primarily by setting some metadata properties.
ScoringFilterException
public void passScoreAfterParsing(UTF8 url, Content content, Parse parse) throws ScoringFilterException
ScoringFilter
passScoreAfterParsing
in interface ScoringFilter
url
- page urlcontent
- original content. NOTE: modifications to this value are not persisted.parse
- target instance to copy the score information to. Implementations
may modify this in-place, primarily by setting some metadata properties.
ScoringFilterException
public CrawlDatum distributeScoreToOutlink(UTF8 fromUrl, UTF8 toUrl, ParseData parseData, CrawlDatum target, CrawlDatum adjust, int allCount, int validCount) throws ScoringFilterException
ScoringFilter
distributeScoreToOutlink
in interface ScoringFilter
fromUrl
- url of the source pagetoUrl
- url of the target pageparseData
- ParseData instance, which stores relevant score value(s)
in its metadata. NOTE: filters may modify this in-place, all changes will
be persisted.target
- target CrawlDatum. NOTE: filters can modify this in-place,
all changes will be persisted.adjust
- a CrawlDatum instance, initially null, which implementations
may use to pass adjustment values to the original CrawlDatum. When creating
this instance, set its status to CrawlDatum.STATUS_LINKED
.allCount
- number of all collected outlinks from the source pagevalidCount
- number of valid outlinks from the source page, i.e.
outlinks that are acceppted by current URLNormalizers and URLFilters.
CrawlDatum.STATUS_LINKED
, which contains adjustments
to be applied to the original CrawlDatum score(s) and metadata. This can
be null if not needed.
ScoringFilterException
public float indexerScore(UTF8 url, Document doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
ScoringFilter
indexerScore
in interface ScoringFilter
url
- url of the pagedoc
- Lucene document. NOTE: this already contains all information collected
by indexing filters. Implementations may modify this instance, in order to store/remove
some information.dbDatum
- current page from CrawlDb. NOTE: changes made to this instance
are not persisted.fetchDatum
- datum from FetcherOutput (containing among others the fetching status)parse
- parsing result. NOTE: changes made to this instance are not persisted.inlinks
- current inlinks from LinkDb. NOTE: changes made to this instance are
not persisted.initScore
- initial boost value for the Lucene document.
ScoringFilterException
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |