org.apache.nutch.scoring
Class ScoringFilters

java.lang.Object
  extended byorg.apache.hadoop.conf.Configured
      extended byorg.apache.nutch.scoring.ScoringFilters
All Implemented Interfaces:
Configurable, Pluggable, ScoringFilter

public class ScoringFilters
extends Configured
implements ScoringFilter

Creates and caches ScoringFilter implementing plugins.

Author:
Andrzej Bialecki

Field Summary
 
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
X_POINT_ID
 
Constructor Summary
ScoringFilters(Configuration conf)
           
 
Method Summary
 CrawlDatum distributeScoreToOutlink(UTF8 fromUrl, UTF8 toUrl, ParseData parseData, CrawlDatum target, CrawlDatum adjust, int allCount, int validCount)
          Distribute score value from the current page to all its outlinked pages.
 float generatorSortValue(UTF8 url, CrawlDatum datum, float initSort)
          Calculate a sort value for Generate.
 float indexerScore(UTF8 url, Document doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
          This method calculates a Lucene document boost.
 void initialScore(UTF8 url, CrawlDatum datum)
          Calculate a new initial score, used when adding new pages.
 void passScoreAfterParsing(UTF8 url, Content content, Parse parse)
          Currently a part of score distribution is performed using only data coming from the parsing process.
 void passScoreBeforeParsing(UTF8 url, CrawlDatum datum, Content content)
          This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata.
 void updateDbScore(UTF8 url, CrawlDatum old, CrawlDatum datum, List inlinked)
          Calculate updated page score during CrawlDb.update().
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Constructor Detail

ScoringFilters

public ScoringFilters(Configuration conf)
Method Detail

generatorSortValue

public float generatorSortValue(UTF8 url,
                                CrawlDatum datum,
                                float initSort)
                         throws ScoringFilterException
Calculate a sort value for Generate.

Specified by:
generatorSortValue in interface ScoringFilter
Parameters:
url - url of the page
datum - page's datum, should not be modified
initSort - initial sort value, or a value from previous filters in chain
Throws:
ScoringFilterException

initialScore

public void initialScore(UTF8 url,
                         CrawlDatum datum)
                  throws ScoringFilterException
Calculate a new initial score, used when adding new pages.

Specified by:
initialScore in interface ScoringFilter
Parameters:
url - url of the page
datum - new datum. Filters will modify it in-place.
Throws:
ScoringFilterException

updateDbScore

public void updateDbScore(UTF8 url,
                          CrawlDatum old,
                          CrawlDatum datum,
                          List inlinked)
                   throws ScoringFilterException
Calculate updated page score during CrawlDb.update().

Specified by:
updateDbScore in interface ScoringFilter
Parameters:
url - url of the page
old - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the parameter may contain values that are no longer valid, if other updates occured between generation and this update.
datum - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
inlinked - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
Throws:
ScoringFilterException

passScoreBeforeParsing

public void passScoreBeforeParsing(UTF8 url,
                                   CrawlDatum datum,
                                   Content content)
                            throws ScoringFilterException
Description copied from interface: ScoringFilter
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. This is needed in order to pass this value(s) to the mechanism that distributes it to outlinked pages.

Specified by:
passScoreBeforeParsing in interface ScoringFilter
Parameters:
url - url of the page
datum - source datum. NOTE: modifications to this value are not persisted.
content - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
Throws:
ScoringFilterException

passScoreAfterParsing

public void passScoreAfterParsing(UTF8 url,
                                  Content content,
                                  Parse parse)
                           throws ScoringFilterException
Description copied from interface: ScoringFilter
Currently a part of score distribution is performed using only data coming from the parsing process. We need this method in order to ensure the presence of score data in these steps.

Specified by:
passScoreAfterParsing in interface ScoringFilter
Parameters:
url - page url
content - original content. NOTE: modifications to this value are not persisted.
parse - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
Throws:
ScoringFilterException

distributeScoreToOutlink

public CrawlDatum distributeScoreToOutlink(UTF8 fromUrl,
                                           UTF8 toUrl,
                                           ParseData parseData,
                                           CrawlDatum target,
                                           CrawlDatum adjust,
                                           int allCount,
                                           int validCount)
                                    throws ScoringFilterException
Description copied from interface: ScoringFilter
Distribute score value from the current page to all its outlinked pages.

Specified by:
distributeScoreToOutlink in interface ScoringFilter
Parameters:
fromUrl - url of the source page
toUrl - url of the target page
parseData - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
target - target CrawlDatum. NOTE: filters can modify this in-place, all changes will be persisted.
adjust - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to CrawlDatum.STATUS_LINKED.
allCount - number of all collected outlinks from the source page
validCount - number of valid outlinks from the source page, i.e. outlinks that are acceppted by current URLNormalizers and URLFilters.
Returns:
if needed, implementations may return an instance of CrawlDatum, with status CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
Throws:
ScoringFilterException

indexerScore

public float indexerScore(UTF8 url,
                          Document doc,
                          CrawlDatum dbDatum,
                          CrawlDatum fetchDatum,
                          Parse parse,
                          Inlinks inlinks,
                          float initScore)
                   throws ScoringFilterException
Description copied from interface: ScoringFilter
This method calculates a Lucene document boost.

Specified by:
indexerScore in interface ScoringFilter
Parameters:
url - url of the page
doc - Lucene document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
dbDatum - current page from CrawlDb. NOTE: changes made to this instance are not persisted.
fetchDatum - datum from FetcherOutput (containing among others the fetching status)
parse - parsing result. NOTE: changes made to this instance are not persisted.
inlinks - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
initScore - initial boost value for the Lucene document.
Returns:
boost value for the Lucene document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying Lucene document directly.
Throws:
ScoringFilterException


Copyright © 2006 The Apache Software Foundation