|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use CrawlDatum | |
---|---|
org.apache.nutch.analysis.lang | Text document language identifier. |
org.apache.nutch.crawl | Crawl control code. |
org.apache.nutch.fetcher | The Nutch robot. |
org.apache.nutch.indexer | Maintain Lucene full-text indexes. |
org.apache.nutch.indexer.anchor | An indexing plugin for inbound anchor text. |
org.apache.nutch.indexer.basic | A basic indexing plugin. |
org.apache.nutch.indexer.feed | |
org.apache.nutch.indexer.metadata | |
org.apache.nutch.indexer.more | A more indexing plugin. |
org.apache.nutch.indexer.solr | |
org.apache.nutch.indexer.staticfield | A simple plugin called at indexing that adds fields with static data. |
org.apache.nutch.indexer.subcollection | |
org.apache.nutch.indexer.tld | Top Level Domain Indexing plugin. |
org.apache.nutch.indexer.urlmeta | URL Meta Tag Indexing Plugin |
org.apache.nutch.microformats.reltag | A microformats Rel-Tag Parser/Indexer/Querier plugin. |
org.apache.nutch.protocol | |
org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. |
org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. |
org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. |
org.apache.nutch.protocol.http.api | Common API used by HTTP plugins (http ,
httpclient ) |
org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. |
org.apache.nutch.scoring | |
org.apache.nutch.scoring.link | |
org.apache.nutch.scoring.opic | |
org.apache.nutch.scoring.tld | Top Level Domain Scoring plugin. |
org.apache.nutch.scoring.urlmeta | URL Meta Tag Scoring Plugin |
org.apache.nutch.scoring.webgraph | |
org.apache.nutch.segment | |
org.apache.nutch.tools | |
org.creativecommons.nutch | Sample plugins that parse and index Creative Commons medadata. |
Uses of CrawlDatum in org.apache.nutch.analysis.lang |
---|
Methods in org.apache.nutch.analysis.lang with parameters of type CrawlDatum | |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.crawl |
---|
Fields in org.apache.nutch.crawl declared as CrawlDatum | |
---|---|
CrawlDatum |
Generator.SelectorEntry.datum
|
Methods in org.apache.nutch.crawl that return CrawlDatum | |
---|---|
CrawlDatum |
AbstractFetchSchedule.forceRefetch(Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
CrawlDatum |
FetchSchedule.forceRefetch(Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching. |
CrawlDatum |
CrawlDbReader.get(String crawlDb,
String url,
Configuration config)
|
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
CrawlDatum |
FetchSchedule.initializeSchedule(Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
static CrawlDatum |
CrawlDatum.read(DataInput in)
|
CrawlDatum |
MimeAdaptiveFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
FetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
CrawlDatum |
FetchSchedule.setPageRetrySchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
Methods in org.apache.nutch.crawl that return types with arguments of type CrawlDatum | |
---|---|
RecordWriter<Text,CrawlDatum> |
CrawlDbReader.CrawlDatumCsvOutputFormat.getRecordWriter(FileSystem fs,
JobConf job,
String name,
Progressable progress)
|
Methods in org.apache.nutch.crawl with parameters of type CrawlDatum | |
---|---|
long |
AbstractFetchSchedule.calculateLastFetchTime(CrawlDatum datum)
This method return the last fetch time of the CrawlDatum |
long |
FetchSchedule.calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum. |
int |
CrawlDatum.compareTo(CrawlDatum that)
Sort by decreasing score. |
CrawlDatum |
AbstractFetchSchedule.forceRefetch(Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
CrawlDatum |
FetchSchedule.forceRefetch(Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching. |
static boolean |
CrawlDatum.hasDbStatus(CrawlDatum datum)
|
static boolean |
CrawlDatum.hasFetchStatus(CrawlDatum datum)
|
CrawlDatum |
AbstractFetchSchedule.initializeSchedule(Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
CrawlDatum |
FetchSchedule.initializeSchedule(Text url,
CrawlDatum datum)
Initialize fetch schedule related data. |
void |
Generator.Selector.map(Text key,
CrawlDatum value,
OutputCollector<FloatWritable,Generator.SelectorEntry> output,
Reporter reporter)
Select & invert subset due for fetch. |
void |
CrawlDbReader.CrawlDbTopNMapper.map(Text key,
CrawlDatum value,
OutputCollector<FloatWritable,Text> output,
Reporter reporter)
|
void |
Generator.CrawlDbUpdater.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbReader.CrawlDbDumpMapper.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbFilter.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbReader.CrawlDbStatMapper.map(Text key,
CrawlDatum value,
OutputCollector<Text,LongWritable> output,
Reporter reporter)
|
void |
CrawlDatum.putAllMetaData(CrawlDatum other)
Add all metadata from other CrawlDatum to this CrawlDatum. |
void |
CrawlDatum.set(CrawlDatum that)
Copy the contents of another instance into this instance. |
CrawlDatum |
MimeAdaptiveFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AbstractFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
FetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
AdaptiveFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
DefaultFetchSchedule.setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
CrawlDatum |
AbstractFetchSchedule.setPageGoneSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
FetchSchedule.setPageGoneSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
CrawlDatum |
AbstractFetchSchedule.setPageRetrySchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
CrawlDatum |
FetchSchedule.setPageRetrySchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
boolean |
AbstractFetchSchedule.shouldFetch(Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
boolean |
FetchSchedule.shouldFetch(Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
void |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter.write(Text key,
CrawlDatum value)
|
Method parameters in org.apache.nutch.crawl with type arguments of type CrawlDatum | |
---|---|
void |
Generator.CrawlDbUpdater.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbReader.CrawlDbDumpMapper.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbFilter.map(Text key,
CrawlDatum value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Injector.InjectMapper.map(WritableComparable key,
Text value,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Generator.CrawlDbUpdater.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Generator.CrawlDbUpdater.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbReducer.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbReducer.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Injector.InjectReducer.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Injector.InjectReducer.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbMerger.Merger.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDbMerger.Merger.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
Generator.PartitionReducer.reduce(Text key,
Iterator<Generator.SelectorEntry> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
Uses of CrawlDatum in org.apache.nutch.fetcher |
---|
Methods in org.apache.nutch.fetcher that return CrawlDatum | |
---|---|
CrawlDatum |
FetcherOutput.getCrawlDatum()
|
Method parameters in org.apache.nutch.fetcher with type arguments of type CrawlDatum | |
---|---|
void |
Fetcher.run(RecordReader<Text,CrawlDatum> input,
OutputCollector<Text,NutchWritable> output,
Reporter reporter)
|
Constructors in org.apache.nutch.fetcher with parameters of type CrawlDatum | |
---|---|
FetcherOutput(CrawlDatum crawlDatum,
Content content,
ParseImpl parse)
|
Uses of CrawlDatum in org.apache.nutch.indexer |
---|
Methods in org.apache.nutch.indexer with parameters of type CrawlDatum | |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse. |
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters. |
Uses of CrawlDatum in org.apache.nutch.indexer.anchor |
---|
Methods in org.apache.nutch.indexer.anchor with parameters of type CrawlDatum | |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Uses of CrawlDatum in org.apache.nutch.indexer.basic |
---|
Methods in org.apache.nutch.indexer.basic with parameters of type CrawlDatum | |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.feed |
---|
Methods in org.apache.nutch.indexer.feed with parameters of type CrawlDatum | |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields: FEED_AUTHOR FEED_TAGS FEED_PUBLISHED FEED_UPDATED FEED And sends them to the Indexer for indexing within the Nutch
index. |
Uses of CrawlDatum in org.apache.nutch.indexer.metadata |
---|
Methods in org.apache.nutch.indexer.metadata with parameters of type CrawlDatum | |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.more |
---|
Methods in org.apache.nutch.indexer.more with parameters of type CrawlDatum | |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.solr |
---|
Methods in org.apache.nutch.indexer.solr with parameters of type CrawlDatum | |
---|---|
void |
SolrClean.DBFilter.map(Text key,
CrawlDatum value,
OutputCollector<ByteWritable,Text> output,
Reporter reporter)
|
Uses of CrawlDatum in org.apache.nutch.indexer.staticfield |
---|
Methods in org.apache.nutch.indexer.staticfield with parameters of type CrawlDatum | |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.subcollection |
---|
Methods in org.apache.nutch.indexer.subcollection with parameters of type CrawlDatum | |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.tld |
---|
Methods in org.apache.nutch.indexer.tld with parameters of type CrawlDatum | |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.indexer.urlmeta |
---|
Methods in org.apache.nutch.indexer.urlmeta with parameters of type CrawlDatum | |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the CrawlDatum object. |
Uses of CrawlDatum in org.apache.nutch.microformats.reltag |
---|
Methods in org.apache.nutch.microformats.reltag with parameters of type CrawlDatum | |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
Uses of CrawlDatum in org.apache.nutch.protocol |
---|
Methods in org.apache.nutch.protocol with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(Text url,
CrawlDatum datum)
Returns the Content for a fetchlist entry. |
RobotRules |
Protocol.getRobotRules(Text url,
CrawlDatum datum)
Retrieve robot rules applicable for this url. |
Uses of CrawlDatum in org.apache.nutch.protocol.file |
---|
Methods in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
File.getProtocolOutput(Text url,
CrawlDatum datum)
|
RobotRules |
File.getRobotRules(Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.file with parameters of type CrawlDatum | |
---|---|
FileResponse(URL url,
CrawlDatum datum,
File file,
Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.ftp |
---|
Methods in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(Text url,
CrawlDatum datum)
|
RobotRules |
Ftp.getRobotRules(Text url,
CrawlDatum datum)
|
Constructors in org.apache.nutch.protocol.ftp with parameters of type CrawlDatum | |
---|---|
FtpResponse(URL url,
CrawlDatum datum,
Ftp ftp,
Configuration conf)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http |
---|
Methods in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
|
Constructors in org.apache.nutch.protocol.http with parameters of type CrawlDatum | |
---|---|
HttpResponse(HttpBase http,
URL url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.http.api |
---|
Methods in org.apache.nutch.protocol.http.api with parameters of type CrawlDatum | |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(Text url,
CrawlDatum datum)
|
protected abstract Response |
HttpBase.getResponse(URL url,
CrawlDatum datum,
boolean followRedirects)
|
RobotRules |
HttpBase.getRobotRules(Text url,
CrawlDatum datum)
|
Uses of CrawlDatum in org.apache.nutch.protocol.httpclient |
---|
Methods in org.apache.nutch.protocol.httpclient with parameters of type CrawlDatum | |
---|---|
protected Response |
Http.getResponse(URL url,
CrawlDatum datum,
boolean redirect)
Fetches the url with a configured HTTP client and
gets the response. |
Uses of CrawlDatum in org.apache.nutch.scoring |
---|
Methods in org.apache.nutch.scoring that return CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
Methods in org.apache.nutch.scoring with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
float |
ScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation. |
float |
ScoringFilters.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
Calculate a sort value for Generate. |
float |
ScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a Lucene document boost. |
float |
ScoringFilters.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
|
void |
ScoringFilter.initialScore(Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages. |
void |
ScoringFilters.initialScore(Text url,
CrawlDatum datum)
Calculate a new initial score, used when adding newly discovered pages. |
void |
ScoringFilter.injectedScore(Text url,
CrawlDatum datum)
Set an initial score for newly injected pages. |
void |
ScoringFilters.injectedScore(Text url,
CrawlDatum datum)
Calculate a new initial score, used when injecting new pages. |
void |
ScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. |
void |
ScoringFilters.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
|
void |
ScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages. |
void |
ScoringFilters.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update(). |
Method parameters in org.apache.nutch.scoring with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
void |
ScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update, based on the initial value of the original CrawlDatum, and also score values contributed by inlinked pages. |
void |
ScoringFilters.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
Calculate updated page score during CrawlDb.update(). |
Uses of CrawlDatum in org.apache.nutch.scoring.link |
---|
Methods in org.apache.nutch.scoring.link that return CrawlDatum | |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
Methods in org.apache.nutch.scoring.link with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
float |
LinkAnalysisScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
|
float |
LinkAnalysisScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
|
void |
LinkAnalysisScoringFilter.initialScore(Text url,
CrawlDatum datum)
|
void |
LinkAnalysisScoringFilter.injectedScore(Text url,
CrawlDatum datum)
|
void |
LinkAnalysisScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
|
void |
LinkAnalysisScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
|
Method parameters in org.apache.nutch.scoring.link with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
void |
LinkAnalysisScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
|
Uses of CrawlDatum in org.apache.nutch.scoring.opic |
---|
Methods in org.apache.nutch.scoring.opic that return CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
Methods in org.apache.nutch.scoring.opic with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
float |
OPICScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
Use getScore() . |
float |
OPICScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower. |
void |
OPICScoringFilter.initialScore(Text url,
CrawlDatum datum)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. |
void |
OPICScoringFilter.injectedScore(Text url,
CrawlDatum datum)
|
void |
OPICScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY. |
void |
OPICScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Increase the score by a sum of inlinked scores. |
Method parameters in org.apache.nutch.scoring.opic with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply. |
Uses of CrawlDatum in org.apache.nutch.scoring.tld |
---|
Methods in org.apache.nutch.scoring.tld that return CrawlDatum | |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
Methods in org.apache.nutch.scoring.tld with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount)
|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
float |
TLDScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
|
float |
TLDScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
|
void |
TLDScoringFilter.initialScore(Text url,
CrawlDatum datum)
|
void |
TLDScoringFilter.injectedScore(Text url,
CrawlDatum datum)
|
void |
TLDScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
|
void |
TLDScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
|
Method parameters in org.apache.nutch.scoring.tld with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
|
void |
TLDScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
|
Uses of CrawlDatum in org.apache.nutch.scoring.urlmeta |
---|
Methods in org.apache.nutch.scoring.urlmeta that return CrawlDatum | |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object. |
Methods in org.apache.nutch.scoring.urlmeta with parameters of type CrawlDatum | |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object. |
float |
URLMetaScoringFilter.generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
Boilerplate |
float |
URLMetaScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate |
void |
URLMetaScoringFilter.initialScore(Text url,
CrawlDatum datum)
Boilerplate |
void |
URLMetaScoringFilter.injectedScore(Text url,
CrawlDatum datum)
Boilerplate |
void |
URLMetaScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the datum object and injects it into the content. |
void |
URLMetaScoringFilter.updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List inlinked)
Boilerplate |
Method parameters in org.apache.nutch.scoring.urlmeta with type arguments of type CrawlDatum | |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags" property, and looks for them inside the parseData object. |
Uses of CrawlDatum in org.apache.nutch.scoring.webgraph |
---|
Method parameters in org.apache.nutch.scoring.webgraph with type arguments of type CrawlDatum | |
---|---|
void |
ScoreUpdater.reduce(Text key,
Iterator<ObjectWritable> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
Creates new CrawlDatum objects with the updated score from the NodeDb or with a cleared score. |
Uses of CrawlDatum in org.apache.nutch.segment |
---|
Methods in org.apache.nutch.segment with parameters of type CrawlDatum | |
---|---|
boolean |
SegmentMergeFilters.filter(WritableComparable key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
boolean |
SegmentMergeFilter.filter(WritableComparable key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL). |
Method parameters in org.apache.nutch.segment with type arguments of type CrawlDatum | |
---|---|
boolean |
SegmentMergeFilters.filter(WritableComparable key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
boolean |
SegmentMergeFilter.filter(WritableComparable key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL). |
Uses of CrawlDatum in org.apache.nutch.tools |
---|
Methods in org.apache.nutch.tools with parameters of type CrawlDatum | |
---|---|
void |
CrawlDBScanner.map(Text url,
CrawlDatum crawlDatum,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
Method parameters in org.apache.nutch.tools with type arguments of type CrawlDatum | |
---|---|
void |
CrawlDBScanner.map(Text url,
CrawlDatum crawlDatum,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDBScanner.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
CrawlDBScanner.reduce(Text key,
Iterator<CrawlDatum> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
void |
FreeGenerator.FG.reduce(Text key,
Iterator<Generator.SelectorEntry> values,
OutputCollector<Text,CrawlDatum> output,
Reporter reporter)
|
Uses of CrawlDatum in org.creativecommons.nutch |
---|
Methods in org.creativecommons.nutch with parameters of type CrawlDatum | |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
|
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |