Interface | Description |
---|---|
FetchSchedule |
This interface defines the contract for implementations that manipulate
fetch times and re-fetch intervals.
|
Class | Description |
---|---|
AbstractFetchSchedule |
This class provides common methods for implementations of
FetchSchedule . |
AdaptiveFetchSchedule |
This class implements an adaptive re-fetch algorithm.
|
Crawl | |
CrawlDatum | |
CrawlDatum.Comparator |
A Comparator optimized for CrawlDatum.
|
CrawlDb |
This class takes the output of the fetcher and updates the
crawldb accordingly.
|
CrawlDbFilter |
This class provides a way to separate the URL normalization
and filtering steps from the rest of CrawlDb manipulation code.
|
CrawlDbMerger |
This tool merges several CrawlDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited
pages.
|
CrawlDbMerger.Merger | |
CrawlDbReader |
Read utility for the CrawlDB.
|
CrawlDbReader.CrawlDatumCsvOutputFormat | |
CrawlDbReader.CrawlDatumCsvOutputFormat.LineRecordWriter | |
CrawlDbReader.CrawlDbDumpMapper | |
CrawlDbReader.CrawlDbStatCombiner | |
CrawlDbReader.CrawlDbStatMapper | |
CrawlDbReader.CrawlDbStatReducer | |
CrawlDbReader.CrawlDbTopNMapper | |
CrawlDbReader.CrawlDbTopNReducer | |
CrawlDbReducer |
Merge new page entries with existing entries.
|
DefaultFetchSchedule |
This class implements the default re-fetch schedule.
|
FetchScheduleFactory |
Creates and caches a
FetchSchedule implementation. |
Generator |
Generates a subset of a crawl db to fetch.
|
Generator.CrawlDbUpdater |
Update the CrawlDB so that the next generate won't include the same URLs.
|
Generator.DecreasingFloatComparator | |
Generator.GeneratorOutputFormat | |
Generator.HashComparator |
Sort fetch lists by hash of URL.
|
Generator.PartitionReducer | |
Generator.Selector |
Selects entries due for fetch.
|
Generator.SelectorEntry | |
Generator.SelectorInverseMapper | |
Injector |
This class takes a flat file of URLs and adds them to the of pages to be
crawled.
|
Injector.InjectMapper |
Normalize and filter injected urls.
|
Injector.InjectReducer |
Combine multiple new entries for a url.
|
Inlink | |
Inlinks |
A list of
Inlink s. |
LinkDb |
Maintains an inverted link map, listing incoming links for each url.
|
LinkDbFilter |
This class provides a way to separate the URL normalization
and filtering steps from the rest of LinkDb manipulation code.
|
LinkDbMerger |
This tool merges several LinkDb-s into one, optionally filtering
URLs through the current URLFilters, to skip prohibited URLs and
links.
|
LinkDbReader |
.
|
MapWritable | Deprecated
Use org.apache.hadoop.io.MapWritable instead.
|
MD5Signature |
Default implementation of a page signature.
|
MimeAdaptiveFetchSchedule |
Extension of @see AdaptiveFetchSchedule that allows for more flexible configuration
of DEC and INC factors for various MIME-types.
|
NutchWritable | |
Signature | |
SignatureComparator | |
SignatureFactory |
Factory class, which instantiates a Signature implementation according to the
current Configuration configuration.
|
TextProfileSignature |
An implementation of a page signature.
|
URLPartitioner |
Partition urls by host, domain name or IP depending on the value of the
parameter 'partition.url.mode' which can be 'byHost', 'byDomain' or 'byIP'
|
Copyright © 2013 The Apache Software Foundation