|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.nutch.net.BasicUrlNormalizer
org.apache.nutch.net.RegexUrlNormalizer
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
This class must be specified as the URL normalizer to be used in nutch-site.xml or nutch-default.xml. To do this specify the urlnormalizer.class property to have the value: org.apache.nutch.net.RegexUrlNormalizer. The urlnormalizer.regex.file property should also be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
Field Summary |
Fields inherited from class org.apache.nutch.net.BasicUrlNormalizer |
LOG |
Constructor Summary | |
RegexUrlNormalizer()
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer() |
|
RegexUrlNormalizer(String filename)
Constructor which can be passed the file name, so it doesn't look in the configuration files for it. |
Method Summary | |
static void |
main(String[] args)
Spits out patterns and substitutions that are in the configuration file. |
String |
normalize(String urlString)
Normalizes any URLs by calling super.basicNormalize() and regexSub(). |
String |
regexNormalize(String urlString)
This function does the replacements by iterating through all the regex patterns. |
void |
setConf(Configuration conf)
|
Methods inherited from class org.apache.nutch.net.BasicUrlNormalizer |
getConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf |
Constructor Detail |
public RegexUrlNormalizer()
public RegexUrlNormalizer(String filename) throws IOException, org.apache.oro.text.regex.MalformedPatternException
Method Detail |
public String regexNormalize(String urlString)
public String normalize(String urlString) throws MalformedURLException
normalize
in interface UrlNormalizer
normalize
in class BasicUrlNormalizer
MalformedURLException
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class BasicUrlNormalizer
public static void main(String[] args) throws org.apache.oro.text.regex.MalformedPatternException, IOException
org.apache.oro.text.regex.MalformedPatternException
IOException
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |