|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.nutch.net.BasicUrlNormalizer
net.nutch.net.RegexUrlNormalizer
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
This class must be specified as the URL normalizer to be used in nutch-site.xml or nutch-default.xml. To do this specify the urlnormalizer.class property to have the value: net.nutch.net.RegexUrlNormalizer. The urlnormalizer.regex.file property should also be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
Field Summary |
Fields inherited from class net.nutch.net.BasicUrlNormalizer |
LOG |
Constructor Summary | |
RegexUrlNormalizer()
Default constructor which gets the file name from either nutch-site.xml or nutch-default.xml and reads that configuration file. |
|
RegexUrlNormalizer(String filename)
Constructor which can be passed the file name, so it doesn't look in the configuration files for it. |
Method Summary | |
static void |
main(String[] args)
Spits out patterns and substitutions that are in the configuration file. |
String |
normalize(String urlString)
Normalizes any URLs by calling super.basicNormalize() and regexSub(). |
String |
regexNormalize(String urlString)
This function does the replacements by iterating through all the regex patterns. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public RegexUrlNormalizer() throws IOException, org.apache.oro.text.regex.MalformedPatternException
public RegexUrlNormalizer(String filename) throws IOException, org.apache.oro.text.regex.MalformedPatternException
Method Detail |
public String regexNormalize(String urlString)
public String normalize(String urlString) throws MalformedURLException
normalize
in interface UrlNormalizer
normalize
in class BasicUrlNormalizer
MalformedURLException
public static void main(String[] args) throws org.apache.oro.text.regex.MalformedPatternException, IOException
org.apache.oro.text.regex.MalformedPatternException
IOException
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |