net.nutch.analysis.lang
Class HTMLLanguageParser
java.lang.Object
net.nutch.analysis.lang.HTMLLanguageParser
- All Implemented Interfaces:
- HtmlParseFilter
- public class HTMLLanguageParser
- extends Object
- implements HtmlParseFilter
Adds metadata identifying language of document if found
We could also run statistical analysis here but we'd miss all other formats
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
META_LANG_NAME
public static final String META_LANG_NAME
- See Also:
- Constant Field Values
LOG
public static final Logger LOG
HTMLLanguageParser
public HTMLLanguageParser()
filter
public Parse filter(Content content,
Parse parse,
DocumentFragment doc)
throws ParseException
- Scan the HTML document looking at possible indications of content language
- 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language)
- 3. meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
Only the first occurence of language is stored.
- Specified by:
filter
in interface HtmlParseFilter
- Throws:
ParseException
Copyright © 2005 The Nutch Organization.