net.nutch.analysis.lang
Class HTMLLanguageParser

java.lang.Object
  extended bynet.nutch.analysis.lang.HTMLLanguageParser
All Implemented Interfaces:
HtmlParseFilter

public class HTMLLanguageParser
extends Object
implements HtmlParseFilter

Adds metadata identifying language of document if found We could also run statistical analysis here but we'd miss all other formats


Field Summary
static Logger LOG
           
static String META_LANG_NAME
           
 
Fields inherited from interface net.nutch.parse.HtmlParseFilter
X_POINT_ID
 
Constructor Summary
HTMLLanguageParser()
           
 
Method Summary
 Parse filter(Content content, Parse parse, DocumentFragment doc)
          Scan the HTML document looking at possible indications of content language
1.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

META_LANG_NAME

public static final String META_LANG_NAME
See Also:
Constant Field Values

LOG

public static final Logger LOG
Constructor Detail

HTMLLanguageParser

public HTMLLanguageParser()
Method Detail

filter

public Parse filter(Content content,
                    Parse parse,
                    DocumentFragment doc)
             throws ParseException
Scan the HTML document looking at possible indications of content language
  • 1. html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
  • 2. meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified-html.shtml#language)
  • 3. meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
    Only the first occurence of language is stored.

    Specified by:
    filter in interface HtmlParseFilter
    Throws:
    ParseException


  • Copyright © 2005 The Nutch Organization.