public class HTMLDocument extends Object
Modifier and Type | Class and Description |
---|---|
static class |
HTMLDocument.TextField
This class represents a text extracted from the HTML DOM related
to the node from which such test has been retrieved.
|
Constructor and Description |
---|
HTMLDocument(Node document)
Constructor accepting the root node.
|
Modifier and Type | Method and Description |
---|---|
static String |
extractRelTag(NamedNodeMap attributes)
Extracts the href specific rel-tag string.
|
static String |
extractRelTag(String hrefAttributeContent)
Extracts the href specific rel-tag string.
|
HTMLDocument.TextField[] |
extractRelTagNodes()
Extracts all the
rel tag nodes. |
String |
find(String xpath) |
List<Node> |
findAll(String xpath) |
List<Node> |
findAllByClassName(String clazz)
Finds all the nodes by class name.
|
Node |
findMicroformattedObjectNode(String objectTag,
String name) |
String |
findMicroformattedValue(String objectTag,
String object,
String fieldTag,
String field,
String key) |
Node |
findNodeById(String id) |
String |
getDefaultLanguage()
Returns the document default language.
|
Node |
getDocument() |
String[] |
getPathToLocalRoot()
Returns the sequence of ancestors from the document root to the local root (document).
|
HTMLDocument.TextField[] |
getPluralTextField(String className)
Returns a plural text field.
|
HTMLDocument.TextField[] |
getPluralUrlField(String className)
Returns the list of URLs associated to the fields marked with class className.
|
HTMLDocument.TextField |
getSingularTextField(String className)
Returns a singular text field.
|
HTMLDocument.TextField |
getSingularUrlField(String className)
Returns the URL associated to the field marked with class className.
|
String |
getText()
Returns the text contained inside a node if leaf,
null otherwise. |
String |
readAttribute(String attribute)
Read an attribute avoiding NullPointerExceptions, if the attr is
missing it just returns an empty string.
|
static String |
readNodeContent(Node node,
boolean prettify)
Reads the text content of the given node and returns it.
|
static HTMLDocument.TextField |
readTextField(Node node)
Reads a text field from the given node adding the content to the given res list.
|
static void |
readUrlField(List<HTMLDocument.TextField> res,
Node node)
Reads an URL field from the given node adding the content to the given res list.
|
org.openrdf.model.URI |
resolveURI(String uri) |
public HTMLDocument(Node document)
document
- public static HTMLDocument.TextField readTextField(Node node)
node
- the node from which read the content.public static void readUrlField(List<HTMLDocument.TextField> res, Node node)
res
- node
- public static String extractRelTag(String hrefAttributeContent)
hrefAttributeContent
- the content of the href attribute.public static String extractRelTag(NamedNodeMap attributes)
attributes
- the list of attributes of a node.public static String readNodeContent(Node node, boolean prettify)
prettify
flag is true
the text is cleaned up.node
- node to read content.prettify
- if true
blank chars will be removed.public org.openrdf.model.URI resolveURI(String uri) throws ExtractionException
ExtractionException
- If the base URI is invalidpublic String findMicroformattedValue(String objectTag, String object, String fieldTag, String field, String key)
public Node getDocument()
public HTMLDocument.TextField getSingularTextField(String className)
className
- name of class containing text.public HTMLDocument.TextField[] getPluralTextField(String className)
className
- name of class node containing text.public HTMLDocument.TextField getSingularUrlField(String className)
className
- name of node class containing the URL field.public HTMLDocument.TextField[] getPluralUrlField(String className)
className
- name of node class containing the URL field.HTMLDocument.TextField
found.public Node findMicroformattedObjectNode(String objectTag, String name)
public String readAttribute(String attribute)
attribute
- the attribute name.public List<Node> findAllByClassName(String clazz)
clazz
- the class name.public String getText()
null
otherwise.public String getDefaultLanguage()
null
otherwise.public String[] getPathToLocalRoot()
public HTMLDocument.TextField[] extractRelTagNodes()
rel
tag nodes.Copyright © 2010-2013 The Apache Software Foundation. All Rights Reserved.