Xerces::DOMPARSE - A Perl module for parsing DOMs.
# Here;s an example that reads in an XML file from the # command line and then removes all formatting, re-adds # formatting and then prints the DOM back to a file.
use Xerces::DOM; use Xerces::DOMPARSE;
my $parser = new DOM::Parser (); $parser->parse ($ARGV[0]); my $doc = $parser->getDocument ();
DOMPARSE::unformat ($doc); DOMPARSE::format ($doc); DOMPARSE::print (\*STDOUT, $doc);
Use this module in conjunction with Xerces::DOM. Once you have read an XML file into a DOM tree in memory, this module provides routines for recursive descent parsing of the DOM tree. It also provides three concrete and useful functions to format, unformat and print DOM trees, all which are built on the more general parsing functions.
Processes $node
and its children recursively and removes all
white space text nodes. It is often difficult to process a DOM tree with
formatting while preserving reasonable formatting. Use unformat to remove
formatting, then proces the unformatted DOM, then use format to add
formatting back in that is reasonable for the new tree.
Processes $node
and its children recursively and introduces
white space text nodes to create a DOM tree that will print with reasonable
indents and newlines. Only call format on a DOM tree that nas no formatting
white space in it. Otherwise the results will be incorrect. Call unformat
to remove formatting white space.
You can optionally set the string variable $DOMPARSE::INDENT to the indent characters you want to use. By default it is a single tab.
Processes $node
and its children recursively and prints the
DOM tree to $file_handle
as a standard XML file. You can
override printing behavior by supplying any of several ``printer''
functions.
$DOMPARSE::NODE_PRINTER $DOMPARSE::DOCUMENT_NODE_PRINTER $DOMPARSE::DOCUMENT_TYPE_NODE_PRINTER $DOMPARSE::COMMENT_NODE_PRINTER $DOMPARSE::TEXT_NODE_PRINTER $DOMPARSE::CDATA_SECTION_NODE_PRINTER $DOMPARSE::ELEMENT_NODE_PRINTER $DOMPARSE::ENTITY_REFERENCE_NODE_PRINTER $DOMPARSE::PROCESSING_INSTRUCTION_NODE_PRINTER $DOMPARSE::ATTRIBUTE_PRINTER
Some of these printers call other printers. For example, $DOMPARSE::NODE_PRINTER determines the node type and calls the correponsing printer for that type, e.g. DOMPARSE::ELEMENT_NODE_PRINTER. So if you replace a printer for a node which has children, you must take the responsibility for calling the child node printers.
All printers take two parameters, a file handle and the node. See DOMPARSE::parse_nodes and DOMPARSE::parse_child_nodes for details.
It is very easy to write a replacement printer that adds value and then calls the default processing as follows.
my $original_text_node_printer = $DOMPARSE::TEXT_NODE_PRINTER; $DOMPARSE::TEXT_NODE_PRINTER = \&my_text_node_printer;
sub my_text_node_printer { my ($fh, $node) = @_; # look at the text node and do something extra return &$original_text_node_printer ($fh, $node); }
The $DOMPARSE::ESCAPE variable (integer) controls whether special XML characters like ampersand ``&'' are escaped, e.g. ``&''. Set $DOMPARSE::ESCAPE to 1 (default) to escape special characters, or to 0 to print characters literally.
Call print_string whenever you need to expand special characters (& < > ``) to their escape sequence equivalents. The print_string is used extensively by the default implementation of DOMPARSE::print. When you replace various node printers, you should also be careful to use it to print node and attribute names and values (but probably not anything else).
The print function respects the global $DOMPARSE::ESCAPE flag. By default it is set to true (1) and escape conversion is performed. Set it to false (0) when you don't want escape conversion.
Call parse_nodes to parse $node
and all of its children
recursively. Each node will be visited and your parsing function,
$process_node, will be called. Optional data $data
will be
passed through if provided.
Your parsing funtion must have the following signature.
process_node ($node, $data)
If it returns 1 then children of $node
will also be parsed. If
it returns 0 then they won't. It is common to use one parsing function to
get to a certain level in the DOM tree, then to return 0 and to call
parse_child_nodes to parse nodes under that level with a different
processing function.
Call to parse the children of $node
recursively. This is just
like parse_nodes except that $node
is not parsed.
Looks up the DOM tree until it finds the document node associated with the given $node. Then returns the document node.
Returns the depth of the specified $node
in the DOM document.
The document has depth 0, the root node has depth 1, and so on.
It is common practice to have an element node that encloses a single text node. If you know you have such a node, you can call element_text to directly access the enclosed text as a string. This is faster than accessing the enclosed text node and then getting the value of it.
Inserts $new_node
in the DOM tree immediately before and as a
sibling of $ref_node. It is safe to call insert_before while in the middle
of parsing a DOM tree if $ref_node
is the current node being
parsed. The newly inserted node will not be parsed.
Inserts $new_node
in the DOM tree immediately after and as a
sibling of $ref_node. It is safe to call insert_after while in the middle
of parsing a DOM tree if $ref_node
is the current node being
parsed. The newly inserted node will not be parsed.
Removes $node
from the DOM tree. It is safe to call remove
while in the middle of parsing a DOM tree if $node
is the
current node being parsed. The next node to be parsed will be the same that
would have been parsed had $node
not been removed, e.g.
$node's next sibling.
Tom Watson <rtwatson@us.ibm.com> wrote version 1.0 and submitted to the XML Apache project <http://xml.apache.org>, where you can contribute to future versions and where the corresponding C++ and Java compilers are also developed as OpenSource projects.