XML::Twig - A perl module for processing huge XML documents in tree mode.
Small documents
my $twig=XML::Twig->new(); # create the twig $twig->parse( 'doc.xml'); # build it my_process( $twig); # use twig methods to process it $twig->print; # output the twig
Huge documents
my $twig=XML::Twig->new( twig_handlers => { title => sub { $_->set_gi( 'h2' }, # change title tags to h2 para => sub { $_->set_gi( 'p') }, # change para to p hidden => sub { $_->delete; }, # remove hidden elements list => \&my_list_process, # process list elements div => sub { $_[0]->flush; }, # output and free memory }, PrettyPrint => 'indented', # output will be nicely formatted EmptyTags => 'html', # outputs <empty_tag /> ); $twig->flush; # flush the end of the document
See XML::Twig 101 for other ways to use the module, as a filter for example
This module provides a way to process XML documents. It is build on top of XML::Parser.
The module offers a tree interface to the document, while allowing you to output the parts of it that have been completely processed.
It allows minimal resource (CPU and memory) usage by building the tree
only for the parts of the documents that need actual processing, through the
use of the twig_roots|twig_roots
and
twig_print_outside_roots|twig_print_outside_roots
options. The
finish|finish
and finish_print|finish_print
methods also help
to increase performances.
XML::Twig tries to make simple things easy so it tries its best to takes care of a lot of the (usually) annoying (but sometimes necessary) features that come with XML and XML::Parser.
XML::Twig can be used either on ``small'' XML documents (that fit in memory) or on huge ones, by processing parts of the document and outputting or discarding them once they are processed.
my $t= XML::Twig->new(); $t->parse( '<d><tit>title</tit><para>para1</para><para>p2</para></d>'); my $root= $t->root; $root->set_gi( 'html'); # change doc to html $title= $root->first_child( 'tit'); # get the title $title->set_gi( 'h1'); # turn it into h1 my @para= $root->children( 'para'); # get the para children foreach my $para (@para) { $para->set_gi( 'p'); } # turn them into p $t->print; # output the document
Other useful methods include:
att: $elt->{'att'}->{'type'}
returns the type
attribute for an
element,
set_att: $elt->set_att( type => "important")
sets the type
attribute to the important
value,
next_sibling: $elt->{next_sibling}
returns the next sibling
in the document (in the example $title->{next_sibling}
is the first para
while $elt->next_sibling( 'table')
is the next table
sibling
The document can also be transformed through the use of the cut,
copy, paste and move methods:
$title->cut; $title->paste( 'after', $p);
for example
And much, much more, see Elt.
One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).
To do this you can define handlers, that will be called once a specific
element has been completely parsed. In these handlers you can access the
element and process it as you see fit, using the navigation and the
cut-n-paste methods, plus lots of convenient ones like prefix|prefix
.
Once the element is completely processed you can then flush|flush
it,
which will output it and free the memory. You can also purge|purge
it
if you don't need to output it (if you are just extracting some data from
the document for example). The handler will be called again once the next
relevant element has been parsed.
my $t= XML::Twig->new( twig_handlers => { section => \§ion, para => sub { $_->set_gi( 'p'); }, ); $t->parsefile( 'doc.xml'); $t->flush; # don't forget to flush one last time in the end or anything # after the last </section> tag will not be output
# the handler is called once a section is completely parsed, ie when # the end tag for section is found, it receives the twig itself and # the element (including all its sub-elements) as arguments sub section { my( $t, $section)= @_; # arguments for all twig_handlers $section->set_gi( 'div'); # change the gi, my favourite method... # let's use the attribute nb as a prefix to the title my $title= $section->first_child( 'title'); # find the title my $nb= $title->{'att'}->{'nb'}; # get the attribute $title->prefix( "$nb - "); # easy isn't it? $section->flush; # outputs the section and frees memory }
my $t= XML::Twig->new( twig_handlers => { 'section/title' => \&print_elt_text} ); $t->parsefile( 'doc.xml'); sub print_elt_text { my( $t, $elt)= @_; print $elt->text; }
my $t= XML::Twig->new( twig_handlers => { 'section[@level="1"]' => \&print_elt_text } ); $t->parsefile( 'doc.xml');
There is of course more to it: you can trigger handlers on more elaborate
conditions than just the name of the element, section/title
for example.
You can also use TwigStartHandlers|TwigStartHandlers
to process an
element as soon as the start tag is found. Besides prefix|prefix
you
can also use suffix|suffix
,
The twig_roots mode builds only the required sub-trees from the document Anything outside of the twig roots will just be ignored:
my $t= XML::Twig->new( # the twig will include just the root and selected titles twig_roots => { 'section/title' => \&print_elt_text, 'annex/title' => \&print_elt_text } ); $t->parsefile( 'doc.xml');
sub print_elt_text { my( $t, $elt)= @_; print $elt->text; # print the text (including sub-element texts) $t->purge; # frees the memory }
You can use that mode when you want to process parts of a documents but are not interested in the rest and you don't want to pay the price, either in time or memory, to build the tree for the it.
You can combine the twig_roots and the twig_print_outside_roots options to build filters, which let you modify selected elements and will output the rest of the document as is.
This would convert prices in $ to prices in Euro in a document:
my $t= XML::Twig->new( twig_roots => { 'price' => \&convert, }, # process prices twig_print_outside_roots => 1, # print the rest ); $t->parsefile( 'doc.xml');
sub convert { my( $t, $price)= @_; my $currency= $price->{'att'}->{'currency'}; # get the currency if( $currency eq 'USD') { $usd_price= $price->text; # get the price # %rate is just a conversion table my $euro_price= $usd_price * $rate{usd2euro}; $price->set_text( $euro_price); # set the new price $price->set_att( currency => 'EUR'); # don't forget this! } $price->print; # output the price }
keep_spaces|keep_spaces
,
keep_spaces_in|keep_spaces_in
and
discard_spaces_in options|discard_spaces_in options
.
keep_encoding|keep_encoding
option
XML::Twig provides the safe_parse|safe_parse
and the
safe_parsefile|safe_parsefile
methods which wrap the parse in an eval
and return either the parsed twig or 0 in case of failure.
A twig is a subclass of XML::Parser, so all XML::Parser methods can be
called on a twig object, including parse and parsefile.
setHandlers on the other hand cannot be used, see BUGS|BUGS
The idea is to support a usefull but efficient (thus limited) subset of XPATH. A fuller expression set will be supported in the future, as users ask for more and as I manage to implement it efficiently. This will never encompass all of XPATH due to the streaming nature of parsing (no lookahead after the element end tag).
A generic_attribute_condition is a condition on an attribute, in the form *[@att=``val''] or *[@att], simple quotes can be used instead of double quotes and the leading '*' is actually optional. No matter what the gi of the element is, the handler will be triggered either if the attribute has the specified value or if it just exists.
A string_condition is a condition on the content of an element, in the form gi[string()=``foo''], simple quotes can be used instead of double quotes, at the moment you cannot escape the quotes (this will be added as soon as I dig out my copy of Mastering Regular Expressions from its storage box). The text returned is, as per what I (and Matt Sergeant!) understood from the XPATH spec the concatenation of all the text in the element, excluding all markup. Thus to call a handler on the element <p>text <b>bold</b></p> the appropriate condition is p[string()=``text bold'']. Note that this is not exactly conformant to the XPATH spec, it just tries to mimic it while being still quite concise.
A extension of that notation is gi[string(child_gi)=``foo''] where the
handler will be called if a child of a gi
element has a text value of
foo
. At the moment only direct children of the gi
element are checked.
If you need to test on descendants of the element let me know. The fix is
trivial but would slow down the checks, so I'd like to keep it the way it is.
A regexp_condition is a condition on the content of an element, in the form
gi[string()=~ /foo/``]. This is the same as a string condition except that
the text of the element is matched to the regexp. The i
, m
, <s> and o
modifiers can be used on the regexp.
The gi[string(child_gi)=~ /foo/``] extension is also supported.
An attribute_condition is a simple condition of an attribute of the current element in the form gi[@att=``val''] (simple quotes can be used instead of double quotes, you can escape quotes either). If several attribute_condition are true the same element all the handlers can be called in turn (in the order in which they were first defined). If the =``val'' part is ommited ( the condition is then gi[@att]) then the handler is triggered if the attribute actually exists for the element, no matter what it's value is.
A full_path looks like '/doc/section/chapter/title', it starts with a / then gives all the gi's to the element. The handler will be called if the path to the current element (in the input document) is exactly as defined by the full_path.
A partial_path is like a full_path except it does not start with a /: 'chapter/title' for example. The handler will be called if the path to the element (in the input document) ends as defined in the partial_path.
WARNING: (hopefully temporary) at the moment string_condition, regexp_condition and attribute_condition are only supported on a simple gi, not on a path.
A gi (generic identifier) is just a tag name.
A special gi _all_ is used to call a function for each element. The special gi _default_ is used to call a handler for each element that does NOT have a specific handler.
The order of precedence to trigger a handler is: generic_attribute_condition, string_condition, regexp_condition, attribute_condition, full_path, longer partial_path, shorter partial_path, gi, _default_ .
Important: once a handler has been triggered if it returns 0 then no other handler is called, exept a _all_ handler which will be called anyway.
If a handler returns a true value and other handlers apply, then the next applicable handler will be called. Repeat, rince, lather..;
When an element is CLOSED the corresponding handler is called, with 2
arguments: the twig and the /Element|/Element
. The twig includes the
document tree that has been built so far, the element is the complete sub-tree
for the element. $_ is also set to the element.
Text is stored in elements where gi is #PCDATA (due to mixed content, text and sub-element in an element there is no way to store the text as just an attribute of the enclosing element).
Warning: if you have used purge or flush on the twig the element might not be complete, some of its children might have been entirely flushed or purged, and the start tag might even have been printed (by flush) already, so changing its gi might not give the expected result.
More generally, the full_path, partial_path and gi expressions are evaluated against the input document. Which means that even if you have changed the gi of an element (changing the gi of a parent element from a handler for example) the change will not impact the expression evaluation. Attributes in attribute_condition are different though. As the initial value of attribute is not stored the handler will be triggered if the current attribute/value pair is found when the element end tag is found. Although this can be quite confusing it should not impact most of users, and allow others to play clever tricks with temporary attributes. Let me know if this is a problem for you.
Example: my $t= XML::Twig->new( twig_roots => { title => 1, subtitle => 1}); $t->parsefile( file); my $t= XML::Twig->new( twig_roots => { 'section/title' => 1}); $t->parsefile( file);
returns a twig containing a document including only title and subtitle elements, as children of the root element.
You can use generic_attribute_condition, attribute_condition, full_path, partial_path, gi, _default_ and _all_ to trigger the building of the twig. string_condition and regexp_condition cannot be used as the content of the element, and the string, have not yet been parsed when the condition is checked.
WARNING: path are checked for the document. Even if the twig_roots option is used they will be checked against the full document tree, not the virtual tree created by XML::Twig
WARNING: twig_roots elements should NOT be nested, that would hopelessly confuse XML::Twig ;--(
Note: you can set handlers (twig_handlers) using twig_roots Example: my $t= XML::Twig->new( twig_roots => { title => sub { $_{1]->print;}, subtitle => \&process_subtitle } ); $t->parsefile( file);
Example: my $t= XML::Twig->new( twig_roots => { title => \&number_title }, twig_print_outside_roots => 1, ); $t->parsefile( file); { my $nb; sub number_title { my( $twig, $title); $nb++; $title->prefix( "$nb "; } $title->print; } }
This example prints the document outside of the title element, calls number_title for each title element, prints it, and then resumes printing the document. The twig is built only for the title elements.
You can use generic_attribute_condition, attribute_condition, full_path, partial_path, gi, _default_ and _all_ to trigger the handler.
string_condition and regexp_condition cannot be used as the content of the element, and the string, have not yet been parsed when the condition is checked.
The main uses for those handlers are to change the tag name (you might have to
do it as soon as you find the open tag if you plan to flush
the twig at some
point in the element, and to create temporary attributes that will be used
when processing sub-element with TwigHanlders.
You should also use it to change tags if you use flush. If you change the tag in a regular TwigHandler then the start tag might already have been flushed.
Note: StartTag handlers can be called outside ot twig_roots if this argument is used, in this case handlers are called with the following arguments: $t (the twig), $gi (the gi of the element) and %att (a hash of the attributes of the element).
If the twig_print_outside_roots argument is also used then the start tag
will be printed if the last handler called returns a true
value, if it
does not then the start tag will not be printed (so you can print a
modified string yourself for example);
Note that you can use the ignore method in start_tag_handlers (and only there).
twig_handlers are called when an element is completely parsed, so why have this redundant option? There is only one use for end_tag_handlers: when using the twig_roots option, to trigger a handler for an element outside the roots. It is for example very useful to number titles in a document using nested sections:
my @no= (0); my $no; my $t= XML::Twig->new( start_tag_handlers => { section => sub { $no[$#no]++; $no= join '.', @no; push @no, 0; } }, twig_roots => { title => sub { $_[1]->prefix( $no); $_[1]->print; } }, end_tag_handlers => { section => sub { pop @no; } }, twig_print_outside_roots => 1 ); $t->parsefile( $file);
Using the end_tag_handlers argument without twig_roots will result in an error.
Example:
my $twig= XML::Twig->new( ignore_elts => { elt => 1 }); $twig->parsefile( 'doc.xml');
This will build the complete twig for the document, except that all elt
elements (and their children) will be left out.
See the t/test6.t test file to see what results you can expect from the various encoding options.
WARNING: if the original encoding is multi-byte then attribute parsing will be EXTREMELY unsafe under any Perl before 5.6, as it uses regular expressions which do not deal properly with multi-byte characters. You can specify an alternate function to parse the start tags with the parse_start_tag option (see below)
WARNING: this option is NOT used when parsing with the non-blocking parser (parse_start, parse_more, parse_done methods) which you probably should not use with XML::Twig anyway as they are totally untested!
print
, sprint
, flush
).
Pre-defined filters are:
iconv
library to find out which encodings are available on your system)
my $conv = XML::Twig::iconv_convert( 'latin1'); my $t = XML::Twig->new(output_filter => $conv);
my $conv = XML::Twig::unicode_convert( 'latin1'); my $t = XML::Twig->new(output_filter => $conv);
Note that the text
and att
methods do not use the filter, so their
result are always in unicode.
original_string()
method) and returns a gi and the
attributes in a hash (or in a list attribute_name/attribute value).
Lprint
>,
sprint|sprint
, flush|flush
and xml_string|xml_string
.
Note that in the twig the entity will be stored as an element whith a
gi '#ENT', the entity will not be expanded there, so you might want to
process the entities before outputting it.
Note that to do this the module will generate a temporary file in the current directory. If this is a problem let me know and I will add an option to specify an alternate directory.
See DTD Handling for more information
BUGS|BUGS
This is quite ugly but better than none
, and it is very safe, the document
will still be valid (conforming to its DTD).
This is how the SGML parser sgmls
splits documents, hence the name.
WARNING: this option leaves the document well-formed but might make it invalid (not conformant to its DTD). If you have elements declared as
<!ELEMENT foo (#PCDATA|bar)>
then a foo
element including a bar
one will be printed as
<foo> <bar>bar is just pcdata</bar> </foo>
This is invalid, as the parser will take the line break after the foo tag as a
sign that the element contains PCDATA, it will then die when it finds the
bar
tag. This may or may not be important for you, but be aware of it!
nice
(and with the same warning) but indents elements according to
their level
indented
)
Bug: comments in the middle of a text element such as
<p>text <!-- comment --> more text --></p>
are output at the end of the text:
<p>text more text <!-- comment --></p>
gi
is #COMMENT
) this can interfere with processing if you
expect $elt->{first_child}
to be an element but find a comment there.
Validation will not protect you from this as comments can happen anywhere.
You can use $elt->first_child( 'gi')
(which is a good habit anyway)
to get where you want. Consider using
drop
, keep
(default) or process
Note that you can also set PI handlers in the twig_handlers option:
'?' => \&handler '?target' => \&handler 2
The handlers will be called with 2 parameters, the twig and the PI element if
pi is set to process
, and with 3, the twig, the target and the data if pi
is set to keep
. Of course they will not be called if PI is set to drop
.
If pi is set to keep
the handler should return a string that will be used
as-is as the PI text (it should look like `` <?target data?
>'' or '' if you
want to remove the PI),
Only one handler will be called, ?target
or ?
if no specific handler for
that target is available.
Note: I _HATE_ the Java-like name of arguments used by most XML modules.
As XML::Twig is based on XML::Parser I kept the style, but you can also use
a more perlish naming convention, using twig_print_outside_roots
instead
of twig_print_outside_roots
or pretty_print
instead of PrettyPrint
,
XML::Twig then normalizes all the argument names.
A die call is thrown if a parse error occurs. Otherwise it will return the twig built by the parse. Use safe_parse if you want the parsing to return even when an error occurs.
Open FILE for reading, then call parse with the open handle. The file is closed no matter how parse returns.
A die call is thrown if a parse error occurs. Otherwise it will return the twig built by the parse. Use safe_parsefile if you want the parsing to return even when an error occurs.
If the $optionnal_user_agent argument is used then it is used, otherwise a new one is created.
Note that the parsing still stops as soon as an error is detected, there is no way to keep going after an error.
Note that the parsing still stops as soon as an error is detected, there is no way to keep going after an error.
undef
if the attribute is not
defined)
BUGS|BUGS
flush take an optional filehandle as an argument.
options: use the Update_DTD option if you have updated the (internal) DTD and/or the entity list and you want the updated DTD to be output
The PrettyPrint option sets the pretty printing of the document.
Example: $t->flush( Update_DTD => 1); $t->flush( \*FILE, Update_DTD => 1); $t->flush( \*FILE);
options: see flush.
options: see flush.
options: see flush.
Note that this method can also be called on an element. If the element is a parent of the current element then this element will be ignored (the twig will not be built any more for it and what has already been built will be deleted)
WARNING: the pretty print style is a GLOBAL variable, so once set it's applied to ALL print's (and sprint's). Same goes if you use XML::Twig with mod_perl . This should not be a problem as the XML that's generated is valid anyway, and XML processors (as well as HTML processors, including browsers) should not care. Let me know if this is a big problem, but at the moment the performance/cleanliness trade-off clearly favors the global approach.
normal outputs an empty tag '<tag/>', html adds a space '<tag /> and expand outputs '<tag></tag>'
options: see flush.
options: see flush.
Inherited methods are:
depth in_element within_element context current_line current_column current_byte position_in_context base current_element element_index namespace eq_name generate_ns_name new_ns_prefixes expand_ns_prefix current_ns_prefixes recognized_string original_string xpcroak xpcarp
path($gi)
Reclaims properly the memory used by an XML::Twig object. As the object has circular references it never goes out of scope, so if you want to parse lots of XML documents then the memory leak becomes a problem. Use $twig->dispose to clear this problem.
The print outputs XML data so base entities are escaped.
generic identifier
the tag
name in SGML parlance).
Similar methods are available for the other navigation methods:
last_child_text
, prev_sibling_text
, next_sibling_text
,
prev_elt_text
, next_elt_text
, child_text
, parent_text
undef
otherwise
if( $elt->first_child_matches( 'title')) ...
is equivalent to
if( $elt->{first_child} && $elt->{first_child}->passes( 'title'))
first_child_is
is an other name for this method
Similar methods are available for the other navigation methods:
last_child_matches
, prev_sibling_matches
, next_sibling_matches
,
prev_elt_matches
, next_elt_matches
, child_matches
,
parent_matches
The $optional_elt is the root of a subtree. When the next_elt is out of the subtree then the method returns undef. You can then walk a sub tree with:
my $elt= $subtree_root; while( $elt= $elt->next_elt( $subtree_root) { # insert processing code here }
getElementsByTagName
instead)
NOTE: the element itself is not part of the list, in order to include it you will have to write:
my @array= ($elt, $elt->ancestors)
undef
You can actually set several attributes this way:
$elt->set_att( att1 => "val1", att2 => "val2");
You can actually delete several attributes at once:
$elt->del_att( 'att1', 'att2', 'att3');
The optional position element can be:
If the option is asis
then the prefix is added asis: it is created in
a separate PCDATA element with an asis property. You can then write:
$elt1->prefix( '<b>', 'asis');
to create a <b
> in the output of print
.
If the option is asis
then the suffix is added asis: it is created in
a separate PCDATA element with an asis property. You can then write:
$elt2->suffix( '<b>', 'asis');
PCDATA
or CDATA
) element in 2 at $offset, the original
element now holds the first part of the string and a new element holds the
right part. The new element is returned
If the element is not a text element then the first text child of the element is split
if $elt is <p>tati tata <b>tutu tati titi</b> tata tati tata</p>
$elt->split( qr/(ta)ti/, 'foo', {type => 'toto'} )
will change $elt to
<p><foo type="toto">ta</foo> tata <b>tutu <foo type="toto">ta</foo> titi</b> tata <foo type="toto">ta</foo> tata</p>
The regexp can be passed either as a string or as qr// (perl 5.005 and later),
it defaults to \s+ just as the split
built-in (but this would be quite a
useless behaviour without the $optional_tag parameter)
$optional_tag defaults to PCDATA or CDATA, depending on the initial element type
The list of descendants is returned (including un-touched original elements and newly created ones)
Examples: my $elt= XML::Twig::Elt->new(); my $elt= XML::Twig::Elt->new( 'para', { align => 'center' }); my $elt= XML::Twig::Elt->new( 'para', { align => 'center' }, 'foo'); my $elt= XML::Twig::Elt->new( 'br', '#EMPTY'); my $elt= XML::Twig::Elt->new( 'para'); my $elt= XML::Twig::Elt->new( 'para', 'this is a para'); my $elt= XML::Twig::Elt->new( 'para', $elt3, 'another para');
The strings are not parsed, the element is not attached to any twig.
WARNING: if you rely on ID's then you will have to set the id yourself. At this point the element does not belong to a twig yet, so the ID attribute is not known so it won't be strored in the ID list.
As obviously the element does not exist beforehand this method has to be called on the class:
my $elt= parse XML::Twig::Elt( "<a> string to parse, with <sub/> <elements>, actually tons of </elements> h</a>");
A subset of the XPATH abbreviated syntax is covered:
gi gi[1] (or any other positive number) gi[last()] gi[@att] (the attribute exists for the element) gi[@att="val"] gi[@att=~ /regexp/] gi[att1="val1" and att2="val2"] gi[att1="val1" or att2="val2"] gi[string()="toto"] (returns gi elements which text (as per the text method) is toto) gi[string()=~/regexp/] (returns gi elements which text (as per the text method) matches regexp) expressions can start with / (search starts at the document root) expressions can start with . (search starts at the current element) // can be used to get all descendants instead of just direct children * matches any gi
So the following examples from the XPATH recommendation (http://www.w3.org/TR/xpath.html#path-abbrev) work:
para selects the para element children of the context node * selects all element children of the context node para[1] selects the first para child of the context node para[last()] selects the last para child of the context node */para selects all para grandchildren of the context node /doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc chapter//para selects the para element descendants of the chapter element children of the context node //para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node //olist/item selects all the item elements in the same document as the context node that have an olist parent .//para selects the para element descendants of the context node .. selects the parent of the context node para[@type="warning"] selects all para children of the context node that have a type attribute with value warning employee[@secretary and @assistant] selects all the employee children of the context node that have both a secretary attribute and an assistant attribute
The elements will be returned in the document order.
If $optional_offset is used then only one element will be returned, the one with the appropriate offset in the list, starting at 0
Quoting and interpolating variables can be a pain when the Perl syntax and the XPATH syntax collide, so here are some more examples to get you started:
my $p1= "p1"; my $p2= "p2"; my @res= $t->get_xpath( "p[string( '$p1') or string( '$p2')]");
my $a= "a1"; my @res= $t->get_xpath( "//*[@att=\"$a\"]);
my $val= "a1"; my $exp= "//p[ \@att='$val']"; # you need to use \@ or you will get a warning my @res= $t->get_xpath( $exp);
XML::Twig does not provide full XPATH support. If that's what you want then look no further than the XML::XPath module on CPAN.
Note that the only supported regexps delimiters are / and that you must backslash all / in regexps AND in regular strings.
$p->insert( table => { border=> 1}, 'tr', 'td')
puts $p in a table with a visible border, a single tr and a single td and returns the table element:
<p><table border="1"><tr><td>original content of p</td></tr></table></p>
new|new
and a paste|paste
: creates a new element using
$gi, $opt_atts_hashref and @opt_content which are arguments similar to those for
new
, then paste it, using $opt_position or 'first_child'
, relative to
$elt.
Returns the newly created element
The optional_atts argument is the ref of a hash of attributes. If this argument is used then the previous attributes are deleted, otherwise they are left untouched.
WARNING: if you rely on ID's then you will have to set the id yourself. At this point the element does not belong to a twig yet, so the ID attribute is not known so it won't be strored in the ID list.
A content of '#EMPTY' creates an empty element;
B<WARNING>: in a tree created using the twig_roots option this will not return the level in the document tree, level 0 will be the document root, level 1 will be the twig_roots elements. During the parsing (in a TwigHandler) you can use the depth method on the twig object to get the real parsing depth.
a < b
it will be output as
such and not as a > b
. This can be useful to create text elements that
will be output as markup. Note that all PCDATA descendants of the element
are also marked as having the property (they are the ones impacted by the
change).
undef
)
elt_id|elt_id
to change the id attribute name
$a is the <A>..</A> element, $b is the <B>...</B> element
document $a->cmp( $b) <A> ... </A> ... <B> ... </B> -1 <A> ... <B> ... </B> ... </A> -1 <B> ... </B> ... <A> ... </A> 1 <B> ... <A> ... </A> ... </B> 1 $a == $b 0 $a and $b not in the same tree undef
if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
Those methods should not be used, unless of course you find some creative and interesting, not to mention useful, ways to do it.
Most of the navigation functions accept a condition as an optional argument
The first element (or all elements for children|children
or
ancestors|ancestors
) that passes the condition is returned.
The condition can be
gi /regexp/ gi[@att] gi[@att="val"] gi[@att=~/regexp/] gi[text()="blah"] gi[text(subelt)="blah"] gi[text()=~ /blah/] gi[text(subelt)=~ /blah/] *[@att] (the * is actually optional) *[@att="val"] *[@att=~/regexp/]
qr//
(hence this is available only on perl 5.005 and above)
See the test file in t/test[1-n].t Additional examples (and a complete tutorial) can be found at http://www.xmltwig.com/
To figure out what flush does call the following script with an xml file and an element name as arguments
use XML::Twig;
my ($file, $elt)= @ARGV; my $t= XML::Twig->new( twig_handlers => { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} }); $t->parsefile( $file, ErrorContext => 2); $t->flush; print "\n";
There are 3 possibilities here. They are:
If you use the load_DTD option when creating the twig the DTD information and the entity declarations can be accessed.
The DTD and the entity declarations will be flush'ed (or print'ed) either as is (if they have not been modified) or as reconstructed (poorly, comments are lost, order is not kept, due to it's content this DTD should not be viewed by anyone) if they have been modified. You can also modify them directly by changing the $twig->{twig_doctype}->{internal} field (straight from XML::Parser, see the Doctype handler doc)
If you use the load_DTD when creating the twig the DTD information and the entity declarations can be accessed. The entity declarations will be flush'ed (or print'ed) either as is (if they have not been modified) or as reconstructed (badly, comments are lost, order is not kept).
You can change the doctype through the $twig->set_doctype method and print the dtd through the $twig->dtd_text or $twig->dtd_print methods.
If you need to modify the entity list this is probably the easiest way to do it.
If you set handlers and use flush, do not forget to flush the twig one last time AFTER the parsing, or you might be missing the end of the document.
Remember that element handlers are called when the element is CLOSED, so if you have handlers for nested elements the inner handlers will be called first. It makes it for example trickier than it would seem to number nested clauses.
att="val&ent;"
will be turned into att => val, unless you use the
keep_encoding
argument to XML::Twig->new
So use XML::Twig with standalone documents, or with documents refereing to an external DTD, but don't expect it to properly parse and even output back the DTD.
dispose|dispose
method to free that
memory after you are done.
If you create elements the same thing might happen, use the delete|delete
method to get rid of them.
Alternatively installing the WeakRef module on a version of Perl that supports it will get rid of the memory leaks automagically.
$twig->change_gi( $old1, $new); $twig->change_gi( $old2, $new); $twig->change_gi( $new, $even_newer);
These are the things that can mess up calling code, especially if threaded. They might also cause problem under mod_perl.
%base_ent= ( '>' => '>', '<' => '<', '&' => '&', "'" => ''', '"' => '"', ); CDATA_START = "<![CDATA["; CDATA_END = "]]>"; PI_START = "<?"; PI_END = "?>"; COMMENT_START = "<!--"; COMMENT_END = "-->";
pretty print styles
( $NSGMLS, $NICE, $INDENTED, $RECORD1, $RECORD2)= (1..5);
empty tag output style
( $HTML, $EXPAND)= (1..2);
$empty_tag_style can mess up HTML bowsers though and changing $ID would most likely create problems.
$pretty=0; # pretty print style $quote='"'; # quote for attributes $INDENT= ' '; # indent for indented pretty print $empty_tag_style= 0; # how to display empty tags $ID # attribute used as a gi ('id' by default)
%gi2index; # gi => index @index2gi; # list of gi's
A future version will try to support this while trying not to be to hard on performance (at least when a single twig is used!).
You can use the benchmark_twig
file to do additional benchmarks.
Please send me benchmark information for additional systems.
Michel Rodriguez <m.v.rodriguez@ieee.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Bug reports and comments to m.v.rodriguez@ieee.org
The XML::Twig page is at http://www.xmltwig.com/xmltwig/ It includes examples and a tutorial at http://www.xmltwig.com/xmltwig/tutorial/index.html
XML::Parser