xml.etree.ElementTree vs. lxml.etree
When working with XML and Python I often start with just using xml.etree.ElementTree from Python’s standard library. It comes with Python and is sufficient for most projects. But with quite big XML documents (> 1 GiB) parsing becomes slow and lxml.etree shines.
While exchanging some code using xml.etree.ElementTree to lxml I’ve found some slightly differences in their implementation despite sharing the same API.
Namespaces ¶
XML Namespaces can be used with a special syntax {namespace-uri}tag-name
as tags.
To map a namespace to a specific prefix with stdlib etree, it requires
to register the prefix/uri combination globally
lxml supports the same {namespace-uri}tag-name
syntax but allows to register
a namespace map (a dict of namespace prefix to uri mapping) on every Element
during creation via a nsmap
keyword argument.
Default Namespace ¶
Registering the default namespace requires to pass an empty string ""
as
prefix with stdlib etree.
With lxml registering the default namespace requires to use None
as key for
the prefix in the namespace mapping dict. Using an empty string ""
instead
will raise an exception.
With lxml it is not possible to register a default namespace globally
At the end the namespace handling of lxml is superior because namespaces are not defined globally. For example this allow to test them independently without interference.
Serialization ¶
When writing a XML document with stdlib etree only the used namespaces will be serialized. Unused but registered namespaces will be skipped and not included in the output.
With lxml all namespaces passed as a namespace map will be included in the output wether they are used or not.
Setting xml_declaration
to True
and using utf-8
as encoding
will write
the encoding in upper cases as <?xml version='1.0' encoding='UTF-8'?>
stdlib etree supports a unicode
“encoding”. When using this “encoding” it is
possible to pass a TextIOBase stream to the write
function and utf-8
is used as the real
encoding.
lxml etree only supports binary RawIOBase
streams for the write
function.