xml.etree.ElementTree vs. lxml.etree

When working with XML and Python I often start with just using xml.etree.ElementTree from Python’s standard library. It comes with Python and is sufficient for most projects. But with quite big XML documents (> 1 GiB) parsing becomes slow and lxml.etree shines.

While exchanging some code using xml.etree.ElementTree to lxml I’ve found some slightly differences in their implementation despite sharing the same API.

Namespaces

XML Namespaces can be used with a special syntax {namespace-uri}tag-name as tags. To map a namespace to a specific prefix with stdlib etree, it requires to register the prefix/uri combination globally

lxml supports the same {namespace-uri}tag-name syntax but allows to register a namespace map (a dict of namespace prefix to uri mapping) on every Element during creation via a nsmap keyword argument.

Default Namespace

Registering the default namespace requires to pass an empty string "" as prefix with stdlib etree.

With lxml registering the default namespace requires to use None as key for the prefix in the namespace mapping dict. Using an empty string "" instead will raise an exception.

With lxml it is not possible to register a default namespace globally

At the end the namespace handling of lxml is superior because namespaces are not defined globally. For example this allow to test them independently without interference.

Serialization

When writing a XML document with stdlib etree only the used namespaces will be serialized. Unused but registered namespaces will be skipped and not included in the output.

With lxml all namespaces passed as a namespace map will be included in the output wether they are used or not.

Setting xml_declaration to True and using utf-8 as encoding will write the encoding in upper cases as <?xml version='1.0' encoding='UTF-8'?>

stdlib etree supports a unicode “encoding”. When using this “encoding” it is possible to pass a TextIOBase stream to the write function and utf-8 is used as the real encoding.

lxml etree only supports binary RawIOBase streams for the write function.