diff options
| author | Stefan Behnel <stefan_ml@behnel.de> | 2012-11-23 08:47:53 +0100 |
|---|---|---|
| committer | Stefan Behnel <stefan_ml@behnel.de> | 2012-11-23 08:47:53 +0100 |
| commit | 7e9a3b21258a14d4bb6937c11fbeb4374e69ae22 (patch) | |
| tree | 055e442b8b7772dabfbb66bd895d6d72d3be6997 /doc/tutorial.txt | |
| parent | a9ffa44ccdb3fa2f10cb5b30afd364c936ea7f8c (diff) | |
| download | python-lxml-7e9a3b21258a14d4bb6937c11fbeb4374e69ae22.tar.gz | |
rewrite tutorial section on ElementTree class
Diffstat (limited to 'doc/tutorial.txt')
| -rw-r--r-- | doc/tutorial.txt | 82 |
1 files changed, 50 insertions, 32 deletions
diff --git a/doc/tutorial.txt b/doc/tutorial.txt index d1f96c4a..1dcc0769 100644 --- a/doc/tutorial.txt +++ b/doc/tutorial.txt @@ -623,61 +623,66 @@ might become handy. Just pass the ``unicode`` type as encoding: u'HelloW\xf6rld' The W3C has a good `article about the Unicode character set and -character encodings`_. - -.. _`article about the Unicode character set and character encodings`: http://www.w3.org/International/tutorials/tutorial-char-enc/ +character encodings +<http://www.w3.org/International/tutorials/tutorial-char-enc/>`_. The ElementTree class ===================== An ``ElementTree`` is mainly a document wrapper around a tree with a -root node. It provides a couple of methods for parsing, serialisation -and general document handling. One of the bigger differences is that -it serialises as a complete document, as opposed to a single -``Element``. This includes top-level processing instructions and -comments, as well as a DOCTYPE and other DTD content in the document: +root node. It provides a couple of methods for serialisation and +general document handling. .. sourcecode:: pycon - >>> tree = etree.parse(StringIO('''\ + >>> root = etree.XML('''\ ... <?xml version="1.0"?> - ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]> + ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]> ... <root> ... <a>&tasty;</a> ... </root> - ... ''')) + ... ''') + >>> tree = etree.ElementTree(root) + >>> print(tree.docinfo.xml_version) + 1.0 >>> print(tree.docinfo.doctype) <!DOCTYPE root SYSTEM "test"> - >>> # lxml 1.3.4 and later - >>> print(etree.tostring(tree)) - <!DOCTYPE root SYSTEM "test" [ - <!ENTITY tasty "eggs"> - ]> - <root> - <a>eggs</a> - </root> +An ``ElementTree`` is also what you get back when you call the +``parse()`` function to parse files or file-like objects (see the +parsing section below). + +One of the important differences is that the ``ElementTree`` class +serialises as a complete document, as opposed to a single ``Element``. +This includes top-level processing instructions and comments, as well +as a DOCTYPE and other DTD content in the document: - >>> # lxml 1.3.4 and later - >>> print(etree.tostring(etree.ElementTree(tree.getroot()))) +.. sourcecode:: pycon + + >>> print(etree.tostring(tree)) # lxml 1.3.4 and later <!DOCTYPE root SYSTEM "test" [ - <!ENTITY tasty "eggs"> + <!ENTITY tasty "parsnips"> ]> <root> - <a>eggs</a> + <a>parsnips</a> </root> - >>> # ElementTree and lxml <= 1.3.3 +In the original xml.etree.ElementTree implementation and in lxml +up to 1.3.3, the output looks the same as when serialising only +the root Element: + +.. sourcecode:: pycon + >>> print(etree.tostring(tree.getroot())) <root> - <a>eggs</a> + <a>parsnips</a> </root> -Note that this has changed in lxml 1.3.4 to match the behaviour of -lxml 2.0. Before, the examples were serialised without DTD content, -which made lxml loose DTD information in an input-output cycle. +This serialisation behaviour has changed in lxml 1.3.4. Before, +the tree was serialised without DTD content, which made lxml +loose DTD information in an input-output cycle. Parsing from strings and files @@ -721,17 +726,26 @@ commonly used to write XML literals right into the source: >>> etree.tostring(root) b'<root>data</root>' +There is also a corresponding function ``HTML()`` for HTML literals. + The parse() function -------------------- -The ``parse()`` function is used to parse from files and file-like objects: +The ``parse()`` function is used to parse from files and file-like objects. + +As an example of such a file-like object, the following code uses the +``StringIO`` class for reading from a string instead of an external file. +That class comes from the ``StringIO`` module in Python 2. In Python 2.6 +and later, including Python 3.x, you would rather use the ``BytesIO`` class +from the ``io`` module. However, in real life, you would obviously avoid +doing this all together and use the string parsing functions above. .. sourcecode:: pycon - >>> some_file_like = StringIO("<root>data</root>") + >>> some_file_like_object = StringIO("<root>data</root>") - >>> tree = etree.parse(some_file_like) + >>> tree = etree.parse(some_file_like_object) >>> etree.tostring(tree) b'<root>data</root>' @@ -763,7 +777,11 @@ The ``parse()`` function supports any of the following sources: * an HTTP or FTP URL string Note that passing a filename or URL is usually faster than passing an -open file. +open file or file-like object. However, the HTTP/FTP client in libxml2 +is rather simple, so things like HTTP authentication require a dedicated +URL request library, e.g. ``urllib2`` or ``request``. These libraries +usually provide a file-like object for the result that you can parse +from while the response is streaming in. Parser objects |
