diff options
| author | scoder <none@none> | 2008-07-16 08:58:10 +0200 |
|---|---|---|
| committer | scoder <none@none> | 2008-07-16 08:58:10 +0200 |
| commit | 8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec (patch) | |
| tree | bb8295487819b968e1188cf04bc454e64a8e6767 /doc/html5parser.txt | |
| parent | 745f4d3898f29271fdb061d0a7c2f897dcc66be9 (diff) | |
| download | python-lxml-8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec.tar.gz | |
[svn r3900] r4637@delle: sbehnel | 2008-07-16 08:55:48 +0200
html5lib parser module provided by Armin Ronacher
--HG--
branch : trunk
Diffstat (limited to 'doc/html5parser.txt')
| -rw-r--r-- | doc/html5parser.txt | 80 |
1 files changed, 80 insertions, 0 deletions
diff --git a/doc/html5parser.txt b/doc/html5parser.txt new file mode 100644 index 00000000..3c8b6ffe --- /dev/null +++ b/doc/html5parser.txt @@ -0,0 +1,80 @@ +=============== +html5lib Parser +=============== + +`html5lib`_ is a Python package that implements the HTML5 parsing algorithm +which is heavily influenced by current browsers and based on the `WHATWG +HTML5 specification`_. + +.. _html5lib: http://code.google.com/p/html5lib/ +.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ +.. _WHATWG HTML5 specification: http://www.whatwg.org/specs/web-apps/current-work/ + +lxml can benefit from the parsing capabilities of `html5lib` through +the ``lxml.html.html5parser`` module. It provides a similar interface +to the ``lxml.html`` module by providing ``fromstring()``, +``parse()``, ``document_fromstring()``, ``fragment_fromstring()`` and +``fragments_fromstring()`` that work like the regular html parsing +functions. + + +Differences to regular HTML parsing +=================================== + +There are a few differences in the returned tree to the regular HTML +parsing functions from ``lxml.html``. html5lib normalizes some elements +and element structures to a common format. For example even if a tables +does not have a `tbody` html5lib will inject one automatically: + +.. sourcecode:: pycon + + >>> from lxml.html import tostring, html5parser + >>> tostring(html5parser.fromstring("<table><td>foo")) + '<table><tbody><tr><td>foo</td></tr></tbody></table>' + +Also the parameters the functions accept are different. + + +Function Reference +================== + +``parse(filename_url_or_file)``: + Parses the named file or url, or if the object has a ``.read()`` + method, parses from that. + +``document_fromstring(html, guess_charset=True)``: + Parses a document from the given string. This always creates a + correct HTML document, which means the parent node is ``<html>``, + and there is a body and possibly a head. + + If a bytestring is passed and ``guess_charset`` is true the chardet + library (if installed) will guess the charset if ambiguities exist. + +``fragment_fromstring(string, create_parent=False, guess_charset=False)``: + Returns an HTML fragment from a string. The fragment must contain + just a single element, unless ``create_parent`` is given; + e.g,. ``fragment_fromstring(string, create_parent='div')`` will + wrap the element in a ``<div>``. If ``create_parent`` is true the + default parent tag (div) is used. + + If a bytestring is passed and ``guess_charset`` is true the chardet + library (if installed) will guess the charset if ambiguities exist. + +``fragments_fromstring(string, no_leading_text=False, parser=None)``: + Returns a list of the elements found in the fragment. The first item in + the list may be a string. If ``no_leading_text`` is true, then it will + be an error if there is leading text, and it will always be a list of + only elements. + + If a bytestring is passed and ``guess_charset`` is true the chardet + library (if installed) will guess the charset if ambiguities exist. + +``fromstring(string)``: + Returns ``document_fromstring`` or ``fragment_fromstring``, based + on whether the string looks like a full document, or just a + fragment. + +Additionally all parsing functions accept an ``parser`` keyword argument +that can be set to a custom parser instance. To create custom parsers +you can subclass the ``HTMLParser`` and ``XHTMLParser`` from the same +module. Note that these are the parser classes provided by html5lib. |
