[svn r3900] r4637@delle: sbehnel | 2008-07-16 08:55:48 +0200

html5lib parser module provided by Armin Ronacher --HG-- branch : trunk
author: scoder <none@none> 2008-07-16 08:58:10 +0200
committer: scoder <none@none> 2008-07-16 08:58:10 +0200
commit: 8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec (patch)
tree: bb8295487819b968e1188cf04bc454e64a8e6767 /doc/html5parser.txt
parent: 745f4d3898f29271fdb061d0a7c2f897dcc66be9 (diff)
download: python-lxml-8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec.tar.gz
1 files changed, 80 insertions, 0 deletions
diff --git a/doc/html5parser.txt b/doc/html5parser.txt
new file mode 100644
index 00000000..3c8b6ffe
--- /dev/null
+++ b/doc/html5parser.txt
@@ -0,0 +1,80 @@
+===============
+html5lib Parser
+===============
+
+`html5lib`_ is a Python package that implements the HTML5 parsing algorithm
+which is heavily influenced by current browsers and based on the `WHATWG
+HTML5 specification`_.
+
+.. _html5lib: http://code.google.com/p/html5lib/
+.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
+.. _WHATWG HTML5 specification: http://www.whatwg.org/specs/web-apps/current-work/
+
+lxml can benefit from the parsing capabilities of `html5lib` through
+the ``lxml.html.html5parser`` module.  It provides a similar interface
+to the ``lxml.html`` module by providing ``fromstring()``,
+``parse()``, ``document_fromstring()``, ``fragment_fromstring()`` and
+``fragments_fromstring()`` that work like the regular html parsing
+functions.
+
+
+Differences to regular HTML parsing
+===================================
+
+There are a few differences in the returned tree to the regular HTML
+parsing functions from ``lxml.html``.  html5lib normalizes some elements
+and element structures to a common format.  For example even if a tables
+does not have a `tbody` html5lib will inject one automatically:
+
+.. sourcecode:: pycon
+
+    >>> from lxml.html import tostring, html5parser
+    >>> tostring(html5parser.fromstring("<table><td>foo"))
+    '<table><tbody><tr><td>foo</td></tr></tbody></table>'
+
+Also the parameters the functions accept are different.
+
+
+Function Reference
+==================
+
+``parse(filename_url_or_file)``:
+    Parses the named file or url, or if the object has a ``.read()``
+    method, parses from that.
+
+``document_fromstring(html, guess_charset=True)``:
+    Parses a document from the given string.  This always creates a
+    correct HTML document, which means the parent node is ``<html>``,
+    and there is a body and possibly a head.
+
+    If a bytestring is passed and ``guess_charset`` is true the chardet
+    library (if installed) will guess the charset if ambiguities exist.
+
+``fragment_fromstring(string, create_parent=False, guess_charset=False)``:
+    Returns an HTML fragment from a string.  The fragment must contain
+    just a single element, unless ``create_parent`` is given;
+    e.g,. ``fragment_fromstring(string, create_parent='div')`` will
+    wrap the element in a ``<div>``.  If ``create_parent`` is true the
+    default parent tag (div) is used.
+
+    If a bytestring is passed and ``guess_charset`` is true the chardet
+    library (if installed) will guess the charset if ambiguities exist.
+
+``fragments_fromstring(string, no_leading_text=False, parser=None)``:
+    Returns a list of the elements found in the fragment.  The first item in
+    the list may be a string.  If ``no_leading_text`` is true, then it will
+    be an error if there is leading text, and it will always be a list of
+    only elements.
+
+    If a bytestring is passed and ``guess_charset`` is true the chardet
+    library (if installed) will guess the charset if ambiguities exist.
+
+``fromstring(string)``:
+    Returns ``document_fromstring`` or ``fragment_fromstring``, based
+    on whether the string looks like a full document, or just a
+    fragment.
+
+Additionally all parsing functions accept an ``parser`` keyword argument
+that can be set to a custom parser instance.  To create custom parsers
+you can subclass the ``HTMLParser`` and ``XHTMLParser`` from the same
+module.  Note that these are the parser classes provided by html5lib.
author	scoder <none@none>	2008-07-16 08:58:10 +0200
committer	scoder <none@none>	2008-07-16 08:58:10 +0200
commit	8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec (patch)
tree	bb8295487819b968e1188cf04bc454e64a8e6767 /doc/html5parser.txt
parent	745f4d3898f29271fdb061d0a7c2f897dcc66be9 (diff)
download	python-lxml-8fa6dc2d0b5870f7abcbfd82a36599d1bef7e9ec.tar.gz