[svn r4116] r5055@delle: sbehnel | 2009-02-27 12:04:42 +0100

FAQ update: clean up threading sections, reference dev-works article --HG-- branch : trunk
author: scoder <none@none> 2009-02-27 14:47:17 +0100
committer: scoder <none@none> 2009-02-27 14:47:17 +0100
commit: becf112a6e9aa4820277fed36aea1f6adf7884cd (patch)
tree: eb0345b5044b2eb482e9ed5231ad5f6b59edb7f1
parent: 4e87849220a64450bee2ef38a6931e496ba6b9d9 (diff)
download: python-lxml-becf112a6e9aa4820277fed36aea1f6adf7884cd.tar.gz
1 files changed, 68 insertions, 50 deletions
diff --git a/doc/FAQ.txt b/doc/FAQ.txt
index 88dbb7ba..3cfa95c1 100644
--- a/doc/FAQ.txt
+++ b/doc/FAQ.txt
@@ -96,7 +96,9 @@ tasks in ElementTree and lxml.etree.  To learn using
 ``lxml.objectify``, read the `objectify documentation`_.
 
 John Shipman has written another tutorial called `Python XML
-processing with lxml`_ that contains lots of examples.
+processing with lxml`_ that contains lots of examples.  Liza Daly
+wrote a nice article about high-performance aspects when `parsing
+large files with lxml`_.
 
 .. _`lxml.etree Tutorial`:      tutorial.html
 .. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
@@ -104,6 +106,8 @@ processing with lxml`_ that contains lots of examples.
 .. _`objectify documentation`:  objectify.html
 .. _`Python XML processing with lxml`: http://www.nmt.edu/tcc/help/pubs/pylxml/
 .. _`element library`:          http://effbot.org/zone/element-lib.htm
+.. _`parsing large files with lxml`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
+
 
 Where can I find more documentation about lxml?
 -----------------------------------------------
@@ -194,7 +198,10 @@ Zope3 and some of its extensions have good support for lxml:
 * zif.sedna_, an XQuery based interface to the Sedna OpenSource XML database
 
 And don't miss the quotes by our generally happy_ users_, and other
-`sites that link to lxml`_.
+`sites that link to lxml`_.  As `Liza Daly`_ puts it: "Many software
+products come with the pick-two caveat, meaning that you must choose
+only two: speed, flexibility, or readability.  When used carefully,
+lxml can provide all three."
 
 .. _Zope: http://www.zope.org/
 .. _Plone: http://www.plone.org/
@@ -215,6 +222,7 @@ And don't miss the quotes by our generally happy_ users_, and other
 .. _happy: http://thread.gmane.org/gmane.comp.python.lxml.devel/3244/focus=3244
 .. _users: http://article.gmane.org/gmane.comp.python.lxml.devel/3246
 .. _`sites that link to lxml`: http://www.google.com/search?as_lq=http:%2F%2Fcodespeak.net%2Flxml
+.. _`Liza Daly`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
 
 
 What is the difference between lxml.etree and lxml.objectify?
@@ -619,8 +627,8 @@ lock) internally when parsing from disk and memory, as long as you use
 either the default parser (which is replicated for each thread) or
 create a parser for each thread yourself.  lxml also allows
 concurrency during validation (RelaxNG and XMLSchema) and XSL
-transformation.  You can share RelaxNG, XMLSchema and (with
-restrictions) XSLT objects between threads.
+transformation.  You can share RelaxNG, XMLSchema and XSLT objects
+between threads.
 
 While you can also share parsers between threads, this will serialize
 the access to each of them, so it is better to ``.copy()`` parsers or
@@ -629,19 +637,16 @@ configuration.  The same applies to the XPath evaluators, which use an
 internal lock to protect their prepared evaluation contexts.  It is
 therefore best to use separate evaluator instances in threads.
 
-Due to the way libxslt handles threading, applying a stylesheets is
-most efficient if it was parsed in the same thread that executes it.
-One way to achieve this is by caching stylesheets in thread-local
-storage.
-
-Warning: Before lxml 2.2, there were various issues when moving
-subtrees between different threads.  If you need code to run with
-older versions, you should generally avoid modifying trees in other
-threads than the one it was generated in.  Although this should work
-in many cases, there are certain scenarios where the termination of a
-thread that parsed a tree can crash the application if subtrees of
-this tree were moved to other documents.  You should be on the safe
-side when passing trees between threads if you either
+Warning: Before lxml 2.2, and especially before 2.1, there were
+various issues when moving subtrees between different threads, or when
+applying XSLT objects from one thread to trees parsed or modified in
+another.  If you need code to run with older versions, you should
+generally avoid modifying trees in other threads than the one it was
+generated in.  Although this should work in many cases, there are
+certain scenarios where the termination of a thread that parsed a tree
+can crash the application if subtrees of this tree were moved to other
+documents.  You should be on the safe side when passing trees between
+threads if you either
 
 - do not modify these trees and do not move their elements to other
   trees, or
@@ -650,6 +655,13 @@ side when passing trees between threads if you either
   use (e.g. by using a fixed size thread-pool or long-running threads
   in processing chains)
 
+Since lxml 2.2, even multi-thread pipelines are supported. However,
+note that it is more efficient to do all tree work inside one thread,
+than to let multiple threads work on a tree one after the other. This
+is because trees inherit state from the thread that created them,
+which must be maintained when the tree is modified inside another
+thread.
+
 
 Does my program run faster if I use threads?
 --------------------------------------------
@@ -657,11 +669,13 @@ Does my program run faster if I use threads?
 Depends.  The best way to answer this is timing and profiling.
 
 The global interpreter lock (GIL) in Python serializes access to the
-interpreter, so if the majority of your processing is done in Python code
-(walking trees, modifying elements, etc.), your gain will be close to 0.  The
-more of your XML processing moves into lxml, however, the higher your gain.
-If your application is bound by XML parsing and serialisation, or by complex
-XSLTs, your speedup on multi-processor machines can be substantial.
+interpreter, so if the majority of your processing is done in Python
+code (walking trees, modifying elements, etc.), your gain will be
+close to zero.  The more of your XML processing moves into lxml,
+however, the higher your gain.  If your application is bound by XML
+parsing and serialisation, or by very selective XPath expressions and
+complex XSLTs, your speedup on multi-processor machines can be
+substantial.
 
 See the question above to learn which operations free the GIL to support
 multi-threading.
@@ -670,30 +684,28 @@ multi-threading.
 Would my single-threaded program run faster if I turned off threading?
 ----------------------------------------------------------------------
 
-Quite likely, yes.  You can see for yourself by compiling lxml
-entirely without threading support.  Pass the ``--without-threading``
-option to setup.py when building lxml from source.  You can also build
-libxml2 without pthread support (``--without-pthreads`` option), which
-may add another bit of performance.  Note that this will leave
-internal data structures entirely without thread protection, so make
-sure you really do not use lxml outside of the main application thread
-in this case.
+Possibly, yes.  You can see for yourself by compiling lxml entirely
+without threading support.  Pass the ``--without-threading`` option to
+setup.py when building lxml from source.  You can also build libxml2
+without pthread support (``--without-pthreads`` option), which may add
+another bit of performance.  Note that this will leave internal data
+structures entirely without thread protection, so make sure you really
+do not use lxml outside of the main application thread in this case.
 
 
 Why can't I reuse XSLT stylesheets in other threads?
 ----------------------------------------------------
 
-Since lxml 2.0, you can.  However, it is a lot more efficient to use
-stylesheets in the thread that created them.  This is due to some
-interfering optimisations in libxslt and lxml.etree.  It is therefore
-a good idea to cache them in thread local storage (see Python's
-threading module).  lxml cannot easily do this for you, as it cannot
-know when to discard them from such a cache.
+Since later lxml 2.0 versions, you can do this.  There is some
+overhead involved as the result document needs an additional cleanup
+traversal when the input document and/or the stylesheet were created
+in other threads.  However, on a multi-processor machine, the gain of
+freeing the GIL easily covers this drawback.
 
-If you use very complex stylesheets or create stylesheets
-programmatically, you should do so in the main thread, and then copy
-them into the thread cache using the ``copy`` module from the standard
-library.
+If you need even the last bit of performance, consider keeping (a copy
+of) the stylesheet in thread-local storage, and try creating the input
+document(s) in the same thread.  And do not forget to benchmark your
+code to see if the increased code complexity is really worth it.
 
 
 My program crashes when run with mod_python/Pyro/Zope/Plone/...
@@ -709,10 +721,11 @@ predictable way.  If you encounter crashes in one of these systems, but your
 code runs perfectly when started by hand, the following gives you a few hints
 for possible approaches to solve your specific problem:
 
-* make sure you use recent versions of libxml2, libxslt and lxml.  The libxml2
-  developers keep fixing bugs in each release, and lxml also tries to become
-  more robust against possible pitfalls.  So newer versions might already fix
-  your problem in a reliable way.
+* make sure you use recent versions of libxml2, libxslt and lxml.  The
+  libxml2 developers keep fixing bugs in each release, and lxml also
+  tries to become more robust against possible pitfalls.  So newer
+  versions might already fix your problem in a reliable way.  Version
+  2.2 of lxml contains many improvements.
 
 * make sure the library versions you installed are really used.  Do
   not rely on what your operating system tells you!  Print the version
@@ -736,14 +749,15 @@ for possible approaches to solve your specific problem:
   from crashing, which should be worth more to you than peek performance.
   Remember that lxml is fast anyway, so concurrency may not even be worth it.
 
-* avoid doing fancy XSLT stuff like foreign document access or passing in
-  subtrees trough XSLT variables.  This might or might not work, depending on
-  your specific usage.
+* look out for fancy XSLT stuff like foreign document access or
+  passing in subtrees trough XSLT variables.  This might or might not
+  work, depending on your specific usage.  Again, later versions of
+  lxml and libxslt provide safer support here.
 
 * try copying trees at suspicious places in your code and working with
-  those instead of a tree shared between threads.  A good candidate
-  might be the result of an XSLT or the stylesheet itself, if it
-  traverses thread boundaries.
+  those instead of a tree shared between threads.  Note that the
+  copying must happen inside the target thread to be effective, not in
+  the thread that created the tree.
 
 * try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
   instead of sharing one.  Also see the question above.
@@ -756,6 +770,10 @@ for possible approaches to solve your specific problem:
   of lxml, libxml2 and libxslt you are using (see the question on reporting
   a bug).
 
+Note that most of these options will degrade performance and/or your
+code quality.  If you are unsure what to do, please ask on the mailing
+list.
+
 
 Parsing and Serialisation
 =========================
author	scoder <none@none>	2009-02-27 14:47:17 +0100
committer	scoder <none@none>	2009-02-27 14:47:17 +0100
commit	becf112a6e9aa4820277fed36aea1f6adf7884cd (patch)
tree	eb0345b5044b2eb482e9ed5231ad5f6b59edb7f1
parent	4e87849220a64450bee2ef38a6931e496ba6b9d9 (diff)
download	python-lxml-becf112a6e9aa4820277fed36aea1f6adf7884cd.tar.gz