summaryrefslogtreecommitdiff
path: root/doc/FAQ.txt
blob: f6814fe50809286292ec9086703810944f7a7eb3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
=====================================
lxml FAQ - Frequently Asked Questions
=====================================

.. meta::
  :description: Frequently Asked Questions about lxml (FAQ)
  :keywords: lxml, lxml.etree, FAQ, frequently asked questions

Frequently asked questions on lxml.  See also the notes on compatibility_ to
ElementTree_.

.. _compatibility: compatibility.html
.. _ElementTree:   http://effbot.org/zone/element-index.htm
.. _`build instructions`: build.html
.. _`MacOS-X` : build.html#building-lxml-on-macos-x

.. contents::
..
   1  General Questions
     1.1  Is there a tutorial?
     1.2  Where can I find more documentation about lxml?
     1.3  What standards does lxml implement?
     1.4  Who uses lxml?
     1.5  What is the difference between lxml.etree and lxml.objectify?
     1.6  How can I make my application run faster?
     1.7  What about that trailing text on serialised Elements?
     1.8  How can I find out if an Element is a comment or PI?
     1.9  How can I map an XML tree into a dict of dicts?
     1.10 Why does lxml sometimes return 'str' values for text in Python 2?
   2  Installation
     2.1  Which version of libxml2 and libxslt should I use or require?
     2.2  Where are the binary builds?
     2.3  Why do I get errors about missing UCS4 symbols when installing lxml?
   3  Contributing
     3.1  Why is lxml not written in Python?
     3.2  How can I contribute?
   4  Bugs
     4.1  My application crashes!
     4.2  My application crashes on MacOS-X!
     4.3  I think I have found a bug in lxml. What should I do?
     4.4  How do I know a bug is really in lxml and not in libxml2?
   5  Threading
     5.1  Can I use threads to concurrently access the lxml API?
     5.2  Does my program run faster if I use threads?
     5.3  Would my single-threaded program run faster if I turned off threading?
     5.4  Why can't I reuse XSLT stylesheets in other threads?
     5.5  My program crashes when run with mod_python/Pyro/Zope/Plone/...
   6  Parsing and Serialisation
     6.1  Why doesn't the ``pretty_print`` option reformat my XML output?
     6.2  Why can't lxml parse my XML from unicode strings?
     6.3  What is the difference between str(xslt(doc)) and xslt(doc).write() ?
     6.4  Why can't I just delete parents or clear the root node in iterparse()?
     6.5  How do I output null characters in XML text?
     6.6  Is lxml vulnerable to XML bombs?
     6.7  Can lxml parse from file objects opened in unicode mode?
   7  XPath and Document Traversal
     7.1  What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
     7.2  Why doesn't ``findall()`` support full XPath expressions?
     7.3  How can I find out which namespace prefixes are used in a document?
     7.4  How can I specify a default namespace for XPath expressions?

..
  >>> import sys
  >>> from lxml import etree as _etree
  >>> if sys.version_info[0] >= 3:
  ...   class etree_mock(object):
  ...     def __getattr__(self, name): return getattr(_etree, name)
  ...     def tostring(self, *args, **kwargs):
  ...       s = _etree.tostring(*args, **kwargs)
  ...       if isinstance(s, bytes): s = s.decode("utf-8") # CR
  ...       if s[-1] == '\n': s = s[:-1]
  ...       return s
  ... else:
  ...   class etree_mock(object):
  ...     def __getattr__(self, name): return getattr(_etree, name)
  ...     def tostring(self, *args, **kwargs):
  ...       s = _etree.tostring(*args, **kwargs)
  ...       if s[-1] == '\n': s = s[:-1]
  ...       return s
  >>> etree = etree_mock()


General Questions
=================

Is there a tutorial?
--------------------

Read the `lxml.etree Tutorial`_.  While this is still work in progress
(just as any good documentation), it provides an overview of the most
important concepts in ``lxml.etree``.  If you want to help out,
improving the tutorial is a very good place to start.

There is also a `tutorial for ElementTree`_ which works for
``lxml.etree``.  The documentation of the `extended etree API`_ also
contains many examples for ``lxml.etree``.  Fredrik Lundh's `element
library`_ contains a lot of nice recipes that show how to solve common
tasks in ElementTree and lxml.etree.  To learn using
``lxml.objectify``, read the `objectify documentation`_.

John Shipman has written another tutorial called `Python XML
processing with lxml`_ that contains lots of examples.  Liza Daly
wrote a nice article about high-performance aspects when `parsing
large files with lxml`_.

.. _`lxml.etree Tutorial`:      tutorial.html
.. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
.. _`extended etree API`:        api.html
.. _`objectify documentation`:  objectify.html
.. _`Python XML processing with lxml`: http://www.nmt.edu/tcc/help/pubs/pylxml/
.. _`element library`:          http://effbot.org/zone/element-lib.htm
.. _`parsing large files with lxml`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/


Where can I find more documentation about lxml?
-----------------------------------------------

There is a lot of documentation on the web and also in the Python
standard library documentation, as lxml implements the well-known
`ElementTree API`_ and tries to follow its documentation as closely as
possible.  The recipes in Fredrik Lundh's `element library`_ are
generally worth taking a look at.  There are a couple of issues where
lxml cannot keep up compatibility.  They are described in the
compatibility_ documentation.

The lxml specific extensions to the API are described by individual
files in the ``doc`` directory of the source distribution and on `the
web page`_.

The `generated API documentation`_ is a comprehensive API reference
for the lxml package.

.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
.. _`the web page`:    http://lxml.de/#documentation
.. _`generated API documentation`: api/index.html


What standards does lxml implement?
-----------------------------------

The compliance to XML Standards depends on the support in libxml2 and libxslt.
Here is a quote from `http://xmlsoft.org/ <http://xmlsoft.org/>`_:

  In most cases libxml2 tries to implement the specifications in a relatively
  strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests
  from the OASIS XML Tests Suite.

lxml currently supports libxml2 2.6.20 or later, which has even better
support for various XML standards.  The important ones are:

* XML 1.0
* HTML 4
* XML namespaces
* XML Schema 1.0
* XPath 1.0
* XInclude 1.0
* XSLT 1.0
* EXSLT
* XML catalogs
* canonical XML
* RelaxNG
* xml:id
* xml:base

Support for XML Schema is currently not 100% complete in libxml2, but
is definitely very close to compliance.  Schematron is supported in
two ways, the best being the original ISO Schematron reference
implementation via XSLT.  libxml2 also supports loading documents
through HTTP and FTP.


Who uses lxml?
--------------

As an XML library, lxml is often used under the hood of in-house
server applications, such as web servers or applications that
facilitate some kind of content management.  Many people who deploy
Zope_, Plone_ or Django_ use it together with lxml in the background,
without speaking publicly about it.  Therefore, it is hard to get an
idea of who uses it, and the following list of 'users and projects we
know of' is very far from a complete list of lxml's users.

Also note that the compatibility to the ElementTree library does not
require projects to set a hard dependency on lxml - as long as they do
not take advantage of lxml's enhanced feature set.

* `cssutils <http://code.google.com/p/cssutils/source/browse/trunk/examples/style.py?r=917>`_,
  a CSS parser and toolkit, can be used with ``lxml.cssselect``
* `Deliverance <http://www.openplans.org/projects/deliverance/project-home>`_,
  a content theming tool
* `Enfold Proxy 4 <http://www.enfoldsystems.com/Products/Proxy/4>`_,
  a web server accelerator with on-the-fly XSLT processing
* `Inteproxy <http://lists.wald.intevation.org/pipermail/inteproxy-devel/2007-February/000000.html>`_,
  a secure HTTP proxy
* `lwebstring <http://pypi.python.org/pypi/lwebstring>`_,
  an XML template engine
* `OpenXMLlib <http://permalink.gmane.org/gmane.comp.python.lxml.devel/3250>`_,
  a library for handling OpenXML document meta data
* `PsychoPy <http://www.psychopy.org/>`_,
  psychology software in Python
* `Pycoon <http://pypi.python.org/pypi/pycoon>`_,
  a WSGI web development framework based on XML pipelines
* `PyQuery <http://pypi.python.org/pypi/pyquery>`_,
  a query framework for XML/HTML, similar to jQuery for JavaScript
* `python-docx <http://github.com/mikemaccana/python-docx>`_,
  a package for handling Microsoft's Word OpenXML format
* `Rambler <http://beta.rambler.ru/srch?query=python+lxml&searchtype=web>`_,
  a meta search engine that aggregates different data sources
* `rdfadict <http://pypi.python.org/pypi/rdfadict>`_,
  an RDFa parser with a simple dictionary-like interface.
* `xupdate-processor <http://pypi.python.org/pypi/xupdate-processor>`_,
  an XUpdate implementation for lxml.etree
* `Diazo <http://docs.diazo.org/>`_,
  an XSLT-under-the-hood web site theming engine

Zope3 and some of its extensions have good support for lxml:

* `gocept.lxml <http://pypi.python.org/pypi/gocept.lxml>`_,
  Zope3 interface bindings for lxml
* `z3c.rml <http://pypi.python.org/pypi/z3c.rml>`_,
  an implementation of ReportLab's RML format
* `zif.sedna <http://pypi.python.org/pypi/zif.sedna>`_,
  an XQuery based interface to the Sedna OpenSource XML database

And don't miss the quotes by our generally happy_ users_, and other
`sites that link to lxml`_.   As `Liza Daly`_ puts it: "Many software
products come with the pick-two caveat, meaning that you must choose
only two: speed, flexibility, or readability.  When used carefully,
lxml can provide all three."

.. _Zope: http://www.zope.org/
.. _Plone: http://www.plone.org/
.. _Django: https://www.djangoproject.com/

.. _happy: http://thread.gmane.org/gmane.comp.python.lxml.devel/3244/focus=3244
.. _users: http://article.gmane.org/gmane.comp.python.lxml.devel/3246
.. _`sites that link to lxml`: http://www.google.com/search?as_lq=http:%2F%2Flxml.de%2F
.. _`Liza Daly`: http://www.ibm.com/developerworks/xml/library/x-hiperfparse/


What is the difference between lxml.etree and lxml.objectify?
-------------------------------------------------------------

The two modules provide different ways of handling XML.  However, objectify
builds on top of lxml.etree and therefore inherits most of its capabilities
and a large portion of its API.

* lxml.etree is a generic API for XML and HTML handling.  It aims for
  ElementTree compatibility_ and supports the entire XML infoset.  It is well
  suited for both mixed content and data centric XML.  Its generality makes it
  the best choice for most applications.

* lxml.objectify is a specialized API for XML data handling in a Python object
  syntax.  It provides a very natural way to deal with data fields stored in a
  structurally well defined XML format.  Data is automatically converted to
  Python data types and can be manipulated with normal Python operators.  Look
  at the examples in the `objectify documentation`_ to see what it feels like
  to use it.

  Objectify is not well suited for mixed contents or HTML documents.  As it is
  built on top of lxml.etree, however, it inherits the normal support for
  XPath, XSLT or validation.


How can I make my application run faster?
-----------------------------------------

lxml.etree is a very fast library for processing XML.  There are, however, `a
few caveats`_ involved in the mapping of the powerful libxml2 library to the
simple and convenient ElementTree API.  Not all operations are as fast as the
simplicity of the API might suggest, while some use cases can heavily benefit
from finding the right way of doing them.  The `benchmark page`_ has a
comparison to other ElementTree implementations and a number of tips for
performance tweaking.  As with any Python application, the rule of thumb is:
the more of your processing runs in C, the faster your application gets.  See
also the section on threading_.

.. _`a few caveats`:  performance.html#the-elementtree-api
.. _`benchmark page`: performance.html
.. _threading:        #threading


What about that trailing text on serialised Elements?
-----------------------------------------------------

The ElementTree tree model defines an Element as a container with a tag name,
contained text, child Elements and a tail text.  This means that whenever you
serialise an Element, you will get all parts of that Element:

.. sourcecode:: pycon

    >>> root = etree.XML("<root><tag>text<child/></tag>tail</root>")
    >>> print(etree.tostring(root[0]))
    <tag>text<child/></tag>tail

Here is an example that shows why not serialising the tail would be
even more surprising from an object point of view:

.. sourcecode:: pycon

    >>> root = etree.Element("test")

    >>> root.text = "TEXT"
    >>> print(etree.tostring(root))
    <test>TEXT</test>

    >>> root.tail = "TAIL"
    >>> print(etree.tostring(root))
    <test>TEXT</test>TAIL

    >>> root.tail = None
    >>> print(etree.tostring(root))
    <test>TEXT</test>

Just imagine a Python list where you append an item and it doesn't
show up when you look at the list.

The ``.tail`` property is a huge simplification for the tree model as
it avoids text nodes to appear in the list of children and makes
access to them quick and simple.  So this is a benefit in most
applications and simplifies many, many XML tree algorithms.

However, in document-like XML (and especially HTML), the above result can be
unexpected to new users and can sometimes require a bit more overhead.  A good
way to deal with this is to use helper functions that copy the Element without
its tail.  The ``lxml.html`` package also deals with this in a couple of
places, as most HTML algorithms benefit from a tail-free behaviour.


How can I find out if an Element is a comment or PI?
----------------------------------------------------

.. sourcecode:: pycon

    >>> root = etree.XML("<?my PI?><root><!-- empty --></root>")

    >>> root.tag
    'root'
    >>> root.getprevious().tag is etree.PI
    True
    >>> root[0].tag is etree.Comment
    True


How can I map an XML tree into a dict of dicts?
-----------------------------------------------

I'm glad you asked.

.. sourcecode:: python

    def recursive_dict(element):
         return element.tag, \
                dict(map(recursive_dict, element)) or element.text


Why does lxml sometimes return 'str' values for text in Python 2?
-----------------------------------------------------------------

In Python 2, lxml's API returns byte strings for plain ASCII text
values, be it for tag names or text in Element content.  This is the
same behaviour as known from ElementTree.  The reasoning is that ASCII
encoded byte strings are compatible with Unicode strings in Python 2,
but consume less memory (usually by a factor of 2 or 4) and are faster
to create because they do not require decoding.  Plain ASCII string
values are very common in XML, so this optimisation is generally worth
it.

In Python 3, lxml always returns Unicode strings for text and names,
as does ElementTree.  Since Python 3.3, Unicode strings containing
only characters that can be encoded in ASCII or Latin-1 are generally
as efficient as byte strings.  In older versions of Python 3, the
above mentioned drawbacks apply.


Installation
============

Which version of libxml2 and libxslt should I use or require?
-------------------------------------------------------------

It really depends on your application, but the rule of thumb is: more recent
versions contain less bugs and provide more features.

* Do not use libxml2 2.6.27 if you want to use XPath (including XSLT).  You
  will get crashes when XPath errors occur during the evaluation (e.g. for
  unknown functions).  This happens inside the evaluation call to libxml2, so
  there is nothing that lxml can do about it.

* Try to use versions of both libraries that were released together.  At least
  the libxml2 version should not be older than the libxslt version.

* If you use XML Schema or Schematron which are still under development, the
  most recent version of libxml2 is usually a good bet.

* The same applies to XPath, where a substantial number of bugs and memory
  leaks were fixed over time.  If you encounter crashes or memory leaks in
  XPath applications, try a more recent version of libxml2.

* For parsing and fixing broken HTML, lxml requires at least libxml2 2.6.21.

* For the normal tree handling, however, any libxml2 version starting with
  2.6.20 should do.

Read the `release notes of libxml2`_ and the `release notes of libxslt`_ to
see when (or if) a specific bug has been fixed.

.. _`release notes of libxml2`: http://xmlsoft.org/news.html
.. _`release notes of libxslt`: http://xmlsoft.org/XSLT/news.html


Where are the binary builds?
----------------------------

Binary builds are most often requested by users of Microsoft Windows.
Two of the major design issues of this operating system make it
non-trivial for its users to build lxml: the lack of a pre-installed
standard compiler and the missing package management.

We previously provided Windows binaries through PyPI, but no
longer do so due to the high maintenance overhead they introduce and
the difficulty in supporting different system configurations.
Christoph Gohlke generously provides `unofficial lxml binary builds
for Windows <http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml>`_ that
are usually very up to date.  Consider using them if you prefer a
binary build over a signed official source release.


Why do I get errors about missing UCS4 symbols when installing lxml?
--------------------------------------------------------------------

Most likely, you use a Python installation that was configured for internal
use of UCS2 unicode, meaning 16-bit unicode.  The lxml egg distributions are
generally compiled on platforms that use UCS4, a 32-bit unicode encoding, as
this is used on the majority of platforms.  Sadly, both are not compatible, so
the eggs can only support the one they were compiled with.

This means that you have to compile lxml from sources for your system.  Note
that you do not need Cython for this, the lxml source distribution is directly
compilable on both platform types.  See the `build instructions`_ on how to do
this.


Contributing
============

Why is lxml not written in Python?
----------------------------------

It *almost* is.

lxml is not written in plain Python, because it interfaces with two C
libraries: libxml2 and libxslt.  Accessing them at the C-level is
required for performance reasons.

However, to avoid writing plain C-code and caring too much about the
details of built-in types and reference counting, lxml is written in
Cython_, a Python-like language that is translated into C-code.
Chances are that if you know Python, you can write `code that Cython
accepts`_.  Again, the C-ish style used in the lxml code is just for
performance optimisations.  If you want to contribute, don't bother
with the details, a Python implementation of your contribution is
better than none.  And keep in mind that lxml's flexible API often
favours an implementation of features in pure Python, without
bothering with C-code at all.  For example, the ``lxml.html`` package
is entirely written in Python.

Please contact the `mailing list`_ if you need any help.

.. _Cython: http://www.cython.org/
.. _`code that Cython accepts`: http://docs.cython.org/docs/tutorial.html


How can I contribute?
---------------------

If you find something that you would like lxml to do (or do better),
then please tell us about it on the `mailing list`_.  Patches are
always appreciated, especially when accompanied by unit tests and
documentation (doctests would be great).  See the ``tests``
subdirectories in the lxml source tree (below the ``src`` directory)
and the ReST_ `text files`_ in the ``doc`` directory.

We also have a `list of missing features`_ that we would like to
implement but didn't due to lack if time.  If *you* find the time,
patches are very welcome.

.. _ReST: http://docutils.sourceforge.net/rst.html
.. _`text files`: https://github.com/lxml/lxml/tree/master/doc
.. _`list of missing features`: https://github.com/lxml/lxml/blob/master/IDEAS.txt

Besides enhancing the code, there are a lot of places where you can help the
project and its user base.  You can

* spread the word and write about lxml.  Many users (especially new Python
  users) have not yet heared about lxml, although our user base is constantly
  growing.  If you write your own blog and feel like saying something about
  lxml, go ahead and do so.  If we think your contribution or criticism is
  valuable to other users, we may even put a link or a quote on the project
  page.

* provide code examples for the general usage of lxml or specific problems
  solved with lxml.  Readable code is a very good way of showing how a library
  can be used and what great things you can do with it.  Again, if we hear
  about it, we can set a link on the project page.

* work on the documentation.  The web page is generated from a set of ReST_
  `text files`_.  It is meant both as a representative project page for lxml
  and as a site for documenting lxml's API and usage.  If you have questions
  or an idea how to make it more readable and accessible while you are reading
  it, please send a comment to the `mailing list`_.

* enhance the web site. We put some work into making the web site
  usable, understandable and also easy to find, but there's always
  things that can be done better.  You may notice that we are not
  top-ranked when searching the web for "Python and XML", so maybe you
  have an idea how to improve that.

* help with the tutorial.  A tutorial is the most important stating point for
  new users, so it is important for us to provide an easy to understand guide
  into lxml.  As allo documentation, the tutorial is work in progress, so we
  appreciate every helping hand.

* improve the docstrings.  lxml uses docstrings to support Python's integrated
  online ``help()`` function.  However, sometimes these are not sufficient to
  grasp the details of the function in question.  If you find such a place,
  you can try to write up a better description and send it to the `mailing
  list`_.


Bugs
====

My application crashes!
-----------------------

One of the goals of lxml is "no segfaults", so if there is no clear
warning in the documentation that you were doing something potentially
harmful, you have found a bug and we would like to hear about it.
Please report this bug to the `mailing list`_.  See the section on bug
reporting to learn how to do that.

If your application (or e.g. your web container) uses threads, please
see the FAQ section on threading_ to check if you touch on one of the
potential pitfalls.

In any case, try to reproduce the problem with the latest versions of
libxml2 and libxslt.  From time to time, bugs and race conditions are found
in these libraries, so a more recent version might already contain a fix for
your problem.

Remember: even if you see lxml appear in a crash stack trace, it is
not necessarily lxml that *caused* the crash.


My application crashes on MacOS-X!
----------------------------------

This was a common problem up to lxml 2.1.x.  Since lxml 2.2, the only
officially supported way to use it on this platform is through a
static build against freshly downloaded versions of libxml2 and
libxslt.  See the build instructions for `MacOS-X`_.


I think I have found a bug in lxml. What should I do?
-----------------------------------------------------

First, you should look at the `current developer changelog`_ to see if this
is a known problem that has already been fixed in the SVN trunk since the
release you are using.

.. _`current developer changelog`: https://github.com/lxml/lxml/blob/master/CHANGES.txt

Also, the 'crash' section above has a few good advices what to try to see if
the problem is really in lxml - and not in your setup.  Believe it or not,
that happens more often than you might think, especially when old libraries
or even multiple library versions are installed.

You should always try to reproduce the problem with the latest
versions of libxml2 and libxslt - and make sure they are used.
``lxml.etree`` can tell you what it runs with:

.. sourcecode:: python

   import sys
   from lxml import etree

   print("%-20s: %s" % ('Python',           sys.version_info))
   print("%-20s: %s" % ('lxml.etree',       etree.LXML_VERSION))
   print("%-20s: %s" % ('libxml used',      etree.LIBXML_VERSION))
   print("%-20s: %s" % ('libxml compiled',  etree.LIBXML_COMPILED_VERSION))
   print("%-20s: %s" % ('libxslt used',     etree.LIBXSLT_VERSION))
   print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

If you can figure that the problem is not in lxml but in the
underlying libxml2 or libxslt, you can ask right on the respective
mailing lists, which may considerably reduce the time to find a fix or
work-around.  See the next question for some hints on how to do that.

Otherwise, we would really like to hear about it.  Please report it to
the `mailing list`_ so that we can fix it.  It is very helpful in this
case if you can come up with a short code snippet that demonstrates
your problem.  If others can reproduce and see the problem, it is much
easier for them to fix it - and maybe even easier for you to describe
it and get people convinced that it really is a problem to fix.

It is important that you always report the version of lxml, libxml2
and libxslt that you get from the code snippet above.  If we do not
know the library versions you are using, we will ask back, so it will
take longer for you to get a helpful answer.

Since as a user of lxml you are likely a programmer, you might find
`this article on bug reports`_ an interesting read.

.. _`mailing list`: http://lxml.de/mailinglist/
.. _`this article on bug reports`: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html


How do I know a bug is really in lxml and not in libxml2?
---------------------------------------------------------

A large part of lxml's functionality is implemented by libxml2 and
libxslt, so problems that you encounter may be in one or the other.
Knowing the right place to ask will reduce the time it takes to fix
the problem, or to find a work-around.

Both libxml2 and libxslt come with their own command line frontends,
namely ``xmllint`` and ``xsltproc``.  If you encounter problems with
XSLT processing for specific stylesheets or with validation for
specific schemas, try to run the XSLT with ``xsltproc`` or the
validation with ``xmllint`` respectively to find out if it fails there
as well.  If it does, please report directly to the mailing lists of
the respective project, namely:

* `libxml2 mailing list <http://mail.gnome.org/mailman/listinfo/xml>`_
* `libxslt mailing list <http://mail.gnome.org/mailman/listinfo/xslt>`_

On the other hand, everything that seems to be related to Python code,
including custom resolvers, custom XPath functions, etc. is likely
outside of the scope of libxml2/libxslt.  If you encounter problems
here or you are not sure where there the problem may come from, please
ask on the lxml mailing list first.

In any case, a good explanation of the problem including some simple
test code and some input data will help us (or the libxml2 developers)
see and understand the problem, which largely increases your chance of
getting help.  See the question above for a few hints on what is
helpful here.


Threading
=========

Can I use threads to concurrently access the lxml API?
------------------------------------------------------

Short answer: yes, if you use lxml 2.2 and later.

Since version 1.1, lxml frees the GIL (Python's global interpreter
lock) internally when parsing from disk and memory, as long as you use
either the default parser (which is replicated for each thread) or
create a parser for each thread yourself.  lxml also allows
concurrency during validation (RelaxNG and XMLSchema) and XSL
transformation.  You can share RelaxNG, XMLSchema and XSLT objects
between threads.

While you can also share parsers between threads, this will serialize
the access to each of them, so it is better to ``.copy()`` parsers or
to just use the default parser if you do not need any special
configuration.  The same applies to the XPath evaluators, which use an
internal lock to protect their prepared evaluation contexts.  It is
therefore best to use separate evaluator instances in threads.

Warning: Before lxml 2.2, and especially before 2.1, there were
various issues when moving subtrees between different threads, or when
applying XSLT objects from one thread to trees parsed or modified in
another.  If you need code to run with older versions, you should
generally avoid modifying trees in other threads than the one it was
generated in.  Although this should work in many cases, there are
certain scenarios where the termination of a thread that parsed a tree
can crash the application if subtrees of this tree were moved to other
documents.  You should be on the safe side when passing trees between
threads if you either

- do not modify these trees and do not move their elements to other
  trees, or

- do not terminate threads while the trees they parsed are still in
  use (e.g. by using a fixed size thread-pool or long-running threads
  in processing chains)

Since lxml 2.2, even multi-thread pipelines are supported. However,
note that it is more efficient to do all tree work inside one thread,
than to let multiple threads work on a tree one after the other. This
is because trees inherit state from the thread that created them,
which must be maintained when the tree is modified inside another
thread.


Does my program run faster if I use threads?
--------------------------------------------

Depends.  The best way to answer this is timing and profiling.

The global interpreter lock (GIL) in Python serializes access to the
interpreter, so if the majority of your processing is done in Python
code (walking trees, modifying elements, etc.), your gain will be
close to zero.  The more of your XML processing moves into lxml,
however, the higher your gain.  If your application is bound by XML
parsing and serialisation, or by very selective XPath expressions and
complex XSLTs, your speedup on multi-processor machines can be
substantial.

See the question above to learn which operations free the GIL to support
multi-threading.


Would my single-threaded program run faster if I turned off threading?
----------------------------------------------------------------------

Possibly, yes.  You can see for yourself by compiling lxml entirely
without threading support.  Pass the ``--without-threading`` option to
setup.py when building lxml from source.  You can also build libxml2
without pthread support (``--without-pthreads`` option), which may add
another bit of performance.  Note that this will leave internal data
structures entirely without thread protection, so make sure you really
do not use lxml outside of the main application thread in this case.


Why can't I reuse XSLT stylesheets in other threads?
----------------------------------------------------

Since later lxml 2.0 versions, you can do this.  There is some
overhead involved as the result document needs an additional cleanup
traversal when the input document and/or the stylesheet were created
in other threads.  However, on a multi-processor machine, the gain of
freeing the GIL easily covers this drawback.

If you need even the last bit of performance, consider keeping (a copy
of) the stylesheet in thread-local storage, and try creating the input
document(s) in the same thread.  And do not forget to benchmark your
code to see if the increased code complexity is really worth it.


My program crashes when run with mod_python/Pyro/Zope/Plone/...
---------------------------------------------------------------

These environments can use threads in a way that may not make it obvious when
threads are created and what happens in which thread.  This makes it hard to
ensure lxml's threading support is used in a reliable way.  Sadly, if problems
arise, they are as diverse as the applications, so it is difficult to provide
any generally applicable solution.  Also, these environments are so complex
that problems become hard to debug and even harder to reproduce in a
predictable way.  If you encounter crashes in one of these systems, but your
code runs perfectly when started by hand, the following gives you a few hints
for possible approaches to solve your specific problem:

* make sure you use recent versions of libxml2, libxslt and lxml.  The
  libxml2 developers keep fixing bugs in each release, and lxml also
  tries to become more robust against possible pitfalls.  So newer
  versions might already fix your problem in a reliable way.  Version
  2.2 of lxml contains many improvements.

* make sure the library versions you installed are really used.  Do
  not rely on what your operating system tells you!  Print the version
  constants in ``lxml.etree`` from within your runtime environment to
  make sure it is the case.  This is especially a problem under
  MacOS-X when newer library versions were installed in addition to
  the outdated system libraries.  Please read the bugs section
  regarding MacOS-X in this FAQ.

* if you use ``mod_python``, try setting this option:

      PythonInterpreter main_interpreter

  There was a discussion on the mailing list about this problem:

      http://comments.gmane.org/gmane.comp.python.lxml.devel/2942

* in a threaded environment, try to initially import ``lxml.etree``
  from the main application thread instead of doing first-time imports
  separately in each spawned worker thread.  If you cannot control the
  thread spawning of your web/application server, an import of
  ``lxml.etree`` in sitecustomize.py or usercustomize.py may still do
  the trick.

* compile lxml without threading support by running ``setup.py`` with the
  ``--without-threading`` option.  While this might be slower in certain
  scenarios on multi-processor systems, it *might* also keep your application
  from crashing, which should be worth more to you than peek performance.
  Remember that lxml is fast anyway, so concurrency may not even be worth it.

* look out for fancy XSLT stuff like foreign document access or
  passing in subtrees trough XSLT variables.  This might or might not
  work, depending on your specific usage.  Again, later versions of
  lxml and libxslt provide safer support here.

* try copying trees at suspicious places in your code and working with
  those instead of a tree shared between threads.  Note that the
  copying must happen inside the target thread to be effective, not in
  the thread that created the tree.  Serialising in one thread and
  parsing in another is also a simple (and fast) way of separating
  thread contexts.

* try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
  instead of sharing one.  Also see the question above.

* you can try to serialise suspicious parts of your code with explicit thread
  locks, thus disabling the concurrency of the runtime system.

* report back on the mailing list to see if there are other ways to work
  around your specific problems.  Do not forget to report the version numbers
  of lxml, libxml2 and libxslt you are using (see the question on reporting
  a bug).

Note that most of these options will degrade performance and/or your
code quality.  If you are unsure what to do, please ask on the mailing
list.


Parsing and Serialisation
=========================

..
    making doctest happy:

    >>> try: from StringIO import StringIO
    ... except ImportError: from io import StringIO # Py3
    >>> filename = StringIO("<root/>")


Why doesn't the ``pretty_print`` option reformat my XML output?
---------------------------------------------------------------

Pretty printing (or formatting) an XML document means adding white space to
the content.  These modifications are harmless if they only impact elements in
the document that do not carry (text) data.  They corrupt your data if they
impact elements that contain data.  If lxml cannot distinguish between
whitespace and data, it will not alter your data.  Whitespace is therefore
only added between nodes that do not contain data.  This is always the case
for trees constructed element-by-element, so no problems should be expected
here.  For parsed trees, a good way to assure that no conflicting whitespace
is left in the tree is the ``remove_blank_text`` option:

.. sourcecode:: pycon

   >>> parser = etree.XMLParser(remove_blank_text=True)
   >>> tree = etree.parse(filename, parser)

This will allow the parser to drop blank text nodes when constructing the
tree.  If you now call a serialization function to pretty print this tree,
lxml can add fresh whitespace to the XML tree to indent it.

Note that the ``remove_blank_text`` option also uses a heuristic if it
has no definite knowledge about the document's ignorable whitespace.
It will keep blank text nodes that appear after non-blank text nodes
at the same level.  This is to prevent document-style XML from
breaking.

If you want to be sure all blank text is removed, you have to use
either a DTD to tell the parser which whitespace it can safely ignore,
or remove the ignorable whitespace manually after parsing, e.g. by
setting all tail text to None:

.. sourcecode:: python

   for element in root.iter():
       element.tail = None

Fredrik Lundh also has a Python-level function for indenting XML by
appending whitespace to tags.  It can be found on his `element
library`_ recipe page.


Why can't lxml parse my XML from unicode strings?
-------------------------------------------------

lxml can read Python unicode strings and even tries to support them if libxml2
does not.  However, if the unicode string declares an XML encoding internally
(``<?xml encoding="..."?>``), parsing is bound to fail, as this encoding is
most likely not the real encoding used in Python unicode.  The same is true
for HTML unicode strings that contain charset meta tags, although the problems
may be more subtle here.  The libxml2 HTML parser may not be able to parse the
meta tags in broken HTML and may end up ignoring them, so even if parsing
succeeds, later handling may still fail with character encoding errors.

Note that Python uses different encodings for unicode on different platforms,
so even specifying the real internal unicode encoding is not portable between
Python interpreters.  Don't do it.

Python unicode strings with XML data or HTML data that carry encoding
information are broken.  lxml will not parse them.  You must provide parsable
data in a valid encoding.


What is the difference between str(xslt(doc)) and xslt(doc).write() ?
---------------------------------------------------------------------

The str() implementation of the XSLTResultTree class (a subclass of the
ElementTree class) knows about the output method chosen in the stylesheet
(xsl:output), write() doesn't.  If you call write(), the result will be a
normal XML tree serialization in the requested encoding.  Calling this method
may also fail for XSLT results that are not XML trees (e.g. string results).

If you call str(), it will return the serialized result as specified by the
XSL transform.  This correctly serializes string results to encoded Python
strings and honours ``xsl:output`` options like ``indent``.  This almost
certainly does what you want, so you should only use ``write()`` if you are
sure that the XSLT result is an XML tree and you want to override the encoding
and indentation options requested by the stylesheet.


Why can't I just delete parents or clear the root node in iterparse()?
----------------------------------------------------------------------

The ``iterparse()`` implementation is based on the libxml2 parser.  It
requires the tree to be intact to finish parsing.  If you delete or modify
parents of the current node, chances are you modify the structure in a way
that breaks the parser.  Normally, this will result in a segfault.  Please
refer to the `iterparse section`_ of the lxml API documentation to find out
what you can do and what you can't do.

.. _`iterparse section`: parsing.html#iterparse-and-iterwalk


How do I output null characters in XML text?
--------------------------------------------

Don't.  What you would produce is not well-formed XML.  XML parsers
will refuse to parse a document that contains null characters.  The
right way to embed binary data in XML is using a text encoding such as
uuencode or base64.


Is lxml vulnerable to XML bombs?
--------------------------------

This has nothing to do with lxml itself, only with the parser of
libxml2.  Since libxml2 version 2.7, the parser imposes hard security
limits on input documents to prevent DoS attacks with forged input
data.  Since lxml 2.2.1, you can disable these limits with the
``huge_tree`` parser option if you need to parse *really* large,
trusted documents.  All lxml versions will leave these restrictions
enabled by default.

Note that libxml2 versions of the 2.6 series do not restrict their
parser and are therefore vulnerable to DoS attacks.


Can lxml parse from file objects opened in unicode/text mode?
-------------------------------------------------------------

Technically, yes. However, you likely do not want to do that, because
it is extremely inefficient. The text encoding that libxml2 uses
internally is UTF-8, so parsing from a Unicode file means that Python
first reads a chunk of data from the file, then decodes it into a new
buffer, and then copies it into a new unicode string object, just to
let libxml2 make yet another copy while encoding it down into UTF-8
in order to parse it. It's clear that this involves a lot more
recoding and copying than when parsing straight from the bytes that
the file contains.

If you really know the encoding better than the parser (e.g. when
parsing HTML that lacks a content declaration), then instead of passing
an encoding parameter into the file object when opening it, create a
new instance of an XMLParser or HTMLParser and pass the encoding into
its constructor. Afterwards, use that parser for parsing, e.g. by
passing it into the ``etree.parse(file, parser)`` function.


XPath and Document Traversal
============================

What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
--------------------------------------------------------------------

``findall()`` is part of the original `ElementTree API`_.  It supports a
`simple subset of the XPath language`_, without predicates, conditions and
other advanced features.  It is very handy for finding specific tags in a
tree.  Another important difference is namespace handling, which uses the
``{namespace}tagname`` notation.  This is not supported by XPath.  The
findall, find and findtext methods are compatible with other ElementTree
implementations and allow writing portable code that runs on ElementTree,
cElementTree and lxml.etree.

``xpath()``, on the other hand, supports the complete power of the XPath
language, including predicates, XPath functions and Python extension
functions.  The syntax is defined by the `XPath specification`_.  If you need
the expressiveness and selectivity of XPath, the ``xpath()`` method, the
``XPath`` class and the ``XPathEvaluator`` are the best choice_.

.. _`simple subset of the XPath language`: http://effbot.org/zone/element-xpath.htm
.. _`XPath specification`:                 http://www.w3.org/TR/xpath
.. _choice:                                performance.html#xpath


Why doesn't ``findall()`` support full XPath expressions?
---------------------------------------------------------

It was decided that it is more important to keep compatibility with
ElementTree_ to simplify code migration between the libraries.  The main
difference compared to XPath is the ``{namespace}tagname`` notation used in
``findall()``, which is not valid XPath.

ElementTree and lxml.etree use the same implementation, which assures 100%
compatibility.  Note that ``findall()`` is `so fast`_ in lxml that a native
implementation would not bring any performance benefits.

.. _`so fast`: performance.html#tree-traversal


How can I find out which namespace prefixes are used in a document?
-------------------------------------------------------------------

You can traverse the document (``root.iter()``) and collect the prefix
attributes from all Elements into a set.  However, it is unlikely that you
really want to do that.  You do not need these prefixes, honestly.  You only
need the namespace URIs.  All namespace comparisons use these, so feel free to
make up your own prefixes when you use XPath expressions or extension
functions.

The only place where you might consider specifying prefixes is the
serialization of Elements that were created through the API.  Here, you can
specify a prefix mapping through the ``nsmap`` argument when creating the root
Element.  Its children will then inherit this prefix for serialization.


How can I specify a default namespace for XPath expressions?
------------------------------------------------------------

You can't.  In XPath, there is no such thing as a default namespace.  Just use
an arbitrary prefix and let the namespace dictionary of the XPath evaluators
map it to your namespace.  See also the question above.