diff options
author | Daniel Veillard <veillard@src.gnome.org> | 1999-09-04 18:27:23 +0000 |
---|---|---|
committer | Daniel Veillard <veillard@src.gnome.org> | 1999-09-04 18:27:23 +0000 |
commit | c8eab3a22c212711ec7be4a65c8e6cfc7c351f86 (patch) | |
tree | aa266cc6b965e568891a7cbc0f3343d6f8c451ef | |
parent | 6bd26dc2d0d57212c9aa3925a9985deca51e58af (diff) | |
download | libxml2-c8eab3a22c212711ec7be4a65c8e6cfc7c351f86.tar.gz |
Updated the documentation, Daniel.
-rw-r--r-- | ChangeLog | 4 | ||||
-rw-r--r-- | doc/xml.html | 276 |
2 files changed, 264 insertions, 16 deletions
@@ -1,3 +1,7 @@ +Sat Sep 4 20:25:46 CEST 1999 Daniel Veillard <Daniel.Veillard@w3.org> + + * doc/xml.html : updated the documentation + Fri Sep 3 00:01:08 CEST 1999 Daniel Veillard <Daniel.Veillard@w3.org> * xmlmemory.[ch] Makefile.am :added a memory wrapper to chase diff --git a/doc/xml.html b/doc/xml.html index 749d3481..49e08997 100644 --- a/doc/xml.html +++ b/doc/xml.html @@ -9,12 +9,16 @@ <body bgcolor="#ffffff"> <h1 align="center">The XML library for Gnome</h1> +<h2 style="text-align: center">libxml, a.k.a. gnome-xml</h2> + +<p></p> + <p>This document describes the <a href="http://www.w3.org/XML/">XML</a> library provideed in the <a href="http://www.gnome.org/">Gnome</a> framework. -XML is a standard to build tag based structured documents/data. </p> +XML is a standard to build tag based structured documents/data.</p> <p>The internal document repesentation is as close as possible to the <a -href="http://www.w3.org/DOM/">DOM</a> interfaces. </p> +href="http://www.w3.org/DOM/">DOM</a> interfaces.</p> <p>Libxml also has a <a href="http://www.megginson.com/SAX/index.html">SAX interface</a>, <a href="mailto:james@daa.com.au">James Henstridge</a> made <a @@ -23,10 +27,6 @@ documentation</a> expaining how to use it. The interface is as compatible as possible with <a href="http://www.jclark.com/xml/expat.html">Expat</a> one.</p> -<p>The code is commented in a <a href=""></a>way which allow <a -href="http://rpmfind.net/veillard/XML/libxml.html">extensive documentation</a> -to be automatically extracted.</p> - <p>There is also a mailing-list <a href="mailto:xml@rufus.w3.org">xml@rufus.w3.org</a> for libxml, with an <a href="http://rpmfind.net/veillard/XML/messages">on-line archive</a>. To @@ -46,10 +46,19 @@ uses it for his implementation of <a href="http://www.w3.org/Graphics/SVG/">SVG</a> called <a href="http://www.levien.com/svg/">gill</a>.</p> -<h2>xml</h2> +<h2>Extensive documentation</h2> + +<p>The code is commented in a <a href=""></a>way which allow <a +href="http://rpmfind.net/veillard/XML/libxml.html">extensive documentation</a> +to be automatically extracted.</p> + +<p>At some point I will change the back-end to produce XML documentation in +addition to SGML Docbook and HTML.</p> -<p>XML is a standard for markup based structured documents, here is <a -name="example">an example</a>:</p> +<h2>XML</h2> + +<p><a href="http://www.w3.org/TR/REC-xml">XML is a standard</a> for markup +based structured documents, here is <a name="example">an example</a>:</p> <pre><?xml version="1.0"?> <EXAMPLE prop1="gnome is great" prop2="&amp; linux too"> <head> @@ -70,6 +79,12 @@ to be closed</strong> XML is pedantic about this, not that for example the image tag has no content (just an attribute) and is closed by ending up the tag with <code>/></code>.</p> +<p>XML can be applied sucessfully to a wide range or usage from long term +structured document maintenance where it follows the steps of SGML to simple +data encoding mechanism like configuration file format (glade), spreadsheets +(gnumeric), or even shorter lived document like in WebDAV where it is used to +encode remote call between a client and a server.</p> + <h2>The tree output</h2> <p>The parser returns a tree built during the document analysis. The value @@ -125,6 +140,66 @@ standalone=true <p>This should be useful to learn the internal representation model.</p> +<h2>The SAX interface</h2> + +<p>Sometimes the DOM tree output is just to large to fit reasonably into +memory. In that case and if you don't expect to save back the XML document +loaded using libxml, it's better to use the SAX interface of libxml. SAX is a +<strong>callback based interface</strong> to the parser. Before parsing, the +application layer register a customized set of callbacks which will be called +by the library as it progresses through the XML input.</p> + +<p>To get a more detailed step-by-step guidance on using the SAX interface of +libxml, <a href="mailto:james@daa.com.au">James Henstridge</a> made <a +href="http://www.daa.com.au/~james/gnome/xml-sax/xml-sax.html">a nice +documentation.</a></p> + +<p>You can debug the SAX behaviour by using the <strong>testSAX</strong> +program located in the gnome-xml module (it's usually not shipped in the +binary packages of libxml, but you can also find it in the tar source +distribution). Here is the sequence of callback that would be generated when +parsing the example given before as reported by testSAX:</p> +<pre>SAX.setDocumentLocator() +SAX.startDocument() +SAX.getEntity(amp) +SAX.startElement(EXAMPLE, prop1='gnome is great', prop2='&amp; linux too') +SAX.characters( , 3) +SAX.startElement(head) +SAX.characters( , 4) +SAX.startElement(title) +SAX.characters(Welcome to Gnome, 16) +SAX.endElement(title) +SAX.characters( , 3) +SAX.endElement(head) +SAX.characters( , 3) +SAX.startElement(chapter) +SAX.characters( , 4) +SAX.startElement(title) +SAX.characters(The Linux adventure, 19) +SAX.endElement(title) +SAX.characters( , 4) +SAX.startElement(p) +SAX.characters(bla bla bla ..., 15) +SAX.endElement(p) +SAX.characters( , 4) +SAX.startElement(image, href='linus.gif') +SAX.endElement(image) +SAX.characters( , 4) +SAX.startElement(p) +SAX.characters(..., 3) +SAX.endElement(p) +SAX.characters( , 3) +SAX.endElement(chapter) +SAX.characters( , 1) +SAX.endElement(EXAMPLE) +SAX.endDocument()</pre> + +<p>Most of the other functionnalities of libxml are based on the DOM tree +building facility, so nearly everything up to the end of this document +presuppose the use of the standard DOM tree build. Note that the DOM tree +itself is built by a set of registered default callbacks, without internal +specific interface.</p> + <h2>The XML library interfaces</h2> <p>This section is directly intended to help programmers getting bootstrapped @@ -132,8 +207,7 @@ using the XML library from the C language. It doesn't intent to be extensive, I hope the automatically generated docs will provide the completeness required, but as a separated set of documents. The interfaces of the XML library are by principle low level, there is nearly zero abstration. Those -interested in a higher level API should <a href="#DOM">look at DOM</a> -(unfortunately not completed).</p> +interested in a higher level API should <a href="#DOM">look at DOM</a>.</p> <h3>Invoking the parser</h3> @@ -290,6 +364,165 @@ individually for one file:</p> </dd> </dl> +<h2>Entities or no entities</h2> + +<p>Entities principle is similar to simple C macros. They define an +abbreviation for a given string that you can reuse many time through the +content of your document. They are especially useful when frequent occurrences +of a given string may occur within a document or to confine the change needed +to a document to a restricted area in the internal subset of the document (at +the beginning). Example:</p> +<pre>1 <?xml version="1.0"?> +2 <!DOCTYPE EXAMPLE SYSTEM "example.dtd" [ +3 <!ENTITY xml "Extensible Markup Language"> +4 ]> +5 <EXAMPLE> +6 &xml; +7 </EXAMPLE> + +</pre> + +<p>Line 3 declares the xml entity. Line 6 uses the xml entity, by prefixing +it's name with '&' and following it by ';' without any spaces added. +There are 5 predefined entities in libxml allowing to escape charaters with +predefined meaning in some parts of the xml document content: +<strong>&lt;</strong> for the letter '<', <strong>&gt;</strong> for +the letter '>', <strong>&apos;</strong> for the letter ''', +<strong>&quot;</strong> for the letter '"', and +<strong>&amp;</strong> for the letter '&'.</p> + +<p>One of the problems related to entities is that you may want the parser to +substitute entities content to see the replacement text in your application, +or you may prefer keeping entities references as such in the content to be +able to save the document back without loosing this usually precious +information (if the user went through the pain of explicitley defining +entities, he may have a a rather negative attitude if you blindly susbtitute +them as saving time). The function <a +href="gnome-xml-parser.html#XMLSUBSTITUTEENTITIESDEFAULT">xmlSubstituteEntitiesDefault()</a> +allows to check and change the behaviour, which is to not substitute entities +by default.</p> + +<p>Here is the DOM tree built by libxml for the previous document in the +default case:</p> +<pre>/gnome/src/gnome-xml -> ./tester --debug test/ent1 +DOCUMENT +version=1.0 + ELEMENT EXAMPLE + TEXT + content= + ENTITY_REF + INTERNAL_GENERAL_ENTITY xml + content=Extensible Markup Language + TEXT + content=</pre> + +<p>And here is the result when substituting entities:</p> +<pre>/gnome/src/gnome-xml -> ./tester --debug --noent test/ent1 +DOCUMENT +version=1.0 + ELEMENT EXAMPLE + TEXT + content= Extensible Markup Language</pre> + +<p>So entities or no entities ? Basically it depends on your use case, I +suggest to keep the non-substituting default behaviour and avoid using +entities in your XML document or data if you are not willing to handle the +entity references elements in the DOM tree.</p> + +<p>Note that at save time libxml enforce the conversion of the predefined +entities where necessary to prevent well-formedness problems, and will also +transparently replace those with chars (i.e. will not generate entity +reference elements in the DOM tree nor call the reference() SAX callback when +finding them in the input).</p> + +<h2>Namespaces</h2> + +<p>The libxml library implement namespace @@ support by recognizing namespace +contructs in the input, and does namespace lookup automatically when building +the DOM tree. A namespace declaration is associated with an in-memory +structure and all elements or attributes within that namespace point to it. +Hence testing the namespace is a simple and fast equality operation at the +user level. </p> + +<p>I suggest it that people using libxml use a namespace, and declare it on +the root element of their document as the default namespace. Then they dont +need to happend the prefix in the content but we will have a basis for future +semantic refinement and merging of data from different sources. This doesn't +augment significantly the size of the XML output, but significantly increase +it's value in the long-term.</p> + +<p>Concerning the namespace value, this has to be an URL, but this doesn't +have to point to any existing resource on the Web. I suggest using an URL +within a domain you control, which makes sense and if possible holding some +kind of versionning informations. For example +<code>"http://www.gnome.org/gnumeric/1.0"</code> is a good namespace scheme. +Then when you load a file, make sure that a namespace carrying the +version-independant prefix is installed on the root element of your document, +and if the version information don't match something you know, warn the user +and be liberal in what you accept as the input. Also do *not* try to base +namespace checking on the prefix value <foo:text> may be exactly the same +as <bar:text> in another document, what really matter is the URI +associated with the element or the attribute, not the prefix string which is +just a shortcut for the full URI.</p> + +<p>@@Interfaces@@</p> + +<p>@@Examples@@</p> + +<p>Usually people object using namespace in the case of validation, I object +this and will make sure that using namespaces won't break validity checking, +so even is you plan or are using validation I strongly suggest to add +namespaces to your document. A default namespace scheme +<code>xmlns="http://...."</code> should not break validity even on less +flexible parsers. Now using namespace to mix and differenciate content coming +from mutliple Dtd will certainly break current validation schemes, I will try +to provide ways to do this, but this may not be portable or standardized.</p> + +<h2>Validation, or are you afraid of DTDs ?</h2> + +<p>Well what is validation and what is a DTD ?</p> + +<p>Validation is the process of checking a document against a set of +construction rules, a <strong>DTD</strong> (Document Type Definition) is such +a set of rules.</p> + +<p>The validation process and building DTDs are the two most difficult parts +of XML life cycle. Briefly a DTD defines all the possibles element to be +found within your document, what is the formal shape of your document tree (by +defining the allowed content of an element, either text, a regular expression +for the allowed list of children, or mixed content i.e. both text and childs). +The DTD also defines the allowed attributes for all elements and the types of +the attributes. For more detailed informations, I suggest to read the related +parts of the XML specification, the examples found under +gnome-xml/test/valid/dtd and the large amount of books available on XML. The +dia example in gnome-xml/test/valid should be both simple and complete enough +to allow you to build your own.</p> + +<p>A word of warning, building a good DTD which will fit your needs of your +application in the long-term is far from trivial, however the extra level of +quality it can insure is well worth the price for some sets of applications or +if you already have already a DTD defined for your application field.</p> + +<p>The validation is not completely finished but in a (very IMHO) usable +state. Until a real validation interface is defined the way to do it is to +define and set the <strong>xmlDoValidityCheckingDefaultValue</strong> external +variable to 1, this will of course be changed at some point:</p> + +<p>extern int xmlDoValidityCheckingDefaultValue;</p> + +<p>...</p> + +<p>xmlDoValidityCheckingDefaultValue = 1;</p> + +<p></p> + +<p>To handle external entities, use the function +<strong>xmlSetExternalEntityLoader</strong>(xmlExternalEntityLoader f); to +link in you HTTP/FTP/Entities database library to the standard libxml +core.</p> + +<p>@@interfaces@@</p> + <h2><a name="DOM">DOM Principles</a></h2> <p><a href="http://www.w3.org/DOM/">DOM</a> stands for the <em>Document Object @@ -306,7 +539,14 @@ presents on other programs like this:</p> <p>This should help greatly doing things like modifying a gnumeric spreadsheet embedded in a GWP document for example.</p> -<h3><a name="Example">A real example</a></h3> +<p>The current DOM implementation on top of libxml is the <a +href="http://cvs.gnome.org/lxr/source/gdome/">gdome Gnome module</a>, this is +a full DOM interface, thanks to <a href="mailto:raph@levien.com">Raph +Levien</a>.</p> + +<p>The gnome-dom module in the Gnome CVS base is obsolete</p> + +<h2><a name="Example">A real example</a></h2> <p>Here is a real size example, where the actual content of the application data is not kept in the DOM tree but uses internal structures. It is based on @@ -368,8 +608,7 @@ base</a>:</p> </gjob:Job> </gjob:Jobs> -</gjob:Helping> -</pre> +</gjob:Helping></pre> <p>While loading the XML file into an internal DOM tree is a matter of calling only a couple of functions, browsing the tree to gather the informations and @@ -501,8 +740,13 @@ produce the code needed to import and export the content between C data and XML storage. This is left as an exercise to the reader :-)</p> <p>Feel free to use <a href="gjobread.c">the code for the full C parsing -example</a> as a template,</p> +example</a> as a template, it is also available with Makefile in the Gnome CVS +base under gnome-xml/example</p> + +<p></p> + +<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p> -<p> <a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p> +<p>$Id$</p> </body> </html> |