An Introduction to XML for
Technical Users
Anthony B. Coates
Equity Systems Pty Ltd
A subsidiary of Reuters Plc.
Copyright © MM Reuters Plc.
tony.coates@reuters.com
2000/03/21
Contents
-
The What, Why, and How of XML
-
What is XML?
-
Why Use XML?
-
How Do You Get Into XML?
-
How - XML Viewers
-
How - XML Editors
-
How - XML Parsers
-
How - XSL & XSLT
-
How - XPath/XPointer/XLink
-
How - XML in Databases
-
How - XML and Objects
-
How - Messaging in XML
-
How - XML Repositories
-
XML Tips
-
References - XML
-
References - XML Standards
-
References - XML Tools
-
References - XML Repositories
-
References - Industry Schemas
-
References - Articles
-
References - Other
- XML is the acronym for eXtensible Markup Language
- XML is a textual encoding
system for creating
structured documents that can
be understood by computers
- XML is a Recommendation of
the W3C
- Charles Goldfarb at IBM,
recognising that custom
document formats impeded
enterprise data sharing,
developed GML, the Generalised Markup Language
- GML evolved into SGML, the Standard Generalised Markup
Language, which remained largely
unknown and unused outside of
the publishing industry
- SGML has so much functionality
that it is difficult to write
parsers which read it
- Few users need more than a
fraction of the features
- When HTML was created at
CERN,
they had the foresight to use
SGML as the basis
- HTML documents, in principle,
are SGML documents which
conform to one of the HTML
DTDs
- DTDs (Document Type
Definitions) are the SGML way
to describe the allowable tag
structure (syntax) for a
particular type of document
- HTML authors are often
oblivious of the HTML DTDs,
as the browser makers ignored
them and added their own tags
- Led to the balkanisation of HTML
- However, HTML made everyone
familiar with SGML
angle-bracket tags:
<html>...</html>
- HTML browsers tend to be
tolerant of HTML errors,
and this is abused
- Many Web pages have incorrect
HTML which displays OK
- OK for people, but a nightmare
for computers, which need
documents to be simple and
correct
- XML look likes HTML, but you
can choose your own tags and
attributes:
<?xml version="1.0"?>
<my-document>
<author name="A.B.Coates"/>
<my-title>
Document Title
</my-title>
<my-section>
<my-title>
Section Title
</my-title>
This is the text
of my section.
</my-section>
</my-document>
- XML can encode data, not just
documents for reading
<?xml version="1.0"?>
<order supplier="Babies Online">
<item code="XYZ-2000"
name="No-Leak Nappies"/>
<unit-price currency="AUD">5</unit-price>
<volume>10000</volume>
<valid-until>20000630</valid-until>
</order>
- XML namespaces avoid problems when
the same tag is used
in different contexts
<?xml version="1.0"?>
<html xmlns:fn="www.furniture.com/products">
<body>
<!-- Here is an XHTML table -->
<table>
<tr><td>1</td><td>2</td></tr>
</table>
<!-- Here is another kind of table -->
<fn:table>
<fn:material>wood</fn:material>
<fn:legs>6</fn:legs>
</fn:table>
</body>
</html>
- XML is a simple and
underwhelming concept
- Avoids major complexities of
SGML and HTML (from computer's
perspective)
- XML is
Unicode,
not ASCII, and is 100% case-sensitive
- Tags are enclosed in
angle-brackets, and can have
attributes in quotes
- Tags can contain other tags
or text or both
- However, such mixed content is best avoided, if possible
- Tags must be closed:
XHTML sample:
<p>A paragraph of text.</p>
<br></br>
<br/>
- XML files must start with an XML identifier:
<?xml version="1.0"?>
- XML files can optionally have an SGML-style Document
Type Description (DTD) which
defines the allowable tag
structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE account SYSTEM "sample.dtd">
<account accountNumber="AC11223344">
<customer>
<name firstName="John" lastName="Citizen"/>
</customer>
<balance>1234.56</balance>
</account>
<!-- A name contains neither tags nor text,
but has two attributes.
The "lastName" is required,
while the "firstName" is optional.
-->
<!ELEMENT name EMPTY>
<!ATTLIST name
firstName CDATA #IMPLIED
lastName CDATA #REQUIRED
>
<!-- A customer has a name. -->
<!ELEMENT customer (name)>
<!-- The balance is in text format. -->
<!ELEMENT balance (#PCDATA)>
<!-- Each account belongs to one or
more customers, and has a balance.
-->
<!ELEMENT account (customer+,balance)>
<!ATTLIST account
accountNumber ID #REQUIRED
>
- More recently, XML files can optionally have a Schema, which defines
not only the tag structure,
but the data type(s) of the
content
<?xml version="1.0" encoding="UTF-8"?>
<account xmlns="sample.xsd"
accountNumber="AC11223344">
<customer>
<name firstName="John" lastName="Citizen"/>
</customer>
<balance>1234.56</balance>
</account>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema PUBLIC
"-//W3C/DTD XML Schema Version 1.0//EN"
"http://www.w3.org/XML/Group/
1999/09/23-xmlschema/structures/
structures.dtd"
>
<schema>
<!-- A name contains neither tags nor text,
but has two attributes.
The "lastName" is required,
while the "firstName" is optional.
-->
<element name="name">
<archetype content="empty">
<attribute name="firstName"
type="string"/>
<attribute name="lastName"
type="string"/>
</archetype>
</element>
<!-- A customer has a name. -->
<element name="customer">
<archetype content="elemOnly">
<element ref="name" minOccurs='1'
maxOccurs='1'/>
</archetype>
</element>
<!-- The balance is a decimal number. -->
<element name="balance" type="decimal"/>
<!-- Each account belongs to one or
more customers, and has a balance.
-->
<element name="account">
<archetype content="elemOnly">
<element ref="customer" minOccurs='1'
maxOccurs='*'/>
<element ref="balance" minOccurs='1'
maxOccurs='1'/>
<attribute name="xmlns"
type="string"/>
<attribute name="accountNumber"
type="string"/>
</archetype>
</element>
</schema>
- Lots of information is tied
up in documents with custom
formats
- To use these documents in
automated processes is
difficult, because you need
to create custom parsers for
each document format
- This makes
computer-to-computer
transactions difficult to
implement across and between
enterprises
- Custom formats have software
maintenance requirements which
are a black hole for
recurrent expenditure
- XML is simple and (generally)
human-readable, capable of
representing complex data
structures
- XML is not anonymous binary
- Simpler to parse than other
formats, for example HTML
or SGML
- XML tags must always be
closed, so a computer always
knows when it has reached the
end of an XML file
- Easier to learn and quicker
to implement than CORBA
or DCOM, at some penalty in raw
speed
- Easy to transmit using Web
servers through firewalls and
over an Intranet or the
Internet
- Rapidly being accepted as the
industry standard portable
data format
- Not controlled by any one
vendor, but contributed to by
many
- Microsoft and Sun may
disagree on which OS's and
languages are best, but both
now support XML as the data
format of choice
- Oracle and IBM's DB2
both support XML input and
output, as do a growing number
of databases
- Lots of software support, much
available free of charge
- Custom binary formats use less
bandwidth, but bandwidth gets
cheaper while developers'
salaries do not
- XML saves development time,
and hence money
- The W3C
and IETF
are looking
into how XML can be
transferred more efficiently
using HTTP and TCP/IP
- XML will become a built-in
data type in CORBA
- WML, the WAP
equivalent of HTML, is XML
- WAP transmits XML as
compressed WBXML
(WAP Binary XML)
- HTML 4 is being reformulated
as XHTML
- IE5 and Communicator 4 can
display XHTML (if written
sympathetically)
- Soon, XHTML will be
modularised
- For example, if you need
tables in your XML document,
you will be able to use just
the tables from XHTML, without
the rest
- SVG, the new Web format for
vector graphics, is also XML
- Supports both animation and
interactivity
- XML data can be transformed
on the fly into SVG images
- Unlike HTML, XML generally
does not contain formatting
information
- A common expectation is that
you can view XML in a browser
like HTML, settings fonts and
colours and layout
- Except for XHTML, raw XML can
only be directly displayed as
a tree of tags
- Apache Xerces-J parser comes
with a simple tree viewer in
Java
- Otherwise XML needs
either extra formatting
information, or to be
converted to a viewable
format
- By default, IE5 uses an
inbuilt XSL stylesheet (see later)
to convert XML to HTML for
display
- Allows tags to be expanded
or collapsed by clicking,
like an outliner
- IE5 and the Mozilla preview
of Communicator 6 allow CSS
(Cascading Style Sheets)
to be applied to XML
- Colours and fonts can be set,
and tags can be treated like
HTML paragraphs, list items,
or other layout
- XML structure needs to closely
match desired visual layout
- The CSS script can be
specified in an XML document
using an xml-stylesheet processing instruction:
<?xml-stylesheet href="style.css"
type="text/css"?>
- XML editors can generally be
divided into low-end tree
editors and high-end document
processors
- Low-end editors generally
display XML as a tree of tags
- When you click on a tag,
you can see all of its content
and edit it
- High-end editors may also
present a tag tree, but
make the XML look like a
word-processor document
- Visual formatting of tags can
be controlled, and the visually
formatted text can be edited
directly
- Important to be able to
see as much as possible
in one view, as humans
do parallel processing
- Free low-end editors are
good for getting into XML,
but tiresome for everyday use
- Low-end editors suitable when
XML is used for forms
- Microsoft's XML Notepad is a good (Windows-only)
low-end editor which
does not require a DTD
- IBM's Xeena is a good DTD-driven
low-end editor written in
Java (used for this seminar)
- Arbortext's Adept and SoftQuad's XMetal are good high-end editors,
with SGML backgrounds
- There are too many others to mention,
but equally worth mentioning
- Get an XML parser - there is a
good choice
- First parsers written in Java
(which supports Unicode), but
C/C++ parsers now available
- Sun's XML parser will be in
future JDKs; likely to be most
widely used Java XML parser
- Apache's Xerces parser
(Java or C++, donated by IBM) is a good
server-side choice
- James Clark's C++ expat parser is used by Perl,
Python, and PHP. Will be used
in Communicator 6
- Microsoft has included XML
parsers in IE4 and IE5,
and these are COM components
which other desktop applications
can use
- Oracle XML parser
(Java, C, C++, PL/SQL) should
be popular in Oracle shops,
but not tied to DB
- All of these parsers are free
- SAX converts an XML document
into an event stream
(open tag, text, close tag)
- Small memory footprint, good
for large documents, cannot
go backwards
- Essentially a lexical analyser
for XML
-
DOM - Document Object Model
- DOM represents an XML document
as a fully navigable tree
of nodes (tags, attributes,
text)
- Uses more memory than SAX,
but editable
- Supported by both IE and Communicator
- Not XML nor HTML specific,
DOM API can be used by any
application to make structured
data navigable and/or editable
- Parsers only parse,
they do not write applications
- The code required to deal with
complicated XML documents
may still require significant
development time
-
XSL - eXtensible Stylesheet
Language
-
XSLT - scripting language
for XML transformations, written in XML
- Converts XML to XML, HTML,
or plain text
- Recognises patterns in XML
document structure, and
replaces them as appropriate
- The default XSL script can be
specified in an XML document
using an xml-stylesheet processing instruction:
<?xml-stylesheet href="transform.xsl"
type="text/xml"?>
- Can add, delete, modify,
and sort content of XML
documents
- Can call external scripts
(Java, JavaScript, NetRexx,
and others)
- Available in Java or C++,
some versions compile scripts
for extra speed
- Apache's Xalan (was LotusXSL)
in Java and C/C++,
James Clark's XT in Java,
MS implementation in COM,
Oracle implementation in
Java, C/C++, and PL/SQL,
plus others
- Out of date versions in IE4
and IE5, allow XML to be
converted to HTML on the fly
by browser. Look for updates
when spec is finalised (soon)
- Perl, Python, PHP, and ASP/JSP
are alternatives, but not as
well focussed on
transformation
- Apache's Xalan was used to
produce these slides
- XSL spec includes 'formatting
objects' which provide
desktop publishing functionality,
but not many implementations yet
- Apache FOP for Acrobat,
PassiveTex, TexML
- XML supports much richer
document linking than HTML
- HTML introduced the
world to hypertext links
- The only HTML linking tag
is <a>
- An HTML link connects one
location in a document
to an HTML document or a
named location in an
HTML document
- HTML links are one-way only
- XPath allows location in
a document to be specified
by structure, not just name
doc.xml#chapter[3]/section[2]/paragraph[4]
- Reasonably robust even
as document changes
- XPointer extends XPath
to allow ranges and not just locations
to be specified
- XLinks can function like
normal HTML links, but
any XML tag can be used
as an XLink
- An XLink can link to
multiple locations,
with the user given
the choice
- An XLink does not
have to be a link
- XLinks can specify an
external document (or
section thereof) to
use in its place
- XLink allows out of line links
- Out-of-line links do not
refer directly to another
document, but to an entry
in a link database
- If a location changes,
it only has to be changed
in one place in the link
database
- Fixes a major problem
in keeping HTML-style links
consistent in richly
interconnected sites
- Out-of-line links
also allow many-to-many
hyperlinking
- XPath is heavily used by XSLT
- Stand-alone versions for use
with the DOM are appearing
- XPointer/XLink specifications
have gone through many changes,
so these are not yet
widely implemented
- XML has a tree structure
which does not fit
directly into relational
database tables
- Tag order is important,
whereas tables can be
re-ordered with impunity
- Nonetheless, Oracle and
IBM (DB2) are providing
software to map XML
into tables, so these
DBs can store/retrieve
XML
- Other relational DB
vendors can be expected
to do the same
- Need special query
support, as the XML
document may be
difficult to query
from the tables
- Object databases
and hierarchical databases
are a better fit for XML
- Software AG's Tamino is a high-end XML data
server based on a hierarchical
database, and does all
input and output using XML
- XML queries are supported,
and all informational messages
from the DB are in XML
-
XML Query language is
still under development
- Originally based on SQL,
but changing to suit XML
needs more specifically
- When all data is in XML,
applications can be
ignorant of whether they
are using a DB, a file,
or an Internet connection
- More flexibility,
easier to maintain
- Sun will shortly be
releasing an XML compiler
- Given an XML Schema,
creates matching
Java classes to store
documents/data
- Loading, storing,
and conversion to/from XML
can be hidden from
Java application
- Similar should appear
for C++ and others
eventually
- Easier to view and
understand complex
data structures in
one XML document
than in many Java/C++
files
- XML can be used to send
portable messages between
computers
- Objects stored in XML
can be used to send
command requests with
parameters, with results
returned as XML objects
- Alternatives, if less
robust, are XML-RPC (Remote Procedure Calls) and SOAP (Simple Object Access
Protocol), which use
XML over HTTP to run
specific applications
on remote computers and
return the results
- Quick to set up, work
through firewalls, useful
for short-term implementations
- XML can allow business
information to be shared,
but only if all parties
are using the same tag
structure
- Expect to see Schemas
appearing for most
vertical markets
- XML repositories for standard
DTDs and Schemas are being set
up for this purpose
- Microsoft plans to
use BizTalk to provide
schema content for a
BizTalk Server product
- Schemas for BizTalk
must follow BizTalk-specific
rules
- OASIS is seen as
vendor-neutral
- OASIS hosts UN's ebXML schema for business-to-business
transactions
- ebXML is part of a UN
initiative to globalise
electronic commerce
- Industry schemas may not
necessarily be located
in a repository
- Example is FpML for financial reporting
- Beware that some schemas
have restrictive usage
agreements
- Open, industry-wide or
multi-vendor schemas
are likely to be the
safest choices
- Keep XML documents as simple
as possible; try not to
mix formatting with data
- The XML source for this
seminar is far more readable
than the HTML source
- Avoid mixed content,
if possible, to
simplify processing
- Write documents that
can be usefully parsed
using SAX
- If it works with SAX,
it will work with DOM,
but maybe not vice-versa
- Use existing DTDs/Schemas
wherever possible
- Using just a subset of the
tags is fine
- If you cannot wait for a
standard, or the standard is
too complicated for your
internal needs, use XSLT or
similar to transform your
internal format to the
standard and back again
- Can save your enterprise
from having to change
documents as a standard
evolves
- When generating text files,
for example log files,
use XML unless there is a
reason not to
- The extra effort of adding
tags is minimal
- Large files
or large numbers of files
are easier to reliably
search, index, format, and/or
process when in XML rather
than plain text
-
XML.com
-
http://www.XML.com/
A good place to start learning about XML.
The Annotated XML Spec is noteworthy,
but not compulsary reading.
-
xmlhack.com
-
http://www.xmlhack.com/
News and articles for XML developers.
-
The XML Cover Pages
-
http://www.oasis-open.org/cover/sgml-xml.html
Robin Cover's encyclopædic reference
for XML standards, articles,
and tools.
-
XML @ W3C
-
http://www.w3.org/XML/
The W3C's XML site. This is where
the standards are. Some of them
can be heavy reading.
-
XML Namespaces
-
http://www.w3.org/TR/REC-xml-names
-
XML Schemas #0: Primer
-
http://www.w3.org/TR/xmlschema-0/
-
XML Schemas #1: Structures
-
http://www.w3.org/TR/xmlschema-1/
-
XML Schemas #2: Datatypes
-
http://www.w3.org/TR/xmlschema-2/
-
XHTML
-
http://www.w3.org/TR/xhtml1
XHTML is the modular XML
replacement for HTML 4.
-
SAX
-
http://www.megginson.com/SAX/index.html
Simple API for XML.
Converts XML into a stream
of events.
-
DOM @ W3C
-
http://www.w3.org/DOM/
Document Object Model.
Converts XML into a tree of
nodes.
-
XSL @ W3C
-
http://www.w3.org/Style/XSL
Extensible Stylesheet
Language.
-
XSLT Specification
-
http://www.w3.org/TR/xslt
XSL Transformations.
-
XSL Specification
-
http://www.w3.org/TR/xsl/
XSL Formatting Objects.
-
XPath Specification
-
http://www.w3.org/TR/xpath
How to specify a location
in an XML document.
-
CSS @ W3C
-
http://www.w3.org/Style/CSS
Cascading Style Sheets.
-
XPointer Specification
-
http://www.w3.org/TR/xptr
How to specify a location
or range in an XML document.
-
XLink Specification
-
http://www.w3.org/TR/xlink/
How to do hyperlinking in XML.
-
XML Query Requirements
-
http://www.w3.org/TR/xmlquery-req
The latest progress on the
XML Query specification.
-
IE5
-
http://www.microsoft.com/windows/ie/default.htm
Displays XML as a tree by
default, and can display
XML using CSS or XSLT.
-
Mozilla (Communicator 6 preview)
-
http://www.mozilla.org/
Can display XML using CSS.
-
Apache XML Tools
-
http://xml.apache.org/
Xerces parser, Xalan XSLT,
FOP XSL Formatting Objects,
and Cocoon servlet for
transforming server-side XML
to client-side HTML or XML.
Xerces and Xalan available either
as Java or C++.
-
James Clark's XML Resources
-
http://www.jclark.com/xml/
XP parser (Java), expat parser (C++), XT (XSLT in Java), and XML test cases. Self-funded, James Clark produces some of the best
XML & SGML software available today, and makes it available for free.
-
Java Technology & XML @ Sun
-
http://java.sun.com/xml/
JAXP (Java API for XML Parsing) is Sun's new standard API
for pluggable Java XML parsers, based on SAX and DOM.
-
XML @ IBM developerWorks
-
http://www.ibm.com/developer/xml/
News, articles, and software.
-
IBM alphaWorks
-
http://www.alphaWorks.ibm.com/
IBM's alphaWorks has an
amazing range of XML (and
other) tools to try for free;
some of them very useful.
-
XML & XSL @ Microsoft
-
http://msdn.microsoft.com/xml/default.asp
Check out Microsoft's XML parser and XSLT engine here.
-
Oracle XML Developer's Kit
-
http://technet.oracle.com/tech/xml/
Oracle XML parser and XSLT engine.
Java, C/C++, and PL/SQL.
-
Tamino
-
http://www.softwareag.com/tamino/
XML server based on a
hierarchical database.
Everything going in or out
is XML.
-
Bluestone XML Suite
-
http://www.bluestone.com/SaISAPI.dll/SaServletEngine.class/products/dynamic.jsp?pid=60
Well-known XML server.
-
eXcelon B2B Solutions
-
http://www.objectdesign.com/products/index.html
Well-known XML server.
-
DataChannel Server
-
http://www.datachannel.com/
DataChannel's Chief Scientist,
Norbet Mikula, wrote one of
the first XML parsers (as
his PhD project).
-
Web Distributed Data eXchange
-
http://www.wddx.org/
A system for exchanging XML
documents via the Web.
-
Arbortext Adept
-
http://www.arbortext.com/Products/ADEPT_Series/adept_series.html
A good, high-end XML editor.
-
SoftQuad XMetal
-
http://www.xmetal.com/
A good, high-end XML editor.
-
IBM Xeena
-
http://www.alphaworks.ibm.com/tech/xeena
A free XML editor in Java.
Requires a DTD, and makes
sure that your document fits
the DTD.
-
Microsoft XML Notepad
-
http://msdn.microsoft.com/xml/notepad/intro.asp
A free XML editor for Windows.
Allows you to write freeform
XML without a DTD.
-
XML.org
-
http://www.XML.org/
-
BizTalk.com
-
http://www.BizTalk.com/
-
schema.net
-
http://www.schema.net/
-
IBM XML specification for business-to-business transactions
-
http://www-4.ibm.com/software/developer/library/tpaml.html
-
UN's ebXML specification for
global e-commerce
-
http://www.ebXML.org/
-
Financial Products Markup Language specification
-
http://www.FpML.org/
-
XML for the Absolute Beginner
-
http://www.javaworld.com/jw-04-1999/jw-04-xml.html?032499txt
JavaWorld article.
-
Schema Repositories: What's at Stake?
-
http://www.xml.com/pub/2000/01/26/feature/index.html
An interesting discussion
of the politics of schema
repositories.
-
Inside SOAP
-
http://xml.com/pub/2000/02/09/feature/index.html?wwwrrr_20000209.txt
A good place to start if you
are interested in SOAP or
XML-RPC for communicating
with remote applications via
XML.
-
XML JavaBeans: Part 1
-
http://www.javaworld.com/javaworld/jw-02-1999/jw-02-beans.html
JavaWorld article.
-
XML JavaBeans: Part 2
-
http://www.javaworld.com/javaworld/jw-03-1999/jw-03-beans.html
-
XML JavaBeans: Part 3
-
http://www.javaworld.com/javaworld/jw-07-1999/jw-07-beans.html
-
Design Patterns in XML Applications: Part 1
-
http://xml.com/pub/2000/01/19/feature/index.html?wwwrrr_20000119.txt
If you are familiar with
design patterns, and article
worth reading. If not, worth
reading anyway before you
design an XML DTD or Schema.
-
Design Patterns in XML Applications: Part 2
-
http://xml.com/pub/2000/02/16/feature/index.html?wwwrrr_20000216.txt
-
xml-dev Archive (new)
-
http://www.egroups.com/group/xml-dev/
The "xml-dev" mailing list
is the #1 list for XML
developers.
-
xml-dev: Archive (old)
-
http://www.lists.ic.ac.uk/hypermail/xml-dev/
-
XSL-List Archive
-
http://www.mulberrytech.com/xsl/xsl-list
The "XSL-List" mailing list
is the #1 list for XSL
developers and users.
-
Coins (JavaBeans serialised as XML)
-
http://www.jxml.com/coins/
Software for converting data
between Java classes and
XML documents.
-
CORBA & XML Resource Page
-
http://www.omg.org/xml/
-
Unicode
-
http://www.unicode.org/
The Unicode system for
encoding the world's
languages. Supported natively
by XML and Java.
-
Source files for this seminar
-
../README.html