Chapter 15. What is XML, and Why Should I Care?

Tod A. Olson

Programmer/Analyst
The University of Chicago
Integrated Library Systems

Table of Contents

What is XML?
A descriptive markup language
Structure in XML: elements and attributes
Closing tags required: making structure explicit
Entity References
XML declarations
Processing instructions
Comments
CDATA sections
Well-formed documents
XML vocabularies: schemas, namespaces, and validation
Advantages of XML
Companion technologies
Why should I care?
Examples of XML-based standards
Summary
Bibliography

Abstract

Since its introduction in 1998, XML has become a well-established technology, having gained widespread adoption for the storing and exchanging of information, including in the library community. This essay will introduce the reader to XML, explain its syntax and how it is used as a means of encoding information, and follow up with why it is important to the library community.

What is XML?

The eXtensible Markup Language (XML) provides a general-purpose syntax for encoding information. XML is designed to be easily processed by machine, and yet human-readable, while addressing practical concerns of building Web-based applications, and drawing on experiences with the earlier Standard Generalized Markup Language (SGML). XML is a Recommendation of the World Wide Web Consortium (W3C), the 1.0 recommendation is available at http://www.w3.org/TR/2004/REC-xml-20040204/. The W3C develops the technologies used by the Web. More information about W3C and its activities is at http://www.w3.org/.

A descriptive markup language

XML is a descriptive markup language. It provides a syntax that lets us add descriptive tags to an electronic text document. Take a number like 1966. On it's own, it could represent a number of things, possibly a year, street address, or a mortgage payment. When we add descriptive tags to the number, its role becomes clear:

<date>1966</date>

We can treat <date>1966</date> appropriately in our processing, whether indexing, exchanging data with another system or formatting for display to a person. In XML terms, date is an example of an element.

While XML defines a syntax for marking up or tagging information in documents, it does not define a list of allowed tags. Instead, a document author is free to invent the tags that are needed. There are, however, mechanisms for declaring exactly what tags may appear in a document. Declaring what tags may appear defines a markup language. For this reason, XML is a metalanguage; it is a language for defining markup languages.

For more on text markup, see Coombs, Renear, and DeRose (1987).

Structure in XML: elements and attributes

Elements provide the structure of information in XML; they contain and describe information. An element is marked by a start tag and an end tag. The start and end tags look like <elt> and </elt>, where elt is the element name. Any text between them is the element content. In the above example, "date" is an element, and the tags <date> and </date> mark the text "1966" as a "date".

Elements can contain text characters and other elements. For example:

<para>An <card>Ace</card> beats a <card>King</card>.</para>

or

<originInfo>
  <place>New York</place>
  <publisher>Bantam</publisher>
  <dateIssued>1971</dateIssued>
  <copyrightDate>1968</copyrightDate>
</originInfo>

The nesting of elements allows a hierarchical document structure. Any information that can be modeled as a hierarchy can be represented in XML.

Element names are case-sensitive. For example, <place> and <Place> are start tags for two different elements, and must be closed by </place> and </Place> respectively.

Elements may have attributes, which may be thought of as metadata for the element. An attribute appears in the opening tag of an element as a name, an equals sign, and a value in quotes. Whitespace around the equals sign is allowed. Either single or double quotes may be used. A value begins with the quote character and continues until the next occurrence of the same quote character. For example, in the Metadata Object Description Schema (MODS) title element, the type attribute can be used to specify what kind of title is being tagged. Here the uniform title is distinguished from the "default" title on the title page of a piano score:

<titleInfo type="uniform">
    <title>Waltzes, piano</title>
    <partNumber>op. 34</partNumber>
</titleInfo>
<titleInfo>
    <title>Trois valses brillantes</title>
    <subTitle>pour le piano : op. 34, no. 1-3</subTitle>
</titleInfo>

Similarly, in MODS a personal name can be distinguished from a corporate name, and the parts of a name can also be distinguished:

<name type="personal">
  <namePart type="family">Chopin</namePart>
  <namePart type="given">Frédéric</namePart>
  <namePart type="date">1810-1849</namePart>
</name>

Elements may have more than one attribute, but the attributes must all have different names.

XML requires that there be only one root element in a document, one top-level element that contains all of the document content, including all other elements. That is, elements nest in a hierarchy, and XML wants a document to have only one hierarchy. Looking at the two sample XML fragments immediately above, the first fragment could not stand alone as a legal XML document, because there are two titleInfo elements at the top level, with no element containing them both. The second fragment is a legal XML document, because the name element acts as a single root element.

Some elements have no content, but are meaningful just by existing at a certain place in a document. If an element has no content, it may use the empty tag syntax: <elt />. For example, in Text Encoding Initiative (TEI), used for scholarly markup of literary texts, often of older printed materials, <lb/> and <pb/> shows where a line break or page break was in an original print edition of a document, just as <br/> shows a line break in XHTML. (Space after the element name is permitted, but not required; <br/> and <br /> are equivalent.)

Empty elements are often used with attributes. In TEI, the ptr can be used to make simple cross-references within a document. ptr is always empty, but it has a target attribute which contains a unique identifier for the target of the cross-reference. In XHTML, the link element is empty, but uses attributes to describe and locate the document that is linked to. In the following example, the link attributes tell a Web browser where to find some external document (href), that this document is a stylesheet for the current XHTML document (rel), and that the stylesheet is written in CSS (type):

<link href='mystylesheet.css' rel='stylesheet' type='text/css' />

Closing tags required: making structure explicit

If the XML syntax seems similar to HTML, it is because HTML and XML share some common ancestry: HTML is an application of the Standard General Markup Language (SGML), and XML began as a simplified version of SGML. Unlike HTML (and SGML), XML requires closing tags, and empty tags have special syntax. This makes it easy to process XML documents, whether as files or data streams, without special knowledge of the particular vocabulary employed. Consider this example from HTML:

<ul>
<li>A list item
<ul>
<li>Another list item

Is the second list an entirely separate list from the first, or a sub-list? Without closing tags, the structure is implicit, and requires special knowledge of HTML to process. There is a similar issue with empty tags in HTML: a <br> tag is empty, it does not begin a new element, and there will never be a </br>. This is a special property of the br element in HTML. There is no syntactic clue to help us; we need special knowledge of br in order to understand the structure of any document that contains it. So we see that the document structure is not determined entirely through syntax, but depends on the semantics of the elements.

By requiring closing tags, XML forces the document author to be explicit. In XHTML, the author will write

<ul>
  <li>A list item</li>
</ul>
<ul>
  <li>An item in a separate list</li>
</ul>

or

<ul>
  <li>A list item
    <ul>
      <li>An item in a sub-list</li>
    </ul>
  </li>
</ul>

The document structure is unambiguous, even if you don't know the first thing about XHTML. Similarly, if <br/> appears in our XHTML document, we know exactly what its effect on the document structure is: it is an element with no content. No special knowledge of the br element is required in XHTML. The document structure in XML is determined unambigously by the syntax, and is not dependent on the element semantics.

Entity References

An entity can be thought of as a piece of text that has a name. An entity reference allows us to refer to that text by name and use it in a document.

We have seen that some characters have special meaning in the XML syntax. For example “<” always signals the beginning of an element tag. The character “<” is illegal in element content because it would complicate parsing, as it would be impossible for the software to determine whether the “<” starts an element or is just a character. For those cases where we need to use “<” as part of our data, XML provides a predefined entity reference, “&lt;”, where the name “lt” is mnemonic for less than. An entity reference begins with an ampersand, ends with a colon, with the entity name in between. The ampersand therefore is also a special character, and we must always use “&amp;” when an ampersand is part of the element content or an attribute value. There are only five such predefined entities, and they are defined for the characters that signal elements, attributes, and entities, and they appear below:

CharacterNameEntity reference
<less than&lt;
>greater than&gt;
"quote, or double quote&quot;
'apostrophe, or single quote&apos;
&ampersand&amp;

There are also character entities for referring to any character by its decimal or hexadecimal value. For the section sign, §, the character entity references would be &#167; and &#xA7;, respectively. This is useful when input mechanisms do not allow a character to be entered directly, or to be certain that the document can be deciphered by humans working in computing environments which do not handle Unicode well.

Document Type Definitions (DTDs), discussed briefly below, allow the definition of custom entities above and beyond the predefined entities and character entities. This can be used to name and refer to text that may be referred to repeatedly in the text. This may be done to ensure that some string, such as a specific name or phrase, is always repeated exactly, or for some other reason. The DocBook DTD, for example, defines a large number of entities that give mnemonic names to characters that are not always easy to enter from the keyboard, such as the &copy;, which is replaced during processing with the copyright symbol, ©.

XML declarations

An XML declaration gives information about the XML version, and possibly the character encoding, used by the document. The following XML declaration asserts that the document conforms to XML 1.0, and that characters are encoded in UTF-8, the 8-bit Unicode encoding:

<?xml version="1.0" encoding="UTF-8"?>

(By default, XML documents use the Unicode character set, in particular, the UTF-8 encoding. This allows XML documents to carry information in many languages.)

The XML declaration allows us to be certain that a document really is XML, and not SGML or some other similar-looking format. According to the XML 1.0 Recommendation, § 2.8, an XML document should have an XML declaration, and if an XML declaration is present, it must be the first statement in the document.

Processing instructions

Processing instructions allow instructions to specific XML systems to be embedded in XML documents. Processing instructions follow this pattern:

<?target instruction?>

where target is a name that a specific processing system will recognize, and instruction is a sequence of text that tells the processing system to take some action. Processing systems that do not recognize a target name will ignore that instruction. For example, the following XML fragment is meant to be processed by a PHP system:

<p>Today is <?php echo date("M d, Y")?>.</p>

PHP will recognize the target of the processing instruction, php, and follow the instructions: it will get the current date, format it, and place the result into the document where the processing instruction was. The output will be something like this:

<p>Today is Jan 23, 2006.</p>

Taget names that begin with “XML”, in any combination of upper or lower case, are reserved. These target names may be used by the XML specification or for standardization. For example, the XML specification defines the target xml to be the XMl declaration. As another example, xml-stylesheet is a standardized processing instruction which allows a stylesheet to be associated with an XML document. Here, a CSS stylesheet is associated with the document:

<?xml-stylesheet type="text/css" href="mystyle.css"?>

For details on xml-stylesheet, see http://www.w3.org/TR/xml-stylesheet/.

Comments

Comments in XML documents mark some part of the text as not being part of the document's data or structure, and which are to be ignored during processing.[6] Comments are useful for a variety of reasons, such as recording commentary for people who may edit the document in the future, or for temporarily “turning off” a part of the document, perhaps for testing purposes. Comments begin with <!-- and end with -->, as follows:

<!-- This is a comment ... -->
                    
<!-- <para>... and this paragraph has been commented out.</para> -->

The string -- is not allowed within a comment. SGML uses this notation to structure comments in a certain way; XML rejects this extra complexity for comments. XML forbids this notation within comments to preserve compatability with SGML processors, where it can cause errors if the precise structure is not observed.

CDATA sections

Character data is any data in an XML document that is not part of the markup. Mostly, this refers to the text content of elements, but not the element tags themselves, entity references, comments, processing instructions, or the like. A CDATA section marks the enclosed block of text as character data in which there is no markup and all characters are treated literally, that is, there are no elements and no entites. Put another way, inside a CDATA section, the special meanings of “<” and “&” are turned off. A CDATA section begins with <![CDATA[ and ends with ]]>.

CDATA sections can be very useful when some of the text looks like markup but should not be interpreted as markup, or uses notations with “<” or “&”. In the XML source for this essay, many of the examples of XML markup are enclosed in CDATA sections. For example, the example in the section called “Processing instructions” is coded as follows:

<![CDATA[<p>Today is <?php echo date("M d, Y")?>.</p>]]>

Well-formed documents

A document is well-formed if the basic syntax and nesting rules are followed. The most important of these rules are:

  1. An XML document has a single root element.

  2. Elements must be closed; elements must be marked with a start tag and a matching end tag, or by an empty tag.

  3. Elements nest properly. A child element must be entirely contained within its parent, partial overlap with other elements is not allowed.

  4. An element may not have multiple attributes of the same name.

  5. Attribute values must be quoted.

  6. <” and “&” must not appear directly as characters in element content or attribute values, but must be represented by their entity references.

The rules for well-formedness promote clarity and unambiguous encoding of information in XML documents, and make it relatively easy for software to parse an XML document.

To explore this notion briefly, consider the fragment below, which violates the rule 6 above by having an unencoded < in an element's character data:

<rule>a<b and b<c implies a<c</rule>

Any XML software will expect "<" to signal the beginning of a new element. "<b and" will look like part of an opening tag with an attribute, but the equals sign required in an attribute never appears. If "<" were allowed here, any software parsing this XML would have to decide from the characters that follow whether the "<" is the beginning of a tag or just more text. This would add unnecessary complexity to the software, violating the principle that XML parsers be straightforward to implement. This rule is similar to the requirement that tags be closed, which we examined earlier, in that it helps to ensure that the document structure is unambiguous and easy to for software to determine.

XML vocabularies: schemas, namespaces, and validation

While XML itself does not define a set of elements, it can be useful to formally define the vocabulary of allowed elements in a document, including each element's attributes and its content model (what the element may contain: other elements and/or text). Formal definition of the vocabulary helps to establish a shared understanding of the documents. Often a community will have a set of common needs, and having an agreed upon, formally defined vocabulary helps meet these needs. This is especially true if exchanging the documents is important.[7] If a vocabulary is formalized (the elements, attributes, and content models expressed) in a machine-readable way, the document can be validated, that is, it can be checked to ensure that it conforms to the formalized vocabulary.

Sometimes, we will also want to combine elements from different vocabularies in the same document. Usually, a vocabulary is designed for a specific problem domain, but sometimes the information we need to record cuts across more than one domain. For example, METS is a vocabulary that defines structural metadata for electronic objects in digital libraries, but relies on other vocabularies for descriptive metadata.

To motivate the need for formal XML vocabularies, consider the following fictitious invoice:

<?xml version="1.0" ?>
<invoice>
    <vendorId>1324123</vendorId>
    <account>6393487</account>
    <item>
        <descr>A Spectre is Haunting Texas, by Fritz Leiber (pbk)</descr>
        <price currency="us">6.95</price>
    </item>
</invoice>

Some software would probably act on this document to verify vendor IDs, check account numbers, deduct money from an account and send it to the vendor. So the producer of the document and the recipient must have a shared understanding of the elements and the document structure. If the producer makes a change to the elements or structure of the document, this has consequences for the recipient. Validation helps to ensure that all parties adhere to the shared understanding of the document.

The mechanisms for formalizing XML vocabularies are generally called schemas. There are three major XML schema languages that are important in the library world: Document Type Definition (DTD), XML Schema and RELAX NG. DTDs are an older way of declaring elements and their attributes content models, inherited from SGML. They are flexible, modular, and well understood, but use their own special syntax rather than XML syntax. XML Schema and RELAX NG use XML syntax for these declarations. All three are well supported by XML editors and processing tools, and are used to develop important XML vocabularies.

XML documents may specify which vocabulary they adhere to. A document adhering to the DocBook DTD, for example, may look like this:

<?xml version="1.0" ?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
                         "http://www.docbook.org/xml/4.4/docbookx.dtd"[
<!ENTITY % guimenu.module "IGNORE">
]>
<chapter>
  <title>What is XML and Why Should I Care?</title>
  <abstract>
     <para>Since its introduction in 1998, XML has become...</para>
  </abstract>
  <section>
    <title>What is XML?</title>
    <para>The eXtensible Markup Language (XML) provides a general-purpose syntax for encoding information...</para>
  </section>
</chapter>

This document begins with the XML declaration. Next is the document type declaration which specifies chapter as the root element, gives the public identifying string for the DocBook version 4.4 DTD, and shows where to find the DTD so that the document may be checked for validity. DTDs can be written to be very modular, with components that can be turned on or off, new components added, or existing components modified. Parameter entities are used to customize DTDs. In this example, the parameter entity guimenu.module is used to disable the DocBook module that defines markup for writing about GUI menus. There are many such DocBook modules for marking up specific textual features that may be enabled, disabled, or modified using parameter entities. New elements and attributes may be defined and integrated into the content model. This should hint at the power and flexibility of expression that can be retained even when the vocabulary has been formalized. DTDs use a syntax that does not resemble the rest of XML, but is retained from SGML.

XML Schema and RELAX NG have different modularity and extensibility mechanisms, which will not be discussed here. They do not require a DOCTYPE declaration, but use a different mechanism, namespaces, for identifying their schema, as shown below. They do not allow the definition of entities, but do allow more sophisticated control over element content, such as specifiying that a element may only contain numeric data.

Namespaces use a Uniform Resource Identifier (URI) to identify the vocabularies to which elements and attributes in a document belong. A namespace declaration is used to associate the namespace URI with a prefix, and that prefix can be applied to elements and attributes, marking them as belonging to the namespace. The syntax for the namespace declaration is "xmlns:", the prefix, "=", and the namespace URI in quotes, as in this template:

xmlns:prefix="URI"

A namespace declaration appears as an attribute to an element. The prefix is defined for use with the current element and any of its attributes or content. If the prefix is omitted from the namespace declaration

xmlns="URI"

the URI identifies the default namespace for the current element and all elements and attributes it contains. Any element or attribute that is not prefixed is assumed to be part of the namespace identified by the URI. The URI for any vocabulary is typically defined as part of the formalization of that vocabulary.

For example, both examples below are equivalent in that the mods element is identified as being part of the MODS v.3 namespace:

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">…</mods:mods>
<mods xmlns ="http://www.loc.gov/mods/v3">…</mods>

If two elements in different vocabularies have the same name, no problem, they will have prefixes associated with different URIs. For example, consider the following document:

<?xml version = "1.0" encoding = "UTF-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/" 
  xmlns:mods="http://www.loc.gov/mods/v3" 
  xmlns:dc="http://purl.org/dc/elements/1.1/">
  <mets:dmdSec ID="HZN5117992MODS">
    <mets:mdWrap MDTYPE="MODS">
      <mets:xmlData>
        <mods:mods>
          <mods:titleInfo>
            <mods:title>Marche funèbre</mods:title>
            <mods:subTitle>tiré de la sonate</mods:subTitle>
          </mods:titleInfo>
          <mods:name type="personal">
            <mods:namePart>Chopin, Frédéric</mods:namePart>
            <mods:namePart type="date">1810-1849</mods:namePart>
          </mods:name>
        </mods:mods>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:dmdSec ID="HZN5117992DC">
    <mets:mdWrap MDTYPE="DC">
      <mets:xmlData>
        <dc:title>Marche funèbre : tiré de la sonate</dc:title>
        <dc:creator>Chopin, Frédéric, 1810-1849</dc:creator>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>
  <mets:structMap TYPE="physical">
    <mets:div TYPE="score" DMDID="HZN5117992MODS HZN5117992DC">
      <mets:div ORDER="1" ORDERLABEL="" LABEL="Cover with title" TYPE="page">...</mets:div>
      <mets:div ORDER="2" ORDERLABEL="2" LABEL="" TYPE="page">...</mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

This METS document is adapted from one used to represent a piano score as a digital object. The two mets:dmdSec elements contain descriptive metadata as defined by schemas external to METS, specifically MODS and Dublin Core. The mets:structMap element contains structural metadata for the digital score object, which is METS' specialty. The file section, with images of the score, is omitted for brevity.

Consider the namespace declarations in the mets:mets element:

xmlns:mets="http://www.loc.gov/METS/" 
xmlns:mods="http://www.loc.gov/mods/v3" 
xmlns:dc="http://purl.org/dc/elements/1.1/"

The prefix mets is bound to the URI http://www.loc.gov/METS/, the METS namespace. Simlarly, the prefix mods is bound to the URI for the MODS namespace, and dc to the URI for the Dublin Core namespace. By using these namespace prefixes, each element is unambiguously identified as belonging to a specific namespace. Grouping elements by namespace helps software know how to process the elements. During validation, for example, the software can recognize all elements that will be found in a schema describing namespace identified by http://www.loc.gov/METS/, the METS namespace. Namespaces also distinguish between elements of the same name defined in different vocabularies. MODS and Dublin Core both define elements named "title", but which have different meanings. By declaring namespaces for MODS and Dublin Core, and using the namespace prefixes, mods:title and dc:title are clearly distinct.

We can even specify where to find a schema document. The XML Schema vocabulary, for example, defines an attribute, schemaLocation, that associates a namespace URI with a location where the schema definition can be found. For example, adding the following attributes to mets:mets would declare the XML Schema namespace and use that namespace to express where to find the MODS v.3 schema:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-0.xsd"

Be aware that DTDs predate namespaces, and are not namespace aware. If namespaces are important for your information, you will be dealing with XML Schema or RELAX NG, not with DTDs.

Advantages of XML

Naturally, proponents of XML like to point out the advantages of XML as a data format. This essay will touch only on a few of these advantages; more thorough discussions may easily be found in the materials listed at the end of this essay.

XML, as an information container, is system independent. XML was designed to be processed by a wide variety of systems, and is often used to communicate information between systems. In part this system independence comes from XML's being a text format, not binary format, with a very simple, dependable syntax and structure. A simple text editor is all you need to write an XML document, though a modern Web browser will help check that XML documents are well-formed, and possibly that they are valid. Contrast that to the binary format of popular word processors, where the ability to read and write a document relies on software with an awareness of the binary format.

XML is extensible. Because its creators knew that they could not determine all possible uses of XML, XML does not define a specific tagset. Instead, it allows one to define the elements needed for a particular situation. Even formalized vocabularies can be designed to be customized.

Finally, XML separates semantics from presentation. This advantage of XML comes from its being a descriptive markup language. Consider the first example:

<date>1966</date>

By marking the role of a part of the document, we can index it to support searching by dates, or we can format it in some special way when printing or displaying it on a screen. This may seem typical when the XML document resembles a set of data fields, such as a the invoice example. But XML also accommodates prose-like journal articles, where the benefits of driving both searching and presentation from the same markup tag are more apparent. The searching granularity is limited only by what we are willing to mark up in the documents; if desired, each document may be treated as a little database. And because presentation can be driven by the role of any component, we have tremendous flexibility in presentation. The date above, or title, etc., could be displayed in bold, in a special font, or just the same as any surrounding text, just by recognizing the date tags, and all of the documents in a collection could use the same presentation. The presentation can be tailored to the specific use, for example, using different fonts in print and online versions, or the presentation can be changed to match the latest look of your institution's website. All of this flexibility is available with no need to edit the underlying XML.

Companion technologies

XML has a number of companion technologies for manipulating and otherwise operating on XML documents.

There are standard application programming interfaces (APIs) for XML. Applications need to extract information from XML documents for use in a variety of ways, often for display to the user or in commmunication with other applications. The XML APIs allow the programmer a standard way to extract this information. An application can load a module that implements the API, and the programmer can rely on that module to parse the XML documents without the programmer having to implement the XML parser directly.

One of the earliest XML APIs is the Simple API for XML (SAX). SAX uses an event-based model, viewing XML documents as a linear sequence of opening and closing tags, text, and the like. A SAX-based application registers special code to be invoked when the different types of XML events are encountered. SAX is fast and does not require the whole document to be read into memory, but it is up to the application to keep track of the document structure. SAX can be difficult to use for complex processing, especially if multiple passes over the document is required. SAX was developed for Java, but has been used as a model for XML parsing in other languages.

The Document Object Model (DOM) is a tree-based programming interface for XML defined by the W3C. The DOM stores the entire the document in memory represented as a hierarchy of nodes. There is a root document node, and under this is a hierarchy of nodes representing every element, attribute, text segment and the like. The DOM provides mechanisms to navigate this hierarchy. The DOM also allows for document node structures to be created or modified in memory and then output as XML. By representing the entire document structure in memory and providing mechanisms to navigate that structure, the DOM makes complex processing much easier than in an event-based model, but tends to demand more memory, which can be prohibitive for manipulating very large documents.

XPath is a language for identifying parts of XML documents based on their elements, attributes, and location in the document structure. For example, in a MODS record we can use XPath to select the first author in a MODS bibliographic record, or all of the authors of any related items that are present in the record. XPath is most often used as a component of other XML technologies, such as XSLT.

Extensible Stylesheet Language Transformations (XSLT) allow us to transform XML documents into other XML vocabularies, HTML, or any other text representation. For example, we might transform a MODS record into MARCXML or Dublin Core for export into some other system, transform it into HTML to display to a user with a Web browser, or transform it into plain text for emailing to a user. An XSLT stylesheet is written as an XML document, and uses XPath to identify specific parts of the source document for specific transformation.

These technologies and others provide a common, system independent framework for working with, manipulating, and repurposing XML.

Why should I care?

Libraries can use XML as a common information format to exchange data with other information providers, to arrange interoperability between our own information systems, and to present information to our users. There are already a number of XML-aware systems and XML-based standards used by libraries today. This is particularly true in digital libraries. Because XML has been widely adopted for Web-based applications, a large number of tools exist to assist in building XML-based systems. The library community can share XML-based tools and standards with other industries, leveraging their experiences, and contributing ours.

XML is important to libraries as a format for manipulating information. The ability to transform XML allows it to be reused in different contexts. XML-aware systems can be made to interoperate by arranging for them to exchange XML documents. The combination of reuse and interoperability means we can use XML to knit systems together, whether to integrate different systems in our users' information landscape, or to streamline our own business processes.

XML offers flexibility in exchanging and presenting information. A benefit of descriptive markup is that data elements are explicitly marked, and knowledge of those elements allows us to transform the data as needed for exchange and presentation. Some integrated library systems use XML internally to manipulate information and format it for presentation to the user. At least one OpenURL resolver uses XML to import and export subscription data.

Exposing XML interfaces to systems can enable interoperability between library applications. This strategy is well established in many industries. For example, businesses in a supply chain may use XML to exchange purchase orders and fulfillment information in real time. The NISO Circulation Interchange Protocol (NCIP) is an XML-based protocol for exchanging library circulation information. In pilot projects, some integrated library systems are allowing remote interlibrary loan (ILL) systems to charge books out directly, greatly streamlining ILL processing. This sort of interoperability recalls the way MARC provides interoperability between cataloging systems, allowing cooperative cataloging arrangements to develop. The difference is that XML is a more flexible information container that is recognized by many more systems, and XML is often exchanged in a real-time, transaction-based environment.

Digital libraries rely heavily on XML in their production streams. The digital objects and their metadata are often created and disseminated in XML, typically using specific XML vocabularies. The display of the objects is often based on transforming that XML. Digital library standards tend to favor XML for representing metadata or digital objects themselves. This is due in part to the descriptive markup and system independent aspects of XML, and the ability to easily process the XML. An understanding of XML is critical for the planning and implementation of digital library projects.

XML is a data format that libraries have in common with the larger world. A benefit of using standards that are not confined to the library community is that we can build on the experiences, knowledge, and techniques developed outside the library community. We do not have to rely only on library vendors to develop XML tools or develop them ourselves. The time and effort saved can be applied to those activities that the standards are intended to support. This means that XML tools are tested more broadly than library-specific tools, development costs are shared among more users, and there are more tools available. To give an obvious example, there are more options for editing and processing XML records than MARC records.

All of the major programming languages, and many of the minor ones, have available various forms of support for XML. These may include packages for generating or parsing XML. There are several implementations of XSLT in different programming languages. One common processing model is for an application to receive an XML document, call an XSLT processor to do the transformation, and work with the results. Without XSLT, any such manipulations would involve programming a custom XML manipulation for each task.

XML editors assist with the creation and maintenance of valid XML documents. XML databases have been developed to allow native storage and retrieval of XML documents. This discussion only hints at the XML tools that are available.

Because XML is not a language, but a framework for developing markup languages, all of these XML tools are applicable to nearly any XML-based vocabulary.

Examples of XML-based standards

There are many XML-based standards that are used in libraries. The standards briefly described below include some of the ones more commonly encountered in the digital library community. Only a few are strictly library standards, which helps to illustrate that, as libraries, we are interacting with other players in the information landscape.

TEI

The Text Encoding Initiative (TEI) is a flexible, well-established tagset for scholarly markup, supporting a wide range of documents, including prose, poetry, plays, and dictionaries. TEI is designed to be both modular and extensible. TEI has been through several revisions. Beginning as a SGML DTD in the late 1980s, the TEI DTD was later modified to support both SGML and XML. The most recent version, TEI P5 is implemented as a RELAX NG schema. TEI is used by a number of projects focusing on online scholarly texts, such as the Brown University's Women Writers Project and the Oxford Text Archive. A partial list of TEI-based projects may be found at http://www.tei-c.org/Applications/.

The TEI began as a research effort of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. It is currently being developed by the TEI Consortium. For more information on TEI, see http://www.tei-c.org/.

DocBook

Though not a library standard per se, DocBook is a freely-available, extensible tagset for technical documentation. As with TEI, DocBook began as an SGML DTD, currently supports both SGML and XML, and the next version will be expressed as a RELAX NG schema.

DocBook is used for providing much of the documentation used by library systems offices. Computer books published by O'Reilly and Associates are coded in DocBook. These source documents are used to produce both the print books and their online equivalents. [8] Sun Microsystems uses a DocBook variant for the online manual pages in Solaris. The manual you are currently reading is marked up in DocBook.

DocBook development is sponsored by the Organization for the Advancement of Structured Information Standards (OASIS). For more information, see http://www.docbook.org/ and http://www.oasis-open.org/.

EAD

Encoded Archival Description (EAD) is used for the electronic mark up of archival finding aids. EAD began as an SGML DTD, initially developed at the University of California, Berkeley, Library, and is currently an XML DTD maintained by the Library of Congress in cooperation with the Society of American Archivists. For more information, see http://www.loc.gov/ead/.

MODS and MADS

The Metadata Object Description Schema (MODS) was developed to preserve MARC-like semantics for bibliographic metadata in an XML format, but in a form that is somewhat simpler than MARC, uses text tags rather than numeric tags, is extensible, and is more friendly to electronic resources. Several of the examples above are based on MODS. The Metadata Authority Description Schema (MADS) is a companion to MODS for encoding authority data. The Library of Congress is the maintenance agency for both MODS and MADS.

More information about MODS and MADS can be found at http://www.loc.gov/standards/mods/ and http://www.loc.gov/standards/mads/.

METS

The Metadata Encoding and Transmission Standard (METS) schema provides a way to bundle up all of the files, metadata, and structural information for an electronic object, and draws on the experiences of the Making of America 2 project. Objects are modeled as a hierarchy of components, or nodes. Each node may have associated with it files or portions of files, descriptive metadata, administrative metadata (technical metadata, preservation metadata and rights metadata); linkages between nodes of the hierarchy may be recorded.

METS defines structural metadata and a file manifest, but relies on other schemas for descriptive and administrative metadata, which are being developed by those with the appropriate expertise. This allows METS implementers to use the best descriptive and administrative metadata standards for their projects. VRA (Visual Resources Association) Core and MODS solve different problems for different communities, and either could be used for descriptive metadata in METS. METS has become well established in the digital library world. It has been used to model books, musical scores, 45 RPM vinyl records, video segments with transcripts, and even entire websites. METS is maintained by the Library of Congress.

For more information, see http://www.loc.gov/standards/mets/.

OAI-PMH

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is an XML-based protocol that allows the exchange of metadata for digital collections. Metadata from a variety of collections can be gathered, or harvested, into repositories. Services can be built on top of these repositories to enable improved access to the materials. For example, the metadata for digital collections housed at separate institutions can be gathered into a repository, and that repository might allow all of the objects from all collections to be searched as though they were one collection.

For more information, see http://www.openarchives.org/.

ONIX

The ONline Information eXchange (ONIX) is a metadata standard used in the publishing and distribution supply chain. ONIX for Books is well-established for transmitting information about books between parties in the book supply chain, and is available as either a DTD or XML Schema. ONIX for Serials is being developed as a set of XML Schemas, each focused on communicating particular information about serials. Though not a library standard, ONIX offers some interesting prospects for libraries to enhance their interactions with their suppliers. For example, Serials Release Notification (SRN) focuses on article- or issue-level information for serials. This offers future possibilities for automatically adjusting serials claiming schedules based on a publisher's most recent release information.

ONIX is maintained by EDItEUR, an international group of participants in the book and serials industries. For more information, see http://www.editeur.org/.

Summary

XML is a general-purpose information container, combining a well-defined syntax with flexibility in representation of information. XML is important to libraries for a number reasons. XML is essential to digital library literacy. The library community can share XML-based tools and standards with other industries, leveraging their experiences. XML is already an important part of our environment, and will only become more so.

Bibliography

Caplan, Priscilla. Metadata Fundamentals for all Librarians. Chicago: American Library Association, 2003.

Coombs, James H., Allen Renear, and Steven J. DeRose. “Markup Systems and the Future of Scholarly Text Processing”. Communications of the Association for Computing Machinery 30, no. 11 (November 1987): 933-947, http://doi.acm.org/10.1145/32206.32209.

DeRose, Steve, David Durand, Elli Mylonas, and Allen Renear. “What is Text, Really?” Journal of Computing in Higher Education 21, no. 3 (August 1997): 1-24, http://doi.acm.org/10.1145/264842.264843.

Ray, Erik. Learning XML. 2nd ed. Cambridge Mass.: O'Reilly, 2003.

Harold, Elliotte Rusty, and W. Scott Means. XML in a Nutshell. 3rd ed. Sebastopol, CA: O'Reilly, 2004.

World Wide Web Consortium. Extensible Markup Language (XML) 1.0. 3rd ed. http://www.w3.org/TR/2004/REC-xml-20040204/. http://www.w3.org/TR/2004/REC-xml-20040204/.



[6] Some XML processors make it possible to examine the contents of comments, but they are not required to.

[7] The MARC format for bibliographic records is an excellent example of this idea. MARC development was motivated by the desire to share certain documents, biblographic records, among libraries; its success made large-scale cooperative cataloging possible.

[8] This illustrates the power of descriptive markup: one input, multiple outputs.