Defining Documents

Document Type Definitions

Table of Contents

What is a DTD?
Defining the Document
Defining Elements
Repetition+
Choices | choices | choices
Tricks and tips
Adding text
Adding Attributes
Unique Identifiers
Good Document Design
Entities and Other Useful(?) Stuff
What are Entities?

So far, we've looked primarily at existing document standards, such as XHTML and DocBook. These are defined through extensive discussion by groups of experts, and are extensively documented and described in various human-readable ways. However, their “canonical” definition lies within their Document Type Definition (DTD), a document (or more likely, a set of documents) that can be read by editors, validators and other software and used to determine whether any given document conforms to that standard.

So, the DTD is a formal specification of a particular document type; it defines what is valid for each instance of that document type. It defines the sequence of allowable elements and controls which entities and attributes may be used within each element. Most XML authoring software can read DTDs and can validate your documents by comparing them to your DTD. Some software will also use the DTD to restrict you to writing only valid markup as you author your documents.

For some uses of XML, there's no available standard covering the specific markup we need to store our data with. Perhaps we need to store very detailed data for a specific subject domain. Or perhaps we want to create documents that have a specific structure tied to their purpose. For these types of applications, we need to create a new definition, and write our own DTD.

DTDs are quite simple to write, but only allow very basic definitions of the elements and attributes that make up the application. For more detailed control over the structure of our instance documents, we can use more modern definition languages such as W3C Schemas or RelaxNG, both of which we'll cover in a later section. But as DTDs are easier to read and write, we'll start with them for our new XML application.

What is a DTD?

To begin with, let's have a look at the example DTD shown in Example 10, “A simple DTD for a jokebook”. This gives us a very simple document structure with some metadata (the bookinfo element) and a series of one or more jokes, all contained within the root node, called jokebook. The simplest document that can be created with this DTD is shown in Example 11, “A minimal jokebook document”.

Example 10. A simple DTD for a jokebook

  <!ELEMENT jokebook (bookinfo, joke+)>
  <!ELEMENT bookinfo (title, editor+)>
  <!ELEMENT title (#PCDATA)>
  <!ELEMENT editor (#PCDATA)>
  <!ELEMENT joke (simplejoke+)>
  <!ELEMENT simplejoke (question, punchline)>
  <!ELEMENT question (#PCDATA)>
  <!ELEMENT punchline (#PCDATA)>
					


Example 11. A minimal jokebook document

  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE jokebook SYSTEM "jokebook.dtd">
  <jokebook>
    <bookinfo>
      <title/>
      <editor/>
    </bookinfo>
    <joke>
      <simplejoke>
        <question/>
        <punchline/>
      </simplejoke>
    </joke>
  </jokebook>
					


You should be able to roughly match the element names in the XML document with the element definitions in the DTD - in the next few sections we'll look at the exact syntax that allows us to define this document structure.

Defining the Document

The basic structure of a DTD comprises a sequence of definitions, of four types:

  • Elements (or tags): <!ELEMENT ....>

  • Attributes: <!ATTLIST ....>

  • Entities: <!ENTITY ....>

  • Notations: <!NOTATION ....>

The order of the definitions is generally not important, unless you're importing definitions from elsewhere (more on this another time).

The DTD doesn't follow the usual XML rules of well-formedness, since it is not an XML document itself. However, the syntax is strict and case-sensitive.

Defining Elements

As you've probably guessed, the element definitions provide the basic tagset to be used in the document. They also define the strict order and repetition of elements within the document.

A typical element definition may look like this:

  <!ELEMENT simplejoke (question, punchline)>			
            

This gives us the following markup:

  <simplejoke>
    <question></question>
    <punchline></punchline>
  </simplejoke>
                
            

This says that every <simplejoke> must contain one and only one <question>, followed by one and only one <punchline>. The simplejoke is the parent element, and question and punchline are its children.

Repetition+

Obviously, you are able to allow more than one of each element:

  <!ELEMENT jokebook (joke+)>
			

which means that a jokebook may contain one or more jokes:

  <jokebook>
    <joke></joke>
    ... more <joke> elements ...
  </jokebook>

Similarly, elements may be marked as optional (?) or as appearing none or more times (*). You can also use the keyword ANY to allow any defined tag or text at that point (though this is rarely used).

Choices | choices | choices

Alternatives can also be marked:

  <!ELEMENT joke (simplejoke | knockknockjoke | 
      doctordoctorjoke | limerick)>
			

so that a joke may contain one (and only one) of several types. A valid pattern might be:

  <joke><limerick></limerick></joke>

We can also allow multiple choices from a list in any order, by saying:

  <!ELEMENT myjokes (simplejoke | knockknockjoke | 
      doctordoctorjoke | limerick)+ >
			

which creates an unordered list of one or more jokes of the listed type, perhaps like this:

  <myjokes>
    <knockknockjoke></knockknockjoke>
    <limerick></limerick>
    <limerick></limerick>
    <simplejoke></simplejoke>
    <limerick></limerick>
  </myjokes>

Remember that, for these to validate, each of these individual element types will also need to be defined, even if only as #PCDATA.

Tricks and tips

Controlling and limiting multiple items is not always easy with DTDs, especially where there's a need to have a finite maximum or minimum that isn't zero or one. There are various tricks needed to work using these conventions, such as:

  <!ELEMENT limerick (line, line, line, line, line)>
				

to define a strictly five-line verse, or

  <!ELEMENT double-entendre (phrase, meaning, meaning+)>
				

to define it as a phrase followed by two or more meanings:

  <double-entendre>
    <phrase></phrase>
    <meaning></meaning>
    <meaning></meaning>
    ... more <meaning>s if required ...
  </double-entendre>

This method is quite a limitation if you wanted to define an element with, perhaps, twenty child elements. Both W3C Schemas and RelaxNG provide more effective ways of defining multiple child elements.

Adding text

So far, we've only defined tags which contain other tags, so we need a way of allowing tags to contain arbitrary text. We can do this using a #PCDATA clause:

  <!ELEMENT phrase (#PCDATA)>
				

which allows any text except tags to be included as part of the phrase. Allowing a mixture of tags and text is trickier, and often confuses parsers and validators. You can mix #PCDATA with other elements in a choice list, but not in a sequence list, so:

  <!ELEMENT line (#PCDATA | emphasis | rhyme)*>
				

is valid, and could give the following markup:

  <line>There <emphasis>was</emphasis> a 
    young lady from <rhyme>Crewe</rhyme>
  </line>

but the next mixes #PCDATA within a sequence, so isn't a valid DTD definition:

  <!ELEMENT double-entendre (phrase, meaning, meaning+, #PCDATA)>
				

In this case, you'll need to define another element which then contains the arbitrary text itself. Generally, though, mixing text and elements is frowned upon as poor document design, unless the document calls for “inline” markup (as, for example, the inline elements in XHTML such as <strong>, <a>, etc.).

Adding Attributes

Often it's preferable to define attributes to specify more detailed information about an element. In HTML, attributes were often used to modify the presentation of the element on the screen; in XML, the attributes should only be used to describe content.

Attributes are often used to provide metadata, information about the data contained in the element, such as its source, language or accuracy.

So we may want to specify in our XML document thus:

  <joke author="Lee Evans" cert="18"> ...... </joke>
            

which we can define as:

  <!ATTLIST joke author CDATA #IMPLIED>
            

The ATTLIST is followed by the element it applies to, then the name of the attribute it is defining. In this case it is an arbitrary text field (CDATA -- note no #P!), and is optional (#IMPLIED).

Explicit values can also be specified here:

  <!ATTLIST joke cert (U | 12 | 15 | 18) #REQUIRED>
            

so that the author must choose one of the certificate values.

And a default value may be given:

  <!ATTLIST joke author CDATA "anonymous">
            

where, if otherwise unspecified, the value will default to "anonymous".

You can also specify a fixed value:

  <!ATTLIST joke language CDATA #FIXED "English">
            

where the attribute will be defined as a specific constant whether or not it's encoded in the document.

Unique Identifiers

It's common to want to refer to a certain section of a document using some kind of identifier, so provision has been made for this using the special attribute 'ID'. This is reserved for use as a unique marker within each document (HTML has this feature, though it is little used). In particular, many database applications of XML will generate or read this attribute as a part of its key for the data.

  <!ATTLIST joke code ID #REQUIRED>
                

You can also define an attribute to be a list of such IDs, as cross-references:

  <!ATTLIST joke seealso IDREFS #IMPLIED>
                
[Note] Practical task

Develop a document type of your own. To begin, think of a type of document that has a simple structure, and that records the same information regularly within that structure. A simple example might be an address book - each entry follows a regular pattern and records the same type of information for each person.

Begin by marking up a “typical” document, creating the tags as you go. Then deconstruct the marked up document into a series of element definitions in a separate file. As you go, try creating a new document using your DTD in <oXygen/>, using the validation facility to test your DTD.

This works best with regular, well-structured documents, such as meeting minutes, restaurant menus, recipes, product specifications (e.g. cars, computers,...) and so on.

As an example, I've created a DTD for the module outlines that we produce to define each module (e.g. this one) that you take (see the outlines folder for other related files).

Good Document Design

Over the coming weeks, we'll build up a picture of how to analyse data and build a document design that models the data accurately and flexibly. For now, we need to bear in mind a few steps towards this process:

  • identify your basic data items

  • group related data items together

  • organise these groups into a hierarchical (tree-like) structure

  • examine the level of detail recorded for each data item - can it be broken down into further elements?

  • determine whether inline markup is needed within text elements?

  • look at the metadata (attributes) needed for each data item

Try to design for reusability; don't be restricted by presentation issues or by a specific output format for the document. Most XML documents have a life after their initial purpose, often unexpected, and planning for this should be part of your design. For example, if recording personal names, separating the first and surnames into separate data elements can allow more flexible processing, perhaps producing lists sorted by surname or firstname:

  <personname>
    <firstname>Gary</firstname>
    <surname>Stringer</surname>
  </personname>
                        
                        

Above all, your document design should be a semantic representation of the structured data contained within your documents.

We'll revisit this topic in more depth in the next chapter.

Entities and Other Useful(?) Stuff

[Important] Warning!

Most of the techniques described in the rest of this chapter are gradually being replaced with XML-related standards.

For example, the use of entities for accented characters is made redundant by the adoption of Unicode; notations are more commonly dealt with by embedding markup (via namespaces), etc.

What are Entities?

Entities are used for several reasons:

  • to create a shorthand for entering often-used text;

  • to define user-friendly names for special characters;

  • to include material from an external file;

  • to define material that should not be parsed.

Defining a text entity (an internal general entity) in the DTD:

  <!ENTITY uoe "University of Exeter">
                    

then using the entity in an XML document:

  <p>The &uoe; is a very rainy place</p>
                    

Inserting Accents and Symbols

Remember the encoding parameter in the XML declaration, which we mentioned way back in the introduction in Example 2, “An example of a home-made markup language in XML”?

  <?xml version="1.0" encoding="UTF-8"?>
                        

An XML document can be written as a Unicode document, which allows the use of a vast array of multinational characters. Unicode should be used wherever possible, and should cope with most situations you come across. However, when dealing with legacy data, it's sometimes necessary to deal with other encodings.

In the past, it was more usual to write XML as plain ASCII, a standard format that almost all text editors use. Since ASCII doesn't have provision for multinational and symbol characters, we need a way of defining them, and inserting them easily into our text.

Special characters are already part of the XML standard; any Unicode character can be inserted using a character reference, which looks similar to an entity and uses the character's Unicode reference number, e.g. an e-acute is &#233;

As you can see, this isn't exactly an easy-to-remember way of inserting special characters, so we usually define entities as more memorable references. In XHTML, for example, we can use the entity &eacute; which is defined as:

  <!ENTITY eacute CDATA "&#233;" -- small e, acute accent -->
                        

There's a list of character references for commonly used characters in Appendix C of Castro (2001), and there are numerous more complete lists on the web.

[Note] Exercises

Using the DTD and XML data documents you created for the previous exercise, try the following:

  1. Add a few standard accented characters (e.g. umlauts or acute/grave accents) as entities to your DTD.

  2. Insert a lang="____" attribute to one of the elements in your DTD.

  3. Add some multilingual text to your documents, with the relevant language attribute set.

    Hint: you'll also need to allow multiple instances of the tag containing the multilingual text.

External Entities

An external entity defines a binary chunk of data for later use. It's commonly employed for regularly used data such as graphical logos or icons that are used frequently within all documents of the type being defined; it's not normally used for one-off graphics such as diagrams, illustrations or other “content-related” items. Here's how it works: :

                              <!ENTITY unilogo SYSTEM logo-large.jpg NDATA jpg>
  <!ENTITY cmitlogo SYSTEM cmit-logo.gif NDATA gif>
                

Creating an attribute that refers to that data:

  <!ELEMENT logo (alternatetext?)>
  <!ATTLIST logo image ENTITY #REQUIRED>
  <!ELEMENT alternatetext (#PCDATA)>
                        

then referring to that picture in the XML file:

  <logo image="cmitlogo">
    <alternatetext>Creative Media and 
      Information Technology.</alternatetext>
  </logo>
                        
[Important] Including images as entities

In practice, this method of including an image is clumsy and very restrictive, since the location of the image file must be defined in the DTD. It's much more usual to merely indicate a filename as a standard attribute and use a stylesheet or script to insert the image.

The entity method can, however, be useful to include very commonly used items such as a corporate logo, or regularly used icons, as part of a the text of the document.

Notations

When including unparsed content, the applications processing the data need to know something about the format of the data included, in order to process it correctly. For this we need to create a <!NOTATION> entry. So for example, we might have:

                            <!NOTATION jpg SYSTEM "image/jpeg">
                            <!NOTATION svg SYSTEM "image/svg-xml">
                        

to allow us to use two different formats for graphical data. Note that the second, SVG, is also an XML document, though we don't want to parse it - it should be passed directly to the application.

The value given for each type of file is called a MIME-type, and is a standard code that most web browsers and data processing systems can use. There are numerous lists of MIME-types on the web; the most authoritative is at IANA, and a slightly friendlier list can be seen at W3Schools.