A Simple XML File Format

In recent weeks, I’ve spent many a sleepless night pondering the seemingly simple task of designing a file format for storing the settings from a blog publishing system; essentially a number of key/value pairs. The only requirement was that the format be a dialect of XML.

First Attempt

My first, naïve attempt looked something like this:

<blog_settings>
  <blog_name>Slashdot<blog_name/>
  <blog_description>News for nerds</blog_description>
  <blog_searchable>true</blog_searchable>
  <blog_adult_content>false</blog_adult_content>
  ...
</blog_settings>

At first glance, such a format might seem perfectly adequate, but the more experienced among you are sure to have spotted the many shortcomings. Let’s focus on just one of the settings:

<blog_searchable>true</blog_searchable>

Sure, that’s enough for us to determine the current value of the setting, but what do we know about its past? When was it created? When last updated? Who was responsible? And what does it really mean?

There’s a lot we can do to improve on that.

Identity

Consider the question of identify. When reading this format, how can anyone be sure that these settings refer to our blogging system?

An XML namespace might help, but even better would be a universally unique identifier for each individual setting. Let’s create an id element, with a tag URI as our identifier.

<id>tag:example.org,2009-04-01:setting.blog_searchable</id>

And while that’s perfect for a machine, we’ll likely also want something a little more descriptive for any humans viewing the data.

<title>Can this blog be indexed by search engines</title>

Already a major improvement.

History

As mentioned earlier, it would be immensely useful to know a little of the history behind each setting. For example, when was the setting first created, or last updated? Let’s add a couple of date elements – we’ll call them published and updated – with RFC3339 timestamps.

<published>2009-04-01T01:23:45.67Z</published>
<updated>2009-04-01T08:34:57.13Z</updated>

And of course it’s even more important to know who last changed the setting. Not just their name – having their website URI provides an extra level of accountability. It makes sense to group these details together, so we’ll also add an enclosing author element.

<author>
  <name>Mr J. Smith</name>
  <uri>http://example.org/~johns</uri>
  <email>fake.email@example.org</email>
</author>

Note that a real email address, while potentially useful, would just be a target for spammers. It doesn’t harm to include a fake address, though, and the data looks more complete that way.

If a setting has been edited by more than one person, we can easily add more contributor elements using the same format.

The Good Stuff

Most important of all, though, is the value of the setting. For this we’re going to add a content element containing a representation of the value, with a type attribute identifying the type of the representation.

In the case of the blog_searchable setting, it’s tempting just to use the string true for its value. However, considering it’s a boolean, we really ought to use a genuine boolean data type. How about this?

<content type="application/x-boolean">AQ==</content>

For the type attribute, we’ve used a boolean MIME type which can be registered at a later date.⁠1 For the content itself, we’ve base64 encoded an 8-bit representation of the number one, signifying a boolean value of true.

But for those that find base64 confusing, it’s also worth including a plain-text summary as a backup.

<summary>true</summary>

Web 2.0

Finally, since internet APIs are all the rage these days, it’ll be nice to have a couple of link elements, making our settings more Web 2.0 compliant.

One link might provide a URI from which the value of the setting can be retrieved. Another could provide a URI to which updates to the setting can be PUT.

<link rel="self" href="http://example.org/get/blog_searchable" />
<link rel="edit" href="http://example.org/set/blog_searchable" />

The exact protocol associated with these operations is left as an exercise for the reader. There’s no limit to the number or type of links that can be added, so go wild.

The Final Product

Once we gather all of these details together into an enclosing element, the end result might look something like this:

<entry>
  <id>tag:example.org,2009-04-01:setting.blog_searchable</id>
  <title>Can this blog be indexed by search engines</title>
  <published>2009-04-01T01:23:45.67Z</published>
  <updated>2009-04-01T08:34:57.13Z</updated>
  <author>
    <name>Mr J. Smith</name>
    <uri>http://example.org/~johns</uri>
    <email>fake.email@example.org</email>
  </author>
  <contributor>
    <name>Mrs A. Fule</name>
    <uri>http://example.org/~aprilf</uri>
    <email>fake.email@example.org</email>
  </contributor>
  <content type="application/x-boolean">AQ==</content>
  <summary>true</summary>
  <link rel="self" href="http://example.org/get/blog_searchable" />
  <link rel="edit" href="http://example.org/set/blog_searchable" />
</entry>

Of course, that’s just one setting. Our full data format will have a number of these entry elements enclosed in a top-level root element. A complete example would be far too large to include here.

I’ll admit it’s more complicated to parse, more effort to produce, and an order of magnitude larger, but it has a whole lot more functionality. Plus it’s Web 2.0 compliant!⁠2

It certainly has come a long way from our first, humble proposal. Was it worth the effort?

Footnotes

  1. Not necessarily by me.
  2. Full Web 2.0 compliance can not be guaranteed.