詹姆斯 James

Encoding RSS Titles

June 13th, 2006

Pessimist: One who, when he has the choice of two evils, chooses both. Oscar Wilde, 1854–1900

The main problem with the RSS title element is how to deal with the characters & and <. They have to be encoded at least once as a requirement of XML, but for those aggregators that treat titles as HTML, a second level of encoding may be expected in order to comply with HTML1

As an aid to RSS feed producers, I decided to put together a set of tests that could be run through my aggregator collection to determine, once and for all, the best form of encoding to use.

First I chose a few titles that I thought would be a fair representation of real world use cases: 2

  1. AT&T
  2. Bill & Ted
  3. The &amp; entity
  4. I <3 N.Y.
  5. A < B
  6. A<B
  7. The <title> element

Then I encoded them in every way imaginable. It turns out there are a lot of ways.

XML has essentially four encoding variations: entity references, two forms of numeric character references (hex and decimal), and CDATA sections. In HTML only the first three really apply, 3 but allowing no encoding at all as a fourth option, we end up with sixteen combinations between the two – more when you account for case. In total there were 176 tests.

Right off the bat you can eliminate XML CDATA sections which fail to work in a number of aggregators. The same goes for HTML numeric character references in one form or another. Bloglines was the only aggregator that had problems with XML character references when mixed with HTML encoding, but those combinations are best avoided too. This cuts down the number of choices quite considerably.

Now let’s look at our first title: AT&T. After eliminating all the problematic encodings, we can narrow things down to four forms that have a reasonable chance of working:

  1. AT&amp;T
  2. AT&#38;T
  3. AT&#x26;T
  4. AT&amp;amp;T

Most aggregators have no problem with any of these. Unfortunately there were three 4 that couldn’t handle the last one, and two 5 that could only handle that one.

The second title, Bill & Ted, produced essentially the same results, 6 but the third was more interesting. After all the obvious duds had been removed, we were left with the following four possibilities:

  1. The &#38;amp; entity
  2. The &#x26;amp; entity
  3. The &amp;amp;amp; entity
  4. The &amp;#38;amp; entity

Three aggregators 4 only handled the first two, while most others only handled the last two. BottomFeeder and Thunderbird only handled 3, GreatNews only handled 4, and BlogBridge handled none. Surprisingly, Bloglines handled all four.

On to our next test: I <3 N.Y. Once again, after filtering out the duds, we were left with four choices: 7

  1. I &lt;3 N.Y.
  2. I &#60;3 N.Y.
  3. I &#x3C;3 N.Y.
  4. I &amp;lt;3 N.Y.

Most aggregators handled all of them but, as usual, there were three 4 that couldn’t handle the last one, and five 8 that could only handle the last one. BottomFeeder couldn’t handle any.

The fifth title, A < B, worked out more or less the same, 9 but with the spaces removed (the sixth title) the results are quite different. There are only three options worth considering: 7

  1. A&#x3C;B
  2. A&amp;lt;B
  3. A&amp;#60;B

Once again, we had the usual three aggregators 4 that only handled the first option, GreatNews only handled the last, Thunderbird handled 1 and 2, and there were four aggregators 10 that supported all three. For most, though, the only supported options were 2 and 3. BottomFeeder handled none.

The results for the last test, The <title> element, were mostly the same. 11

As you can see, the ideal choice of encoding depends to a large extent on the type of string you’re dealing with, as well as the aggregators you wish to support. However, looking at the results across all test cases, it essentially boils down to single encoding (using hexadecimal character references) vs. double encoding (using entity references for both).

On that basis, I’ve summarised the results for our seven titles using just those two forms:

Single Encoding Double Encoding
1234567 1234567
AmphetaDesk 0.93.1
Attensa Online
BlogBridge 2.16
Bloglines
BottomFeeder 4.2
FeedDemon 2.0.0.11
FeedExplorer 1.1.7
FeedReader 3.01
Firefox 1.5.0.4 12
Firefox 2.0a 13
Google Reader
GreatNews 1.0.0.368
Internet Explorer 7.0.5346.5
JetBrains Omea 2.0 (671.6)
My Yahoo!
Netvibes
Newsgator Online
Newz Crawler 1.8.0 (3312)
Pluck 14
RSS Bandit 1.3.0.42
RSSOwl 1.2.1
RssReader 1.0.88.0
SharpReader 0.9.6.0
Snarfer 0.4.0
Thunderbird 1.5.0.4

Clearly if you want to support Firefox or Internet Explorer you’ve got no choice but to use the single encoding option. For certain strings, though, that would mean losing support for at least twenty other aggregators. No matter what you do, you can’t win.

Perhaps the only solution, barring a miraculous change of policy on the part of certain browser authors, is to just use both.

1 Although it shouldn’t be necessary, some aggregators require that > be encoded too.
2 For those of you unfamiliar with emoticons, title number 4 says I ♥ N.Y.
3 CDATA sections, while valid, are rarely supported in HTML.
4 Firefox, Internet Explorer and RSSOwl.
5 BottomFeeder and Netvibes.
6 Same set of options, but Netvibes handled all four this time.
7 Note that hexadecimal character references are also valid in lowercase.
8 Attensa Online, FeedDemon, FeedReader, My Yahoo! and Newz Crawler.
9 Same set of options, but Attensa handled all four this time.
10 Blogbridge, Bloglines, JetBrains Omea and Newsgator Online.
11 Same set of options, but JetBrains Omea was the only aggregator that supported all three; RSSOwl supported none.
12 Firefox Live Bookmarks.
13 The new feed preview.
14 Using the Firefox extension.