Right-To-Left Content in RSS

One of the difficulties in dealing with international content, is the complication of right-to-left writing systems. There are essentially two main issues that need to be dealt with: initialisation of the base directionality, which is necessary for the Unicode Bidirectional Algorithm to function correctly; and the right-alignment of paragraphs, with elements like tables and lists appropriately reversed.

The first part can be achieved, with reasonable effort, using RLE (U+202B) and PDF (U+202C) control codes; the second part is more complicated. Ideally you would wrap your content in an HTML div element with the dir attribute set to "rtl", taking care of both directionality and alignment in one step. Unfortunately that isn’t a viable option for RSS 0.91 and 1.0 feeds, neither of which allow markup in the description element.⁠1

One solution that has been proposed, is the addition of a dir attribute to RSS itself.⁠2 On encountering such an attribute an aggregator would know to align the content appropriately. However, there is very little chance of that idea making it into an official specification: RSS 2.0 is frozen and can’t be updated; RSS 0.91 has essentially been superseded by 2.0; and the RSS-DEV WG stopped working on RSS 1.0 a long time ago.

The end result is that there are a good deal of RSS feeds out there, containing right-to-left content, with no indication to aggregators how that content ought to be rendered.

In an effort to determine how best to cope with such feeds, I put together a couple of tests to see what other aggregators were doing. It turns out, not very much. Most would position the content correctly if it used markup in the description to set the directionality, but there were several aggregators that couldn’t even manage that successfully.

IE7 used the RSS language element to trigger a right-to-left layout for the entire feed,⁠3 but I don’t think that’s such a great idea. Even when that information is accurate, it’s of no use with a language like Azerbaijani that uses multiple scripts with different directions; or feeds that have a mixture of entries in different languages.

Arabic Snarfer Screenshot

So what is the answer?

For Snarfer we decided that the most practical solution would be to try and detect the directionality of the content ourselves. If it appeared to be using a right-to-left script, we would wrap it in a div with an appropriate dir attribute before passing it on to the HTML renderer.

The algorithm we use looks something like this:

  1. Initialise a counter to zero.
  2. Look at the first n characters of the content with markup stripped.
  3. If a character is from an RTL script,⁠4 increment the counter by one.
  4. If a character is from an LTR script,⁠5 decrement the counter by one.
  5. Once n characters have been processed, if the counter is positive, the content is considered to be RTL.

Of course it’s not difficult to imagine a situation in which this algorithm would fail, but for the most part it performs remarkably well. If nothing else, it’s a good deal better than any of the current alternatives.

Footnotes

  1. Even some RSS 2.0 feeds would consider this impractical, since many still use the 0.91 content model, which limits their description elements to plain text.
  2. This was one of the issues discussed by Ian Forrester at Xtech 2005.
  3. According to Robert Sayre, Firefox will most likely implement something similar in a future release.
  4. Essentially character types R and AN.
  5. Only character type L (not EN).