詹姆斯 James

RSS Duplicate Detection

August 18th, 2006

In order to be irreplaceable one must always be different. Gabrielle “Coco” Chanel, 1883–1971

Detecting duplicate items in an RSS feed is something of a black art. How does one uniquely identify an item in a feed while still allowing for that item to be updated? RSS 2.0 has a guid element that fits the bill perfectly, but it’s not a required element and many feeds don’t use it. As a result, aggregator authors are left guessing, and nearly every one of them guesses differently.

I can’t say for sure what algorithms applications are using, but after running 150 tests on more than 20 different aggregators, I think have a fair idea how many of them work.

As you would expect, for most the guid is considered the key element for determining duplicates. This is pretty straightforward. If two items have the same guid they are considered duplicates; if their guids differ then they are considered different.

If a feed doesn’t contain guids, though, aggregators will most likely resort to one of three general strategies – all of which involve the link element in some way.

What to do when there is no guid

Some will use the link as the primary fallback – for two items to be considered different, their links must be different. Only if items have no link (and obviously no guid) will they compare other elements such as the title or description (sometimes the date). IE7, RSS Bandit and Google Reader are all aggregators that have taken this approach.

Another common technique is to fallback to either of the link or title elements (sometimes other elements too). A difference in any one of these is enough for the aggregator to consider the items different. FeedDemon, NewsRiver and Snarfer do something like this, although they all differ in their choice of applicable elements.

Less commonly, there are some aggregators that require a difference in the link element in addition to either the title or the description (Sharpreader and Omea Reader being two examples). If there are no link elements then things get slightly more complicated, but the basic idea is the same.

Those that ignore the guid

As I said initially, most aggregators consider the guid the key element for determining duplicates, but that’s not always the case. For some, it is the link element that is the most significant distinguishing factor.

For these aggregators, two items with exactly matching guids will still be considered different unless their link elements also match. Similarly, two items with differing guids may still be considered duplicates of each other unless their links elements are also different. The title and description elements can also come into play, but the guid by itself is never enough.

BlogBridge, RssReader and Thunderbird are examples of aggregators that do something along these lines, although their specific implementations are quite different.

Other special cases

Finally, I should mention that there are a few aggregators that don’t really fall into any of the categories described above. Rojo, for example, uses the title element as a fallback when there is no guid. BottomFeeder uses the description element. FeedReader accepts either – in other words, two items would be considered different if either their titles or descriptions were different.

Bloglines appears to support guids with a fallback to link plus title or description (much like Sharpreader). However, if the guids are not permalinks, the fact that two items have the same guid will not be enough for them to be considered the same – their titles and descriptions would also have to match.

Similarly, if two items have different guids, that is not enough for those items to be considered different – either their titles or descriptions must also be different. This is not uncommon though; several other aggregators do something similar, most likely as a means of handling badly generated feeds that assign new guids on every refresh.

Recommendations for publishers

The most obvious recommendation is that you should always include guids in your feeds. If you don’t know what that entails, I would urge you to read Mark Pilgrim’s excellent article on ID creation. Some of what he describes there is specific to the Atom syndication format, but much of it applies equally well to RSS.

In addition, I would recommend you also include a unique link element for each item in your feed, to allow for aggregators that don’t handle guids very well. No two items should ever have the same link element, and ideally a link should never change (if you do update a link, be aware that it could show up as a new item for some aggregators).

Finally, although this is not essential, it is advisable that you refrain from updating your article titles if at all possible. There are at least two aggregators that will consider an entry with an altered title to be a completely new post – somewhat annoying to readers when all you’ve done is make a spelling correction in your title.

Recommendations for aggregators

As you’ve seen, there are a number of techniques that aggregators can use to detect duplicates. In the few that I tested for this article, I encountered at least 15 variations. There’s no way I can point to any one in particular and say that’s the best way to do it.

I will say, however, that supporting guids seems like an essential starting point. Also, when a feed doesn’t contain guids, the link element is probably a good fallback (possibly combined with or as an alternative to other elements). After that, it’s really a matter of personal preference.

Whatever you do though – however carefully you think things through – you can always be sure that some feed, somewhere, will still manage to screw things up completely. But that’s what makes RSS so much fun.