Microformats, dark data and CSS - part 1

There was a bit of kerfuffle when the BBC dropped support for microformats in their program listings.

You can’t argue with their reasons: data for microformats was being read aloud by screen-readers, popping up as tool-tips, and confusing the hell out of their users.

The microformats community rallied to solve the issue, but Auntie Beeb rejected all their proposals; and due to lack of community support, also back-tracked on their own proposal (inserting data- prefixed values into the class attribute.)

You can’t blame them for rejecting the microformats community’s proposals. This proposal feels particularly torturous:

<p>
  To be held on
  <span class="dtstart dtend">
    <abbr class="value" title="1998-03-12">12 March 1998</abbr>
  </span>
  from
  <span class="dtstart">
    <abbr class="value" title="08:30">8:30am</abbr>
    <abbr class="value" title="-0500">EST</abbr>
  </span>
  until
  <span class="dtend">
    <abbr class="value" title="09:30">9:30am</abbr>
    <abbr class="value" title="-0500">EST</abbr>
  </span>
</p>

That suggestion seems to be the result of a 30 minute brain-fart by microformat’s spiritual leader and, like the BBC, I find it “complicated” (I doubt that was the first word that sprang to mind though.)

Let’s go back to the specs and see what HTML gives us to work with. Considering over one hundred attributes in HTML 4, only a handful apply to the elements we’d want to tag (not just <abbr>, but also <span>, <a>, <li> amongst several others.)

The only attributes available to insert microformat data and still pass validation are: class, dir, id, lang, style and title.

We’ve dismissed all event-handling attributes (on_whatever_) as they’re supposed to contain script, and we can quickly dismiss a few more attributes now:

dir can only contain two values (ltr and rtl.)

id must be unique per page; that’s a restriction we can’t work with for microformats.

style attributes are supposed to hold CSS properties. It could be subverted using a vendor prefix e.g. style="-mf-dtstart: ...". However, you’d buy yourself a place in hell, and you’d never get support from the microformats community.

That leaves us with lang, title and class.

Why lang?

Good question.

The lang attribute indicates what language the content of an element is held in. Language codes were defined in RFC 1766, since replaced by RFC 3066, which has itself been replaced by a pair of RFCs: 4646 and 4647.

Amongst the long lists of language codes, you can find a few options that should let you embed machine-data without it being read-aloud. e.g. the language-code zxx indicates there’s “no linguistic content”, so you might think a screen-reader would simply skip the content:

<p>
  To be held on 12 March 1998
  <span class="dtstart dtend" lang="zxx">1998-03-12</span>
  from 8:30am EST
  <span class="dtstart" lang="zxx">08:30-0500</span>
  until 9:30am EST
  <span class="dtend" lang="zxx">09:30-0500</span>
</p>

Sadly, a quick trial using accessibility features on a mac reveals the content it still read aloud. I suspect the other possible language codes - art for “artificial languages”, and the x- prefix for “private use” - would suffer the same fate.

Also, the content would need hiding from sighted users using a simple CSS rule: [lang|=zxx] { display: none } - but this would fail under various viewing conditions (e.g. syndicating data via RSS without styles.)

Finally, the whole idea of marking machine-data using the lang attribute may be frowned upon. RFC 4646 includes this:

Language tags are used to help identify languages, whether spoken, written, signed, or otherwise signaled, for the purpose of communication. This includes constructed and artificial languages, but excludes languages not intended primarily for human communication, such as programming languages.

Looks like the lang attribute’s a bit of a no-no then. That leaves us with class and title.

title vs. class

We’ve already mentioned the accessibility issues with title. The BBC have done the research and user-testing other people only think about. Their results show it’s not just screen-readers that get confused by machine-data in title attributes - sighted users are baffled too.

Fragmenting one machine-readable value into three doesn’t solve the problem - it exacerbates it. The number of elements with mystical tool-tips increases; while human-friendliness increases for some (dates), it decreases for others (timezones).

The title attribute is meant for humans, no matter how you spin it.

All hail the mighty class

I must apologize: did I just have a ”title vs. class” debate without mentioning class?

I guess I did. title got disqualified, leaving class to win by default.

(small side-note: out of all the attributes on the short-list, class is the only one defined to contain CDATA - i.e. general-purpose Character DATA. Just thought that might be worth a mention.)

Having whittled down our options to one winner, we now need to decide how we’re going to organize our data, and cram it into the class attribute. We need to take into account the definition of class, and that the attribute-value is treated as an unordered list of white-space separated tokens.

That’s for part 2, where I also propose tweaking a CSS3 selector to change it from a single niche application to become a general purpose tool in a web-designers arsenal.

Update: This post predated HTML support for two features which cover the use-cases microformats tried to address: microdata and custom data attributes.