Archive for July, 2008

Tweaking PNG transparency with ImageMagick

Tuesday, July 22nd, 2008

This took me way too long to find out, so I thought I’d blog here and hopefully save someone else some time.

ImageMagick is a great swiss-army-knife type tool, with a shed-load of options for converting and combining images. Unfortunately, the sheer number of options can make it a bit time-consuming and frustrating trying to find the one you want.

My aim was simple: given a PNG, make the whole thing semi-transparent.

Searching Google using “transparent” and “opacity” drew a blank - all I got was instructions on how to set transparency for certain colours - not what I wanted to do.

The word I was missing was “alpha”, and the magic incantation for changing the opacity of the whole image is:


convert input.png -channel Alpha -evaluate Divide 2 output.png

In my case, I wanted to set the PNG to be 50% transparent (hence “Divide 2″.) Of course, you can change that number to whatever works for you.

Microformats, dark data and CSS - part 2

Monday, July 21st, 2008

The first part of this article considered over 100 HTML 4 attributes and came to the conclusion class was the only one suitable for storing machine data (i.e. data specifically inserted and intended for machine parsing.)

In this second part, I’ll review several ways to store data in the class attribute, determine the ‘best’ method, and suggest a CSS implementation change that is (IMO) both trivial and immensely beneficial.

We start by considering the definition of the class attribute, how it’s value is interpreted, and what restrictions this this places on us for storing data.

Isn’t class object-oriented?

Some people say class has an object-oriented use as though (X)HTML and CSS are object-oriented languages, with inheritance based on class values. But that’s not how things work: inheritance is based on parent/child relationships, with everything else determined by “the cascade“.

Let me illustrate with a contact directory example I hope isn’t too contrived.

Contact phone numbers are styled using common fonts and padding, but with different background-images based on the type of phone number (home, work, fax etc.) Using a top-level concept class of tel, we “subclass” using home, work and fax.

Phone numbers can be output and formatted using multiple classes working together:

<span class="tel home">+1 212 123 1234<span>

Because home is such a generic term, we’d write CSS using a 2-class selector like this:


.tel { font: ...; padding-left: 16px; background: transparent no-repeat middle left; }
.tel.home { background-image: url(icons/tel-home.gif); }
.tel.work { background-image: url(icons/tel-work.gif); }
.tel.fax { background-image: url(icons/tel-fax.gif); }

Dropping tel from the mark-up would cause all styling to be lost - the value home on its own does not encapsulate enough information to determine its position in a class hierarchy. Later on, I’ll come back to this and suggest hyphenation as an option that may embody a class relationship more explicitly.

Unordered class data

By definition, class is an unordered set of white-space separated values. The values “tel home” and “home tel” should be treated the same, with the CSS selector “.tel.home” applying with equal specificity to both numbers below.


<span class="tel home">+1 212 12341 12112<span>
<span class="home tel">+1 212 12341 12112<span>

We must bear this ordering-independence in mind when storing data in class. Trying to store multiple bits of data in sequential order cannot work - e.g. a conference schedule:


...
<li><span class="dtstart 9:00 dtend 9:15" title="9am">09:00</span> - Registration</li>
<li><span class="dtstart 9:15 dtend 10:30" title="9:15am">09:15</span> - Keynote</li>
<li><span class="dtstart 10:30 dtend 10:45" title="10:30am">10:30</span> - Coffee</li>
<li><span class="dtstart 10:45 dtend 12:00" title="10:45am">10:45</span> - Session 1</li>
<li><span class="dtstart 13:00 dtend 14:00" title="1pm">13:00</span> - Session 1</li>
...

Note: in this example, humans are supposed to infer end-times by looking at the start-time of the following event. We include machine-data for end-times because “inference” is not easy for programmers to implement.

Although the order is clear and correct in the mark-up, browsers, parsers and libraries have no obligation to maintain the order when accessed. e.g. a “classes” method could return an arbitrarily ordered array of classes:


// fetch the classes for the first item in the schedule:
var classes = $(’.dtstart:nth(0)’).classes();
// may output: ["9:00", "9:15", "dtend", "dtstart"]

Without further labouring, the take-home point is: data in class-values cannot rely on ordering.

The necessity of prefixes

You may not be 100% certain how your content will be processed or transformed, or what corruption it may suffer; but you can at least attempt to mitigate disaster.

For example: times embedded in machine-data can be arbitrarily precise, from specifying years on their own (”2008″), to fully specifying a time-zone and exact second of an event (”20080721T124032+0100″) The longer format is unlikely to cause confusion (to machines), but the shorter variants could easily be mistaken for model numbers. e.g. the ISSN of periodicals for sale:


<li><a href="..." class="issn 02624079 dtstart 20080719" title="New Scientist dated 19th July 2008">New Scientist no. 2665</a></li>

As we can’t rely on ordering, we need to join the data-type and the data-value together. A few approaches have been suggested, including wrapping the value, or concatenating the pieces with an arbitrary separator - I suggest using a hyphen, which I’ll justify in a minute:


<a href="..." class="issn{02624079} dtstart{20080719}">
<a href="..." class="issn#02624079 dtstart#20080719">
<a href="..." class="issn-02624079 dtstart-20080719">

The hyphenated-prefix selector [attribute|=prefix]

CSS 2 introduced several attribute selectors, including one I’m calling the hypehenated-prefix selector.

The specification admits the primary purpose of this selector is for matching language subcodes; i.e. where CSS rules need only apply to content written in some subset of natural languages:


[lang|=en] blockquote, [lang|=en] q, blockquote[lang|=en], q[lang|=en] { quotes: ‘“’ ”’; }
[lang|=de] blockquote, [lang|=de] q, blockquote[lang|=de], q[lang|=de] { quotes: ‘«’ ‘»’; }

The rules above specify different quote-marks for German and English. Using the prefix selector means the appropriate rule applies to all English languages, including “en-GB” and “en-US”, as well as content marked no more specifically than lang=”en”. Similarly, the ‘de’ rule applies to all German languages.

However, this selector can just as easily be applied to classes. We can rewrite the telephone-number example as:


<span class="tel-work">+1 212 800 1234<span>
<span class="tel-home">+1 212 123 1234<span>

[class|=tel] { font: …; padding-left: 16px; background: transparent no-repeat middle left; }
.tel-home { background-image: url(icons/tel-home.gif); }
.tel-work { background-image: url(icons/tel-work.gif); }
.tel-fax { background-image: url(icons/tel-fax.gif); }

Relaxing the hyphenated-prefix rules

Sadly, the hyphenated-prefix is overly-restricted. In the following example, only one rule applies:


[class|=issn] { font-weight: bold }
[class|=dtstart] { background-image: url(bg/microformat.gif); }

<li><a href=”…” class=”issn-02624079 dtstart-20080719″ title=”New Scientist dated 19th July 2008″>New Scientist no. 2665</a></li>

The problem is due to the way [attribute|=prefix] is defined:

Match when the element’s “att” attribute value is a hyphen-separated list of “words”, beginning with “val”. The match always starts at the beginning of the attribute value. This is primarily intended to allow language subcode matches (e.g., the “lang” attribute in HTML) as described in RFC 1766 ([RFC1766]).

(Emphasis added.)

If the definition had instead been made to cater for a white space separated set of hyphenated tokens, we’d be in a much better position for styling and parsing machine-data microformats today.

[attribute|=prefix] implementations

(Surprisingly) the big four browsers (including IE7) all support the hyphenation prefix selector. But, JavaScript library support is lacking, specifically (naming the javascript library I use daily) jQuery doesn’t handle the hyphenated-prefix selector, although it’s a simple patch.

Assuming JavaScript libraries (or microformat parsers) already implement attribute-selectors, it’s a simple matter to support white space separated hyphenated-prefixes. The key regular-expression is:

/(^|\s)prefix(-|\s|$)/

Assuming your users know what they’re doing and are willing to fix their own issues after throwing something stupid at your library, the regular-expression is easily built on executed on the fly:


new RegExp("(^|\\s)" + prefix + "(-|\\s|$)").test(attribute)

Or (I think), in XPath 2.0:


//*[matches(@attribute, "(^|\s)prefix(-|\s|$)")]

Encoding data

Quotations and ampersands aside (naturally taken care of by normal (X)HTML encoding rules) there’s an obvious problem when data-values contain white space. Fortunately, there’s also an obvious solution, as several methods exist to encode arbitrary data into continuos strings without any white-space. In JavaScript, the methods available include escape, encodeURI and encodeURIComponent, and I’d suggest encodeURI as the best option - providing a good balance between safely encoding data, without being overly aggressive and creating human-unreadable data.

Simplicity is the key

Microformats success depends on its simplicity; using a few attributes and a handful of patterns to invisibly add extra layers of information to existing content.

Hopefully, I haven’t suggested anything in conflict with existing microformats. Hyphenated-prefixes should be viewed as an additional tool in your arsenal. Not as a competing or successor solution.

With a more flexible definition of the that damned attribute selector, I’m sure the unAPI folks would have produced an even simpler specification, and the arguments around microformat’s datetime design pattern would have been resolved years ago.

Though I’m sure it doesn’t show, I’ve written and rewritten this article many times, but it doesn’t get any more complex:

If you want to piggy-back machine-data on existing content, use the class attribute. Separate data-types from data-values using a hyphen, and encode the data using something equivalent to JavaScript’s encodeURI.

That’s all folks

Microformats, dark data and CSS - part 1

Tuesday, July 15th, 2008

There’s was a bit of kerfuffle when the BBC dropped support for microformats in their program listings.

You can’t argue with their reasons: data for microformats was being read aloud by screen-readers, popping up as tool-tips, and confusing the hell out of their users.

The microformats community rallied to solve the issue, but Auntie Beeb rejected all their proposals; and due to lack of community support, also back-tracked on their own proposal (inserting data- prefixed values into the class attribute.)

You can’t blame them for rejecting the microformats community’s proposals. This proposal feels particularly torturous:


<p>
  To be held on
  <span class="dtstart dtend">
    <abbr class="value" title="1998-03-12">
      12 March 1998
    </abbr>
  </span>
  from
  <span class="dtstart">
    <abbr class="value" title="08:30">
      8:30am
    </abbr>
    <abbr class="value" title="-0500">
      EST
    </abbr>
  </span>
  until
  <span class="dtend">
    <abbr class="value" title="09:30">
      9:30am
    </abbr>
    <abbr class="value" title="-0500">
      EST
    </abbr>
  </span>
</p>

That suggestion seems to be the result of a 30 minute brain-fart by microformat’s spiritual leader and, like the BBC, I find it “complicated” (I doubt that was the first word that sprang to mind though.)

Let’s go back to the specs and see what HTML gives us to work with. Considering over one hundred attributes in HTML 4, only a handful apply to the elements we’d want to tag (not just <abbr>, but also <span>, <a>, <li> amongst several others.)

The only attributes available to insert microformat data and still pass validation are: class, dir, id, lang, style and title.

We’ve dismissed all event-handling attributes (onwhatever) as they’re supposed to contain script, and we can quickly dismiss a few more attributes now:

dir can only contain two values (ltr and rtl.)

id must be unique per page; that’s a restriction we can’t work with for microformats.

style attributes are supposed to hold CSS properties. It could be subverted using a vendor prefix e.g. style="-mf-dtstart: ...". However, you’d buy yourself a place in hell, and you’d never get support from the microformats community.

That leaves us with lang, title and class.

Why lang?

Good question.

The lang attribute indicates what language the content of an element is held in. Language codes were defined in RFC 1766, since replaced by RFC 3066, which has itself been replaced by a due of RFCs: 4646 and 4647.

Amongst the long lists of language codes, you can find a few options that should let you embed machine-data without it being read-aloud. e.g. the language-code zxx indicates there’s “no linguistic content”, so you might think a screen-reader would simply skip the content:


<p>
  To be held on 12 March 1998
  <span class="dtstart dtend" lang="zxx">1998-03-12</span>
  from 8:30am EST
  <span class="dtstart" lang="zxx">08:30-0500</span>
  until 9:30am EST
  <span class="dtend" lang="zxx">09:30-0500</span>
</p>

Sadly, a quick trial using accessibility features on a mac reveals the content it still read aloud. I suspect the other possible language codes - art for “artificial languages”, and the x- prefix for “private use” - would suffer the same fate.

Also, the content would need hiding from sighted users using a simple CSS rule: [lang|=zxx] { display: none } - but this would fail under various viewing conditions (e.g. syndicating data via RSS without styles.)

Finally, the whole idea of marking machine-data using the lang attribute may be frowned upon. RFC 4646 includes this:

Language tags are used to help identify languages, whether spoken, written, signed, or otherwise signaled, for the purpose of communication. This includes constructed and artificial languages, but excludes languages not intended primarily for human communication, such as programming languages.

Looks like the lang attribute’s a bit of a no-no then. That leaves us with class and title.

title vs. class

We’ve already mentioned the accessibility issues with title. The BBC have done the research and user-testing other people only think about. Their results show it’s not just screen-readers that get confused by machine-data in title attributes - sighted users are baffled too.

Fragmenting one machine-readable value into three doesn’t solve the problem - it exacerbates it. The number of elements with mystical tool-tips increases; while human-friendliness increases for some (dates), it decreases for others (timezones).

The title attribute is meant for humans, no matter how you spin it.

All hail the mighty class

I must apologize: did I just have a “title vs. class” debate without mentioning class?

I guess I did. title got disqualified, leaving class to win by default.

(small side-note: out of all the attributes on the short-list, class is the only one defined to contain CDATA - i.e. general-purpose Character DATA. Just thought that might be worth a mention.)

Having whittled down our options to one winner, we now need to decide how we’re going to organize our data, and cram it into the class attribute. We need to take into account the definition of class, and that the attribute-value is treated as an unordered list of white-space separated tokens.

That’s for part 2, where I also propose tweaking a CSS3 selector to change it from a single niche application to become a general purpose tool in a web-designers arsenal.

PHP grievance number 1

Thursday, July 10th, 2008

There’s a lot to hate about PHP.

Maybe that’s harsh: nothing’s perfect, every language has it’s strength and weaknesses, and noone ever suggested using PHP for everything.

Bearing that in mind, I’m using PHP daily, and you get used to a lot of quirks and foibles, and it’s easy to forget how truly shit it is.

Take arrays for example. Please; please take ‘em.

Skipping the “needle, haystack” parameter ordering farce, there are two things I dislike.

  1. array access warnings
  2. array access

First, the warnings.

The language designers must have made a choice between checked vs. unchecked array access: should this throw a warning, or shouldn’t it:


$titles = array("Philosopher's Stone", "Chamber of Secrets");

// accessing an element that doesn't exist:
$title = $titles[2];

They decided it should - it’s just a “Notice”; easily hidden with appropriate php.ini settings. I can live with that; it’s been a while, but doesn’t Java throws a wobbly when you fall off the end of an array too?

My grievance is here, when you make a typo:


$titles = array("Philosopher's Stone", "Chamber of Secrets");
$tiltes[2] = “Prisoner of Azkaban”;

No errors, no warnings, no notices. Idiot-fingers just created a new array $tiltes, assigning a vitally important piece of information to a variable that shouldn’t exist!

That’s a bit stupid, isn’t it? Bit of a design-decision inconsistency?

Onto array access: it would be nice - and when I say nice, I mean obvious, natural and expected - to write:


$head = $dom->get_elements_by_tagname('head')[0];

PHP’s getElementsByTagName returns an array; we’re expecting a single <head> element, so try to skip the bullshit and access it straight-off. But we can’t, because PHP’s compiler can’t handle array-access on function return-values.

Object access? Sure, no problem. Calling a method of an object in an array inside another array is no problem:

$book['chapter']['title']->display(’html’)

But if you want to access an array element returned from a function, choose a different language.

Fixing mysqldump character-encoding in Vim

Wednesday, July 9th, 2008

If you find yourself in a position where your mysqldump backup/restore process isn’t working, it’s worth checking for character-encoding issues - and the best way to do this is often to look at the SQL in your backup file.

To tell vim you prefer working in unicode, you may have added some settings to your .vimrc:

set encoding=utf-8 fenc=utf-8

That doesn’t get vim to treat existing files as UTF-8 though! Vim tries to figure-out the encoding itself, and may well get it wrong.

Looking at the mysqldump file, you could see garbage like:

One naïve approach

Something’s horribly wrong there. It looks like an encoding issue, and a quick :set fenc shows you whether vim opened the file in latin1 or utf-8.

You can force vim to re-open the file in utf-8 using:

:e ++enc=utf8 %

Hopefully, you now see:

One naïve approach

Now you know you’ve got unicode in your mysqldump, you can fix the restore process by a bit of search-and-replace on connection settings and table-creation statements. i.e you’re looking for lines like:

/*!40101 SET NAMES latin1 */;

…and…

ENGINE=MyISAM DEFAULT CHARSET=latin1;

Switch those from latin1 to utf8 and, fingers-crossed, you should be able to restore your db backup, and upgrade all your tables to utf8 in the process.