W3C Overview of Alt Usage

There is some degree of contention as to whether the advice that the current HTML5 specification gives about using alt is appropriate or not. One overarching part of the disagreement concerns whether the alt text should be intended as a replacement for the image, rather content of "equivalent purpose".

In order to see if it might help make progress on this disagreement, I made a quick pass through the Paciello Group Dataset to extract alt usage. I make no claim that this is sufficient research, but I'm sharing it early in case it helps.

What I did was extract all img[alt] where there was surrounding text inside of the same container (where the container is the first ancestor that isn't phrasing content), and placed the alt in context of that surrounding text (in italicised red). Obviously, looking at the actual usage in practice requires human analysis (or more multilingual NLP-fu than I can muster) but I have found scanning the shorter version of the generated report already quite instructive about actual usage (at least for the languages I understand).

Suggestions for refining this analysis — or extracting more useful data in a similar manner — are very much welcome.

Be warned that these generated reports are a bit big. The full report is 13MB; it contains all instances of alt that had text on either side. The smaller inline report is smaller at 2MB; it only contains cases in which the alt had text on both sides. The latter makes it easier to detect cases of images in the textual flow to see how developers in the wild actually use them. Loading these can take a while though, so there's a zipball.