There is some degree of contention as to whether the advice that the current HTML5
specification gives about using alt is appropriate or not. One overarching
part of the disagreement concerns whether the alt text should be intended as a replacement
for the image, rather content of "equivalent purpose".
In order to see if it might help make progress on this disagreement, I made a quick pass
through the Paciello
Group Dataset to extract alt usage. I make no claim that this is sufficient
research, but I'm sharing it early in case it helps.
What I did was extract all img[alt] where there was surrounding text inside
of the same container (where the container is the first ancestor that isn't phrasing content),
and placed the alt in context of that surrounding text (in italicised red). Obviously, looking
at the actual usage in practice requires human analysis (or more multilingual NLP-fu than I
can muster) but I have found scanning the shorter version of the generated report already quite
instructive about actual usage (at least for the languages I understand).
Suggestions for refining this analysis — or extracting more useful data in a similar manner — are very much welcome.
Be warned that these generated reports are a bit big. The
full report is 13MB; it contains all instances of alt
that had text on either side. The smaller inline report is
smaller at 2MB; it only contains cases in which the alt had text on both sides. The
latter makes it easier to detect cases of images in the textual flow to see how developers in the
wild actually use them. Loading these can take a while though, so there's a zipball.