Charl van Niekerk » Blog

Main

Latest

Archives

Powered by Blogger

The 'lang' Attribute

Since we're focusing on accessibility now (and rightfully so) I thought I would kick off with an issue that has long bothered me: multilingual text.

I find that one of the most practical, yet heavily underused attributes is the lang attribute (and by that I also include xml:lang obviously). Yes, many people like me have been using it for years now, but far too few IMHO. There has been far too little said about this issue in recent years, and I feel it's about time we should be starting to make people aware of this, why we need it and how to use it properly.

Marking up languages and using those attributes don't only apply to the root element and specifying the base language of a document (although that is of course also very important, especially for search engines). Marking up languages specifically and thoroughly is very important for speech-enabled UAs to pronounce the text correctly.

For example, let's say I want to throw a bit of common French: Au Revoir! Let's say an aural UA reads this out to a person with limited visibility. Just imagine if that French was pronounced as English! That wouldn't sound correct at all, would it? Even people (like me) with very limited skills in French would think it sounds rather weird! That's why it needs to be marked up appropriately as fr so that it can be pronounced correctly by a UA with a French engine (most should have one on a default install - French is a major western language).

Ideally speaking, you should mark up absolutely every single piece of text as the language it's written in accurately. This sounds pretty straight-forward, and usually it is, but sometimes it can also become quite tricky.

Some languages, like Dutch for example, have certain words that are taken from English and then adjusted somewhat to become Dutch. For example, the word downloaden (which means "download", naturally). Semantically, this is sort-of adjusted and adopted from English to Dutch, thereby making it Dutch. In other words, if the rest of the document is in Dutch you shouldn't really have to bother marking this up as another language.

When this word is pronounced by a computer, it will probably be pronounced incorrectly, because I strongly doubt that the "w" would be pronounced by it as it is normally pronounced in Dutch. Some people will therefore be tempted to mark it up as English. However, this is still incorrect from a semantic point of view. Downloaden shouldn't be pronounced in an English voice, but rather in a typical Dutch voice, otherwise it would sound completely unnatural in a Dutch piece of text.

Most languages on earth will have at least some words (some languages more than others) that are pronounced a bit funny. Therefore, it's the job of the creators of the relevant aural engine to cater for the various needs of their respective language. We, as web developers, can only aid them so far.

One of the facts that people often ignore (or want to ignore) is that proper nouns ("names of places or things" for those that don't speak geek) should also be marked up according to language.

For example, my name (pronounced in a typical English fashion) might not sound funny (well, probably it does, but they're expecting it to sound funny) to an American that doesn't know anything about South African names in the first place. However, for the average South African, to listen to my name being pronounced in English would sound totally ridiculous and often incomprehensible to them. They would expect it to be pronounced as in Afrikaans.

Therefore, you would like to have the average South African using an aural UA that might have an Afrikaans engine installed hear my name pronounced as it should be. Every time you mention my name or this weblog's title, you should preferably mark it up in the applicable language if possible. For example:

Note that I'm not listing every possible scenario and combination since that would just be a waste of time. You're not a total idiot, are you? ;-)

Marking up languages for names is indeed often very difficult. For example, when you want to mark up the language of the name of the author of a post or a comment. This will essentially mean that you need an extra textbox on your comment form to allow your users to type their name's language code.

However, sometimes you get people with their full name containing more than one language. For example, take a Russian woman marrying a Greek man. That woman will then have a Russian first name and a Greek surname. This essentially means that you'll have to start allowing span elements with lang/xml:lang attributes into the name field. This could get a little tricky, especially if you need to insert that name into the value of an attribute (for example, the title attribute of a rel element). Therefore, you'll just have to strip the markup in those.

Anyway, there are workarounds for everything, and we'll just have to live with the limitations. For example, you're out of luck in attributes and also in poorly-designed elements like the title element which doesn't really allow you to mark up anything properly. For example, no span elements are allowed, which means you'll have to mark the entire title up in one element. This is troublesome for titles as on this weblog, which I would have liked to mark up like this:

<title><span lang="af">Charl van Niekerk</span>'s Weblog: Some Post Title</title>

Very stupid not to be able to do this IMHO! But luckily, this will all change with XHTML 2.0 (thanks, Lachlan) since apparently in the current working draft at time of writing, span elements are allowed into the title element!

Yeah, this indeed goes to a whole new level of markup purism and software requirements... Who ever said life is going to be simple? :-)

15 Comments

Comment by Blogger Lachlan Hunt on Tuesday, February 15, 2005 9:50:00 AM

You're correct about there being no other elements allowed within the title, but the title element does allow the lang and xml:lang attributes.

Comment by Blogger Charl van Niekerk on Tuesday, February 15, 2005 10:12:00 AM

Ah, thanks. I should have checked that up in the spec, but I'm a little lazy today. Post corrected.

Comment by Anonymous Tommy Olsson on Tuesday, February 15, 2005 2:25:00 PM

Good article!

Foreign words that are truly incorporated into another language, such as your example of "downloaden", shouldn't have to be marked up. It's the responsibility of the speech software to know about these things.

One shortcoming that I find annoying is marking up foreign acronyms and abbreviation. For instance, if I mark up "CSS" as an abbreviation, I want the title to be read in English ("cascading style sheets"), but in a Swedish text I don't want the abbreviation itself to be read as "cee ess ess". I want it to be read out with the Swedish names for those letters (approximately "ceh ess ess"). This would require a rather silly amount of markup:
<abbr lang="en" title="Cascading Style Sheets"><span lang="sv">CSS</span></abbr>

Needless to say, I don't add that SPAN, but that would make it sound silly in a Swedish screen reader. :(

Comment by Blogger Charl van Niekerk on Tuesday, February 15, 2005 3:55:00 PM

Very interesting point raised about abbreviations!!!

With the great complexity in marking up abbreviations (post will follow), it leads to only one conclusion: Abbreviations should be marked up by computers, not by humans. Like I said, post will follow, but it's good that you brought up this point so that I can address it there.

Comment by Blogger Anne on Tuesday, February 15, 2005 4:15:00 PM

I do not get the "XHTML 2 will fix this" part.

Comment by Blogger Charl van Niekerk on Tuesday, February 15, 2005 4:52:00 PM

Ah, sorry that was quite stupid of me. I should have clarified. Just read it again.

Comment by Blogger Anne on Tuesday, February 15, 2005 6:55:00 PM

So now I can use markup inside TITLE. Nifty. I wonder how multiple lines are handled inside user agents.

It does not really seem like the authors of XHTML 2 pay attention to any of that.

Comment by Blogger Charl van Niekerk on Wednesday, February 16, 2005 7:11:00 AM

Probably strip all line breaks and insert spaces between them. But yeah, that should be defined. XHTML 2.0 is rather complex, and needs a lot of work. But it's still only a working draft, so I'm not going to start criticising them until it at least reaches candidate recommendation level.

Comment by Blogger Anne on Wednesday, February 16, 2005 10:59:00 AM

That would be stupid. Sorry to say so but when it reaches CR it also reaches call for implementations and it is generally to late to let them redefine stuff unless critical issues are found.

You should send in every nit you find right away so they can fix it. And do a full review when it reaches LC (last call).

Comment by Blogger Charl van Niekerk on Wednesday, February 16, 2005 4:42:00 PM

I wasn't referring to criticizing the specification (or at least sending in helpful suggestions), I meant criticizing the W3C for creating crappy specifications.

Yes, naturally all suggestions should be sent in ahead of time. I would have done that myself, but believe it or not, I'm only subscribed to www-html for a few days now. Maybe I'll raise that point later if nobody else does.

Comment by Anonymous David Spinar on Sunday, February 20, 2005 12:40:00 AM

Hello, I would like to stress out two points:

1. At present, common screen readers does not use the "lang" attribute to change the pronunciation. In 1999 (when WCAG 1.0 was written) it was planned so, but todays practise is somewhere else.

2. Using "lang" atribute twice in abbreviation (one in abbr second in span) doesn´t make any sence. The screen readers will reed the content of the "title" attribute, not the abbreviation.

Comment by Blogger Charl van Niekerk on Monday, February 21, 2005 8:41:00 AM

Your first point is totally correct. I meant to include this in my post, but it must have slipped my mind for some reason.

However, I do feel it is important to use the 'lang' attribute anyway, because one day (hopefully), most common screen readers will support it, so using the 'lang' attribute already is good for forwards compatibility.

Also, I believe that one of the major reasons screen readers don't support this is because it isn't being used commonly in the first place. If it starts to be used commonly, they might see the benifits and start to support it.

Your second point is incorrect though. I don't see anything wrong with screen readers reading out the abbreviation itself. If I needed to use a screen reader, I would much rather have it tell me 'see-es-es' than 'cascading style sheets' every time.

Comment by Anonymous Jim on Saturday, March 05, 2005 1:19:00 AM

<span lang="af">Charl van Niekerk</span>'s

Beware of separating punctuation like this with closing tags. As I recall, one of the major aural browsers reads them out as separate words, so somethng like:

<a href="...">This is a link</a>.

..will be read out as "This is a link full-stop" instead of "This is a link".

Comment by Blogger Charl van Niekerk on Monday, March 07, 2005 8:06:00 AM

Using it as I did is the only way to keep it semantically correct. The apostrophe, followed by the "s" is not Afrikaans; it would have been replaced by a separate word "se", which I don't want to do. The "'s" is English, and therefore needs to be marked up as such.

Personally, I don't really care about what some major browser does. This site isn't even compatible with IE, which is a major browser. However, that is of course different for commercial sites, in which a lot of cross-browser testing needs to be done. But as pointed out previously, proper semantic support in aural user agents is very limited anyway.

Comment by Anonymous Annerose on Tuesday, July 10, 2007 2:56:00 PM

These comments have been invaluable to me as is this whole site. I thank you for your comment.

Post a Comment

Copyright © 2004-2008 Charl van Niekerk. All articles are released under the Creative Commons Attribution 2.5 South Africa licence, unless where otherwise stated.