Tuesday, 11 December 2007

Multi-lingualism and the masses

Multi-lingualism, and the provision of multi-lingual services, is one of those problems that just keeps on giving. Like digging a hole in sand which just keeps filling with water as fast as you can shovel it out again, or the loose thread which unravels your clothes when you pull on it. I remember being told, back at the start, that multi-lingualism was a solved problem; that i18n allowed us to keep our language separate from our application.

When the first major work was done on DSpace to convert the UI away from being strictly UK to being internationalised, there was great cause for celebration. This initial step was extremely large, and DSpace has reaped the benefits of having an internationalised UI, with translations into 19 languages at time of writing. It's also helped me, among others, understand where else we might want to go with the internationalisation of the platform, and what the issues are. This post is designed to allow me to enumerate the issues that I've so far come up against or across, to suggest some directions where possible, but mostly just to help organise thoughts.

So lets start with the UI. It turns out that there are a couple of questions which immediately come to the fore once you have a basically international interface. The first is whether display semantics should be embedded in your international tags. My gut reaction was, of course, no ... but, suppose, for example, emphasised text needs to be done differently in different locales? The second is in the granularity of the language tags, and the way that they appear on the page. Suppose it is better in one language to reverse the order of two distinct tags, to dispense with one altogether, or to add additional ones? All of these require modifications in the pages which call the language specific messages, not in the messages themselves. Is there a technical solution to these problems? (I don't know, by the way, but I'm open to suggestion).

We also have the problem of wholesale documentation. User and Administrator help, and system documentation. Not only are they vast, but they are often changing, and maintaining many versions of them is a serious undertaking. It seems inappropriate to use i18n tagging to do documentation, so a different approach is necessary. The idea of the "language pack" would be to include not only custom i18n tags, but also language specific documentation, and all of the other things that I'm going to waffle about below.

Something else happens in the UI which is nothing to do with the page layout. Data is displayed. It is not uncommon to see DSpace instances with hacked attempts at creating multi-lingual application data such as Community and Collection structures, because the tools simply don't yet exist to manage them properly. For example:


where the English and Swedish terms are included in the single field for the benefit of their national and international readership.

Capturing all data in a multi-lingual way is very very hard, mostly because of the work involved. But DSpace should be offering multi-lingual administrator controlled data such as Communities and Collections, and at least offering the possibility of multi-lingual items. The application challenges here are to:

  • Capture the data in multiple languages

  • Store the data in multiple languages

  • Offer administrator tools for adding translations (automated?)

  • Disseminate in the correct language.

Dissemination in the correct language ought not to be too much hassle through the UI (and DSpace already offers tools to switch UI language), but I wonder how much of a difficulty this would be for packaging? Or other types of interoperability? Do we need to start adding language qualifiers to everything? And what happens if the language you are interested in isn't available, or is only partial for what you are looking at? Defining a fall-back chain shouldn't be too hard, but perhaps that fall-back chain is user specific; suppose I'm English, but I also understand German and French: I don't want the application to fall back from English to Russian, for example.

This post was actually motivated by a discussion I have been having about multi-lingual taxonomies, and using URIs to store the vocabulary terms, instead of the terms themselves. In this particular space, URIs are a good solution, because they are tied to a specific, recognised wording. It does place a burden on the UI, though, to be able to hide the URI from the user during deposit and dissemination.

But the same approach could, in theory, be used to offer multi-lingual browse and search results across an entire database. Imagine: each indexable field is collected in its many languages, a single (internal) URI is assigned to that cluster of terms, and that URI is stored instead of the value. With a lot of computational effort you could produce a map of URIs to all the same terms in all the different languages in the database and their corresponding digital objects, which you could offer to your users through search or browse interfaces (I'd not like to be the one to have to implement this, and iron out the wrinkles which I'm blatantly overlooking here).

There are many other corner areas of applications which include language-specifics, and it's going to take me a while to gather the list of what they are. Here are a few which aren't covered by the above:

  • system configuration

  • code exceptions and errors

  • application email notifications

A second major step has been taken for DSpace 1.5 with regard to multi-lingualism, in the form of Claudia Jürgen's work on submission configuraton, help files, emails and front page news. The natural progression would be onto multi-lingual application metadata, and from there the stars ...

No comments: