The Chronicles of Richard: December 2007

Friday, 14 December 2007

Pointless Password Pedantry

Nobody trusts me, and nobody can agree on what the best way of making me trustworthy is.

This is the sense that I get from password form schemes, when I'm signing up for new services. I don't know about you, but I have literally tens of passwords to remember, and so, sensibly, I have devised a personal algorithm to generate passwords in different situations, rather than doing something deeply insecure like writing them down in a text file on my desktop (yes, people really do do this, even with system root passwords!).

Without giving away too much, my password algorithm allows me to domain or namespace my passwords both in terms of the service they are for, and the context they are being used in. Further, there is a feedback loop between these two components which explains how to modify the password further in a way which is not possible to predict in advance, and upon which a further set of standard modifications is then applied. The result: easy to reconstruct without the aid of memory but totally unguessable passwords. They include alphanumeric characters, special characters and both capital and lower case letters. They are a paragon of good password design.

So why oh why oh why do different services have such wildly different notions of "good" passwords. Let me give you some examples. Sourceforge don't permit special characters in their passwords! eBuyer don't permit passwords of more than 20 characters (the passwords that my algorithm generates can be extremely long, adding to their security). My online bank requires 2 digits and 2 capital letters, and disallows certain special characters. So I still have to remember which services require which variations on the algorithm, and I'm constantly having to make new adjustments to it. The problem is, that many services conflict with their requirements: you MUST have special characters, you MUST NOT have special characters. How's a security conscious person going to win? I suppose I could start writing my passwords down in a plain text file on my desktop ...

Why don't these systems just implement something like:

http://rumkin.com/tools/password/passchk.php

and reject passwords that come out at less than "Reasonable"?

Thursday, 13 December 2007

The Data Access Layer Divide

Warning: technical post.

One of the things that has been giving me consternation this week is the division between the data storage layer and the application layer. A colleague of mine has been working hard on this problem for some months for DSpace, and his work will form the backbone of the 1.6 release next year. As an new HP Labs employee, I'm just getting involved in this work too, with my focus currently on identifiers for objects in the system (not just content objects, but everything from access policies to user accounts).

We are replacing the default Handle mechanism for exposing URLs in DSpace with an entirely portable identification mechanism which should support whatever identifier scheme you want to put on top of it. DSpace is going to provide its own local identification through UUIDs, so that we can try to break the dependency of identification of artifacts in the system away from the specific implementation of the storage engine. That is, at the moment, database ids are passed around and used with little thought. But what happens if the data storage layer is replaced with something which doesn't use database ids? It's not even slightly inconceivable. Hence the introduction of the UUID.

Now, here's where it gets tricky. The UUID becomes an application level identifier for system artifacts. Fine. The database is free to give columns in tables integer ids, and use them to maintain its own referential integrity. Fine.

I have several questions, and some half-answers for you:

- Why is this a problem?

Suppose I have two modules which store in the database. Lets use a DSpace example of Item and Bitstream objects (DSpace object model sticklers: I know what I'm about to say isn't really true, it's for the purposes of example): I want to store the Item, I want to store the Bitstream, and I want to preserve the relationship between them. Therefore, the Item storage module needs to know how to identify the Bitstream (or vice versa). If I want, I can use the UUIDs, nice long strings, which may have implications on my database performance; why use a relational database if I'm going to burden it with looking up long strings when it could be using nice small integers?

So the problem is: how does the Item get to find out the Bitstream storage id?

- How far up the API can I pass the database id?

The answer to this is "not very far". In fact, it looks like i can't even pass it as far as the DAO API.

- Can I use a RelationalDatabase interface?

The best solution I've come up with so far is to allow my DAO to implement a RelationalDatabase interface, so that other DAO implementations can inspect it to see if they can get database ids out of it. Is that a good solution? I don't know, I'm asking you!

- What's the point?

At the moment the DSpace API is awash with references to the database id. It's fine for the time being, and most people will never get upset about it. But it bothers engineers, and it will bother people who want to try and implement novel storage technologies behind DSpace.

The title of this post reflects my current feeling that these two particular layers of the system, the application and the data storage, have, at some point, to collide; can we really engineer it so that no damage occurs? Answers on a postcard.

Wednesday, 12 December 2007

OAI-ORE Alpha Specifications

The ORE Project has released the first draft of the specifications for public consumption. There is due a final Technical Committee meeting in January next year which may cause changes to this initial draft:

http://www.openarchives.org/ore/0.1/toc

BMC and the Free Open Repository Trial

Our good buddies at BioMedCentral's Open Repository team have released the latest upgrade to their service, and are offering 3 month trial repositories for evaluation. From the DSpace home page:

BioMed Central announced the latest upgrades to Open Repository, the open access publisher's hosted repository solution. Open Repository offers institutions a cost effective repository solution (setup, hosting and maintenance) which includes new DSpace features, customization options, improved user interface. Along with the annoucement of the upgrades, Open Repository is offereing a free 3-month pilot repository, so institutions can test the suitability of the service without obligation. See the full articles in Weekly News Digest and in Alpha Galieo.

Tuesday, 11 December 2007

Multi-lingualism and the masses

Multi-lingualism, and the provision of multi-lingual services, is one of those problems that just keeps on giving. Like digging a hole in sand which just keeps filling with water as fast as you can shovel it out again, or the loose thread which unravels your clothes when you pull on it. I remember being told, back at the start, that multi-lingualism was a solved problem; that i18n allowed us to keep our language separate from our application.

When the first major work was done on DSpace to convert the UI away from being strictly UK to being internationalised, there was great cause for celebration. This initial step was extremely large, and DSpace has reaped the benefits of having an internationalised UI, with translations into 19 languages at time of writing. It's also helped me, among others, understand where else we might want to go with the internationalisation of the platform, and what the issues are. This post is designed to allow me to enumerate the issues that I've so far come up against or across, to suggest some directions where possible, but mostly just to help organise thoughts.

So lets start with the UI. It turns out that there are a couple of questions which immediately come to the fore once you have a basically international interface. The first is whether display semantics should be embedded in your international tags. My gut reaction was, of course, no ... but, suppose, for example, emphasised text needs to be done differently in different locales? The second is in the granularity of the language tags, and the way that they appear on the page. Suppose it is better in one language to reverse the order of two distinct tags, to dispense with one altogether, or to add additional ones? All of these require modifications in the pages which call the language specific messages, not in the messages themselves. Is there a technical solution to these problems? (I don't know, by the way, but I'm open to suggestion).

We also have the problem of wholesale documentation. User and Administrator help, and system documentation. Not only are they vast, but they are often changing, and maintaining many versions of them is a serious undertaking. It seems inappropriate to use i18n tagging to do documentation, so a different approach is necessary. The idea of the "language pack" would be to include not only custom i18n tags, but also language specific documentation, and all of the other things that I'm going to waffle about below.

Something else happens in the UI which is nothing to do with the page layout. Data is displayed. It is not uncommon to see DSpace instances with hacked attempts at creating multi-lingual application data such as Community and Collection structures, because the tools simply don't yet exist to manage them properly. For example:

https://gupea.ub.gu.se/dspace/community-list

where the English and Swedish terms are included in the single field for the benefit of their national and international readership.

Capturing all data in a multi-lingual way is very very hard, mostly because of the work involved. But DSpace should be offering multi-lingual administrator controlled data such as Communities and Collections, and at least offering the possibility of multi-lingual items. The application challenges here are to:

Capture the data in multiple languages

Store the data in multiple languages

Offer administrator tools for adding translations (automated?)

Disseminate in the correct language.

Dissemination in the correct language ought not to be too much hassle through the UI (and DSpace already offers tools to switch UI language), but I wonder how much of a difficulty this would be for packaging? Or other types of interoperability? Do we need to start adding language qualifiers to everything? And what happens if the language you are interested in isn't available, or is only partial for what you are looking at? Defining a fall-back chain shouldn't be too hard, but perhaps that fall-back chain is user specific; suppose I'm English, but I also understand German and French: I don't want the application to fall back from English to Russian, for example.

This post was actually motivated by a discussion I have been having about multi-lingual taxonomies, and using URIs to store the vocabulary terms, instead of the terms themselves. In this particular space, URIs are a good solution, because they are tied to a specific, recognised wording. It does place a burden on the UI, though, to be able to hide the URI from the user during deposit and dissemination.

But the same approach could, in theory, be used to offer multi-lingual browse and search results across an entire database. Imagine: each indexable field is collected in its many languages, a single (internal) URI is assigned to that cluster of terms, and that URI is stored instead of the value. With a lot of computational effort you could produce a map of URIs to all the same terms in all the different languages in the database and their corresponding digital objects, which you could offer to your users through search or browse interfaces (I'd not like to be the one to have to implement this, and iron out the wrinkles which I'm blatantly overlooking here).

There are many other corner areas of applications which include language-specifics, and it's going to take me a while to gather the list of what they are. Here are a few which aren't covered by the above:

system configuration

code exceptions and errors

application email notifications

A second major step has been taken for DSpace 1.5 with regard to multi-lingualism, in the form of Claudia Jürgen's work on submission configuraton, help files, emails and front page news. The natural progression would be onto multi-lingual application metadata, and from there the stars ...

Friday, 7 December 2007

CRIG Meeting Day 2 (2)

Topics for today:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference#Friday_December_7th

The ones that interest me the most are probably these:

- Death to Packages

Not really Death to Packages, but lets not forget that packaging sometimes isn't what we want to do or what we can do.

- Get What?

This harks to my ORE interest, as to what is available under the URLs, and what that means for something like content negotiation.

- One Put to Multiple Places

Really important to distributed information systems (e.g. ethosnet integration into local institutions). Also, this relates, for me, to the unpackaging question, because it introduces differences between what systems might all be expecting.

- Web 2.0 interfaces (ok, ok)

I'm interested in web services. Yes it's a bit trendy. But it is useful.

- Core Servies of a Repository

For repository core architecture, this is important. With my DSpace hat on I'd like to see what sorts of things an internal service architecture or api ought to be able to support

CRIG Meeting Day 2 (1)

It's first thing on day two. I'm late because I have to get all the way across town, which takes a surprisingly long time in London. I should have just stayed at a nearby hotel. Oh well.

The remainder of yesterday was interesting. Scope for live blogging is difficult, as the conference is extremely mobile. Today I will have to pick a point and hide in a corner to get you up to date.

In the afternoon we discussed the CRIG scenarios, and then implemented something called a Dotmocracy, which involves sticking dots (like house points at school) next to topics which appeared which we were interested in. When we start up today, the first order of business will be to see what topics made the cut. From what I saw at the end of the day, this will include Federated Searching, Google Search, and package deconstruction (my personal favourite this week).

As a brief aside, one running theme has been "no more standards". As it happens, I disagree with this. We're never going to get everything thinking the same and working the same. That's why there are so many standards, and why new ones get made all the time. It's the way of the world. At least, with a standard, though, when you have implemented one, you at least have a way of telling people what you did, over the home grown undocumented solutions which are the alternative.

Right, I suppose I'd better get my skates on.

Thursday, 6 December 2007

CRIG Meeting Day 1 (2)

http://en.wikipedia.org/wiki/Unconference

See also Jim Downing's live blogging.

We've just done a round of preliminary unconferencing, where the CRIG Podcast topics were brainstormed onto flip charts. Not sure how useful that's going to be, but I'm going to approach the whole thing with an open mind. I've got my marker pen, my baloon, and my three dots.

wish me luck ...

CRIG Meeting Day 1 (1)

Some live blogging; may be slightly malformed, as this is happening inline, with no post-editing.

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference

Les Carr and Jim Downing have introduced us to the CRIG workshop first day. We're unconferencing which means that there's not a programme! We're going to try and stay at the abstract or high level discussion, not try to talk about technology.

David Flanders outlines the meeting philosophy. The outputs aimed for the meeting include: ideas (bluesky), standards and scenarios and how they can be linked together. The outputs will be taken to OR08. The best way for a group to produce good stuff is for everyone to think about themselves. Makes me think of an article I read recently:

http://www7.nationalgeographic.com/ngm/0707/feature5/index.html

We are not about creating new specs.

Julie then brings us some stuff about SWORD. See my previous post on this. We are going to have implementations for xrXiv, white rose research online and Jorum. A SPECTRa deposit client, and later an article in Ariadne and a presentation at OR08.

Break time ... tea and coffee!

The Chronicles of Richard