The Chronicles of Richard: 2007

Friday, 14 December 2007

Pointless Password Pedantry

Nobody trusts me, and nobody can agree on what the best way of making me trustworthy is.

This is the sense that I get from password form schemes, when I'm signing up for new services. I don't know about you, but I have literally tens of passwords to remember, and so, sensibly, I have devised a personal algorithm to generate passwords in different situations, rather than doing something deeply insecure like writing them down in a text file on my desktop (yes, people really do do this, even with system root passwords!).

Without giving away too much, my password algorithm allows me to domain or namespace my passwords both in terms of the service they are for, and the context they are being used in. Further, there is a feedback loop between these two components which explains how to modify the password further in a way which is not possible to predict in advance, and upon which a further set of standard modifications is then applied. The result: easy to reconstruct without the aid of memory but totally unguessable passwords. They include alphanumeric characters, special characters and both capital and lower case letters. They are a paragon of good password design.

So why oh why oh why do different services have such wildly different notions of "good" passwords. Let me give you some examples. Sourceforge don't permit special characters in their passwords! eBuyer don't permit passwords of more than 20 characters (the passwords that my algorithm generates can be extremely long, adding to their security). My online bank requires 2 digits and 2 capital letters, and disallows certain special characters. So I still have to remember which services require which variations on the algorithm, and I'm constantly having to make new adjustments to it. The problem is, that many services conflict with their requirements: you MUST have special characters, you MUST NOT have special characters. How's a security conscious person going to win? I suppose I could start writing my passwords down in a plain text file on my desktop ...

Why don't these systems just implement something like:

http://rumkin.com/tools/password/passchk.php

and reject passwords that come out at less than "Reasonable"?

Thursday, 13 December 2007

The Data Access Layer Divide

Warning: technical post.

One of the things that has been giving me consternation this week is the division between the data storage layer and the application layer. A colleague of mine has been working hard on this problem for some months for DSpace, and his work will form the backbone of the 1.6 release next year. As an new HP Labs employee, I'm just getting involved in this work too, with my focus currently on identifiers for objects in the system (not just content objects, but everything from access policies to user accounts).

We are replacing the default Handle mechanism for exposing URLs in DSpace with an entirely portable identification mechanism which should support whatever identifier scheme you want to put on top of it. DSpace is going to provide its own local identification through UUIDs, so that we can try to break the dependency of identification of artifacts in the system away from the specific implementation of the storage engine. That is, at the moment, database ids are passed around and used with little thought. But what happens if the data storage layer is replaced with something which doesn't use database ids? It's not even slightly inconceivable. Hence the introduction of the UUID.

Now, here's where it gets tricky. The UUID becomes an application level identifier for system artifacts. Fine. The database is free to give columns in tables integer ids, and use them to maintain its own referential integrity. Fine.

I have several questions, and some half-answers for you:

- Why is this a problem?

Suppose I have two modules which store in the database. Lets use a DSpace example of Item and Bitstream objects (DSpace object model sticklers: I know what I'm about to say isn't really true, it's for the purposes of example): I want to store the Item, I want to store the Bitstream, and I want to preserve the relationship between them. Therefore, the Item storage module needs to know how to identify the Bitstream (or vice versa). If I want, I can use the UUIDs, nice long strings, which may have implications on my database performance; why use a relational database if I'm going to burden it with looking up long strings when it could be using nice small integers?

So the problem is: how does the Item get to find out the Bitstream storage id?

- How far up the API can I pass the database id?

The answer to this is "not very far". In fact, it looks like i can't even pass it as far as the DAO API.

- Can I use a RelationalDatabase interface?

The best solution I've come up with so far is to allow my DAO to implement a RelationalDatabase interface, so that other DAO implementations can inspect it to see if they can get database ids out of it. Is that a good solution? I don't know, I'm asking you!

- What's the point?

At the moment the DSpace API is awash with references to the database id. It's fine for the time being, and most people will never get upset about it. But it bothers engineers, and it will bother people who want to try and implement novel storage technologies behind DSpace.

The title of this post reflects my current feeling that these two particular layers of the system, the application and the data storage, have, at some point, to collide; can we really engineer it so that no damage occurs? Answers on a postcard.

Wednesday, 12 December 2007

OAI-ORE Alpha Specifications

The ORE Project has released the first draft of the specifications for public consumption. There is due a final Technical Committee meeting in January next year which may cause changes to this initial draft:

http://www.openarchives.org/ore/0.1/toc

BMC and the Free Open Repository Trial

Our good buddies at BioMedCentral's Open Repository team have released the latest upgrade to their service, and are offering 3 month trial repositories for evaluation. From the DSpace home page:

BioMed Central announced the latest upgrades to Open Repository, the open access publisher's hosted repository solution. Open Repository offers institutions a cost effective repository solution (setup, hosting and maintenance) which includes new DSpace features, customization options, improved user interface. Along with the annoucement of the upgrades, Open Repository is offereing a free 3-month pilot repository, so institutions can test the suitability of the service without obligation. See the full articles in Weekly News Digest and in Alpha Galieo.

Tuesday, 11 December 2007

Multi-lingualism and the masses

Multi-lingualism, and the provision of multi-lingual services, is one of those problems that just keeps on giving. Like digging a hole in sand which just keeps filling with water as fast as you can shovel it out again, or the loose thread which unravels your clothes when you pull on it. I remember being told, back at the start, that multi-lingualism was a solved problem; that i18n allowed us to keep our language separate from our application.

When the first major work was done on DSpace to convert the UI away from being strictly UK to being internationalised, there was great cause for celebration. This initial step was extremely large, and DSpace has reaped the benefits of having an internationalised UI, with translations into 19 languages at time of writing. It's also helped me, among others, understand where else we might want to go with the internationalisation of the platform, and what the issues are. This post is designed to allow me to enumerate the issues that I've so far come up against or across, to suggest some directions where possible, but mostly just to help organise thoughts.

So lets start with the UI. It turns out that there are a couple of questions which immediately come to the fore once you have a basically international interface. The first is whether display semantics should be embedded in your international tags. My gut reaction was, of course, no ... but, suppose, for example, emphasised text needs to be done differently in different locales? The second is in the granularity of the language tags, and the way that they appear on the page. Suppose it is better in one language to reverse the order of two distinct tags, to dispense with one altogether, or to add additional ones? All of these require modifications in the pages which call the language specific messages, not in the messages themselves. Is there a technical solution to these problems? (I don't know, by the way, but I'm open to suggestion).

We also have the problem of wholesale documentation. User and Administrator help, and system documentation. Not only are they vast, but they are often changing, and maintaining many versions of them is a serious undertaking. It seems inappropriate to use i18n tagging to do documentation, so a different approach is necessary. The idea of the "language pack" would be to include not only custom i18n tags, but also language specific documentation, and all of the other things that I'm going to waffle about below.

Something else happens in the UI which is nothing to do with the page layout. Data is displayed. It is not uncommon to see DSpace instances with hacked attempts at creating multi-lingual application data such as Community and Collection structures, because the tools simply don't yet exist to manage them properly. For example:

https://gupea.ub.gu.se/dspace/community-list

where the English and Swedish terms are included in the single field for the benefit of their national and international readership.

Capturing all data in a multi-lingual way is very very hard, mostly because of the work involved. But DSpace should be offering multi-lingual administrator controlled data such as Communities and Collections, and at least offering the possibility of multi-lingual items. The application challenges here are to:

Capture the data in multiple languages

Store the data in multiple languages

Offer administrator tools for adding translations (automated?)

Disseminate in the correct language.

Dissemination in the correct language ought not to be too much hassle through the UI (and DSpace already offers tools to switch UI language), but I wonder how much of a difficulty this would be for packaging? Or other types of interoperability? Do we need to start adding language qualifiers to everything? And what happens if the language you are interested in isn't available, or is only partial for what you are looking at? Defining a fall-back chain shouldn't be too hard, but perhaps that fall-back chain is user specific; suppose I'm English, but I also understand German and French: I don't want the application to fall back from English to Russian, for example.

This post was actually motivated by a discussion I have been having about multi-lingual taxonomies, and using URIs to store the vocabulary terms, instead of the terms themselves. In this particular space, URIs are a good solution, because they are tied to a specific, recognised wording. It does place a burden on the UI, though, to be able to hide the URI from the user during deposit and dissemination.

But the same approach could, in theory, be used to offer multi-lingual browse and search results across an entire database. Imagine: each indexable field is collected in its many languages, a single (internal) URI is assigned to that cluster of terms, and that URI is stored instead of the value. With a lot of computational effort you could produce a map of URIs to all the same terms in all the different languages in the database and their corresponding digital objects, which you could offer to your users through search or browse interfaces (I'd not like to be the one to have to implement this, and iron out the wrinkles which I'm blatantly overlooking here).

There are many other corner areas of applications which include language-specifics, and it's going to take me a while to gather the list of what they are. Here are a few which aren't covered by the above:

system configuration

code exceptions and errors

application email notifications

A second major step has been taken for DSpace 1.5 with regard to multi-lingualism, in the form of Claudia Jürgen's work on submission configuraton, help files, emails and front page news. The natural progression would be onto multi-lingual application metadata, and from there the stars ...

Friday, 7 December 2007

CRIG Meeting Day 2 (2)

Topics for today:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference#Friday_December_7th

The ones that interest me the most are probably these:

- Death to Packages

Not really Death to Packages, but lets not forget that packaging sometimes isn't what we want to do or what we can do.

- Get What?

This harks to my ORE interest, as to what is available under the URLs, and what that means for something like content negotiation.

- One Put to Multiple Places

Really important to distributed information systems (e.g. ethosnet integration into local institutions). Also, this relates, for me, to the unpackaging question, because it introduces differences between what systems might all be expecting.

- Web 2.0 interfaces (ok, ok)

I'm interested in web services. Yes it's a bit trendy. But it is useful.

- Core Servies of a Repository

For repository core architecture, this is important. With my DSpace hat on I'd like to see what sorts of things an internal service architecture or api ought to be able to support

CRIG Meeting Day 2 (1)

It's first thing on day two. I'm late because I have to get all the way across town, which takes a surprisingly long time in London. I should have just stayed at a nearby hotel. Oh well.

The remainder of yesterday was interesting. Scope for live blogging is difficult, as the conference is extremely mobile. Today I will have to pick a point and hide in a corner to get you up to date.

In the afternoon we discussed the CRIG scenarios, and then implemented something called a Dotmocracy, which involves sticking dots (like house points at school) next to topics which appeared which we were interested in. When we start up today, the first order of business will be to see what topics made the cut. From what I saw at the end of the day, this will include Federated Searching, Google Search, and package deconstruction (my personal favourite this week).

As a brief aside, one running theme has been "no more standards". As it happens, I disagree with this. We're never going to get everything thinking the same and working the same. That's why there are so many standards, and why new ones get made all the time. It's the way of the world. At least, with a standard, though, when you have implemented one, you at least have a way of telling people what you did, over the home grown undocumented solutions which are the alternative.

Right, I suppose I'd better get my skates on.

Thursday, 6 December 2007

CRIG Meeting Day 1 (2)

http://en.wikipedia.org/wiki/Unconference

See also Jim Downing's live blogging.

We've just done a round of preliminary unconferencing, where the CRIG Podcast topics were brainstormed onto flip charts. Not sure how useful that's going to be, but I'm going to approach the whole thing with an open mind. I've got my marker pen, my baloon, and my three dots.

wish me luck ...

CRIG Meeting Day 1 (1)

Some live blogging; may be slightly malformed, as this is happening inline, with no post-editing.

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference

Les Carr and Jim Downing have introduced us to the CRIG workshop first day. We're unconferencing which means that there's not a programme! We're going to try and stay at the abstract or high level discussion, not try to talk about technology.

David Flanders outlines the meeting philosophy. The outputs aimed for the meeting include: ideas (bluesky), standards and scenarios and how they can be linked together. The outputs will be taken to OR08. The best way for a group to produce good stuff is for everyone to think about themselves. Makes me think of an article I read recently:

http://www7.nationalgeographic.com/ngm/0707/feature5/index.html

We are not about creating new specs.

Julie then brings us some stuff about SWORD. See my previous post on this. We are going to have implementations for xrXiv, white rose research online and Jorum. A SPECTRa deposit client, and later an article in Ariadne and a presentation at OR08.

Break time ... tea and coffee!

Friday, 30 November 2007

CRIG Podcast

A couple of weeks ago the JISC CRIG (Common Repository Interfaces Group) organised a series of telephone debates on important areas for it. These have now been edited into short commentaries which might be of interest to you, and are aimed at priming and informing the upcoming "unconference" to be held 6/7 December in London:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Podcasts

The "unconference" will take place at Birkbeck College in Bloomsbury, London. Take a listen, and enjoy. Yours truly appears in the "Get and Put within Repositories" and the "Object Interoperability" discussions.

Thursday, 8 November 2007

SWORD 1.0 Released

Just a quick heads up to say that the SWORD 1.0 release is now out and ready for download from SourceForge:

http://sourceforge.net/projects/sword-app/

Here you will find the common java library which supports repositories wanting to implement SWORD, plus implementations for DSpace and Fedora. There is also a client (with GUI and CLI versions) which you can use to deposit content into the repositories.

The DSpace implementation is designed only to work with the forthcoming DSpace 1.5 (which is currently in Alpha release). Your feedback and experiences with the code would be much appreciated. We expect to be making refinements to the DSpace implementation up unitl DSpace 1.5 is released as stable.

Wednesday, 31 October 2007

Scandinavian Dugnad

I was invited by the Scandinavian DSpace User Group meeting to join them in their first official meeting yesterday in Oslo. It was great to see so many people representing a small-ish geographical area and a reasonably small population all together from 4 nations (Norway, Sweden, Finland and Denmark) to talk about DSpace. Probably 35 people all-in, with plans to extend the group to be the Nordic DSpace User Group to include members from Iceland, and perhaps even the Faroe Islands, and Greenland (if DSpace instances appear there).

http://wiki.dspace.org/index.php/Scandinavia

In the grand traditions of Open Source and Open Access, I borrowed presentations given at the recent DSpace User Group Rome, and gave them an update on the state of the DSpace Foundation, DSpace 2.0, and then went on to produce some original slides telling folks how to get involved in DSpace developments. Hopefully all the content will be available on the web soon.

As your humble chronicaller struggled with his sub-par Norwegian, he picked up some interesting things. There is good user end development going on in Scandinavia which could be harnessed to bring improvements to the DSpace UI. There are also increasingly many requests for "Integration with ...", where the object of integration is one of a variety of library information systems. Statistics are high on the agenda here as they are everywhere else. They are also a base of experts in multi-language problems stemming from being polyglot nations with additional letters in their native alphabets.

It's clear where the future of repositories lie in Scandinavian nations where the national interest and the community feature prominently in society and culture. Bibsys, a major supplier of library systems and services in Norway (and organisers of the meeting), have 29 DSpace clients on their books already, and are looking at tighter integration between it and their other products, right down to the information model level. National research reporting systems are much desired repository data sources, and internal information systems at each institutions are starting to feed into their public repositories.

With such a big user group, and such a community focus, there is little doubt in my mind that the Nordic user group will be a great asset to the DSpace users in that region, and probably to the DSpace community as a whole.

PS Dugnad is a Norwegian word effectively referring to voluntary, communal work which benefits the community to some degree, but is also social and enjoyable for the participants. It also formed the basis of the 2006 DSpace User Group Meeting in Bergen

http://dsug2006.uib.no/

Friday, 26 October 2007

Exciting news from the pages of the Chronicles

Some of you will already know this, but for the benefit of those that don't but wanted to know, here is some job related news on my part.

With the recent launch of Spiral, I have felt free to consider again my place in the world, the work I do on Open Source and Open Access, and my general future, knowing that if I were to leave Imperial College, I would not be leaving having achieved nothing visible.

I have, therefore, decided to make a move from the academic into the commercial sector, and have taken up a position with HP Labs to work with DSpace especially in the context of India, where it has become extremely popular. So towards the end of next month you will see the "About Me" section of this blog get updated, and I may vanish off the radar for a week or two while I get myself up and running in this new post.

I'm greatly looking forward to working with the DSpace folks both in HP Labs Bristol, Bangalore and Vermont!

Thursday, 25 October 2007

DSpace 1.5 Alpha with experimental binary distribution

The DSpace 1.5 Alpha has now been released and we encourage you to download this exciting new release of DSpace and try it out.

There are big changes in this code base, both in terms of functionality and organisation. First, we are now using Maven to manage our build process, and have carved the application into a set of core modules which can be used to assemble your desired DSpace instance. For example, the JSP UI and the Manakin UI are now available as separate UI modules, and you may build either or both of these. We are taking an important step down the road, here, to allowing for community developments to be more easily created, and also more easily shared. You should be able, with a little tinkering, to provide separate code packages which can be dropped in alongside the dspace core modules, and built along with them. There are many stages to go through before this process is complete or perfect, so we encourage you to try out this new mechanism, and to let us know how you get on, or what changes you would make. Oh, and please do share your modules with the community! Props to Mark Diggory and the MIT guys for this restructuring work.

The second big and most exciting thing is that Manakin is now part of our standard distribution, and we want to see it taking over from the JSP UI over the next few major releases. A big hand for Scott Phillips and the Texas A&M guys for getting this code into the distribution; they have worked really hard.

In addition to this, we have an Event System which should help us start to decouple tightly integrated parts of the repository, from Richard Rodgers and the guys at MIT. Browsing is now done with a heavily configurable system written initially by myself, but with significant assistance from Graham Triggs at BioMed Central. Tim Donohue's much desired Configurable Submission system is now integrated with both JSP and Manakin interfaces and is part of the release too.

Further to this we have a bunch of other functionality including: IP Authentication, better metadata and schema registry import, move items from one collection to another, metadata export, configurable multilingualism support, Google and html sitemap generator, Community and Sub-Communities as OAI Sets, and Item metadata in XHTML head <meta> elements.

All in all, a good looking release. There will be a testathon organised shortly which will be announced on the mailing lists, so that we can run this up to beta and then into final release as soon as possible. There's lots to test, so please lend a hand.

We are also experimenting with a binary release, which can be downloaded from the same page as the source release. We are interested in how people get on with this, so let us know on the mailing lists.

Come and get it:

http://sourceforge.net/project/showfiles.php?group_id=19984

DSpace User Group 2007, Rome

Last week was the annual DSpace User Group Meeting, this year held in Rome, hosted by the Food and Agriculture Organization of the United Nations:

http://www.aepic.it/conf/index.php?cf=11

These guys have an interest in DSpace for sharing knowledge throught the developing world, and kindly offered to run the user group this year. The FAO building is set at the east end of the incredible Circus Maximus, and just 5 minutes up the road from the Colosseum. And we could see all of this from the 8th floor terrace cafe where lunch and coffee was served every day.

The presentations for this event are mostly available online, at:

http://www.aepic.it/conf/program.php?cf=11

If there are presenters reading this whose papers are not yet online, please contact the conference organisers so they can make it available.

I felt that this year the balance between technical and non-technical presentations was struck particularly well. While there were streams of non-technical presentations, there were highly technical tracks for the developers among us to attend. Specifically worth a mention was Scott Phillips' Introduction to Manakin, which is something we will all need to get to grips with in the long run, and something which I knew woefully little about. After that session, though, I'm confident about getting stuck in.

The quality of the work going on with DSpace is definitely reaching a high degree of maturity, with increasingly many developments leveraging the latest features of DSpace in new and innovative ways. For me this suggests that our platform has approached a critical point where we must, as a community, find a way to make these developments easier to share and easier to adopt and easier to write.

So thanks from me to the organisers. It was great to see the usual suspects again, but equally great was it to put faces to names from the mailing lists and IRC. See you all next year!

Wednesday, 24 October 2007

my my where did the summer go

OK, ok, it's been a long long time since I updated. Did I say at the beginning that this was an experiment in seeing if I was capable of maintaining a blog? If I didn't I should have done.

But there's a good reason that I've not updated for a while. That is, that I've been working flat out on the Imperial College Digital Repository: Spir@l, and am pleased to finally announce in a quiet way that we are officially LIVE:

http://spiral.imperial.ac.uk/

On the outside it doesn't look too serious. A standard looking DSpace, I hear you say, with an Imperial College site template on it. And you'd be right. But only about the tip of the ice-berg.

Without wishing to blow my own trumpet (modesty is the third or fourth best thing about me), please do check out the article which I co-wrote with my good colleague Fereshteh Afshari:

http://hdl.handle.net/10044/1/493

And you may also be interested in my presentation at the recent DSpace User Group Meeting in Rome 2007 (more on that later, maybe):

http://www.aepic.it/conf/viewabstract.php?id=200&cf=11

I could probably be persueded to write a little here about how it works; maybe you'll even get snippets from the monolithic technical documentation that I'm in the middle of writing.

Oh, and there's more news, but now I've got your attention again you have to wait for the next installment.

Thursday, 10 May 2007

EThOSnet Kick-Off

On Tuesday of this week the EThOSnet Project Board met for the first time to kick off this significant new project. For background, this project is the successor to the EThOS project, which in turn grew out of the Scottish projects: Theses Alive at Edinburgh, DAEDALUS at Glasgow, and Electronic Theses at the Robert Gordon University.

The aim of EThOSnet is to take the work done under EThOS and bring it up to a point where UK institutions can actually start to become early adopters, to start to digitise the back-catalogue of print theses in the UK, investigate technology for the current and the future incarnations of the system, and to basically kick-start a genuinely viable service for deposit and dissemination of UK theses.

At this stage, the project does not have a Project Manager, which is causing minor hold-ups initially, but Project Director, and Director of Library Services Clare Jenkins of Imperial College Library has stepped in to hold things together until one is appointed (we are expecting to hear very soon). In the interim, the Project Board has also been put in place to check that all the 7 Work Packages have the things they need to get going.

Of these 7 workpackages, the first and last are concerned with project management and exit strategy, and the meat of the project will take place in packages 2 - 6. Details of these work packages are available in the project proposal, which will hopefully be available on the JISC website soon.

A quick summary, then, of some of the changes and more concrete decisions that we made during the meeting:

We have set a pleasingly high target of 20,000 digitised theses and 3,000 born-digital theses by the end of the project. This will be sourced from the many institutions who have already expressed an interest in adopting the service, before the project is even going!

The first port of call for the technology is to smooth the process of the existing software tools for repository users. I would hope to have something which works well for DSpace available quickly, and general enough to be part of the main distribution. EPrints is already fully compliant, and Fedora has representitives from the University of Hull looking after it.

Communications will be done primarily through a soon-to-exist project wiki, and it is hoped that the existing E-Theses UK list will be used more heavily than it is already. Imperial College has agreed to host the existing ethos website, the wiki, and potentially the toolkit if necessary (currently hosted at RGU).

Toolkit development will be ongoing, with work being done on it within a wiki, but with the option to move to some XML format for the final product

This is a very big project, and I can't possibly represent everything that came out of Tuesday's meeting here. In the near future expect to see links to the project wiki appear and more information to come out.

Wednesday, 2 May 2007

vive la revolucion

Today I'm happy to see a major hardward manufacturer teaming up with a major Linux distro, and doing so in a nice visible place like the BBC:

http://news.bbc.co.uk/1/hi/business/6610901.stm

I've been a Linux user for some years now, but when I first made the switch from the competition it was still a very difficult thing to do, even as a professional computer geek. Ubuntu seems pretty good, and hopefully it will help encourage non-expert users to have it installed before they even get their laptop home.

Tuesday, 24 April 2007

Ridiculously well integrated IDE and Application

This week I have been playing around a lot with Eclipse, which is an Integrated Development Environment platform, geared principally (but by no means exclusively) towards Java development. I've been an Eclipse user for some time, but principally for its ability to mediate nicely with my version control server (SubVersion), and the pleasantness of using the graphical editing tools. I was aware, though, of the huge potential it has for rapid application development, and finally got around to putting some time aside to investigate.

The product of my attempts have been documented here, on the DSpace wiki:

HOWTO Integrate DSpace with Eclipse and Tomcat

Along the way I found all sorts of goodies, like the Database Explorer tools, that allow me to execute SQL directly from files open in the editor onto my running database, and a variety of graphical and semi-graphical tools for editing my files. The full power of the source code analysis for a properly set-up project is staggering, as are the refactoring tools.

The real kicker, though, the feature that makes this effort all worthwhile is that I can now run DSpace (or any other web application) from within Eclipse. It looks after controlling Tomcat, and ties in the Tomcat debugger to Eclipse, so that (and this is the really cool bit) I can set points in my application source for execution to halt while I examine the state of the machine. So, I load a web page, which invokes code within which I have set "breakpoints", and Eclipse immediately takes over, opens up a debugging environment, and allows me to step through the code, line by line if I like, examining the in-memory objects all the way. Awesome. In the original sense of the word.

I've seen other people use similar functionality (for example in Visual Studio), so I'm glad I've replicated it in Eclipse. Of course, now I'll discover that everyone has been doing this for years, and I'm the last to catch on. But if I'm not the last to catch on, or at least to figure out how to get DSpace working in this environment (non-trivial) then I strongly encourage you to give this a go.

Monday, 23 April 2007

Google Summer of Code Go-Ahead

At the start of last week Google announced the students who had been funded under this year's Summer of Code project. DSpace is pleased to be able to say that it has 5 students being funded to do development for the platform over the next 3 to 4 months. The details of the projects are as follows:

DSpace Content Integrity Service
- Student: Jiahui Wang
- Mentor: Jim Rutherford
- Mentor Backup: Scott Phillips

Portable citations: moving citations between DSpace and bibliographic software managers
- Student: Jodi Schneider
- Mentor: Stuart Lewis
- Mentor Backups: Claudia Jurgen, Christophe Dupriez

DSpace Versioning
- Student: Robert Graham
- Mentor: Robert Tansley
- Mentor Backups: Mark Diggory + Scott Phillips

Statistics
- Student: Federico Paparoni
- Mentor: Richard Jones
- Mentor Backup: Stuart Lewis

Visualization Artifacts for Manakin/DSpace
- Student: Brian Eoff
- Mentor: Scott Phillips
- Mentor Backup: Mark Diggory

An overview of DSpace involvement is available here:

http://code.google.com/soc/dspace/about.html

We are currently involved in the preliminary discussions with the students over their work, and design and development will start in earnest during May. Watch this space for updates.

OAI-5 Presentations

Presentations from OAI-5 are now available online, as well as videos of some of the presentations:

http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=5710

Tuesday, 17 April 2007

Configurable Browse System released

Well, things have been a little quiet on here, what with my extended easter break. But now I'm back in the office and getting on top of things again. There's a few things to update on, which I will be doing over the next couple of days. The most exciting thing today is that I have finally finished the release candidate code for the new Browse system for DSpace. If you are a DSpace user, please check it out at:

https://sourceforge.net/tracker/index.php?func=detail&aid=1702233&group_id=19984&atid=319984

I have been using the core of this for some time, so it is should be stable. Your feedback would be welcome.

Thursday, 22 March 2007

Crawling like Ants

During experiments yesterday I discovered that running java processes from inside the Ant build tool can have unforseen performance issues. I had written an Ant task to build the browse indices for my DSpace system. This involves producing 9 separete indices for around 80,000 records, and is not a rapid process at the best of times. Previous executions of this code have yielded index rates of approximately 10 - 15 items per second, which I was pretty happy with. Running it in Ant, though, dropped my performance right down to a low of 1 item every 2 seconds! After this had run for several hours I killed it, and tried it again directly from the command line; up came the performance again to usual standard.

So, what is going on? Here are some details:

- while indexing in Ant, the box was under almost no load - no physical memory shortages, disk io bottlenecks, etc.

- I toyed with the idea that memory allocation to the JVM was the problem, but I've seen the indexer run with different memory allocations, and it has so far never caused a speed problem (just OutOfMemory errors)

Answers on a postcard. Or in a comment.

Monday, 19 March 2007

Repository 66 and the Google Map Adventure

Tim Brody from the University of Southamptom has just blogged some URLs to add repository locations to Google Earth.

I thought it would be worth adding that the University of Aberystwyth's repository guru Stuart Lewis has been running Repository66 for a couple of months now, with the same premis (except you don't have to download Google Earth to use it).

Thursday, 15 March 2007

Google Summer of Code

DSpace is pleased to announce that it has been accepted as a mentoring organisation in Google's Summer of Code 2007.

http://code.google.com/soc/dspace/about.html

The following DSpace developers are officially mentoring for the period:

Robert Tansley (Google)
Jim Rutherford (HP)
Richard Jones (Imperial College)
Stuart Lewis (University of Aberystwyth)
Claudia Jürgen (Universität Dortmund)
Scott Phillips (Texas A&M University)

I am very excited about it, and we have the workings of plenty of ideas for the developments that can be undertaken.

Wednesday, 14 March 2007

Blackwell vs Norway (part 2)

Just a quick update to note that the University of Bergen has made a new 3 year deal with Blackwell, after quite a long negotiation period. I don't have any particular details, but I am sure that they will be reported on by UiB when they are ready.

Wednesday, 7 March 2007

IR Manager site and mailing list

Dorothea Salo has produced a new site and potential for set of resources for non-software specific IR Management issues:

http://oaresearch.org/

It has a weblog, forum and mailing list. It will be interesting to see if this takes off alongside the many other disparate resources for repository managers, such as the software-specific lists such as DSpace General and the broader ranging lists such as American Scientist and SPARC OA. Perhaps, in the long run, the forum might be the source of some kind of generalised How-To or FAQ for repository management, which would be a valuable resource.

Tuesday, 6 March 2007

Blackwell vs Norway

Last week, news of the fracas between the norwegian university libraries and Blackwell was reported on Peter Suber's Open Access News. This was conveniently timed, because I have just spent the weekend in Norway visiting old colleagues, and getting the low-down on this matter direct from the source.

Norway is lucky to have a man with the wisdom and experience of Ole Gunnar Evensen heading up the team running the negotiations with Blackwell. And whatever happens when they have the final meeting in a week's time, this is a significant event in the "serials crisis" story. Things going well, Blackwell will accede to the university consortium's requests over pricing, but if things go as badly as they could, it will still mean that Blackwell will undoubtedly look like unreasonable bullies to all of their other customers.

Ole Gunnar cited to me 3 principal problems that they are encountering (copy taken from this article in På Høyden):

The publishers fix their prices on the basis of how many subscriptions the institution has had, and the libraries must thereby pay for subscriptions that individual departments or research centres have had on the side.

A high annual price rise in the contractual period is being demanded, in Blackwell’s case 7 per cent.

The publishing houses concede a discount for transition to pure electronic subscriptions, but this is much lower than what the publishers actually save. (‘The discount is normally around 10 per cent, but on top of that we have VAT, in Norway’s case 25 per cent’, explains the head of the Acquisitions Division at UB, Ole Gunnar Evensen.)

I, for one, wish Ole G and Kari Garnes the best of luck next week for their showdown.

Friday, 23 February 2007

JISC Capital Circular 4/06 outcomes

Today has been an exciting day. Projects that I am potentially involved in which have so far been announced as funded under the last round of JISC bids from November last year are as follows:

SWORD - Repository Deposit API development work in association with Aberystwyth, Southampton, Hull, Cambridge, Birkbeck (University of London), National Library of Wales, and Intralect, as a DSpace advisor and developer

EThOSnet - A major e-theses project following on from the great work of the recently completed EThOS project. Imperial is pleased to be leading this project, with partners from the following institutions: Leicester, Warwick, the British Library, Nottingham, Hull, Glasgow, Birmingham, National Library of Scotland, Edinburgh, Southampton, Cranfield, Robert Gordon University, Aberystwyth, Cardiff, Loughborough, National Library of Wales, and Exeter. What a team, and what a great looking project. My role is yet to be formalised, but hopefully somewhere in the area of the software development ;)

The future for repositories at Imperial looks bright. Today we completed our first UAT for our upcoming IR service "Spir@l", and we are due, over the course of this year to go live with that service, our own internal e-theses management system, and now the outcomes of these two projects will no doubt play a role in shaping our repository environment, which I hope will rapidly become one to be pround of.

Tuesday, 30 January 2007

AAP PR campaign: opinion

The last week or so have seen an explosion of discussion over the hiring by the Americal Association of Publishers of a well known PR firm whose director is known as the 'pit bull' of the PR community. Others have done the details, so I won't go over them here. Instead check out the coverage at Peter Suber's blog.

It's already been pretty heavily commented, so I wasn't going to add anything, but I've not yet seen the words of warning that immediately sprung to my mind when I read about this. Most commentary has been of the "they know they're backed into the corner, and they're fooling nobody" line. While I agree that those of us on the other side of the fence are not fooled by this, it is not us that they are concerned with. If Microsoft want to outdoo Apple, they don't market to Apple employees, saying "we're better than you, so just give up". Whether we know or not that this is just FUD is irrelevant - it is the people who ultimately make the decisions that are the targets of a campaign like this, and those people are our practicing academics, and, to a degree, members of the public.

We are all aware that people will believe the most ridiculous things if they're told them the right way, and being a top academic does not change that (I've seen some "interesting" opinions on OA from very senior staff). The battle is between the links twixt us and the academics and the links twixt publishers and the academics. If the AAP can convince their authors that OA is bad/wrong/immoral/censorship then we have a serious problem on our hands.

An analagous situation might be the Linux vs Windows argument, which has been raging for some time in this PR zone. Linux might be in the right, but at ever step of the way Microsoft have yet more cards to play to maintain their stranglehold monopoly. I don't think we've seen the end of this; in fact, I would say that we are only just over the Fuseki, and the middle-game is now underway. We cannot allow ourselves to relax in the knowledge that the publisher's have admitted that we're right, and a threat, because we've always known that

How does a loose community (by necessity) such as the Open Access community combat a well directed organisation which is seriously motivated to see itself previal? If you know the answer to that, then it won't just be this dispute which we can solve.

DSpace Architecture Review Report

Following a meeting at MIT, in Cambridge, Massachussets in October 2006, the findings of the DSpace Architecture Review have now been presented at Open Repositories 2007:

http://wiki.dspace.org/index.php/ArchReviewReport

Many thanks to John Mark Ockerbloom for all his work, and for guiding the architecture review group.

Monday, 29 January 2007

The Institutional Repository: sales figures for year 1

Self-indulgent though it might be, I am pleased to report that the first annual sales figures for the book The Institutional Repository, which I co-wrote with Theo Andrew and John MacColl of the University of Edinburgh has sold a total of 737 copies this year. This is well in excess of what I had anticipated, so big thanks to any and all of you out there who purchased a copy.

Open Repositories 2007: preliminary feedback

Thursday, 25 January 2007

EPrints 3.0 Released

The release of EPrints.org 3.0 was announced yesterday at the Open Repositories 2007 conference in San Antonio

http://www.ecs.soton.ac.uk/about/news/1148

DSpace developer though I am, the Open Source voice in me reminds us that diversity in software and breadth of choice is part of the point, so congratulations to the EPrints.org team on their latest release.

Mind you, despite the press release saying "EPrints is already the world’s leading software for producing open access institutional repositories", the Registry of Open Access Repositories (ROAR) lists (at time of writing) 218 EPrints.org repositories and 223 DSpace ones. I'm just saying ;)

Thursday, 18 January 2007

Knowledge Exchange Workshop 16 - 17 January

I have just returned from a very interesting workshop organised by the Knowledge Exchange organisation, on the topic of Interoperability and Institutional Repositories. There were around 70 experts from the 4 countries involved in Knowledge Exchange (UK, Denmark, The Netherlands, and Germany) discussing the following broad topics in the context of interoperability:

e-theses

OAI-PMH

Research Paper Metadata

Usage Statistics

Exchanging Research Information

Author Identification

The findings of each group should be made public shortly, and I will be sure to post the location of any resources that I am aware of.

In the mean time I can present only the outline of the findings of the group I was in: Exchanging Research Information. This was focussed around the possibility for integration or interoperation between Current Research Information Systems (CRIS) and Open Access Repositories (OAR). There were representitives from both communities, and a large part of the meeting was for each of us to understand the other. The Common European Research Information Format was introduced to us, in light of the upcoming release of the latest revision.

It was initially felt, especially by the CRIS community, that interactions between CRIS and OARs would be very one way, and that the CRIS would simply make available the relevant information for the OARs. This doesn't strike me as being the definition of Interoperability, and so it was necessary for us to examine what was really the relationship between the data held by each system.

The approach that was taken was that a simple use case was analysed for the following features:

1) What information it would need to encompass
2) Where the information could be obtained
3) Where the information would be of interest

The use case is the traditional repository use case of "Deposit", although it was necessary to formulate this in a more general way as a "Publication Registration Process". This allowed us to successfully abstract away from where the User Interface for such a registration process lay, and thus to take away some of the arguments over whether this was the domain of the CRIS or the OAR.

Throughout the meeting, the discussion was very wide ranging, but out of it were extracted some important similarities and differences between CRIS and OARs. The most basic formulation of the key difference is as follows: The CRIS's primary interest is in high-quality, accurate metadata, while the OAR's primary interest is in content, and can live with a lower-quality metadata. This exposes two things: how the CRIS can be of benefit to the OAR, and how the problem domains do not overlap quite as much as it might first appear. My conclusion from this is that the interoperability we are talking about is actually about finding the layer at which these two systems domains can be stitched together for the benefit of the research community.

With this discussion under our belts, then, we enumerated first the information that CRIS are interested in, and then the information that OARs are interested in. The following list is not exhaustive, but gives an example of the differing perspectives:

CRIS
- project information
- bibliographic metadata
- researcher role
- scientific impact

OAR
- bibliographic metadata
- administrative metadata (technical, preservation, etc)
- collection/group information
- full-text / content
- {persistent} identifier

The resulting analysis of the use case showed that information needed to come from all corners to achieve this process, including the special case of author information, which may come to the process from yet another system, albeit via the CRIS.

The general consensus of the meeting is that a working group needs to look closely at the interactions going on in this and other use cases, and specify some set of interfaces and content models that can allow for interchange of the relevant data. This should be followed by a reference implementation and service. It was proposed that the basis for a project looking at these issues might consider e-theses and other grey literature, as they may prove to be the easiest place to start.

It was good to see plenty of crossover between this and other strands. The bibliographic metadata obviously mattered to the Research Paper Metadata group, while starting with e-theses and grey literature will matter to the E-Theses group. That author names may have to come from some third-party system may well be connected to the Author Identifier group, and since interoperability is of essence, you can barely go any distance before considering at least the base problems which OAI-PMH addresses.

All in, an interesting meeting, and I'm looking forward to seeing the reports that will be published by the group moderators in due time.

Wednesday, 17 January 2007

EC Petition for Open Access

The Knowledge Exchange organisation has set up an online petition to be sent to the European Commission in support of the recent recommendations in the following study:

Study on the Economic and Technical Evolution of the Scientific Publication Markets of Europe

Currently, the recommendations are being lobbied against by publisher groups, so Knowledge Exchange feel that the other side needs to be represented. Therefore, if you would like to support the study, you can find some more information and option to sign at the following location:

http://www.ec-petition.eu/

The principal concern is to support recommendation A1:

RECOMMENDATION A1. GUARANTEE PUBLIC ACCESS TO PUBLICLY-FUNDED RESEARCH RESULTS SHORTLY AFTER PUBLICATION

Saturday, 13 January 2007

ORE Technical Committee Meeting 11 - 12 January

On 11 and 12 of January, 13 members of the ORE Technical Committee met at Columbia University in New York for the first face-to-face meeting of this project. Attendants were (in no particular order): Tony Hammond (Nature Publishing), Michael Nelson (Old Dominion University), Pete Johnstone (Eduserv, on behalf of Andy Powell), Ray Plante (NCSA), David Fulker (UCAR), Richard Jones (Imperial College London), Peter Murray (OhioLINK), Jeff Young (OCLC), Rob Sanderson (University of Liverpool), Tim DiLauro (Johns Hopkins University), Simeon Warner (Cornell), and of course Herbert van de Sompel (LANL) and Carl Lagoze (Cornell).

The results of this meeting are due to be reported at Open Repository 2007 at the end of this month, once they have been formalised from the complex debate and discussion that occurred at the meeting, so I won't attempt to summarise outcomes in any detail.

We began with an overview of the problem domain, which is of compound digital objects in a heterogeneous environment, which must be operable within the web architecture. One of the core outcomes of the project, therefore, will be a specification for describing these objects, and their internal and external relationships. Each of the attendant committee members was given the opportunity to present their thoughts on the initial documentation for the project. These ranged from commentary on a privately circulated white paper on the project through to suggestions on implementation technologies or methodologies that might be appropriate.

On the second day of the meeting we moved on to start formalising the goals for the various aspects of the project. This included our communication channels, our use cases, what we understand by the format that will help us describe structures and relationships, and our forthcoming work and subsequent meetings.

Communication for the project will happen through private mailing lists and a wiki. All outcomes from the project will be pushed out to the ORE website, and later there may be a project blog when there are findings to disseminate. We also specified 6 use cases and assigned members of the technical committee to examine the use case titles and develop some working "stories" which we will be able to develop. These use cases should be ready in time for presentation at Open Repository 2007.

Overall, it feels like we covered significant ground in just two short days, although I for one found the results of the meeting quite complex, and in need of some significant work to make coherent results from. Carl and Herbert will be carrying out this analysis in the coming weeks, which is when meeting results will be made available.

Tuesday, 9 January 2007

UK and Ireland DSpace User Group Videoconference presentation

Videos and presentation slides are now available from the UK&I DSUG held on 24 November 2006.

Presentation: http://cadair.aber.ac.uk/dspace/handle/2160/281

Videos: http://cadair.aber.ac.uk/dspace/handle/2160/290

Many thanks to Stuart Lewis at the University of Wales Aberystwyth for organising this event, and for making the presentations and videos available in their institutional repository.

At the moment, these presentations are only available in Windows Media and Real Media formats, due to limitations at the video editing suite in Aberystwyth.

The presentations given at this meeting were under the following titles:

Inside, Outside, Where Have We Been? The Who - of DSpace
development in Trinity College Dublin (along with the why, the
what and the how)

Distributing repository functions with DSpace [yours truly]

Next Steps for the China Digital Museum Project

What OR did next, or administering admins in a hosted repository
service

Thanks Google! A love-hate relationship

An update from the DSpace Architecture and Technology Review [yours truly]