Friday, 22 December 2006

Christmas Vacation Notes

It's Christmas time, and so I probably won't be updating again until after the holidays unless anything specific comes up.

In the new year I will be attending the following events, so you can expect updates about them in time:

  • ORE Technical Committee meeting, Columbia University, New York. The first of four official meetings of the ORE technical committee from 11 - 12 January in New York City

  • Knowledge Exchange Workshop. Held in Utrecht on 16 - 17 January.

I have also recently kindly been invited onto the JISC CRIG (Common Repository Interfaces Group) which will meet early next year.

Merry Christmas Everyone.

Wednesday, 20 December 2006

Calgary reverses decision on DSpace

Just a few hours after yesterday's post looking at Calgary's reasons for rejecting DSpace as a platform for e-theses, an email came round the DSpace General list indicating that after the amount of support and feedback that they'd had at Calgary, that they have rescinded their decision to use WebGencat, and decided to go back to DSpace after all.

A triumph for scholarly communication!

Tuesday, 19 December 2006

Calgary Rejects DSpace for E-Theses Archives

The Rejection of D-Space : Selecting Theses Database Software at the University of Calgary Archives

This article outlines reasons that the University of Calgary Archives chose not to implement DSpace to manage their e-theses, although they are still using it for the Institutional Repository system.

I'd like to include some responses to the significan issues that Calgary encountered:

  • Searching Issues: I think that the main problem here is a lack of information about the customisations available in both the search and the browse. The existing browse has a very primitive customisation capability, but the search indexes are extremely flexible, and probably the addition of search.index.<x> = would be sufficient to construct accurate searches or reports on specific years. The admission of responsibility on the DSpace part is that the search interface is not the most flexible, and it is not possible to construct searches like this without knowing exactly what you are doing. This is a very serious issue, because the report cites lack of sufficient search indices as a major concern, and yet a quick peek into the dspace.cfg file, and a quick tweak of search/advanced.jsp and all the problems are solved.

  • Browse by date: just to blow my own trumpet for a moment, when the browse code I've been working on is finished, it will be possible to index by date, and browse only items within a specific period. So if date.submitted = 1994, then you can browse on all date.submitted where it is 1994 and so on

  • Counting results: using the mechanisms above it would be possible to get the results counted correctly. Even if not, recourse back to generating SQL queries to do the counts would be much more desireable than counting by hand, and since our code base is open, you are free to write in the functionality

  • Cannot do X: much of the content of this article is about how the UI cannot do X (allow you to search for more than 3 terms at a time | search on the relevant fields). DSpace is an Open Source product, where words like "cannot" should not be used unless you really have looked into it. The underlying search engine can do all of the things required for Calgary, and all it requires is the alteration of the UI to support it

  • Reporting Issued: as has already been noted at time of writing on the DSpace lists, DSpace is not a reporting tool. I understand Calgary's pain here, as it is part of my remit to report on the content and activity of our repository. I will be using a combination of the log file analysis, the web server logs, and potentially the Minho stats add-on. I would welcome a suite of reporting tools for the platform, though

  • Sorting search results: That DSpace can't is a complete nuisance, and I agree that it should be able to do so. Lucene, which is the search engine upon which DSpace relies for this functionality, does support result searching, so if anyone wants to have a go at adding in the feature, I expect you would be very popular

  • Public and Private Display: This is also a potential shortcoming of DSpace, and was certainly not a design goal in the initial case. DSpace was originally intended to help you achieve open access, and so does not do so well with the ideas of public and private views. To fix this here at Imperial we have one public repository and X private repositories which deal content to the public one on demand (and yes, one of those private repositories is an under-construction e-theses repository built on DSpace).

  • scalability: I feel that Calgary didn't do their research regarding scalability here. They worry that adding 600 new items annually might be a problem. Cambridge have 200,000 records in their public repository, and we have 75,000 records in our private repository, and while there are scalability bumps, they are being ironed out, and they don't show up until you get into the tens of thousands of records at all. I blame this on the over cited (so I won't) DSpace Scalability Issues page on the wiki being people's only source of information.

So what can we, as the DSpace community, learn from this. The first thing is probably about documentation - we are very good at documentation, but sometimes it can be hard to find what you are looking for. Could we do more, and if so what?

The second is about customisability of our UI. Manakin will hopefully lay to rest many of the problems that we come up against with our UI customisations, but we should also bear in mind that library administrators want graphical interfaces to modify their configuration in all senses. I can give you a concrete example of where having our config defined through the UI would be useful: At Imperial College if we want to change the configuration and restart the application server to pick up the changes, we need to go through an official "Change Request" process. If this could be done through the UI, though, this would become an administrative task, and would not require the extra bureaucracy! In addition, it will make it easier for non-technical folks to understand what options are available to them.

Third is about the nature of Open Source. DSpace has many known problems or discrepencies between what the UI will allow and what the underlying application will actually support. The DSpace core is much more powerful than the UI would have you believe, and those of us who spend most of our time "under the hood" can testify to the things that you can make it do if you know what you are doing. The problem at Calgary was that they didn't appear to understand that in order to make the system work in their exact niche, it was going to require some modification. This is a different issue, in my mind, than not having the resources to undertake said modifications.

Nonetheless, if WebGencat meets their needs, and DSpace does not, then that is a success for the diversity of software products available and the evaluation process in this part of the market. We (the DSpace community) need to learn from the feedback in the report, and hopefully use it to make our system better.

Wednesday, 13 December 2006

Richard Poynder interviews Professor Tony Hey

Respected Open Access journalist and blogger Richard Poynder interviews Professor Tony Hey of Microsoft and The University of Southampton on his career, his move to Microsoft, his stance on Open Source and Open Access:

This interview was probably kick-started at least in part because Southampton's world class Institutional Repository package can now be run on smoothly (I'm told) on Windows, and Microsoft's apparent interest in funding Open Source development work (but not GNU licenced work, only BSD or similar).

There are sentiments expressed in this interview which set my Open Source alarm bells ringing. Phrases like Microsoft "promise not to sue you" don't sit well with me, along with the idea that "if Microsoft doesn't patent these technologies, someone else will" doesn't strike me as getting to the nub of the argument. Or that the GPL "denies the existance of a software industry". These don't strike me as statements on behalf of a company that really wants to share, and my paranoia alarm says that things will only continue like this for as long as it suits them. Be on your guard.

As ever, a thorough treatment by Richard Poynder.

Monday, 11 December 2006

OAI-ORE Project Briefing

Carl Lagoze and Herbert Van de Sompel recently presented a briefing of the new OAI-ORE project at the CNI Fall Task Force meeting, and have made the slides available on the OAI-ORE website:

"The Canonical Representation Format (CaRF) [is the] Format to express a
manifest of all available Representations (and Resources) for a

Fleshing out the CaRF is probably the core effort of OAI-ORE"

Friday, 8 December 2006

Sorting in databases

Discovered via a circuitous route, Jim Downing notes that Dorothea Salo has a great tip for fixing sort ordering in the DSpace browse. Since I'm working on this feature as we speak, it's important for me to be able to take this on board. In fact, I have the following feature planned:

Allow each field to request a Normaliser for its entry into the sort_value column of the browse system. Using the PluginManager for DSpace, this might look like this:

String myValue = "some value to be normalised";
String myLang = "en"; // this is the language I want to normalise into
Normaliser myNormaliser = (Normaliser) PluginManager.getNamedPlugin(Normaliser.class, myLang);
myValue = myNormaliser.normalise(myValue);

our configuration for this would then probably just be something like the following: = \
org.dspace.browse.EnglishNormaliser = en, \
org.dspace.browse.NorwegianNormaliser = no

and so forth. Then, the way your normaliser works would be up to you, and perhaps for Dorothea's example, you need to just maintain a mapping file of unicode values and their target English representation.

DSpace 1.4.1 Released

Scott Yeadon announces the release of DSpace 1.4.1 to the DSpace community. This release is principally bug fixes, stability and securing enhancements, and minor feature improvements.

Important things to note would be: Stats from older versions of DSpace are supported alongside the new log file format; improvement of HTTP status codes in relevant contexts; updated libraries and preventing spammers from using the Feedback page.

The DSpace community is pleased to announce the release of DSpace 1.4.1.

This stable release is primarily a bug fix release incorporating
numerous bugs/enhancements. Refer to the CHANGES file within the
distribution for the full list of enhancements.

The documentation for this release is bundled within the package. Note
that the site is in the process of being migrated to the
Wiki, so the most up-to-date version of the documentation is only
available within the 1.4.1 distribution.

DSpace 1.4.1 can be downloaded from the files area at or from CVS using the tag

Please use the mailing lists available at to provide feedback on this

Those wishing to do development work with DSpace are strongly encouraged
to obtain the source code using CVS. This is very straightforward and a
guide to doing this is available here:

We would also like to take this opportunity to invite you all to
participate in the DSpace development process. Extra developer
hands are always welcome, but there are other ways you can help:

- Test the system and report bugs
- Provide documentation (for end users and institutions, as well as
- Provide or update language packs
- Share your deployment experiences
- Donate content and metadata for testing and research
- Share your technical experience and ideas

Please visit the DSpace Wiki to see the various resources and
collaboration tools available to the DSpace community:

Thanks to everyone who contributed to this release, and to Scott and Claudia for all their work making it ready.

Friday, 1 December 2006


A few weeks ago Herbert van de Sompel invited me to be on the ORE Technical Committee. The website for this interesting new project is here:

The first meeting of the group is in January 2007.

Unknown Namecheck

A new e-theses blog (thanks to Peter Suber for the link) has seen fit to reference a whole bunch of articles that I have written (or which have been written by or in collaboration with a good friend and colleague from the University of Edinburgh) over the last few years on the topics of DSpace and/or E-Theses.

Since this is my blog, here are the ones that were by or partially by me:

Open Source, Open Access and E-Theses

The Tapir: Adding E-Theses functionality to DSpace

A couple of references to chapters in our 2006 book "The Institutional Repository" (Jones, R; Andrew, T; MacColl, J):

The Institutional Repository in the Digital Library

Case Study: The Edinburgh Research Archive

And if you are interested in the topic, you would be interested in the work of this man:

Intellectual Property and Electronic Theses

It's a strange experience when you discover other people posting about your work without your knowledge. A good feeling. But strange. And hopefully one day something that I can be allowed to get used to.

Wednesday, 29 November 2006

DSpace Browse Code Redevelopment

Here at Imperial we have a 70,000 strong set of records for academic publications that we have to deal with. The current browse code for DSpace is pretty inflexible and hides some scary scalability problems. For example, if you have 3000 records all produced by the same author, and you attempt to browse all the publications by that author, it will instantiate an Item object on each of those 3000 items and display it to the user. This can Cause Things To Be Slow.

A long while back I wrote some code which allowed you to specify which metadata fields you wanted to bind to the existing 3 browse indices (later increased to 4 by the addition of a subject browse). As an engineer, the idea that you couldn't just define your indices in real time, or if not real time at least in configuration, meant that I simultaneously started to reconsider rewriting the browse system. To that end, I produced an initial prototype of a generalised browse patch, which was attached to the patch tracker as number #1480998.

The consequence of this second development was that the author browse problem could be quickly discovered in other contexts, and, more problematically, more likely ones. For example, we store workflow information about our 70,000 records, and at the very start, when the first data import has completed, we have every one of those records with the same status ("new"). In this case, if you select our "Browse by Item Status" -> "new" option configured using the patch, the system attempts to display to you 70,000 instantiated Items. I don't know anyone who has yet waited to see if the page will ever display.

A second problem was discovered while attempting to fix the first: that paging the results of a "second level browse"* was impossible because the focus of the browse and the value of the browse are conflated in the code, so that it was simply impossible to apply pagination to a specific value browse (e.g. browse by status where status = new) with the existing code. A new understanding of the browse process was needed.

This is what has led us to redevelop the browse code to fix both of these issues. The development process is being live documented on the DSpace Wiki, at the URL:

Once finished, this code will be made available to the community via the patch tracker. Keep an eye on the wiki to watch its progress.

Welcome, folks, to my new and experimental blog. In a largely informal way I intend to use this to document some of my work, as much for my own use as anyone else's. It will mostly consist of my DSpace related thoughts and ramblings, as well as any other topics in the information sciences and technologies which catch my eye. Of course, it will depend on how much time I have to write about these things.