Tuesday 19 December 2006

Calgary Rejects DSpace for E-Theses Archives

The Rejection of D-Space : Selecting Theses Database Software at the University of Calgary Archives

This article outlines reasons that the University of Calgary Archives chose not to implement DSpace to manage their e-theses, although they are still using it for the Institutional Repository system.

I'd like to include some responses to the significan issues that Calgary encountered:


  • Searching Issues: I think that the main problem here is a lack of information about the customisations available in both the search and the browse. The existing browse has a very primitive customisation capability, but the search indexes are extremely flexible, and probably the addition of search.index.<x> = dc.date.accessioned would be sufficient to construct accurate searches or reports on specific years. The admission of responsibility on the DSpace part is that the search interface is not the most flexible, and it is not possible to construct searches like this without knowing exactly what you are doing. This is a very serious issue, because the report cites lack of sufficient search indices as a major concern, and yet a quick peek into the dspace.cfg file, and a quick tweak of search/advanced.jsp and all the problems are solved.


  • Browse by date: just to blow my own trumpet for a moment, when the browse code I've been working on is finished, it will be possible to index by date, and browse only items within a specific period. So if date.submitted = 1994, then you can browse on all date.submitted where it is 1994 and so on


  • Counting results: using the mechanisms above it would be possible to get the results counted correctly. Even if not, recourse back to generating SQL queries to do the counts would be much more desireable than counting by hand, and since our code base is open, you are free to write in the functionality


  • Cannot do X: much of the content of this article is about how the UI cannot do X (allow you to search for more than 3 terms at a time | search on the relevant fields). DSpace is an Open Source product, where words like "cannot" should not be used unless you really have looked into it. The underlying search engine can do all of the things required for Calgary, and all it requires is the alteration of the UI to support it


  • Reporting Issued: as has already been noted at time of writing on the DSpace lists, DSpace is not a reporting tool. I understand Calgary's pain here, as it is part of my remit to report on the content and activity of our repository. I will be using a combination of the log file analysis, the web server logs, and potentially the Minho stats add-on. I would welcome a suite of reporting tools for the platform, though


  • Sorting search results: That DSpace can't is a complete nuisance, and I agree that it should be able to do so. Lucene, which is the search engine upon which DSpace relies for this functionality, does support result searching, so if anyone wants to have a go at adding in the feature, I expect you would be very popular


  • Public and Private Display: This is also a potential shortcoming of DSpace, and was certainly not a design goal in the initial case. DSpace was originally intended to help you achieve open access, and so does not do so well with the ideas of public and private views. To fix this here at Imperial we have one public repository and X private repositories which deal content to the public one on demand (and yes, one of those private repositories is an under-construction e-theses repository built on DSpace).


  • scalability: I feel that Calgary didn't do their research regarding scalability here. They worry that adding 600 new items annually might be a problem. Cambridge have 200,000 records in their public repository, and we have 75,000 records in our private repository, and while there are scalability bumps, they are being ironed out, and they don't show up until you get into the tens of thousands of records at all. I blame this on the over cited (so I won't) DSpace Scalability Issues page on the wiki being people's only source of information.




So what can we, as the DSpace community, learn from this. The first thing is probably about documentation - we are very good at documentation, but sometimes it can be hard to find what you are looking for. Could we do more, and if so what?

The second is about customisability of our UI. Manakin will hopefully lay to rest many of the problems that we come up against with our UI customisations, but we should also bear in mind that library administrators want graphical interfaces to modify their configuration in all senses. I can give you a concrete example of where having our config defined through the UI would be useful: At Imperial College if we want to change the configuration and restart the application server to pick up the changes, we need to go through an official "Change Request" process. If this could be done through the UI, though, this would become an administrative task, and would not require the extra bureaucracy! In addition, it will make it easier for non-technical folks to understand what options are available to them.

Third is about the nature of Open Source. DSpace has many known problems or discrepencies between what the UI will allow and what the underlying application will actually support. The DSpace core is much more powerful than the UI would have you believe, and those of us who spend most of our time "under the hood" can testify to the things that you can make it do if you know what you are doing. The problem at Calgary was that they didn't appear to understand that in order to make the system work in their exact niche, it was going to require some modification. This is a different issue, in my mind, than not having the resources to undertake said modifications.

Nonetheless, if WebGencat meets their needs, and DSpace does not, then that is a success for the diversity of software products available and the evaluation process in this part of the market. We (the DSpace community) need to learn from the feedback in the report, and hopefully use it to make our system better.

No comments: