Wednesday 29 November 2006

DSpace Browse Code Redevelopment

Here at Imperial we have a 70,000 strong set of records for academic publications that we have to deal with. The current browse code for DSpace is pretty inflexible and hides some scary scalability problems. For example, if you have 3000 records all produced by the same author, and you attempt to browse all the publications by that author, it will instantiate an Item object on each of those 3000 items and display it to the user. This can Cause Things To Be Slow.

A long while back I wrote some code which allowed you to specify which metadata fields you wanted to bind to the existing 3 browse indices (later increased to 4 by the addition of a subject browse). As an engineer, the idea that you couldn't just define your indices in real time, or if not real time at least in configuration, meant that I simultaneously started to reconsider rewriting the browse system. To that end, I produced an initial prototype of a generalised browse patch, which was attached to the patch tracker as number #1480998.

The consequence of this second development was that the author browse problem could be quickly discovered in other contexts, and, more problematically, more likely ones. For example, we store workflow information about our 70,000 records, and at the very start, when the first data import has completed, we have every one of those records with the same status ("new"). In this case, if you select our "Browse by Item Status" -> "new" option configured using the patch, the system attempts to display to you 70,000 instantiated Items. I don't know anyone who has yet waited to see if the page will ever display.

A second problem was discovered while attempting to fix the first: that paging the results of a "second level browse"* was impossible because the focus of the browse and the value of the browse are conflated in the code, so that it was simply impossible to apply pagination to a specific value browse (e.g. browse by status where status = new) with the existing code. A new understanding of the browse process was needed.

This is what has led us to redevelop the browse code to fix both of these issues. The development process is being live documented on the DSpace Wiki, at the URL:

http://wiki.dspace.org/index.php/DynamicBrowsePrototype

Once finished, this code will be made available to the community via the patch tracker. Keep an eye on the wiki to watch its progress.

Welcome, folks, to my new and experimental blog. In a largely informal way I intend to use this to document some of my work, as much for my own use as anyone else's. It will mostly consist of my DSpace related thoughts and ramblings, as well as any other topics in the information sciences and technologies which catch my eye. Of course, it will depend on how much time I have to write about these things.