Tuesday 22 January 2008

Fine Grained Repository Interoperability: can't package, won't package

Sadly (although some of you may not agree!), my paper proposed for this year's Open Repositories conference in Southampton has not made it through the Programme Committee. I include here, therefore, my submission so that it may live on, and you can get an idea of the sorts of things I was thinking about talking about.

The reasons given for not accepting it are probably valid; mostly concerning a lack of focus. Honestly, I thought it did a pretty good job of saying what I would talk about, but such is life.




What is the point of interoperability, what might it allow us to achieve, and why aren't we very good at it yet?

Interoperability is a loosely defined concept. It can allow systems to talk to each other about the information that they hold, about the information that they can disseminate, and to interchange that information. It can allow us to tie systems together to improve ingest and dissemination of repository holdings, and allows us to distribute repository functions across multiple systems. It ought even to allow us to offer repository services to systems which don't do so natively, improving the richness of the information space; repository interoperability is not just about repository to repository, it is also about cross-system communications. The maturing set of repositories such as DSpace, Fedora and EPrints and other information systems such as publications management tools and research information systems, as well as home-spun solutions are making the task of taking on the interoperability beast both tangible and urgent.

Traditional approaches to interoperability have often centred around moving packaged information between systems (often other repositories). The effect this has is to introduce a black-box problem concerning the content of the package itself. We are no longer transferring information, we are transferring data! It therefore becomes necessary to introduce package descriptors which allow the endpoint to re-interpret the package correctly, to turn it back into information. But this constrains us very tightly in the form of our packages, and introduces a great risk of data loss. Furthermore, it means that we cannot perform temporally and spatially disparate interoperability on an object level (that is, assemble an object's content over a period of time, and from a variety of sources). A more general approach to information interchange may be more powerful.

This paper brings together a number of sources. It discusses some of the work undertaken at Imperial College London to connect a distributed repository system (built on top of DSpace) to an existing information environment. This provides repository services to existing systems, and offers library administrators custom repository management tools in an integrated way. It also considers some of the thoughts arising from the JISC Common Repository Interfaces Group (CRIG) in this area, as well as some speculative proposals for future work and further ideas that may need to be explored.

Where do we start? The most basic way to address this problem is to break the idea of the package down into its most simple component parts in the context of a repository: the object metadata, the file content, and the use rights metadata. Using this approach, you can go a surprisingly long way down the interoperability route without adding further complexity. At the heart of the Imperial College Digital Repository is a set of web services which deal with exactly this fine structure of the package, because the content for the repository may be fed from a number of sources over a period of time, and thus there never is a definitive package.

These sorts of operations are not new, though, and there are a variety of approaches to it which have already been undertaken. For example, WebDAV offers extensions to HTTP to deal with objects using operations such as PUT, COPY or MOVE which could be used to achieve the effects that we desire. The real challenge, therefore, is not in the mechanics of the web services which we use to exchange details about this deconstructed package, but is in the additional complexities which we can introduce to enhance the interoperability of our systems and provide the value-added services which repositories wish to offer.

Consider some other features of interoperability which might be desirable

- fine grained or partial metadata records. We may wish to ingest partial records from a variety of sources to assemble into a single record, or disseminate only subsets of our stored metadata.
- file metadata, or any other sub-structure element of the object. This may include bibliographic, administrative or technical metadata.
object structural information, to allow complex hierarchies and relationships to be expressed and modified.
- versioning, and other inter-object relationships.
- workflow status, if performing deposit across multiple systems, it may be necessary to be aware of the status of the object in each system to calculate an overall state.
- state and provenance reporting, to offer feedback on the repository state to other information systems, administrators or users.
- statistics, to allow content delivery services to aggregate statistics globally.
- identifiers, to support multiple identification schemes.

Techniques such as application profiling for metadata allow us to frame entire metadata records in terms of their interpretation (e.g. the Scholarly Works Application Profile (SWAP)), but should also be used to frame individual metadata elements. Object structural data can be encoded using standards such as METS, which can also help us with attaching metadata to sub-structures of the object itself, such as its files. Versioning, and other inter-object relationships could be achieved using an RDF approach, and perhaps the OAI-ORE project will offer some guidance. But other operations such as workflow status, and state and provenance reporting do not have such clear approaches. Meanwhile, the Interoperable Repository Statistics (IRS) project has looked at the statistics problem, and the RIDIR project is looking into interoperable identifiers. In these latter cases, can we ever consider providing access to their outcomes or services through some general fine grained interface?

The Imperial College Digital Repository offers limited file metadata which is attached during upload and exposed as part of a METS record, detailing the entire digital object, as a descriptive metadata section. It can deal with the idea that some metadata comes from one source, while other metadata comes from another, allowing for a primitive partial metadata interchange process. Conversely, it will also deal with multiple metadata records for the same item. Also introduced are custom workflow metadata fields which allow some basic interaction between different systems to track deposit of objects both from the point of view of the administrator, the depositor and the systems themselves. In addition, there is an extensible notifications engine which is used to produce periodic reports to all depositors whose content has undergone some sort of modification or interesting event in a given time period. This notifications engine is behind a very generic web service which offers extreme flexibility within the context of the College's information environment.

Important work in the fields that will help achieve this interoperability include the SWORD deposit mechanism which currently deals with packages but may be extensible to include these much needed enhancements. Meanwhile, the OAI-ORE will be able to provide the semantics for complex objects which will no doubt assist in framing the problems that interoperability faces in a manor in which they can be solved.

Other examples of the spaces in which interoperability needs to work would include the EThOSnet project, the UK national e-theses effort, where it is conceivable that institutions may want to provide their own e-theses submission system with integration into the central hub to offer seamless distributed submission. Or in the relationship between Current Research Information Systems (CRIS) and open access repositories, to offer a full-stack information environment for researchers and administrators alike. The possibilities are extensive and the benefit to the research community would be truly great. HP Labs is actively researching in these and related areas with its continued work on the DSpace platform.


No comments: