The Chronicles of Richard: sword

Showing posts with label sword. Show all posts

Monday, 9 June 2008

ORE software libraries from Foresite

The Foresite [1] project is pleased to announce the initial code of two software libraries for constructing, parsing, manipulating and serialising OAI-ORE [2] Resource Maps. These libraries are being written in Java and Python, and can be used generically to provide advanced functionality to OAI-ORE aware applications, and are compliant with the latest release (0.9) of the specification. The software is open source, released under a BSD licence, and is available from a Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet, and are lacking good documentation for this early release, but we will be continuing to develop this software throughout the project and hope that it will be of use to the community immediately and beyond the end of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3, N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a demonstrator and test of the OAI-ORE standard by creating Resource Maps of journals and their contents held in JSTOR [4], and delivering them as ATOM documents via the SWORD [5] interface to DSpace [6]. DSpace will ingest these resource maps, and convert them into repository items which reference content which continues to reside in JSTOR. The Python library is being used to generate the resource maps from JSTOR and the Java library is being used to provide all the ingest, transformation and dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us have your feedback via the Google group:

foresite@googlegroups.com

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/

Tuesday, 22 January 2008

Fine Grained Repository Interoperability: can't package, won't package

Sadly (although some of you may not agree!), my paper proposed for this year's Open Repositories conference in Southampton has not made it through the Programme Committee. I include here, therefore, my submission so that it may live on, and you can get an idea of the sorts of things I was thinking about talking about.

The reasons given for not accepting it are probably valid; mostly concerning a lack of focus. Honestly, I thought it did a pretty good job of saying what I would talk about, but such is life.

What is the point of interoperability, what might it allow us to achieve, and why aren't we very good at it yet?

Interoperability is a loosely defined concept. It can allow systems to talk to each other about the information that they hold, about the information that they can disseminate, and to interchange that information. It can allow us to tie systems together to improve ingest and dissemination of repository holdings, and allows us to distribute repository functions across multiple systems. It ought even to allow us to offer repository services to systems which don't do so natively, improving the richness of the information space; repository interoperability is not just about repository to repository, it is also about cross-system communications. The maturing set of repositories such as DSpace, Fedora and EPrints and other information systems such as publications management tools and research information systems, as well as home-spun solutions are making the task of taking on the interoperability beast both tangible and urgent.

Traditional approaches to interoperability have often centred around moving packaged information between systems (often other repositories). The effect this has is to introduce a black-box problem concerning the content of the package itself. We are no longer transferring information, we are transferring data! It therefore becomes necessary to introduce package descriptors which allow the endpoint to re-interpret the package correctly, to turn it back into information. But this constrains us very tightly in the form of our packages, and introduces a great risk of data loss. Furthermore, it means that we cannot perform temporally and spatially disparate interoperability on an object level (that is, assemble an object's content over a period of time, and from a variety of sources). A more general approach to information interchange may be more powerful.

This paper brings together a number of sources. It discusses some of the work undertaken at Imperial College London to connect a distributed repository system (built on top of DSpace) to an existing information environment. This provides repository services to existing systems, and offers library administrators custom repository management tools in an integrated way. It also considers some of the thoughts arising from the JISC Common Repository Interfaces Group (CRIG) in this area, as well as some speculative proposals for future work and further ideas that may need to be explored.

Where do we start? The most basic way to address this problem is to break the idea of the package down into its most simple component parts in the context of a repository: the object metadata, the file content, and the use rights metadata. Using this approach, you can go a surprisingly long way down the interoperability route without adding further complexity. At the heart of the Imperial College Digital Repository is a set of web services which deal with exactly this fine structure of the package, because the content for the repository may be fed from a number of sources over a period of time, and thus there never is a definitive package.

These sorts of operations are not new, though, and there are a variety of approaches to it which have already been undertaken. For example, WebDAV offers extensions to HTTP to deal with objects using operations such as PUT, COPY or MOVE which could be used to achieve the effects that we desire. The real challenge, therefore, is not in the mechanics of the web services which we use to exchange details about this deconstructed package, but is in the additional complexities which we can introduce to enhance the interoperability of our systems and provide the value-added services which repositories wish to offer.

Consider some other features of interoperability which might be desirable

- fine grained or partial metadata records. We may wish to ingest partial records from a variety of sources to assemble into a single record, or disseminate only subsets of our stored metadata.
- file metadata, or any other sub-structure element of the object. This may include bibliographic, administrative or technical metadata.
object structural information, to allow complex hierarchies and relationships to be expressed and modified.
- versioning, and other inter-object relationships.
- workflow status, if performing deposit across multiple systems, it may be necessary to be aware of the status of the object in each system to calculate an overall state.
- state and provenance reporting, to offer feedback on the repository state to other information systems, administrators or users.
- statistics, to allow content delivery services to aggregate statistics globally.
- identifiers, to support multiple identification schemes.

Techniques such as application profiling for metadata allow us to frame entire metadata records in terms of their interpretation (e.g. the Scholarly Works Application Profile (SWAP)), but should also be used to frame individual metadata elements. Object structural data can be encoded using standards such as METS, which can also help us with attaching metadata to sub-structures of the object itself, such as its files. Versioning, and other inter-object relationships could be achieved using an RDF approach, and perhaps the OAI-ORE project will offer some guidance. But other operations such as workflow status, and state and provenance reporting do not have such clear approaches. Meanwhile, the Interoperable Repository Statistics (IRS) project has looked at the statistics problem, and the RIDIR project is looking into interoperable identifiers. In these latter cases, can we ever consider providing access to their outcomes or services through some general fine grained interface?

The Imperial College Digital Repository offers limited file metadata which is attached during upload and exposed as part of a METS record, detailing the entire digital object, as a descriptive metadata section. It can deal with the idea that some metadata comes from one source, while other metadata comes from another, allowing for a primitive partial metadata interchange process. Conversely, it will also deal with multiple metadata records for the same item. Also introduced are custom workflow metadata fields which allow some basic interaction between different systems to track deposit of objects both from the point of view of the administrator, the depositor and the systems themselves. In addition, there is an extensible notifications engine which is used to produce periodic reports to all depositors whose content has undergone some sort of modification or interesting event in a given time period. This notifications engine is behind a very generic web service which offers extreme flexibility within the context of the College's information environment.

Important work in the fields that will help achieve this interoperability include the SWORD deposit mechanism which currently deals with packages but may be extensible to include these much needed enhancements. Meanwhile, the OAI-ORE will be able to provide the semantics for complex objects which will no doubt assist in framing the problems that interoperability faces in a manor in which they can be solved.

Other examples of the spaces in which interoperability needs to work would include the EThOSnet project, the UK national e-theses effort, where it is conceivable that institutions may want to provide their own e-theses submission system with integration into the central hub to offer seamless distributed submission. Or in the relationship between Current Research Information Systems (CRIS) and open access repositories, to offer a full-stack information environment for researchers and administrators alike. The possibilities are extensive and the benefit to the research community would be truly great. HP Labs is actively researching in these and related areas with its continued work on the DSpace platform.

Monday, 21 January 2008

SWORD/ORE

Last week I was at the ORE meeting in Washington DC, and presented some thoughts regarding SWORD and its relationship to ORE. The slides I presented can be found here:

http://wiki.dspace.org/static_files/1/1d/Sword-ore.pdf

[Be warned that discussion on these slides ensued, and they therefore don't reflect the most recent thinking on the topic]

The overall approach of using SWORD as the infrastructure to do deposit for ORE seems sound. There are three main approaches identified:

- SWORD is used to deposit the URI of a Resource Map onto a repository
- SWORD is used to deposit the Resource Map as XML onto a repository
- SWORD is used to deposit a package containing the digital object and its Resource Map onto a repository

In terms of complications there are two primary ones which concern me the most:

- Mapping of the SWORD levels to the usage of ORE.

The principal issue is that level 1 implies level 0, and therefore level 2 implies level 1 and level 0. The inclusion of semantics to support ORE specifics could invoke a new level, and if this level is (for argument's sake) level 3, it implies all the levels beneath it, whatever they might require. Since the service, by this stage, is becoming complex in itself, such a linear relationship might not follow.

A brief option discussed at the meeting would be to modularise the SWORD support instead of implementing a level based approach. That is, the service document would describe the actual services offered by the server, such as ORE support, NoOp support, Verbose support and so forth, with no recourse to "bundles" of functionality labelled by linear levelling.

- Scalability of the service document

The mechanisms imposed by ORE allow for complex objects to be attached to other complex objects as aggregated resources (ORE term). This means that you could have a resource map which you wish to tell a repository describes a new part of an existing complex object. In order to do this, the service document will need to supply the appropriate deposit URI for a segment of an existing repository item. In DSpace semantics, for example, we may be adding a cluster of files to an existing item, and would therefore require the deposit URI of the item itself. To do otherwise would be to limit the applicability of ORE within SWORD and the repository model. Our current service document is a flat document describing what is pragmatically assumed (correctly, in virtually all cases) to be a small selection of deposit URIs. The same will not be true of item level deposit targets, which could be a very large number of possible deposit targets. Furthermore, in repositories which exploit the full descriptive capabilities of ORE, the number of deposit targets could be identical to the number of aggregations described (which can be more than one per resource map), which has the potential to be a very large number.

The consequences are in scalability of response time, which is a platform specific issue, and the scalability of the document itself and the usefulness of the consequences. It may be more useful to navigate hierarchically through the different levels of the service document in order to identify deposit nodes.

Any feedback on this topic is probably most useful in the ORE Google Group

Friday, 7 December 2007

CRIG Meeting Day 2 (1)

It's first thing on day two. I'm late because I have to get all the way across town, which takes a surprisingly long time in London. I should have just stayed at a nearby hotel. Oh well.

The remainder of yesterday was interesting. Scope for live blogging is difficult, as the conference is extremely mobile. Today I will have to pick a point and hide in a corner to get you up to date.

In the afternoon we discussed the CRIG scenarios, and then implemented something called a Dotmocracy, which involves sticking dots (like house points at school) next to topics which appeared which we were interested in. When we start up today, the first order of business will be to see what topics made the cut. From what I saw at the end of the day, this will include Federated Searching, Google Search, and package deconstruction (my personal favourite this week).

As a brief aside, one running theme has been "no more standards". As it happens, I disagree with this. We're never going to get everything thinking the same and working the same. That's why there are so many standards, and why new ones get made all the time. It's the way of the world. At least, with a standard, though, when you have implemented one, you at least have a way of telling people what you did, over the home grown undocumented solutions which are the alternative.

Right, I suppose I'd better get my skates on.

Thursday, 6 December 2007

CRIG Meeting Day 1 (2)

http://en.wikipedia.org/wiki/Unconference

See also Jim Downing's live blogging.

We've just done a round of preliminary unconferencing, where the CRIG Podcast topics were brainstormed onto flip charts. Not sure how useful that's going to be, but I'm going to approach the whole thing with an open mind. I've got my marker pen, my baloon, and my three dots.

wish me luck ...

CRIG Meeting Day 1 (1)

Some live blogging; may be slightly malformed, as this is happening inline, with no post-editing.

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference

Les Carr and Jim Downing have introduced us to the CRIG workshop first day. We're unconferencing which means that there's not a programme! We're going to try and stay at the abstract or high level discussion, not try to talk about technology.

David Flanders outlines the meeting philosophy. The outputs aimed for the meeting include: ideas (bluesky), standards and scenarios and how they can be linked together. The outputs will be taken to OR08. The best way for a group to produce good stuff is for everyone to think about themselves. Makes me think of an article I read recently:

http://www7.nationalgeographic.com/ngm/0707/feature5/index.html

We are not about creating new specs.

Julie then brings us some stuff about SWORD. See my previous post on this. We are going to have implementations for xrXiv, white rose research online and Jorum. A SPECTRa deposit client, and later an article in Ariadne and a presentation at OR08.

Break time ... tea and coffee!

Friday, 30 November 2007

CRIG Podcast

A couple of weeks ago the JISC CRIG (Common Repository Interfaces Group) organised a series of telephone debates on important areas for it. These have now been edited into short commentaries which might be of interest to you, and are aimed at priming and informing the upcoming "unconference" to be held 6/7 December in London:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Podcasts

The "unconference" will take place at Birkbeck College in Bloomsbury, London. Take a listen, and enjoy. Yours truly appears in the "Get and Put within Repositories" and the "Object Interoperability" discussions.

Thursday, 8 November 2007

SWORD 1.0 Released

Just a quick heads up to say that the SWORD 1.0 release is now out and ready for download from SourceForge:

http://sourceforge.net/projects/sword-app/

Here you will find the common java library which supports repositories wanting to implement SWORD, plus implementations for DSpace and Fedora. There is also a client (with GUI and CLI versions) which you can use to deposit content into the repositories.

The DSpace implementation is designed only to work with the forthcoming DSpace 1.5 (which is currently in Alpha release). Your feedback and experiences with the code would be much appreciated. We expect to be making refinements to the DSpace implementation up unitl DSpace 1.5 is released as stable.

Friday, 23 February 2007

JISC Capital Circular 4/06 outcomes

Today has been an exciting day. Projects that I am potentially involved in which have so far been announced as funded under the last round of JISC bids from November last year are as follows:

SWORD - Repository Deposit API development work in association with Aberystwyth, Southampton, Hull, Cambridge, Birkbeck (University of London), National Library of Wales, and Intralect, as a DSpace advisor and developer

EThOSnet - A major e-theses project following on from the great work of the recently completed EThOS project. Imperial is pleased to be leading this project, with partners from the following institutions: Leicester, Warwick, the British Library, Nottingham, Hull, Glasgow, Birmingham, National Library of Scotland, Edinburgh, Southampton, Cranfield, Robert Gordon University, Aberystwyth, Cardiff, Loughborough, National Library of Wales, and Exeter. What a team, and what a great looking project. My role is yet to be formalised, but hopefully somewhere in the area of the software development ;)

The future for repositories at Imperial looks bright. Today we completed our first UAT for our upcoming IR service "Spir@l", and we are due, over the course of this year to go live with that service, our own internal e-theses management system, and now the outcomes of these two projects will no doubt play a role in shaping our repository environment, which I hope will rapidly become one to be pround of.

The Chronicles of Richard