The Chronicles of Richard: oaiore

Showing posts with label oaiore. Show all posts

Monday, 9 June 2008

ORE software libraries from Foresite

The Foresite [1] project is pleased to announce the initial code of two software libraries for constructing, parsing, manipulating and serialising OAI-ORE [2] Resource Maps. These libraries are being written in Java and Python, and can be used generically to provide advanced functionality to OAI-ORE aware applications, and are compliant with the latest release (0.9) of the specification. The software is open source, released under a BSD licence, and is available from a Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet, and are lacking good documentation for this early release, but we will be continuing to develop this software throughout the project and hope that it will be of use to the community immediately and beyond the end of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3, N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a demonstrator and test of the OAI-ORE standard by creating Resource Maps of journals and their contents held in JSTOR [4], and delivering them as ATOM documents via the SWORD [5] interface to DSpace [6]. DSpace will ingest these resource maps, and convert them into repository items which reference content which continues to reside in JSTOR. The Python library is being used to generate the resource maps from JSTOR and the Java library is being used to provide all the ingest, transformation and dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us have your feedback via the Google group:

foresite@googlegroups.com

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/

Wednesday, 23 January 2008

European ORE Roll-Out at Open Repositories 2008

The European leg of the ORE roll-out has been announced and will occur on the final day of the Open Repositories 2008 conference in Southampton, UK. This is to complement the meeting at Johns Hopkins University in Baltimore on March 3. From the email circular:

A meeting will be held on April 4, 2008 at the University of Southampton, in conjunction with Open Repositories 2008, to roll-out the beta release of the OAI-ORE specifications. This meeting is the European follow-on to a meeting that will be held in the USA on March 3, 2008 at Johns Hopkins University.

The OAI-ORE specifications describe a data model to identify and describe aggregations of web resources, and they introduce machine-readable formats to describe these aggregations based on ATOM and RDF/XML. The current, alpha version of the OAI-ORE specifications is at http://www.openarchives.org/ore/0.1/.

Additional details for the OAI-ORE European Open Meeting are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/EUKickoffPressrelease.pdf

- The registration site for the event:

http://regonline.com/eu-oai-ore

Note that registration is required and space is limited.

Tuesday, 22 January 2008

Fine Grained Repository Interoperability: can't package, won't package

Sadly (although some of you may not agree!), my paper proposed for this year's Open Repositories conference in Southampton has not made it through the Programme Committee. I include here, therefore, my submission so that it may live on, and you can get an idea of the sorts of things I was thinking about talking about.

The reasons given for not accepting it are probably valid; mostly concerning a lack of focus. Honestly, I thought it did a pretty good job of saying what I would talk about, but such is life.

What is the point of interoperability, what might it allow us to achieve, and why aren't we very good at it yet?

Interoperability is a loosely defined concept. It can allow systems to talk to each other about the information that they hold, about the information that they can disseminate, and to interchange that information. It can allow us to tie systems together to improve ingest and dissemination of repository holdings, and allows us to distribute repository functions across multiple systems. It ought even to allow us to offer repository services to systems which don't do so natively, improving the richness of the information space; repository interoperability is not just about repository to repository, it is also about cross-system communications. The maturing set of repositories such as DSpace, Fedora and EPrints and other information systems such as publications management tools and research information systems, as well as home-spun solutions are making the task of taking on the interoperability beast both tangible and urgent.

Traditional approaches to interoperability have often centred around moving packaged information between systems (often other repositories). The effect this has is to introduce a black-box problem concerning the content of the package itself. We are no longer transferring information, we are transferring data! It therefore becomes necessary to introduce package descriptors which allow the endpoint to re-interpret the package correctly, to turn it back into information. But this constrains us very tightly in the form of our packages, and introduces a great risk of data loss. Furthermore, it means that we cannot perform temporally and spatially disparate interoperability on an object level (that is, assemble an object's content over a period of time, and from a variety of sources). A more general approach to information interchange may be more powerful.

This paper brings together a number of sources. It discusses some of the work undertaken at Imperial College London to connect a distributed repository system (built on top of DSpace) to an existing information environment. This provides repository services to existing systems, and offers library administrators custom repository management tools in an integrated way. It also considers some of the thoughts arising from the JISC Common Repository Interfaces Group (CRIG) in this area, as well as some speculative proposals for future work and further ideas that may need to be explored.

Where do we start? The most basic way to address this problem is to break the idea of the package down into its most simple component parts in the context of a repository: the object metadata, the file content, and the use rights metadata. Using this approach, you can go a surprisingly long way down the interoperability route without adding further complexity. At the heart of the Imperial College Digital Repository is a set of web services which deal with exactly this fine structure of the package, because the content for the repository may be fed from a number of sources over a period of time, and thus there never is a definitive package.

These sorts of operations are not new, though, and there are a variety of approaches to it which have already been undertaken. For example, WebDAV offers extensions to HTTP to deal with objects using operations such as PUT, COPY or MOVE which could be used to achieve the effects that we desire. The real challenge, therefore, is not in the mechanics of the web services which we use to exchange details about this deconstructed package, but is in the additional complexities which we can introduce to enhance the interoperability of our systems and provide the value-added services which repositories wish to offer.

Consider some other features of interoperability which might be desirable

- fine grained or partial metadata records. We may wish to ingest partial records from a variety of sources to assemble into a single record, or disseminate only subsets of our stored metadata.
- file metadata, or any other sub-structure element of the object. This may include bibliographic, administrative or technical metadata.
object structural information, to allow complex hierarchies and relationships to be expressed and modified.
- versioning, and other inter-object relationships.
- workflow status, if performing deposit across multiple systems, it may be necessary to be aware of the status of the object in each system to calculate an overall state.
- state and provenance reporting, to offer feedback on the repository state to other information systems, administrators or users.
- statistics, to allow content delivery services to aggregate statistics globally.
- identifiers, to support multiple identification schemes.

Techniques such as application profiling for metadata allow us to frame entire metadata records in terms of their interpretation (e.g. the Scholarly Works Application Profile (SWAP)), but should also be used to frame individual metadata elements. Object structural data can be encoded using standards such as METS, which can also help us with attaching metadata to sub-structures of the object itself, such as its files. Versioning, and other inter-object relationships could be achieved using an RDF approach, and perhaps the OAI-ORE project will offer some guidance. But other operations such as workflow status, and state and provenance reporting do not have such clear approaches. Meanwhile, the Interoperable Repository Statistics (IRS) project has looked at the statistics problem, and the RIDIR project is looking into interoperable identifiers. In these latter cases, can we ever consider providing access to their outcomes or services through some general fine grained interface?

The Imperial College Digital Repository offers limited file metadata which is attached during upload and exposed as part of a METS record, detailing the entire digital object, as a descriptive metadata section. It can deal with the idea that some metadata comes from one source, while other metadata comes from another, allowing for a primitive partial metadata interchange process. Conversely, it will also deal with multiple metadata records for the same item. Also introduced are custom workflow metadata fields which allow some basic interaction between different systems to track deposit of objects both from the point of view of the administrator, the depositor and the systems themselves. In addition, there is an extensible notifications engine which is used to produce periodic reports to all depositors whose content has undergone some sort of modification or interesting event in a given time period. This notifications engine is behind a very generic web service which offers extreme flexibility within the context of the College's information environment.

Important work in the fields that will help achieve this interoperability include the SWORD deposit mechanism which currently deals with packages but may be extensible to include these much needed enhancements. Meanwhile, the OAI-ORE will be able to provide the semantics for complex objects which will no doubt assist in framing the problems that interoperability faces in a manor in which they can be solved.

Other examples of the spaces in which interoperability needs to work would include the EThOSnet project, the UK national e-theses effort, where it is conceivable that institutions may want to provide their own e-theses submission system with integration into the central hub to offer seamless distributed submission. Or in the relationship between Current Research Information Systems (CRIS) and open access repositories, to offer a full-stack information environment for researchers and administrators alike. The possibilities are extensive and the benefit to the research community would be truly great. HP Labs is actively researching in these and related areas with its continued work on the DSpace platform.

Monday, 21 January 2008

SWORD/ORE

Last week I was at the ORE meeting in Washington DC, and presented some thoughts regarding SWORD and its relationship to ORE. The slides I presented can be found here:

http://wiki.dspace.org/static_files/1/1d/Sword-ore.pdf

[Be warned that discussion on these slides ensued, and they therefore don't reflect the most recent thinking on the topic]

The overall approach of using SWORD as the infrastructure to do deposit for ORE seems sound. There are three main approaches identified:

- SWORD is used to deposit the URI of a Resource Map onto a repository
- SWORD is used to deposit the Resource Map as XML onto a repository
- SWORD is used to deposit a package containing the digital object and its Resource Map onto a repository

In terms of complications there are two primary ones which concern me the most:

- Mapping of the SWORD levels to the usage of ORE.

The principal issue is that level 1 implies level 0, and therefore level 2 implies level 1 and level 0. The inclusion of semantics to support ORE specifics could invoke a new level, and if this level is (for argument's sake) level 3, it implies all the levels beneath it, whatever they might require. Since the service, by this stage, is becoming complex in itself, such a linear relationship might not follow.

A brief option discussed at the meeting would be to modularise the SWORD support instead of implementing a level based approach. That is, the service document would describe the actual services offered by the server, such as ORE support, NoOp support, Verbose support and so forth, with no recourse to "bundles" of functionality labelled by linear levelling.

- Scalability of the service document

The mechanisms imposed by ORE allow for complex objects to be attached to other complex objects as aggregated resources (ORE term). This means that you could have a resource map which you wish to tell a repository describes a new part of an existing complex object. In order to do this, the service document will need to supply the appropriate deposit URI for a segment of an existing repository item. In DSpace semantics, for example, we may be adding a cluster of files to an existing item, and would therefore require the deposit URI of the item itself. To do otherwise would be to limit the applicability of ORE within SWORD and the repository model. Our current service document is a flat document describing what is pragmatically assumed (correctly, in virtually all cases) to be a small selection of deposit URIs. The same will not be true of item level deposit targets, which could be a very large number of possible deposit targets. Furthermore, in repositories which exploit the full descriptive capabilities of ORE, the number of deposit targets could be identical to the number of aggregations described (which can be more than one per resource map), which has the potential to be a very large number.

The consequences are in scalability of response time, which is a platform specific issue, and the scalability of the document itself and the usefulness of the consequences. It may be more useful to navigate hierarchically through the different levels of the service document in order to identify deposit nodes.

Any feedback on this topic is probably most useful in the ORE Google Group

Wednesday, 12 December 2007

OAI-ORE Alpha Specifications

The ORE Project has released the first draft of the specifications for public consumption. There is due a final Technical Committee meeting in January next year which may cause changes to this initial draft:

http://www.openarchives.org/ore/0.1/toc

Friday, 7 December 2007

CRIG Meeting Day 2 (2)

Topics for today:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Unconference#Friday_December_7th

The ones that interest me the most are probably these:

- Death to Packages

Not really Death to Packages, but lets not forget that packaging sometimes isn't what we want to do or what we can do.

- Get What?

This harks to my ORE interest, as to what is available under the URLs, and what that means for something like content negotiation.

- One Put to Multiple Places

Really important to distributed information systems (e.g. ethosnet integration into local institutions). Also, this relates, for me, to the unpackaging question, because it introduces differences between what systems might all be expecting.

- Web 2.0 interfaces (ok, ok)

I'm interested in web services. Yes it's a bit trendy. But it is useful.

- Core Servies of a Repository

For repository core architecture, this is important. With my DSpace hat on I'd like to see what sorts of things an internal service architecture or api ought to be able to support

Friday, 30 November 2007

CRIG Podcast

A couple of weeks ago the JISC CRIG (Common Repository Interfaces Group) organised a series of telephone debates on important areas for it. These have now been edited into short commentaries which might be of interest to you, and are aimed at priming and informing the upcoming "unconference" to be held 6/7 December in London:

http://www.ukoln.ac.uk/repositories/digirep/index/CRIG_Podcasts

The "unconference" will take place at Birkbeck College in Bloomsbury, London. Take a listen, and enjoy. Yours truly appears in the "Get and Put within Repositories" and the "Object Interoperability" discussions.

Saturday, 13 January 2007

ORE Technical Committee Meeting 11 - 12 January

On 11 and 12 of January, 13 members of the ORE Technical Committee met at Columbia University in New York for the first face-to-face meeting of this project. Attendants were (in no particular order): Tony Hammond (Nature Publishing), Michael Nelson (Old Dominion University), Pete Johnstone (Eduserv, on behalf of Andy Powell), Ray Plante (NCSA), David Fulker (UCAR), Richard Jones (Imperial College London), Peter Murray (OhioLINK), Jeff Young (OCLC), Rob Sanderson (University of Liverpool), Tim DiLauro (Johns Hopkins University), Simeon Warner (Cornell), and of course Herbert van de Sompel (LANL) and Carl Lagoze (Cornell).

The results of this meeting are due to be reported at Open Repository 2007 at the end of this month, once they have been formalised from the complex debate and discussion that occurred at the meeting, so I won't attempt to summarise outcomes in any detail.

We began with an overview of the problem domain, which is of compound digital objects in a heterogeneous environment, which must be operable within the web architecture. One of the core outcomes of the project, therefore, will be a specification for describing these objects, and their internal and external relationships. Each of the attendant committee members was given the opportunity to present their thoughts on the initial documentation for the project. These ranged from commentary on a privately circulated white paper on the project through to suggestions on implementation technologies or methodologies that might be appropriate.

On the second day of the meeting we moved on to start formalising the goals for the various aspects of the project. This included our communication channels, our use cases, what we understand by the format that will help us describe structures and relationships, and our forthcoming work and subsequent meetings.

Communication for the project will happen through private mailing lists and a wiki. All outcomes from the project will be pushed out to the ORE website, and later there may be a project blog when there are findings to disseminate. We also specified 6 use cases and assigned members of the technical committee to examine the use case titles and develop some working "stories" which we will be able to develop. These use cases should be ready in time for presentation at Open Repository 2007.

Overall, it feels like we covered significant ground in just two short days, although I for one found the results of the meeting quite complex, and in need of some significant work to make coherent results from. Carl and Herbert will be carrying out this analysis in the coming weeks, which is when meeting results will be made available.

The Chronicles of Richard