Monday 22 March 2010

An Analytical Anniversary

Today is my anniversary.  I have been at Symplectic Ltd for one of your Earth "years".  And a very busy one it has been, what with writing repository integration tools for our research management system to deposit content into DSpace, EPrints and Fedora, plus supporting the integration into a number of other platforms.  I thought it would be fun to do a bit of a breakdown of the code that I've written from scratch in the last 12 months (which I'm counting as 233 working days).  I'm going to do an analysis of the following areas of productivity:

  • lines of code
  • lines of inline code commentary
  • number of A4 pages of documentation (end user, administrator and technical)
  • number of version control commits

Lets start from the bottom and work upwards.

Number of version control commits

Total: 700

Per day: 3

I tend to commit units of work, so this might suggest that I do 3 bits of functionality every day.  In reality I quite often also commit quick bug fixes (so that I can record in the commit log the fix details), or at the end of a day/week, when I want to know that my code is safe from hardware theft, nuclear disaster, etc.

Number of A4 pages of documentation

Total: 72

Per day: 0.31

Not everyone writes their documentation in A4 form any more, and it's true that some of my dox take the form of web pages, but as a commercial software house we tend to produce well formatted, nice end-user and administrator documentation.  In addition, I rather enjoy at a geek level a nice printable document that's well laid out, so I do my technical dox that way too.

The amount of documentation is relatively small, but it doesn't take into account a lot of informal documentation.  More importantly, though, at the back end of the first version of our Repository Tools software, the documentation is still in development.  I expect the number of pages to probably triple or quadruple over the next few weeks.

Lines of Code and Lines of Commentary

I wrote a script which analysed my outputs.  Ironically, it's written in Python, which isn't one of the languages that I use professionally, so it's not included in this analysis (and none of my personal programming projects are therefore included).  This analysis covers all of my final code on my anniversary (23rd March), and does not take into account prototyping or refactoring of any kind.  Note also that blank lines are not counted.

Line Counts:

XML (107 Files) :: Lines of Code: 17819; Lines of Inline Comments: 420

XML isn't really programming, but it was interesting to see how much I actually work with it.  This figure is not used in any of the below statistics.  Some of these are large metadata documents and some are configuration (maven build files, ant build files, web server config, etc).


XSLT (36 Files) :: Lines of Code: 8502; Lines of Inline Comments: 2762
JAVA (181 Files) :: Lines of Code: 22350; Lines of Inline Comments: 7565
JSP (16 Files) :: Lines of Code: 2847; Lines of Inline Comments: 1
PERL (58 Files) :: Lines of Code: 6506; Lines of Inline Comments: 1699
---------------
TOTAL (291 Files) :: Lines of Code: 40205; Lines of Inline Comments: 12027

I remember once being told that 30k lines of code a year was pretty reasonable for a developer.  I feel quite chuffed!


Lines of code/comments per day:

XSLT :: Lines of Code: 36; Lines of Inline Comments: 12
JAVA :: Lines of Code: 96; Lines of Inline Comments: 32
JSP :: Lines of Code: 12; Lines of Inline Comments: 0
PERL :: Lines of Code: 28; Lines of Inline Comments: 7
---------------
TOTAL :: Lines of Code: 173; Lines of Inline Comments: 52

It looks much less impressive when you look at it on a daily basis.  We just have to remember that this is 173 wonderful lines of code every day!

Comment to code ratio (comments/code):

XSLT :: 0.33
JAVA :: 0.34
JSP :: 0
PERL :: 0.26
---------------
TOTAL :: 0.30

It was interesting to see that my commenting ratio is fairly stable at about 30% of the overall codebase size.  I didn't plan that or anything.  This includes block comments for classes and methods, and inline programmer documentation.  The reason for the shortfall in Perl is suggested below.  Notice that I didn't write any comments in the JSPs because I only use this code for testing, and is less carefully curated code.

Some perl comments don't start with anything specific - they are block comments starting and ending with =xxx and =cut respectively, which is difficult to parse out for analysis easily. Therefore the Perl code line counts overestimate and the comment counts underestimate. More likely figures are, given a 0.33 comment to code ratio:

PERL (58 Files) :: Lines of Code: 5498; Lines of Inline Comments: 2707

Amount of testing code (testing/production):

9937 / 30268 = 0.33

This is the total amount of code that I wrote to test the other code that I wrote.  So nearly 10k lines of code are there purely to demonstrate that the other 30k lines of code are working.  I'm not going to suggest that this 33% is a linear relationship as the projects increase in size, but maybe we'll find out next year.  Incidentally, the test code that I analysed was the third version of my test framework, so in reality I wrote quite a few more lines of code (perhaps 3 or 4k) before reaching the final version used above.

Note that I'm a big fan of Behaviour Driven Development, and this does tend to cause testing code to be fairly extensive in its own right.

Number of new files per day:

XSLT :: 0.15
JAVA :: 0.78
JSP :: 0.07
PERL :: 0.25
---------------
TOTAL :: 1.25

In reality, of course, I create lots and lots of new files over a short period of time, and then nothing for ages.


Average file length:

Excluding blank lines: 179
Including blank lines: 211
Spaciousness (including/excluding): 1.18

What is spaciousness?  It's a measure of how I tend to space my code.  Everyone, I have noticed, is fairly different in this regard - I wonder what other people's spaciousness is?

Source Code

Do you want to have a go at this yourself?  Blogger doesn't make attaching files particularly easy, so you can get this from the nice folks at pastebin, who say this shouldn't ever time out: http://pastebin.com/GVkHd7tB.

Monday 9 June 2008

ORE software libraries from Foresite

The Foresite [1] project is pleased to announce the initial code of two software libraries for constructing, parsing, manipulating and serialising OAI-ORE [2] Resource Maps. These libraries are being written in Java and Python, and can be used generically to provide advanced functionality to OAI-ORE aware applications, and are compliant with the latest release (0.9) of the specification. The software is open source, released under a BSD licence, and is available from a Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet, and are lacking good documentation for this early release, but we will be continuing to develop this software throughout the project and hope that it will be of use to the community immediately and beyond the end of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3, N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a demonstrator and test of the OAI-ORE standard by creating Resource Maps of journals and their contents held in JSTOR [4], and delivering them as ATOM documents via the SWORD [5] interface to DSpace [6]. DSpace will ingest these resource maps, and convert them into repository items which reference content which continues to reside in JSTOR. The Python library is being used to generate the resource maps from JSTOR and the Java library is being used to provide all the ingest, transformation and dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us have your feedback via the Google group:

foresite@googlegroups.com

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/

Friday 15 February 2008

DSpace 1.5 Beta 1 Released

I'm pleased to be able to relay that DSpace 1.5 has been released for beta testing. Particularly big thanks to Scott Philips, the release coordinator and lead Manakin developer for his contributions to it. From the email announcement:


The first beta for DSpace 1.5 has been released. You may either checkout the new tag directly from SVN or download the release from sourceforge. One sourceforge you will not that there are two types of releases:

dspace-1.5.0-beta1-release
dspace-1.5.0-beta1-src-release

- The "dspace-1.5.0-beta1-release" is a binary download that just contains dspace, it's manual, configuration, and a few other essential items. Use this package if you want to download DSpace pre-compiled and get it up running with no customizations.

- The other release, "dspace-1.5.0-beta1-src-release" is a full copy of the DSpace source code that you can modify and customize. Use this release as an alternative to checking out a copy of the source directly from SVN.


Sourceforge download URL:
http://sourceforge.net/project/showfiles.php?group_id=19984


There is going to be a full week testathon next week, which we encourage everyone to get involved in. Please do donwload and install either or both of the available releases, and let us know how you get on. Give it your best shot to break them, and if you do and are able to, consider sending us a patch to fix what was broken. The developers will be available (depending on time zone) in the DSpace IRC channel to help with diagnoses and fixes and any other questions:

server: irc.freenode.net
channel: #dspace

See you there!

Thursday 24 January 2008

CRIG Flipchart Outputs

The JISC CRIG meeting which I previously live-blogged from has now had its output formulated into a series of slides with annotations on Flickr, which can be found here:

http://www.flickr.com/photos/wocrig/

The process by which this was achieved was through an intense round of brain-storming sessions culminating in a room full of topic spaced flip chart sheets. We then performed a Dotmocracy, and the results that you see on the Flickr page are the ideas which made it through the process as having some interest invested in them.

Wednesday 23 January 2008

European ORE Roll-Out at Open Repositories 2008

The European leg of the ORE roll-out has been announced and will occur on the final day of the Open Repositories 2008 conference in Southampton, UK. This is to complement the meeting at Johns Hopkins University in Baltimore on March 3. From the email circular:


A meeting will be held on April 4, 2008 at the University of Southampton, in conjunction with Open Repositories 2008, to roll-out the beta release of the OAI-ORE specifications. This meeting is the European follow-on to a meeting that will be held in the USA on March 3, 2008 at Johns Hopkins University.

The OAI-ORE specifications describe a data model to identify and describe aggregations of web resources, and they introduce machine-readable formats to describe these aggregations based on ATOM and RDF/XML. The current, alpha version of the OAI-ORE specifications is at http://www.openarchives.org/ore/0.1/.

Additional details for the OAI-ORE European Open Meeting are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/EUKickoffPressrelease.pdf

- The registration site for the event:

http://regonline.com/eu-oai-ore

Note that registration is required and space is limited.

Tuesday 22 January 2008

Fine Grained Repository Interoperability: can't package, won't package

Sadly (although some of you may not agree!), my paper proposed for this year's Open Repositories conference in Southampton has not made it through the Programme Committee. I include here, therefore, my submission so that it may live on, and you can get an idea of the sorts of things I was thinking about talking about.

The reasons given for not accepting it are probably valid; mostly concerning a lack of focus. Honestly, I thought it did a pretty good job of saying what I would talk about, but such is life.




What is the point of interoperability, what might it allow us to achieve, and why aren't we very good at it yet?

Interoperability is a loosely defined concept. It can allow systems to talk to each other about the information that they hold, about the information that they can disseminate, and to interchange that information. It can allow us to tie systems together to improve ingest and dissemination of repository holdings, and allows us to distribute repository functions across multiple systems. It ought even to allow us to offer repository services to systems which don't do so natively, improving the richness of the information space; repository interoperability is not just about repository to repository, it is also about cross-system communications. The maturing set of repositories such as DSpace, Fedora and EPrints and other information systems such as publications management tools and research information systems, as well as home-spun solutions are making the task of taking on the interoperability beast both tangible and urgent.

Traditional approaches to interoperability have often centred around moving packaged information between systems (often other repositories). The effect this has is to introduce a black-box problem concerning the content of the package itself. We are no longer transferring information, we are transferring data! It therefore becomes necessary to introduce package descriptors which allow the endpoint to re-interpret the package correctly, to turn it back into information. But this constrains us very tightly in the form of our packages, and introduces a great risk of data loss. Furthermore, it means that we cannot perform temporally and spatially disparate interoperability on an object level (that is, assemble an object's content over a period of time, and from a variety of sources). A more general approach to information interchange may be more powerful.

This paper brings together a number of sources. It discusses some of the work undertaken at Imperial College London to connect a distributed repository system (built on top of DSpace) to an existing information environment. This provides repository services to existing systems, and offers library administrators custom repository management tools in an integrated way. It also considers some of the thoughts arising from the JISC Common Repository Interfaces Group (CRIG) in this area, as well as some speculative proposals for future work and further ideas that may need to be explored.

Where do we start? The most basic way to address this problem is to break the idea of the package down into its most simple component parts in the context of a repository: the object metadata, the file content, and the use rights metadata. Using this approach, you can go a surprisingly long way down the interoperability route without adding further complexity. At the heart of the Imperial College Digital Repository is a set of web services which deal with exactly this fine structure of the package, because the content for the repository may be fed from a number of sources over a period of time, and thus there never is a definitive package.

These sorts of operations are not new, though, and there are a variety of approaches to it which have already been undertaken. For example, WebDAV offers extensions to HTTP to deal with objects using operations such as PUT, COPY or MOVE which could be used to achieve the effects that we desire. The real challenge, therefore, is not in the mechanics of the web services which we use to exchange details about this deconstructed package, but is in the additional complexities which we can introduce to enhance the interoperability of our systems and provide the value-added services which repositories wish to offer.

Consider some other features of interoperability which might be desirable

- fine grained or partial metadata records. We may wish to ingest partial records from a variety of sources to assemble into a single record, or disseminate only subsets of our stored metadata.
- file metadata, or any other sub-structure element of the object. This may include bibliographic, administrative or technical metadata.
object structural information, to allow complex hierarchies and relationships to be expressed and modified.
- versioning, and other inter-object relationships.
- workflow status, if performing deposit across multiple systems, it may be necessary to be aware of the status of the object in each system to calculate an overall state.
- state and provenance reporting, to offer feedback on the repository state to other information systems, administrators or users.
- statistics, to allow content delivery services to aggregate statistics globally.
- identifiers, to support multiple identification schemes.

Techniques such as application profiling for metadata allow us to frame entire metadata records in terms of their interpretation (e.g. the Scholarly Works Application Profile (SWAP)), but should also be used to frame individual metadata elements. Object structural data can be encoded using standards such as METS, which can also help us with attaching metadata to sub-structures of the object itself, such as its files. Versioning, and other inter-object relationships could be achieved using an RDF approach, and perhaps the OAI-ORE project will offer some guidance. But other operations such as workflow status, and state and provenance reporting do not have such clear approaches. Meanwhile, the Interoperable Repository Statistics (IRS) project has looked at the statistics problem, and the RIDIR project is looking into interoperable identifiers. In these latter cases, can we ever consider providing access to their outcomes or services through some general fine grained interface?

The Imperial College Digital Repository offers limited file metadata which is attached during upload and exposed as part of a METS record, detailing the entire digital object, as a descriptive metadata section. It can deal with the idea that some metadata comes from one source, while other metadata comes from another, allowing for a primitive partial metadata interchange process. Conversely, it will also deal with multiple metadata records for the same item. Also introduced are custom workflow metadata fields which allow some basic interaction between different systems to track deposit of objects both from the point of view of the administrator, the depositor and the systems themselves. In addition, there is an extensible notifications engine which is used to produce periodic reports to all depositors whose content has undergone some sort of modification or interesting event in a given time period. This notifications engine is behind a very generic web service which offers extreme flexibility within the context of the College's information environment.

Important work in the fields that will help achieve this interoperability include the SWORD deposit mechanism which currently deals with packages but may be extensible to include these much needed enhancements. Meanwhile, the OAI-ORE will be able to provide the semantics for complex objects which will no doubt assist in framing the problems that interoperability faces in a manor in which they can be solved.

Other examples of the spaces in which interoperability needs to work would include the EThOSnet project, the UK national e-theses effort, where it is conceivable that institutions may want to provide their own e-theses submission system with integration into the central hub to offer seamless distributed submission. Or in the relationship between Current Research Information Systems (CRIS) and open access repositories, to offer a full-stack information environment for researchers and administrators alike. The possibilities are extensive and the benefit to the research community would be truly great. HP Labs is actively researching in these and related areas with its continued work on the DSpace platform.


Monday 21 January 2008

SWORD/ORE

Last week I was at the ORE meeting in Washington DC, and presented some thoughts regarding SWORD and its relationship to ORE. The slides I presented can be found here:

http://wiki.dspace.org/static_files/1/1d/Sword-ore.pdf

[Be warned that discussion on these slides ensued, and they therefore don't reflect the most recent thinking on the topic]

The overall approach of using SWORD as the infrastructure to do deposit for ORE seems sound. There are three main approaches identified:

- SWORD is used to deposit the URI of a Resource Map onto a repository
- SWORD is used to deposit the Resource Map as XML onto a repository
- SWORD is used to deposit a package containing the digital object and its Resource Map onto a repository

In terms of complications there are two primary ones which concern me the most:

- Mapping of the SWORD levels to the usage of ORE.

The principal issue is that level 1 implies level 0, and therefore level 2 implies level 1 and level 0. The inclusion of semantics to support ORE specifics could invoke a new level, and if this level is (for argument's sake) level 3, it implies all the levels beneath it, whatever they might require. Since the service, by this stage, is becoming complex in itself, such a linear relationship might not follow.

A brief option discussed at the meeting would be to modularise the SWORD support instead of implementing a level based approach. That is, the service document would describe the actual services offered by the server, such as ORE support, NoOp support, Verbose support and so forth, with no recourse to "bundles" of functionality labelled by linear levelling.

- Scalability of the service document

The mechanisms imposed by ORE allow for complex objects to be attached to other complex objects as aggregated resources (ORE term). This means that you could have a resource map which you wish to tell a repository describes a new part of an existing complex object. In order to do this, the service document will need to supply the appropriate deposit URI for a segment of an existing repository item. In DSpace semantics, for example, we may be adding a cluster of files to an existing item, and would therefore require the deposit URI of the item itself. To do otherwise would be to limit the applicability of ORE within SWORD and the repository model. Our current service document is a flat document describing what is pragmatically assumed (correctly, in virtually all cases) to be a small selection of deposit URIs. The same will not be true of item level deposit targets, which could be a very large number of possible deposit targets. Furthermore, in repositories which exploit the full descriptive capabilities of ORE, the number of deposit targets could be identical to the number of aggregations described (which can be more than one per resource map), which has the potential to be a very large number.

The consequences are in scalability of response time, which is a platform specific issue, and the scalability of the document itself and the usefulness of the consequences. It may be more useful to navigate hierarchically through the different levels of the service document in order to identify deposit nodes.

Any feedback on this topic is probably most useful in the ORE Google Group