The Chronicles of Richard: code

Showing posts with label code. Show all posts

Monday, 22 March 2010

An Analytical Anniversary

Today is my anniversary. I have been at Symplectic Ltd for one of your Earth "years". And a very busy one it has been, what with writing repository integration tools for our research management system to deposit content into DSpace, EPrints and Fedora, plus supporting the integration into a number of other platforms. I thought it would be fun to do a bit of a breakdown of the code that I've written from scratch in the last 12 months (which I'm counting as 233 working days). I'm going to do an analysis of the following areas of productivity:

lines of code
lines of inline code commentary
number of A4 pages of documentation (end user, administrator and technical)
number of version control commits

Lets start from the bottom and work upwards.

Number of version control commits

Total: 700

Per day: 3

I tend to commit units of work, so this might suggest that I do 3 bits of functionality every day. In reality I quite often also commit quick bug fixes (so that I can record in the commit log the fix details), or at the end of a day/week, when I want to know that my code is safe from hardware theft, nuclear disaster, etc.

Number of A4 pages of documentation

Total: 72

Per day: 0.31

Not everyone writes their documentation in A4 form any more, and it's true that some of my dox take the form of web pages, but as a commercial software house we tend to produce well formatted, nice end-user and administrator documentation. In addition, I rather enjoy at a geek level a nice printable document that's well laid out, so I do my technical dox that way too.

The amount of documentation is relatively small, but it doesn't take into account a lot of informal documentation. More importantly, though, at the back end of the first version of our Repository Tools software, the documentation is still in development. I expect the number of pages to probably triple or quadruple over the next few weeks.

Lines of Code and Lines of Commentary

I wrote a script which analysed my outputs. Ironically, it's written in Python, which isn't one of the languages that I use professionally, so it's not included in this analysis (and none of my personal programming projects are therefore included). This analysis covers all of my final code on my anniversary (23rd March), and does not take into account prototyping or refactoring of any kind. Note also that blank lines are not counted.

Line Counts:

XML (107 Files) :: Lines of Code: 17819; Lines of Inline Comments: 420

XML isn't really programming, but it was interesting to see how much I actually work with it. This figure is not used in any of the below statistics. Some of these are large metadata documents and some are configuration (maven build files, ant build files, web server config, etc).

XSLT (36 Files) :: Lines of Code: 8502; Lines of Inline Comments: 2762
JAVA (181 Files) :: Lines of Code: 22350; Lines of Inline Comments: 7565
JSP (16 Files) :: Lines of Code: 2847; Lines of Inline Comments: 1
PERL (58 Files) :: Lines of Code: 6506; Lines of Inline Comments: 1699
---------------
TOTAL (291 Files) :: Lines of Code: 40205; Lines of Inline Comments: 12027

I remember once being told that 30k lines of code a year was pretty reasonable for a developer. I feel quite chuffed!

Lines of code/comments per day:

XSLT :: Lines of Code: 36; Lines of Inline Comments: 12
JAVA :: Lines of Code: 96; Lines of Inline Comments: 32
JSP :: Lines of Code: 12; Lines of Inline Comments: 0
PERL :: Lines of Code: 28; Lines of Inline Comments: 7
---------------
TOTAL :: Lines of Code: 173; Lines of Inline Comments: 52

It looks much less impressive when you look at it on a daily basis. We just have to remember that this is 173 wonderful lines of code every day!

Comment to code ratio (comments/code):

XSLT :: 0.33
JAVA :: 0.34
JSP :: 0
PERL :: 0.26
---------------
TOTAL :: 0.30

It was interesting to see that my commenting ratio is fairly stable at about 30% of the overall codebase size. I didn't plan that or anything. This includes block comments for classes and methods, and inline programmer documentation. The reason for the shortfall in Perl is suggested below. Notice that I didn't write any comments in the JSPs because I only use this code for testing, and is less carefully curated code.

Some perl comments don't start with anything specific - they are block comments starting and ending with =xxx and =cut respectively, which is difficult to parse out for analysis easily. Therefore the Perl code line counts overestimate and the comment counts underestimate. More likely figures are, given a 0.33 comment to code ratio:

PERL (58 Files) :: Lines of Code: 5498; Lines of Inline Comments: 2707

Amount of testing code (testing/production):

9937 / 30268 = 0.33

This is the total amount of code that I wrote to test the other code that I wrote. So nearly 10k lines of code are there purely to demonstrate that the other 30k lines of code are working. I'm not going to suggest that this 33% is a linear relationship as the projects increase in size, but maybe we'll find out next year. Incidentally, the test code that I analysed was the third version of my test framework, so in reality I wrote quite a few more lines of code (perhaps 3 or 4k) before reaching the final version used above.

Note that I'm a big fan of Behaviour Driven Development, and this does tend to cause testing code to be fairly extensive in its own right.

Number of new files per day:

XSLT :: 0.15
JAVA :: 0.78
JSP :: 0.07
PERL :: 0.25
---------------
TOTAL :: 1.25

In reality, of course, I create lots and lots of new files over a short period of time, and then nothing for ages.

Average file length:

Excluding blank lines: 179
Including blank lines: 211
Spaciousness (including/excluding): 1.18

What is spaciousness? It's a measure of how I tend to space my code. Everyone, I have noticed, is fairly different in this regard - I wonder what other people's spaciousness is?

Source Code

Do you want to have a go at this yourself? Blogger doesn't make attaching files particularly easy, so you can get this from the nice folks at pastebin, who say this shouldn't ever time out: http://pastebin.com/GVkHd7tB.

Monday, 9 June 2008

ORE software libraries from Foresite

The Foresite [1] project is pleased to announce the initial code of two software libraries for constructing, parsing, manipulating and serialising OAI-ORE [2] Resource Maps. These libraries are being written in Java and Python, and can be used generically to provide advanced functionality to OAI-ORE aware applications, and are compliant with the latest release (0.9) of the specification. The software is open source, released under a BSD licence, and is available from a Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet, and are lacking good documentation for this early release, but we will be continuing to develop this software throughout the project and hope that it will be of use to the community immediately and beyond the end of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3, N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a demonstrator and test of the OAI-ORE standard by creating Resource Maps of journals and their contents held in JSTOR [4], and delivering them as ATOM documents via the SWORD [5] interface to DSpace [6]. DSpace will ingest these resource maps, and convert them into repository items which reference content which continues to reside in JSTOR. The Python library is being used to generate the resource maps from JSTOR and the Java library is being used to provide all the ingest, transformation and dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us have your feedback via the Google group:

foresite@googlegroups.com

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/

Friday, 15 February 2008

DSpace 1.5 Beta 1 Released

I'm pleased to be able to relay that DSpace 1.5 has been released for beta testing. Particularly big thanks to Scott Philips, the release coordinator and lead Manakin developer for his contributions to it. From the email announcement:

The first beta for DSpace 1.5 has been released. You may either checkout the new tag directly from SVN or download the release from sourceforge. One sourceforge you will not that there are two types of releases:

dspace-1.5.0-beta1-release
dspace-1.5.0-beta1-src-release

- The "dspace-1.5.0-beta1-release" is a binary download that just contains dspace, it's manual, configuration, and a few other essential items. Use this package if you want to download DSpace pre-compiled and get it up running with no customizations.

- The other release, "dspace-1.5.0-beta1-src-release" is a full copy of the DSpace source code that you can modify and customize. Use this release as an alternative to checking out a copy of the source directly from SVN.

Sourceforge download URL:
http://sourceforge.net/project/showfiles.php?group_id=19984

There is going to be a full week testathon next week, which we encourage everyone to get involved in. Please do donwload and install either or both of the available releases, and let us know how you get on. Give it your best shot to break them, and if you do and are able to, consider sending us a patch to fix what was broken. The developers will be available (depending on time zone) in the DSpace IRC channel to help with diagnoses and fixes and any other questions:

server: irc.freenode.net
channel: #dspace

See you there!

Thursday, 13 December 2007

The Data Access Layer Divide

Warning: technical post.

One of the things that has been giving me consternation this week is the division between the data storage layer and the application layer. A colleague of mine has been working hard on this problem for some months for DSpace, and his work will form the backbone of the 1.6 release next year. As an new HP Labs employee, I'm just getting involved in this work too, with my focus currently on identifiers for objects in the system (not just content objects, but everything from access policies to user accounts).

We are replacing the default Handle mechanism for exposing URLs in DSpace with an entirely portable identification mechanism which should support whatever identifier scheme you want to put on top of it. DSpace is going to provide its own local identification through UUIDs, so that we can try to break the dependency of identification of artifacts in the system away from the specific implementation of the storage engine. That is, at the moment, database ids are passed around and used with little thought. But what happens if the data storage layer is replaced with something which doesn't use database ids? It's not even slightly inconceivable. Hence the introduction of the UUID.

Now, here's where it gets tricky. The UUID becomes an application level identifier for system artifacts. Fine. The database is free to give columns in tables integer ids, and use them to maintain its own referential integrity. Fine.

I have several questions, and some half-answers for you:

- Why is this a problem?

Suppose I have two modules which store in the database. Lets use a DSpace example of Item and Bitstream objects (DSpace object model sticklers: I know what I'm about to say isn't really true, it's for the purposes of example): I want to store the Item, I want to store the Bitstream, and I want to preserve the relationship between them. Therefore, the Item storage module needs to know how to identify the Bitstream (or vice versa). If I want, I can use the UUIDs, nice long strings, which may have implications on my database performance; why use a relational database if I'm going to burden it with looking up long strings when it could be using nice small integers?

So the problem is: how does the Item get to find out the Bitstream storage id?

- How far up the API can I pass the database id?

The answer to this is "not very far". In fact, it looks like i can't even pass it as far as the DAO API.

- Can I use a RelationalDatabase interface?

The best solution I've come up with so far is to allow my DAO to implement a RelationalDatabase interface, so that other DAO implementations can inspect it to see if they can get database ids out of it. Is that a good solution? I don't know, I'm asking you!

- What's the point?

At the moment the DSpace API is awash with references to the database id. It's fine for the time being, and most people will never get upset about it. But it bothers engineers, and it will bother people who want to try and implement novel storage technologies behind DSpace.

The title of this post reflects my current feeling that these two particular layers of the system, the application and the data storage, have, at some point, to collide; can we really engineer it so that no damage occurs? Answers on a postcard.

Tuesday, 11 December 2007

Multi-lingualism and the masses

Multi-lingualism, and the provision of multi-lingual services, is one of those problems that just keeps on giving. Like digging a hole in sand which just keeps filling with water as fast as you can shovel it out again, or the loose thread which unravels your clothes when you pull on it. I remember being told, back at the start, that multi-lingualism was a solved problem; that i18n allowed us to keep our language separate from our application.

When the first major work was done on DSpace to convert the UI away from being strictly UK to being internationalised, there was great cause for celebration. This initial step was extremely large, and DSpace has reaped the benefits of having an internationalised UI, with translations into 19 languages at time of writing. It's also helped me, among others, understand where else we might want to go with the internationalisation of the platform, and what the issues are. This post is designed to allow me to enumerate the issues that I've so far come up against or across, to suggest some directions where possible, but mostly just to help organise thoughts.

So lets start with the UI. It turns out that there are a couple of questions which immediately come to the fore once you have a basically international interface. The first is whether display semantics should be embedded in your international tags. My gut reaction was, of course, no ... but, suppose, for example, emphasised text needs to be done differently in different locales? The second is in the granularity of the language tags, and the way that they appear on the page. Suppose it is better in one language to reverse the order of two distinct tags, to dispense with one altogether, or to add additional ones? All of these require modifications in the pages which call the language specific messages, not in the messages themselves. Is there a technical solution to these problems? (I don't know, by the way, but I'm open to suggestion).

We also have the problem of wholesale documentation. User and Administrator help, and system documentation. Not only are they vast, but they are often changing, and maintaining many versions of them is a serious undertaking. It seems inappropriate to use i18n tagging to do documentation, so a different approach is necessary. The idea of the "language pack" would be to include not only custom i18n tags, but also language specific documentation, and all of the other things that I'm going to waffle about below.

Something else happens in the UI which is nothing to do with the page layout. Data is displayed. It is not uncommon to see DSpace instances with hacked attempts at creating multi-lingual application data such as Community and Collection structures, because the tools simply don't yet exist to manage them properly. For example:

https://gupea.ub.gu.se/dspace/community-list

where the English and Swedish terms are included in the single field for the benefit of their national and international readership.

Capturing all data in a multi-lingual way is very very hard, mostly because of the work involved. But DSpace should be offering multi-lingual administrator controlled data such as Communities and Collections, and at least offering the possibility of multi-lingual items. The application challenges here are to:

Capture the data in multiple languages

Store the data in multiple languages

Offer administrator tools for adding translations (automated?)

Disseminate in the correct language.

Dissemination in the correct language ought not to be too much hassle through the UI (and DSpace already offers tools to switch UI language), but I wonder how much of a difficulty this would be for packaging? Or other types of interoperability? Do we need to start adding language qualifiers to everything? And what happens if the language you are interested in isn't available, or is only partial for what you are looking at? Defining a fall-back chain shouldn't be too hard, but perhaps that fall-back chain is user specific; suppose I'm English, but I also understand German and French: I don't want the application to fall back from English to Russian, for example.

This post was actually motivated by a discussion I have been having about multi-lingual taxonomies, and using URIs to store the vocabulary terms, instead of the terms themselves. In this particular space, URIs are a good solution, because they are tied to a specific, recognised wording. It does place a burden on the UI, though, to be able to hide the URI from the user during deposit and dissemination.

But the same approach could, in theory, be used to offer multi-lingual browse and search results across an entire database. Imagine: each indexable field is collected in its many languages, a single (internal) URI is assigned to that cluster of terms, and that URI is stored instead of the value. With a lot of computational effort you could produce a map of URIs to all the same terms in all the different languages in the database and their corresponding digital objects, which you could offer to your users through search or browse interfaces (I'd not like to be the one to have to implement this, and iron out the wrinkles which I'm blatantly overlooking here).

There are many other corner areas of applications which include language-specifics, and it's going to take me a while to gather the list of what they are. Here are a few which aren't covered by the above:

system configuration

code exceptions and errors

application email notifications

A second major step has been taken for DSpace 1.5 with regard to multi-lingualism, in the form of Claudia Jürgen's work on submission configuraton, help files, emails and front page news. The natural progression would be onto multi-lingual application metadata, and from there the stars ...

Thursday, 8 November 2007

SWORD 1.0 Released

Just a quick heads up to say that the SWORD 1.0 release is now out and ready for download from SourceForge:

http://sourceforge.net/projects/sword-app/

Here you will find the common java library which supports repositories wanting to implement SWORD, plus implementations for DSpace and Fedora. There is also a client (with GUI and CLI versions) which you can use to deposit content into the repositories.

The DSpace implementation is designed only to work with the forthcoming DSpace 1.5 (which is currently in Alpha release). Your feedback and experiences with the code would be much appreciated. We expect to be making refinements to the DSpace implementation up unitl DSpace 1.5 is released as stable.

Thursday, 25 October 2007

DSpace 1.5 Alpha with experimental binary distribution

The DSpace 1.5 Alpha has now been released and we encourage you to download this exciting new release of DSpace and try it out.

There are big changes in this code base, both in terms of functionality and organisation. First, we are now using Maven to manage our build process, and have carved the application into a set of core modules which can be used to assemble your desired DSpace instance. For example, the JSP UI and the Manakin UI are now available as separate UI modules, and you may build either or both of these. We are taking an important step down the road, here, to allowing for community developments to be more easily created, and also more easily shared. You should be able, with a little tinkering, to provide separate code packages which can be dropped in alongside the dspace core modules, and built along with them. There are many stages to go through before this process is complete or perfect, so we encourage you to try out this new mechanism, and to let us know how you get on, or what changes you would make. Oh, and please do share your modules with the community! Props to Mark Diggory and the MIT guys for this restructuring work.

The second big and most exciting thing is that Manakin is now part of our standard distribution, and we want to see it taking over from the JSP UI over the next few major releases. A big hand for Scott Phillips and the Texas A&M guys for getting this code into the distribution; they have worked really hard.

In addition to this, we have an Event System which should help us start to decouple tightly integrated parts of the repository, from Richard Rodgers and the guys at MIT. Browsing is now done with a heavily configurable system written initially by myself, but with significant assistance from Graham Triggs at BioMed Central. Tim Donohue's much desired Configurable Submission system is now integrated with both JSP and Manakin interfaces and is part of the release too.

Further to this we have a bunch of other functionality including: IP Authentication, better metadata and schema registry import, move items from one collection to another, metadata export, configurable multilingualism support, Google and html sitemap generator, Community and Sub-Communities as OAI Sets, and Item metadata in XHTML head <meta> elements.

All in all, a good looking release. There will be a testathon organised shortly which will be announced on the mailing lists, so that we can run this up to beta and then into final release as soon as possible. There's lots to test, so please lend a hand.

We are also experimenting with a binary release, which can be downloaded from the same page as the source release. We are interested in how people get on with this, so let us know on the mailing lists.

Come and get it:

http://sourceforge.net/project/showfiles.php?group_id=19984

Tuesday, 24 April 2007

Ridiculously well integrated IDE and Application

This week I have been playing around a lot with Eclipse, which is an Integrated Development Environment platform, geared principally (but by no means exclusively) towards Java development. I've been an Eclipse user for some time, but principally for its ability to mediate nicely with my version control server (SubVersion), and the pleasantness of using the graphical editing tools. I was aware, though, of the huge potential it has for rapid application development, and finally got around to putting some time aside to investigate.

The product of my attempts have been documented here, on the DSpace wiki:

HOWTO Integrate DSpace with Eclipse and Tomcat

Along the way I found all sorts of goodies, like the Database Explorer tools, that allow me to execute SQL directly from files open in the editor onto my running database, and a variety of graphical and semi-graphical tools for editing my files. The full power of the source code analysis for a properly set-up project is staggering, as are the refactoring tools.

The real kicker, though, the feature that makes this effort all worthwhile is that I can now run DSpace (or any other web application) from within Eclipse. It looks after controlling Tomcat, and ties in the Tomcat debugger to Eclipse, so that (and this is the really cool bit) I can set points in my application source for execution to halt while I examine the state of the machine. So, I load a web page, which invokes code within which I have set "breakpoints", and Eclipse immediately takes over, opens up a debugging environment, and allows me to step through the code, line by line if I like, examining the in-memory objects all the way. Awesome. In the original sense of the word.

I've seen other people use similar functionality (for example in Visual Studio), so I'm glad I've replicated it in Eclipse. Of course, now I'll discover that everyone has been doing this for years, and I'm the last to catch on. But if I'm not the last to catch on, or at least to figure out how to get DSpace working in this environment (non-trivial) then I strongly encourage you to give this a go.

Tuesday, 17 April 2007

Configurable Browse System released

Well, things have been a little quiet on here, what with my extended easter break. But now I'm back in the office and getting on top of things again. There's a few things to update on, which I will be doing over the next couple of days. The most exciting thing today is that I have finally finished the release candidate code for the new Browse system for DSpace. If you are a DSpace user, please check it out at:

https://sourceforge.net/tracker/index.php?func=detail&aid=1702233&group_id=19984&atid=319984

I have been using the core of this for some time, so it is should be stable. Your feedback would be welcome.

Thursday, 22 March 2007

Crawling like Ants

During experiments yesterday I discovered that running java processes from inside the Ant build tool can have unforseen performance issues. I had written an Ant task to build the browse indices for my DSpace system. This involves producing 9 separete indices for around 80,000 records, and is not a rapid process at the best of times. Previous executions of this code have yielded index rates of approximately 10 - 15 items per second, which I was pretty happy with. Running it in Ant, though, dropped my performance right down to a low of 1 item every 2 seconds! After this had run for several hours I killed it, and tried it again directly from the command line; up came the performance again to usual standard.

So, what is going on? Here are some details:

- while indexing in Ant, the box was under almost no load - no physical memory shortages, disk io bottlenecks, etc.

- I toyed with the idea that memory allocation to the JVM was the problem, but I've seen the indexer run with different memory allocations, and it has so far never caused a speed problem (just OutOfMemory errors)

Answers on a postcard. Or in a comment.

Thursday, 15 March 2007

Google Summer of Code

DSpace is pleased to announce that it has been accepted as a mentoring organisation in Google's Summer of Code 2007.

http://code.google.com/soc/dspace/about.html

The following DSpace developers are officially mentoring for the period:

Robert Tansley (Google)
Jim Rutherford (HP)
Richard Jones (Imperial College)
Stuart Lewis (University of Aberystwyth)
Claudia Jürgen (Universität Dortmund)
Scott Phillips (Texas A&M University)

I am very excited about it, and we have the workings of plenty of ideas for the developments that can be undertaken.

Tuesday, 30 January 2007

DSpace Architecture Review Report

Following a meeting at MIT, in Cambridge, Massachussets in October 2006, the findings of the DSpace Architecture Review have now been presented at Open Repositories 2007:

http://wiki.dspace.org/index.php/ArchReviewReport

Many thanks to John Mark Ockerbloom for all his work, and for guiding the architecture review group.

Friday, 8 December 2006

Sorting in databases

Discovered via a circuitous route, Jim Downing notes that Dorothea Salo has a great tip for fixing sort ordering in the DSpace browse. Since I'm working on this feature as we speak, it's important for me to be able to take this on board. In fact, I have the following feature planned:

Allow each field to request a Normaliser for its entry into the sort_value column of the browse system. Using the PluginManager for DSpace, this might look like this:

String myValue = "some value to be normalised"; String myLang = "en"; // this is the language I want to normalise into Normaliser myNormaliser = (Normaliser) PluginManager.getNamedPlugin(Normaliser.class, myLang); myValue = myNormaliser.normalise(myValue);

our configuration for this would then probably just be something like the following:

plugin.named.org.dspace.browse.Normaliser = \ org.dspace.browse.EnglishNormaliser = en, \ org.dspace.browse.NorwegianNormaliser = no

and so forth. Then, the way your normaliser works would be up to you, and perhaps for Dorothea's example, you need to just maintain a mapping file of unicode values and their target English representation.

The Chronicles of Richard