CLARIN DSpace at the Oxford Text Archive

Submitted by Linda Stokman on 28 August 2019

Blog post by Ondřej Košarko (LINDAT/CLARIN) who received a CLARIN Mobility Grant in July 2019 to visit the Oxford Text Archive in the UK.

The Oxford Text Archive (OTA) collects, catalogues, preserves and distributes high-quality digital resources for research and teaching. They currently hold thousands of texts in more than 25 different languages,and are actively working to extend their catalogue of holdings. The OTA relies upon deposits from the wider community as the primary source of high-quality materials. The OTA is part of the CLARIN European Research Infrastructure; it is registered as a CLARIN C-centre, and is part of the University of Oxford's contribution to the CLARIN-UK Consortium.

The OTA became interested in a new repository system. The current production for the OTA is hosted by their IT Services department, and it was agreed in 2017 that it would move, along with the administration and staff, to the Bodleian Libraries, the central library system of the University. The current platform OTA uses is a custom solution, developed over many years, and spread across a number of systems. A new solution was required that could easily be hosted by the Bodleian Libraries virtual hosting system, on a single server.

The LINDAT/CLARIN team, which I am part of, has been collaborating with the OTA for quite some time about. Among other things, we share our experience running CLARIN DSpace repository.

A CLARIN DSpace installation at the OTA would offer the following:

a turnkey solution with community support from CLARIN experts
an important step on the route towards fulfilling the requirements of a CLARIN B Centre
added functionality, including more sophisticated search based on a wide range of metadata categories
an opportunity to make -compliant archival packages for migration to a future platform, e.g. a planned future Oxford Libraries Fedora implementation.

The CLARIN Mobility Grant made it possible to help setting up a CLARIN DSpace repository at the Oxford Text Archive (OTA) and to provide hands-on experience managing and customizing the installation. The visit took place from July 1st to July 4th in Oxford, UK.

The installation proces

On the first day I met up with Martin Wynne (function) and Mark Rogerson. We started work with git and GitHub, quickly going through branches with merge conflicts, remotes with invalid credentials, outdated upstream remote and various other things that piled up while Mark was setting up a testing installation. In the end we’ve synchronized the ee-dev/clarin-dspace fork with upstream ufal/clarin-dspace, while moving all the customizations for the OTA to a branch of its own.Afterwards we went through the “standard” repository stack installation - compiling nginx with ajp and shibboleth modules, connecting nginx with a servlet container (tomcat), compiling Shibboleth service provider (as the official builds are without fcgi support) and connecting that with nginx, preparing postgresql databases, compiling the most recent version of repository (with OTA changes) and deploying that to tomcat.This was done on a new machine which will become the “production” (public facing). There was a handle server running on the test machine, so we’ve reset the configuration and sent an update to CNRI (the handle server binds to a particular ip and that was different on the production machine). Some of the shibboleth configuration from the testing machine could be reset to, but since the domain name of the production machine was different, we had to send the updated service provider metadata to the UK identity federation.

After the installation we did a basic setup of collections, communities and went through authorization and authentication configuration, e.g. assigning users to a particular group based on shibboleth attributes.

Solving issues

When the installation was done, we went through the issues we’ve discussed before the visit. One of those was how to handle the original (old) urls and ids. We were looking for a way to redirect the PURLs for the records in the old system so that they map to CLARIN DSpace items (and/or to the handles assigned to these items). The PURLs contain identifier assigned to a particular item, so we’ve decided to do something similar for the handles. We can’t influence the prefix part (that’s assigned by CNRI), but the rest of the handle (the suffix) is completely in our hands and so we’ve used the item identifier too. This should make it fairly easy to map between the PURLs and handles, in fact no mapping is needed just a rewrite. Enabling items to be identified by legacy identifiers in the handles will help ensuring that legacy citations will remain valid. Also it seems much more user friendly when all of the ids used to refer an item look more or less the same compared to using two or more independent numbering schemas. Eventually this was just a small change to the export/import scripts between the old and the new systems.

We’ve also spent time going through how to customize the repository. Not only its look, but also the filters, facets and browse indexes that help the user navigate and search the content of the repository. These sometimes interact in non obvious ways and it’s necessary to follow few conventions.

Faced with an issue

One issue we didn’t get to solve is related to a date facet. For date facets the repository provides a sort of “zooming” - it divides the dates into buckets (e.g. years 1400-1499, 1500-1599, etc) and after selecting a bucket the remaining items are bucketed again, but with a smaller gap (e.g. if you selected the 1400-1499 bucket, you might see buckets for 1400-1409, 1410-1419, etc). The gap is computed automatically in such a way that splits the items in a fixed number of buckets. This usually works well, but the thing with some of the OTA records is that they have Before the Common Era dates or date ranges. Neither the date parser nor the bucketing algorithm plays well with this kind of data. We’ve decided that, since it couldn’t be fixed at the moment, it’d be best to create an additional metadata field during the import which would provide at least the first level of the buckets.

Conclusion

The visit was beneficial both for the OTA and myself. The OTA now has almost production ready installation and has gained better understanding of the repository and the software involved. For me it provided experience with CLARIN DSpace running under different circumstances and showed areas where improvement would be helpful.

The Bodleian Libraries at the Osney One Building