Mobility Grant: Preparing Kielipankki Corpus for EuReCo

Submitted by Karina Berger on 19 November 2024

Written by Harald Lüngen

With my CLARIN mobility grant, I visited CSC, the Finnish IT Centre for Science in Espoo, Finland, which hosts the Finnish corpus archive Kielipankki that is part of FIN-CLARIN.

My official stay according to the grant was from 29 August until 6 September 2024, but I got the opportunity by my employer to stay a bit longer at CSC, and I had planned the scope of my project for the extended stay. In my project, I wanted to prepare a subcorpus from the Kielipankki archives for the European Reference Corpus initiative EuReCo, and also to try and set up a KorAP instance for EuReCo at CSC.

EuReCo is an initiative that aims to create (pairs of) comparable corpora by making existing corpora maintained by (and often legally bound to) CLARIN hosting centres interoperable in terms of an implementation-based (as opposed to specification-based) interoperability. KorAP, the corpus management system maintained at my home institution IDS, is currently used as the reference implementation, i.e. search engine used for the EuReCo pilot projects.

I thought that the corpus KLK-FI-V2-VRT (newspaper corpus created by the Finnish National Library) in Kielipankki should be suitable for an integration in EuReCo, because it was very large and contemporary and contained newspaper and magazine text which would be compatible with the other corpora already in EuReCo. Due to its legal status (license CLARIN RES+NC v2.1), it cannot be published on other servers outside CSC and would therefore be a good showcase of the benefits of EuReCo.

My first task at CSC was to familiarise myself with the corpus KLK-FI, its legal status, its internal structure, in particular the VRT encoding and its annotations. For this I consulted with the admin group at CSC, led by Katri Tegel and Martin Matthiesen, as well as with the Kielipankki management group, led by Krister Lindén and Mietta Lennes. I would like to thank them all for their warm welcome and for letting me take part in their group meetings and getting to know the team members who maintain this impressive corpus archive and service. I gained many interesting and relevant insights from my discussions with them.

My main task was to implement a conversion pipeline from VRT to I5 or, alternatively, , which are the formats that can be read by KorAP. TEI is of course also a well-established corpus exchange format, especially in the CLARIN sphere. At home I prefer to use XSLT 3.0 with streaming facilities to manipulate large XML files due to the declarative nature and hence easier maintainability of XSLT stylesheets. However, the (commercial) saxon parser required for XSLT 3.0 was not available at CSC, so I resorted to perl and its XML::Twig library for XML streaming. I have made the vrt2tei conversion pipeline I implemented at CSC available in the IDS gerrit repository (see here).

*KLK-Fi in KorAP instance (not accessible from the web). The foundry names 'malt' and 'spacy‘ are used as dummies until a TurkuNLP foundry title is available in KorAP.*

When applying the pipeline, one can choose between the output formats TEI proper and I5. For an appropriate representation of the CoNLL-U columns in VRT that represent dependency analyses (called head and deprel) I customised two new attributes for <w> named @head and @deprel in I5. I also got in touch with the TEI SIG for linguists about this, who confirmed that they aimed to introduce the same or similar attributes in the proper TEI as well.

I converted a subcorpus of KLK-FI-V2-VRT comprising the most recent 21 years from 2001 to 2021 and 21 major national and regional newspaper titles from all over Finland, generating an I5 encoding for it. Its size is more than 4 billion tokens. I then applied the KorAP indexing pipeline to the I5 corpus, but unfortunately it turned out that the user authentification in KorAP is not compatible with the one used at CSC. Since user authentification is crucial due to the restricted license of the corpus, we had to postpone the plan to set up a KorAP instance on a CSC server. For the time being and as a proof of concept, I indexed the corpus in a KorAP instance on my project notebook which is not accessible via the web.

In the future, a KorAP instance should be set up at CSC, and both CSC and IDS are still committed to this goal. The prerequisites for this are that Shibboleth must be enabled for KorAP, and OAuth2 must be implemented for Kielipankki first. The vrt2tei pipeline is currently tuned to the VRT of the KLK-Fi corpus. It is expected that it will work just as well on the sister corpus KLK-SV (Swedish language newspaper corpus of the Finnish national library). With some extensions, it should also be adaptable to convert (parts of) the Suomi24 corpus in VRT, which has richter textual metadata.