L2L From Lemmatisers to Linkers

Submitted by Karina Berger on 7 September 2023

Motivation and Goals

Interoperability is a crucial feature for the exchange and utilisation of data. CLARIN promotes interoperability by endorsing the adoption of open science standards. There is, however, a more intuitive notion of interoperability that involves language resources: be they corpora or lexicons, language resources are all made of a basic ingredient, i.e. words. Interoperable content should embed the information that the same lexical item is used across different resources.

The main goal of L2L is to experiment with this idea to connect resources available in CLARIN’s using a lexically-based approach, and to build a network of corpora and lexicons linked at the level of words. Central to this project is the task of lemmatisation: the connection between lexicons and corpora is attainable by linking corpus tokens to the canonical forms that are also found in lexicons. This vision of interoperability builds upon the results of the Linguistic Linked Open Data community, and was implemented by the project LiLa - Linking Latin for Latin. We plan to test it in a larger language-independent context.

The goals of L2L were therefore:

To provide a survey of the lexicons and corpora in the VLO, to identify the resources that could be leveraged to build the network that we have in mind
To implement a prototype of this architecture, by creating a demo of a language-independent lemmatiser/linker with pointers to resources in the VLO; the tool should replicate the same architecture as the LiLa Text Linker
To integrate the LiLa Text Linker into the CLARIN Switchboard.

*Overview of the network of words and corpora in the CLARIN Textual resources linker. Larger nodes are corpora, smaller nodes are lemmas. Try it live!*

Outcomes

Tools:
- A demo of our CLARIN Textual resources linker is available here. Our prototype is designed to be language-independent; the demo shows results for Italian and links only 14 textual resources from the VLO, due the limitations that we discuss in the documentation.
- The LiLa Text Linker now has a new (see an example of a response here) compliant with the Switchboard guidelines. We made a pull request to include the tool into the Switchboard Tool Registry.
Documentation:
- A survey of lexical and textual resources that can be used to build and improve an architecture of interoperable resources based on lemmatisation and the LOD paradigm
- A post containing a detailed discussion and documentation of the CLARIN Textual resources linker
- Two Jupyter notebooks with the Python code used to generate and explore the datasets in the Text resources linker.
Datasets:
- An SQL file with a dump of the relational database produced by the backend component of the CLARIN Textual resources linker
- Three RDF files generated from the DB tables and containing:
  - The Italian lemma bank of ca 82k canonical forms, based on the Ontolex and Lexinfo models
  - The frequencies of the lemmas in the 14 textual resources scanned, expressed using the FrAC extension of Ontolex
  - One lexical resource (the sentiment lexicon OpeNER) whose entries and polarity values have been modelled using the Ontolex and Marl ontologies.

*The adjective 'grandezza' (greatness) and the network of six resources where it is attested in the CLARIN Textual resources linker.*

Lessons Learned

The lemmatiser/linker approach to interoperability is achievable and opens many avenues for joint exploration of language resources in a large infrastructure such as CLARIN.
The survey revealed quite a surprising number of LOD-compliant lexicons and (though in more limited supply) corpora.
Our greatest challenge was that both the back- and the frontend of our demo project require a considerable amount of computational power.
We will continue looking for more efficient solutions and ways to improve our prototype.

The Team

Francesco Mambrini, project leader
Marco Passarotti, project consultant
Giovanni Moretti, developer.