Using CLARIN in Training and Education @CLARIN2023

Tuesday, 17 October 2023 , 11:00 - 13:00

About

This workshop will showcase the latest training and educational initiatives, as well as learning materials created within the CLARIN network. It will be held as a parallel session during the CLARIN Annual Conference, 16-18 October Leuven 2023.

The workshop is scheduled for Tuesday, 17 October 2023, from 11:00 to 13:00 CEST and will be a hybrid event. Due to limited seating, this event is only available for virtual attendance (please use the conference registration form).

Programme

Slides for all presentations.

View the recording.

11:00 - 12:00	Presentations of Accepted Abstracts 11:00 - 11:10 – Welcome & Introduction by Francesca Frontini 11:10 - 11:20 – Privacy by Design in Linguistic Research by Henk van den Heuvel 11:20 - 11:30 – Teaching Syntax with CLARIN Corpora and Resources by Antonio Balvet 11:30 - 11:40 – Learning Programming in Python for Linguistics and Language Studies by Koenraad De Smedt 11:40 - 11:50 – NLP Annotation for Digital Scholars by Maarten Janssen and Silvie Cinková 11:50 - 12:00 – DH-Course Registry: A Bridge Between Infrastructures, DH Masters Degrees and Industry? by Edward Gray, Vicky Garnett, Tom Gheldof, Adeline Joffres, Iulianna van der Lek, Amelia Sanz
12:00 - 12:10	Q & A
12:10 - 13:00	Reusing the UPSKILLS Learning Content (The workshop participants are invited to browse the UPSKILLS learning content on Moodle before the workshop. During the workshop, we will discuss possible scenarios of reusing the materials in teaching and training.) 12:10 -12:20 – Overview of the UPSKILLS Learning Content by Stavros Assimakopoulos 12:20 -12:35 – Introduction to Language Data: Standards and Repositories by Iulianna van der Lek 12:35 -13:50 – Automatic Speech Recognition and Force Alignment by Louis ten Bosch
12:50 - 13:00	Discussion & Wrap-Up

Abstracts

Privacy By Design in Linguistic Research

Henk van den Heuvel (Radboud University, the Netherlands)

This presentation shares the experience of reusing and adapting the educational materials from the CLARIN Learning Hub and DELAD project for a workshop at AITLA 2023. The workshop was inspired, motivated and tuned towards the sensitive data typically associated with atypical speech that we deal with at CLARIN's Knowledge Center for Atypical Communication Expertise (ACE: https://ace.ruhosting.nl/) coordinated by Henk van den Heuvel. The Privacy by Design in Linguistic Research learning content is built around three components: Introduction to the GDPR and its impact on linguistic research, Group discussion of two use cases, and DPIA roleplay. Parts 1 and 2 are based on the author’s experience as a data steward at the Faculty of Arts at Radboud University. For more details about these educational materials, see this CLARIN Impact Story: Navigating GDPR with Innovative Educational Materials

Teaching Syntax With CLARIN Corpora and Resources

Antonio Balvet, University of Lille, France

What if learning syntax could become a gamifiable, highly engaging activity instead of a boring topic? Have you ever dreamed you could generate tons of language exercises for your Moodle class based on authentic texts instead of made-up sentences?

Join Antonio Balvet as he introduces a new platform that seamlessly transforms CONLL structured syntactic annotations into Moodle-compatible quizzes. The demonstration will centre on French, but the scripts can be applied to any CONLL corpus available from the Universal Dependencies web repository. The code for the current version of the corpus2quiz processing chain is available at https://github.com/abalvet/ACE.

Learning Programming in Python for Linguistics and Language Studies

Koenraad De Smedt (University of Bergen, Norway)

This virtual course offers basic knowledge and skills in programming. There are many Python courses, but this one mostly focuses on text processing and data analysis related to linguistics, language studies, digital humanities and cognitive science. The core of the course consists of a series of Jupyter notebooks that combine examples of Python code with explanatory text. The notebooks demonstrate simple language processing and aggregating and visualising qualitative and quantitative data, including data from real language research, CLARINO, and other sources. They also suggest exercises that should be solvable based on the given examples and explanations. Ideally, the course should be presented by a teacher, and the exercises should be supervised, but the modules are also suitable for self-study.

Experience with the course shows that students get started quickly because they do not have to install any software. The combination of text and code in Jupyter Notebooks makes the course largely self-explanatory. Students in linguistics and language studies benefit from the focus on language. Still, the course has also attracted students from information science, communication and media studies, cognitive science, digital culture, computer science, and digital security.

The exercises stimulate active learning. Although the course is self-contained and suitable for self-study, experience shows that its use in classroom teaching is preferable, especially for absolute beginners, because such teaching allows interaction through questions and answers if something is not well understood. Also, sessions offering help with the exercises were appreciated.

After several iterations, the course is now fairly stable, but further improvements are possible. Several examples could be made even more relevant to language studies, and using more language datasets from CLARIN is being considered. There are no solutions to exercises, but students have requested these, especially for self-study. The addition of quizzes may be considered. The Google Colaboratory platform is very easy to use, but it has limitations, and its conditions for use may change. Alternative platforms, such as Deepnote, Kaggle or Binder, have been successfully tested but are not essentially better. Some students prefer to run the Jupyter notebooks on their own machines. Ideally, the code and runtime should be hosted on an academic cloud service, such as NIRD (Norway), but that gave too many administrative hurdles and did not provide all packages.

The course materials are licensed under Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). Citation: Introduction to programming for NLP with Python. Web-based course at the University of Bergen. https://mitt.uib.no/courses/38115.

NLP Annotation for Digital Scholars

Maarten Janssen and Silvie Cinková (Charles University, Czech Republic)

We are proposing a mentoring and assistance pipeline for individual scholars or groups from the Digital Humanities (DH) community to create training data for NLP tools in their languages and specific domains (e.g., poetry, historical texts), drawing on Universal Dependencies as the current standard for linguistic annotation.

DH typically deals with out-of-domain texts. Since corpus annotation is no longer a topic for NLP research, it is the DH researchers themselves who are in charge of training corpora for their languages/domains to be included in the regular updates of established NLP workflows (e.g. UDPipe) as well as ad hoc (diverse SpaCy models on HuggingFace). However, the DH researchers need a friendly nudge to get going. In our previous teaching events, we have demonstrated to our students that it is daunting work, which, on the other hand, does not require the extent of technical (programming) skills they tend to fear.

We continue elaborating on 'NLP annotation for scholars' as a pedagogical concept. The corpus data can be random data for the given domain, but the more typical use case is that of a scholar already having a corpus they are interested in, for which no adequate tagger exists. This corpus can already be loaded with other annotations, e.g. XML-TEI links to a facsimile or an audio track. We teach our students to operationalise their research questions in common linguistic terms, formulate them in terms of Universal Dependencies, and query them with a corpus query language. For all these steps, we use TEITOK - an online environment explicitly designed to integrate the NLP annotation in complex document structures and make the resulting corpus searchable across the different annotation layers.

The first step is to select a portion of text in the pre-established corpus collection and manually annotate it from scratch or pre-processed by a (sub-optimal) tagger. For the manual annotation, the system will ask for the correct lemma, POS tag, and morphological features for each word. By default, syntactic dependencies will not be annotated in this step since scholars with no linguistic background get easily discouraged by mentions of syntax. At the same time, they usually have a good grasp of morphological categories. The manual annotation is iteratively used to train/improve the tagger and facilitate further annotation with improved pre-processing. Deep learning means fewer training data are needed to reach an adequate tagger accuracy.

This process thus provides, from the very start, an automatic tagger and lemmatiser, which will become increasingly accurate with more training data. In the set-up we provide for this, the model will be available for download, and there will also be an online interface where anyone can use the newly trained tagger. As is done by UDPipe, it will include authors' details to ensure their due credit. The session will demonstrate the current setup and its future prospects.

DH-Course Registry: A Bridge Between Infrastructures, DH Masters Degrees and Industry?

Edward Gray, Huma-Num (CNRS & DARIAH-EU), Vicky Garnett (DARIAH-EU,) Tom Gheldof (KU Leuven) Adeline Joffres (Huma-Num & CNRS), Iulianna van der Lek (CLARIN ERIC), Amelia Sanz (Complutense University of Madrid)

ERICs such as CLARIN and DARIAH are well-placed to provide a conduit between industry and education, given our wide-ranging contacts with both communities. DARIAH and CLARIN already collaborate closely within the context of the DH Course Registry, maintained by both infrastructures. The registry, a platform to collect metadata on digital humanities programmes across Europe, served as a glue between the research infrastructures and the DH programmes, leading to a new joint initiative.

In the spring of 2023, we set out to explore effective strategies and best practices for facilitating the career success of graduates of DH Master’s programmes in the private sector. The skills acquired within Digital Humanities (DH) postgraduate degrees are interdisciplinary and, therefore, transferable, something that has been recognised among larger multinational companies. Moreover, a strong humanities background and familiarity with our methods can benefit the commercial sector. Yet among small and medium enterprises (SMEs), employing a graduate from a field still in its relative infancy compared with more traditional disciplines can be considered a risk. It therefore becomes necessary to identify the gaps between the current provision of training among DH scholars at a Master’s level and the needs of companies and future employers of DH graduates.

It is necessary to foster internships that encourage and nurture experimental data spaces between cultural heritage, industry and academia - and CLARIN and DARIAH are ideal forums to cultivate these synergies. To do so, we contacted a series of DH Master’s Programme leaders (directors, coordinators and/or representatives) to learn how they approached the issue of internships with private industry. At the DARIAH Annual Event 2023 in Budapest, Hungary, these efforts culminated in a joint effort led by CLARIN and DARIAH to bring together 25 DH Master’s heads (https://doi.org/10.5281/zenodo.8071224) to examine these questions more closely during a pre-conference workshop, which resulted in a White Paper. This presentation is, therefore, a continuation of this conversation with the CLARIN community to ensure that our impact is as wide and representative as possible.

CLARIN in the UPSKILLS Project

Stavros Assimakopoulos (University of Malta), Iulianna van der Lek (CLARIN ERIC), Louis ten Bosch (Radboud University)

In the second part of the workshop, we will give an introduction to the UPSKILLS, an Erasmus+ strategic partnership project (2020 - 2023), which aimed to identify and tackle the gaps and mismatches in skills for linguistics and language students through the development of a new curriculum component and supporting learning content to be embedded in existing programmes. The UPSKILLS consortium partners, including CLARIN, developed 11 learning blocks on various topics, which are accessible for browsing and download from the project website. After an introduction to the project by Stavros Assimakomolous, we will give a demo of the learning blocks produced by CLARIN in the project, namely: Automatic Speech Recognition and Forced Alignment and Introduction to Language Data: Standards and Repositories. For a more complete overview of CLARIN's contribution to the UPSKILLS project, please see the CLARIN Learning Hub.

Contact

If you have questions about this event, please get in touch with Iulianna van der Lek at training [at] clarin.eu (training[at]clarin[dot]eu).

For questions and more information about the CLARIN Annual Conference, please visit our conference web page or email events [at] clarin.eu (events[at]clarin[dot]eu).

Curious to learn more about CLARIN's training and educational programs? Please visit our Learning Hub.

Address

Location: Irish College Leuven
Leuven
Belgium