Workshop on Comparable and Interoperable Corpora of Academic Texts @CLARIN2024

Thursday, 17 October 2024 , 14:00 - 18:00

Promotional image for the CLARIN workshop 2024 in Barcelona.

General Information

Date: 17 October 2024

Time: 14:00 - 18:00 CEST

Location: Barcelo Sants, Barcelona, Spain (at CLARIN2024)

Key Deadlines

Submission deadline: 15 July 2024

Notification of acceptance: 9 August 2024 (closed)

If you are attending the CLARIN Annual Conference and would like to join this workshop, feel free to send an email to events[at]clarin[dot]eu.

About the Workshop

The CLARIN 2024 Post-Conference workshop on Comparable and Interoperable Corpora of Academic Texts aims to bring together experts and enthusiasts from CLARIN partners to discuss the creation, management, and application of academic text corpora.

Academic texts, such as academic papers and theses serve as sources for sharing innovative research findings, theories, and methodologies across the academic community. Often accessible via open-source infrastructures, they create invaluable resources for comparative large-scale language research, including terminology extraction. Analysing academic texts from various disciplines also promotes cross-disciplinary insights and interdisciplinary collaboration.

The workshop will mainly focus on written text data; research on other data can be presented as well.

Programme

14:00	Welcome & opening
14:10	Presentation 1: Tomaž Erjavec Corpora of Slovenian academic texts Slides Abstract We present corpora of Slovenian academic writing that have been compiled from the data held in the Open Science Slovenia portal. The first is the 1.4 billion word KAS corpus from 2019, which consists of bachelors, masters and doctoral theses. It contains by rich metadata, has marked page breaks with links to page images, and is linguistically annotated, including term candidates. The second is OSS from 2023, which is similar to KAS but contains newer texts and also other types of scientific writing apart from theses, and comprises 2,4 billion words. Both corpora are openly available for research via the CLARIN.SI repository and the CLARIN.SI concordancers.
14:30	Presentation 2: Vanja Štefanec, Daša Farkaš, Matea Filko and Marko Tadić Croatian Scientific Corpus Slides Abstract To be added
14:50	Presentation 3: Roberts Darģis Corpus of Latvian PhD Theses Slides Abstract The corpus consists of PhD theses and abstracts published from 1993 until 2020. This extensive corpus contains over 21 million tokens from 1,449 documents, offering a rich resource for linguistic and academic research. It is morphologically annotated and available in the noSketch Engine for analysis and in the CLARIN-LV repository for download. Each document's metadata includes the title, year, URL of the source document, and field. The field is categorized into 6 main categories and 30 subcategories. More info via this link: https://korpuss.lv/en/id/Disert%C4%81cijas
15:10	Presentation 4: Marc Kupietz, Peter Leinen and Nils Diewald Towards a Very Large German Academic Corpus - Step 1: Building and Making Available a Corpus of 10,000 Doctoral Dissertations Slides Abstract This paper outlines our ongoing effort to create and make accessible a large German academic corpus, starting with 10,000 doctoral dissertations, in a lawful and interoperable way, by running an instance of the corpus analysis platform KorAP in the German National Library. We address our approach to the legal challenges associated with this endeavor, provide technical details on the conversion and annotation pipeline, highlight key features of the analysis platform, and sketch the integration into broader frameworks, including CLARIN, CENL and EuReCo, a European initiative for providing comparable corpora.
15:30	Coffee break
15:50	Presentation on FCS: Erik Körner
16:05	Presentation 5: Anje Müller Gjesdal and Marita Kristiansen Academic corpora and specialised neology: examples from the nature and environment subject fields Slides Abstract This paper reports on the creation and use of an academic corpus to study neology in the nature and environment subject fields. The paper describes issues encountered in creating the corpus, as well as use cases for the corpus, including neology analyses. Further, we discuss possible extensions to create additional corpora of other academic text genres to enable several axes of research, including neology detection and terminology extraction.
16:25	Presentation 6: Sofia Nasopoulou Transforming research publications into a Knowledge Graph Slides Abstract To address the risk of missed or duplicated work, especially in multidisciplinary fields, we propose a workflow for generating Knowledge Graphs (KGs) from academic papers. We extend the Scholarly Ontology (SO) by adding two entities: Activity, representing research methods in context, and Finding, denoting outcomes. Using spaCy and RoBERTa-base models, we annotate sentences from 3,081 JSTOR publications to fine-tune, train and evaluate entity extraction models, achieving F1 scores of 74.33 for Activity and 82.22 for Finding. Entities are linked via proximity-based inferencing rules and transformed into RDF triples, enabling semantically complex queries.
16:45	Conclusion and outlook: next steps

Topics of Interest

We welcome submissions on a wide range of topics related to the development and utilisation of comparable and interoperable corpora of academic texts, including but not limited to:
•    Design and creation of monolingual and multilingual academic text corpora
•    Linguistic annotation and metadata of academic text corpora
•    Standards and methods for interoperability and comparability of academic text corpora
•    Use cases and applications for academic text corpora across various academic disciplines
•    Ethical and legal considerations in data collection.

Submission Guidelines

Authors are invited to present their ideas and existing resources at the workshop. Extended abstracts should include a description of the activities, as well as the names and affiliations of the presenters. Please prepare your abstracts according to the following guidelines:

•    Length: Extended abstracts of 500 to 1000 words (without references)
•    Language: Submissions should be written in English
•    Submission: Via the conference management system [link].

Submission for this workshop is closed.

If you are attending the CLARIN Annual Conference and would like to join this workshop, feel free to send an email to events[at]clarin[dot]eu.

Publication

Selected workshop abstracts will be published in the CLARIN2024 Conference Proceedings. Full papers based on workshop presentations may be submitted for publication in the CLARIN2024 post-conference volume.

Accommodation Information

Funds to cover accommodation expenses (max. 2 nights) are available for workshop participants. For attendees of the CLARIN Annual Conference 2024 one additional night of accommodation is covered.

Contact Information

For more information about the workshop or in case of any questions, please contact Andreas Witt (witt [at] ids-mannheim.de (witt[at]ids-mannheim[dot]de)) and Laura Herzberg (herzberg [at] ids-mannheim.de (herzberg[at]ids-mannheim[dot]de)).

We look forward to your submissions and to welcoming you to Barcelona in October!

The Workshop Committee
Tomaž Erjavec
Laura Herzberg
Tanja Wissik
Andreas Witt

Address

Barcelona
Spain