Tour de CLARIN: Interview with Marin Laak

Submitted by Jakob Lenardič on 10 December 2018

Tour de CLARIN highlights prominent User Involvement (UI) activities of a particular CLARIN national consortium. This time the focus is on Estonia and Marin Laak, a senior researcher and principal investigator of the Estonian Literary Museum who was one of the developers of the Estonian cultural history web portal “Kreutzwald’s Century” that gives access to a vast amount of the digitized literary legacy.

1. Could you briefly tell us about your academic background?

I studied at the University of Tartu and got my bachelor’s degree in Estonian language and literature and then obtained my M.A. and Ph.D. degrees in literary studies. I have always been interested in collaboration with linguists, since I believe that literary studies are not possible without paying close attention to the language, which is clearly both the material and the base for the creation of literature. Throughout my career as a literary scholar I have been interested in large content-based models and literary environments, which I explored in my doctoral dissertation called “Non-linear Models of Literary history: The Problems of Text and Context in the Digital Environment”. I worked on the first Estonian project that developed a hyper-text environment which linked various types of texts together. While the links were created manually at first, in the next project, we developed software to generate them automatically, which required close work with the textual resources. The accessibility and usability in a wide range of scientific and educational purposes has always been one of my priorities. In this sense, the collaboration with the Estonian CLARIN consortium has beena dream project for me.

2. How did you get involved with the Estonian CLARIN consortium?

The Estonian Literary Museum has been a member of the Estonian CLARIN consortium since 2016 and they have always supported my research . Together we have created a small and efficient working group that is developing the first tagged literary corpus for Estonian. I am very grateful to the team and the synergy that we have established working together. I have been involved in the Digital Humanities already since 1997 when my old friend Neeme Kahusk (now a member of the CLARIN Estonia staff) advised me to participate in the call for the Tiigrihüpe (Est. Tiger’s Leap) project proposals. This was an initiative of the Estonian government that started in 1997 and heavily invested into the development and expansion of computer skills and network infrastructures in Estonia, with a particular emphasis on education. I was then the only non-linguist in the team, and in a couple of years we put together an extensive corpus of literary criticism texts, which was linked with textual interconnections to a larger hypertext network.

Having observed the work of linguists since the 1990s, I have witnessed a huge qualitative leap in their research. The potential of textual resources that are tagged morphologically and syntactically has grown significantly and has led to countless new possibilities for contextual research. For this reason, I believe that computational linguists should strive to make their tools more helpful and user-friendly for literary scholars. To make this possible, we first need to overcome the challenges set by the diachronic changes of language.

3. You are one of the authors of the Estonian cultural history web portal Kreutzwald’s Century. Why is this portal important for Digital Humanities in Estonia?

Kreutzwald’s Century is a unique project that is named after a literary exhibition dedicated to the cultural legacy of the Estonian writer and publicist Friedrich Reinhold Kreutzwald. The portal was created as a non-linear environment model for new literary history studies and is actually the starting point of the digitization of all books ever published in Estonian. It is an immense leap forward in the context of the massive digitization of cultural legacy that is taking place nowadays. Currently, the portal gives access to 268 author biographies, more than 10,000 photos and more than 2000 event descriptions based on newspaper material. More than 300 older fictional works in Estonian are accessible in the e-pub format, and the publicly available text corpora contain 13,808 pages or 24,859,487 characters. The portal is widely used in education: in 2018, we registered around a million clicks monthly (which is almost comparable to the Estonian population, which is slightly over a million people) and around 2000 unique visitors. We have manually controlled and corrected the optical character recognition (in spite of the large amount of work this entailed). As a result, the portal is the biggest and most accurate literary textual resource portal in Estonia.

4. How can corpus linguistics be applied to the research of cultural and literary history? Why is textual annotation relevant for literary studies? How does CLARIN Estonia help researchers in non-technical fields like literary theory to apply computational methodologies?

With the support of the Estonian government, all types of cultural legacy (printed books, archival documents, etc.) are being massively digitized and made accessible as open data. Consequently, the quantity of texts is exponentially becoming larger and larger. However,the methods that literary scholars use are still the same as those from decades ago -- they are mostly based on close reading, which is a time-consuming method with a narrow focus and a lot of limitations for large-scope research. It is not a local problem, as I see the same tendencies at the international level. Literary scholars worldwide already have access to large amounts of data and create new resources themselves, but our vision for textual resources and the possibilities of their usage has not yet reached the level of computer linguists. Linguists have worked with morphologically, syntactically and even semantically tagged resources for decades, and have developed new annotation layers and new research methods to meet new opportunities. That is exactly the challenge literary scholars are facing now. We need to work out proper annotation layers and tagsets to address the content-driven research questions that are in our focus. We need to address the challenges of having simultaneous access to large collections of data where we can, by relying on linguistic information, trace the connections between texts and authors, the developments of literary means, changes in poetics, and so forth. We need the expertise of linguists to develop the theory and practice of annotation. At the same time, we need to learn how to pose new research questions and solve research problems in literary studies and humanities in the digital framework. We already strive to make the materials we work with broadly accessible, and our next step is to enhance their quality for scientific usage.

5. Could you describe how the Estonian Literary Museum collaborates with CLARIN Estonia on the digitisation of textual cultural heritage and its transformation into machine-readable research data?

As a pilot project, we have put together a morphologically tagged corpus out of approximately thousand pages of handwritten letters by two Estonian writers, Johannes Semper and Johannes Barbarus, from 1910 to 1940. The corpus is publicly available via the Estonian interface of the corpus query system KORP. Our work is described in a paper submitted to DHN2019, titled “Literary Studies Meet Corpus Linguistics: Estonian Pilot Project of Private Letters in KORP” (authors Marin Laak and Kaarel Veskis from Estonian Literary Museum; Olga Gerassimenko, Neeme Kahusk and Kadri Vider from the University of Tartu). We are going to use this corpusto test the possibilities that linguistic annotation opens for the studies of literary content and literary history. Together with our international colleagues, we will discuss how research questions in literary studies relate to KORP collections and the possible adaptations of KORP functionalities for literary scholars at DHN2019 , as well as at the Research Data and Humanities conference in 2019.

Estonia is expecting an explosive growth of digital heritage and textual resources. Preparations for massive digitization of cultural heritage started in 2018 as part of the national programme, and the creation of different digital resources is the current priority of Estonian memory institutions. Additionally, our institution already has a lot of digitized contemporary data for life-writing studies. The crucial question for us is how to bridge the gap between the research possibilities offered by contemporary language technologies on the one hand and the ever-increasing volumes of texts and other digital data produced by memory institutions on the other. We therefore need to rethink the approach to defining the empirical object in literary studies in general and proposing new research questions. The ability to compare text strategies, rhetorical and stylistic patterns in literary, religious and political text corpora should give us new insights into the way ideology, rhetoric and identity presentations interact. To do that, we have to learn to search for not only linguistic patterns but for the cultural threads in literary texts. Such threads show how ideas and thoughts travel from one text to another and from one period to the next. We need to unite the expertise of literary scholars, linguists and computational experts to make this possible, and we need to organise our textual resources wisely according to their genre, creation period and other metadata. Thankfully, the Estonian CLARIN centre offers the needed expertise for transforming our data into valuable and reliable text resources, which was already achieved in the case of the Kreutzwald’s Century materials and is currently taking place with the Corpus of Estonian Literary Criticism.

My collaboration with CLARIN Estonia is a continuation of my work in the European Union East project CULTOS: Cultural Units of Learning - Tools and Services. I lead the project “Formal and informal networks of literature based on sources of cultural history” and I believe that the new technical opportunities offered by the consortium are helping us advance our research. Our interdisciplinary practical work, which has involved the preparation of a literary corpus for KORP, has been a synergetic team effort, and I have the best hopes for our future work together.

6. Are there any tools and resources provided by the Estonian consortium that you use in your work and you would like to single out as inspiring for other Digital Humanities researchers?

The tool I am currently fascinated by is the corpus query system KORP. We learned a lot about the KORP functionalities, such as flexible search options and statistics. We would love to promote the research possibilities with KORP among our colleagues and adapt KORP functionalities for literary studies. I would love to work on further developments of KORP together with the international community.

7. In your opinion, how can research infrastructures like CLARIN help museums (staff and visitors alike)?

The Estonian Literary Museum is not really a visitor-type museum; rather, it functions as a leading memory institution and research centre. Along with the Centre of Excellence in Estonian Studies, we will benefit from our partnership with CLARIN by being able to rely on CLARIN’s ability to create, maintain and enhance the usability of data collections.

The interview with Marin was conducted in Estonian on 5 December 2018 at Marin’s workplace at the Estonian Literary Museum by Olga Gerassimenko and Kadri Vider. The interview has been translated by Olga Gerassimenko and edited by Darja Fišer and Jakob Lenardič.

Click here to read more about Tour de CLARIN