Tour de CLARIN: Interview with Mikel Iruskieta

Submitted by Jakob Lenardič on 22 November 2019

Mikel Iruskieta is a computational linguist who is part of the Ixa Research Group and the Didactics of Language and Literature Department at the University of the Basque country. He has collaborated with the CLARIN IMPACT CKC K-Centre, which has helped him and his colleagues digitize Basque texts. The interview was conducted via e-mail.

1. Could you briefly describe your academic and research background?

My current research focuses on the didactics and analysis of Basque language, mostly regarding discourse parsing and evaluation of discourse structure. For the last 5 years, I have mainly worked on adapting language technologies for teaching and learning purposes. With that goal, I have created and now co-lead a postgraduate programme in Basque (University Specialist in ICT and Digital Competences in Education, Continuing Education and Language Teaching), a research group working in Digital Humanities and Education. Our aim is to build a research community that will conduct research and teach in Basque by adopting a critical approach and using language technologies in a pedagogical context. In this postgraduate programme (a summary and student projects can be accessed here), my colleagues and I are developing a new framework of the socio-tech pedagogy for Basque that will cover the following topics:

The Basics of Technology and Pedagogy
Formal Education and Technology
Continuing Education and Technology
Language Teaching and Technology Development
Society and Education, Opportunities and Risk of Technology
E-learning: approaches and resources, and
Digital research: Methods and resources.

2. Does the fact that Basque is a language isolate have any bearing on the development of language tools tailored to it?

The history and current situation of the Basque language are both complex and interesting. Basque has a relatively small community of speakers (751,700 active and 1,185,500 passive speakers) which lives in contact with three powerful language communities such as Spanish and French (as official languages in the Basque Country) and English (as a foreign language). It is also not supported enough by official language policies. As a result, Basque is still considered an under-resourced language. In this context, the work of the research Ixa Group for is highly valuable. They have developed basic resources for Basque (as well as for other languages) which are used by the research community, for example IXApipes (a modular set of NLP tools which provide easy access to NLP technology for several languages that can be used or exploit its modularity to pick and change different components) and ANALHITZA (a web service to analyze Basque, Spanish and English texts without any technical experience). Many more basic and advanced tools and resources for Basque can be found on the website of the HiTZ: Basque Center for Language Technology.

3. How did you get involved with the IMPACT K-Centre and how did they help you with your research?

I learned about the IMPACT K-Centre when they joined CLARIN. Because I was working on several different digitization projects for Basque and for Spanish, I immediately got in touch with them and asked for their help. Isabel Martinez Sempere, the manager of IMPACT, helped me solve a digitization issue that I encountered when I was analyzing the most frequently occurring words in Pulgarcito, which is a Cuban children’s magazine from 1919–1920. This magazine consists of very diverse materials, such as drawings and handwritten texts, which are normally very difficult to digitize. I first tried a commercial OCR tool, but the results were very poor. I then got in touch with IMPACT, telling them that I needed good quality OCR results presented in a machine-readable format like XML. IMPACT promptly responded to my request and managed to digitize the entire journal within a week with significantly fewer errors than when I had used the commercial OCR tool.

In another project, which was led by the Ixa Group but also involved the Basque Ikastola Schools and the Faculty of Informatics, we had three corpora that contained texts for 4–6 year old children. The first corpus is a Basque collection of stories that is used in education. The second is a corpus of old European fairytales, such as Rapunzel, The Beauty and the Beast, Sleeping Beauty, and Snow White, which are translated and adapted into Basque. The third corpus is a modern version of the European fairytales which have been adapted for co-educative purposes, meaning that they are suitable for mixed-gender classrooms.

However, the co-educative modern version wasn't machine-readable, so we asked IMPACT if they could give us the OCR version of this collection. IMPACT were again happy to do so and their experts extracted all the pages from the corpus and performed OCR with Abbyy FineReader (version SDK 11) on the Basque texts.

4. Can you share any interesting results?

As soon as IMPACT digitized the fairytale corpora, my colleagues and I used the ANALHITZA tool to determine whether the texts in the corpora contained gender-inclusive language from the perspective of the characters’ roles in the narrative. To this end, we performed an analysis of several expressions, such as eder (beautiful), polit (beatiful), gaizto (evil), and indarra (power) , which we extracted with the Voyant Tools from the OCRed corpora.

In the traditional fairytale corpus, it turned out that expressions associated with concepts such as beauty and fear (e.g., eder “beautiful”) were almost exclusively used in reference to female characters, while expressions related to concepts such as power (e.g., indarra “strength”) were used to refer to male characters. Such a sharp linguistic division between the two genders in learning materials for very young children reinforces problematic gender dichotomies like the idea that male characters inherently play an “active” and adventurous role in the story, whereas female characters are “passive”, dependent characters associated with concepts such as home but not power.

Let’s give concrete examples from the two corpora. In the traditional fairytales corpus (Table 1), the noun indar (power) and its inflectional variants refer to male characters 4 out of 5 times. By contrast, in the modern co-educative corpus (Table 2), indar is now used 6 of 10 times in reference to female characters, so the usage is almost evenly split between female and male characters, which is desirable if one wants to ensure that the language is used gender-inclusively.

Document	Left	Term	Right	Manual interpretation
7	bat ikusi zuen eta, azkeneko	indarrak	ateraz, haraino joan zen. Hondartza	Male character
5	eta orduan, braust!, Gretelek bere	indar	guztiarekin bultzatu zuen sorgina labe	Female character
5	eta orduan, braust!, Gretelek bere	indar	guztiarekin bultzatu zuen sorgina labe	Male character
7	asko nekatu zen. Ez zuen	indarrik	igerian jarritzeko eta itoko zela	Male character
7	txiki-txiki batzuk ziren. Gulliver	indarka	hasi zen bere burua askatzeko	Male character

Table 1: Usage of the expressions indar (power) in the traditional fairytale corpus, where it is associated with male characters in 4 out of 5 cases. For instance, the first KWIC line – bat ikusi zuen eta, azkeneko indarrak ateraz, haraino joan zen – is roughly translated into English as “He saw one other person, and, drawing his last strength, he went on”, describes an action of a male character. By contrast, the second KWIC line – eta orduan, braust!, Gretelek bere indar guztiarekin bultzatu zuen sorgina labe – is roughly translated into “And then, Gretel with all of her strength pushed the witch”, which this time around describes the action of a female character.

Doc	Left	Term	Right	Manual interpretation
2	Edurne Zuri erabat suspertu zen,	indarrez	bete zen eta bizitza berriari	Female character
4	nagusia zen jadanik; bera, ordea,	indartsua	eta arina zen. Murruetatik gora	Male character
1	bihurri batetik gora hasi zen.	indar	bitxi batek tira egiten zion	Female character
5	atera zuen leihotik eta bere	indar	guztiekin egin zuen garrasi: -Kaixoooooo	Female character
4	zitekeen herrixkara. Aitak ez zituen	indarrak	sobera zituen Ederrak. -Ederki, halaxe	Male character
4	ere handik joan nahi. Neskak	indarrez	estutu zion eskua, eta eskatu	Female character
4	eta gerritik zintzilik zituen giltzak	indarrez	kentzen zizkion bitartean- Eta niri	Female character
4	etorri zenetik, Ederra piztia baino	indartsuago	sentitu zen. Bira egin, eta	Female character
0	hunkituta. Galtza igo zion, zangoak	indartsuak	eta ile ugariz beterik zeuden	Male character
0	jarri, eztarria garbitu eta ahots	indartsuz	esan zuen: -Gustuko zaitut, Monty	Male character

Table 2: The use of the expression indar (power) in the modern, co-educative corpus, where it is used 6 times with female and 4 times with male characters. For instance, the first KWIC line Edurne Zuri erabat suspertu zen, indarrez bete zen eta bizitza berriari is roughly translated as “Sleeping Beauty was completely revived, full of strength (indarrez) and new life”, where the concept of power is associated with Sleeping Beauty, a female character, while the second line – nagusia zen jadanik; bera, ordea, indartsua eta arina zen. Murruetatik gora – is roughly translated into “he was already the boss; but he was strong (indartsua) and agile”, where strength is associated with a male character.

5. What’s your vision for the future of the IMPACT K-Centre?

Generally speaking, Artificial Intelligence and the work of the IMPACT K-Centre is crucial to explore the past of humanity. There are many documents which are still not available in a machine-readable form. Making these information sources accessible, analyzing languages in them, linking objects, data and documents, enriching texts with metadata, and making accessible or referenceable the created virtual corpus through decentralised collections will enable a new way to interact with our past, understand the present and plan for the future. Researchers will be able manage a large amount of data and finish their research in less time, allowing them to use more of their time to focus on the truly interesting, ground-breaking research questions rather than on non-innovative technical tasks.

As for my more concrete wishes for the future development of the IMPACT infrastructure, an important topic that would be highly valuable to tackle next is handwritten texts. For us, this would be particularly valuable because we have a large handwritten learner corpus of Basque annotated with errors (some sample material can be consulted here) by language professional testers who have passed a rigorous ALTE audit. Digitizing handwritten text, however, is notoriously difficult in comparison to printed text. Nevertheless, I think the process can be streamlined with the development of new machine learning techniques. Luckily, IMPACT already provides an enormous amount of OCRed data which could be used to train new models for digitizing handwritten texts.

Click here to read more about Tour de CLARIN