CLARIN Estonia introduces EstNLTK

Submitted by Jakob Lenardič on 26 November 2018

Blog post written by Krista Liin, edited by Darja Fišer and Jakob Lenardič

When working with texts it is often difficult to extract the necessary information, especially if the texts are in a morphologically complex language such as Estonian. To find out which locations or individuals are mentioned in the text, you’d need to perform a full language processing workflow, from tokenization and finding base word forms up to detecting named entities. The next challenge is getting all those steps to work together.

EstNLTK, the Estonian Natural Language Toolkit, brings together previously developed Estonian tools and resources in a common environment, making them easily accessible. The toolkit is a set of Python libraries that has been created following the example of NLTK. It provides the following NLP components for the processing and analysis of the Estonian language: tokenisation, spelling correction, pronunciation clues for stress and palatalization, the detection of paragraph, sentence and clause boundaries, verb chain tagging (such as ‘oli läinud tooma’ - ‘had gone to bring’), morphological synthesis, named entity recognition, and a WordNet module.

EstNLTK is open source and is available for Linux, MacOS and Windows. Anaconda packages are available for researchers who want to use the toolchain as part of the Anaconda data science distribution. EstNLTK can also be used as a docker image, which allows researchers to skip the installation process and access the toolchain directly from a web browser in the Jupyter notebook (and copy any tutorials to work with). In addition to Python libraries, parts of EstNLTK can also be accessed as a webservice, or a WebLicht service.

Figure 1: Using EstNLTK in a WebLicht workflow for morphological analysis

The documentation that accompanies EstNLTK includes tutorials that cover several NLP tasks, from basic text operations such as finding base word forms (which is not very easy for a morphologically rich language such as Estonian) to more interesting tasks such as mapping the time expressions, recognizing named entities or querying the Estonian WordNet and tagging words in text with their meanings and related synsets. Throughout the years people have created several morphological and syntactic analyzers for Estonian, and EstNLTK has made an attempt to incorporate them all. To make it easier to work with large text corpora, EstNLTK has a database module that integrates with Elastic so you can use elasticsearch. There are also tutorials available on how to use the Estonian Reference corpus or Wikipedia data in EstNLTK.

Figure 2: Using the different integrated dependency syntax parsers

Although the newer versions of EstNLTK allow researchers to choose among different tools in the language processing workflow, it is also possible to simply use the default options and get the end results. As can be seen in Figure 2, the default option for dependency syntax is the statistical MaltParser model. However, it is also possible to work with the rule-based Constraint Grammar parser EstCG instead.

Figure 3: the NLP tasks performed by the EstNLTK toolkit - Part-ofSpeech tagging and Named Entity recognition

Figure 3 shows an example of the standard NLP tasks performed by EstNLTK. In this example, the toolkit is applied to the sentence “Mark Fišel ütles, et Londoni lend, mis pidi täna hommikul kell 4:30 Tallinna saabuma, hilineb mootori starteri rikke tõttu ning peaks Tallinna jõudma ööl vastu homset kell 02:20” (“Mark Fišel said that the flight from London, which was scheduled to land to Tallinn today morning at 4:30, is late due to an engine starter malfunction and is about to arrive to Tallinn tomorrow night at 02:00”). The text formatting chosen here shows lemmas with annotation for persons (red), locations (green), verbs (magenta), nouns (blue) and time expressions (underlined). The example has been run on 2018-11-20, so the time values have been calculated with respect to that date.

EstNLTK is highly interoperable and is used in several widely used applications, such as Feelingstream, which uses it in the processing of opinion mining, and the TEXTA toolkit, which takes advantage of the morphological analysis and NER for text mining. The toolkit is fairly robust, and it has also been used to work with non-contemporary texts, such as communal court minute books from the late 19th century, which did not follow modern spelling and were often written in local dialects. Kersti Lust from the National Archives of Estonia, Kadri Muischnek from the chair of language technology in University of Tartu and several of their colleagues worked together to make the collection of almost 3000 texts from 22 different parishes browsable by annotating it with (standardized) lemmas and named entities, which makes it easier to study the interactions between different people mentioned in the minutes. Although manual correction is still needed, the automatic annotation worked very well, except for the Southern Estonian dialects, which differ a lot from contemporary Estonian, even syntactically.

EstNLTK has been developed under the NPELT programme by Sven Laur and colleagues.

Click here to read more about Tour de CLARIN