Tour de CLARIN: CLARIN LINDAT introduces UDPipe

Submitted by karolina@clarin.eu on 14 May 2018

Blog post written by Barbora Hladka and Jakob Lenardič

UDPipe is a state-of-the-art tool pipeline which performs several complex annotation tasks: tokenisation, Part-of-Speech tagging, lemmatisation, sentence segmentation and dependency parsing, all to a high degree of precision. The architecture of UDpipe employs a deep neural network and is trained on the basis of language models from the Universal Dependency treebanks provided by LINDAT. UDPipe can be used to annotate and parse texts from over 50 languages, many of which are non-Indo-European, such as Arabic, Irish, Indonesian and Tamil. It is developed at the Institute of Formal and Applied Linguistics at Charles University and can be freely used for non-commercial purposes.

UDPipe is available both as a downloadable program that is compatible with Linux, Windows and OS X, as a library in programming languages such as C++, Python, Perl, R, Java, C#, and as an easy-to-use web application. Researchers who wish to run UDPipe as a standalone program on their own computers must also download one of the Universal Dependencies language models, which are described in detail in the UDPipe User's Manual:

the Universal Dependencies 1.2 Models, which contain cross-linguistically consistent treebank annotation models for 33 languages,
the Universal Dependencies 2.0 Models, which are an updated version of the former and contain annotation models for over 50 languages, and
the CoNLL17 Shared Task Baseline UD 2.0 Models, which contain a different version of the Universal Dependencies 2.0 models.

The UDPipe Web Application is provided through the LINDAT architecture. It is very easy to use in the sense that researchers need only select one of the many languages in one of the three training models and input the text (or upload whole files) they wish to have annotated. The results can either be visualised in the form of a tree structure, which shows the syntactic dependencies (Figure 1), or in table form, where each individual word is accompanied by its Part-of-Speech label as well as more complex set of grammatical features, such as case, person, gender, and tense (Figure 2).

The powerful flexibility of UDPipe has recently been demonstrated in the CoNLL 2017 shared task, which was of crucial importance for the development and research of dependency parsing. In the shared tasks, UDPipe was used to process raw text in 40+ languages based on the Universal Dependency models with very high precision, which shows that UDPipe can also be easily adapted to annotate and parse new languages. The CoNLL 2018 is a follow-up of CoNLL 2017 and Udpipe is used as a baseline system.

For more details on UDpipe see Straka, Straková (2017) and Straka et al. (2016):

Milan Straka and Jana Straková. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, August 2017.
Straka Milan, Hajič Jan, Straková Jana. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016.

Figure 1: The tree structure of a complex English raising construction. Apart from visualising the sentential structure, the tree structure also shows the parts of speech and syntactic features of the constituents.

Figure 2: UDPipe shows the grammatical features of the sentence "John is very happy to have met Mary" in Table form. Note that it is able to detect very complex features, such as the perfect (i.e. past tense) use of the infinitive in the subordinate clause.

Click here to read more about Tour de CLARIN.