Skip to main content

Annual Conference Overview |  Programme |  Registration 

Tour de CLARIN: CLARIN LINDAT introduces UDPipe

Submitted by karolina@clarin.eu on

Blog post written by Barbora Hladka and Jakob Lenardič


UDPipe is a state-of-the-art tool pipeline which performs several complex annotation tasks: tokenisation, Part-of-Speech tagging, lemmatisation, sentence segmentation and dependency parsing, all to a high degree of precision. The architecture of UDpipe employs a deep neural network and is trained on the basis of language models from the Universal Dependency treebanks provided by LINDAT. UDPipe can be used to annotate and parse texts from over 50 languages, many of which are non-Indo-European, such as Arabic, Irish, Indonesian and Tamil. It is developed at the Institute of Formal and Applied Linguistics at Charles University and can be freely used for non-commercial purposes.

UDPipe is available both as a downloadable program that is compatible with Linux, Windows and OS X, as a library in programming languages such as C++, Python, Perl, R, Java, C#, and as an easy-to-use web application. Researchers who wish to run UDPipe as a standalone program on their own computers must also download one of the Universal Dependencies language models, which are described in detail in the UDPipe User's Manual:

The UDPipe Web Application is provided through the LINDAT architecture. It is very easy to use in the sense that researchers need only select one of the many languages in one of the three training models and input the text (or upload whole files) they wish to have annotated. The results can either be visualised in the form of a tree structure, which shows the syntactic dependencies (Figure 1), or in table form, where each individual word is accompanied by its Part-of-Speech label as well as more complex set of grammatical features, such as case, person, gender, and tense (Figure 2).

The powerful flexibility of UDPipe has recently been demonstrated in the CoNLL 2017 shared task, which was of crucial importance for the development and research of dependency parsing. In the shared tasks, UDPipe was used to process raw text in 40+ languages based on the Universal Dependency models with very high precision, which shows that UDPipe can also be easily adapted to annotate and parse new languages. The  CoNLL 2018 is a follow-up of CoNLL 2017 and Udpipe is used as a baseline system.

For more details on UDpipe see Straka, Straková (2017) and Straka et al. (2016):

Figure 1: The tree structure of a complex English raising construction. Apart from visualising the sentential structure, the tree structure also shows the parts of speech and syntactic features of the constituents.

 

Figure 2: UDPipe shows the grammatical features of the sentence "John is very happy to have met Mary" in Table form. Note that it is able to detect very complex features, such as the perfect (i.e. past tense) use of the infinitive in the subordinate clause.

 


Click here to read more about Tour de CLARIN.