Blog authors: Han Sloetjes and Marie Hinrichs
The most recent version of the multimedia annotation tool ELAN offers a new level of integration with WebLicht, a framework developed as part of the CLARIN infrastructure. As a desktop tool for manual annotation of audio and video recordings, ELAN allows users to perform common tasks like segmenting and labelling, transcribing and translating and so on. The tool offers what is needed to manually add e.g. part-of-speech annotations, but in many cases this is much too time-consuming (and therefore much too expensive). WebLicht, on the other hand, is a web based framework that allows building and execution of processing chains. The registered NLP tools in a chain can be used to automatically enrich the (mostly textual) input with lemmas, part-of-speech tags and so on.
Why could interaction with WebLicht be of interest to ELAN users?
In many projects creating the segmentation and a basic transcription for a set of recordings consumes a considerable part of the budget and other levels of annotation cannot even be considered. It is of course possible to export the transcription text and feed it to (a chain of) NLP tools and further process the results in some way. The advantage of the integrated approach is that the time alignment of the annotations (e.g. sentences) is maintained when the additional layers of tokens and part-of-speech tags are added. Part-of-speech tagging and lemmatization are interesting for users who work with one of the languages that are supported by at least one of the WebLicht tools.
How does it work behind the scenes?
In the current situation ELAN uses the harvesting service provided by WebLicht to acquire the list of available services. Depending on information provided by the annotator the list is filtered (roughly: either a list of tokenizers is shown or a list of pos-taggers and lemmatizers) and presented to the user. After selection of a service ELAN sends the data, either as plain text or in WebLicht’s format, to that service and waits for the response, which will also be in TCF. The output is then converted to tiers and annotations and added to the annotation document. There are some limitations to the types of tiers in ELAN that can be used as input and the filtering of services is based on the type of data normally annotated in ELAN.
What could be next?
The types of tiers that can serve as input could be extended so that the interaction is more flexible. Also, ELAN might connect to the one WebLicht service that accepts a tool chain as input (rather then connecting directly to individual tool-services), so that centralized statistics of the use of different tools is supported.
Read on about ELAN's integration with CLARIN webservices:
- Using WebLicht directly from ELAN (from the ELAN manual)
- Using WebMAUS directly from ELAN for automated segmentation and labelling (paper)