New collections and resources available in the VLO

Submitted by e.gorgaini@uu.nl on 12 March 2020

With the Virtual Language Observatory ( ), CLARIN provides a means of exploring language resources and tools. The content made available through the VLO is provided by CLARIN centres as well as a number of external sources. This content is far from static: three times a week, CLARIN “harvests” resource descriptions (metadata) from these sources to ensure an up-to-date reflection of the status of available language resources, services and tools. New sources are introduced on a regular basis as well. As a result, the number of resources and collections that can be discovered by searching and browsing through the VLO has been steadily expanding over the years.

This post provides an update on the content that can be discovered through the VLO by highlighting a number of interesting recent additions. First, we will illustrate newly selected content aggregated by and retrieved from Europeana. Then we will showcase four centres that have started providing metadata to the VLO in the past half year: the Lund University Humanities Lab, the ZIM Centre for Information Modelling in Graz, the CLARIN centre for Latvian language resources and tools (Riga) and the Speech Synthesis and Recognition Laboratory of UIIP NAS Belarus.

Note that many CLARIN centres and other providers are constantly adding new resources to their repositories and catalogues. So when visiting the VLO, make sure to also look for resources matching your interest beyond the collections highlighted below!

Updated selection from Europeana | Lund University Humanities Lab | ZIM Centre for Information Modelling (Graz) | CLARIN Centre of Latvian language resources and tools (Riga) | Speech Synthesis and Recognition Laboratory of UIIP NAS Belarus (Minsk)

Updated selection from Europeana

By Twan Goosen

Europeana is the European digital platform for cultural heritage. Digitised cultural resources from a wide range of cultural institutions all across Europe are made accessible via Europeana. Using Europeana’s APIs, we have established an integration of a selection of cultural heritage objects relevant to the CLARIN community into the VLO (see this blog post for details). Recently, this selection has been re-evaluated and updated, resulting in a total of 180,500 harvested records - an increase of almost 60,000 records compared to the previous selection. As a result, Europeana is now the largest provider of metadata in terms of individual records in the VLO.

In addition to updates to collection that were already available, about 20 entirely new collections have been added. A few highlights are:

From the National Library of Serbia: about 600 digitised books, issues of periodicals and other textual resources from various time periods
The National Library of Latvia: almost 2000 digitised historical Latvian books
Library of the Alliance Israélite Universelle: over 3000 digitised documents in a variety of languages
Dutch National Library: almost 400 digitised historical children’s books, and over 1000 scanned issues of periodicals of the 20th century Dutch women’s rights movement
Opera del Vocabolario Italiano (CNR): 3500 high quality digitised manuscripts

Most of these resources (except for the digitised manuscript) are available as PDFs with embedded full text content. Various tools that can process such files are available via the Language Resource Switchboard; the text could also be extracted using freely available tools. All mentioned collections are openly available and free reuse is admitted.

These and many other cultural heritage resources can be browsed and searched via the Virtual Language Observatory.

Lund University Humanities Lab

By Johan Frid

The Lund University Humanities Lab is an interdisciplinary department for research technology and training. It is a CLARIN K- and C-center and it is also a node within the Swe-Clarin infrastructure in Sweden.

The lab's corpus server hosts two sets of corpora, the Lund Corpora, and the Repository and Workspace for Austroasiatic Intangible Heritage, RWAAI. The facility contains a wide variety of data types including audio, video, text, images, eye-movement data, and related materials. Furthermore, it aims to provide a dynamic, digital workspace for contributors where they can store, curate and reuse their collections.

The Lund Corpora archives data from major world languages to lesser-described minority languages collected by researchers from Lund University. Its diverse collections include longitudinal child language studies, adult language acquisition data, Swedish and Estonian dialect surveys, corpora with linked eye-tracking data, and language documentation research.

The RWAAI corpora is a unique digital resource preserving multidisciplinary research collections documenting the languages and cultures of communities from the Austroasiatic language family of Mainland Southeast Asia and India. RWAAI’s collections span over half a century of research in fields such as linguistics, anthropology, botany, ethnomusicology, and human ecology. More than 50 predominantly endangered minority languages are currently represented in the collection. RWAAI accepts analogue and digital collections from researchers in any field that focuses on language and/or cultural heritage of Austroasiatic communities. It specializes in the digital preservation of endangered analogue collections.

More information is available on the Lund University Humanities Lab website, or by accessing the Corpus Server directly. These resources can now also be browsed and searched via the Virtual Language Observatory.

ZIM Centre for Information Modelling (Graz)

By Gerlinde Schneider

The Centre for Information Modelling (ZIM) focuses on applied research and development in the area of information and data processing in the humanities, with special emphasis on digital scholarly editing, long-term preservation, digital museology and semantic web technologies. The centre is an important partner and contributor to research projects within the Austrian and international scientific community as well as to non-university organisations such as museums and libraries.

Via its certified repository infrastructure GAMS, the ZIM hosts research data of more than 80 projects mainly from the fields of digital scholarly edition, cultural heritage and increasingly, linguistics and digital philology. It builds a central platform not only for digital preservation and data management of its research assets, but also for publication and data retrieval; it constitutes the core at the basis of the establishment of a Clarin-B Centre at the University of Graz.

The GAMS repository provides access to almost 4000 annotated multilingual periodical volumes from the project The Spectators in the international context. Spectators are a journalistic genre, which originated at the beginning of the 18th century in England, spread rapidly all over Europe and became an important feature for the discourse system of the Enlightenment. Currently, versions and analysis of English, French, German, Italian, Spanish and Portuguese texts are provided. They are annotated for simple text structures, for narrative forms, as well as for named entities such as persons, places and works. All resources can be downloaded in the /XML P5 format under a free license. The Spectators' texts are the first of a series of resources from the GAMS repository, which will be successively made available via the Virtual Language Observatory.

CLARIN Centre of Latvian language resources and tools (Riga)

By Inguna Skadiņa

CLARIN-LV - the CLARIN Centre of Latvian language resources and tools - has recently become a C-center in the CLARIN infrastructure. The Center is maintained by the national coordinator - the Artificial Intelligence Laboratory (AiLab) of the Institute of Mathematics and Computer Science (IMCS), at the University of Latvia (UL). The laboratory has been conducting research on natural language processing and has been providing access to different language resources for about30 years.

The first collection of language resources in the CLARIN-LV repository is the Language Resources and Tools ( ) of the AiLab IMCS UL. The collection demonstrates the variety of Latvian LRTs through the inclusion of:

Monolingual (e.g., LVK 2018 - the largest balanced corpus of modern Latvian) and multilingual (e.g., Lithuanian-Latvian parallel corpus) corpora;
Text and audio corpora (e.g., annotated longitudinal corpus of Latvian children's language);
Lexicons (e.g., tezaurs.lv);
Browsable and downloadable resources;
Tools for language processing (e.g., NLP-pipe, for details see also article in Tour de Clarin).

Although most LRTs are developed for Latvian, our collection also includes the Latgalian language corpus .

The aim of CLARIN-LV for 2020 is to extend the current collection of AiLab IMCS UL language resources, as well as to add new collections of other language resources developed by different stakeholders in Latvia. The CLARIN-LV language resources and tools can now also be browsed and searched via the Virtual Language Observatory.

Speech Synthesis and Recognition Laboratory of UIIP NAS Belarus (Minsk)

By Yuras Hetsevich

The Speech Synthesis and Recognition Laboratory of UIIP NAS Belarus works in the fields of text and speech processing on the basis of human-human, human-machine and machine-machine communications. The Lab has expertise in the building of systems for stationary, mobile and web-based platforms for Belarusian, Russian and English languages.

The special platform has been developed and is being constantly updated further to provide users with a set of 50+ tools (services) for text, voice and other data processing. The developed services are then grouped into thematic domains for more convenient use in specific fields of application.

The approach to the development of each service, is simple, the user can run the service by clicking only on one button, with this action the input test data will be processed and the results will be shown. After this, the user is offered to input his own data and adjust the setting before running the tool.This approach helps students and researchers to get up to speed on and test new hypotheses faster.

The platform provides tools for tokenization, morphological analysis, voiced electronic grammatical dictionary, part-of-speech tagging, frequency counter, spell checking, text-to-speech and many others.

The lab has recently started the process of metadata creation for all online resources, which means that part of the services are now available via the VLO. All services can also be accessed through the platform directly. More information is available on the Speech Synthesis and Recognition Laboratory of UIIP NAS Belarus website.

This blog post was written by: Johan Frid (Lund University), Twan Goosen (CLARIN ERIC), Yuras Hetsevich (National Academy of Sciences of Belarus), Gerlinde Schneider (University of Graz), Inguna Skadiņa (Institute of Mathematics and Computer Science, University of Latvia)