Data

CLARIN provides access to digital language data. The datasets cover various dimensions (language, modality, time span, etc.) and are hosted in a distributed way by the CLARIN centres.

CLARIN's language resources and tools can be explored via the individual repositories and via our unified catalogue, the Virtual Language Observatory ( ).

Virtual Language Observatory

The Virtual Language Observatory (VLO) provides a means of exploring language resources and tools. Its aim is to provide an easy-to-use interface, allowing for a uniform search and discovery process for a large number of resources from a wide variety of domains. Facets make it easy to explore and access available resources. A powerful query syntax makes it possible to carry out more targeted searches as well. It also makes it easy to review processing options for discovered resources via the Language Resource Switchboard, and to create virtual collections based on search results via the Virtual Collection Registry.

The following list provides a few links for example selections and queries to start exploring:

Resources for spoken French
Corpora with Polish content
All records from the Language Bank of Finland
Searching for a general term: 'slovenian news sentiment'

More information is available in the VLO’s integrated help page.

Visit the VLO

Repositories

ASV Leipzig

The Leipzig Corpora Collection (LCC) provides over 500 corpora and monolingual dictionaries for more than 250 languages. The project continuously processes freely available text material mostly based on extensive web crawling. A strong focus is news material and text resources for lesser-resourced languages.

Featured example: the German news corpus for 2020 with over 540 million tokens

Visit portal

ARCHE

ARCHE (A Resource Centre for the HumanitiEs) is a service that offers stable and persistent hosting as well as the dissemination of digital research data and resources for the Austrian humanities community.

Featured example: the Austrian Baroque Corpus

Visit repository

Bavarian Archive for Speech Signals

The Bavarian Archive for Speech Signals at the Ludwig-Maximilians-Universität Munich (BAS) provides online access to a large collection of corpora of spoken German, and maintains a suite of web services for automatic annotation of speech and other phonetic tools.

Featured example: Alcohol Language Corpus (BAS ALC): a collection of speech recordings in different speaking styles spoken by sober and intoxicated speakers. This corpus can be used to investigate the influence of alcohol on articulation as well as to test detection algorithms to prevent driving under influence.

Visit repository

Berlin-Brandenburg Academy of Sciences and Humanities

The CLARIN center at the BBAW focuses on historical text corpora (predominantly provided by the 'Deutsches Textarchiv'/German Text Archive, DTA) as well as on lexical resources (e.g. dictionaries provided by the 'Digitales Wörterbuch der Deutschen Sprache'/Digital Dictionary of the German Language, DWDS).

Featured example: the German Text Archive (DTA)

Visit repository

Center of Estonian Language Resources

CELR provides knowledge and access about more than 120 corpora (text, audio, parallel, annotated, treebanks, modern, historical, fiction, news, web sources, etc.) and more than 90 lexical resources (monolingual, multilingual, dialects, named entities, wordnet etc) in Estonian and its dialects, especially Võro.

Featured example: the Place Names Database (KNAB)

Visit portal

Common Language Resources and Technology Infrastructure, Slovenia (CLARIN.SI)

CLARIN.SI is the Slovene national consortium of the European research infrastructure CLARIN. Its goal is to support research communities from the humanities, social sciences and other language-related disciplines with language resources and technologies and expertise and knowledge transfer.

Featured example: ParlaMint - Multilingual comparable corpora of parliamentary debates

Visit repository

Eberhard Karls Universität Tübingen

The CLARIN Center at the University of Tübingen focuses on the long-term preservation and availability of language resources primarily from linguistics and related fields. Resources include linguistically annotated corpora (treebanks), lexical resources (e.g. wordnets and word embeddings), and quantitative research data (e.g. obtained by psycholinguistic experiments).

Featured example: Early New High German Treebank (Referenzkorpus Frühneuhochdeutsch: Baumbank.UP)

Visit portal

Leibniz-Institut für Deutsche Sprache

As an institutional repository, the IDS Repository provides long term archiving of linguistic resources in the field of German studies. It consists of a number of written and spoken language resources such as the German Reference Corpus (DeReKo) and the Archive of Spoken German (AGD).

Featured example: the German Reference Corpus (DeReKo)

Visit repository

LINDAT/CLARIAH-CZ

LINDAT/CLARIAH-CZ is a distributed node of the Czech CLARIN and DARIAH consortia of thirteen institutions located in Prague, Pilsen and Brno. It runs a certified repository with openly accessible language resources and digital humanities data, tools, and models.

Featured example: Universal Dependencies

Visit repository

MPI for Psycholinguistics

The Language Archive at the Max Planck Institute for Psycholinguistics contains materials on a large variety of languages from around the world, including audio and video recordings, texts, photographs, etc. Its main goal is to provide a unique record of how people around the world speak in everyday family life.

Featured example: Documentation of Endangered Languages (DOBES)

Visit repository

Språkbanken Text

Språkbanken Text forms part of Nationella språkbanken (the National Language Bank), a national e-infrastructure supporting research based on language data. It is the coordinating node of SWE-CLARIN and operates a CLARIN B center providing modern and historical Swedish texts in digital formats and language-technology based text analysis tools for research.

Featured example: the Riksdag's open data

Visit repository

The ILC4CLARIN Centre at the Institute for Computational Linguistics

The ILC4CLARIN is the national centre of the CLARIN-IT infrastructure and provides Italian researchers with a repository to preserve and access language resources and tools. Its focus is on Italian and classical languages (Ancient Greek, Latin, Classical Arabic).

Featured example: ItalWordnet

Visit repository

The Language Bank of Finland

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland. FIN-CLARIN helps researchers share language resources. The Language Bank of Finland is the service centre providing language materials and tools for the research community.

Featured example: Suomi24

Visit repository

ZIM Centre for Information Modelling

The ZIM focuses on applied research and development in the area of information/data processing in the humanities. The repository GAMS builds a central platform for preservation, management, and publication of research assets and hosts data mainly from the fields of digital scholarly edition, cultural heritage, digital philology and linguistics.

Featured example: The Spectators in the international context

Visit repository

CLARINO Bergen Center

The CLARINO Bergen Center provides access to language data and analysis tools through a repository, a corpus management and search platform, a treebanking platform, and a component metadata editor. It is also integrating access to a Term Portal, a Medieval Nordic Text Archive, and other collections.

Featured example: INESS

Visit repository

PORTULAN CLARIN

The PORTULAN CLARIN repository provides long-term archiving and access for hundreds of language resources, such as language data or tools.

Featured example: CINTIL-DependencyBank

Visit repository

The CLARIN Centre at University of Copenhagen

CLARIN-DK is an infrastructure where researchers can deposit, share and download language-based material. i.e. texts, transcriptions, lexicons, word lists, audio and video files.

Featured example: The Grundtvig's Works Corpus

Visit repository