According to the linguist Geoffrey Leech (2002), a "reference corpus is designed to provide comprehensive information about the language […] It has to be a general corpus of wide coverage of the language, and hopefully it will be treated by its user community as some kind of “standard” for the language." Reference corpora thus contrast with specialised corpus families (e.g., parliamentary corpora, CMC-corpora) in that they are comprehensive with respect to genre inclusion, typically sampling a diverse set of primarily written genres. 

The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.

Reference corpora in the CLARIN infrastructure

Corpus Language Description Availability

AbNC: Abkhaz National Corpus

Size: 10 million words 
Annotation: MSD-tagged, lemmatized 


This corpus includes Abkhaz texts published between 1920 and 2016. The corpus is encoded in .

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO distribution).

For the relevant publication, see Meurer (2018)


Bulgarian National Reference Corpus (BNRC)

Size: 70 million tokens 
Annotation: tokenized, PoS-tagged 
Licence: Individual terms of agreement


This corpus includes Bulgarian texts taken from news media, literature, and administrative documents between 1997 and 2002.

The tokenised corpus is available through WebCLaRK, while the PoS-tagged version is available only upon request.

For the relevant publication, see Simov et al. (2004)


Croatian language corpus Riznica 0.1

Size: 101.8 million tokens, 85.3 million words, 4.7 million sentences, 14,781 texts 
Annotation: sentence segmented, PoS-tagged, lemmatized 
Licence: CC BY-NC-SA 4.0


This corpus includes Croatian texts taken from fiction (28%) and specialised texts (72%).

The corpus is available for online browsing via noSketch Engine and KonText and for download from the CLARIN.SI repository.

For the relevant publication, see Ćavar and Brozović Rončević (2012)




Croatian National Corpus

Size: 101 million tokens


This corpus includes Croatian texts taken from newspapers, magazines, popular texts, and fiction.

The corpus is available for online browsing through the noSketch Engine.

For the relevant publication, see Tadić (2002)


SYN2005: balanced corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)


This corpus includes Czech texts published between 2000 and 2004. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)



SYN2010: balanced corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)


This corpus includes Czech fiction, professional literature, newspapers etc. published between 2005 and 2009. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)



SYN2015: representative corpus of written Czech

Size: 100 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Czech National Corpus (Shuffled Corpus Data)


This corpus includes Czech fiction, professional literature, newspapers etc. published between 2010 and 2014. The corpus is encoded in XML.

The corpus is available for online browsing through the KonText concordancer and can be downloaded from the LINDAT repository.

For the relevant publication, see Hnátková et al. (2014)



DK-CLARIN Reference Corpus of General Danish

Size: 45.1 million words 
Annotation: PoS-tagged, sentence and paragraph segmentation, lemmatized 


This corpus includes Danish texts published between 2008 and 2011.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source and year of publication.

The corpus is available for download from the CLARIN-DK repository.



Size: 500 million words 
Annotation: PoS-tagged, lemmatized, named entities; coreference annotation and annotation of spatial and temporal relations for the manually annotated SoNaR-1 subset 
Licence: Terms of Agreement


This corpus includes representative Dutch texts (fiction, brochures, magazines, legal texts, newspapers, parliamentary proceedings, and computer-mediated communication).

Aside from written materials, the corpus also contains transcriptions of spoken language. The corpus is encoded in FoLiA.

The corpus is available for online browsing through the OpenSONAR concordancer and can be downloaded from the Dutch Language Institute (CLARIAH-NL).


Download subset 1

Download subset 2

Corpus of Contemporary American English – Kielipankki version

Size: 440 million words, 190,000 texts 
Annotation: PoS-tagged, lemmatized 
Licence: CLARIN ACA (online version), CLARIN RES (downloadable version)

English (American)

This corpus includes American English texts evenly divided into the spoken, fiction, magazine, newspaper, and academic genres (around 88 million words each) published between 1990 and 2012.

The corpus is available for download from the Finnish Language Bank as well as for online browsing through the concordancer Korp (FIN-CLARIN distribution).



British National Corpus

Size: 100 million words 
Annotation: PoS-tagged, lemmatized 
Licence: BNC User Licence (restricted for the downloadable version)

English (British)

This corpus includes English texts (fiction, magazines, newspapers, and academic writing) published between 1980 and 1993.

The corpus is encoded in TEI. Non-linguistic metadata include contextual and bibliographic information. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer and can be downloaded from the Oxford Text Archive (CLARIN-UK).



Estonian National Corpus 2019

Size: 1.5 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-SA


This corpus includes Estonian texts published between 1990 and 2019. Amongst others, this corpus contains the Estonian Reference Corpus as a subcorpus.

The corpus is available for download from (CELR distribution).


Estonian Reference Corpus

Size: 175 million words 
Annotation: MSD-tagged, lemmatized 
Licence: free for non-commercial use


This corpus includes Estonian texts (fiction, PhD theses, newspapers, magazines, parliamentary transcriptions, computer-mediated communication) published between 1990 and 2007. The corpus is encoded in TEI.

The corpus is available for online browsing through a dedicated concordancer and is available for download from CELR.




Size: 31.7 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-SA


This corpus includes German texts in a wide variety of genres published from 1947 onwards. Non-linguistic metadata include rich bibliographic information and partial layout information.

Part of the corpus is available for download from a dedicated webpage (CLARIN-D distribution), while the entire corpus can be queried online through the COSMAS II platform.

For the relevant publication, see Kupietz et al. (2018)



Corpus of Greek Texts

Size: 27.6 million words 
Licence: CC-BY-NC, ACA


This corpus includes representative Greek texts published between 1990 and 2010. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Goutsos (2010)


Diachronic corpus of Greek of the 20th century

Size: 20 million words 
Licence: CC BY-NC


This corpus includes Greek texts published in the 20th century.

The corpus is available for download from CLARIN:EL.


Hellenic National Corpus

Size: 47 million words 
Annotation: sentence segmented 
Licence: proprietary


This corpus includes Greek texts published from 1990 onwards.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Gavrilidou (2002)


Hungarian National Corpus

Size: 190 million tokens 
Annotation: PoS-tagged 
Licence: free after registration


This corpus includes Hungarian texts (newspapers, literature, scientific articles, official and personal documents).

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Váradi (2002)


The Icelandic Gigaword Corpus

Size: 1.9 billion words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY and a special user licence


This corpus includes Icelandic texts (newspapers, parliamentary proceedings, adjudications, fiction and non-fiction) published until 2017.

The corpus is encoded in TEI. Non-linguistic metadata include bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing and download through CLARIN-IS (in two subsets, each with its own licence).

For the relevant publication, see Steingrímsson et al. (2018)


Download subset 1

Download subset 2

Balanced Corpus of Modern Latvian (LVK2022)

Size: 122.9 million tokens 
Annotation: MSD-tagged, lemmatized


This corpus includes texts from journalism, fiction, science, Wikipedia, legal documents, parliamentary subscripts, and subtitles.

The corpus is available for online browsing through the noSketch Engine concordancer.


Corpus of the Contemporary Lithuanian Language

Size: 208.4 million tokens 
Annotation: MSD-tagged, lemmatized 


This corpus includes Lithuanian texts (mostly newspapers but also fiction, non-fiction, and specialised magazines) published between 1990 and 2008.

The corpus is encoded in TEI. Non-linguistic metadata includes bibliographic information. Aside from written materials, the corpus also contains transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.


The Lexicographic Corpus for Norwegian Bokmål (LBK)

Size: 100 million tokens 
Annotation: PoS-tagged, lemmatized 

Norwegian (Bokmål)

This corpus includes representative Norwegian (Bokmål) texts (newspapers and periodicals, non-fiction, fiction, TV subtitles, and small print) published between 1985 and 2013.

The corpus is available for online browsing through the concordancer Glossa (CLARINO).

For the relevant publication, see Lain Knudsen and Vatvedt Fjeld (2013)


Norsk Ordboks Nynorskkorpus (NNK)

Size: 107.8 million words 
Annotation: MSD-tagged, lemmatized 

Norwegian (Nynorsk)

This corpus includes representative Norwegian (Nynorsk) texts published between 1866 and 2012. The corpus is encoded in XML.

The corpus is available for online browsing through the Corpuscle concordancer (CLARINO).


National Corpus of Polish

Size: 1.8 billion tokens 
Annotation: MSD-tagged, lemmatized


This is a written and spoken corpus that includes representative Polish texts published between 1945 and 2010.

The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author. Aside from written materials, the corpus also includes transcriptions of spoken language.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Przepiórkowski et al. (2012)


Corpus of combined Slovenian corpora metaFida 1.0

Size: 6 billion tokens 
Annotation: MSD-tagged (MULTEXT-East), lemmatised, normalised 
Licence: various


This corpus contains a number of existing Slovenian corpora available through the CLARIN.SI concordances and thus provides a unified search across all the included corpora. metaFida contains over 4,7 billion words or 6 billion tokens from 15 million text published 1584 - 2022 from 34 corpora.

In the metaFida corpus we keep only information that is common to most of the selected corpora. The structure is nested very shallowly (text and paragraph), as it is then easier to create subcorpora or limit the search to individual text types. All metaFida positional attributes (word, normalised form, lemma, MULTEXT-East MSD in Slovenian and English) are considered to have multiple values, separated by a space.

Concordancer (noSketchEngine)

Concordancer (KonText)


Spoken corpus Gos 2.0

Size: 1534 texts; 127,604 utterances; 2,462,368 words 
Annotation: PoS-tagged, lemmatised, phonetically and orthographically transcribed 
Licence: CC BY-SA 4.0


This corpus contains transcripts from radio and TV shows, school lessons, private conversations, business meetings. It is composed of three different sources: Spoken corpus Gos 1.1 (112 hours, 1 million words), Spoken corpus Gos VideoLectures 4.2 (22 hours, 179,000 words), a selection from the ASR database ARTUR 1.0 (185 hours, 1.2 mllion words).

The corpus is available for download from CLARIN.SI as well as through a dedicated webconcordancer.

For the relevant publication, see Verdonik and Zwitter-Vitez (2011)

Concordancer (noSketchEngine)

Concordancer (KonText)


Written corpus ccGigafida 1.0

Size: 126.9 million tokens, 103.2 million words, 31,722 texts 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY-NC-SA 4.0


This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the representative Gigafida corpus (version 1). It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)


Written corpus ccKres 1.0

Size: 12.2 million tokens, 9.8 million words 
Annotation: MSD-tagged, lemmatized 
Licence: CC-BY


This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011. The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

This corpus is a downloadable subset of the balanced Kres corpus. It can be downloaded from the CLARIN.SI repository.

For the relevant publication, see Erjavec and Logar (2012)


Written corpus Gigafida 2.0

Size: 1.3 billion tokens, 1.1 billion words, 38,310 texts 
Annotation: MSD-tagged, lemmatized 
Licence: Individual terms of agreement


This corpus includes representative Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2018. 

The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through the noSketch Engine concordancer (CLARIN.SI distribution), as well as through a dedicated search engine.

For the relevant publication, see Krek et al. (2018)



Written corpus Kres 1.0

Size: 99 million words 
Annotation: MSD-tagged, lemmatized 
Licence: Individual terms of agreement


This corpus includes balanced Slovenian texts (newspapers, magazines, computer-mediated communication, fiction and non-fiction) published between 1990 and 2011.

This corpus is a balanced subset of the representative Gigafida corpus (version 1). The corpus is encoded in TEI. Non-linguistic metadata includes information on source, year of publication, text type, title, author.

The corpus is available for online browsing through a dedicated concordancer.

For the relevant publication, see Krek et al. (2018)


CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh

Size: 11 million words 
Licence: CC BY-NC-SA 4.0


This corpus contains spoken, written and digital (e-language) Welsh. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

The corpus is available for online browsing through a dedicated webpage and by request.

For the relevant publication, see Knight et al. (2020)




