Skip to main content

Historical Corpora

The CLARIN infrastructure offers access to more than 80 historical corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] (email).


Historical corpora in the CLARIN infrastructure

Monolingual corpora

Corpus Language Description Availability

Open Richly Annotated Cuneiform Corpus, Korp Version

Size: 1,600,563 tokens 
Annotation: tokenised, lemmatised, PoS-tagged, semantically annotated 
Licence: CC-BY-SA


This corpus contains cuneiform texts from Ancient history.

The texts come from the Oracc project and include collections such as the Corpus of Ancient Mesopotamian Scholarship, The Digital Corpus of Cuneiform Lexical Texts, and Royal Inscriptions of Babylonia online.

The corpus is available through the concordancer Korp and for download from the repository of FIN-CLARIN.



Greek Medieval Texts

Size: 3.4 million words 
Licence: CC-BY

Ancient Greek

This corpus contains texts from the 4th to the 16th century.

The texts belong to the following categories: religious, poetical-literary, political, and historical texts, as well as hymns and epigrams.

The corpus is available for download from the clarin:el repository.


The Diorisis Ancient Greek Corpus

Size: 10.2 million words 
Annotation: PoS-tagged, lemmatised 
Licence: CC BY 4.0

Ancient Greek

This corpus consists of 820 texts spanning between the beginnings of the Ancient Greek literary tradition (Homer) to the fifth century AD.

The texts are sourced from the Perseus Canonical Greek Lit Repository, "The Little Sailing" digital library, and the Bibliotheca Augustana digital library.

The corpus is available for download from Figshare.

For the relevant publication, see Vatri and McGillivray (2018)


Sheffield Corpus of Chinese

Size: 148,876 words 
Annotation: no annotation 
Licence: CC-BY-NC-SA 3.0


This corpus contains three texts (two non-fictional and one fictional) from the Medieval and Modern Chinese periods.

The text "Zhuzi Yulei is genre-wise similar to sermons and vernacular dialogues, and is representative of Medieval Chinese. The two other texts are the novel "Shuihu Zhuan", which is from the Ming Dynasty (1368–1644), and the novel "Rulin Waishi", which is from the Quing Dynasty (1644–1911).

The corpus is available for download from the Oxford Text Archive.


Brieven als buit (Letters as loot)

Size: 460,000 words 
Annotation: lemmatised, PoS-tagged, grammatically tagged 


This corpus contains 40,000 letters from the 17th to the 19th century.

These letters were sent home by sailors and others from abroad but also vice versa by those staying behind who needed to keep in touch with their loved ones. Many letters did not reach their destinations: they were taken as loot by privateers and confiscated by the High Court of Admiralty during the wars fought between The Netherlands and England

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rutten and van der Wal (2014).


Corpus Gysseling

Size: 1.5 million words 
Annotation: PoS-tagged, lemmatised 
Licence: INT Licence for researchers


This corpus contains texts from the 13th century.

The texts were prepared and originally published in the 1970s and 1980s by the Ghent linguist Maurits Gysseling.

The corpus is available for download from the Instituut voor de Nederlandse Taal and through a dedicated concordancer.



A Corpus of English Dialogues 1560-1760 (CED)

Size: 1.2 million words 
Annotation: no annotation 
Licence: Oxford Text Archive licence


This corpus contains dialogues from literary and didactic works from 1560 to 1760.

There are five text-types in the CED. The text-types representative of constructed dialogue are drama comedy, didactic works (language manuals and other handbooks) and fiction; the text-types representative of authentic dialogue are trial proceedings and witness depositions. In addition, a small group of miscellaneous dialogic texts is included in the collection.

The corpus is available for download from the Oxford Text Archive.


Corpus of Early English Correspondence Sampler (CEECS)

Size: 450,000 words 
Annotation: no annotation 
Licence: Oxford Text Archive licence


This corpus contains 1147 letters from 1418 to 1680.

The corpus was created from the larger Corpus of Early English Correspondence.

The corpus is available for download from the Oxford Text Archive.


Corpus of Late Modern English prose / David Denison

Size: 580,056 words 
Annotation: no annotation 
Licence: Oxford Text Archive licence


This corpus contains fictional texts from 1837 to 1926.

The corpus is available for download from the Oxford Text Archive.


Hansard Corpus

Size: 1.6 billion tokens 
Annotation: tokenised, PoS-tagged, lemmatised, semantic tags


This corpus contains parliamentary debates from 1803 to 2005.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rayson et al. (2015).


Helsinki Corpus of Scottish Correspondence (1540-1750)

Size: 500,000 tokens 
Annotation: tokenised 


This corpus contains personal correspondence from 1540 to 1750.

the corpus consists of transcripts of original letter manuscripts. The texts are reproduced without any modernisation or normalisation. Language-external variables such as date, region, gender, addressee, hand and script type have been coded.

The writers originate from fifteen different regions of Scotland. A fifth of the correspondents in the corpus are women.

The corpus is available through the concordancer Korp.


Older Scottish texts: the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith

Size: 877,000 tokens 
Annotation: tokenised 
Licence: CC-BY-NC-SA 3.0


This corpus contains texts from 1450 to 1600.

The corpus is available for download from the Oxford Text Archive.


Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn

Size: 431,013 words 
Licence: CC-BY-NC-SA 3.0


This corpus contains pamphlets of the American Revolution from 1750 to 1776.

The corpus is available for download from the Oxford Text Archive.


Parsed Corpus of Early English Correspondence (PCEEC)

Size: 2.2 million words 
Annotation: tokenised, PoS-tagged, syntactically parsed 
Licence: Oxford Text Archive licence


This corpus contains correspondence from around 1410 to 1681.

There are 4970 personal letters by 666 writers. The letters have been selected to be as socially representative of the literate social ranks of the time as possible.

This corpus is available for download from the Oxford Text Archive.


Royal Society Corpus (Version 4.0)

Size: 35 million tokens 
Annotation: PoS-tagged using PennTreebank tagset, lemmatised, normalised 
Licence: CC-BY-NC-SA-4.0


This corpus contains articles from the  Philosophical Transactions of the Royal Society of London journal from 1665 to 1869.

The corpus is available for download from the CLARIN-D repository as well as through a concordancer.



The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose

Size: 300,000 words 
Annotation: COCOA-style 
Licence: Oxford Text Archive licence


This corpus contains texts from 1761 to 1790.

The corpus is available for download from the Oxford Text Archive.


The Lampeter Corpus of Early Modern English Tracts

Size: 50,797,916 words 
Annotation: no linguistic annotation 
Licence: CC-BY-NC-SA 3.0


This corpus contains tracts from 1640 to 1740.

The corpus is available for download from the Oxford Text Archive.


The Lancaster Newsbooks Corpus

Size: 3,001,604 words 
Licence: CC-BY-NC-SA 3.0


This corpus contains two collections of English printed pamphlets, books, and newspapers from 1654 to 1655.

The corpus is available for download from the Oxford Text Archive.


Corpus of Historical American English - Kielipankki Korp version 2017H1

Size: 385 million tokens 
Annotation: tokenised 
Licence: CLARN ACA

English (American)

This corpus contains texts from 1810 to 2009.

Each decade has roughly the same balance of fiction, popular magazine, newspaper, and non-fiction books.

The corpus is available through the concordancer Korp.


The Corpus of Late Modern English Texts, version 3.1

Size: 34 million words 
Annotation: PoS-tagged 
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains texts written by British and Irish authors from 1710 to 1920.

In terms of genre, the texts correspond to narrative fiction and non-fiction, drama, letters, treatises, and miscellaneous written works.

The corpus is available for download from a CLARIN-D repository.


The Old Bailey Corpus

Size: 134 million words 
Annotation: detailed sociobiographical, pragmatic and textual annotation 
Licence: CC-BY-NC-SA 4.0

English (Late Modern)

This corpus contains proceedings of the Old Bailey (i.e., legal documents) from 1674 to 1913.

The corpus is available for download from the CLARIN-D repository and through the CQPConcordancer.

For the corpus manual, see Huber et al. (2016).



Helsinki corpus of English texts

Size: 240,000 words 
Licence: Oxford Text Archive licence

English (Old and Middle)

This corpus contains religious and fictional texts from 730 to 1710.

See the project page for a list of all the texts included in the corpus.

The corpus is available for download from the Oxford Text Archive.


The York-Helsinki parsed corpus of Old English poetry (YCOEP)

Size: 71,500 words 
Annotation: syntactically-parsed 
Licence: Oxford Text Archive licence

English (Old)

This corpus contains poems from 730 to 1710.

The corpus contains a selection of poems taken from the Old English subpart of the Helsinki Corpus of English Texts.

The corpus is available for download from the Oxford Text Archive.


Corpus of Old Written Estonian

Size: 2 million tokens 
Annotation: tokenised, 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated. 
Licence: CC-BY


This corpus covers secular and religious texts from the 16th to the 18th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Kingisepp et al. (2004).


Classics of Finnish Literature, Kielipankki Version

Size: 1.5 million words 
Licence: EUPL v.1.1 SA


This corpus contains literary texts from 1880 to 1949.

In terms of genre, the texts correspond to prose fiction, plays, poetry and aphorisms.

The corpus is available through the concordancer Korp (FIN-CLARIN).


Corpus of Old Literary Finnish

Size: 4.1 million words 
Annotation: MSD-tagged, syntactically parsed 
Licence: EUPL v.1.1 SA


This corpus contains both literary and non-literary texts from 1543 to 1810.

In terms of genre, the texts correspond to bible translations and religious texts (for instance, all of the clergyman Mikael Agricola's Finnish works), legal texts, poems, and texts concerning agriculture, nature, health, and so on.

The corpus is available through the concordancer Korp.


The Finnish Gutenberg Corpus

Size: 34.5 million words 
Licence: CC-BY


This corpus contains books published up to 1925 that are made available through the Gutenberg project.

The corpus is available through the concordancer Korp.


The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 5.2 billion tokens 
Annotation: tokenised 
Licence: CC_BY-SA


This corpus contains newspaper articles from 1840 to 2011.

For a comprehensive list of newspapers included in the corpus, see here.

The corpus is available through the concordancer Korp.


The Morpho-Syntactic Database of Mikael Agricola's Works

Size: 428,300 tokens 
Annotation: tokenised, PoS-tagged, morphological components and syntactic function 
Licence: CC-BY-ND


This corpus contains texts from 1544 to 1551 written by the clergyman Mikael Agricola.

The corpus is available through the concordancer Korp.


Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version

Size: 48 texts 
Licence: CC-BY-NC-ND


This corpus contains literary texts from 1543 to 1791.

This corpus complements the Corpus of Old Literary Finnish available through FIN-CLARIN.


Partonopeus de Blois: transcriptions of all manuscripts and fragments

Size: 21,736,766 words 
Annotation: no linguistic annotation 
Licence: CC BY-NC-SA 3.0

French (Old)

This corpus contains transcriptions of the manuscripts and fragments of the romance Partonopeus de Blois.

The corpus is available for download from the Oxford Text Archive.


Syntactic Reference Corpus of Medieval French

Size: 245,000 tokens 
Annotation: tokenised, syntactically-parsed 

French (Old)

This corpus contains texts from the 9th to the 13th century.

The syntactic categories of the SRCMF annotation and the grammatical principles of the annotation are explained in detail in the documentation.

The corpus is available for download from a dedicated webpage.

For the relevant publication, see Stein (2013).


Austrian Baroque Corpus

Size: 200,000 tokens 
Annotation: tokenised, PoS-tagged, lemmatised, named entities


This corpus contains sermons from 1650 to 1750.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Resch et al. (2016).


DDR-Presseportal (GDR press portal)



This corpus contains newspaper texts from 1945 to 1994.

The corpus is available through a concordancer provided by CLARIN-D.


Deutsches Textarchiv (DTA)

Size: 215,168,761 tokens 
Annotation: tokenised, PoS-tagged, lemmatised 


This corpus contains texts from the 17th to the 20th century.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Haaf and Thomas (2016).


GerManC. A Historical Corpus of German Newspapers 1650-1800

Size: 700,000 words 
Annotation: no annotation 
Licence: CC-BY-NC-SA 3.0


This corpus contains personal letters, sermons and fictional, scholarly (i.e., humanities), scientific and legal texts from 1650 to 1800.

The corpus is available for download from the Oxford Text Archive.


Mannheimer Korpus Historischer Zeitungen und Zeitschriften

Size: 3532 pages


This corpus contains texts from the 18th and 19th centuries.

The corpus is available for download directly through the VLO.


Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus)

Size: 2.5 million tokens 
Annotation: tokenised, PoS-tagged, lemmatised, normalised, morphosyntactic description 
Licence: CC-BY-SA 4.0


This corpus contains texts from 1050 to 1350.

The corpus is available for download from the Deutsches Text Archiv and through a concordancer.

For the relevant publication, see Klein and Dipper (2016).



SaCoCo—Saarbrücken Cookbook Corpus

Size: 436,000 tokens 
Annotation: PoS-tagged using the STTS tagset, lemmatised, normalised 
Licence: CC-BY-NC-SA-3.0


This corpus contains historical cookbook recipes from  1569 to 1800, as well as contemporary ones from 2012.

The corpus is available through the CQPweb concordancer provided by CLARIN-D.


The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700)

Size: 120,000 tokens 
Annotation: Lite markup, no linguistic annotation 
Licence: CC-BY-NC-SA 3.0


This corpus contains medical writing from 1500 to 1700.

The texts are taken primarily from digital facsimile copies available online via the University of Würzburg’s library interface, particularly from the subcategory of pertaining to gynaecology.

The corpus is available for download from the Oxford Text Archive.


B4 Historisches Predigtenkorpus zum Nachfeld

Size: 92,500 tokens 
Annotation: tokenised, syntactic and discursive annotation 

German (Middle High)

This corpus contains sermons from an Upper German (Balvarian-Alemannic) dialect area.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.



B4 Ludolf

Size: 6,690 tokens 
Annotation: tokenised, tagged for clause type and grammatical function 

German (Middle High)

This corpus contains texts from a journey diary from 1350.

The corpus is available for download from the repository of the University of Hamburg and through the ANNIS environment.



Reference Corpus Middle Low German/Low Rhenish (1200-1650)

Size: 200,700 tokens 
Annotation: tokenised, MSD-tagged 
Licence: CC-BY

German (Middle Low)

This corpus contains texts from the 13th century to the middle of the 17th century.

The corpus is available for download from the repository of the University of Hamburg through the ANNIS environment.

For the relevant publication, see Schröder (2014).



OROSSIMO Corpus – History

Size: 553,000 tokens 
Annotation: structural annotation (paragraph) 
Licence: CC-BY


This corpus contains historic academic texts.

The corpus is available for download from the clarin:el repository.


Hungarian Historical Corpus

Size: 30 million words


This corpus contains historical texts from the 18th century to the 2000s.

The corpus is available through a dedicated concordancer.

For the relevant publication, see lemma=


The Saga Corpus

Size: 1.5 million tokens 
Annotation: tokenised, PoS-tagged, lemmatised, normalized orthography 
Licence: CC-BY 4.0

Icelandic (Old)

This corpus contains Old Icelandic (Old Norse) Narrative texts from the 13th to the 15th century.

The corpus is available for download from CLARIN-IS and for search through the concordancer Korp.

For the relevant publication, see Rögnvaldsson and Helgadóttir (2011)




Size: 16.6 million words 
Annotation: unannotated 
Licence: ODC Attribution License (ODC-By)


This corpus contains Italian language newspapers published in the United States between 1898 and 1920. The corpus includes seven Italian language newspapers published in California, Massachusetts, Pennsylvania, Vermont, and West Virginia. The collection includes the following titles: L’Italia, Cronaca sovversiva, La libera parola, The patriot, La ragione, La rassegna, and La sentinella del West Virginia.

The corpus is available for download from the repository of the University of Utrecht.


LatinISE corpus (version 4)

Size: 13.3 million tokens 
Annotation: sentence segmented, PoS-tagged, lemmatized 
Licence: CC BY-NC-SA 4.0


This corpus consists of Latin texts from the 2nd century B.C. to the 21st century. Non-linguistic metadata include information on genre, title, century and specific date.

The corpus is available for download from LINDAT and for search online through Sketch Engine.

For the relevant publication, see McGillivray and Kilgarriff (2015)




Size: 1.6 million tokens 
Annotation: tokenised, MSD-tagged, lemmatised 
Licence: CC-BY

Old Norse

This corpus contains Medieval Nordic texts.

The corpus is available for download and through the concordancer Corpuscle.




Size: 16 million tokens 
Licence: CC-BY-SA


This corpus contains newspaper articles from 1945 to 1954.

The corpus is available through a dedicated concordancer.


Polish language of the 1960s

Size: 500,000 words 
Annotation: MSD-tagged 
Licence: CC-BY-NC-SA 3.0


This corpus contains essays, news articles, and scientific and literary texts from 1963 to 1967.

The corpus is available for download from the Oxford Text Archive.



Size: 3.5 million words 
Licence: CC-BY-NC-ND


This is a corpus of historical, religious, notarial, literary texts in prose and verse.

The corpus is available from PORTULAN.



Portuguese Parish Memories (1758)

Licence: CC BY


This is a corpus of historical surveys from the 18th century.

The corpus is available from PORTULAN.


Corpus of biblical text in Scots / John Kirk

Size: 35,506 words 
Annotation: no annotation 
Licence: Oxford Text Archive licence


This corpus contains Biblical texts.

The corpus is available for download from the Oxford Text Archive.


The Helsinki corpus of Older Scots : [1450-1700]

Size: 1,940,706 words 
Annotation: no annotation 
Licence: CC-BY-NC-SA 3.0


This corpus contains texts of different domains and genres (e.g., burgh records, diaries, pamphlets, scientific treatises, sermons) from 1450 to 1700.

The corpus is available for download from the Oxford Text Archive.


Digital library and corpus of historical Slovene IMP 1.1

Size: 17.7 million tokens 
Annotation: tokenised, lemmatised, PoS-tagged 
Licence: CC-BY-SA 4.0


This corpus contains 658 unique texts from 1584 to 1919.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2015).



Reference corpus of historical Slovene goo300k 1.2

Size: 300,000 tokens 
Annotation: manually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic words 
Licence: CC-BY 4.0


This corpus contains 89 unique texts from 1584 to 1899.

The corpus is available for download from the CLARIN.SI repository and through the concordancer KonText.

For the relevant publication, see Erjavec (2012).



The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 3.5 billion tokens 
Annotation: tokenised 
Licence: CC-BY-SA.


This corpus contains newspaper articles from 1770 to 1950.

The corpus is available through the concordancer Korp.


Historical Corpus of the Welsh Language 1500-1850

Size: 420,000 words


This corpus contains 30 texts from 1500 to 1850.

The corpus is available for download from a dedicated website and through a dedicated concordancer.


Multilingual corpora

Corpus Language Description Availability

"PolDiLemma" Middle Polish Diachrone Lemmatised Corpus

Size: 7 million tokens 
Annotation: tokenised, lemmatised 
Licence: CC BY-NC-SA 4.0

Czech, German, Latin, Polish

This corpus contains political, religious and scientific texts from the 16th to the 18th century.

The corpus is available for download from the CLARIN-D repository.


Medieval Charter Sections Corpus

Size: 57 chapters 
Annotation: manually-tagged, named entities 
Licence: CC-BY-NC-SA 4.0

Czech, Latin

This corpus contains Latin charters created in the era of John the Bling, King of Bohemia.

The corpus is available for download from LINDAT.

For the relevant publication, see Galuščáková and Neužilová (2018).


Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo

Size: 4000 words 
Annotation: no linguistic annotation 
Licence: Oxford Text Archive licence

English (Middle), Hebrew

This corpus contains literary texts from 1100 to 1400.

The corpus is available for download from the Oxford Text Archive.


Dictionary of Old English Corpus in Electronic Form (DOEC)

Annotation: no linguistic annotation 
Licence: Oxford Text Archive licence

English (Old), Latin

This corpus contains 3037 texts from 600 to 1150.

The corpus is available for download from the Oxford Text Archive.


The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)

Size: 1.5 million words 
Annotation: syntactically-parsed 
Licence: Oxford Text Archive licence

English (Old), Latin

This corpus contains fictional texts from 600 to 1150.

The corpus is available for download from the Oxford Text Archive.


Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA)

Size: 128,000 words 
Annotation: MSD-tagged, syntactically parsed 

English, German, Latin, Old Norse, Swedish

This corpus contains texts written in the Late Old Swedish period (from 1375 to 1550).

The corpus is available for download from the repository of the University of Hamburg.


The Electronic Text Corpus of Sumerian Literature. Revised edition

Size: 5,151,373 words 
Annotation: Each word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation. 
Licence: CC-BY-NC-SA 3.0

English, Sumerian

This corpus contains transliterations and English translations of 394 Sumerian compositions from approximately 2100 to 1700 BCE.

The corpus is available for download from the Oxford Text Archive.


Finnish Folk Poetry

Size: 7.1 million words 
Annotation: normalised (added diacritics) 
Licence: CC-BY-NC

Finnish, Karelian, Ludian, Latin, Swedish, Olonets, Izhorian, Votic

This corpus contains poems from 1564 to 1939.

The corpus is available through the concordancer Korp.


Corpus of Early Modern Finnish, Kielipankki Version

Size: 8.6 million words 
Annotation: no linguistic annotation 
Licence: EUPL v.1.1 SA

Finnish, Russian, German, Latin

This corpus contains texts from 1809 to 1899.

The corpus is available through the concordancer Korp.


Aleksis Kivi Corpus (SKS)

Size: 413,700 words 
Annotation: MSD-tagged, syntactically parsed 
Licence: CC-BY-NC

Finnish, Swedish

This corpus contains the works by Finnish author Aleksis Kivi from 1855 to 1871.

The corpus is available through the concordancer Korp.


Classics Library of the National Library of Finland - Kielipankki version

Licence: CC-BY

Finnish, Swedish This corpus will contain literary texts from 1549 to 1944.  

The Letters of Paul Sinebrychoff, Kielipankki Version

Size: 8.6 million words 
Annotation: Finnish subset: MSD-tagged, syntactically parsed; Swedish subset: no linguistic annotation 
Licence: CC-BY

Finnish, Swedish

This corpus contains letters from 1895 to 1909.

The corpus is available through a dedicated online search environment.


The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 8.7 billion words 
Licence: CC-BY

Finnish, Swedish

This corpus contains newspaper articles from 1770 to 2011.

The corpus is available through the concordancer Korp.


The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Licence: CC-BY

Finnish, Swedish This corpus contains newspaper articles from 1771 to 1874. Download

The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920)

Size: 8.7 billion tokens 
Annotation: tokenised 

Finnish, Swedish

This corpus contains newspaper articles from 1875 to 1920.

The corpus is available for download from the Language Bank of Finland.


Carniolan Provincial Assembly corpus Kranjska 1.0

Size: 10.9 million words 
Annotation: tokenised, MSD-tagged, lemmatised 
Licence: CC-BY 4.0

German, Slovenian

The corpus contains meeting proceedings of 694 sessions of the Carniolan Provincial Assembly from 1861 to 1913.

The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia and History of Slovenia - SIstory portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin.

The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit for Slovenian and German, while Lingua is used for language detection.

The documents are in the Parla-CLARIN compliant TEI XML format. Each session in one file.

For the relevant publication, see Marolt et al. (2023)


B4 Tatian Corpus of Deviating Examples 2.1

Size: 11,300 tokens 
Annotation: tokenised, MSD-tagged 
Licence: CC-BY

Latin, German (Old High)

This corpus contains the OHG Tatian, which is one of the largest prose texts from the Old High German period.

The corpus is available for download and through a concordancer from the repository of the University of Hamburg.



Språkbanken's historical corpora

Size: 1.34 billion tokens 
Annotation: tokenised, PoS-tagged, lemmatised, syntactically parsed, word sense (for materials more recent than 1800) 
Licence: CC-BY

Swedish, German, French and others

This collection of corpora contains – among others – diachronic legal texts, Bible translations, medieval letters, digitized newspapers from the Swedish National Library and 19th century fiction from the Swedish Literature Bank.

The corpora are available through the concordancer Korp.


Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0

Size: 34,542 utterances; 578,958 sentences; 13,271,885 words; 15,403 pages 
Annotation: tokenised, MSD-tagged, lemmatised 
Licence: CC BY 4.0

Croatian, Serbian, Slovenian

This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions.

The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet.

The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script.

The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.

Concordancer (noSketch)

Concordancer (KonText)


Other historical corpora

Monolingual corpora

Corpus Language Description Availability


Size: 4 million tokens 
Annotation: basic structural markup 
Licence: CC-BY-NC-SA


This corpus contains texts from the 14th to the 20th century.

The corpus is available through a dedicated concordancer.





The corpus contains texts from 1600 to 1999.

The corpus is available through the CQPConcordancer.



Size: 74 million tokens 
Annotation: no linguistic annotation 
Licence: CC-0


This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1700 to 1800.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.




Size: 766 million tokens 
Annotation: no linguistic annotation 
Licence: CC-0


This corpus contains texts (literature, philosophy, politics, religion, geography, science and all other areas of human endeavour) from 1450 to 1750.

The corpus is available through a dedicated concordancer.



Size: 766 million tokens 
Annotation: no linguistic annotation 
Licence: CC-0


This corpus contains American texts from 1640 to 1821.

The corpus is available through a dedicated concordancer.


Historical Corpora at Lancaster University

Annotation: tokenised, PoS-tagged, partial semantic tagging (USAS system)


The corpus contains texts in various domains (e.g., fiction, newspaper texts, religious texts) from 1500 on.

The corpus is available through the CQPConcordancer.



Size: 300 million words 
Annotation: PoS-tagged, lemmatised


This corpus contains texts from the 10th to the 21st century.

The corpus is available through a dedicated concordancer (restricted access).


Corpus of Old and Middle Hungarian court records and private correspondence

Size: 850,000 words 
Annotation: tokenised, MSD-tagged, lemmatised, sociolinguistic metadata


This corpus contains private letters and testimonies from the 16th to the 18th  century.

The corpus is available through a dedicated concordancer.


Old Hungarian Corpus

Size: 3 million tokens 
Annotation: tokenised, partially normalized, partially MSD-tagged


This corpus contains texts (codices, letters) from the 12th to the 17th century.

The corpus is available for download from a dedicated webpage and through a dedicated concordancer.



Corpus testuale del Tesoro della Lingua Italiana delle Origini

Size: 23 million tokens 
Annotation: tokenised, lemmatised


This corpus contains early Italian texts before 1375.

The corpus is available through a dedicated concordancer.





This corpus contains texts from 1861 to 1945.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Rossini Favretti et al. (2011).


M.I.DIA. (Morfologia dell'Italiano in DIAcronia)

Size: 7.5 million tokens 
Annotation: tokenised 
Licence: CC-BY-NC 4.0


This corpus contains texts from the 13th to the 20th century.

The corpus is available through a dedicated concordancer


Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej)

Size: 625,000 tokens 
Annotation: tokenised, PoS-tagged, lemmatised, transliteration, transcripton


This corpus contains texts from 1830 to 1918.

The corpus is available for download through a dedicated webpage.


The Electronic Corpus of 17th- and 18th-century Polish Texts (Elektroniczny Korpus Tekstów Polskich z XVII i XVIII w.)

Size: 13.5 million tokens 
Annotation: tokenised, partially PoS-tagged, structural annotation


This corpus contains texts from 1601 to 1772.

The corpus is available through a dedicated concordancer.

A manually annotated subset is available here.

For the relevant publication, see Gruszczyński et al. (2021)


IMPACT GT corpus (Korpus GT projektu IMPACT)

Size: 1.5 million tokens 
Annotation: transcription


This corpus contains texts from 1570 to 1756.

The corpus is available through a dedicated concordancer.

For the relevant publication, see Bień (2012).


Parsed Corpus of Historical Portuguese

Size: 3.3 million 
Annotation: tokenised, PoS-tagged (2 million), treebanked (1.2 million)


This corpus contains 76 texts written by authors born between 1380 and 1881.

The corpus is available for download and through a dedicated concordancer.



Multilingual corpora

Corpus Language Description Availability

Bundesblatt/Feuille fédérale/Foglio federale

Size: 203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian 
Annotation: tokenised, syntactically-parsed

German, French, Italian

This corpus contains texts from 1849 to 2014.

The corpus is available through the CQPWeb concordancer.


Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500)

Size: 620,000 tokens 
Annotation: tokenised

Polish, Latin

This corpus contains texts until 1500.

The corpus is available for download from a dedicated webpage.


Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku)

Annotation: lemmatised, transliteration

Polish, Latin

This corpus contains texts from the 16th century.

The corpus is available through a dedicated concordancer.


eFontes Mediae et Infimae Latinitatis Polonorum (Elektroniczny korpus polskiej łaciny średniowiecznej)

Size: 5 million tokens 
Annotation: tokenised, lemmatised

Polish, Latin

This corpus contains texts from the 11th to the middle of the 16th century.

The corpus is available through a dedicated concordancer.


XV century New Testament translations (Piętnastowieczne przekłady Nowego Testamentu – elektroniczna konkordancja staropolska)

Size: 400,000 tokens 
Annotation: tokenised

Polish, Latin

This corpus contains Biblical texts from 1380 to 1500.

This corpus is available through a dedicated concordancer.


Additional Materials

  • Presentations on historical newspaper corpora t the CLARIN-PLUS workshop 'Working with Digital Collections of Newspapers.' 19-21 September 2016, Leuven, Belgium. [html]
  • Videolectures of the CLARIN-PLUS workshop. [html]

List of Publications on Historical Corpora

[Bień 2012] Janusz Bień. 2012. Delivering the IMPACT project Polish Ground-Truth texts with Poliqarp for DjVu.

[Erjavec 2012] Tomaž Erjavec. 2012.  The goo300k corpus of historical Slovene.

[Erjavec 2015] Tomaž Erjavec. 2015. The IMP historical Slovene language resources. 

[Galuščáková and Neužilová 2018]  Petra Galuščáková and Lucie Neužilová. Low Resource Methods for Medieval Document Sections Analysis.

[Gruszczyński et al. 2021] Włodzimierz Gruszczyński, Dorota Adamiec, Renata Bronikowska, Witold Kieraś, Emanuel Modrzejewski, Aleksandra Wieczorek, and Marcin Woliński. 2021. The Electronic Corpus of 17th- and 18th-century Polish Texts

[Haaf and Thomas 2016] Susanne Haaf and Christian Thomas. 2016. The Historical Corpora of the German Text Archive as a basis for research into linguistic history.

[Huber et al. 2016] Magnus Huber, Magnus Nissel, Karin Puga. 2016. The Old Bailey Corpus 2.0, 1720-1913 Manual. 

[Kingisepp et al. 2004] Valve-Liivi Kingisepp, Külli Prillop, Külli Habicht. 2004. EESTI VANA KIRJAKEELE KORPUS: MIS TEHTUD, MIS TEOKSIL.

[Klein and Dipper 2016] Thomas Klein and Stefanie Dipper. 2016. Handbuch zum Referenzkorpus Mittelhochdeutsch.

[McGillivray and Kilgarriff 2015] Barbara McGillivray and Adam Kilgarriff. 2015. Tools for historical corpus research, and a corpus of Latin.

[Rayson et al. 2015] Paul Rayson, Alistair Baron, Scott Piao, Steve Wattam. 2015. Large-scale Time-sensitive Semantic Analysis of Historical Corpora. 

[Rossini Favretti et al. 2011] Rema Rossini Favretti, Fabio Tamburini, Andrea Zaninello. 2011.  Exploiting corpus evidence for automatic sense induction.

[Rögnvaldsson and Helgadóttir 2011] Eiríkur Rögnvaldsson and Sigrún Helgadóttir. Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In C. Sporleder, A.P.J. van den Bosch and K.A. Zervanou (eds.): Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series, pp. 63-76. Springer, Berlin.

[Rutten and van der Wal 2014] Gijsbert Rutten and Marijke van der Wal. 2014. Letters as Loot. A sociolinguistic approach to seventeenth- and eighteenth-century Dutch

[Resch et al. 2016] Claudia Resch, Ulrike Czeitschner, Eva Wohlfarter, Barbara Krautgartner. 2016. Introducing the Austrian Baroque Corpus: Annotation and Application of a Thematic Research Collection.

[Schröder 2014] Ingrid Schröder. 2014. The Reference Corpus: New Perspectives for Middle Low German Grammar.

[Stein 2013] Achim Stein. 2013. Diachronic syntax based on constituency and dependency annotated corpora: theoretical and methodological issues.

[Vatri and McGillivray 2018] Alessandro Vatri and Barbara McGillivray. 2018. The Diorisis Ancient Greek Corpus.