Skip to main content

Newspaper Corpora

Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.

The CLARIN infrastructure gives access to 31 newspaper corpora, 6 of which are multilingual and 25 monolingual. The available corpora contain newspaper articles in languages such as Arabic, Czech, Finnish, French, German, Greek, Italian, Norwegian, Polish and Swedish. Almost a third of the newspaper corpora are historical, with the oldest articles from the 18th century. The majority of them richly tagged and are available under public licences. We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

Additionally, the CLARIN infrastructure gives access to the entire Europeana historical newspaper collection, which is here listed under the section The Europeana collection. The collection is divided into 9 subsets by country. Each subset corresponds to a CLARIN Virtual Collection, which includes a link to the with parameters to select the relevant country’s newspaper records, a link to the full metadata archive and links to the metadata records for all the newspaper titles. The latter provide access to the records for specific years, where you can directly browse the individual newspaper issues. 

The Europeana collection can be accessed directly through the VLO. For instance, the newspaper Sakala, which is part of the Estonian collection, consists of 64 annual issues published between 1878 and 1944; each issue has its own VLO entry that is part of a nested hierarchy with the main newspaper issue, from which the individual issues can be both browsed in the form of scans as well as downloaded.

The newspaper issues included in the Europeana Newspapers collection can also be browsed and viewed through the thematic collection on Europeana’s portal.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Newspaper Corpora in the CLARIN Infrastructure

Monolingual Corpora

Corpus Language Description Availability

SYN2006PUB: corpus of Czech newspapers

Size: 300 million tokens
Annotation: tokenised, lemmatised, PoS-tagged
Licence: CC-BY

Czech

This corpus contains articles from 11 Czech newspapers from 1989 to 2004.

The corpus is available for download from the Czech repository LINDAT.

Download

SYN2013PUB: corpus of written Czech newspapers

Size: 935 million tokens
Annotation: tokenised, lemmatised, MSD-tagged
Licence: Czech National Corpus (Shuffled Corpus Data)

Czech

This corpus contains articles from Czech newspapers from 2005 to 2009.

The corpus is available for download from the Czech repository LINDAT.

Download

The Karelian Finnish Newspaper Corpus

Size: 500,000 tokens
Licence: CLARIN ACA

Finnish

This corpus contains articles from the Finnish newspaper Karjalan Sanomat from 2012 to 2014.

The corpus is available through the concordancer Korp.

Concordancer

Corpus journalistique issu de l'Est Républicain

Annotation: MSD-tagged, lemmatised
Licence: CC-BY

French

This corpus contains articles from the French newspaper l'Est Républicain from 1999 to 2003.

The corpus is available for download from Ortolang.

Download

Tübingen Treebank of Written German / Newspaper Corpus

Size: 1.8 million tokens
Annotation: tokenised, MSD tagged, lemmatised, syntactic constituency, named-entities
Licence: CLARIN RES

German

This corpus contains articles from the German newspaper Die Tageszeitung.

The corpus is available through a dedicated concordancer with an institutional account.

Concordancer

TIGER Corpus

Size: 900,000 tokens
Annotation: tokenised, PoS-tagged, parsed, lemmatised
Licence: CLARIN PUB

German

This corpus contains articles from the German newspaper Frankfurter Rundschau.

The corpus is available for download from a dedicated webpage.

Download

Mannheim Corpus of Historical Newspapers and Magazines

Size: 4.1 million tokens
Annotation: tokenised

German

This corpus contains articles from 21 German newspapers from the 18th and 19th century.

The corpus is available for download from the CLARIN-D repository.

Download

Corpus "Library and Information Centre - Newspapers"

Size: 20 units
Licence: CC-BY-NC-SA

Greek

The corpus contains newspaper articles.

The corpus is available for download from the CLARIN:EL repository.

Download

The image of Germany in the Greek press

Size: 3.5 million tokens, 7650 texts
Annotation: tokenised, lemmatised

Greek

The corpus consists of newspaper articles from three Greek newspapers (Ta Nea, Risospastis, and To Vima) dealing with Germany from the Greek perspective.

Bibliographical information is encoded in the path to the file: It is composed of title of the newspaper, year, month, day, and rubric. The lemmata are stored in a separate tree of the same structure, the text files in that tree contain one lemma per line.

The corpus is available for download from CLARIN-D (Saarland University B-centre).

For the relevant publication, see Tsotsou (2019)

Download

Modern Greek Texts Corpus - "Makedonia" newspaper

Size: 3 million tokens
Licence: CC-BY-NC-SA

Greek

This corpus contains newspaper articles in various topics (politics, economy, sports).

The corpus is available for download from the CLARIN:EL repository.

Download

Modern Greek Texts Corpus - "Ta Nea" newspaper

Size: 2 million words
Licence: CC-BY-NC-SA

Greek

This corpus contains newspaper articles in various topics (politics, economy, sports).

The corpus is available for download from the CLARIN:EL repository.

Download

The Norwegian Newspaper Corpus

Size: 700 million tokens
Annotation: multitagged
Licence: CC-BY

Norwegian

This corpus contains articles from 24 Norwegian newspapers from 1998 onwards.

The corpus is available through the concordancer Corpuscle.

Concordancer

ChronoPress Corpus of Polish Press Texts

Size: 20 million tokens
Annotation: tokenised, PoS-tagged, named entities
Licence: CLARIN PUB

Polish

This corpus contains articles from various Polish newspapers from 1945 and 1962.

The corpus is available through a dedicated concordancer.

Concordancer

8 sidor

Size: 678,000 tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper 8 sidor from 2003 to 2012.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Dagny

Size: 8.1 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Dagny from 1886 to 1913.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

DN 1987

Size: 5 million tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper Dagens Nyheter from 1987.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

GP 1994 and 2001-2011

Size: 271 million tokens
Annotation: tokenised, PoS-tagged, parsed, compounds
Licence: CC-BY

Swedish

This corpus contains articles from the Swedish newspaper Göteborgsposten from 1994 and from 2001 to  2011.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Hertha

Size: 3.8 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Hertha from 1914 to 2015.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Idun

Size: 2 million tokens
Annotation: tokenized, PoS-tagged, parsed

Swedish

This corpus contains articles from the newspaper Idun from 1887 to 1917.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Kvinnornas Tidning

Size: 5.5 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Kvinnornas Tidning for the period between 1921 and 1925.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Morgonbris

Size: 3.5 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Morgonbris from 1904 to 1924. 

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Rösträtt för Kvinnor

Size: 2.2 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Rösträtt för Kvinnor from 1912 to 1919.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Smittskydd

Size: 691,000 tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from the newspaper Smittskyd from 2002 to 2010.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

The Webbnyheter corpus

Size: 272 million tokens
Annotation: tokenized, PoS-tagged, parsed
Licence: CC-BY

Swedish

This corpus contains articles from various Swedish online newspapers from 2001 to 2013.

The corpus is available for download from Språkbanken and can be accessed through the concordancer Korp.

Concordancer

Download

Multilingual corpora

Corpus Language Description Availability

Parallel Global Voices

Size: 8 million units
Licence: CC BY

40 languages This corpus contains articles from the https://globalvoices.org/ website, where volunteers publish and translate news stories in more than 40 languages. Download

ACCURAT corpus of comparable sentences

Size: 23,820 sentences
Licence: CC BY

English-Croatian, English- Greek, English-Estonian, English-Latvian, English-Lithuanian, English-Romanian, English-Slovenian, Greek-Romanian, Latvian-Lithuanian, Romanian-German, Romanian-Lithuanian and German-English

This comparable corpus contains sentence pairs extracted from news comparable corpora.

The corpus is available for download from the CLARIN:EL repository.

Download

SETIMES - A parallel corpus of the Balkan languages

Size: 341.83 million tokens
Annotation: sentence-aligned
Licence: Open For Reuse With Restrictions

Romanian, Turkish, Serbian, English, Bulgarian, Macedonian, Croatian, Greek, Albanian

This parallel corpus contains online news articles extracted from the SETimes webpage.

The corpus is available for download from the CLARIN:EL repository.

Download

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version

Size: 8.8 billion tokens
Annotation: tokenised, MSD-tagged, syntactically parsed
Licence: CC-BY

Swedish and Finnish

This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1770 to 2011.

The corpus can be accessed through the concordancer Korp.

Concordancer

The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)

Licence: CC-BY

Swedish and Finnish

This corpus contains articles from a large variety of Finnish and Swedish newspapers (over 100 for each language) from 1771 to 1874.

The corpus can be downloaded from FIN-CLARIN.

Download

Corpora of Newspaper Texts

Size: 435 million tokens
Annotation: tokenised
Licence: under negotiation

Swedish, English and Finnish

This corpus contains articles from a variety of Swedish, English and Finnish newspapers.

The corpus can be found in the FIN-CLARIN repository although its availability and licence are still under negotiation.

 

The Europeana Collection

Corpus Language Description Availability

Europeana historical newspapers: Netherlands

Size: 2,869,483,985 words
Licence: Public

Dutch, French, English, Spanish (Castilian), Hebrew, Western Frisian, German, Punjabi, Arabic This corpus contains 4266 issues of 164 newspapers r published in the Netherlands between 1618 and 1940. VLO

Europeana historical newspapers: Estonia

Size: 351,656,185 words
Licence: Public

Estonian, Russian, German This corpus contains 92,558 issues of 40 newspapers published in Estonia between 1852 and 1946. VLO

Europeana historical newspapers: Finland

Size: 393,776,815 words
Licence: Public

Finnish, Swedish This corpus contains 24,164 issues of 10 newspapers published in Finland between 1900 and 1910. VLO

Europeana historical newspapers: Luxembourg

Size: 29,266,765 words
Licence: Public

French This corpus contains 1225 issues of 2 newspapers published in Luxembourg between 1704 and 1794. VLO

Europeana historical newspapers: Germany

Size: 5,593,768,847 words
Licence: Public

German, English This corpus contains 126,564 issues of 11 newspapers published in Germany (chiefly Berlin and Hamburg) between 1792 and 1945. VLO

Europeana historical newspapers: Austria

Size: 2,351,079,191 words
Licence: Public

German, Modern Greek, Croatian This corpus contains 147,515 issues of 77 newspapers published in Austria between 1683 and 1930. VLO

Europeana historical newspapers: Latvia

Size: 964,243,746 words
Licence: Public

Latvian, Russian, German, Polish, Estonian This corpus contains 67,870 issues of 77 newspapers published in Latvia between 1868 and 1955. VLO

Europeana historical newspapers: Poland

Size: 181,102,489 words
Licence: Public

Polish, German, Ukranian, Russian This corpus contains 15,130 issues of 10 newspapers published in Poland between 1866 and 1939. VLO

Europeana historical newspapers: Serbia

Size: 338,080,416 words
Licence: Public

Serbian This corpus contains 22,087 issues of 44 newspapers published in Serbia between 1830 and 1944. VLO

Other Newspaper Corpora

Monolingual Corpora

Corpus Language Description Availability

Zurich English Newspaper Corpus

Size: 1.6 million tokens
Annotation: tokenised
Licence: public

English This corpus contains articles from various English newspapers (mainly newspapers from London) from the 17th and 18th century.  For access, contact the authors.

deu_newscrawl_2011

Size: 426 million tokens
Annotation: tokenised

German

This corpus contains articles from various German newspapers from 2011.

The corpus is available through a dedicated concordancer.

Concordancer

CRIPCO

Size: 43,000 documents
Annotation: coreference resolution
Licence: proprietary

Italian

This corpus contains articles from the Italian newspaper L’Adige from 1999 to 2006.

The corpus is available for download through META-SHARE.

Download

"LA REPUBBLICA" CORPUS

Size: 380 million tokens
Annotation: tokenised, PoS-tagged, lemmatised
Licence: CC-BY

Italian

The corpus contains articles from the Italian newspaper La Repubblica.

The corpus is available through the noSketch Engine concordancer.

Concordancer

WItaC - NewsReader Wikinews Italian Corpus

Size: 40,231 tokens
Annotation: entities, events, event factuality, temporal information, semantic roles, and intra-document and cross-document event and entity coreference
Licence: CC-BY

Italian

This corpus contains Italian translations of 120 English Wikinews articles.

The corpus is available for download from a dedicated website.

For the relevant publication, see Minard et al. (2016).

Download

Corpus of Contemporary Serbian Newspapers and Magazines

Size: 916 million tokens
Annotation: tokenised, PoS-tagged and lemmatised
Licence: CC-BY

Serbian This corpus contains articles from over a 100 Serbian newspapers from 2004 to 2012. For access, contact the resource manager.

Multilingual Corpora

Corpus Language Description Availability

Europeana Newspapers NER Corpora

Size: 500, 000 tokens (182,483 Dutch; 207,000 French;  96,735 German)
Annotation: named entities
Licence: CC-ZERO

Dutch, French and German

This corpus contains articles from Europeana newspapers for the following time periods: 1811-1856 for the Dutch subcorpus, 1871-1916 for the French subcorpus, and 1926 for the German subcorpus.

The corpus is available for download from the KB Lab.

For the relevant publication, see Neudecker (2016).

Download

Timestamped JSI web corpus

Size: 35 billion tokens
Annotation: tokenised, PoS-tagged

18 languages

This corpus contains articles from newsfeed from 2014 to 2017.

The corpus is available through noSketchEingine.

For the relevant publication, see Bušta et al. (2017).

Concordancer

Additional Materials

CLARIN-PLUS workshop: 'Working with Digital Collections of Newspapers', 19-21 September 2016, Leuven, Belgium. [html]

Videolectures of the CLARIN-PLUS workshop. [html]

Workshop 'Hacking the News: from digitised newspapers to the archived-web: an introductory workshop to text and data-mining', 5-6 March 2018, Helsinki, Finland. [html]

Slides for 'Hacking the News' workshop. [gdoc]

Publications on the Newspaper Corpora

[Bušta et al. 2017] Jan Bušta, Ondřej Herman, Miloš Jakubíček, Simon Krek, Blaž Novak. JSI Newsfeed Corpus. [pdf]

[Minard et al. 2016] Anne-Lyse Minard , Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, Chantal van Son. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus.

[Neudecker 2016] Clemens Neudecker. An Open Corpus for Named Entity Recognition in Historic Newspapers.