CLARIN Resource Families: Spoken Corpora

Submitted by Linda Stokman on 17 May 2021

The CLARIN Resource Families initiative provides a user-friendly overview of the available language resources in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies.

This month CLARIN highlights the spoken corpora.

Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues. They are often aligned with the accompanying recordings. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. Such corpora are carefully sampled and rich in sociodemographic metadata.

There are 90 spoken corpora in the CLARIN infrastructure, 79 of which contain both the transcriptions of spoken or spontaneous speech and the associated recordings, and 11 only the transcriptions.

See the overview