Skip to main content

Corpora of Disordered Speech

The CLARIN Resource Family ‘Corpora for Speech with Disorders’ deals with a specific kind of speech data in the realm of the CLARIN research infrastructure that is related to the recordings of individuals with communication disorders (CSD).

These CSD are invaluable resources for education and research. However, they are costly and hard to build and can be difficult to share given various issues, such as the preservation of privacy and confidentiality of the participants, and the possible extra work and cost required for formatting the datasets for comparable sharing and hosting in a repository. Overcoming these challenges is important, as sharing data enables better science in the future. Re-analysis of raw data fosters improvement in the reproducibility and robustness of research. The availability of datasets allows other research teams to answer a different research question which maximizes the value of the data collected and in turn increases the impact of research of the original investigators. The availability of data also facilitates systematic review and meta-analyses. Datasets that are comparable can be pooled together to form a bigger set of data permitting more sophisticated analyses. The pooling of similar datasets also allows cross-linguistic research between countries, or investigation of rare conditions as it is often difficult to collect data from a sufficient number of participants by a single research center. Hence, it greatly benefits the discipline if more researchers in the area of clinical linguistics and phonetics, and speech and language therapy (or speech-language pathology) considered sharing speech data. This CLARIN Resource Family is designed to exactly serve this objective.

To establish this Resource Family Page three action lines were followed. 

  1. We made an inventory of the material (datasets and resources) offered through DELAD and CLARIN centres with expertise in CSD;
    1. For DELAD we departed from the resources mentioned in https://delad.ruhosting.nl/wordpress/data-inventory/ and consulted the contributors
    2. For Talkbank we concentrated on the resources in https://phonbank.talkbank.org/access/Clinical/ and the relevant resources in Talkbanks Clinical Banks and FluencyBank 
  2. We made an inventory of any other relevant datasets in CLARIN’s Virtual Language Observatory (VLO).
  3. We made an inventory of datasets in other CRFs which may qualify as members of the new CRF by contacting the right holders;
  4. We issued a questionnaire in which we asked everyone in DELAD’s and CLARIN’s networks to to contribute with relevant data sources

Furthermore, a special CMDI profile for this oral history resource family was created in the CMDI component registry. The profile was derived from the CorpusCollection profile and extended with two additional metadata elements being: LanguageDisorders and SpeechSoundDisorders. The values for these metadata elements are provided as closed sets to foster interoperability and can be viewed here. The name of the profile is CorpusCollection_CSD. For those corpora not yet in the we created metadata files with this profile to make them visible and accessible via the VLO. We recommend other contributors to this resource family to use this profile, as we do for other contributors to the CLARIN Virtual Language Observatory with recordings of individuals with communication disorders.

Contact person: N.Bessell [at] ucc.ie (Nicola Bessell) anda.lee [at] ucc.ie ( Alice Lee) (DELAD, K-centre ACE)

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Corpora of disordered speech in the CLARIN infrastructure

Corpus Language Description Availability

AphasiaBank

Size: 380 MB transcripts, 827 GB media 
Annotation: CHAT and CA/CHAT 
Licence: email request for access

Cantonese, Croatian, English, French, German, Greek, Hungarian, Italian, Japanese, Mandarin, Romanian, Spanish

This is a corpus of multimedia interactions for the study of communication in aphasia.

Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0

Size: 6760 texts, 34469 sentences, 426187 tokens 
Annotation: MULTEXT-East tagset 
Licence: CC-BY-SA 4.0

Croatian

The corpus consists of texts produced by nonprofessional typical speakers and speakers with different language disorders (developmental language disorder, dyslexia, traumatic brain injury, aphasia, other).

Roughly half of the corpus consists of texts of typical speakers, and the other half of speakers with language disorders.

Language samples were elicited by six groups of tasks representing different writing styles (descriptive, expository, narrative, and letter) and different levels of formality.

For the relevant publication, see Kuvač Kraljević et al. (2021)

Download

ADHD and SLI corpus UvA database

Size: 4 GB (67 recordings) of 26 Dutch children with ADHD, 19 Dutch children with SLI, 22 children Dutch controls 
Annotation: Transcriptions (CHAT-format) 
Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)

Dutch This corpus aims to compare the language and executive functioning profiles of children with ADHD to children with Specific Language Impairment and children with Tourette’s Disorder. Download

Bilingual deaf children RU-Kentalis database

Size: 4 GB complete video recordings. 1 GB selected parts video recordings. 0,1 GB selected parts transcripts. 0,5 GB test and background data of 11 deaf children, longitudinal, 104 recordings 
Annotation: CHAT-like format for 104 recordings 
Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)

Dutch

The corpus is used for investigating the bilingual language and communication development of young deaf children in Sign Language of the Netherlands (SLN) and Dutch.

For the relevant publication, see Klatter-Folmer et al. (2016)

Download

SLI RU-Kentalis database

Size: 2 GB 
Annotation: Praat transcripts 
Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)

Dutch The corpus has been collected to investigate of the expression of spatial relations by children with SLI and normally developing children in their spoken language production. Download

Dutch Corpus of Pathological and Normal Speech (COPAS) 

Size: 319 speakers of which 122 normal controls and 197 with a speech disorder. Corpus size: 1.3 GB 
Annotation: Orthographic transcription 
Licence: Academic, bespoke

Dutch (Flemish)

This corpus has been constructed within the framework of the project Speech Algorithms for Clinical and Educational applications (SPACE).

For the relevant publication, see Middag et al. (2010)

Download

FluencyBank

Size: 481 MB transcripts, 207 GB media 
Annotation: CHAT and CA/CHAT 
Licence: email request for access

Dutch, English, French, German

This corpus is intended for the study of fluency development.

Participants include typically-developing monolingual and bilingual children, children and adults who stutter (C/AWS) or who clutter (C/AWC), and second language learners.

Access to the research data in FluencyBank is password protected and restricted to members of the FluencyBank consortium group, although a subset of the corpus is publicly available.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

ASDBank

Size: 42 MB transcripts, 401 MB media 
Annotation: CHAT and CA/CHAT 
Licence: open access

Dutch, English, French, Greek, Mandarin, Spanish

This is a corpus of multimedia interactions for the study of communication in autism-spectrum disorder.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

Deaf adults RU database

Size: 2GB of 46 deaf Dutch adults, 38 hearing Turkish adults, 24 hearing Moroccan adults, 10 Dutch controls 
Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)

Dutch, Turkish, Moroccan

This corpus aims at the investigation of the acquisition of Dutch by deaf Dutch adults (late L1/early L2) and comparison to hearing Turkish and Moroccan-Arabic.

For the relevant publication, see Parriger (2012)

Download

TBIBank

Size: 63 MB transcripts, 98 GB media 
Annotation: CHAT and CA/CHAT 
Licence: email request for access

English

This is a corpus of multimedia interactions for the study of communication in people with traumatic brain injury.

Access to the data in TBIBank is password protected and restricted to members of the TBIBank consortium group.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

PsychosisBank

Size: Not available 
Annotation: CHAT and CA/CHAT 
Licence: email request for access

English (various dialects), Spanish

This is a corpus intended for the study of language in psychosis.

The site is noted as under construction.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

 

Alzheimer's Dementia Recognition through Spontaneous Speech (audio only): The ADReSSo Challenge

Annotation: CHAT and CA/CHAT 
Licence: email request for access

English, German, Mandarin, Spanish, Taiwanese

This is a corpus of multimedia interactions for the study of communication in dementia.

Access to the data in DementiaBank is password protected and restricted to members of the DementiaBank consortium group.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

RHDBank

Size: 30 MB transcripts, 28 GB media 
Annotation: CHAT and CA/CHAT 
Licence: email request for access

English, Spanish

This is a corpus of multimedia interactions for the study of communication in people with Right Hemisphere Damage (RHD).

Access to the data in RHDBank is password protected and restricted to members of the RHDBank consortium group.

Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.

Browse

DemCorpus-Basilicata: Dementia Corpus

Size: 08:50 hours 
Licence: Processed data available by request

Italian

This corpus consists of semi-spontaneous speech data produced by elderly residents of the Basilicata region in Italy.

In total, 40 individuals participated: the patient group consists of 20 participants with a diagnosis of dementia (9 cases of Alzheimer’s disease, 2 patients with mixed dementia, 5 patients with not-further-specified dementia, 3 patients with vascular dementia, and 1 patient with frontotemporal dementia).

the control group consists of 20 healthy individuals matched for age, gender, and geographical origin. Three linguistic tasks were administered to all participants: two narrative tasks (the first one was about an excursion or a trip, and the second was about Christmas festivities), and an image description task. This resulted in 8 hours and 50 minutes of recorded semi-spontaneous speech, which was then transcribed, segmented, and annotated using ELAN.

For the relevant publication, see Martinelli et al. (2022)

 

ItaASD: Italian speech corpus Austism Spectrum Disorder

Size: 04.19 hours 
Annotation: Orthographic

Italian

This is a corpus of semi-spontaneous speech produced by 34 children between 6 and 13 years of age, residents in the Campania region of Italy.#sepHalf of the participating children were diagnosed with high-functioning Autism Spectrum Disorder, and the other half were neurotypical children matched for age, gender, and geographical origin.

All participants were administered three tasks: a complex image description task, a story-telling task, and a story-retelling task. This resulted in 4 hours and 19 minutes of recorded speech, which were then transcribed and annotated using ELAN.

 

OPLON: Opportunities for active and healthy LONgevity

Size: 06:50 hours

Italian

This corpus consists of semi-spontaneous speech data collected from 96 elderly participants who were divided into two groups: the pathological and the control group.

The pathological group refers to three categories: (i) 16 participants with amnestic Mild Cognitive Impairment (MCI), (ii) 16 participants with multiple-domain MCI, and (iii) 16 participants with Early Dementia (probable Alzheimer Dementia, Fronto-Temporal Dementia, Mixed Dementia, and Lewy Body Dementia).

The control group includes 48 healthy individuals matched for gender, age, educational level, and geographical origin. The corpus was subjected to PoS Tagging and Dependency Parsing (CoNLL format).

 

Polish Cued Speech Corpus of Hearing-Impaired Children

Size: 20 children (11 girls and 9 boys) 
Annotation: CHAT format 
Licence: open access or through email request for access

Polish

This is a corpus of recordings of the DIA (Dutch Intelligibilty Assessment).

The corpus also contains a variety of other samples like reading passages, isolated sentences and recordings of spontaneous speech.

The corpus contains samples of 187 speakers with a speech disorder and samples of 122 speakers without a speech disorder.

Download

Other corpora of disordered speech

Corpus Language Description Availability

Perceptual Voice Qualities Database

Size: 296 audio files of varying sizes 
Licence: CC 4.0

English

This corpus contains voice samples which have been rated by experienced voice professionals (at least 3 different raters with a minimum of 2 years’ clinical experience) in order to provide educators with standardized materials to better train pre-service clinical voice professionals.

For the relevant publication, see Kempster (2007)

Browse or download

TORGO

Size: Originally TORGO database contains 18GB of data 
Licence: CC-BY

English

This is a corpus of dysarthric articulation and consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability, and matched controls.

This dataset contains 2000 samples for dysarthric males, dysarthric females, non-dysarthric males, and non-dysarthric females.

For the relevant publication, see Rudzicz et al. (2012)

Browse

University College London Archive of Stuttered Speech (UCLASS)

Size: 56 files 
Annotation: None 
Licence: open access

English

This corpus consists of data from a study by Howell, Davis, Bartrip, and Wormald (2004).

The study looked at the fluency-enhancing effects of speaking at the same time as a frequency shifted version of the voice.

There were 14 speakers and four recording per speaker making 56 files in all. Recording are in SFS format.#SEThe four recordings for a speaker were for two texts and two readings of each text.

For the relevant publication, see Howell et al. (2004)

Download

Speech Exemplar and Evaluation Database (SEED)

Licence: Access by registration

English (American)

This corpus includes recordings of single words and continuous speech samples that provide examples of speakers with and without speech disorders.

For the relevant publication, see Atkins et al. (2020)

Browse

STAR Child speech-error database

Size: 162 audio files 
Annotation: orthographic, phonemic, phonetic 
Licence: CC BY-NC-ND

English (Scottish)

This is a collection of multiple audio-articulatory speech disorder corpora

The corpus is constituted of composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio.

Recordings in this database are of single words, or short phrases, produced by child speakers who were either reading orthographic stimuli from a screen, naming pictures, or repeating words produced by a researcher. Phonemic transcriptions are provided in order that those who are not familiar with the (rhotic) central Scottish accent can be aware of the speech sound targets.

For the relevant publication, see Lawson et al. (2023)

Browse

STAR Disordered child-speech sentences database

Size: 18 speakers 
Annotation: orthographic, phonemic, phonetic 
Licence: CC BY-NC-ND

English (Scottish)

This is a collection of multiple audio-articulatory speech-disorder corpora.

Database items are composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio.

Recordings in this database are of sentences produced by child speakers (aged 6,1-13,4) who were either reading orthographic stimuli from a screen, or repeating sentences produced by a researcher. Diagnoses are based on clinicians' reports.

For the relevant publication, see Lawson et al. (2023)

Browse

The Cleft Dataset

Size: 11 speakers 
Annotation: Orthographic, phonetic 
Licence: open access

English (Scottish)

This is a corpus of ultrasound and audio recorded with children with cleft lip and palate.

For the relevant publication, see Cleland et al. (2020)

Download

Ultraphonix 

Size: 19 hours 
Annotation: Orthographic, phonetic 
Licence: open access

English (Scottish)

This is a corpus of ultrasound and audio recordings from children with speech sound disorders. It contains data from 20 speakers (16 male, 4 female), aged 6-13 years.

For the relevant publication, see Eshky et al. (2018)

Download

Ultrax 2020 Dataset

Size: 37 speakers 
Annotation: Orthographic, phonetic 
Licence: open access

English (Scottish)

This is a corpus of ultrasound tongue imaging and audio data, gathered from children with speech sound disorders by speech and language therapists in hospital environments.

11 female speakers and 26 male, aged 5-12 years. There is one recording per child.

The following metadata are available for each recording: speech waveform, raw ultrasound data, ultrasound parameters, and prompt text with date/time of utterance recording.

For the relevant publication, see Eshky et al. (2018)

Download

Ultrax Speech Sound Disorders

Size: 11 hours 
Annotation: Orthographic, phonetic 
Licence: open access

English (Scottish)

This is a corpus of ultrasound and audio recordings from children with speech sound disorders.

It contains data from 8 speakers (2 female and 6 male), aged 5-10 years.

For the relevant publication, see Eshky et al. (2018)

Download

Phonological Development Tools and Cross-Linguistic Phonologyt Project

Size: 4 speakers for transcription resource 
Annotation: Phonemic and phonetic transcription 
Licence: CC 4.0 Non-commercial

English, French, Spanish, Mandarin, Cantonese, Slovenian This corpus is used for investigating the phonological development across languages, and to evaluate intervention outcomes given a nonlinear phonological approach and ultrasound intervention outcomes across speech disorders. Browse

Plan-V Aphasia Corpus

Size: 1.84 MB 
Annotation: Sentence, utterance, clause, POS 
Licence: CC-BY 4.0

Greek (Modern)

This corpus contains spoken discourse data collected from Greek-speaking People with Aphasia (PWA) and from neurotypical adults.

For the relevant publication, see Stamouli et al. (2023)

Download

EWA DB Early Warning of Alzheimers speech database

Size: 150 hours 
Licence: Non-commercial and commercial options

Slovak

This corpus contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects.

Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), and picture description (two single pictures and three complex pictures).

 

Ahoslabi-esophageal speech database

Size: 10.8 hours 
Licence: Non Commercial Use - ELRA END USER

Spanish, Castilian

This corpus primarily consists of recordings of 31 laryngectomees (27 males and 4 females) pronouncing 100 phonetically balanced sentences.

Esophageal voices were recorded in a soundproof recording cubicle with a Neuman microphone.

The corpus also includes parallel recordings of the sentences by 9 healthy speakers (6 males and 3 females) to facilitate speech processing tasks that require small parallel corpora, such as voice conversion or synthetic speech adaptation. Apart from the sentences, the database also contains 4 sustained vowels and a small set of isolated words (14) which can be very valuable for research on esophageal speech analysis, diagnosis and evaluation.

For the relevant publication, see Serrano García (2021)

 

The SSNCE Database of Tamil Dysarthric Speech

Size: 30 speakers 
Annotation: phonetic 
Licence: LDC

Tamil

This is a corpus of Tamil Dysarthric Speech.

The corpus contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).

The non-dysarthric speakers consisted of five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years ol.

In total, each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases.

The corpus includes time-aligned phonetic transcripts for all collected speech data. Additional documentation includes phoneme mappings and speaker metadata. Audio data is presented as 16-bit 16kHz FLAC compressed linear pcm wav. Transcripts are presented as UTF-8 encoded plain text.

Download