Corpora of Disordered Speech

The CLARIN Resource Family ‘Corpora for Speech with Disorders’ deals with a specific kind of speech data in the realm of the CLARIN research infrastructure that is related to the recordings of individuals with communication disorders (CSD).

These CSD are invaluable resources for education and research. However, they are costly and hard to build and can be difficult to share given various issues, such as the preservation of privacy and confidentiality of the participants, and the possible extra work and cost required for formatting the datasets for comparable sharing and hosting in a repository. Overcoming these challenges is important, as sharing data enables better science in the future. Re-analysis of raw data fosters improvement in the reproducibility and robustness of research. The availability of datasets allows other research teams to answer a different research question which maximizes the value of the data collected and in turn increases the impact of research of the original investigators. The availability of data also facilitates systematic review and meta-analyses. Datasets that are comparable can be pooled together to form a bigger set of data permitting more sophisticated analyses. The pooling of similar datasets also allows cross-linguistic research between countries, or investigation of rare conditions as it is often difficult to collect data from a sufficient number of participants by a single research center. Hence, it greatly benefits the discipline if more researchers in the area of clinical linguistics and phonetics, and speech and language therapy (or speech-language pathology) considered sharing speech data. This CLARIN Resource Family is designed to exactly serve this objective.

To establish this Resource Family Page three action lines were followed.

We made an inventory of the material (datasets and resources) offered through DELAD and CLARIN centres with expertise in CSD;
1. For DELAD we departed from the resources mentioned in https://delad.ruhosting.nl/wordpress/data-inventory/ and consulted the contributors
2. For Talkbank we concentrated on the resources in https://phonbank.talkbank.org/access/Clinical/ and the relevant resources in Talkbanks Clinical Banks and FluencyBank
We made an inventory of any other relevant datasets in CLARIN’s Virtual Language Observatory (VLO).
We made an inventory of datasets in other CRFs which may qualify as members of the new CRF by contacting the right holders;
We issued a questionnaire in which we asked everyone in DELAD’s and CLARIN’s networks to to contribute with relevant data sources

Furthermore, a special CMDI profile for this oral history resource family was created in the CMDI component registry. The profile was derived from the CorpusCollection profile and extended with two additional metadata elements being: LanguageDisorders and SpeechSoundDisorders. The values for these metadata elements are provided as closed sets to foster interoperability and can be viewed here. The name of the profile is CorpusCollection_CSD. For those corpora not yet in the we created metadata files with this profile to make them visible and accessible via the VLO. We recommend other contributors to this resource family to use this profile, as we do for other contributors to the CLARIN Virtual Language Observatory with recordings of individuals with communication disorders.

Contact person: N.Bessell [at] ucc.ie (Nicola Bessell) anda.lee [at] ucc.ie ( Alice Lee) (DELAD, K-centre ACE)

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Corpora of disordered speech in the CLARIN infrastructure

Corpus	Language	Description	Availability
AphasiaBank Size: 380 MB transcripts, 827 GB media Annotation: CHAT and CA/CHAT Licence: email request for access	Cantonese, Croatian, English, French, German, Greek, Hungarian, Italian, Japanese, Mandarin, Romanian, Spanish	This is a corpus of multimedia interactions for the study of communication in aphasia. Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
Croatian corpus of non-professional written language by typical speakers and speakers with language disorders RAPUT 1.0 Size: 6760 texts, 34469 sentences, 426187 tokens Annotation: MULTEXT-East tagset Licence: CC-BY-SA 4.0	Croatian	The corpus consists of texts produced by nonprofessional typical speakers and speakers with different language disorders (developmental language disorder, dyslexia, traumatic brain injury, aphasia, other). Roughly half of the corpus consists of texts of typical speakers, and the other half of speakers with language disorders. Language samples were elicited by six groups of tasks representing different writing styles (descriptive, expository, narrative, and letter) and different levels of formality. For the relevant publication, see Kuvač Kraljević et al. (2021)	Download
ADHD and SLI corpus UvA database Size: 4 GB (67 recordings) of 26 Dutch children with ADHD, 19 Dutch children with SLI, 22 children Dutch controls Annotation: Transcriptions (CHAT-format) Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)	Dutch	This corpus aims to compare the language and executive functioning profiles of children with ADHD to children with Specific Language Impairment and children with Tourette’s Disorder.	Download
Bilingual deaf children RU-Kentalis database Size: 4 GB complete video recordings. 1 GB selected parts video recordings. 0,1 GB selected parts transcripts. 0,5 GB test and background data of 11 deaf children, longitudinal, 104 recordings Annotation: CHAT-like format for 104 recordings Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)	Dutch	The corpus is used for investigating the bilingual language and communication development of young deaf children in Sign Language of the Netherlands (SLN) and Dutch. For the relevant publication, see Klatter-Folmer et al. (2016)	Download
SLI RU-Kentalis database Size: 2 GB Annotation: Praat transcripts Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)	Dutch	The corpus has been collected to investigate of the expression of spatial relations by children with SLI and normally developing children in their spoken language production.	Download
Dutch Corpus of Pathological and Normal Speech (COPAS) Size: 319 speakers of which 122 normal controls and 197 with a speech disorder. Corpus size: 1.3 GB Annotation: Orthographic transcription Licence: Academic, bespoke	Dutch (Flemish)	This corpus has been constructed within the framework of the project Speech Algorithms for Clinical and Educational applications (SPACE). For the relevant publication, see Middag et al. (2010)	Download
FluencyBank Size: 481 MB transcripts, 207 GB media Annotation: CHAT and CA/CHAT Licence: email request for access	Dutch, English, French, German	This corpus is intended for the study of fluency development. Participants include typically-developing monolingual and bilingual children, children and adults who stutter (C/AWS) or who clutter (C/AWC), and second language learners. Access to the research data in FluencyBank is password protected and restricted to members of the FluencyBank consortium group, although a subset of the corpus is publicly available. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
ASDBank Size: 42 MB transcripts, 401 MB media Annotation: CHAT and CA/CHAT Licence: open access	Dutch, English, French, Greek, Mandarin, Spanish	This is a corpus of multimedia interactions for the study of communication in autism-spectrum disorder. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
Deaf adults RU database Size: 2GB of 46 deaf Dutch adults, 38 hearing Turkish adults, 24 hearing Moroccan adults, 10 Dutch controls Licence: CLARIN PUB (Transcriptions), CLARIN RESTRICTED (Recordings)	Dutch, Turkish, Moroccan	This corpus aims at the investigation of the acquisition of Dutch by deaf Dutch adults (late L1/early L2) and comparison to hearing Turkish and Moroccan-Arabic. For the relevant publication, see Parriger (2012)	Download
TBIBank Size: 63 MB transcripts, 98 GB media Annotation: CHAT and CA/CHAT Licence: email request for access	English	This is a corpus of multimedia interactions for the study of communication in people with traumatic brain injury. Access to the data in TBIBank is password protected and restricted to members of the TBIBank consortium group. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
PsychosisBank Size: Not available Annotation: CHAT and CA/CHAT Licence: email request for access	English (various dialects), Spanish	This is a corpus intended for the study of language in psychosis. The site is noted as under construction. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.
Alzheimer's Dementia Recognition through Spontaneous Speech (audio only): The ADReSSo Challenge Annotation: CHAT and CA/CHAT Licence: email request for access	English, German, Mandarin, Spanish, Taiwanese	This is a corpus of multimedia interactions for the study of communication in dementia. Access to the data in DementiaBank is password protected and restricted to members of the DementiaBank consortium group. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
RHDBank Size: 30 MB transcripts, 28 GB media Annotation: CHAT and CA/CHAT Licence: email request for access	English, Spanish	This is a corpus of multimedia interactions for the study of communication in people with Right Hemisphere Damage (RHD). Access to the data in RHDBank is password protected and restricted to members of the RHDBank consortium group. Data in TalkBank use a consistent XML-compatible representation called CHAT. All of the data is transcribed in CHAT and CA/CHAT formats.	Browse
DemCorpus-Basilicata: Dementia Corpus Size: 08:50 hours Licence: Processed data available by request	Italian	This corpus consists of semi-spontaneous speech data produced by elderly residents of the Basilicata region in Italy. In total, 40 individuals participated: the patient group consists of 20 participants with a diagnosis of dementia (9 cases of Alzheimer’s disease, 2 patients with mixed dementia, 5 patients with not-further-specified dementia, 3 patients with vascular dementia, and 1 patient with frontotemporal dementia). the control group consists of 20 healthy individuals matched for age, gender, and geographical origin. Three linguistic tasks were administered to all participants: two narrative tasks (the first one was about an excursion or a trip, and the second was about Christmas festivities), and an image description task. This resulted in 8 hours and 50 minutes of recorded semi-spontaneous speech, which was then transcribed, segmented, and annotated using ELAN. For the relevant publication, see Martinelli et al. (2022)
ItaASD: Italian speech corpus Austism Spectrum Disorder Size: 04.19 hours Annotation: Orthographic	Italian	This is a corpus of semi-spontaneous speech produced by 34 children between 6 and 13 years of age, residents in the Campania region of Italy.#sepHalf of the participating children were diagnosed with high-functioning Autism Spectrum Disorder, and the other half were neurotypical children matched for age, gender, and geographical origin. All participants were administered three tasks: a complex image description task, a story-telling task, and a story-retelling task. This resulted in 4 hours and 19 minutes of recorded speech, which were then transcribed and annotated using ELAN.
OPLON: Opportunities for active and healthy LONgevity Size: 06:50 hours	Italian	This corpus consists of semi-spontaneous speech data collected from 96 elderly participants who were divided into two groups: the pathological and the control group. The pathological group refers to three categories: (i) 16 participants with amnestic Mild Cognitive Impairment (MCI), (ii) 16 participants with multiple-domain MCI, and (iii) 16 participants with Early Dementia (probable Alzheimer Dementia, Fronto-Temporal Dementia, Mixed Dementia, and Lewy Body Dementia). The control group includes 48 healthy individuals matched for gender, age, educational level, and geographical origin. The corpus was subjected to PoS Tagging and Dependency Parsing (CoNLL format).
Polish Cued Speech Corpus of Hearing-Impaired Children Size: 20 children (11 girls and 9 boys) Annotation: CHAT format Licence: open access or through email request for access	Polish	This is a corpus of recordings of the DIA (Dutch Intelligibilty Assessment). The corpus also contains a variety of other samples like reading passages, isolated sentences and recordings of spontaneous speech. The corpus contains samples of 187 speakers with a speech disorder and samples of 122 speakers without a speech disorder.	Download

Other corpora of disordered speech

Corpus	Language	Description	Availability
Perceptual Voice Qualities Database Size: 296 audio files of varying sizes Licence: CC 4.0	English	This corpus contains voice samples which have been rated by experienced voice professionals (at least 3 different raters with a minimum of 2 years’ clinical experience) in order to provide educators with standardized materials to better train pre-service clinical voice professionals. For the relevant publication, see Kempster (2007)	Browse or download
TORGO Size: Originally TORGO database contains 18GB of data Licence: CC-BY	English	This is a corpus of dysarthric articulation and consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability, and matched controls. This dataset contains 2000 samples for dysarthric males, dysarthric females, non-dysarthric males, and non-dysarthric females. For the relevant publication, see Rudzicz et al. (2012)	Browse
University College London Archive of Stuttered Speech (UCLASS) Size: 56 files Annotation: None Licence: open access	English	This corpus consists of data from a study by Howell, Davis, Bartrip, and Wormald (2004). The study looked at the fluency-enhancing effects of speaking at the same time as a frequency shifted version of the voice. There were 14 speakers and four recording per speaker making 56 files in all. Recording are in SFS format.#SEThe four recordings for a speaker were for two texts and two readings of each text. For the relevant publication, see Howell et al. (2004)	Download
Speech Exemplar and Evaluation Database (SEED) Licence: Access by registration	English (American)	This corpus includes recordings of single words and continuous speech samples that provide examples of speakers with and without speech disorders. For the relevant publication, see Atkins et al. (2020)	Browse
STAR Child speech-error database Size: 162 audio files Annotation: orthographic, phonemic, phonetic Licence: CC BY-NC-ND	English (Scottish)	This is a collection of multiple audio-articulatory speech disorder corpora The corpus is constituted of composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio. Recordings in this database are of single words, or short phrases, produced by child speakers who were either reading orthographic stimuli from a screen, naming pictures, or repeating words produced by a researcher. Phonemic transcriptions are provided in order that those who are not familiar with the (rhotic) central Scottish accent can be aware of the speech sound targets. For the relevant publication, see Lawson et al. (2023)	Browse
STAR Disordered child-speech sentences database Size: 18 speakers Annotation: orthographic, phonemic, phonetic Licence: CC BY-NC-ND	English (Scottish)	This is a collection of multiple audio-articulatory speech-disorder corpora. Database items are composite videos containing (i) midsagittal tongue movement, imaged with ultrasound tongue imaging (UTI), (ii) optional profile lip movement, recorded with a headset-mounted camera, and (iii) synchronised audio. Recordings in this database are of sentences produced by child speakers (aged 6,1-13,4) who were either reading orthographic stimuli from a screen, or repeating sentences produced by a researcher. Diagnoses are based on clinicians' reports. For the relevant publication, see Lawson et al. (2023)	Browse
The Cleft Dataset Size: 11 speakers Annotation: Orthographic, phonetic Licence: open access	English (Scottish)	This is a corpus of ultrasound and audio recorded with children with cleft lip and palate. For the relevant publication, see Cleland et al. (2020)	Download
Ultraphonix Size: 19 hours Annotation: Orthographic, phonetic Licence: open access	English (Scottish)	This is a corpus of ultrasound and audio recordings from children with speech sound disorders. It contains data from 20 speakers (16 male, 4 female), aged 6-13 years. For the relevant publication, see Eshky et al. (2018)	Download
Ultrax 2020 Dataset Size: 37 speakers Annotation: Orthographic, phonetic Licence: open access	English (Scottish)	This is a corpus of ultrasound tongue imaging and audio data, gathered from children with speech sound disorders by speech and language therapists in hospital environments. 11 female speakers and 26 male, aged 5-12 years. There is one recording per child. The following metadata are available for each recording: speech waveform, raw ultrasound data, ultrasound parameters, and prompt text with date/time of utterance recording. For the relevant publication, see Eshky et al. (2018)	Download
Ultrax Speech Sound Disorders Size: 11 hours Annotation: Orthographic, phonetic Licence: open access	English (Scottish)	This is a corpus of ultrasound and audio recordings from children with speech sound disorders. It contains data from 8 speakers (2 female and 6 male), aged 5-10 years. For the relevant publication, see Eshky et al. (2018)	Download
Phonological Development Tools and Cross-Linguistic Phonologyt Project Size: 4 speakers for transcription resource Annotation: Phonemic and phonetic transcription Licence: CC 4.0 Non-commercial	English, French, Spanish, Mandarin, Cantonese, Slovenian	This corpus is used for investigating the phonological development across languages, and to evaluate intervention outcomes given a nonlinear phonological approach and ultrasound intervention outcomes across speech disorders.	Browse
Plan-V Aphasia Corpus Size: 1.84 MB Annotation: Sentence, utterance, clause, POS Licence: CC-BY 4.0	Greek (Modern)	This corpus contains spoken discourse data collected from Greek-speaking People with Aphasia (PWA) and from neurotypical adults. For the relevant publication, see Stamouli et al. (2023)	Download
EWA DB Early Warning of Alzheimers speech database Size: 150 hours Licence: Non-commercial and commercial options	Slovak	This corpus contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), and picture description (two single pictures and three complex pictures).
Ahoslabi-esophageal speech database Size: 10.8 hours Licence: Non Commercial Use - ELRA END USER	Spanish, Castilian	This corpus primarily consists of recordings of 31 laryngectomees (27 males and 4 females) pronouncing 100 phonetically balanced sentences. Esophageal voices were recorded in a soundproof recording cubicle with a Neuman microphone. The corpus also includes parallel recordings of the sentences by 9 healthy speakers (6 males and 3 females) to facilitate speech processing tasks that require small parallel corpora, such as voice conversion or synthetic speech adaptation. Apart from the sentences, the database also contains 4 sustained vowels and a small set of isolated words (14) which can be very valuable for research on esophageal speech analysis, diagnosis and evaluation. For the relevant publication, see Serrano García (2021)
The SSNCE Database of Tamil Dysarthric Speech Size: 30 speakers Annotation: phonetic Licence: LDC	Tamil	This is a corpus of Tamil Dysarthric Speech. The corpus contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers). The non-dysarthric speakers consisted of five female and five male subjects. The dysarthric speakers (7 female, 13 male) reported a diagnosis of cerebral palsy and ranged in age from 12 years old to 37 years ol. In total, each speaker recorded 365 utterances consisting of single words and of sentences that included a combination of common and uncommon Tamil phrases. The corpus includes time-aligned phonetic transcripts for all collected speech data. Additional documentation includes phoneme mappings and speaker metadata. Audio data is presented as 16-bit 16kHz FLAC compressed linear pcm wav. Transcripts are presented as UTF-8 encoded plain text.	Download