Skip to main content

L2 Learner Corpora

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how learners of a second language acquire the new language on a lexical as well as syntactic level, and how it is influenced by their native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

The CLARIN infrastructure provides access to 75 L2 learner corpora. 14 corpora are multilingual, while the rest  provide written, spoken and even videotaped forms of monolingual L2 data in the following languages: Arabic, Czech, English, Finnish, French, German, Hungarian, Icelandic,  Italian, Mandarin, Norwegian, Spanish, and Swedish. Many of these corpora are available through public licences.

We first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

 

Monolingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

CzeSL – Czech as a Second Language

Size: 0.9 million words 
Annotation: tokenised, PoS-tagged, lemmatised, error labels 
Licence: CC-BY

Czech

This corpus contains essays written in 2013 by learners from 54 L1 backgrounds.

The corpus is available for download from LINDAT.

For the relevant publication, see Rosen (2016).

Download

British Academic Written English Corpus

Size: 2761 texts 
Licence: CC-BY

English

This is primarily a L1 corpus although it also contains L2 texts.

The corpus is available for download from the University of Oxford Text Archive.

Download

CORYL (Corpus of Young Learner Language)

Size: 191,568 tokens 
Annotation: tokenised, anonymised, error labels, linked to CEFR levels 
Licence: CC-BY

English

This corpus contains English texts written yb Norwegian primary school pupils (7th, 10th, and 11th grade).

The corpus is available through the Browse Corpuscle provided by CLARINO.

Browse

ETS Corpus of Non-Native Written English

Size: 12,100 essays (1100 / language) 
Licence: restricted

English

This corpus contains texts written by learners from 11 L1 backgrounds as part of an international text of academic English proficiency. Prompts as well as proficiency level are part of the metadata.

The corpus is available for download from the LDC catalogue.

Download

ICLE International Corpus of Learner English

Size: 3 million words

English

This corpus contains texts written by learners of English from 14 L1 backgrounds.

The corpus can be 

 

The Hanken Corpus of Academic Writing

Size: 500,000 words 
Licence: CC-BY

English

This corpus contains academic texts  written by Finnish and Swedish native speakers.

The corpus is still under development.

 

The Uppsala Student English corpus

Size: 1.2 million tokens 
Annotation: tokenised 
Licence: CC-BY

English

This corpus contains essays written during the first three semesters of English studies at Uppsala University; most of the essays were written during the first semester. The corpus contains text files, each with a student ID and text ID including the course level, and information about the different prompts are available.

The corpus is available for download from the University of Oxford Text Archive.

Download

International Corpus of Learner Finnish (ICLFI) Corpus

Size: 1 million words 
Annotation: MSD-tagged 
Licence: CLARIN RES

Finnish

This corpus contains fictional (e.g., letters, narratives) and non-fictional (e.g., essays) texts.

The corpus provides information on a large number of variables concerning the linguistic background of the learner, the learning task, the learning context, etc. It is available through the Browse Korp.

For the relevant publication, see Jantunen (2011).

Browse

Testipiste Corpus

Size: 840,000 tokens 
Annotation: tokenised 
Licence: CLARIN RES

Finnish

This corpus contains essays written by adult migrants from various L1 backgrounds.

The corpus will be made available through the Browse Korp.

 

The Advanced Finnish Learners’ Corpus

Size: 288,000 tokens 
Annotation: tokenised, MSD-tagged, lemmatised 
Licence: CLARIN RES

Finnish

This corpus contains academic texts written by MA students and collected in 2009.

The corpus consists of two subcorpora - The Exam Essays Subcorpus and the Course Papers Subcorpus, both of which are also available through Korp.

Browse

Download

Download

Commented Learner Corpus Academic Writing

Size: 853 texts 
Licence: CC BY-NC-SA 3.0

German

This corpus contains texts written by students at the University of Hamburg from various L1 backgrounds.

The corpus is available for download through the repository of the University of Hamburg.

Download

ASK – Norsk andrespråkskorpus

Size: 618,000 tokens 
Annotation: tokenised, PoS-tagged, errors 
Licence: CLARIN RES

Norwegian

This corpus contains essays and tests written by students from 10 L1 backgrounds. It also contains L1 control essays.

The corpus is available through a dedicated Browse provided by CLARINO.

Browse

Slovene learner corpus KOST 1.0

Size: 6311 texts, 1 million words 
Annotation: annotated with rich author and text metadata 
Licence: CC BY-SA 4.0

Slovenian

This corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 6,311 texts (just over 1 million words) written by adult speakers for whom Slovene is not their first language. This corpus offers insights into Slovene language as produced by those who are still learning it as a second or foreign language, and in particular into the most common errors that occur in this process. KOST therefore aims at all those working with Slovene as a second or foreign language. The texts were mainly written at lectorates and Slovene as a L2/FL courses.

Most of the authors of these texts speak Serbian, Bosnian and Macedonian as their first language, but texts by speakers of other languages are also included. The authors are at different proficiency levels in Slovene, from beginners to advanced. For each contributor, information is available on gender, year of birth, country, first language and other languages they speak, employment status and education, and prior experience of learning Slovene. For each text, there is also information on the time and circumstances of creation (exam or homework), the programme in which it was produced, input type (digital or hand-written), language level and the grade. A part of the corpus has also texts available in their corrected version which can be access also through concordancers (noSketchEngine or KonText). The tokens of the original and corrected texts are linked (one group of link per paragraph) and the links categorised into 23 error types.

The corpus is available for download from the Slovenian repository CLARIN.SI and can be queried through noSketchEngine and KonText concordancers.

For the relevant publication, see Stritar Kučuk (2022)

Concordancer (noSketchEngine)

Concordancer (KonText)

Download

FinSveStud 79-80

Size: 175,000 tokens 
Annotation: tokenised, lemmatised 
Licence: CLARIN RES

Swedish

This corpus contains texts written by students with Finnish as their L1 background.

The corpus is available through the Browse Korp.

Browse

SpIn

Size: 46,911 tokens; 4,302 sentences 
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms 
Licence: CC-BY

Swedish

This corpus contains essays from a Language Introduction course for newly arrived students (256 essays; 166 students, some of whom are recurrent) – i.e., course preparation for Swedish upper-intermediate school (gymnasium-level). It is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (A1-B2). See the metadata description for further details on the automatic and manual annotation.

The corpus is available through the Browse Korp and for download in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Browse (Korp)

Online (application)

SW1203-essays

Size: 52,528 tokens; 3,145 sentences 
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms 
Licence: CC-BY

Swedish

This corpus contains essays from a preparatory university course with three essays written by (almost) all students: (1) entrance essay, (2) mid-term essay; (3) fnal exam essay; (4) final exam retake for some students. The corpus is longitudinal in a way. It is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (B1-C2). See the metadata description for further details on the automatic and manual annotation.

The corpus is available for download from the Språkbanken Resource List, through the Browse Korp, and for download through in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Browse (Korp)

Online (application)

Download

SweLL-gold

Size: 147,842 tokens (original version), 151,851 (normalized version); 7,807 sentences (original), 8,137 sentences (normalized) 
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms 
Licence: CC-BY

Swedish

This corpus contains essays from various education establishments in Sweden for non-Swedish speaking adult learners.

Aside from the automatic linguistic annotation, the corpus is manually annotated at the following levels: pseudonymization, normalization, and correction annotation. See the metadata description for further details on the automatic and manual annotation. While the SweLL-pilot corpus was collected in 2006–2016, SweLL-gold was collected in 2017–2020.

The corpus is available through the Browse Korp and for download in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2019).

Browse (original)

Browse (normalized)

Online (application)

Tisus corpus

Size: 60,632 tokens; 3,422 sentences 
Annotation: tokenised, PoS-tagged, MSD-tagged, lemgrams, compounds word forms 
Licence: CC-BY

Swedish

This corpus contains essays from a test situation written by adult learners (105 essays, 105 sutdents; one essay per student). The essays are argumentative on the topic of stress, written at an advanced level. This is a subcorpus of the SweLL-pilot corpus.

Aside from the automatic linguistic annotation, the corpus is manually annotated for CEFR labels (B2-C1). See the metadata description for further details on the automatic and manual annotation.

The corpus is available for download from Språkbanken, through the Browse Korp, and in Språkbanken Text / the SweLL infrastructure through an individual application form.

For the relevant publication, see Volodina et al. (2016).

Browse (Korp)

Online (application)

Download

Spoken corpora

Corpus Language Description Availability

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Czech

The corpus contains speech recordings of ~32 German children learning Czech (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

The Anglish Corpus 

Annotation: interpausal units 
Licence: CLARIN RES

English

This corpus contains various speech tasks performed by French native speakers and the associated transcriptions.

The corpus is available for download from Ortolang.

Download

The Barcelona English Language Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of Spanish children and teenagers learning English in Barcelona. across 4 tasks (written composition, oral narrative, oral interview and role play).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Muñoz (2006)

Browse

Download

The Barraja-Rohan Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of adult international students who spoke English as a second language and who had newly arrived at an Australian university. These undergraduate international students from various Asian backgrounds interacted over a period of seven months with Australian graduate students who were native speakers of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Barraja-Rohan (2013)

Browse

Download

The Connolly Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of 60 Japanese high school students learning English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The CUHK corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 6 children learning English in Hong Kong.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see MacWhinney (2016)

Browse

Download

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of ~32 German children learning English (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

GLBCC (Giessen - Long Beach Chaplin Corpus)

Size: 2472 words/transcript 
Licence: CC-BY

English

This corpus contains film retellings performed by English and German native speakers.

The corpus is available for download from the University of Oxford Text archive.

Download

A Learners' Corpus of Reading Texts

Licence: CLARIN RES

English

This corpus contains unprepared readings by first-year students at an English department who speak French as a native language.

The corpus is available for download from Ortolang.

Download

The Markee Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 3 students learning English as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Markee (2000)

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 95 students learning English in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The QATAR Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains recorded interviews involving 19 Qatari learners of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhao and MacWhinney (2010)

Browse

Download

The Vercellotti Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains recordings of adult learners entering an Intensive English Program (IEP) in the United States during the year 2010. Tasks include 2 minute monologues.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vercellotti (2017)

Browse

Download

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

The corpus contains speech recordings of ~32 German children learning French (type of study: interview).

The corpus is a part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

French Learner Language Oral Corpora (FLLOC)

Size: 1375 transcripts 
Annotation: MSD-tagged 
Licence: CC-BY

French

This corpus contains various narrative and interactive speech tasks performed by English and Dutch native speakers.

The corpus is available for download from the University of Oxford Text Archive. The transcripts and audio files can also be downloaded and browsed through through TalkBank.

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 28 British undergraduates learning French before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus is a 3-year follow up to the LANGSNAP corpus, involving 18 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The Newcastle Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains intermediate level spoken French from 17-18 year old second language learners, in years 12 to 13 of UK secondary education.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 40 students learning French in France as a second language (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The Trinity College (TCD) Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains recordings of 5 children (2 Irish, 1 Polish, 2 Cambodian) learning French in a school in France.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The Reading Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains oral proficiency interviews with 34 16-year-olds learning French in South Wales.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chambers and Richards (1995)

Browse

Download

The UWI Corpus

Size: 15,068 tokens 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus consists of 25 recorded interviews with learners of French (9 adult learners) in Jamaica.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Péters (2017)

Browse

Download

The Dimroth SLA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

The corpus contains speech recordings of 47 students learning German (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Dimroth (2008)

Browse

Download

Hamburg Modern Times Corpus

Size: 24,000 words 
Annotation: prosody 
Licence: CLARIN RES

German

This corpus contains film retellings and the accompanying transcriptions.

The corpus is available for download from the HZSK CLARIN-D repository.

Download

The RyanDan Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

The corpus contains recordings of 4 Carnegie Mellon University students learning German.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Walter (2020)

Browse

Download

The VYSA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

This corpus contains recordings of 3 highschool students learning German abroad while living with German-speaking host families and attending German secondary schools in standard German-speaking urban and peri-urban regions of Germany.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Young-Scholten and Langer (2015)

Browse

Download

The Theodórsdóttir corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Icelandic

The corpus contains recordings obtained in a longitudinal case study of L2 Icelandic.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Theodórsdóttir (2018)

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Italian

This corpus contains speech recordings of 95 students learning Italian in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The COPA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of ~120 college students learning Mandarin in Hong Kong (type of study: responses to questions).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhang (2009)

Browse

Download

The HKPU Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of 20 college students learning Mandarin in Hong Kong. The tasks involve oral interviews.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chang et al. (2013)

Browse

Download

LANGMAN

Size: 11 subcorpora 
Annotation: error coding 
Licence: CC-BY

Hungarian

This corpus is a spoken corpus involving Chinese native speakers who learn Hungarian as a second language.

The subcorpora are available for download from and browsing through the TalkBank.

Browse

Download

The BCN-L2 Corpus

Annotation: error coding 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Berber students learning Spanish. The participants were 88 native speakers of Moroccan Arabic (Darija) and 26 speakers of Berber (Amazigh) living in Catalonia.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Bet et al. (2016)

Browse

Download

The Díaz Rodríguez Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Indoeuropean and Asian Learners, both semi-spontaneous and experimental, obtained in Barcelona, Spain (type of study: naturalistic, longitudinal).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Díaz (2002)

Browse

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 27 British undergraduates learning Spanish before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus is a 3-year follow-up to the LANGSNAP Corpus, involving 33 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The Liceras Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 11 students learning Spanish as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras et al. (1999)

Browse

Download

The Nebrija-CORELE-UA Corpus

Size: 1 hour 27 minutes, 10,292 words 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains 10 recorded interviews involving students of Spanish as a Foreign Language have at the University of Alicante, in Alicante.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Medina Soler (2017)

Browse

Download

The Nebrija-INMIGRA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus consists of oral interviews carried out in the context of the LETRA test of Spanish for immigrant workers. It is made up of semi-guided interviews carried out in Spanish which last approximately 10 minutes each. The participants are immigrants from 11 different countries who live in the Autonomous Community of Madrid (Spain).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras (2017)

Browse

Download

The Nebrija-OAP Corpus

Size: 9 hours 19 minutes, 49,718 words 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains 67 videotaped presentations involving 95 North American students of Spanish as a Foreign Language at Nebrija University in Madrid.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Browse

Download

The Nebrija-WOCAE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of emails written and read by 28 Chinese students.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Browse

Download

The Nicolás Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of 2 two children from Morocco learning Spanish in Spain learning Spanish.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see de Benito (2016)

Browse

Download

The SPLLOC1 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of L2 Spanish in a classroom context. There were 20 learners, all of whom were English native speakers, at each of 3 levels: beginners (Year 9 students aged 13-14), intermediate students (A2 students aged 17-18), and fourth year undergraduates.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Browse

Download

The SPLLOC2 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus is an extension of the SPLLOC1 Corpus.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Browse

Download

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Czech

The corpus contains speech recordings of ~32 German children learning Czech (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

The Anglish Corpus 

Annotation: interpausal units 
Licence: CLARIN RES

English

This corpus contains various speech tasks performed by French native speakers and the associated transcriptions.

The corpus is available for download from Ortolang.

Download

The Barcelona English Language Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of Spanish children and teenagers learning English in Barcelona. across 4 tasks (written composition, oral narrative, oral interview and role play).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Muñoz (2006)

Browse

Download

The Barraja-Rohan Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of adult international students who spoke English as a second language and who had newly arrived at an Australian university. These undergraduate international students from various Asian backgrounds interacted over a period of seven months with Australian graduate students who were native speakers of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Barraja-Rohan (2013)

Browse

Download

The Connolly Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of 60 Japanese high school students learning English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The CUHK corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 6 children learning English in Hong Kong.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see MacWhinney (2016)

Browse

Download

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

The corpus contains speech recordings of ~32 German children learning English (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

GLBCC (Giessen - Long Beach Chaplin Corpus)

Size: 2472 words/transcript 
Licence: CC-BY

English

This corpus contains film retellings performed by English and German native speakers.

The corpus is available for download from the University of Oxford Text archive.

Download

A Learners' Corpus of Reading Texts

Licence: CLARIN RES

English

This corpus contains unprepared readings by first-year students at an English department who speak French as a native language.

The corpus is available for download from Ortolang.

Download

The Markee Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 3 students learning English as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Markee (2000)

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains speech recordings of 95 students learning English in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The QATAR Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains recorded interviews involving 19 Qatari learners of English.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhao and MacWhinney (2010)

Browse

Download

The Vercellotti Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

English

This corpus contains recordings of adult learners entering an Intensive English Program (IEP) in the United States during the year 2010. Tasks include 2 minute monologues.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vercellotti (2017)

Browse

Download

The Dresden Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

The corpus contains speech recordings of ~32 German children learning French (type of study: interview).

The corpus is a part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Kubanek-German (2000)

Browse

Download

French Learner Language Oral Corpora (FLLOC)

Size: 1375 transcripts 
Annotation: MSD-tagged 
Licence: CC-BY

French

This corpus contains various narrative and interactive speech tasks performed by English and Dutch native speakers.

The corpus is available for download from the University of Oxford Text Archive. The transcripts and audio files can also be downloaded and browsed through through TalkBank.

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 28 British undergraduates learning French before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus is a 3-year follow up to the LANGSNAP corpus, involving 18 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The Newcastle Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains intermediate level spoken French from 17-18 year old second language learners, in years 12 to 13 of UK secondary education.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains speech recordings of 40 students learning French in France as a second language (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The Trinity College (TCD) Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains recordings of 5 children (2 Irish, 1 Polish, 2 Cambodian) learning French in a school in France.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

The Reading Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus contains oral proficiency interviews with 34 16-year-olds learning French in South Wales.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chambers and Richards (1995)

Browse

Download

The UWI Corpus

Size: 15,068 tokens 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

French

This corpus consists of 25 recorded interviews with learners of French (9 adult learners) in Jamaica.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Péters (2017)

Browse

Download

The Dimroth SLA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

The corpus contains speech recordings of 47 students learning German (type of study: interview).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Dimroth (2008)

Browse

Download

Hamburg Modern Times Corpus

Size: 24,000 words 
Annotation: prosody 
Licence: CLARIN RES

German

This corpus contains film retellings and the accompanying transcriptions.

The corpus is available for download from the HZSK CLARIN-D repository.

Download

The RyanDan Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

The corpus contains recordings of 4 Carnegie Mellon University students learning German.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Walter (2020)

Browse

Download

The VYSA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

German

This corpus contains recordings of 3 highschool students learning German abroad while living with German-speaking host families and attending German secondary schools in standard German-speaking urban and peri-urban regions of Germany.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Young-Scholten and Langer (2015)

Browse

Download

The Theodórsdóttir corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Icelandic

The corpus contains recordings obtained in a longitudinal case study of L2 Icelandic.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Theodórsdóttir (2018)

Browse

Download

The PAROLE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Italian

This corpus contains speech recordings of 95 students learning Italian in France (type of study: tasks/storytelling).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Hilton (2009)

Browse

Download

The COPA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of ~120 college students learning Mandarin in Hong Kong (type of study: responses to questions).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Zhang (2009)

Browse

Download

The HKPU Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Mandarin

This corpus contains speech recordings of 20 college students learning Mandarin in Hong Kong. The tasks involve oral interviews.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Chang et al. (2013)

Browse

Download

LANGMAN

Size: 11 subcorpora 
Annotation: error coding 
Licence: CC-BY

Hungarian

This corpus is a spoken corpus involving Chinese native speakers who learn Hungarian as a second language.

The subcorpora are available for download from and browsing through the TalkBank.

Browse

Download

The BCN-L2 Corpus

Annotation: error coding 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Berber students learning Spanish. The participants were 88 native speakers of Moroccan Arabic (Darija) and 26 speakers of Berber (Amazigh) living in Catalonia.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Bet et al. (2016)

Browse

Download

The Díaz Rodríguez Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of Indoeuropean and Asian Learners, both semi-spontaneous and experimental, obtained in Barcelona, Spain (type of study: naturalistic, longitudinal).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Díaz (2002)

Browse

Download

The LANGSNAP Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 27 British undergraduates learning Spanish before, during and after a year abroad. Tasks include oral interviews and and story retellings, aside from argumentative writing tasks.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The LANGSNAP3 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus is a 3-year follow-up to the LANGSNAP Corpus, involving 33 participants.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Tracy-Ventura and Huensch (2018)

Browse

Download

The Liceras Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains speech recordings of 11 students learning Spanish as a second language.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras et al. (1999)

Browse

Download

The Nebrija-CORELE-UA Corpus

Size: 1 hour 27 minutes, 10,292 words 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains 10 recorded interviews involving students of Spanish as a Foreign Language have at the University of Alicante, in Alicante.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Medina Soler (2017)

Browse

Download

The Nebrija-INMIGRA Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus consists of oral interviews carried out in the context of the LETRA test of Spanish for immigrant workers. It is made up of semi-guided interviews carried out in Spanish which last approximately 10 minutes each. The participants are immigrants from 11 different countries who live in the Autonomous Community of Madrid (Spain).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Liceras (2017)

Browse

Download

The Nebrija-OAP Corpus

Size: 9 hours 19 minutes, 49,718 words 
Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains 67 videotaped presentations involving 95 North American students of Spanish as a Foreign Language at Nebrija University in Madrid.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Browse

Download

The Nebrija-WOCAE Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of emails written and read by 28 Chinese students.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Vergara Padilla (2017)

Browse

Download

The Nicolás Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of 2 two children from Morocco learning Spanish in Spain learning Spanish.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see de Benito (2016)

Browse

Download

The SPLLOC1 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus contains recordings of L2 Spanish in a classroom context. There were 20 learners, all of whom were English native speakers, at each of 3 levels: beginners (Year 9 students aged 13-14), intermediate students (A2 students aged 17-18), and fourth year undergraduates.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Browse

Download

The SPLLOC2 Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required)

Spanish

This corpus is an extension of the SPLLOC1 Corpus.

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

For the relevant publication, see Mitchell et al. (2008)

Browse

Download

Video and multimodal corpora

Corpus Language Description Availability

Arabic Learner Corpus

Size: 0.3 million tokens 
Annotation: tokenised 
Licence: CLARIN RES

Arabic

This corpus contains essays written by students from 67 L1 backgrounds. It also contains recordings of speech tasks and associated transcriptions.

The corpus is available for download from the LDC catalogue.

Download

English as a Foreign Language Corpus

Size: 24 hours 
Licence: Under Negotiation

English The corpus contains videotaped lessons involving students at Finnish secondary schools.   

The Long Second Corpus

Licence: Under Negotiation

Finnish This corpus contains written texts, audio recordings and videotaped lessons involving immigrants from the following L1 backgrounds: Estonian, Macedonian, Kurdish, Portuguese, Russian, and English. The corpus is still in preparation. It is set to be made available on the LAT platform.

The van Compernolle Corpus

Annotation: audio/transcription linking 
Licence: public (acknowledgment required

French

This corpus contains a recorded examination of classroom interactional practices and actions in a beginning-level ESL reading class. Analytic foci include aspects of speech delivery and timing as well as nonverbal behaviors (e.g., eye gaze, gesture).

The corpus is part of the SLABank collection, which is a component of TalkBank dedicated to providing corpora for the study of second language acquisition and learning.

The corpus is available for online browsing and download via TalkBank.

Browse

Download

Multilingual L2 learner corpora in the CLARIN infrastructure

Written corpora

Corpus Language Description Availability

MERLIN Written Learner Corpus for Czech, German, Italian 1.1

Size: 2287 texts 
Annotation: a wide range of language characteristics that provide researchers with concrete examples of learner performance and progress across multiple proficiency levels. 
Licence: CC BY-SA 4.0

Czech, German, Italian

This corpus contains learner texts produced in standardized language certifications covering CEFR levels A1-C1.

The corpus is available for download from the Eurac Research CLARIN Centre Repository.

Download

CEFLING Project Corpus

 

Finnish and English This corpus contains texts written by primary and secondary school students (years 7-9).  

DIALUKI: Diagnosing reading and writing in a second or foreign language

Size: 8,600 texts 
Licence: CLARIN RES

Finnish and English

This corpus contains texts both in Finnish (written by Russian native speakers) and English (written by Finnish native speakers).

The corpus will be made available through Korp.

 

Topling - Paths in Second Language Acquisition

Size: 165,000 tokens 
Annotation: tokenised 
Licence: CLARIN End User Licence Agreement

Finnish, English, Swedish

This corpus contains written texts in English, Swedish and Finnish produced by students in the Finnish educational system and is an extension of  the CEFLING corpus, which it also includes.

The corpus is available through the concordancer Korp.

Browse

Kolipsi Corpus Family 

Size: 5500 texts; 1.15 million tokens 
Annotation: sentence splitting, tokenised, lemmatised, PoS-tagged, manual annotation (see description) 
Licence: CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)

German, Italian

The Kolipsi Corpus Family is a collection of Italian and German learner texts that were collected in the course of the KOLIPSI project in 2007/2008 (Kolipsi-1) and a follow-up study in 2014/2015 (Kolipsi-2). The aim of the original project and the follow-up study was to analyse the second language competences of South-Tyrolean pupils from upper secondary schools (between 16 and 18 years old), and to contextualize the results of such investigation by commenting on crucial sociolinguistic and psychosocial aspects that influence it. The results of the follow-up study should be compared to the results of the original KOLIPSI project.

All sub-corpora of the Kolipsi Corpus Family contain manually performed transcription annotations. Transcription annotations reflect surface features of the text, such as the graphical arrangement, and include error annotation on the orthographic level.

Both subcorpora of KOLIPSI are available for download from the Eurac Research CLARIN Centre Repository. In addition, the family is also available for online browsing through the ANNIS concordancer.

Download (Kolipsi-1)

Download (Kolipsi-2)

Browse

LEONIDE - Longitudinal Learner Corpus in Italiano, Deutsch and English 

Size: 2510 texts; 240,000 tokens 
Annotation: sentence splitting, tokenised, lemmatised, PoS-tagged, manual annotation (see description) 
Licence: CLARIN ACADEMIC END-USER LICENCE (ACA-BY-NC-NORED 1.0)

Italian, German, English

This is a longitudinal corpus of student essays documenting the language competences and writing development of lower secondary school students in three different languages.

The texts were collected over the span of 3 consecutive years (2015-2018) in public middle schools. The pupils were 11 years old at the beginning of the data collection and 13 years old at the end. In each grade, two written texts were collected that differ with respect to genre: the first text was elicited using a picture story re-telling task; the second text is an opinion text on different aspects related to the pupils’ life and public discourse.

Manual annotation concerns the fact that the corpus is fully anonymised and annotated with target hypotheses correcting orthography errors in the text as well as annotations on structural elements (paragraphs, line breaks, bullet points, symbols or emoticons etc.), foreign word insertions and transcript surface features (e.g. deletions, corrections or insertions of the student, unreadable or ambiguous items).

The corpus is available for download from the the Eurac Research CLARIN Centre Repository.

For the relevant publication, see Glaznieks et al. (2021)

Download

Spoken corpora

Corpus Language Description Availability

AixOx

Size: 40 minutes/task 
Licence: restricted

English and French This corpus contains readings of written texts performed by French and English native speakers.  

LeaP: The Learning the Prosody of a Foreign Language

Size: 31 hours 
Annotation: PoS-tagged, lemmatised, prosody

English and German

This corpus contains recordings of English and German spoken by non-native speakers from 31 different native language backgrounds.

The corpus is available for download from the Language Archive.

Download

Repiso/Contrefactualité

Licence: CLARIN RES

French, Italian, Spanish This corpus contains recordings of counterfactual sentences.  

Openprodat

Licence: Publique Générale GNU

Dutch, English, French, German, Italian, Arabic, Spanish, Hungarian, Japanese, Thai, Norwegian, Chinese

This corpus contains paragraph readings by participants in both their L1 and in as many L2 as they felt they could manage.

The corpus is available for download from Ortolang.

For the relevant publication, see Hirst et al 2013

Download

GeWiss

Size: 1.4 million tokens 
Annotation: code switching

German (L2 and L1), English, Polish, Italian (L1)

This corpus contains L1 and L2 transcripts and audio recordings of spoken German academic discourse, as well as L1 data of spoken English, Polish, and Italian academic discourse.

For the relevant publication, see Fandrych et al. (2014)

Browse

Video and multimodal corpora

Corpus Language Description Availability

TAITO: Written and Oral Data of the TAITO-project

Licence: Under Negotiation

English, French, German, Italian, Swedish This corpus contains texts written by undergraduate students at the beginning of their studies and videotaped discussions.  

YKI National Certificates corpus

Licence: CLARIN RES

Italian,  Swedish, Spanish, English, Finnish, German, French, Russian This corpus contains written and speech tasks.  

Other L2 Learner Corpora

There exist an additional number of 128 L2 learner corpora that are not part of the CLARIN infrastructure that are listed on the website of the Catholic University of Louvain.

See also LADDER. Learners' digital communication: a corpus for pragmatic competences in Italian L1/L2. This downloadable corpus consists of emails and instant messages, where the informants are (i) German learners of Italian between A2-C1 level according to the CEFR and most of them are students living in Tyrol (Austria) and (ii) native speakers of Italian most of whom are students from Rome (Italy). See also Brocca (2021) for a related publication.

Additional Materials

CLARIN workshop on Interoperability of Second Language Resources and Tools, 6-8 December 2017, Gothenburg, Sweden. [html]

Publications on L2 Learner Corpora

[Barraja-Rohan 2013] Anne-Marie Barraja-Rohan. 2013. Second Language Interactional Competence and its Development: A Study of International Students in Australia.

[Bel et al. 2016] Aurora Bel, Estela García-Alcaraz, and Elisa Rosado. 2016.  Reference comprehension and production in bilingual Spanish. In Language Acquisition Beyond Parameters: Studies in honour of Juana M. Liceras, edited by Anahí Alba de la Fuente, Elena Valenzuela, and Cristina Martínez Sanz, 37–70.

[de Benito 2016] Estrella Nicolás de Benito. 2016. La adquisición del sintagma determinante en español por niños de lengua materna árabe marroquí. Doctoral dissertation.

[Chang et al. 2013] A. Chang, Z.H. Feng, and W.C. Yang. 2013. A new multimedia shared L2 spoken Mandarin Chinese corpus: construction and linguistic analyses. In Proceedings of the 21st Annual Meeting of the Internatioal Association of Chinese Linguistics.

[Chambers and Richards 1995] Francine Chambers and Brian Richards. 1995. The "free conversation" and the assessment of oral proficiency. Language Learning Journal, 11: 6–10. 

[Dimroth 2008] Christine Dimroth. 2008. Age Effects on the Process of L2 Acquisition? Evidence From the Acquisition of Negation and Finiteness in L2 German. Language Learning, 58 (1): 117–150.

[Díaz 2002] Lourdes Díaz. 2002. Interferencias discursivas de hablantes bilingües castellano/catalán: uso oral y escrito. In Seminari sobre les llengües i educació de l’Estat, edited by J. Perera.

[Hirst et al. 2013] Daniel Hirst, Brigitte Bigi, Hyongsil Cho, Hongwei Ding, Sophie Herment, Ting Wang. 2013. Building OMProDat: an open multilingual prosodic database.

[Hilton 2009] Heather Hilton. 2009. Annotation and Analyses of Temporal Aspects of Spoken Fluency. CALICO Journal, 26 (3): 644–661.

[Kubanek-German 2000] Angelika Kubanek-German. 2010. Early Language Programmes in Germany. In An Early Start: Young Learners and Modern Languages in Europe and Beyond.

[Jantunen 2011] Jarmo Harri Jantunen. 2011. Kansainvälinen oppijansuomen korpus (ICLFI): typologia, taustamuuttujat ja annotointi.

[Liceras 2017] Juana M. Liceras. Herramientas para abordar el análisis de la gramática no nativa de los inmigrantes (Juana M. Liceras). In La formación de los docentes de español para inmigrantes en distintos contextos educativos, edited By Dimitrinka Georgieva Níkleva.

[Liceras et al. 1999] J.M. Liceras, E. Valenzuela, and L. Díaz. 1999. L1/L2 Spanish grammars and the pragmatic deficit hypothesis. Second Language Research, 15 (2): 161–190.

[MacWhinney 2016] Brian MacWhinney. 2016. A Shared Platform for Studying Second Language Acquisition. Language Learning, 67 (1).

[Markee 2000] Numa P. Markee. 2000. Conversation Analysis. Mahwah, New Jersey: Erlbaum.

[Medina Soler 2017] Isabela Medina Soler. 2017. La atenuación en el discurso oral de estudiantes de e/le universitarios con nivel b1 en contexto de inmersión para los actos de habla disentivo.

[Mitchell et al. 2008] Rosamond Mitchell, Laura Domínguez, María Arceh, Florence Myles, and Emma Marsden. 2008. SPLLOC: A new corpus for Spanish second language acquisition research. Eurosla Yearbook, 8 (1): 287–304.

[Muñoz 2006] Carmen Muñoz (editor). 2006. Age and the Rate of Foreign Language Learning. Great Britain: Comwell Press Ltd

[Orr and Quené 2017] Rosemary Orr and Hugo Quené. 2017. D-LUCEA: Curation of the UCU Accent Project Data.

[Péters 2017] Hugues Péters. 2017. Comportements d'autocorrection et d'hésitation manifestés par les apprenants de FLE au cours de conversations orales spontanées. Publié dans Bulletin VALS-ASLA N° Spécial, 2: 133–145.

[Rosen 2016] Alexandr Rosen. 2016. Building and using corpora of non-native Czech. 

[Theodórsdóttir‬ 2018] Guðrún Theodórsdóttir‬. 2018. L2 Teaching in the Wild: A Closer Look at Correction and Explanation Practices in Everyday L2 Interaction. The Modern Language Journal, 102 (1).

[Tracy-Ventura and Huensch 2018] Nicole Tracy-Ventura and Amanda Huensch. 2018. The potential of publicly shared longitudinal learner corpora in SLA research. In Critical Reflections on Data in Second Language Acquisition, edited by Aarnes Gudmestad and Amanda Edmonds, 149–170.

[Vercellotti 2015] Mary Lou Vercellotti. 2015. The Development of Complexity, Accuracy, and Fluency in Second Language Performance: A Longitudinal Study. Applied Linguistics, 38 (1): 90–111.

[Volodina et al. 2016] Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. 

[Volodina et al. 2019]  Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, and Mats Wirén. 2019. The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. (Non-final version)

[Vergara Padilla 2017]  María Ángeles Vergara Padilla. 2017. La influencia de las tipologías textuales en la fluidez. Las presentaciones académicas orales de aprendientes estadounidenses de ele.

[Walter 2020] Daniel Walter. 2020. Student Uses of the First Language for L2 Classroom Interactions.

[Young-Scholten and Langer 2015] Martha Young-Scholten and Monika Langer. 2015. The role of orthographic input in second language German: Evidence from naturalistic adult learners’ production. Applied Psycholinguistics, 36 (1): 93–114.

[Zhang 2009] Yanhui Zhang. 2009. A Tutor for Learning Chinese Sounds through Pinyin (Unpublished Doctoral Dissertation). Carnegie Mellon University.

[Zhao and MacWhinney 2009] Yun Zhao and Brian MacWhinney. 2009. Competing Cues: A Corpus-based Study of the English Tense-Aspect in Second Language Acquisition. In Proceedings of the 34th annual Boston University Conference on Language Development, edited by Katie Franich, Kate M. Iserman, and Lauren L. Keil, 503–514.