Skip to main content

CLARIN in the EOSC: Demonstration

Diagram showing the workflow in this demonstration

Introduction

This demonstration is intended to showcase the use of CLARIN data and tools in the context of the European Open Science Cloud. It was made for the EOSC launch event in November 2018. More information on CLARIN's role in the EOSC can be found here.

Update: Please note that as of 2024, the EOSC portal has been discontinued. However the records on CLARIN resources are still findable via the SSH Open Marketplace instead.

The Research Question

Do parliamentary speeches of female and male members of parliament differ?

If they do, what are typical topics for each group?

The Dataset

Dr. Maciej Ogrodniczuk selected the following dataset based on the Polish Parliamentary Corpus: utterances from male and female Members of Parliament (MP), extracted from the current cadency (8th) of Sejm, between 2015-11-12 (session 1, day 1) and 2016-10-21 (session 28, day 3).

For both groups, the following sample was extracted from his dataset:

Number of

Female MPs

Male MPs

Utterances

19,983

19,983

Speakers

126

291

Words ("tokens")

957,228

969,514

The individual utterances per MP are stored in separate text files, with a prefix indicating the sex of the speaker, e.g. f-AgataBorowiec.txt stands for utterances from the female MP named Agata Borowiec.

All these files are available as a zipfile that can be downloaded from B2SHARE.

Learn more

... about the Polish parliamentary corpus in the following publications:

... about parliamentary corpora and their applications on the CLARIN website.

... about other parliamentary corpora on the CLARIN website.

Searching for tools & processing the dataset

1. EOSC Portal Search

2. B2DROP file upload

  • The researcher uses single-sign-on (B2ACCESS) to login to his B2DROP workspace.

  • There he uploads the dataset and shares the file with a Share Link.

 

  • Now he clicks on the … icon next to the file and selects Switchboard.

3. Language Resource Switchboard

  • After being redirected to the Language Resource Switchboard, he indicates that the input file contains Polish data.

  • Then he clicks on Show Tools.

  • This combination results in 1 available tool, in the category Stylometry, called WebSty. He clicks on this tool to see more details.

  • Now he invokes WebSty via the Click to start tool button.

 

4. WebSty tool

  • Now that the WebSty application is displayed, the researcher selects the appropriate parameters to run a noun-based comparison between the female and male MPs:

    • As Method of Analysis he chooses Content Similarity.

    • In advanced options he selects Choice of features and then clicks on the tab BOW ("bag of words").

 

  • Then he clicks on the Analyze button. A highly efficient parallel computation process now starts. This computation entails:

    • Performing a linguistic analysis of all sentences, as to find the part of speech of each word – this is required to determine the nouns.

    • At the same time a morphosyntactic analysis is made to determine the lemma (the uninflected base form) of the nouns. Especially for languages with a rich case system – like Polish – this is a very important step.

    • Based on the results of the linguistic analysis described above, the similarity between the nouns used in female and the male group is calculated.

 

 

  • Once the processing is over, the researcher scrolls down to the Results section.

  • He now clicks on the Importance of features to inspect which nouns are use differently in both groups.

 

  • As grouping method he now selects first level (the female vs. male group) and then he clicks the Analyze button.

  • Now the researcher has access to a table in the application with the nouns that are statistically more likely to be used by the female MPs, in descending order of statistic significance. In the Result section he can also download the outcomes as an Excel table (available via B2SHARE).

Word (lemma)

English translation

dziecko

child

niepełnosprawna

disabled woman

niepełnosprawność

disability

opieka

care

pracownik

employee

rodzina

family

edukacja

education

kobieta

woman

praca

work

aborcja

abortion

matka

mother

placówka

establishment

rodzic

parent

cel

goal

mała

small one (child)

zdrowie

health

 

  • Using these results, the researcher can conclude that indeed there is a significant difference in the topics that the female MPs are addressing. They are talking more than their male colleagues on topics like healthcare and family structures.

Learn more

... about WebSty:

 

5. Publishing the results in B2SHARE

  • At the EOSC-portal market place, the researcher searches for "publish research data" Then he finds B2SHARE as potential publication platform.

  • From there he navigates to B2SHARE and uses single-sign on to authenticate. Since he authenticated to B2DROP before, it is not necessary anymore to enter a username and password.

  • He clicks on the Create a new record button, enters a title, selects the CLARIN community and finally clicks on Create Draft Record.

  • After entering the necessary metadata, he checks the Submit draft for publication checkbox and clicks on the Save and Publish button. This will make the dataset available via a persistent identifier.

  • His submissions will be findable in the Virtual Language Observatory and B2FIND within a day.

Alternative data publication option

Many CLARIN centres are providing depositing services.

Acknowledgements

  • Maciej Ogrodniczuk – providing the dataset and expertise, presenting the case at the EOSC launch event
  • Tomasz Walkowiak  – providing support and suggestions for WebSty and related CLARIN-PL tools
  • Claus Zinn – designing, implementing and configuring the Language Resource Switchboard
  • Darja Fišer – providing input on the research question and feedback on the implementation of this demonstration