Skip to main content

Standards and Formats

Golden stars confetti

 

Basic Principles

CLARIN adheres to the following principles:

  • Open standards are preferred over proprietary standards
  • Formats and protocols should be:
    • Well-documented
    • Verifiable
    • Proven (being used in practice)
  • Text-based formats are (where possible) preferred over binary formats
  • In the case of digitisation of an analogue signal, using no or lossless compression is recommended.

Data-deposition Formats

One of the questions to ask yourself before you deposit data at a centre is whether the form(at) the data is in is acceptable by the given centre: Is there a more optimal format? Which formats I should rather avoid?

The answer depends on the data itself, on the purpose of the data (is it corpus documentation or is it corpus data; is it a dictionary or a list of participants in a speech event, etc.), but also on the centre itself.

Several CLARIN centres have published information on what formats they recommend for language-research-technology data depositions. At some point, a CLARIN service emerged that aims at aggregating the information, for the benefit of the users and of the centres – see the searchable and sortable deposition format section of the Standards Information System.

While, as of mid-2024, not all centres curate their format recommendations in the SIS yet, the lists below serve as additional pointers. The two lists below are not diligently maintained. The first one is expected to grow, while the target pages in the other may stay in place, may move around, and may get up-translated to the SIS listing. In case of doubt, please search the list of individual centres provided by the SIS.

List of centres that have deposited and maintain their format-related recommendations in the Standards Information System:

 

Below is a list of explicit data-deposition-format recommendations maintained by individual centres. While for some of the centres above, SIS versions of format recommendations exist, they should be treated with caution if a warning about lack of curation is displayed on the SIS page. Links to the SIS variants are provided in brackets after each centre name. Note that this list is not actively updated.


Learn More

For old times' sake

Documents that may be partially obsolete and of mostly historical value: