A Recap on The CLARIN Café on Online and Desktop Tools for Querying a Language Corpus

Submitted by e.gorgaini@uu.nl on 20 October 2022

Text and corpus analysis lie at the heart of digital scholarship in the humanities and social sciences, and a wide range of software tools are available in this domain. These software tools represent prime examples of the ways in which language technologies can support research across a range of disciplines, and they are therefore central to CLARIN’s mission. CLARIN is gathering information and publishing a Resource Family to provide users, both expert and novice, with a free online guide to the discovery and use of these vital tools. The CLARIN Café in September 2022 presented some of the key applications currently in use among research communities, and invited feedback on the Resource Families project.

New Resource Families Coming Up

Martin Wynne, National Coordinator for CLARIN-UK, presented the project and the draft registry of tools in an online spreadsheet. Participants in the Café were invited to offer additions, suggestions and corrections. The Resource Families aim to be comprehensive lists of all the software applications that are available for corpus analysis, both for desktop use and online, with some key information about them in order to help users to find them and to choose between them for a particular research goal. Martin outlined how we define a ‘corpus analysis tool’ in this context. It was decided to rely on the view of the late John Sinclair (and others) that the basic operations of corpus linguistics involve ‘corpus, concordance, collocation’. So we include tools that can at least: deal with a corpus, show concordances, as well as (preferably) calculating collocations. Most of the tools listed so far can do a lot more than this, but they need to be able to do the basics to be included.

LancsBox, AntConc, Kontext

As examples of the sorts of tools that are covered by the Resource Families, three guest presenters gave introductions to their software.

#LancsBox was presented by Vaclav Brezina from Lancaster University, as an example of a desktop tool which comes pre-bundled with a number of corpora and useful lexical data, and has a powerful array of corpus creation, analysis and visualisation tools. The Wizard Tool, which makes a prose research report from a series of corpus queries, was particularly impressive, pushing the boundaries of what it is possible for corpus software to achieve. The ability to deal with very large corpora is also notable, and in this respect, one of the key differences between the capabilities of desktop versus online server-based systems appears to be breaking down.

AntConc was presented by Laurence Anthony from Waseda University in Japan, as another example of a corpus analysis toolkit for concordancing and text analysis. The huge number of users and citations for AntConc is a tribute to the ease with which it can be downloaded, set up and used with texts and corpora on the user’s desktop computer. The latest version (4.1.2) has a new and improved look and feel, with an interface ‘nativised’ for Windows and MacOS, and a fully-featured Linux version is also available.

Michal Kren from Charles University, Prague presented Kontext as an example of an online corpus query interface in use in a CLARIN centre, as the interface to the Czech National Corpus, and other corpora. Like #LancsBox, it makes use of the powerful Corpus Query Language (CQL) to enable users to construct complex queries making use of grammatical annotation in the corpora. The interface also accommodates parallel corpora.

Conclusions and Discussion

A lively discussion ensued, focussing initially on the question of sustainability of software. AntConc had been presented as a tool created, distributed and managed by one person, with great success, but it was noted that there is an obvious risk to long-term sustainability. Discussion ranged over the pros and cons of open-source software and community development models. There was also reference to the desirability of tools to be transparent rather than ‘black boxes’, but with an accompanying risk of complexity and confusion for the user when all of the inner workings are made apparent.

It is to be hoped that CLARIN might be able to play a role in supporting the sustainability of tools (offering support for licensing and curation) and interoperability - the Language Resources Switchboard aims to connect datasets with applicable tools and the Federated Content Search offers the user a way to submit a query to multiple online corpus interfaces. On the subject of interoperability of tools and datasets, the MTSV file format for interoperability in corpus linguistics was also raised.

The Resource Families will be published online in October 2022, and will be curated by CLARIN to ensure that they are accurate and up to date. Registries of this kind typically go out of date quickly, but the hope is that by embedding the ongoing curation in a research infrastructure with a long-term perspective and with the backing of national infrastructures and funders, the chances of long-term sustainability are higher.

Additional information on this CLARIN Café and the slides of the event are available on the event page.