Behind the Scenes: Peter Wittenburg

In our 'Behind the Scenes' series, we introduce the people who work for and use our infrastructure. In the series, we feature pioneers, researchers, ambassadors, committee chairs, PhD students, and more. First in our series is Peter Wittenburg, one of CLARIN’s ‘founding fathers’.

Please introduce yourself. What is your background?

I have a degree in electrical engineering from TU Berlin, specialising in early IT and pattern recognition. I worked as head of the technical department at the Max-Planck-Institute for Psycholinguistics in Nijmegen for about 35 years. After my official engagements in projects such as CLARIN and , I devoted my time to first co-start in 2012/13 and then in 2019/2020 the Fair Digital Objects Forum (FDOF), all with the intention to accelerate building a Global Integrated Dataspace. We just celebrated our second international FDOF meeting.

You are one of CLARIN’s pioneers. How did you first get involved?

Research at the MPI was increasingly often dependent on data from other institutes, but in the early 2000s, researchers in the area of language resources did not yet think of sharing their resources. Therefore, together with Martin Everaert, we formulated the vision that ‘distributed data’ stored in many databases should be seen as a research infrastructure of high value and we submitted an EC proposal to virtually integrate such databases. This proposal was rejected since the EC did not yet have a mechanism for such infrastructure building. However, by 2005 indicated that there would be a call for distributed infrastructures, with the condition that there should only be one proposal ‘per domain’. Together with a few well-known researchers from the field, we organised a workshop in Paris, which completely failed. The colleagues from the eastern networks and those from the western networks could not come together. To understand the reasons would have required an ethnological study (distrust, lack of common ground, different characters, etc.). Highly frustrated, we went home.

'I am convinced that CLARIN has done an extremely good job related to other initiatives in Europe and the US. It helped to structure the domain (adding the notion of trustworthy and sustainable centres, etc.) and unlocking huge amounts of data, raising awareness about data and data sharing, and created some standards such as CMDI, which may survive as the most important contribution.'

Peter Wittenburg, one of CLARIN's 'founding fathers'

The following Monday morning my boss, Pim Levelt, asked me about the results and after listening to my frustration story, he suggested calling Tamas Varady and suggesting a visit to Budapest that same week, expressing the conviction that together we could build bridges. After another chat with my colleagues from UU, where Steven Krauwer was now in charge, I called Tamas. On Friday, I was warmly welcomed in Tamas' institute, where Tamas' boss, a well-known writer in Hungary, was also around. After some excellent interactions, a concert in the Liszt Hall and some good food, I left Budapest on Saturday with the clear feeling that writing a joint proposal was indeed possible.

After some more interactions clarifying the work packages, we invited a few experts to a meeting in Berlin to discuss content and leadership. Astonishingly, this went very smoothly resulting in clear responsibilities, so that we were able to write an excellent ESFRI proposal, which was accepted. We formed an executive board of eight colleagues that worked perfectly well together in the first three years in a goal-driven manner. I took the role of leading the technical work and together with Erhard Hinrichs we also managed to convince the German government to substantially contribute.

I deliberately retired after three years from my CLARIN engagement to get younger colleagues into the leadership with new ideas.

Can you describe what the first few years were like?

Starting CLARIN was an adventure, since no one had clear ideas what a research infrastructure based on distributed repositories could deliver. On the one hand, infrastructures must produce some commons (tools, portals, standards, etc.). On the other hand, we knew that researchers were not really interested in commons, but in tools that facilitate their concrete work. This antagonism kept us busy for the first three years and there were many good ideas on how to progress in both of these areas.

One could also see differences between the national level and the EC level activities. The national level activities had in general more concrete goals. But for an infrastructure builder, which was my role, it is impossible to address all the wishes of thousands of individuals. You need to take decisions on commons and you are not sure whether they will finally be used in practice. This is not at all surprising. Just to remind us that when the internet and the web were introduced, in both cases my scientific directors spoke first of useless tools for some nerds.

What are some of your favourite memories of CLARIN’s early days?

I definitely have a good memory of these first three years. Let me mention two points which immediately come into my mind. First, whenever I went to Budapest to visit Tamas, I had to drink at least one glass of Palinka, hard alcohol I was not used to, but Tamas did not give me a chance to escape! Second, whenever we met in Copenhagen, Bente organised a lunch in a special restaurant where we got excellent smørrebrød.

CLARIN'S executive board in 2009 (front, left to right): Martin Wynne, Tamás Váradi, Bente Maegaard, Dan Cristea, and (back, left to right) Steven Krauwer, Kimmo Koskenniemi, Erhard Hinrichs, and Peter Wittenburg.

CLARIN is now more than 10 years old, and an established RI in SSH. Does this meet the expectations you had in 2012?

To answer this question, one can take two views: (1) one can compare CLARIN’s work with the other ESFRI/ERICs and (2) one can try to understand where we are in research/data infrastructure building. I am convinced that CLARIN has done an extremely good job related to other initiatives in Europe and the US. It helped to structure the domain (adding the notion of trustworthy and sustainable centres, etc.) and unlocking huge amounts of data, raising awareness about data and data sharing, created some standards such as , which may survive as the most important contribution, and provided some tools which, however, will be replaced by next generations of technology.

However, when looking with a broader perspective, including the recent discussions about FAIR principles, , dataspaces and FAIR Digital Objects, I can conclude that we are still far away from a common data infrastructure that reduces the enormous inefficiencies of using data across silos within and across domains. As one of the excellent examples, CLARIN has shown what the possibilities and limitations of these early research infrastructures are. However, despite the achievements, basically none of the ESFRI ERICs are self-sustainable yet. This is not a critique, since we do not yet have a proper model for a general data infrastructure landscape. Also the recent developments around generative AI indicate, much more than in earlier years, how important ‘fair’ data sharing will be. Will CLARIN as well as the other ESFRIs help in making AI broadly available?

Is there anything that has surprised you in CLARIN’s development?

Looking back and comparing it to other ESFRI projects (not all made it to ERICs), I am astonished how smooth in general our development was. In all three years, we had only two or three real clashes, if I remember correctly. Of course we had some difficult discussions and decisions to make, but I think that, perhaps by accident, we managed to establish an excellent leadership and did a great job to get many of the core people and centres in Europe on board. When looking at other large initiatives, this is not what you can expect, since in such ESFRI projects one has to navigate between different cultures, backgrounds, interests, expectations and characters. If we exclude SHARE for a moment, we were the first ESFRI project to receive the state of an - was it surprising? Looking back I would say no - thanks to all the CLARIN colleagues, especially also the executive board colleagues.

How do you see CLARIN 10 years from now?

My feeling is that the ESFRI research infrastructures will need new impulses after almost 15 years, otherwise they will come into a state of saturation and lose momentum and impact. All ESFRI projects will have to determine which of their results are being broadly used, which can be maintained at which costs, and how they can bring their huge knowledge into the new upcoming discussions and challenges. Having seen some of the plans for the next five years, I am not convinced that all ESFRI ERICs will survive. This may sound like a pessimistic view, but I would like to turn it into an optimistic statement: (1) The ESFRI/ERICs have created an enormous amount of knowledge and knowledgeable people, as well as some structures that may survive. Make optimal use of this potential for advancing the domain. (2) Check early enough what can be maintained at minimal costs by a few strong institutions, such as the CMDI framework, run training courses, do massive dissemination etc. (3) Be careful with tools, since their maintenance is expensive and technology innovation cycles are short.

CLARIN may have a role in the future landscape, but perhaps based on a different funding model and for sure by addressing new challenges such as they are now taken up by the Finnish CLARIN partners, for example. As far as I can see, the Finnish colleagues have a clear program to build LLMs for minority languages, to help fine tune models for specific applications, etc. These national programs have a clear benefit and extend the scope of activities. Other ESFRI/ERICs invest in comprehensive workflow frameworks at European level, but to me it is not clear in all cases whether there will be an outcome that will be taken up.

What I certainly miss is a European-level discussion about the question of what the ‘gold nuggets’ are, after 15 years of investment in the ESFRI projects. In contrast, new programs and terms such as EOSC and ‘dataspaces’ are being introduced, without building on achievements that will continue to have a structuring role. As already indicated, DDI (social science), CMDI (language resources), RO-Crate (biomed), CMIP (climate) etc. are examples for great building blocks and they were already almost FAIR before the term was launched. Also the identification of ‘strong centres’ as done in CLARIN, for example, will have a structuring impact.