Federated Content Search: completeness and Googleology

Submitted by Thorsten Trippel on 24 May 2013

Federated Content Search ( ) is a technology to access and search resources that are available at different locations. Using a common search page, someone can search simultaneously over the resources available at various places. Technically this means that each location has a search service for the resources provided by them and the FCS client accesses all of them, sending a query. In CLARIN, FCS is used to search within language resources at different CLARIN centres. Someone using FCS searches simultaneously at various centres and receive the answer to a query on a common website. With this information alone it becomes clear what many users may be looking for: a kind of super-Google for language related resources. In this blog, I will briefly discuss some of the limits and options of FCS as I see them.

FCS and Googleology: Kilgarriff’s reply to research using Google

In his 2007 article Googleology is bad science (Computational Linguistics 33 (1): 147-151), Adam Kilgarriff discusses the use of commercial search engines and their limits for use in linguistic research. He points out that search engines do not lemmatize or part-of speech tag the data, they support only limited syntax, do not return all results and index pages rather than instances. When addressing Federated Content Search for language resources, this sounds like the solution, because linguists lemmatise, provide part of speech tags, corpus query engines support all kinds of sophisticated search syntax and provide (with known limits) a complete set of results when their search functions are used on their data. With this in mind, it is easy to imagine a linguist in a Star Trek like fashion: "Computer, give me all ditransitive verbs that are stressed on the second syllable which are preceded by a numeral, followed by a quantifier and compare the statistical distribution to …." The dream of a corpus linguist, a nightmare for implementers. The analyst will say that this is virtually impossible and can only lead to inflated expectations if pursued. But why? First of all, Kilgarriff provides reasons for rejecting the idea that results from commercial search engines might be sound and complete and above all sufficiently specific for interesting questions. All of these reasons hold true also for a search over distributed language resources, maybe even more. Let me have a brief look.

Lemmatization and Part-of-Speech tagging

The first issue he raises is that Google & Co. do not provide lemmatization and Part-of-Speech tagging. Most corpus linguists will see lemmatization and Part-of-Speech tagging as the baseline for all corpora. They work with (written) texts, often in languages for which computational linguists have produced some tools. There are common tools and known procedures. But the reference to corpus linguist’s ideas already includes a few very fundamental restrictions: not every resource is a corpus, not every language resource is provided by computational linguists, and not everything is based on written material. And of course there may not even be taggers and lemmatizers available for minority, understudied or sometimes called "commercially less interesting" languages. This means that only a portion (though probably a large one) of the material will be rich enough to provide lemmatization and POS tags. And these can also not easily be added "on the fly" to other types of resources. In fact, lemmatization and POS tags do not make much sense for resources such as lexical databases or experimental data containing lists of numeric values, for example resulting from psycholinguistic reading time studies. Unless Federated Content Search is not restricted to specific types of resources, Kilgarriffs fundamental point of criticism is also true for FCS: lemmatization and POS tags may not be available. Or to put it more general: the material does not contain the structured information required for specific queries relying on those structures.

Limited search syntax

The second issue raised by Kilgarriff is the limited search syntax of commercial search engines. Referring to search functions in corpus linguistics, this is of course also true. Corpus linguists build their material and tools with a specific purpose, be it syntax analysis, lexical semantics, coreference analysis, etc. If the query language is general enough, all kinds of information explicitly contained in the data can be queried and related to other information contained in the same resource. But most corpus linguistic search tools do not have such a general query language, but restrict themselves to the "interesting" questions Implementers want to be able to answer: words with specific annotations, but not syllables or phrases, and of course not structures going beyond sentence boundaries. Of course there are solutions for each of these, but they are not necessarily available at the institution hosting the resource. And this is a fundamental difference to commercial search engines, which basically copy the data wherever they find it onto their servers and index it there. Federated Content Search is distributed, that means that there is not one index, not one search infrastructure but many that have to work together. Some of this is due to licensing issues, others is due to missing infrastructure to process large data sets centrally. But this also means that Federated Content Search can only fully support those queries that can be expressed in all FCS environments. As these are very heterogeneous, the resulting query options will be a subset of the query options at each individual system, resulting in a fewer syntax options. Extending this syntax may mean to change search functions at various institutions, possibly looking into old, unmaintained code, requiring testing and implementation of new features, possibly without funding. The FCS search syntax will be very limited in its expressivity.

Completeness and Indexing

Kilgarriff's third and fourth issue address questions of completeness, saying that commercial search engines constrain the number of queries and results and that they index pages instead of instances. In practice this is also true in a FCS environment, if the servers only return a fixed number of results ("The first n hits"), may have doubled information (from two or more centers having the same corpora to two or more versions of a resource on the same server), etc. Additionally, the same search executed with a different search function may result in differing results. Anticipating well defined, sound results in an FCS environment would require insights into the algorithm, detailed statistics, etc. And this seems impossible in a distributed environment with possibly arbitrary numbers of endpoints, even though within a research infrastructure this will still be a small number. The heterogeneity of tagsets and annotations, formats and query languages does not make the situation for the results any better.

Applications for FCS in CLARIN

The purpose of FCS in CLARIN will have to be something else: locating centers and resources that show the potential to be useful for answering research questions on concrete linguistic forms, for example. A full text search will identify institutions that have resources with specific content, pointing to services where the full expressivity of query languages supported by an individual center - including its specialized syntax - can be used. FCS will hence provide an overview, not statistics, sophisticated searches or in depth analysis. It is a great tool to identify candidates for suitable resources, but not for in depth analysing it.

Future use of FCS for language resources

For the search results I should add another development that will become available: the use of results as an input for further analysis. Though at present FCS cannot (and maybe will not) be used for complex questions, it seems obvious that the results can be used as an input for example for webservice based analysis, such as Part-of-Speech tagging, parsing, named entity recognition, etc. In fact, I am positive that we are going to see an implementation where some group combines FCS in CLARIN with the WebLicht webservice infrastructure. I did not see that yet, but for me this is a fascinating perspective and I am looking forward to having a look at such an integrated system in a couple of weeks.