Question answering (QA) systems have recently become commercially viable (and lucrative) products, thanks to increasing investments and research and the development of intuitive and easy-to-use interfaces. Regular interactions with QA systems have become increasingly frequent and natural and the consumers’ expectations around their capabilities keep growing. Such systems are now available in various settings, devices and languages and the exploding usage in real (non-experimental) settings has boosted the demand for resilient systems, which can cope with high volume demand.
The Scalable Question Answering (SQA) challenge was successfully executed on the HOBBIT platform at the ESWC conference this year, with the aim of providing an up-to-date benchmark for assessing and comparing state-of-the-art-systems that mediate between a large volume of users, expressing their information needs in natural language, and RDF data. In particular, successful approaches to this challenge were able to scale up to big data volumes, handle a vast amount of questions and accelerate the question answering process, so that the highest possible number of questions can be answered as accurately as possible in the shortest time.
The dataset was derived from the award-nominated LC-QuAD dataset, which comprises 5000 questions of variable complexity and their corresponding SPARQL queries over DBpedia. In contrast to the analogous challenge task run at ESWC in 2017, the adoption of this new dataset ensured an increase in the complexity of the questions and the introduction of “noise” in the form of spelling mistakes and anomalies as a way to simulate a noisy real-world scenario in which questions may be served to the system imperfectly as a result, for instance, of speech recognition failures or typing errors.
The benchmark (see our previous post for more details) sends to the QA system one question at the start, two more questions after one minute and continues to send n+1 new questions after n minutes. One minute after the last set of questions is dispatched, the benchmarks closes and the evaluation is generated. Along with the usual measure of precision, recall and F-measure, an additional measure was introduced as the main ranking criterion, the Response Power. This was defined as the harmonic mean of three measures: precision, recall and the ratio between processed questions (an empty answer is considered as processed, a missing answer is considered as unprocessed) and total number of questions sent to the system.
Out of the over twenty teams who expressed an interest in the challenge, three (from Canada, Finland and France) were able to submit their system and present it at the conference (which was a requirement). The final winner was the WDAqua-core1 system with a response power of 0.472, followed by GQA (0.028) and LAMA (0.019).
We would like to thank Meltwater for sponsoring a social dinner for the Question Answering community at ESWC 2018, as well as for providing the 300€ prize for the SQA challenge winner, and Springer for offering a 100€ voucher for the runner-up.