Question Answering (QA) systems aim at providing correct answers to natural language queries, typically on the basis of one or more structured datasets which constitute their “knowledge base”. These systems are becoming increasingly available for both domain specific applications and broad everyday life adoption (just think of Amazon’s Alexa and Google’s Home). However, significant challenges still need to be addressed and these challenges are in constant evolution as the systems and methodologies addressing them are. The need for comparable and reliable rating of QA systems then arises.
The benchmarking of a number of QA systems typically consists in assigning to each system one or more performance scores following their computation of answers to the same set of natural language questions. Therefore, the QA benchmark makes it possible to rank question answering systems based on their performance and to formulate statements about their excellence and quality (or their lack thereof). Our goal within HOBBIT project is to provide a platform for benchmarking QA systems on a number of tasks and related key performance indicators (KPIs).
Tasks and KPIs are specific to the challenge(s) that a system intends to address. For instance, if a QA system aims at answering complex queries, precision and recall are relevant indicators; for another system that aims at answering a large number of simple queries in a short time, some responsiveness metric will also be relevant. The HOBBIT Question Answering Benchmark focuses on
- Multilinguality (Multilingual QA task), that is the ability of a system to answer the same questions formulated in multiple languages.
- Source heterogeneity (Hybrid QA task), in our case the ability to address questions that may only be answered by resourcing to the combination of structured datasets and information expressed in free text.
- Scalability (Large-scale QA task), here the performance of a system in answering questions at an increasing pace.
For all tasks we used the DBpedia dataset as the knowledge base of reference.
Multilingual QA task
As the interest in automatic question answering applications widens beyond academia and English-speaking corporate domains, it becomes increasingly compelling to provide solutions in languages other than English. With this benchmark, systems can be evaluated on a set of 50 hand-crafted questions, each available in eight languages including English, German, French, Spanish, Italian, Dutch, Romanian and Farsi.
Hybrid QA task
Although well curated, comprehensively structured knowledge bases represent the ideal source of information for an automated answering system. In reality, these sources only constitute a negligible portion of potentially usable information which is (so far) still expressed in the format preferred by human beings: free text. This will remain the case for the foreseeable future, hence versatile systems should also be able to cope with unstructured information. The Hybrid QA task enables the benchmarking of such systems by challenging them with hybrid questions, that is questions requiring the retrieval of information from both a subset of a structured source and an associated free-text document. In our case, these correspond to a set of DBpedia triples and a textual abstract, respectively.
Large-scale QA task
Finally, as big data becomes… bigger and faster, so does the requirement for QA systems to scale up in terms of volume and velocity. For the large-scale QA task, we automatically generated 1.6M questions (and their associated SPARQL queries) from an initial 150 question template set, by substituting entities in this small set with other entities of the same type. The questions are fired to the system to be benchmarked at exponentially decreasing time intervals, putting the system under increasing stress.
What is happening now?
We are already making an impact on the research community: the QALD-7 challenge is ongoing and competing systems are being benchmarked. The winners will be announced at the end of the ESWC 2017 in Slovenia. We are also preparing for the QALD-8 challenge which will run in the context of the ISWC 2017 in Vienna and allow wider participation by removing the requirement for competing system representatives to submit a publication presenting their work along their participation in the challenge.
What is next?
Having concluded the first phase of our project and provided a fully functioning benchmark environment for QA systems, we can now look back at our journey, at our current standing point and plan for the road we have left to travel. We already have many ideas for improving our tasks, from the possibility to choose alternative knowledge bases to the introduction of new KPIs and questions. But we are sure the QA community will be there to suggest many more…
Authors: Bastian Haarmann, Giulio Napolitano